RACER: Avoiding End-to-End Slowdowns in Accelerated Chip Multi-Processors

Published 2025 in ACM Transactions on Architecture and Code Optimization (TACO)

ABSTRACT

Recent chip multiprocessors incorporate several on-chip accelerators, marking the beginning of the Accelerated Chip Multi-Processor (XMP) era in datacenters. Despite the close proximity of accelerators and general-purpose cores, offloading functions to accelerators may not always be beneficial. Offloading to hardware accelerators can introduce several end-to-end overheads that can negate the speedup of the accelerable function. In this article, we design RACER, a hardware architecture and runtime system that evades the danger of end-to-end slowdowns when using hardware acceleration. RACER leverages a low-overhead interface between general-purpose cores and on-chip accelerators, fine-grained context switching, accelerator-initiated preemption, and seamless data motion between general-purpose cores and accelerators to improve the performance of workloads that use on-chip accelerators. We evaluate RACER on five representative request processing workloads featuring diverse memory access patterns, accelerable functions, and compute intensities. RACER improves the performance of hardware acceleration on a real XMP by an average of 1.31× on a range of diverse workloads and guarantees that accelerator offloads never cause slowdowns.

PUBLICATION RECORD

Publication year
2025
Venue
ACM Transactions on Architecture and Code Optimization (TACO)
Publication date
2025-07-24
Fields of study
Computer Science, Engineering
Identifiers
DOI 10.1145/3750448
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

XRT: An Accelerator-Aware Runtime for Accelerated Chip Multiprocessors
2025cited by this paper
Extended User Interrupts (xUI): Fast and Flexible Notification without Polling
2025cited by this paper
11.1 AMD InstinctTM MI300 Series Modular Chiplet Package – HPC and AI Accelerator for Exa-Class Systems
2024cited by this paper
Mozart: Taming Taxes and Composing Accelerators with Shared-Memory
2024cited by this paper
Efficient Microsecond-scale Blind Scheduling with Tiny Quanta
2024cited by this paper
LibPreemptible: Enabling Fast, Adaptive, and Hardware-Assisted User-Space Scheduling
2024cited by this paper
Achieving Microsecond-Scale Tail Latency Efficiently with Approximate Optimal Scheduling
2023influential reference
A Quantitative Analysis and Guidelines of Data Streaming Accelerator in Modern Intel Xeon Scalable Processors
2023cited by this paper
Sapphire Rapids: The Next-Generation Intel Xeon Scalable Processor
2022cited by this paper
Protocol Buffers
2022cited by this paper
IDIO: Network-Driven, Inbound Network Data Orchestration on Server Processors
2022cited by this paper
AI accelerator on IBM Telum processor: industrial product
2022cited by this paper
A Case for Fine-grain Coherence Specialization in Heterogeneous Systems
2021cited by this paper
Cohmeleon: Learning-Based Orchestration of Accelerator Coherence in Heterogeneous SoCs
2021cited by this paper
A comprehensive methodology to determine optimal coherence interfaces for many-accelerator SoCs
2020cited by this paper
Accelerometer: Understanding Acceleration Opportunities for Data Center Overheads at Hyperscale
2020cited by this paper
Shinjuku: Preemptive Scheduling for μsecond-scale Tail Latency
2019cited by this paper
Runtime reconfigurable memory hierarchy in embedded scalable platforms
2019cited by this paper
Determining Optimal Coherency Interface for Many-Accelerator SoCs Using Bayesian Optimization
2019cited by this paper
μ Suite: A Benchmark Suite for Microservices
2018cited by this paper
Accelerators and Coherence: An SoC Perspective
2018cited by this paper
Spandex: A Flexible Interface for Efficient Heterogeneous Coherence
2018cited by this paper
Arachne: Core-Aware Thread Management
2018cited by this paper
NoC-Based Support of Heterogeneous Cache-Coherence Models for Accelerators
2018cited by this paper
LogCA
2017cited by this paper
LogCA: A high-level performance model for hardware accelerators
2017cited by this paper
Co-designing accelerators and SoC interfaces using gem5-Aladdin
2016cited by this paper
Fusion: Design tradeoffs in coherent cache hierarchies for accelerators
2015cited by this paper
A Tutorial on Principal Component Analysis
2014cited by this paper
Quantifying Wasted Write Energy in the Memory Hierarchy
2014cited by this paper
Ligra: a lightweight graph processing framework for shared memory
2013cited by this paper
PaRSEC: Exploiting Heterogeneity to Enhance Scalability
2013cited by this paper
Legion: Expressing locality and independence with logical regions
2012cited by this paper
Phoenix++: modular MapReduce for shared-memory systems
2011cited by this paper
StarPU: a unified platform for task scheduling on heterogeneous multicore architectures
2011cited by this paper
DeNovo: Rethinking the Memory Hierarchy for Disciplined Parallelism
2011cited by this paper
Impact of Cache Coherence Protocols on the Processing of Network Traffic
2007cited by this paper
Direct cache access for high bandwidth network I/O
2005cited by this paper
Distributed caching with memcached
2004cited by this paper
JETTY: filtering snoops for reduced energy consumption in SMP servers
2001cited by this paper
DEFLATE Compressed Data Format Specification version 1.3
1996cited by this paper
Fundamentals of Database Systems
1989cited by this paper

CITED BY

No citing papers are available for this paper.