Towards Efficient and Practical GPU Multitasking in the Era of LLM

Jiarong Xing; Yifan Qiao; Simon Mo; Xiao-Bing Cui; Gur-Eyal Sela; Yang Zhou; Joseph E. Gonzalez; Ion Stoica

doi:10.48550/arxiv.2508.08448

Abstract

1 min read

GPU singletasking is becoming increasingly inefficient and unsustainable as hardware capabilities grow and workloads diversify. We are now at an inflection point where GPUs must embrace multitasking, much like CPUs did decades ago, to meet the demands of modern AI workloads. In this work, we highlight the key requirements for GPU multitasking, examine prior efforts, and discuss why they fall short. To advance toward efficient and practical GPU multitasking, we envision a resource management layer, analogous to a CPU operating system, to handle various aspects of GPU resource management and sharing. We outline the challenges and potential solutions, and hope this paper inspires broader community efforts to build the next-generation GPU compute paradigm grounded in multitasking.

Open reviews(0)

Public, signed peer feedback on this preprint.

No reviews yet.

Related publications

Preprint2024

Towards Efficient and Practical GPU Multitasking in the Era of LLM

Abstract

Discussion(0)

Open reviews(0)

Related publications

NEO: Saving GPU Memory Crisis with CPU Offloading for Online LLM Inference

semi-PD: Towards Efficient LLM Serving via Phase-Wise Disaggregated Computation and Unified Storage

MultiPath Transfer Engine: Breaking GPU and Host-Memory Bandwidth Bottlenecks in LLM Services

The Streaming Batch Model for Efficient and Fault-Tolerant Heterogeneous Execution

Triumphs and Challenges of Natural Product Discovery in the Postgenomic Era