SkyWalker: A Locality-Aware Cross-Region Load Balancer for LLM Inference

Tian Xia; Ziming Mao; Jamison Kerney; Ethan J. Jackson; Zhifei Li; Jiarong Xing; Scott Shenker; Ion Stoica

doi:10.1145/3767295.3769353

Abstract

1 min read

Serving Large Language Models (LLMs) efficiently in multi-region setups remains a challenge. Due to cost and GPU availability concerns, providers typically deploy LLMs in multiple regions using instance with long-term commitments, like reserved instances or on-premise clusters, which are often underutilized due to their region-local traffic handling and diurnal traffic variance. In this paper, we introduce SkyWalker, a multi-region load balancer for LLM inference that aggregates regional diurnal patterns through cross-region traffic handling. By doing so, SkyWalker enables providers to reserve instances based on expected global demand, rather than peak demand in each individual region. Meanwhile, SkyWalker preserves KV-Cache locality and load balancing, ensuring cost efficiency without sacrificing performance. SkyWalker achieves this with a cache-aware cross-region traffic handler and a selective pushing based load balancing mechanism. Our evaluation on real-world workloads shows that it achieves 1.12–2.06× higher throughput and 1.74–6.30× lower latency compared to existing load balancers, while reducing total serving cost by 25%.

Related publications

Preprint2025

SkyWalker: A Locality-Aware Cross-Region Load Balancer for LLM Inference

Abstract

Discussion(0)

Related publications

Locality-aware Fair Scheduling in LLM Serving

Prism: Unleashing GPU Sharing for Cost-Efficient Multi-LLM Serving

Optimizing LLM Queries in Relational Data Analytics Workloads

Pie: Pooling CPU Memory for LLM Inference

RLBoost: Harvesting Preemptible Resources for Cost-Efficient Reinforcement Learning on LLMs