SkyWalker: A Locality-Aware Cross-Region Load Balancer for LLM Inference
Article 2026
Authors
TX
Tian Xia
ZM
Ziming Mao
JK
Jamison Kerney
Abstract
1 min read
Serving Large Language Models (LLMs) efficiently in multi-region setups remains a challenge. Due to cost and GPU availability concerns, providers typically deploy LLMs in multiple regions using instance with long-term commitments, like reserved instances or on-premise clusters, which are often underutilized due to their region-local traffic handling and diurnal traffic variance. In this paper, we introduce SkyWalker, a multi-region load balancer for LLM inference that aggregates regional diurnal patterns through cross-region traffic handling. By doing so, SkyWalker enables providers to reserve instances based on expected global demand, rather than peak demand in each individual region. Meanwhile, SkyWalker preserves KV-Cache locality and load balancing, ensuring cost efficiency without sacrificing performance. SkyWalker achieves this with a cache-aware cross-region traffic handler and a selective pushing based load balancing mechanism. Our evaluation on real-world workloads shows that it achieves 1.12–2.06× higher throughput and 1.74–6.30× lower latency compared to existing load balancers, while reducing total serving cost by 25%.
Discussion(0)
No comments yet. Be the first to comment.