Federated Learning-Enabled Hybrid Language Models for Communication-Efficient Token Transmission

Faranaksadat Solat; Joohyung Lee; Mohamed Seif; Dusit Niyato; H Vincent Vincent Poort

doi:10.48550/arxiv.2507.00082

Abstract

1 min read

Hybrid Language Models (HLMs) combine the low-latency efficiency of Small Language Models (SLMs) on edge devices with the high accuracy of Large Language Models (LLMs) on centralized servers. Unlike traditional end-to-end LLM inference, HLMs reduce latency and communication by invoking LLMs only when local SLM predictions are uncertain, i.e., when token-level confidence is low or entropy is high. However, ambiguous or low-confidence predictions still require frequent offloading to the LLM, leading to significant communication overhead in bandwidth-constrained settings. To address this, we propose FedHLM, a communication-efficient HLM framework that integrates uncertainty-aware inference with Federated Learning (FL). FedHLM's key innovation lies in collaboratively learning token-level uncertainty thresholds that govern when LLM assistance is needed. Rather than using static or manually tuned thresholds, FedHLM employs FL to optimize these thresholds in a privacy-preserving, distributed manner. Additionally, it leverages embedding-based token representations for Peer-to-Peer (P2P) resolution, enabling clients to reuse tokens inferred by semantically similar peers without engaging the LLM. We further introduce hierarchical model aggregation: edge servers refine local routing policies through client updates, while cross-cluster coordination aligns global decision boundaries. This layered design captures recurring uncertainty patterns, reducing redundant LLM queries. Experiments on large-scale news classification tasks show that FedHLM reduces LLM transmissions by over 95 percent with negligible accuracy loss, making it well-suited for scalable and efficient edge-AI applications.

Federated Learning-Enabled Hybrid Language Models for Communication-Efficient Token Transmission

Abstract

Discussion(0)

Open reviews(0)

Related publications

Federated Learning-Enabled Hybrid Language Models for Communication-Efficient Token Transmission

Communication-efficient federated learning

FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU

Wireless Communications for Collaborative Federated Learning

Route-and-Aggregate Decentralized Federated Learning Under Communication Errors

Related publications

Article2025
Federated Learning-Enabled Hybrid Language Models for Communication-Efficient Token Transmission
Article2025

Article2021
Communication-efficient federated learning
Article2021

Preprint2023
FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU
Preprint2023

Preprint2020
Wireless Communications for Collaborative Federated Learning
Preprint2020

Article2025
Route-and-Aggregate Decentralized Federated Learning Under Communication Errors
Article2025