ZipMoE: Efficient On-Device MoE Serving via Lossless Compression and Cache-Affinity Scheduling

Yuchen Yang,Yaru Zhao,Pu Yang,Shaowei Wang,Zhijian Zhou

Published 2026 in arXiv.org

ABSTRACT

While Mixture-of-Experts (MoE) architectures substantially bolster the expressive power of large-language models, their prohibitive memory footprint severely impedes the practical deployment on resource-constrained edge devices, especially when model behavior must be preserved without relying on lossy quantization. In this paper, we present ZipMoE, an efficient and semantically lossless on-device MoE serving system. ZipMoE exploits the synergy between the hardware properties of edge devices and the statistical redundancy inherent to MoE parameters via a caching-scheduling co-design with provable performance guarantee. Fundamentally, our design shifts the paradigm of on-device MoE inference from an I/O-bound bottleneck to a compute-centric workflow that enables efficient parallelization. We implement a prototype of ZipMoE and conduct extensive experiments on representative edge computing platforms using popular open-source MoE models and real-world workloads. Our evaluation reveals that ZipMoE achieves up to $72.77\%$ inference latency reduction and up to $6.76\times$ higher throughput than the state-of-the-art systems.

PUBLICATION RECORD

Publication year
2026
Venue
arXiv.org
Publication date
2026-01-29
Fields of study
Computer Science, Engineering
Identifiers
DOI 10.48550/arXiv.2601.21198 arXiv 2601.21198
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Huff-LLM: End-to-End Lossless Compression for Efficient LLM Inference
2025cited by this paper
An LLM-Driven Chatbot in Higher Education for Databases and Information Systems
2025cited by this paper
Taming Latency-Memory Trade-Off in MoE-Based LLM Serving via Fine-Grained Expert Offloading
2025cited by this paper
KTransformers: Unleashing the Full Potential of CPU/GPU Hybrid Inference for MoE Models
2025influential reference
FloE: On-the-Fly MoE Inference on Memory-constrained GPU
2025cited by this paper
MxMoE: Mixed-precision Quantization for MoE with Accuracy and Performance Co-Design
2025cited by this paper
D2MoE: Dual Routing and Dynamic Scheduling for Efficient On-Device MoE-based LLM Serving
2025cited by this paper
70% Size, 100% Accuracy: Lossless LLM Compression for Efficient GPU Inference via Dynamic-Length Float
2025cited by this paper
Quantization Hurts Reasoning? An Empirical Study on Quantized Reasoning Models
2025cited by this paper
Klotski: Efficient Mixture-of-Expert Inference via Expert-Aware Multi-Batch Pipeline
2025cited by this paper
Mobile Edge Intelligence for Large Language Models: A Contemporary Survey
2024cited by this paper
Fiddler: CPU-GPU Orchestration for Fast Inference of Mixture-of-Experts Models
2024cited by this paper
NeuZip: Memory-Efficient Training and Inference with Dynamic Compression of Neural Networks
2024cited by this paper
ProMoE: Fast MoE-based LLM Serving using Proactive Caching
2024cited by this paper
"Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM Quantization
2024cited by this paper
ZipNN: Lossless Compression for AI Models
2024cited by this paper
MoE-Lightning: High-Throughput MoE Inference on Memory-constrained GPUs
2024cited by this paper
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
2024cited by this paper
Exploiting LLM Quantization
2024cited by this paper
SiDA: Sparsity-Inspired Data-Aware Serving for Efficient and Scalable Large Mixture-of-Experts Models
2023cited by this paper
High-throughput Generative Inference of Large Language Models with a Single GPU
2023cited by this paper
LLM-Pruner: On the Structural Pruning of Large Language Models
2023cited by this paper
On the Exploitability of Instruction Tuning
2023cited by this paper
EdgeMoE: Empowering Sparse Large Language Models on Mobile Devices
2023cited by this paper
SwapMoE: Serving Off-the-shelf MoE-based Large Language Models with Tunable Memory Budget
2023cited by this paper
Prompt, Plan, Perform: LLM-based Humanoid Control via Quantized Imitation Learning
2023cited by this paper
Fast Inference of Mixture-of-Experts Language Models with Offloading
2023cited by this paper
DeepSpeed- Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale
2022cited by this paper
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
2021cited by this paper
in Study
2018cited by this paper
Ensemble Methods: Foundations and Algorithms
2012cited by this paper
Exploiting unlabeled data to enhance ensemble diversity
2009cited by this paper
Game theory, maximum entropy, minimum discrepancy and robust Bayesian decision theory
2004cited by this paper
Weighted finite population sampling to maximize entropy
1994cited by this paper
Competitive Paging Algorithms
1991cited by this paper
Adaptive Mixtures of Local Experts
1991cited by this paper
MoE-Infinity: Offloading-Efficient MoE Model Serving
year unknowncited by this paper

CITED BY

No citing papers are available for this paper.