Distilling Privileged Knowledge From Transformers to Lightweight CNNs for On-Device Time Series Forecasting

Published 2026 in IEEE Access

ABSTRACT

Time Series Forecasting (TSF) is a pivotal capability for critical applications in industrial automation, energy management, and smart infrastructure. While Large Language Models (LLMs) and Transformer-based architectures have set new benchmarks in forecasting accuracy, their exorbitant computational costs and memory requirements prohibit their deployment on resource-constrained Microcontroller Units (MCUs). Lightweight models offer an alternative but often fail to capture complex long-term dependencies. This paper introduces a comprehensive framework to address these interconnected challenges of accuracy and efficiency. We propose a highly efficient Convolutional Neural Network (CNN) architecture, enhanced through a novel heterogeneous knowledge distillation methodology. In our framework, a powerful LLM-based teacher model, utilizing self-attention mechanisms, guides the training of a compact student model that operates solely on lightweight convolutions during inference. This approach is augmented by the Gated Dilated depthwise separable Convolution (GDC) block, which efficiently captures long-range dependencies and multi-scale patterns while drastically minimizing parameter count to meet strict MCU constraints. The robustness and generalizability of our framework are validated through extensive experiments on six long-term and five short-term forecasting benchmarks. Critically, our framework establishes a new state-of-the-art among lightweight architectures, demonstrating consistent performance gains across diverse forecasting horizons. Experimental results confirm that our method reduces Mean Squared Error (MSE) by 11.5% in long-term and 12.4% in short-term forecasting tasks compared to massive LLM-based baselines, all while utilizing less than 0.01% of the parameters. The final, optimized model is successfully deployed on a commercial ESP32-S3 MCU, demonstrating its practical viability for real-time, accurate, and efficient forecasting on edge devices with an inference latency of just 256 ms.

PUBLICATION RECORD

Publication year
2026
Venue
IEEE Access
Publication date
Unknown publication date
Fields of study
Not labeled
Identifiers
DOI 10.1109/ACCESS.2026.3667131
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Efficient Multivariate Time Series Forecasting via Calibrated Language Models with Privileged Knowledge Distillation
2025influential reference
A Comprehensive Survey of Deep Learning for Time Series Forecasting: Architectural Diversity and Open Challenges
2024cited by this paper
Timer-XL: Long-Context Transformers for Unified Time Series Forecasting
2024cited by this paper
Chronos: Learning the Language of Time Series
2024cited by this paper
LightCTS*: Lightweight Correlated Time Series Forecasting Enhanced With Model Distillation
2024cited by this paper
TimeMixer: Decomposable Multiscale Mixing for Time Series Forecasting
2024cited by this paper
ModernTCN: A Modern Pure Convolution Structure for General Time Series Analysis
2024cited by this paper
Long-term Forecasting with TiDE: Time-series Dense Encoder
2023cited by this paper
A Survey on Time-Series Pre-Trained Models
2023cited by this paper
One-for-All: Bridge the Gap Between Heterogeneous Architectures in Knowledge Distillation
2023cited by this paper
Lag-Llama: Towards Foundation Models for Probabilistic Time Series Forecasting
2023cited by this paper
TimeGPT-1
2023cited by this paper
Time-LLM: Time Series Forecasting by Reprogramming Large Language Models
2023cited by this paper
Llama 2: Open Foundation and Fine-Tuned Chat Models
2023cited by this paper
Are Transformers Effective for Time Series Forecasting?
2022cited by this paper
Multivariate time series prediction by RNN architectures for energy consumption forecasting
2022cited by this paper
A Time Series is Worth 64 Words: Long-term Forecasting with Transformers
2022influential reference
FEDformer: Frequency Enhanced Decomposed Transformer for Long-term Series Forecasting
2022cited by this paper
SCINet: Time Series Modeling and Forecasting with Sample Convolution and Interaction
2021cited by this paper
Autoformer: Decomposition Transformers with Auto-Correlation for Long-Term Series Forecasting
2021cited by this paper
Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting
2020cited by this paper
A Survey on Modern Deep Neural Network for Traffic Prediction: Trends, Methods and Challenges
2020cited by this paper
Forecasting day-ahead electricity prices: A review of state-of-the-art algorithms, best practices and an open-access benchmark
2020cited by this paper
MCUNet: Tiny Deep Learning on IoT Devices
2020cited by this paper
The Role of Big Data Analytics in Industrial Internet of Things
2019cited by this paper
MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications
2017cited by this paper
Language Modeling with Gated Convolutional Networks
2016cited by this paper
Multi-Scale Context Aggregation by Dilated Convolutions
2015cited by this paper
Distilling the Knowledge in a Neural Network
2015cited by this paper
Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation
2014cited by this paper
2009 Special Issue: A new learning paradigm: Learning using privileged information
2009cited by this paper
Long Short-Term Memory
1997cited by this paper

CITED BY

No citing papers are available for this paper.