Distilling Privileged Knowledge From Transformers to Lightweight CNNs for On-Device Time Series Forecasting

Sangjin Na,Yong-Jun Cho,Yunju Baek

Published 2026 in IEEE Access

ABSTRACT

Time Series Forecasting (TSF) is a pivotal capability for critical applications in industrial automation, energy management, and smart infrastructure. While Large Language Models (LLMs) and Transformer-based architectures have set new benchmarks in forecasting accuracy, their exorbitant computational costs and memory requirements prohibit their deployment on resource-constrained Microcontroller Units (MCUs). Lightweight models offer an alternative but often fail to capture complex long-term dependencies. This paper introduces a comprehensive framework to address these interconnected challenges of accuracy and efficiency. We propose a highly efficient Convolutional Neural Network (CNN) architecture, enhanced through a novel heterogeneous knowledge distillation methodology. In our framework, a powerful LLM-based teacher model, utilizing self-attention mechanisms, guides the training of a compact student model that operates solely on lightweight convolutions during inference. This approach is augmented by the Gated Dilated depthwise separable Convolution (GDC) block, which efficiently captures long-range dependencies and multi-scale patterns while drastically minimizing parameter count to meet strict MCU constraints. The robustness and generalizability of our framework are validated through extensive experiments on six long-term and five short-term forecasting benchmarks. Critically, our framework establishes a new state-of-the-art among lightweight architectures, demonstrating consistent performance gains across diverse forecasting horizons. Experimental results confirm that our method reduces Mean Squared Error (MSE) by 11.5% in long-term and 12.4% in short-term forecasting tasks compared to massive LLM-based baselines, all while utilizing less than 0.01% of the parameters. The final, optimized model is successfully deployed on a commercial ESP32-S3 MCU, demonstrating its practical viability for real-time, accurate, and efficient forecasting on edge devices with an inference latency of just 256 ms.

PUBLICATION RECORD

CITATION MAP

EXTRACTION MAP

CLAIMS

  • No claims are published for this paper.

CONCEPTS

  • No concepts are published for this paper.

REFERENCES

Showing 1-32 of 32 references · Page 1 of 1

CITED BY

  • No citing papers are available for this paper.

Showing 0-0 of 0 citing papers · Page 1 of 1