On the Impact of Chunking Strategies in NLP Pipelines: A Multi-Task Empirical Study

Sawan Rai,R. Belwal

Published 2025 in 2025 OITS International Conference on Information Technology (OCIT)

ABSTRACT

Chunking input text is a crucial preprocessing step when using Large Language Models (LLMs) for long or structured documents. However, its impact on downstream task performance remains underexplored. This study presents a comprehensive empirical analysis evaluating the effect of various chunking strategies: fixed-size, overlapping, sentence-based, and paragraph-based, across three fundamental NLP tasks: question answering, text classification, and abstractive summarization. Experiments were conducted using lightweight, open-access models such as Flan-T5, GPT-2, DistilBERT, and RoBERTa on benchmark datasets including SQuAD, CoQA, QuAC, IMDB, Amazon Polarity, CNN/DailyMail, and XSum. Performance was measured using task-appropriate metrics (ROUGE, EM, F1, precision, recall) along with latency. Results reveal that chunking strategies significantly affect performance and latency, with no single approach universally optimal. These findings highlight the need for task-specific chunking choices in practical LLM deployments, especially under resource constraints.

PUBLICATION RECORD

CITATION MAP

EXTRACTION MAP

CLAIMS

  • No claims are published for this paper.

CONCEPTS

  • No concepts are published for this paper.

REFERENCES

Showing 1-19 of 19 references · Page 1 of 1

CITED BY

  • No citing papers are available for this paper.

Showing 0-0 of 0 citing papers · Page 1 of 1