On the Impact of Chunking Strategies in NLP Pipelines: A Multi-Task Empirical Study

Published 2025 in 2025 OITS International Conference on Information Technology (OCIT)

ABSTRACT

Chunking input text is a crucial preprocessing step when using Large Language Models (LLMs) for long or structured documents. However, its impact on downstream task performance remains underexplored. This study presents a comprehensive empirical analysis evaluating the effect of various chunking strategies: fixed-size, overlapping, sentence-based, and paragraph-based, across three fundamental NLP tasks: question answering, text classification, and abstractive summarization. Experiments were conducted using lightweight, open-access models such as Flan-T5, GPT-2, DistilBERT, and RoBERTa on benchmark datasets including SQuAD, CoQA, QuAC, IMDB, Amazon Polarity, CNN/DailyMail, and XSum. Performance was measured using task-appropriate metrics (ROUGE, EM, F1, precision, recall) along with latency. Results reveal that chunking strategies significantly affect performance and latency, with no single approach universally optimal. These findings highlight the need for task-specific chunking choices in practical LLM deployments, especially under resource constraints.

PUBLICATION RECORD

Publication year
2025
Venue
2025 OITS International Conference on Information Technology (OCIT)
Publication date
2025-12-18
Fields of study
Not labeled
Identifiers
DOI 10.1109/OCIT66168.2025.11400424
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Reconstructing Context: Evaluating Advanced Chunking Strategies for Retrieval-Augmented Generation
2025cited by this paper
Financial Report Chunking for Effective Retrieval Augmented Generation
2024cited by this paper
Enhancing RAG Performance Through Chunking and Text Splitting Techniques
2024cited by this paper
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
2022cited by this paper
Scaling Instruction-Finetuned Language Models
2022cited by this paper
LongT5: Efficient Text-To-Text Transformer for Long Sequences
2021cited by this paper
Efficient Transformers: A Survey
2020cited by this paper
DIALOGPT : Large-Scale Generative Pre-training for Conversational Response Generation
2019cited by this paper
Language Models are Unsupervised Multitask Learners
2019cited by this paper
RoBERTa: A Robustly Optimized BERT Pretraining Approach
2019cited by this paper
BERTgrid: Contextualized Embedding for 2D Document Representation and Understanding
2019cited by this paper
DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
2019cited by this paper
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
2019cited by this paper
Don’t Give Me the Details, Just the Summary! Topic-Aware Convolutional Neural Networks for Extreme Summarization
2018cited by this paper
SQuAD: 100,000+ Questions for Machine Comprehension of Text
2016cited by this paper
Character-level Convolutional Networks for Text Classification
2015cited by this paper
An Introduction to Information Retrieval
2013cited by this paper
ROUGE: A Package for Automatic Evaluation of Summaries
2004cited by this paper
A trainable document summarizer
1995cited by this paper

CITED BY

No citing papers are available for this paper.