TextGram: Towards a better domain-adaptive pretraining

Sharayu Hiwarkhedkar,Saloni Mittal,Vidula Magdum,Omkar Dhekane,Raviraj Joshi,Geetanjali Kale,Arnav Ladkat

Published 2024 in arXiv.org

ABSTRACT

For green AI, it is crucial to measure and reduce the carbon footprint emitted during the training of large language models. In NLP, performing pre-training on Transformer models requires significant computational resources. This pre-training involves using a large amount of text data to gain prior knowledge for performing downstream tasks. Thus, it is important that we select the correct data in the form of domain-specific data from this vast corpus to achieve optimum results aligned with our domain-specific tasks. While training on large unsupervised data is expensive, it can be optimized by performing a data selection step before pretraining. Selecting important data reduces the space overhead and the substantial amount of time required to pre-train the model while maintaining constant accuracy. We investigate the existing selection strategies and propose our own domain-adaptive data selection method - TextGram - that effectively selects essential data from large corpora. We compare and evaluate the results of finetuned models for text classification task with and without data selection. We show that the proposed strategy works better compared to other selection methods.

PUBLICATION RECORD

Publication year
2024
Venue
arXiv.org
Publication date
2024-04-28
Fields of study
Computer Science, Environmental Science
Identifiers
DOI 10.1007/978-3-031-58495-4_12 arXiv 2404.18228
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Towards Simple and Efficient Task-Adaptive Pre-training for Text Classification
2022cited by this paper
The Energy and Carbon Footprint of Training End-to-End Speech Recognizers
2021cited by this paper
Unigram-Normalized Perplexity as a Language Model Performance Measure with Different Vocabulary Sizes
2020cited by this paper
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
2019influential reference
Defending Against Neural Fake News
2019cited by this paper
Extracting In-domain Training Corpora for Neural Machine Translation Using Data Selection Methods
2018cited by this paper
Dynamic Data Selection for Neural Machine Translation
2017cited by this paper
Document clustering: TF-IDF approach
2016cited by this paper
Learning Word Vectors for Sentiment Analysis
2011cited by this paper
Domain Adaptation via Pseudo In-Domain Data Selection
2011cited by this paper
Intelligent Selection of Language Model Training Data
2010cited by this paper
Discriminative Instance Weighting for Domain Adaptation in Statistical Machine Translation
2010cited by this paper
Method of Selecting Training Data to Build a Compact and Efficient Translation Model
2008cited by this paper
Low Cost Portability for Statistical Machine Translation based on N-gram Frequency and TF-IDF
2005cited by this paper
TextRank: Bringing Order into Text
2004cited by this paper
Toward a unified approach to statistical language modeling for Chinese
2002cited by this paper

CITED BY

No citing papers are available for this paper.