Biological language model performance depends heavily on pretraining data quality, diversity, and size. While metagenomic datasets feature enormous biological diversity, their utilization as pretraining data has been limited due to challenges in data accessibility, quality filtering and deduplication. Here, we present the Open MetaGenomic (OMG) corpus, a genomic pretraining dataset totalling 3.1T base pairs and 3.3B protein coding sequences, obtained by combining two largest metagenomic dataset repositories (JGI’s IMG and EMBL’s MGnify). We first document the composition of the dataset and describe the quality filtering steps taken to remove poor quality data. We make the OMG corpus available as a mixed-modality genomic sequence dataset that represents multi-gene encoding genomic sequences with translated amino acids for protein coding sequences, and nucleic acids for intergenic sequences. We train the first mixed-modality genomic language model (gLM2) that leverages genomic context information to learn robust functional representations, as well as coevolutionary signals in protein-protein interfaces and genomic regulatory syntax. Furthermore, we show that deduplication in embedding space can be used to balance the corpus, demonstrating improved performance on downstream tasks. The OMG dataset is publicly hosted on the Hugging Face Hub at https://huggingface.co/datasets/tattabio/OMG and gLM2 is available at https://huggingface.co/tattabio/gLM2_650M.
The OMG dataset: An Open MetaGenomic corpus for mixed-modality genomic language modeling
Andre Cornman,Jacob West-Roberts,A. Camargo,Simon Roux,Martin Beracochea,M. Mirdita,Sergey Ovchinnikov,Yunha Hwang
Published 2024 in bioRxiv
ABSTRACT
PUBLICATION RECORD
- Publication year
2024
- Venue
bioRxiv
- Publication date
2024-10-01
- Fields of study
Biology, Computer Science
- Identifiers
- External record
- Source metadata
Semantic Scholar
CITATION MAP
EXTRACTION MAP
CLAIMS
- No claims are published for this paper.
CONCEPTS
- No concepts are published for this paper.
REFERENCES
Showing 1-58 of 58 references · Page 1 of 1
CITED BY
Showing 1-36 of 36 citing papers · Page 1 of 1