On patterns and re-use in bioinformatics databases

Published 2017 in Bioinform.

ABSTRACT

Motivation : As the quantity of data being depositing into biological databases continues to increase, it becomes ever more vital to develop methods that enable us to understand this data and ensure that the knowledge is correct. It is widely‐held that data percolates between different databases, which causes particular concerns for data correctness; if this percolation occurs, incorrect data in one database may eventually affect many others while, conversely, corrections in one database may fail to percolate to others. In this paper, we test this widely‐held belief by directly looking for sentence reuse both within and between databases. Further, we investigate patterns of how sentences are reused over time. Finally, we consider the limitations of this form of analysis and the implications that this may have for bioinformatics database design. Results : We show that reuse of annotation is common within many different databases, and that also there is a detectable level of reuse between databases. In addition, we show that there are patterns of reuse that have previously been shown to be associated with percolation errors. Availability and implementation : Analytical software is available on request. Contact: phillip.lord@newcastle.ac.uk

PUBLICATION RECORD

Publication year
2017
Venue
Bioinform.
Publication date
2017-05-19
Fields of study
Biology, Medicine, Computer Science
Identifiers
DOI 10.1093/bioinformatics/btx310 arXiv 1705.08730 PMID 28525546 PMCID 5860070
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar, PubMed

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Provenance, propagation and quality of biological annotation
2014cited by this paper
The W3C PROV family of specifications for modelling provenance metadata
2013cited by this paper
Can Inferred Provenance and Its Visualisation Be Used to Detect Erroneous Annotation? A Case Study Using UniProtKB
2013cited by this paper
New and continuing developments at PROSITE
2012cited by this paper
The 2013 Nucleic Acids Research Database Issue and the online Molecular Biology Database Collection
2012cited by this paper
The automatic annotation of bacterial genomes
2012cited by this paper
An approach to describing and analysing bulk biological annotation quality: a case study using UniProtKB
2012cited by this paper
The Pfam protein families database
2011cited by this paper
The 2012 Nucleic Acids Research Database Issue and the online Molecular Biology Database Collection
2011cited by this paper
InterPro in 2011: new developments in the family and domain prediction database
2011cited by this paper
neXtProt: a knowledge platform for human proteins
2011cited by this paper
Provenance and evidence in UniProtKB
2010cited by this paper
Estimating the Quality of Ontology-Based Annotations by Considering Evolutionary Changes
2009cited by this paper
Manual curation is not sufficient for annotation of genomic databases
2007cited by this paper
The Pfam protein families database
2007cited by this paper
UniSave: the UniProtKB Sequence/Annotation Version database
2006cited by this paper
On the Nature of Biological Data
2005cited by this paper
Studying cooperation and conflict between authors with history flow visualizations
2004cited by this paper
PRINTS and its automatic supplement, prePRINTS
2003influential reference
The TIGRFAMs database of protein families
2003cited by this paper
PRECIS: an automated pipeline for producing concise reports about proteins
2001cited by this paper

CITED BY

Perspectives on tracking data reuse across biodata resources
2024cites this paper
Recent applications of bioinformatics in target identification and drug discovery for Alzheimer's disease.
2022cites this paper
Flavin Mononucleotide-Dependent l-Lactate Dehydrogenases: Expanding the Toolbox of Enzymes for l-Lactate Biosensors
2022cites this paper
Experimental and computational investigation of enzyme functional annotations uncovers misannotation in the EC 1.1.3.15 enzyme class
2021cites this paper
In vivo, in vitro and in silico: an open space for the development of microbe‐based applications of synthetic biology
2021cites this paper
The reuse of public datasets in the life sciences: potential risks and rewards
2020cites this paper