The MEDLINE database provides an extensive source of scientific articles and heterogeneous biomedical information in the form of unstructured text. One of the most important knowledge present within articles are the relations between human proteins and their phenotypes, which can stay hidden due to the exponential growth of publications. This has presented a range of opportunities for the development of computational methods to extract these biomedical relations from the articles. However, currently, no such method exists for the automated extraction of relations involving human proteins and human phenotype ontology (HPO) terms. In our previous work, we developed a comprehensive database composed of all co-mentions of proteins and phenotypes. In this study, we present a supervised machine learning approach called PPPred (Protein-Phenotype Predictor) for classifying the validity of a given sentence-level co-mention. Using an in-house developed gold standard dataset, we demonstrate that PPPred significantly outperforms several baseline methods. This two-step approach of co-mention extraction and classification constitutes a complete biomedical relation extraction pipeline for extracting protein-phenotype relations. CCS CONCEPTS •Computing methodologies → Information extraction; Supervised learning by classification; •Applied computing →Bioinformatics;
PPPred: Classifying Protein-phenotype Co-mentions Extracted from Biomedical Literature
Morteza Pourreza Shahri,G. Reynolds,Mandi M. Roe,Indika Kahanda
Published 2019 in bioRxiv
ABSTRACT
PUBLICATION RECORD
- Publication year
2019
- Venue
bioRxiv
- Publication date
2019-05-31
- Fields of study
Biology, Medicine, Computer Science
- Identifiers
- External record
- Source metadata
Semantic Scholar
CITATION MAP
EXTRACTION MAP
CLAIMS
- No claims are published for this paper.
CONCEPTS
- No concepts are published for this paper.
REFERENCES
Showing 1-44 of 44 references · Page 1 of 1
CITED BY
Showing 1-4 of 4 citing papers · Page 1 of 1