The parameter sensitivity of random forests

Published 2016 in BMC Bioinformatics

ABSTRACT

The Random Forest (RF) algorithm for supervised machine learning is an ensemble learning method widely used in science and many other fields. Its popularity has been increasing, but relatively few studies address the parameter selection process: a critical step in model fitting. Due to numerous assertions regarding the performance reliability of the default parameters, many RF models are fit using these values. However there has not yet been a thorough examination of the parameter-sensitivity of RFs in computational genomic studies. We address this gap here. We examined the effects of parameter selection on classification performance using the RF machine learning algorithm on two biological datasets with distinct p/n ratios: sequencing summary statistics (low p/n) and microarray-derived data (high p/n). Here, p, refers to the number of variables and, n, the number of samples. Our findings demonstrate that parameterization is highly correlated with prediction accuracy and variable importance measures (VIMs). Further, we demonstrate that different parameters are critical in tuning different datasets, and that parameter-optimization significantly enhances upon the default parameters. Parameter performance demonstrated wide variability on both low and high p/n data. Therefore, there is significant benefit to be gained by model tuning RFs away from their default parameter settings.

PUBLICATION RECORD

Publication year
2016
Venue
BMC Bioinformatics
Publication date
2016-09-01
Fields of study
Biology, Medicine, Computer Science
Identifiers
DOI 10.1186/s12859-016-1228-x PMID 27586051 PMCID 5009551
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar, PubMed

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

The elements of statistical learning: data mining, inference, and prediction, 2nd Edition
2020cited by this paper
MVDA: a multi-view genomic data integration methodology
2015cited by this paper
ProtDCal: A program to compute general-purpose-numerical descriptors for sequences and 3D-structures of proteins
2015cited by this paper
Discrimination of cell cycle phases in PCNA-immunolabeled cells
2015cited by this paper
mAPKL: R/ Bioconductor package for detecting gene exemplars and revealing their characteristics
2015cited by this paper
Peak shape clustering reveals biological insights
2015cited by this paper
A systematic evaluation of high-dimensional, ensemble-based regression for exploring large model spaces in microbiome analyses
2015cited by this paper
Boosting for high-dimensional two-class prediction
2015cited by this paper
Fizzy: feature subset selection for metagenomics
2015cited by this paper
GESPA: classifying nsSNPs to predict disease association
2015cited by this paper
Label noise in subtype discrimination of class C G protein-coupled receptors: A systematic approach to the analysis of classification errors
2015cited by this paper
A methodology for exploring biomarker – phenotype associations: application to flow cytometry data and systemic sclerosis clinical manifestations
2015cited by this paper
PaPI: pseudo amino acid composition to score human protein-coding variants
2015cited by this paper
Aro: a machine learning approach to identifying single molecules and estimating classification error in fluorescence microscopy images
2015cited by this paper
SuRankCo: supervised ranking of contigs in de novo assemblies
2015cited by this paper
A multi-view genomic data simulator
2015cited by this paper
Learning-guided automatic three dimensional synapse quantification for drosophila neurons
2015cited by this paper
NetBenchmark: a bioconductor package for reproducible benchmarks of gene regulatory network inference
2015cited by this paper
Analytics
2015influential reference
Sigma-RF: prediction of the variability of spatial restraints in template-based modeling by random forest
2015cited by this paper
RNA-binding residues prediction using structural features
2015cited by this paper
Seq-ing improved gene expression estimates from microarrays using machine learning
2015cited by this paper
The application of sparse estimation of covariance matrix to quadratic discriminant analysis
2015cited by this paper
Factors affecting the accuracy of a class prediction model in gene expression data
2015cited by this paper
Optimal combination of feature selection and classification via local hyperplane based learning strategy
2015cited by this paper
Knowledge transfer via classification rules using functional mapping for integrative modeling of gene expression data
2015cited by this paper
Proposal of supervised data analysis strategy of plasma miRNAs from hybridisation array data with an application to assess hemolysis-related deregulation
2015cited by this paper
Controlling false discoveries in high-dimensional situations: boosting with stability selection
2014cited by this paper
SeqControl: process control for DNA sequencing
2014influential reference
R: A language and environment for statistical computing.
2014cited by this paper
Gene Selection for Cancer Classification using Support Vector Machines
2014cited by this paper
Classification of microarray cancer data using ensemble approach
2013cited by this paper
Kernel Learning Algorithms for Face Recognition
2013cited by this paper
Practical Bayesian Optimization of Machine Learning Algorithms
2012cited by this paper
Decision Forests: A Unified Framework for Classification, Regression, Density Estimation, Manifold Learning and Semi-Supervised Learning
2012cited by this paper
A few useful things to know about machine learning
2012cited by this paper
pROC: an open-source package for R and S+ to analyze and compare ROC curves
2011cited by this paper
Learning from Past Treatments and Their Outcome Improves Prediction of In Vivo Response to Anti-HIV Therapy
2011cited by this paper
Letter to the Editor: Stability of Random Forest importance measures
2011cited by this paper
An active role for machine learning in drug development.
2011cited by this paper
Multigenic modeling of complex disease by random forests.
2010cited by this paper
A comprehensive comparison of random forests and support vector machines for microarray-based cancer classification
2008cited by this paper
Empirical characterization of random forest variable importance measures
2008cited by this paper
Support Vector Machines and Kernels for Computational Biology
2008cited by this paper
Gene expression–based survival prediction in lung adenocarcinoma: a multi-site, blinded validation study
2008influential reference
Application of machine learning algorithms to predict coronary artery calcification with a sibship‐based design
2008cited by this paper
Lattice: Multivariate Data Visualization with R
2008cited by this paper
Random Forest for Gene Expression Based Cancer Classification: Overlooked Issues
2007cited by this paper
An Introduction to the Bootstrap
2007cited by this paper
Bias in random forest variable importance measures: Illustrations, sources and a solution
2007cited by this paper
Classification and Regression by randomForest
2007influential reference
Cost-sensitive boosting for classification of imbalanced data
2007cited by this paper
EGFR Activation and Ultraviolet Light-Induced Skin Carcinogenesis
2006cited by this paper
Evaluation of different biological data and computational classification methods for use in protein interaction prediction
2006cited by this paper
Applications of Machine Learning in Cancer Prediction and Prognosis
2006cited by this paper
Multiclass cancer classification and biomarker discovery using GA-based algorithms
2005cited by this paper
The elements of statistical learning: data mining, inference and prediction
2005cited by this paper
Prediction of protein-protein interactions using random decision forest framework
2005cited by this paper
Gene expression profiles predict complete pathologic response to neoadjuvant paclitaxel and fluorouracil, doxorubicin, and cyclophosphamide chemotherapy in breast cancer.
2004cited by this paper
Machine Learning Benchmarks and Random Forest Regression
2004cited by this paper
A data-analytic strategy for protein biomarker discovery: profiling of high-dimensional proteomic data for cancer detection.
2003cited by this paper
Machine learning approaches to lung cancer prediction from mass spectra
2003cited by this paper
Ensemble machine learning on gene expression data for cancer classification.
2003cited by this paper
Random Forest: A Classification and Regression Tool for Compound Classification and QSAR Modeling
2003cited by this paper
Classification in microarray experiments
2003cited by this paper
Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning
2002cited by this paper
SMOTE: Synthetic Minority Over-sampling Technique
2002cited by this paper
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data
2001cited by this paper
Random Forests
2001influential reference
Drug Design by Machine Learning: Support Vector Machines for Pharmaceutical Data Analysis
2001cited by this paper
An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees: Bagging, Boosting, and Randomization
2000cited by this paper
Popular Ensemble Methods: An Empirical Study
1999cited by this paper
Machine learning approaches for the prediction of signal peptides and other protein sorting signals.
1999cited by this paper
An Empirical Comparison of Voting Classification Algorithms: Bagging, Boosting, and Variants
1999cited by this paper
MetaCost: a general method for making classifiers cost-sensitive
1999cited by this paper
An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees: Bagging, B
1998cited by this paper
Data Mining for Direct Marketing: Problems and Solutions
1998cited by this paper
Addressing the Curse of Imbalanced Training Sets: One-Sided Selection
1997cited by this paper
Bagging Predictors
1996cited by this paper
Heuristics of instability and stabilization in model selection
1996cited by this paper
OUT-OF-BAG ESTIMATION
1996cited by this paper
A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection
1995cited by this paper
An Introduction to the Bootstrap
1995cited by this paper
A concordance correlation coefficient to evaluate reproducibility.
1989cited by this paper

CITED BY

Improving Protein Quantification with SERS Superspectra and Machine Learning.
2026cites this paper
Urban Ecosystem Responses to Human Activity Shifts: Multi-Year Evidence from New York City Before, During, and After the COVID-19 Pandemic
2026cites this paper
Comparative analysis of surrogate models for nonlinear behavior prediction of an aircraft landing gear bracket
2026cites this paper
Developing an ecological risk‐based approach to facilitate licensing offshore wind development
2026cites this paper
Changes in the Spatial Extent of the Middle‐Lower Yangtze River During the Period 2003–2022 and Their Causes
2026cites this paper
Data-Driven Prediction of Tensile Strength in Heat-Treated Steels Using Random Forests for Sustainable Materials Design
2026cites this paper
Sustainable Land Use and Land Cover Management Model for Flood Mitigation in Krueng Baro Watershed, Aceh, Indonesia
2025cites this paper
A comprehensive method for exploratory data analysis and preprocessing the ASHRAE database for machine learning
2025cites this paper
Evaluating Lorenz entropy for tropical forest discrimination using GEDI and supervised machine learning approach
2025cites this paper
Estimation of elbow flexion torque from anthropometric and NMES MMG variables using random forest regression
2025cites this paper
Clinical Response Characteristics of Salivary Proteins in the Management Strategy of Diabetes-Associated Periodontitis
2025cites this paper
Advanced monitoring of almond orchard water status using machine learning and remote sensing
2025cites this paper
A flash flood susceptibility prediction and partitioning method based on GeoDetector and random forest
2025cites this paper
Temperate forests of high conservation value are successfully identified by satellite and LiDAR data fusion
2025cites this paper
Enhancing hydraulic fracturing efficiency through machine learning
2025cites this paper
Dengue dynamics, predictions, and future increase under changing monsoon climate in India
2025cites this paper
Towards Bridging GIS and 3D Modeling: A Framework for Learning Coordinate Conversion Using Machine Learning
2025cites this paper
Harnessing machine learning for energy optimization and intelligent process control in wastewater treatment
2025cites this paper
Exploring trade-off synergies and analyzing drivers among ecosystem services in resource-based cities—a case study of Shanxi Province, China
2025cites this paper
From prediction to regionalization: Enhancing flash flood susceptibility mapping using machine learning and GeoDetector
2025cites this paper
Advancing rock mass classification using machine learning approach
2025cites this paper
Nonparametric estimation of conditional probability distributions using a generative approach based on conditional push-forward neural networks
2025cites this paper
Machine learning in biological research: key algorithms, applications, and future directions
2025cites this paper
Development of a CNN classifier with XAI to detect interpretable water stress in sweet potato using RGB images
2025cites this paper
Coupling molecular dynamics–derived energy fingerprints with random forest to reveal wear mechanisms in functionalized nanocomposites
2025cites this paper
Artificial intelligence and systems biology analysis in stem cell research and therapeutics development
2025cites this paper
Integrated Approach of Machine Learning and High-Throughput Screening to Identify Chemical Probe Candidates Targeting Aldehyde Dehydrogenases.
2025cites this paper
Multiplexed Quantification of Soil Nutrients Using an AI-Enhanced and Low-Cost Impedimetric Sensor
2025cites this paper
Risk-Informed Dual-Threshold Screening for SPT-Based Liquefaction: A Probability-Calibrated Random Forest Approach
2025cites this paper
Deep Learning–Based Sentiment and Topic Analysis of Turkish Football Fans on X Platform
2025cites this paper
Optimized Rainfall Imputation Using ERA5-Land and Tree-Based Machine Learning: A Scalable Framework for Data-Sparse Regions
2025cites this paper
Predicting the thickness of shallow landslides in Switzerland using machine learning
2025cites this paper
AI-driven prediction of carbonation depth in recycled aggregate concrete: influential parameters and nonlinear interactions from machine learning perspectives
2025cites this paper
Learning Software Overtime Estimation From Experts‘ Annotations: A Greedy Cross-Validation-Based Machine Learning Approach
2025influential citation
Developing a Prediction Model for Suicidality Among COVID-19 Patients in Korea Using Timely Data From the National Center for Disaster and Trauma
2025cites this paper
Optimizing Credit Risk Prediction for Peer-to-Peer Lending Using Machine Learning
2025cites this paper
Machine Learning-Powered Multi-Omics for Food Microbiology and Smarter Food Safety
2025cites this paper
Modelling α-Diversity and Abundance of Soil Bacterial and Fungal Communities in Tibetan Alpine Grasslands Using Climate Data and Normalized Difference Vegetation Index
2025cites this paper
Assessing wildfire susceptibility and driving variables in Portugal using machine learning approach
2025cites this paper
Mapping of Forest Species Using Sentinel-2A Images in the Alentejo and Algarve Regions, Portugal
2024cites this paper
Identification and validation of biomarkers related to mitochondria during ex vivo lung perfusion for lung transplants based on machine learning algorithm.
2024cites this paper
Artificial Intelligence–Enhanced Multi-Algorithm R Shiny Application for Predictive Modeling and Analytics: Case Study of Alzheimer Disease Diagnostics
2024influential citation
Fault prognosis of wind turbines using multimodal machine learning
2024cites this paper
The spatial-temporal evolution patterns of landslide-oriented resilience in mountainous city: A case study of Chongqing, China.
2024cites this paper
Quantitative structure-property relationship of glass transition temperatures for organic compounds
2024cites this paper
Bayesian Calibration of Stochastic Agent Based Model via Random Forest
2024cites this paper
Evaluation of the support vector regression (SVR) and the random forest (RF) models accuracy for streamflow prediction under a data-scarce basin in Morocco
2024cites this paper
Spatial distribution and hydrogeochemical processes of high iodine groundwater in the Hetao Basin, China.
2024cites this paper
A new evolutionary forest model via incremental tree selection for short-term global solar irradiance forecasting under six various climatic zones
2024cites this paper
Estimation of slope stability using ensemble-based hybrid machine learning approaches
2024cites this paper
A machine learning approach for modeling the occurrence of the major intermediate hosts for schistosomiasis in East Africa
2024influential citation
Drivers of global irrigation expansion: the role of discrete global grid choice
2024cites this paper
Functionally Assembled Terrestrial Ecosystem Simulator (FATES) for Hurricane Disturbance and Recovery
2024cites this paper
Enhancing nighttime light remote Sensing: Introducing the nighttime light background value (NLBV) for urban applications
2024cites this paper
Detection of flood zones using machine learning algorithms and remote sensing to determine the area of land under flood damage (case study: March 2019 flood in Aqqla city)
2024cites this paper
Random forest regression models in ecology: Accounting for messy biological data and producing predictions with uncertainty
2024cites this paper
Determining the influence and correlation for parameters of flexible forming using the random forest method
2023cites this paper
Fault detection of wind turbines using SCADA data and genetic algorithm-based ensemble learning
2023cites this paper
Heterogeneous Oblique Double Random Forest
2023cites this paper
Credit Risk Assessment in P2P Lending Using LightGBM and Particle Swarm Optimization
2023cites this paper
Accelerating Explicit Time-Stepping with Spatially Variable Time Steps Through Machine Learning
2023cites this paper
Costs of preventing and supressing wildfires in Victoria, Australia.
2023cites this paper
Estimation of Soil Surface Roughness Parameters Under Simulated Rainfall Using Spectral Reflectance in Optical Domain
2023cites this paper
QSPR model for solvation enthalpy based on quantum chemical descriptors
2023cites this paper
Improving the hindcast of the northward shift of South Asian high in June with machine learning
2023cites this paper
Hybrid Model and Ensemble for Inflation Forecasting: A Machine Learning Approach
2023cites this paper
Stability of filter feature selection methods in data pipelines: a simulation study
2022cites this paper
Theory and Practice of Integrating Machine Learning and Conventional Statistics in Medical Data Analysis
2022cites this paper
Plasma Proteomics in Healthy Subjects with Differences in Tissue Glucocorticoid Sensitivity Identifies A Novel Proteomic Signature
2022cites this paper
A Biomedical Case Study Showing That Tuning Random Forests Can Fundamentally Change the Interpretation of Supervised Data Structure Exploration Aimed at Knowledge Discovery
2022cites this paper
Identifying driving factors of urban land expansion using Google Earth Engine and machine-learning approaches in Mentougou District, China
2022cites this paper
Intestinal microbiome-mediated resistance against vibriosis for Cynoglossus semilaevis
2022cites this paper
Learning Financial Networks with High-frequency Trade Data.
2022cites this paper
Artificial immune cell, AI-cell, a new tool to predict interferon production by peripheral blood monocytes in response to nucleic acid nanoparticles
2022cites this paper
Transcriptome provides insights into bovine mammary regulatory mechanisms during the lactation cycle
2022cites this paper
Factors Controlling the Distribution of Intermediate Host Snails of Schistosoma in Crater Lakes in Uganda: A Machine Learning Approach
2022influential citation
Pre-trauma predictors of severe psychiatric comorbidity 5 years following traumatic experiences.
2022cites this paper
Enhancing Web Authentication Security Using Random Forest
2022cites this paper
Predicting tuberculosis drug resistance using machine learning based on DNA sequencing data
2021influential citation
Evaluation of random forests for short-term daily streamflow forecasting in rainfall- and snowmelt-driven watersheds
2021cites this paper
Development of a predictive model for integrated medical and long-term care resource consumption based on health behaviour: application of healthcare big data of patients with circulatory diseases
2021cites this paper
Estimating maize lethal necrosis (MLN) severity in Kenya using multispectral high-resolution data
2021cites this paper
Addressing Measurement Error in Random Forests using Quantitative Bias Analysis.
2021cites this paper
Large-Scale Modeling of Multispecies Acute Toxicity End Points Using Consensus of Multitask Deep Learning Methods
2021cites this paper
Sex-Specific Risk Profiles for Suicide Among Persons with Substance Use Disorders in Denmark.
2021cites this paper
A Data-Driven Approach to Predict Fatigue in Exercise Based on Motion Data from Wearable Sensors or Force Plate
2021cites this paper
Clustering of patient comorbidities within electronic medical records enables high-precision COVID-19 mortality prediction
2021cites this paper
Are OpenStreetMap building data useful for flood vulnerability modelling?
2021influential citation
Predicting Sex-Specific Non-Fatal Suicide Attempt Risk Using Machine Learning and Data from Danish National Registries.
2021cites this paper
Machine learning-based accelerated property prediction of two-phase materials using microstructural descriptors and finite element analysis
2021cites this paper
Influence of Random Forest Hyperparameterization on Short-Term Runoff Forecasting in an Andean Mountain Catchment
2021influential citation
Using Machine Learning to Identify and Investigate Moderators of Alcohol Use Intervention Effects in Meta-Analyses.
2021cites this paper
Accuracy and reliability of predictions of E. coli concentrations in water of irrigation ponds from drone-based imagery as affected by parameters of the random forest algorithm
2021cites this paper
Predicting 30-Day Mortality after an Acute Coronary Syndrome (ACS) using Machine Learning Methods for Feature Selection, Classification and Visualisation
2021cites this paper
Land degradation modeling of dust storm sources using MODIS and meteorological time series data
2021cites this paper
Side effect prediction based on drug-induced gene expression profiles and random forest with iterative feature selection
2021cites this paper
Mapping soil erodibility in southeast China at 250 m resolution: Using environmental variables and random forest regression with limited samples
2021cites this paper
Applying random forest in a health administrative data context: a conceptual guide
2021cites this paper
Change Point Detection In Continuous Integration Performance Tests
2021cites this paper
Using Sentinel-1 and Sentinel-2 Time Series for Slangbos Mapping in the Free State Province, South Africa
2021cites this paper