Random Forest
Random forest is an ensemble machine learning algorithm that builds many decision trees on bootstrapped samples and aggregates their predictions to improve accuracy and reduce overfitting.
Random forest is a supervised ensemble learning algorithm that constructs a large number of decision trees during training and aggregates their predictions. Formally introduced by Leo Breiman in 2001, it builds on bootstrap aggregating (bagging) and adds the idea of randomly sampling features at each tree split. Random forests are one of the most widely used non-deep-learning algorithms for tabular data, valued for accuracy, robustness, and ease of use.
Background
Decision trees are simple, interpretable models that recursively split data on features to produce predictions. Individual trees, however, are prone to high variance: small changes in the training data can lead to very different trees. Bagging, also proposed by Breiman in 1996, reduces variance by training many trees on bootstrap samples and averaging their predictions. Random forests further decorrelate the trees by considering only a random subset of features when choosing each split, which mitigates the dominance of a few highly informative features and produces more diverse ensembles.
Algorithm
A random forest with n_trees trees is constructed as follows.
- For each tree, draw a bootstrap sample of the training set — the same size as the original dataset, but sampled with replacement, leaving some observations out (the out-of-bag, or OOB, sample).
- Grow a decision tree on the bootstrap sample. At every node, randomly select
mtryfeatures (a subset of all features) and choose the best split among them. - Continue splitting until a stopping criterion is met (minimum samples per leaf, maximum depth, or until pure leaves are reached). Trees are typically not pruned.
- Aggregate predictions: for classification, take the majority vote across trees; for regression, take the average prediction.
Out-of-bag samples can be used to estimate generalisation error without a separate validation set. Feature importance is typically computed from the average impurity decrease attributable to each feature, or from permutation tests on out-of-bag data.
Hyperparameters
| Hyperparameter | Description | Typical default |
| --- | --- | --- |
| n_estimators | Number of trees | 100–500 |
| max_features | Features considered per split | sqrt(p) for classification, p/3 for regression |
| max_depth | Maximum tree depth | None (grow fully) |
| min_samples_split | Minimum samples to split a node | 2 |
| min_samples_leaf | Minimum samples at a leaf | 1 |
| bootstrap | Whether to sample with replacement | True |
Random forests are comparatively forgiving of hyperparameter choices, which contributes to their popularity as a strong default model.
Strengths and limitations
Random forests handle both classification and regression, accept mixed numerical and categorical features, are robust to outliers and to irrelevant features, and provide useful feature-importance measures. They generally require less tuning than gradient-boosted trees and parallelise easily across cores.
Limitations include large memory footprint for very large datasets, slower inference than a single tree, less interpretability than a single decision tree (although tools such as SHAP and LIME mitigate this), and a tendency to underperform gradient boosting on competitive tabular tasks. Random forests struggle to extrapolate outside the range of training data and are not the best choice for very high-dimensional sparse data such as text bag-of-words.
Comparison with related methods
Random forest is part of a family of tree-based ensemble methods that also includes Extremely Randomized Trees (Extra Trees), gradient-boosted trees, and modern implementations such as XGBoost, LightGBM, and CatBoost. Gradient boosting builds trees sequentially, with each tree correcting the errors of the previous ensemble; it often achieves higher accuracy than random forest on structured tabular data but requires more careful tuning. Extra Trees randomise the split thresholds in addition to feature selection, sometimes reducing variance further at the cost of higher bias.
For tabular machine learning competitions on platforms such as Kaggle, gradient-boosted variants (XGBoost, LightGBM, CatBoost) are typically the strongest baseline, with random forests and stacked ensembles close behind. Deep learning methods such as TabNet and FT-Transformer have closed some of the gap but rarely dominate on small to mid-sized tabular datasets.
Applications
Random forests are used in a wide variety of domains:
- Finance: credit scoring, fraud detection, churn prediction
- Healthcare: disease risk modelling, biomarker discovery
- Industrial operations: predictive maintenance, quality control
- Agriculture: yield prediction, crop disease identification
- Marketing: customer segmentation, response modelling
- Public sector: tax risk assessment, social-program eligibility
- Cybersecurity: intrusion detection, malware classification
In ecology and remote sensing, random forests are a standard tool for classifying land cover from satellite imagery and for species distribution modelling.
Implementations
Random forests are available in essentially every major machine learning library: scikit-learn, R's randomForest and ranger packages, Apache Spark MLlib, H2O.ai, XGBoost's random forest mode, and cloud services such as AWS SageMaker, Google Vertex AI, and Microsoft Azure Machine Learning. Their simplicity and strong default behaviour make them a common baseline in production machine learning pipelines and a popular teaching example in data-science courses.
References
- Breiman, L. (2001). Random Forests. Machine Learning, 45(1), 5–32.
- Breiman, L. (1996). Bagging Predictors. Machine Learning, 24(2), 123–140.
- Hastie, T., Tibshirani, R., and Friedman, J. (2009). The Elements of Statistical Learning. Springer.
- Pedregosa, F. et al. (2011). Scikit-learn: Machine Learning in Python. JMLR.