What is AIWiki Malaysia?

AIWiki Malaysia is a free, open AI knowledge base covering artificial intelligence concepts, tools, models, and use cases — written specifically for Malaysian professionals and students. It is maintained by AITG Sdn Bhd, an AI company based in Penang.

Who maintains AIWiki Malaysia?

AIWiki Malaysia is maintained by AITG Sdn Bhd (Registration: 202601016521 (1678618-W)), an AI company headquartered in George Town, Penang, Malaysia. The editorial team continuously updates and expands the knowledge base.

What topics does AIWiki Malaysia cover?

AIWiki Malaysia covers a wide range of AI topics including large language models (LLMs), AI agents, machine learning fundamentals, prompt engineering, AI automation, generative AI tools, Malaysian AI regulations, local vendor landscape, and real-world AI use cases relevant to the Malaysian market.

How do I search for AI topics on AIWiki Malaysia?

You can use the search bar at the top of the site to find articles by keyword or topic. Articles are also organised by category, so you can browse by subject area such as Models, Tools, Concepts, or Use Cases.

Is AIWiki Malaysia available in Bahasa Malaysia?

Yes. AIWiki Malaysia publishes content in both English and Bahasa Malaysia to serve the full breadth of the Malaysian professional and student community. Language availability is indicated on each article page.

How can I submit a topic or suggest an article?

You can suggest topics or submit article ideas by contacting the AIWiki Malaysia team at admin@aiteragrid.com. AITG Sdn Bhd reviews all submissions and publishes content that meets editorial accuracy standards.

Random Forest

Random forest is an ensemble machine learning algorithm that builds many decision trees on bootstrapped samples and aggregates their predictions to improve accuracy and reduce overfitting.

6 min readLast updated May 2026Foundations

Random forest is a supervised ensemble learning algorithm that constructs a large number of decision trees during training and aggregates their predictions. Formally introduced by Leo Breiman in 2001, it builds on bootstrap aggregating (bagging) and adds the idea of randomly sampling features at each tree split. Random forests are one of the most widely used non-deep-learning algorithms for tabular data, valued for accuracy, robustness, and ease of use.

Background

Decision trees are simple, interpretable models that recursively split data on features to produce predictions. Individual trees, however, are prone to high variance: small changes in the training data can lead to very different trees. Bagging, also proposed by Breiman in 1996, reduces variance by training many trees on bootstrap samples and averaging their predictions. Random forests further decorrelate the trees by considering only a random subset of features when choosing each split, which mitigates the dominance of a few highly informative features and produces more diverse ensembles.

Algorithm

A random forest with n_trees trees is constructed as follows.

For each tree, draw a bootstrap sample of the training set — the same size as the original dataset, but sampled with replacement, leaving some observations out (the out-of-bag, or OOB, sample).
Grow a decision tree on the bootstrap sample. At every node, randomly select mtry features (a subset of all features) and choose the best split among them.
Continue splitting until a stopping criterion is met (minimum samples per leaf, maximum depth, or until pure leaves are reached). Trees are typically not pruned.
Aggregate predictions: for classification, take the majority vote across trees; for regression, take the average prediction.

Out-of-bag samples can be used to estimate generalisation error without a separate validation set. Feature importance is typically computed from the average impurity decrease attributable to each feature, or from permutation tests on out-of-bag data.

Hyperparameters

| Hyperparameter | Description | Typical default | | --- | --- | --- | | n_estimators | Number of trees | 100–500 | | max_features | Features considered per split | sqrt(p) for classification, p/3 for regression | | max_depth | Maximum tree depth | None (grow fully) | | min_samples_split | Minimum samples to split a node | 2 | | min_samples_leaf | Minimum samples at a leaf | 1 | | bootstrap | Whether to sample with replacement | True |

Random forests are comparatively forgiving of hyperparameter choices, which contributes to their popularity as a strong default model.

Strengths and limitations

Random forests handle both classification and regression, accept mixed numerical and categorical features, are robust to outliers and to irrelevant features, and provide useful feature-importance measures. They generally require less tuning than gradient-boosted trees and parallelise easily across cores.

Limitations include large memory footprint for very large datasets, slower inference than a single tree, less interpretability than a single decision tree (although tools such as SHAP and LIME mitigate this), and a tendency to underperform gradient boosting on competitive tabular tasks. Random forests struggle to extrapolate outside the range of training data and are not the best choice for very high-dimensional sparse data such as text bag-of-words.

Random forest is part of a family of tree-based ensemble methods that also includes Extremely Randomized Trees (Extra Trees), gradient-boosted trees, and modern implementations such as XGBoost, LightGBM, and CatBoost. Gradient boosting builds trees sequentially, with each tree correcting the errors of the previous ensemble; it often achieves higher accuracy than random forest on structured tabular data but requires more careful tuning. Extra Trees randomise the split thresholds in addition to feature selection, sometimes reducing variance further at the cost of higher bias.

For tabular machine learning competitions on platforms such as Kaggle, gradient-boosted variants (XGBoost, LightGBM, CatBoost) are typically the strongest baseline, with random forests and stacked ensembles close behind. Deep learning methods such as TabNet and FT-Transformer have closed some of the gap but rarely dominate on small to mid-sized tabular datasets.

Applications

Random forests are used in a wide variety of domains:

Finance: credit scoring, fraud detection, churn prediction
Healthcare: disease risk modelling, biomarker discovery
Industrial operations: predictive maintenance, quality control
Agriculture: yield prediction, crop disease identification
Marketing: customer segmentation, response modelling
Public sector: tax risk assessment, social-program eligibility
Cybersecurity: intrusion detection, malware classification

In ecology and remote sensing, random forests are a standard tool for classifying land cover from satellite imagery and for species distribution modelling.

Malaysian Context — Random Forests in Malaysian Applications

Random forest models have a strong presence in Malaysian industry and research. In banking, Maybank, CIMB, RHB, Public Bank, and Hong Leong Bank deploy random forests and gradient-boosted variants for credit decisioning, anti-money-laundering screening, and customer-churn modelling, operating under Bank Negara Malaysia's (BNM) Risk Management in Technology (RMiT) framework and emerging MY-AI standards on model governance. The Securities Commission Malaysia (SC) recognises tree-based ensembles in its discussions of model risk management for licensed entities.

In the plantation and agriculture sector, the Malaysian Palm Oil Board (MPOB), Sime Darby Plantation, IOI Corporation, and Felda Global Ventures (FGV) use random forests for yield prediction, disease detection, and land-use classification with remote sensing data from satellites and drones. The Malaysian Agricultural Research and Development Institute (MARDI) applies random forests to rice and horticulture research, and the Forest Research Institute Malaysia (FRIM) uses them for biodiversity and forest cover monitoring.

Healthcare adoption includes risk-prediction models in collaboration with University Malaya Medical Centre, IMU University, Universiti Sains Malaysia, and the Ministry of Health Malaysia, particularly in chronic disease modelling and laboratory analytics. Cybersecurity teams at CyberSecurity Malaysia and the National Cyber Security Agency (NACSA) use random forests as part of intrusion detection and malware classification pipelines.

In manufacturing, Penang's electronics cluster — Intel Malaysia, Western Digital, Inari Amertron, ViTrox, and Pentamaster — employs random forests in predictive maintenance and defect classification. Training is supported by HRD Corp, MDEC's MyDigital Maker programme, and university data-science curricula at UM, UTM, UKM, UTP, and MMU.

Implementations

Random forests are available in essentially every major machine learning library: scikit-learn, R's randomForest and ranger packages, Apache Spark MLlib, H2O.ai, XGBoost's random forest mode, and cloud services such as AWS SageMaker, Google Vertex AI, and Microsoft Azure Machine Learning. Their simplicity and strong default behaviour make them a common baseline in production machine learning pipelines and a popular teaching example in data-science courses.

References

Breiman, L. (2001). Random Forests. Machine Learning, 45(1), 5–32.
Breiman, L. (1996). Bagging Predictors. Machine Learning, 24(2), 123–140.
Hastie, T., Tibshirani, R., and Friedman, J. (2009). The Elements of Statistical Learning. Springer.
Pedregosa, F. et al. (2011). Scikit-learn: Machine Learning in Python. JMLR.

Tags:random forest ensemble learning decision tree machine learning

Type	Supervised ensemble learning
Introduced	2001 (Leo Breiman)
Builds on	Bagging, decision trees
Tasks	Classification and regression
Key feature	Random feature subsets per split
Related	Gradient boosting, bagging, decision tree

Background

Algorithm

Hyperparameters

Strengths and limitations

Comparison with related methods

Applications

Implementations

References