What is AIWiki Malaysia?

AIWiki Malaysia is a free, open AI knowledge base covering artificial intelligence concepts, tools, models, and use cases — written specifically for Malaysian professionals and students. It is maintained by AITG Sdn Bhd, an AI company based in Penang.

Who maintains AIWiki Malaysia?

AIWiki Malaysia is maintained by AITG Sdn Bhd (Registration: 202601016521 (1678618-W)), an AI company headquartered in George Town, Penang, Malaysia. The editorial team continuously updates and expands the knowledge base.

What topics does AIWiki Malaysia cover?

AIWiki Malaysia covers a wide range of AI topics including large language models (LLMs), AI agents, machine learning fundamentals, prompt engineering, AI automation, generative AI tools, Malaysian AI regulations, local vendor landscape, and real-world AI use cases relevant to the Malaysian market.

How do I search for AI topics on AIWiki Malaysia?

You can use the search bar at the top of the site to find articles by keyword or topic. Articles are also organised by category, so you can browse by subject area such as Models, Tools, Concepts, or Use Cases.

Is AIWiki Malaysia available in Bahasa Malaysia?

Yes. AIWiki Malaysia publishes content in both English and Bahasa Malaysia to serve the full breadth of the Malaysian professional and student community. Language availability is indicated on each article page.

How can I submit a topic or suggest an article?

You can suggest topics or submit article ideas by contacting the AIWiki Malaysia team at admin@aiteragrid.com. AITG Sdn Bhd reviews all submissions and publishes content that meets editorial accuracy standards.

MLOps

A set of practices and tools that combine machine learning, DevOps, and data engineering to automate and operationalise the full lifecycle of ML models from development through production deployment and monitoring.

7 min readLast updated May 2026Infrastructure

MLOps (Machine Learning Operations) is an engineering discipline that applies DevOps principles — automation, continuous integration, monitoring, and cross-functional collaboration — to the unique challenges of developing, deploying, and maintaining machine learning models in production. The goal of MLOps is to shorten the development cycle, improve model reliability, and ensure that ML systems perform as intended as real-world data distributions change over time.[^1]

The Problem MLOps Solves

Machine learning projects fail to reach production at a disproportionately high rate. A common finding in industry surveys is that the majority of ML models never make it to production, and many that do are retired quickly due to performance degradation. Several characteristics of ML systems make operationalisation harder than traditional software:

ML models are probabilistic and sensitive to the statistical properties of their input data. When production data drifts away from the training distribution — due to seasonal changes, shifts in user behaviour, or upstream data pipeline modifications — model performance can degrade silently without triggering conventional software errors. ML development involves a complex graph of dependencies between raw data, preprocessing logic, model code, hyperparameters, training infrastructure, and deployment environment, any of which can cause a seemingly identical model to produce different results. Unlike software, where the artifact (compiled code) is deterministic, an ML model's behaviour is determined by the intersection of architecture, training data, and training procedure.

MLOps addresses these challenges by providing tooling, processes, and organisational structures to manage the full ML lifecycle reproducibly and at scale.

Core Components

Experiment Tracking

Experiment tracking tools record the inputs and outputs of each training run — hyperparameters, dataset versions, code commits, evaluation metrics, and artefacts — enabling teams to reproduce results and compare experiments systematically. MLflow, Weights & Biases, Comet ML, and Neptune.ai are leading platforms in this space.

Feature Stores

A feature store is a centralised repository for computed ML features — derived attributes of raw data used as model inputs. Feature stores serve dual purposes: they allow features computed in the training pipeline to be reused consistently in serving (preventing training-serving skew), and they enable teams to share features across models rather than recomputing the same transformations repeatedly. Examples include Feast, Tecton, and managed offerings from Databricks and AWS SageMaker.

Model Registry

A model registry is a versioned catalogue of trained model artefacts, storing each model along with its associated metadata, evaluation metrics, lineage information, and deployment status. The registry provides governance over which model version is active in production and facilitates rollback if a new version performs poorly. MLflow Model Registry and Amazon SageMaker Model Registry are common implementations.

CI/CD for ML

Continuous Integration and Continuous Deployment pipelines for ML extend conventional CI/CD to accommodate the ML-specific artifacts of datasets and trained models. A CI pipeline might automatically retrain a model when new training data arrives, run evaluation suites against holdout datasets, validate data schema and quality, and check model performance against defined thresholds before approving deployment. Tools such as GitHub Actions, GitLab CI, and dedicated ML workflow orchestrators (Kubeflow Pipelines, Metaflow, Prefect) implement these pipelines.[^2]

Model Serving

Model serving is the infrastructure that makes a trained model available for inference. The two primary patterns are online serving — synchronous REST or gRPC endpoints that respond to real-time requests within milliseconds — and batch serving, where the model processes large volumes of requests offline. Serving infrastructure must handle load balancing, horizontal scaling, hardware acceleration (GPU/TPU allocation), and graceful model version transitions. Frameworks such as TorchServe, Triton Inference Server, and BentoML abstract serving infrastructure; managed services include AWS SageMaker Endpoints, Google Vertex AI Prediction, and Azure Machine Learning Endpoints.

Monitoring and Drift Detection

Production monitoring tracks both system-level metrics (latency, throughput, error rate) and ML-specific metrics (prediction distribution, input feature statistics, output calibration). Data drift refers to changes in the statistical distribution of input features relative to the training set; concept drift refers to changes in the relationship between inputs and the target variable. Automated alerts and retraining triggers based on drift metrics allow teams to intervene before performance degradation becomes visible to users.[^3]

MLOps Maturity Levels

Google has described a three-level MLOps maturity framework:

| Level | Characteristics | |-------|----------------| | Level 0 | Manual, script-based training and deployment; no pipeline automation | | Level 1 | Automated training pipeline; continuous delivery of model predictions | | Level 2 | Automated CI/CD for ML pipelines; full pipeline automation from data to deployment |

Most organisations begin at Level 0 and progress toward Level 2 as ML usage matures and the cost of manual operations becomes apparent.

LLMOps

The emergence of large language model deployments has spawned a subfield sometimes called LLMOps, which extends MLOps practices to the specific challenges of operating LLMs: managing prompt versions alongside model versions, monitoring for hallucination and output quality degradation rather than statistical drift, evaluating models against qualitative criteria using LLM-as-judge approaches, and managing the cost and latency of inference at scale. Tools such as LangSmith, Langfuse, Helicone, and Arize AI provide LLM-specific observability features.

Malaysian Context — MLOps Adoption and Ecosystem

MLOps adoption in Malaysia is accelerating alongside broader enterprise AI investments, with financial services, telecommunications, and the public sector emerging as the primary domains of activity.

Maybank's data and AI division has built internal MLOps pipelines to manage the bank's portfolio of credit risk models, fraud detection systems, and customer personalisation algorithms. The platform integrates model versioning, automated retraining triggers based on statistical drift detection, and model performance dashboards compliant with Bank Negara Malaysia's (BNM) Risk Management in Technology (RMIT) guidelines, which require financial institutions to maintain audit trails and ensure AI system reliability.

PETRONAS' digital arm has applied MLOps to predictive maintenance models across its upstream and downstream operations. Models that predict equipment failure probabilities based on sensor telemetry require continuous monitoring for data drift as equipment ages, operating conditions change, and sensor configurations are updated — a canonical MLOps use case.

Telekom Malaysia (TM) has implemented model serving infrastructure on its private cloud to support network anomaly detection and churn prediction models. The organisation has adopted a tiered serving architecture that reserves GPU resources for computationally intensive inference workloads while routing simpler predictions to CPU-based endpoints to manage cost.

The Malaysian AI startup ecosystem, concentrated at the Malaysia Digital Hub in Cyberjaya and MRANTI Park in Bukit Jalil, includes a growing number of MLOps-focused companies and consultancies offering platform engineering services. Cloud hyperscaler programmes — Microsoft Azure's ISV Success Programme, AWS's ASEAN Partner Network, and Google Cloud's Partner Advantage — provide technical and financial support for Malaysian companies building MLOps solutions on their platforms.

MDEC's Digital Talent programme and HRD Corp-funded training initiatives include MLOps engineering courses covering experiment tracking, pipeline orchestration, and model monitoring. Universiti Teknologi Malaysia (UTM) and Multimedia University (MMU) have introduced MLOps modules into their postgraduate AI and data science curricula.

References

Sculley, D., Holt, G., Golovin, D., Davydov, E., Phillips, T., Ebner, D., Chaudhary, V., Young, M., Crespo, J.-F., & Dennison, D. (2015). Hidden Technical Debt in Machine Learning Systems. Advances in Neural Information Processing Systems, 28.
AWS. (2025). What is MLOps? — Machine Learning Operations Explained. Amazon Web Services. https://aws.amazon.com/what-is/mlops/
Klaise, J., Van Looveren, A., Vacanti, G., & Coca, A. (2020). Monitoring and Explainability of Models in Production. ICML Workshop on Challenges in Deploying and Monitoring Machine Learning Systems.
Gartner. (2025). Magic Quadrant for Data Science and Machine Learning Platforms. Gartner Research.

Tags:MLOps model deployment model monitoring CI/CD machine learning

Full name	Machine Learning Operations
Type	Engineering discipline and tooling ecosystem
Analogous to	DevOps in software engineering
Key tools	MLflow, Weights & Biases, Kubeflow, SageMaker, Vertex AI
Related	DataOps, model serving, feature store, data pipeline

The Problem MLOps Solves

Core Components

Experiment Tracking

Feature Stores

Model Registry

CI/CD for ML

Model Serving

Monitoring and Drift Detection

MLOps Maturity Levels

LLMOps

See Also

References

References