MLOps
A set of practices and tools that combine machine learning, DevOps, and data engineering to automate and operationalise the full lifecycle of ML models from development through production deployment and monitoring.
MLOps (Machine Learning Operations) is an engineering discipline that applies DevOps principles — automation, continuous integration, monitoring, and cross-functional collaboration — to the unique challenges of developing, deploying, and maintaining machine learning models in production. The goal of MLOps is to shorten the development cycle, improve model reliability, and ensure that ML systems perform as intended as real-world data distributions change over time.[^1]
The Problem MLOps Solves
Machine learning projects fail to reach production at a disproportionately high rate. A common finding in industry surveys is that the majority of ML models never make it to production, and many that do are retired quickly due to performance degradation. Several characteristics of ML systems make operationalisation harder than traditional software:
ML models are probabilistic and sensitive to the statistical properties of their input data. When production data drifts away from the training distribution — due to seasonal changes, shifts in user behaviour, or upstream data pipeline modifications — model performance can degrade silently without triggering conventional software errors. ML development involves a complex graph of dependencies between raw data, preprocessing logic, model code, hyperparameters, training infrastructure, and deployment environment, any of which can cause a seemingly identical model to produce different results. Unlike software, where the artifact (compiled code) is deterministic, an ML model's behaviour is determined by the intersection of architecture, training data, and training procedure.
MLOps addresses these challenges by providing tooling, processes, and organisational structures to manage the full ML lifecycle reproducibly and at scale.
Core Components
Experiment Tracking
Experiment tracking tools record the inputs and outputs of each training run — hyperparameters, dataset versions, code commits, evaluation metrics, and artefacts — enabling teams to reproduce results and compare experiments systematically. MLflow, Weights & Biases, Comet ML, and Neptune.ai are leading platforms in this space.
Feature Stores
A feature store is a centralised repository for computed ML features — derived attributes of raw data used as model inputs. Feature stores serve dual purposes: they allow features computed in the training pipeline to be reused consistently in serving (preventing training-serving skew), and they enable teams to share features across models rather than recomputing the same transformations repeatedly. Examples include Feast, Tecton, and managed offerings from Databricks and AWS SageMaker.
Model Registry
A model registry is a versioned catalogue of trained model artefacts, storing each model along with its associated metadata, evaluation metrics, lineage information, and deployment status. The registry provides governance over which model version is active in production and facilitates rollback if a new version performs poorly. MLflow Model Registry and Amazon SageMaker Model Registry are common implementations.
CI/CD for ML
Continuous Integration and Continuous Deployment pipelines for ML extend conventional CI/CD to accommodate the ML-specific artifacts of datasets and trained models. A CI pipeline might automatically retrain a model when new training data arrives, run evaluation suites against holdout datasets, validate data schema and quality, and check model performance against defined thresholds before approving deployment. Tools such as GitHub Actions, GitLab CI, and dedicated ML workflow orchestrators (Kubeflow Pipelines, Metaflow, Prefect) implement these pipelines.[^2]
Model Serving
Model serving is the infrastructure that makes a trained model available for inference. The two primary patterns are online serving — synchronous REST or gRPC endpoints that respond to real-time requests within milliseconds — and batch serving, where the model processes large volumes of requests offline. Serving infrastructure must handle load balancing, horizontal scaling, hardware acceleration (GPU/TPU allocation), and graceful model version transitions. Frameworks such as TorchServe, Triton Inference Server, and BentoML abstract serving infrastructure; managed services include AWS SageMaker Endpoints, Google Vertex AI Prediction, and Azure Machine Learning Endpoints.
Monitoring and Drift Detection
Production monitoring tracks both system-level metrics (latency, throughput, error rate) and ML-specific metrics (prediction distribution, input feature statistics, output calibration). Data drift refers to changes in the statistical distribution of input features relative to the training set; concept drift refers to changes in the relationship between inputs and the target variable. Automated alerts and retraining triggers based on drift metrics allow teams to intervene before performance degradation becomes visible to users.[^3]
MLOps Maturity Levels
Google has described a three-level MLOps maturity framework:
| Level | Characteristics | |-------|----------------| | Level 0 | Manual, script-based training and deployment; no pipeline automation | | Level 1 | Automated training pipeline; continuous delivery of model predictions | | Level 2 | Automated CI/CD for ML pipelines; full pipeline automation from data to deployment |
Most organisations begin at Level 0 and progress toward Level 2 as ML usage matures and the cost of manual operations becomes apparent.
LLMOps
The emergence of large language model deployments has spawned a subfield sometimes called LLMOps, which extends MLOps practices to the specific challenges of operating LLMs: managing prompt versions alongside model versions, monitoring for hallucination and output quality degradation rather than statistical drift, evaluating models against qualitative criteria using LLM-as-judge approaches, and managing the cost and latency of inference at scale. Tools such as LangSmith, Langfuse, Helicone, and Arize AI provide LLM-specific observability features.
See Also
References
References
- Sculley, D., Holt, G., Golovin, D., Davydov, E., Phillips, T., Ebner, D., Chaudhary, V., Young, M., Crespo, J.-F., & Dennison, D. (2015). Hidden Technical Debt in Machine Learning Systems. Advances in Neural Information Processing Systems, 28.
- AWS. (2025). What is MLOps? — Machine Learning Operations Explained. Amazon Web Services. https://aws.amazon.com/what-is/mlops/
- Klaise, J., Van Looveren, A., Vacanti, G., & Coca, A. (2020). Monitoring and Explainability of Models in Production. ICML Workshop on Challenges in Deploying and Monitoring Machine Learning Systems.
- Gartner. (2025). Magic Quadrant for Data Science and Machine Learning Platforms. Gartner Research.