Data Pipeline
A data pipeline is an automated sequence of processes that ingests, transforms, and delivers data from source systems to destination systems for analysis, machine learning, or operational use.
A data pipeline is an automated sequence of steps that moves data from one or more source systems through a series of transformations before delivering it to a destination — such as a data warehouse, machine learning model, or analytical dashboard. In the context of artificial intelligence and machine learning, data pipelines are foundational infrastructure: the quality, freshness, and completeness of data flowing through a pipeline directly determines the reliability of any model trained or served downstream.
Data pipelines differ from ad hoc data exports or manual ETL scripts in that they are repeatable, monitored, and designed for continuous or scheduled operation. A well-engineered pipeline handles schema changes, source failures, late-arriving data, and throughput spikes without manual intervention.
Core Stages
Ingestion
The first stage of a data pipeline collects raw data from source systems. These sources may include relational databases, REST APIs, message queues, file stores, event streams, IoT sensors, or third-party SaaS platforms. Ingestion can be batch-based — where data is collected at scheduled intervals — or streaming-based, where events are captured in near real-time using systems such as Apache Kafka, Amazon Kinesis, or Google Pub/Sub.
Transformation
Raw ingested data is rarely suitable for immediate use. The transformation stage applies cleaning, normalisation, deduplication, type casting, and business-rule logic to produce consistent, queryable records. Tools such as Apache Spark handle large-scale distributed transformation, while dbt (data build tool) enables SQL-based transformations within data warehouses using version-controlled models and lineage tracking.
Feature Engineering
For machine learning pipelines specifically, transformation extends into feature engineering — the process of deriving meaningful numerical representations from raw fields. This may involve computing rolling aggregates, encoding categorical variables, applying text embeddings, or creating interaction terms. The resulting features are typically persisted in a feature store so that the same representations used during model training can be reproduced at inference time, preventing training-serving skew.
Validation and Quality Checks
Production pipelines incorporate data quality gates at multiple points. These checks verify that expected columns are present, null rates remain below thresholds, statistical distributions have not shifted unexpectedly, and referential integrity is maintained. Frameworks such as Great Expectations, Soda, and Apache Griffin are commonly used to codify and automate these validations.
Storage and Serving
Processed data is written to destination systems appropriate to its use: a data warehouse (BigQuery, Redshift, Snowflake) for analytics, a feature store (Feast, Tecton, Vertex AI Feature Store) for ML serving, or a NoSQL store for low-latency operational queries. In streaming architectures, processed events may be written to Kafka topics for downstream consumers to read.
Monitoring and Orchestration
Pipeline orchestration tools such as Apache Airflow, Prefect, and Dagster schedule and coordinate workflow execution, manage dependencies between tasks, and provide visibility into run history and failures. Monitoring captures throughput metrics, latency, error rates, and data freshness so that on-call engineers are alerted before downstream consumers are affected.
Batch vs. Streaming Pipelines
| Characteristic | Batch Pipeline | Streaming Pipeline | |---|---|---| | Latency | Minutes to hours | Milliseconds to seconds | | Typical tools | Spark, dbt, Airflow | Kafka, Flink, Dataflow | | Use cases | Daily model retraining, reporting | Real-time fraud detection, recommendations | | Complexity | Lower | Higher | | Cost | Lower per record | Higher infrastructure cost |
Many production systems combine both paradigms in a lambda architecture, where a high-throughput streaming layer handles real-time requirements and a batch layer performs periodic reconciliation and reprocessing.
AI-Specific Considerations
Machine learning data pipelines carry concerns not present in conventional analytics pipelines. Training pipelines must handle large volumes of labelled examples, potentially spanning terabytes of image, text, or tabular data. Versioning of datasets is essential so that a model can be retrained on exactly the same data distribution for reproducibility. Continuous training pipelines detect model drift and automatically trigger retraining when incoming data diverges from historical patterns.
Data lineage — the ability to trace each record from its origin through every transformation to its ultimate use in a model — is increasingly required for regulatory compliance, particularly in financial services and healthcare where auditability is mandated.
Common Frameworks and Tools
Apache Airflow remains the most widely deployed open-source pipeline orchestrator, using directed acyclic graphs (DAGs) to define task dependencies and scheduling. Apache Spark provides distributed data processing at scale. dbt has become the standard for analytics engineering within data warehouses. For streaming, Apache Flink and Kafka Streams offer stateful stream processing with exactly-once semantics. Cloud-managed services such as AWS Glue, Google Cloud Dataflow, and Azure Data Factory reduce operational overhead for teams that prefer managed infrastructure.
See Also
References
- Zaharia, M. et al. (2018). Accelerating the Machine Learning Lifecycle with MLflow. ACM SIGMOD Record.
- IBM. (2024). What Is an AI Data Pipeline? IBM Think. https://www.ibm.com/think/topics/ai-data-pipeline
- Bank Negara Malaysia. (2023). Risk Management in Technology (RMiT) Policy Document. BNM.
- dbt Labs. (2024). What is dbt? dbt Documentation. https://docs.getdbt.com
- Apache Software Foundation. (2024). Apache Airflow Documentation. https://airflow.apache.org