What is AIWiki Malaysia?

AIWiki Malaysia is a free, open AI knowledge base covering artificial intelligence concepts, tools, models, and use cases — written specifically for Malaysian professionals and students. It is maintained by AITG Sdn Bhd, an AI company based in Penang.

Who maintains AIWiki Malaysia?

AIWiki Malaysia is maintained by AITG Sdn Bhd (Registration: 202601016521 (1678618-W)), an AI company headquartered in George Town, Penang, Malaysia. The editorial team continuously updates and expands the knowledge base.

What topics does AIWiki Malaysia cover?

AIWiki Malaysia covers a wide range of AI topics including large language models (LLMs), AI agents, machine learning fundamentals, prompt engineering, AI automation, generative AI tools, Malaysian AI regulations, local vendor landscape, and real-world AI use cases relevant to the Malaysian market.

How do I search for AI topics on AIWiki Malaysia?

You can use the search bar at the top of the site to find articles by keyword or topic. Articles are also organised by category, so you can browse by subject area such as Models, Tools, Concepts, or Use Cases.

Is AIWiki Malaysia available in Bahasa Malaysia?

Yes. AIWiki Malaysia publishes content in both English and Bahasa Malaysia to serve the full breadth of the Malaysian professional and student community. Language availability is indicated on each article page.

How can I submit a topic or suggest an article?

You can suggest topics or submit article ideas by contacting the AIWiki Malaysia team at admin@aiteragrid.com. AITG Sdn Bhd reviews all submissions and publishes content that meets editorial accuracy standards.

Data Pipeline

A data pipeline is an automated sequence of processes that ingests, transforms, and delivers data from source systems to destination systems for analysis, machine learning, or operational use.

6 min readLast updated June 2026Infrastructure

A data pipeline is an automated sequence of steps that moves data from one or more source systems through a series of transformations before delivering it to a destination — such as a data warehouse, machine learning model, or analytical dashboard. In the context of artificial intelligence and machine learning, data pipelines are foundational infrastructure: the quality, freshness, and completeness of data flowing through a pipeline directly determines the reliability of any model trained or served downstream.

Data pipelines differ from ad hoc data exports or manual ETL scripts in that they are repeatable, monitored, and designed for continuous or scheduled operation. A well-engineered pipeline handles schema changes, source failures, late-arriving data, and throughput spikes without manual intervention.

Core Stages

Ingestion

The first stage of a data pipeline collects raw data from source systems. These sources may include relational databases, REST APIs, message queues, file stores, event streams, IoT sensors, or third-party SaaS platforms. Ingestion can be batch-based — where data is collected at scheduled intervals — or streaming-based, where events are captured in near real-time using systems such as Apache Kafka, Amazon Kinesis, or Google Pub/Sub.

Transformation

Raw ingested data is rarely suitable for immediate use. The transformation stage applies cleaning, normalisation, deduplication, type casting, and business-rule logic to produce consistent, queryable records. Tools such as Apache Spark handle large-scale distributed transformation, while dbt (data build tool) enables SQL-based transformations within data warehouses using version-controlled models and lineage tracking.

Feature Engineering

For machine learning pipelines specifically, transformation extends into feature engineering — the process of deriving meaningful numerical representations from raw fields. This may involve computing rolling aggregates, encoding categorical variables, applying text embeddings, or creating interaction terms. The resulting features are typically persisted in a feature store so that the same representations used during model training can be reproduced at inference time, preventing training-serving skew.

Validation and Quality Checks

Production pipelines incorporate data quality gates at multiple points. These checks verify that expected columns are present, null rates remain below thresholds, statistical distributions have not shifted unexpectedly, and referential integrity is maintained. Frameworks such as Great Expectations, Soda, and Apache Griffin are commonly used to codify and automate these validations.

Storage and Serving

Processed data is written to destination systems appropriate to its use: a data warehouse (BigQuery, Redshift, Snowflake) for analytics, a feature store (Feast, Tecton, Vertex AI Feature Store) for ML serving, or a NoSQL store for low-latency operational queries. In streaming architectures, processed events may be written to Kafka topics for downstream consumers to read.

Monitoring and Orchestration

Pipeline orchestration tools such as Apache Airflow, Prefect, and Dagster schedule and coordinate workflow execution, manage dependencies between tasks, and provide visibility into run history and failures. Monitoring captures throughput metrics, latency, error rates, and data freshness so that on-call engineers are alerted before downstream consumers are affected.

Batch vs. Streaming Pipelines

| Characteristic | Batch Pipeline | Streaming Pipeline | |---|---|---| | Latency | Minutes to hours | Milliseconds to seconds | | Typical tools | Spark, dbt, Airflow | Kafka, Flink, Dataflow | | Use cases | Daily model retraining, reporting | Real-time fraud detection, recommendations | | Complexity | Lower | Higher | | Cost | Lower per record | Higher infrastructure cost |

Many production systems combine both paradigms in a lambda architecture, where a high-throughput streaming layer handles real-time requirements and a batch layer performs periodic reconciliation and reprocessing.

AI-Specific Considerations

Machine learning data pipelines carry concerns not present in conventional analytics pipelines. Training pipelines must handle large volumes of labelled examples, potentially spanning terabytes of image, text, or tabular data. Versioning of datasets is essential so that a model can be retrained on exactly the same data distribution for reproducibility. Continuous training pipelines detect model drift and automatically trigger retraining when incoming data diverges from historical patterns.

Data lineage — the ability to trace each record from its origin through every transformation to its ultimate use in a model — is increasingly required for regulatory compliance, particularly in financial services and healthcare where auditability is mandated.

Common Frameworks and Tools

Apache Airflow remains the most widely deployed open-source pipeline orchestrator, using directed acyclic graphs (DAGs) to define task dependencies and scheduling. Apache Spark provides distributed data processing at scale. dbt has become the standard for analytics engineering within data warehouses. For streaming, Apache Flink and Kafka Streams offer stateful stream processing with exactly-once semantics. Cloud-managed services such as AWS Glue, Google Cloud Dataflow, and Azure Data Factory reduce operational overhead for teams that prefer managed infrastructure.

Malaysian Context — Data Pipelines in Malaysia's AI Ecosystem

Malaysian enterprises across banking, telecommunications, and manufacturing have significantly expanded their data pipeline investments as part of the MyDigital Blueprint and the broader Malaysia Digital economy initiative administered by MDEC. The shift from legacy on-premise ETL tools to cloud-native pipeline architectures has accelerated following the establishment of hyperscaler data centres in Malaysia by AWS, Microsoft Azure, and Google Cloud, all of which began Malaysian operations between 2022 and 2024.

In the financial sector, Bank Negara Malaysia (BNM) has emphasised data governance and lineage requirements in its Risk Management in Technology (RMiT) framework, effectively mandating that licensed financial institutions maintain auditable data pipelines. Maybank, CIMB, and RHB have all invested in modern data stack implementations — adopting tools such as dbt, Snowflake, and Apache Spark — to support credit scoring, anti-money laundering detection, and personalised product recommendations.

Telecommunications companies TM (Telekom Malaysia) and Maxis operate streaming data pipelines to process network telemetry and customer interaction events at high volume, feeding anomaly detection and churn prediction models. Petronas, through its digital and data science arm, has deployed industrial data pipelines connecting SCADA systems and sensor networks to machine learning platforms for predictive maintenance across upstream and downstream operations.

For AI developers and startups, MDEC's Malaysia Digital (MD) status programme provides tax incentives and infrastructure grants that reduce the cost of cloud-based data pipeline infrastructure. The HRD Corp (Human Resource Development Corporation) funds certified data engineering training courses, including Apache Airflow, Apache Spark, and cloud data platform programmes, helping to develop the pipeline engineering talent pool within the country.

Grab Malaysia and AirAsia (now AirAsia MOVE) are prominent examples of Malaysian-headquartered technology companies that operate real-time data pipelines processing millions of daily events — spanning ride-hailing telemetry, payment transactions, and customer browsing behaviour — to power recommendation engines, dynamic pricing, and fraud prevention systems.

References

Zaharia, M. et al. (2018). Accelerating the Machine Learning Lifecycle with MLflow. ACM SIGMOD Record.
IBM. (2024). What Is an AI Data Pipeline? IBM Think. https://www.ibm.com/think/topics/ai-data-pipeline
Bank Negara Malaysia. (2023). Risk Management in Technology (RMiT) Policy Document. BNM.
dbt Labs. (2024). What is dbt? dbt Documentation. https://docs.getdbt.com
Apache Software Foundation. (2024). Apache Airflow Documentation. https://airflow.apache.org

Tags:data engineering ETL MLOps data ingestion feature engineering

Type	Data engineering infrastructure
Core stages	Ingestion, transformation, feature engineering, serving
Common tools	Apache Spark, Apache Kafka, Airflow, dbt, Fivetran
Related	MLOps, Feature store, DataOps, Model serving
Key use	Feeding clean, structured data to ML models and analytics systems