Data Engineering · Pipelines & Warehousing

Data Analytics & BI (Data Engineering Track)

Data engineering is the backbone of modern data-driven organisations. This roadmap covers the skills to design, build, and maintain robust pipelines, warehouses, and infrastructure at scale.

TrackPipelines · Warehouses · Big Data · Streaming
LevelIntermediate → Advanced
OutcomeJunior → Mid Data Engineer

What you'll get

  • End-to-end pipeline skills: ingestion → transform → serve.
  • Warehouse and modelling patterns that scale.
  • Streaming + CDC foundations for real-time systems.
  • Capstone project you can showcase with clear architecture decisions.

Skills you'll build

Learn how to build reliable data systems: clean ingestion, trustworthy models, and production-ready pipelines with monitoring and governance.

Python for data work (pandas/polars)SQL mastery (joins, windows, CTEs, optimisation)ETL/ELT design + reliability patternsOrchestration (Airflow / Prefect / Dagster)dbt transformations + modellingWarehouses & lakehouse patternsBig data processing (Spark)Streaming + CDC (Kafka, Flink, Debezium)Data quality & observabilityCloud data stacks (AWS/GCP/Azure)

Prerequisites

This track moves quickly. Having the basics will help you focus on system design and scale.

Recommended before you start

  • Basic Python & SQL knowledge

Tools you'll use

A modern data stack blends orchestration, modelling, and scalable compute — then deploys it cleanly.

Airflow / Prefect / Dagster

Orchestrate and monitor pipelines.

dbt

Transform, test, and document analytics models.

Airbyte / Fivetran

Connector-based ingestion at speed.

PostgreSQL / MySQL

Operational sources + deep SQL practice.

Snowflake / BigQuery / Redshift

Warehouses and scalable analytics.

Spark + Kafka

Big data processing and streaming.

Learning roadmap

A clear phase-based path — we intentionally don't show weeks so you can progress at your pace.

Phase 1 — Foundations

Core
  • Python for data engineering: types, control flow, functions
  • Working with files: CSV, JSON, Parquet; dependency management and virtual environments
  • pandas and polars for transformation and manipulation
  • SQL mastery: joins, group by, window functions, CTEs/subqueries, indexing and optimisation
  • Linux & command line: shell scripting, cron scheduling, SSH and remote server basics

Phase 2 — Data Pipelines & ETL/ELT

Pipelines
  • ETL concepts: batch vs streaming, validation, idempotency, reliability
  • Orchestration: Airflow DAGs + modern alternatives (Prefect/Dagster)
  • Monitoring, alerting, retries, and dependency management
  • Integration tooling: Airbyte/Fivetran connectors, Singer, and custom Python ingestion
  • Transformations: dbt, SQL, pandas/Spark depending on scale

Phase 3 — Storage & Warehousing

Warehouses
  • Warehouses: Snowflake, BigQuery, Redshift (core architecture + cost/perf patterns)
  • Lakehouse: Delta Lake / Iceberg patterns and when to use them
  • Data modelling: star/snowflake schemas, Kimball, SCDs, data vault basics
  • File formats: Parquet, ORC, Avro; object storage (S3/GCS/Azure Blob)
  • Data lake partitioning strategies and layout decisions

Phase 4 — Big Data & Streaming

Scale
  • Apache Spark: DataFrames, PySpark transformations, optimisation (partitioning/caching/broadcast joins)
  • Streaming with Kafka: topics/partitions, producers/consumers, exactly-once concepts
  • Real-time pipelines: Flink for stateful processing; Lambda vs Kappa architectures
  • CDC with Debezium for database change streams

Phase 5 — Cloud & DevOps

Production
  • Cloud platforms: AWS (S3/Glue/Athena/EMR), GCP (Pub/Sub/Dataflow/BigQuery), Azure (Synapse/Data Factory/ADLS)
  • Infrastructure as Code: Terraform; packaging workloads with Docker
  • Kubernetes basics for data workloads (when it fits)
  • Data quality & observability: Great Expectations; lineage and monitoring concepts

Phase 6 — Capstone

Portfolio
  • Ingest a real dataset via a pipeline (batch or streaming)
  • Model in a warehouse using dbt
  • Orchestrate with Airflow (or Prefect/Dagster)
  • Deploy on a cloud platform and document architecture decisions
  • Present trade-offs: reliability, cost, latency, and governance

Portfolio projects (what you can show)

Build end-to-end systems that demonstrate engineering maturity: reliability, performance, and clean modelling.

Reliable batch pipeline

  • Ingest source data, validate quality, and load to warehouse
  • Idempotent runs + retries + alerting

Warehouse modelling with dbt

  • Star schema + SCD dimension
  • Tests + docs + lineage-friendly structure

Real-time stream + CDC

  • Kafka topic ingestion with a stream processor
  • Debezium CDC into a lake/warehouse for near-real-time analytics

Cloud deployment

  • Deploy the stack on AWS/GCP/Azure
  • Terraform for infrastructure and repeatability

Ready to build production data systems?

Share your background and goals — we'll recommend the best starting phase and a first project to build.

Ready to learn