Arestech — Tech Consulting, Academy & Talent

Skills you'll build

Learn how to build reliable data systems: clean ingestion, trustworthy models, and production-ready pipelines with monitoring and governance.

Python for data work (pandas/polars)SQL mastery (joins, windows, CTEs, optimisation)ETL/ELT design + reliability patternsOrchestration (Airflow / Prefect / Dagster)dbt transformations + modellingWarehouses & lakehouse patternsBig data processing (Spark)Streaming + CDC (Kafka, Flink, Debezium)Data quality & observabilityCloud data stacks (AWS/GCP/Azure)

Prerequisites

This track moves quickly. Having the basics will help you focus on system design and scale.

Recommended before you start

Basic Python & SQL knowledge

Tools you'll use

A modern data stack blends orchestration, modelling, and scalable compute — then deploys it cleanly.

Airflow / Prefect / Dagster

Orchestrate and monitor pipelines.

dbt

Transform, test, and document analytics models.

Airbyte / Fivetran

Connector-based ingestion at speed.

PostgreSQL / MySQL

Operational sources + deep SQL practice.

Snowflake / BigQuery / Redshift

Warehouses and scalable analytics.

Spark + Kafka

Big data processing and streaming.

Learning roadmap

A clear phase-based path — we intentionally don't show weeks so you can progress at your pace.

Phase 1 — Foundations

Core

Python for data engineering: types, control flow, functions
Working with files: CSV, JSON, Parquet; dependency management and virtual environments
pandas and polars for transformation and manipulation
SQL mastery: joins, group by, window functions, CTEs/subqueries, indexing and optimisation
Linux & command line: shell scripting, cron scheduling, SSH and remote server basics

Phase 2 — Data Pipelines & ETL/ELT

Pipelines

ETL concepts: batch vs streaming, validation, idempotency, reliability
Orchestration: Airflow DAGs + modern alternatives (Prefect/Dagster)
Monitoring, alerting, retries, and dependency management
Integration tooling: Airbyte/Fivetran connectors, Singer, and custom Python ingestion
Transformations: dbt, SQL, pandas/Spark depending on scale

Phase 3 — Storage & Warehousing

Warehouses

Warehouses: Snowflake, BigQuery, Redshift (core architecture + cost/perf patterns)
Lakehouse: Delta Lake / Iceberg patterns and when to use them
Data modelling: star/snowflake schemas, Kimball, SCDs, data vault basics
File formats: Parquet, ORC, Avro; object storage (S3/GCS/Azure Blob)
Data lake partitioning strategies and layout decisions

Phase 4 — Big Data & Streaming

Scale

Apache Spark: DataFrames, PySpark transformations, optimisation (partitioning/caching/broadcast joins)
Streaming with Kafka: topics/partitions, producers/consumers, exactly-once concepts
Real-time pipelines: Flink for stateful processing; Lambda vs Kappa architectures
CDC with Debezium for database change streams

Phase 5 — Cloud & DevOps

Production

Cloud platforms: AWS (S3/Glue/Athena/EMR), GCP (Pub/Sub/Dataflow/BigQuery), Azure (Synapse/Data Factory/ADLS)
Infrastructure as Code: Terraform; packaging workloads with Docker
Kubernetes basics for data workloads (when it fits)
Data quality & observability: Great Expectations; lineage and monitoring concepts

Phase 6 — Capstone

Portfolio

Ingest a real dataset via a pipeline (batch or streaming)
Model in a warehouse using dbt
Orchestrate with Airflow (or Prefect/Dagster)
Deploy on a cloud platform and document architecture decisions
Present trade-offs: reliability, cost, latency, and governance

Portfolio projects (what you can show)

Build end-to-end systems that demonstrate engineering maturity: reliability, performance, and clean modelling.

Reliable batch pipeline

Ingest source data, validate quality, and load to warehouse
Idempotent runs + retries + alerting

Warehouse modelling with dbt

Star schema + SCD dimension
Tests + docs + lineage-friendly structure

Real-time stream + CDC

Kafka topic ingestion with a stream processor
Debezium CDC into a lake/warehouse for near-real-time analytics

Cloud deployment

Deploy the stack on AWS/GCP/Azure
Terraform for infrastructure and repeatability

Ready to build production data systems?

Share your background and goals — we'll recommend the best starting phase and a first project to build.

Ready to learn