🎓 Program Overview
Python is the undisputed language of the data economy. The demand for engineers who can work across the full stack from raw data to deployed model has never been higher — and the gap between data analysts who run notebooks and data engineers who build systems is where the salary premium lives. This course is explicitly designed to produce engineers, not analysts.
In five weeks you will build data pipelines with Airflow, train machine learning models with scikit-learn and PyTorch, serve predictions via FastAPI, track experiments with MLflow, and deploy to production on AWS SageMaker — with monitoring that detects model drift before it causes silent failures.
💡 Why Python Data & ML Engineering in 2026
📚 Full Program — 5 Weeks
This phase is not "intro to Python." It covers the language features and tooling that production-grade data and ML code depends on — the parts that separate a data engineer from someone who learned Python for scripting in a notebook.
- Python typing system: type hints, TypedDict, dataclasses, Pydantic models, and runtime validation — the foundation for type-safe data pipelines
- Advanced Python: generators, iterators, context managers, decorators, and comprehensions at scale — the constructs that make Python data code readable and memory-efficient
- Concurrency in Python: threading vs multiprocessing vs asyncio — when to use each for data workloads, CPU-bound tasks, and I/O-bound pipelines
- Memory management: how Python manages objects, reference counting, and avoiding memory leaks in long-running data jobs
- Modern Python tooling: pyproject.toml, uv (the modern pip replacement), virtual environments, and dependency locking for reproducible environments
- Project structure for data and ML: src layout, configuration management with pydantic-settings, and separating pipeline stages
- Testing data code with pytest: fixtures, parametrize, and testing data transformation logic with deterministic test data
- Logging and observability for Python pipelines: structlog for structured logging and OpenTelemetry for distributed tracing
- Docker for Python: containerising data applications, multi-stage builds, and keeping images small for Lambda and ECS deployment
- Git workflows for data projects: DVC (Data Version Control) for versioning large datasets and model artefacts alongside code in Git
Data engineering is the work of acquiring, cleaning, transforming, and storing data so it is ready for analysis or model training. This phase covers Pandas (the industry standard) and Polars (the performance-first modern replacement) so you can work in either ecosystem — and make an informed choice for new projects.
- Series and DataFrame internals: dtypes, memory layout, and why they matter for performance
- Data ingestion: reading CSV, JSON, Parquet, Excel, and SQL databases with Pandas
- Data cleaning: handling nulls, duplicates, type coercion, and string normalisation
- Data transformation: groupby, merge, pivot, melt, and window functions
- Time series data: DatetimeIndex, resampling, rolling windows, and timezone handling
- Categorical data and memory optimisation: reducing DataFrame memory footprint by 60–80%
- Method chaining and pipe: writing readable, composable transformation pipelines
- Vectorisation vs loops: why loc/iloc/apply patterns matter for performance
- Why Polars: lazy evaluation, zero-copy memory model, true parallelism — 10–100x performance over Pandas on large datasets
- Polars expressions: the composable, lazy query API — the fundamental building block of Polars code
- Lazy vs eager execution: building query plans before materialising results — the key to Polars performance
- Polars with Parquet: the native storage format for Polars workflows
- Migrating Pandas code to Polars: the common patterns and where they differ
- When to use Pandas vs Polars: practical decision framework based on data size and team context
- Parquet: columnar storage, compression codecs, and why it is the production standard for data pipelines
- Apache Arrow: the in-memory columnar format that Polars, DuckDB, and modern data tools are built on
- JSON Lines (JSONL): streaming-friendly format for log and event data
- DuckDB: in-process SQL analytics engine that queries Parquet and CSV files directly — the fastest path from raw file to SQL query
- Working with large files that do not fit in memory: chunked processing and streaming reads with both Pandas and Polars
- Data lake patterns on AWS S3: partitioned Parquet datasets and querying with Athena
- Data validation with Great Expectations and Pandera: asserting data quality before it enters your pipeline
A one-off data transformation script is not a data pipeline. A pipeline runs on a schedule, handles failures gracefully, retries, logs, alerts, and produces auditable outputs. This phase covers how to build them with the tools the industry has standardised on.
- ETL vs ELT: the architectural difference and when each pattern applies — why modern data stacks have shifted to ELT
- Building ETL pipelines in pure Python: extract → validate → transform → load as composable, testable steps
- Apache Airflow fundamentals: DAGs, operators, sensors, XComs, connections, and the task lifecycle
- Writing Airflow DAGs in Python: scheduling pipelines, managing task dependencies, and setting retry policies
- Airflow best practices: idempotent tasks, templating with Jinja, and avoiding common DAG pitfalls
- Airflow on AWS MWAA: deploying managed Airflow, configuring plugins, and monitoring DAG health
- Prefect as a modern Airflow alternative: flows, tasks, deployments, and work pools
- Database pipelines: incremental loads, upserts, and change data capture (CDC) patterns
- AWS Glue: serverless ETL jobs for large-scale data transformation on S3 using PySpark
- PostgreSQL as a data warehouse at medium scale: schemas, materialized views, BRIN and GIN indexes for analytics
- Pipeline monitoring: alerting on failures, data quality regressions, SLO tracking, and dead-letter queues for failed records
Data and ML systems need APIs — to receive data, trigger pipelines, serve predictions, and expose results to applications. FastAPI is the dominant Python API framework in 2026 for anything that needs to be fast, well-typed, and auto-documented. Used by Uber, Netflix, and Microsoft internally.
- FastAPI fundamentals: routing, path parameters, query parameters, request bodies, and response models
- Pydantic v2 for request and response validation: models, field validators, computed fields, model_validator, and serialisation
- Dependency injection in FastAPI: building composable, testable service layers — database sessions, authentication, and configuration
- Async endpoints: using asyncio properly in FastAPI for I/O-bound operations — when to use def vs async def
- Database integration with SQLAlchemy 2.0 async: async sessions, transactions, and connection pooling with asyncpg
- Alembic for database migrations: versioned schema changes in production without downtime
- Authentication: JWT tokens, OAuth2 password flow, and API key authentication — middleware-based route protection
- Background tasks: offloading heavy work (data processing, email, ML inference) with FastAPI BackgroundTasks and Celery
- File upload and streaming: receiving large data files, validating them, and processing them asynchronously
- Streaming responses: server-sent events for streaming ML inference output token by token
- OpenAPI documentation: auto-generated Swagger and ReDoc docs, customising schemas, and using the docs UI for testing
- Testing FastAPI applications: TestClient, async test patterns with httpx, and mocking external dependencies
- Deploying FastAPI: Docker + Gunicorn + Uvicorn workers, AWS ECS, and AWS Lambda with Mangum
This phase is split into two tiers: classical ML with scikit-learn (the workhorse of most production ML systems) and deep learning with PyTorch (for neural networks, NLP, and computer vision). Both are taught from an engineering perspective — not academic theory or research papers.
- The ML workflow: problem framing, data preparation, model selection, evaluation, and deployment — the engineering lifecycle
- Feature engineering: encoding categorical variables, scaling, imputation, and feature selection techniques
- scikit-learn Pipelines: chaining preprocessing and model steps into a single, deployable, serialisable object
- Supervised learning: linear and logistic regression, decision trees, random forests, gradient boosting (XGBoost, LightGBM)
- Unsupervised learning: K-means clustering, DBSCAN, PCA for dimensionality reduction
- Model evaluation: cross-validation, confusion matrices, ROC-AUC, precision/recall trade-offs, and RMSE
- Hyperparameter tuning: GridSearchCV, RandomizedSearchCV, and Optuna for Bayesian optimisation
- Handling imbalanced datasets: SMOTE, class weights, and threshold tuning for precision/recall balance
- Model explainability: SHAP values for understanding feature importance in production models — a requirement for regulated industries
- Saving and loading models: joblib, pickle, and ONNX for cross-framework portability
- PyTorch fundamentals: tensors, autograd, and the computational graph — how backpropagation actually works
- Building neural networks with nn.Module: layers, activations, dropout, batch normalisation, and forward passes
- Training loops: loss functions, optimisers (Adam, AdamW), learning rate schedulers, gradient clipping, and early stopping
- Datasets and DataLoaders: batching, shuffling, and custom dataset classes for structured and unstructured data
- Transfer learning: fine-tuning pre-trained models from HuggingFace Transformers for classification tasks
- NLP with transformers: tokenisation, embeddings, and using BERT/DistilBERT for text classification and named entity recognition
- Computer vision basics: CNNs, image classification, and object detection with torchvision and ResNet fine-tuning
- GPU training: moving tensors to CUDA, mixed precision training with torch.amp, and gradient accumulation
- Model checkpointing: saving and resuming training, best model selection, and checkpoint management
- Exporting models: TorchScript, ONNX export, and preparing PyTorch models for Triton Inference Server
Training a model once in a notebook is not ML engineering. MLOps is the discipline of making ML reproducible, auditable, and maintainable across teams and over time. This phase covers the tools and practices that enterprise ML teams operate on.
- MLOps principles: reproducibility, versioning, automation, and monitoring as first-class engineering concerns
- MLflow fundamentals: tracking experiments, logging parameters, metrics, and artefacts from every training run
- MLflow Projects: packaging ML code for reproducible runs in any environment — Docker and conda backends
- MLflow Model Registry: versioning trained models, managing staging/production transitions, and model lineage tracking
- DVC (Data Version Control): versioning large datasets and model artefacts alongside code in Git without LFS overhead
- Feature stores: what they are and when you need one — Feast for online and offline feature serving
- Weights & Biases (W&B) as an MLflow alternative: rich deep learning experiment tracking with gradient histograms and media logging
- Model cards: documenting model capabilities, limitations, training data, and evaluation results — a requirement for responsible AI deployment
- CI/CD for ML: automated retraining pipelines triggered by data drift detection or scheduled cadences with GitHub Actions
- Experiment comparison: systematically analysing runs across hyperparameter sweeps with the MLflow UI and programmatic comparison API
A model that is not serving predictions is not doing anything. This phase covers every practical pattern for getting models into production — from self-managed FastAPI endpoints to fully managed SageMaker endpoints with automated monitoring and retraining.
- Serving scikit-learn and PyTorch models with FastAPI: prediction endpoints, batch inference, and async request queuing
- BentoML: packaging models with their dependencies into portable, OCI-compliant, deployable services
- Triton Inference Server: high-performance model serving for GPU-accelerated PyTorch and ONNX models
- Model caching and warm-up: ensuring low-latency responses on the first request after cold start
- Batching inference requests: grouping incoming requests for GPU efficiency — dynamic batching in Triton
- SageMaker overview: training jobs, processing jobs, pipelines, model registry, and real-time endpoints — the full managed ML platform
- SageMaker Training Jobs: running scikit-learn and PyTorch training at scale on managed compute with custom Docker containers
- SageMaker Processing: running data preprocessing and post-processing jobs at scale — the alternative to Airflow for ML-specific data work
- SageMaker Model Registry: versioning and approving models for deployment — approval workflows from the AWS console and API
- SageMaker Real-Time Endpoints: deploying models to auto-scaling inference endpoints — instance selection and scaling policy
- SageMaker Serverless Inference: cost-efficient endpoints for low-traffic prediction APIs without idle compute cost
- SageMaker Batch Transform: running inference over large datasets without a persistent endpoint — cost-effective for periodic scoring
- SageMaker Pipelines: building end-to-end ML pipelines that chain data processing, training, evaluation, and deployment in a single workflow
- SageMaker with MLflow: using MLflow tracking with SageMaker training jobs — unified experiment tracking across environments
- Data drift detection: monitoring input feature distributions over time with Evidently AI — statistical tests for distribution shift
- Model performance monitoring: tracking prediction accuracy, latency, and error rates in production with Prometheus and CloudWatch
- SageMaker Model Monitor: automated data quality and model quality monitoring on SageMaker endpoints
- Retraining triggers: detecting when model performance has degraded and automatically scheduling retraining via EventBridge
- A/B testing models in production: shadow deployments and traffic splitting between model versions with SageMaker multi-variant endpoints
📅 Schedule & Timings
Weekday Groups
Weekend Groups
📍 Location: In-house training, F-11 Markaz, Islamabad · 📱 Online option available for out-of-city participants