π Python Data + ML Engineering
Currently available in Islamabad
Python is the language of the data economy. Every major data pipeline, machine learning model, and AI system running in production today is built on Python β and the demand for engineers who can work across the full stack from raw data to deployed model has never been higher. This track is not a data science course focused on Jupyter notebooks and academic theory. It is an engineering-first program that teaches you to build systems that process real data, train reliable models, and serve predictions in production.
You will graduate with the skills to work as a data engineer, ML engineer, or backend engineer in data-heavy systems β three of the fastest-growing and highest-paying roles in the global remote job market.
π‘ Why This Track
Most Python data courses teach you how to analyse data in a notebook. This course teaches you how to build the systems that analysts depend on. The distinction matters enormously in the job market:
- Data engineers build the pipelines β this course teaches pipelines
- ML engineers build, train, and deploy models β this course covers the full lifecycle, not just training
- The combination of FastAPI + data engineering + MLOps is exactly what companies hiring for remote Python roles want in 2025
- AWS SageMaker and MLflow are production tools used at scale β not toys
- Polars is replacing Pandas for performance-critical data work β you will learn both
π Module Breakdown
Week 1 β Phase 0: Python Engineering Foundations
This phase is not "intro to Python." It covers the parts of Python that separate a data engineer from someone who learned Python for scripting β the language features and tooling that production-grade data and ML code depends on.
- Python typing system: type hints, TypedDict, dataclasses, Pydantic models, and runtime validation
- Advanced Python: generators, iterators, context managers, decorators, and comprehensions at scale
- Concurrency in Python: threading vs multiprocessing vs asyncio β when to use each for data workloads
- Memory management: how Python manages objects, reference counting, and avoiding memory leaks in long-running data jobs
- Virtual environments, pyproject.toml, and dependency management with uv (the modern pip replacement)
- Project structure for data and ML projects: src layout, configuration management with pydantic-settings
- Testing data code with pytest: fixtures, parametrize, and testing data transformation logic
- Logging and observability for Python data pipelines: structlog and OpenTelemetry
- Docker for Python: containerising data applications, multi-stage builds, and keeping images small
- Git workflows for data projects: DVC (Data Version Control) for versioning datasets alongside code
Week 1β2 β Phase 1: Data Engineering with Pandas & Polars
Data engineering is the work of acquiring, cleaning, transforming, and storing data so it is ready for analysis or model training. This phase covers both Pandas (the industry standard) and Polars (the performance-first modern replacement) so you can work in either ecosystem.
Pandas in depth:
- Series and DataFrame internals: dtypes, memory layout, and why they matter for performance
- Data ingestion: reading CSV, JSON, Parquet, Excel, and SQL databases with Pandas
- Data cleaning: handling nulls, duplicates, type coercion, and string normalisation
- Data transformation: groupby, merge, pivot, melt, and window functions
- Time series data: DatetimeIndex, resampling, rolling windows, and timezone handling
- Categorical data and memory optimisation: reducing DataFrame memory footprint by 60β80%
- Method chaining and pipe: writing readable, composable transformation pipelines
- Vectorisation vs loops: why loc/iloc/apply patterns matter for performance
Polars β the modern replacement:
- Why Polars: lazy evaluation, zero-copy memory, true parallelism, and 10β100x performance over Pandas on large data
- Polars expressions: the composable, lazy query API that defines Polars
- Lazy vs eager execution: building query plans before materialising results
- Polars with Parquet: the native storage format for Polars workflows
- Migrating Pandas code to Polars: the common patterns and where they differ
- When to use Pandas vs Polars: practical decision framework based on data size and team context
Data formats and storage:
- Parquet: columnar storage, compression, and why it is the standard for data pipelines
- Arrow: the in-memory columnar format that Polars and modern data tools are built on
- JSON Lines (JSONL): streaming-friendly format for log and event data
- Working with large files that do not fit in memory: chunked processing and streaming reads
- Data lake patterns on AWS S3: partitioned Parquet datasets and querying with Athena
Week 2 β Phase 2: Data Pipelines & Orchestration
A one-off data transformation script is not a data pipeline. A pipeline runs on a schedule, handles failures gracefully, retries, logs, alerts, and produces auditable outputs. This phase covers how to build them.
- ETL vs ELT: the architectural difference and when each pattern applies
- Building ETL pipelines in pure Python: extract β validate β transform β load as composable steps
- Data validation with Great Expectations and Pandera: asserting data quality before it enters your system
- Apache Airflow fundamentals: DAGs, operators, sensors, and the task lifecycle
- Writing Airflow DAGs in Python: scheduling pipelines, managing dependencies, and setting retry policies
- Airflow on AWS: deploying with Amazon MWAA (Managed Workflows for Apache Airflow)
- Prefect as a modern Airflow alternative: flows, tasks, and deployments
- Database pipelines: incremental loads, upserts, and change data capture (CDC) patterns
- PostgreSQL as a data warehouse for small to medium scale: schemas, materialized views, and indexes for analytics
- AWS Glue: serverless ETL jobs for large-scale data transformation on S3
- Monitoring pipelines: alerting on failures, data quality regressions, and pipeline SLOs
Week 2β3 β Phase 3: Production APIs with FastAPI
Data and ML systems need APIs β to receive data, trigger pipelines, serve predictions, and expose results to applications. FastAPI is the dominant Python API framework in 2025 for anything that needs to be fast, well-typed, and auto-documented.
- FastAPI fundamentals: routing, path parameters, query parameters, and request bodies
- Pydantic v2 for request and response validation: models, validators, computed fields, and serialisation
- Dependency injection: building composable, testable service layers with FastAPI's DI system
- Async endpoints: using asyncio properly in FastAPI for I/O-bound operations
- Database integration with SQLAlchemy 2.0 (async): sessions, transactions, and connection pooling
- Alembic for database migrations: versioned schema changes in production
- Authentication: JWT tokens, OAuth2 password flow, and API key authentication in FastAPI
- Background tasks: offloading heavy work (data processing, email, ML inference) with FastAPI BackgroundTasks and Celery
- File upload and streaming: receiving large data files, processing them asynchronously
- Streaming responses: server-sent events and streaming ML inference output
- OpenAPI documentation: auto-generated docs, customising schemas, and using Swagger UI
- Testing FastAPI applications: TestClient, mocking dependencies, and async test patterns
- Deploying FastAPI: Docker, Gunicorn + Uvicorn workers, and deployment on AWS ECS / Lambda
Week 3β4 β Phase 4: Machine Learning Engineering with scikit-learn & PyTorch
This phase is split into two tiers: classical ML with scikit-learn (the workhorse of most production ML systems) and deep learning with PyTorch (for neural networks, NLP, and computer vision). Both are taught from an engineering perspective β not academic theory.
Classical ML with scikit-learn:
- The ML workflow: problem framing, data preparation, model selection, evaluation, and deployment
- Feature engineering: encoding categorical variables, scaling, imputation, and feature selection
- scikit-learn Pipelines: chaining preprocessing and model steps into a single deployable object
- Supervised learning: linear and logistic regression, decision trees, random forests, gradient boosting (XGBoost, LightGBM)
- Unsupervised learning: K-means clustering, DBSCAN, PCA for dimensionality reduction
- Model evaluation: cross-validation, confusion matrices, ROC-AUC, precision/recall, and RMSE
- Hyperparameter tuning: GridSearchCV, RandomizedSearchCV, and Optuna for Bayesian optimisation
- Handling imbalanced datasets: SMOTE, class weights, and threshold tuning
- Model explainability: SHAP values for understanding feature importance in production models
- Saving and loading models: joblib, pickle, and ONNX for cross-framework portability
Deep learning with PyTorch:
- PyTorch fundamentals: tensors, autograd, and the computational graph
- Building neural networks with nn.Module: layers, activations, and forward passes
- Training loops: loss functions, optimisers (Adam, AdamW), learning rate schedulers, and gradient clipping
- Datasets and DataLoaders: batching, shuffling, and custom dataset classes for structured and unstructured data
- Transfer learning: fine-tuning pre-trained models from HuggingFace for classification tasks
- NLP with transformers: tokenisation, embeddings, and using BERT/DistilBERT for text classification and NER
- Computer vision basics: CNNs, image classification, and object detection with torchvision
- GPU training: moving tensors to CUDA, mixed precision training with torch.amp
- Model checkpointing: saving and resuming training, best model selection
- Exporting models: TorchScript, ONNX export, and preparing models for production serving
Week 4β5 β Phase 5: MLOps β Experiment Tracking, Versioning & Model Registry
Training a model once in a notebook is not ML engineering. MLOps is the discipline of making ML reproducible, auditable, and maintainable across teams and over time. This phase covers the tools and practices that separate ML engineers from data scientists.
- MLOps principles: reproducibility, versioning, automation, and monitoring as first-class concerns
- MLflow fundamentals: tracking experiments, logging parameters, metrics, and artefacts
- MLflow Projects: packaging ML code for reproducible runs in any environment
- MLflow Model Registry: versioning trained models, managing staging/production transitions, and model lineage
- DVC (Data Version Control): versioning large datasets and model artefacts alongside code in Git
- Feature stores: what they are and when you need one β Feast for online and offline feature serving
- Experiment comparison: analysing runs across hyperparameter sweeps with MLflow UI
- Weights & Biases (W&B) as an MLflow alternative for deep learning experiment tracking
- Model cards: documenting model capabilities, limitations, training data, and evaluation results
- CI/CD for ML: automated retraining pipelines triggered by data drift or scheduled cadences with GitHub Actions
Week 5 β Phase 6: Model Deployment & AWS SageMaker
A model that is not serving predictions is not doing anything. This phase covers every practical pattern for getting models into production β from simple FastAPI endpoints to fully managed SageMaker endpoints.
Self-managed model serving:
- Serving scikit-learn and PyTorch models with FastAPI: prediction endpoints, batch inference, and async queuing
- BentoML: packaging models with their dependencies into portable, deployable services
- Triton Inference Server: high-performance model serving for GPU-accelerated PyTorch and ONNX models
- Model caching and warm-up: ensuring low-latency responses on the first request
- Batching inference requests: grouping incoming requests for GPU efficiency
AWS SageMaker:
- SageMaker overview: training jobs, processing jobs, pipelines, model registry, and endpoints
- SageMaker Training Jobs: running scikit-learn and PyTorch training at scale on managed compute
- SageMaker Processing: running data preprocessing and post-processing jobs at scale
- SageMaker Model Registry: versioning and approving models for deployment from the AWS console
- SageMaker Real-Time Endpoints: deploying models to auto-scaling inference endpoints
- SageMaker Serverless Inference: cost-efficient endpoints for low-traffic prediction APIs
- SageMaker Batch Transform: running inference over large datasets without a persistent endpoint
- SageMaker Pipelines: building end-to-end ML pipelines that chain data processing, training, evaluation, and deployment
- SageMaker with MLflow: using MLflow tracking with SageMaker training jobs
Monitoring models in production:
- Data drift detection: monitoring input feature distributions over time with Evidently AI
- Model performance monitoring: tracking prediction accuracy, latency, and error rates in production
- SageMaker Model Monitor: automated data quality and model quality monitoring on SageMaker endpoints
- Retraining triggers: detecting when model performance has degraded and automatically scheduling retraining
- A/B testing models in production: shadow deployments and traffic splitting between model versions
π Schedule & Timings
Choose one group only based on your availability. Max 5 candidates per group to ensure individual attention and hands-on lab support.
Weekday Groups:
- Group 1: MonβWed, 10 AM β 1 PM
- Group 2: MonβWed, 4 PM β 7 PM
Weekend Groups:
- Group 3: Sat & Sun, 10 AM β 2 PM
- Group 4: Sat & Sun, 4 PM β 8 PM
π Location: In-house training in Islamabad
π± Online option may be arranged for out-of-city participants
π οΈ Tools & Technologies Covered
- Language & Tooling: Python 3.12+, uv, pyproject.toml, pytest, Docker
- Data Engineering: Pandas, Polars, Apache Arrow, Parquet, DVC, Great Expectations, Pandera
- Orchestration: Apache Airflow, Prefect, AWS Glue, Amazon MWAA
- APIs: FastAPI, Pydantic v2, SQLAlchemy 2.0, Alembic, Celery
- Classical ML: scikit-learn, XGBoost, LightGBM, SHAP, Optuna
- Deep Learning: PyTorch, HuggingFace Transformers, torchvision, ONNX
- MLOps: MLflow, DVC, Weights & Biases, BentoML, Feast, Evidently AI
- Cloud: AWS SageMaker, S3, Athena, ECS, Lambda, RDS (PostgreSQL)
β Prerequisites
- Comfortable writing Python (functions, classes, file I/O, error handling)
- Basic understanding of SQL and relational databases
- Familiarity with the command line and Git
- No prior data engineering or ML experience required
- No mathematics degree required β the course covers the essential maths where needed
π― Who This Is For
- Python backend developers transitioning into data or ML engineering roles
- Data analysts who want to move from analysis to building production data systems
- Software engineers who want to add ML capabilities to their product development skillset
- Engineers targeting remote data engineering or ML engineering positions
π³ Course Fee & Booking
- β Duration: 5 Weeks
- π Seats: 5 only per group