🎓 Program Overview
There is a significant difference between using AI and building AI systems. Millions of developers now use ChatGPT and GitHub Copilot daily — but only a fraction can build the production systems that power those experiences: the retrieval pipelines, the agent architectures, the evaluation frameworks, the cost-optimised inference layers, and the monitoring systems that keep AI features working reliably at scale.
This track trains you to be on the engineering side of that divide. You will work with the OpenAI, Anthropic, and Google Gemini APIs not as a user but as a builder — designing RAG pipelines, orchestrating multi-step LLM agents, managing vector databases, evaluating model outputs programmatically, and deploying AI features that behave predictably in production.
💡 Why AI Engineering in 2026
📚 Curriculum — 8 Phases + Capstone
Before calling a single API, you need to understand what language models actually are, how they work at a systems level, and what their capabilities and failure modes look like in production. This phase gives engineers the mental model needed to make good architectural decisions throughout the entire course.
- How large language models work: tokens, embeddings, attention, and the transformer architecture — explained for engineers, not researchers
- Tokenisation in practice: how text becomes tokens, why token counts matter for cost and context limits, and how to measure them with tiktoken
- Context windows: what they are, how they constrain system design, and current limits across GPT-4o, Claude 3.5, and Gemini 1.5 Pro
- Temperature, top-p, and sampling parameters: what they control and how to set them for different use cases
- LLM failure modes engineers must understand: hallucination, context loss, sycophancy, prompt injection, and positional bias
- Model comparison: GPT-4o vs Claude 3.5 Sonnet vs Gemini 1.5 Pro vs open-source (Llama 3, Mistral) — capabilities, pricing, and when to use each
- Open-source vs proprietary models: self-hosted inference with Ollama and vLLM vs API-based models
- Cost modelling: estimating and controlling LLM API spend at scale — token budgeting, caching, and model tiering strategies
- Setting up the AI engineering environment: Python, API keys, environment variable management, and rate limit handling
Working directly with OpenAI, Anthropic, and Google Gemini — and the prompt engineering techniques that make the difference between a toy prototype and a reliable production feature.
- OpenAI API: chat completions, function calling, structured outputs, vision, and streaming with the Python SDK
- Anthropic API: messages API, system prompts, tool use, vision, and extended thinking with Claude
- Google Gemini API: multimodal inputs, long context, grounding, and the Gemini Python SDK
- Streaming responses: handling token-by-token output in APIs and surfacing it to users in real time
- Structured output: forcing models to return valid JSON using OpenAI structured outputs, Anthropic tool use, and the Instructor library
- Vision and multimodal inputs: sending images, PDFs, and documents to LLM APIs for analysis
- Batch API: processing thousands of requests asynchronously at lower cost with OpenAI Batch and Anthropic Batch
- Rate limiting and retry logic: exponential backoff, request queuing, and graceful degradation
- Provider abstraction: building a unified LLM client that can swap providers without rewriting application logic
- System prompts: writing effective prompts that define persona, behaviour, output format, and constraints
- Few-shot prompting: selecting and formatting examples that steer model behaviour reliably
- Chain-of-thought: making models reason step by step before producing output
- XML and structured prompt formatting: Anthropic's recommended approach for complex prompts
- Prompt templating: building dynamic prompts from user input and context using Jinja2 and f-strings
- Output formatting control: requesting JSON, markdown, tables, and code blocks reliably
- Prompt versioning: treating prompts as code — version control and A/B testing
- Prompt injection: understanding attack vectors and how to defend against them
- Context window management: summarisation, truncation, and prioritisation strategies
- Instruction following: writing prompts that models actually follow — specificity, positive framing
Embeddings are the foundation of semantic search, RAG pipelines, recommendation systems, and clustering. This phase covers generating embeddings and storing/querying them at scale using production vector databases.
- What embeddings are: converting text, images, and data into high-dimensional vectors that encode semantic meaning
- Embedding models: OpenAI text-embedding-3 (small/large), Cohere Embed v3, and open-source alternatives (sentence-transformers, BGE, E5)
- Embedding dimensions and model selection: accuracy vs cost vs latency trade-offs
- Similarity metrics: cosine similarity, dot product, and Euclidean distance — when each applies
- Batching embedding requests: efficient bulk generation for large document corpora
- Multimodal embeddings: text, images, and code — CLIP and OpenAI vision embeddings
- Embedding drift: how model updates can change embedding spaces and break existing indexes
- pgvector: vector similarity search in PostgreSQL — HNSW vs IVFFlat indexing, Eloquent-friendly queries
- Pinecone: managed vector database — indexes, namespaces, metadata filtering, and hybrid search
- Qdrant: open-source vector database — collections, payload filtering, and self-hosted deployment
- Choosing a vector database: decision framework based on scale, cost, latency, and infrastructure
- Hybrid search: combining dense vector search with sparse BM25 keyword search for better retrieval
- Metadata filtering: narrowing searches by document type, date, user, tenant, or structured fields
- Vector index performance: HNSW graph construction, ef_construction, and recall/latency trade-offs
- Re-ranking: using cross-encoders (Cohere Rerank, Voyage Rerank) to improve retrieval precision
- Amazon OpenSearch with vector engine: AWS-native vector search alternative for AWS deployments
RAG is the most important pattern in production AI engineering — it solves the core limitations of LLMs (outdated training data, hallucination, private knowledge) by retrieving relevant context at inference time. This phase covers RAG from basic implementation through to advanced production patterns.
- Why RAG: the problem it solves, when to use it, and when fine-tuning is a better answer
- The basic RAG pipeline: ingest → chunk → embed → store → retrieve → augment → generate
- Document ingestion: loading PDFs, Word docs, web pages, Notion, and databases with LangChain loaders and LlamaIndex readers
- Text chunking strategies: fixed-size, recursive character splitting, semantic chunking, document-structure-aware
- Chunk size and overlap: how they affect retrieval quality and what to tune for different document types
- Metadata enrichment: adding source, page number, section headers, and timestamps to chunks
- Embedding and indexing: bulk ingestion pipelines with progress tracking and error handling
- Query embedding and similarity search: retrieving the top-k most relevant chunks
- Context assembly: formatting retrieved chunks into a coherent prompt context block
- Source attribution: citing which documents the answer was drawn from
- Query transformation: rewriting user queries with an LLM before retrieval to improve recall
- HyDE (Hypothetical Document Embeddings): generating a hypothetical answer and using it as the retrieval query
- Multi-query retrieval: generating multiple query variants and merging their results
- Parent-child chunking: indexing small child chunks for precision, retrieving larger parent context
- Contextual compression: extracting only the relevant portion of a retrieved chunk
- Corrective RAG (CRAG): evaluating retrieval quality and falling back to web search when the knowledge base is insufficient
- Multi-vector retrieval: indexing documents by multiple representations (summary + full text + hypothetical questions)
- Agentic RAG: building retrieval as a tool that an agent calls dynamically
- RAG evaluation with RAGAS: measuring retrieval quality (context precision, recall) and generation quality (faithfulness, answer relevancy)
LangChain and LlamaIndex are the two dominant frameworks for orchestrating LLM applications — managing chains of calls, tool integrations, memory, and retrieval pipelines. Both are covered so you can choose the right tool for each job.
- LangChain architecture: chains, runnables, and the LCEL (LangChain Expression Language) pipeline syntax
- Prompt templates, output parsers, and structured output chains
- LangChain retrieval chains: complete RAG pipelines with LCEL
- Conversation chains and memory: maintaining history across turns with different memory backends
- LangChain Tools: wrapping functions, APIs, and databases as tools LLMs can call
- LangSmith: tracing, debugging, and evaluating LangChain applications in production
- LlamaIndex architecture: nodes, indexes, query engines, and pipelines
- Document and node processing: readers, transformations, and metadata extractors
- Index types: VectorStoreIndex, SummaryIndex, KnowledgeGraphIndex, PropertyGraphIndex
- Query engines and chat engines: conversational interfaces over your data
- Sub-question query engine: decomposing complex questions across multiple data sources
- LlamaIndex Workflows: event-driven, step-based orchestration for complex multi-stage pipelines
- LlamaParse: managed document parsing for complex PDFs, tables, and mixed-format documents
- LangChain vs LlamaIndex vs building from scratch: practical decision framework with real trade-offs
- Using both together: LlamaIndex for retrieval, LangChain for orchestration
- When to avoid frameworks: cases where direct API calls produce simpler, more maintainable code
- Dependency pinning and version management: keeping orchestration framework upgrades from breaking production
Agents are LLMs that can take actions — calling tools, writing and executing code, querying databases, and orchestrating other AI models. This phase covers agent architectures from simple tool-calling to complex multi-agent systems.
- Function calling fundamentals: defining tools as JSON schemas and letting LLMs decide when and how to call them
- Parallel tool calls: models that call multiple tools simultaneously and merge results
- Tool design principles: naming, descriptions, and parameter schemas that LLMs use reliably
- Built-in tools: web search, code execution, and file reading across OpenAI, Anthropic, and Gemini
- Custom tools: wrapping REST APIs, database queries, Python functions, and external services as LLM tools
- ReAct (Reasoning + Acting): the foundational agent loop — think, act, observe, repeat
- OpenAI Assistants API: threads, runs, tool calls, and file search — managed agent infrastructure
- LangGraph: stateful, graph-based agent workflows with cycles, branches, and human-in-the-loop
- LlamaIndex Workflows: event-driven agent pipelines with explicit step definitions
- Memory in agents: short-term (conversation buffer), long-term (vector memory), and entity memory
- Planning agents: breaking complex goals into sub-tasks and executing in order
- Code execution agents: agents that write Python, run it in a sandbox, and iterate on output
- Browser agents: agents that navigate web pages and extract information (Playwright + LLM)
- Multi-agent patterns: supervisor agents that delegate to specialist sub-agents
- Agent-to-agent communication: how agents pass context, results, and instructions
- CrewAI: role-based multi-agent orchestration for structured collaborative workflows
- AutoGen: Microsoft's multi-agent conversation framework for complex task decomposition
- Guardrails in agent systems: preventing runaway loops, cost overruns, and unintended actions
- Human-in-the-loop: checkpoints where agents pause and request human approval before proceeding
Fine-tuning is not always the right answer — but when it is, it dramatically outperforms prompting alone. This phase covers when fine-tuning makes sense, how to do it correctly, and the alternatives that are often faster and cheaper.
- Fine-tuning vs RAG vs prompt engineering: the decision framework every AI engineer needs
- When fine-tuning wins: style consistency, format adherence, domain-specific terminology, and latency-sensitive use cases
- Dataset preparation: formatting training data as instruction-response pairs, quality filtering, and diversity
- OpenAI fine-tuning API: uploading datasets, running training jobs, evaluating fine-tuned models, and cost estimation
- Fine-tuning GPT-4o mini for classification, extraction, and structured output tasks
- LoRA and QLoRA: parameter-efficient fine-tuning of open-source models (Llama 3, Mistral) on consumer hardware
- HuggingFace PEFT library: implementing LoRA fine-tuning with the Trainer API
- Instruction tuning vs continued pre-training: understanding the difference and when each applies
- RLHF overview: how models are aligned with human preferences — conceptual understanding
- Deploying fine-tuned models: serving with vLLM, BentoML, or uploading to HuggingFace Hub
Production AI systems fail in ways that are hard to predict and hard to detect. This phase covers evaluation frameworks, guardrails, and safety layers that make AI features trustworthy in customer-facing applications.
- Why LLM evaluation is hard: non-determinism, subjective quality, and the absence of ground truth
- Evaluation metrics: faithfulness, answer relevancy, context precision, context recall, and toxicity
- RAGAS: automated RAG evaluation — measuring retrieval and generation quality end-to-end
- LLM-as-judge: using a strong LLM to evaluate the outputs of another — prompting patterns and limitations
- Human evaluation: building annotation interfaces and rubrics for systematic human review
- Regression testing: building an evaluation dataset and running it on every prompt or model change
- LangSmith and Braintrust: platforms for logging, evaluating, and comparing LLM outputs across runs
- Evals as code: integrating LLM evaluation into CI/CD pipelines so regressions are caught before deployment
- Input guardrails: classifying and filtering user inputs before they reach the LLM
- Output guardrails: validating, filtering, and post-processing LLM outputs before they reach users
- Guardrails AI: declarative guardrail definitions with validators for PII, toxicity, and schema conformance
- Llama Guard: Meta's open-source safety classifier for screening inputs and outputs
- PII detection and redaction: identifying and masking personal data in inputs and outputs with Presidio
- Jailbreak and prompt injection defence: input sanitisation and instruction hierarchy patterns
- Content moderation: OpenAI Moderation API and custom classifiers for domain-specific policies
- Fallback strategies: graceful degradation when models fail, time out, or produce unsafe output
Shipping an AI feature to production is different from shipping a traditional API — latency is higher, costs vary with usage, outputs are non-deterministic, and failures are often silent. This phase covers running AI systems reliably at scale.
- AI feature architecture: synchronous vs asynchronous patterns in full-stack applications
- Streaming AI responses to the frontend: Server-Sent Events in FastAPI and Next.js
- Background AI jobs: document processing, embeddings, and batch inference with Celery / ARQ
- Caching LLM responses: semantic caching with GPTCache and Redis to reduce cost and latency
- LLM proxy layer: routing requests across providers, fallbacks, and usage tracking with LiteLLM
- Multi-tenancy: isolating AI features, vector namespaces, and usage quotas per user or organisation
- FastAPI on AWS ECS: containerised AI inference services with auto-scaling
- AWS Lambda for lightweight AI features: serverless LLM calls with cold start optimisation
- AWS Bedrock: accessing Claude, Llama, Titan, and other foundation models through AWS
- Amazon OpenSearch with vector engine: AWS-native vector search for RAG at scale
- Secrets management: storing and rotating API keys with AWS Secrets Manager
- LLM observability: what to log — prompts, completions, tokens, latency, cost, and user feedback
- Langfuse: open-source LLM observability — tracing, scoring, and dataset management
- OpenTelemetry for AI: tracing LLM calls as spans in distributed traces
- Cost monitoring: per-user, per-feature, and per-model spend with dashboards and budget alerts
- Latency monitoring: p50/p95/p99 tracking and alerting on degradation
- Hallucination monitoring: automated detection of factual inconsistencies in production
- User feedback loops: thumbs up/down signals and using them to improve prompts and retrieval
The final week is a guided capstone project where each student builds a complete, production-ready AI feature integrated into a full-stack application. Example project options:
- Document Q&A System: upload any PDF, ask questions, get answers with cited sources — built with RAG + pgvector + FastAPI + Next.js
- AI Customer Support Agent: handles FAQs from a knowledge base, escalates to humans when uncertain — RAG + LangGraph + guardrails
- Semantic Search Engine: replacing keyword search with vector search + hybrid retrieval over a product catalogue
- Code Review Agent: analyses pull request diffs and produces structured feedback using multi-step tool use
- Content Generation Pipeline: brief → research → draft → review loop with multiple specialised agents
🛠️ Tools & Technologies Covered
OpenAI · Anthropic · Gemini · AWS Bedrock
GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro, and open-source models (Llama 3, Mistral) via Ollama and vLLM
pgvector · Pinecone · Qdrant · OpenSearch
PostgreSQL-native vectors, managed Pinecone, self-hosted Qdrant, and AWS OpenSearch vector engine
LangChain · LlamaIndex · LangGraph · CrewAI
Complete orchestration frameworks, stateful graph-based agents, and multi-agent workflows
RAGAS · LangSmith · Langfuse · Guardrails AI
Automated evaluation, LLM tracing, production observability, and input/output safety layers
OpenAI Fine-Tuning · HuggingFace PEFT · LoRA
OpenAI's fine-tuning API, parameter-efficient LoRA/QLoRA for open-source models on consumer hardware
FastAPI · Celery · LiteLLM · Docker · AWS ECS
Production deployment, background job processing, provider routing, containerisation, and serverless Lambda
📅 Schedule & Timings
Weekday Groups
Weekend Groups
📍 Location: In-house training, F-11 Markaz, Islamabad · 📱 Online option available for out-of-city participants