Intermediate Databricks Course Outline
Length: 3 Days
Module 1: Advanced Data Engineering with Delta Lake
Delta Lake architecture deep dive: transaction log, time travel, and ACID guarantees
Schema enforcement vs. schema evolution strategies
Optimizing Delta tables: OPTIMIZE, ZORDER, and vacuum operations
Handling slowly changing dimensions (SCD Type 1, 2, 3)
Module 2: Delta Live Tables & Pipeline Architecture
Change Data Capture (CDC) with Delta Lake
Delta Live Tables: declarative pipelines and expectations
Medallion architecture implementation (bronze/silver/gold)
Lab: Build an end-to-end Delta Live Tables pipeline with data quality constraints
Module 3: Spark Performance Fundamentals
Spark UI deep dive: understanding jobs, stages, and tasks
Identifying and resolving data skew
Broadcast joins vs. shuffle hash joins vs. sort-merge joins
Adaptive Query Execution (AQE) tuning
Partition pruning and predicate pushdown
Module 4: Cluster & Code Optimization
Cluster configuration: worker types, autoscaling policies, spot instances
Caching strategies and when to use them
UDFs: performance implications and alternatives (pandas UDFs, vectorized operations)
Lab: Performance tune a slow-running job using Spark UI diagnostics
Module 5: Orchestration & DevOps
Databricks Workflows: jobs, tasks, and dependencies
Parameterized notebooks and job clusters vs. all-purpose clusters
CI/CD patterns: Repos integration, testing strategies, promotion workflows
Unity Catalog fundamentals: metastore, catalogs, schemas, and governance
Module 6: Security, Governance & Production Readiness
Row-level and column-level security with Unity Catalog
Secret management and secure credential handling
Monitoring and alerting: job notifications, query history, audit logs
Lab: Deploy a production-ready pipeline with Unity Catalog governance and scheduled workflows
Prerequisites
Basic Spark/PySpark
SQL proficiency
Familiarity with Databricks notebooks and basic cluster operations