Week 3: Delta Lake & Workflows

Overview

Build reliable data pipelines with Delta Lake — ACID transactions, schema enforcement, DML operations (INSERT, UPDATE, MERGE), and time travel. Then orchestrate pipelines with Databricks Jobs, Dashboards, and Workflows.

Topics

#TypeTitleDuration
3.1.1VideoWhat Is Delta Lake10 min
3.1.2VideoDelta Lake Concepts12 min
3.1.3VideoCreating Delta Tables10 min
3.2.1VideoInsert, Update & Merge12 min
3.2.2VideoTime Travel8 min
3.3.1VideoJobs, Dashboards & Workflows12 min
LabDelta Tables45 min
LabJobs & Workflows30 min
QuizDelta Lake & Workflows15 min

Key Concepts

Delta Lake Architecture

Delta Table
├── _delta_log/              # Transaction log (JSON + Parquet)
│   ├── 00000000000000.json  # Version 0
│   ├── 00000000000001.json  # Version 1
│   └── 00000000000010.checkpoint.parquet
└── part-00000-*.parquet     # Data files (standard Parquet)

The transaction log records every change, enabling ACID guarantees.

Delta Lake Features

FeatureWhat It DoesWhy It Matters
ACID TransactionsAtomic, consistent writesNo corrupt/partial data
Schema EnforcementValidates data on writeData quality
Schema EvolutionAdd columns safelyAgile development
Time TravelQuery historical versionsAuditing, rollback
MERGE (Upsert)INSERT + UPDATE + DELETEEfficient CDC
Auto-OptimizeCompacts small filesQuery performance

DML Operations

  • INSERT: df.write.format("delta").mode("append")
  • UPDATE: UPDATE table SET col = val WHERE condition
  • MERGE: Match on key — update if exists, insert if not
  • Time Travel: SELECT * FROM table VERSION AS OF n

Databricks Workflows

  • Job: Scheduled execution of a notebook or script
  • Task: Single unit of work within a workflow
  • Workflow: Multi-task DAG with dependencies
  • Dashboard: SQL-powered visualizations connected to SQL Warehouses
  • Widgets: Parameterize notebooks for reusable pipelines

Certification Topics

Key accreditation concepts from this week:

  1. Delta Lake provides ACID transactions via the transaction log
  2. MERGE combines INSERT, UPDATE, and DELETE in one atomic operation
  3. Time travel enables querying any previous version of the data
  4. Schema enforcement prevents bad data; schema evolution adds columns safely
  5. Jobs use job clusters (auto-created, auto-terminated) for scheduled workloads
  6. Workflows orchestrate multi-step pipelines with DAG dependencies

Demo Code