Databricks Specialization on Coursera

Courses 1, 3 & 4 of the Databricks Specialization on Coursera

Platform: Databricks Free Edition | Comparison Layer: Sovereign AI Stack (Rust)

Design Philosophy

Course 1 is Databricks-only, building the foundation for the specialization.

Courses 3 & 4 use a dual-layer pedagogy:

  1. Databricks layer — Hands-on with MLflow, Feature Store, Model Serving, Vector Search, Foundation Models
  2. Sovereign AI Stack layer — Build the same concepts from scratch in Rust to understand what platforms abstract

Why both?

  • Practitioners need to use Databricks effectively
  • Engineers need to understand what's underneath
  • "Understand by building" creates deeper retention

Course Overview

CourseTitleDuration
1Lakehouse Fundamentals~15 hours
3MLOps Engineering~30 hours
4GenAI Engineering~34 hours

Sovereign AI Stack

┌──────────────────────────────────────────────────────────────────┐
│                   batuta (Orchestration)                         │
│              Privacy Tiers · CLI · Stack Coordination            │
├───────────────────┬──────────────────┬───────────────────────────┤
│  realizar         │  entrenar        │      pacha                │
│  (Inference)      │  (Training)      │   (Model Registry)        │
│  GGUF/SafeTensors │  autograd/LoRA   │  Sign/Encrypt/Lineage     │
├───────────────────┴──────────────────┴───────────────────────────┤
│                    aprender                                       │
│         ML Algorithms: regression, trees, clustering              │
├──────────────────────────────────────────────────────────────────┤
│                     trueno                                        │
│         SIMD/GPU Compute (AVX2/AVX-512/NEON, wgpu)               │
├──────────────────────────────────────────────────────────────────┤
│  trueno-rag      │ trueno-db       │ alimentar     │ pmat        │
│  BM25 + Vector   │ GPU Analytics   │ Arrow/Parquet │ Quality     │
└──────────────────┴─────────────────┴───────────────┴─────────────┘

Prerequisites

Databricks

Sovereign AI Stack (Rust)

# Install Rust
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

# Key crates
cargo install batuta realizar pmat

Getting Started

Begin with Course 1: Lakehouse Fundamentals for the foundational concepts, then continue to Course 3: MLOps Engineering or jump directly to Course 4: GenAI Engineering if you're already familiar with MLOps concepts.

Course 1: Databricks Lakehouse Fundamentals

Subtitle

Master the Data Lakehouse Architecture with Databricks Free Edition

Description

Build a solid foundation in the Databricks Lakehouse Platform: understand the evolution from data warehouses and data lakes to the lakehouse paradigm, navigate the Databricks workspace and Unity Catalog, write Spark DataFrames and SQL, and work with Delta Lake for reliable, versioned data storage. This is a Databricks-only course — no Sovereign AI Stack component.

Certification Alignment

This course prepares you for the Databricks Accredited Lakehouse Platform Fundamentals accreditation:

  • 25 multiple-choice questions
  • Tests conceptual understanding of the platform
  • Covers architecture, components, governance, and workloads
  • Free for Databricks customers and partners

Learning Outcomes

  1. Explain the data lakehouse architecture and how it combines warehouse reliability with lake flexibility
  2. Navigate the Databricks workspace, Unity Catalog, and compute resources
  3. Use Databricks notebooks with magic commands, dbutils, and multiple languages
  4. Write Spark transformations (select, filter, groupBy, join) and actions
  5. Create and manage Delta Lake tables with ACID transactions, MERGE, and time travel
  6. Build parameterized ETL pipelines and schedule them as Databricks Jobs

Duration

~15 hours | 18 videos | 6 labs | 3 quizzes

Weeks

WeekTopicFocus
1Lakehouse Architecture & PlatformArchitecture, workspace, catalog, compute
2Spark FundamentalsNotebooks, DataFrames, SQL, transformations
3Delta Lake & WorkflowsDelta tables, DML, time travel, jobs

Databricks Free Edition Features Used

  • Workspace and Notebooks
  • Unity Catalog (basic)
  • Apache Spark DataFrames and SQL
  • Delta Lake tables
  • DBFS (Databricks File System)
  • Jobs and Workflows
  • Sample datasets (/databricks-datasets/)

Prerequisites

  • Basic SQL knowledge
  • Familiarity with Python
  • Databricks Free Edition account (sign up)

Week 1: Lakehouse Architecture & Platform

Overview

Understand the evolution of data architectures from data warehouses to data lakes to the data lakehouse. Explore the Databricks platform: workspace navigation, Unity Catalog hierarchy, and compute resources.

Topics

#TypeTitleDuration
1.1.1VideoData Architecture Evolution8 min
1.1.2VideoLakehouse Architecture10 min
1.1.3VideoDatabricks and the Lakehouse8 min
1.2.1VideoDatabricks Overview10 min
1.2.2VideoWorkspace, Catalog & Data12 min
1.3.1VideoCompute Resources8 min
LabLakehouse Concepts30 min
LabWorkspace & Catalog30 min
QuizLakehouse Architecture15 min

Key Concepts

Data Architecture Evolution

EraArchitectureStrengthsWeaknesses
1980s–2000sData WarehouseACID, schema, BIExpensive, rigid, no unstructured
2010sData LakeCheap, flexible, any formatNo ACID, quality issues, "data swamp"
2020s+Data LakehouseBest of bothRequires modern platform

Lakehouse Properties

A data lakehouse provides:

  • ACID transactions on data lake storage (via Delta Lake)
  • Schema enforcement and evolution for data quality
  • Direct BI access to source data (no ETL to warehouse)
  • Unified batch and streaming in one architecture
  • Open formats (Parquet + Delta) — no vendor lock-in
  • Governance via Unity Catalog

Databricks Platform Architecture

  • Control Plane: Managed by Databricks — workspace UI, job scheduling, notebooks
  • Data Plane: Runs in your cloud account — compute clusters, data storage, processing
  • Unity Catalog: Three-level namespace (Metastore > Catalog > Schema > Table)
  • Compute Options: All-purpose clusters, job clusters, SQL warehouses, serverless

Certification Topics

Key accreditation concepts from this week:

  1. A data lakehouse combines warehouse reliability with lake flexibility
  2. Delta Lake provides ACID transactions on data lake storage
  3. Unity Catalog provides unified governance across all data assets
  4. The control plane is managed by Databricks; the data plane runs in your cloud
  5. Photon accelerates SQL queries without requiring code changes
  6. Open formats prevent vendor lock-in

Demo Code

Lab: Lakehouse Concepts

Explore the data lakehouse architecture hands-on: compare architectures, inspect platform components, and create your first Delta table.

Objectives

  • Identify key properties of a data lakehouse
  • Compare lakehouse vs data warehouse vs data lake
  • Create a Delta table and inspect the transaction log
  • Verify the Databricks environment

Lab Exercise

See labs/course1/week1/lab_lakehouse.py

Key Tasks

  1. Verify environment — Print Spark version and runtime info
  2. Architecture comparison — Build a DataFrame comparing warehouse/lake/lakehouse features
  3. Create Delta table — Write sample data as a Delta table
  4. Inspect history — Use DESCRIBE HISTORY to view the transaction log

Validation

The lab includes a validate_lab() function that checks:

  • Spark environment is running
  • Delta table was created with at least 5 rows
  • Architecture comparison DataFrame has all 3 architectures

Lab: Workspace & Catalog

Navigate the Databricks workspace, explore the Unity Catalog hierarchy, browse DBFS, and inspect compute resources.

Objectives

  • Navigate the Databricks Workspace UI
  • Explore Unity Catalog (Metastore > Catalog > Schema > Table)
  • Use DBFS to browse files and sample datasets
  • Inspect cluster configuration

Lab Exercise

See labs/course1/week1/lab_workspace.py

Key Tasks

  1. Catalog exploration — List catalogs and schemas using SQL
  2. Create schema and table — Build a lab_workspace.cities table with data
  3. File system exploration — Browse /databricks-datasets/ with dbutils
  4. Compute inspection — Print cluster and runtime configuration

Validation

The lab includes a validate_lab() function that checks:

  • Schema lab_workspace was created
  • Cities table exists with at least 3 rows

Week 2: Spark Fundamentals

Overview

Master Apache Spark on Databricks: use notebooks with magic commands and utilities, load and preview data, then apply core DataFrame operations — select, filter, groupBy, aggregations, and joins.

Topics

#TypeTitleDuration
2.1.1VideoUsing Notebooks10 min
2.1.2VideoMagic Commands & Utilities8 min
2.1.3VideoLoading & Previewing Data10 min
2.2.1VideoSpark Core Concepts12 min
2.2.2VideoSelect & Filter Operations10 min
2.2.3VideoGroupBy, Aggregations & Joins12 min
LabUsing Notebooks30 min
LabSpark Operations45 min
QuizSpark Fundamentals15 min

Key Concepts

Databricks Notebooks

  • Support Python, SQL, Scala, R in the same notebook
  • Magic commands: %python, %sql, %scala, %r, %md, %sh, %fs, %run
  • dbutils: File system ops (fs), notebook chaining (notebook), widgets, secrets
  • display(): Rich visualizations built into Databricks

Spark Core Architecture

  • SparkSession: Entry point (spark variable, auto-created on Databricks)
  • DataFrame: Distributed collection of rows with named columns
  • Lazy evaluation: Transformations build a plan; actions trigger execution
  • Catalyst Optimizer: Optimizes the query plan regardless of API used

Transformations vs Actions

Transformations (Lazy)Actions (Eager)
select()show()
filter() / where()count()
groupBy()collect()
join()first()
orderBy()take(n)
withColumn()write.*

Core Operations

  • select() — Choose and transform columns
  • filter() / where() — Select rows by condition
  • groupBy().agg() — Group rows and compute aggregates (sum, avg, count, max, min)
  • join() — Combine DataFrames (inner, left, right, full)
  • orderBy() — Sort results

Data Formats

FormatCommandUse Case
CSVspark.read.csv()Simple tabular data
JSONspark.read.json()Semi-structured data
Parquetspark.read.parquet()Columnar analytics
Deltaspark.read.format("delta")Lakehouse tables

Demo Code

Lab: Using Notebooks

Practice using Databricks notebooks: magic commands for multi-language cells, dbutils for file operations, loading data, and visualizations.

Objectives

  • Use magic commands to switch between Python, SQL, and Markdown
  • Work with dbutils for file system operations
  • Load data from CSV files
  • Use display() for rich visualizations

Lab Exercise

See labs/course1/week2/lab_notebooks.py

Key Tasks

  1. Magic commands — Write Python and SQL cells in the same notebook
  2. dbutils exploration — List sample datasets, preview file contents
  3. Load data — Read a CSV file with schema inference
  4. Visualization — Use display() to create charts from aggregated data

Validation

The lab includes a validate_lab() function that checks:

  • Python magic command executed correctly
  • DataFrame was loaded with data

Lab: Spark Operations

Practice core Spark DataFrame operations: select, filter, groupBy, aggregations, and joins using sales data.

Objectives

  • Use select() to choose and transform columns
  • Use filter() to select rows by condition
  • Use groupBy() with aggregation functions (sum, avg, count, max)
  • Perform inner and left joins between DataFrames
  • Write equivalent SQL queries

Lab Exercise

See labs/course1/week2/lab_spark.py

Key Tasks

  1. Select — Create derived columns (total_revenue, discounted_price)
  2. Filter — Find rows by price, category, region, and date range
  3. GroupBy — Compute revenue by category, average price by region, max price per category
  4. Join — Combine sales with region lookup, then aggregate by territory
  5. SQL — Register DataFrames as views and write equivalent SQL queries

Validation

The lab includes a validate_lab() function that checks:

  • Sales data loaded (10 rows)
  • Select returns correct number of columns
  • Filter returns non-empty results
  • GroupBy produces correct number of groups
  • Join produces correct row count

Week 3: Delta Lake & Workflows

Overview

Build reliable data pipelines with Delta Lake — ACID transactions, schema enforcement, DML operations (INSERT, UPDATE, MERGE), and time travel. Then orchestrate pipelines with Databricks Jobs, Dashboards, and Workflows.

Topics

#TypeTitleDuration
3.1.1VideoWhat Is Delta Lake10 min
3.1.2VideoDelta Lake Concepts12 min
3.1.3VideoCreating Delta Tables10 min
3.2.1VideoInsert, Update & Merge12 min
3.2.2VideoTime Travel8 min
3.3.1VideoJobs, Dashboards & Workflows12 min
LabDelta Tables45 min
LabJobs & Workflows30 min
QuizDelta Lake & Workflows15 min

Key Concepts

Delta Lake Architecture

Delta Table
├── _delta_log/              # Transaction log (JSON + Parquet)
│   ├── 00000000000000.json  # Version 0
│   ├── 00000000000001.json  # Version 1
│   └── 00000000000010.checkpoint.parquet
└── part-00000-*.parquet     # Data files (standard Parquet)

The transaction log records every change, enabling ACID guarantees.

Delta Lake Features

FeatureWhat It DoesWhy It Matters
ACID TransactionsAtomic, consistent writesNo corrupt/partial data
Schema EnforcementValidates data on writeData quality
Schema EvolutionAdd columns safelyAgile development
Time TravelQuery historical versionsAuditing, rollback
MERGE (Upsert)INSERT + UPDATE + DELETEEfficient CDC
Auto-OptimizeCompacts small filesQuery performance

DML Operations

  • INSERT: df.write.format("delta").mode("append")
  • UPDATE: UPDATE table SET col = val WHERE condition
  • MERGE: Match on key — update if exists, insert if not
  • Time Travel: SELECT * FROM table VERSION AS OF n

Databricks Workflows

  • Job: Scheduled execution of a notebook or script
  • Task: Single unit of work within a workflow
  • Workflow: Multi-task DAG with dependencies
  • Dashboard: SQL-powered visualizations connected to SQL Warehouses
  • Widgets: Parameterize notebooks for reusable pipelines

Certification Topics

Key accreditation concepts from this week:

  1. Delta Lake provides ACID transactions via the transaction log
  2. MERGE combines INSERT, UPDATE, and DELETE in one atomic operation
  3. Time travel enables querying any previous version of the data
  4. Schema enforcement prevents bad data; schema evolution adds columns safely
  5. Jobs use job clusters (auto-created, auto-terminated) for scheduled workloads
  6. Workflows orchestrate multi-step pipelines with DAG dependencies

Demo Code

Lab: Delta Tables

Create and manage Delta Lake tables: INSERT, UPDATE, MERGE operations, time travel queries, and schema enforcement.

Objectives

  • Create Delta tables from DataFrames
  • Perform INSERT, UPDATE, and MERGE (upsert) operations
  • Use time travel to query historical versions
  • Understand schema enforcement and evolution

Lab Exercise

See labs/course1/week3/lab_delta.py

Key Tasks

  1. Create table — Build an inventory Delta table with 6+ products
  2. INSERT — Append new products
  3. UPDATE — Modify prices for a category
  4. MERGE — Upsert with matched updates and unmatched inserts
  5. Time travel — View history, query version 0, compare price changes
  6. Schema enforcement — Verify that mismatched schemas are rejected

Key SQL Patterns

-- MERGE pattern
MERGE INTO target USING source ON target.key = source.key
WHEN MATCHED THEN UPDATE SET ...
WHEN NOT MATCHED THEN INSERT ...

-- Time travel
SELECT * FROM table VERSION AS OF 0

-- History
DESCRIBE HISTORY table

Validation

The lab includes a validate_lab() function that checks:

  • Delta table exists
  • Table has at least 6 rows
  • Multiple versions exist (DML operations were performed)

Lab: Jobs & Workflows

Build a parameterized ETL pipeline, create dashboard-ready queries, and understand Databricks job scheduling.

Objectives

  • Create parameterized notebooks with widgets
  • Build an Extract-Transform-Load pipeline
  • Write dashboard-ready SQL queries
  • Understand job scheduling and workflow orchestration

Lab Exercise

See labs/course1/week3/lab_workflows.py

Key Tasks

  1. Widgets — Create text and dropdown widgets for runtime parameters
  2. ETL pipeline — Extract raw orders, transform (filter + enrich), load to Delta
  3. Dashboard queries — Revenue by category, daily trends, top products
  4. Job concepts — Answer questions about cluster types, retries, and parameter passing

Key Concepts

  • Widgets: dbutils.widgets.text(), dbutils.widgets.dropdown()
  • Job clusters: Auto-created and terminated — best for scheduled workloads
  • Workflows: Multi-task DAG with dependency ordering
  • Dashboards: SQL queries connected to SQL Warehouses for visualization

Validation

The lab includes a validate_lab() function that checks:

  • Parameters are configured
  • Gold Delta table was created with data
  • Revenue column exists in output
  • Only completed orders were loaded

Course 3: MLOps Engineering on Databricks

Subtitle

Build and Deploy ML Systems with MLflow, Feature Store, and Model Serving

Description

Master the complete MLOps lifecycle on Databricks: experiment tracking with MLflow, feature engineering with Feature Store, model management with Unity Catalog, and deployment with Model Serving. Understand each component deeply by building equivalent systems from scratch with the Sovereign AI Stack.

Learning Outcomes

  1. Track experiments and manage model lifecycle with MLflow on Databricks
  2. Build and serve features using Databricks Feature Store and SQL Warehouses
  3. Register, version, and govern models with Unity Catalog
  4. Deploy models for batch and real-time inference
  5. Implement quality gates and monitoring for production ML

Duration

~30 hours | 38 videos | 12 labs | 5 quizzes | 1 capstone

Weeks

WeekTopicSovereign AI Stack
1Experiment Tracking with MLflowreqwest, serde, pacha
2Feature Engineeringalimentar, trueno, delta-rs
3Model Training and Registryaprender, pacha
4Model Serving and Inferencerealizar
5Production Quality and Orchestrationpmat, batuta
6Capstone: Fraud Detection PlatformFull stack

Databricks Free Edition Features Used

  • Experiments (MLflow Tracking)
  • Catalog (Unity Catalog for model registry)
  • Jobs & Pipelines (orchestration)
  • SQL Warehouses (feature computation)
  • Playground (model testing)

Week 1: Experiment Tracking with MLflow

Overview

Understand experiment tracking by implementing an MLflow REST client in Rust.

Topics

#TypeTitlePlatformDuration
1.1VideoThe Reproducibility CrisisConcept8 min
1.2VideoMLflow Architecture: Tracking, Registry, ProjectsDatabricks10 min
1.3LabCreate Experiments in DatabricksDatabricks30 min
1.4VideoMLflow REST Protocol Deep DiveConcept10 min
1.5LabBuild MLflow Client in RustSovereign40 min
1.6VideoAutologging and Framework IntegrationDatabricks8 min
1.7VideoArtifact Storage: DBFS, S3, Unity CatalogDatabricks8 min
1.8LabCompare: Databricks MLflow vs Rust ClientBoth25 min
1.9QuizExperiment Tracking Fundamentals15 min

Sovereign AI Stack Components

  • reqwest for HTTP client
  • serde for JSON serialization
  • pacha concepts for artifact storage

Key Concepts

MLflow Tracking

  • Experiments organize related runs
  • Runs contain parameters, metrics, and artifacts
  • Metrics can be logged at each training step

REST API

  • POST /api/2.0/mlflow/experiments/create
  • POST /api/2.0/mlflow/runs/create
  • POST /api/2.0/mlflow/runs/log-metric
  • POST /api/2.0/mlflow/runs/log-batch

Lab: MLflow Client

Build an MLflow REST client in Rust to understand experiment tracking internals.

Objectives

  • Implement HTTP client for MLflow REST API
  • Create experiments and runs
  • Log parameters and metrics
  • Search and retrieve runs

Demo Code

See demos/course3/week1/mlflow-client/

Lab Exercise

See labs/course3/week1/lab_1_5_mlflow_client.py

Key Implementation

#![allow(unused)]
fn main() {
pub struct MlflowClient {
    base_url: String,
    client: reqwest::Client,
}

impl MlflowClient {
    pub async fn log_metric(
        &self,
        run_id: &str,
        key: &str,
        value: f64,
    ) -> Result<(), MlflowError> {
        let body = json!({
            "run_id": run_id,
            "key": key,
            "value": value,
            "timestamp": Utc::now().timestamp_millis(),
        });
        self.post_void("runs/log-metric", &body).await
    }
}
}

Validation

Run tests:

cd demos/course3/week1/mlflow-client
cargo test

Lab: Feature Pipeline

Build a SIMD-accelerated feature computation pipeline.

Objectives

  • Compute feature statistics
  • Implement normalization transforms
  • Build a composable pipeline

Demo Code

See demos/course3/week2/feature-pipeline/

Lab Exercise

See labs/course3/week2/lab_2_5_feature_pipeline.py

Key Transforms

#![allow(unused)]
fn main() {
pub fn normalize_zscore(values: &[f32]) -> Result<Vec<f32>, FeatureError> {
    let stats = compute_statistics(values)?;
    Ok(values.iter()
        .map(|v| (v - stats.mean) / stats.std_dev)
        .collect())
}

pub fn normalize_minmax(values: &[f32]) -> Result<Vec<f32>, FeatureError> {
    let stats = compute_statistics(values)?;
    let range = stats.max - stats.min;
    Ok(values.iter()
        .map(|v| (v - stats.min) / range)
        .collect())
}
}

Validation

Run tests:

cd demos/course3/week2/feature-pipeline
cargo test

Week 3: Model Training and Registry

Overview

Train models with aprender and manage them with pacha's signed registry.

Topics

#TypeTitlePlatformDuration
3.1VideoML Algorithms: From Scratch to AutoMLConcept10 min
3.2LabTrain Models with aprenderSovereign40 min
3.3VideoDatabricks AutoMLDatabricks10 min
3.4LabAutoML Experiment in DatabricksDatabricks30 min
3.5VideoModel Registry with Unity CatalogDatabricks10 min
3.6VideoModel Signing and SecuritySovereign8 min
3.7LabRegister and Sign Models with pachaSovereign35 min
3.8VideoModel Lineage and GovernanceDatabricks8 min
3.9QuizTraining and Registry15 min

Sovereign AI Stack Components

  • aprender for ML algorithms
  • pacha for Ed25519 signing and BLAKE3 hashing

Key Concepts

Model Training

  • Linear regression with gradient descent
  • Random forest ensemble methods
  • Cross-validation for model selection

Model Registry

  • Version control for models
  • Stage transitions (staging → production)
  • Cryptographic signing for integrity

Lab: Model Training

Train ML models with gradient descent and evaluate performance.

Objectives

  • Implement linear regression
  • Train on synthetic datasets
  • Calculate evaluation metrics

Demo Code

See demos/course3/week3/model-training/

Lab Exercise

See labs/course3/week3/lab_3_4_automl.py

Key Implementation

#![allow(unused)]
fn main() {
impl LinearRegression {
    pub fn fit(&mut self, features: &[Vec<f64>], labels: &[f64]) {
        for _ in 0..self.n_iterations {
            let mut weight_gradients = vec![0.0; self.weights.len()];
            let mut bias_gradient = 0.0;

            for (x, &y) in features.iter().zip(labels.iter()) {
                let pred = self.predict_single(x);
                let error = pred - y;
                for (j, &xj) in x.iter().enumerate() {
                    weight_gradients[j] += error * xj;
                }
                bias_gradient += error;
            }

            // Update weights
            for (w, grad) in self.weights.iter_mut().zip(&weight_gradients) {
                *w -= self.learning_rate * grad / n_samples;
            }
            self.bias -= self.learning_rate * bias_gradient / n_samples;
        }
    }
}
}

Lab: Inference Server

Build a model serving infrastructure with batching and health checks.

Objectives

  • Implement prediction endpoint
  • Add request batching
  • Configure health monitoring

Demo Code

See demos/course3/week4/inference-server/

Lab Exercise

See labs/course3/week4/lab_4_5_serving.py

Key Components

#![allow(unused)]
fn main() {
pub struct InferenceServer {
    model: Box<dyn Model>,
    batcher: RequestBatcher,
    metrics: ServerMetrics,
}

impl InferenceServer {
    pub async fn predict(&self, request: PredictRequest) -> PredictResponse {
        let start = Instant::now();

        let result = self.batcher.add(request).await;

        self.metrics.record_request(start.elapsed());
        result
    }

    pub fn health(&self) -> HealthResponse {
        HealthResponse {
            status: "healthy",
            model_loaded: self.model.is_loaded(),
            requests_processed: self.metrics.total_requests(),
        }
    }
}
}

Week 5: Production Quality and Orchestration

Overview

Implement quality gates with pmat and orchestration with batuta.

Topics

#TypeTitlePlatformDuration
5.1VideoMLOps Maturity ModelConcept8 min
5.2VideoDatabricks Workflows for MLDatabricks10 min
5.3LabBuild ML Pipeline with JobsDatabricks35 min
5.4VideoQuality Gates with pmatSovereign8 min
5.5LabEnforce TDG Quality ScoreSovereign25 min
5.6VideoMonitoring and Drift DetectionDatabricks10 min
5.7Videobatuta OrchestrationSovereign8 min
5.8QuizProduction MLOps15 min

Sovereign AI Stack Components

  • batuta for orchestration
  • pmat for quality gates
  • renacer for syscall tracing

Key Concepts

Quality Gates

  • TDG (Technical Debt Gauge) scoring
  • Complexity thresholds
  • Test coverage requirements

Orchestration

  • DAG-based workflow execution
  • Privacy tier enforcement
  • Retry and failure handling

Lab: Quality Gates

Implement production quality enforcement with pmat.

Objectives

  • Configure quality thresholds
  • Implement pre-commit hooks
  • Enforce TDG scoring

Demo Code

See demos/course3/week5/quality-gates/

Lab Exercise

See labs/course3/week5/lab_5_5_quality_gates.py

Configuration

# .pmat-gates.toml
[gates]
min_tdg_score = "B"
max_cyclomatic = 30
max_cognitive = 25
min_line_coverage = 80
min_branch_coverage = 70

[pre_commit_checks]
checks = ["complexity", "dead-code", "security", "duplicates"]

Commands

# Repository health score
pmat repo-score

# Quality gate check
pmat quality-gate

# Rust project score
pmat rust-project-score

# Analyze complexity
pmat analyze complexity --path .

Course 4: GenAI Engineering on Databricks

Subtitle

Build LLM Applications with Foundation Models, Vector Search, and RAG

Description

Construct production GenAI systems on Databricks: serve foundation models, implement vector search for semantic retrieval, build RAG pipelines, and fine-tune models for domain adaptation. Understand the internals by building equivalent systems with the Sovereign AI Stack.

Learning Outcomes

  1. Serve and query foundation models on Databricks
  2. Generate embeddings and build vector search indexes
  3. Implement production RAG pipelines with hybrid retrieval
  4. Fine-tune models with LoRA/QLoRA for domain adaptation
  5. Deploy privacy-aware GenAI systems with proper governance

Duration

~34 hours | 40 videos | 12 labs | 5 quizzes | 1 capstone

Weeks

WeekTopicSovereign AI Stack
1Foundation Models and LLM Servingrealizar, tokenizers
2Prompt Engineering and Structured Outputbatuta, serde
3Embeddings and Vector Searchtrueno, trueno-rag
4RAG Pipelinestrueno-rag, alimentar
5Fine-Tuning and Model Securityentrenar, pacha
6Production Deploymentbatuta, renacer
7Capstone: Enterprise Knowledge AssistantFull stack

Databricks Free Edition Features Used

  • Playground (Foundation Models)
  • Vector Search (via Catalog)
  • Genie (AI/BI demo)
  • Experiments (evaluation tracking)
  • Jobs & Pipelines (RAG orchestration)

Week 1: Foundation Models and LLM Serving

Overview

Understand LLM serving by building a tokenizer and inference server in Rust.

Topics

#TypeTitlePlatformDuration
1.1VideoThe GenAI LandscapeConcept10 min
1.2VideoDatabricks Foundation Model APIsDatabricks10 min
1.3LabQuery Models in PlaygroundDatabricks25 min
1.4VideoGGUF Format and QuantizationSovereign10 min
1.5LabServe Local Model with realizarSovereign35 min
1.6VideoTokenization Deep DiveConcept10 min
1.7LabBuild BPE TokenizerSovereign30 min
1.8VideoExternal Models and AI GatewayDatabricks8 min
1.9QuizLLM Serving Fundamentals15 min

Sovereign AI Stack Components

  • realizar for GGUF inference
  • tokenizers crate for BPE

Key Concepts

Tokenization

  • BPE (Byte-Pair Encoding) algorithm
  • Vocabulary and merge rules
  • Special tokens: <|endoftext|>, <|pad|>

Model Quantization

  • FP16, INT8, INT4 representations
  • GGUF format: Q4_K_M, Q5_K_M, Q8_0
  • Memory vs accuracy trade-offs

Lab: Tokenizer

Build a BPE tokenizer to understand LLM text processing.

Objectives

  • Implement byte-pair encoding
  • Handle special tokens
  • Encode and decode text

Demo Code

See demos/course4/week1/llm-serving/

Lab Exercise

See labs/course4/week1/lab_1_7_tokenizer.py

Key Implementation

#![allow(unused)]
fn main() {
pub struct BpeTokenizer {
    vocab: HashMap<String, u32>,
    merges: Vec<(String, String)>,
    special_tokens: HashMap<String, u32>,
}

impl BpeTokenizer {
    pub fn encode(&self, text: &str) -> Vec<u32> {
        let mut tokens: Vec<String> = text.chars()
            .map(|c| c.to_string())
            .collect();

        // Apply merge rules
        for (a, b) in &self.merges {
            tokens = self.apply_merge(&tokens, a, b);
        }

        tokens.iter()
            .filter_map(|t| self.vocab.get(t).copied())
            .collect()
    }
}
}

Lab: Prompt Templates

Build type-safe prompt templates with variable substitution.

Objectives

  • Create reusable templates
  • Implement variable validation
  • Build a prompt library

Demo Code

See demos/course4/week2/prompt-engineering/

Lab Exercise

See labs/course4/week2/lab_2_6_prompt_templates.py

Key Implementation

#![allow(unused)]
fn main() {
pub struct PromptTemplate {
    template: String,
    variables: Vec<String>,
}

impl PromptTemplate {
    pub fn render(&self, vars: &HashMap<String, String>) -> Result<String, PromptError> {
        let mut result = self.template.clone();
        for var in &self.variables {
            let value = vars.get(var)
                .ok_or(PromptError::MissingVariable(var.clone()))?;
            result = result.replace(&format!("{{{}}}", var), value);
        }
        Ok(result)
    }
}
}

Week 3: Embeddings and Vector Search

Overview

Build SIMD-accelerated vector search with trueno and implement HNSW indexing.

Topics

#TypeTitlePlatformDuration
3.1VideoWhat Are Embeddings?Concept10 min
3.2VideoDatabricks Vector SearchDatabricks10 min
3.3LabCreate Vector Search IndexDatabricks35 min
3.4VideoSIMD Similarity: Cosine, Dot ProductSovereign10 min
3.5LabBuild SIMD Vector Search with truenoSovereign35 min
3.6VideoHNSW: Approximate Nearest NeighborsConcept10 min
3.7LabImplement HNSW IndexSovereign40 min
3.8VideoHybrid Search: BM25 + VectorSovereign8 min
3.9LabHybrid Retrieval with trueno-ragSovereign35 min
3.10QuizVector Search15 min

Sovereign AI Stack Components

  • trueno for SIMD computation
  • trueno-rag for BM25 + HNSW
  • trueno-db for GPU analytics

Key Concepts

Similarity Metrics

  • Cosine similarity: dot(a, b) / (||a|| * ||b||)
  • Euclidean distance: sqrt(sum((a - b)^2))
  • Dot product: sum(a * b)

HNSW Algorithm

  • Hierarchical navigable small world graphs
  • O(log n) search complexity
  • Configurable M and ef parameters

Lab: Embeddings

Build a vector search index with SIMD-accelerated similarity.

Objectives

  • Generate text embeddings
  • Implement similarity metrics
  • Build a searchable index

Demo Code

See demos/course4/week3/vector-search/

Lab Exercise

See labs/course4/week3/lab_3_5_embeddings.py

Key Implementation

#![allow(unused)]
fn main() {
pub fn cosine_similarity(a: &[f32], b: &[f32]) -> f32 {
    let dot: f32 = a.iter().zip(b).map(|(x, y)| x * y).sum();
    let norm_a: f32 = a.iter().map(|x| x * x).sum::<f32>().sqrt();
    let norm_b: f32 = b.iter().map(|x| x * x).sum::<f32>().sqrt();
    dot / (norm_a * norm_b)
}

pub struct VectorIndex {
    embeddings: Vec<Embedding>,
}

impl VectorIndex {
    pub fn search(&self, query: &[f32], k: usize) -> Vec<SearchResult> {
        let mut results: Vec<_> = self.embeddings.iter()
            .map(|e| (e.id.clone(), cosine_similarity(query, &e.vector)))
            .collect();
        results.sort_by(|a, b| b.1.partial_cmp(&a.1).unwrap());
        results.into_iter().take(k).collect()
    }
}
}

Lab: RAG Pipeline

Build an end-to-end RAG system with chunking, retrieval, and generation.

Objectives

  • Implement document chunking
  • Build retrieval pipeline
  • Generate contextual answers

Demo Code

See demos/course4/week4/rag-pipeline/

Lab Exercise

See labs/course4/week4/lab_4_7_rag.py

Key Implementation

#![allow(unused)]
fn main() {
pub struct RagPipeline {
    chunker: TextChunker,
    vector_store: VectorStore,
    generator: Generator,
}

impl RagPipeline {
    pub fn query(&self, question: &str) -> RagResponse {
        // 1. Embed query
        let query_embedding = self.embed(question);

        // 2. Retrieve relevant chunks
        let results = self.vector_store.search(&query_embedding, 3);

        // 3. Build context
        let context = results.iter()
            .map(|r| r.chunk.text.as_str())
            .collect::<Vec<_>>()
            .join("\n\n");

        // 4. Generate answer
        let prompt = format!(
            "Context:\n{}\n\nQuestion: {}\n\nAnswer:",
            context, question
        );
        let answer = self.generator.generate(&prompt);

        RagResponse { answer, sources: results }
    }
}
}

Week 5: Fine-Tuning and Model Security

Overview

Fine-tune models with LoRA/QLoRA and implement secure model distribution.

Topics

#TypeTitlePlatformDuration
5.1VideoWhen to Fine-Tune vs RAGConcept10 min
5.2VideoDatabricks Fine-TuningDatabricks10 min
5.3LabFine-Tune in DatabricksDatabricks40 min
5.4VideoLoRA/QLoRA from ScratchSovereign10 min
5.5LabFine-Tune with entrenarSovereign45 min
5.6VideoModel Encryption and SigningSovereign10 min
5.7LabSecure Model Pipeline with pachaSovereign35 min
5.8VideoEU AI Act and GovernanceConcept8 min
5.9QuizFine-Tuning and Security15 min

Sovereign AI Stack Components

  • entrenar for LoRA/QLoRA training
  • pacha for ChaCha20-Poly1305 encryption

Key Concepts

LoRA (Low-Rank Adaptation)

  • Freeze base model weights
  • Add trainable low-rank matrices
  • Scaling factor: alpha / r
  • Target modules: q_proj, v_proj, k_proj

QLoRA

  • Quantized base model (4-bit)
  • Double quantization for memory efficiency
  • Paged optimizers for large batches

Fine-Tuning vs RAG

AspectFine-TuningRAG
KnowledgeBaked into weightsRetrieved at runtime
UpdatesRequires retrainingUpdate index only
CostHigher computeLower compute
Use caseStyle/behavior changeKnowledge access

Lab: Fine-Tuning

Configure LoRA fine-tuning for domain adaptation.

Objectives

  • Configure LoRA parameters
  • Prepare training data
  • Calculate training metrics

Demo Code

See demos/course4/week5/fine-tuning/

Lab Exercise

See labs/course4/week5/lab_5_3_fine_tuning.py

Key Implementation

#![allow(unused)]
fn main() {
pub struct LoraConfig {
    pub r: usize,           // Rank (4, 8, 16)
    pub alpha: usize,       // Scaling (16, 32)
    pub dropout: f32,       // Dropout rate
    pub target_modules: Vec<String>,
}

impl LoraConfig {
    pub fn scaling_factor(&self) -> f32 {
        self.alpha as f32 / self.r as f32
    }

    pub fn estimated_params(&self, hidden: usize, layers: usize) -> usize {
        self.r * hidden * 2 * self.target_modules.len() * layers
    }
}

// Example: 7B model with r=8
// Params: 8 * 4096 * 2 * 2 * 32 = 4.2M (0.06% of 7B)
}

Lab: Production Deployment

Deploy GenAI systems with guardrails and monitoring.

Objectives

  • Implement input/output guardrails
  • Configure rate limiting
  • Track production metrics

Demo Code

See demos/course4/week6/production/

Lab Exercise

See labs/course4/week6/lab_6_3_production.py

Key Implementation

#![allow(unused)]
fn main() {
pub struct ProductionServer {
    guardrails: Guardrails,
    rate_limiter: RateLimiter,
    metrics: Metrics,
    router: ABRouter,
}

impl ProductionServer {
    pub fn process(&mut self, request: Request) -> Response {
        // 1. Check rate limit
        if !self.rate_limiter.check() {
            return Response::error("Rate limited");
        }

        // 2. Check guardrails
        let check = self.guardrails.check_input(&request.prompt);
        if !check.passed {
            self.metrics.record_guardrail_block();
            return Response::error("Blocked by guardrails");
        }

        // 3. Route to model variant
        let model = self.router.select();

        // 4. Generate and record metrics
        let start = Instant::now();
        let response = model.generate(&request.prompt);
        self.metrics.record(start.elapsed(), response.tokens);

        response
    }
}
}

Sovereign AI Stack

The Sovereign AI Stack is a collection of Rust crates for building ML and GenAI systems from first principles.

Architecture

┌──────────────────────────────────────────────────────────────────┐
│                   batuta (Orchestration)                         │
│              Privacy Tiers · CLI · Stack Coordination            │
├───────────────────┬──────────────────┬───────────────────────────┤
│  realizar         │  entrenar        │      pacha                │
│  (Inference)      │  (Training)      │   (Model Registry)        │
│  GGUF/SafeTensors │  autograd/LoRA   │  Sign/Encrypt/Lineage     │
├───────────────────┴──────────────────┴───────────────────────────┤
│                    aprender                                       │
│         ML Algorithms: regression, trees, clustering              │
├──────────────────────────────────────────────────────────────────┤
│                     trueno                                        │
│         SIMD/GPU Compute (AVX2/AVX-512/NEON, wgpu)               │
├──────────────────────────────────────────────────────────────────┤
│  trueno-rag      │ trueno-db       │ alimentar     │ pmat        │
│  BM25 + Vector   │ GPU Analytics   │ Arrow/Parquet │ Quality     │
└──────────────────┴─────────────────┴───────────────┴─────────────┘

Component Reference

ComponentPurposeCourse Usage
truenoSIMD tensor operationsFeature computation, embeddings
aprenderML algorithmsModel training
realizarInference servingModel deployment
entrenarLoRA/QLoRA trainingFine-tuning
pachaModel registrySigning, encryption
batutaOrchestrationPipeline coordination
trueno-ragRAG pipelineRetrieval + generation
alimentarData loadingParquet, chunking
pmatQuality gatesTDG scoring

Installation

# Install from crates.io
cargo install batuta realizar pmat

# Or add to Cargo.toml
[dependencies]
trueno = "0.11"
aprender = "0.24"
realizar = "0.5"
pacha = "0.2"
batuta = "0.4"
alimentar = "0.2"
pmat = "2.213"

Privacy Tiers

The Sovereign AI Stack supports three privacy tiers:

TierDescriptionData Location
SovereignAir-gapped, on-premisesNever leaves local infrastructure
PrivateCloud but encryptedYour cloud account, E2E encrypted
StandardManaged servicesThird-party APIs allowed

Configure in batuta.toml:

[privacy]
tier = "sovereign"  # or "private", "standard"
allowed_endpoints = ["localhost", "*.internal.corp"]

Databricks Setup

This guide covers setting up Databricks Free Edition for the courses.

Create Account

  1. Go to databricks.com
  2. Click "Try Databricks Free"
  3. Sign up with your email
  4. Verify your account

Workspace Setup

Create Cluster

  1. Navigate to Compute in the sidebar
  2. Click Create Cluster
  3. Select the smallest instance type
  4. Enable auto-termination (15 minutes)

Install Libraries

For Python notebooks:

%pip install mlflow databricks-feature-store

Features Used

Course 3: MLOps

FeaturePurpose
ExperimentsMLflow tracking
CatalogModel registry
JobsPipeline orchestration
SQL WarehousesFeature computation
PlaygroundModel testing

Course 4: GenAI

FeaturePurpose
PlaygroundFoundation Models
Vector SearchSemantic retrieval
GenieAI/BI demo
ExperimentsEvaluation tracking
JobsRAG orchestration

Notebook Conventions

All Databricks notebooks in this repository use:

# Databricks notebook source
# MAGIC %md
# MAGIC # Notebook Title

# COMMAND ----------
# Code cell

Running Labs

  1. Import notebook into Databricks workspace
  2. Attach to running cluster
  3. Run cells sequentially
  4. Complete TODO sections

Troubleshooting

Cluster won't start

  • Check your free tier limits
  • Ensure auto-termination is enabled
  • Try a smaller instance type

MLflow not found

%pip install mlflow --quiet
dbutils.library.restartPython()

Feature Store issues

%pip install databricks-feature-store --quiet
dbutils.library.restartPython()