Week 2: Spark Fundamentals

Overview

Master Apache Spark on Databricks: use notebooks with magic commands and utilities, load and preview data, then apply core DataFrame operations — select, filter, groupBy, aggregations, and joins.

Topics

#TypeTitleDuration
2.1.1VideoUsing Notebooks10 min
2.1.2VideoMagic Commands & Utilities8 min
2.1.3VideoLoading & Previewing Data10 min
2.2.1VideoSpark Core Concepts12 min
2.2.2VideoSelect & Filter Operations10 min
2.2.3VideoGroupBy, Aggregations & Joins12 min
LabUsing Notebooks30 min
LabSpark Operations45 min
QuizSpark Fundamentals15 min

Key Concepts

Databricks Notebooks

  • Support Python, SQL, Scala, R in the same notebook
  • Magic commands: %python, %sql, %scala, %r, %md, %sh, %fs, %run
  • dbutils: File system ops (fs), notebook chaining (notebook), widgets, secrets
  • display(): Rich visualizations built into Databricks

Spark Core Architecture

  • SparkSession: Entry point (spark variable, auto-created on Databricks)
  • DataFrame: Distributed collection of rows with named columns
  • Lazy evaluation: Transformations build a plan; actions trigger execution
  • Catalyst Optimizer: Optimizes the query plan regardless of API used

Transformations vs Actions

Transformations (Lazy)Actions (Eager)
select()show()
filter() / where()count()
groupBy()collect()
join()first()
orderBy()take(n)
withColumn()write.*

Core Operations

  • select() — Choose and transform columns
  • filter() / where() — Select rows by condition
  • groupBy().agg() — Group rows and compute aggregates (sum, avg, count, max, min)
  • join() — Combine DataFrames (inner, left, right, full)
  • orderBy() — Sort results

Data Formats

FormatCommandUse Case
CSVspark.read.csv()Simple tabular data
JSONspark.read.json()Semi-structured data
Parquetspark.read.parquet()Columnar analytics
Deltaspark.read.format("delta")Lakehouse tables

Demo Code