Databricks Series, Part 1: Getting Started
Navigating the Databricks workspace, launching clusters, writing notebooks, and submitting your first PySpark job.
The Databricks Workspace
When you log into Databricks, you enter a workspace — the top-level organizational unit. A workspace is your team’s Databricks environment: it contains notebooks, clusters, jobs, and shared data. If you have multiple teams or environments (dev, prod), you have multiple workspaces.
The left navigation bar provides access to:
- Compute — clusters and SQL warehouses
- Workflows — scheduled jobs and pipelines
- Data — catalogs, schemas, and tables (Unity Catalog)
- ML — experiments and models (MLflow)
- Repos — version control for notebooks and code
Unlike Spark’s distributed RDD/DataFrame mental model, Databricks introduces a namespace hierarchy: workspace → catalog → schema → table. This mirrors SQL databases but is more organized. Part 2 dives deeper; for now, understand that data is organized in hierarchies, not just paths on disk.
Cluster Types
A cluster is a set of compute resources (VMs) running Spark. Databricks offers two main cluster types for data engineering:
All-Purpose Clusters are interactive, long-running, and shared across team members. You attach notebooks to them, run code, iterate. Ideal for exploration and development. Pay per DBU-hour, with auto-scaling and auto-termination to control costs.
Job Clusters are ephemeral — created per-job run, run the job, then terminated. No shared state between runs. Ideal for production pipelines where you want a fresh environment each time. Slightly cheaper because you don’t pay for idle time.
SQL Warehouses are specialized for SQL-only workloads. They optimize for analytical queries via Photon engine. Not used in this PySpark series.
Every cluster runs a Databricks Runtime (DBR) — a pre-configured Spark distribution with Delta Lake, MLflow, and other libraries pre-installed. DBR version maps directly to Spark: DBR 14.x = Spark 3.5, DBR 13.x = Spark 3.4. Choosing the runtime version is as important as choosing a Spark version — it locks in library versions for reproducibility.
Code block — creating an All-Purpose cluster via the Databricks SDK:
from databricks.sdk import WorkspaceClient
from databricks.sdk.service.compute import ClusterSpec, AutoScale
w = WorkspaceClient() # uses DATABRICKS_HOST and DATABRICKS_TOKEN env vars
cluster = w.clusters.create(
cluster_name="dev-cluster",
spark_version="14.3.x-scala2.12", # DBR 14.3 = Spark 3.5
node_type_id="i3.xlarge", # instance type
autoscale=AutoScale(min_workers=2, max_workers=8),
autotermination_minutes=60, # shut down after 1 hr idle
)
print(f"Cluster ID: {cluster.cluster_id}")
Your First Notebook
A notebook in Databricks is a web-based document containing code cells (Python, SQL, Scala, R). Cells can be executed independently, and results display inline. The spark session is pre-initialized — unlike a local PySpark script, you don’t call SparkSession.builder.
When you create a notebook, attach it to your All-Purpose cluster. Then write Python in the first cell:
# spark is already available — no SparkSession needed
df = spark.read.csv(
"/databricks-datasets/airlines/", # built-in sample data
header=True,
inferSchema=True,
)
df.printSchema()
df.show(5)
The output displays in the browser. This is the Databricks notebook experience — rapid iteration without infrastructure.
Databricks File System (DBFS)
DBFS is a distributed file system abstraction over cloud object storage. Paths like /mnt/mydata or /FileStore/uploads are DBFS. Internally, Databricks maps these to S3, ADLS, or GCS.
DBFS has a few built-in locations:
/databricks-datasets/— sample datasets provided by Databricks/FileStore/— writable user file storage/mnt/— mount points for cloud storage
In production, you avoid DBFS and use cloud paths directly (s3://my-bucket/data/). DBFS adds overhead. But for quick experiments, it is convenient.
Code block — file system operations:
# List the airlines dataset
dbutils.fs.ls("/databricks-datasets/airlines/")
# Mount S3 bucket (older pattern — Unity Catalog's volumes replace this in Part 2)
dbutils.fs.mount(
source="s3a://my-bucket/raw/events/",
mount_point="/mnt/raw",
extra_configs={"fs.s3a.access.key": "YOUR_KEY", "fs.s3a.secret.key": "YOUR_SECRET"},
)
# Read from mount
events = spark.read.json("/mnt/raw/2024-10-20/")
Writing a PySpark Job
Notebooks are for interactive development. For production, you write a Python script and submit it as a job. The script looks like any PySpark code — import, create a session, compute, write results.
Code block — standalone PySpark ETL script (pipeline.py):
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
spark = SparkSession.builder.appName("airline-stats").getOrCreate()
raw = spark.read.csv(
"s3://my-bucket/raw/airlines/",
header=True,
inferSchema=True,
)
# Compute on-time departure percentage per carrier
stats = (
raw
.filter(F.col("DepDelay").isNotNull())
.groupBy("UniqueCarrier")
.agg(
F.count("*").alias("total_flights"),
F.sum((F.col("DepDelay") <= 15).cast("int")).alias("on_time"),
)
.withColumn("on_time_pct", F.round(F.col("on_time") / F.col("total_flights") * 100, 2))
)
stats.write.format("delta").mode("overwrite").save("s3://my-bucket/silver/carrier_stats/")
print(f"Computed stats for {stats.count()} carriers")
Save this as pipeline.py and upload it to your Databricks repo or workspace.
Submitting Jobs via the UI and CLI
From the UI: Go to Workflows, click “Create job”, and provide:
- A name (e.g., “airline-stats”)
- Task type: Spark submit or notebook
- Path to your Python script
- Cluster type (new Job Cluster or All-Purpose)
- Optional: parameters, schedule (cron), alerts
From the CLI: Use the Databricks CLI to submit and monitor jobs programmatically:
# Configure the CLI once
databricks configure --host https://adb-<workspace>.azuredatabricks.net/ --token <pat>
# Submit a job run
databricks jobs run-now --job-id 12345 \
--job-parameters '{"date": "2024-10-13", "env": "prod"}'
# Monitor the run
databricks runs get --run-id 987654
Job parameters are passed via command-line arguments. In your Python script, read them:
import sys
run_date = sys.argv[1] if len(sys.argv) > 1 else "2024-10-13"
env = sys.argv[2] if len(sys.argv) > 2 else "dev"
print(f"Running for {run_date} in {env}")
Secrets and Configuration
Production jobs need credentials (cloud storage keys, API tokens) without hardcoding them in code. Databricks Secrets API solves this.
Create a secret scope once (a logical container for secrets):
databricks secrets create-scope --scope my-scope
Then add a secret:
databricks secrets put --scope my-scope --key storage-key
# Databricks prompts you for the value — enter your S3 secret key
In your notebook or job, read it:
# Retrieve the secret
storage_key = dbutils.secrets.get(scope="my-scope", key="storage-key")
# Use it to configure cloud access
spark.conf.set(
"fs.s3a.access.key",
dbutils.secrets.get(scope="my-scope", key="s3-access-key")
)
spark.conf.set(
"fs.s3a.secret.key",
dbutils.secrets.get(scope="my-scope", key="s3-secret-key")
)
# Now read directly from S3
events = spark.read.json("s3a://my-bucket/raw/events/")
Secrets are stored in Databricks’ secure vault — never exposed in job logs or code.
Key Takeaways
- Databricks workspace is your team’s environment; it contains notebooks, clusters, jobs, and data organized in catalogs and schemas
- All-Purpose Clusters are for interactive development; Job Clusters are ephemeral, per-job compute (cheaper for batch)
- The
sparksession is pre-initialized in notebooks — noSparkSession.builderneeded - DBFS is a convenience layer over cloud storage; in production use cloud paths directly (
s3://bucket/...) - A Databricks job is a Python script or notebook scheduled on a cluster — the unit of production data engineering
- Use Databricks Secrets to store credentials; read them with
dbutils.secrets.get()— never hardcode credentials
Next: Lakehouse Architecture — understanding Unity Catalog, the medallion Bronze/Silver/Gold pattern, and Delta tables in practice.