Medallion Architecture Guide

$29

A comprehensive decision framework and implementation guide for building production-grade medallion (bronze/silver/gold) architectures in Databricks.

📁 26 files

MarkdownPythonAWSAzureGCPDatabricksPySparkSparkDelta Lake

📄 Product Preview

Try the interactive reader and demo tools below, or get the full product with all content unlocked.

📖 Interactive Reader (Free Preview) 📦 Download Free Sample

📁 File Structure 26 files

medallion-architecture-guide/ ├── LICENSE ├── README.md ├── cheatsheets/ │ ├── layer_comparison.md │ ├── migration_checklist.md │ └── naming_conventions_cheatsheet.md ├── code_examples/ │ ├── bronze_ingestion.py │ ├── cross_layer_pipeline.py │ ├── gold_aggregation.py │ ├── naming_convention_generator.py │ ├── schema_migration.py │ └── silver_transformation.py ├── diagrams/ │ ├── data_flow.md │ ├── decision_tree.md │ └── medallion_overview.md ├── free-sample.zip ├── guide/ │ ├── 01_introduction.md │ ├── 02_decision_framework.md │ ├── 03_bronze_layer.md │ ├── 04_silver_layer.md │ ├── 05_gold_layer.md │ ├── 06_naming_conventions.md │ ├── 07_schema_evolution.md │ ├── 08_data_quality_gates.md │ ├── 09_anti_patterns.md │ └── 10_reference_architectures.md └── index.html

📖 Documentation Preview README excerpt

Medallion Architecture Guide

A comprehensive decision framework and implementation guide for building production-grade medallion architectures in Databricks.

Product: Medallion Architecture Guide

Version: 1.0.0

Price: $19

Publisher: DataStack Pro

License: MIT

Why This Guide Exists

Every team that adopts Databricks eventually faces the same question: "How should we organize our lakehouse?"

The medallion architecture (bronze → silver → gold) is the dominant pattern, but most teams implement it poorly. They either:

1. Copy a blog post's toy example and wonder why it doesn't scale

2. Over-engineer the layers with unnecessary abstractions

3. Skip layers they actually need, creating unmaintainable pipelines

4. Choose medallion when a different pattern would serve them better

This guide gives you the decision framework to know when medallion is right, and the implementation playbook to do it properly when it is.

What You Get

Decision Framework (Not Just "Use Medallion Always")

Before writing a single line of code, you'll work through a genuine decision tree that considers:

Your data volume, velocity, and variety
Team size and SQL vs. engineering skill mix
Regulatory and audit requirements
Whether data vault, one-big-table, streaming-first, or hybrid patterns might serve you better

Deep-Dive Layer Guides

Each layer gets a dedicated chapter covering patterns that blog posts skip:

Layer	Key Topics
Bronze	Auto Loader vs. batch, append-only vs. overwrite, CDC capture, raw schema preservation
Silver	SCD Type 1/2, deduplication strategies, data quality gates, conforming dimensions
Gold	Business aggregates, feature stores, materialized views, serving layer patterns

Production-Ready Code

Not snippets—complete, runnable PySpark scripts with type hints, docstrings, and error handling:

Bronze ingestion with Auto Loader and batch fallback
Silver transformation pipeline with quality gates
Gold aggregation with incremental processing
End-to-end cross-layer pipeline
Naming convention SQL generator (YAML-driven)
Schema migration utility

Anti-Pattern Encyclopedia

... continues with setup instructions, usage examples, and more.

📄 Code Sample .py preview

code_examples/bronze_ingestion.py """Bronze Layer Ingestion Pipeline. Complete bronze layer ingestion example supporting both Auto Loader (streaming) and batch JDBC ingestion patterns. Designed for Databricks Runtime 13.3+. Usage: # In a Databricks notebook or job: from bronze_ingestion import BronzeIngestionPipeline pipeline = BronzeIngestionPipeline(spark, config) pipeline.ingest_files("json", "/landing/orders/", "bronze.erp.raw_orders") pipeline.ingest_jdbc("silver_source_db", "public.customers", "bronze.erp.raw_customers") Copyright (c) 2026 DataStack Pro. MIT License. """ from __future__ import annotations from dataclasses import dataclass, field from datetime import datetime from typing import Any from pyspark.sql import DataFrame, SparkSession from pyspark.sql import functions as F from pyspark.sql.streaming import StreamingQuery from pyspark.sql.types import ( StringType, StructField, StructType, TimestampType, ) @dataclass class IngestionConfig: """Configuration for a bronze ingestion pipeline. Attributes: checkpoint_base: Base path for streaming checkpoints source_system: Name of the source system (for metadata) batch_id: Optional batch identifier enable_schema_evolution: Whether to allow schema changes file_format_options: Format-specific options for file ingestion jdbc_options: JDBC-specific connection options trigger_mode: Streaming trigger mode ('availableNow', 'processingTime', 'once') trigger_interval: Interval for processingTime trigger (e.g., '5 minutes') """ checkpoint_base: str = "/checkpoints" source_system: str = "unknown" batch_id: str | None = None # ... 473 more lines ...

Buy Now — $29 Back to Products