A data lake is just a swamp with better marketing — right up until you give it a skeleton.
Dump enough raw files into cheap object storage and you don’t get a lake; you get a murky basin where data sinks to the bottom and is never reliably seen again. The fix isn’t a fancier query engine or a more expensive warehouse. It’s a discipline with a slightly silly name — bronze, silver, gold — (reminiscent of the Olympics) and at its heart sits one stubborn rule it quietly borrowed from a corner of software most data engineers never visit:
🪨 Never change what already happened.
Let’s talk about how the medallion pattern actually works, how to build each layer, and why — if you’ve ever written an event-sourced system — the whole thing is going to feel suspiciously familiar.
The data lake pitch was intoxicating:
For about six months, it feels like freedom. Then the silt sets in.
s3://data/raw/2019/Phil Karlton quipped that the two hardest problems in computing are cache invalidation and naming things; a swamp is what happens when finding and trusting your data becomes the third. None of this is exotic — it’s the predictable result of storing data with no shape.
(The shift from ETL to ELT — load raw first, transform later — didn’t create the swamp, but it did remove the bouncer at the door.)
The lesson isn’t “go back to rigid warehouses.” It’s that a lake needs a spine.
🖼️ [TEMP — IMAGE DESCRIPTION, DELETE BEFORE PUBLISH] (exists:
lake-vs-swamp.svg/.drawio) Two side-by-side panels under the header “Same water. The difference is whether anything gives it a shape.” • LEFT — “🐊 The Swamp”: scattered, tilted file icons labeledfinal_v2,final_REAL,copy(3),??.csv— visual chaos, no structure. Footer: “No ACID · no schema · no time travel” and “‘what did Q2 look like?’ → silence.” • RIGHT — “🦴 Lake With a Spine”: an ordered vertical mini-stack — 🪨 Bronze → 🪙 Silver → 🏆 Gold — with connecting arrows. Footer: “ACID · schema evolution · time travel” and “‘what did Q2 look like?’ → one query.”
The medallion pattern gives the lake a skeleton: three layers, each more refined than the last.
🖼️ [TEMP — IMAGE DESCRIPTION, DELETE BEFORE PUBLISH] (exists:
medallion-stack.svg/.drawio) A vertical stack of three rounded boxes joined by downward arrows. Top hint: “raw data in ↓”. • 🪨 BRONZE (amber): “Raw · Immutable · Append-only · Iceberg on S3 · partitioned by event_date” — captioned “the insurance policy.” • arrow labeled “dbt + quality gates” → • 🪙 SILVER (gray): “Cleaned · deduped · type-cast · schema-enforced · PII tokenized” — captioned “the contract.” • arrow labeled “dbt + contracts” → • 🏆 GOLD (gold): “Business marts · aggregations · definitions” — captioned “the product.” Bottom hint: “business value out ↓”.
The names borrow from the Olympic podium, and the metaphor is doing real work — each layer is a higher grade of the same underlying thing. Here’s the part most explainers miss:
💡 ETL/ELT is a data-movement strategy. Medallion is a data-trust strategy. One answers how does data get from A to B? The other answers how confident am I in the data at each stage? They’re not competitors. They’re answering different questions.
If the progression feels familiar, it should. It’s Kent Beck’s old mantra wearing a data hat:
“Make it work, make it right, make it fast.” — Kent Beck
Bronze makes it exist. Silver makes it right. Gold makes it useful. Same discipline, different domain.
What it is: every record exactly as it arrived, never modified.
How it’s actually built:
Bronze is the layer you’re tempted to skip and the one you’ll be most grateful for — your insurance policy. Anything downstream can be rebuilt from it, because it never lies and never forgets.
What it is: cleaned, conformed, trustworthy data with a stable shape.
How it’s actually built:
Silver is the contract layer. When a vendor changes their feed, only bronze-to-silver breaks. Everything downstream keeps running against the contract silver promises. That isolation — vendor chaos on one side, business logic on the other — is the whole point.
What it is: business-ready marts shaped for the people and systems that consume them.
How it’s actually built:
Gold is the product. When the business redefines “active customer,” only silver-to-gold changes — silver stays stable for everyone else.
If you’ve built a CQRS or event-sourced system, a bell has been ringing for several paragraphs. Let it ring. Fowler describes event sourcing as storing every state change as a sequence of events you never edit, then deriving current state by replaying them. Sound like anyone you know?
| Event Sourcing | Medallion |
|---|---|
| Append-only event log | Immutable bronze |
| Events are facts, never edited | Raw records never overwritten |
| Projections derived from the log | Silver / gold derived from bronze |
| Rebuild state by replaying events | Re-derive marts by reprocessing bronze |
| Temporal query (“state as of T”) | Table-format time-travel, e.g. Iceberg (“data as of T”) |
| Corrective events (never delete) | Version-2 records (never delete) |
💡 Bronze is an event log for your analytics. The medallion pattern is event sourcing that wandered into the data warehouse and decided to stay.
But — and this is the part worth saying out loud, because it keeps you honest — they are analogous, not identical:
The shared DNA is immutability + derivation; the replay mechanics differ. Name the resemblance, then name the seam.
Here’s where the spine earns its keep. The dreaded request:
“Can you re-run last March, but with the corrected numbers?”
In a swamp, that sentence ruins a week. In a medallion lake:
A retroactive correction stops being archaeology and becomes a Tuesday — not a feature you bolt on later, but a property that falls out of never overwriting bronze.
🖼️ [TEMP — OPTIONAL IMAGE SLOT, DELETE BEFORE PUBLISH] (does not exist yet) A small left-to-right flow would land well here: a correction enters Bronze as a new versioned record → dbt incremental re-derives Silver & Gold from a watermark → time-travel query answers “as of March 1.” Say the word and I’ll build it (
restatement-flow.svg/.drawio); otherwise delete this note.
Three layers aren’t free. Each is more pipeline, more orchestration, more places to break at 2 a.m. Silver earns its keep only when input is messy or multiple consumers need the same cleaned data. Knock out both — clean-on-arrival data, one consumer — and silver becomes an elaborate SELECT *.
In that world, collapse to two tiers (raw → curated) and don’t apologize.
“Everything should be made as simple as possible, but no simpler.” — Einstein (give or take a paraphrase)
The value was never the bronze/silver/gold vocabulary. It’s the separation of concerns — keeping raw, cleaned, and business-ready apart so a vendor change can’t ripple into a dashboard. Three boxes are a convenient default, not a commandment.
The pattern is tool-agnostic, but here’s the slate I reached for above — by job, with the trade-offs. The biggest decision is the open table format; the rest are fairly settled defaults.
| Tool | Type | Pros | Cons |
|---|---|---|---|
| Apache Iceberg | Open table format | Vendor-neutral; read by Athena, Spark, Trino, Snowflake, BigQuery; strong schema & partition evolution | Some ecosystem tooling still maturing; metadata needs periodic compaction |
| Delta Lake | Open table format | Mature; best-in-class on Databricks/Spark; strong ML tooling | Historically Spark-centric; shines most inside Databricks’ orbit |
| Apache Hudi | Open table format | Excellent for high-frequency upserts & CDC; record-level indexing | More operational complexity; steeper learning curve |
| Amazon S3 | Object storage | Cheap, durable, ubiquitous; deep AWS integration | AWS-coupled; egress costs (alts: GCS, Azure Blob — same trade per cloud) |
| dbt | Transformation | SQL-native; version-controlled; tested; strong lineage & community | SQL-centric for heavy non-SQL work; needs a separate orchestrator |
| Great Expectations | Data quality | Rich distributional checks; docs-as-tests; Python-native | Setup overhead; verbose for trivial checks |
💡 The one real choice here is the table format. Default to Iceberg for vendor neutrality; pick Delta if you live in Databricks; reach for Hudi if your world is high-frequency upserts. Everything else — object storage, dbt, a quality framework — is a near-default for an AWS-shaped lake.
🦴 A lake doesn’t become trustworthy by getting bigger. It becomes trustworthy by getting a shape.