Clocks, Quorums, and Lies
Distributed Systems Patterns That Survive Contact with Production
One-time purchase · instant download · PDF + EPUB included · secure checkout via Stripe
What you'll learn
- Diagnose why timestamps, retries, and dual writes silently corrupt data — and name the exact failure mode in a design review
- Add idempotency keys and deduplication so any request or message can be safely retried
- Explain quorums, leases, and fencing tokens well enough to spot a split-brain risk before it pages you
- Decide when you actually need consensus — and when a Postgres row or an etcd lease is enough
- Replace dual writes with the transactional outbox and wire change events reliably into a log
- Break a cross-service transaction into a saga with compensations that survive crashes mid-flight
- Run a repeatable correctness review: delivery semantics, ordering assumptions, and failure injection for any new service
Contents
- 1. Welcome to Partial Failure
- 2. Time Is a Rumor
- 3. Happened-Before: Clocks That Actually Work
- 4. The Double-Charged Customer
- 5. Exactly-Once Is a Lie
- 6. Retries: How to Ask Again Without Making It Worse
- 7. One Copy Is None: Replication Basics
- 8. The Zombie Leader
- 9. Quorums: How Many Nodes Make a Truth
- 10. Consensus Without Tears
- 11. Consistency Models You Can Say Out Loud
- 12. The Dual-Write Problem
- 13. The Outbox Pattern
- 14. Sagas: Transactions That Apologize
- 15. The Replayed Webhook
- 16. The Myth of Global Order
- 17. Designing for Failure on Purpose
Read a free sample below — the full book comes with purchase (PDF & EPUB)
Free sample — the opening of Chapter 1, Welcome to Partial Failure. The complete book (158 pages, 17 chapters) comes as DRM-free PDF + EPUB with purchase.
Chapter 1: Welcome to Partial Failure
At 2:07 a.m., Priya’s phone lights up with three pages in ninety seconds. The first: payments.charge.count for order BB-88412 is 2, and the customer has already emailed — she was charged $61.90 twice for one copy of a $30.95 hardcover. The second: the nightly settlement export finished twice, four minutes apart, and the second run pushed 14,206 duplicate rows to the bank’s SFTP drop. The third: order BB-79001, refunded three days ago, has flipped its status back to paid, and the customer-facing order page now shows a refund that apparently happened before the payment it refunds.
Priya is the on-call engineer at Brindle Books, an online bookshop. She opens three dashboards, sees three healthy services, and finds no crashed process anywhere. Nothing is down. Every node reports green. The logs show retries — a Checkout retry at 2:03:41 after a five-second timeout, a Scheduler node resuming a job at 1:58:12 after a 30-second GC pause, a webhook from the payment provider redelivered at 2:05:55 with a timestamp from Tuesday. Every component did exactly what it was built to do.
That’s the trap this book is about. In a system made of one process, a failure is loud: the process crashes, the request errors, you see it. In a system made of many processes connected by a network, the common failure is quiet and partial — one call out of thousands lands in a state where the caller doesn’t know what happened, retries, and the retry does damage that no single component can see. The question is never “did it fail?” It is “did it half-fail, and what did the retry do?”
Brindle Books, in one tour
Brindle started as one Rails-ish monolith with one database. Growth split it into services, each owning its own Postgres database:
- Checkout takes the customer’s cart and coordinates the purchase.
- Payments talks to Paylode, the external card provider, and records charges and refunds.
- Orders owns the order record and its status:
pending,paid,shipped,refunded. - Inventory owns stock counts and reservations.
- Notifications sends the emails nobody reads until something goes wrong.
The services communicate two ways. Synchronously, over HTTP-style request/response — Checkout calls Payments and waits. Asynchronously, over the log: a Kafka-style ordered, partitioned broker. Services append records to the log; other services consume them at their own pace. When Payments records a successful charge, it appends payment.captured to the log; Orders consumes it and marks the order paid; Notifications consumes it and emails a receipt. The log is split into log partitions — shards, each internally ordered — so that records for the same order land in the same shard and stay in sequence.
Two more actors. Paylode, the payment provider, calls back into Brindle with webhooks: payment.settled, payment.disputed, refund.completed. Paylode promises to deliver each webhook at least once, which — read carefully — is a promise to sometimes deliver it twice. And a Scheduler service runs nightly jobs: the settlement export to the bank, a stock reconciliation pass. Only one Scheduler node should run a job at a time, so the nodes elect a leader by taking a lease in an etcd-style coordination store: whoever holds the lease runs the job, and the lease expires unless renewed every few seconds.
That’s the whole cast. No new services get invented in this book. Every failure you’ll read about happens to these six components and this one provider, because the point is that this perfectly ordinary architecture — the one you probably run some variant of — contains every trap worth knowing.
Three incidents, in miniature
Priya’s three pages are the book’s three canonical incidents. We’ll tear each one apart in its own chapter; here is the shape of each.
Incident 1 — the double-charged customer. Checkout called Payments to charge $30.95. The call timed out after five seconds. Checkout retried. Both requests had, in fact, reached Payments; both charged the card. The customer paid $61.90. The timeout didn’t mean the first charge failed — it meant Checkout stopped waiting for the answer. Chapter 4 fixes this with idempotency keys.
Incident 2 — the zombie leader. Scheduler node A held the lease and started the settlement export at 1:57. Then it hit a 30-second GC pause. Its lease, which required renewal every 10 seconds, expired. Node B, correctly observing an expired lease, took over and started the export. Node A woke up at 1:58:12, unaware that time had passed, and kept running its export. Two exports, one bank, 14,206 duplicate rows. Node A wasn’t malfunctioning; it was frozen, and a frozen node cannot know it was frozen. Chapter 8 fixes this with fencing tokens.
Incident 3 — the replayed webhook. Paylode delivered payment.settled for order BB-79001 on Tuesday. Orders marked it paid. On Thursday the customer got a refund; Orders marked it refunded. Early Friday morning, Paylode — recovering from its own internal failure — redelivered Tuesday’s webhook. Orders processed it again, faithfully, and set the order back to paid. Old news, applied late, overwrote newer truth. Chapter 15 fixes this with versioning and deduplication.
The sample ends here. Buy Clocks, Quorums, and Lies above to keep reading — one-time purchase, instant download, yours forever.