Skip to main content
Rows on Disk — cover
Clarqo Press

Rows on Disk

How PostgreSQL Stores, Finds, and Reclaims Your Data

By Tobias Krell Level: Intermediate 166 pages PDF & EPUB

One-time purchase · instant download · PDF + EPUB included · secure checkout via Stripe

What you'll learn

  • Read a raw heap page with pageinspect and decode tuple headers — xmin, xmax, ctid — by hand
  • Explain table and index bloat from first principles, measure it with pgstattuple, and know which fixes actually shrink files
  • Trace an UPDATE from new tuple version through WAL record, dirty buffer, checkpoint, and eventual vacuum
  • Diagnose index-only scans that still hit the heap, and use the visibility map to fix them
  • Configure autovacuum from mechanism, not folklore — thresholds, cost limits, and why long transactions block cleanup
  • Understand xid wraparound and freezing well enough to never fear the 'database is not accepting commands' warning
  • Read replication lag and replication-slot disk retention as WAL plumbing problems and resolve them safely

Contents

  • 1. A Database Is a Directory
  • 2. The 8 KB Page
  • 3. Anatomy of a Tuple
  • 4. UPDATE Is INSERT Plus Forget
  • 5. Snapshots: Who Sees What
  • 6. WAL: Write It Down First
  • 7. The Buffer Cache
  • 8. Indexes Point at Ghosts
  • 9. Index-Only Scans That Aren't
  • 10. TOAST: Rows Too Big to Sit Still
  • 11. Where Bloat Comes From
  • 12. VACUUM: The Reclaimer
  • 13. Freezing and Wraparound
  • 14. Getting the Space Back
  • 15. Shipping the WAL
  • 16. The Field Guide

Read a free sample below — the full book comes with purchase (PDF & EPUB)

Free sample — the opening of Chapter 1, A Database Is a Directory. The complete book (166 pages, 16 chapters) comes as DRM-free PDF + EPUB with purchase.

Chapter 1: A Database Is a Directory

It’s 2 a.m. and the disk on the database server is at 94 percent. You SSH in, run du -sh on the PostgreSQL data directory, and get 412 GB. Then you connect to the database and add up every table size that \dt+ reports: about 180 GB. Somewhere on that volume, 230 GB of disk is being held by something that doesn’t appear in any table listing you know how to produce. The pager will not stop until you find it.

This chapter is about closing that gap. By the end of it you will be able to take any table, index, or database and put your finger on the exact files that hold it — and, going the other direction, take any file in the data directory and say which relation it belongs to and which of several forks of that relation it is. Once you can do that, disk usage stops being a mystery you query around and becomes an inventory you can audit line by line.

The data directory tour

Everything a PostgreSQL cluster knows lives under one directory, conventionally called the data directory or PGDATA. Ask the server where it is:

SHOW data_directory;
 data_directory
--------------------------
 /var/lib/postgresql/data

List it and you’ll see a couple dozen entries. Three of them matter enormously; the rest are small bookkeeping.

base/ holds the actual table and index data, one subdirectory per database. The subdirectory names are numbers, not database names — we’ll decode them in a moment. When your disk fills up because a table grew, the bytes are here.

pg_wal/ holds the write-ahead log: the append-only journal of every change made to the cluster, written before the change touches any table file. It is a sequence of fixed-size segment files, 16 MB each by default. (Version note: the segment size is configurable at build or initdb time, and before v10 this directory was called pg_xlog — it was renamed precisely because people saw “log” in the name and deleted it. Don’t. Old docs and function names still say “xlog”; same thing.) Chapter 6 is devoted to WAL. For tonight’s incident, know this much: pg_wal/ is a common culprit when disk usage outruns table sizes, because anything that prevents old segments from being recycled — a failing archiver, or most notoriously an abandoned replication slot — makes this directory grow without bound. That is incident (c) in this book, the replica 20 minutes behind while a forgotten slot eats the disk, and we’ll resolve it properly at the end.

global/ holds cluster-wide catalogs — things that exist once per cluster rather than once per database, like the list of databases themselves and the list of roles. It’s small and you’ll rarely think about it, but it explains why some system tables look identical from every database.

Around those three you’ll find pg_tblspc/ (symlinks to tablespaces, if you’ve placed relations on other volumes), several pg_* directories tracking transaction status and other metadata, and the configuration files. None of them normally accounts for serious disk space.

OIDs and relfilenodes

Create the database this book lives in. Brightpath is a small SaaS helpdesk, and its two central tables will follow us through every chapter:

CREATE DATABASE brightpath;

Every database has an OID — a numeric object identifier, the internal primary key of PostgreSQL’s catalogs — and its subdirectory under base/ is named after it:

SELECT oid, datname FROM pg_database WHERE datname = 'brightpath';
  oid  |  datname
-------+------------
 16384 | brightpath

So the files for everything in brightpath live under base/16384/. Connect to it and create the two tables:

CREATE TABLE orders (
    id          bigint GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
    customer_id bigint NOT NULL,
    status      text NOT NULL DEFAULT 'new',
    total       numeric NOT NULL,
    notes       text,
    created_at  timestamptz NOT NULL DEFAULT now()
);

CREATE TABLE events (
    id       bigint GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
    order_id bigint NOT NULL,
    kind     text NOT NULL,
    payload  text,
    at       timestamptz NOT NULL DEFAULT now()
);

orders is update-heavy — an order’s status flips several times over its life. events is an append-only audit log: rows go in, nothing changes. That difference in write pattern will turn out to explain most of the difference in how these two tables age on disk, but that’s for later chapters. Right now the goal is simpler: find their files.

Tables have OIDs too, but the file on disk is not named after the table’s OID. It’s named after a second number, the relfilenode. The two usually start out equal, which lulls people into thinking they’re the same thing; they are not, and the difference is load-bearing:

SELECT relname, oid, relfilenode
FROM pg_class
WHERE relname IN ('orders', 'events');
 relname |  oid  | relfilenode
---------+-------+-------------
 orders  | 16385 | 16385
 events  | 16392 | 16392

The sample ends here. Buy Rows on Disk above to keep reading — one-time purchase, instant download, yours forever.