PostgreSQL DBA Field Guide

DBA Reference

PostgreSQL Internals
Field Guide

Core concepts every PostgreSQL DBA must know — explained in plain English, no jargon fog.

🗄️

01 · Storage Layer

How Postgres Stores Data

Tuple

aka row / record

A tuple is simply one row in a table. Postgres calls rows "tuples" internally. Every tuple has a hidden header (t_xmin, t_xmax) that tracks which transaction created and deleted it — the foundation of MVCC.

Page / Block

8 KB default unit

Postgres reads/writes data in 8 KB pages (also called blocks). A table file is a sequence of pages. Each page holds multiple tuples. Changing one byte forces the whole 8 KB page to be written to disk.

Heap

the table's main file

The heap is the raw data file for a table — tuples stored in no particular order. Unlike B-tree indexes, the heap has no sorted structure. pg_relation_filepath('mytable') shows you the file on disk.

OID / Relfilenode

internal object identifiers

Every database object has an OID (Object ID). The relfilenode is the actual filename on disk (a number like 16384). After CLUSTER or VACUUM FULL, the relfilenode changes even though the table name stays the same.

TOAST

The Oversized-Attribute Storage Technique

When a column value is too large to fit in a page, Postgres automatically compresses or moves it to a separate TOAST table. Large TEXT, JSONB, or BYTEA values live there. Transparent to SQL, important for bloat investigations.

Fork

relation file sub-types

Each relation (table/index) is stored in up to 3 forks on disk:
• Main fork — actual data
• FSM fork — Free Space Map (which pages have room)
• VM fork — Visibility Map (which pages are all-visible)
You'll see suffixes like _fsm, _vm in the data directory.

👁️

02 · Concurrency

MVCC — Multi-Version Concurrency Control

What is MVCC?

Postgres never overwrites a row in place. Instead, an UPDATE creates a new tuple version and marks the old one as dead. This means readers never block writers and writers never block readers. Each transaction sees a snapshot of the database from when it started — like a photograph taken at that moment in time.

Row v1 (xmin=100)

→ UPDATE →

Row v2 (xmin=201)

Row v1 (xmax=201) dead

xmin / xmax

hidden system columns

xmin = Transaction ID that inserted this tuple. xmax = Transaction ID that deleted/updated this tuple (0 means it's still live). You can see them with:

SELECT xmin, xmax, * FROM mytable;

Transaction ID (XID)

32-bit counter, wraps around!

Every transaction gets a monotonically increasing XID. It's a 32-bit integer — after ~2 billion transactions it wraps. This is the famous XID wraparound problem. Unvacuumed tables can lose data if wraparound occurs.

⚠️

Monitor this! age(datfrozenxid) must stay below 2 billion. Alert at 500 million.

Snapshot

your transaction's "world view"

When your transaction starts, Postgres takes a snapshot listing all currently active XIDs. Your query only sees tuples committed before your snapshot. Rows inserted by concurrent transactions are invisible to you until you start a new transaction.

Isolation Levels

READ COMMITTED · REPEATABLE READ · SERIALIZABLE

READ COMMITTED (default) — new snapshot per statement. You can see rows committed between statements in the same transaction.

REPEATABLE READ — snapshot taken at first statement, held for whole transaction.

SERIALIZABLE — full SSI, as if transactions ran one by one.

🧹

03 · Housekeeping

VACUUM — The Garbage Collector

Why VACUUM Exists

Because MVCC creates dead tuples on every UPDATE/DELETE, someone has to clean them up. VACUUM marks dead tuple space as reusable (but doesn't shrink the file). Think of it as "marking empty seats in a theater" — others can sit there but the theater size stays the same.

Key point: Regular VACUUM does NOT return space to the OS. Only VACUUM FULL does (but it locks the table).

VACUUM

non-blocking cleanup

Marks dead tuples as free space, updates the FSM and VM, and advances relfrozenxid. Runs concurrently with reads and most writes.

VACUUM (VERBOSE, ANALYZE) mytable;

VACUUM FULL

⚠️ table-locking rewrite

Rewrites the entire table to a new file, reclaiming all dead space and returning it to the OS. Acquires an exclusive lock — no reads or writes during this time. Use pg_repack instead in production for zero-downtime compaction.

ANALYZE

stats collector for the planner

Samples the table and updates pg_statistic with column data distributions (histograms, most-common-values). The query planner uses these stats to pick the best plan. Stale stats = bad plans = slow queries.

Freeze

XID wraparound prevention

VACUUM freezes old tuples by replacing their xmin with a special FrozenXID. Frozen tuples are visible to all future transactions, preventing wraparound. VACUUM FREEZE forces this for the whole table regardless of age.

Visibility Map (VM)

vacuum's shortcut

A bitmap with 2 bits per page. Bit 1 = all tuples on this page are visible to all transactions. Bit 2 = all tuples are frozen. VACUUM skips pages where both bits are set. Index-only scans also use the VM to avoid heap fetches.

⚡

04 · Transactions

ACID & Transaction Management

ACID

the four guarantees

Atomicity — all or nothing.
Consistency — constraints always satisfied.
Isolation — concurrent transactions don't interfere.
Durability — committed data survives crashes (via WAL).

Savepoint

partial rollback

A bookmark inside a transaction. If something fails, you roll back to the savepoint instead of the whole transaction. Used heavily in ORMs to handle expected errors without aborting the whole session.

SAVEPOINT my_save; -- do risky stuff ROLLBACK TO my_save;

Two-Phase Commit (2PC)

distributed transactions

For distributed systems. Phase 1: PREPARE TRANSACTION (durable write, not yet committed). Phase 2: COMMIT PREPARED. If the coordinator crashes between phases, pg_prepared_xacts holds the zombie transaction. Monitor and clean these!

Idle in Transaction

the silent killer

A connection that's opened a transaction but isn't doing anything. Holds locks, prevents VACUUM from cleaning rows, bloats the table. Find them with:

SELECT pid, state, now() - xact_start FROM pg_stat_activity WHERE state = 'idle in transaction';

🔒

05 · Locking

Locks & Blocking

Lock Levels (lightest → heaviest)

Lock	Who Takes It	Blocks
ACCESS SHARE	SELECT	Only DROP TABLE
ROW SHARE	SELECT FOR UPDATE	DROP, ALTER
ROW EXCLUSIVE	INSERT/UPDATE/DELETE	ALTER TABLE, LOCK
SHARE UPDATE EXCLUSIVE	VACUUM, CREATE INDEX CONCURRENTLY	Other VACUUMs, ALTER
SHARE	CREATE INDEX (non-concurrent)	All writes
ACCESS EXCLUSIVE	DROP, TRUNCATE, VACUUM FULL, some ALTER	Everything including SELECT

Deadlock

circular lock wait

Session A waits for Session B's lock, and Session B waits for Session A's lock. Postgres detects this within deadlock_timeout (default 1s) and kills one transaction. Fix: always acquire locks in a consistent order in your application code.

Lock Wait Monitoring

find who's blocking whom

SELECT blocking.pid, blocked.pid AS waiting FROM pg_stat_activity blocked JOIN pg_stat_activity blocking ON blocking.pid = ANY(pg_blocking_pids(blocked.pid));

Advisory Locks

application-level mutexes

Custom locks your application defines — Postgres doesn't know their meaning. Useful for preventing two workers from running the same job simultaneously.

SELECT pg_advisory_lock(12345); -- do work SELECT pg_advisory_unlock(12345);

💾

06 · Memory

Buffers & Caches

shared_buffers

Postgres's own page cache

The shared memory pool all backends share. Hot pages live here. Start at 25% of RAM. A cache hit means no disk I/O. Monitor hit ratio — should be >99% for OLTP:

SELECT round(blks_hit * 100.0 / (blks_hit + blks_read), 2) AS hit_ratio FROM pg_stat_database WHERE datname = current_database();

work_mem

per-sort / per-hash memory

Memory per sort operation or hash table. Each query can use multiple work_mem chunks. Set too low → spills to disk (slow). Set too high × many connections = OOM. Typical: 4–64 MB. Use SET LOCAL for specific heavy queries.

effective_cache_size

planner hint, not allocated

Tells the planner how much memory is available for caching (shared_buffers + OS page cache). Not actually allocated! Higher value nudges the planner toward index scans. Typical: 50–75% of RAM.

Buffer Eviction (Clock Sweep)

how Postgres manages its cache

When shared_buffers is full, Postgres uses a clock sweep algorithm to evict pages. Each buffer has a usage count. The sweeper decrements counts as it scans; it evicts the first buffer with count 0. Frequently accessed pages survive longer.

📝

07 · Durability

WAL — Write-Ahead Log

What is WAL?

Before any data page is modified on disk, Postgres writes a log record first (write-ahead). If Postgres crashes mid-write, it replays WAL on restart to get back to a consistent state. WAL is stored in pg_wal/. It's also the foundation for replication — standbys stream WAL from primary.

Transaction COMMIT

→

WAL flushed to disk

→

Client gets "success"

→

Data pages written later

LSN

Log Sequence Number

A byte offset into the WAL stream (e.g. 0/3A4F2B0). Used to track replication lag, confirm data has been received, and coordinate crash recovery. Compare primary vs replica LSN to measure lag:

SELECT pg_wal_lsn_diff(pg_current_wal_lsn(), sent_lsn) AS lag_bytes FROM pg_stat_replication;

Checkpoint

flushing dirty pages to disk

A checkpoint flushes all dirty shared_buffers to disk and writes a checkpoint record in WAL. Crash recovery only needs to replay WAL from the last checkpoint. Controlled by checkpoint_completion_target and max_wal_size.

WAL Level

minimal · replica · logical

minimal — bare minimum for crash recovery.
replica — enough for physical standby replication (default).
logical — needed for logical replication, pgoutput decoder, CDC tools like Debezium.

synchronous_commit

durability vs speed dial

Controls when Postgres confirms a COMMIT to the client:
on (default) — WAL flushed before reply.
off — reply immediately, WAL flushed in background. Risk: up to wal_writer_delay (200ms) of data loss on crash. No corruption risk — only recent uncommitted transactions could be lost.

🔄

08 · Replication

HA & Standbys

Streaming Replication

physical, byte-for-byte copy

Standby connects to primary and streams WAL records in real time. The standby is a binary copy — same Postgres major version required. Standbys can serve read-only queries (hot_standby = on).

Logical Replication

row-level, selective

Replicates individual DML changes (INSERT/UPDATE/DELETE) decoded from WAL. Can replicate specific tables, across major versions, and to non-Postgres targets. Publisher → Subscriber model. Useful for zero-downtime major upgrades.

Replication Slot

⚠️ WAL retention mechanism

A slot makes the primary retain WAL until the subscriber consumes it. If a subscriber goes offline for a long time, WAL accumulates and can fill your disk. Monitor pg_replication_slots closely — drop stale slots!

⚠️

Disk full risk: An inactive replication slot will make Postgres keep WAL forever.

Failover vs Switchover

planned vs unplanned promotion

Switchover — planned, graceful. Primary is demoted cleanly, standby promoted. Zero data loss.

Failover — unplanned, primary is dead. Standby is promoted. Potential data loss equal to replication lag at time of failure.

pg_basebackup

base backup for new standbys

Creates a consistent physical backup of a running Postgres cluster. Used to provision new standbys or as a base for PITR restores. Includes WAL needed to make the backup consistent on replay.

🔍

09 · Indexes

Index Types & Usage

B-tree

the default, for =, <, >, BETWEEN

Balanced tree. Good for equality and range queries on any orderable type. Supports ORDER BY without a sort step. Use for 90% of indexes.

GIN

Generalized Inverted Index

Indexes elements within a value — arrays, JSONB keys, full-text tsvector. Slow to build and update, fast to query. Use for @>, &&, @@ operators.

GiST / BRIN

geometry / sorted large tables

GiST — for geometric types, PostGIS, full-text, exclusion constraints. Lossy (checks heap).

BRIN — tiny index storing min/max per page range. Only good when data is physically sorted (time-series, log tables).

Partial Index

index a subset of rows

Index only rows matching a WHERE clause. Smaller, faster, and allows unique constraints on subsets:

CREATE INDEX idx_active_users ON users(email) WHERE status = 'active';

Index-Only Scan

heap access avoided

If all columns needed by a query are in the index, Postgres reads only the index without touching the heap — very fast. Requires the VM to confirm tuples are visible. This is why VACUUM matters for performance.

CREATE INDEX CONCURRENTLY

zero-downtime index builds

Builds the index without locking the table for writes. Takes longer and does 2 passes. If it fails, leaves an INVALID index — check pg_indexes and drop/rebuild.

SELECT indexname, indisvalid FROM pg_indexes JOIN pg_index USING(indexrelid) WHERE NOT indisvalid;

🤖

10 · Automation

Autovacuum — The Background Daemon

How Autovacuum Decides to Run

Autovacuum fires on a table when dead tuples exceed this threshold:

threshold = autovacuum_vacuum_threshold + autovacuum_vacuum_scale_factor × reltuples -- default: 50 + 0.2 × table_rows -- so a 1M row table triggers at 200,050 dead tuples

For large, hot tables: set per-table storage parameters to lower the threshold:
ALTER TABLE big_table SET (autovacuum_vacuum_scale_factor = 0.01);

autovacuum_max_workers

parallel vacuum workers

How many autovacuum workers can run simultaneously (default 3). Each worker handles one table. On busy databases with many tables, increase this. Workers are throttled by autovacuum_vacuum_cost_delay to be gentle on I/O.

Monitoring Autovacuum

is it keeping up?

-- Last vacuum/analyze times per table SELECT relname, last_autovacuum, last_autoanalyze, n_dead_tup, n_live_tup FROM pg_stat_user_tables ORDER BY n_dead_tup DESC LIMIT 10;

Autovacuum & Long Transactions

the nemesis

A long-running transaction prevents autovacuum from removing dead tuples created before it started — those rows are still "potentially visible" to the old transaction. This causes table bloat fast. Kill idle-in-transaction sessions promptly.

📈

11 · Bloat

Table & Index Bloat

Table Bloat

wasted space in the heap

Dead tuples not yet reclaimed by VACUUM. The file grows but never shrinks (unless VACUUM FULL or pg_repack). Bloat causes more I/O: Postgres reads pages full of dead rows looking for live ones. High bloat = slow sequential scans.

Index Bloat

indexes bloat too!

B-tree index leaf pages accumulate dead entries. Regular VACUUM cleans index entries but doesn't shrink the tree. Use REINDEX CONCURRENTLY to rebuild a bloated index online without downtime.

pg_repack

online table/index compaction

Extension that rebuilds a table in the background and swaps it in with minimal locking — unlike VACUUM FULL which locks the whole time. The go-to tool for reclaiming bloat in production without a maintenance window.

FSM — Free Space Map

the recycling bin registry

Tracks how much free space exists in each page. After VACUUM marks dead tuples as free, it updates the FSM. INSERTs consult the FSM to find pages with room, avoiding writing to the end of the file constantly.

🧠

12 · Query Planner

EXPLAIN & Query Optimization

EXPLAIN ANALYZE — your most important tool

Shows the actual execution plan with real timing. Never tune without it.

EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT) SELECT * FROM orders WHERE user_id = 42;

Key things to look for: Rows estimate vs actual (big gap = stale stats), Seq Scan on large tables (missing index?), Hash Batches > 1 (work_mem too low), Buffers: hit vs read (cache misses).

Seq Scan vs Index Scan

planner chooses based on cost

The planner chooses a Seq Scan when it's cheaper — for example, if it estimates most of the table will be returned, reading sequentially is faster than random index lookups. A Seq Scan on a large table isn't always wrong.

Join Types

nested loop · hash join · merge join

Nested Loop — outer row × inner scan. Good for small tables or indexed inner.
Hash Join — build hash table of smaller side, probe with larger. Good for large unsorted sets.
Merge Join — both inputs must be sorted. Good when inputs are pre-sorted via index.

pg_stat_statements

query performance history

Extension that tracks query performance across all executions — total time, calls, rows. Essential for finding which queries to optimize first:

SELECT query, calls, mean_exec_time, total_exec_time FROM pg_stat_statements ORDER BY total_exec_time DESC LIMIT 10;

BRIGHT DBA

ORACLE DBA – Tips and Techniques

PostgreSQL DBA Terminology

PostgreSQL Internals
Field Guide