PostgreSQL DBA Terminology

PostgreSQL DBA Field Guide
DBA Reference

PostgreSQL Internals
Field Guide

Core concepts every PostgreSQL DBA must know β€” explained in plain English, no jargon fog.

πŸ—„οΈ
01 Β· Storage Layer
How Postgres Stores Data
Tuple
aka row / record
A tuple is simply one row in a table. Postgres calls rows "tuples" internally. Every tuple has a hidden header (t_xmin, t_xmax) that tracks which transaction created and deleted it β€” the foundation of MVCC.
Page / Block
8 KB default unit
Postgres reads/writes data in 8 KB pages (also called blocks). A table file is a sequence of pages. Each page holds multiple tuples. Changing one byte forces the whole 8 KB page to be written to disk.
Heap
the table's main file
The heap is the raw data file for a table β€” tuples stored in no particular order. Unlike B-tree indexes, the heap has no sorted structure. pg_relation_filepath('mytable') shows you the file on disk.
OID / Relfilenode
internal object identifiers
Every database object has an OID (Object ID). The relfilenode is the actual filename on disk (a number like 16384). After CLUSTER or VACUUM FULL, the relfilenode changes even though the table name stays the same.
TOAST
The Oversized-Attribute Storage Technique
When a column value is too large to fit in a page, Postgres automatically compresses or moves it to a separate TOAST table. Large TEXT, JSONB, or BYTEA values live there. Transparent to SQL, important for bloat investigations.
Fork
relation file sub-types
Each relation (table/index) is stored in up to 3 forks on disk:
β€’ Main fork β€” actual data
β€’ FSM fork β€” Free Space Map (which pages have room)
β€’ VM fork β€” Visibility Map (which pages are all-visible)
You'll see suffixes like _fsm, _vm in the data directory.
πŸ‘οΈ
02 Β· Concurrency
MVCC β€” Multi-Version Concurrency Control
What is MVCC?
Postgres never overwrites a row in place. Instead, an UPDATE creates a new tuple version and marks the old one as dead. This means readers never block writers and writers never block readers. Each transaction sees a snapshot of the database from when it started β€” like a photograph taken at that moment in time.
Row v1 (xmin=100)
β†’ UPDATE β†’
Row v2 (xmin=201)
+
Row v1 (xmax=201) dead
xmin / xmax
hidden system columns
xmin = Transaction ID that inserted this tuple. xmax = Transaction ID that deleted/updated this tuple (0 means it's still live). You can see them with:
SELECT xmin, xmax, * FROM mytable;
Transaction ID (XID)
32-bit counter, wraps around!
Every transaction gets a monotonically increasing XID. It's a 32-bit integer β€” after ~2 billion transactions it wraps. This is the famous XID wraparound problem. Unvacuumed tables can lose data if wraparound occurs.
⚠️
Monitor this! age(datfrozenxid) must stay below 2 billion. Alert at 500 million.
Snapshot
your transaction's "world view"
When your transaction starts, Postgres takes a snapshot listing all currently active XIDs. Your query only sees tuples committed before your snapshot. Rows inserted by concurrent transactions are invisible to you until you start a new transaction.
Isolation Levels
READ COMMITTED Β· REPEATABLE READ Β· SERIALIZABLE
READ COMMITTED (default) β€” new snapshot per statement. You can see rows committed between statements in the same transaction.

REPEATABLE READ β€” snapshot taken at first statement, held for whole transaction.

SERIALIZABLE β€” full SSI, as if transactions ran one by one.
🧹
03 Β· Housekeeping
VACUUM β€” The Garbage Collector
Why VACUUM Exists
Because MVCC creates dead tuples on every UPDATE/DELETE, someone has to clean them up. VACUUM marks dead tuple space as reusable (but doesn't shrink the file). Think of it as "marking empty seats in a theater" β€” others can sit there but the theater size stays the same.
Key point: Regular VACUUM does NOT return space to the OS. Only VACUUM FULL does (but it locks the table).
VACUUM
non-blocking cleanup
Marks dead tuples as free space, updates the FSM and VM, and advances relfrozenxid. Runs concurrently with reads and most writes.
VACUUM (VERBOSE, ANALYZE) mytable;
VACUUM FULL
⚠️ table-locking rewrite
Rewrites the entire table to a new file, reclaiming all dead space and returning it to the OS. Acquires an exclusive lock β€” no reads or writes during this time. Use pg_repack instead in production for zero-downtime compaction.
ANALYZE
stats collector for the planner
Samples the table and updates pg_statistic with column data distributions (histograms, most-common-values). The query planner uses these stats to pick the best plan. Stale stats = bad plans = slow queries.
Freeze
XID wraparound prevention
VACUUM freezes old tuples by replacing their xmin with a special FrozenXID. Frozen tuples are visible to all future transactions, preventing wraparound. VACUUM FREEZE forces this for the whole table regardless of age.
Visibility Map (VM)
vacuum's shortcut
A bitmap with 2 bits per page. Bit 1 = all tuples on this page are visible to all transactions. Bit 2 = all tuples are frozen. VACUUM skips pages where both bits are set. Index-only scans also use the VM to avoid heap fetches.
⚑
04 Β· Transactions
ACID & Transaction Management
ACID
the four guarantees
Atomicity β€” all or nothing.
Consistency β€” constraints always satisfied.
Isolation β€” concurrent transactions don't interfere.
Durability β€” committed data survives crashes (via WAL).
Savepoint
partial rollback
A bookmark inside a transaction. If something fails, you roll back to the savepoint instead of the whole transaction. Used heavily in ORMs to handle expected errors without aborting the whole session.
SAVEPOINT my_save; -- do risky stuff ROLLBACK TO my_save;
Two-Phase Commit (2PC)
distributed transactions
For distributed systems. Phase 1: PREPARE TRANSACTION (durable write, not yet committed). Phase 2: COMMIT PREPARED. If the coordinator crashes between phases, pg_prepared_xacts holds the zombie transaction. Monitor and clean these!
Idle in Transaction
the silent killer
A connection that's opened a transaction but isn't doing anything. Holds locks, prevents VACUUM from cleaning rows, bloats the table. Find them with:
SELECT pid, state, now() - xact_start FROM pg_stat_activity WHERE state = 'idle in transaction';
πŸ”’
05 Β· Locking
Locks & Blocking
Lock Levels (lightest β†’ heaviest)
LockWho Takes ItBlocks
ACCESS SHARESELECTOnly DROP TABLE
ROW SHARESELECT FOR UPDATEDROP, ALTER
ROW EXCLUSIVEINSERT/UPDATE/DELETEALTER TABLE, LOCK
SHARE UPDATE EXCLUSIVEVACUUM, CREATE INDEX CONCURRENTLYOther VACUUMs, ALTER
SHARECREATE INDEX (non-concurrent)All writes
ACCESS EXCLUSIVEDROP, TRUNCATE, VACUUM FULL, some ALTEREverything including SELECT
Deadlock
circular lock wait
Session A waits for Session B's lock, and Session B waits for Session A's lock. Postgres detects this within deadlock_timeout (default 1s) and kills one transaction. Fix: always acquire locks in a consistent order in your application code.
Lock Wait Monitoring
find who's blocking whom
SELECT blocking.pid, blocked.pid AS waiting FROM pg_stat_activity blocked JOIN pg_stat_activity blocking ON blocking.pid = ANY(pg_blocking_pids(blocked.pid));
Advisory Locks
application-level mutexes
Custom locks your application defines β€” Postgres doesn't know their meaning. Useful for preventing two workers from running the same job simultaneously.
SELECT pg_advisory_lock(12345); -- do work SELECT pg_advisory_unlock(12345);
πŸ’Ύ
06 Β· Memory
Buffers & Caches
shared_buffers
Postgres's own page cache
The shared memory pool all backends share. Hot pages live here. Start at 25% of RAM. A cache hit means no disk I/O. Monitor hit ratio β€” should be >99% for OLTP:
SELECT round(blks_hit * 100.0 / (blks_hit + blks_read), 2) AS hit_ratio FROM pg_stat_database WHERE datname = current_database();
work_mem
per-sort / per-hash memory
Memory per sort operation or hash table. Each query can use multiple work_mem chunks. Set too low β†’ spills to disk (slow). Set too high Γ— many connections = OOM. Typical: 4–64 MB. Use SET LOCAL for specific heavy queries.
effective_cache_size
planner hint, not allocated
Tells the planner how much memory is available for caching (shared_buffers + OS page cache). Not actually allocated! Higher value nudges the planner toward index scans. Typical: 50–75% of RAM.
Buffer Eviction (Clock Sweep)
how Postgres manages its cache
When shared_buffers is full, Postgres uses a clock sweep algorithm to evict pages. Each buffer has a usage count. The sweeper decrements counts as it scans; it evicts the first buffer with count 0. Frequently accessed pages survive longer.
πŸ“
07 Β· Durability
WAL β€” Write-Ahead Log
What is WAL?
Before any data page is modified on disk, Postgres writes a log record first (write-ahead). If Postgres crashes mid-write, it replays WAL on restart to get back to a consistent state. WAL is stored in pg_wal/. It's also the foundation for replication β€” standbys stream WAL from primary.
Transaction COMMIT
β†’
WAL flushed to disk
β†’
Client gets "success"
β†’
Data pages written later
LSN
Log Sequence Number
A byte offset into the WAL stream (e.g. 0/3A4F2B0). Used to track replication lag, confirm data has been received, and coordinate crash recovery. Compare primary vs replica LSN to measure lag:
SELECT pg_wal_lsn_diff(pg_current_wal_lsn(), sent_lsn) AS lag_bytes FROM pg_stat_replication;
Checkpoint
flushing dirty pages to disk
A checkpoint flushes all dirty shared_buffers to disk and writes a checkpoint record in WAL. Crash recovery only needs to replay WAL from the last checkpoint. Controlled by checkpoint_completion_target and max_wal_size.
WAL Level
minimal Β· replica Β· logical
minimal β€” bare minimum for crash recovery.
replica β€” enough for physical standby replication (default).
logical β€” needed for logical replication, pgoutput decoder, CDC tools like Debezium.
synchronous_commit
durability vs speed dial
Controls when Postgres confirms a COMMIT to the client:
on (default) β€” WAL flushed before reply.
off β€” reply immediately, WAL flushed in background. Risk: up to wal_writer_delay (200ms) of data loss on crash. No corruption risk β€” only recent uncommitted transactions could be lost.
πŸ”„
08 Β· Replication
HA & Standbys
Streaming Replication
physical, byte-for-byte copy
Standby connects to primary and streams WAL records in real time. The standby is a binary copy β€” same Postgres major version required. Standbys can serve read-only queries (hot_standby = on).
Logical Replication
row-level, selective
Replicates individual DML changes (INSERT/UPDATE/DELETE) decoded from WAL. Can replicate specific tables, across major versions, and to non-Postgres targets. Publisher β†’ Subscriber model. Useful for zero-downtime major upgrades.
Replication Slot
⚠️ WAL retention mechanism
A slot makes the primary retain WAL until the subscriber consumes it. If a subscriber goes offline for a long time, WAL accumulates and can fill your disk. Monitor pg_replication_slots closely β€” drop stale slots!
⚠️
Disk full risk: An inactive replication slot will make Postgres keep WAL forever.
Failover vs Switchover
planned vs unplanned promotion
Switchover β€” planned, graceful. Primary is demoted cleanly, standby promoted. Zero data loss.

Failover β€” unplanned, primary is dead. Standby is promoted. Potential data loss equal to replication lag at time of failure.
pg_basebackup
base backup for new standbys
Creates a consistent physical backup of a running Postgres cluster. Used to provision new standbys or as a base for PITR restores. Includes WAL needed to make the backup consistent on replay.
πŸ”
09 Β· Indexes
Index Types & Usage
B-tree
the default, for =, <, >, BETWEEN
Balanced tree. Good for equality and range queries on any orderable type. Supports ORDER BY without a sort step. Use for 90% of indexes.
GIN
Generalized Inverted Index
Indexes elements within a value β€” arrays, JSONB keys, full-text tsvector. Slow to build and update, fast to query. Use for @>, &&, @@ operators.
GiST / BRIN
geometry / sorted large tables
GiST β€” for geometric types, PostGIS, full-text, exclusion constraints. Lossy (checks heap).

BRIN β€” tiny index storing min/max per page range. Only good when data is physically sorted (time-series, log tables).
Partial Index
index a subset of rows
Index only rows matching a WHERE clause. Smaller, faster, and allows unique constraints on subsets:
CREATE INDEX idx_active_users ON users(email) WHERE status = 'active';
Index-Only Scan
heap access avoided
If all columns needed by a query are in the index, Postgres reads only the index without touching the heap β€” very fast. Requires the VM to confirm tuples are visible. This is why VACUUM matters for performance.
CREATE INDEX CONCURRENTLY
zero-downtime index builds
Builds the index without locking the table for writes. Takes longer and does 2 passes. If it fails, leaves an INVALID index β€” check pg_indexes and drop/rebuild.
SELECT indexname, indisvalid FROM pg_indexes JOIN pg_index USING(indexrelid) WHERE NOT indisvalid;
πŸ€–
10 Β· Automation
Autovacuum β€” The Background Daemon
How Autovacuum Decides to Run
Autovacuum fires on a table when dead tuples exceed this threshold:
threshold = autovacuum_vacuum_threshold + autovacuum_vacuum_scale_factor Γ— reltuples -- default: 50 + 0.2 Γ— table_rows -- so a 1M row table triggers at 200,050 dead tuples
For large, hot tables: set per-table storage parameters to lower the threshold:
ALTER TABLE big_table SET (autovacuum_vacuum_scale_factor = 0.01);
autovacuum_max_workers
parallel vacuum workers
How many autovacuum workers can run simultaneously (default 3). Each worker handles one table. On busy databases with many tables, increase this. Workers are throttled by autovacuum_vacuum_cost_delay to be gentle on I/O.
Monitoring Autovacuum
is it keeping up?
-- Last vacuum/analyze times per table SELECT relname, last_autovacuum, last_autoanalyze, n_dead_tup, n_live_tup FROM pg_stat_user_tables ORDER BY n_dead_tup DESC LIMIT 10;
Autovacuum & Long Transactions
the nemesis
A long-running transaction prevents autovacuum from removing dead tuples created before it started β€” those rows are still "potentially visible" to the old transaction. This causes table bloat fast. Kill idle-in-transaction sessions promptly.
πŸ“ˆ
11 Β· Bloat
Table & Index Bloat
Table Bloat
wasted space in the heap
Dead tuples not yet reclaimed by VACUUM. The file grows but never shrinks (unless VACUUM FULL or pg_repack). Bloat causes more I/O: Postgres reads pages full of dead rows looking for live ones. High bloat = slow sequential scans.
Index Bloat
indexes bloat too!
B-tree index leaf pages accumulate dead entries. Regular VACUUM cleans index entries but doesn't shrink the tree. Use REINDEX CONCURRENTLY to rebuild a bloated index online without downtime.
pg_repack
online table/index compaction
Extension that rebuilds a table in the background and swaps it in with minimal locking β€” unlike VACUUM FULL which locks the whole time. The go-to tool for reclaiming bloat in production without a maintenance window.
FSM β€” Free Space Map
the recycling bin registry
Tracks how much free space exists in each page. After VACUUM marks dead tuples as free, it updates the FSM. INSERTs consult the FSM to find pages with room, avoiding writing to the end of the file constantly.
🧠
12 Β· Query Planner
EXPLAIN & Query Optimization
EXPLAIN ANALYZE β€” your most important tool
Shows the actual execution plan with real timing. Never tune without it.
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT) SELECT * FROM orders WHERE user_id = 42;
Key things to look for: Rows estimate vs actual (big gap = stale stats), Seq Scan on large tables (missing index?), Hash Batches > 1 (work_mem too low), Buffers: hit vs read (cache misses).
Seq Scan vs Index Scan
planner chooses based on cost
The planner chooses a Seq Scan when it's cheaper β€” for example, if it estimates most of the table will be returned, reading sequentially is faster than random index lookups. A Seq Scan on a large table isn't always wrong.
Join Types
nested loop Β· hash join Β· merge join
Nested Loop β€” outer row Γ— inner scan. Good for small tables or indexed inner.
Hash Join β€” build hash table of smaller side, probe with larger. Good for large unsorted sets.
Merge Join β€” both inputs must be sorted. Good when inputs are pre-sorted via index.
pg_stat_statements
query performance history
Extension that tracks query performance across all executions β€” total time, calls, rows. Essential for finding which queries to optimize first:
SELECT query, calls, mean_exec_time, total_exec_time FROM pg_stat_statements ORDER BY total_exec_time DESC LIMIT 10;

PostgreSQL DBA Field Guide Β· Internal Concepts Reference

PostgreSQL 16+ DBA Edition