PostgreSQL Architecture & How SELECT, INSERT, UPDATE, DELETE Work Internally | Bright DBA Bright DBA
PostgreSQL › Architecture & Internals
PostgreSQL Internals · Deep Dive
PostgreSQL Architecture & How Queries Work Inside
A complete technical guide to PostgreSQL's process, memory, and storage architecture —
and a step-by-step explanation of exactly what happens when you run
SELECT, INSERT, UPDATE, or DELETE. With Oracle comparisons throughout.
Three-layer architecture: Client · Server Processes · Storage
PostgreSQL follows a client-server, multi-process architecture. Unlike MySQL's multi-threaded model
or Oracle's mix of threads and processes, PostgreSQL spawns a dedicated OS process for each client connection.
The architecture has three distinct layers: the client application layer, the server process layer, and the storage layer.
When PostgreSQL starts, the Postmaster process launches and listens on port 5432.
For every new client connection, Postmaster calls fork() to create a dedicated Backend process.
That backend handles the entire session lifecycle — parsing queries, planning, executing, and returning results —
all within its own private process address space.
🏠
Postmaster (postgres)
The parent of all PostgreSQL processes. Authenticates connections via pg_hba.conf, then forks a backend for each client. Restarts crashed backends. PID 1 of the cluster.
👤
Backend Process
One per client connection. Handles parsing, planning, and execution of SQL statements. Has access to shared memory but its own stack. Dies when client disconnects.
✍️
WAL Writer
Periodically flushes WAL buffers to pg_wal on disk. Ensures WAL is durable even if no transaction commits. Runs every wal_writer_delay (default 200ms).
✅
Checkpointer
Writes all dirty pages from shared buffers to data files at checkpoint intervals. Reduces crash recovery time. Triggered by checkpoint_timeout or max_wal_size.
🧹
Autovacuum Launcher
Monitors tables for dead tuple accumulation. Spawns autovacuum worker processes to VACUUM/ANALYZE tables that cross the configured thresholds.
📝
Background Writer
Proactively writes dirty buffers to disk between checkpoints, reducing I/O spikes at checkpoint time. Works alongside the checkpointer process.
📦
WAL Archiver
When archive_mode = on, copies completed WAL segments to the archive location using archive_command. Enables PITR and remote standby servers.
📊
Stats Collector
Gathers statistics used by pg_stat_activity, pg_stat_user_tables, pg_stat_statements etc. Writes stats to temporary files in pg_stat_tmp/.
03
Memory Architecture
Shared Memory (all processes) + Local Memory (per backend)
🔵 Shared Memory — All Processes
Shared BuffersThe main data cache. 8KB pages from data files. Default 128MB, recommend 25% of RAM. All backends share this pool. ≈ Oracle SGA Buffer Cache.
WAL BuffersIn-memory buffer for WAL records before they are flushed to pg_wal on disk. Default wal_buffers = -1 (auto). ≈ Oracle Redo Log Buffer.
Lock SpaceShared hash table of all active locks. Used by all processes to detect conflicts. Part of shared memory segment.
Proc ArrayShared array of all running backend PIDs, their XIDs, and snapshot info. Used for MVCC visibility checks.
🟠 Local Memory — Per Backend Process
work_memMemory for sort operations, hash joins, bitmap scans. Per-sort, not per-query. A query with 5 sorts uses up to 5 × work_mem. Default 4MB. ≈ Oracle PGA sort area.
maintenance_work_memMemory for VACUUM, CREATE INDEX, ALTER TABLE. Default 64MB. Larger values speed up index builds significantly.
temp_buffersCache for temporary table data. Per-session. Default 8MB. Only allocated when session uses temp tables.
Catalog CachePer-backend cache of system catalog lookups (table definitions, index info). Avoids repeated catalog queries for the same objects.
💡
Key Insight
PostgreSQL has no Oracle-equivalent SGA/PGA distinction as a hard boundary. Instead, shared memory (shared_buffers + WAL buffers + locks) is the shared pool, and each process has private memory. The effective_cache_size parameter is NOT actual memory — it's a hint to the query planner about how much OS file cache exists.
Every piece of data in PostgreSQL is ultimately stored as 8KB pages in heap files.
A table is a collection of heap pages. An index is a separate collection of index pages.
The storage is entirely on the filesystem — there is no raw device mode.
$PGDATA Directory Structure
Database filesbase/One subdirectory per database OID. Heap files named by OID (e.g., 16384, 16385...)
Redo logspg_wal/WAL segment files (16MB each). Named by LSN. archive_status/ subdir tracks archiving.
Catalogglobal/Cluster-wide tables: pg_database, pg_tablespace, pg_auth. Shared across databases.
Tablespacepg_tblspc/Symlinks to tablespace locations. Allows data on different disk volumes.
A SELECT statement passes through 5 stages inside the backend process before results reach the client.
No WAL is written for a pure SELECT. Data is served from shared buffers if cached, or read from disk if not.
SQL Text
Client sends query
→
Parser
Syntax check → Parse Tree
→
Analyzer / Rewriter
Semantics View expansion
→
Planner / Optimizer
Cost-based plan pg_statistic
→
Executor
Walk plan tree Fetch pages
→
Results
Wire protocol → Client
1
Parser — SQL Text → Parse Tree
The backend receives the SQL string and runs it through gram.y (yacc grammar). It performs syntactic analysis only — checking for correct SQL syntax, not whether tables exist. Output: a raw parse tree (List of Node structures).
SELECT * FROM orders WHERE status = 'ACTIVE' → RawStmt → SelectStmt
The analyzer looks up table and column names in pg_catalog, resolves types, checks permissions, and builds a Query tree. The rewriter then expands any VIEWs (replacing the view reference with its defining SQL) and applies pg_rewrite rules.
3
Planner / Optimizer — Generate the Cheapest Execution Plan
The planner is the most complex stage. It generates candidate execution plans (sequential scan, index scan, bitmap scan, nested loop, hash join, merge join...) and estimates the cost of each using statistics from pg_statistic. It selects the plan with the lowest total cost. Output: a Plan tree.
EXPLAIN SELECT * FROM orders WHERE status = 'ACTIVE' → shows chosen plan
4
Executor — Walk the Plan Tree, Fetch Data
The executor traverses the plan tree node-by-node. For each data page needed:
1. Check shared_buffers — is the 8KB page already cached? If yes → read from RAM (buffer hit).
2. Cache miss — find a free buffer frame, read the 8KB page from disk into shared_buffers, then read from there.
3. MVCC visibility check — for each tuple (row) on the page, check if it is visible to the current transaction using t_xmin/t_xmax (see Section 9).
4. Filter — apply WHERE clause predicates.
5. Project — extract only the requested columns.
Results are streamed back to the client via the wire protocol (not buffered all at once).
5
No WAL Written — SELECT is Read-Only
A plain SELECT never writes WAL. No data files are modified. The only side effect is that shared_buffers may be populated with pages that weren't there before (they stay cached for future queries). Lock: typically an AccessShareLock on the relation — compatible with everything except ACCESS EXCLUSIVE (e.g., DROP TABLE).
💡
Oracle DBA Note
PostgreSQL SELECT ≈ Oracle SELECT. Both check their buffer caches first (Oracle: Buffer Cache; PostgreSQL: shared_buffers). Both use MVCC to ensure consistent reads without locking rows. Oracle uses UNDO tablespace for old row versions; PostgreSQL keeps them in the heap itself. Oracle's SQL Trace / AUTOTRACE ≈ PostgreSQL's EXPLAIN ANALYZE.
An INSERT must write both the WAL record and the tuple into a shared buffer page.
The page is written to disk lazily (by bgwriter or checkpointer) — not immediately on INSERT.
The WAL record however must be flushed to disk before the COMMIT returns (when synchronous_commit = on).
INSERT SQL
Client sends
→
Parse + Plan
Same as SELECT stages 1–3
→
Get XID
Assign transaction ID (t_xmin)
→
Find Page
FSM → free space in page
→
Write Tuple
Insert into shared buffer
→
Write WAL
WAL record to WAL buffers
→
COMMIT
Flush WAL to disk Mark visible
1
Begin Transaction — Assign XID
When the first DML statement executes, PostgreSQL assigns a Transaction ID (XID) — a 32-bit monotonically increasing integer. This XID becomes the t_xmin value stamped into every new tuple this transaction inserts. The new row will only be visible to other transactions after this XID commits.
-- Internal: XID = 1001 assigned. New tuple will have t_xmin = 1001, t_xmax = 0
2
Free Space Map (FSM) — Find Target Page
The executor consults the Free Space Map (FSM) — a compact data structure (stored in the _fsm fork of each relation) that tracks how much free space each heap page has. It finds a page with enough free space to fit the new tuple. If no page has enough space, a new page is allocated (extends the file).
3
Acquire Buffer Page — Write Tuple
The executor loads the target 8KB heap page into shared_buffers (if not already there). It then writes the new HeapTuple into the page — stamping:
• t_xmin = current XID (transaction that inserted it)
• t_xmax = 0 (not deleted yet)
• t_ctid = (page, offset) (physical location of this tuple)
The page is now marked dirty in shared_buffers. It is NOT written to disk yet.
New tuple: t_xmin=1001 | t_xmax=0 | data: ('Alice', 'DBA', 50000)
4
Write WAL Record — Into WAL Buffers
A WAL record describing the INSERT is written into the WAL buffers (in shared memory). The WAL record contains enough information to redo the insert during crash recovery. On the first modification after a checkpoint, a Full Page Image (FPI) of the entire 8KB page is also included to protect against torn page writes.
5
COMMIT — Flush WAL to Disk, Make Visible
On COMMIT, PostgreSQL writes a COMMIT WAL record and calls fsync() (or equivalent) to ensure all WAL up to this LSN is durably written to disk in pg_wal. Only after this disk write returns successfully does COMMIT return success to the client. The page in shared_buffers remains dirty — it will be written to the actual data file later by bgwriter or checkpointer.
-- After COMMIT: WAL on disk. Tuple visible to new transactions. Data page still in RAM (dirty).
6
Index Update (if indexed columns)
For each index on the table, a corresponding index entry must be inserted pointing to the new tuple's ctid (physical location). Each index entry also has its own WAL record. HOT (Heap-Only Tuple) optimization: if no indexed column changed and the new tuple fits on the same page, PostgreSQL can skip the index insert and use a ctid chain instead.
⚡
Performance Insight
INSERT performance bottleneck is almost always WAL fsync latency. Setting synchronous_commit = off eliminates the fsync wait, achieving 3–10× higher INSERT throughput with only a tiny (600µs) durability window. Use for non-critical bulk inserts. For critical data, keep synchronous_commit = on.
💡
Oracle DBA Note
PostgreSQL INSERT ≈ Oracle INSERT in flow: both write redo (WAL≈redo log) and update the buffer cache. Key difference: Oracle also writes UNDO data (to reconstruct the old row for other readers). PostgreSQL does NOT need UNDO because old row versions stay in the heap — they just become invisible to new transactions once t_xmax is set.
07
How UPDATE Works — The MVCC Magic
UPDATE = DELETE old + INSERT new. Old version kept for concurrent readers.
This is one of the most important things to understand about PostgreSQL:
UPDATE does not modify the row in place. Instead, it creates a brand new row version
and marks the old row as "dead". This is the foundation of MVCC — multiple transactions can
read different versions of the same row concurrently without blocking each other.
UPDATE Step-by-Step: what happens to row versions in the heap
BEFORE UPDATE (salary from 50000 → 65000):
Row
t_xmin
t_xmax
t_ctid
name
salary
Visible to
Row v1
1001 (committed)
0 (not deleted)
(5, 1)
Alice
50000
ALL transactions
DURING UPDATE (XID 2005 running: UPDATE emp SET salary=65000 WHERE name='Alice'):
Row
t_xmin
t_xmax
t_ctid
name
salary
Visible to
Row v1
1001 (committed)
2005 (in-progress)
(5, 2) →
Alice
50000
Txns started before 2005
Row v2
2005 (in-progress)
0
(5, 2)
Alice
65000
Only XID 2005 itself
AFTER COMMIT (XID 2005 committed):
Row
t_xmin
t_xmax
t_ctid
name
salary
Visible to
Row v1 (dead)
1001
2005 (committed)
(5, 2) →
Alice
50000
Dead — invisible to all new txns
Row v2 (live)
2005 (committed)
0
(5, 2)
Alice
65000
ALL new transactions
⚠ Row v1 (salary=50000) remains on the page as a dead tuple. It will be cleaned up by VACUUM. This is why UPDATE-heavy tables need regular vacuuming.
1
Find the Old Row — Take Row-Level Lock
The executor locates the target row via an index or sequential scan. It checks that the row is visible and updatable (not already being updated by another transaction). It acquires an ExclusiveLock on the tuple (stored in t_xmax as the current XID) to prevent concurrent updates.
2
Mark Old Row Dead — Set t_xmax
PostgreSQL stamps the old row's t_xmax = current XID. This marks it as "deleted by this transaction". The old row is still physically present in the page — just logically invisible to transactions that start after this XID commits.
Old row: t_xmin=1001 | t_xmax=2005 | salary=50000 ← marked dead
3
Insert New Row Version — Set t_xmin
A completely new HeapTuple is inserted (same as INSERT): t_xmin = current XID, t_xmax = 0. It's stamped with the new column values. The old row's t_ctid is updated to point to the new row's location (ctid chain — allows readers of the old version to find the new one).
New row: t_xmin=2005 | t_xmax=0 | salary=65000 ← new live row
4
WAL Records Written for Both Operations
Two WAL records are written: one for "mark old row dead" (UPDATE_OLD) and one for "insert new row" (UPDATE_NEW). If this is the first page modification after a checkpoint, a Full Page Image of the affected page is also included in WAL.
5
Index Update — Unless HOT Optimization Applies
Every index on the table must have its entry updated to point from the old tuple's ctid to the new tuple's ctid. HOT Optimization (Heap-Only Tuple): if the updated columns are NOT part of any index AND the new row fits on the same page as the old one, PostgreSQL skips the index update entirely — dramatically reducing write amplification for frequently-updated rows with non-indexed column changes.
⚠
Write Amplification Warning
Every UPDATE writes 2 tuples (old dead + new live) + potentially 2 index entries + WAL for all of the above. On a table with many indexes, a single-column UPDATE can write 5–10× more data than the actual change. Use partial indexes, set appropriate fillfactor (e.g., 80) to leave room for HOT updates, and run VACUUM regularly to reclaim dead tuple space.
08
How DELETE Works
Logical deletion only — physical removal happens at VACUUM time
DELETE in PostgreSQL is purely logical. No data is physically removed from the page when you DELETE.
The row is simply marked as deleted by setting t_xmax. The physical space is reclaimed only when VACUUM runs.
1
Locate the Row — Visibility Check
The executor finds the target row(s) via index scan or sequential scan. For each row, it confirms it is visible to the current transaction (t_xmin committed, t_xmax = 0 or t_xmax not yet committed).
2
Mark as Dead — Set t_xmax = Current XID
The row is marked deleted by setting t_xmax = current XID. No new row is inserted. The row data (column values) is still physically on the page — it just becomes invisible to transactions starting after this XID commits.
Before DELETE: t_xmin=1001 | t_xmax=0 | data: ('Alice', 50000)
After DELETE: t_xmin=1001 | t_xmax=3007 | data: ('Alice', 50000) ← dead tuple
3
WAL Record Written for the Deletion
A WAL DELETE record is written, describing which page and which offset was marked dead. Much smaller than an INSERT WAL record since we're only recording the t_xmax change (unless a Full Page Image is required).
4
Index Entries Are NOT Immediately Removed
The index entries still point to the now-dead tuple. When a query uses the index and follows the pointer to the dead tuple, it performs a visibility check and skips it. Index cleanup happens during VACUUM (VACUUM cleans both the heap pages and the corresponding index entries).
5
VACUUM — Physical Space Reclamation
VACUUM scans heap pages looking for dead tuples (t_xmax set and that XID committed). For each dead tuple that is no longer needed by any active transaction, VACUUM:
• Clears the tuple slot (marks it available for reuse)
• Updates the Free Space Map (FSM) so future inserts can use that space
• Cleans corresponding index entries (index vacuum)
Note: VACUUM does NOT shrink the file — it just marks space as reusable. Only VACUUM FULL or pg_repack physically returns space to the OS.
SELECT n_dead_tup, n_live_tup FROM pg_stat_user_tables WHERE relname = 'emp';
Operation
Oracle Behavior
PostgreSQL Behavior
DELETE
Row marked in data block. Old version moved to UNDO segment.
t_xmax set in heap tuple. Old data stays in the page as dead tuple.
Rollback reads
Readers reconstruct old version from UNDO tablespace.
Old tuple version already in heap — readers use t_xmin/t_xmax to find correct version.
Space reclamation
Automatic via UNDO retention period.
Manual/auto via VACUUM. Must run regularly to prevent bloat.
TRUNCATE
Drops and recreates the segment. Instant. Needs commits.
Same concept. Drops and recreates relation files. Instant. Generates WAL.
MVCC is the mechanism that allows concurrent readers and writers to never block each other.
Each transaction sees a consistent snapshot of the database as of its start time (or statement start time, depending on isolation level).
The visibility of a tuple is determined entirely by its t_xmin and t_xmax values.
MVCC Visibility Rules — How PostgreSQL decides if a tuple is visible
TUPLE IS VISIBLE when:
✓t_xmin is committed AND is older than my snapshot
✓t_xmax is 0 (not deleted) OR t_xmax is my own XID OR t_xmax is not yet committed
TUPLE IS INVISIBLE when:
✗t_xmin is not yet committed (inserting txn still running)
✗t_xmin was rolled back (aborted)
✗t_xmax is committed and older than my snapshot (deleted before I started)
Transaction Snapshot Timeline — READ COMMITTED vs REPEATABLE READ
T1 START
Snapshot taken (READ COMMITTED: per-stmt)
T2 INSERT
Other txn inserts a row
T2 COMMIT
Row now committed t_xmin = 2005
T1 SELECT
READ COMMITTED: sees T2's row ✓ REPEATABLE READ: does NOT see it ✗
T1 COMMIT
T1 ends
Isolation Level
Snapshot Taken At
Dirty Read
Non-Repeatable Read
Phantom Read
READ COMMITTED (default)
Start of each statement
❌ No
✅ Yes — sees committed changes
✅ Yes
REPEATABLE READ
Start of transaction
❌ No
❌ No
❌ No (PostgreSQL prevents)
SERIALIZABLE
Start of transaction
❌ No
❌ No
❌ No (SSI algorithm)
💡
Oracle DBA Note
Oracle defaults to READ COMMITTED isolation (same as PostgreSQL default). Oracle's "consistent read" using UNDO data ≈ PostgreSQL's MVCC using old tuple versions in the heap. Oracle's SCN (System Change Number) ≈ PostgreSQL's transaction ID (XID) for snapshot comparison. Oracle doesn't support REPEATABLE READ as a distinct level — PostgreSQL does.
10
WAL — Write-Ahead Logging
Durability · Crash Recovery · Replication · PITR
WAL is the backbone of PostgreSQL's durability and replication story.
The principle is simple: write the log record before changing the data.
This guarantees that even if PostgreSQL crashes mid-write, no committed data is lost —
recovery replays the WAL to reconstruct any missing changes.
SQL DML
INSERT/UPDATE DELETE
→
WAL Record Written
Into WAL buffers (shared memory)
→
Dirty Page in Buffers
8KB page changed in shared_buffers
→
COMMIT
WAL flushed to pg_wal disk
→
COMMIT OK
Client receives success
→
Later: Checkpoint
Dirty pages flushed to data files
🔄
Crash Recovery
On startup after a crash, PostgreSQL replays WAL records from the last checkpoint to reconstruct all committed changes that weren't written to data files yet. Fast and automatic.
🏭
Streaming Replication
Standby servers connect to primary and receive a continuous stream of WAL records. They apply them to their own data files — staying in sync in near-real time. pg_stat_replication monitors this.
📅
Point-in-Time Recovery
When archive_mode=on, completed WAL segments are archived. You can restore a base backup + replay WAL to any specific timestamp — e.g., just before an accidental DROP TABLE.
📡
Logical Decoding
WAL records can be decoded into logical row-level changes (INSERT/UPDATE/DELETE) via output plugins (pgoutput, wal2json). Used for logical replication and CDC with Debezium/Kafka.
WAL File Naming — Understanding LSN and Segments
shell — WAL file structure
# WAL files in $PGDATA/pg_wal/
ls -la $PGDATA/pg_wal/
# Filename format: 000000TT XXXXXXXX YYYYYYYY# TT = Timeline ID (increases on failover/PITR)# XXXXXXXX YYYYYYYY = 64-bit LSN (Log Sequence Number) split
000000010000000000000091 # Timeline 1, LSN segment 0x91
000000010000000000000092 # Timeline 1, LSN segment 0x92
000000010000000000000093 # After DROP TABLE — segment 0x93# Each WAL segment = 16MB (default wal_segment_size)# Check current WAL position (LSN)SELECT pg_current_wal_lsn(); -- e.g., 0/93007FF0SELECT pg_walfile_name(pg_current_wal_lsn()); -- convert LSN to filename# Measure WAL generated in last 5 minutesSELECT pg_size_pretty(
pg_wal_lsn_diff(pg_current_wal_lsn(), '0/91000000')
) AS wal_since_checkpoint;
🔒
Full Page Images (FPI) — Write Amplification Source
After every checkpoint, the first write to any page includes a full 8KB copy of that page in the WAL record — even if only 10 bytes changed. This protects against torn page writes (partial OS writes). Enable wal_compression = lz4 or zstd to compress FPIs and reduce WAL volume by 50–80% with minimal CPU cost.