PostgreSQL DBA
Intermediate Interview
Questions & Answers
25 carefully crafted intermediate-level questions covering table partitioning, logical replication, autovacuum tuning, connection pooling, Patroni HA, query optimization, and real-world DBA scenarios.
Table partitioning divides a large logical table into smaller physical pieces called partitions, while queries still work against the parent table. PostgreSQL has supported declarative partitioning since version 10.
PostgreSQL supports three partition types:
- Range Partitioning — Rows are distributed based on a range of values (e.g., date ranges). Best for time-series data.
- List Partitioning — Rows are distributed based on a discrete list of values (e.g., country code, region).
- Hash Partitioning — Rows are distributed using a hash function for even distribution when no natural range/list exists.
-- Create partitioned parent table CREATE TABLE orders ( order_id BIGSERIAL, order_date DATE NOT NULL, amount NUMERIC ) PARTITION BY RANGE (order_date); -- Create partitions CREATE TABLE orders_2024 PARTITION OF orders FOR VALUES FROM ('2024-01-01') TO ('2025-01-01'); CREATE TABLE orders_2025 PARTITION OF orders FOR VALUES FROM ('2025-01-01') TO ('2026-01-01'); -- Query works on parent — partition pruning happens automatically SELECT * FROM orders WHERE order_date = '2025-06-01';
Partition pruning is PostgreSQL's optimizer feature that skips scanning partitions which cannot contain rows matching the query's WHERE clause. This dramatically reduces I/O on large partitioned tables.
- Static pruning — Happens at plan time when the WHERE clause uses constant values.
- Dynamic pruning — Happens at execution time when values are not known until runtime (e.g., parameterized queries). Controlled by
enable_partition_pruning = on(default).
-- Check if partition pruning is working EXPLAIN SELECT * FROM orders WHERE order_date = '2025-03-15'; -- Good output: only orders_2025 scanned -- Bad output: all partitions scanned → no pruning -- Ensure partition pruning is enabled SHOW enable_partition_pruning; SET enable_partition_pruning = on;
WHERE order_date::TEXT = '2025-03-15'. Always query the partition key in its native type.Table bloat occurs when dead tuples (from MVCC updates/deletes) accumulate faster than autovacuum can clean them, causing tables and indexes to consume more disk space than necessary and slowing queries.
-- Find tables with high dead tuple ratio SELECT schemaname, relname AS table_name, n_live_tup, n_dead_tup, round(n_dead_tup::numeric / NULLIF(n_live_tup + n_dead_tup, 0) * 100, 1) AS dead_pct, last_autovacuum, last_vacuum FROM pg_stat_user_tables ORDER BY n_dead_tup DESC LIMIT 15; -- Fix 1: Standard VACUUM (no lock, space reusable) VACUUM VERBOSE mytable; -- Fix 2: VACUUM FULL (reclaims disk, takes EXCLUSIVE LOCK!) VACUUM FULL mytable; -- Fix 3: pg_repack (online defrag — no exclusive lock) -- pg_repack -d mydb -t mytable
- A
dead_pctabove 10–15% indicates a table needing immediate attention. - Prefer
pg_repackoverVACUUM FULLin production — it rebuilds the table without long locks. - Index bloat is separate — use
REINDEX CONCURRENTLYfor bloated indexes.
TOAST (The Oversized-Attribute Storage Technique) is PostgreSQL's mechanism for storing large column values that exceed the 8KB page size. PostgreSQL automatically stores large values in a separate TOAST table linked to the main table.
- TOAST kicks in when a row exceeds roughly 2KB (the
TOAST_TUPLE_THRESHOLD). - TOAST can compress and/or move out-of-line large values (TEXT, BYTEA, JSONB, arrays).
- Four storage strategies:
PLAIN,EXTENDED(default),EXTERNAL,MAIN. - Each table with TOAST-able columns has a corresponding
pg_toast.pg_toast_NNNNtable.
-- Check TOAST strategy for each column SELECT attname, attstorage FROM pg_attribute WHERE attrelid = 'mytable'::regclass AND attnum > 0; -- Change storage strategy for a column ALTER TABLE mytable ALTER COLUMN doc_content SET STORAGE EXTERNAL; -- EXTERNAL = no compression, fast access
PostgreSQL uses a 32-bit transaction ID (XID) counter. After ~2.1 billion transactions, the XID wraps around. Without preventive action, PostgreSQL will shut down the database to prevent data loss — this is one of the most critical operational risks in PostgreSQL.
-- Check XID age per database (alert if > 1.5 billion) SELECT datname, age(datfrozenxid) AS xid_age, round(age(datfrozenxid)::numeric / 2100000000 * 100, 1) AS pct_used FROM pg_database ORDER BY xid_age DESC; -- Check per-table (find which table is holding old XID) SELECT relname, age(relfrozenxid) AS table_xid_age FROM pg_class WHERE relkind = 'r' ORDER BY age(relfrozenxid) DESC LIMIT 10; -- Force freeze old XIDs VACUUM FREEZE mytable;
- PostgreSQL automatically triggers anti-wraparound autovacuum when XID age exceeds
autovacuum_freeze_max_age(default: 200M). - Monitor and alert when XID age exceeds 1.5 billion.
- Root cause is often a long-running idle transaction holding back the oldest XID — check
pg_stat_activity.
Default autovacuum settings are too conservative for high-write databases. The key is to make autovacuum trigger more frequently and work faster, especially for large tables.
── Global tuning (postgresql.conf) ────────────────── # More workers for concurrent vacuuming autovacuum_max_workers = 6 # default: 3 # Check tables more frequently autovacuum_naptime = '30s' # default: 1min # Trigger vacuum when 5% rows are dead (not 20%) autovacuum_vacuum_scale_factor = 0.05 # default: 0.2 # Allow autovacuum to work faster autovacuum_vacuum_cost_limit = 1000 # default: 200 autovacuum_vacuum_cost_delay = '2ms' # default: 20ms ── Per-table override for very large/hot tables ────── ALTER TABLE orders SET ( autovacuum_vacuum_scale_factor = 0.01, autovacuum_vacuum_threshold = 1000 );
- For a 100M-row table, the default 20% scale factor waits for 20M dead tuples — far too late.
- Use per-table overrides via
ALTER TABLE ... SETfor your hottest tables. - Monitor with
SELECT * FROM pg_stat_user_tables ORDER BY n_dead_tup DESC;
Cost-based delay throttles autovacuum's I/O by making it sleep when it exceeds a cost threshold, preventing it from overwhelming the disk subsystem on production systems.
# Cost assigned per vacuum I/O operation vacuum_cost_page_hit = 1 # buffer cache hit vacuum_cost_page_miss = 10 # disk read vacuum_cost_page_dirty = 20 # dirty page write # When accumulated cost hits limit, sleep for delay duration autovacuum_vacuum_cost_limit = 200 # default autovacuum_vacuum_cost_delay = '20ms' # default # For NVMe/fast storage — let autovacuum run at full speed autovacuum_vacuum_cost_limit = 2000 autovacuum_vacuum_cost_delay = '0ms'
- High
cost_delaywith lowcost_limit= slow autovacuum = bloat accumulation. - On fast NVMe storage, set
cost_delay = 0and raisecost_limitsignificantly. - On shared/spinning disk environments, keep cost_delay higher to avoid I/O saturation.
Long-running transactions and idle-in-transaction sessions prevent autovacuum from cleaning dead tuples because it cannot advance the oldest XID past the blocking transaction.
-- Find sessions blocking autovacuum (idle-in-transaction) SELECT pid, usename, state, wait_event_type, wait_event, now() - xact_start AS xact_duration, left(query, 80) AS query FROM pg_stat_activity WHERE state IN ('idle in transaction', 'idle in transaction (aborted)') ORDER BY xact_start; -- Find oldest running transaction (the XID bottleneck) SELECT pid, usename, state, now() - xact_start AS age, backend_xmin FROM pg_stat_activity WHERE backend_xmin IS NOT NULL ORDER BY backend_xmin; -- Terminate the blocker SELECT pg_terminate_backend(pid);
idle_in_transaction_session_timeout = '10min' in postgresql.conf to automatically terminate sessions that hold open transactions too long.pg_repack is a PostgreSQL extension that removes bloat from tables and indexes online — without holding an exclusive lock for the full duration. It is the production-safe alternative to VACUUM FULL.
| Feature | VACUUM FULL | pg_repack |
|---|---|---|
| Table Lock | Exclusive (full duration) | Brief lock at start/end only |
| Production Safe | ❌ Blocks all reads/writes | ✅ Online operation |
| Reclaims Disk | ✅ Yes | ✅ Yes |
| Rebuilds Indexes | ✅ Yes | ✅ Yes |
| Requires Extension | No | Yes (pg_repack) |
# Install pg_repack yum install pg_repack17 # Repack a specific table online pg_repack -d mydb -t orders --no-superuser-check # Repack all tables in a database pg_repack -d mydb
| Feature | Streaming Replication | Logical Replication |
|---|---|---|
| Level | Physical (block/byte level) | Logical (row level) |
| Selective Tables | ❌ Entire cluster | ✅ Per table |
| Cross-Version | ❌ Same major version | ✅ Different versions |
| Cross-Platform | ❌ Same OS/arch | ✅ Flexible |
| DDL Replication | ✅ Automatic | ❌ Manual |
| wal_level needed | replica | logical |
| Standby writable | ❌ Read-only | ✅ Yes |
-- Step 1: Set wal_level = logical in postgresql.conf -- wal_level = logical -- Step 2: Create a replication user CREATE ROLE repl_user WITH REPLICATION LOGIN PASSWORD 'secret'; GRANT SELECT ON ALL TABLES IN SCHEMA public TO repl_user; -- Step 3: Create publication for specific tables CREATE PUBLICATION pub_orders FOR TABLE orders, order_items; -- Or publish ALL tables CREATE PUBLICATION pub_all FOR ALL TABLES;
-- Step 4: Tables must exist on subscriber first CREATE TABLE orders ( ... ); -- Step 5: Create subscription CREATE SUBSCRIPTION sub_orders CONNECTION 'host=192.168.1.51 port=5432 dbname=mydb user=repl_user password=secret' PUBLICATION pub_orders; -- Step 6: Monitor subscription status SELECT * FROM pg_stat_subscription; SELECT * FROM pg_replication_slots; -- on publisher
A replication slot is a durable state tracker that ensures WAL files needed by a subscriber or standby are NOT deleted — even if the consumer disconnects — until those WAL files have been confirmed consumed.
-- List all replication slots and WAL retained SELECT slot_name, active, restart_lsn, confirmed_flush_lsn, pg_size_pretty(pg_wal_lsn_diff( pg_current_wal_lsn(), restart_lsn)) AS wal_retained FROM pg_replication_slots; -- Drop an inactive/stale replication slot SELECT pg_drop_replication_slot('stale_slot_name');
max_slot_wal_keep_size in PostgreSQL 13+ as a safety limit.Logical replication allows replication between different major PostgreSQL versions, making it ideal for near-zero downtime upgrades (e.g., PG 15 → PG 17).
- Step 1 — Install new PostgreSQL 17 on a new server and initialize a cluster.
- Step 2 — Dump schema only from old cluster:
pg_dump --schema-only -d mydb | psql -d mydb_new - Step 3 — Set up logical replication: create publication on PG15, subscription on PG17.
- Step 4 — Wait for initial data copy to complete and replication to catch up (monitor lag).
- Step 5 — During maintenance window: stop application writes, let replication catch up fully, switchover application to PG17.
- Step 6 — Drop subscription, sequences need to be reset manually.
PgBouncer is a lightweight connection pooler for PostgreSQL. Since PostgreSQL spawns a new OS process per client connection (expensive), PgBouncer maintains a pool of server connections and multiplexes client connections onto them.
| Mode | Server Connection Released | Best For |
|---|---|---|
| Session Mode | When client disconnects | Legacy apps, safe default |
| Transaction Mode | After each transaction ends | Most web apps — best efficiency |
| Statement Mode | After each SQL statement | Rarely used — breaks multi-statement logic |
[databases]
mydb = host=127.0.0.1 port=5432 dbname=mydb
[pgbouncer]
listen_port = 6432
listen_addr = *
auth_type = scram-sha-256
pool_mode = transaction # recommended
max_client_conn = 1000
default_pool_size = 25
server_idle_timeout = 600SET LOCAL, prepared statements (without protocol-level support), advisory locks, and LISTEN/NOTIFY.pg_stat_statements is a PostgreSQL extension that tracks execution statistics for every SQL statement. It is the equivalent of Oracle's V$SQL — your first stop for query performance analysis.
-- Enable in postgresql.conf (requires restart) -- shared_preload_libraries = 'pg_stat_statements' -- Enable the extension CREATE EXTENSION pg_stat_statements; -- Top 10 queries by total execution time SELECT round(total_exec_time::numeric, 2) AS total_ms, calls, round(mean_exec_time::numeric, 2) AS avg_ms, round(stddev_exec_time::numeric, 2) AS stddev_ms, rows, left(query, 80) AS query_snippet FROM pg_stat_statements ORDER BY total_exec_time DESC LIMIT 10; -- Reset statistics SELECT pg_stat_statements_reset();
Partial Index — An index built on a subset of rows defined by a WHERE clause. Much smaller than a full index, making it faster and cheaper to maintain.
-- Partial index: only index active orders (not all 50M rows) CREATE INDEX idx_active_orders ON orders(customer_id) WHERE status = 'ACTIVE'; -- Covering index: include extra columns to avoid table fetch -- (Index-Only Scan becomes possible) CREATE INDEX idx_orders_covering ON orders(customer_id) INCLUDE (order_date, amount); -- Query that can use covering index (Index-Only Scan) SELECT customer_id, order_date, amount FROM orders WHERE customer_id = 12345;
- Covering indexes using
INCLUDEallow Index-Only Scans — no table heap access needed, dramatically faster. - Partial indexes shine on tables with a small percentage of frequently-queried rows.
- Stale statistics — The planner estimates too many or too few rows. Fix:
ANALYZE tablename; - Low selectivity — Query returns a large % of rows (e.g., >10-20%). Seq Scan is cheaper than many random I/Os.
- Small table — If the entire table fits in a few pages, a Seq Scan is faster than index lookup overhead.
- No index on the column — Index simply doesn't exist.
- Data type mismatch / implicit cast —
WHERE id = '123'when id is INTEGER prevents index use. - Function on indexed column —
WHERE LOWER(email) = 'x'bypasses index unless a functional index exists. - Low effective_cache_size — Planner underestimates OS cache, making index scans seem more expensive.
-- Force index scan for testing only SET enable_seqscan = off; EXPLAIN ANALYZE SELECT * FROM orders WHERE customer_id = 99; SET enable_seqscan = on; -- always reset after testing -- Update statistics ANALYZE orders;
-- EXPLAIN: plan only, no execution, no actual timings EXPLAIN SELECT * FROM orders WHERE customer_id = 42; -- EXPLAIN ANALYZE: executes the query + actual timings EXPLAIN ANALYZE SELECT * FROM orders WHERE customer_id = 42; -- EXPLAIN (ANALYZE, BUFFERS): shows buffer hits/misses -- Shared hit = from shared_buffers (fast) -- Shared read = from disk (slow) EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT) SELECT * FROM orders WHERE customer_id = 42; -- EXPLAIN (ANALYZE, BUFFERS, SETTINGS, WAL, FORMAT JSON) -- Full diagnostic output for pgBadger / explain.dalibo.com EXPLAIN (ANALYZE, BUFFERS, SETTINGS, WAL, FORMAT JSON) SELECT * FROM orders WHERE customer_id = 42;
Shared hit— Pages served from shared_buffers (no I/O). Good.Shared read— Pages read from disk. High value means cache miss — consider increasingshared_buffers.- Use explain.dalibo.com to visualise the JSON output as a graphical plan.
Patroni is an open-source HA solution that uses a distributed configuration store (etcd, ZooKeeper, or Consul) to manage leader election and automatic failover for PostgreSQL clusters.
- Patroni runs as a daemon on every node and monitors the local PostgreSQL instance.
- The current primary holds a leader lock in etcd (with TTL). If the primary fails, the lock expires.
- Replicas race to acquire the leader lock. The winner promotes itself to primary.
- HAProxy or Keepalived (VIP) routes application traffic to the current primary automatically.
- Full cluster configuration is stored in etcd — making it consistent across all nodes.
# Typical Patroni HA stack ┌─────────────────────────────────────────────┐ │ App → HAProxy (VIP via Keepalived) │ │ ↓ ↓ │ │ PostgreSQL Primary PostgreSQL Replica │ │ + Patroni + Patroni │ │ ↓ ↓ │ │ └── etcd cluster (3 nodes) ─────────┘ └─────────────────────────────────────────────┘ # Check cluster status patronictl -c /etc/patroni/patroni.yml list # Manual switchover patronictl -c /etc/patroni/patroni.yml switchover
| Feature | Patroni | repmgr |
|---|---|---|
| Auto Failover | ✅ Native, consensus-based | ⚠ Requires repmgrd + extra setup |
| Split-brain Protection | ✅ etcd/ZK consensus prevents it | ⚠ Risk without fencing |
| Complexity | Higher (needs etcd cluster) | Lower (simpler setup) |
| DCS Dependency | etcd / ZooKeeper / Consul | PostgreSQL itself |
| Cluster Management | patronictl (full REST API) | repmgr CLI |
| Industry Adoption | Widely used in enterprise | Common for simpler setups |
For production enterprise environments, Patroni is the recommended choice due to its reliable consensus-based failover and split-brain prevention.
Split-brain occurs when two PostgreSQL nodes both believe they are the primary and accept writes simultaneously — leading to data divergence and corruption.
- Patroni prevents split-brain using a distributed lock in etcd. Only the node holding the leader key can act as primary.
- The leader key has a TTL (Time-To-Live). The primary must renew it continuously. If the primary cannot reach etcd, it loses the key and demotes itself.
- Replicas will only promote if they successfully acquire the leader key — guaranteed by etcd's atomic compare-and-swap.
- Network partitions where a primary cannot reach etcd will cause self-demotion, never split-brain.
-- On PRIMARY: lag per standby (bytes and estimated time) SELECT application_name, client_addr, state, sync_state, pg_size_pretty(sent_lsn - replay_lsn) AS replay_lag_bytes, write_lag, flush_lag, replay_lag FROM pg_stat_replication; -- On STANDBY: how far behind in seconds SELECT now() - pg_last_xact_replay_timestamp() AS replication_delay, pg_is_in_recovery() AS is_standby, pg_last_wal_receive_lsn() AS received_lsn, pg_last_wal_replay_lsn() AS replayed_lsn;
- < 1 second — Excellent. Standard for healthy async replication on LAN.
- 1–30 seconds — Acceptable for most OLTP workloads.
- > 60 seconds — Alert: investigate I/O bottleneck on standby, network issues, or standby under load.
- High lag + network failure = data loss window. Size your recovery point objective (RPO) accordingly.
pg_upgrade upgrades a PostgreSQL cluster to a new major version in-place, by copying only the catalog files and hard-linking or copying the data files — much faster than a dump/restore for large databases.
# Step 1: Install new PostgreSQL 17 binaries (don't init cluster) yum install postgresql17-server # Step 2: Initialize new cluster (empty) /usr/pgsql-17/bin/initdb -D /pgdata/pgsql17/data # Step 3: Stop old cluster systemctl stop postgresql-15 # Step 4: Run pg_upgrade check first (dry run) /usr/pgsql-17/bin/pg_upgrade \ -b /usr/pgsql-15/bin \ -B /usr/pgsql-17/bin \ -d /pgdata/pgsql15/data \ -D /pgdata/pgsql17/data \ --check # Step 5: Run actual upgrade (--link = hard-link, no copy) /usr/pgsql-17/bin/pg_upgrade \ -b /usr/pgsql-15/bin \ -B /usr/pgsql-17/bin \ -d /pgdata/pgsql15/data \ -D /pgdata/pgsql17/data \ --link # Step 6: Start new cluster and run post-upgrade tasks systemctl start postgresql-17 ./analyze_new_cluster.sh # generated by pg_upgrade
pg_cron is a PostgreSQL extension that enables scheduling of SQL queries or stored procedures using cron syntax — directly inside the database, without needing external OS cron jobs.
-- Enable in postgresql.conf (requires restart) -- shared_preload_libraries = 'pg_cron' -- cron.database_name = 'mydb' CREATE EXTENSION pg_cron; -- Schedule a job: delete old logs every day at 3 AM SELECT cron.schedule( 'cleanup-old-logs', '0 3 * * *', $$DELETE FROM app_logs WHERE created_at < NOW() - INTERVAL '90 days'$$ ); -- Schedule a job every 5 minutes SELECT cron.schedule('refresh-stats', '*/5 * * * *', $$CALL refresh_materialized_views()$$); -- List all scheduled jobs SELECT * FROM cron.job; -- View execution history SELECT * FROM cron.job_run_details ORDER BY start_time DESC LIMIT 10;
A FATAL: remaining connection slots are reserved for non-replication superuser connections error means max_connections is exhausted. Here is a systematic approach to diagnose and resolve it:
-- Step 1: Check current connection usage SELECT max_conn, used_conn, superuser_reserved, max_conn - used_conn AS free_conn FROM ( SELECT current_setting('max_connections')::int AS max_conn, COUNT(*) AS used_conn, current_setting('superuser_reserved_connections')::int AS superuser_reserved FROM pg_stat_activity ) t; -- Step 2: Find connection counts per user/database SELECT usename, datname, state, COUNT(*) FROM pg_stat_activity GROUP BY usename, datname, state ORDER BY COUNT(*) DESC; -- Step 3: Find idle/idle-in-transaction sessions to terminate SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'idle in transaction' AND now() - xact_start > interval '30 minutes';
- Immediate fix — Terminate idle/stale sessions as shown above.
- Short-term fix — Temporarily raise
max_connections(requires restart). Note: each connection uses ~5–10MB RAM. - Permanent fix — Deploy PgBouncer in transaction pooling mode to dramatically reduce the number of actual PostgreSQL connections needed.
- Set
idle_in_transaction_session_timeoutto auto-kill stuck transactions. - Set
CONNECTION LIMITper role to prevent any single user from exhausting connections.