PostgreSQL
Advanced
DBA Interview
Expert-level questions covering query planner internals, JIT compilation, logical decoding & CDC, Citus sharding, custom extensions, WAL internals, parallel query, and Oracle-to-PostgreSQL migration strategy.
The PostgreSQL query planner uses a cost-based optimizer (CBO). It processes a query through five stages:
- Parse — SQL text → parse tree (syntax check via gram.y). Produces an abstract syntax tree.
- Analyze/Rewrite — Semantic validation, name resolution, rule rewriting (views expand here). Produces a query tree.
- Plan/Optimize — Generates candidate execution plans, estimates their cost using
pg_statisticdata, selects the cheapest plan. - Execute — The executor walks the plan tree, calling each node's Init/Next/End functions.
The planner estimates cost using three components: seq_page_cost, random_page_cost, and cpu_tuple_cost. Row count estimates come from pg_statistic (populated by ANALYZE). The planner uses a dynamic programming algorithm for join ordering up to join_collapse_limit tables; beyond that it switches to a genetic algorithm (GEQO).
-- See full plan with cost breakdown EXPLAIN (ANALYZE, BUFFERS, SETTINGS, WAL, FORMAT JSON) SELECT * FROM orders o JOIN customers c ON o.customer_id = c.id WHERE o.status = 'ACTIVE'; -- Tune planner cost constants SET random_page_cost = 1.1; -- SSD (default 4.0 for HDD) SET seq_page_cost = 1.0; -- default SET cpu_tuple_cost = 0.01; -- default -- Check statistics on a column SELECT * FROM pg_statistic WHERE starelid = 'orders'::regclass;
pg_hint_plan extension for query hints.JIT (Just-In-Time) compilation uses LLVM to compile portions of an execution plan into native machine code at query runtime, rather than interpreting them through PostgreSQL's generic executor. It was introduced in PostgreSQL 11.
- What gets JIT-compiled: expression evaluation, tuple deforming, aggregate functions, projections.
- When it helps: CPU-intensive analytical queries with complex WHERE clauses, large aggregations, or many columns — typically OLAP workloads.
- When it hurts: OLTP queries with low row counts. JIT compilation has ~20–100ms overhead. For short queries this overhead exceeds any gain.
-- Check if JIT is enabled SHOW jit; SHOW jit_above_cost; -- default: 100000 SHOW jit_inline_above_cost; -- default: 500000 SHOW jit_optimize_above_cost;-- default: 500000 -- Disable JIT for OLTP workloads (set in postgresql.conf) jit = off -- Check if JIT was used in a plan EXPLAIN (ANALYZE, FORMAT TEXT) SELECT SUM(amount * discount * tax_rate) FROM orders WHERE year = 2025; -- Look for: JIT: Functions: N, Options: Inlining true...
jit_above_cost or disable JIT globally and enable it only for specific analytical sessions.PostgreSQL can execute certain queries using multiple CPU cores via parallel workers (background processes). The leader process divides work among parallel workers using Parallel Seq Scan, Parallel Index Scan, Parallel Hash Join, and Parallel Aggregate.
max_parallel_workers = 8 # total parallel workers max_parallel_workers_per_gather = 4 # per query max_parallel_maintenance_workers = 4 # for CREATE INDEX parallel_setup_cost = 1000 parallel_tuple_cost = 0.1 min_parallel_table_scan_size = '8MB' -- Force parallel for testing SET max_parallel_workers_per_gather = 4; SET parallel_setup_cost = 0; EXPLAIN ANALYZE SELECT COUNT(*) FROM large_table;
Limitations:
- Parallel query is NOT used inside transactions with
SERIALIZABLEisolation level. - Queries with
FOR UPDATE/SHARE,CURSOR, orLIMIT(in some cases) don't parallelize. - Functions marked
PARALLEL UNSAFE(default for user functions) prevent parallelism. - Small tables: the planner's parallel_setup_cost threshold prevents parallel overhead on tiny scans.
- Worker spawning has overhead — not beneficial for queries under ~100ms.
By default, PostgreSQL collects statistics on each column independently. When columns are correlated (e.g., city + zip_code), the planner severely underestimates row counts for multi-column predicates. Extended statistics capture cross-column correlations.
-- Problem: city and zip_code are correlated -- Planner underestimates: rows = 5 (actual: 500) EXPLAIN ANALYZE SELECT * FROM addresses WHERE city = 'London' AND zip_code = 'EC1A'; -- Solution: create extended statistics on correlated columns CREATE STATISTICS stat_city_zip (dependencies) ON city, zip_code FROM addresses; ANALYZE addresses; -- Now rerun EXPLAIN — row estimate dramatically improves -- Types of extended statistics: -- dependencies: functional dependency (A determines B) -- ndistinct: number of distinct value combinations -- mcv: most common value combinations (PostgreSQL 12+) CREATE STATISTICS stat_full (dependencies, ndistinct, mcv) ON city, state, zip_code FROM addresses;
DBMS_STATS.CREATE_EXTENDED_STATS. In PostgreSQL: CREATE STATISTICS. Both solve the same correlated-column row estimation problem.| Join Type | Algorithm | Memory | Best When |
|---|---|---|---|
| Nested Loop | For each outer row, scan inner table/index | Low | Small inner table, index available on join key, OLTP |
| Hash Join | Build hash table from smaller relation, probe with larger | High (work_mem) | Large tables, no useful index, OLAP aggregations |
| Merge Join | Sort both sides, merge-scan sorted results | Medium (sort memory) | Both sides already sorted or have btree index on join key |
-- Diagnose which join is being chosen EXPLAIN (ANALYZE, FORMAT TEXT) SELECT * FROM orders o JOIN customers c ON o.cid = c.id; -- Force specific join types for testing SET enable_hashjoin = off; SET enable_nestloop = off; SET enable_mergejoin = off; -- Hash join spills to disk if work_mem is too low -- Look for: Batches: N (if N > 1, it spilled to disk) SET work_mem = '256MB'; -- raise per session for large joins
Hash Batches: 8 in EXPLAIN output, the hash join spilled to disk 8 times. Increasing work_mem for that session can collapse it to 1 batch (in-memory) and cut the query time dramatically.Every change in PostgreSQL generates a WAL record — a binary log entry describing exactly what changed. WAL records are appended sequentially to WAL segment files (16MB each by default).
A WAL record contains: header (total length, resource manager ID, CRC32), data (the actual change: block number, offset, new data), and optionally a full-page image (FPI) — a complete copy of the 8KB block written on the first modification after a checkpoint.
- LSN (Log Sequence Number) — A 64-bit byte offset into the WAL stream. Format:
0/A1B2C3D4. Monotonically increasing. Used to identify a precise WAL position for replication lag, PITR, and crash recovery. - Full-Page Images (FPI) — Written to survive partial write scenarios (torn pages). Controlled by
full_page_writes = on(default). This is the main driver of WAL write amplification. - WAL compression —
wal_compression = oncompresses FPIs with LZ4/zstd, dramatically reducing WAL volume with minimal CPU overhead.
-- Current WAL write position (LSN) SELECT pg_current_wal_lsn(); -- WAL generated between two points (bytes) SELECT pg_wal_lsn_diff( pg_current_wal_lsn(), '0/A1B2C3D4' ); -- Decode WAL records (pg_walinspect extension - PG 14+) CREATE EXTENSION pg_walinspect; SELECT * FROM pg_get_wal_records_info( '0/A1000000', '0/A2000000' ) LIMIT 20; -- Check WAL statistics SELECT * FROM pg_stat_wal;
Write amplification is when PostgreSQL writes significantly more data to disk than the logical data change size. The primary sources are full-page images (FPIs) in WAL and autovacuum rewrites.
- FPI amplification: After every checkpoint, the first write to each page generates a full 8KB FPI in WAL — even if only 10 bytes changed. A table doing many random updates can generate 100x WAL volume vs actual data changes.
- Heap amplification: MVCC means every UPDATE writes a new row version — the old version stays until vacuumed. High-update tables swell in size.
- Index amplification: Every row version change requires an index entry update. Partial indexes and HOT (Heap-Only Tuple) updates help.
-- Measure WAL generation rate per statement EXPLAIN (ANALYZE, WAL) UPDATE orders SET status = 'CLOSED' WHERE id = 42; -- Output shows: WAL Records: N, WAL Bytes: N -- Reduce write amplification -- 1. Enable WAL compression (huge win for FPIs) wal_compression = lz4 # or zstd (PG 15+) -- 2. Increase checkpoint_timeout (fewer FPI triggers) checkpoint_timeout = '30min' -- 3. Use HOT updates (no index key change needed) -- HOT = Heap Only Tuple: no index update needed -- Requires: update only non-indexed columns -- sufficient fillfactor on table ALTER TABLE orders SET (fillfactor = 80);
A checkpoint is a point where PostgreSQL guarantees all dirty (modified) shared_buffer pages have been flushed to disk. After a crash, recovery only needs to replay WAL from the last checkpoint — reducing recovery time.
- Triggers:
checkpoint_timeoutelapsed,max_wal_sizeexceeded, or explicitCHECKPOINTcommand. - The checkpointer process writes dirty pages spread across the checkpoint window — controlled by
checkpoint_completion_target. - Checkpoints cause I/O spikes if they complete too quickly. Setting
checkpoint_completion_target = 0.9spreads the writes over 90% of the checkpoint interval.
checkpoint_timeout = '15min' # default 5min — longer = less WAL max_wal_size = '4GB' # default 1GB min_wal_size = '1GB' checkpoint_completion_target = 0.9 # spread I/O over 90% of interval -- Monitor checkpoint frequency SELECT checkpoints_timed, checkpoints_req, buffers_checkpoint, buffers_clean, round(buffers_checkpoint / NULLIF(checkpoints_timed + checkpoints_req, 0)) AS avg_bufs FROM pg_stat_bgwriter; -- High checkpoints_req = max_wal_size too small
checkpoints_req is much higher than checkpoints_timed, your max_wal_size is too small and checkpoints are being forced by WAL volume, not by time. Increase max_wal_size.synchronous_commit controls when PostgreSQL considers a transaction committed relative to WAL durability. It is a per-transaction setting — no restart required.
| Value | WAL Written to Disk | Standby Received | Standby Applied | Risk |
|---|---|---|---|---|
| on (default) | ✅ Before COMMIT returns | — | — | None (fully durable) |
| off | ❌ Async (up to 3x wal_writer_delay later) | — | — | Last ~600µs of commits on crash |
| remote_write | ✅ | ✅ Written to OS buffer | — | Small: standby OS crash |
| remote_apply | ✅ | ✅ | ✅ Applied to standby | None — reads from standby guaranteed fresh |
| local | ✅ Primary only | — | — | Standby may lag |
synchronous_commit = off is safe for non-critical workloads (logging, analytics inserts) and can give 3–10x write throughput improvement with only a very small durability window. The database will never corrupt — you only risk losing the last ~600µs of commits on a hard crash.Logical decoding interprets WAL records and emits them as a logical stream of row-level changes (INSERT/UPDATE/DELETE) using a pluggable output plugin. This powers Change Data Capture (CDC) without application code changes.
- Output plugins:
pgoutput(built-in, used by logical replication),wal2json(JSON output),test_decoding(text, for debugging). - Requires
wal_level = logicaland a logical replication slot to track consumer position. - Debezium is the most popular CDC framework — it connects to PostgreSQL's logical decoding API and streams changes to Kafka.
-- Create a logical decoding slot with wal2json SELECT pg_create_logical_replication_slot( 'my_cdc_slot', 'wal2json' ); -- Peek at changes (non-consuming) SELECT * FROM pg_logical_slot_peek_changes( 'my_cdc_slot', NULL, NULL ); -- Consume changes (advances slot position) SELECT * FROM pg_logical_slot_get_changes( 'my_cdc_slot', NULL, NULL, 'pretty-print', '1' ); -- Always monitor slot lag! SELECT slot_name, active, pg_size_pretty(pg_wal_lsn_diff( pg_current_wal_lsn(), confirmed_flush_lsn)) AS lag FROM pg_replication_slots;
A production CDC pipeline from PostgreSQL to a data warehouse (Snowflake, BigQuery, Redshift) typically uses: PostgreSQL → Debezium → Apache Kafka → Kafka Connect sink.
- Step 1 — PostgreSQL config: Set
wal_level = logical, create a dedicated replication user, add pg_hba.conf entry. - Step 2 — Debezium PostgreSQL connector: Creates a replication slot (
pgoutputplugin), reads WAL changes, publishes to Kafka topics per table. - Step 3 — Kafka topics: Each table gets a topic (e.g.,
pg.public.orders). Payload contains before/after row images in Avro or JSON. - Step 4 — Sink connector: Kafka Connect sink writes to target (Snowflake Kafka Connector, BigQuery Connector).
{
"connector.class": "io.debezium.connector.postgresql.PostgresConnector",
"database.hostname": "192.168.1.51",
"database.port": "5432",
"database.user": "repl_user",
"database.password": "secret",
"database.dbname": "mydb",
"database.server.name": "pg",
"plugin.name": "pgoutput",
"slot.name": "debezium_slot",
"publication.name": "dbz_publication",
"table.include.list": "public.orders,public.customers",
"snapshot.mode": "initial"
}max_slot_wal_keep_size = 10GB as a safety valve — but note: if exceeded, the slot is invalidated and you must re-snapshot.REPLICA IDENTITY controls what column data is included in the WAL record for UPDATE and DELETE operations. Without sufficient identity columns, subscribers cannot identify which row to update or delete.
| Setting | OLD Row Data in WAL | Use When |
|---|---|---|
| DEFAULT | Primary key columns only | Table has a primary key (recommended) |
| FULL | All columns (before image) | No PK — but doubles WAL volume! |
| NOTHING | None | Only INSERT replication needed |
| USING INDEX | Specified unique index columns | No PK but has unique index |
-- Check current replica identity SELECT relname, relreplident FROM pg_class WHERE relname = 'orders'; -- d=DEFAULT, f=FULL, n=NOTHING, i=INDEX -- Set FULL for tables without PK (CDC requirement) ALTER TABLE orders_legacy REPLICA IDENTITY FULL; -- Use index instead (less WAL overhead than FULL) ALTER TABLE orders REPLICA IDENTITY USING INDEX idx_orders_uuid;
Citus is a PostgreSQL extension (now open-source by Microsoft) that transforms a single PostgreSQL instance into a distributed database by adding a coordinator node and multiple worker nodes.
- Coordinator node: Receives all queries. Parses and plans them. Routes shard-specific parts to workers. Merges results.
- Worker nodes: Each holds a subset of shards. Executes pushed-down query fragments in parallel.
- Distribution column: A column used to hash rows across shards. Choose carefully — cross-shard JOINs are expensive.
- Shard count: Default 32 shards per table. Shards are regular PostgreSQL tables on workers.
-- On coordinator: add worker nodes SELECT citus_add_node('worker1', 5432); SELECT citus_add_node('worker2', 5432); -- Distribute a table by tenant_id SELECT create_distributed_table('orders', 'tenant_id'); -- Colocate related tables (JOIN-friendly) SELECT create_distributed_table('order_items', 'tenant_id', colocate_with := 'orders'); -- Reference table (replicated to all workers) SELECT create_reference_table('countries'); -- Check shard placement SELECT * FROM citus_shards; SELECT * FROM pg_dist_placement;
tenant_id is ideal: all tenant data goes to the same shard, enabling efficient local JOINs without cross-node network hops.Table colocation ensures that related tables (e.g., orders and order_items) that share the same distribution key have their corresponding shards placed on the same worker node. This allows JOINs between them to be executed locally on each worker — no network hops between workers.
-- Without colocation: JOIN requires cross-node data transfer -- With colocation: JOIN runs locally on each worker (parallel) SELECT create_distributed_table('orders', 'tenant_id'); SELECT create_distributed_table('order_items', 'tenant_id', colocate_with := 'orders'); -- This JOIN now executes locally on workers (fast) SELECT o.id, SUM(oi.amount) FROM orders o JOIN order_items oi ON o.id = oi.order_id AND o.tenant_id = oi.tenant_id WHERE o.tenant_id = 42 GROUP BY o.id; -- Verify colocation group assignment SELECT logicalrelid, colocationid FROM pg_dist_partition WHERE logicalrelid::text IN ('orders', 'order_items');
- No cross-shard transactions with full ACID: Multi-shard writes use 2-phase commit (2PC) which is slower and can leave transactions in an uncertain state if coordinator crashes.
- Changing distribution column: Once a table is distributed, you cannot easily change the distribution column without redistributing all data.
- Cross-shard JOINs on non-distribution key: If you JOIN on a column other than the distribution key, Citus must repartition data on the fly — expensive and slow.
- Sequences: Citus uses sharded sequences that may produce gaps. Standard
SERIALsequences may not be globally unique across workers. - DDL limitations: Some DDL operations require quorum across all nodes. Schema changes must be applied carefully.
- Aggregation pushdown: Only certain aggregates (SUM, COUNT, AVG, MIN, MAX) can be pushed down. Complex aggregates may pull data to coordinator.
Background workers (BGW) are custom processes that run inside the PostgreSQL server process tree — started at server startup or dynamically. They are used by extensions to run periodic tasks, listen to events, or implement custom replication.
pg_cron— Uses a BGW to wake up every minute and check the cron schedule.pg_logical— Uses BGWs per subscription to stream WAL changes.Patroni— Not a BGW, but Patroni watches PostgreSQL as an external agent.- BGWs run as normal postgres backend processes — they can execute SQL, access shared memory, and use background worker APIs.
/* In your extension's _PG_init() function */ void _PG_init(void) { BackgroundWorker worker; MemSet(&worker, 0, sizeof(worker)); worker.bgw_flags = BGWORKER_SHMEM_ACCESS | BGWORKER_BACKEND_DATABASE_CONNECTION; worker.bgw_start_time = BgWorkerStart_RecoveryFinished; worker.bgw_restart_time = 10; /* restart after 10s on crash */ snprintf(worker.bgw_name, BGW_MAXLEN, "my_extension worker"); worker.bgw_main_arg = (Datum) 0; RegisterBackgroundWorker(&worker); } /* Worker main function: runs in its own process */ void my_worker_main(Datum arg) { BackgroundWorkerInitializeConnection("mydb", NULL, 0); while (!ShutdownRequestPending()) { /* do work every 5 seconds */ WaitLatch(MyLatch, WL_LATCH_SET | WL_TIMEOUT, 5000, ...); } }
Wait events in pg_stat_activity tell you exactly what a session is waiting for — the single most important diagnostic tool for PostgreSQL performance problems.
-- Real-time wait event summary (run repeatedly) SELECT wait_event_type, wait_event, COUNT(*) AS sessions FROM pg_stat_activity WHERE state != 'idle' GROUP BY wait_event_type, wait_event ORDER BY sessions DESC;
| Wait Event | Type | Meaning & Action |
|---|---|---|
| Lock:relation | Lock | Table-level lock contention. Find blocker with pg_blocking_pids(). |
| Lock:tuple | Lock | Row-level lock contention. Typical in high-concurrency OLTP. |
| IO:DataFileRead | IO | Reading data pages from disk. shared_buffers too small or missing indexes. |
| IO:WALWrite | IO | WAL writes slow. Check disk I/O, consider wal_compression. |
| CPU | CPU | Query is running. Normal — unless it never finishes. Check for full table scans. |
| Client:ClientRead | Client | Waiting for client to send next command. Idle connection pooling issue. |
| Lock:advisory | Lock | Application-level advisory lock contention. |
| LWLock:WALInsert | LWLock | Contention on WAL insertion. Very high write concurrency. |
PostgreSQL has no built-in ASH equivalent, but you can build one by periodically sampling pg_stat_activity into a history table. Extensions like pg_activity, pgBadger, and Percona Monitoring and Management (PMM) provide this.
-- Create ASH history table CREATE TABLE ash_history ( sample_time TIMESTAMPTZ DEFAULT NOW(), pid INT, usename TEXT, datname TEXT, state TEXT, wait_event_type TEXT, wait_event TEXT, query TEXT ); -- Schedule sampling every 1 second via pg_cron SELECT cron.schedule('ash-sample', '* * * * *', $$ INSERT INTO ash_history SELECT NOW(), pid, usename, datname, state, wait_event_type, wait_event, left(query,200) FROM pg_stat_activity WHERE state != 'idle' AND pid != pg_backend_pid(); $$); -- Query ASH: top wait events in last hour SELECT wait_event_type, wait_event, COUNT(*) AS samples FROM ash_history WHERE sample_time > NOW() - interval '1 hour' GROUP BY 1,2 ORDER BY 3 DESC;
A PostgreSQL extension packages custom SQL objects (functions, types, operators, views) with C or PL/pgSQL code. The minimum files are: a control file, an SQL script, and optionally a C shared library.
## Directory structure my_extension/ ├── my_extension.control # extension metadata ├── my_extension--1.0.sql # SQL definitions ├── my_extension.c # C implementation └── Makefile ## my_extension.control comment = 'My custom extension' default_version = '1.0' relocatable = true module_pathname = '$libdir/my_extension' ## my_extension--1.0.sql CREATE FUNCTION my_add(a INT, b INT) RETURNS INT AS 'MODULE_PATHNAME', 'my_add' LANGUAGE C STRICT; /* my_extension.c */ #include "postgres.h" #include "fmgr.h" PG_MODULE_MAGIC; PG_FUNCTION_INFO_V1(my_add); Datum my_add(PG_FUNCTION_ARGS) { int32 a = PG_GETARG_INT32(0); int32 b = PG_GETARG_INT32(1); PG_RETURN_INT32(a + b); } ## Install make install psql -c "CREATE EXTENSION my_extension;"
| Feature | Oracle | PostgreSQL Equivalent |
|---|---|---|
| Empty string vs NULL | '' = NULL (same!) | '' ≠ NULL — critical behavioural difference |
| ROWNUM / ROWID | ROWNUM, ROWID | LIMIT/OFFSET, ctid (internal, not stable) |
| Packages | CREATE PACKAGE (state + code) | No direct equivalent — use schemas + functions |
| CONNECT BY (hierarchical) | CONNECT BY PRIOR | WITH RECURSIVE (standard SQL) |
| DECODE / NVL | DECODE(), NVL() | CASE WHEN, COALESCE() |
| Sequences | NEXTVAL on sequence object | SERIAL / IDENTITY columns or explicit sequences |
| SYSDATE / SYSTIMESTAMP | SYSDATE | NOW(), CURRENT_TIMESTAMP, CURRENT_DATE |
| VARCHAR2 | VARCHAR2(n BYTE/CHAR) | VARCHAR(n) — always character-based |
| DUAL table | SELECT 1 FROM DUAL | SELECT 1 (no FROM needed) |
| Outer join syntax | WHERE a.id = b.id(+) | Standard ANSI: LEFT JOIN ... ON |
WHERE col = '' is equivalent to WHERE col IS NULL. In PostgreSQL they are completely different. Applications that store '' in Oracle to mean "no value" will break silently if migrated without code audit.- ora2pg — Open-source tool that exports Oracle schema, data, and PL/SQL to PostgreSQL-compatible SQL. Best for schema conversion and data export. Handles tables, indexes, sequences, views, functions, triggers.
- AWS Schema Conversion Tool (SCT) — GUI tool, good for assessing migration complexity and converting stored procedures.
- AWS Database Migration Service (DMS) — For ongoing data replication from Oracle to PostgreSQL during cutover period.
- Striim / GoldenGate — For zero-downtime migration using CDC from Oracle to PostgreSQL.
# Step 1: Assess migration complexity ora2pg --type SHOW_REPORT --estimate_cost \ --oracle_dsn "dbi:Oracle:host=ora-server;sid=ORCL" # Step 2: Export schema only (no data) ora2pg --type TABLE -o tables.sql ora2pg --type INDEX -o indexes.sql ora2pg --type PACKAGE -o packages.sql # converts to PL/pgSQL # Step 3: Export data in COPY format (fast) ora2pg --type COPY --table ORDERS \ -o orders_data.sql # Step 4: Load into PostgreSQL psql -d targetdb -f tables.sql psql -d targetdb -f orders_data.sql
Oracle PL/SQL packages (package spec + body with global state) have no direct equivalent in PostgreSQL. The conversion strategy depends on what the package does:
- Package as namespace: Create a PostgreSQL schema with the package name. Put all functions inside that schema.
- Package global variables: No equivalent. Use a session-local temporary table, a custom GUC parameter, or pass state explicitly as function parameters.
- Package initialization code: Use a function called explicitly on session start, or a
SET LOCALin a transaction. - EXECUTE IMMEDIATE: Convert to PostgreSQL
EXECUTE format(...)in PL/pgSQL. - Autonomous transactions: PostgreSQL 14+ has limited support via
dblinkor thepg_backgroundextension.
-- Oracle package → PostgreSQL schema approach -- Oracle: CREATE PACKAGE emp_pkg AS ... -- PostgreSQL equivalent: CREATE SCHEMA emp_pkg; CREATE OR REPLACE FUNCTION emp_pkg.get_salary(p_id INT) RETURNS NUMERIC AS $$ BEGIN RETURN (SELECT salary FROM employees WHERE id = p_id); END; $$ LANGUAGE plpgsql; -- Oracle global package variable → custom GUC -- In C extension: DefineCustomIntVariable('myapp.user_id', ...) -- Then in PL/pgSQL: SELECT current_setting('myapp.user_id')::INT; PERFORM SET_CONFIG('myapp.user_id', 42::text, true);
Row-Level Security (RLS) allows you to define policies that restrict which rows a user can see or modify. The database enforces these policies transparently — the application doesn't need to add WHERE clauses.
-- Step 1: Enable RLS on the table ALTER TABLE orders ENABLE ROW LEVEL SECURITY; -- Step 2: Create isolation policy (tenants see only their data) CREATE POLICY tenant_isolation ON orders USING (tenant_id = current_setting('app.current_tenant')::INT); -- Step 3: Application sets tenant context on connection SET app.current_tenant = '42'; -- Now tenant 42 only sees their own rows SELECT * FROM orders; -- automatically filtered! -- Step 4: Superusers bypass RLS by default -- Force RLS even for table owner: ALTER TABLE orders FORCE ROW LEVEL SECURITY; -- Separate policies for SELECT vs INSERT vs UPDATE CREATE POLICY select_policy ON orders FOR SELECT USING (tenant_id = current_setting('app.current_tenant')::INT); CREATE POLICY insert_policy ON orders FOR INSERT WITH CHECK (tenant_id = current_setting('app.current_tenant')::INT);
pgAudit is the standard PostgreSQL extension for detailed audit logging. It extends PostgreSQL's standard log_statement to provide fine-grained control over which operations are logged — required for SOX, PCI-DSS, HIPAA compliance.
## postgresql.conf shared_preload_libraries = 'pgaudit' ## Session-level audit (all DDL + DML) pgaudit.log = 'ddl, write, role' ## Options: read, write, function, role, ddl, misc, all ## Object-level audit (specific tables) pgaudit.log_level = log pgaudit.log_relation = on pgaudit.log_parameter= on -- Enable extension CREATE EXTENSION pgaudit; -- Audit specific role ALTER ROLE sensitive_user SET pgaudit.log = 'all'; -- Audit specific table (object-level) ALTER ROLE auditor SET pgaudit.log = 'read'; -- then grant privileges — pgaudit tracks access -- Log format (in PostgreSQL log file): -- AUDIT: SESSION,1,1,DDL,CREATE TABLE,,,CREATE TABLE secret_data...
This is a capstone architecture question. A 50TB OLTP database with five-nines availability (~5 minutes downtime/year) requires layered redundancy across every tier.
- HA Layer: Patroni cluster (1 primary + 2 synchronous replicas + 1 async replica in DR region). etcd 3-node cluster for leader election. HAProxy + Keepalived for VIP failover. Target RTO: <30 seconds automatic.
- Synchronous replication:
synchronous_standby_names = 'FIRST 1 (replica1, replica2)'— RPO = 0 (zero data loss) for primary datacenter failure. - Storage: NVMe SSDs with hardware RAID-10 or all-flash SAN.
random_page_cost = 1.1,wal_compression = lz4. - Connection pooling: PgBouncer tier (2+ instances behind load balancer) in transaction mode. Target: 2000 app connections → 200 PostgreSQL connections.
- Partitioning: Range-partition large tables by date. Detach old partitions for archival without locking.
- Backup: pgBackRest with WAL archiving to S3/object storage. Daily base backups, continuous WAL archiving. RPO target: 5 minutes. Monthly restore drills.
- Monitoring: Prometheus + postgres_exporter + Grafana. Alert on: replication lag >5s, XID age >1.5B, dead tuple ratio >10%, connection usage >80%, checkpoint frequency, WAL retention of replication slots.
- Maintenance: pg_repack for online bloat removal. REINDEX CONCURRENTLY for index maintenance. Patch upgrades via logical replication-based rolling upgrade.