PostgreSQL Interview Questions

TL;DR

35+ PostgreSQL interview questions organized by topic. Click "Show Answer" to reveal detailed answers.

Short on time? Focus on Fundamentals and Query Performance sections — these come up in almost every interview.

PostgreSQL Fundamentals

Q: What is PostgreSQL and how does it differ from other relational databases?

PostgreSQL is an open-source object-relational database that has been in active development since 1986. It differentiates itself through: extensibility (custom types, functions, operators, index methods), standards compliance (closest to the SQL standard of any database), advanced features out of the box (JSONB, full-text search, CTEs, window functions), and MVCC-based concurrency that readers never block writers. Unlike MySQL, PostgreSQL has had robust support for transactions, subqueries, and foreign keys since the beginning.

Q: Explain ACID properties and how PostgreSQL implements them.

Atomicity: transactions are all-or-nothing, implemented via MVCC — uncommitted changes are invisible. Consistency: constraints (PK, FK, CHECK, UNIQUE) are enforced at commit time. Isolation: PostgreSQL supports four isolation levels (Read Uncommitted, Read Committed, Repeatable Read, Serializable) using snapshot isolation via MVCC. Durability: guaranteed by the Write-Ahead Log (WAL) — changes are written to the WAL before data files, so committed data survives crashes.

Q: What is MVCC and why does PostgreSQL use it?

Multi-Version Concurrency Control (MVCC) means PostgreSQL keeps multiple versions of each row. Every transaction sees a snapshot of the database at its start time. Readers don't block writers, and writers don't block readers. Each row has hidden xmin (creating transaction ID) and xmax (deleting transaction ID) columns. The trade-off: old row versions accumulate and must be cleaned up by VACUUM.

Q: What is the Write-Ahead Log (WAL) and what is it used for?

The WAL is a sequential log where PostgreSQL writes all changes before modifying actual data files. It serves three purposes: crash recovery (replay WAL after restart to restore committed data), replication (stream WAL to replicas for high availability), and point-in-time recovery (replay WAL up to a specific timestamp). WAL is written sequentially, which is much faster than random writes to data files.

Q: What are the key differences between PostgreSQL and MySQL?

Key differences: Concurrency — PostgreSQL uses MVCC natively; MySQL/InnoDB added it later. Data types — PostgreSQL has JSONB, arrays, hstore, range types, custom types; MySQL is more limited. Standards — PostgreSQL follows SQL standard more closely. Extensibility — PostgreSQL supports custom index types, operators, procedural languages (PL/pgSQL, PL/Python, PL/V8). Replication — PostgreSQL has native streaming and logical replication; MySQL has binlog-based replication. PostgreSQL is often preferred for complex queries and analytics; MySQL for simple read-heavy web workloads.

Q: What is the difference between a schema and a database in PostgreSQL?

A database is a fully isolated container — you cannot query across databases in a single SQL statement (without extensions like dblink/FDW). A schema is a namespace within a database that organizes tables, views, and functions. You can query across schemas freely: SELECT * FROM schema_a.users JOIN schema_b.orders .... The default schema is public. Use schemas for multi-tenant isolation or logical grouping; use separate databases only when you need complete isolation.

Data Types & Schema Design

Q: What is the difference between JSON and JSONB in PostgreSQL?

JSON stores data as a raw text string — preserves whitespace, key order, and duplicates. It must be re-parsed on every access. JSONB stores data in a decomposed binary format — faster to query, supports indexing (GIN indexes), but slightly slower to insert. Almost always use JSONB unless you need to preserve exact formatting. JSONB supports operators like @> (contains), ? (key exists), and ->> (extract as text).

Q: When would you use PostgreSQL array columns versus a separate table?

Use arrays when: the list is small and always accessed as a whole (tags, labels), you don't need to join on individual elements, and order matters. Use a separate table when: elements have their own attributes, you need referential integrity, you'll query by individual elements frequently, or the list can grow unbounded. Arrays break first normal form and can make queries less efficient for large or frequently filtered datasets.

Q: What are the pros and cons of ENUM types in PostgreSQL?

Pros: type safety (only valid values allowed), compact storage (4 bytes), self-documenting, and fast comparisons. Cons: adding new values requires ALTER TYPE ... ADD VALUE which cannot be rolled back inside a transaction, removing values is not supported (you must recreate the type), and they can't be shared across databases. Alternative: a lookup table with a foreign key is more flexible but requires a JOIN.

Q: Compare SERIAL, IDENTITY, and UUID for primary keys.

SERIAL is a legacy shortcut that creates a sequence and sets a default. IDENTITY (SQL standard, PostgreSQL 10+) is preferred: id INT GENERATED ALWAYS AS IDENTITY — prevents accidental manual inserts. UUID (using gen_random_uuid()) is globally unique, safe for distributed systems, but is 16 bytes (vs 4 for INT), causes index bloat, and poor B-tree locality. Use IDENTITY for most cases, UUID when merging data from multiple sources or exposing IDs externally (prevents enumeration).

Q: What is normalization and when should you normalize your PostgreSQL schema?

Normalization eliminates redundancy by splitting data into related tables. 1NF: atomic values, no arrays. 2NF: no partial dependencies on composite keys. 3NF: no transitive dependencies. Normalize when: data integrity is critical, storage is a concern, or write-heavy workloads need consistency. PostgreSQL's strong JOIN performance and advanced indexing make normalized schemas perform well even at scale.

Q: When is denormalization appropriate in PostgreSQL?

Denormalize when: read performance is critical and JOINs are too expensive, you need pre-computed aggregates for dashboards, or you're building a materialized view for reporting. PostgreSQL helps with denormalization through materialized views (REFRESH MATERIALIZED VIEW CONCURRENTLY), JSONB columns for embedding related data, and generated columns for computed values. Always measure before denormalizing — PostgreSQL's optimizer often handles normalized schemas better than expected.

Queries & SQL

Q: What are CTEs (Common Table Expressions) and when should you use them?

A CTE is a named temporary result set defined with WITH ... AS. Use them to: break complex queries into readable steps, avoid repeating subqueries, and enable recursive queries. In PostgreSQL 12+, CTEs can be inlined by the optimizer (treated like subqueries) unless you add MATERIALIZED. Before PG 12, CTEs were always materialized, which could hurt performance. Always test with EXPLAIN ANALYZE to check if a CTE is being inlined.

Q: How do window functions work in PostgreSQL? Give an example.

Window functions perform calculations across a set of rows related to the current row without collapsing them (unlike GROUP BY). Syntax: function() OVER (PARTITION BY ... ORDER BY ...). Example: rank employees by salary within each department: SELECT name, dept, salary, RANK() OVER (PARTITION BY dept ORDER BY salary DESC) FROM employees. Common functions: ROW_NUMBER, RANK, DENSE_RANK, LAG, LEAD, SUM, AVG. Window functions execute after WHERE and GROUP BY but before ORDER BY.

Q: What is a LATERAL join and when would you use it?

A LATERAL join lets the right-hand subquery reference columns from the left-hand table — like a correlated subquery but in the FROM clause. Use cases: top-N per group (e.g., latest 3 orders per customer), set-returning functions that depend on each row, and unnesting arrays with row context. Example: SELECT c.name, o.* FROM customers c, LATERAL (SELECT * FROM orders WHERE customer_id = c.id ORDER BY created_at DESC LIMIT 3) o. Often faster than window functions for top-N queries on large tables.

Q: What does DISTINCT ON do and how is it different from DISTINCT?

DISTINCT removes fully duplicate rows. DISTINCT ON (expression) is a PostgreSQL extension that returns the first row for each unique value of the expression, based on the ORDER BY clause. Example: get the latest order per customer: SELECT DISTINCT ON (customer_id) * FROM orders ORDER BY customer_id, created_at DESC. It's often simpler and faster than a window function + filter for "first row per group" queries. Note: DISTINCT ON is PostgreSQL-specific, not standard SQL.

Q: How does UPSERT (ON CONFLICT) work in PostgreSQL?

UPSERT is INSERT ... ON CONFLICT ... DO UPDATE/DO NOTHING (PostgreSQL 9.5+). It atomically inserts a row or updates it if a conflict occurs on a unique constraint or index. Example: INSERT INTO users (email, name) VALUES ('[email protected]', 'Alice') ON CONFLICT (email) DO UPDATE SET name = EXCLUDED.name. The EXCLUDED pseudo-table refers to the row that was proposed for insertion. This is atomic and race-condition free, unlike the "check then insert" pattern.

Q: How do recursive queries work in PostgreSQL?

Recursive CTEs use WITH RECURSIVE and have two parts: a base case (non-recursive term) and a recursive step that references the CTE itself. PostgreSQL iterates until the recursive step returns no new rows. Common uses: traversing hierarchies (org charts, category trees), generating series, and graph traversal. Example for an org chart: WITH RECURSIVE org AS (SELECT id, name, manager_id, 1 AS depth FROM employees WHERE manager_id IS NULL UNION ALL SELECT e.id, e.name, e.manager_id, o.depth + 1 FROM employees e JOIN org o ON e.manager_id = o.id) SELECT * FROM org. Always include a depth limit or cycle detection to prevent infinite loops.

Query Performance & Optimization

Q: How do you read an EXPLAIN ANALYZE output?

Read bottom-up (inner nodes execute first). Key fields: actual time (startup..total in ms), rows (estimated vs actual — big differences mean stale statistics), loops (multiply time by loops for true cost). Watch for: Seq Scan on large tables (missing index), Nested Loop with high row counts (consider Hash Join), Sort with high memory (add index to avoid sort). Use EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT) for the most useful output. The Buffers section shows I/O: shared hit = cache, shared read = disk.

Q: What index types does PostgreSQL support and when would you use each?

B-tree (default): equality and range queries (=, <, >, BETWEEN, ORDER BY). GIN (Generalized Inverted Index): JSONB containment (@>), array overlap (&&), full-text search (@@), trigram similarity. GiST (Generalized Search Tree): geometric/spatial data (PostGIS), range types, nearest-neighbor searches. BRIN (Block Range Index): very large, naturally ordered tables (time-series data, append-only logs) — tiny index size, good for sequential correlation. Hash: equality-only lookups (rarely used, B-tree is usually better). Also: SP-GiST for partitioned search spaces (IP addresses, phone numbers).

Q: When should you NOT add an index?

Don't index when: the table is small (sequential scan is faster), the column has low selectivity (boolean, status with few values — the index returns most of the table), the table is write-heavy with few reads (indexes slow down INSERT/UPDATE/DELETE), or you already have a covering index that includes the column. Also avoid indexing columns used with functions unless you create an expression index. Every index adds storage overhead and VACUUM work. Rule of thumb: an index helps when it filters out at least 90% of rows.

Q: How do you diagnose a slow query in PostgreSQL?

Step-by-step: (1) Enable pg_stat_statements to find the slowest queries by total time. (2) Run EXPLAIN (ANALYZE, BUFFERS) on the query. (3) Check for sequential scans on large tables — add indexes. (4) Check estimated vs actual rows — if wildly different, run ANALYZE to update statistics. (5) Check for lock contention via pg_stat_activity and pg_locks. (6) Check work_mem — if sorts spill to disk, increase it. (7) Check if the query can be rewritten (e.g., replace correlated subquery with JOIN, use LATERAL for top-N).

Q: What is the N+1 query problem and how do you solve it in PostgreSQL?

The N+1 problem occurs when an application fetches a list of N items, then makes N additional queries to fetch related data for each item. Example: fetch 100 orders, then 100 separate queries for each order's items. Solutions: (1) Use JOINs to fetch everything in one query. (2) Use IN with a list of IDs: SELECT * FROM items WHERE order_id IN (1, 2, 3, ...). (3) Use LATERAL joins for top-N per group. (4) In ORMs, use eager loading (e.g., prefetch_related in Django, includes in Rails). The fix is always to reduce round trips to the database.

Q: What is connection pooling and why is it important for PostgreSQL?

Each PostgreSQL connection spawns a separate OS process (~10 MB RAM each). With hundreds of connections, memory and context-switching overhead becomes significant. Connection pooling (via PgBouncer or pgpool-II) maintains a pool of reusable connections, reducing overhead. PgBouncer modes: session (connection per session), transaction (connection per transaction — most common), statement (connection per statement). Transaction mode gives the best connection reuse but doesn't support prepared statements or session-level features (SET, LISTEN/NOTIFY).

Q: What are prepared statements and how do they improve performance?

A prepared statement is a pre-parsed, pre-planned query that can be executed multiple times with different parameters. Benefits: (1) Avoids re-parsing SQL on each execution. (2) Prevents SQL injection by separating query structure from data. (3) PostgreSQL can create a generic plan after 5 executions if it's consistently good, or use custom plans for each parameter set. Use PREPARE name AS SELECT ... then EXECUTE name(params). Most client libraries (psycopg, JDBC) handle this automatically. Note: prepared statements are session-scoped and don't work in PgBouncer transaction mode.

Administration & Operations

Q: What is VACUUM and why does PostgreSQL need it?

Because of MVCC, deleted and updated rows leave behind "dead tuples" (old row versions). VACUUM reclaims this space for reuse. Regular VACUUM marks dead tuples as reusable but doesn't return space to the OS. VACUUM FULL rewrites the entire table to reclaim disk space but takes an exclusive lock — avoid in production. Autovacuum runs automatically based on thresholds (autovacuum_vacuum_threshold + autovacuum_vacuum_scale_factor * row count). Tune autovacuum to run more aggressively on high-churn tables. Without vacuuming, tables bloat and queries slow down.

Q: What is the difference between pg_dump and pg_basebackup?

pg_dump creates a logical backup — SQL statements or a custom archive format. It's database-level, can selectively back up tables, and produces portable output you can restore to different PostgreSQL versions. pg_basebackup creates a physical backup — a binary copy of the entire data directory. It's faster for large databases, supports point-in-time recovery (PITR) with WAL archiving, and is required for setting up streaming replicas. Use pg_dump for small/medium databases and migrations; pg_basebackup for disaster recovery and replica setup.

Q: Explain the difference between streaming replication and logical replication.

Streaming replication sends the raw WAL bytes to replicas, creating exact physical copies. The replica is read-only. It replicates everything (all databases, all schemas). Logical replication (PostgreSQL 10+) decodes WAL into logical changes (INSERT, UPDATE, DELETE) and sends them via publications/subscriptions. It can replicate specific tables, allows writes on the subscriber, and works across PostgreSQL major versions. Use streaming for high availability (failover); logical for selective replication, online major version upgrades, or data integration.

Q: When and how should you partition a table in PostgreSQL?

Partition when a table exceeds tens of millions of rows and queries consistently filter on the partition key. PostgreSQL supports range (dates, IDs), list (categories, regions), and hash (even distribution) partitioning. Benefits: partition pruning (only scans relevant partitions), faster VACUUM (per-partition), easy data archival (drop old partitions). Create with: CREATE TABLE orders (... ) PARTITION BY RANGE (created_at), then create child partitions. Avoid partitioning small tables or when queries don't filter on the partition key — it adds overhead without benefit.

Q: What tools do you use to monitor a PostgreSQL database?

Built-in: pg_stat_statements (query performance), pg_stat_activity (active connections and locks), pg_stat_user_tables (table I/O, dead tuples), pg_stat_bgwriter (checkpoint and buffer stats). External: pgAdmin (GUI), pg_top (real-time process monitor), pgBadger (log analysis), Prometheus + Grafana with postgres_exporter (dashboards and alerting), PgHero (quick health checks). Key metrics to watch: cache hit ratio (>99%), dead tuple ratio, long-running queries, replication lag, and connection count.

Q: What is transaction ID wraparound and why is it dangerous?

PostgreSQL uses 32-bit transaction IDs (about 4 billion). MVCC compares transaction IDs to determine row visibility. When IDs wrap around, old "past" transactions would suddenly appear to be in the "future," making their data invisible — effectively data loss. PostgreSQL prevents this with aggressive autovacuum: VACUUM freezes old transaction IDs (marks rows as visible to all). If autovacuum falls behind, PostgreSQL issues warnings and eventually shuts down to prevent wraparound. Monitor pg_database.datfrozenxid and ensure autovacuum keeps up on all tables, especially large ones.

Scenario-Based Questions

Q: How would you migrate a production database from MySQL to PostgreSQL?

Step-by-step: (1) Schema conversion — map MySQL types to Postgres (AUTO_INCREMENT → IDENTITY, TINYINT(1) → BOOLEAN, DATETIME → TIMESTAMPTZ). Tools: pgLoader (automates the full migration) or AWS SCT. (2) Data migration — use pgLoader for bulk transfer, validate row counts and checksums. (3) Query audit — fix MySQL-specific syntax (backtick quoting → double quotes, LIMIT offset syntax, IF() → CASE WHEN). (4) Test thoroughly — run the app against Postgres with production-like data. (5) Cutover — use logical replication or dual-write during transition to minimize downtime.

Q: A table with 500 million rows is getting slow. How do you fix it?

Systematic approach: (1) Identify slow queries via pg_stat_statements. (2) Partitioning — partition by date or other natural key to enable partition pruning. (3) Indexes — ensure proper indexes exist; consider BRIN indexes for time-ordered data (tiny index for huge tables). (4) VACUUM and statistics — tune autovacuum to run more frequently; increase default_statistics_target for better estimates. (5) Hardware — increase shared_buffers, work_mem, effective_cache_size. (6) Archival — move old data to archive tables or cold storage. (7) Read replicas — offload analytics queries to a replica. (8) Materialized views — pre-compute expensive aggregations.

Q: How would you design a multi-tenant schema in PostgreSQL?

Three approaches: (1) Shared table with tenant_id — simplest, add tenant_id column to every table with Row-Level Security (RLS) policies. Best for many tenants with small data. (2) Schema per tenant — each tenant gets their own schema (SET search_path TO tenant_123). Good isolation, same database. Works well up to hundreds of tenants. (3) Database per tenant — maximum isolation but hardest to manage. Best for enterprise customers with strict compliance needs. Most teams start with option 1 (shared table + RLS) and migrate large tenants to option 2 if needed.

Q: Is PostgreSQL a good choice for time-series data? How would you set it up?

Yes, PostgreSQL handles time-series well with the right setup. Key strategies: (1) Partitioning by time — range partitions by day/week/month for fast queries and easy archival (DROP PARTITION is instant). (2) BRIN indexes on timestamps — tiny index, perfect for naturally ordered data. (3) TimescaleDB extension — adds automatic partitioning, compression, continuous aggregates, and retention policies. (4) Materialized views for pre-computed rollups (hourly, daily aggregates). (5) Tune work_mem and maintenance_work_mem for large time-range queries. For extreme scale (millions of inserts/sec), consider TimescaleDB or a dedicated time-series database.

Q: How do you handle database migrations in a CI/CD pipeline with PostgreSQL?

Best practices: (1) Use a migration tool (Flyway, Alembic, golang-migrate, or sqitch) that tracks applied migrations in a table. (2) Make migrations backward-compatible — add columns as nullable first, deploy code that handles both old and new schema, then make columns required. (3) Avoid long locks — use CREATE INDEX CONCURRENTLY, ALTER TABLE ... ADD COLUMN (no default is instant in PG 11+). (4) Test migrations against a production-sized copy before deploying. (5) Separate schema changes from code deployments — schema changes should be safe to run before or after the new code. (6) Always have a rollback script for each migration.

Q: When would you choose PostgreSQL over a NoSQL database, and vice versa?

Choose PostgreSQL when: you need ACID transactions, complex JOINs, strong consistency, or structured data with relationships. PostgreSQL's JSONB gives you document-store flexibility without sacrificing transactions. Choose NoSQL when: you need horizontal scaling across many nodes (Cassandra, DynamoDB), sub-millisecond key-value lookups (Redis), graph traversals (Neo4j), or truly schema-less documents at massive scale (MongoDB). In practice, PostgreSQL with JSONB covers 80%+ of use cases that teams reach for NoSQL for. Start with Postgres unless you have a specific scaling or data model requirement that demands a specialized database.