PostgreSQL Internals Series, Part 1: Page Layout & Storage
How PostgreSQL stores data on disk — page structure, tuple anatomy, alignment, TOAST, and practical inspection with pageinspect.
PostgreSQL stores data in pages — fixed 8KB blocks on disk (by default). Everything lives in pages: table data (heap), index entries, even free space tracking. Understanding page structure is the foundation for understanding everything else.
This part is hands-on. We’ll build a test table, inspect its pages byte-by-byte, and learn why tuple alignment and TOAST matter for performance.
Page Structure
Every page is 8192 bytes and has this layout:
┌─────────────────────────────────────────┐
│ PageHeaderData (24 bytes) │ ← LSN, flags, free space
├─────────────────────────────────────────┤
│ ItemIdData[] (4 bytes per tuple) │ ← Pointers to tuples
├─────────────────────────────────────────┤
│ ...free space... │
├─────────────────────────────────────────┤
│ ...tuple data... │ (stored from end backwards)
└─────────────────────────────────────────┘
Key insight: Tuple data grows backwards from the end of the page. This leaves free space in the middle for new tuples or index entries.
Header (24 bytes)
The PageHeaderData struct contains:
- pd_lsn (8 bytes): Log sequence number (WAL position) when page was last modified
- pd_checksum (2 bytes): Optional page checksum for corruption detection
- pd_flags (2 bytes): Bit flags (reserved, all visible, etc.)
- pd_lower (2 bytes): Offset to first free space after the last line pointer
- pd_upper (2 bytes): Offset to start of tuples
- pd_special (2 bytes): Offset to page-specific data (used by indexes, not heap)
- pd_pagesize_version (2 bytes): Page size and PostgreSQL version
- pd_linp (variable): Array of line pointers (4 bytes each)
Line Pointers
After the header comes an array of line pointers (4 bytes each), one per tuple on the page.
typedef struct ItemPointerData {
BlockIdData ip_blkid; // 4 bytes: block number within relation
OffsetNumber ip_posid; // 2 bytes: offset within page
} ItemPointerData;
The line pointer is the tuple’s address: (block, offset). This is how PostgreSQL references tuples — not by byte position, but by their block and slot number. This allows tuples to be moved within a page (during VACUUM) without breaking references to them.
Tuple Structure
Each tuple starts with a HeapTupleHeaderData (23 bytes minimum) followed by user data:
typedef struct HeapTupleHeaderData {
union {
HeapTupleFields t_heap;
DatumTupleFields t_datum;
} t_choice;
ItemPointerData t_ctid; // 6 bytes: current item pointer (for MVCC)
uint16 t_infomask; // 2 bytes: visibility flags
uint16 t_infomask2; // 2 bytes: more flags
uint8 t_hoff; // 1 byte: offset to user data
uint8 t_bits[1]; // Variable: NULL bitmap
} HeapTupleHeaderData;
Tuple Header Fields (MVCC metadata)
- xmin (4 bytes): ID of transaction that inserted this tuple
- xmax (4 bytes): ID of transaction that deleted this tuple (0 if active)
- xvac (4 bytes): Transaction ID of VACUUM that moved this tuple
- t_ctid: Current item pointer (points to self or the newer version)
- t_infomask: Flags (HEAP_XMIN_COMMITTED, HEAP_XMIN_INVALID, HEAP_XMAX_COMMITTED, etc.)
- t_hoff: Byte offset from tuple start to user data (after NULL bitmap)
These fields enable MVCC — different transactions see different versions of the same tuple based on visibility rules (Part 2).
NULL Bitmap
If the table has NULLable columns, a NULL bitmap follows the header. It’s a bit array where bit N indicates if column N is NULL. For tables with no nullable columns, this is omitted entirely.
Formula for NULL bitmap size: BITMAPLEN(natts) = ((natts) + 7) / 8 bytes
User Data
After the header and NULL bitmap comes the actual column values, stored in attribute order.
Alignment rule: Most data types require alignment. For example:
int4(4 bytes) must start at a 4-byte boundaryint8(8 bytes) must start at an 8-byte boundary
PostgreSQL pads data to ensure alignment. This is why a table with (int1, int8) wastes 7 bytes of padding — the int8 needs to start at an 8-byte boundary.
Example: Inspecting a Page
Let’s create a test table and look at its storage:
-- Create test table
CREATE TABLE users (
id BIGINT,
name TEXT,
age INT,
email TEXT
);
-- Insert a few rows
INSERT INTO users VALUES
(1, 'Alice', 30, 'alice@example.com'),
(2, 'Bob', 25, 'bob@example.com'),
(3, 'Charlie', 35, 'charlie@example.com');
-- Enable pageinspect extension
CREATE EXTENSION IF NOT EXISTS pageinspect;
Now inspect the page header:
SELECT * FROM page_header(get_raw_page('users', 0));
Output:
lsn | checksum | flags | lower | upper | special | pagesize | version | prune_xid
-----+----------+-------+-------+-------+---------+----------+---------+-----------
0/0 | 0 | 0 | 44 | 8128 | 8192 | 8192 | 18 | 0
This tells us:
- lower = 44: The line pointer array ends at byte 44 (header is 24 bytes + 4 line pointers = 44)
- upper = 8128: Tuple data starts at byte 8128
- Free space: 8128 - 44 = 8084 bytes available for new tuples
Now inspect individual tuples:
SELECT * FROM heap_page_items(get_raw_page('users', 0));
Output:
lp | lp_off | lp_flags | lp_len | t_xmin | t_xmax | t_field3 | t_ctid | t_infomask | t_infomask2 | t_hoff | t_bits | t_oid
----+--------+----------+--------+--------+--------+----------+--------+------------+-------------+--------+--------+-------
1 | 8128 | 1 | 48 | 742 | 0 | 0 | (0,1) | 2050 | 2 | 24 | |
2 | 8080 | 1 | 48 | 743 | 0 | 0 | (0,2) | 2050 | 2 | 24 | |
3 | 8032 | 1 | 48 | 744 | 0 | 0 | (0,3) | 2050 | 2 | 24 | |
Key observations:
- lp (line pointer 1, 2, 3): Each tuple has a pointer
- t_xmin (742, 743, 744): Different transaction IDs created each tuple
- t_xmax (0): All tuples are active (not deleted)
- t_ctid ((0,1), (0,2), (0,3)): Each tuple points to itself (or would point to a newer version if updated)
- t_hoff (24): User data starts at byte 24 (no NULL bitmap, all columns are NOT NULL)
Decode the actual tuple data:
SELECT t_ctid, encode(t_data, 'hex') as hex_data
FROM heap_page_items(get_raw_page('users', 0));
The hex output shows raw bytes. The id (int8) comes first, then name (variable length text), then age (int4), then email (variable length text).
TOAST — Out-of-Line Storage
PostgreSQL has a 4KB limit per tuple (after TOAST compression). Large values (text, bytea, arrays) are stored separately in a TOAST table (Tuple Attribute Storage).
When a column value is too large:
- PostgreSQL compresses it (if compression works)
- If still too large, stores it in the TOAST table
- The original tuple holds a TOAST pointer (18 bytes) instead of the data
Example:
CREATE TABLE documents (
id BIGINT,
content TEXT
);
INSERT INTO documents VALUES (1, repeat('x', 100000));
-- Check page size
SELECT * FROM page_header(get_raw_page('documents', 0));
The tuple on the heap page is now ~18 bytes (just the TOAST pointer), and the actual content lives in a separate TOAST table.
When TOAST activates:
storage_typeparameter per column controls this- Default behavior: compress, then TOAST if still > 2KB
- Can force
EXTERNAL(never compress, TOAST immediately) orPLAIN(never TOAST)
CREATE TABLE big_data (
id BIGINT,
data BYTEA STORAGE EXTERNAL
);
Practical Impact: Page Density & Performance
The tuple overhead is 24 bytes (header) + NULL bitmap (if needed) + padding for alignment.
For a table with small rows (e.g., (int4, int4, int4) = 12 bytes data + 24 byte header = 36 bytes):
- Tuples per page: ~8192 / 36 ≈ 227 tuples
- With 100M rows, you need ~440,000 pages = ~3.4 GB
For a table with large rows (e.g., (text, text, text) with 1KB each):
- Tuples per page: ~8192 / 3KB ≈ 2–3 tuples
- With 100M rows, you need ~50M pages = ~390 GB
Implications:
- Wider tables = slower scans (fewer tuples per page, more I/O)
- Unnecessary columns bloat the table size
- Column order matters for padding efficiency
- TOAST effectively “unbundles” large values from the main table
Checking Table Size
SELECT
schemaname,
tablename,
pg_size_pretty(pg_total_relation_size(schemaname || '.' || tablename)) AS total_size,
pg_size_pretty(pg_relation_size(schemaname || '.' || tablename)) AS table_size,
pg_size_pretty(pg_total_relation_size(schemaname || '.' || tablename) - pg_relation_size(schemaname || '.' || tablename)) AS indexes_and_toast
FROM pg_tables t
WHERE schemaname NOT IN ('pg_catalog', 'information_schema')
ORDER BY pg_total_relation_size(schemaname || '.' || tablename) DESC;
Key Takeaways
- Pages are 8KB fixed blocks. Tuple data grows backwards from the end, leaving free space for growth.
- Line pointers enable indirection. Tuples can be moved within a page without breaking references.
- Tuple headers carry MVCC metadata (xmin, xmax, t_ctid). This overhead is unavoidable — roughly 24 bytes per tuple.
- Alignment and padding waste space. Column order impacts layout efficiency.
- TOAST unbundles large values to the TOAST table, keeping the main tuple small.
- Pageinspect extension lets you inspect pages directly — invaluable for debugging storage issues.
Monitoring Page Health
Check table bloat and page efficiency:
SELECT
schemaname,
tablename,
pg_size_pretty(pg_relation_size(schemaname || '.' || tablename)) AS size,
(SELECT count(*) FROM pg_class WHERE relname = tablename) AS row_count
FROM pg_tables
WHERE schemaname = 'public'
ORDER BY pg_relation_size(schemaname || '.' || tablename) DESC;
Next Steps
Now that you understand how PostgreSQL stores individual tuples, the question becomes: how does it handle concurrent modifications to those tuples? When one transaction deletes a row and another transaction reads it simultaneously, who sees what?
This is MVCC — multi-version concurrency control. Every tuple carries visibility metadata, and every transaction sees a snapshot of the database at a specific point in time.
Part 1 complete. Next: Transaction Isolation & MVCC