Postgres autovacuum: why your 200GB table won't release space

dead tuples, MVCC visibility, vacuum vs vacuum full, autovacuum_naptime, hot standby conflicts

May 29, 2026

The first time we got paged on a 200 GB table that wouldn't shrink, it was an events table sitting at around 400 million rows. Disk dashboard was yellowing, autovacuum was running clean - pg_stat_user_tables showed worker passes completing, log lines free of errors, last_autovacuum updated an hour ago. Somebody from the DBA chat asked why we didn't just run a VACUUM. We did. Nothing changed.

Every signal looked healthy. The only thing wrong was the outcome. Autovacuum was doing exactly what it's supposed to, and the table was still 200 GB - both true at the same time. Once we saw why, we stopped chasing the wrong fix. Issue #17 covered Postgres on Kubernetes at the cluster level. This is what happened inside one instance under write load, before any of that mattered.

The query that misled us first

We started with the obvious one and immediately got pointed in the wrong direction by it.

SELECT relname, n_live_tup, n_dead_tup, last_autovacuum, autovacuum_count
FROM pg_stat_user_tables
WHERE relname = 'events';

n_dead_tup came back at 80 million on a 400-million-row table. Our first read was that vacuum was broken. It wasn't. That number isn't a count of bloat, it's a count of tuples vacuum decided it couldn't remove on that pass. The interesting question turned out to be why it couldn't.

We learned the hard way that the disk-not-shrinking part has two layered causes. One is a misconception about what VACUUM actually does to disk. The other is the visibility horizon - the thing silently holding vacuum back without showing up as an error anywhere. We peeled them in that order on the second outage and got to the real fix in an afternoon instead of a week.

Why MVCC leaves corpses behind

Postgres uses MVCC - multi-version concurrency control - which means writes never overwrite a row in place. An UPDATE writes a new physical tuple and marks the old one as superseded. A DELETE marks the existing tuple as deleted rather than removing it. The old tuple keeps sitting in the same page on disk, taking up the bytes it took yesterday.

Two hidden system columns track this on every tuple: xmin, the transaction ID that created the tuple, and xmax, the transaction ID that invalidated it via UPDATE or DELETE. A tuple with xmax = 0 is still live; once xmax is populated the tuple is dead, but only from the perspective of transactions that started after that xmax committed. When we peeked at one of our hot rows with the pageinspect extension, we saw exactly what the docs describe: the new version with one xmin and the old version still sitting there with a populated xmax, waiting on cleanup.

The reason for the design is write throughput. An UPDATE doesn't rewrite the row in place or worry about concurrent readers seeing a half-written tuple - it appends a new tuple, flips a header bit on the old one, commits, and moves on. Readers in older transactions see the old version; readers in newer transactions follow the chain to the new one. Everyone gets a consistent view at the moment their transaction started, no lock blocking a high-traffic row.

The price is that nothing has actually been freed. On a table where we were updating the same row a hundred times an hour, we'd accumulate a hundred dead tuples per row per hour, all living on the same page until vacuum got around to them. The 200 GB we'd been paged on was mostly that gap - dead tuples vacuum hadn't been allowed to remove yet.

The discovery that VACUUM doesn't free disk

I'll admit we ran VACUUM by hand three times before anyone opened the docs - each run finished clean in the logs, and each time the disk graph sat flat. The reason turned out to be mundane: VACUUM does not hand pages back to the OS. It marks the space the dead tuples held as free, and that free space is only ever reused by later inserts and updates into the same table.

So autovacuum walks the table and writes the location of every dead tuple it finds into the free space map. The next INSERT grabs one of those slots instead of growing the file. When the write rate matches the rate dead tuples pile up, the table parks at one size and sits there. Ours hadn't gotten there by accident - the events table took a one-time backfill six months before the page, then dropped to a fraction of that write rate, so it was frozen at the backfill high-water mark with no fresh writes to refill the holes.

This is the right tradeoff. Truncating a file requires the dead space to be at the end of the file, and in a heavily-updated table dead tuples are scattered across every page. Actually shrinking the file means rewriting the table, which locks it. VACUUM does the cheap thing and leaves the rewrite as a separate operation you have to ask for - either VACUUM FULL, which takes an ACCESS EXCLUSIVE lock and ran for hours on our 200 GB table, or pg_repack, which does an online shadow-table swap but needs roughly 2x disk to run.

When someone in chat says "autovacuum isn't reclaiming disk", the technically-correct answer is that it was never going to. The right question is whether the table sits at a sensible steady-state size for the write pattern. For us the answer was no - the table was bloated way past steady state because vacuum hadn't been able to mark enough dead tuples reusable. Which led us to the horizon.

The visibility horizon: where the real bug was

Vacuum can only remove a dead tuple if no active transaction could still need to see it. That sounds obvious until you trace what "active transaction could still need to see it" actually means in a busy Postgres instance, which is what we spent a Tuesday afternoon doing.

Every transaction starts with a snapshot of the database at the moment it began. As long as that transaction is open, its snapshot pins the visibility of every dead tuple whose xmax is newer than the snapshot's view. Vacuum sees those tuples, checks the oldest active snapshot, and skips them - removing them would corrupt the view of a transaction still running.

The horizon - the oldest snapshot held by anything - turned out to be the single most important thing to look at when vacuum looks fine but disk doesn't shrink:

SELECT max(age(backend_xid)) AS oldest_xid_age,
       max(age(backend_xmin)) AS oldest_xmin_age
FROM pg_stat_activity
WHERE state <> 'idle';

age() measures how many transactions have happened since the given XID. Our oldest_xmin_age came back at 47 million. Something was holding a very old snapshot, and that something was the silent culprit. We later found the same shape - oldest_xmin_age in the millions - in roughly half the autovacuum-bloat tickets our SRE team had filed that year.

The actual hold turned out to be one of four sources, and we've since seen each at least twice in production. The most obvious case is a long-running transaction left open - a reporting query running for two hours, or a migration script that opened a transaction and forgot it. Sneakier are idle-in-transaction sessions, where a connection started a transaction, ran some queries, then the app went off to do something else and never committed:

SELECT pid, state, query, now() - xact_start AS age
FROM pg_stat_activity
WHERE state = 'idle in transaction'
ORDER BY age DESC;

The transaction is open. The snapshot is pinned. Vacuum can't touch tuples newer than it. We've watched idle in transaction sessions with an age of hours in apps that aren't careful about pooling and transaction boundaries.

Replication slots got us once and were the most operationally tricky of the four. A logical or physical slot tells Postgres not to discard WAL and not to advance the horizon past the consumer's position. A replica we'd torn down months earlier had left its slot behind, and the primary had been holding WAL and pinning the horizon for it ever since:

SELECT slot_name, slot_type, active, restart_lsn, confirmed_flush_lsn,
       age(xmin) AS xmin_age
FROM pg_replication_slots;

A slot with active = false and xmin_age in the tens of millions is a dead replica quietly killing your vacuum. The fix was pg_drop_replication_slot after confirming the replica really was gone.

Prepared transactions are the rare but devastating one. Two-phase commit transactions can be prepared but not committed - they sit in pg_prepared_xacts until someone explicitly commits or rolls them back. A forgotten prepared transaction holds the horizon indefinitely. If pg_prepared_xacts has anything in it and nobody knows why, that's your answer.

None of these show up in the autovacuum logs. Vacuum runs, vacuum logs success, vacuum quietly skips tuples it can't remove, and the only visible symptom is n_dead_tup climbing.

What the fix actually looked like

Order matters here. We had to diagnose the horizon first, then decide on cleanup tactics. A team across the hall had done it the other way the previous quarter - VACUUM FULL in a maintenance window without checking the horizon - and their table was bloated again in a week because the hold-back was still in place.

Our first concrete action was the horizon check from the previous section. We killed two idle-in-transaction sessions older than our typical query time, dropped the stale slot from the torn-down replica, and confirmed pg_prepared_xacts was empty. An hour after the next autovacuum pass, n_dead_tup finally started dropping for the first time in weeks.

Once the horizon was clean, autovacuum finally had permission to do its job - but it still wasn't keeping up with how fast that table took writes. The parameter we reached for was autovacuum_vacuum_scale_factor. It ships at 0.2, so a table doesn't even become eligible for autovacuum until dead tuples cross 20% of the live count. Do that math on 400 million rows and vacuum sits on its hands until 80 million tuples are already dead. For a table that hot, we wanted it twitchier:

ALTER TABLE events SET (autovacuum_vacuum_scale_factor = 0.02);

At 2% the table became eligible roughly ten times sooner. Each pass had less to clean, so instead of n_dead_tup sawtoothing up into the hundreds of millions and back, it stayed in a narrow band we could actually reason about. On a payments-log table we own, we've since gone down to 0.01 and added autovacuum_vacuum_threshold as a fixed-row trigger on top.

The rewrite came at the end, only because steady-state wasn't the size we wanted. Our events table should have been around 80 GB given the new write pattern, but it was at 200 because of the historical bloat - reusable space the table would never consume again. pg_repack was the right tool there: online, transparent, left the table at its actual minimum size after a weekend run. VACUUM FULL gets you the same outcome with a long exclusive lock - the right choice when you have a maintenance window and don't want to install an extension.

While the rewrite was running we watched progress directly:

SELECT pid, phase, heap_blks_scanned, heap_blks_total,
       num_dead_tuples, max_dead_tuples
FROM pg_stat_progress_vacuum;

On a different table we owned, vacuum had been stuck in vacuuming indexes for hours - a huge fragmented index, and the vacuum pass was mostly index work. That was a separate problem with its own fix: REINDEX CONCURRENTLY on the affected indexes, run during a low-traffic window.

What we keep getting wrong

The same patterns kept showing up in the autovacuum-bloat-200-GB tickets after that first incident, and most of them came back to the same misreads:

Running VACUUM FULL in a maintenance window without checking the horizon. The table shrinks for a day, then bloats again because the idle-in-transaction session keeps leaking from the same broken app. We did exactly this on a different table six months later, before we'd internalized the order.
Reading n_dead_tup as a measure of bloat. It's a measure of what vacuum can't remove right now. Real bloat estimation needs the pgstattuple extension - we now run a pgstattuple sweep weekly across the top ten tables by size.
Ignoring idle in transaction sessions because they aren't running queries. They aren't idle. They're holding a snapshot.
Leaving stale replication slots from torn-down read-replicas. Every team we've talked to has at least one. Cheapest fix for the most expensive symptom we've seen.
Never tuning autovacuum_vacuum_scale_factor for large hot tables. The 20% default is fine for tables with a few thousand rows and absurd for tables with hundreds of millions.
Tuning autovacuum_naptime instead of scale factor. Naptime controls how often the launcher wakes up. Scale factor controls when a specific table becomes eligible. Most "vacuum doesn't run often enough" complaints we've debugged turned out to be scale-factor problems in disguise.
Assuming autovacuum behaves the same on a primary and a hot standby. With hot_standby_feedback = on, long queries on the replica hold the primary's horizon back too - a separate failure mode that only appeared in our setup once we started routing read traffic to the replica.

Links

PostgreSQL docs: pgstattuple - the extension that gives you actual bloat numbers instead of n_dead_tup.
PostgreSQL docs: hot_standby_feedback - the replica-side parameter that can stall the primary's horizon.
PostgreSQL wiki: VACUUM FULL vs CLUSTER vs pg_repack - tradeoffs across the three full-rewrite options.

This is the first in a Postgres-fundamentals mini-arc. Storage was today. Next in the queue: how the planner picks the plans it picks, what transaction isolation actually buys you under concurrent writes, and when JSONB is the right column type versus a documented mistake. The 200 GB table is just where the iceberg pokes above the water.

Podo Stack

Discussion about this post

Ready for more?