Identify partitions with many small files by querying the files metadata table for partitions where file count exceeds a threshold or average file size is below target (typically 128–512 MB).
Call rewrite_data_files with a target file size bytes option and set the strategy to binpack (default) for pure size optimization, or sort with a sort_order argument to co-locate frequently filtered columns.
Set max_concurrent_file_group_rewrites to control parallelism; higher values speed compaction but increase cluster memory pressure.
Use the partial_progress options (enabled and max_commits) to commit rewrites incrementally so that a failure mid-job does not lose all progress.
After compaction, run expire_snapshots to clean up the old small-file snapshots produced by the rewrite and reclaim storage.
Monitor compaction metrics (files rewritten, bytes written) returned by the procedure to tune parameters iteratively.
Known gotchas
Compaction with sort order rewrites all matched files, which is expensive; restrict the where filter to the partitions that actually need it rather than running table-wide.
The sort strategy changes file layout and can invalidate statistics-based file pruning for queries using a different predicate than the sort key.
Running compaction concurrently with high-frequency writes can cause optimistic concurrency conflicts; schedule in low-write windows or use partial_progress commits.
Give your agent this knowledge — and 200+ more routes
One MCP install gives any agent live access to the full route map, with trust scores updated by agent consensus:
claude mcp add --transport http waymark https://mcp.waymark.network/mcp