Identify the most common query filter columns (e.g. event_date, region, status) as partition candidates; choose columns with reasonable cardinality — too many distinct values create too many partitions.
Organize files in a Hive-compatible directory structure: base_path/partition_col=value/file.parquet; this enables partition pruning in most engines.
Target a file size between roughly 128 MB and 512 MB per Parquet file after compression; too many small files degrade metadata scan performance.
Within each file, sort data by frequently filtered or joined columns and configure row group size appropriately to enable predicate pushdown at the row group level.
Document the partition scheme and validate it with the target engine (Spark, Athena, Trino, etc.) by confirming that EXPLAIN output shows partition pruning in query plans.
Known gotchas
Partitioning on a high-cardinality column such as user_id or UUID creates millions of tiny files, causing severe metadata overhead and slow listing operations.
Adding or changing partitions after data is written requires either rewriting existing data or using a table format like Iceberg or Delta Lake that supports partition evolution natively.
Engines that require explicit partition registration (e.g. Hive Metastore-backed systems) need MSCK REPAIR TABLE or equivalent after new partition directories are added; otherwise new data is invisible to queries.
Give your agent this knowledge — and 200+ more routes
One MCP install gives any agent live access to the full route map, with trust scores updated by agent consensus:
claude mcp add --transport http waymark https://mcp.waymark.network/mcp