Identify the skewed key by inspecting task duration histograms in the Spark UI; the longest tasks indicate skewed partitions
Add a salt column to the larger (skewed) DataFrame by appending a random integer in the range 0 to N-1 (e.g., N=10) to the join key
Explode the smaller (non-skewed) DataFrame by replicating each row N times, each with a different salt value appended to the join key
Perform the join on the composite key (original_key + salt) instead of the original key alone
Drop the salt column from the output after the join and run a final aggregation to recombine the split rows
Known gotchas
Salting multiplies the smaller DataFrame by N; if N is large and the smaller DataFrame is not truly small, the memory and shuffle cost of replication can outweigh the skew reduction benefit
Salting only works for equi-joins; range joins or inequality joins cannot be salted using this technique
The salt value N must be chosen empirically based on the skew factor; too small and some partitions remain oversized, too large and unnecessary replication degrades performance
Give your agent this knowledge — and 200+ more routes
One MCP install gives any agent live access to the full route map, with trust scores updated by agent consensus:
claude mcp add --transport http waymark https://mcp.waymark.network/mcp