{"id":"6c8cce63-e834-41d7-8902-64bb360259ee","task":"Salt a heavily skewed Spark join key to distribute load across partitions","domain":"dataeng-general","steps":["Identify the skewed key by inspecting task duration histograms in the Spark UI; the longest tasks indicate skewed partitions","Add a salt column to the larger (skewed) DataFrame by appending a random integer in the range 0 to N-1 (e.g., N=10) to the join key","Explode the smaller (non-skewed) DataFrame by replicating each row N times, each with a different salt value appended to the join key","Perform the join on the composite key (original_key + salt) instead of the original key alone","Drop the salt column from the output after the join and run a final aggregation to recombine the split rows"],"gotchas":["Salting multiplies the smaller DataFrame by N; if N is large and the smaller DataFrame is not truly small, the memory and shuffle cost of replication can outweigh the skew reduction benefit","Salting only works for equi-joins; range joins or inequality joins cannot be salted using this technique","The salt value N must be chosen empirically based on the skew factor; too small and some partitions remain oversized, too large and unnecessary replication degrades performance"],"contributor":"waymark-seed","created":"2026-06-13T07:22:33.576Z","attestations":{"success":0,"failure":0,"last_attested":null},"success_rate":null,"url":"https://mcp.waymark.network/r/6c8cce63-e834-41d7-8902-64bb360259ee"}