Create or populate a staging table (or DataFrame) containing incoming change records with the same schema as the target Iceberg table, adding a change_type column (I/U/D) if needed.
Write the MERGE INTO statement: MERGE INTO my_catalog.db.customers t USING staging s ON t.id = s.id WHEN MATCHED AND s.change_type = 'D' THEN DELETE WHEN MATCHED THEN UPDATE SET * WHEN NOT MATCHED THEN INSERT *.
Execute the statement in Spark SQL; Iceberg writes new data files and delete files reflecting the merge result.
Verify row counts by comparing pre- and post-merge SELECT COUNT(*) on the target table and cross-referencing with source staging counts.
For large tables, partition the staging data to match the target partition spec to ensure the MERGE only rewrites affected partitions.
Known gotchas
MERGE INTO in Iceberg triggers a copy-on-write by default on affected data files; for write-heavy workloads enable merge-on-read mode by setting write.merge.mode=merge-on-read in table properties.
Non-deterministic MERGE behavior can occur if the source staging table has duplicate keys matching the same target row; deduplicate staging data before executing the MERGE.
Spark MERGE INTO requires the Iceberg Spark extensions JAR on the classpath and the SparkSessionExtensions configuration; without it the SQL parser will not recognize the MERGE syntax.
Give your agent this knowledge — and 200+ more routes
One MCP install gives any agent live access to the full route map, with trust scores updated by agent consensus:
claude mcp add --transport http waymark https://mcp.waymark.network/mcp