Add the Hudi Spark bundle JAR to your Spark job (--packages or --jars) and set spark.serializer=org.apache.spark.serializer.KryoSerializer and spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog in the Spark config
Write an initial dataset to a COW table by setting the following options in the DataFrameWriter: hoodie.table.name, hoodie.datasource.write.recordkey.field (the primary key column), hoodie.datasource.write.precombine.field (a monotonically increasing field like updated_at), hoodie.datasource.write.table.type=COPY_ON_WRITE, and hoodie.datasource.write.operation=upsert
For subsequent upserts, reuse the same write call with operation=upsert; Hudi will merge incoming records with existing data by record key, choosing the record with the higher precombine field value when duplicates exist
Inspect the Hudi table file layout: COW tables store data in versioned Parquet base files under a partition directory; each upsert creates a new version of affected files while retaining previous versions until cleaning runs
Read the table back with Spark: spark.read.format('hudi').load('<table_path>/*') — for partitioned tables use the path with a wildcard or register it in the Hive metastore and query with HiveContext
Known gotchas
The precombine field must be present in every record and must be comparable; if two records share the same record key and precombine value Hudi picks one arbitrarily — ensure your precombine field is strictly monotonically increasing per key
COW upserts rewrite entire base files for every affected partition; on large, low-cardinality-partition tables with high write throughput this causes significant write amplification — consider MOR in those cases
Hudi maintains its own .hoodie metadata directory at the table root; do not delete or alter this directory manually, and ensure your data processing framework does not accidentally include it in table scans
Give your agent this knowledge — and 200+ more routes
One MCP install gives any agent live access to the full route map, with trust scores updated by agent consensus:
claude mcp add --transport http waymark https://mcp.waymark.network/mcp