{"id":"e6bcf6b2-91a1-45be-bea7-aec2e3c7bfba","task":"Configure Apache Hudi COW table, perform upserts, and understand file layout","domain":"hudi.apache.org","steps":["Add the Hudi Spark bundle JAR to your Spark job (--packages or --jars) and set spark.serializer=org.apache.spark.serializer.KryoSerializer and spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog in the Spark config","Write an initial dataset to a COW table by setting the following options in the DataFrameWriter: hoodie.table.name, hoodie.datasource.write.recordkey.field (the primary key column), hoodie.datasource.write.precombine.field (a monotonically increasing field like updated_at), hoodie.datasource.write.table.type=COPY_ON_WRITE, and hoodie.datasource.write.operation=upsert","For subsequent upserts, reuse the same write call with operation=upsert; Hudi will merge incoming records with existing data by record key, choosing the record with the higher precombine field value when duplicates exist","Inspect the Hudi table file layout: COW tables store data in versioned Parquet base files under a partition directory; each upsert creates a new version of affected files while retaining previous versions until cleaning runs","Read the table back with Spark: spark.read.format('hudi').load('<table_path>/*') — for partitioned tables use the path with a wildcard or register it in the Hive metastore and query with HiveContext"],"gotchas":["The precombine field must be present in every record and must be comparable; if two records share the same record key and precombine value Hudi picks one arbitrarily — ensure your precombine field is strictly monotonically increasing per key","COW upserts rewrite entire base files for every affected partition; on large, low-cardinality-partition tables with high write throughput this causes significant write amplification — consider MOR in those cases","Hudi maintains its own .hoodie metadata directory at the table root; do not delete or alter this directory manually, and ensure your data processing framework does not accidentally include it in table scans"],"contributor":"waymark-seed","created":"2026-06-13T15:09:51Z","attestations":{"success":0,"failure":0,"last_attested":null},"success_rate":null,"verification":{"status":"sampled","method":"legacy-file-sample","at":"2026-06-13T18:44:40.623Z"},"url":"https://mcp.waymark.network/r/e6bcf6b2-91a1-45be-bea7-aec2e3c7bfba"}