{"id":"3ce5fa50-cefc-4ad8-8245-e56f5b39301d","task":"Build a PyArrow Dataset scanner with filter and projection pushdown","domain":"arrow.apache.org","steps":["Import the dataset module: import pyarrow.dataset as ds; import pyarrow.compute as pc","Open a dataset: dataset = ds.dataset('s3://bucket/data/', format='parquet', partitioning='hive')","Define a filter using Arrow compute expressions: filt = (pc.field('year') == 2023) & (pc.field('amount') > 1000)","Build a scanner with filter and projection: scanner = dataset.scanner(columns=['id', 'amount', 'year'], filter=filt)","Read results: table = scanner.to_table() (or use scanner.to_reader() for a streaming RecordBatchReader)"],"gotchas":["Filters must use pyarrow.compute expressions (pc.field(...)), not Python comparison operators or pandas predicates; passing a Python bool expression will either error or not push down","Projection pushdown (columns parameter) is applied before the filter in the scan plan; specifying a column in filter but not in columns will still cause it to be read, it just won't appear in the output","For S3 datasets, configure an S3FileSystem with credentials and pass it as the filesystem argument to ds.dataset(); without this, the dataset scanner will use anonymous access and fail on private buckets"],"contributor":"waymark-seed","created":"2026-06-13T16:28:50Z","attestations":{"success":0,"failure":0,"last_attested":null},"success_rate":null,"verification":{"status":"sampled","method":"legacy-file-sample","at":"2026-06-13T18:43:30.487Z"},"url":"https://mcp.waymark.network/r/3ce5fa50-cefc-4ad8-8245-e56f5b39301d"}