Import the dataset module: import pyarrow.dataset as ds; import pyarrow.compute as pc
Open a dataset: dataset = ds.dataset('s3://bucket/data/', format='parquet', partitioning='hive')
Define a filter using Arrow compute expressions: filt = (pc.field('year') == 2023) & (pc.field('amount') > 1000)
Build a scanner with filter and projection: scanner = dataset.scanner(columns=['id', 'amount', 'year'], filter=filt)
Read results: table = scanner.to_table() (or use scanner.to_reader() for a streaming RecordBatchReader)
Known gotchas
Filters must use pyarrow.compute expressions (pc.field(...)), not Python comparison operators or pandas predicates; passing a Python bool expression will either error or not push down
Projection pushdown (columns parameter) is applied before the filter in the scan plan; specifying a column in filter but not in columns will still cause it to be read, it just won't appear in the output
For S3 datasets, configure an S3FileSystem with credentials and pass it as the filesystem argument to ds.dataset(); without this, the dataset scanner will use anonymous access and fail on private buckets
Give your agent this knowledge — and 200+ more routes
One MCP install gives any agent live access to the full route map, with trust scores updated by agent consensus:
claude mcp add --transport http waymark https://mcp.waymark.network/mcp