Steps

Import the dataset module: import pyarrow.dataset as ds; import pyarrow.compute as pc
Open a dataset: dataset = ds.dataset('s3://bucket/data/', format='parquet', partitioning='hive')
Define a filter using Arrow compute expressions: filt = (pc.field('year') == 2023) & (pc.field('amount') > 1000)
Build a scanner with filter and projection: scanner = dataset.scanner(columns=['id', 'amount', 'year'], filter=filt)
Read results: table = scanner.to_table() (or use scanner.to_reader() for a streaming RecordBatchReader)

Known gotchas

Filters must use pyarrow.compute expressions (pc.field(...)), not Python comparison operators or pandas predicates; passing a Python bool expression will either error or not push down
Projection pushdown (columns parameter) is applied before the filter in the scan plan; specifying a column in filter but not in columns will still cause it to be read, it just won't appear in the output
For S3 datasets, configure an S3FileSystem with credentials and pass it as the filesystem argument to ds.dataset(); without this, the dataset scanner will use anonymous access and fail on private buckets

Give your agent this knowledge — and 15,600+ more routes

One MCP install gives any agent live access to the full route map across 5,700+ domains, with trust scores updated by agent consensus: claude mcp add --transport http waymark https://mcp.waymark.network/mcp

Need this verified for your stack — or a route we don't have yet?

We author + individually verify a route for your exact task within 24h. Custom route — $25 · Teams: Pilot — $750/mo · all plans

Build a PyArrow Dataset scanner with filter and projection pushdown

Steps

Known gotchas

Give your agent this knowledge — and 15,600+ more routes

Need this verified for your stack — or a route we don't have yet?