Lesson 7 of 12 · The lookup tables

Bounding boxes, byte strings, and the “which-files-do-I-need” query.

You have the static layers. You have the live data path. What you don’t yet have is a fast way to answer “which vehicle-day GPS files touch route X on day Y?” The routing engine in Lesson 8 needs to ask that question millions of times. This lesson builds the three lookup tables that make it cheap: parcels with WKB geometry + per-route assignment + bbox columns; per-day vehicle-day summary indexes; and a STAC-like master index that joins them.

Time: ~75 min You'll touch: prep/parcels_wkb · cache/gps_indexes · cache/master_index Result: the indexes the engine uses to be fast

· Objective

Build the three lookup tables. Each one answers one specific question fast.

  • prep/parcels_wkb.py — parcels in a DuckDB-friendly form: WKB geometry bytes, per-parcel route_id assignment via spatial join to route polygons, bbox columns precomputed. Answers “which parcels belong to route X?” in milliseconds.
  • cache/gps_indexes.py — one summary row per vehicle-day parquet file: min_dt, max_dt, n_rows, bounding box in 4326. Cheap metadata that lets you skip files that can’t possibly matter.
  • cache/master_index.py — the cross-join: vehicle-day bbox × route bbox → pairs that overlap. The “STAC-like” lookup that turns “which GPS files touch route X on day Y?” into a small filter on a small table.
The rule that runs through this lesson: the cheapest way to filter big data is to filter against small data. Don’t scan 500 vehicle-day parquet files to find the ones that touch a route. Build a small lookup table once, scan that, then open only the files it points to. Engineering big-data queries to feel small is almost always “build the right index.”

· Build it, step by step

1 Parcels with WKB — prep/parcels_wkb.py

The parcels module from Lesson 2 returned a GeoPandas GeoDataFrame. That’s great for Python-side work, but DuckDB doesn’t speak GeoPandas; it speaks WKB (Well-Known Binary), the standard byte encoding for geometries. So we produce a derived form designed for DuckDB consumption:

build_parcels_wkb(
    parcels_gdf,        # the GeoDataFrame from prep.parcels
    routes_gdf,         # the route polygons (for assignment)
)
# Returns a plain DataFrame with columns:
#   APN, route_id, geom_wkb (bytes), min_lon, min_lat, max_lon, max_lat
  • WKB bytes — the geometry as a bytes column DuckDB can ST_GeomFromWKB on without Python ever touching it.
  • Per-parcel route_id assignment — a spatial join (parcels-in-routes) attaches the route each parcel belongs to. Parcels outside any route get NULL; they’re not serviced.
  • Bbox columns precomputed — min/max lon/lat directly on the row. Bbox filtering becomes plain SQL WHERE clauses.

Writing the output to a single parquet via write_parcels_wkb() gives you a file DuckDB can query directly: SELECT * FROM 'parcels_wkb.parquet' WHERE route_id = 'A1'.

2 Vehicle-day indexes — cache/gps_indexes.py

Each GPS cache file (one parquet per vehicle per local day from Lesson 6) gets summarized to a single row of metadata:

build_day_index(cache_dir, "2026-01-18")
# -> writes cache/2026-01-18/_vehicle_day_index.parquet
#    columns: vehicle_id, n_rows, min_dt, max_dt,
#             min_lon, min_lat, max_lon, max_lat

The index file lives inside the day directory, so it’s discoverable next to the data it summarizes. The _ prefix sorts it to the top and makes it look like the “table of contents” for that day.

  • summarize_vehicle_day_file(path) — do the math for one file. Reads min/max dt and lon/lat in a single pyarrow scan; no full materialization.
  • build_day_index(cache_dir, day) — loop one day’s files, write the summary parquet.
  • build_all_day_indexes(cache_dir) — rebuild every day’s index. Idempotent.

3 The master index — cache/master_index.py

The headline lookup table. For every (vehicle, day) in the cache and every route_id in your route bbox index, check whether their bounding boxes overlap. If they do, emit a row. If not, don’t.

build_master_index(
    vehicle_day_indexes_df,   # concatenated output of build_day_index
    route_bboxes_df,          # output of build_route_index from Lesson 3
)
# -> DataFrame with columns:
#    vehicle_id, day, route_id, n_rows, min_dt, max_dt

This is STAC-like in spirit: a small lookup of “which assets cover which spatial extents.” The actual implementation is a self-join with bbox-overlap predicates — no heavy geometry library needed, just numeric comparisons.

files_for_route(master_df, route_id, day) is the tiny convenience: takes the master index and a (route, day), returns the list of vehicle-day cache files that need to be opened. The routing engine in Lesson 8 will call this once at the top of each route-day analysis — maybe 5 files instead of 500.

4 Tests with synthetic data

Three test files, all using synthetic geometries and pings (no real data):

  • tests/test_parcels_wkb.py — WKB encoding round-trips; spatial-join assignment correct; bbox columns match geometry.
  • tests/test_gps_indexes.py — summary stats correct; index file written; build_all idempotent.
  • tests/test_master_index.py — bbox-overlap logic; non-overlapping pairs excluded; files_for_route returns the right paths.
pip install -e ".[dev,geotab,postgres]"
ruff check .
pytest -q

5 Commit and push

git add .
git commit -m "Lesson 7: parcels_wkb + GPS indexes + master index"
git push origin main

· Package anatomy after this lesson

Where everything lives now. new marks files added this lesson.

opentrash/ ├── pyproject.toml ├── mkdocs.yml ├── .github/workflows/{ci,docs}.yml ├── docs/{index,architecture}.md + CNAME ├── opentrash/ │ ├── adapters/gps/ # base, geotab, postgres │ ├── core/ # crs, duckdb_session, vehicle_ids │ ├── prep/ │ │ ├── sites.py │ │ ├── parcels.py │ │ ├── static_layers.py │ │ └── parcels_wkb.py # [new] WKB + route assignment + bbox │ ├── cache/ │ │ ├── gps_cache.py │ │ ├── gps_indexes.py # [new] per-day summary indexes │ │ └── master_index.py # [new] STAC-like cross-join │ ├── tonnage/ # registry, cleaners, keys, upsert, pipeline │ └── (engine, patterns, routeview scaffolded) └── tests/ ├── (existing 11 test files) ├── test_parcels_wkb.py # [new] ├── test_gps_indexes.py # [new] └── test_master_index.py # [new]

· What you built

  • Parcels in WKB form — the parquet DuckDB queries directly, with per-parcel route assignment and bbox columns precomputed.
  • Per-day vehicle-day summary indexes — the “table of contents” for each day’s GPS cache.
  • The master spatial-temporal index — the small lookup table that answers “which files touch which routes when” in a SQL filter.
  • Three new tests covering encoding, summary math, bbox-overlap logic, and the convenience lookup.
The substrate is in place. Next lesson is the routing engine — the layer that uses these indexes to join every GPS ping to every applicable GIS layer once, producing the enriched ping stream every downstream product consumes.

· Companion resources

Optional, for going deeper.

  • WKB and WKT: Wikipedia — the binary/text geometry standards every spatial DB consumes.
  • STAC: SpatioTemporal Asset Catalog — the geospatial-data-catalog spec our master index resembles in spirit (we’re building a private one; STAC is the public version).
  • DuckDB’s spatial extension: spatial functions referenceST_GeomFromWKB, ST_Intersects, and friends.

· Next lesson

Lesson 8 — The routing engine: the architectural heart of the package. One unified spatial join that takes raw GPS pings and enriches them with route, parcel, and facility context. Every downstream product consumes the engine’s output.