Bounding boxes, byte strings, and the “which-files-do-I-need” query.
You have the static layers. You have the live data path. What you don’t yet have is a fast way to answer “which vehicle-day GPS files touch route X on day Y?” The routing engine in Lesson 8 needs to ask that question millions of times. This lesson builds the three lookup tables that make it cheap: parcels with WKB geometry + per-route assignment + bbox columns; per-day vehicle-day summary indexes; and a STAC-like master index that joins them.
· Objective
Build the three lookup tables. Each one answers one specific question fast.
prep/parcels_wkb.py— parcels in a DuckDB-friendly form: WKB geometry bytes, per-parcelroute_idassignment via spatial join to route polygons, bbox columns precomputed. Answers “which parcels belong to route X?” in milliseconds.cache/gps_indexes.py— one summary row per vehicle-day parquet file:min_dt,max_dt,n_rows, bounding box in 4326. Cheap metadata that lets you skip files that can’t possibly matter.cache/master_index.py— the cross-join: vehicle-day bbox × route bbox → pairs that overlap. The “STAC-like” lookup that turns “which GPS files touch route X on day Y?” into a small filter on a small table.
· Build it, step by step
1 Parcels with WKB — prep/parcels_wkb.py
The parcels module from Lesson 2 returned a GeoPandas GeoDataFrame. That’s great for Python-side work, but DuckDB doesn’t speak GeoPandas; it speaks WKB (Well-Known Binary), the standard byte encoding for geometries. So we produce a derived form designed for DuckDB consumption:
build_parcels_wkb(
parcels_gdf, # the GeoDataFrame from prep.parcels
routes_gdf, # the route polygons (for assignment)
)
# Returns a plain DataFrame with columns:
# APN, route_id, geom_wkb (bytes), min_lon, min_lat, max_lon, max_lat
- WKB bytes — the geometry as a
bytescolumn DuckDB canST_GeomFromWKBon without Python ever touching it. - Per-parcel
route_idassignment — a spatial join (parcels-in-routes) attaches the route each parcel belongs to. Parcels outside any route getNULL; they’re not serviced. - Bbox columns precomputed — min/max lon/lat directly on the row. Bbox filtering becomes plain SQL
WHEREclauses.
Writing the output to a single parquet via write_parcels_wkb() gives you a file DuckDB can query directly: SELECT * FROM 'parcels_wkb.parquet' WHERE route_id = 'A1'.
2 Vehicle-day indexes — cache/gps_indexes.py
Each GPS cache file (one parquet per vehicle per local day from Lesson 6) gets summarized to a single row of metadata:
build_day_index(cache_dir, "2026-01-18")
# -> writes cache/2026-01-18/_vehicle_day_index.parquet
# columns: vehicle_id, n_rows, min_dt, max_dt,
# min_lon, min_lat, max_lon, max_lat
The index file lives inside the day directory, so it’s discoverable next to the data it summarizes. The _ prefix sorts it to the top and makes it look like the “table of contents” for that day.
summarize_vehicle_day_file(path)— do the math for one file. Reads min/max dt and lon/lat in a single pyarrow scan; no full materialization.build_day_index(cache_dir, day)— loop one day’s files, write the summary parquet.build_all_day_indexes(cache_dir)— rebuild every day’s index. Idempotent.
3 The master index — cache/master_index.py
The headline lookup table. For every (vehicle, day) in the cache and every route_id in your route bbox index, check whether their bounding boxes overlap. If they do, emit a row. If not, don’t.
build_master_index(
vehicle_day_indexes_df, # concatenated output of build_day_index
route_bboxes_df, # output of build_route_index from Lesson 3
)
# -> DataFrame with columns:
# vehicle_id, day, route_id, n_rows, min_dt, max_dt
This is STAC-like in spirit: a small lookup of “which assets cover which spatial extents.” The actual implementation is a self-join with bbox-overlap predicates — no heavy geometry library needed, just numeric comparisons.
files_for_route(master_df, route_id, day) is the tiny convenience: takes the master index and a (route, day), returns the list of vehicle-day cache files that need to be opened. The routing engine in Lesson 8 will call this once at the top of each route-day analysis — maybe 5 files instead of 500.
4 Tests with synthetic data
Three test files, all using synthetic geometries and pings (no real data):
tests/test_parcels_wkb.py— WKB encoding round-trips; spatial-join assignment correct; bbox columns match geometry.tests/test_gps_indexes.py— summary stats correct; index file written;build_allidempotent.tests/test_master_index.py— bbox-overlap logic; non-overlapping pairs excluded;files_for_routereturns the right paths.
pip install -e ".[dev,geotab,postgres]"
ruff check .
pytest -q
5 Commit and push
git add .
git commit -m "Lesson 7: parcels_wkb + GPS indexes + master index"
git push origin main
· Package anatomy after this lesson
Where everything lives now. new marks files added this lesson.
· What you built
- Parcels in WKB form — the parquet DuckDB queries directly, with per-parcel route assignment and bbox columns precomputed.
- Per-day vehicle-day summary indexes — the “table of contents” for each day’s GPS cache.
- The master spatial-temporal index — the small lookup table that answers “which files touch which routes when” in a SQL filter.
- Three new tests covering encoding, summary math, bbox-overlap logic, and the convenience lookup.
· Companion resources
Optional, for going deeper.
- WKB and WKT: Wikipedia — the binary/text geometry standards every spatial DB consumes.
- STAC: SpatioTemporal Asset Catalog — the geospatial-data-catalog spec our master index resembles in spirit (we’re building a private one; STAC is the public version).
- DuckDB’s spatial extension: spatial functions reference —
ST_GeomFromWKB,ST_Intersects, and friends.
· Next lesson
Lesson 8 — The routing engine: the architectural heart of the package. One unified spatial join that takes raw GPS pings and enriches them with route, parcel, and facility context. Every downstream product consumes the engine’s output.