One spatial join. Done once. For everything that follows.
Eight lessons have been preparation. Foundational primitives, static GIS layers, GPS adapters, the cache, the substrate — all of it set up so that today, one module can do the only spatial join the package needs. Every GPS ping joins, in one DuckDB pass, to its route polygon, nearest parcel, landfill membership, and depot proximity. The output is the canonical enriched ping stream. From here forward, every downstream module is a pure aggregation on top of it.
· Objective
Two modules, one architectural payoff.
engine/config.py—EngineConfig, a frozen dataclass holding every tunable knob the engine uses: parcel edge distance, slow-speed threshold for service-event candidacy, landfill buffer, depot radius, output CRS, and batch size. One place to read, one place to change.engine/enrichment.py—enrich_pings(): the single spatial-join entry point. Takes a vehicle-day’s pings and the substrate; returns the same pings with four new columns (route_id,apn,in_landfill,at_depot). Plusenrich_vehicle_day(), the convenience that reads a cache file and writes an enriched parquet in the mirror layout.
· Build it, step by step
1 Engine config — engine/config.py
A frozen dataclass that surfaces every tunable value. Defaults reflect the City of San Diego setup; override individual fields for a different agency or commodity:
@dataclass(frozen=True, slots=True)
class EngineConfig:
parcel_edge_ft: float = 25.0 # how close to parcel boundary counts
slow_mph_max: float = 12.0 # fast pings don't service parcels
landfill_buffer_ft: float = 50.0
depot_radius_ft: float = 250.0
output_crs: str = "EPSG:4326"
batch_size: int = 200_000
Why a dataclass and not a settings file? Because the engine is library code, not an application. The caller’s notebook owns configuration; the package gives them a typed, immutable object to pass.
2 Canonical output shape — ENRICHED_COLUMNS
Six original GPS columns + four engine-added columns, in this exact order:
ENRICHED_COLUMNS = (
"vehicle_id", "dt_utc", "dt_local", "lat", "lon", "speed_mph",
"route_id", # the route polygon containing the ping, or NULL
"apn", # nearest parcel (within parcel_edge_ft), or NULL
"in_landfill", # bool: ping is inside any landfill polygon
"at_depot", # bool: ping is within depot_radius_ft of the depot
)
Downstream modules — segments (Lesson 9), patterns (Lesson 10), RouteView (Lesson 11) — expect this shape. Treat the column tuple as the contract.
3 The single CTAS — enrich_pings()
The whole enrichment is one DuckDB CREATE TABLE AS SELECT. Inputs registered as views; LEFT JOINs ensure pings are never dropped:
WITH ping_routes AS (
SELECT p.*, r.route_id
FROM pings p
LEFT JOIN routes r
ON ST_Within(ST_Point(p.lon, p.lat),
ST_GeomFromWKB(r.geom_wkb))
),
ping_parcels AS (
SELECT pr.*, pc.APN AS apn
FROM ping_routes pr
LEFT JOIN 'parcels_wkb.parquet' pc
ON pr.speed_mph <= config.slow_mph_max
AND bbox prefilter on min/max columns from L7
AND ST_DWithin(ST_Point(pr.lon, pr.lat),
ST_GeomFromWKB(pc.geom_wkb),
parcel_edge_deg)
),
ping_landfills AS (...),
ping_depot AS (...)
SELECT ... FROM ping_depot ORDER BY vehicle_id, dt_utc
Three things this design buys us:
- The bbox prefilter from Lesson 7 pays off here. Without it, joining millions of pings against a half-million-parcel layer would scan every polygon. With the precomputed min/max columns, DuckDB filters numerically first and only computes exact
ST_DWithinon a tiny survivor set. - Speed gating happens in the JOIN, not after. Fast-moving pings don’t even compute a parcel match. The truck driving past at 35 mph isn’t servicing parcels; the engine doesn’t pretend otherwise.
- Landfill / depot tests are
EXISTSsubqueries. Cheap. The substrate is small (one or two depot polygons, a handful of landfills); the engine answers “is this ping at one of them?” with constant-time checks.
4 Vehicle-day driver — enrich_vehicle_day()
The convenience entry point most callers use. Reads one cache parquet, runs enrichment, writes to the mirror layout:
enrich_vehicle_day(
cache_path="cache/2026-01-18/815001.parquet",
out_root="enriched/",
parcels_wkb_path="parcels_wkb.parquet",
routes_gdf=routes_gdf,
landfills_gdf=landfills_gdf,
depot_gdf=depot_gdf,
)
# -> writes enriched/2026-01-18/815001.parquet
One vehicle-day in, one enriched vehicle-day out. Looping over the master index from Lesson 7 gives you a whole fleet’s worth of enrichment in one script.
5 Tests with synthetic substrate
Seventeen tests — ten pure unit tests (no DuckDB needed) plus seven behavior tests that build a tiny synthetic substrate (one route box, one parcel, one landfill, one depot) and verify each enrichment column in isolation:
- Route attribution: a ping inside route A gets
route_id="A". - Parcel attribution: a slow ping near a parcel gets its APN; a fast ping doesn’t.
- Landfill flag: ping inside the polygon →
in_landfill=True. - Depot flag: ping near the depot centroid →
at_depot=True. - Ping outside everything: all four enrichments are NULL/false.
The seven behavior tests skip gracefully in environments where DuckDB’s spatial extension can’t be downloaded (a sandboxed CI, for example). On any machine with network access to extensions.duckdb.org, all seventeen pass.
6 Commit and push
pip install -e ".[dev,geotab,postgres]"
ruff check .
pytest -q
git add . && git commit -m "Lesson 8: routing engine — config + enrichment"
git push origin main
· Package anatomy after this lesson
Where everything lives now. new marks files added this lesson.
· What you built
- A frozen config dataclass — every engine knob in one place, immutable, typed.
- The canonical enriched-ping shape — six GPS columns + four engine enrichments, in a tuple that downstream code treats as a contract.
- One DuckDB CTAS — one spatial join, four enrichments, bbox-prefiltered, speed-gated, written for millions of pings.
- A vehicle-day driver that reads from the L6 cache and writes to a mirror layout downstream lessons can scan cheaply.
- Seventeen tests over synthetic substrate covering every enrichment in isolation.
· Companion resources
Optional, for going deeper.
- DuckDB CTAS pattern: CREATE TABLE AS SELECT — the “build a derived table in one shot” idiom we use throughout.
- ST_DWithin and ST_Within: spatial function reference — the two predicates this engine relies on.
- Why frozen dataclasses: Python docs on frozen instances — immutability as a property of the type, not a runtime check.
· Next lesson
Lesson 9 — Segments: timeline analysis on top of the enriched-ping stream. Service, travel, landfill, load, shift — all derived from sequences of enriched pings. The first downstream module that proves the architecture: zero spatial joins, pure aggregation on the engine output.