Lesson 8 of 12 · The architectural payoff

One spatial join. Done once. For everything that follows.

Eight lessons have been preparation. Foundational primitives, static GIS layers, GPS adapters, the cache, the substrate — all of it set up so that today, one module can do the only spatial join the package needs. Every GPS ping joins, in one DuckDB pass, to its route polygon, nearest parcel, landfill membership, and depot proximity. The output is the canonical enriched ping stream. From here forward, every downstream module is a pure aggregation on top of it.

Time: ~90 min You'll touch: engine/config · engine/enrichment Result: the canonical enriched ping stream

· Objective

Two modules, one architectural payoff.

  • engine/config.pyEngineConfig, a frozen dataclass holding every tunable knob the engine uses: parcel edge distance, slow-speed threshold for service-event candidacy, landfill buffer, depot radius, output CRS, and batch size. One place to read, one place to change.
  • engine/enrichment.pyenrich_pings(): the single spatial-join entry point. Takes a vehicle-day’s pings and the substrate; returns the same pings with four new columns (route_id, apn, in_landfill, at_depot). Plus enrich_vehicle_day(), the convenience that reads a cache file and writes an enriched parquet in the mirror layout.
The rule the whole architecture rests on: spatial joins are infrastructure; products are calculations. The engine does the spatial work once. Every downstream module — segments, patterns, RouteView, reports — reads the engine’s output and aggregates. None of them open a parcel polygon. None of them buffer a landfill. The rule is what keeps the system simple even as the surface area grows.

· Build it, step by step

1 Engine config — engine/config.py

A frozen dataclass that surfaces every tunable value. Defaults reflect the City of San Diego setup; override individual fields for a different agency or commodity:

@dataclass(frozen=True, slots=True)
class EngineConfig:
    parcel_edge_ft: float = 25.0     # how close to parcel boundary counts
    slow_mph_max: float = 12.0       # fast pings don't service parcels
    landfill_buffer_ft: float = 50.0
    depot_radius_ft: float = 250.0
    output_crs: str = "EPSG:4326"
    batch_size: int = 200_000

Why a dataclass and not a settings file? Because the engine is library code, not an application. The caller’s notebook owns configuration; the package gives them a typed, immutable object to pass.

2 Canonical output shape — ENRICHED_COLUMNS

Six original GPS columns + four engine-added columns, in this exact order:

ENRICHED_COLUMNS = (
    "vehicle_id", "dt_utc", "dt_local", "lat", "lon", "speed_mph",
    "route_id",       # the route polygon containing the ping, or NULL
    "apn",            # nearest parcel (within parcel_edge_ft), or NULL
    "in_landfill",    # bool: ping is inside any landfill polygon
    "at_depot",       # bool: ping is within depot_radius_ft of the depot
)

Downstream modules — segments (Lesson 9), patterns (Lesson 10), RouteView (Lesson 11) — expect this shape. Treat the column tuple as the contract.

3 The single CTAS — enrich_pings()

The whole enrichment is one DuckDB CREATE TABLE AS SELECT. Inputs registered as views; LEFT JOINs ensure pings are never dropped:

WITH ping_routes AS (
    SELECT p.*, r.route_id
    FROM pings p
    LEFT JOIN routes r
        ON ST_Within(ST_Point(p.lon, p.lat),
                     ST_GeomFromWKB(r.geom_wkb))
),
ping_parcels AS (
    SELECT pr.*, pc.APN AS apn
    FROM ping_routes pr
    LEFT JOIN 'parcels_wkb.parquet' pc
        ON pr.speed_mph <= config.slow_mph_max
       AND bbox prefilter on min/max columns from L7
       AND ST_DWithin(ST_Point(pr.lon, pr.lat),
                      ST_GeomFromWKB(pc.geom_wkb),
                      parcel_edge_deg)
),
ping_landfills AS (...),
ping_depot AS (...)
SELECT ... FROM ping_depot ORDER BY vehicle_id, dt_utc

Three things this design buys us:

  • The bbox prefilter from Lesson 7 pays off here. Without it, joining millions of pings against a half-million-parcel layer would scan every polygon. With the precomputed min/max columns, DuckDB filters numerically first and only computes exact ST_DWithin on a tiny survivor set.
  • Speed gating happens in the JOIN, not after. Fast-moving pings don’t even compute a parcel match. The truck driving past at 35 mph isn’t servicing parcels; the engine doesn’t pretend otherwise.
  • Landfill / depot tests are EXISTS subqueries. Cheap. The substrate is small (one or two depot polygons, a handful of landfills); the engine answers “is this ping at one of them?” with constant-time checks.

4 Vehicle-day driver — enrich_vehicle_day()

The convenience entry point most callers use. Reads one cache parquet, runs enrichment, writes to the mirror layout:

enrich_vehicle_day(
    cache_path="cache/2026-01-18/815001.parquet",
    out_root="enriched/",
    parcels_wkb_path="parcels_wkb.parquet",
    routes_gdf=routes_gdf,
    landfills_gdf=landfills_gdf,
    depot_gdf=depot_gdf,
)
# -> writes enriched/2026-01-18/815001.parquet

One vehicle-day in, one enriched vehicle-day out. Looping over the master index from Lesson 7 gives you a whole fleet’s worth of enrichment in one script.

5 Tests with synthetic substrate

Seventeen tests — ten pure unit tests (no DuckDB needed) plus seven behavior tests that build a tiny synthetic substrate (one route box, one parcel, one landfill, one depot) and verify each enrichment column in isolation:

  • Route attribution: a ping inside route A gets route_id="A".
  • Parcel attribution: a slow ping near a parcel gets its APN; a fast ping doesn’t.
  • Landfill flag: ping inside the polygon → in_landfill=True.
  • Depot flag: ping near the depot centroid → at_depot=True.
  • Ping outside everything: all four enrichments are NULL/false.

The seven behavior tests skip gracefully in environments where DuckDB’s spatial extension can’t be downloaded (a sandboxed CI, for example). On any machine with network access to extensions.duckdb.org, all seventeen pass.

6 Commit and push

pip install -e ".[dev,geotab,postgres]"
ruff check .
pytest -q
git add . && git commit -m "Lesson 8: routing engine — config + enrichment"
git push origin main

· Package anatomy after this lesson

Where everything lives now. new marks files added this lesson.

opentrash/ ├── pyproject.toml ├── mkdocs.yml ├── .github/workflows/{ci,docs}.yml ├── docs/{index,architecture}.md + CNAME ├── opentrash/ │ ├── adapters/gps/ # base, geotab, postgres │ ├── core/ # crs, duckdb_session, vehicle_ids │ ├── prep/ # sites, parcels, static_layers, parcels_wkb │ ├── cache/ # gps_cache, gps_indexes, master_index │ ├── tonnage/ # registry, cleaners, keys, upsert, pipeline │ ├── engine/ │ │ ├── config.py # [new] EngineConfig dataclass │ │ └── enrichment.py # [new] THE ONE SPATIAL JOIN │ └── (patterns, routeview scaffolded) └── tests/ ├── (existing 14 test files) └── test_engine_enrichment.py # [new] 17 tests

· What you built

  • A frozen config dataclass — every engine knob in one place, immutable, typed.
  • The canonical enriched-ping shape — six GPS columns + four engine enrichments, in a tuple that downstream code treats as a contract.
  • One DuckDB CTAS — one spatial join, four enrichments, bbox-prefiltered, speed-gated, written for millions of pings.
  • A vehicle-day driver that reads from the L6 cache and writes to a mirror layout downstream lessons can scan cheaply.
  • Seventeen tests over synthetic substrate covering every enrichment in isolation.
From here forward, the package gets simpler. The hard work is done. Segments, patterns, RouteView — each one is a focused calculation on top of the engine’s output. No spatial joins anywhere downstream.

· Companion resources

Optional, for going deeper.

· Next lesson

Lesson 9 — Segments: timeline analysis on top of the enriched-ping stream. Service, travel, landfill, load, shift — all derived from sequences of enriched pings. The first downstream module that proves the architecture: zero spatial joins, pure aggregation on the engine output.