An ant colony has a routine. So does a collection truck.
A typical residential-collection day looks like an ant colony at work: leave home, travel to the site, do the job, come back to dump, repeat until done, return home. The truck has the same choreography — depot, windshield, collection, dump, repeat. This lesson takes the enriched ping stream from Lesson 8 and aggregates it into the segment-level timeline: one row per phase, organized by load, with mileage and duration baked in. Plus operational red flags when the choreography breaks.
· Objective
One module, one canonical artifact, one operational tool.
derive_phase_per_ping()— classify each enriched ping asdepot,landfill,collection, orwindshieldfrom the L8 enrichment flags + speed.build_timeline()— gap-and-island grouping into chronological segments, with load numbers, cumulative haversine mileage, and duration per segment.flag_choreography_violations()— surface operational red flags from the timeline: loaded depot returns/departures, overrun loads, mid-load route switches.build_all_segments()— the driver: read one enriched parquet, writetimeline.parquet+violations.parquet.engine/config.pygets a new knob:mileage_inflation_pctfor opt-in road-network approximation.
DEPOT_DEPARTURE → LOAD 1 (windshield + collection + dump) → LOAD 2 → LOAD 3? → DEPOT_ARRIVAL.
A typical 10-hour day is 2–3 loads. The depot bookends carry no
load number. Every active segment carries a load number 1..N.
That structure is what makes operational reports possible: per-load
mileage, per-load duration, per-load tonnage (joined in a later
product lesson).
· Build it, step by step
1 Phase per ping — derive_phase_per_ping()
The enriched ping already carries the flags we need (at_depot, in_landfill, route_id, speed_mph). Phase derivation is a 4-rule priority cascade:
at_depot= true →depot(overrides everything — start/end-of-shift dwell)in_landfill= true →landfill(truck is tipping)route_idnot null ANDspeed_mph ≤ slow_mph_max→collection- otherwise →
windshield(travel time)
Implementation is one numpy np.full(len, "windshield") + 3 boolean masks. No loops, no Python-level row iteration. Millions of pings classify in milliseconds.
2 Gap-and-island → segments
Adjacent rows sharing the same phase (and route_id, for collection phases) collapse into one segment. Classic SQL pattern done in pandas:
effective_key = where(is_collection, phase + "|" + route_id, phase)
boundaries = effective_key[1:] != effective_key[:-1]
segment_id = boundaries.cumsum()
The route_id is part of the key only for collection segments. A truck driving down route A then B then back to A produces three collection segments, not one — we want per-route attribution. Windshield phases ignore route_id (the truck is just moving; the route polygons it passes through aren’t meaningful).
3 Mileage — haversine, cumulative, optional inflation
For each segment, sum the great-circle distance between consecutive pings:
haversine(lat1, lon1, lat2, lon2) # great-circle, miles
miles = sum(haversine(p_i, p_{i+1}) for i in segment)
miles *= (1 + config.mileage_inflation_pct)
- Haversine — accurate to ~0.5% at this scale. No reprojection cost; numpy-vectorized over all ping-pairs in a segment at once.
- Cumulative — sum of pairwise distances reflects what the truck drove; displacement (first-to-last) would zero out for any loop.
- Inflation knob — straight-line GPS mileage systematically undercounts real road-network distance by 5–15% because trucks follow streets (which curve) while the line cuts corners.
mileage_inflation_pctdefaults to0.0(honest straight-line). Set to0.08for an 8% road-network approximation. Never silently inflate; let the operator opt in.
Higher accuracy would require a road network (OpenStreetMap, OSM). That’s a future release; for now the inflation knob calibrates against whatever ground-truth dataset the operator can produce.
4 Load numbers — the workday backbone
Walk the timeline chronologically. Bookend depot segments get load_number = NaN. The first active segment after depot_departure starts load 1. After every dump, look ahead: if any future segment is a collection or dump, this is a new load. If the next active segment is depot_arrival, the truck is going home; that windshield stays in the current load.
DEPOT_DEPARTURE load=NaN
windshield load=1 (going to first route)
collection (A) load=1
dump load=1 (load 1 complete)
windshield load=2 (more work ahead, going to next route)
collection (B) load=2
dump load=2 (load 2 complete)
windshield load=2 (going home — no future work)
DEPOT_ARRIVAL load=NaN
5 Choreography violations — the operational payoff
The timeline is descriptive. The violations are what make it useful:
loaded_depot_return(severity high) — day ended on a collection segment without a closing dump. Truck went home loaded.loaded_depot_departure(severity high) — day started with a dump before any collection. Truck started loaded from yesterday.overrun_loads(severity medium) — more than 3 loads. Review for shift overrun.mid_load_route_switch(severity low) — a single load touched multiple route_ids. Could be HTC parcels off the main route, or could indicate routing inefficiency.
This is the analytical payoff of the segment abstraction. A pile of GPS pings is just dots; a load-organized timeline is a workday; flagged violations are a review queue.
6 Driver + output layout — build_all_segments()
build_all_segments(
enriched_path="enriched/2026-01-18/815001.parquet",
out_root="segments/",
)
# writes:
# segments/timeline/2026-01-18/815001.parquet
# segments/violations/2026-01-18/815001.parquet
One enriched vehicle-day in, two parquets out. Loop over the L7 master index and you get the whole fleet’s timeline in one batch.
7 Tests + commit
26 tests covering the haversine math, phase derivation cascade, gap-and-island grouping, load numbering edge cases, mileage inflation, and every violation type. All synthetic data — no DuckDB, no network. The whole suite runs in under 2 seconds.
pip install -e ".[dev,geotab,postgres]"
ruff check .
pytest -q
git add . && git commit -m "Lesson 9: segments — load-organized timeline + violations"
git push origin main
· Package anatomy after this lesson
Where everything lives now. new marks files added this lesson.
· What you built
- A per-ping phase classifier — depot / landfill / collection / windshield, derived from the L8 enrichments via a 4-rule priority cascade.
- A load-organized timeline — one row per segment, chronological, with cumulative haversine mileage and load numbers that respect the “going home” windshield case.
- Choreography-violation flags — four operational red flags surfaced from the timeline. The descriptive becomes prescriptive.
- An opt-in mileage inflation knob — default 0.0; honest reporting; calibrate against ground truth when needed.
- Zero spatial joins. The whole module is a pure aggregation on the engine’s enriched-ping output. The L8 architectural rule paying off for the first time.
· Companion resources
Optional, for going deeper.
- Haversine formula: Wikipedia — great-circle distance, taught for centuries to navigators.
- Gap-and-island problem: canonical pattern — SQL by origin, pandas by adaptation. Recognize it once and you’ll see it everywhere event-stream data appears.
- The ant colony as metaphor: a real research thread on ACO algorithms — same routine-following behavior at much larger scale. Different problem, same intuition.
· Next lesson
Lesson 10 — Pattern detection: per-parcel service signatures (weekly1, weekly2, biweekly) derived from a long history of enriched pings. Zero spatial joins; pure DuckDB CTAS aggregation. The first true product on top of the engine.