Lesson 9 of 12 · The workday timeline

An ant colony has a routine. So does a collection truck.

A typical residential-collection day looks like an ant colony at work: leave home, travel to the site, do the job, come back to dump, repeat until done, return home. The truck has the same choreography — depot, windshield, collection, dump, repeat. This lesson takes the enriched ping stream from Lesson 8 and aggregates it into the segment-level timeline: one row per phase, organized by load, with mileage and duration baked in. Plus operational red flags when the choreography breaks.

Time: ~90 min You'll touch: engine/segments · engine/config (mileage knob) Result: the load-organized workday timeline

· Objective

One module, one canonical artifact, one operational tool.

  • derive_phase_per_ping() — classify each enriched ping as depot, landfill, collection, or windshield from the L8 enrichment flags + speed.
  • build_timeline() — gap-and-island grouping into chronological segments, with load numbers, cumulative haversine mileage, and duration per segment.
  • flag_choreography_violations() — surface operational red flags from the timeline: loaded depot returns/departures, overrun loads, mid-load route switches.
  • build_all_segments() — the driver: read one enriched parquet, write timeline.parquet + violations.parquet.
  • engine/config.py gets a new knob: mileage_inflation_pct for opt-in road-network approximation.
The choreography we’re modeling: DEPOT_DEPARTURELOAD 1 (windshield + collection + dump)LOAD 2LOAD 3? → DEPOT_ARRIVAL. A typical 10-hour day is 2–3 loads. The depot bookends carry no load number. Every active segment carries a load number 1..N. That structure is what makes operational reports possible: per-load mileage, per-load duration, per-load tonnage (joined in a later product lesson).

· Build it, step by step

1 Phase per ping — derive_phase_per_ping()

The enriched ping already carries the flags we need (at_depot, in_landfill, route_id, speed_mph). Phase derivation is a 4-rule priority cascade:

  1. at_depot = true → depot (overrides everything — start/end-of-shift dwell)
  2. in_landfill = true → landfill (truck is tipping)
  3. route_id not null AND speed_mph ≤ slow_mph_maxcollection
  4. otherwise → windshield (travel time)

Implementation is one numpy np.full(len, "windshield") + 3 boolean masks. No loops, no Python-level row iteration. Millions of pings classify in milliseconds.

2 Gap-and-island → segments

Adjacent rows sharing the same phase (and route_id, for collection phases) collapse into one segment. Classic SQL pattern done in pandas:

effective_key = where(is_collection, phase + "|" + route_id, phase)
boundaries = effective_key[1:] != effective_key[:-1]
segment_id = boundaries.cumsum()

The route_id is part of the key only for collection segments. A truck driving down route A then B then back to A produces three collection segments, not one — we want per-route attribution. Windshield phases ignore route_id (the truck is just moving; the route polygons it passes through aren’t meaningful).

3 Mileage — haversine, cumulative, optional inflation

For each segment, sum the great-circle distance between consecutive pings:

haversine(lat1, lon1, lat2, lon2)   # great-circle, miles
miles = sum(haversine(p_i, p_{i+1}) for i in segment)
miles *= (1 + config.mileage_inflation_pct)
  • Haversine — accurate to ~0.5% at this scale. No reprojection cost; numpy-vectorized over all ping-pairs in a segment at once.
  • Cumulative — sum of pairwise distances reflects what the truck drove; displacement (first-to-last) would zero out for any loop.
  • Inflation knob — straight-line GPS mileage systematically undercounts real road-network distance by 5–15% because trucks follow streets (which curve) while the line cuts corners. mileage_inflation_pct defaults to 0.0 (honest straight-line). Set to 0.08 for an 8% road-network approximation. Never silently inflate; let the operator opt in.

Higher accuracy would require a road network (OpenStreetMap, OSM). That’s a future release; for now the inflation knob calibrates against whatever ground-truth dataset the operator can produce.

4 Load numbers — the workday backbone

Walk the timeline chronologically. Bookend depot segments get load_number = NaN. The first active segment after depot_departure starts load 1. After every dump, look ahead: if any future segment is a collection or dump, this is a new load. If the next active segment is depot_arrival, the truck is going home; that windshield stays in the current load.

DEPOT_DEPARTURE     load=NaN
windshield          load=1   (going to first route)
collection (A)      load=1
dump                load=1   (load 1 complete)
windshield          load=2   (more work ahead, going to next route)
collection (B)      load=2
dump                load=2   (load 2 complete)
windshield          load=2   (going home — no future work)
DEPOT_ARRIVAL       load=NaN

5 Choreography violations — the operational payoff

The timeline is descriptive. The violations are what make it useful:

  • loaded_depot_return (severity high) — day ended on a collection segment without a closing dump. Truck went home loaded.
  • loaded_depot_departure (severity high) — day started with a dump before any collection. Truck started loaded from yesterday.
  • overrun_loads (severity medium) — more than 3 loads. Review for shift overrun.
  • mid_load_route_switch (severity low) — a single load touched multiple route_ids. Could be HTC parcels off the main route, or could indicate routing inefficiency.

This is the analytical payoff of the segment abstraction. A pile of GPS pings is just dots; a load-organized timeline is a workday; flagged violations are a review queue.

6 Driver + output layout — build_all_segments()

build_all_segments(
    enriched_path="enriched/2026-01-18/815001.parquet",
    out_root="segments/",
)
# writes:
#   segments/timeline/2026-01-18/815001.parquet
#   segments/violations/2026-01-18/815001.parquet

One enriched vehicle-day in, two parquets out. Loop over the L7 master index and you get the whole fleet’s timeline in one batch.

7 Tests + commit

26 tests covering the haversine math, phase derivation cascade, gap-and-island grouping, load numbering edge cases, mileage inflation, and every violation type. All synthetic data — no DuckDB, no network. The whole suite runs in under 2 seconds.

pip install -e ".[dev,geotab,postgres]"
ruff check .
pytest -q
git add . && git commit -m "Lesson 9: segments — load-organized timeline + violations"
git push origin main

· Package anatomy after this lesson

Where everything lives now. new marks files added this lesson.

opentrash/ ├── pyproject.toml · mkdocs.yml · .github/workflows/{ci,docs}.yml ├── docs/{index,architecture}.md + CNAME ├── opentrash/ │ ├── adapters/gps/ # base, geotab, postgres │ ├── core/ # crs, duckdb_session, vehicle_ids │ ├── prep/ # sites, parcels, static_layers, parcels_wkb │ ├── cache/ # gps_cache, gps_indexes, master_index │ ├── tonnage/ # registry, cleaners, keys, upsert, pipeline │ ├── engine/ │ │ ├── config.py # [updated] +mileage_inflation_pct │ │ ├── enrichment.py │ │ └── segments.py # [new] timeline + violations │ └── (patterns, routeview scaffolded) └── tests/ ├── (existing 15 test files) └── test_engine_segments.py # [new] 26 tests

· What you built

  • A per-ping phase classifier — depot / landfill / collection / windshield, derived from the L8 enrichments via a 4-rule priority cascade.
  • A load-organized timeline — one row per segment, chronological, with cumulative haversine mileage and load numbers that respect the “going home” windshield case.
  • Choreography-violation flags — four operational red flags surfaced from the timeline. The descriptive becomes prescriptive.
  • An opt-in mileage inflation knob — default 0.0; honest reporting; calibrate against ground truth when needed.
  • Zero spatial joins. The whole module is a pure aggregation on the engine’s enriched-ping output. The L8 architectural rule paying off for the first time.
With the engine and segments built, the package now has both ingredients for downstream products. Patterns (next) reads enriched pings + segments to surface per-parcel service signatures. RouteView reads them to render a single-route, single-day map. Both are pure consumers; neither does spatial work.

· Companion resources

Optional, for going deeper.

  • Haversine formula: Wikipedia — great-circle distance, taught for centuries to navigators.
  • Gap-and-island problem: canonical pattern — SQL by origin, pandas by adaptation. Recognize it once and you’ll see it everywhere event-stream data appears.
  • The ant colony as metaphor: a real research thread on ACO algorithms — same routine-following behavior at much larger scale. Different problem, same intuition.

· Next lesson

Lesson 10 — Pattern detection: per-parcel service signatures (weekly1, weekly2, biweekly) derived from a long history of enriched pings. Zero spatial joins; pure DuckDB CTAS aggregation. The first true product on top of the engine.