How regular are the ants? A year of pings, condensed to one row per parcel.
You have months or years of enriched pings on disk. Each one carries a parcel APN, a vehicle, a timestamp. The question this lesson answers: for every parcel, who shows up regularly, on what day, at what hour, with what cadence — weekly, every other week, irregularly? The output is one wide table, one row per parcel, with weekly1 / weekly2 / biweekly slots. Plot it as a gradient by DOW and hour and the route choreography reveals itself.
· Objective
Five modules. One product. Designed for route-agnostic detection with chunked DuckDB execution.
patterns/config.py—PatternConfig, all 15 detection knobs in one frozen dataclass.patterns/window.py—compute_window("past_year"),"past_quarter", etc. — tiny helper that turns named periods into explicit dates.patterns/detector.py—detect_patterns_chunk(), the DuckDB CTAS pipeline. Pure function; one chunk at a time.patterns/runner.py—run_patterns(), the chunked orchestrator with idempotent writes + skip-if-fresh.patterns/validator.py— post-run diagnostics: how many parcels got a pattern? Are the slots disjoint? Is the biweekly score distribution healthy?
(APN, vehicle_id) pairs only. Route columns
(route_ref, route_org, route_rec)
flow through from parcels_wkb as sidecar metadata, but never enter the
detection logic. Why? Because the top-3 ranked vehicles per parcel
naturally map to the 3 commodities (refuse, organics, recycling) —
one vehicle is dedicated to one commodity stream. The algorithm
doesn’t need to be told which vehicle does what; the ranking
surfaces it.
· Build it, step by step
1 The window helper — patterns/window.py
Pattern detection runs over an explicit time window. The detector takes start_date and end_date dates; this helper turns operational shorthand into the matching pair:
compute_window("past_year") # -> (today - 365 days, today)
compute_window("past_quarter") # -> (today - 90 days, today)
compute_window("custom", start_date="2025-01-01",
end_date="2025-12-31")
Decoupled from the detector so the time-window decision stays visible at the call site, and so a test can dial in any reproducible window via the anchor parameter.
2 Config — patterns/config.py
Fifteen knobs in one frozen dataclass:
- Stop binning —
stop_bin_minutes=15for dwell de-noising. When a truck passes the same parcel twice in a short window (e.g. serving both sides of a street), the pings cluster within 15 min; the bin collapses them to one visit. Removes pass-by noise without losing real visits. - Weekly acceptance —
weekly_min_regularity=0.25,weekly_min_weeks=6,weekly_min_weeks_ratio=0.25. Effective threshold =max(floor, ratio * total_weeks)— flat floor protects short windows, ratio scales with long windows. - Biweekly — gap target 14 days ± 4, phase share ≥ 0.72, composite score ≥ 0.58, max weekly regularity 0.80 (above 80% it’s weekly, not biweekly).
- Composite biweekly scoring — weights on phase share, percent of gaps near 14 days, and "regularity centeredness" (rewards values near 0.5, which is what a clean biweekly looks like in weekly-regularity terms). Weights sum to 1.0 by convention.
Defaults match the values that produced clean patterns on a year of real residential-collection data. Override per-field for a different agency’s cadence.
3 Detector — the DuckDB CTAS pipeline
The hardest-working file in the package. Read it top to bottom; it’s a tour of set-based analytics done right:
enriched pings + parcels_wkb (sidecar)
WHERE apn IS NOT NULL AND dt_local BETWEEN start AND end
[AND apn IN (chunk's APNs)]
-> refined (the analysis universe — route-agnostic from here)
-> stops (15-min dwell debouncing)
-> daily_visits (one row per parcel/vehicle/day)
-> dow_stats (weeks_present / total_weeks per parcel/vehicle/DOW)
-> best_dow (top DOW per parcel/vehicle)
-> best_days (the actual daily rows on that best DOW)
Branch A — biweekly:
-> biweekly_gaps (gap-day analysis between consecutive visits)
-> biweekly_phase (even/odd week share around an anchor)
-> biweekly_scored (composite score, threshold filter)
-> biweekly_best (top parcel/vehicle pair per APN)
Branch B — weekly ranking (excludes biweekly winners):
-> weekly_pool (best_dow rows minus biweekly winners)
-> weekly_ranked (row_number per APN)
-> weekly1, weekly2 (rank 1 and 2)
Final:
-> patterns (parcels_wkb LEFT JOIN weekly1 weekly2 biweekly)
One DuckDB connection, one chunk, one wide DataFrame out. No spatial extension needed — this is pure timeline analysis.
4 The runner — chunked, idempotent, resumable
The detector is a pure function. The runner is the orchestrator. It iterates chunks of parcels (one per route_id in parcels_wkb, for organizational batching only), calls the detector once per chunk, writes the result to a partitioned parquet:
patterns/
<window_label>/ e.g. "2025-03-01_to_2026-02-28"
by_route/
11722.parquet one chunk per route_id
11632.parquet
...
_manifest.parquet audit trail: ran when, with what config
Three properties that matter for real ops:
- Idempotent — re-running a chunk overwrites cleanly. Same input, same output.
- Resumable — the runner skips chunks whose output is newer than the inputs. Crash mid-run? Re-run; it picks up.
- Parallel-safe at the caller — chunks are independent. A workflow runner can dispatch them across processes without locking; each process opens its own DuckDB connection.
5 Validator — post-run diagnostics
After a run, the validator answers: how many parcels got a pattern? Are weekly1 and weekly2 picking the same vehicle (a bug; they should be disjoint)? What does the biweekly score distribution look like? Returns a single ValidationReport dataclass — print it, or pipe it into a small markdown report.
from opentrash.patterns.validator import validate_patterns
from opentrash.patterns.runner import load_patterns
combined = load_patterns("patterns/", "2025-03-01_to_2026-02-28")
report = validate_patterns(combined)
print(report)
6 Tests + commit
22 tests covering the window helper, config invariants, the detector on a known weekly pattern (Tuesdays at 8am for 52 weeks → weekly1_regularity ≥ 0.95), a known biweekly pattern, the "top 3 commodities" scenario (refuse + organics + recycling on one parcel → correct vehicles land in weekly1/weekly2/biweekly), sparse-input behavior (no pattern detected), the floor/ratio threshold math, runner chunking + skip-if-fresh + refresh, and validator basics. All run on synthetic data — no real GPS, no network.
pip install -e ".[dev,geotab,postgres]"
ruff check .
pytest -q
git add . && git commit -m "Lesson 10: pattern detection — route-agnostic chunked DuckDB"
git push origin main
· Package anatomy after this lesson
Where everything lives now. new marks files added this lesson.
· What you built
- A route-agnostic detector that groups on (parcel, vehicle) only — the top-3 vehicles per parcel emerge as the 3 commodities naturally.
- A dwell-debounced visit counter — the 15-min bin removes pass-by noise without losing real visits.
- Floor + ratio thresholds — weekly_min_weeks = max(6, 25% of total_weeks) — scales gracefully across short and long analysis windows.
- A chunked runner that processes the city in route-sized batches, writes idempotent parquets, skips up-to-date chunks, and is parallel-safe at the caller layer.
- Zero spatial joins — the detection algorithm is pure (parcel, vehicle) timeline analysis on top of the engine’s output. The architectural rule paying off again.
· Companion resources
Optional, for going deeper.
- DuckDB CTAS pipelines: official docs — the "build a derived table in one shot" idiom we lean on throughout.
- Idempotency in data pipelines: the same pattern as dbt incremental models at production scale. Same shape, smaller library.
- Composite scoring: the biweekly score blends three signals (phase, gap percentage, regularity centeredness) into one threshold. The general technique — weighted sum of normalized signals — is everywhere from ML feature engineering to credit scoring.
· Next lesson
Lesson 11 — RouteView: one route + one day + one vehicle, rendered as an interactive MapLibre HTML. Reads enriched pings, segments, and patterns; renders parcels-served vs missed, the truck’s trail, the load cycles. The micro counterpart to patterns’ macro view.