Lesson 10 of 12 · The first product

How regular are the ants? A year of pings, condensed to one row per parcel.

You have months or years of enriched pings on disk. Each one carries a parcel APN, a vehicle, a timestamp. The question this lesson answers: for every parcel, who shows up regularly, on what day, at what hour, with what cadence — weekly, every other week, irregularly? The output is one wide table, one row per parcel, with weekly1 / weekly2 / biweekly slots. Plot it as a gradient by DOW and hour and the route choreography reveals itself.

Time: ~100 min You'll touch: patterns/{config,window,detector,runner,validator} Result: per-parcel service signatures across an analysis window

· Objective

Five modules. One product. Designed for route-agnostic detection with chunked DuckDB execution.

  • patterns/config.pyPatternConfig, all 15 detection knobs in one frozen dataclass.
  • patterns/window.pycompute_window("past_year"), "past_quarter", etc. — tiny helper that turns named periods into explicit dates.
  • patterns/detector.pydetect_patterns_chunk(), the DuckDB CTAS pipeline. Pure function; one chunk at a time.
  • patterns/runner.pyrun_patterns(), the chunked orchestrator with idempotent writes + skip-if-fresh.
  • patterns/validator.py — post-run diagnostics: how many parcels got a pattern? Are the slots disjoint? Is the biweekly score distribution healthy?
The rule that drives the algorithm: detection groups on (APN, vehicle_id) pairs only. Route columns (route_ref, route_org, route_rec) flow through from parcels_wkb as sidecar metadata, but never enter the detection logic. Why? Because the top-3 ranked vehicles per parcel naturally map to the 3 commodities (refuse, organics, recycling) — one vehicle is dedicated to one commodity stream. The algorithm doesn’t need to be told which vehicle does what; the ranking surfaces it.

· Build it, step by step

1 The window helper — patterns/window.py

Pattern detection runs over an explicit time window. The detector takes start_date and end_date dates; this helper turns operational shorthand into the matching pair:

compute_window("past_year")    # -> (today - 365 days, today)
compute_window("past_quarter") # -> (today - 90 days, today)
compute_window("custom", start_date="2025-01-01",
                          end_date="2025-12-31")

Decoupled from the detector so the time-window decision stays visible at the call site, and so a test can dial in any reproducible window via the anchor parameter.

2 Config — patterns/config.py

Fifteen knobs in one frozen dataclass:

  • Stop binningstop_bin_minutes=15 for dwell de-noising. When a truck passes the same parcel twice in a short window (e.g. serving both sides of a street), the pings cluster within 15 min; the bin collapses them to one visit. Removes pass-by noise without losing real visits.
  • Weekly acceptanceweekly_min_regularity=0.25, weekly_min_weeks=6, weekly_min_weeks_ratio=0.25. Effective threshold = max(floor, ratio * total_weeks) — flat floor protects short windows, ratio scales with long windows.
  • Biweekly — gap target 14 days ± 4, phase share ≥ 0.72, composite score ≥ 0.58, max weekly regularity 0.80 (above 80% it’s weekly, not biweekly).
  • Composite biweekly scoring — weights on phase share, percent of gaps near 14 days, and "regularity centeredness" (rewards values near 0.5, which is what a clean biweekly looks like in weekly-regularity terms). Weights sum to 1.0 by convention.

Defaults match the values that produced clean patterns on a year of real residential-collection data. Override per-field for a different agency’s cadence.

3 Detector — the DuckDB CTAS pipeline

The hardest-working file in the package. Read it top to bottom; it’s a tour of set-based analytics done right:

enriched pings + parcels_wkb (sidecar)
  WHERE apn IS NOT NULL AND dt_local BETWEEN start AND end
    [AND apn IN (chunk's APNs)]
  -> refined         (the analysis universe — route-agnostic from here)
  -> stops           (15-min dwell debouncing)
  -> daily_visits    (one row per parcel/vehicle/day)
  -> dow_stats       (weeks_present / total_weeks per parcel/vehicle/DOW)
  -> best_dow        (top DOW per parcel/vehicle)
  -> best_days       (the actual daily rows on that best DOW)
  Branch A — biweekly:
    -> biweekly_gaps        (gap-day analysis between consecutive visits)
    -> biweekly_phase       (even/odd week share around an anchor)
    -> biweekly_scored      (composite score, threshold filter)
    -> biweekly_best        (top parcel/vehicle pair per APN)
  Branch B — weekly ranking (excludes biweekly winners):
    -> weekly_pool          (best_dow rows minus biweekly winners)
    -> weekly_ranked        (row_number per APN)
    -> weekly1, weekly2     (rank 1 and 2)
  Final:
    -> patterns             (parcels_wkb LEFT JOIN weekly1 weekly2 biweekly)

One DuckDB connection, one chunk, one wide DataFrame out. No spatial extension needed — this is pure timeline analysis.

4 The runner — chunked, idempotent, resumable

The detector is a pure function. The runner is the orchestrator. It iterates chunks of parcels (one per route_id in parcels_wkb, for organizational batching only), calls the detector once per chunk, writes the result to a partitioned parquet:

patterns/
  <window_label>/                       e.g. "2025-03-01_to_2026-02-28"
    by_route/
      11722.parquet                       one chunk per route_id
      11632.parquet
      ...
    _manifest.parquet                     audit trail: ran when, with what config

Three properties that matter for real ops:

  • Idempotent — re-running a chunk overwrites cleanly. Same input, same output.
  • Resumable — the runner skips chunks whose output is newer than the inputs. Crash mid-run? Re-run; it picks up.
  • Parallel-safe at the caller — chunks are independent. A workflow runner can dispatch them across processes without locking; each process opens its own DuckDB connection.

5 Validator — post-run diagnostics

After a run, the validator answers: how many parcels got a pattern? Are weekly1 and weekly2 picking the same vehicle (a bug; they should be disjoint)? What does the biweekly score distribution look like? Returns a single ValidationReport dataclass — print it, or pipe it into a small markdown report.

from opentrash.patterns.validator import validate_patterns
from opentrash.patterns.runner import load_patterns

combined = load_patterns("patterns/", "2025-03-01_to_2026-02-28")
report = validate_patterns(combined)
print(report)

6 Tests + commit

22 tests covering the window helper, config invariants, the detector on a known weekly pattern (Tuesdays at 8am for 52 weeks → weekly1_regularity ≥ 0.95), a known biweekly pattern, the "top 3 commodities" scenario (refuse + organics + recycling on one parcel → correct vehicles land in weekly1/weekly2/biweekly), sparse-input behavior (no pattern detected), the floor/ratio threshold math, runner chunking + skip-if-fresh + refresh, and validator basics. All run on synthetic data — no real GPS, no network.

pip install -e ".[dev,geotab,postgres]"
ruff check .
pytest -q
git add . && git commit -m "Lesson 10: pattern detection — route-agnostic chunked DuckDB"
git push origin main

· Package anatomy after this lesson

Where everything lives now. new marks files added this lesson.

opentrash/ ├── pyproject.toml · mkdocs.yml · .github/workflows/{ci,docs}.yml ├── docs/{index,architecture}.md + CNAME ├── opentrash/ │ ├── adapters/gps/ # base, geotab, postgres │ ├── core/ # crs, duckdb_session, vehicle_ids │ ├── prep/ # sites, parcels, static_layers, parcels_wkb │ ├── cache/ # gps_cache, gps_indexes, master_index │ ├── tonnage/ # registry, cleaners, keys, upsert, pipeline │ ├── engine/ # config, enrichment, segments │ ├── patterns/ │ │ ├── config.py # [new] 15 knobs │ │ ├── window.py # [new] compute_window helper │ │ ├── detector.py # [new] DuckDB CTAS pipeline │ │ ├── runner.py # [new] chunked orchestrator │ │ └── validator.py # [new] post-run diagnostics │ └── (routeview scaffolded) └── tests/ ├── (existing 16 test files) └── test_patterns.py # [new] 22 tests

· What you built

  • A route-agnostic detector that groups on (parcel, vehicle) only — the top-3 vehicles per parcel emerge as the 3 commodities naturally.
  • A dwell-debounced visit counter — the 15-min bin removes pass-by noise without losing real visits.
  • Floor + ratio thresholds — weekly_min_weeks = max(6, 25% of total_weeks) — scales gracefully across short and long analysis windows.
  • A chunked runner that processes the city in route-sized batches, writes idempotent parquets, skips up-to-date chunks, and is parallel-safe at the caller layer.
  • Zero spatial joins — the detection algorithm is pure (parcel, vehicle) timeline analysis on top of the engine’s output. The architectural rule paying off again.
One product on top of the engine, done. RouteView is next — the interactive map that takes one route, one day, one vehicle and renders the choreography on top of static GIS layers + enriched pings + segments. The macro view (patterns) and the micro view (RouteView) together: how the routes are actually driven.

· Companion resources

Optional, for going deeper.

  • DuckDB CTAS pipelines: official docs — the "build a derived table in one shot" idiom we lean on throughout.
  • Idempotency in data pipelines: the same pattern as dbt incremental models at production scale. Same shape, smaller library.
  • Composite scoring: the biweekly score blends three signals (phase, gap percentage, regularity centeredness) into one threshold. The general technique — weighted sum of normalized signals — is everywhere from ML feature engineering to credit scoring.

· Next lesson

Lesson 11 — RouteView: one route + one day + one vehicle, rendered as an interactive MapLibre HTML. Reads enriched pings, segments, and patterns; renders parcels-served vs missed, the truck’s trail, the load cycles. The micro counterpart to patterns’ macro view.