Lesson 5 of 12 · The legacy data path

Excel files in, partitioned parquet out — idempotent every time.

Tonnage data — the records of what each truck tipped at which landfill on which day — lives in a legacy system that exports messy Excel files. This lesson builds the ingestion pipeline: register input files, clean them, hash each row for dedup, upsert into year-partitioned parquet. Running it twice on the same file produces zero new rows.

Time: ~90 min You'll touch: tonnage/* · core/vehicle_ids Result: tonnage normalized + docs site live at opentrash.app

· Objective

Land the tonnage pipeline and one supporting primitive, plus turn on the docs deploy so opentrash.app finally reflects the v2 architecture.

  • tonnage/registry.py — scan a folder of input Excel files, parse source (RAD vs ARTS) and date from filename, pick the latest by mtime.
  • tonnage/cleaners.py — per-source header resolution, type coercion, vehicle-ID normalization, material code normalization.
  • tonnage/keys.py — deterministic row hashes for dedup. Same record produces the same hash on every run.
  • tonnage/upsert.py — idempotent merge into year-partitioned parquet (one file per year per source).
  • tonnage/pipeline.py — the top-level run_ingest() entry point that wires it all together.
  • core/vehicle_ids.py — parse fleet vehicle IDs (e.g. 815001) into prefix + number; classify AUTO vs MANUAL vs OTHER.
  • Docs deploymkdocs.yml + .github/workflows/docs.yml + a docs/CNAME file. After this lesson, every push to main rebuilds and deploys to opentrash.app.
The idempotency rule. Re-running the pipeline on the same input must be safe. Legacy data sources are messy — files arrive late, get re-sent, change names. The pipeline has to treat every run as “catch up to the current state on disk.” That’s what hash-based dedup buys us: identical rows produce identical hashes; the upsert step drops duplicates by key. Run it ten times in a row, get the same parquet output.

· Build it, step by step

1 Vehicle IDs — core/vehicle_ids.py

The first foundational primitive in core/ we haven’t built yet. Vehicle IDs in this fleet follow a pattern: three-digit prefix + sequence number (e.g. 815001, prefix 815, number 001). The prefix classifies the vehicle:

  • 815, 832 — AUTO fleet (city-owned automated collection trucks)
  • Anything else — MANUAL or OTHER (contractor, hand-collection)

The module exposes a frozen VehicleId dataclass plus parse_vehicle_id() — tolerant of Excel-style noise (commas, trailing .0 from floats, whitespace).

v = parse_vehicle_id("815001")
v.prefix  # "815"
v.number  # "001"
v.kind    # "AUTO"

2 File registry — tonnage/registry.py

Tonnage files arrive daily-ish in a folder. Two sources, two naming conventions:

  • RAD exports — the “ride-along data” tonnage records
  • ARTS exports — the “automated routing tonnage system” records

The registry: scan the folder, parse source + date from each filename, attach the file’s mtime for tie-breaking. pick_latest_file(registry, source) returns the path of the most recent file for that source. Same input folder, same output — deterministic.

3 Cleaners — tonnage/cleaners.py

Each source has its own header conventions and quirks. The cleaners handle:

  • Header resolution — given a list of acceptable header names, find which one this file uses. Legacy systems rename columns occasionally; the cleaner is tolerant.
  • Type coercion — dates, floats, fleet IDs — all the “Excel exported numbers as strings or strings as numbers” mess.
  • Material code normalization — the two systems use different vocabularies for the same materials (RAD: "RFS" for refuse; ARTS: "Refuse"). matl_norm_rad() and matl_norm_arts() map them to a canonical form.
  • Top-level entry points: clean_arts_file(path) and clean_rad_file(path) — return a clean DataFrame + a metadata dict.

4 Hash keys — tonnage/keys.py

For dedup we need a key that’s deterministic from the row content alone — same row, same hash, every time, in every run. Trickier than it sounds because floats have rounding issues and datetimes have timezone issues.

The pattern: round tons to 3 decimals, floor datetimes to the minute, normalize text (strip + uppercase), then concatenate the key columns and hash. add_keys_arts() and add_keys_rad() attach the resulting row_hash column to each cleaned DataFrame.

Why the floors and rounds? Because 0.1 + 0.2 != 0.3 in floating point, and a datetime read back from parquet might lose microseconds it had in Excel. If the hash isn’t stable across round-trips, dedup falls apart.

5 Idempotent upsert — tonnage/upsert.py

One year, one parquet file per source: tonnage_out/<source>/<year>.parquet. The upsert step:

  1. Read the existing year file (if it exists).
  2. Concatenate new rows.
  3. Drop duplicates by row_hash — first occurrence wins.
  4. Write back to disk.

Running this on the same input twice does nothing on the second run, because every row hashes to a key that’s already in the file.

6 Pipeline — tonnage/pipeline.py

The one entry point most callers use:

from opentrash.tonnage.pipeline import run_ingest

run_ingest(
    data_dir="data/tonnage_in/",
    out_dir="data/tonnage_out/",
)

Internally: scan registry → pick latest of each source → clean → add keys → upsert into the year-partitioned parquet store. The whole job is a few hundred lines but reads top-to-bottom as five clean steps.

7 Docs deploy — mkdocs.yml + .github/workflows/docs.yml

Three small files turn on the live docs site at opentrash.app:

  • mkdocs.yml at repo root — MkDocs Material configuration. Theme is the same charcoal-and-cyan palette as the v1 site. Nav has Home and Architecture for now; more pages added as the package grows.
  • .github/workflows/docs.yml — GitHub Actions workflow. On every push to main: install mkdocs-material, run mkdocs gh-deploy --force, which builds the site to a gh-pages branch.
  • docs/CNAME — one line: opentrash.app. Tells GitHub Pages to serve under the custom domain.

First push after this lesson, the docs site rebuilds with the new architecture page — replaces whatever was there before. Take a look at opentrash.app after CI is green.

8 Tests + push

pip install -e ".[dev]"
ruff check .
pytest -q
git add . && git commit -m "Lesson 5: tonnage pipeline + vehicle IDs + docs deploy"
git push origin main

Two workflows now trigger: ci.yml runs tests, docs.yml rebuilds and deploys the site.

· Package anatomy after this lesson

Where everything lives now. new marks files added this lesson.

opentrash/ ├── pyproject.toml ├── README.md ├── mkdocs.yml # [new] docs site config ├── .github/workflows/ │ ├── ci.yml │ └── docs.yml # [new] docs deploy ├── docs/ │ ├── architecture.md │ ├── index.md # [new] docs home page │ └── CNAME # [new] custom domain ├── opentrash/ │ ├── core/ │ │ ├── crs.py │ │ ├── duckdb_session.py │ │ └── vehicle_ids.py # [new] fleet ID parsing │ ├── prep/ │ │ ├── sites.py │ │ ├── parcels.py │ │ └── static_layers.py │ ├── tonnage/ │ │ ├── registry.py # [new] file inventory │ │ ├── cleaners.py # [new] per-source cleaning │ │ ├── keys.py # [new] row-hash dedup keys │ │ ├── upsert.py # [new] idempotent year-partition merge │ │ └── pipeline.py # [new] top-level ingest entry point │ └── (cache, engine, patterns, routeview scaffolded) └── tests/ ├── test_sites.py ├── test_crs.py ├── test_duckdb_session.py ├── test_parcels.py ├── test_static_layers.py ├── test_vehicle_ids.py # [new] └── test_tonnage.py # [new] full tonnage suite

· What you built

  • A vehicle-ID primitive with prefix/number parsing and fleet classification — used everywhere fleet IDs appear.
  • A file registry that scans an input folder and parses source + date from filenames.
  • Per-source cleaners for RAD and ARTS exports — handle header drift and material-code normalization.
  • Hash-based dedup keys that are stable across round-trips through parquet.
  • An idempotent upsert into year-partitioned parquet — safe to re-run forever.
  • A pipeline entry point that wires it all together.
  • Docs deploy liveopentrash.app now reflects the v2 architecture on every push.
The static and semi-static data layers are complete. Next lesson we pivot to the live data stream: GPS adapters (Geotab + Postgres) behind one Protocol, plus the cache-first layer.

· Companion resources

Optional, for going deeper.

  • Idempotency in data pipelines: the same principle that drives database UPSERT exists at every scale — from one Excel file to a billion-row warehouse load. dbt incremental models are a good production-scale example.
  • Hash-based dedup pitfalls: floats, datetimes, and unicode normalization are the three classic gotchas. The keys module handles them explicitly — worth reading the source.
  • MkDocs Material: official site. The theme is heavily customizable; we’re using ~5% of its features and the result already looks professional.

· Next lesson

Lesson 6 — GPS adapters & cache: the live data path. Two vendors (Geotab API + Postgres stream) behind one Protocol; cache-first reads (one parquet per vehicle per day).