Excel files in, partitioned parquet out — idempotent every time.
Tonnage data — the records of what each truck tipped at which landfill on which day — lives in a legacy system that exports messy Excel files. This lesson builds the ingestion pipeline: register input files, clean them, hash each row for dedup, upsert into year-partitioned parquet. Running it twice on the same file produces zero new rows.
· Objective
Land the tonnage pipeline and one supporting primitive, plus turn
on the docs deploy so opentrash.app finally reflects
the v2 architecture.
tonnage/registry.py— scan a folder of input Excel files, parse source (RAD vs ARTS) and date from filename, pick the latest by mtime.tonnage/cleaners.py— per-source header resolution, type coercion, vehicle-ID normalization, material code normalization.tonnage/keys.py— deterministic row hashes for dedup. Same record produces the same hash on every run.tonnage/upsert.py— idempotent merge into year-partitioned parquet (one file per year per source).tonnage/pipeline.py— the top-levelrun_ingest()entry point that wires it all together.core/vehicle_ids.py— parse fleet vehicle IDs (e.g.815001) into prefix + number; classify AUTO vs MANUAL vs OTHER.- Docs deploy —
mkdocs.yml+.github/workflows/docs.yml+ adocs/CNAMEfile. After this lesson, every push to main rebuilds and deploys toopentrash.app.
· Build it, step by step
1 Vehicle IDs — core/vehicle_ids.py
The first foundational primitive in core/ we haven’t built yet. Vehicle IDs in this fleet follow a pattern: three-digit prefix + sequence number (e.g. 815001, prefix 815, number 001). The prefix classifies the vehicle:
815,832— AUTO fleet (city-owned automated collection trucks)- Anything else — MANUAL or OTHER (contractor, hand-collection)
The module exposes a frozen VehicleId dataclass plus parse_vehicle_id() — tolerant of Excel-style noise (commas, trailing .0 from floats, whitespace).
v = parse_vehicle_id("815001")
v.prefix # "815"
v.number # "001"
v.kind # "AUTO"
2 File registry — tonnage/registry.py
Tonnage files arrive daily-ish in a folder. Two sources, two naming conventions:
- RAD exports — the “ride-along data” tonnage records
- ARTS exports — the “automated routing tonnage system” records
The registry: scan the folder, parse source + date from each filename, attach the file’s mtime for tie-breaking. pick_latest_file(registry, source) returns the path of the most recent file for that source. Same input folder, same output — deterministic.
3 Cleaners — tonnage/cleaners.py
Each source has its own header conventions and quirks. The cleaners handle:
- Header resolution — given a list of acceptable header names, find which one this file uses. Legacy systems rename columns occasionally; the cleaner is tolerant.
- Type coercion — dates, floats, fleet IDs — all the “Excel exported numbers as strings or strings as numbers” mess.
- Material code normalization — the two systems use different vocabularies for the same materials (RAD:
"RFS"for refuse; ARTS:"Refuse").matl_norm_rad()andmatl_norm_arts()map them to a canonical form. - Top-level entry points:
clean_arts_file(path)andclean_rad_file(path)— return a clean DataFrame + a metadata dict.
4 Hash keys — tonnage/keys.py
For dedup we need a key that’s deterministic from the row content alone — same row, same hash, every time, in every run. Trickier than it sounds because floats have rounding issues and datetimes have timezone issues.
The pattern: round tons to 3 decimals, floor datetimes to the minute, normalize text (strip + uppercase), then concatenate the key columns and hash. add_keys_arts() and add_keys_rad() attach the resulting row_hash column to each cleaned DataFrame.
Why the floors and rounds? Because 0.1 + 0.2 != 0.3 in floating point, and a datetime read back from parquet might lose microseconds it had in Excel. If the hash isn’t stable across round-trips, dedup falls apart.
5 Idempotent upsert — tonnage/upsert.py
One year, one parquet file per source: tonnage_out/<source>/<year>.parquet. The upsert step:
- Read the existing year file (if it exists).
- Concatenate new rows.
- Drop duplicates by
row_hash— first occurrence wins. - Write back to disk.
Running this on the same input twice does nothing on the second run, because every row hashes to a key that’s already in the file.
6 Pipeline — tonnage/pipeline.py
The one entry point most callers use:
from opentrash.tonnage.pipeline import run_ingest
run_ingest(
data_dir="data/tonnage_in/",
out_dir="data/tonnage_out/",
)
Internally: scan registry → pick latest of each source → clean → add keys → upsert into the year-partitioned parquet store. The whole job is a few hundred lines but reads top-to-bottom as five clean steps.
7 Docs deploy — mkdocs.yml + .github/workflows/docs.yml
Three small files turn on the live docs site at opentrash.app:
mkdocs.ymlat repo root — MkDocs Material configuration. Theme is the same charcoal-and-cyan palette as the v1 site. Nav has Home and Architecture for now; more pages added as the package grows..github/workflows/docs.yml— GitHub Actions workflow. On every push to main: installmkdocs-material, runmkdocs gh-deploy --force, which builds the site to agh-pagesbranch.docs/CNAME— one line:opentrash.app. Tells GitHub Pages to serve under the custom domain.
First push after this lesson, the docs site rebuilds with the new architecture page — replaces whatever was there before. Take a look at opentrash.app after CI is green.
8 Tests + push
pip install -e ".[dev]"
ruff check .
pytest -q
git add . && git commit -m "Lesson 5: tonnage pipeline + vehicle IDs + docs deploy"
git push origin main
Two workflows now trigger: ci.yml runs tests, docs.yml rebuilds and deploys the site.
· Package anatomy after this lesson
Where everything lives now. new marks files added this lesson.
· What you built
- A vehicle-ID primitive with prefix/number parsing and fleet classification — used everywhere fleet IDs appear.
- A file registry that scans an input folder and parses source + date from filenames.
- Per-source cleaners for RAD and ARTS exports — handle header drift and material-code normalization.
- Hash-based dedup keys that are stable across round-trips through parquet.
- An idempotent upsert into year-partitioned parquet — safe to re-run forever.
- A pipeline entry point that wires it all together.
- Docs deploy live —
opentrash.appnow reflects the v2 architecture on every push.
· Companion resources
Optional, for going deeper.
- Idempotency in data pipelines: the same principle that drives database UPSERT exists at every scale — from one Excel file to a billion-row warehouse load. dbt incremental models are a good production-scale example.
- Hash-based dedup pitfalls: floats, datetimes, and unicode normalization are the three classic gotchas. The keys module handles them explicitly — worth reading the source.
- MkDocs Material: official site. The theme is heavily customizable; we’re using ~5% of its features and the result already looks professional.
· Next lesson
Lesson 6 — GPS adapters & cache: the live data path. Two vendors (Geotab API + Postgres stream) behind one Protocol; cache-first reads (one parquet per vehicle per day).