Lesson 6 of 12 · The live data path

One Protocol, two vendors, one cache that never re-fetches.

GPS data is the stream the whole engine sits on. This lesson builds the three pieces that make the stream consumable: a tiny GPSAdapter Protocol that defines the canonical input shape, two concrete adapters (Geotab API and streaming Postgres) that produce it from real vendor data, and a cache-first read layer that lands one parquet per vehicle per local day.

Time: ~75 min You'll touch: adapters/gps/{base,geotab,postgres} · cache/gps_cache Result: swappable GPS sources, persistent local cache

· Objective

Build the data ingress for the rest of the package.

  • adapters/gps/base.py — the GPSAdapter Protocol and the canonical 6-column output schema. Anything that supplies GPS must produce this shape.
  • adapters/gps/geotab.py — Geotab MyGeotab API adapter. Pulls one vehicle’s pings for a local-day window via the LogRecord API.
  • adapters/gps/postgres.py — Streaming Postgres adapter. The production data path: a database where pings stream in every ~10 seconds.
  • cache/gps_cache.py — cache-first reads: ask once, write to disk, subsequent reads hit disk. One parquet per vehicle per local day in cache/YYYY-MM-DD/<vehicle>.parquet.
The credentials rule. No credentials in code. Ever. Every adapter takes its URL/API-key/password as a constructor argument or reads it from an environment variable. The package itself contains zero secret values, so it’s safe to commit, fork, and share. If a credential lands in a notebook or a script, that’s on the user to manage — and to rotate immediately if it ever hit version control.

· Build it, step by step

1 The Protocol — adapters/gps/base.py

A Python typing.Protocol defines an interface by shape, not inheritance. Anything with a matching fetch() method satisfies the protocol — no base-class wiring required. That’s the right tool for the “swap vendors freely” design:

class GPSAdapter(Protocol):
    def fetch(self, vehicle: str, start_date, end_date) -> pd.DataFrame:
        """Return canonical GPS rows for one vehicle over a local-day range."""
        ...

The canonical 6-column schema (GPS_COLUMNS) is also defined here:

GPS_COLUMNS = ("vehicle_id", "dt_utc", "dt_local", "lat", "lon", "speed_mph")

Every downstream module imports those names. Adapter quirks — Geotab’s VehicleName, Postgres’s DateTime, mph vs km/h, naive vs aware datetimes — all get normalized to this shape inside each adapter.

2 Geotab adapter — adapters/gps/geotab.py

Geotab is the legacy / on-demand path: a REST API you authenticate against and call to pull one vehicle’s LogRecord rows. The adapter:

  • Takes username, password, database as constructor args, with env-var fallback (GEOTAB_USERNAME etc.). No values committed.
  • Lazily imports mygeotab — the package installs cleanly without the [geotab] extra; the adapter only fails if you actually try to use it.
  • Looks up the device by vehicle name, queries LogRecord for the local-day UTC window, normalizes km/h speeds to mph and naive UTC datetimes to (UTC, Pacific) aware ones.

3 Postgres adapter — adapters/gps/postgres.py

The production path: a Postgres database where a separate streaming process writes pings every ~10 seconds. Reads are cheap and incremental, and you can do range queries efficiently with SQL.

  • Takes a SQLAlchemy URL (postgresql+psycopg2://...) as a constructor arg, with env-var fallback (OPENTRASH_PG_URL). Same no-creds-in-code rule.
  • Default queries match a Geotab-style schema (LogRecords2, Devices2); override devices_sql / pings_sql on the constructor if your shape differs.
  • Server-side chunked fetch (execution_options(stream_results=True) + fetchmany(chunksize)) so multi-year pulls don’t pin memory. The default 250k-row chunk handles a year of pings comfortably.

Both adapters implement the same fetch(vehicle, start_date, end_date) signature. Downstream code — cache/gps_cache.py below, the engine, RouteView — never knows or cares which vendor is behind the call. Swap one for the other and nothing else changes.

4 Cache-first reads — cache/gps_cache.py

Calling a vendor every time we want pings is wasteful and slow. The cache layer fixes that:

get_gps_day(adapter, "815001", "2026-01-18", cache_dir="cache/")
# First call: adapter.fetch(...) -> writes cache/2026-01-18/815001.parquet
# Second call: reads cache/2026-01-18/815001.parquet directly
  • Layout: one file per vehicle per local day. The YYYY-MM-DD/ directory naming sorts chronologically and matches what the Postgres extractor writes natively.
  • Refresh: pass refresh=True to force a re-fetch (useful when a vendor is back-filling history).
  • Fleet helper: get_gps_fleet_day(adapter, vehicles, day) loops through a list, calls get_gps_day for each, returns one concatenated DataFrame.

5 Tests with stub adapters

Real GPS APIs need real credentials. For tests, we use a StubAdapter — a tiny class that satisfies the Protocol by returning canned pings and counting how many times it’s called. That lets us prove cache-miss writes, cache-hit reads (no second call), refresh behavior, fleet concatenation, and all the error-path conditions — without network.

pip install -e ".[dev]"        # base + dev tools
pip install -e ".[postgres]"   # if you'll use the Postgres adapter
ruff check .
pytest -q

6 Commit and push

git add .
git commit -m "Lesson 6: GPS adapters (Geotab + Postgres) + cache-first reads"
git push origin main

· Package anatomy after this lesson

Where everything lives now. new marks files added this lesson.

opentrash/ ├── pyproject.toml ├── mkdocs.yml ├── .github/workflows/{ci,docs}.yml ├── docs/{index,architecture}.md + CNAME ├── opentrash/ │ ├── adapters/gps/ │ │ ├── base.py # [new] GPSAdapter Protocol + canonical schema │ │ ├── geotab.py # [new] Geotab API adapter │ │ └── postgres.py # [new] Streaming Postgres adapter │ ├── core/ │ │ ├── crs.py │ │ ├── duckdb_session.py │ │ └── vehicle_ids.py │ ├── prep/ │ │ ├── sites.py │ │ ├── parcels.py │ │ └── static_layers.py │ ├── cache/ │ │ └── gps_cache.py # [new] cache-first GPS reads │ ├── tonnage/ # registry, cleaners, keys, upsert, pipeline │ └── (engine, patterns, routeview scaffolded) └── tests/ ├── (existing 7 test files) ├── test_gps_adapter_base.py # [new] ├── test_geotab_adapter.py # [new] ├── test_postgres_adapter.py # [new] └── test_gps_cache.py # [new]

· What you built

  • A Protocol that lets the package treat any GPS vendor identically — the foundational decoupling that keeps engine code vendor-neutral.
  • Two concrete adapters — Geotab (REST, on-demand) and Postgres (streaming, server-side chunked) — both producing the canonical 6-column shape.
  • A cache-first read layer that turns repeated “give me this vehicle-day” calls into disk reads after the first fetch.
  • Credentials via args or env, never in code — the package is safe to fork and commit without leaking secrets.
  • Tests using stub adapters — no network, no real credentials, full behavior coverage.
One housekeeping reminder. If you ever committed real Geotab or Postgres credentials to any repo (notebooks count) — rotate them now. Git history is forever; the moment a secret hits a public branch, treat it as leaked. The package never has this problem because we never put credentials in the package.

· Companion resources

Optional, for going deeper.

  • typing.Protocol: official docs — structural typing in Python.
  • SQLAlchemy streaming results: stream_results docs — how to fetch huge result sets without holding them in memory.
  • The 12-factor app: factor III, “Config” — the canonical argument for environment-variable configuration.

· Next lesson

Lesson 7 — The substrate: parcels with WKB geometry, per-day vehicle-day indexes, and the master GPS index that joins them. The lookup tables that make the routing engine fast.