Lesson 4 of 12 · The customer bridge

From a CSV of customers to a layer of points on parcels.

Sites are customer accounts — they come from a Salesforce export with addresses and lat/lon coordinates. They’re the bridge between billing (who pays for service) and operations (which parcel gets serviced). This lesson turns the tabular sites table from Lesson 0 into a real points layer, then uses point-in-polygon spatial join to attach each site to its parcel.

Time: ~60 min You'll touch: prep/sites (three new functions) Result: sites with parcel APNs attached via geometry

· Objective

Extend prep/sites.py with three new functions that take the existing tabular cleaner one step further — into geometry.

  • sites_to_geo(df, lon_col, lat_col, source_crs) — build a points GeoDataFrame from the cleaned tabular sites; drop rows with missing or out-of-range coordinates.
  • attach_apns_via_spatial_join(sites_gdf, parcels_gdf, apn_col) — point-in-polygon spatial join. Every site gets the APN of the parcel it sits on; sites that fall outside any parcel get NaN (a useful data-quality flag).
  • load_sites_with_apns(sites_path, parcels_gdf, ...) — the convenience one-call entry point: load → clean → geom → spatial join → APNs attached.
Why a spatial join when the data has an APN column already? Because data quality is what it is — APNs in customer records drift, get typo’d, fall out of sync with the assessor’s map. A geometry-based attribution is grounded truth: this lon/lat is inside this parcel, full stop. When the customer-record APN and the spatially-derived APN agree, you’ve confirmed the record. When they disagree, you’ve found a data-quality issue worth investigating.

· Build it, step by step

1 Sites as points — sites_to_geo()

Take the cleaned tabular sites (from load_sites — the function we built back in Lesson 0) and turn it into a points GeoDataFrame:

  • Coerce the lon/lat columns to numeric (a tolerant pd.to_numeric(errors="coerce"); any junk becomes NaN).
  • Drop rows with missing coordinates or out-of-range values (lon outside [−180, 180] or lat outside [−90, 90]). A point with no location can’t be joined to anything; keeping it would just pollute downstream output with nulls.
  • Build shapely.geometry.Point objects from the survivors and wrap them in a GeoDataFrame at source_crs (default EPSG:4326, virtually always right for CSV).

Column names are knobs (lon_col, lat_col) with sensible defaults ("Longitude", "Latitude"). Salesforce uses those; other exports might use "Long" / "Lat" or "X-Coordinate" / "Y-Coordinate". Same pattern; just pass the right column names.

2 Spatial join — attach_apns_via_spatial_join()

The cornerstone operation: point-in-polygon join via GeoPandas’ sjoin.

joined = gpd.sjoin(
    sites_gdf,
    parcels_min,
    how="left",       # keep all sites; APN becomes NaN for misses
    predicate="within"   # point is "within" polygon
)
  • CRS matching — both sides must agree on what coordinates mean. If sites are in EPSG:4326 and parcels are in EPSG:2230, the function reprojects parcels to match sites (we preserve the sites’ CRS so any downstream rendering doesn’t need to reproject again).
  • Column collision avoidance — we keep only APN + geometry from parcels before the join, because sites and parcels often share field names like ADDRESS or OWNER which would collide.
  • Cleanup — sjoin adds an index_right column we don’t want; we drop it before returning.

A parcel is a non-overlapping property lot, so a point sits on at most one parcel in practice. We don’t need to disambiguate multi-matches; the within predicate handles boundary cases via shapely’s own rules.

3 The convenience entry point — load_sites_with_apns()

For most callers, the one-call entry point is what you want:

from opentrash.prep.parcels import load_parcels
from opentrash.prep.sites import load_sites_with_apns

parcels = load_parcels("data/parcels.gpkg")
sites_with_apns = load_sites_with_apns("data/sites_export.csv", parcels)

Equivalent to chaining load_sites()sites_to_geo()attach_apns_via_spatial_join(), but in one obvious function for the common case.

4 Tests with synthetic Salesforce-style data

Seven new tests appended to tests/test_sites.py:

  • sites_to_geo builds points; drops bad coords; raises when columns missing.
  • attach_apns finds parcels; missing APN column raises; missing CRS raises.
  • Cross-CRS works — if parcels are in EPSG:2230 and sites are in EPSG:4326, the function reprojects parcels in.
pip install -e ".[dev]"
ruff check .
pytest -q

5 Commit and push

git add .
git commit -m "Lesson 4: sites_to_geo + spatial join to parcels"
git push origin main

· Package anatomy after this lesson

Where everything lives now. new marks files added or extended this lesson.

opentrash/ ├── pyproject.toml ├── README.md ├── .github/workflows/ci.yml ├── docs/architecture.md ├── opentrash/ │ ├── __init__.py │ ├── adapters/gps/ # scaffolded │ ├── core/ │ │ ├── crs.py │ │ └── duckdb_session.py │ ├── prep/ │ │ ├── sites.py # [updated] +sites_to_geo, +spatial join, +entry point │ │ ├── parcels.py │ │ └── static_layers.py │ ├── cache/ # scaffolded │ ├── tonnage/ # scaffolded │ ├── engine/ # scaffolded — THE ROUTING ENGINE │ ├── patterns/ # scaffolded │ └── routeview/ # scaffolded └── tests/ ├── test_sites.py # [updated] +7 geometry/spatial-join tests ├── test_crs.py ├── test_duckdb_session.py ├── test_parcels.py └── test_static_layers.py

· What you built

  • Sites as a points layer built from a Salesforce-style CSV — with sensible defaults for the column names and tolerant handling of bad coordinates.
  • A point-in-polygon spatial join that bridges customer accounts to assessor parcels via geometry, not assumed-correct identifiers.
  • A one-call entry pointload_sites_with_apns() — for the common case.
  • Seven new tests covering bad-coord handling, the join itself, cross-CRS behavior, and the error paths.
The static-data foundation is complete: parcels, routes, facilities, sites. The next lesson adds the last static-ish piece — tonnage ingestion — before we pivot to the live data stream.

· Companion resources

Optional, for going deeper.

  • GeoPandas sjoin: spatial joins user guide — the predicates (within, intersects, contains) and when each applies.
  • Shapely Point objects: Point docs — the tiny class that powers tens of thousands of customer records.
  • Data-quality framing: when you reconcile the spatially-derived APN against a record-carried APN, agreements are silent confirmation and disagreements are usually the interesting stuff — addresses split across parcels, geocoder errors, parcel re-numbering after a re-subdivision.

· Next lesson

Lesson 5 — Tonnage: the Excel-to-parquet ingestion pipeline. Daily-ish files, vehicle-ID parsing, hash-based idempotent dedup. After this, all static and semi-static data is in — the GPS stream is next.