From a CSV of customers to a layer of points on parcels.
Sites are customer accounts — they come from a Salesforce export with addresses and lat/lon coordinates. They’re the bridge between billing (who pays for service) and operations (which parcel gets serviced). This lesson turns the tabular sites table from Lesson 0 into a real points layer, then uses point-in-polygon spatial join to attach each site to its parcel.
· Objective
Extend prep/sites.py with three new functions that take
the existing tabular cleaner one step further — into geometry.
sites_to_geo(df, lon_col, lat_col, source_crs)— build a points GeoDataFrame from the cleaned tabular sites; drop rows with missing or out-of-range coordinates.attach_apns_via_spatial_join(sites_gdf, parcels_gdf, apn_col)— point-in-polygon spatial join. Every site gets the APN of the parcel it sits on; sites that fall outside any parcel getNaN(a useful data-quality flag).load_sites_with_apns(sites_path, parcels_gdf, ...)— the convenience one-call entry point: load → clean → geom → spatial join → APNs attached.
· Build it, step by step
1 Sites as points — sites_to_geo()
Take the cleaned tabular sites (from load_sites — the function we built back in Lesson 0) and turn it into a points GeoDataFrame:
- Coerce the lon/lat columns to numeric (a tolerant
pd.to_numeric(errors="coerce"); any junk becomes NaN). - Drop rows with missing coordinates or out-of-range values (lon outside [−180, 180] or lat outside [−90, 90]). A point with no location can’t be joined to anything; keeping it would just pollute downstream output with nulls.
- Build
shapely.geometry.Pointobjects from the survivors and wrap them in a GeoDataFrame atsource_crs(default EPSG:4326, virtually always right for CSV).
Column names are knobs (lon_col, lat_col) with sensible defaults ("Longitude", "Latitude"). Salesforce uses those; other exports might use "Long" / "Lat" or "X-Coordinate" / "Y-Coordinate". Same pattern; just pass the right column names.
2 Spatial join — attach_apns_via_spatial_join()
The cornerstone operation: point-in-polygon join via GeoPandas’ sjoin.
joined = gpd.sjoin(
sites_gdf,
parcels_min,
how="left", # keep all sites; APN becomes NaN for misses
predicate="within" # point is "within" polygon
)
- CRS matching — both sides must agree on what coordinates mean. If sites are in EPSG:4326 and parcels are in EPSG:2230, the function reprojects parcels to match sites (we preserve the sites’ CRS so any downstream rendering doesn’t need to reproject again).
- Column collision avoidance — we keep only
APN+geometryfrom parcels before the join, because sites and parcels often share field names likeADDRESSorOWNERwhich would collide. - Cleanup — sjoin adds an
index_rightcolumn we don’t want; we drop it before returning.
A parcel is a non-overlapping property lot, so a point sits on at most one parcel in practice. We don’t need to disambiguate multi-matches; the within predicate handles boundary cases via shapely’s own rules.
3 The convenience entry point — load_sites_with_apns()
For most callers, the one-call entry point is what you want:
from opentrash.prep.parcels import load_parcels
from opentrash.prep.sites import load_sites_with_apns
parcels = load_parcels("data/parcels.gpkg")
sites_with_apns = load_sites_with_apns("data/sites_export.csv", parcels)
Equivalent to chaining load_sites() → sites_to_geo() → attach_apns_via_spatial_join(), but in one obvious function for the common case.
4 Tests with synthetic Salesforce-style data
Seven new tests appended to tests/test_sites.py:
sites_to_geobuilds points; drops bad coords; raises when columns missing.attach_apnsfinds parcels; missing APN column raises; missing CRS raises.- Cross-CRS works — if parcels are in EPSG:2230 and sites are in EPSG:4326, the function reprojects parcels in.
pip install -e ".[dev]"
ruff check .
pytest -q
5 Commit and push
git add .
git commit -m "Lesson 4: sites_to_geo + spatial join to parcels"
git push origin main
· Package anatomy after this lesson
Where everything lives now. new marks files added or extended this lesson.
· What you built
- Sites as a points layer built from a Salesforce-style CSV — with sensible defaults for the column names and tolerant handling of bad coordinates.
- A point-in-polygon spatial join that bridges customer accounts to assessor parcels via geometry, not assumed-correct identifiers.
- A one-call entry point —
load_sites_with_apns()— for the common case. - Seven new tests covering bad-coord handling, the join itself, cross-CRS behavior, and the error paths.
· Companion resources
Optional, for going deeper.
- GeoPandas sjoin: spatial joins user guide — the predicates (within, intersects, contains) and when each applies.
- Shapely Point objects: Point docs — the tiny class that powers tens of thousands of customer records.
- Data-quality framing: when you reconcile the spatially-derived APN against a record-carried APN, agreements are silent confirmation and disagreements are usually the interesting stuff — addresses split across parcels, geocoder errors, parcel re-numbering after a re-subdivision.
· Next lesson
Lesson 5 — Tonnage: the Excel-to-parquet ingestion pipeline. Daily-ish files, vehicle-ID parsing, hash-based idempotent dedup. After this, all static and semi-static data is in — the GPS stream is next.