Data platform · Cloud · Full-stack .NET
RA Import Platform
A production-grade data-aggregation and contact-enrichment engine that scrapes regulatory records for residential assisted-living facilities across all 50 U.S. states, normalizes them into a single source of truth, verifies contact data, and feeds marketing systems — running fully automated on Azure.
What it does
A single automated pipeline turns fragmented public health-department data into a clean, sales-ready contact database.
Nationwide coverage
Harvests every state's assisted-living registry — from modern open-data APIs to PDF-only records and CAPTCHA-gated portals — into one consistent dataset.
Verified contacts
Enriches each facility with a website, email, and phone, then validates deliverability so only real, mailable contacts reach the marketing team.
Always current
Runs on a schedule, tracks change history per facility, and exports ready-to-import lists for ActiveCampaign (email) and Postalytics (direct mail).
Architecture
Layered pipeline — acquire → normalize → enrich → export — behind a secured API, orchestrated by scheduled background workers.
Technology stack
Modern .NET, real browser automation, and a normalized SQL model — deployed as a container on Azure with CI/CD.
Engineering highlights for technical reviewers
The problems that made this hard, and the patterns used to solve them.
Registry-driven scraper fan-out
A single KnownStates map registers 59 scrapers as keyed DI services. Adding a state is a one-line registration — no factory or switch logic. Multi-track states (OH has 3 license systems, CA/AZ have 3 sources each) coexist cleanly.
Stable identity & idempotent upserts
Every facility gets a deterministic FacilityKey (SHA-256 of canonical name + address). Re-scrapes upsert by key, so history and hard-won enrichment data survive re-runs instead of being overwritten.
Normalized schema + JSON grab-bag
Core entities (Facility, Address, Person, Email, Phone, Website) are relational; volatile state-specific fields live in an ISJSON-checked Details column with an append-only history table — schema stays stable as 50 states' quirks change.
Cost-aware enrichment
Email verification is metered per call, so verdicts are cached in-process and in SQL to avoid re-billing. A Facebook fallback with a login-wall circuit breaker recovers contacts the primary path misses.
Heterogeneous source handling
One pipeline absorbs ArcGIS/Socrata REST feeds, JS-heavy portals via real Chromium, Excel workbooks, and PDF-only state records parsed positionally — plus a reCAPTCHA-gated portal handled via snapshot.
Cloud-native operations
Multi-stage Docker image bakes Chromium + system deps for headless scraping in-container. Background workers run scrapes on an interval without blocking Kestrel startup; a warm-up service rebuilds in-memory state from SQL on boot.
Data & API surface
- Facility — identity, source, active/seen/scraped timestamps
- Address / Person / FacilityPerson — typed addresses, role-carrying links (Owner, Administrator, Agent…)
- Email / Phone / Website — owned by facility or person; carry source & verification state
- FacilityDetails + History — current JSON snapshot plus append-only change timeline
- ScrapeRun — per-run audit: counts, success, errors
- POST /scrape — trigger an on-demand state scrape
- POST /enrich · /enrich-all — run contact enrichment
- GET /latest · /latest-multi — results as JSON or CSV
- GET /status — facility counts & scrape-run history
- GET /states — catalog of supported sources
- Secured with X-Api-Key; documented via Swagger
Capabilities demonstrated services & hiring
What building and running this system proves I can deliver.