Geomarker Curation and Computation

2023 BMI Practicum

Cole Brokamp

2/20/23

👋 Hello!

Cole Brokamp Associate Professor
University of Cincinnati
Cincinnati Children’s Hospital Medical Center

  • Understanding the heterogeneous effects of environmental exposures and community characteristics on childhood psychiatric and neurobehavioral health outcomes
  • Fairness in place-based data science, environmental and population health, precision health
  • Developing new methods and technologies to support environmental and population health research
    • statistical computing tools for geocoding and geomarker assessment
    • high resolution spatiotemporal air pollution exposure assessment models
    • machine learning methods for causal inference

Geomarker Assessment

Place-Based Health

  • Place is a powerful determinant of health
  • Place-based characteristics can have high spatiotemporal variability
  • Join health data to extant place-based data
  • Direct measurements often not feasible in cohort- or EHR-based studies
  • Observational studies

Harmonizing spatiotemporal resolution and extent

Geomarker

Geomarker: Any place-based or geospatial measure that influences or predicts health

Geocoding: Converting a string of text into spatial coordinates or boundaries

place (+ time) → estimating past geomarkers

Geomarker Exposure Assessment Pipeline for Research

Geomarkers

Social & Economic: American Community Survey, indices, crime, resource deserts, community material deprivation index

Environmental Exposures: air pollution, weather, greenness/greenspace, land use, traffic, noise

Hyperlocal: parcel-level housing characteristics, estimated transportation time to resources, combined sewer overflows, opioid addiction treatment deserts, gunfire

Geomarker Assessment

Containing geography: census tract linkage to survey data; neighborhood linked to policies or characteristics

Radial measures: buffer designated around location with a radius; length, area, or density of sources

Exact location: proximity to predicted source, nearest neighbor weighting, krigging, land use models

Identifier linkage: Auditor, tax, educational data resources

ZIP Code

  • utilize addresses without geocoding
  • frequently changing
  • ill-defined boundaries
  • ZIP Code Tabulation Areas (ZCTAs)

Geomarker Challenges

Protected Health Information (PHI)

  • Confidentiality of research subjects must be safeguarded
  • HIPAA-defined “Safe Harbor” provision prohibits sharing of identifiers and quasi-identifiers, such as:
    • time finer than a calendar year
    • spatial boundary with < 20,000 residents

Sharing Data with PHI

  • Consent often not obtained for unforeseen future analyses
  • Retrospective consent often not feasible + consent bias
  • IRB and institutional DUA approvals can be lengthy and have different requirements
  • Transmission of PHI to a third party often not possible

De- and Re-identification

  • deidentification can ensure small, but non-zero, chance of reidentification
  • re-identification of identifiers ≠ re-identification of quasi-identifiers
    • quasi-identifiers recovered by merging with extant datasets
    • institutional restrictions on sharing of quasi-identifiers

Existing Approaches

  • Anonymization: geomasking, date shifting, generalization
  • Independent Geomarker Assessment: differences in methods introduce differential bias
  • Existing Software Approaches: costly, not scalable, not reproducible

DeGAUSS

Multi-Site Study

Data → Computation

Data ← Computation

Sharing Computation on Data

DeGAUSS

DEcentralized Geomarker Assessment for mUlti Site Studies

Curated and standardized library for secure, efficient, automated, and reproducible linkage of geomarkers to protected health and geolocation data

DeGAUSS

  • Container framework that reads and writes CSV files
  • No extensive computational resources
  • No geospatial or computing expertise required
  • PHI is never exposed to a third party or the internet

Sample Workflow

DeGAUSS

  • Free and open source
  • Automated and continuous documentation and integration
  • Metadata curation and integration
  • Community supports and contributions

DeGAUSS Development

DeGAUSS Applications

eMERGE Network

site N median distance (m) median income (USD)
All 61,866 10,761 57,750
Cincinnati Children’s 7,018 3,342 56,656
Columbia 3,029 1,200 49,750
Marshfield 19,808 39,625 64,611
Mayo Clinic 9,622 12,116 59,743
Vanderbilt 22,389 5,210 50,143

CREW

ECHO Program

  • DeGAUSS images for exposure assessment of three ambient air pollutants at daily, 1 sq. km resolution
    • 17,587 study participants across 53 cohorts
    • 1,590,931 person-months of follow up time
  • Integrated into DAC infrastructure for secure enclave

Geohash

H3 is a hexagonal hierarchical geospatial indexing system

Safe Harbor Geohash

Spatiotemporal Air Pollution

Elevated Blood Lead

Community Partners

Conclusion

  1. Curated library that scientists can utilize for secure, efficient, automated, and reproducible linkage of geomarkers to protected health and geolocation data.
  2. A generalized framework for geomarker curation and computation to which exposure scientists can contribute.

FAIR data principles
Reproducible using “computable exposures”
Portable for sharing and mobility of compute

Thank You

📍 https://degauss.org
👨‍💻️ github.com/cole-brokamp
🐦 @cole_brokamp
📧 cole.brokamp@cchmc.org