Air Quality & Public Health Risk Forecasting

An Interpretable Research Framework for Early-Warning Pollution Signals

Overview

Air pollution is one of the most critical environmental determinants of public health. Exposure to elevated ozone levels has been consistently linked to respiratory and cardiovascular diseases. This project develops a research-oriented forecasting framework to analyze air quality dynamics and identify early-warning signals for high-risk regions.

The repository is structured as an academic research baseline, emphasizing methodological rigor, interpretability, and reproducibility.

Key Research Questions

Predictive Persistence: Can counties with chronic high-pollution risk be forecasted years in advance?
Geographic Heterogeneity: How do temporal pollution patterns differ between industrial urban centers and rural regions?
Policy Evaluation: Are regulatory effects (e.g., Clean Air Act) uniformly distributed across U.S. states?
Threshold Identification: Can rolling statistics define effective early-warning thresholds for public health?
Health Linkage: How does air quality volatility relate to long-term public health outcomes?

Exploratory Insights & Trends

Our analysis reveals significant geographic variability. While national averages show general trends, specific "hotspots" require localized attention.

Figure 1: Comparative analysis of ozone trends across high-impact states.

Figure 2: Statistical frequency of counties exceeding safety thresholds per state.

Methodological Framework

1. Feature Engineering (Signal Extraction)

We focus on extracting temporal signals that act as early-warning indicators:

Lagged Indicators: Capturing historical pollution "memory".
Rolling Statistics: Using 3-year windows to smooth volatility and detect emerging trends.
Normalized Temporal Representations: Accounting for long-term shifts.

Figure 3: 3-year rolling average vs. actual fluctuations.

2. Modeling Strategy

We employ a tiered modeling approach to ensure analytical clarity:

Baseline Forecasting: A persistence model ($Value_{t} = Value_{t-1}$) to establish a reference point.
Machine Learning: Using Random Forest to capture non-linear spatiotemporal patterns.

Performance Benchmarking

Our Machine Learning approach significantly outperforms the baseline, proving the value of engineered features.

Model	MAE (Error in Days)	R² Score	Status
Baseline (Naive)	8.47 Days	-0.1593	Reference
Random Forest	2.17 Days	0.6098	Best Performer

Figure 4: Visualizing the baseline model's limitations.

Figure 5: High correlation achieved by the Random Forest model.

Future Research Directions

This framework is designed as an open-ended research baseline. I am actively looking to extend this work in the following directions:

Multimodal Health Integration: Correlating exceedance forecasts with geo-coded public health datasets (e.g., CDC PLACES, hospital admission rates, and respiratory mortality indices) to quantify the health burden.
Causal Inference & Policy Evaluation: Utilizing quasi-experimental designs (e.g., Difference-in-Differences) to evaluate the effectiveness of specific state-level environmental regulations.
Advanced Spatiotemporal Architectures: Transitioning from tree-based ensembles to Graph Neural Networks (GNNs) and LSTMs to capture complex spatial "spillover" effects between neighboring counties.
Early-Warning Decision Support: Developing a probabilistic threshold-based system to support local government decision-making for "Code Red" air quality alerts.

Author & Academic Collaboration

Mariam Zakaria Machine Learning & Data Science Researcher Research Interests: * Interpretable Machine Learning in Environmental Science.

Spatio-temporal Risk Modeling.
Data-driven Public Health Policy.

Open for Collaboration: I am actively seeking academic mentorship and collaborative opportunities to refine this framework for potential journal submission or conference presentation. If you are a faculty member or researcher interested in environmental health and predictive modeling, I would welcome the opportunity to discuss this work further.

Project Structure

air-quality-health-risk-forecasting/
├── data/                 # Data documentation & preprocessing logs
├── notebooks/           # Standardized EDA, Feature Engineering, & Baseline Modeling
├── src/                 # Modular Python scripts for pipeline reproducibility
├── results/figures/     # High-fidelity research visualizations for publication
├── research/            # Literature review, abstract drafts, and methodology notes
└── README.md            # Research-centric project documentation

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Air Quality & Public Health Risk Forecasting

Overview

Key Research Questions

Exploratory Insights & Trends

Methodological Framework

1. Feature Engineering (Signal Extraction)

2. Modeling Strategy

Performance Benchmarking

Future Research Directions

Author & Academic Collaboration

Project Structure

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
data		data
notebooks		notebooks
research		research
results/figures		results/figures
src		src
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Air Quality & Public Health Risk Forecasting

Overview

Key Research Questions

Exploratory Insights & Trends

Methodological Framework

1. Feature Engineering (Signal Extraction)

2. Modeling Strategy

Performance Benchmarking

Future Research Directions

Author & Academic Collaboration

Project Structure

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages