An Interpretable Research Framework for Early-Warning Pollution Signals
Air pollution is one of the most critical environmental determinants of public health. Exposure to elevated ozone levels has been consistently linked to respiratory and cardiovascular diseases. This project develops a research-oriented forecasting framework to analyze air quality dynamics and identify early-warning signals for high-risk regions.
The repository is structured as an academic research baseline, emphasizing methodological rigor, interpretability, and reproducibility.
- Predictive Persistence: Can counties with chronic high-pollution risk be forecasted years in advance?
- Geographic Heterogeneity: How do temporal pollution patterns differ between industrial urban centers and rural regions?
- Policy Evaluation: Are regulatory effects (e.g., Clean Air Act) uniformly distributed across U.S. states?
- Threshold Identification: Can rolling statistics define effective early-warning thresholds for public health?
- Health Linkage: How does air quality volatility relate to long-term public health outcomes?
Our analysis reveals significant geographic variability. While national averages show general trends, specific "hotspots" require localized attention.
Figure 1: Comparative analysis of ozone trends across high-impact states.
Figure 2: Statistical frequency of counties exceeding safety thresholds per state.
We focus on extracting temporal signals that act as early-warning indicators:
- Lagged Indicators: Capturing historical pollution "memory".
- Rolling Statistics: Using 3-year windows to smooth volatility and detect emerging trends.
- Normalized Temporal Representations: Accounting for long-term shifts.
Figure 3: 3-year rolling average vs. actual fluctuations.
We employ a tiered modeling approach to ensure analytical clarity:
-
Baseline Forecasting: A persistence model (
$Value_{t} = Value_{t-1}$ ) to establish a reference point. - Machine Learning: Using Random Forest to capture non-linear spatiotemporal patterns.
Our Machine Learning approach significantly outperforms the baseline, proving the value of engineered features.
| Model | MAE (Error in Days) | R² Score | Status |
|---|---|---|---|
| Baseline (Naive) | 8.47 Days | -0.1593 | Reference |
| Random Forest | 2.17 Days | 0.6098 | Best Performer |
Figure 4: Visualizing the baseline model's limitations.
Figure 5: High correlation achieved by the Random Forest model.
This framework is designed as an open-ended research baseline. I am actively looking to extend this work in the following directions:
- Multimodal Health Integration: Correlating exceedance forecasts with geo-coded public health datasets (e.g., CDC PLACES, hospital admission rates, and respiratory mortality indices) to quantify the health burden.
- Causal Inference & Policy Evaluation: Utilizing quasi-experimental designs (e.g., Difference-in-Differences) to evaluate the effectiveness of specific state-level environmental regulations.
- Advanced Spatiotemporal Architectures: Transitioning from tree-based ensembles to Graph Neural Networks (GNNs) and LSTMs to capture complex spatial "spillover" effects between neighboring counties.
- Early-Warning Decision Support: Developing a probabilistic threshold-based system to support local government decision-making for "Code Red" air quality alerts.
Mariam Zakaria Machine Learning & Data Science Researcher Research Interests: * Interpretable Machine Learning in Environmental Science.
- Spatio-temporal Risk Modeling.
- Data-driven Public Health Policy.
Open for Collaboration: I am actively seeking academic mentorship and collaborative opportunities to refine this framework for potential journal submission or conference presentation. If you are a faculty member or researcher interested in environmental health and predictive modeling, I would welcome the opportunity to discuss this work further.
air-quality-health-risk-forecasting/
├── data/ # Data documentation & preprocessing logs
├── notebooks/ # Standardized EDA, Feature Engineering, & Baseline Modeling
├── src/ # Modular Python scripts for pipeline reproducibility
├── results/figures/ # High-fidelity research visualizations for publication
├── research/ # Literature review, abstract drafts, and methodology notes
└── README.md # Research-centric project documentation
