Total events
405,586
Years
2020–2025
Tornado-only
8,106
Pipeline starts from data/processed/storm_events.csv, the merged output of 01_data_preparation.qmd and 01_merge.qmd. The upstream notebooks parse raw NOAA Storm Events records, collapse event locations to one centroid per EVENT_ID, convert damage suffixes such as K/M/B into numeric values, and attach the most recent ISD station observation within 100 km and 90 minutes before event start.
Raw inputs are two public NOAA/NCEI sources. The Storm Events files provide timestamps, location, event type, magnitude, damage, injury/death counts, tornado-specific measurements, and narrative text. The ISD station files provide hourly surface weather fields such as wind, ceiling, visibility, temperature, dew point, and sea-level pressure.
The 100 km / 90-minute rule is a practical availability heuristic, not proof that the station fully represents the storm environment. It makes the start-time meteorological panel usable for S1 and for any at-event context variables. S2 is different: it is a retrospective EF-severity classification task, so later sections explicitly decide which post-event tornado descriptors may be used.
Target:
LOG_DAMAGE
Rows:
67,687
Columns:
44
Target:
TOR_F_SCALE_NUM
(EF0/1/2/3p)
Rows:
8,097
Columns:
37
Total events
405,586
Tornado-only
8,106
EF3+ share
3.1%
EDA summarizes the merged Storm Events data used to build the S1 and S2 task datasets.
Scenario 1 predicts LOG_DAMAGE and compares statistical regressors with feature-partitioning models using test RMSE.
Task rows
67,687
Primary metric
RMSE (lower = better)
Best model
rf — 1.393
H1 verdict
PASS
linear · polynomial · ridge
polynomial
Best RMSE = 1.716
random forest · xgboost
rf
Best RMSE = 1.393
Saved tuning choice for the damage_reg winner (rf). Engines without tunable parameters are reported as such.
Scenario 2 predicts ordinal TOR_F_SCALE_NUM and compares parametric with non-parametric classifiers using test QWK.
Task rows
8,097
Primary metric
QWK (higher = better)
EF3+ share
3.1%
H2 verdict
PASS
logistic · LDA · naive Bayes
lda
Best QWK = 0.523
CART · random forest · k-NN
rf
Best QWK = 0.581
Saved tuning choice for the tornado_ef winner (rf). Engines without tunable parameters are reported as such.
Scenario 3 applies feature selection to the S1/S2 winners and checks whether RFE keeps fewer than half of candidate predictors.
Verdicts and summary tables are loaded from the notebook summary artifact.