{"baselines_definition":{"always_one":"predict_proba=1; recall=1, precision=base_rate","always_zero":"predict_proba=0; trivial F1=0 baseline","base_rate_constant":"predict_proba=train_pos_rate; well-calibrated, no resolution","country_base_rate":"predict_proba=country's historical positive rate; no temporal signal","predict_yesterday":"predict_proba=y[t-1] per (country, ...) group; tough on autocorrelated targets","random_with_base_rate":"Bernoulli sample with p=base_rate; sanity floor"},"generated_at":"2026-05-21T20:06:13+00:00","honest_caveats":["Lift is computed against the predict_yesterday baseline (lag-1 of the label within each country group). This is the toughest trivial baseline on autocorrelated targets like 'will country X be in shutdown next week'.","Some models report training-time sidecar metrics (random or stratified CV) and we compute baselines on a separate last-60d temporal holdout. Lift numbers in those rows are approximate, NOT a clean apples-to-apples test. Look at the `source` field per row to see which are live re-eval.","AUC of None on always_zero / always_one is by design (degenerate constants); those rows still have F1 + Brier.","barely_beats_baseline=true is a yellow flag, not red. Some models (e.g. classifier_v3.3) have legitimate value beyond F1 lift (per-country thresholds, conformal intervals, SHAP attributions).","Per-platform/domain/method are a random sample of 5 each, seeded 20260521 for reproducibility. Re-running will pick the same items."],"n_models":23,"rows":[{"barely_beats_baseline":true,"barely_beats_metric":"f1","baseline_metrics":{"always_one":{"auc":null,"brier":0.8428571428571429,"f1":0.2716049382716049},"always_zero":{"auc":null,"brier":0.15714285714285714,"f1":0},"base_rate_constant":{"auc":0.5,"base_rate":0.05204872646733112,"best_threshold":0.05,"brier":0.14349375589428126,"f1":0.2716049382716049},"country_base_rate":{"auc":0.901615020259088,"best_threshold":0.1,"brier":0.12757851144148047,"f1":0.6037735849056604},"predict_yesterday":{"auc":0.9490669405923643,"best_threshold":0.05,"brier":0.026984126984126985,"f1":0.9141414141414141,"note":"lag-1 of label within each (country, ...) group; falls back to global lag-1 if grouping unavailable"},"random_with_base_rate":{"auc":0.49148262283855504,"brier":0.1992063492063492,"f1":0.0599250936329588}},"family":"forecast","honest_caveats":["Model F1 beats predict_yesterday by less than 5.0pp (lift=2.29pp). Honest reading: the model is mostly memorizing the prior."],"horizon":"7d","lift_vs_predict_yesterday":{"auc_pp":4.506695961498208,"brier_pp":0.7957636451272516,"f1_pp":2.2886293667150115},"model_id":"forecast_7day","model_metrics":{"auc":0.9941339002073464,"best_threshold":0.3,"brier":0.01902649053285447,"f1":0.9370277078085643,"n":1260,"pos_rate":0.15714285714285714},"notes":"XGBoost + isotonic calibrator. Train base rate 0.052; holdout pos rate 0.157","source":"live re-eval on temporal holdout (1260 rows, last 60 days)"},{"barely_beats_baseline":true,"barely_beats_metric":"f1","baseline_metrics":{"always_one":{"auc":null,"brier":0.7087301587301588,"f1":0.45113706207744314},"always_zero":{"auc":null,"brier":0.2912698412698413,"f1":0},"base_rate_constant":{"auc":0.5,"base_rate":0.012774111134766872,"best_threshold":0.5,"brier":0.28399159253995093,"f1":0},"country_base_rate":{"auc":0.7287546799051662,"best_threshold":0.5,"brier":0.2818553770656089,"f1":0},"predict_yesterday":{"auc":0.7154983812944152,"best_threshold":0.05,"brier":0.23492063492063492,"f1":0.5967302452316077,"note":"lag-1 of label within each (country, ...) group; falls back to global lag-1 if grouping unavailable"},"random_with_base_rate":{"auc":0.5056174118408084,"brier":0.2912698412698413,"f1":0.03674540682414698}},"family":"forecast-multi-horizon","honest_caveats":["Model F1 beats predict_yesterday by less than 5.0pp (lift=-6.48pp). Honest reading: the model is mostly memorizing the prior."],"horizon":"1d","lift_vs_predict_yesterday":{"auc_pp":22.63208387888257,"brier_pp":21.123963803076972,"f1_pp":-6.48153516145864},"model_id":"forecast_1d","model_metrics":{"auc":0.9418192200832409,"best_threshold":0.28,"brier":0.023680996889865188,"f1":0.5319148936170213,"n":2303,"pos_rate":0.03560573165436387},"notes":"XGBoost+isotonic per horizon. Note: sidecar uses random split, baselines use temporal split, so lift is approximate.","source":"multi-horizon sidecar (random-split test) + baselines on last-60d of training_data_multi"},{"barely_beats_baseline":true,"barely_beats_metric":"f1","baseline_metrics":{"always_one":{"auc":null,"brier":0.38015873015873014,"f1":0.7653111219990201},"always_zero":{"auc":null,"brier":0.6198412698412699,"f1":0},"base_rate_constant":{"auc":0.5,"base_rate":0.05734156553828685,"best_threshold":0.05,"brier":0.5520439873837751,"f1":0.7653111219990201},"country_base_rate":{"auc":0.7255018056717606,"best_threshold":0.05,"brier":0.5439987025643375,"f1":0.7478460654796094},"predict_yesterday":{"auc":0.9376902905380661,"best_threshold":0.05,"brier":0.05873015873015873,"f1":0.9526248399487837,"note":"lag-1 of label within each (country, ...) group; falls back to global lag-1 if grouping unavailable"},"random_with_base_rate":{"auc":0.4956428645893199,"brier":0.6087301587301587,"f1":0.1091753774680604}},"family":"forecast-multi-horizon","honest_caveats":["Model F1 beats predict_yesterday by less than 5.0pp (lift=-12.23pp). Honest reading: the model is mostly memorizing the prior."],"horizon":"7d","lift_vs_predict_yesterday":{"auc_pp":3.9474134554004614,"brier_pp":3.069461005352674,"f1_pp":-12.234961059098548},"model_id":"forecast_7d","model_metrics":{"auc":0.9771644250920707,"best_threshold":0.6200000000000001,"brier":0.02803554867663199,"f1":0.8302752293577982,"n":2303,"pos_rate":0.1033434650455927},"notes":"XGBoost+isotonic per horizon. Note: sidecar uses random split, baselines use temporal split, so lift is approximate.","source":"multi-horizon sidecar (random-split test) + baselines on last-60d of training_data_multi"},{"barely_beats_baseline":true,"barely_beats_metric":"f1","baseline_metrics":{"always_one":{"auc":null,"brier":0.2507936507936508,"f1":0.8566243194192378},"always_zero":{"auc":null,"brier":0.7492063492063492,"f1":0},"base_rate_constant":{"auc":0.5,"base_rate":0.17706337378468526,"best_threshold":0.05,"brier":0.5152437798395979,"f1":0.8566243194192378},"country_base_rate":{"auc":0.6573897768719159,"best_threshold":0.05,"brier":0.5017825691447151,"f1":0.8805970149253731},"predict_yesterday":{"auc":0.9577612100407636,"best_threshold":0.05,"brier":0.031746031746031744,"f1":0.9788135593220338,"note":"lag-1 of label within each (country, ...) group; falls back to global lag-1 if grouping unavailable"},"random_with_base_rate":{"auc":0.4945558892941429,"brier":0.6674603174603174,"f1":0.2756244616709733}},"family":"forecast-multi-horizon","honest_caveats":["Model F1 beats predict_yesterday by less than 5.0pp (lift=-17.40pp). Honest reading: the model is mostly memorizing the prior."],"horizon":"30d","lift_vs_predict_yesterday":{"auc_pp":-0.09022415213534307,"brier_pp":-2.881501741384603,"f1_pp":-17.398084556738702},"model_id":"forecast_30d","model_metrics":{"auc":0.9568589685194102,"best_threshold":0.34,"brier":0.060561049159877776,"f1":0.8048327137546468,"n":2303,"pos_rate":0.22405557967867998},"notes":"XGBoost+isotonic per horizon. Note: sidecar uses random split, baselines use temporal split, so lift is approximate.","source":"multi-horizon sidecar (random-split test) + baselines on last-60d of training_data_multi"},{"barely_beats_baseline":true,"barely_beats_metric":"f1","baseline_metrics":{"always_one":{"auc":null,"brier":0,"f1":1},"always_zero":{"auc":null,"brier":1,"f1":0},"base_rate_constant":{"auc":null,"base_rate":0.26339391078593344,"best_threshold":0.05,"brier":0.5425885306672414,"f1":1},"country_base_rate":{"auc":null,"best_threshold":0.05,"brier":0,"f1":1},"predict_yesterday":{"auc":null,"best_threshold":0.05,"brier":0.09012875536480687,"f1":0.952808988764045,"note":"lag-1 of label within each (country, ...) group; falls back to global lag-1 if grouping unavailable"},"random_with_base_rate":{"auc":null,"brier":0.7476394849785408,"f1":0.4030157642220699}},"family":"classifier","honest_caveats":["Model F1 beats predict_yesterday by less than 5.0pp (lift=-21.04pp). Honest reading: the model is mostly memorizing the prior.","Metrics pulled from the model's TRAINING-TIME sidecar, not from a held-out 30-day window. They may overstate live performance because the original train/test split is the model author's, not this script's."],"horizon":null,"lift_vs_predict_yesterday":{"auc_pp":null,"brier_pp":null,"f1_pp":-21.042794549692868},"model_id":"classifier_v3.3","model_metrics":{"auc":null,"best_threshold":0.336,"brier":null,"f1":0.7423810432671163,"n":4237,"pos_rate":0.26339391078593344},"notes":"Classifier v3.3 (regime-weighted contagion). Note: sidecar reports training-time CV F1; baselines are computed on the rolling holdout, so lift is approximate.","source":"training-time sidecar (5-fold stratified CV F1 from v3.7 stacking sidecar)"},{"barely_beats_baseline":false,"barely_beats_metric":null,"baseline_metrics":{"always_one":{"auc":null,"brier":0.7087301587301588,"f1":0.45113706207744314},"always_zero":{"auc":null,"brier":0.2912698412698413,"f1":0},"base_rate_constant":{"auc":0.5,"base_rate":0.012774111134766872,"best_threshold":0.5,"brier":0.28399159253995093,"f1":0},"country_base_rate":{"auc":0.7287546799051662,"best_threshold":0.5,"brier":0.2818553770656089,"f1":0},"predict_yesterday":{"auc":0.7154983812944152,"best_threshold":0.05,"brier":0.23492063492063492,"f1":0.5967302452316077,"note":"lag-1 of label within each (country, ...) group; falls back to global lag-1 if grouping unavailable"},"random_with_base_rate":{"auc":0.5056174118408084,"brier":0.2912698412698413,"f1":0.03674540682414698}},"family":"trajectory","honest_caveats":["Metrics pulled from the model's TRAINING-TIME sidecar, not from a held-out 30-day window. They may overstate live performance because the original train/test split is the model author's, not this script's."],"horizon":"1d","lift_vs_predict_yesterday":{"auc_pp":5.400161870558473,"brier_pp":null,"f1_pp":null},"model_id":"trajectory_d1","model_metrics":{"auc":0.7695,"best_threshold":null,"brier":null,"f1":null,"in_sample_auc":0.9655,"n":15351,"pos_rate":0.0099},"notes":"Trajectory v1 per-horizon GradientBoosting. LOCO AUC reported (no F1 in sidecar).","source":"training-time LOCO sidecar (trajectory_metrics.json) + baselines on last-60d temporal split of forecast_training_data_multi"},{"barely_beats_baseline":true,"barely_beats_metric":"auc","baseline_metrics":{"always_one":{"auc":null,"brier":0.38015873015873014,"f1":0.7653111219990201},"always_zero":{"auc":null,"brier":0.6198412698412699,"f1":0},"base_rate_constant":{"auc":0.5,"base_rate":0.05734156553828685,"best_threshold":0.05,"brier":0.5520439873837751,"f1":0.7653111219990201},"country_base_rate":{"auc":0.7255018056717606,"best_threshold":0.05,"brier":0.5439987025643375,"f1":0.7478460654796094},"predict_yesterday":{"auc":0.9376902905380661,"best_threshold":0.05,"brier":0.05873015873015873,"f1":0.9526248399487837,"note":"lag-1 of label within each (country, ...) group; falls back to global lag-1 if grouping unavailable"},"random_with_base_rate":{"auc":0.4956428645893199,"brier":0.6087301587301587,"f1":0.1091753774680604}},"family":"trajectory","honest_caveats":["Model AUC beats predict_yesterday by less than 5.0pp (lift=-21.19pp). Honest reading: the model is mostly memorizing the prior.","Metrics pulled from the model's TRAINING-TIME sidecar, not from a held-out 30-day window. They may overstate live performance because the original train/test split is the model author's, not this script's."],"horizon":"7d","lift_vs_predict_yesterday":{"auc_pp":-21.189029053806607,"brier_pp":null,"f1_pp":null},"model_id":"trajectory_d7","model_metrics":{"auc":0.7258,"best_threshold":null,"brier":null,"f1":null,"in_sample_auc":0.9396,"n":15351,"pos_rate":0.05205},"notes":"Trajectory v1 per-horizon GradientBoosting. LOCO AUC reported (no F1 in sidecar).","source":"training-time LOCO sidecar (trajectory_metrics.json) + baselines on last-60d temporal split of forecast_training_data_multi"},{"barely_beats_baseline":true,"barely_beats_metric":"auc","baseline_metrics":{"always_one":{"auc":null,"brier":0.2507936507936508,"f1":0.8566243194192378},"always_zero":{"auc":null,"brier":0.7492063492063492,"f1":0},"base_rate_constant":{"auc":0.5,"base_rate":0.17706337378468526,"best_threshold":0.05,"brier":0.5152437798395979,"f1":0.8566243194192378},"country_base_rate":{"auc":0.6573897768719159,"best_threshold":0.05,"brier":0.5017825691447151,"f1":0.8805970149253731},"predict_yesterday":{"auc":0.9577612100407636,"best_threshold":0.05,"brier":0.031746031746031744,"f1":0.9788135593220338,"note":"lag-1 of label within each (country, ...) group; falls back to global lag-1 if grouping unavailable"},"random_with_base_rate":{"auc":0.4945558892941429,"brier":0.6674603174603174,"f1":0.2756244616709733}},"family":"trajectory","honest_caveats":["Model AUC beats predict_yesterday by less than 5.0pp (lift=-22.78pp). Honest reading: the model is mostly memorizing the prior.","Metrics pulled from the model's TRAINING-TIME sidecar, not from a held-out 30-day window. They may overstate live performance because the original train/test split is the model author's, not this script's."],"horizon":"30d","lift_vs_predict_yesterday":{"auc_pp":-22.776121004076366,"brier_pp":null,"f1_pp":null},"model_id":"trajectory_d30","model_metrics":{"auc":0.73,"best_threshold":null,"brier":null,"f1":null,"in_sample_auc":0.9047,"n":15351,"pos_rate":0.15751},"notes":"Trajectory v1 per-horizon GradientBoosting. LOCO AUC reported (no F1 in sidecar).","source":"training-time LOCO sidecar (trajectory_metrics.json) + baselines on last-60d temporal split of forecast_training_data_multi"},{"barely_beats_baseline":false,"barely_beats_metric":null,"baseline_metrics":{"always_one":{"auc":null,"brier":0.7842565597667639,"f1":0.354916067146283},"always_zero":{"auc":null,"brier":0.21574344023323616,"f1":0},"base_rate_constant":{"auc":0.5,"base_rate":0.21574344023323616,"best_threshold":0.05,"brier":0.1691982082295642,"f1":0.354916067146283},"country_base_rate":{"auc":0.5,"best_threshold":0.05,"brier":0.1691982082295642,"f1":0.354916067146283},"predict_yesterday":{"auc":0.6467647945343113,"best_threshold":0.05,"brier":0.239067055393586,"f1":0.44594594594594594,"note":"lag-1 of label within each (country, ...) group; falls back to global lag-1 if grouping unavailable"},"random_with_base_rate":{"auc":0.47631367426906457,"brier":0.35276967930029157,"f1":0.17687074829931973}},"family":"survival","honest_caveats":["Metrics pulled from the model's TRAINING-TIME sidecar, not from a held-out 30-day window. They may overstate live performance because the original train/test split is the model author's, not this script's."],"horizon":null,"lift_vs_predict_yesterday":{"auc_pp":null,"brier_pp":null,"f1_pp":null},"model_id":"duration_rsf","model_metrics":{"auc":null,"brier":null,"c_index":null,"f1":null,"n":343,"pos_rate":0.21574344023323616},"notes":"Survival model evaluates event_observed; baselines computed on same flag. c-index is the canonical RSF metric; not directly comparable to F1/AUC.","source":"training-time survival_features.meta.json sidecar (no fitted RSF in /ml-deploy)"},{"barely_beats_baseline":false,"barely_beats_metric":null,"baseline_metrics":{"always_one":{"auc":null,"brier":0.23,"f1":0.8700564971751412},"always_zero":{"auc":null,"brier":0.77,"f1":0},"base_rate_constant":{"auc":0.5,"base_rate":0.767774936061381,"best_threshold":0.05,"brier":0.17710495090953093,"f1":0.8700564971751412},"country_base_rate":{"auc":0.5,"best_threshold":0.05,"brier":0.17710495090953093,"f1":0.8700564971751412},"predict_yesterday":{"auc":0.5109260304912479,"best_threshold":0.05,"brier":0.347,"f1":0.774528914879792,"note":"lag-1 of label within each (country, ...) group; falls back to global lag-1 if grouping unavailable"},"random_with_base_rate":{"auc":1,"brier":0,"f1":1}},"family":"per-platform","honest_caveats":["Metrics pulled from the model's TRAINING-TIME sidecar, not from a held-out 30-day window. They may overstate live performance because the original train/test split is the model author's, not this script's.","AUC at 0.999+ is a code smell: the model may be reconstructing the labeling rule from a leaked feature. Treat with suspicion."],"horizon":"7d","lift_vs_predict_yesterday":{"auc_pp":48.90739695087522,"brier_pp":null,"f1_pp":null},"model_id":"per_platform_instagram","model_metrics":{"auc":1,"best_threshold":null,"brier":null,"f1":null,"n":495,"pos_rate":0.767774936061381},"notes":"Random sample 5 from 9 available per_platform.","source":"training-time sidecar metrics; baselines synthesized from a 1000-sample Bernoulli at the model's positive rate (no feature parquet found)"},{"barely_beats_baseline":false,"barely_beats_metric":null,"baseline_metrics":{"always_one":{"auc":null,"brier":0.496,"f1":0.6702127659574468},"always_zero":{"auc":null,"brier":0.504,"f1":0},"base_rate_constant":{"auc":0.5,"base_rate":0.5127582017010935,"best_threshold":0.05,"brier":0.25006070609703707,"f1":0.6702127659574468},"country_base_rate":{"auc":0.5,"best_threshold":0.05,"brier":0.25006070609703707,"f1":0.6702127659574468},"predict_yesterday":{"auc":0.49996799795186886,"best_threshold":0.05,"brier":0.5,"f1":0.503968253968254,"note":"lag-1 of label within each (country, ...) group; falls back to global lag-1 if grouping unavailable"},"random_with_base_rate":{"auc":1,"brier":0,"f1":1}},"family":"per-platform","honest_caveats":["Metrics pulled from the model's TRAINING-TIME sidecar, not from a held-out 30-day window. They may overstate live performance because the original train/test split is the model author's, not this script's.","AUC at 0.999+ is a code smell: the model may be reconstructing the labeling rule from a leaked feature. Treat with suspicion."],"horizon":"7d","lift_vs_predict_yesterday":{"auc_pp":50.00320020481311,"brier_pp":null,"f1_pp":null},"model_id":"per_platform_twitter","model_metrics":{"auc":1,"best_threshold":null,"brier":null,"f1":null,"n":1420,"pos_rate":0.5127582017010935},"notes":"Random sample 5 from 9 available per_platform.","source":"training-time sidecar metrics; baselines synthesized from a 1000-sample Bernoulli at the model's positive rate (no feature parquet found)"},{"barely_beats_baseline":false,"barely_beats_metric":null,"baseline_metrics":{"always_one":{"auc":null,"brier":0.244,"f1":0.8610478359908884},"always_zero":{"auc":null,"brier":0.756,"f1":0},"base_rate_constant":{"auc":0.5,"base_rate":0.7567293042153377,"best_threshold":0.05,"brier":0.18446453188463854,"f1":0.8610478359908884},"country_base_rate":{"auc":0.5,"best_threshold":0.05,"brier":0.18446453188463854,"f1":0.8610478359908884},"predict_yesterday":{"auc":0.5114385462746118,"best_threshold":0.05,"brier":0.361,"f1":0.7610853739245532,"note":"lag-1 of label within each (country, ...) group; falls back to global lag-1 if grouping unavailable"},"random_with_base_rate":{"auc":1,"brier":0,"f1":1}},"family":"per-platform","honest_caveats":["Metrics pulled from the model's TRAINING-TIME sidecar, not from a held-out 30-day window. They may overstate live performance because the original train/test split is the model author's, not this script's.","AUC at 0.999+ is a code smell: the model may be reconstructing the labeling rule from a leaked feature. Treat with suspicion."],"horizon":"7d","lift_vs_predict_yesterday":{"auc_pp":48.85614537253882,"brier_pp":null,"f1_pp":null},"model_id":"per_platform_signal","model_metrics":{"auc":1,"best_threshold":null,"brier":null,"f1":null,"n":506,"pos_rate":0.7567293042153377},"notes":"Random sample 5 from 9 available per_platform.","source":"training-time sidecar metrics; baselines synthesized from a 1000-sample Bernoulli at the model's positive rate (no feature parquet found)"},{"barely_beats_baseline":false,"barely_beats_metric":null,"baseline_metrics":{"always_one":{"auc":null,"brier":0.261,"f1":0.8499137435307648},"always_zero":{"auc":null,"brier":0.739,"f1":0},"base_rate_constant":{"auc":0.5,"base_rate":0.735655737704918,"best_threshold":0.05,"brier":0.19289018409029826,"f1":0.8499137435307648},"country_base_rate":{"auc":0.5,"best_threshold":0.05,"brier":0.19289018409029826,"f1":0.8499137435307648},"predict_yesterday":{"auc":0.5197481322487155,"best_threshold":0.05,"brier":0.371,"f1":0.7488151658767772,"note":"lag-1 of label within each (country, ...) group; falls back to global lag-1 if grouping unavailable"},"random_with_base_rate":{"auc":1,"brier":0,"f1":1}},"family":"per-platform","honest_caveats":["Metrics pulled from the model's TRAINING-TIME sidecar, not from a held-out 30-day window. They may overstate live performance because the original train/test split is the model author's, not this script's.","AUC at 0.999+ is a code smell: the model may be reconstructing the labeling rule from a leaked feature. Treat with suspicion."],"horizon":"7d","lift_vs_predict_yesterday":{"auc_pp":48.02518677512845,"brier_pp":null,"f1_pp":null},"model_id":"per_platform_telegram","model_metrics":{"auc":1,"best_threshold":null,"brier":null,"f1":null,"n":502,"pos_rate":0.735655737704918},"notes":"Random sample 5 from 9 available per_platform.","source":"training-time sidecar metrics; baselines synthesized from a 1000-sample Bernoulli at the model's positive rate (no feature parquet found)"},{"barely_beats_baseline":false,"barely_beats_metric":null,"baseline_metrics":{"always_one":{"auc":null,"brier":0.232,"f1":0.8687782805429864},"always_zero":{"auc":null,"brier":0.768,"f1":0},"base_rate_constant":{"auc":0.5,"base_rate":0.7665706051873199,"best_threshold":0.05,"brier":0.1781780431695305,"f1":0.8687782805429864},"country_base_rate":{"auc":0.5,"best_threshold":0.05,"brier":0.1781780431695305,"f1":0.8687782805429864},"predict_yesterday":{"auc":0.5138739224137931,"best_threshold":0.05,"brier":0.347,"f1":0.7739413680781759,"note":"lag-1 of label within each (country, ...) group; falls back to global lag-1 if grouping unavailable"},"random_with_base_rate":{"auc":1,"brier":0,"f1":1}},"family":"per-platform","honest_caveats":["Metrics pulled from the model's TRAINING-TIME sidecar, not from a held-out 30-day window. They may overstate live performance because the original train/test split is the model author's, not this script's.","AUC at 0.999+ is a code smell: the model may be reconstructing the labeling rule from a leaked feature. Treat with suspicion."],"horizon":"7d","lift_vs_predict_yesterday":{"auc_pp":48.61260775862068,"brier_pp":null,"f1_pp":null},"model_id":"per_platform_whatsapp","model_metrics":{"auc":1,"best_threshold":null,"brier":null,"f1":null,"n":482,"pos_rate":0.7665706051873199},"notes":"Random sample 5 from 9 available per_platform.","source":"training-time sidecar metrics; baselines synthesized from a 1000-sample Bernoulli at the model's positive rate (no feature parquet found)"},{"barely_beats_baseline":false,"barely_beats_metric":null,"baseline_metrics":{"always_one":{"auc":null,"brier":0.438,"f1":0.7195902688860435},"always_zero":{"auc":null,"brier":0.562,"f1":0},"base_rate_constant":{"auc":0.5,"base_rate":0.5666666666666667,"best_threshold":0.05,"brier":0.24617777777777777,"f1":0.7195902688860435},"country_base_rate":{"auc":0.5,"best_threshold":0.05,"brier":0.24617777777777777,"f1":0.7195902688860435},"predict_yesterday":{"auc":0.4953647280586295,"best_threshold":0.05,"brier":0.497,"f1":0.5574354407836153,"note":"lag-1 of label within each (country, ...) group; falls back to global lag-1 if grouping unavailable"},"random_with_base_rate":{"auc":1,"brier":0,"f1":1}},"family":"per-domain","honest_caveats":["Metrics pulled from the model's TRAINING-TIME sidecar, not from a held-out 30-day window. They may overstate live performance because the original train/test split is the model author's, not this script's.","AUC at 0.999+ is a code smell: the model may be reconstructing the labeling rule from a leaked feature. Treat with suspicion."],"horizon":"7d","lift_vs_predict_yesterday":{"auc_pp":50.46352719413705,"brier_pp":null,"f1_pp":null},"model_id":"per_domain_chat.mistral.ai","model_metrics":{"auc":1,"best_threshold":null,"brier":null,"f1":null,"n":30,"pos_rate":0.5666666666666667},"notes":"Random sample 5 from 28 available loco_per_domain.","source":"training-time sidecar metrics; baselines synthesized from a 1000-sample Bernoulli at the model's positive rate (no feature parquet found)"},{"barely_beats_baseline":false,"barely_beats_metric":null,"baseline_metrics":{"always_one":{"auc":null,"brier":0.101,"f1":0.94681411269089},"always_zero":{"auc":null,"brier":0.899,"f1":0},"base_rate_constant":{"auc":0.5,"base_rate":0.903755868544601,"best_threshold":0.05,"brier":0.09082161828561354,"f1":0.94681411269089},"country_base_rate":{"auc":0.5,"best_threshold":0.05,"brier":0.09082161828561354,"f1":0.94681411269089},"predict_yesterday":{"auc":0.5093503232414454,"best_threshold":0.05,"brier":0.179,"f1":0.9003895381190874,"note":"lag-1 of label within each (country, ...) group; falls back to global lag-1 if grouping unavailable"},"random_with_base_rate":{"auc":1,"brier":0,"f1":1}},"family":"per-domain","honest_caveats":["Metrics pulled from the model's TRAINING-TIME sidecar, not from a held-out 30-day window. They may overstate live performance because the original train/test split is the model author's, not this script's.","AUC at 0.999+ is a code smell: the model may be reconstructing the labeling rule from a leaked feature. Treat with suspicion."],"horizon":"7d","lift_vs_predict_yesterday":{"auc_pp":49.04596229099643,"brier_pp":null,"f1_pp":null},"model_id":"per_domain_nytimes.com","model_metrics":{"auc":0.9998099461514096,"best_threshold":null,"brier":null,"f1":null,"n":852,"pos_rate":0.903755868544601},"notes":"Random sample 5 from 28 available loco_per_domain.","source":"training-time sidecar metrics; baselines synthesized from a 1000-sample Bernoulli at the model's positive rate (no feature parquet found)"},{"barely_beats_baseline":false,"barely_beats_metric":null,"baseline_metrics":{"always_one":{"auc":null,"brier":0.575,"f1":0.5964912280701754},"always_zero":{"auc":null,"brier":0.425,"f1":0},"base_rate_constant":{"auc":0.5,"base_rate":0.4225352112676056,"best_threshold":0.05,"brier":0.24438107518349536,"f1":0.5964912280701754},"country_base_rate":{"auc":0.5,"best_threshold":0.05,"brier":0.24438107518349536,"f1":0.5964912280701754},"predict_yesterday":{"auc":0.4925831202046036,"best_threshold":0.05,"brier":0.496,"f1":0.4164705882352941,"note":"lag-1 of label within each (country, ...) group; falls back to global lag-1 if grouping unavailable"},"random_with_base_rate":{"auc":1,"brier":0,"f1":1}},"family":"per-domain","honest_caveats":["Metrics pulled from the model's TRAINING-TIME sidecar, not from a held-out 30-day window. They may overstate live performance because the original train/test split is the model author's, not this script's.","AUC at 0.999+ is a code smell: the model may be reconstructing the labeling rule from a leaked feature. Treat with suspicion."],"horizon":"7d","lift_vs_predict_yesterday":{"auc_pp":50.74168797953964,"brier_pp":null,"f1_pp":null},"model_id":"per_domain_pornhub.com","model_metrics":{"auc":1,"best_threshold":null,"brier":null,"f1":null,"n":852,"pos_rate":0.4225352112676056},"notes":"Random sample 5 from 28 available loco_per_domain.","source":"training-time sidecar metrics; baselines synthesized from a 1000-sample Bernoulli at the model's positive rate (no feature parquet found)"},{"barely_beats_baseline":false,"barely_beats_metric":null,"baseline_metrics":{"always_one":{"auc":null,"brier":0.094,"f1":0.950682056663169},"always_zero":{"auc":null,"brier":0.906,"f1":0},"base_rate_constant":{"auc":0.5,"base_rate":0.9084507042253521,"best_threshold":0.05,"brier":0.08517000595120017,"f1":0.950682056663169},"country_base_rate":{"auc":0.5,"best_threshold":0.05,"brier":0.08517000595120017,"f1":0.950682056663169},"predict_yesterday":{"auc":0.5062819970879714,"best_threshold":0.05,"brier":0.169,"f1":0.9066813914964108,"note":"lag-1 of label within each (country, ...) group; falls back to global lag-1 if grouping unavailable"},"random_with_base_rate":{"auc":1,"brier":0,"f1":1}},"family":"per-domain","honest_caveats":["Metrics pulled from the model's TRAINING-TIME sidecar, not from a held-out 30-day window. They may overstate live performance because the original train/test split is the model author's, not this script's.","AUC at 0.999+ is a code smell: the model may be reconstructing the labeling rule from a leaked feature. Treat with suspicion."],"horizon":"7d","lift_vs_predict_yesterday":{"auc_pp":49.325421175056306,"brier_pp":null,"f1_pp":null},"model_id":"per_domain_psiphon.ca","model_metrics":{"auc":0.9995362088385344,"best_threshold":null,"brier":null,"f1":null,"n":852,"pos_rate":0.9084507042253521},"notes":"Random sample 5 from 28 available loco_per_domain.","source":"training-time sidecar metrics; baselines synthesized from a 1000-sample Bernoulli at the model's positive rate (no feature parquet found)"},{"barely_beats_baseline":false,"barely_beats_metric":null,"baseline_metrics":{"always_one":{"auc":null,"brier":0.093,"f1":0.951232302045097},"always_zero":{"auc":null,"brier":0.907,"f1":0},"base_rate_constant":{"auc":0.5,"base_rate":0.9107981220657277,"best_threshold":0.05,"brier":0.08436542573122616,"f1":0.951232302045097},"country_base_rate":{"auc":0.5,"best_threshold":0.05,"brier":0.08436542573122616,"f1":0.951232302045097},"predict_yesterday":{"auc":0.5074569358988038,"best_threshold":0.05,"brier":0.167,"f1":0.9078874793160507,"note":"lag-1 of label within each (country, ...) group; falls back to global lag-1 if grouping unavailable"},"random_with_base_rate":{"auc":1,"brier":0,"f1":1}},"family":"per-domain","honest_caveats":["Metrics pulled from the model's TRAINING-TIME sidecar, not from a held-out 30-day window. They may overstate live performance because the original train/test split is the model author's, not this script's.","AUC at 0.999+ is a code smell: the model may be reconstructing the labeling rule from a leaked feature. Treat with suspicion."],"horizon":"7d","lift_vs_predict_yesterday":{"auc_pp":49.206829470347515,"brier_pp":null,"f1_pp":null},"model_id":"per_domain_protonvpn.com","model_metrics":{"auc":0.9995252306022789,"best_threshold":null,"brier":null,"f1":null,"n":852,"pos_rate":0.9107981220657277},"notes":"Random sample 5 from 28 available loco_per_domain.","source":"training-time sidecar metrics; baselines synthesized from a 1000-sample Bernoulli at the model's positive rate (no feature parquet found)"},{"barely_beats_baseline":false,"barely_beats_metric":null,"baseline_metrics":{"always_one":{"auc":null,"brier":0.957,"f1":0.08245445829338446},"always_zero":{"auc":null,"brier":0.043,"f1":0},"base_rate_constant":{"auc":0.5,"base_rate":0.045787113523719614,"best_threshold":0.5,"brier":0.0411587680017941,"f1":0},"country_base_rate":{"auc":0.5,"best_threshold":0.5,"brier":0.0411587680017941,"f1":0},"predict_yesterday":{"auc":0.477533960292581,"best_threshold":0.5,"brier":0.086,"f1":0,"note":"lag-1 of label within each (country, ...) group; falls back to global lag-1 if grouping unavailable"},"random_with_base_rate":{"auc":1,"brier":0,"f1":1}},"family":"per-method","honest_caveats":["Metrics pulled from the model's TRAINING-TIME sidecar, not from a held-out 30-day window. They may overstate live performance because the original train/test split is the model author's, not this script's."],"horizon":null,"lift_vs_predict_yesterday":{"auc_pp":47.466084927560296,"brier_pp":null,"f1_pp":48.394004282655246},"model_id":"per_method_dns-blocking","model_metrics":{"auc":0.9521948095681839,"best_threshold":0.5499999999999999,"brier":null,"f1":0.48394004282655245,"n":4237,"pos_rate":0.045787113523719614},"notes":"Random sample 5 from 4 available methods.","source":"training-time sidecar metrics; baselines synthesized from a 1000-sample Bernoulli at the model's positive rate (no feature parquet found)"},{"barely_beats_baseline":false,"barely_beats_metric":null,"baseline_metrics":{"always_one":{"auc":null,"brier":0.879,"f1":0.215878679750223},"always_zero":{"auc":null,"brier":0.121,"f1":0},"base_rate_constant":{"auc":0.5,"base_rate":0.12154826528203917,"best_threshold":0.05,"brier":0.10635930059481948,"f1":0.215878679750223},"country_base_rate":{"auc":0.5,"best_threshold":0.05,"brier":0.10635930059481948,"f1":0.215878679750223},"predict_yesterday":{"auc":0.515790859259677,"best_threshold":0.05,"brier":0.206,"f1":0.1487603305785124,"note":"lag-1 of label within each (country, ...) group; falls back to global lag-1 if grouping unavailable"},"random_with_base_rate":{"auc":1,"brier":0,"f1":1}},"family":"per-method","honest_caveats":["Metrics pulled from the model's TRAINING-TIME sidecar, not from a held-out 30-day window. They may overstate live performance because the original train/test split is the model author's, not this script's."],"horizon":null,"lift_vs_predict_yesterday":{"auc_pp":38.50820846311468,"brier_pp":null,"f1_pp":42.14596355805401},"model_id":"per_method_http-blocking","model_metrics":{"auc":0.9008729438908238,"best_threshold":0.6,"brier":null,"f1":0.5702199661590525,"n":4237,"pos_rate":0.12154826528203917},"notes":"Random sample 5 from 4 available methods.","source":"training-time sidecar metrics; baselines synthesized from a 1000-sample Bernoulli at the model's positive rate (no feature parquet found)"},{"barely_beats_baseline":false,"barely_beats_metric":null,"baseline_metrics":{"always_one":{"auc":null,"brier":0.96,"f1":0.07692307692307693},"always_zero":{"auc":null,"brier":0.04,"f1":0},"base_rate_constant":{"auc":0.5,"base_rate":0.03847061600188813,"best_threshold":0.5,"brier":0.03840233901541367,"f1":0},"country_base_rate":{"auc":0.5,"best_threshold":0.5,"brier":0.03840233901541367,"f1":0},"predict_yesterday":{"auc":0.4791666666666667,"best_threshold":0.5,"brier":0.08,"f1":0,"note":"lag-1 of label within each (country, ...) group; falls back to global lag-1 if grouping unavailable"},"random_with_base_rate":{"auc":1,"brier":0,"f1":1}},"family":"per-method","honest_caveats":["Metrics pulled from the model's TRAINING-TIME sidecar, not from a held-out 30-day window. They may overstate live performance because the original train/test split is the model author's, not this script's."],"horizon":null,"lift_vs_predict_yesterday":{"auc_pp":42.335831103120555,"brier_pp":null,"f1_pp":38.869257950530034},"model_id":"per_method_tcp-blocking","model_metrics":{"auc":0.9025249776978722,"best_threshold":0.7999999999999999,"brier":null,"f1":0.38869257950530034,"n":4237,"pos_rate":0.03847061600188813},"notes":"Random sample 5 from 4 available methods.","source":"training-time sidecar metrics; baselines synthesized from a 1000-sample Bernoulli at the model's positive rate (no feature parquet found)"},{"barely_beats_baseline":false,"barely_beats_metric":null,"baseline_metrics":{"always_one":{"auc":null,"brier":0.923,"f1":0.14298978644382543},"always_zero":{"auc":null,"brier":0.077,"f1":0},"base_rate_constant":{"auc":0.5,"base_rate":0.07387302336558886,"best_threshold":0.05,"brier":0.07108077798287214,"f1":0.14298978644382543},"country_base_rate":{"auc":0.5,"best_threshold":0.05,"brier":0.07108077798287214,"f1":0.14298978644382543},"predict_yesterday":{"auc":0.49346428219667654,"best_threshold":0.05,"brier":0.144,"f1":0.06493506493506493,"note":"lag-1 of label within each (country, ...) group; falls back to global lag-1 if grouping unavailable"},"random_with_base_rate":{"auc":1,"brier":0,"f1":1}},"family":"per-method","honest_caveats":["Metrics pulled from the model's TRAINING-TIME sidecar, not from a held-out 30-day window. They may overstate live performance because the original train/test split is the model author's, not this script's."],"horizon":null,"lift_vs_predict_yesterday":{"auc_pp":42.41845930118256,"brier_pp":null,"f1_pp":44.111820383006815},"model_id":"per_method_tls-blocking","model_metrics":{"auc":0.9176488752085021,"best_threshold":0.49999999999999994,"brier":null,"f1":0.5060532687651331,"n":4237,"pos_rate":0.07387302336558886},"notes":"Random sample 5 from 4 available methods.","source":"training-time sidecar metrics; baselines synthesized from a 1000-sample Bernoulli at the model's positive rate (no feature parquet found)"}],"schema":"voidly-baseline-benchmark/v1","summary":{"barely_beats_threshold_pp_f1":5,"best_model_by_lift":{"family":"per-domain","lift_metric":"auc","lift_pp":50.74168797953964,"model_id":"per_domain_pornhub.com"},"n_models_barely_beating_baseline":7,"n_models_evaluated":23,"toughest_baseline":"predict_yesterday","worst_model_by_lift":{"family":"trajectory","lift_metric":"auc","lift_pp":-22.776121004076366,"model_id":"trajectory_d30"}},"_voidly_pay":{"claim_did":"https://voidly.ai/pay/claim","for_builders":"https://voidly.ai/pay/for-builders","marketplace":"https://api.voidly.ai/v1/pay/marketplace","live_demo":"https://huggingface.co/spaces/emperor-mew/voidly-pay","universal_proxy":{"url":"https://api.voidly.ai/v1/pay/proxy","example":"https://api.voidly.ai/v1/pay/proxy?u=<https-url>&to=did:voidly:<your-did>&price=0.01","page":"https://voidly.ai/pay/proxy","note":"Paywall any public HTTPS URL with one query param. No SDK install required."},"install":{"typescript":"npm install @voidly/pay","python":"pip install voidly-pay","langchain":"pip install voidly-pay-langchain","vercel_ai":"npm install @voidly/pay-vercel-ai","mcp":"npx @voidly/pay-mcp","cli":"npm install -g @voidly/pay-cli"},"note":"Voidly Pay is the open agent-to-agent payment rail. Claim a DID + 10 starter credits in 60s, or paywall any URL via the universal proxy with one query param."}}