{"baselines_definition":{"always_one":"predict_proba=1; recall=1, precision=base_rate","always_zero":"predict_proba=0; trivial F1=0 baseline","base_rate_constant":"predict_proba=train_pos_rate; well-calibrated, no resolution","country_base_rate":"predict_proba=country's historical positive rate; no temporal signal","predict_yesterday":"predict_proba=y[t-1] per (country, ...) group; tough on autocorrelated targets","random_with_base_rate":"Bernoulli sample with p=base_rate; sanity floor"},"generated_at":"2026-05-21T20:06:13+00:00","honest_caveats":["Lift is computed against the predict_yesterday baseline (lag-1 of the label within each country group). This is the toughest trivial baseline on autocorrelated targets like 'will country X be in shutdown next week'.","Some models report training-time sidecar metrics (random or stratified CV) and we compute baselines on a separate last-60d temporal holdout. Lift numbers in those rows are approximate, NOT a clean apples-to-apples test. Look at the `source` field per row to see which are live re-eval.","AUC of None on always_zero / always_one is by design (degenerate constants); those rows still have F1 + Brier.","barely_beats_baseline=true is a yellow flag, not red. Some models (e.g. classifier_v3.3) have legitimate value beyond F1 lift (per-country thresholds, conformal intervals, SHAP attributions).","Per-platform/domain/method are a random sample of 5 each, seeded 20260521 for reproducibility. Re-running will pick the same items."],"n_models":23,"n_rows_filtered":5,"rows":[{"barely_beats_baseline":false,"barely_beats_metric":null,"baseline_metrics":{"always_one":{"auc":null,"brier":0.438,"f1":0.7195902688860435},"always_zero":{"auc":null,"brier":0.562,"f1":0},"base_rate_constant":{"auc":0.5,"base_rate":0.5666666666666667,"best_threshold":0.05,"brier":0.24617777777777777,"f1":0.7195902688860435},"country_base_rate":{"auc":0.5,"best_threshold":0.05,"brier":0.24617777777777777,"f1":0.7195902688860435},"predict_yesterday":{"auc":0.4953647280586295,"best_threshold":0.05,"brier":0.497,"f1":0.5574354407836153,"note":"lag-1 of label within each (country, ...) group; falls back to global lag-1 if grouping unavailable"},"random_with_base_rate":{"auc":1,"brier":0,"f1":1}},"family":"per-domain","honest_caveats":["Metrics pulled from the model's TRAINING-TIME sidecar, not from a held-out 30-day window. They may overstate live performance because the original train/test split is the model author's, not this script's.","AUC at 0.999+ is a code smell: the model may be reconstructing the labeling rule from a leaked feature. Treat with suspicion."],"horizon":"7d","lift_vs_predict_yesterday":{"auc_pp":50.46352719413705,"brier_pp":null,"f1_pp":null},"model_id":"per_domain_chat.mistral.ai","model_metrics":{"auc":1,"best_threshold":null,"brier":null,"f1":null,"n":30,"pos_rate":0.5666666666666667},"notes":"Random sample 5 from 28 available loco_per_domain.","source":"training-time sidecar metrics; baselines synthesized from a 1000-sample Bernoulli at the model's positive rate (no feature parquet found)"},{"barely_beats_baseline":false,"barely_beats_metric":null,"baseline_metrics":{"always_one":{"auc":null,"brier":0.101,"f1":0.94681411269089},"always_zero":{"auc":null,"brier":0.899,"f1":0},"base_rate_constant":{"auc":0.5,"base_rate":0.903755868544601,"best_threshold":0.05,"brier":0.09082161828561354,"f1":0.94681411269089},"country_base_rate":{"auc":0.5,"best_threshold":0.05,"brier":0.09082161828561354,"f1":0.94681411269089},"predict_yesterday":{"auc":0.5093503232414454,"best_threshold":0.05,"brier":0.179,"f1":0.9003895381190874,"note":"lag-1 of label within each (country, ...) group; falls back to global lag-1 if grouping unavailable"},"random_with_base_rate":{"auc":1,"brier":0,"f1":1}},"family":"per-domain","honest_caveats":["Metrics pulled from the model's TRAINING-TIME sidecar, not from a held-out 30-day window. They may overstate live performance because the original train/test split is the model author's, not this script's.","AUC at 0.999+ is a code smell: the model may be reconstructing the labeling rule from a leaked feature. Treat with suspicion."],"horizon":"7d","lift_vs_predict_yesterday":{"auc_pp":49.04596229099643,"brier_pp":null,"f1_pp":null},"model_id":"per_domain_nytimes.com","model_metrics":{"auc":0.9998099461514096,"best_threshold":null,"brier":null,"f1":null,"n":852,"pos_rate":0.903755868544601},"notes":"Random sample 5 from 28 available loco_per_domain.","source":"training-time sidecar metrics; baselines synthesized from a 1000-sample Bernoulli at the model's positive rate (no feature parquet found)"},{"barely_beats_baseline":false,"barely_beats_metric":null,"baseline_metrics":{"always_one":{"auc":null,"brier":0.575,"f1":0.5964912280701754},"always_zero":{"auc":null,"brier":0.425,"f1":0},"base_rate_constant":{"auc":0.5,"base_rate":0.4225352112676056,"best_threshold":0.05,"brier":0.24438107518349536,"f1":0.5964912280701754},"country_base_rate":{"auc":0.5,"best_threshold":0.05,"brier":0.24438107518349536,"f1":0.5964912280701754},"predict_yesterday":{"auc":0.4925831202046036,"best_threshold":0.05,"brier":0.496,"f1":0.4164705882352941,"note":"lag-1 of label within each (country, ...) group; falls back to global lag-1 if grouping unavailable"},"random_with_base_rate":{"auc":1,"brier":0,"f1":1}},"family":"per-domain","honest_caveats":["Metrics pulled from the model's TRAINING-TIME sidecar, not from a held-out 30-day window. They may overstate live performance because the original train/test split is the model author's, not this script's.","AUC at 0.999+ is a code smell: the model may be reconstructing the labeling rule from a leaked feature. Treat with suspicion."],"horizon":"7d","lift_vs_predict_yesterday":{"auc_pp":50.74168797953964,"brier_pp":null,"f1_pp":null},"model_id":"per_domain_pornhub.com","model_metrics":{"auc":1,"best_threshold":null,"brier":null,"f1":null,"n":852,"pos_rate":0.4225352112676056},"notes":"Random sample 5 from 28 available loco_per_domain.","source":"training-time sidecar metrics; baselines synthesized from a 1000-sample Bernoulli at the model's positive rate (no feature parquet found)"},{"barely_beats_baseline":false,"barely_beats_metric":null,"baseline_metrics":{"always_one":{"auc":null,"brier":0.094,"f1":0.950682056663169},"always_zero":{"auc":null,"brier":0.906,"f1":0},"base_rate_constant":{"auc":0.5,"base_rate":0.9084507042253521,"best_threshold":0.05,"brier":0.08517000595120017,"f1":0.950682056663169},"country_base_rate":{"auc":0.5,"best_threshold":0.05,"brier":0.08517000595120017,"f1":0.950682056663169},"predict_yesterday":{"auc":0.5062819970879714,"best_threshold":0.05,"brier":0.169,"f1":0.9066813914964108,"note":"lag-1 of label within each (country, ...) group; falls back to global lag-1 if grouping unavailable"},"random_with_base_rate":{"auc":1,"brier":0,"f1":1}},"family":"per-domain","honest_caveats":["Metrics pulled from the model's TRAINING-TIME sidecar, not from a held-out 30-day window. They may overstate live performance because the original train/test split is the model author's, not this script's.","AUC at 0.999+ is a code smell: the model may be reconstructing the labeling rule from a leaked feature. Treat with suspicion."],"horizon":"7d","lift_vs_predict_yesterday":{"auc_pp":49.325421175056306,"brier_pp":null,"f1_pp":null},"model_id":"per_domain_psiphon.ca","model_metrics":{"auc":0.9995362088385344,"best_threshold":null,"brier":null,"f1":null,"n":852,"pos_rate":0.9084507042253521},"notes":"Random sample 5 from 28 available loco_per_domain.","source":"training-time sidecar metrics; baselines synthesized from a 1000-sample Bernoulli at the model's positive rate (no feature parquet found)"},{"barely_beats_baseline":false,"barely_beats_metric":null,"baseline_metrics":{"always_one":{"auc":null,"brier":0.093,"f1":0.951232302045097},"always_zero":{"auc":null,"brier":0.907,"f1":0},"base_rate_constant":{"auc":0.5,"base_rate":0.9107981220657277,"best_threshold":0.05,"brier":0.08436542573122616,"f1":0.951232302045097},"country_base_rate":{"auc":0.5,"best_threshold":0.05,"brier":0.08436542573122616,"f1":0.951232302045097},"predict_yesterday":{"auc":0.5074569358988038,"best_threshold":0.05,"brier":0.167,"f1":0.9078874793160507,"note":"lag-1 of label within each (country, ...) group; falls back to global lag-1 if grouping unavailable"},"random_with_base_rate":{"auc":1,"brier":0,"f1":1}},"family":"per-domain","honest_caveats":["Metrics pulled from the model's TRAINING-TIME sidecar, not from a held-out 30-day window. They may overstate live performance because the original train/test split is the model author's, not this script's.","AUC at 0.999+ is a code smell: the model may be reconstructing the labeling rule from a leaked feature. Treat with suspicion."],"horizon":"7d","lift_vs_predict_yesterday":{"auc_pp":49.206829470347515,"brier_pp":null,"f1_pp":null},"model_id":"per_domain_protonvpn.com","model_metrics":{"auc":0.9995252306022789,"best_threshold":null,"brier":null,"f1":null,"n":852,"pos_rate":0.9107981220657277},"notes":"Random sample 5 from 28 available loco_per_domain.","source":"training-time sidecar metrics; baselines synthesized from a 1000-sample Bernoulli at the model's positive rate (no feature parquet found)"}],"schema":"voidly-baseline-benchmark/v1","summary":{"barely_beats_threshold_pp_f1":5,"best_model_by_lift":{"family":"per-domain","lift_metric":"auc","lift_pp":50.74168797953964,"model_id":"per_domain_pornhub.com"},"n_models_barely_beating_baseline":7,"n_models_evaluated":23,"toughest_baseline":"predict_yesterday","worst_model_by_lift":{"family":"trajectory","lift_metric":"auc","lift_pp":-22.776121004076366,"model_id":"trajectory_d30"}},"_voidly_pay":{"claim_did":"https://voidly.ai/pay/claim","for_builders":"https://voidly.ai/pay/for-builders","marketplace":"https://api.voidly.ai/v1/pay/marketplace","live_demo":"https://huggingface.co/spaces/emperor-mew/voidly-pay","universal_proxy":{"url":"https://api.voidly.ai/v1/pay/proxy","example":"https://api.voidly.ai/v1/pay/proxy?u=<https-url>&to=did:voidly:<your-did>&price=0.01","page":"https://voidly.ai/pay/proxy","note":"Paywall any public HTTPS URL with one query param. No SDK install required."},"install":{"typescript":"npm install @voidly/pay","python":"pip install voidly-pay","langchain":"pip install voidly-pay-langchain","vercel_ai":"npm install @voidly/pay-vercel-ai","mcp":"npx @voidly/pay-mcp","cli":"npm install -g @voidly/pay-cli"},"note":"Voidly Pay is the open agent-to-agent payment rail. Claim a DID + 10 starter credits in 60s, or paywall any URL via the universal proxy with one query param."}}