Statistical Analysis¶

statistical_analysis.py computes publication-ready statistics from validation experiment and Pareto sweep results. It implements bootstrap confidence intervals, pairwise significance tests, effect sizes, and multiple comparison corrections.

Analysis Modes¶

Mode	Input	Purpose
`validation`	`validation-summary.json`	Compare conditions within a validation experiment
`pareto`	`pareto-summary.json`	Analyze injection-weight ablation sweep
`aggregate`	Multiple sweep directories	Cross-condition comparisons for scale/framework analysis

CLI Usage¶

Validation Analysis¶

python harness/statistical_analysis.py \
  --input reports/validation-4.1/validation-summary.json \
  --output reports/validation-4.1/statistics.json \
  --mode validation

Pareto Analysis¶

python harness/statistical_analysis.py \
  --input reports/pareto-qwen7b/pareto-summary.json \
  --output reports/pareto-qwen7b/pareto-statistics.json \
  --mode pareto

Aggregate Analysis¶

Aggregate mode reads sweep directories and groups results by model and condition. When a sweep directory contains a manifest.json (written by run_model_sweep.sh), the model tag and condition knobs (authority style, optimizer, target framework, etc.) are read from the manifest. This prevents silent merging of different experimental conditions that happen to share a directory-name pattern. Sweep directories without a manifest fall back to directory-name parsing.

python harness/statistical_analysis.py \
  --output reports/aggregate-statistics.json \
  --mode aggregate \
  --sweep-dirs reports/sweep-qwen7b reports/sweep-qwen32b reports/sweep-qwen72b

All Flags¶

Flag	Default	Description
`--input`		Input JSON file (required for validation/pareto; unused by aggregate)
`--output`	`statistics.json`	Output statistics JSON
`--mode`	`validation`	Analysis mode: `validation`, `pareto`, or `aggregate`
`--sweep-dirs`		Space-separated sweep directories (aggregate mode)
`--n-bootstrap`	`10000`	Number of bootstrap resamples
`--ci`	`0.95`	Confidence interval level

Statistical Tests¶

Bootstrap Confidence Intervals¶

Computes confidence intervals via nonparametric percentile bootstrap (10,000 resamples by default):

bootstrap_ci(values, n_bootstrap=10000, ci=0.95)

Returns: mean, CI lower bound, CI upper bound, standard deviation, sample size
Used for: continuous metrics (injection rates, retrieval rates)

bootstrap_proportion_ci(successes, total, n_bootstrap=10000, ci=0.95)

Returns: proportion, CI bounds, sample size
Used for: binary outcomes (injection detected per trial)

Cohen's h Effect Size¶

Measures the practical significance of the difference between two proportions:

$$ h = 2 \arcsin\sqrt{p_1} - 2 \arcsin\sqrt{p_2} $$

| $|h|$ | Interpretation | |---|---| | < 0.2 | Negligible | | 0.2–0.5 | Small | | 0.5–0.8 | Medium | | > 0.8 | Large |

Cohen's h is reported alongside p-values to distinguish statistically significant but practically meaningless differences from genuinely meaningful effects.

Fisher's Exact Test¶

Tests independence in a 2×2 contingency table:

	Injection Success	Injection Failure
Condition A	$a$	$b$
Condition B	$c$	$d$

Returns odds ratio and exact p-value. Preferred over chi-squared when expected cell counts are small (<5).

McNemar's Test¶

Tests for paired differences between two conditions on the same corpus:

If the discordant pair count (n = b + c) is < 25: exact binomial test
Otherwise: chi-squared approximation

Where $b$ = success in A / failure in B, and $c$ = success in B / failure in A.

Bonferroni Correction¶

Adjusts p-values for multiple comparisons:

$$ \alpha_{\text{adjusted}} = \frac{\alpha}{k} $$

where $k$ is the number of comparisons. Applied automatically when comparing across multiple payload categories within an experiment.

Variance Decomposition (ANOVA)¶

One-way ANOVA with intraclass correlation coefficient (ICC) for comparing between-group and within-group variance:

Between-group variance: Differences across conditions, models, or weight values
Within-group variance: Run-to-run variation under the same condition
F-test: Statistical significance of group differences
ICC: Proportion of total variance attributable to group membership

Output Format¶

Validation Mode¶

{
  "experiment": "4.1",
  "comparisons": {
    "redirect": {
      "baseline": {
        "injection": {"mean": 0.50, "ci_lower": 0.32, "ci_upper": 0.68, "std": 0.18, "n": 30},
        "retrieval": {"mean": 0.65, "ci_lower": 0.48, "ci_upper": 0.82, "std": 0.15, "n": 30}
      },
      "bayesian": {
        "injection": {"mean": 0.73, "ci_lower": 0.56, "ci_upper": 0.87, "std": 0.14, "n": 30},
        "retrieval": {"mean": 0.58, "ci_lower": 0.41, "ci_upper": 0.75, "std": 0.16, "n": 30}
      },
      "effect_size": {
        "cohens_h": 0.48,
        "interpretation": "small",
        "direction": "baseline vs bayesian"
      },
      "fisher_exact": {"odds_ratio": 2.1, "p_value": 0.042},
      "mcnemar": {"statistic": 4.0, "p_value": 0.063, "method": "exact_binomial", "b": 5, "c": 1}
    }
  },
  "bonferroni": [
    {"p_original": 0.042, "p_corrected": 0.084, "significant": false, "alpha": 0.05, "n_comparisons": 2}
  ]
}

Pareto Mode¶

{
  "sweep": "pareto",
  "weight_analysis": {
    "0.0000": {
      "weight": 0.0,
      "injection": {"mean": 0.50, "ci_lower": 0.30, "ci_upper": 0.70, "std": 0.12, "n": 10},
      "retrieval": {"mean": 0.75, "ci_lower": 0.55, "ci_upper": 0.90, "std": 0.10, "n": 10}
    },
    "0.1000": { "..." : "..." }
  },
  "vs_baseline": [
    {
      "weight": 0.1,
      "cohens_h": 0.15,
      "interpretation": "negligible",
      "fisher_exact": {"odds_ratio": 1.3, "p_value": 0.45}
    }
  ],
  "optimal_weight": {
    "weight": 0.3,
    "combined_score": 0.65,
    "note": "Geometric mean of injection and retrieval rates"
  }
}