Statistical Analysis¶
statistical_analysis.py computes publication-ready statistics from validation experiment and Pareto sweep results. It implements bootstrap confidence intervals, pairwise significance tests, effect sizes, and multiple comparison corrections.
Analysis Modes¶
| Mode | Input | Purpose |
|---|---|---|
validation |
validation-summary.json |
Compare conditions within a validation experiment |
pareto |
pareto-summary.json |
Analyze injection-weight ablation sweep |
aggregate |
Multiple sweep directories | Cross-condition comparisons for scale/framework analysis |
CLI Usage¶
Validation Analysis¶
python harness/statistical_analysis.py \
--input reports/validation-4.1/validation-summary.json \
--output reports/validation-4.1/statistics.json \
--mode validation
Pareto Analysis¶
python harness/statistical_analysis.py \
--input reports/pareto-qwen7b/pareto-summary.json \
--output reports/pareto-qwen7b/pareto-statistics.json \
--mode pareto
Aggregate Analysis¶
Aggregate mode reads sweep directories and groups results by model and condition.
When a sweep directory contains a manifest.json (written by run_model_sweep.sh),
the model tag and condition knobs (authority style, optimizer, target framework, etc.)
are read from the manifest. This prevents silent merging of different experimental
conditions that happen to share a directory-name pattern. Sweep directories without
a manifest fall back to directory-name parsing.
python harness/statistical_analysis.py \
--output reports/aggregate-statistics.json \
--mode aggregate \
--sweep-dirs reports/sweep-qwen7b reports/sweep-qwen32b reports/sweep-qwen72b
All Flags¶
| Flag | Default | Description |
|---|---|---|
--input |
Input JSON file (required for validation/pareto; unused by aggregate) | |
--output |
statistics.json |
Output statistics JSON |
--mode |
validation |
Analysis mode: validation, pareto, or aggregate |
--sweep-dirs |
Space-separated sweep directories (aggregate mode) | |
--n-bootstrap |
10000 |
Number of bootstrap resamples |
--ci |
0.95 |
Confidence interval level |
Statistical Tests¶
Bootstrap Confidence Intervals¶
Computes confidence intervals via nonparametric percentile bootstrap (10,000 resamples by default):
bootstrap_ci(values, n_bootstrap=10000, ci=0.95)
- Returns: mean, CI lower bound, CI upper bound, standard deviation, sample size
- Used for: continuous metrics (injection rates, retrieval rates)
bootstrap_proportion_ci(successes, total, n_bootstrap=10000, ci=0.95)
- Returns: proportion, CI bounds, sample size
- Used for: binary outcomes (injection detected per trial)
Cohen's h Effect Size¶
Measures the practical significance of the difference between two proportions:
$$ h = 2 \arcsin\sqrt{p_1} - 2 \arcsin\sqrt{p_2} $$
| $|h|$ | Interpretation | |---|---| | < 0.2 | Negligible | | 0.2–0.5 | Small | | 0.5–0.8 | Medium | | > 0.8 | Large |
Cohen's h is reported alongside p-values to distinguish statistically significant but practically meaningless differences from genuinely meaningful effects.
Fisher's Exact Test¶
Tests independence in a 2×2 contingency table:
| Injection Success | Injection Failure | |
|---|---|---|
| Condition A | $a$ | $b$ |
| Condition B | $c$ | $d$ |
Returns odds ratio and exact p-value. Preferred over chi-squared when expected cell counts are small (<5).
McNemar's Test¶
Tests for paired differences between two conditions on the same corpus:
- If the discordant pair count (n = b + c) is < 25: exact binomial test
- Otherwise: chi-squared approximation
Where $b$ = success in A / failure in B, and $c$ = success in B / failure in A.
Bonferroni Correction¶
Adjusts p-values for multiple comparisons:
$$ \alpha_{\text{adjusted}} = \frac{\alpha}{k} $$
where $k$ is the number of comparisons. Applied automatically when comparing across multiple payload categories within an experiment.
Variance Decomposition (ANOVA)¶
One-way ANOVA with intraclass correlation coefficient (ICC) for comparing between-group and within-group variance:
- Between-group variance: Differences across conditions, models, or weight values
- Within-group variance: Run-to-run variation under the same condition
- F-test: Statistical significance of group differences
- ICC: Proportion of total variance attributable to group membership
Output Format¶
Validation Mode¶
{
"experiment": "4.1",
"comparisons": {
"redirect": {
"baseline": {
"injection": {"mean": 0.50, "ci_lower": 0.32, "ci_upper": 0.68, "std": 0.18, "n": 30},
"retrieval": {"mean": 0.65, "ci_lower": 0.48, "ci_upper": 0.82, "std": 0.15, "n": 30}
},
"bayesian": {
"injection": {"mean": 0.73, "ci_lower": 0.56, "ci_upper": 0.87, "std": 0.14, "n": 30},
"retrieval": {"mean": 0.58, "ci_lower": 0.41, "ci_upper": 0.75, "std": 0.16, "n": 30}
},
"effect_size": {
"cohens_h": 0.48,
"interpretation": "small",
"direction": "baseline vs bayesian"
},
"fisher_exact": {"odds_ratio": 2.1, "p_value": 0.042},
"mcnemar": {"statistic": 4.0, "p_value": 0.063, "method": "exact_binomial", "b": 5, "c": 1}
}
},
"bonferroni": [
{"p_original": 0.042, "p_corrected": 0.084, "significant": false, "alpha": 0.05, "n_comparisons": 2}
]
}
Pareto Mode¶
{
"sweep": "pareto",
"weight_analysis": {
"0.0000": {
"weight": 0.0,
"injection": {"mean": 0.50, "ci_lower": 0.30, "ci_upper": 0.70, "std": 0.12, "n": 10},
"retrieval": {"mean": 0.75, "ci_lower": 0.55, "ci_upper": 0.90, "std": 0.10, "n": 10}
},
"0.1000": { "..." : "..." }
},
"vs_baseline": [
{
"weight": 0.1,
"cohens_h": 0.15,
"interpretation": "negligible",
"fisher_exact": {"odds_ratio": 1.3, "p_value": 0.45}
}
],
"optimal_weight": {
"weight": 0.3,
"combined_score": 0.65,
"note": "Geometric mean of injection and retrieval rates"
}
}
See Also¶
- Validation Experiments — generates the input data
- Pareto Sweep — generates the Pareto input data
- Figure Generation — visualizes the statistics