Skip to content

Statistical Analysis

statistical_analysis.py computes publication-ready statistics from validation experiment and Pareto sweep results. It implements bootstrap confidence intervals, pairwise significance tests, effect sizes, and multiple comparison corrections.

Analysis Modes

Mode Input Purpose
validation validation-summary.json Compare conditions within a validation experiment
pareto pareto-summary.json Analyze injection-weight ablation sweep
aggregate Multiple sweep directories Cross-condition comparisons for scale/framework analysis

CLI Usage

Validation Analysis

python harness/statistical_analysis.py \
  --input reports/validation-4.1/validation-summary.json \
  --output reports/validation-4.1/statistics.json \
  --mode validation

Pareto Analysis

python harness/statistical_analysis.py \
  --input reports/pareto-qwen7b/pareto-summary.json \
  --output reports/pareto-qwen7b/pareto-statistics.json \
  --mode pareto

Aggregate Analysis

Aggregate mode reads sweep directories and groups results by model and condition. When a sweep directory contains a manifest.json (written by run_model_sweep.sh), the model tag and condition knobs (authority style, optimizer, target framework, etc.) are read from the manifest. This prevents silent merging of different experimental conditions that happen to share a directory-name pattern. Sweep directories without a manifest fall back to directory-name parsing.

python harness/statistical_analysis.py \
  --output reports/aggregate-statistics.json \
  --mode aggregate \
  --sweep-dirs reports/sweep-qwen7b reports/sweep-qwen32b reports/sweep-qwen72b

All Flags

Flag Default Description
--input Input JSON file (required for validation/pareto; unused by aggregate)
--output statistics.json Output statistics JSON
--mode validation Analysis mode: validation, pareto, or aggregate
--sweep-dirs Space-separated sweep directories (aggregate mode)
--n-bootstrap 10000 Number of bootstrap resamples
--ci 0.95 Confidence interval level

Statistical Tests

Bootstrap Confidence Intervals

Computes confidence intervals via nonparametric percentile bootstrap (10,000 resamples by default):

bootstrap_ci(values, n_bootstrap=10000, ci=0.95)

  • Returns: mean, CI lower bound, CI upper bound, standard deviation, sample size
  • Used for: continuous metrics (injection rates, retrieval rates)

bootstrap_proportion_ci(successes, total, n_bootstrap=10000, ci=0.95)

  • Returns: proportion, CI bounds, sample size
  • Used for: binary outcomes (injection detected per trial)

Cohen's h Effect Size

Measures the practical significance of the difference between two proportions:

$$ h = 2 \arcsin\sqrt{p_1} - 2 \arcsin\sqrt{p_2} $$

| $|h|$ | Interpretation | |---|---| | < 0.2 | Negligible | | 0.2–0.5 | Small | | 0.5–0.8 | Medium | | > 0.8 | Large |

Cohen's h is reported alongside p-values to distinguish statistically significant but practically meaningless differences from genuinely meaningful effects.

Fisher's Exact Test

Tests independence in a 2×2 contingency table:

Injection Success Injection Failure
Condition A $a$ $b$
Condition B $c$ $d$

Returns odds ratio and exact p-value. Preferred over chi-squared when expected cell counts are small (<5).

McNemar's Test

Tests for paired differences between two conditions on the same corpus:

  • If the discordant pair count (n = b + c) is < 25: exact binomial test
  • Otherwise: chi-squared approximation

Where $b$ = success in A / failure in B, and $c$ = success in B / failure in A.

Bonferroni Correction

Adjusts p-values for multiple comparisons:

$$ \alpha_{\text{adjusted}} = \frac{\alpha}{k} $$

where $k$ is the number of comparisons. Applied automatically when comparing across multiple payload categories within an experiment.

Variance Decomposition (ANOVA)

One-way ANOVA with intraclass correlation coefficient (ICC) for comparing between-group and within-group variance:

  • Between-group variance: Differences across conditions, models, or weight values
  • Within-group variance: Run-to-run variation under the same condition
  • F-test: Statistical significance of group differences
  • ICC: Proportion of total variance attributable to group membership

Output Format

Validation Mode

{
  "experiment": "4.1",
  "comparisons": {
    "redirect": {
      "baseline": {
        "injection": {"mean": 0.50, "ci_lower": 0.32, "ci_upper": 0.68, "std": 0.18, "n": 30},
        "retrieval": {"mean": 0.65, "ci_lower": 0.48, "ci_upper": 0.82, "std": 0.15, "n": 30}
      },
      "bayesian": {
        "injection": {"mean": 0.73, "ci_lower": 0.56, "ci_upper": 0.87, "std": 0.14, "n": 30},
        "retrieval": {"mean": 0.58, "ci_lower": 0.41, "ci_upper": 0.75, "std": 0.16, "n": 30}
      },
      "effect_size": {
        "cohens_h": 0.48,
        "interpretation": "small",
        "direction": "baseline vs bayesian"
      },
      "fisher_exact": {"odds_ratio": 2.1, "p_value": 0.042},
      "mcnemar": {"statistic": 4.0, "p_value": 0.063, "method": "exact_binomial", "b": 5, "c": 1}
    }
  },
  "bonferroni": [
    {"p_original": 0.042, "p_corrected": 0.084, "significant": false, "alpha": 0.05, "n_comparisons": 2}
  ]
}

Pareto Mode

{
  "sweep": "pareto",
  "weight_analysis": {
    "0.0000": {
      "weight": 0.0,
      "injection": {"mean": 0.50, "ci_lower": 0.30, "ci_upper": 0.70, "std": 0.12, "n": 10},
      "retrieval": {"mean": 0.75, "ci_lower": 0.55, "ci_upper": 0.90, "std": 0.10, "n": 10}
    },
    "0.1000": { "..." : "..." }
  },
  "vs_baseline": [
    {
      "weight": 0.1,
      "cohens_h": 0.15,
      "interpretation": "negligible",
      "fisher_exact": {"odds_ratio": 1.3, "p_value": 0.45}
    }
  ],
  "optimal_weight": {
    "weight": 0.3,
    "combined_score": 0.65,
    "note": "Geometric mean of injection and retrieval rates"
  }
}

See Also