Skip to content

Validation Experiments

validation_runner.py orchestrates the four controlled experiments that evaluate whether joint optimization, Bayesian hyperparameter tuning, and reward-model guidance improve injection success rates compared to baselines.

Experiments

Experiment Question Model Conditions
4.1 Does Bayesian optimization outperform template baseline? qwen2.5:7b Baseline vs Bayesian best-params
4.2 Does reward-model guidance recover injection rate lost by similarity-only Genetic? qwen2.5:7b Genetic (w=0) vs Genetic (w=0.4)
4.3 Does joint optimization improve ASR at 32B? qwen2.5:32b Baseline vs Bayesian best-params
4.4 Can joint optimization break the 72B barrier? qwen2.5:72b Baseline vs Bayesian best-params

Each experiment runs a configurable number of independent trials per condition (default: 30) and records both injection and retrieval rates per framework.

How It Works

For each condition in an experiment:

  1. Resolves category-aware inputs (query, topic, cover-text) via shared helpers in experiment_utils.py
  2. Builds hemlock CLI flags from the condition specification
  3. Generates a fresh corpus with hemlock batch
  4. Ingests documents into a temporary ChromaDB collection
  5. Runs injection_test.py to measure injection success
  6. Runs retrieval_test.py to measure retrieval success
  7. Records per-framework results
  8. Cleans up the ChromaDB collection
  9. Checkpoints progress to validation-summary.json

CLI Usage

Run Experiment 4.1

python harness/validation_runner.py \
  --experiment 4.1 \
  --config harness/authority-config.json \
  --output-dir reports/validation-4.1 \
  --model qwen2.5:7b \
  --runs 30 \
  --best-params-file reports/bayesian-qwen7b/best-params.json

Run Experiment 4.2 (Reward-Guided)

python harness/validation_runner.py \
  --experiment 4.2 \
  --config harness/authority-config.json \
  --output-dir reports/validation-4.2 \
  --model qwen2.5:7b \
  --runs 30 \
  --injection-weight 0.4 \
  --injection-model-host http://localhost:9090

Resume After Interruption

python harness/validation_runner.py \
  --experiment 4.1 \
  --config harness/authority-config.json \
  --output-dir reports/validation-4.1 \
  --resume

All Flags

Flag Default Description
--experiment (required) Experiment ID: 4.1, 4.2, 4.3, or 4.4
--config (required) Config JSON with pipeline endpoints
--output-dir (required) Output directory
--model Target LLM model
--runs 30 Trials per condition
--best-params-file Path to best-params.json from Bayesian optimizer (required for 4.1, 4.3, 4.4)
--injection-weight 0.4 Injection weight for experiment 4.2
--injection-model-host http://localhost:9090 Reward server URL
--resume false Skip completed runs
--batch-timeout 1800 Timeout in seconds for each hemlock batch subprocess

Experiment Details

4.1: Bayesian vs Baseline (7B)

Tests whether the parameters found by the Bayesian optimizer produce higher injection rates than unoptimized template payloads at 7B scale.

Conditions:

  • baseline — no optimization flags (template payloads)
  • bayesian — flags from best-params.json

Categories tested: override, redirect

4.2: Reward-Guided vs Similarity-Only (7B)

Tests whether adding the injection reward model to the Genetic optimizer recovers the injection rate that similarity-only optimization destroys.

Conditions:

  • genetic-similarity--genetic --injection-weight 0
  • genetic-guided--genetic --injection-weight 0.4

Categories tested: override, redirect

Context: Tests whether adding reward-model guidance to similarity-only Genetic optimization recovers any injection rate that pure embedding-similarity optimization may have suppressed.

4.3: Bayesian at 32B

Tests whether joint optimization pushes 32B ASR beyond the template baseline.

Conditions: Same as 4.1, but targeting qwen2.5:32b on Strix hardware.

4.4: 72B Joint Optimization

Tests whether optimized parameters achieve nonzero injection success at 72B scale where template payloads have not been observed to succeed in our pilot runs.

Conditions: Same as 4.1, but targeting qwen2.5:72b. Tests all five categories (override, redirect, exfiltrate, denial, multistage).

Reporting: Each condition reports nonzero rates with bootstrap CIs across 30 runs. All-zero outcomes are reported as 0/30 with the binomial CI upper bound.

Output

validation-summary.json

{
  "experiment": "4.1",
  "model": "qwen2.5:7b",
  "conditions": ["baseline", "bayesian"],
  "categories": ["override", "redirect"],
  "results": [
    {
      "run": 1,
      "condition": "baseline",
      "payload_category": "override",
      "model": "qwen2.5:7b",
      "injection_rate": 0.0,
      "retrieval_rate": 0.5,
      "injected": 0,
      "inj_total": 4,
      "retrieved_top5": 2,
      "ret_total": 4,
      "framework_detail": {
        "langchain": {"injected": false, "confidence": null, "retrieved": true, "poisoned_rank": 2.0},
        "llamaindex": {"injected": false, "confidence": null, "retrieved": true, "poisoned_rank": 4.0}
      }
    }
  ]
}

Analysis Pipeline

After running an experiment, feed the results into statistical analysis and figure generation:

# Compute statistics
python harness/statistical_analysis.py \
  --input reports/validation-4.1/validation-summary.json \
  --output reports/validation-4.1/statistics.json \
  --mode validation

# Generate retrieval-injection scatter (Figure 2)
python harness/generate_figures.py \
  --validation-stats reports/validation-4.1/statistics.json \
  --output-dir figures/ \
  --figure 2

See Also