Figure Generation¶

generate_figures.py produces five publication-quality figures from the statistical analysis output. Each figure is saved as both PDF (vector) and PNG (300 DPI raster).

CLI Usage¶

Generate All Figures¶

python harness/generate_figures.py \
  --stats-dir reports/statistics/ \
  --output-dir figures/

Generate a Single Figure¶

python harness/generate_figures.py \
  --pareto-stats reports/pareto-qwen7b/pareto-statistics.json \
  --output-dir figures/ \
  --figure 4

All Flags¶

Flag	Default	Description
`--stats-dir`		Directory containing all statistics JSONs
`--output-dir`	(required)	Output directory for figure files
`--figure`	(all)	Single figure number (1–5) or omit for all
`--pareto-stats`		Pareto statistics JSON (for Figure 4)
`--validation-stats`		Validation statistics JSON (for Figure 2)
`--aggregate-stats`		Aggregate statistics JSON (for Figures 1, 3, 5)

Figures¶

Figure 1: ASR vs Model Scale¶

Data source: Aggregate statistics (cross-model sweep results)

Plots injection success rate (ASR) against model size ($\log_2$ parameters) with one line per model family (Qwen, Llama, optionally Gemma). Confidence intervals are rendered as shaded regions.

Output: fig1_scale.pdf, fig1_scale.png

Figure 2: Retrieval vs Injection Scatter¶

Data source: Validation statistics (experiment 4.1 or 4.2)

Scatter plot with retrieval rate on the X-axis and injection rate on the Y-axis. Each point represents one optimizer condition. Colors distinguish optimizer types (baseline, CEM, Genetic, Whitebox). Error bars show bootstrap CIs in both dimensions. A $y = x$ diagonal reference line highlights the retrieval-injection disconnect.

Output: fig2_retrieval_vs_injection.pdf, fig2_retrieval_vs_injection.png

Figure 3: Authority Style Heatmap¶

Data source: Aggregate statistics (model_sizes, cells keys)

Heatmap matrix with model scale on the X-axis and authority style on the Y-axis. Cell color intensity represents ASR. Each cell is annotated with the percentage value. Visualizes how authority-style cover text interacts with model scale across the swept conditions.

Warning

If the aggregate statistics JSON does not contain model_sizes and cells, this figure will fail with an explicit error rather than silently rendering a zero-filled heatmap.

Warning

If the only authority style present in the cells data is none (i.e. no authority-framed sweeps have been run), Figure 3 is skipped with a warning. This prevents presenting fabricated zero rows for academic, institutional, and regulatory styles that were never actually measured. Run authority-style sweeps via run_model_sweep.sh --authority-style academic etc. to populate these cells.

Output: fig3_authority_heatmap.pdf, fig3_authority_heatmap.png

Figure 4: Pareto Frontier¶

Data source: Pareto statistics

Dual-axis plot with injection weight on the X-axis. Left Y-axis (orange) shows injection rate; right Y-axis (blue) shows retrieval rate. Error bars from bootstrap CIs at each weight. Marks the optimal weight (highest injection rate × retrieval rate product). The crossing point of the two curves indicates where the retrieval-injection trade-off peaks.

Output: fig4_pareto_curve.pdf, fig4_pareto_curve.png

Figure 5: Framework Vulnerability Comparison¶

Data source: Aggregate statistics (framework_rates key)

Grouped bar chart with payload categories on the X-axis and ASR on the Y-axis. Bars are grouped by framework (LangChain, LlamaIndex, Haystack, Unstructured, ColPALI). Shows differential vulnerability — which frameworks are susceptible to which payload categories.

Warning

If the aggregate statistics JSON does not contain framework_rates, this figure will fail with an explicit error rather than silently rendering zero-height bars.

Output: fig5_framework_comparison.pdf, fig5_framework_comparison.png

Style Configuration¶

All figures use a consistent publication-ready style:

Setting	Value
Font family	Serif
Font size	10pt
Figure size	6×4 inches
DPI	300
Grid	Enabled
Backend	Agg (non-interactive)

Color Palette¶

Colorblind-friendly palette (Okabe-Ito):

Label	Color	Hex
qwen	Blue	`#0072B2`
llama	Orange	`#D55E00`
gemma	Green	`#009E73`
baseline	Gray	`#808080`
optimized	Pink	`#CC79A7`
genetic	Yellow	`#E69F00`
cem	Light blue	`#56B4E9`
whitebox	Light yellow	`#F0E442`

Workflow¶

The typical end-to-end workflow from raw experiment data to figures:

# 1. Run experiments (see validation-experiments.md, pareto-sweep.md)

# 2. Compute statistics
#    Output filenames can be anything — the figure loader classifies by content schema.
python harness/statistical_analysis.py \
  --input reports/validation-4.1/validation-summary.json \
  --output reports/statistics/validation.json \
  --mode validation

python harness/statistical_analysis.py \
  --input reports/pareto-qwen7b/pareto-summary.json \
  --output reports/statistics/pareto.json \
  --mode pareto

python harness/statistical_analysis.py \
  --output reports/statistics/aggregate.json \
  --mode aggregate \
  --sweep-dirs reports/sweep-*

# 3. Generate figures
#    --stats-dir scans for *.json and classifies by content:
#      {"mode": "aggregate"}  → figures 1, 3, 5
#      {"sweep": "pareto"}    → figure 4
#      {"comparisons": ...}   → figure 2
python harness/generate_figures.py \
  --stats-dir reports/statistics/ \
  --output-dir figures/

Figure Generation¶

CLI Usage¶

Generate All Figures¶

Generate a Single Figure¶

All Flags¶

Figures¶

Figure 1: ASR vs Model Scale¶

Figure 2: Retrieval vs Injection Scatter¶

Figure 3: Authority Style Heatmap¶

Figure 4: Pareto Frontier¶

Figure 5: Framework Vulnerability Comparison¶

Style Configuration¶

Color Palette¶

Workflow¶

See Also¶