Figure Generation¶
generate_figures.py produces five publication-quality figures from the statistical analysis output. Each figure is saved as both PDF (vector) and PNG (300 DPI raster).
CLI Usage¶
Generate All Figures¶
Generate a Single Figure¶
python harness/generate_figures.py \
--pareto-stats reports/pareto-qwen7b/pareto-statistics.json \
--output-dir figures/ \
--figure 4
All Flags¶
| Flag | Default | Description |
|---|---|---|
--stats-dir |
Directory containing all statistics JSONs | |
--output-dir |
(required) | Output directory for figure files |
--figure |
(all) | Single figure number (1–5) or omit for all |
--pareto-stats |
Pareto statistics JSON (for Figure 4) | |
--validation-stats |
Validation statistics JSON (for Figure 2) | |
--aggregate-stats |
Aggregate statistics JSON (for Figures 1, 3, 5) |
Figures¶
Figure 1: ASR vs Model Scale¶
Data source: Aggregate statistics (cross-model sweep results)
Plots injection success rate (ASR) against model size ($\log_2$ parameters) with one line per model family (Qwen, Llama, optionally Gemma). Confidence intervals are rendered as shaded regions.
Output: fig1_scale.pdf, fig1_scale.png
Figure 2: Retrieval vs Injection Scatter¶
Data source: Validation statistics (experiment 4.1 or 4.2)
Scatter plot with retrieval rate on the X-axis and injection rate on the Y-axis. Each point represents one optimizer condition. Colors distinguish optimizer types (baseline, CEM, Genetic, Whitebox). Error bars show bootstrap CIs in both dimensions. A $y = x$ diagonal reference line highlights the retrieval-injection disconnect.
Output: fig2_retrieval_vs_injection.pdf, fig2_retrieval_vs_injection.png
Figure 3: Authority Style Heatmap¶
Data source: Aggregate statistics (model_sizes, cells keys)
Heatmap matrix with model scale on the X-axis and authority style on the Y-axis. Cell color intensity represents ASR. Each cell is annotated with the percentage value. Visualizes how authority-style cover text interacts with model scale across the swept conditions.
Warning
If the aggregate statistics JSON does not contain model_sizes and cells, this figure will fail with an explicit error rather than silently rendering a zero-filled heatmap.
Warning
If the only authority style present in the cells data is none (i.e. no authority-framed sweeps have been run), Figure 3 is skipped with a warning. This prevents presenting fabricated zero rows for academic, institutional, and regulatory styles that were never actually measured. Run authority-style sweeps via run_model_sweep.sh --authority-style academic etc. to populate these cells.
Output: fig3_authority_heatmap.pdf, fig3_authority_heatmap.png
Figure 4: Pareto Frontier¶
Data source: Pareto statistics
Dual-axis plot with injection weight on the X-axis. Left Y-axis (orange) shows injection rate; right Y-axis (blue) shows retrieval rate. Error bars from bootstrap CIs at each weight. Marks the optimal weight (highest injection rate × retrieval rate product). The crossing point of the two curves indicates where the retrieval-injection trade-off peaks.
Output: fig4_pareto_curve.pdf, fig4_pareto_curve.png
Figure 5: Framework Vulnerability Comparison¶
Data source: Aggregate statistics (framework_rates key)
Grouped bar chart with payload categories on the X-axis and ASR on the Y-axis. Bars are grouped by framework (LangChain, LlamaIndex, Haystack, Unstructured, ColPALI). Shows differential vulnerability — which frameworks are susceptible to which payload categories.
Warning
If the aggregate statistics JSON does not contain framework_rates, this figure will fail with an explicit error rather than silently rendering zero-height bars.
Output: fig5_framework_comparison.pdf, fig5_framework_comparison.png
Style Configuration¶
All figures use a consistent publication-ready style:
| Setting | Value |
|---|---|
| Font family | Serif |
| Font size | 10pt |
| Figure size | 6×4 inches |
| DPI | 300 |
| Grid | Enabled |
| Backend | Agg (non-interactive) |
Color Palette¶
Colorblind-friendly palette (Okabe-Ito):
| Label | Color | Hex |
|---|---|---|
| qwen | Blue | #0072B2 |
| llama | Orange | #D55E00 |
| gemma | Green | #009E73 |
| baseline | Gray | #808080 |
| optimized | Pink | #CC79A7 |
| genetic | Yellow | #E69F00 |
| cem | Light blue | #56B4E9 |
| whitebox | Light yellow | #F0E442 |
Workflow¶
The typical end-to-end workflow from raw experiment data to figures:
# 1. Run experiments (see validation-experiments.md, pareto-sweep.md)
# 2. Compute statistics
# Output filenames can be anything — the figure loader classifies by content schema.
python harness/statistical_analysis.py \
--input reports/validation-4.1/validation-summary.json \
--output reports/statistics/validation.json \
--mode validation
python harness/statistical_analysis.py \
--input reports/pareto-qwen7b/pareto-summary.json \
--output reports/statistics/pareto.json \
--mode pareto
python harness/statistical_analysis.py \
--output reports/statistics/aggregate.json \
--mode aggregate \
--sweep-dirs reports/sweep-*
# 3. Generate figures
# --stats-dir scans for *.json and classifies by content:
# {"mode": "aggregate"} → figures 1, 3, 5
# {"sweep": "pareto"} → figure 4
# {"comparisons": ...} → figure 2
python harness/generate_figures.py \
--stats-dir reports/statistics/ \
--output-dir figures/
See Also¶
- Statistical Analysis — produces the input statistics
- Validation Experiments — experiment orchestration
- Pareto Sweep — injection-weight ablation