Skip to content

Figure Generation

generate_figures.py produces five publication-quality figures from the statistical analysis output. Each figure is saved as both PDF (vector) and PNG (300 DPI raster).

CLI Usage

Generate All Figures

python harness/generate_figures.py \
  --stats-dir reports/statistics/ \
  --output-dir figures/

Generate a Single Figure

python harness/generate_figures.py \
  --pareto-stats reports/pareto-qwen7b/pareto-statistics.json \
  --output-dir figures/ \
  --figure 4

All Flags

Flag Default Description
--stats-dir Directory containing all statistics JSONs
--output-dir (required) Output directory for figure files
--figure (all) Single figure number (1–5) or omit for all
--pareto-stats Pareto statistics JSON (for Figure 4)
--validation-stats Validation statistics JSON (for Figure 2)
--aggregate-stats Aggregate statistics JSON (for Figures 1, 3, 5)

Figures

Figure 1: ASR vs Model Scale

Data source: Aggregate statistics (cross-model sweep results)

Plots injection success rate (ASR) against model size ($\log_2$ parameters) with one line per model family (Qwen, Llama, optionally Gemma). Confidence intervals are rendered as shaded regions.

Output: fig1_scale.pdf, fig1_scale.png

Figure 2: Retrieval vs Injection Scatter

Data source: Validation statistics (experiment 4.1 or 4.2)

Scatter plot with retrieval rate on the X-axis and injection rate on the Y-axis. Each point represents one optimizer condition. Colors distinguish optimizer types (baseline, CEM, Genetic, Whitebox). Error bars show bootstrap CIs in both dimensions. A $y = x$ diagonal reference line highlights the retrieval-injection disconnect.

Output: fig2_retrieval_vs_injection.pdf, fig2_retrieval_vs_injection.png

Figure 3: Authority Style Heatmap

Data source: Aggregate statistics (model_sizes, cells keys)

Heatmap matrix with model scale on the X-axis and authority style on the Y-axis. Cell color intensity represents ASR. Each cell is annotated with the percentage value. Visualizes how authority-style cover text interacts with model scale across the swept conditions.

Warning

If the aggregate statistics JSON does not contain model_sizes and cells, this figure will fail with an explicit error rather than silently rendering a zero-filled heatmap.

Warning

If the only authority style present in the cells data is none (i.e. no authority-framed sweeps have been run), Figure 3 is skipped with a warning. This prevents presenting fabricated zero rows for academic, institutional, and regulatory styles that were never actually measured. Run authority-style sweeps via run_model_sweep.sh --authority-style academic etc. to populate these cells.

Output: fig3_authority_heatmap.pdf, fig3_authority_heatmap.png

Figure 4: Pareto Frontier

Data source: Pareto statistics

Dual-axis plot with injection weight on the X-axis. Left Y-axis (orange) shows injection rate; right Y-axis (blue) shows retrieval rate. Error bars from bootstrap CIs at each weight. Marks the optimal weight (highest injection rate × retrieval rate product). The crossing point of the two curves indicates where the retrieval-injection trade-off peaks.

Output: fig4_pareto_curve.pdf, fig4_pareto_curve.png

Figure 5: Framework Vulnerability Comparison

Data source: Aggregate statistics (framework_rates key)

Grouped bar chart with payload categories on the X-axis and ASR on the Y-axis. Bars are grouped by framework (LangChain, LlamaIndex, Haystack, Unstructured, ColPALI). Shows differential vulnerability — which frameworks are susceptible to which payload categories.

Warning

If the aggregate statistics JSON does not contain framework_rates, this figure will fail with an explicit error rather than silently rendering zero-height bars.

Output: fig5_framework_comparison.pdf, fig5_framework_comparison.png

Style Configuration

All figures use a consistent publication-ready style:

Setting Value
Font family Serif
Font size 10pt
Figure size 6×4 inches
DPI 300
Grid Enabled
Backend Agg (non-interactive)

Color Palette

Colorblind-friendly palette (Okabe-Ito):

Label Color Hex
qwen Blue #0072B2
llama Orange #D55E00
gemma Green #009E73
baseline Gray #808080
optimized Pink #CC79A7
genetic Yellow #E69F00
cem Light blue #56B4E9
whitebox Light yellow #F0E442

Workflow

The typical end-to-end workflow from raw experiment data to figures:

# 1. Run experiments (see validation-experiments.md, pareto-sweep.md)

# 2. Compute statistics
#    Output filenames can be anything — the figure loader classifies by content schema.
python harness/statistical_analysis.py \
  --input reports/validation-4.1/validation-summary.json \
  --output reports/statistics/validation.json \
  --mode validation

python harness/statistical_analysis.py \
  --input reports/pareto-qwen7b/pareto-summary.json \
  --output reports/statistics/pareto.json \
  --mode pareto

python harness/statistical_analysis.py \
  --output reports/statistics/aggregate.json \
  --mode aggregate \
  --sweep-dirs reports/sweep-*

# 3. Generate figures
#    --stats-dir scans for *.json and classifies by content:
#      {"mode": "aggregate"}  → figures 1, 3, 5
#      {"sweep": "pareto"}    → figure 4
#      {"comparisons": ...}   → figure 2
python harness/generate_figures.py \
  --stats-dir reports/statistics/ \
  --output-dir figures/

See Also