Joint Optimization¶

Status of empirical claims on this page

The numerical effect sizes on this page are historical exploratory sweeps from the early 7B-only epoch. The canonical paired-replay evidence about the retrieval-injection disconnect, the optimizer-hurts-coupled-categories finding, and the three-shape coupling taxonomy lives in companion paper-track materials and is not reproduced here. Treat the magnitudes below as illustrative; the qualitative observation (similarity-side optimization does not buy injection-side success) is the load-bearing claim.

Overview¶

Traditional RAG poisoning tools optimize documents for a single objective — embedding similarity to target queries. hemlock's joint optimization framework introduces a multi-objective scoring function that combines embedding similarity with a naturalness term and a metadata-conditioned prior, allowing operators to blend objectives instead of optimizing only for retrieval rank.

This addresses a core architectural limitation: optimizers that maximize embedding similarity often reduce end-to-end attack success, because the modifications that improve retrieval (keyword stuffing, semantic alignment) can disrupt the hidden payload's ability to influence model output.

The Retrieval-Injection Disconnect¶

Early exploratory sweeps showed that improved retrieval does not correlate with improved injection:

Baseline (no optimization): 50% redirect injection rate at 7B scale
Genetic optimizer (similarity-only): 17% redirect injection rate — worse than baseline
CEM optimizer: ~0% injection rate despite ~95% retrieval rate

This disconnect exists because retrieval optimization modifies the visible text surrounding the hidden payload, and these modifications can:

Dilute the payload's influence in the model's attention window
Introduce contradictory context that the model follows instead of the payload
Shift the document's semantic focus away from the payload's intent

Architecture¶

Joint optimization uses a blended scoring function in all three optimizers (CEM, Genetic, Whitebox):

$$ \text{score} = (1 - w_{\text{inj}} - w_{\text{nat}}) \cdot s_{\text{sim}} + w_{\text{nat}} \cdot s_{\text{nat}} + w_{\text{inj}} \cdot s_{\text{inj}} $$

Where:

Component	Description
$s_{\text{sim}}$	Cosine similarity between document embedding and target query
$s_{\text{nat}}$	Naturalness score from perplexity estimation
$s_{\text{inj}}$	Reward-model score: a metadata-conditioned prior over (model × framework × category) cells with cross-validated AUC ~0.79 on inference-time-available features (see the Known Limitation callout below)
$w_{\text{inj}}$	Injection weight (`--injection-weight`, default 0.0)
$w_{\text{nat}}$	Naturalness weight (`--naturalness-weight`, default 0.0)

When --injection-weight is 0 (the default), the scoring function is identical to the original single-objective behavior, ensuring full backward compatibility.

Reward Model¶

The injection score $s_{\text{inj}}$ comes from a small trained MLP classifier served via HTTP. The model takes document metadata features and emits a scalar score in $[0, 1]$.

Known limitation: train-vs-inference feature asymmetry

The shipped model is trained on metadata features that include three fields not available at the Go optimizer's inference time:

indicator_hits (count of indicator strings in the response — known only after generation)
response_length (length of the response — known only after generation)
poisoned_in_sources (whether the poisoned doc appeared in retrieval — known only after retrieval testing)

When the Go optimizer queries the reward server during candidate scoring, these fields default to constants (0, 200, false) because they are post-hoc with respect to the candidate. The model then returns a score driven only by the remaining metadata (model, framework, authority style, optimizer type, payload category) — all of which are constant within a single experiment cell.

Cross-validated AUC on the leaky training distribution is ~0.997. Cross-validated AUC on inference-time-available features only is ~0.79. The shipped reward model is therefore better described as a metadata-conditioned prior with AUC ~0.79 over (model × framework × category) cells, used as one term in the candidate-scoring blend'' than aspredicts P(injection_success)''. None of the headline empirical claims in the validation experiments below depend on the reward server providing useful per-document signal; lifts and hurts come from the outer-loop Bayesian search over document-generation parameters, not from inner-loop reward scoring.

A redesigned reward model with inference-time-only features is reserved for follow-up work.

The trained MLP takes the following metadata features as input:

Model scale (log₂ of parameter count)
Framework identity (one-hot encoded)
Authority style (one-hot encoded)
Optimizer type (one-hot encoded)
Payload category (one-hot encoded)
Response characteristics — leaky at inference time, see warning above

The model is trained on historical experiment results using binary cross-entropy with class weighting. See the hemlock-lab reward model documentation for training and serving details.

Communication Flow¶

sequenceDiagram
    participant O as Go Optimizer
    participant R as Reward Server (Python)
    participant E as Embedding Provider

    loop Each candidate document
        O->>E: Embed candidate text
        E-->>O: Embedding vector
        O->>O: Compute similarity score
        O->>R: POST /predict-injection {text, model, framework, ...}
        R-->>O: {score: 0.73}
        O->>O: Blend scores with weights
        O->>O: Select/evolve candidates
    end

The Go optimizers call the reward server via HTTP POST to {injection-model-host}/predict-injection. The default host is http://localhost:9090. The call adds ~10ms per candidate with a local server, which is negligible compared to embedding computation (~200ms) and model inference (~2s).

CLI Usage¶

Basic Joint Optimization¶

# Start the reward server (in hemlock-lab)
python harness/reward_server.py --model-path reward_model.pt

# Generate with joint optimization (40% injection weight)
hemlock craft \
  --format html \
  --payload redirect \
  --genetic \
  --injection-weight 0.4 \
  --embed-provider ollama \
  --target-query "What is the company refund policy?"

Batch Generation¶

hemlock batch \
  --payload override \
  --genetic \
  --injection-weight 0.3 \
  --injection-model-host http://localhost:9090 \
  --embed-provider ollama

Cover Text Controls¶

Two additional parameters control document structure:

hemlock craft \
  --format docx \
  --payload redirect \
  --genetic \
  --injection-weight 0.4 \
  --cover-text-density 0.7 \  # Retain 70% of cover text
  --payload-position start \  # Place payload at document start
  --embed-provider ollama

--cover-text-density (0.3–1.0, default 1.0): Fraction of generated cover text to retain. Lower values produce shorter documents with higher payload-to-text ratio.
--payload-position (start or end, default: format-specific): Controls whether the hidden payload is embedded at the beginning or end of the cover text.

Bayesian Hyperparameter Search¶

Rather than manually tuning the injection weight and other parameters, the Bayesian optimizer searches a 10-dimensional continuous space using Gaussian Process regression with Expected Improvement acquisition. It evaluates each parameter configuration by:

Generating a document corpus with the candidate parameters
Ingesting into a fresh ChromaDB collection
Running injection tests across all framework pipelines
Running retrieval tests to measure poisoned document ranking
Computing a composite reward: $0.3 \times r_{\text{retrieval}} + 0.7 \times r_{\text{injection}}$

The retrieval signal provides gradient even when injection rate is zero, preventing the optimizer from exploring blindly in a flat reward landscape. After 50–100 evaluations, the optimizer outputs best-params.json containing the optimal CLI flags for the target model.

Pareto Analysis¶

The Pareto sweep ablates the injection weight from 0.0 to 1.0 in configurable steps, measuring both injection and retrieval rates at each weight. This produces the Pareto frontier — the fundamental trade-off curve between retrieval quality and injection effectiveness.

Understanding this curve is critical for practical use: some applications require high retrieval (the poisoned document must appear in results) while tolerating lower injection rates, while others prioritize injection success even if retrieval sometimes fails.

Experimental Validation¶

The validation experiments run controlled A/B comparisons:

Experiment	Question
4.1	Baseline vs Bayesian parameters at 7B
4.2	Similarity-only Genetic vs reward-guided Genetic at 7B
4.3	Baseline vs Bayesian parameters at 32B
4.4	Baseline vs Bayesian parameters at 72B

Each experiment runs 30 independent trials per condition with bootstrap confidence intervals and effect size analysis.

Design Decisions¶

Why a separate reward server? The ML inference stack (PyTorch, scikit-learn, sentence-transformers) lives in Python. Embedding it in Go would require CGo bindings or ONNX runtime integration — significant complexity for ~10ms per-request HTTP overhead.

Why metadata features instead of text embeddings? The training set is ~5000 examples. A transformer-based text model would overfit on this scale. Metadata features (model, framework, category, authority style) capture the dominant predictors of injection success with high sample efficiency.

Why three-way blending instead of Pareto optimization? A weighted sum is simple to implement, tune, and interpret. The --injection-weight parameter gives users direct control over the retrieval-injection trade-off, and the Pareto sweep provides empirical guidance for choosing the weight.

Why backward-compatible defaults? All new flags default to values that reproduce the original single-objective behavior (--injection-weight 0, --cover-text-density 1.0, --payload-position ""). Existing workflows are unaffected unless the user explicitly opts in.