Skip to content

Joint Optimization

Status of empirical claims on this page

The numerical effect sizes on this page are historical exploratory sweeps from the early 7B-only epoch. The canonical paired-replay evidence about the retrieval-injection disconnect, the optimizer-hurts-coupled-categories finding, and the three-shape coupling taxonomy lives in companion paper-track materials and is not reproduced here. Treat the magnitudes below as illustrative; the qualitative observation (similarity-side optimization does not buy injection-side success) is the load-bearing claim.

Overview

Traditional RAG poisoning tools optimize documents for a single objective — embedding similarity to target queries. hemlock's joint optimization framework introduces a multi-objective scoring function that combines embedding similarity with a naturalness term and a metadata-conditioned prior, allowing operators to blend objectives instead of optimizing only for retrieval rank.

This addresses a core architectural limitation: optimizers that maximize embedding similarity often reduce end-to-end attack success, because the modifications that improve retrieval (keyword stuffing, semantic alignment) can disrupt the hidden payload's ability to influence model output.

The Retrieval-Injection Disconnect

Early exploratory sweeps showed that improved retrieval does not correlate with improved injection:

  • Baseline (no optimization): 50% redirect injection rate at 7B scale
  • Genetic optimizer (similarity-only): 17% redirect injection rate — worse than baseline
  • CEM optimizer: ~0% injection rate despite ~95% retrieval rate

This disconnect exists because retrieval optimization modifies the visible text surrounding the hidden payload, and these modifications can:

  1. Dilute the payload's influence in the model's attention window
  2. Introduce contradictory context that the model follows instead of the payload
  3. Shift the document's semantic focus away from the payload's intent

Architecture

Joint optimization uses a blended scoring function in all three optimizers (CEM, Genetic, Whitebox):

$$ \text{score} = (1 - w_{\text{inj}} - w_{\text{nat}}) \cdot s_{\text{sim}} + w_{\text{nat}} \cdot s_{\text{nat}} + w_{\text{inj}} \cdot s_{\text{inj}} $$

Where:

Component Description
$s_{\text{sim}}$ Cosine similarity between document embedding and target query
$s_{\text{nat}}$ Naturalness score from perplexity estimation
$s_{\text{inj}}$ Reward-model score: a metadata-conditioned prior over (model × framework × category) cells with cross-validated AUC ~0.79 on inference-time-available features (see the Known Limitation callout below)
$w_{\text{inj}}$ Injection weight (--injection-weight, default 0.0)
$w_{\text{nat}}$ Naturalness weight (--naturalness-weight, default 0.0)

When --injection-weight is 0 (the default), the scoring function is identical to the original single-objective behavior, ensuring full backward compatibility.

Reward Model

The injection score $s_{\text{inj}}$ comes from a small trained MLP classifier served via HTTP. The model takes document metadata features and emits a scalar score in $[0, 1]$.

Known limitation: train-vs-inference feature asymmetry

The shipped model is trained on metadata features that include three fields not available at the Go optimizer's inference time:

  • indicator_hits (count of indicator strings in the response — known only after generation)
  • response_length (length of the response — known only after generation)
  • poisoned_in_sources (whether the poisoned doc appeared in retrieval — known only after retrieval testing)

When the Go optimizer queries the reward server during candidate scoring, these fields default to constants (0, 200, false) because they are post-hoc with respect to the candidate. The model then returns a score driven only by the remaining metadata (model, framework, authority style, optimizer type, payload category) — all of which are constant within a single experiment cell.

Cross-validated AUC on the leaky training distribution is ~0.997. Cross-validated AUC on inference-time-available features only is ~0.79. The shipped reward model is therefore better described as a metadata-conditioned prior with AUC ~0.79 over (model × framework × category) cells, used as one term in the candidate-scoring blend'' than aspredicts P(injection_success)''. None of the headline empirical claims in the validation experiments below depend on the reward server providing useful per-document signal; lifts and hurts come from the outer-loop Bayesian search over document-generation parameters, not from inner-loop reward scoring.

A redesigned reward model with inference-time-only features is reserved for follow-up work.

The trained MLP takes the following metadata features as input:

  • Model scale (log₂ of parameter count)
  • Framework identity (one-hot encoded)
  • Authority style (one-hot encoded)
  • Optimizer type (one-hot encoded)
  • Payload category (one-hot encoded)
  • Response characteristics — leaky at inference time, see warning above

The model is trained on historical experiment results using binary cross-entropy with class weighting. See the hemlock-lab reward model documentation for training and serving details.

Communication Flow

sequenceDiagram
    participant O as Go Optimizer
    participant R as Reward Server (Python)
    participant E as Embedding Provider

    loop Each candidate document
        O->>E: Embed candidate text
        E-->>O: Embedding vector
        O->>O: Compute similarity score
        O->>R: POST /predict-injection {text, model, framework, ...}
        R-->>O: {score: 0.73}
        O->>O: Blend scores with weights
        O->>O: Select/evolve candidates
    end

The Go optimizers call the reward server via HTTP POST to {injection-model-host}/predict-injection. The default host is http://localhost:9090. The call adds ~10ms per candidate with a local server, which is negligible compared to embedding computation (~200ms) and model inference (~2s).

CLI Usage

Basic Joint Optimization

# Start the reward server (in hemlock-lab)
python harness/reward_server.py --model-path reward_model.pt

# Generate with joint optimization (40% injection weight)
hemlock craft \
  --format html \
  --payload redirect \
  --genetic \
  --injection-weight 0.4 \
  --embed-provider ollama \
  --target-query "What is the company refund policy?"

Batch Generation

hemlock batch \
  --payload override \
  --genetic \
  --injection-weight 0.3 \
  --injection-model-host http://localhost:9090 \
  --embed-provider ollama

Cover Text Controls

Two additional parameters control document structure:

hemlock craft \
  --format docx \
  --payload redirect \
  --genetic \
  --injection-weight 0.4 \
  --cover-text-density 0.7 \  # Retain 70% of cover text
  --payload-position start \  # Place payload at document start
  --embed-provider ollama
  • --cover-text-density (0.3–1.0, default 1.0): Fraction of generated cover text to retain. Lower values produce shorter documents with higher payload-to-text ratio.
  • --payload-position (start or end, default: format-specific): Controls whether the hidden payload is embedded at the beginning or end of the cover text.

Rather than manually tuning the injection weight and other parameters, the Bayesian optimizer searches a 10-dimensional continuous space using Gaussian Process regression with Expected Improvement acquisition. It evaluates each parameter configuration by:

  1. Generating a document corpus with the candidate parameters
  2. Ingesting into a fresh ChromaDB collection
  3. Running injection tests across all framework pipelines
  4. Running retrieval tests to measure poisoned document ranking
  5. Computing a composite reward: $0.3 \times r_{\text{retrieval}} + 0.7 \times r_{\text{injection}}$

The retrieval signal provides gradient even when injection rate is zero, preventing the optimizer from exploring blindly in a flat reward landscape. After 50–100 evaluations, the optimizer outputs best-params.json containing the optimal CLI flags for the target model.

Pareto Analysis

The Pareto sweep ablates the injection weight from 0.0 to 1.0 in configurable steps, measuring both injection and retrieval rates at each weight. This produces the Pareto frontier — the fundamental trade-off curve between retrieval quality and injection effectiveness.

Understanding this curve is critical for practical use: some applications require high retrieval (the poisoned document must appear in results) while tolerating lower injection rates, while others prioritize injection success even if retrieval sometimes fails.

Experimental Validation

The validation experiments run controlled A/B comparisons:

Experiment Question
4.1 Baseline vs Bayesian parameters at 7B
4.2 Similarity-only Genetic vs reward-guided Genetic at 7B
4.3 Baseline vs Bayesian parameters at 32B
4.4 Baseline vs Bayesian parameters at 72B

Each experiment runs 30 independent trials per condition with bootstrap confidence intervals and effect size analysis.

Design Decisions

Why a separate reward server? The ML inference stack (PyTorch, scikit-learn, sentence-transformers) lives in Python. Embedding it in Go would require CGo bindings or ONNX runtime integration — significant complexity for ~10ms per-request HTTP overhead.

Why metadata features instead of text embeddings? The training set is ~5000 examples. A transformer-based text model would overfit on this scale. Metadata features (model, framework, category, authority style) capture the dominant predictors of injection success with high sample efficiency.

Why three-way blending instead of Pareto optimization? A weighted sum is simple to implement, tune, and interpret. The --injection-weight parameter gives users direct control over the retrieval-injection trade-off, and the Pareto sweep provides empirical guidance for choosing the weight.

Why backward-compatible defaults? All new flags default to values that reproduce the original single-objective behavior (--injection-weight 0, --cover-text-density 1.0, --payload-position ""). Existing workflows are unaffected unless the user explicitly opts in.