Skip to content

Reward Model

The reward model is a three-stage pipeline that trains a classifier to predict injection success from experiment metadata, then serves predictions to hemlock's Go optimizers via HTTP. This enables joint optimization — blending retrieval similarity with predicted injection success during document generation.

Pipeline Overview

flowchart LR
    A["reports/**/<br/>injection-results.json<br/>retrieval-results.json"] --> B["build_training_data.py"]
    B --> C["training_data.parquet"]
    C --> D["reward_model.py"]
    D --> E["reward_model.pt"]
    E --> F["reward_server.py<br/>:9090"]
    F --> G["hemlock<br/>--injection-weight 0.4"]

    style B fill:#4a148c,stroke:#7c43bd,color:#ffffff
    style D fill:#4a148c,stroke:#7c43bd,color:#ffffff
    style F fill:#00695c,stroke:#00897b,color:#ffffff

Stage 1: Training Data (build_training_data.py)

Parses all experiment results from the reports directory into a flat tabular dataset.

Usage

python harness/build_training_data.py \
  --reports-dir reports/ \
  --output training_data.parquet \
  --format parquet

Flags

Flag Default Description
--reports-dir ./reports Directory containing sweep result subdirectories
--output training_data.parquet Output file path
--format parquet Output format: parquet or csv

Output Schema

Column Type Description
framework string RAG framework (langchain, llamaindex, etc.)
payload_category string override, exfiltrate, redirect, denial, multistage, manyshot
injection_detected bool Label — positive class for training
confidence string high, medium, low, or null
indicator_hits int Number of injection indicator keyword matches
poisoned_rank float Rank of poisoned document in retrieval results (NaN if not retrieved)
poisoned_in_sources bool Whether poisoned document appeared in top-k retrieval
model string Target LLM (extracted from sweep directory name)
optimizer_type string baseline, cem, genetic, whitebox
authority_style string none, academic, institutional, regulatory
system_prompt string permissive, default, restrictive, etc.
response_length int Length of model response text
sweep_dir string Source sweep directory name

Data Pipeline Details

The script joins two data sources:

  1. injection-results.json — Per-framework injection test outcomes, indicator hits, confidence levels
  2. retrieval-results.json — Retrieval rankings, poisoned document presence in top-k

Metadata (model, optimizer, authority style) is extracted from directory names and hemlock-batch.log files via regex pattern matching.

Expected Dataset Size

With the current experiment history, build_training_data.py produces ~5000 rows with an ~8% positive rate (injection detected). The class imbalance is handled during training via class weighting.


Stage 2: Model Training (reward_model.py)

Trains a classifier to predict P(injection_success) from metadata features.

Usage

# Train MLP (recommended)
python harness/reward_model.py \
  --data training_data.parquet \
  --mode mlp \
  --output reward_model.pt \
  --epochs 100

# Train logistic regression baseline
python harness/reward_model.py \
  --data training_data.parquet \
  --mode baseline \
  --output reward_baseline.pt

# Evaluate a trained model
python harness/reward_model.py \
  --data training_data.parquet \
  --mode evaluate \
  --model-path reward_model.pt

Flags

Flag Default Description
--data (required) Training data file (Parquet or CSV)
--mode mlp Training mode: baseline (logistic), mlp, or evaluate
--model-path Model checkpoint path (for evaluate mode)
--output reward_model.pt Output model checkpoint path
--epochs 100 Training epochs for MLP
--lr 1e-3 Learning rate

Feature Encoding

The model uses 23 metadata features (no text embeddings):

Feature Group Dimensions Encoding
Model scale 1 $\log_2(\text{params in billions})$
Framework 5 One-hot (langchain, llamaindex, unstructured, haystack, colpali)
Authority style 4 One-hot (none, academic, institutional, regulatory)
Optimizer type 4 One-hot (baseline, cem, genetic, whitebox)
Payload category 6 One-hot (override, exfiltrate, redirect, denial, multistage, manyshot)
Indicator hits 1 Raw count
Response length 1 $\log(1 + \text{length})$
Poisoned in sources 1 Boolean → float

Why Metadata Instead of Text Embeddings?

The training set is ~5000 examples. A transformer-based text encoder would overfit badly at this scale. Metadata features (model, framework, category, authority style) capture the dominant predictors of injection success with high sample efficiency. Text embeddings can be added when the training set grows to 50k+ rows.

Architecture

Logistic regression baselinesklearn.LogisticRegression with class weights. Establishes the ceiling from metadata features alone.

MLP — 2-layer network:

Input(23) → Linear(128) → ReLU → Dropout(0.3) → Linear(64) → ReLU → Dropout(0.2) → Linear(1)

Training uses:

  • 5-fold stratified cross-validation — ensures each fold has the same positive/negative ratio
  • BCEWithLogitsLoss with class weighting — upweights the minority positive class
  • Early stopping (patience 15 epochs) — prevents overfitting when validation AUC plateaus
  • Evaluation metric — ROC-AUC (area under the receiver operating characteristic curve)

Output

The checkpoint (reward_model.pt) contains:

  • Model state dictionary (weights)
  • StandardScaler parameters (mean, scale) for feature normalization
  • Feature dimensionality
  • Training metadata (best AUC, number of folds)

Stage 3: Serving (reward_server.py)

FastAPI server that loads a trained checkpoint and serves injection score predictions to hemlock's Go optimizers.

Usage

python harness/reward_server.py \
  --model-path reward_model.pt \
  --port 9090 \
  --host 0.0.0.0

Flags

Flag Default Description
--model-path (required) Path to trained model checkpoint
--port 9090 Server port
--host 0.0.0.0 Server bind address

Endpoints

POST /predict-injection

Single prediction for one candidate document.

Request:

{
  "text": "document content...",
  "model": "qwen2.5:7b",
  "framework": "langchain",
  "authority_style": "academic",
  "optimizer_type": "genetic",
  "payload_category": "redirect",
  "indicator_hits": 0,
  "response_length": 200,
  "poisoned_in_sources": false
}

Response:

{
  "score": 0.73
}

The score is the sigmoid output of the MLP: $P(\text{injection_success}) \in [0, 1]$.

POST /predict-batch

Batch predictions for multiple candidates (used by the Genetic optimizer's population scoring).

Request:

{
  "items": [
    {"text": "...", "model": "qwen2.5:7b", "framework": "langchain", ...},
    {"text": "...", "model": "qwen2.5:7b", "framework": "haystack", ...}
  ]
}

Response:

{
  "scores": [0.73, 0.45]
}

GET /health

Health check endpoint.

Response:

{
  "status": "ok",
  "model_loaded": true,
  "feature_dim": 22
}

Integration with hemlock

The Go optimizers call the reward server during candidate evaluation when --injection-weight > 0:

# Start the reward server
python harness/reward_server.py --model-path reward_model.pt

# In another terminal, run hemlock with joint optimization
hemlock batch \
  --payload redirect \
  --genetic \
  --injection-weight 0.4 \
  --injection-model-host http://localhost:9090 \
  --embed-provider ollama

The Go code in score_injection.go POSTs to /predict-injection for each candidate document during optimization. The 3-way blended scoring function is:

$$ \text{score} = (1 - w_{\text{inj}} - w_{\text{nat}}) \cdot s_{\text{sim}} + w_{\text{nat}} \cdot s_{\text{nat}} + w_{\text{inj}} \cdot s_{\text{inj}} $$

Retraining

As new experiment data accumulates (from Phase 1 batch runs, Bayesian optimizer evaluations, or validation experiments), rebuild the training set and retrain:

# Rebuild training data (picks up all new results)
python harness/build_training_data.py --reports-dir reports/

# Retrain
python harness/reward_model.py --data training_data.parquet --mode mlp --output reward_model.pt

# Restart server with new model
python harness/reward_server.py --model-path reward_model.pt

See Also