Reward Model¶

The reward model is a three-stage pipeline that trains a classifier to predict injection success from experiment metadata, then serves predictions to hemlock's Go optimizers via HTTP. This enables joint optimization — blending retrieval similarity with predicted injection success during document generation.

Pipeline Overview¶

flowchart LR
    A["reports/**/<br/>injection-results.json<br/>retrieval-results.json"] --> B["build_training_data.py"]
    B --> C["training_data.parquet"]
    C --> D["reward_model.py"]
    D --> E["reward_model.pt"]
    E --> F["reward_server.py<br/>:9090"]
    F --> G["hemlock<br/>--injection-weight 0.4"]

    style B fill:#4a148c,stroke:#7c43bd,color:#ffffff
    style D fill:#4a148c,stroke:#7c43bd,color:#ffffff
    style F fill:#00695c,stroke:#00897b,color:#ffffff

Stage 1: Training Data (`build_training_data.py`)¶

Parses all experiment results from the reports directory into a flat tabular dataset.

Usage¶

python harness/build_training_data.py \
  --reports-dir reports/ \
  --output training_data.parquet \
  --format parquet

Flags¶

Flag	Default	Description
`--reports-dir`	`./reports`	Directory containing sweep result subdirectories
`--output`	`training_data.parquet`	Output file path
`--format`	`parquet`	Output format: `parquet` or `csv`

Output Schema¶

Column	Type	Description
`framework`	string	RAG framework (langchain, llamaindex, etc.)
`payload_category`	string	override, exfiltrate, redirect, denial, multistage, manyshot
`injection_detected`	bool	Label — positive class for training
`confidence`	string	high, medium, low, or null
`indicator_hits`	int	Number of injection indicator keyword matches
`poisoned_rank`	float	Rank of poisoned document in retrieval results (NaN if not retrieved)
`poisoned_in_sources`	bool	Whether poisoned document appeared in top-k retrieval
`model`	string	Target LLM (extracted from sweep directory name)
`optimizer_type`	string	baseline, cem, genetic, whitebox
`authority_style`	string	none, academic, institutional, regulatory
`system_prompt`	string	permissive, default, restrictive, etc.
`response_length`	int	Length of model response text
`sweep_dir`	string	Source sweep directory name

Data Pipeline Details¶

The script joins two data sources:

injection-results.json — Per-framework injection test outcomes, indicator hits, confidence levels
retrieval-results.json — Retrieval rankings, poisoned document presence in top-k

Metadata (model, optimizer, authority style) is extracted from directory names and hemlock-batch.log files via regex pattern matching.

Expected Dataset Size

With the current experiment history, build_training_data.py produces ~5000 rows with an ~8% positive rate (injection detected). The class imbalance is handled during training via class weighting.

Stage 2: Model Training (`reward_model.py`)¶

Trains a classifier to predict P(injection_success) from metadata features.

Usage¶

# Train MLP (recommended)
python harness/reward_model.py \
  --data training_data.parquet \
  --mode mlp \
  --output reward_model.pt \
  --epochs 100

# Train logistic regression baseline
python harness/reward_model.py \
  --data training_data.parquet \
  --mode baseline \
  --output reward_baseline.pt

# Evaluate a trained model
python harness/reward_model.py \
  --data training_data.parquet \
  --mode evaluate \
  --model-path reward_model.pt

Flags¶

Flag	Default	Description
`--data`	(required)	Training data file (Parquet or CSV)
`--mode`	`mlp`	Training mode: `baseline` (logistic), `mlp`, or `evaluate`
`--model-path`		Model checkpoint path (for `evaluate` mode)
`--output`	`reward_model.pt`	Output model checkpoint path
`--epochs`	`100`	Training epochs for MLP
`--lr`	`1e-3`	Learning rate

Feature Encoding¶

The model uses 23 metadata features (no text embeddings):

Feature Group	Dimensions	Encoding
Model scale	1	$\log_2(\text{params in billions})$
Framework	5	One-hot (langchain, llamaindex, unstructured, haystack, colpali)
Authority style	4	One-hot (none, academic, institutional, regulatory)
Optimizer type	4	One-hot (baseline, cem, genetic, whitebox)
Payload category	6	One-hot (override, exfiltrate, redirect, denial, multistage, manyshot)
Indicator hits	1	Raw count
Response length	1	$\log(1 + \text{length})$
Poisoned in sources	1	Boolean → float

Why Metadata Instead of Text Embeddings?

The training set is ~5000 examples. A transformer-based text encoder would overfit badly at this scale. Metadata features (model, framework, category, authority style) capture the dominant predictors of injection success with high sample efficiency. Text embeddings can be added when the training set grows to 50k+ rows.

Architecture¶

Logistic regression baseline — sklearn.LogisticRegression with class weights. Establishes the ceiling from metadata features alone.

MLP — 2-layer network:

Input(23) → Linear(128) → ReLU → Dropout(0.3) → Linear(64) → ReLU → Dropout(0.2) → Linear(1)

Training uses:

5-fold stratified cross-validation — ensures each fold has the same positive/negative ratio
BCEWithLogitsLoss with class weighting — upweights the minority positive class
Early stopping (patience 15 epochs) — prevents overfitting when validation AUC plateaus
Evaluation metric — ROC-AUC (area under the receiver operating characteristic curve)

Output¶

The checkpoint (reward_model.pt) contains:

Model state dictionary (weights)
StandardScaler parameters (mean, scale) for feature normalization
Feature dimensionality
Training metadata (best AUC, number of folds)

Stage 3: Serving (`reward_server.py`)¶

FastAPI server that loads a trained checkpoint and serves injection score predictions to hemlock's Go optimizers.

Usage¶

python harness/reward_server.py \
  --model-path reward_model.pt \
  --port 9090 \
  --host 0.0.0.0

Flags¶

Flag	Default	Description
`--model-path`	(required)	Path to trained model checkpoint
`--port`	`9090`	Server port
`--host`	`0.0.0.0`	Server bind address

Endpoints¶

`POST /predict-injection`¶

Single prediction for one candidate document.

Request:

{
  "text": "document content...",
  "model": "qwen2.5:7b",
  "framework": "langchain",
  "authority_style": "academic",
  "optimizer_type": "genetic",
  "payload_category": "redirect",
  "indicator_hits": 0,
  "response_length": 200,
  "poisoned_in_sources": false
}

Response:

{
  "score": 0.73
}

The score is the sigmoid output of the MLP: $P(\text{injection_success}) \in [0, 1]$.

`POST /predict-batch`¶

Batch predictions for multiple candidates (used by the Genetic optimizer's population scoring).

Request:

{
  "items": [
    {"text": "...", "model": "qwen2.5:7b", "framework": "langchain", ...},
    {"text": "...", "model": "qwen2.5:7b", "framework": "haystack", ...}
  ]
}

Response:

{
  "scores": [0.73, 0.45]
}

`GET /health`¶

Health check endpoint.

Response:

{
  "status": "ok",
  "model_loaded": true,
  "feature_dim": 22
}

Integration with hemlock¶

The Go optimizers call the reward server during candidate evaluation when --injection-weight > 0:

# Start the reward server
python harness/reward_server.py --model-path reward_model.pt

# In another terminal, run hemlock with joint optimization
hemlock batch \
  --payload redirect \
  --genetic \
  --injection-weight 0.4 \
  --injection-model-host http://localhost:9090 \
  --embed-provider ollama

The Go code in score_injection.go POSTs to /predict-injection for each candidate document during optimization. The 3-way blended scoring function is:

$$ \text{score} = (1 - w_{\text{inj}} - w_{\text{nat}}) \cdot s_{\text{sim}} + w_{\text{nat}} \cdot s_{\text{nat}} + w_{\text{inj}} \cdot s_{\text{inj}} $$

Retraining¶

As new experiment data accumulates (from Phase 1 batch runs, Bayesian optimizer evaluations, or validation experiments), rebuild the training set and retrain:

# Rebuild training data (picks up all new results)
python harness/build_training_data.py --reports-dir reports/

# Retrain
python harness/reward_model.py --data training_data.parquet --mode mlp --output reward_model.pt

# Restart server with new model
python harness/reward_server.py --model-path reward_model.pt

Reward Model¶

Pipeline Overview¶

Stage 1: Training Data (build_training_data.py)¶

Usage¶

Flags¶

Output Schema¶

Data Pipeline Details¶

Stage 2: Model Training (reward_model.py)¶

Usage¶

Flags¶

Feature Encoding¶

Architecture¶

Output¶

Stage 3: Serving (reward_server.py)¶

Usage¶

Flags¶

Endpoints¶

POST /predict-injection¶

POST /predict-batch¶

GET /health¶

Integration with hemlock¶

Retraining¶

See Also¶

Stage 1: Training Data (`build_training_data.py`)¶

Stage 2: Model Training (`reward_model.py`)¶

Stage 3: Serving (`reward_server.py`)¶

`POST /predict-injection`¶

`POST /predict-batch`¶

`GET /health`¶