Reward Model¶
The reward model is a three-stage pipeline that trains a classifier to predict injection success from experiment metadata, then serves predictions to hemlock's Go optimizers via HTTP. This enables joint optimization — blending retrieval similarity with predicted injection success during document generation.
Pipeline Overview¶
flowchart LR
A["reports/**/<br/>injection-results.json<br/>retrieval-results.json"] --> B["build_training_data.py"]
B --> C["training_data.parquet"]
C --> D["reward_model.py"]
D --> E["reward_model.pt"]
E --> F["reward_server.py<br/>:9090"]
F --> G["hemlock<br/>--injection-weight 0.4"]
style B fill:#4a148c,stroke:#7c43bd,color:#ffffff
style D fill:#4a148c,stroke:#7c43bd,color:#ffffff
style F fill:#00695c,stroke:#00897b,color:#ffffff
Stage 1: Training Data (build_training_data.py)¶
Parses all experiment results from the reports directory into a flat tabular dataset.
Usage¶
python harness/build_training_data.py \
--reports-dir reports/ \
--output training_data.parquet \
--format parquet
Flags¶
| Flag | Default | Description |
|---|---|---|
--reports-dir |
./reports |
Directory containing sweep result subdirectories |
--output |
training_data.parquet |
Output file path |
--format |
parquet |
Output format: parquet or csv |
Output Schema¶
| Column | Type | Description |
|---|---|---|
framework |
string | RAG framework (langchain, llamaindex, etc.) |
payload_category |
string | override, exfiltrate, redirect, denial, multistage, manyshot |
injection_detected |
bool | Label — positive class for training |
confidence |
string | high, medium, low, or null |
indicator_hits |
int | Number of injection indicator keyword matches |
poisoned_rank |
float | Rank of poisoned document in retrieval results (NaN if not retrieved) |
poisoned_in_sources |
bool | Whether poisoned document appeared in top-k retrieval |
model |
string | Target LLM (extracted from sweep directory name) |
optimizer_type |
string | baseline, cem, genetic, whitebox |
authority_style |
string | none, academic, institutional, regulatory |
system_prompt |
string | permissive, default, restrictive, etc. |
response_length |
int | Length of model response text |
sweep_dir |
string | Source sweep directory name |
Data Pipeline Details¶
The script joins two data sources:
injection-results.json— Per-framework injection test outcomes, indicator hits, confidence levelsretrieval-results.json— Retrieval rankings, poisoned document presence in top-k
Metadata (model, optimizer, authority style) is extracted from directory names and hemlock-batch.log files via regex pattern matching.
Expected Dataset Size
With the current experiment history, build_training_data.py produces ~5000 rows with an ~8% positive rate (injection detected). The class imbalance is handled during training via class weighting.
Stage 2: Model Training (reward_model.py)¶
Trains a classifier to predict P(injection_success) from metadata features.
Usage¶
# Train MLP (recommended)
python harness/reward_model.py \
--data training_data.parquet \
--mode mlp \
--output reward_model.pt \
--epochs 100
# Train logistic regression baseline
python harness/reward_model.py \
--data training_data.parquet \
--mode baseline \
--output reward_baseline.pt
# Evaluate a trained model
python harness/reward_model.py \
--data training_data.parquet \
--mode evaluate \
--model-path reward_model.pt
Flags¶
| Flag | Default | Description |
|---|---|---|
--data |
(required) | Training data file (Parquet or CSV) |
--mode |
mlp |
Training mode: baseline (logistic), mlp, or evaluate |
--model-path |
Model checkpoint path (for evaluate mode) |
|
--output |
reward_model.pt |
Output model checkpoint path |
--epochs |
100 |
Training epochs for MLP |
--lr |
1e-3 |
Learning rate |
Feature Encoding¶
The model uses 23 metadata features (no text embeddings):
| Feature Group | Dimensions | Encoding |
|---|---|---|
| Model scale | 1 | $\log_2(\text{params in billions})$ |
| Framework | 5 | One-hot (langchain, llamaindex, unstructured, haystack, colpali) |
| Authority style | 4 | One-hot (none, academic, institutional, regulatory) |
| Optimizer type | 4 | One-hot (baseline, cem, genetic, whitebox) |
| Payload category | 6 | One-hot (override, exfiltrate, redirect, denial, multistage, manyshot) |
| Indicator hits | 1 | Raw count |
| Response length | 1 | $\log(1 + \text{length})$ |
| Poisoned in sources | 1 | Boolean → float |
Why Metadata Instead of Text Embeddings?
The training set is ~5000 examples. A transformer-based text encoder would overfit badly at this scale. Metadata features (model, framework, category, authority style) capture the dominant predictors of injection success with high sample efficiency. Text embeddings can be added when the training set grows to 50k+ rows.
Architecture¶
Logistic regression baseline — sklearn.LogisticRegression with class weights. Establishes the ceiling from metadata features alone.
MLP — 2-layer network:
Training uses:
- 5-fold stratified cross-validation — ensures each fold has the same positive/negative ratio
- BCEWithLogitsLoss with class weighting — upweights the minority positive class
- Early stopping (patience 15 epochs) — prevents overfitting when validation AUC plateaus
- Evaluation metric — ROC-AUC (area under the receiver operating characteristic curve)
Output¶
The checkpoint (reward_model.pt) contains:
- Model state dictionary (weights)
- StandardScaler parameters (mean, scale) for feature normalization
- Feature dimensionality
- Training metadata (best AUC, number of folds)
Stage 3: Serving (reward_server.py)¶
FastAPI server that loads a trained checkpoint and serves injection score predictions to hemlock's Go optimizers.
Usage¶
Flags¶
| Flag | Default | Description |
|---|---|---|
--model-path |
(required) | Path to trained model checkpoint |
--port |
9090 |
Server port |
--host |
0.0.0.0 |
Server bind address |
Endpoints¶
POST /predict-injection¶
Single prediction for one candidate document.
Request:
{
"text": "document content...",
"model": "qwen2.5:7b",
"framework": "langchain",
"authority_style": "academic",
"optimizer_type": "genetic",
"payload_category": "redirect",
"indicator_hits": 0,
"response_length": 200,
"poisoned_in_sources": false
}
Response:
The score is the sigmoid output of the MLP: $P(\text{injection_success}) \in [0, 1]$.
POST /predict-batch¶
Batch predictions for multiple candidates (used by the Genetic optimizer's population scoring).
Request:
{
"items": [
{"text": "...", "model": "qwen2.5:7b", "framework": "langchain", ...},
{"text": "...", "model": "qwen2.5:7b", "framework": "haystack", ...}
]
}
Response:
GET /health¶
Health check endpoint.
Response:
Integration with hemlock¶
The Go optimizers call the reward server during candidate evaluation when --injection-weight > 0:
# Start the reward server
python harness/reward_server.py --model-path reward_model.pt
# In another terminal, run hemlock with joint optimization
hemlock batch \
--payload redirect \
--genetic \
--injection-weight 0.4 \
--injection-model-host http://localhost:9090 \
--embed-provider ollama
The Go code in score_injection.go POSTs to /predict-injection for each candidate document during optimization. The 3-way blended scoring function is:
$$ \text{score} = (1 - w_{\text{inj}} - w_{\text{nat}}) \cdot s_{\text{sim}} + w_{\text{nat}} \cdot s_{\text{nat}} + w_{\text{inj}} \cdot s_{\text{inj}} $$
Retraining¶
As new experiment data accumulates (from Phase 1 batch runs, Bayesian optimizer evaluations, or validation experiments), rebuild the training set and retrain:
# Rebuild training data (picks up all new results)
python harness/build_training_data.py --reports-dir reports/
# Retrain
python harness/reward_model.py --data training_data.parquet --mode mlp --output reward_model.pt
# Restart server with new model
python harness/reward_server.py --model-path reward_model.pt
See Also¶
- Joint Optimization — hemlock-side scoring architecture
- Bayesian Optimizer — hyperparameter search
- Validation Experiments — tests using reward-guided optimization
- Optimization Architecture — system-level diagram