Retrieval Tests¶
Layer 2 tests whether poisoned documents, after being ingested into ChromaDB, appear in the top-k results when a target query is run.
How It Works¶
sequenceDiagram
participant H as harness
participant P as Pipeline
participant C as ChromaDB
H->>P: POST /ingest (noise docs)
P->>C: Embed + store noise
H->>P: POST /ingest (poisoned doc)
P->>C: Embed + store poisoned
H->>P: POST /query (target query)
P->>C: Vector search
C-->>P: Top-k results
P-->>H: Ranked documents
H->>H: Check if poisoned doc in top-k
Step by Step¶
- Ingest noise corpus — The 11 legitimate business documents from
noise-corpus/ - Ingest poisoned document — A hemlock-generated document with an embedded payload
- Query — Run a query related to the payload's target topic
- Check ranking — Determine if the poisoned document appears in the top-k retrieved results
- Record position — Track the rank (1st, 2nd, 3rd, ... or not retrieved)
Why Noise Matters¶
A poisoned document that gets retrieved when it's the only document in the collection isn't impressive. The retrieval test ingests the noise corpus first, creating realistic competition:
Collection contents after setup:
1. refund-policy.html (legitimate)
2. faq.html (legitimate)
3. employee-onboarding.md (legitimate)
4. ...8 more legitimate docs...
5. poisoned-override-001.html (hemlock payload targeting "refund policy")
When querying "What is the refund policy?", the poisoned document must compete with the legitimate refund-policy.html for retrieval ranking.
Result Categories¶
| Status | Meaning |
|---|---|
| RETRIEVED_TOP1 | Poisoned doc was the #1 result |
| RETRIEVED_TOP3 | Poisoned doc in top 3 results |
| RETRIEVED_TOP5 | Poisoned doc in top 5 results |
| NOT_RETRIEVED | Poisoned doc not in top-k |
| ERROR | Ingestion or query failed |
Output Format¶
retrieval-results.json:
{
"timestamp": "2026-04-02T10:35:00Z",
"total": 144,
"results": [
{
"document": "poisoned-override-001.html",
"format": "html",
"technique": "csshide",
"payload": "override",
"framework": "langchain",
"target_query": "What is the refund policy?",
"rank": 2,
"status": "RETRIEVED_TOP3",
"total_results": 5
}
]
}
Running Retrieval Tests Only¶
Modifies ChromaDB state
Retrieval tests create collections and ingest documents. Run make restore before subsequent test runs.
Factors Affecting Retrieval¶
Embedding Model¶
The nomic-embed-text model produces embeddings that may rank documents differently than larger models:
- Small embedding space — Similar documents may cluster together
- Topic relevance — Payload text that matches the query topic ranks higher
- Document length — Longer documents may dilute the payload's embedding weight
Payload Type¶
Different payload types have different retrieval characteristics:
| Payload | Retrieval Impact |
|---|---|
| Override | Directly targets a topic — high retrieval relevance |
| Redirect | Contains target topic + redirect instructions |
| Exfiltrate | May not match topic queries — lower retrieval |
| Denial | Refusal text may not match positive queries |
Technique¶
The embedding technique affects whether the payload contributes to the document's vector:
- Visible techniques (comment injection) — Payload text is part of the embedding input
- Hidden techniques (CSS hide, white font) — Depends on whether the framework extracts the hidden text before embedding
Hybrid Retrieval Mode¶
The --hybrid-retrieval flag enables BM25 + dense fusion for pipelines that support it:
When enabled, retrieval_test.py calls /query-hybrid instead of /query for Haystack only — the only pipeline that implements hybrid retrieval. Other pipelines automatically fall back to dense-only /query and are labeled accordingly in results. This means a hybrid run produces Haystack-hybrid results alongside dense-only results for the other frameworks, giving a direct comparison without avoidable endpoint failures.
Why It Matters¶
Dense retrieval ranks documents by embedding cosine similarity. Adversarial documents optimized for embedding similarity (via CEM, Genetic, or Whitebox) can achieve high retrieval rates even with unnatural payloads. BM25 scores documents by keyword overlap with the query, providing an orthogonal signal that penalizes documents lacking natural keyword matches.
Hybrid retrieval tests whether this defense reduces poisoned document retrieval rates compared to dense-only.
Result Differences¶
Hybrid results appear in the same retrieval-results.json format. The sources field in the response includes a retrieval_mode: hybrid-bm25-dense indicator.
Next Steps¶
- Injection Tests — Layer 3: does the LLM act on the payload?
- Drift Report — Understanding prediction errors