Retrieval Tests¶

Layer 2 tests whether poisoned documents, after being ingested into ChromaDB, appear in the top-k results when a target query is run.

How It Works¶

sequenceDiagram
    participant H as harness
    participant P as Pipeline
    participant C as ChromaDB

    H->>P: POST /ingest (noise docs)
    P->>C: Embed + store noise
    H->>P: POST /ingest (poisoned doc)
    P->>C: Embed + store poisoned
    H->>P: POST /query (target query)
    P->>C: Vector search
    C-->>P: Top-k results
    P-->>H: Ranked documents
    H->>H: Check if poisoned doc in top-k

Step by Step¶

Ingest noise corpus — The 11 legitimate business documents from noise-corpus/
Ingest poisoned document — A hemlock-generated document with an embedded payload
Query — Run a query related to the payload's target topic
Check ranking — Determine if the poisoned document appears in the top-k retrieved results
Record position — Track the rank (1^st, 2^nd, 3^rd, ... or not retrieved)

Why Noise Matters¶

A poisoned document that gets retrieved when it's the only document in the collection isn't impressive. The retrieval test ingests the noise corpus first, creating realistic competition:

Collection contents after setup:
  1. refund-policy.html         (legitimate)
  2. faq.html                   (legitimate)
  3. employee-onboarding.md     (legitimate)
  4. ...8 more legitimate docs...
  5. poisoned-override-001.html (hemlock payload targeting "refund policy")

When querying "What is the refund policy?", the poisoned document must compete with the legitimate refund-policy.html for retrieval ranking.

Result Categories¶

Status	Meaning
RETRIEVED_TOP1	Poisoned doc was the #1 result
RETRIEVED_TOP3	Poisoned doc in top 3 results
RETRIEVED_TOP5	Poisoned doc in top 5 results
NOT_RETRIEVED	Poisoned doc not in top-k
ERROR	Ingestion or query failed

Output Format¶

retrieval-results.json:

{
  "timestamp": "2026-04-02T10:35:00Z",
  "total": 144,
  "results": [
    {
      "document": "poisoned-override-001.html",
      "format": "html",
      "technique": "csshide",
      "payload": "override",
      "framework": "langchain",
      "target_query": "What is the refund policy?",
      "rank": 2,
      "status": "RETRIEVED_TOP3",
      "total_results": 5
    }
  ]
}

Running Retrieval Tests Only¶

make test-retrieval

Modifies ChromaDB state

Retrieval tests create collections and ingest documents. Run make restore before subsequent test runs.

Factors Affecting Retrieval¶

Embedding Model¶

The nomic-embed-text model produces embeddings that may rank documents differently than larger models:

Small embedding space — Similar documents may cluster together
Topic relevance — Payload text that matches the query topic ranks higher
Document length — Longer documents may dilute the payload's embedding weight

Payload Type¶

Different payload types have different retrieval characteristics:

Payload	Retrieval Impact
Override	Directly targets a topic — high retrieval relevance
Redirect	Contains target topic + redirect instructions
Exfiltrate	May not match topic queries — lower retrieval
Denial	Refusal text may not match positive queries

Technique¶

The embedding technique affects whether the payload contributes to the document's vector:

Visible techniques (comment injection) — Payload text is part of the embedding input
Hidden techniques (CSS hide, white font) — Depends on whether the framework extracts the hidden text before embedding

Hybrid Retrieval Mode¶

The --hybrid-retrieval flag enables BM25 + dense fusion for pipelines that support it:

bash harness/run_all.sh --hybrid-retrieval

When enabled, retrieval_test.py calls /query-hybrid instead of /query for Haystack only — the only pipeline that implements hybrid retrieval. Other pipelines automatically fall back to dense-only /query and are labeled accordingly in results. This means a hybrid run produces Haystack-hybrid results alongside dense-only results for the other frameworks, giving a direct comparison without avoidable endpoint failures.

Why It Matters¶

Dense retrieval ranks documents by embedding cosine similarity. Adversarial documents optimized for embedding similarity (via CEM, Genetic, or Whitebox) can achieve high retrieval rates even with unnatural payloads. BM25 scores documents by keyword overlap with the query, providing an orthogonal signal that penalizes documents lacking natural keyword matches.

Hybrid retrieval tests whether this defense reduces poisoned document retrieval rates compared to dense-only.

Result Differences¶

Hybrid results appear in the same retrieval-results.json format. The sources field in the response includes a retrieval_mode: hybrid-bm25-dense indicator.

Next Steps¶

Injection Tests — Layer 3: does the LLM act on the payload?
Drift Report — Understanding prediction errors