Extraction Tests¶

Layer 1 tests whether hemlock's injected payloads survive extraction by each RAG framework's document parser.

How It Works¶

sequenceDiagram
    participant H as harness
    participant P as Pipeline (:8100-8103)

    H->>H: Read poisoned document
    H->>P: POST /extract (multipart file)
    P->>P: Parse document with framework loader
    P-->>H: Extracted text
    H->>H: Search extracted text for payload
    H->>H: Compare against hemlock prediction
    H->>H: Record MATCH / DRIFT / ERROR

Step by Step¶

Generate documents — hemlock batch creates poisoned documents across all format×technique combinations
Load predictions — hemlock validate --json produces hemlock's predictions for each combination
POST each document — Send to each pipeline's /extract endpoint (skip unsupported combinations)
Check survival — Run technique-aware payload matching against the extracted text (literal → zero-width decode → reversed text → HTML comment chunks → homoglyph)
Compare — Check if the actual result matches hemlock's prediction
Record — Write result as MATCH, DRIFT, SKIP, or ERROR

Result Categories¶

Status	Meaning	Action
MATCH	hemlock predicted correctly	None — prediction is accurate
DRIFT	hemlock predicted wrong	Investigate — may be a hemlock prediction issue, framework change, or harness matching difference
SKIP	Unsupported format/framework combination	None — excluded from accuracy calculation
NO_PRED	No prediction available	Add prediction to hemlock for this combination
ERROR	Framework failed to parse	Investigate — may be a framework bug or unsupported format

Output Format¶

extraction-results.json:

{
  "timestamp": "2026-04-02T10:30:00Z",
  "total": 576,
  "results": [
    {
      "document": "poisoned-comment-001.html",
      "format": "html",
      "technique": "comment",
      "framework": "langchain",
      "predicted": "survive",
      "actual": "survive",
      "status": "MATCH",
      "payload_found": true,
      "extracted_length": 2451
    },
    {
      "document": "poisoned-csshide-001.html",
      "format": "html",
      "technique": "csshide",
      "framework": "unstructured",
      "predicted": "survive",
      "actual": "stripped",
      "status": "DRIFT",
      "payload_found": false,
      "extracted_length": 1823
    }
  ]
}

Test Matrix¶

The extraction test covers every combination of:

10 formats × 36 techniques × 5 frameworks = up to 1,800 combinations

Not all technique×format combinations are valid (e.g., CSS-hide only applies to HTML). The actual test count depends on hemlock's supported combinations:

# See how many documents hemlock generates
hemlock batch --output-dir /tmp/count --all-formats --all-techniques --dry-run

Running Extraction Tests Only¶

make test-extract

Or manually:

hemlock batch --output-dir /tmp/hemlock-batch --all-formats --all-techniques
hemlock validate --json --output /tmp/hemlock-predictions.json
python3 harness/extraction_test.py \
  --docs /tmp/hemlock-batch \
  --predictions /tmp/hemlock-predictions.json \
  --output reports/extraction-results.json

Common Patterns¶

Framework-Specific Stripping¶

Some frameworks consistently strip certain techniques:

unstructured + csshide → ALWAYS stripped (partition extracts visible text)
unstructured + aria    → ALWAYS stripped (hidden elements skipped)
langchain + csshide    → ALWAYS survives (BSHTMLLoader preserves structure)

Version-Dependent Behavior¶

A technique that survives in one framework version may be stripped in an update:

llamaindex 0.12.33 + comment → survive
llamaindex 0.12.34 + comment → stripped  ← version changed behavior

This is exactly what the drift report catches.

Next Steps¶

Drift Report — How to interpret extraction drifts
Retrieval Tests — Layer 2: does the document get retrieved?