Extraction Tests¶
Layer 1 tests whether hemlock's injected payloads survive extraction by each RAG framework's document parser.
How It Works¶
sequenceDiagram
participant H as harness
participant P as Pipeline (:8100-8103)
H->>H: Read poisoned document
H->>P: POST /extract (multipart file)
P->>P: Parse document with framework loader
P-->>H: Extracted text
H->>H: Search extracted text for payload
H->>H: Compare against hemlock prediction
H->>H: Record MATCH / DRIFT / ERROR
Step by Step¶
- Generate documents —
hemlock batchcreates poisoned documents across all format×technique combinations - Load predictions —
hemlock validate --jsonproduces hemlock's predictions for each combination - POST each document — Send to each pipeline's
/extractendpoint (skip unsupported combinations) - Check survival — Run technique-aware payload matching against the extracted text (literal → zero-width decode → reversed text → HTML comment chunks → homoglyph)
- Compare — Check if the actual result matches hemlock's prediction
- Record — Write result as MATCH, DRIFT, SKIP, or ERROR
Result Categories¶
| Status | Meaning | Action |
|---|---|---|
| MATCH | hemlock predicted correctly | None — prediction is accurate |
| DRIFT | hemlock predicted wrong | Investigate — may be a hemlock prediction issue, framework change, or harness matching difference |
| SKIP | Unsupported format/framework combination | None — excluded from accuracy calculation |
| NO_PRED | No prediction available | Add prediction to hemlock for this combination |
| ERROR | Framework failed to parse | Investigate — may be a framework bug or unsupported format |
Output Format¶
extraction-results.json:
{
"timestamp": "2026-04-02T10:30:00Z",
"total": 576,
"results": [
{
"document": "poisoned-comment-001.html",
"format": "html",
"technique": "comment",
"framework": "langchain",
"predicted": "survive",
"actual": "survive",
"status": "MATCH",
"payload_found": true,
"extracted_length": 2451
},
{
"document": "poisoned-csshide-001.html",
"format": "html",
"technique": "csshide",
"framework": "unstructured",
"predicted": "survive",
"actual": "stripped",
"status": "DRIFT",
"payload_found": false,
"extracted_length": 1823
}
]
}
Test Matrix¶
The extraction test covers every combination of:
- 10 formats × 36 techniques × 5 frameworks = up to 1,800 combinations
Not all technique×format combinations are valid (e.g., CSS-hide only applies to HTML). The actual test count depends on hemlock's supported combinations:
# See how many documents hemlock generates
hemlock batch --output-dir /tmp/count --all-formats --all-techniques --dry-run
Running Extraction Tests Only¶
Or manually:
hemlock batch --output-dir /tmp/hemlock-batch --all-formats --all-techniques
hemlock validate --json --output /tmp/hemlock-predictions.json
python3 harness/extraction_test.py \
--docs /tmp/hemlock-batch \
--predictions /tmp/hemlock-predictions.json \
--output reports/extraction-results.json
Common Patterns¶
Framework-Specific Stripping¶
Some frameworks consistently strip certain techniques:
unstructured + csshide → ALWAYS stripped (partition extracts visible text)
unstructured + aria → ALWAYS stripped (hidden elements skipped)
langchain + csshide → ALWAYS survives (BSHTMLLoader preserves structure)
Version-Dependent Behavior¶
A technique that survives in one framework version may be stripped in an update:
llamaindex 0.12.33 + comment → survive
llamaindex 0.12.34 + comment → stripped ← version changed behavior
This is exactly what the drift report catches.
Next Steps¶
- Drift Report — How to interpret extraction drifts
- Retrieval Tests — Layer 2: does the document get retrieved?