hemlock defend corpus-monitor¶
Polls a target RAG corpus at configured cadence; for each operator-supplied query, records the top-k retrieved document IDs and ranks; detects persistence anomalies. The defensive companion to corpus-poisoning attacks that don't deliver in a single shot — the corpus is mutable, attackers iterate, and persistence (top-k stays poisoned over time) is its own TTP.
Synopsis¶
hemlock defend corpus-monitor --query <q> [--query <q>] [flags]
hemlock defend corpus-monitor --queries-file <path> [flags]
Detectors¶
The monitor flags four classes of anomaly:
| Kind | Triggers when | What it suggests |
|---|---|---|
recency-burst |
≥3 docs from the same source added within 1 h all currently rank in top-k | A coordinated single-source ingest spike that also dominates retrieval |
single-source-dom |
≥60% of a query's current top-k from one source | One source has captured retrieval for that query — possible topic squatting |
rank-flap |
A doc oscillates wildly in rank across recent polls | Possible adversarial activity churning the corpus |
stale-but-ranking |
(skeleton) doc added long ago dominates rank-1 for a query that didn't exist at ingest time | Possible "seeded for later queries" pattern |
Flags¶
| Flag | Type | Default | Description |
|---|---|---|---|
--query |
[]string |
(required if no --queries-file) |
Repeatable. Each query is monitored independently. |
--queries-file |
string |
Newline-delimited file of queries. Combines with --query. |
|
--top-k |
int |
10 |
Retrieval k passed to each query |
--poll-interval |
int |
300 (5 min) |
Seconds between polls |
--window-days |
int |
30 |
Rolling window (days) for anomaly computation |
--audit-log |
string |
JSONL anomaly log path. One row per anomaly, append-only. | |
--quiet |
bool |
false |
Suppress per-poll stderr logs |
--one-shot |
bool |
false |
Run a single poll cycle and exit (testing / cron-driven deployments) |
--backend |
string |
stub |
Corpus backend: stub (built-in, returns empty results — for CLI surface verification) or chroma (planned, talks to ChromaDB v2) |
Output¶
Per-poll (stderr unless --quiet): one line per query with rank-1 doc id, plus any flagged anomalies.
Audit log (JSONL, when --audit-log is set): one JSON object per anomaly with timestamp, kind, query, doc IDs, source, and a one-line detail string.
On exit / one-shot (stdout): human-readable report including query count, poll cadence, anomaly count, top-3 most-frequent rank-1 docs per query, and the 10 most-recent anomalies.
Example¶
Two-query monitor, restricted-to-one-cycle smoke test:
hemlock defend corpus-monitor \
--query "what are the latest security policies?" \
--query "where can I find the official documentation?" \
--top-k 10 \
--one-shot
Long-running deployment with audit log, polling every 10 minutes:
hemlock defend corpus-monitor \
--queries-file ./monitored-queries.txt \
--poll-interval 600 \
--top-k 20 \
--audit-log ./corpus-monitor.jsonl
./monitored-queries.txt — one query per line, blank lines and # comments allowed:
# High-value queries to monitor for adversarial top-k capture.
what are the latest policy updates?
where can I find the official documentation?
how do I configure MFA?
what are the SLA tiers?
When to use this¶
Corpus-monitoring complements hemlock defend monitor (in-band response detection) by addressing the persistence layer of the attack chain. defend monitor catches a single response that emits a canary; defend corpus-monitor catches the corpus-side conditions — anomalous ingest patterns, topic capture, source domination — that make a successful attack persist over time as the corpus evolves.
Run both in concert: defend monitor on every response (in-band, always-on); defend corpus-monitor on the operator-supplied query set at a slower cadence (out-of-band, scheduled).
Backend integration¶
The shipped stub backend returns no results — useful for verifying the CLI surface and the analysis primitives without a live corpus. To wire it to a real ChromaDB instance (or any other vector store), implement the defend.CorpusBackend interface:
type CorpusBackend interface {
Query(ctx context.Context, q string, k int) ([]RetrievedDoc, error)
Document(ctx context.Context, id string) (DocMeta, error)
}
A first-party chroma backend that talks to ChromaDB v2 is reserved for a follow-up release. Until then, operators with non-Chroma stores (Pinecone, Weaviate, Qdrant, custom) can supply their own backend implementation by depending on pkg/defend directly.
Related¶
hemlock defend monitor— in-band response-side detector middleware (reverse proxy)hemlock defend detect— strict-canary CLI primitive (offline detection on collected response samples)hemlock defend report— render run-results JSONL into operator briefs