Skip to content

hemlock defend corpus-monitor

Polls a target RAG corpus at configured cadence; for each operator-supplied query, records the top-k retrieved document IDs and ranks; detects persistence anomalies. The defensive companion to corpus-poisoning attacks that don't deliver in a single shot — the corpus is mutable, attackers iterate, and persistence (top-k stays poisoned over time) is its own TTP.

Synopsis

hemlock defend corpus-monitor --query <q> [--query <q>] [flags]
hemlock defend corpus-monitor --queries-file <path> [flags]

Detectors

The monitor flags four classes of anomaly:

Kind Triggers when What it suggests
recency-burst ≥3 docs from the same source added within 1 h all currently rank in top-k A coordinated single-source ingest spike that also dominates retrieval
single-source-dom ≥60% of a query's current top-k from one source One source has captured retrieval for that query — possible topic squatting
rank-flap A doc oscillates wildly in rank across recent polls Possible adversarial activity churning the corpus
stale-but-ranking (skeleton) doc added long ago dominates rank-1 for a query that didn't exist at ingest time Possible "seeded for later queries" pattern

Flags

Flag Type Default Description
--query []string (required if no --queries-file) Repeatable. Each query is monitored independently.
--queries-file string Newline-delimited file of queries. Combines with --query.
--top-k int 10 Retrieval k passed to each query
--poll-interval int 300 (5 min) Seconds between polls
--window-days int 30 Rolling window (days) for anomaly computation
--audit-log string JSONL anomaly log path. One row per anomaly, append-only.
--quiet bool false Suppress per-poll stderr logs
--one-shot bool false Run a single poll cycle and exit (testing / cron-driven deployments)
--backend string stub Corpus backend: stub (built-in, returns empty results — for CLI surface verification) or chroma (planned, talks to ChromaDB v2)

Output

Per-poll (stderr unless --quiet): one line per query with rank-1 doc id, plus any flagged anomalies.

Audit log (JSONL, when --audit-log is set): one JSON object per anomaly with timestamp, kind, query, doc IDs, source, and a one-line detail string.

On exit / one-shot (stdout): human-readable report including query count, poll cadence, anomaly count, top-3 most-frequent rank-1 docs per query, and the 10 most-recent anomalies.

Example

Two-query monitor, restricted-to-one-cycle smoke test:

hemlock defend corpus-monitor \
  --query "what are the latest security policies?" \
  --query "where can I find the official documentation?" \
  --top-k 10 \
  --one-shot

Long-running deployment with audit log, polling every 10 minutes:

hemlock defend corpus-monitor \
  --queries-file ./monitored-queries.txt \
  --poll-interval 600 \
  --top-k 20 \
  --audit-log ./corpus-monitor.jsonl

./monitored-queries.txt — one query per line, blank lines and # comments allowed:

# High-value queries to monitor for adversarial top-k capture.
what are the latest policy updates?
where can I find the official documentation?
how do I configure MFA?
what are the SLA tiers?

When to use this

Corpus-monitoring complements hemlock defend monitor (in-band response detection) by addressing the persistence layer of the attack chain. defend monitor catches a single response that emits a canary; defend corpus-monitor catches the corpus-side conditions — anomalous ingest patterns, topic capture, source domination — that make a successful attack persist over time as the corpus evolves.

Run both in concert: defend monitor on every response (in-band, always-on); defend corpus-monitor on the operator-supplied query set at a slower cadence (out-of-band, scheduled).

Backend integration

The shipped stub backend returns no results — useful for verifying the CLI surface and the analysis primitives without a live corpus. To wire it to a real ChromaDB instance (or any other vector store), implement the defend.CorpusBackend interface:

type CorpusBackend interface {
    Query(ctx context.Context, q string, k int) ([]RetrievedDoc, error)
    Document(ctx context.Context, id string) (DocMeta, error)
}

A first-party chroma backend that talks to ChromaDB v2 is reserved for a follow-up release. Until then, operators with non-Chroma stores (Pinecone, Weaviate, Qdrant, custom) can supply their own backend implementation by depending on pkg/defend directly.