hemlock score¶

Calculate composite effectiveness scores for format/technique/framework/payload combinations using hemlock's scoring engine. Scores are computed from lookup tables without generating any documents.

Synopsis¶

hemlock score --format <format> --framework <framework> [flags]

Flags¶

Flag	Type	Default	Description
`--format`	`string`	(required)	Document format to score: `html`, `docx`, `pdf`, `txt`, `markdown`, `rtf`, `epub`, `csv`, `json`, `xlsx`, `image`
`--framework`	`string`	(required)	Target RAG framework: `langchain`, `llamaindex`, `haystack`, `unstructured`, `generic`
`--technique`	`string`		Specific technique to score. If omitted, scores all techniques for the format.
`--payload`	`string`	`override`	Payload category for complexity factor: `override`, `exfiltrate`, `redirect`, `denial`, `multistage`, `authority`, `custom`
`--json`	`bool`	`false`	Output results as JSON instead of a table

Description¶

score computes a composite effectiveness score for each technique based on three factors:

Stealth score (0--100): How well hidden the payload is from human inspection
Survival probability (0.0--1.0): Whether the technique survives extraction by the target framework
Payload complexity (0.0--1.0): How sophisticated the payload category is

The composite score is calculated as:

$$\text{composite} = \frac{\text{stealth}}{100} \times \text{survival} \times (0.5 + 0.5 \times \text{complexity})$$

Each result receives a letter rating:

Rating	Score Range
A	≥ 0.80
B	≥ 0.60
C	≥ 0.40
D	≥ 0.20
F	< 0.20

Lookup-based scoring

Scores are computed entirely from hemlock's built-in lookup tables (stealth scores, framework survival maps, payload complexity factors). No documents are generated and no live validation is performed. Use hemlock validate for runtime survival testing.

Examples¶

Score all HTML techniques against LangChain¶

CommandOutput

hemlock score --format html --framework langchain

FORMAT  TECHNIQUE            STEALTH  SURVIVAL  COMPLEXITY  SCORE  RATING
------  ---------            -------  --------  ----------  -----  ------
html    comment              30       0.00      0.60        0.00   F
html    invisible-div        55       1.00      0.60        0.44   C
html    aria-hidden          70       1.00      0.60        0.56   C
html    css-hide             75       1.00      0.60        0.60   B
html    microdata            60       1.00      0.60        0.48   C
html    chunk-boundary       65       1.00      0.60        0.52   C
html    offscreen            80       1.00      0.60        0.64   B
html    color-transparent    85       1.00      0.60        0.68   B
html    noscript             60       1.00      0.60        0.48   C

Score a specific technique with multistage payloads¶

hemlock score \
  --format html \
  --technique css-hide \
  --framework langchain \
  --payload multistage

JSON output for programmatic use¶

hemlock score \
  --format markdown \
  --framework langchain \
  --json | jq '.[] | select(.rating == "A" or .rating == "B")'

Compare techniques across frameworks¶

# Score Markdown techniques against each framework
for fw in langchain llamaindex haystack unstructured; do
  echo "=== $fw ==="
  hemlock score --format markdown --framework "$fw"
  echo
done

Payload Complexity Factors¶

The --payload flag determines the complexity factor used in scoring:

Category	Complexity	Rationale
`override`	0.6	Direct instruction override; moderate sophistication
`exfiltrate`	0.8	Requires external endpoint interaction; high sophistication
`redirect`	0.7	Social engineering cues; above average sophistication
`denial`	0.5	Simple disruption; lower sophistication
`multistage`	0.9	Two-phase attack with primer/trigger coordination; highest sophistication
`authority`	0.85	Authority mimicry with institutional framing; very high sophistication
`custom`	0.5	Default factor for user-provided payloads

Tips¶

Use score to plan engagements

Run score before generating documents to identify the most effective technique/framework combinations for your target. Focus document generation on techniques with B or A ratings.

Generic framework scoring

When --framework generic is used, survival probability is computed as the proportion of frameworks the technique survives (e.g., if a technique survives 3 of 4 frameworks, survival = 0.75).