LlamaIndex Pipeline¶

Property	Value
Framework	LlamaIndex
Version	0.12.33
Port	8101
Container	`llamaindex-rag`
Image	`docker/llamaindex-rag`

Extraction¶

LlamaIndex uses SimpleDirectoryReader with auto-detection:

from llama_index.core import SimpleDirectoryReader

# Write uploaded file to temp directory
with tempfile.TemporaryDirectory() as tmpdir:
    file_path = Path(tmpdir) / filename
    file_path.write_bytes(content)
    documents = SimpleDirectoryReader(tmpdir).load_data()
    text = "\n".join(doc.text for doc in documents)

How It Works¶

Uploaded file is written to a temporary directory
SimpleDirectoryReader scans the directory and auto-detects file type
Internally delegates to format-specific readers (PDFReader, DocxReader, HTMLTagReader, MarkdownReader)
Returns Document objects with .text and .metadata

Implications for hemlock¶

HTML — Uses HTMLTagReader which extracts text from tags — hidden elements may or may not survive depending on the tag structure
PDF — Uses PDFReader (pypdf-based) — text layer only, similar to LangChain
DOCX — Uses DocxReader — reads paragraphs from the main body
Markdown — Parses Markdown to extract text content
TXT — Direct file read

RAG Chain¶

graph LR
    Q["Query"] --> I["VectorStoreIndex"]
    I --> R["ChromaDB<br/>Retriever"]
    R --> D["Top-k<br/>Nodes"]
    D --> E["Query Engine"]
    E --> L["Ollama<br/>smollm2:135m"]
    L --> A["Response"]

LlamaIndex uses VectorStoreIndex with the global Settings object:

from llama_index.core import VectorStoreIndex, Settings

Settings.embed_model = OllamaEmbedding(model_name="nomic-embed-text")
Settings.llm = Ollama(model="smollm2:135m")

index = VectorStoreIndex.from_vector_store(chroma_vector_store)
query_engine = index.as_query_engine()
response = query_engine.query(query_text)

Curl Examples¶

Health Check¶

curl http://localhost:8101/health

{"framework": "llamaindex", "status": "ok", "version": "0.12.33"}

Extract Text¶

curl -X POST http://localhost:8101/extract \
  -F "file=@poisoned-document.pdf"

{"text": "...", "metadata": {"filename": "poisoned-document.pdf"}}

Ingest Document¶

curl -X POST http://localhost:8101/ingest \
  -F "file=@document.pdf" \
  -F "collection=test-collection"

{"status": "ingested", "collection": "test-collection", "chunks": 5}

Query¶

curl -X POST http://localhost:8101/query \
  -H "Content-Type: application/json" \
  -d '{"query": "What is the refund policy?", "collection": "test-collection"}'

{"answer": "...", "sources": ["refund-policy.html"]}

Known Behaviors¶

Technique	Format	Survives?	Notes
CSS hide	HTML	Varies	Depends on HTMLTagReader version
HTML comment	HTML	✗	Comments typically stripped
ARIA hidden	HTML	Varies	Tag-based extraction may skip
Invisible text	HTML	✓	Zero-width chars pass through
Metadata	PDF	✗	PDFReader ignores metadata
Frontmatter	Markdown	✓	YAML frontmatter included in text
Custom XML	DOCX	✗	DocxReader reads paragraphs only

Auto-detection variability

SimpleDirectoryReader behavior depends on installed optional packages. The exact reader used may differ from what you expect.

Next Steps¶

Unstructured — Compare with Unstructured's element-based approach
LangChain — Compare with LangChain's loader approach