Skip to content

LlamaIndex Pipeline

Property Value
Framework LlamaIndex
Version 0.12.33
Port 8101
Container llamaindex-rag
Image docker/llamaindex-rag

Extraction

LlamaIndex uses SimpleDirectoryReader with auto-detection:

from llama_index.core import SimpleDirectoryReader

# Write uploaded file to temp directory
with tempfile.TemporaryDirectory() as tmpdir:
    file_path = Path(tmpdir) / filename
    file_path.write_bytes(content)
    documents = SimpleDirectoryReader(tmpdir).load_data()
    text = "\n".join(doc.text for doc in documents)

How It Works

  1. Uploaded file is written to a temporary directory
  2. SimpleDirectoryReader scans the directory and auto-detects file type
  3. Internally delegates to format-specific readers (PDFReader, DocxReader, HTMLTagReader, MarkdownReader)
  4. Returns Document objects with .text and .metadata

Implications for hemlock

  • HTML — Uses HTMLTagReader which extracts text from tags — hidden elements may or may not survive depending on the tag structure
  • PDF — Uses PDFReader (pypdf-based) — text layer only, similar to LangChain
  • DOCX — Uses DocxReader — reads paragraphs from the main body
  • Markdown — Parses Markdown to extract text content
  • TXT — Direct file read

RAG Chain

graph LR
    Q["Query"] --> I["VectorStoreIndex"]
    I --> R["ChromaDB<br/>Retriever"]
    R --> D["Top-k<br/>Nodes"]
    D --> E["Query Engine"]
    E --> L["Ollama<br/>smollm2:135m"]
    L --> A["Response"]

LlamaIndex uses VectorStoreIndex with the global Settings object:

from llama_index.core import VectorStoreIndex, Settings

Settings.embed_model = OllamaEmbedding(model_name="nomic-embed-text")
Settings.llm = Ollama(model="smollm2:135m")

index = VectorStoreIndex.from_vector_store(chroma_vector_store)
query_engine = index.as_query_engine()
response = query_engine.query(query_text)

Curl Examples

Health Check

curl http://localhost:8101/health
{"framework": "llamaindex", "status": "ok", "version": "0.12.33"}

Extract Text

curl -X POST http://localhost:8101/extract \
  -F "file=@poisoned-document.pdf"
{"text": "...", "metadata": {"filename": "poisoned-document.pdf"}}

Ingest Document

curl -X POST http://localhost:8101/ingest \
  -F "file=@document.pdf" \
  -F "collection=test-collection"
{"status": "ingested", "collection": "test-collection", "chunks": 5}

Query

curl -X POST http://localhost:8101/query \
  -H "Content-Type: application/json" \
  -d '{"query": "What is the refund policy?", "collection": "test-collection"}'
{"answer": "...", "sources": ["refund-policy.html"]}

Known Behaviors

Technique Format Survives? Notes
CSS hide HTML Varies Depends on HTMLTagReader version
HTML comment HTML Comments typically stripped
ARIA hidden HTML Varies Tag-based extraction may skip
Invisible text HTML Zero-width chars pass through
Metadata PDF PDFReader ignores metadata
Frontmatter Markdown YAML frontmatter included in text
Custom XML DOCX DocxReader reads paragraphs only

Auto-detection variability

SimpleDirectoryReader behavior depends on installed optional packages. The exact reader used may differ from what you expect.


Next Steps

  • Unstructured — Compare with Unstructured's element-based approach
  • LangChain — Compare with LangChain's loader approach