LlamaIndex Pipeline¶
| Property | Value |
|---|---|
| Framework | LlamaIndex |
| Version | 0.12.33 |
| Port | 8101 |
| Container | llamaindex-rag |
| Image | docker/llamaindex-rag |
Extraction¶
LlamaIndex uses SimpleDirectoryReader with auto-detection:
from llama_index.core import SimpleDirectoryReader
# Write uploaded file to temp directory
with tempfile.TemporaryDirectory() as tmpdir:
file_path = Path(tmpdir) / filename
file_path.write_bytes(content)
documents = SimpleDirectoryReader(tmpdir).load_data()
text = "\n".join(doc.text for doc in documents)
How It Works¶
- Uploaded file is written to a temporary directory
SimpleDirectoryReaderscans the directory and auto-detects file type- Internally delegates to format-specific readers (PDFReader, DocxReader, HTMLTagReader, MarkdownReader)
- Returns
Documentobjects with.textand.metadata
Implications for hemlock¶
- HTML — Uses
HTMLTagReaderwhich extracts text from tags — hidden elements may or may not survive depending on the tag structure - PDF — Uses
PDFReader(pypdf-based) — text layer only, similar to LangChain - DOCX — Uses
DocxReader— reads paragraphs from the main body - Markdown — Parses Markdown to extract text content
- TXT — Direct file read
RAG Chain¶
graph LR
Q["Query"] --> I["VectorStoreIndex"]
I --> R["ChromaDB<br/>Retriever"]
R --> D["Top-k<br/>Nodes"]
D --> E["Query Engine"]
E --> L["Ollama<br/>smollm2:135m"]
L --> A["Response"]
LlamaIndex uses VectorStoreIndex with the global Settings object:
from llama_index.core import VectorStoreIndex, Settings
Settings.embed_model = OllamaEmbedding(model_name="nomic-embed-text")
Settings.llm = Ollama(model="smollm2:135m")
index = VectorStoreIndex.from_vector_store(chroma_vector_store)
query_engine = index.as_query_engine()
response = query_engine.query(query_text)
Curl Examples¶
Health Check¶
Extract Text¶
Ingest Document¶
curl -X POST http://localhost:8101/ingest \
-F "file=@document.pdf" \
-F "collection=test-collection"
Query¶
curl -X POST http://localhost:8101/query \
-H "Content-Type: application/json" \
-d '{"query": "What is the refund policy?", "collection": "test-collection"}'
Known Behaviors¶
| Technique | Format | Survives? | Notes |
|---|---|---|---|
| CSS hide | HTML | Varies | Depends on HTMLTagReader version |
| HTML comment | HTML | ✗ | Comments typically stripped |
| ARIA hidden | HTML | Varies | Tag-based extraction may skip |
| Invisible text | HTML | ✓ | Zero-width chars pass through |
| Metadata | ✗ | PDFReader ignores metadata | |
| Frontmatter | Markdown | ✓ | YAML frontmatter included in text |
| Custom XML | DOCX | ✗ | DocxReader reads paragraphs only |
Auto-detection variability
SimpleDirectoryReader behavior depends on installed optional packages. The exact reader used may differ from what you expect.
Next Steps¶
- Unstructured — Compare with Unstructured's element-based approach
- LangChain — Compare with LangChain's loader approach