LangChain Pipeline¶
| Property | Value |
|---|---|
| Framework | LangChain |
| Version | 0.3.35 |
| Port | 8100 |
| Container | langchain-rag |
| Image | docker/langchain-rag |
Extraction¶
LangChain uses an extension-to-loader mapping:
LOADERS = {
".html": BSHTMLLoader,
".pdf": PyPDFLoader,
".docx": Docx2txtLoader,
".md": UnstructuredMarkdownLoader,
".txt": None, # plain open() read
}
How It Works¶
- Uploaded file is saved to a temp path
- Extension is matched to a loader class
- Loader's
.load()returns a list ofDocumentobjects - Text is extracted from
doc.page_content
Implications for hemlock¶
BSHTMLLoaderuses BeautifulSoup and preserves HTML structure including hidden elements — CSS-hidden and invisible payloads tend to survivePyPDFLoaderextracts text layer only — JavaScript and annotation payloads are strippedDocx2txtLoaderreads the main document body — custom XML and metadata payloads are stripped.txtfallback reads raw bytes — all text-based payloads survive
RAG Chain¶
graph LR
Q["Query"] --> R["ChromaDB<br/>Retriever"]
R --> D["Top-k<br/>Documents"]
D --> P["Prompt<br/>Template"]
P --> L["Ollama<br/>smollm2:135m"]
L --> A["Answer"]
LangChain uses RetrievalQA.from_chain_type():
embeddings = OllamaEmbeddings(model="nomic-embed-text", base_url=OLLAMA_URL)
vectorstore = Chroma(
collection_name=collection,
embedding_function=embeddings,
client=chromadb_client
)
retriever = vectorstore.as_retriever()
llm = Ollama(model="smollm2:135m", base_url=OLLAMA_URL)
chain = RetrievalQA.from_chain_type(llm=llm, retriever=retriever)
result = chain.invoke({"query": query_text})
Curl Examples¶
Health Check¶
Extract Text¶
Ingest Document¶
curl -X POST http://localhost:8100/ingest \
-F "file=@document.html" \
-F "collection=test-collection"
Query¶
curl -X POST http://localhost:8100/query \
-H "Content-Type: application/json" \
-d '{"query": "What is the refund policy?", "collection": "test-collection"}'
Known Behaviors¶
| Technique | Format | Survives? | Notes |
|---|---|---|---|
| CSS hide | HTML | ✓ | BSHTMLLoader preserves display:none elements |
| HTML comment | HTML | ✓ | Comments included in output |
| ARIA hidden | HTML | ✓ | Attribute preserved, text extracted |
| Invisible text | HTML | ✓ | Zero-width chars pass through |
| Metadata | ✗ | PyPDFLoader ignores metadata fields | |
| JavaScript | ✗ | Not executed, text-layer only | |
| Custom XML | DOCX | ✗ | Docx2txt reads body only |
These are observed behaviors
Behaviors may change with LangChain version updates. Run make test-extract to verify against the current version.
Next Steps¶
- LlamaIndex — Compare with LlamaIndex's approach
- API Reference: Extract — Full
/extractendpoint spec