LangChain Pipeline¶

Property	Value
Framework	LangChain
Version	0.3.35
Port	8100
Container	`langchain-rag`
Image	`docker/langchain-rag`

Extraction¶

LangChain uses an extension-to-loader mapping:

LOADERS = {
    ".html": BSHTMLLoader,
    ".pdf":  PyPDFLoader,
    ".docx": Docx2txtLoader,
    ".md":   UnstructuredMarkdownLoader,
    ".txt":  None,  # plain open() read
}

How It Works¶

Uploaded file is saved to a temp path
Extension is matched to a loader class
Loader's .load() returns a list of Document objects
Text is extracted from doc.page_content

Implications for hemlock¶

BSHTMLLoader uses BeautifulSoup and preserves HTML structure including hidden elements — CSS-hidden and invisible payloads tend to survive
PyPDFLoader extracts text layer only — JavaScript and annotation payloads are stripped
Docx2txtLoader reads the main document body — custom XML and metadata payloads are stripped
.txt fallback reads raw bytes — all text-based payloads survive

RAG Chain¶

graph LR
    Q["Query"] --> R["ChromaDB<br/>Retriever"]
    R --> D["Top-k<br/>Documents"]
    D --> P["Prompt<br/>Template"]
    P --> L["Ollama<br/>smollm2:135m"]
    L --> A["Answer"]

LangChain uses RetrievalQA.from_chain_type():

embeddings = OllamaEmbeddings(model="nomic-embed-text", base_url=OLLAMA_URL)
vectorstore = Chroma(
    collection_name=collection,
    embedding_function=embeddings,
    client=chromadb_client
)
retriever = vectorstore.as_retriever()
llm = Ollama(model="smollm2:135m", base_url=OLLAMA_URL)
chain = RetrievalQA.from_chain_type(llm=llm, retriever=retriever)
result = chain.invoke({"query": query_text})

Curl Examples¶

Health Check¶

curl http://localhost:8100/health

{"framework": "langchain", "status": "ok", "version": "0.3.35"}

Extract Text¶

curl -X POST http://localhost:8100/extract \
  -F "file=@poisoned-document.html"

{"text": "...", "metadata": {"source": "poisoned-document.html"}}

Ingest Document¶

curl -X POST http://localhost:8100/ingest \
  -F "file=@document.html" \
  -F "collection=test-collection"

{"status": "ingested", "collection": "test-collection", "chunks": 3}

Query¶

curl -X POST http://localhost:8100/query \
  -H "Content-Type: application/json" \
  -d '{"query": "What is the refund policy?", "collection": "test-collection"}'

{"answer": "...", "sources": ["refund-policy.html"]}

Known Behaviors¶

Technique	Format	Survives?	Notes
CSS hide	HTML	✓	BSHTMLLoader preserves `display:none` elements
HTML comment	HTML	✓	Comments included in output
ARIA hidden	HTML	✓	Attribute preserved, text extracted
Invisible text	HTML	✓	Zero-width chars pass through
Metadata	PDF	✗	PyPDFLoader ignores metadata fields
JavaScript	PDF	✗	Not executed, text-layer only
Custom XML	DOCX	✗	Docx2txt reads body only

These are observed behaviors

Behaviors may change with LangChain version updates. Run make test-extract to verify against the current version.

Next Steps¶

LlamaIndex — Compare with LlamaIndex's approach
API Reference: Extract — Full /extract endpoint spec