Skip to content

LangChain Pipeline

Property Value
Framework LangChain
Version 0.3.35
Port 8100
Container langchain-rag
Image docker/langchain-rag

Extraction

LangChain uses an extension-to-loader mapping:

LOADERS = {
    ".html": BSHTMLLoader,
    ".pdf":  PyPDFLoader,
    ".docx": Docx2txtLoader,
    ".md":   UnstructuredMarkdownLoader,
    ".txt":  None,  # plain open() read
}

How It Works

  1. Uploaded file is saved to a temp path
  2. Extension is matched to a loader class
  3. Loader's .load() returns a list of Document objects
  4. Text is extracted from doc.page_content

Implications for hemlock

  • BSHTMLLoader uses BeautifulSoup and preserves HTML structure including hidden elements — CSS-hidden and invisible payloads tend to survive
  • PyPDFLoader extracts text layer only — JavaScript and annotation payloads are stripped
  • Docx2txtLoader reads the main document body — custom XML and metadata payloads are stripped
  • .txt fallback reads raw bytes — all text-based payloads survive

RAG Chain

graph LR
    Q["Query"] --> R["ChromaDB<br/>Retriever"]
    R --> D["Top-k<br/>Documents"]
    D --> P["Prompt<br/>Template"]
    P --> L["Ollama<br/>smollm2:135m"]
    L --> A["Answer"]

LangChain uses RetrievalQA.from_chain_type():

embeddings = OllamaEmbeddings(model="nomic-embed-text", base_url=OLLAMA_URL)
vectorstore = Chroma(
    collection_name=collection,
    embedding_function=embeddings,
    client=chromadb_client
)
retriever = vectorstore.as_retriever()
llm = Ollama(model="smollm2:135m", base_url=OLLAMA_URL)
chain = RetrievalQA.from_chain_type(llm=llm, retriever=retriever)
result = chain.invoke({"query": query_text})

Curl Examples

Health Check

curl http://localhost:8100/health
{"framework": "langchain", "status": "ok", "version": "0.3.35"}

Extract Text

curl -X POST http://localhost:8100/extract \
  -F "file=@poisoned-document.html"
{"text": "...", "metadata": {"source": "poisoned-document.html"}}

Ingest Document

curl -X POST http://localhost:8100/ingest \
  -F "file=@document.html" \
  -F "collection=test-collection"
{"status": "ingested", "collection": "test-collection", "chunks": 3}

Query

curl -X POST http://localhost:8100/query \
  -H "Content-Type: application/json" \
  -d '{"query": "What is the refund policy?", "collection": "test-collection"}'
{"answer": "...", "sources": ["refund-policy.html"]}

Known Behaviors

Technique Format Survives? Notes
CSS hide HTML BSHTMLLoader preserves display:none elements
HTML comment HTML Comments included in output
ARIA hidden HTML Attribute preserved, text extracted
Invisible text HTML Zero-width chars pass through
Metadata PDF PyPDFLoader ignores metadata fields
JavaScript PDF Not executed, text-layer only
Custom XML DOCX Docx2txt reads body only

These are observed behaviors

Behaviors may change with LangChain version updates. Run make test-extract to verify against the current version.


Next Steps