Document Extraction and Self Training

An agentic system for extracting text from documents (PDF/JPEG) with reinforcement learning capabilities using RAG. Features OCR, vision models, LLM-based structured data extraction, validation, and self-learning from user corrections.

Idea

Create an intelligent document extraction system that learns from user corrections. Uses Langgraph for agentic workflows, ChromaDB for RAG-based learning, OpenRouter for flexible LLM model selection, and Langfuse for observability. Supports schema management, re-extraction, and real-time accuracy validation.

Tech Stacks

ChromaDBFastAPILangfuseLanggraphOpenRouterPythonReactSQLiteTailwind CSSTypeScriptVite

Resources

Medium Article

https://medium.com/p/6b422568eabe

Read article →