# smart-llm-loader **Repository Path**: tolryg0/smart-llm-loader ## Basic Information - **Project Name**: smart-llm-loader - **Description**: No description available - **Primary Language**: Unknown - **License**: MIT - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2025-03-13 - **Last Updated**: 2025-03-13 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # LLMLoader llm_loader is a lightweight yet powerful Python package that transforms any document into LLM-ready chunks. It handles the entire document processing pipeline: - 📄 Converts documents to clean markdown - 🔍 Built-in OCR for scanned documents and images - ✂️ Smart, context-aware text chunking - 🔌 Seamless integration with LangChain and LlamaIndex - 📦 Ready for vector stores and LLM ingestion Spend less time on preprocessing headaches and more time building what matters. From RAG systems to chatbots to document Q&A, LLMLoader handles the heavy lifting so you can focus on creating exceptional AI applications. LLMLoader's chunking approach has been benchmarked against traditional methods, showing superior performance particularly when paired with Google's Gemini Flash model. This combination offers an efficient and cost-effective solution for document chunking in RAG systems. View the detailed performance comparison [here](https://www.sergey.fyi/articles/gemini-flash-2). ## Features - Support for multiple LLM providers - In-built OCR for scanned documents and images - Flexible document type support - Supports different chunking strategies such as: context-aware chunking and page-based chunking - Supports custom prompts and custom chunking ## Installation You can install LLMLoader using pip: ```bash pip install llm-loader ``` Or using Poetry: ```bash poetry add llm-loader ``` ## Quick Start llm-loader package uses litellm to call the LLM so any arguments supported by litellm can be used. You can find the litellm documentation [here](https://docs.litellm.ai/docs/providers). You can use any multi-modal model supported by litellm. ```python from llm_loader import LLMLoader # Using Gemini Flash model os.environ["GEMINI_API_KEY"] = "YOUR_GEMINI_API_KEY" model = "gemini/gemini-1.5-flash" # Using openai model os.environ["OPENAI_API_KEY"] = "YOUR_OPENAI_API_KEY" model = "openai/gpt-4o" # Using anthropic model os.environ["ANTHROPIC_API_KEY"] = "YOUR_ANTHROPIC_API_KEY" model = "anthropic/claude-3-5-sonnet" # Initialize the document loader loader = LLMLoader( file_path="your_document.pdf", chunk_strategy="contextual", model=model, ) # Load and split the document into chunks documents = loader.load_and_split() ``` ## Parameters ```python class LLMLoader(BaseLoader): """A flexible document loader that supports multiple input types.""" def __init__( self, file_path: Optional[Union[str, Path]] = None, # path to the document to load url: Optional[str] = None, # url to the document to load chunk_strategy: str = 'page', # chunking strategy to use (page, contextual, custom) custom_prompt: Optional[str] = None, # custom prompt to use model: str = "gemini/gemini-2.0-flash", # LLM model to use save_output: bool = False, # whether to save the output to a file output_dir: Optional[Union[str, Path]] = None, # directory to save the output to api_key: Optional[str] = None, # API key to use **kwargs, ): ``` ## Comparison with Traditional Methods Let's see LLMLoader in action! We'll compare it with PyMuPDF (a popular traditional document loader) to demonstrate why LLMLoader's intelligent chunking makes such a difference in real-world applications. ### The Challenge: Processing an Invoice We'll process this sample invoice that includes headers, tables, and complex formatting: ![Sample Invoice Document](https://raw.githubusercontent.com/AskYourPdf/llm-loader/refs/heads/master/examples/data/test_ocr_doc.png?height=200) ### Head-to-Head Comparison #### 1. LLMLoader Output LLMLoader intelligently breaks down the document into semantic chunks, preserving structure and meaning (note that the json output below has been formatted for readability): ```json [ { "content": "Invoice no: 27301261\nDate of issue: 10/09/2012", "metadata": { "page": 0, "semantic_theme": "invoice_header", "source": "data/test_ocr_doc.pdf" } }, { "content": "Seller:\nWilliams LLC\n72074 Taylor Plains Suite 342\nWest Alexandria, AR 97978\nTax Id: 922-88-2832\nIBAN: GB70FTNR64199348221780", "metadata": { "page": 0, "semantic_theme": "seller_information", "source": "data/test_ocr_doc.pdf" } }, { "content": "Client:\nHernandez-Anderson\n084 Carter Lane Apt. 846\nSouth Ronaldbury, AZ 91030\nTax Id: 959-74-5868", "metadata": { "page": 0, "semantic_theme": "client_information", "source": "data/test_ocr_doc.pdf" } }, { "content": "Item table:\n" "| No. | Description | Qty | UM | Net price | Net worth | VAT [%] | Gross worth |\n" "|-----|-----------------------------------------------------------|------|------|-----------|-----------|---------|-------------|\n" "| 1 | Lilly Pulitzer dress Size 2 | 5.00 | each | 45.00 | 225.00 | 10% | 247.50 |\n" "| 2 | New ERIN Erin Fertherston Straight Dress White Sequence Lining Sleeveless SZ 10 | 1.00 | each | 59.99 | 59.99 | 10% | 65.99 |\n" "| 3 | Sequence dress Size Small | 3.00 | each | 35.00 | 105.00 | 10% | 115.50 |\n" "| 4 | fire los angeles dress Medium | 3.00 | each | 6.50 | 19.50 | 10% | 21.45 |\n" "| 5 | Eileen Fisher Women's Long Sleeve Fleece Lined Front Pockets Dress XS Gray | 3.00 | each | 15.99 | 47.97 | 10% | 52.77 |\n" "| 6 | Lularoe Nicole Dress Size Small Light Solid Grey/White Ringer Tee Trim | 2.00 | each | 3.75 | 7.50 | 10% | 8.25 |\n" "| 7 | J.Crew Collection Black & White sweater Dress sz S | 1.00 | each | 30.00 | 30.00 | 10% | 33.00 |", "metadata": { "page": 0, "semantic_theme": "items_table", "source": "data/test_ocr_doc.pdf" } }, { "content": "Summary table:\n" "| VAT [%] | Net worth | VAT | Gross worth |\n" "|---------|-----------|--------|-------------|\n" "| 10% | 494,96 | 49,50 | 544,46 |\n" "| Total | $ 494,96 | $ 49,50| $ 544,46 |", "metadata": { "page": 0, "semantic_theme": "summary_table", "source": "data/test_ocr_doc.pdf" } } ] ``` **Key Benefits:** - ✨ Clean, structured chunks - 🎯 Semantic understanding - 📊 Preserved table formatting - 🏷️ Intelligent metadata tagging #### 2. Traditional PyMuPDF Output PyMuPDF provides a basic text extraction without semantic understanding: ```json [ { "page": 0, "content": "Invoice no: 27301261 \nDate of issue: \nSeller: \nWilliams LLC \n72074 Taylor Plains Suite 342 \nWest Alexandria, AR 97978 \nTax Id: 922-88-2832 \nIBAN: GB70FTNR64199348221780 \nITEMS \nNo. \nDescription \n2l \nLilly Pulitzer dress Size 2 \n2. \nNew ERIN Erin Fertherston \nStraight Dress White Sequence \nLining Sleeveless SZ 10 \n3. \n Sequence dress Size Small \n4. \nfire los angeles dress Medium \nL \nEileen Fisher Women's Long \nSleeve Fleece Lined Front \nPockets Dress XS Gray \n6. \nLularoe Nicole Dress Size Small \nLight Solid Grey/ White Ringer \nTee Trim \nT \nJ.Crew Collection Black & White \nsweater Dress sz S \nSUMMARY \nTotal \n2,00 \n1,00 \nVAT [%] \n10% \n10/09/2012 \neach \neach \nClient: \nHernandez-Anderson \n084 Carter Lane Apt. 846 \nSouth Ronaldbury, AZ 91030 \nTax Id: 959-74-5868 \nNet price \n Net worth \nVAT [%] \n45,00 \n225,00 \n10% \n59,99 \n59,99 \n10% \n35,00 \n105,00 \n10% \n6,50 \n19,50 \n10% \n15,99 \n47,97 \n10% \n3,75 \n7.50 \n10% \n30,00 \n30,00 \n10% \nNet worth \nVAT \n494,96 \n49,50 \n$ 494,96 \n$49,50 \nGross \nworth \n247,50 \n65,99 \n115,50 \n21,45 \n52,77 \n8,25 \n33,00 \nGross worth \n544,46 \n$ 544,46 \n", "metadata": { "source": "./data/test_ocr_doc.pdf", "file_path": "./data/test_ocr_doc.pdf", "page": 0, "total_pages": 1, "format": "PDF 1.5", "title": "", "author": "", "subject": "", "keywords": "", "creator": "", "producer": "AskYourPDF.com", "creationDate": "", "modDate": "D:20250213152908Z", "trapped": "" } } ] ``` ### Real-World Impact: RAG Performance Let's see how this difference affects a real Question-Answering system: ```python question = "What is the total gross worth for item 1 and item 7?" # LLMLoader Result ✅ "The total gross worth for item 1 (Lilly Pulitzer dress) is $247.50 and for item 7 (J.Crew Collection sweater dress) is $33.00. Total: $280.50" # PyMuPDF Result ❌ "The total gross worth for item 1 is $45.00, and for item 7 it is $33.00. Total: $78.00" ``` **Why LLMLoader Won:** - 🎯 Maintained table structure - 💡 Preserved relationships between data - 📊 Accurate calculations - 🤖 Better context for the LLM You can try it yourself by running the complete [RAG example](./examples/rag_example.py) to see the difference in action! ## License This project is licensed under the MIT License - see the LICENSE file for details. ## Contributing Contributions are welcome! Please feel free to submit a Pull Request. ## Authors - David Emmanuel ([@drmingler](https://github.com/drmingler))