Why Mistral AI
Handles document understanding across PDFs, scanned images, and handwritten forms. Native vision capability means no separate OCR pipeline — the model reads directly from pixels, extracting tables, key-value pairs, and structured content in a single pass.
Powers both the classification agent (mapping documents to 83+ types) and schema-based extraction (pulling structured fields per template). High accuracy on financial documents with strong JSON output adherence.
Why not GPT / Claude for this pipeline?
Flow 1: Ingestion Pipeline
7-step pipeline from file arrival to compliance verification.
SharePoint Ingestion
Airbyte ConnectorPull files from SharePoint, deduplicate via SHA-256 hash, store raw bytes.
Document Parsing
Mistral AI (Pixtral)Parse document content — OCR, table extraction, key-value pairs from scanned images and PDFs.
Classification
Mistral LargeClassify into taxonomy category/subcategory with confidence score.
Schema Lookup
In-memoryLoad the correct schema family extraction template based on classification result.
Schema-Based Extraction
Mistral LargeExtract structured fields per schema template into normalized JSON output.
Document Relation Agent
Custom LogicLink related documents (e.g., borrower's 1919 to tax returns), compute match confidence.
Store & Index
SQL Server + Vector DBWrite structured data to SQL, embed chunks in vector DB for semantic search.
Compliance Engine
Cron JobCheck for missing documents, validate checklist completeness, notify loan officers.
How Classification Works
{
"category": "BORROWER",
"subcategory": "sba_form_1919",
"confidence": 0.94
}How Extraction Works
83 document types map to 9 extraction templates (schema families). Each schema family defines the exact fields to extract, reducing prompt complexity and improving accuracy.
{
"filing_year": 2023,
"entity_type": "1120S",
"gross_income": 1245000,
"net_income": 187500,
"total_assets": 2100000,
"depreciation": 45000,
"officer_compensation": 120000
}Flow 2 & 3 Overview
documents + job_queue. Flow 2 reads from documents and writes to job_queue. Flow 3 polls job_queue and executes. Clean decoupling — each flow can be deployed, scaled, and debugged independently.