AI Pipeline Architecture

Flow 1 ingestion pipeline — from SharePoint to compliance alerts.

Why Mistral AI

PixtralMultimodal Model

Handles document understanding across PDFs, scanned images, and handwritten forms. Native vision capability means no separate OCR pipeline — the model reads directly from pixels, extracting tables, key-value pairs, and structured content in a single pass.

Mistral LargeClassification & Extraction

Powers both the classification agent (mapping documents to 83+ types) and schema-based extraction (pulling structured fields per template). High accuracy on financial documents with strong JSON output adherence.

Why not GPT / Claude for this pipeline?

$Cost efficiency at scale. 80+ doc types per loan, hundreds of loans per year. Mistral offers significantly lower per-token cost for high-volume structured extraction.

🔒Self-hostable. Bank data compliance requires strict data residency. Mistral models can be deployed on-premises or in a private cloud, keeping loan documents off third-party APIs.

⚡Optimized for structure. JSON mode with schema enforcement produces consistently parseable output, critical for pipeline reliability at scale.

Flow 1: Ingestion Pipeline

7-step pipeline from file arrival to compliance verification.

Ingestion

AI Processing

Storage

Compliance

SharePoint Ingestion

Airbyte Connector

Pull files from SharePoint, deduplicate via SHA-256 hash, store raw bytes.

In: SharePoint folder

Out: Raw file bytes + metadata

Document Parsing

Mistral AI (Pixtral)

Parse document content — OCR, table extraction, key-value pairs from scanned images and PDFs.

In: Raw file bytes

Out: Structured markdown + extracted tables

Classification

Mistral Large

Classify into taxonomy category/subcategory with confidence score.

In: Filename + first 2,000 chars

Out: {category, subcategory, confidence}

Schema Lookup

In-memory

Load the correct schema family extraction template based on classification result.

In: Classification result

Out: Extraction template (JSON schema)

Schema-Based Extraction

Mistral Large

Extract structured fields per schema template into normalized JSON output.

In: Document content + schema template

Out: Structured JSON (extracted fields)

Document Relation Agent

Custom Logic

Link related documents (e.g., borrower's 1919 to tax returns), compute match confidence.

In: Extracted data + existing documents

Out: Document links + match scores

Store & Index

SQL Server + Vector DB

Write structured data to SQL, embed chunks in vector DB for semantic search.

In: Extracted JSON + document chunks

Out: Indexed records + embeddings

Compliance Engine

Cron Job

Check for missing documents, validate checklist completeness, notify loan officers.

In: Document inventory per loan

Out: Gap alerts + checklist status

How Classification Works

Document arrives→

Filename + first 2,000 chars→

Mistral Large→

Classification result

Output

{
  "category": "BORROWER",
  "subcategory": "sba_form_1919",
  "confidence": 0.94
}

⚠️If confidence < 0.7, the document is flagged for human review. The loan officer sees the top-3 predicted categories and manually confirms.

Classifies into 7 categories / 83 subcategories defined in the taxonomy.

How Extraction Works

83 document types map to 9 extraction templates (schema families). Each schema family defines the exact fields to extract, reducing prompt complexity and improving accuracy.

SBA Forms Financial (Precise)Tax Returns Bank Statements Narrative / Q&A Legal Documents Resume / Bio Spreadsheet / Workbook Generic Document

Example: Tax Return → tax_returns schema family

{
  "filing_year": 2023,
  "entity_type": "1120S",
  "gross_income": 1245000,
  "net_income": 187500,
  "total_assets": 2100000,
  "depreciation": 45000,
  "officer_compensation": 120000
}

Explore all schema families on the Taxonomy page → Schema Families tab.

Flow 2 & 3 Overview

💬

Flow 2: Teams Bot

Loan officers interact via Microsoft Teams. The bot parses intent (status check, gap query, report request), reads from the database, and responds instantly. Complex requests (full audit, report generation) are queued as jobs for Flow 3.

TEAMS MESSAGE

⏰

Flow 3: Job Runner

Polls the job queue every 5 minutes. Routes by job type: audit execution, report generation, or checklist compilation. Results are written back to the database and notifications sent to the requesting officer.

CRON (5 MIN)

🔑

Flows communicate through SQL, not direct calls. Flow 1 writes to documents + job_queue. Flow 2 reads from documents and writes to job_queue. Flow 3 polls job_queue and executes. Clean decoupling — each flow can be deployed, scaled, and debugged independently.

Explore Full Taxonomy →