AI Pipeline Architecture

Flow 1 ingestion pipeline — from SharePoint to compliance alerts.

Why Mistral AI

PixtralMultimodal Model

Handles document understanding across PDFs, scanned images, and handwritten forms. Native vision capability means no separate OCR pipeline — the model reads directly from pixels, extracting tables, key-value pairs, and structured content in a single pass.

Mistral LargeClassification & Extraction

Powers both the classification agent (mapping documents to 83+ types) and schema-based extraction (pulling structured fields per template). High accuracy on financial documents with strong JSON output adherence.

Why not GPT / Claude for this pipeline?

$Cost efficiency at scale. 80+ doc types per loan, hundreds of loans per year. Mistral offers significantly lower per-token cost for high-volume structured extraction.
🔒Self-hostable. Bank data compliance requires strict data residency. Mistral models can be deployed on-premises or in a private cloud, keeping loan documents off third-party APIs.
Optimized for structure. JSON mode with schema enforcement produces consistently parseable output, critical for pipeline reliability at scale.

Flow 1: Ingestion Pipeline

7-step pipeline from file arrival to compliance verification.

Ingestion
AI Processing
Storage
Compliance
0

SharePoint Ingestion

Airbyte Connector

Pull files from SharePoint, deduplicate via SHA-256 hash, store raw bytes.

In: SharePoint folder
Out: Raw file bytes + metadata
1

Document Parsing

Mistral AI (Pixtral)

Parse document content — OCR, table extraction, key-value pairs from scanned images and PDFs.

In: Raw file bytes
Out: Structured markdown + extracted tables
2

Classification

Mistral Large

Classify into taxonomy category/subcategory with confidence score.

In: Filename + first 2,000 chars
Out: {category, subcategory, confidence}
3a

Schema Lookup

In-memory

Load the correct schema family extraction template based on classification result.

In: Classification result
Out: Extraction template (JSON schema)
3

Schema-Based Extraction

Mistral Large

Extract structured fields per schema template into normalized JSON output.

In: Document content + schema template
Out: Structured JSON (extracted fields)
4

Document Relation Agent

Custom Logic

Link related documents (e.g., borrower's 1919 to tax returns), compute match confidence.

In: Extracted data + existing documents
Out: Document links + match scores
5

Store & Index

SQL Server + Vector DB

Write structured data to SQL, embed chunks in vector DB for semantic search.

In: Extracted JSON + document chunks
Out: Indexed records + embeddings
6

Compliance Engine

Cron Job

Check for missing documents, validate checklist completeness, notify loan officers.

In: Document inventory per loan
Out: Gap alerts + checklist status

How Classification Works

Document arrives
Filename + first 2,000 chars
Mistral Large
Classification result
Output
{
  "category": "BORROWER",
  "subcategory": "sba_form_1919",
  "confidence": 0.94
}
⚠️If confidence < 0.7, the document is flagged for human review. The loan officer sees the top-3 predicted categories and manually confirms.
Classifies into 7 categories / 83 subcategories defined in the taxonomy.

How Extraction Works

83 document types map to 9 extraction templates (schema families). Each schema family defines the exact fields to extract, reducing prompt complexity and improving accuracy.

Example: Tax Return → tax_returns schema family
{
  "filing_year": 2023,
  "entity_type": "1120S",
  "gross_income": 1245000,
  "net_income": 187500,
  "total_assets": 2100000,
  "depreciation": 45000,
  "officer_compensation": 120000
}
Explore all schema families on the Taxonomy page → Schema Families tab.

Flow 2 & 3 Overview

💬
Flow 2: Teams Bot
Loan officers interact via Microsoft Teams. The bot parses intent (status check, gap query, report request), reads from the database, and responds instantly. Complex requests (full audit, report generation) are queued as jobs for Flow 3.
TEAMS MESSAGE
Flow 3: Job Runner
Polls the job queue every 5 minutes. Routes by job type: audit execution, report generation, or checklist compilation. Results are written back to the database and notifications sent to the requesting officer.
CRON (5 MIN)
🔑
Flows communicate through SQL, not direct calls. Flow 1 writes to documents + job_queue. Flow 2 reads from documents and writes to job_queue. Flow 3 polls job_queue and executes. Clean decoupling — each flow can be deployed, scaled, and debugged independently.
Explore Full Taxonomy →