Advanced Extraction Models

This guide covers advanced techniques for building robust extraction models that handle complex documents with high accuracy.

Using Variables for Dynamic Extraction

Variables let you inject context into extraction, making models reusable across different periods, entities, or scenarios without creating separate models.

Why use variables?

Without variables, you'd need separate models for each audit period (FY2024, FY2025) or each entity. Variables let you create one model and change the context at runtime.

Common variable use cases:

Fiscal year end dates (closing_date)
Reference periods (Q4 2024)
Entity identifiers (loan_reference, contract_number)
Currency or unit context

Creating variables

In your extraction model, click Add Variable (in the instruction text area)
Name your variable clearly (e.g., closing_date, reference_period)
Choose the data type (text, number, or date)
Click Create

Variable syntax

Reference variables in your instruction (meta prompt) or field explanations using double curly braces:

{{variable_name}}

Example instruction with variable:

Extract financial data from this loan amortization schedule. The reference date for all balances is {{closing_date}}. Find the row closest to this date and extract the relevant values.

Example field explanation with variable:

Find the outstanding principal balance as of {{closing_date}}. Look for the row where the payment date is closest to but not after the closing date.

Setting variable values

When you run an extraction or configure a Test of Details step:

The system shows all variables defined in the model
Enter the value for each variable (e.g., 31-12-2024 for a date)
Values are substituted into the instruction and field explanations before extraction runs

Variable examples

Variable	Purpose	Format	Example value
`closing_date`	Fiscal year end	DD-MM-YYYY	`31-12-2024`
`reference_period`	Period description	Text	`Q4 2024`
`entity_name`	Company/entity	Text	`Acme Corp`
`loan_reference`	Specific loan ID	Text	`LOAN-2024-001`
`currency`	Expected currency	Text	`EUR`

List/Table Field Extraction

The List data type extracts repeating rows from tables — invoice line items, transaction lists, amortization schedules, etc.

When to use List vs flat fields

Use a List field when:

The document has multiple rows/records with the same structure
Each row has the same columns (e.g., Description, Quantity, Amount)
You want to extract ALL rows, not just one value

Use flat fields when:

Each piece of information appears exactly once
Fields are distinct (Invoice Number, Invoice Date, Total Amount)

Quick test

Ask: "Will there be multiple records with the SAME fields, or just ONE value per field?"

Multiple records → List field
One value each → Flat fields

Defining a List field

When you add a List field, you define its columns (sub-fields):

Example: Invoice line items

List field name	Column	Type	Description
Line Items	Description	Text	Product or service description
	Quantity	Number	Number of units
	Unit Price	Number	Price per unit (HT/excluding tax)
	Amount	Number	Line total

Each column gets its own:

Name — The column header
Type — text, number, or date
Explanation — How to identify this column in the document

Writing column explanations

Be specific about label variations and formats:

Good column explanation (Description):

Product or service description. May be labeled "Description", "Désignation", "Item", "Article", or "Libellé". Extract the full text as shown.

Good column explanation (Amount):

Line total for this item. Look for "Montant HT", "Amount", "Total ligne", or the rightmost numeric column. Number only with comma decimal separator (e.g., 1234,56).

List extraction limits

List extraction works best for tables with up to ~50-100 rows. Quality decreases for larger tables.

Row count	Recommendation
< 50 rows	✅ Use List extraction
50-100 rows	⚠️ Test carefully, may work
> 100 rows	❌ Use OCR to Excel instead

For large tables, use OCR to Excel to convert to spreadsheet format.

Using List totals in formulas

In Test of Details analysis, you can reference numeric totals from List fields:

[DocType: Total of Line Items - Amount]

This automatically sums the Amount column across all extracted rows.

Extracting from Loan Amortization Schedules

Loan amortization schedules present specific challenges: they contain many rows at different dates, and you need to identify the correct row based on your audit date.

The challenge

A typical amortization schedule shows:

Monthly or periodic payment rows
Running balances before and after each payment
Interest and principal breakdown per payment

You need to extract values at a specific date, but which balance depends on the timing:

Closing date vs payment date	What you typically need
Closing date falls on a payment date	Post-amortization balance (after that payment)
Closing date between two payment dates	Post-amortization balance from the preceding payment

Pre-amortization vs Post-amortization

This is the most critical decision:

Balance type	Definition	When to use
Pre-amortization	Balance before a payment is applied	Rarely — only if you need the opening balance for a period
Post-amortization	Balance after a payment is applied	Most common — the actual debt remaining at closing

Example: Closing date is 31-12-2024. The schedule shows:

Row for 15-12-2024: Balance after payment = 100,000
Row for 15-01-2025: Balance after payment = 98,000

→ You want 100,000 (post-amortization from the last payment before closing).

Key decisions before building

Question	Options	Impact
Multiple loans?	Single / Multiple in document	May need loan identifier variable

Let the AI handle timing logic

You don't need to decide pre vs post-amortization upfront. The instruction template below lets the AI determine the correct row based on the closing date.

Recommended model structure

Field	Type	Purpose
`closing_date`	Variable	Reference date for extraction
`loan_reference`	Variable (if needed)	Which loan if multiple
Principal Balance	Number	Outstanding principal at closing
Interest Rate	Number	Annual rate
Next Payment Date	Date	First payment after closing
Remaining Payments	Number	Payments left
Monthly Payment	Number	Regular payment amount

Writing the instruction (meta prompt)

Extract loan information from this amortization schedule.

Reference date: {{closing_date}}
Loan identifier (if multiple schedules): {{loan_reference}}

Determine the outstanding balance at the reference date by analyzing the schedule:

- If {{closing_date}} falls exactly ON a payment date → extract the post-amortization balance (after that payment)
- If {{closing_date}} falls BETWEEN two payment dates → extract the post-amortization balance from the last payment before the reference date

The goal is to find what the borrower still owes as of {{closing_date}}.

Field explanation examples

Principal Balance:

Determine the outstanding principal as of {{closing_date}}:

If {{closing_date}} matches a payment date exactly → use the balance AFTER that payment (post-amortization)

If {{closing_date}} falls between payments → use the balance AFTER the most recent payment before {{closing_date}}

Look for columns labeled: "Capital restant", "CRD", "Remaining Balance", "Solde", "Outstanding Principal", or "Balance après échéance".

Format: Number only with comma decimal separator (e.g., 125000,00).

Next Payment Date:

Find the first payment date that occurs after {{closing_date}}.

Look for columns labeled: "Date échéance", "Payment Date", "Date", or "Échéance".

Format: DD-MM-YYYY (e.g., 15-01-2025).

Handling multiple loans in one document

If a document contains multiple amortization schedules:

Add a loan_reference variable
Include it in your instruction and field explanations:

From the amortization schedule for loan {{loan_reference}}, find the row where...

Format Requirements for Accurate Extraction

The AI extracts values in specific formats. Understanding these ensures accuracy:

Date format

Required format: DD-MM-YYYY

Document shows	AI extracts
December 31, 2024	`31-12-2024`
2024-12-31	`31-12-2024`
31/12/24	`31-12-2024`

In explanations, specify:

Format: DD-MM-YYYY (e.g., 31-12-2024)

Number format

Required format: Digits only, comma as decimal separator

Document shows	AI extracts
€1,234.56	`1234,56`
1 234,56 €	`1234,56`
$1,234.56	`1234,56`
-500.00	`-500,00`

In explanations, specify:

Number only with comma decimal separator (e.g., 1234,56). No currency symbols, thousand separators, or units.

Text format

Text is extracted verbatim with leading/trailing whitespace trimmed.

In explanations, specify any requirements:

Extract the company name exactly as shown. Include legal form (SA, SARL, etc.) if present.

Tips for Robust Field Explanations

Include synonyms and label variations

Documents use different labels for the same data. List common variations:

Invoice number (aka: invoice no., bill no., reference number, facture n°, numéro de facture). Prefer the alphanumeric code near the "Invoice" header.

Specify location hints

Help the AI find the right value when multiple similar values exist:

Total amount including VAT. Usually found at the bottom of the invoice, near labels like "Total TTC", "Amount Due", or "Balance Due". This is typically the largest amount on the document.

Handle edge cases explicitly

Describe what to do when expected data is missing or ambiguous:

If no due date is shown, return an empty value. Do not guess or calculate from invoice date.

Use the document's language

If documents are in French, write explanations in French for better matching:

Montant total TTC. Chercher près des étiquettes "Total TTC", "Net à payer", ou "Montant dû".

Debugging Extraction Issues

Step 1: Check grounding

Click any extracted value to see where it came from in the document:

Correct location, wrong value → Format issue or description needs refinement
Wrong location → Description isn't specific enough, or label variations missing

Step 2: Review field explanations

For each problematic field, ask:

Would a human following this description find the right value?
Are all label variations covered?
Is the location in the document described?
Are edge cases handled?

Step 3: Test on diverse samples

Run on 5+ documents with different:

Layouts and formats
Languages (if applicable)
Scanned vs digital PDFs
Single page vs multi-page

Step 4: Iterate incrementally

Refine one field at a time:

Identify the failing field
Update its explanation
Re-test on the problem document
Verify it still works on other documents

Common issues and solutions

Problem	Likely cause	Solution
Value not found	Label variations	Add synonyms to explanation
Wrong value extracted	Multiple similar values	Add location hints
Wrong row (in tables)	Unclear row identification	Make date/row criteria more specific
Format errors	Unexpected format in document	Describe expected format variations
Inconsistent results	Ambiguous description	Make explanation more specific

AI-Generated vs Manual Models

When to use AI generation

Starting a new model for a common document type (invoices, contracts)
Quickly prototyping before refinement
Learning what fields are typically extracted from a document type

When to build manually

Highly specific extraction requirements
Non-standard document formats
Need precise control over field explanations
Building on an existing model

Refining AI-generated models

AI generation is a starting point. Always:

Review all generated fields
Test on real documents
Adjust explanations for your specific document formats
Add label variations you discover
Remove unnecessary fields

Multi-Document Patterns

In Test of Details, you often match multiple document types to transactions:

Transaction field	Invoice	Delivery Note	Payment
Reference	Invoice Number	Delivery Reference	Payment Reference
Amount	Total Amount	—	Payment Amount
Date	Invoice Date	Delivery Date	Payment Date

Create separate extraction models for each document type, then use Test of Details to match them.

Cross-document validation

Use variables to ensure consistency:

Set expected_amount variable from transaction data
Reference in extraction: Verify total matches {{expected_amount}}

Best Practices Summary

Use variables for dates, periods, and identifiers that change between extractions
Write specific explanations with label variations and location hints
Choose List vs flat fields based on whether data repeats
Respect format requirements — dates as DD-MM-YYYY, numbers with comma decimals
Test on diverse documents before running large batches
Iterate incrementally — fix one field at a time
Use AI generation as a starting point, then refine

Using Variables for Dynamic Extraction​

Why use variables?​

Creating variables​

Variable syntax​

Setting variable values​

Variable examples​

List/Table Field Extraction​

When to use List vs flat fields​

Defining a List field​

Writing column explanations​

List extraction limits​

Using List totals in formulas​

Extracting from Loan Amortization Schedules​

The challenge​

Pre-amortization vs Post-amortization​

Key decisions before building​

Recommended model structure​

Writing the instruction (meta prompt)​

Field explanation examples​

Handling multiple loans in one document​

Format Requirements for Accurate Extraction​

Date format​

Number format​

Text format​

Tips for Robust Field Explanations​

Include synonyms and label variations​

Specify location hints​

Handle edge cases explicitly​

Use the document's language​

Debugging Extraction Issues​

Step 1: Check grounding​

Step 2: Review field explanations​

Step 3: Test on diverse samples​

Step 4: Iterate incrementally​

Common issues and solutions​

AI-Generated vs Manual Models​

When to use AI generation​

When to build manually​

Refining AI-generated models​

Multi-Document Patterns​

Matching related documents​

Cross-document validation​

Best Practices Summary​