Advanced Extraction Models
This guide covers advanced techniques for building robust extraction models that handle complex documents with high accuracy.
Using Variables for Dynamic Extraction
Variables let you inject context into extraction, making models reusable across different periods, entities, or scenarios without creating separate models.
Why use variables?
Without variables, you'd need separate models for each audit period (FY2024, FY2025) or each entity. Variables let you create one model and change the context at runtime.
Common variable use cases:
- Fiscal year end dates (
closing_date) - Reference periods (
Q4 2024) - Entity identifiers (
loan_reference,contract_number) - Currency or unit context
Creating variables
- In your extraction model, click Add Variable (in the instruction text area)
- Name your variable clearly (e.g.,
closing_date,reference_period) - Choose the data type (text, number, or date)
- Click Create
Variable syntax
Reference variables in your instruction (meta prompt) or field explanations using double curly braces:
{{variable_name}}
Example instruction with variable:
Extract financial data from this loan amortization schedule. The reference date for all balances is
{{closing_date}}. Find the row closest to this date and extract the relevant values.
Example field explanation with variable:
Find the outstanding principal balance as of
{{closing_date}}. Look for the row where the payment date is closest to but not after the closing date.
Setting variable values
When you run an extraction or configure a Test of Details step:
- The system shows all variables defined in the model
- Enter the value for each variable (e.g.,
31-12-2024for a date) - Values are substituted into the instruction and field explanations before extraction runs
Variable examples
| Variable | Purpose | Format | Example value |
|---|---|---|---|
closing_date | Fiscal year end | DD-MM-YYYY | 31-12-2024 |
reference_period | Period description | Text | Q4 2024 |
entity_name | Company/entity | Text | Acme Corp |
loan_reference | Specific loan ID | Text | LOAN-2024-001 |
currency | Expected currency | Text | EUR |
List/Table Field Extraction
The List data type extracts repeating rows from tables — invoice line items, transaction lists, amortization schedules, etc.
When to use List vs flat fields
Use a List field when:
- The document has multiple rows/records with the same structure
- Each row has the same columns (e.g., Description, Quantity, Amount)
- You want to extract ALL rows, not just one value
Use flat fields when:
- Each piece of information appears exactly once
- Fields are distinct (Invoice Number, Invoice Date, Total Amount)
Ask: "Will there be multiple records with the SAME fields, or just ONE value per field?"
- Multiple records → List field
- One value each → Flat fields
Defining a List field
When you add a List field, you define its columns (sub-fields):
Example: Invoice line items
| List field name | Column | Type | Description |
|---|---|---|---|
| Line Items | Description | Text | Product or service description |
| Quantity | Number | Number of units | |
| Unit Price | Number | Price per unit (HT/excluding tax) | |
| Amount | Number | Line total |
Each column gets its own:
- Name — The column header
- Type — text, number, or date
- Explanation — How to identify this column in the document
Writing column explanations
Be specific about label variations and formats:
Good column explanation (Description):
Product or service description. May be labeled "Description", "Désignation", "Item", "Article", or "Libellé". Extract the full text as shown.
Good column explanation (Amount):
Line total for this item. Look for "Montant HT", "Amount", "Total ligne", or the rightmost numeric column. Number only with comma decimal separator (e.g., 1234,56).
List extraction limits
List extraction works best for tables with up to ~50-100 rows. Quality decreases for larger tables.
| Row count | Recommendation |
|---|---|
| < 50 rows | ✅ Use List extraction |
| 50-100 rows | ⚠️ Test carefully, may work |
| > 100 rows | ❌ Use OCR to Excel instead |
For large tables, use OCR to Excel to convert to spreadsheet format.
Using List totals in formulas
In Test of Details analysis, you can reference numeric totals from List fields:
[DocType: Total of Line Items - Amount]
This automatically sums the Amount column across all extracted rows.
Extracting from Loan Amortization Schedules
Loan amortization schedules present specific challenges: they contain many rows at different dates, and you need to identify the correct row based on your audit date.
The challenge
A typical amortization schedule shows:
- Monthly or periodic payment rows
- Running balances before and after each payment
- Interest and principal breakdown per payment
You need to extract values at a specific date, but which balance depends on the timing:
| Closing date vs payment date | What you typically need |
|---|---|
| Closing date falls on a payment date | Post-amortization balance (after that payment) |
| Closing date between two payment dates | Post-amortization balance from the preceding payment |
Pre-amortization vs Post-amortization
This is the most critical decision:
| Balance type | Definition | When to use |
|---|---|---|
| Pre-amortization | Balance before a payment is applied | Rarely — only if you need the opening balance for a period |
| Post-amortization | Balance after a payment is applied | Most common — the actual debt remaining at closing |
Example: Closing date is 31-12-2024. The schedule shows:
- Row for 15-12-2024: Balance after payment = 100,000
- Row for 15-01-2025: Balance after payment = 98,000
→ You want 100,000 (post-amortization from the last payment before closing).
Key decisions before building
| Question | Options | Impact |
|---|---|---|
| Multiple loans? | Single / Multiple in document | May need loan identifier variable |
You don't need to decide pre vs post-amortization upfront. The instruction template below lets the AI determine the correct row based on the closing date.
Recommended model structure
| Field | Type | Purpose |
|---|---|---|
closing_date | Variable | Reference date for extraction |
loan_reference | Variable (if needed) | Which loan if multiple |
| Principal Balance | Number | Outstanding principal at closing |
| Interest Rate | Number | Annual rate |
| Next Payment Date | Date | First payment after closing |
| Remaining Payments | Number | Payments left |
| Monthly Payment | Number | Regular payment amount |
Writing the instruction (meta prompt)
Extract loan information from this amortization schedule.
Reference date: {{closing_date}}
Loan identifier (if multiple schedules): {{loan_reference}}
Determine the outstanding balance at the reference date by analyzing the schedule:
- If {{closing_date}} falls exactly ON a payment date → extract the post-amortization balance (after that payment)
- If {{closing_date}} falls BETWEEN two payment dates → extract the post-amortization balance from the last payment before the reference date
The goal is to find what the borrower still owes as of {{closing_date}}.
Field explanation examples
Principal Balance:
Determine the outstanding principal as of
{{closing_date}}:
- If
{{closing_date}}matches a payment date exactly → use the balance AFTER that payment (post-amortization)- If
{{closing_date}}falls between payments → use the balance AFTER the most recent payment before{{closing_date}}Look for columns labeled: "Capital restant", "CRD", "Remaining Balance", "Solde", "Outstanding Principal", or "Balance après échéance".
Format: Number only with comma decimal separator (e.g., 125000,00).
Next Payment Date:
Find the first payment date that occurs after
{{closing_date}}.Look for columns labeled: "Date échéance", "Payment Date", "Date", or "Échéance".
Format: DD-MM-YYYY (e.g., 15-01-2025).
Handling multiple loans in one document
If a document contains multiple amortization schedules:
- Add a
loan_referencevariable - Include it in your instruction and field explanations:
From the amortization schedule for loan
{{loan_reference}}, find the row where...
Format Requirements for Accurate Extraction
The AI extracts values in specific formats. Understanding these ensures accuracy:
Date format
Required format: DD-MM-YYYY
| Document shows | AI extracts |
|---|---|
| December 31, 2024 | 31-12-2024 |
| 2024-12-31 | 31-12-2024 |
| 31/12/24 | 31-12-2024 |
In explanations, specify:
Format: DD-MM-YYYY (e.g., 31-12-2024)
Number format
Required format: Digits only, comma as decimal separator
| Document shows | AI extracts |
|---|---|
| €1,234.56 | 1234,56 |
| 1 234,56 € | 1234,56 |
| $1,234.56 | 1234,56 |
| -500.00 | -500,00 |
In explanations, specify:
Number only with comma decimal separator (e.g., 1234,56). No currency symbols, thousand separators, or units.
Text format
Text is extracted verbatim with leading/trailing whitespace trimmed.
In explanations, specify any requirements:
Extract the company name exactly as shown. Include legal form (SA, SARL, etc.) if present.
Tips for Robust Field Explanations
Include synonyms and label variations
Documents use different labels for the same data. List common variations:
Invoice number (aka: invoice no., bill no., reference number, facture n°, numéro de facture). Prefer the alphanumeric code near the "Invoice" header.
Specify location hints
Help the AI find the right value when multiple similar values exist:
Total amount including VAT. Usually found at the bottom of the invoice, near labels like "Total TTC", "Amount Due", or "Balance Due". This is typically the largest amount on the document.
Handle edge cases explicitly
Describe what to do when expected data is missing or ambiguous:
If no due date is shown, return an empty value. Do not guess or calculate from invoice date.
Use the document's language
If documents are in French, write explanations in French for better matching:
Montant total TTC. Chercher près des étiquettes "Total TTC", "Net à payer", ou "Montant dû".
Debugging Extraction Issues
Step 1: Check grounding
Click any extracted value to see where it came from in the document:
- Correct location, wrong value → Format issue or description needs refinement
- Wrong location → Description isn't specific enough, or label variations missing
Step 2: Review field explanations
For each problematic field, ask:
- Would a human following this description find the right value?
- Are all label variations covered?
- Is the location in the document described?
- Are edge cases handled?
Step 3: Test on diverse samples
Run on 5+ documents with different:
- Layouts and formats
- Languages (if applicable)
- Scanned vs digital PDFs
- Single page vs multi-page
Step 4: Iterate incrementally
Refine one field at a time:
- Identify the failing field
- Update its explanation
- Re-test on the problem document
- Verify it still works on other documents
Common issues and solutions
| Problem | Likely cause | Solution |
|---|---|---|
| Value not found | Label variations | Add synonyms to explanation |
| Wrong value extracted | Multiple similar values | Add location hints |
| Wrong row (in tables) | Unclear row identification | Make date/row criteria more specific |
| Format errors | Unexpected format in document | Describe expected format variations |
| Inconsistent results | Ambiguous description | Make explanation more specific |
AI-Generated vs Manual Models
When to use AI generation
- Starting a new model for a common document type (invoices, contracts)
- Quickly prototyping before refinement
- Learning what fields are typically extracted from a document type
When to build manually
- Highly specific extraction requirements
- Non-standard document formats
- Need precise control over field explanations
- Building on an existing model
Refining AI-generated models
AI generation is a starting point. Always:
- Review all generated fields
- Test on real documents
- Adjust explanations for your specific document formats
- Add label variations you discover
- Remove unnecessary fields
Multi-Document Patterns
Matching related documents
In Test of Details, you often match multiple document types to transactions:
| Transaction field | Invoice | Delivery Note | Payment |
|---|---|---|---|
| Reference | Invoice Number | Delivery Reference | Payment Reference |
| Amount | Total Amount | — | Payment Amount |
| Date | Invoice Date | Delivery Date | Payment Date |
Create separate extraction models for each document type, then use Test of Details to match them.
Cross-document validation
Use variables to ensure consistency:
- Set
expected_amountvariable from transaction data - Reference in extraction:
Verify total matches {{expected_amount}}
Best Practices Summary
- Use variables for dates, periods, and identifiers that change between extractions
- Write specific explanations with label variations and location hints
- Choose List vs flat fields based on whether data repeats
- Respect format requirements — dates as DD-MM-YYYY, numbers with comma decimals
- Test on diverse documents before running large batches
- Iterate incrementally — fix one field at a time
- Use AI generation as a starting point, then refine