Skip to main content

Advanced Extraction Models

This guide covers advanced techniques for building robust extraction models that handle complex documents with high accuracy.

Using Variables for Dynamic Extraction

Variables let you inject context into extraction, making models reusable across different periods, entities, or scenarios without creating separate models.

Why use variables?

Without variables, you'd need separate models for each audit period (FY2024, FY2025) or each entity. Variables let you create one model and change the context at runtime.

Common variable use cases:

  • Fiscal year end dates (closing_date)
  • Reference periods (Q4 2024)
  • Entity identifiers (loan_reference, contract_number)
  • Currency or unit context

Creating variables

  1. In your extraction model, click Add Variable (in the instruction text area)
  2. Name your variable clearly (e.g., closing_date, reference_period)
  3. Choose the data type (text, number, or date)
  4. Click Create

Variable syntax

Reference variables in your instruction (meta prompt) or field explanations using double curly braces:

{{variable_name}}

Example instruction with variable:

Extract financial data from this loan amortization schedule. The reference date for all balances is {{closing_date}}. Find the row closest to this date and extract the relevant values.

Example field explanation with variable:

Find the outstanding principal balance as of {{closing_date}}. Look for the row where the payment date is closest to but not after the closing date.

Setting variable values

When you run an extraction or configure a Test of Details step:

  1. The system shows all variables defined in the model
  2. Enter the value for each variable (e.g., 31-12-2024 for a date)
  3. Values are substituted into the instruction and field explanations before extraction runs

Variable examples

VariablePurposeFormatExample value
closing_dateFiscal year endDD-MM-YYYY31-12-2024
reference_periodPeriod descriptionTextQ4 2024
entity_nameCompany/entityTextAcme Corp
loan_referenceSpecific loan IDTextLOAN-2024-001
currencyExpected currencyTextEUR

List/Table Field Extraction

The List data type extracts repeating rows from tables — invoice line items, transaction lists, amortization schedules, etc.

When to use List vs flat fields

Use a List field when:

  • The document has multiple rows/records with the same structure
  • Each row has the same columns (e.g., Description, Quantity, Amount)
  • You want to extract ALL rows, not just one value

Use flat fields when:

  • Each piece of information appears exactly once
  • Fields are distinct (Invoice Number, Invoice Date, Total Amount)
Quick test

Ask: "Will there be multiple records with the SAME fields, or just ONE value per field?"

  • Multiple records → List field
  • One value each → Flat fields

Defining a List field

When you add a List field, you define its columns (sub-fields):

Example: Invoice line items

List field nameColumnTypeDescription
Line ItemsDescriptionTextProduct or service description
QuantityNumberNumber of units
Unit PriceNumberPrice per unit (HT/excluding tax)
AmountNumberLine total

Each column gets its own:

  • Name — The column header
  • Type — text, number, or date
  • Explanation — How to identify this column in the document

Writing column explanations

Be specific about label variations and formats:

Good column explanation (Description):

Product or service description. May be labeled "Description", "Désignation", "Item", "Article", or "Libellé". Extract the full text as shown.

Good column explanation (Amount):

Line total for this item. Look for "Montant HT", "Amount", "Total ligne", or the rightmost numeric column. Number only with comma decimal separator (e.g., 1234,56).

List extraction limits

List extraction works best for tables with up to ~50-100 rows. Quality decreases for larger tables.

Row countRecommendation
< 50 rows✅ Use List extraction
50-100 rows⚠️ Test carefully, may work
> 100 rows❌ Use OCR to Excel instead

For large tables, use OCR to Excel to convert to spreadsheet format.

Using List totals in formulas

In Test of Details analysis, you can reference numeric totals from List fields:

[DocType: Total of Line Items - Amount]

This automatically sums the Amount column across all extracted rows.


Extracting from Loan Amortization Schedules

Loan amortization schedules present specific challenges: they contain many rows at different dates, and you need to identify the correct row based on your audit date.

The challenge

A typical amortization schedule shows:

  • Monthly or periodic payment rows
  • Running balances before and after each payment
  • Interest and principal breakdown per payment

You need to extract values at a specific date, but which balance depends on the timing:

Closing date vs payment dateWhat you typically need
Closing date falls on a payment datePost-amortization balance (after that payment)
Closing date between two payment datesPost-amortization balance from the preceding payment

Pre-amortization vs Post-amortization

This is the most critical decision:

Balance typeDefinitionWhen to use
Pre-amortizationBalance before a payment is appliedRarely — only if you need the opening balance for a period
Post-amortizationBalance after a payment is appliedMost common — the actual debt remaining at closing

Example: Closing date is 31-12-2024. The schedule shows:

  • Row for 15-12-2024: Balance after payment = 100,000
  • Row for 15-01-2025: Balance after payment = 98,000

→ You want 100,000 (post-amortization from the last payment before closing).

Key decisions before building

QuestionOptionsImpact
Multiple loans?Single / Multiple in documentMay need loan identifier variable
Let the AI handle timing logic

You don't need to decide pre vs post-amortization upfront. The instruction template below lets the AI determine the correct row based on the closing date.

FieldTypePurpose
closing_dateVariableReference date for extraction
loan_referenceVariable (if needed)Which loan if multiple
Principal BalanceNumberOutstanding principal at closing
Interest RateNumberAnnual rate
Next Payment DateDateFirst payment after closing
Remaining PaymentsNumberPayments left
Monthly PaymentNumberRegular payment amount

Writing the instruction (meta prompt)

Extract loan information from this amortization schedule.

Reference date: {{closing_date}}
Loan identifier (if multiple schedules): {{loan_reference}}

Determine the outstanding balance at the reference date by analyzing the schedule:

- If {{closing_date}} falls exactly ON a payment date → extract the post-amortization balance (after that payment)
- If {{closing_date}} falls BETWEEN two payment dates → extract the post-amortization balance from the last payment before the reference date

The goal is to find what the borrower still owes as of {{closing_date}}.

Field explanation examples

Principal Balance:

Determine the outstanding principal as of {{closing_date}}:

  1. If {{closing_date}} matches a payment date exactly → use the balance AFTER that payment (post-amortization)
  2. If {{closing_date}} falls between payments → use the balance AFTER the most recent payment before {{closing_date}}

Look for columns labeled: "Capital restant", "CRD", "Remaining Balance", "Solde", "Outstanding Principal", or "Balance après échéance".

Format: Number only with comma decimal separator (e.g., 125000,00).

Next Payment Date:

Find the first payment date that occurs after {{closing_date}}.

Look for columns labeled: "Date échéance", "Payment Date", "Date", or "Échéance".

Format: DD-MM-YYYY (e.g., 15-01-2025).

Handling multiple loans in one document

If a document contains multiple amortization schedules:

  1. Add a loan_reference variable
  2. Include it in your instruction and field explanations:

From the amortization schedule for loan {{loan_reference}}, find the row where...


Format Requirements for Accurate Extraction

The AI extracts values in specific formats. Understanding these ensures accuracy:

Date format

Required format: DD-MM-YYYY

Document showsAI extracts
December 31, 202431-12-2024
2024-12-3131-12-2024
31/12/2431-12-2024

In explanations, specify:

Format: DD-MM-YYYY (e.g., 31-12-2024)

Number format

Required format: Digits only, comma as decimal separator

Document showsAI extracts
€1,234.561234,56
1 234,56 €1234,56
$1,234.561234,56
-500.00-500,00

In explanations, specify:

Number only with comma decimal separator (e.g., 1234,56). No currency symbols, thousand separators, or units.

Text format

Text is extracted verbatim with leading/trailing whitespace trimmed.

In explanations, specify any requirements:

Extract the company name exactly as shown. Include legal form (SA, SARL, etc.) if present.


Tips for Robust Field Explanations

Include synonyms and label variations

Documents use different labels for the same data. List common variations:

Invoice number (aka: invoice no., bill no., reference number, facture n°, numéro de facture). Prefer the alphanumeric code near the "Invoice" header.

Specify location hints

Help the AI find the right value when multiple similar values exist:

Total amount including VAT. Usually found at the bottom of the invoice, near labels like "Total TTC", "Amount Due", or "Balance Due". This is typically the largest amount on the document.

Handle edge cases explicitly

Describe what to do when expected data is missing or ambiguous:

If no due date is shown, return an empty value. Do not guess or calculate from invoice date.

Use the document's language

If documents are in French, write explanations in French for better matching:

Montant total TTC. Chercher près des étiquettes "Total TTC", "Net à payer", ou "Montant dû".


Debugging Extraction Issues

Step 1: Check grounding

Click any extracted value to see where it came from in the document:

  • Correct location, wrong value → Format issue or description needs refinement
  • Wrong location → Description isn't specific enough, or label variations missing

Step 2: Review field explanations

For each problematic field, ask:

  • Would a human following this description find the right value?
  • Are all label variations covered?
  • Is the location in the document described?
  • Are edge cases handled?

Step 3: Test on diverse samples

Run on 5+ documents with different:

  • Layouts and formats
  • Languages (if applicable)
  • Scanned vs digital PDFs
  • Single page vs multi-page

Step 4: Iterate incrementally

Refine one field at a time:

  1. Identify the failing field
  2. Update its explanation
  3. Re-test on the problem document
  4. Verify it still works on other documents

Common issues and solutions

ProblemLikely causeSolution
Value not foundLabel variationsAdd synonyms to explanation
Wrong value extractedMultiple similar valuesAdd location hints
Wrong row (in tables)Unclear row identificationMake date/row criteria more specific
Format errorsUnexpected format in documentDescribe expected format variations
Inconsistent resultsAmbiguous descriptionMake explanation more specific

AI-Generated vs Manual Models

When to use AI generation

  • Starting a new model for a common document type (invoices, contracts)
  • Quickly prototyping before refinement
  • Learning what fields are typically extracted from a document type

When to build manually

  • Highly specific extraction requirements
  • Non-standard document formats
  • Need precise control over field explanations
  • Building on an existing model

Refining AI-generated models

AI generation is a starting point. Always:

  1. Review all generated fields
  2. Test on real documents
  3. Adjust explanations for your specific document formats
  4. Add label variations you discover
  5. Remove unnecessary fields

Multi-Document Patterns

In Test of Details, you often match multiple document types to transactions:

Transaction fieldInvoiceDelivery NotePayment
ReferenceInvoice NumberDelivery ReferencePayment Reference
AmountTotal AmountPayment Amount
DateInvoice DateDelivery DatePayment Date

Create separate extraction models for each document type, then use Test of Details to match them.

Cross-document validation

Use variables to ensure consistency:

  • Set expected_amount variable from transaction data
  • Reference in extraction: Verify total matches {{expected_amount}}

Best Practices Summary

  1. Use variables for dates, periods, and identifiers that change between extractions
  2. Write specific explanations with label variations and location hints
  3. Choose List vs flat fields based on whether data repeats
  4. Respect format requirements — dates as DD-MM-YYYY, numbers with comma decimals
  5. Test on diverse documents before running large batches
  6. Iterate incrementally — fix one field at a time
  7. Use AI generation as a starting point, then refine