Few-Shot Learning for Document Parsing: Training AI on Human Corrections

I built ParseIt to automate document processing with AI. The core problem: LLMs are never 100% accurate on real-world documents out of the box. They make mistakes - misread fields, incorrect formatting, wrong entity names. For production workflows, even small errors break the entire pipeline.

The obvious solution is model fine-tuning. Train a custom model on your specific document formats, iterate until it’s accurate enough, ship it. But fine-tuning requires infrastructure overhead: Vertex AI setup, collecting training data, managing model versions, custom hosting costs.

For a bootstrapped SaaS, that’s a lot of friction before you even validate product-market fit.

So I built continuous learning using few-shot prompting instead. Here’s what actually happened.

How Few-Shot Learning Works

The concept is simple: inject examples of past corrections directly into the AI prompt before processing new documents.

Standard prompt:

Extract document_id, entity_name, total_amount from this document.

Few-shot prompt:

Extract document_id, entity_name, total_amount from this document.

LEARNING FROM PAST CORRECTIONS:

Example 1:
Field: entity_name
Initially extracted (INCORRECT): "ABC Ltd"
Corrected value (CORRECT): "ABC Limited (Australia)"
Lesson: Include full legal entity name with country suffix

Example 2:
Field: document_date
Initially extracted (INCORRECT): "12/03/2024"
Corrected value (CORRECT): "2024-03-12"
Lesson: Always use ISO 8601 format (YYYY-MM-DD)

Now extract data from this document...

The AI sees its past mistakes and adjusts. No model training required.

The Implementation

I added four core functions to ParseIt’s Go backend:

1. Capture Corrections

When a user edits extracted data, compare old vs new values:

func detectCorrections(oldData, newData map[string]interface{}) []Correction {
    var corrections []Correction

    // Recursively compare nested fields
    for key, newVal := range newData {
        oldVal := oldData[key]
        if oldVal != newVal {
            corrections = append(corrections, Correction{
                FieldPath: key,
                OriginalValue: fmt.Sprintf("%v", oldVal),
                CorrectedValue: fmt.Sprintf("%v", newVal),
            })
        }
    }

    return corrections
}

2. Store Corrections

Save each correction with metadata for retrieval:

func storeCorrections(app core.App, corrections []Correction, document *core.Record) error {
    for _, correction := range corrections {
        record := core.NewRecord(correctionsCollection)
        record.Set("tenant_id", document.GetString("tenant_id"))
        record.Set("template_id", document.GetString("template_id"))
        record.Set("field_path", correction.FieldPath)
        record.Set("original_value", correction.OriginalValue)
        record.Set("corrected_value", correction.CorrectedValue)

        app.Save(record)
    }
    return nil
}

Important: Tenant isolation matters. Each client’s corrections only improve their templates, never leak across tenants.

3. Retrieve Relevant Corrections

When processing a new document, grab the most recent corrections for each field:

func getRelevantCorrections(app core.App, templateID string) ([]Correction, error) {
    query := `
        SELECT field_path, original_value, corrected_value
        FROM document_corrections
        WHERE template_id = ?
        ORDER BY created DESC
        LIMIT 10
    `
    // Execute query and return results
}

Why 10? Context engineering. While Gemini 2.0 Flash handles 1M input tokens, the “lost in the middle” problem means quality matters more than quantity. 10 examples per field strikes the right balance - enough to cover edge cases without diluting effectiveness.

Research: Lost in the Middle

Language models perform significantly worse on information in the middle of long contexts compared to information at the beginning or end. See “Lost in the Middle: How Language Models Use Long Contexts” (Liu et al., 2023) for the research behind this phenomenon.

4. Inject Into Prompts

Modify the Gemini prompt to include corrections:

func buildPromptWithFewShot(template Template, corrections []Correction) string {
    prompt := "Extract the following data from this document:\n\n"

    if len(corrections) > 0 {
        prompt += "--- LEARNING FROM PAST CORRECTIONS ---\n"
        for i, corr := range corrections {
            prompt += fmt.Sprintf("Example %d:\n", i+1)
            prompt += fmt.Sprintf("  Field: %s\n", corr.FieldPath)
            prompt += fmt.Sprintf("  Incorrect: \"%s\"\n", corr.OriginalValue)
            prompt += fmt.Sprintf("  Correct: \"%s\"\n", corr.CorrectedValue)
        }
        prompt += "--- END CORRECTIONS ---\n\n"
    }

    prompt += "Now extract data from this document..."
    return prompt
}

That’s it. Four functions, zero model training.

What It Actually Solved

The system learns patterns after seeing them once. Real example:

Field: entity_name
Initial extractions: “Acme Co”, “ACME”, “Acme”
User corrected to: “Acme Corporation Pty Ltd”
After correction: AI consistently extracts full legal name with suffix

Unlike static models with fixed accuracy, few-shot learning approaches 100% as users correct edge cases. Corrections apply immediately to the next document - no training runs, no Vertex AI setup, standard Gemini API pricing.

Track accuracy metrics

Store total documents processed and total corrections made per template. Watching the correction rate decrease validates the approach and shows which templates need more examples.

What It Doesn’t Solve

Few-shot learning has real limitations:

Token Bloat at Scale

As corrections accumulate, prompt size explodes. At some point, you hit diminishing returns:

Slower inference: Larger prompts take longer to process
Higher costs: Pay per token for every request
No better accuracy: After a certain point, more examples don’t help

That’s when you migrate to Vertex AI fine-tuning. Export corrections as training data, fine-tune a custom model, deploy it. But for 90% of use cases, you’ll never hit that limit.

Context Window Constraints

Few-shot examples compete with document content for context window space. If documents are 50 pages, you have less room for examples. Worse, the “lost in the middle” problem applies to document content too - the model struggles with information buried in the middle of long documents.

My approach: Limit to ~10 examples per field, prioritize recent corrections. For long documents, consider chunking or extracting key sections first.

No Semantic Similarity Matching

I retrieve corrections by recency (most recent first). Better would be semantic similarity: find corrections from documents most similar to the current one.

Why I didn’t build it: Embeddings add complexity (vector database, similarity scoring). Recency works well enough for v1. When accuracy plateaus, I’ll revisit.

Manual Architecture Required

The AI doesn’t figure out what to learn. I explicitly:

Defined the corrections schema (field_path, old_value, new_value)
Built comparison logic to detect changes
Wrote retrieval queries to find relevant examples
Designed the prompt structure for few-shot injection

This isn’t “AI that teaches itself.” It’s structured learning guided by careful system design.

When to migrate to fine-tuning

As corrections accumulate and token bloat becomes an issue, fine-tuning becomes worth considering. Options: fine-tune on Vertex AI (ParseIt uses Gemini Flash), or fine-tune local LLMs like Llama for faster iteration. Export corrections as JSONL training data. But most use cases never hit this limit - start simple, scale only when needed.

The Hybrid Approach: Best of Both Worlds

After implementing this in production, I learned that few-shot and fine-tuning aren’t mutually exclusive. The optimal strategy uses both at different stages:

Few-Shot: 0-50 Corrections

When it works best: Early stage, exploring edge cases, rapid iteration

Ship immediately with zero infrastructure
Corrections apply to next document (no training delay)
Perfect for validating product-market fit
Sweet spot: 10-30 corrections per template

Performance: Each correction noticeably improves accuracy. The model learns patterns quickly because it’s already pre-trained on millions of similar documents.

Transition Zone: 50-100 Corrections

The crossover point: Few-shot starts hitting diminishing returns

Prompts grow large (slower, more expensive)
Model stops learning new patterns (most edge cases covered)
Fine-tuning becomes cost-effective

Decision point: If a tenant processes >500 docs/month on a template, fine-tuning pays for itself through faster inference and lower per-request costs.

Fine-Tuning: 100+ Corrections

When it wins: High volume, stable formats, mature product

Google’s Gemini fine-tuning documentation recommends 100 examples minimum for meaningful improvement. From my research:

100 corrections: Viable starting point, noticeable improvement
500 corrections: Sweet spot for production-ready custom models
1000+ corrections: Diminishing returns unless format is highly specialized

The process:

Export corrections as JSONL training data
Submit fine-tuning job to Vertex AI (2-6 hours training time)
Deploy custom model endpoint
A/B test against few-shot to validate improvement

The Hybrid Strategy

Here’s the approach I’m implementing for ParseIt:

Tier 1: Starter (Free/Basic)

Few-shot only, limit to 20 most recent corrections
Good enough for low-volume users (<100 docs/month)

Tier 2: Professional ($50/month)

Few-shot with 50 correction limit
Handles ~500 docs/month efficiently
When tenant hits 100 corrections, offer upgrade

Tier 3: Enterprise ($200/month)

Custom fine-tuned model per template format
Requires 100+ corrections to activate
Lower per-document cost at high volume
Keep few-shot running: New edge cases get added to prompts until next fine-tuning run (monthly or quarterly)

Why hybrid?

Fine-tuned models still miss edge cases
Few-shot catches new patterns immediately
Periodic re-training incorporates new corrections

The beauty: you get immediate learning (few-shot) plus baked-in knowledge (fine-tuning). Start with few-shot, graduate to fine-tuning when volume justifies it, keep both running for optimal accuracy.

Cost optimization

Fine-tuning makes economic sense when: (inference cost savings from smaller prompts) > (fine-tuning cost + hosting overhead). For ParseIt, that break-even is around 500 docs/month per template. Below that, few-shot is cheaper.

Lessons Learned

Database gotcha: JSON fields stored in custom types don’t deserialize like normal Go maps. Read the framework docs for proper deserialization methods. Logging exposed this immediately.
Tenant isolation is critical: Each correction is scoped to tenant + template. Never leak Client A’s corrections to Client B.
Correction patterns inform product: Tracking which fields get corrected most revealed which formats needed better prompts. Some templates stabilized after 3 corrections, others needed 20+.
Ship imperfect, improve continuously: The beauty of few-shot is you don’t need perfect accuracy on day one. Real usage makes it better.

Try It Yourself

The pattern works for any LLM extraction task: capture corrections, store with context, retrieve relevant examples, inject into prompts. ~200 lines of code, any LLM API.

Try ParseIt at parseit.ai (currently in closed beta - sign up to get early access). Upload documents, correct errors, watch accuracy improve with each correction.

The future of AI products isn’t just “use GPT.” It’s building systems that learn from real usage. Few-shot learning is the pragmatic path to get there.