Extraction Runs
Overview
Extraction Runs are the core processing job that transforms your document collections into structured knowledge graphs using an ontology. An Extraction Run takes a collection of documents and applies an ontology to systematically extract entities and their relationships, creating a queryable knowledge base.
Think of an Extraction Run as an intelligent processing job that reads through all your documents, identifies the information defined by your ontology (people, companies, events, etc.), and organizes everything into a structured graph that you can query and analyze.
Core Concepts
What is an Extraction Run?
An Extraction Run is a multi-step pipeline that:
- Processes Document Collections: Takes a collection of documents as input
- Applies an Ontology: Uses your ontology to guide what information to extract
- Creates Knowledge Graphs: Builds a structured representation of information from your documents
Once you are satisfied with your extraction results, turn them into a Knowledge Base to make them queryable by AI agents.
Extraction Run Pipeline
Pipeline Stages
Every Extraction Run follows a systematic multi-stage pipeline:
1. Chunking Stage (Optional)
Documents are broken down into manageable pieces.
It is recommended to only use chunking for very long documents containing a lot of entities. Start without it, and use it if your extraction feels incomplete!
Available chunking methods:
- Character-based chunking: Split text into segments with a maximum character count (e.g., 1500 characters)
- Semantic chunking (coming soon): Use AI to create meaningful content segments based on guidelines you provide
2. Extraction Stage
Entity and Relationship Extraction:
- Identifies instances of entity types and their relations, as defined in your ontology
- Configurable with custom extraction guidelines in "Advanced parameters"
- Can extract either from document-level or from chunks resulting from the previous pipeline step.
Advanced Features:
- Custom Guidelines: Provide specific instructions. Use it only for guidelines that are very specific to your collection and must not be included directly in your ontology.
- Think before extraction: Enable detailed reasoning explanations for extractions (the chain-of-thought will appear in the results, useful to learn how the AI interprets your guidelines or documents)
- Reflect on extracted items: Apply additional validation to extracted entities (the self-reflection will appear in the results)
We use caching techniques so that changing one class in your ontology and re-launching the extraction doesn't recompute all the results for the other untouched classes. The goal is that you work iteratively on your ontology to refine it, based on your extraction results!
3. Deduplication (Optional)
Deduplication is a two step process:
- First, identify potential candidates.
- For obvious duplicates (typically if a strong identifier matches exactly), go straight to the merging step.
- For not-so-obvious cases (typically if two company names fuzzy match, such as Tesla and Tesla,Inc), trigger an AI review. The AI considers all the entity properties and decides if two entities must be considered duplicates or not.
- When two entries have been detected as duplicates, records need to be merged.
- If they don't have conflicting properties, merge them.
- If they have conflicts, use AI to manage them, based on your merge guidelines.
graph TD
B[Step 1: Identify Duplicates Candidates]
B --> D[Duplicates]
B --> E[Potential Duplicates]
B --> M[Not Duplicates]
D --> C[Step 2: Merge Records: Check for Conflicting Properties]
E --> G[Trigger AI Review]
G --> D
G --> M
C --> I[No Conflicts]
C --> J[Has Conflicts]
I --> K[Merge Them]
J --> L[Use AI to Manage Conflicts<br/>Based on Merge Guidelines]
style B fill:#FED766,stroke:#333,stroke-width:2px,color:#000
style C fill:#FED766,stroke:#333,stroke-width:2px,color:#000
style D fill:#FC5A8F,stroke:#333,stroke-width:2px,color:#fff
style E fill:#7061A3,stroke:#333,stroke-width:2px,color:#fff
style M fill:#F22FB6,stroke:#333,stroke-width:2px,color:#fff
style G fill:#7061A3,stroke:#333,stroke-width:2px,color:#fff
style I fill:#FC5A8F,stroke:#333,stroke-width:2px,color:#fff
style J fill:#F22FB6,stroke:#333,stroke-width:2px,color:#fff
style K fill:#FC5A8F,stroke:#333,stroke-width:2px,color:#fff
style L fill:#F22FB6,stroke:#333,stroke-width:2px,color:#fff
4. Results
When you are satisfied with the results, you can create a Knowledge Base:
- Makes the extraction run results permanent
- Enables querying and analysis fo your results with AI
Usage Monitoring
- Check LLM usage insights in extraction run status
- Smaller chunks and more entity types increase processing costs
- "Think before extraction" and "Reflect on extracted items" in the extraction parameters add processing overhead
Best Practices
Start Simple:
- Begin with a small document collection to test your ontology
- Run extraction on a subset of entity types first
- Iterate on your ontology based on extraction quality