The Core Problem: Documents Are Not Data
Most organisations hold enormous amounts of useful information locked inside documents. Contracts, invoices, reports, filings, correspondence, application forms. The information is there — the parties to an agreement, the financial terms, the key dates — but it is buried in prose and formatted pages rather than stored as structured, queryable data.
To do anything systematic with that information — analyse it, report on it, feed it into another system — someone has to read each document and manually transfer the relevant data into a spreadsheet or database. For large document sets, this is one of the most time-consuming and error-prone tasks in professional services.
Modern AI extraction pipelines solve this. Here is how they work, stage by stage.
Stage 1: Document Ingestion
The first step is getting the documents into the system. Documents typically arrive in several formats:
- Native PDFs — PDFs that were created digitally (e.g., exported from Word). These contain machine-readable text already embedded.
- Scanned PDFs — PDFs created by scanning a physical document. These are images; there is no underlying text layer.
- Word documents (.docx) — Generally straightforward to parse, as the XML structure is accessible.
- Images (JPEG, PNG, TIFF) — Scanned documents saved as image files rather than PDFs.
The pipeline needs to handle all of these. For native PDFs and Word documents, text extraction is direct. For scanned documents and images, an OCR step is required first.
Stage 2: OCR (Optical Character Recognition)
OCR converts an image of text into actual machine-readable characters. Modern OCR tools — such as Tesseract (open source) or commercial alternatives like AWS Textract or Google Document AI — are highly accurate on clean scans, typically achieving 98–99% character accuracy on good-quality documents.
The accuracy drops on low-quality scans, unusual fonts, handwriting, or documents with complex layouts (tables, multi-column text, headers/footers that overlap with body text). A good extraction pipeline includes pre-processing steps to improve scan quality before OCR — deskewing, contrast adjustment, noise reduction — and post-processing to catch and correct common OCR errors.
For documents that mix machine-readable and handwritten content (common in legal and financial contexts), hybrid approaches are used — OCR for printed text, and either human review or specialist handwriting recognition for handwritten portions.
Stage 3: Text Cleaning and Structure Detection
Raw OCR output is not clean text. It contains page numbers, headers, footers, watermarks, stray characters, and formatting artefacts. Before the AI extraction step, the text needs to be cleaned: irrelevant elements removed, paragraphs properly reassembled (OCR often breaks lines mid-sentence), tables identified and structured appropriately.
For complex documents, layout analysis is also performed at this stage — identifying which text is in the main body, which is in headers and footers, which is in tables, and which is in margin notes or annotations. This structure matters for extraction accuracy: a rent figure in a table has different significance than the same number in a narrative paragraph.
Stage 4: LLM-Based Extraction
This is where the AI does its core work. A large language model (LLM) — the same technology underlying tools like GPT-4 or Claude — is given the cleaned document text alongside a structured prompt that specifies exactly what to extract.
The prompt is designed for the specific document type. For a commercial lease, it might instruct the model to identify and return: the landlord's name, the tenant's name, the demised premises address, the lease start date, the lease end date, the initial annual rent, the rent review mechanism, any break clause dates and conditions, and any provisions that appear to deviate from a standard commercial lease.
The LLM reads the document and returns structured output — typically in JSON format — containing the requested fields and their values. This is not keyword matching or template-based extraction; the model understands context. It can identify that "the term shall commence on the date of this deed" means the start date is the execution date, even though no explicit date is written in that sentence.
Unlike rules-based extraction — which breaks when documents vary from an expected format — LLM extraction handles variation naturally, because the model understands what the text means, not just what it looks like.
Stage 5: Validation and Confidence Scoring
LLMs are very capable but not infallible. A well-engineered extraction pipeline does not treat every output as correct. Validation steps include:
- Format validation — Is the extracted date in a valid date format? Is the rent figure a number?
- Cross-document consistency checks — If the same party name appears in 50 documents, do all extractions match?
- Confidence flagging — The model can be instructed to indicate when it is uncertain about an extraction. These items are queued for human review rather than passed through automatically.
- Mandatory field checks — If a required field is missing from the output, the document is flagged rather than silently producing an incomplete record.
Human review is not eliminated — it is targeted. Instead of a person reading every document, they review only the flagged items: the ones where the AI was uncertain, or where validation checks failed. This is a much more efficient use of review time.
Stage 6: Output to Database or Spreadsheet
The validated extracted data is written to the output system. This might be:
- A structured database (PostgreSQL, SQL Server) that other systems can query
- A spreadsheet (Excel, Google Sheets) for direct use by the team
- An integration with an existing system (a case management system, a property management platform, a CRM)
- A structured JSON or CSV export for further processing
The output format is determined by how the data will be used. For ongoing pipelines where new documents are added regularly, database storage with an API is usually the right approach. For one-off extraction projects, a clean spreadsheet is often sufficient.
What Good Extraction Looks Like
A well-built extraction pipeline is not just technically functional — it is built around the specific documents and use case it needs to serve. The extraction prompts are developed and refined using real examples of the documents in question. The validation rules are designed around what errors would matter most. The output format matches what the downstream users actually need.
This is why off-the-shelf document extraction tools often underperform: they are built to handle any document, which means they are not optimised for your documents. A custom-built pipeline, tuned for your specific document types, consistently outperforms generic tools on accuracy and on the relevance of what it extracts.
If your firm is sitting on large volumes of documents that contain information you need but cannot easily access, document extraction is likely a straightforward and high-value automation project.