The Due Diligence Problem
A mid-size corporate transaction — a company acquisition, a property portfolio deal, a merger — typically involves hundreds of documents. Shareholder agreements, employment contracts, leases, regulatory filings, board minutes, intellectual property licences, supply chain agreements. Each one needs to be read, understood, and assessed for risk.
In most UK law firms today, this work still falls on associates and paralegals working through document bundles manually, often under significant time pressure. A straightforward M&A transaction might require 300–600 hours of document review. At a cost of £80–£150 per hour for a mid-level associate, that is between £24,000 and £90,000 in fee earner time — on the review work alone, before any legal analysis is written up.
The problem is not that solicitors are slow. It is that the work is structurally repetitive: read a lease, extract the key dates, parties, break clauses, and rent review provisions. Repeat for 120 leases. That is a task that does not require legal judgement — it requires careful reading and consistent data extraction. And that is exactly what AI systems are now very good at.
How AI Document Extraction Works in Due Diligence
A well-built AI extraction system for due diligence operates in several stages. First, documents are ingested — whether they arrive as scanned PDFs, Word documents, or native PDFs from Companies House or a data room. OCR (optical character recognition) converts any scanned pages into machine-readable text. Modern OCR tools are highly accurate even on older, lower-quality scans.
Once the text is extracted, a large language model (LLM) — the same class of AI that powers systems like GPT-4 — is given structured instructions for what to find. These instructions are tailored to the document type. For a commercial lease, the system might be asked to identify: the landlord and tenant parties, the lease term start and end dates, the annual rent, any rent review mechanism, break clause dates and conditions, permitted use, alienation restrictions, and any unusual or non-standard clauses.
The LLM reads each document and returns structured data — not a summary, but a filled-in record with specific fields and values. That data is then validated: cross-checked against other documents, flagged if a field is missing or ambiguous, and written to a database or spreadsheet that the legal team can review.
What Gets Extracted
The specific data points extracted depend on the transaction type, but common categories include:
- Contracts and agreements: Parties, effective date, term, termination provisions, payment terms, key obligations, change of control clauses, governing law.
- Property leases: Landlord/tenant, demised premises, lease term, rent and review schedule, break options, repairing obligations, alienation.
- Employment contracts: Role, salary, notice period, restrictive covenants (non-compete, non-solicit), IP assignment clauses.
- Corporate filings: Directors, shareholders, charges registered at Companies House, confirmation statement data.
- IP licences: Licensed rights, territory, exclusivity, royalties, termination triggers.
The output is a structured dataset — typically a spreadsheet or database table — where every document is a row and every extracted field is a column. The legal team can sort, filter, and review at the data level rather than reading every document from scratch.
Time Savings in Practice
A real-world example: a property solicitor handling a portfolio acquisition involving 85 commercial leases. Manually, a paralegal might spend 45 minutes per lease extracting the key terms into a schedule — roughly 64 hours of work, spread over two weeks. With an AI extraction pipeline, the same 85 leases are processed in under two hours, with a structured schedule produced automatically. The paralegal's role shifts to reviewing the output, spot-checking flagged items, and handling the genuinely complex cases where the AI has noted ambiguity.
Typical time savings in due diligence document review run between 60% and 85% depending on document type and complexity. The time saving is highest on high-volume, relatively uniform documents (leases, standard employment contracts) and somewhat lower on heavily negotiated bespoke agreements that require more nuanced reading.
What AI Does Not Replace
It is important to be clear about what these systems do and do not do. AI extraction does not replace legal judgement. It does not tell you whether a break clause is commercially acceptable, whether a non-compete is enforceable, or whether a particular risk is deal-breaking. Those decisions require a solicitor.
What it does is eliminate the hours of mechanical reading and data entry that currently precede that judgement. When a senior associate can see all 85 leases' key terms in a single spreadsheet in two hours rather than two weeks, they can spend their time on the actual legal analysis — and the client gets a faster, more cost-effective result.
Getting Started
The right approach for most firms is to start with a defined, repeatable document type that appears frequently in their practice — leases, NDAs, employment contracts — and build an extraction pipeline for that specific document class. This produces a working system quickly and demonstrates measurable time savings before expanding to other document types.
If your firm is handling significant volumes of due diligence work and you are interested in what an AI extraction system would look like for your specific practice area, I am happy to walk through the options.