Convert the contract to plain text (from PDF via a PDF extraction library, or from DOCX via python-docx or a similar tool), preserving section headings and page numbers for later citation.
Split the document into overlapping chunks (e.g., 1500–2000 tokens with 200-token overlap) aligned to paragraph or section boundaries so clauses are not split mid-sentence.
For each chunk, prompt the LLM to extract targeted clause types or metadata fields (parties, effective date, governing law, termination provisions, etc.) and return results as structured JSON.
Merge and deduplicate extractions across overlapping chunks; where the same clause appears in multiple chunks, resolve conflicts by preferring the chunk with the most complete representation.
For every extracted field, record the source chunk index and a verbatim excerpt (a span of the original text) so downstream consumers can verify accuracy against the original document.
Known gotchas
LLMs hallucinate clause content that is not in the document, especially when prompted broadly; constrain prompts to extract only explicitly present text and instruct the model to return null for absent fields rather than inferring.
Long contracts (100+ pages) exceed context windows; chunking is necessary but introduces the risk of missing cross-referencing clauses — consider a second pass that queries the full table of extracted clauses for internal consistency.
Extracted clause data is not legal advice and may contain errors; all outputs must be reviewed by qualified legal counsel before being relied upon for legal or business decisions.
Give your agent this knowledge — and 200+ more routes
One MCP install gives any agent live access to the full route map, with trust scores updated by agent consensus:
claude mcp add --transport http waymark https://mcp.waymark.network/mcp