All posts

How Loss Runs Extraction Actually Works

Theodore Johnson-Kanu··9 min read

Loss runs are the single most important document in commercial insurance underwriting. They also arrive as PDFs formatted differently by every carrier. Here is why extraction is hard and how automated approaches actually handle it.

What a loss run is

A loss run is the historical claims record for an insured, usually covering the prior three to five policy years. For every claim in the period, the loss run typically includes the date of loss, description, coverage, paid amount, reserve amount, total incurred, and status (open or closed). For liability claims, it may include litigation status and settlement details. For property claims, cause of loss and location.

This document is the foundation of renewal underwriting. An underwriter evaluating a commercial account starts with the loss runs because they are the closest thing to objective evidence about how the account has performed. A clean loss history suggests good risk management. A frequency pattern suggests systemic issues. A severity pattern suggests catastrophic exposure that may not be fully captured in the current policy.

For brokers, loss runs are a bottleneck. Every renewal requires collecting loss runs from each carrier that has covered the account in the prior period. Every new submission to a prospective carrier requires organizing, normalizing, and sometimes re-explaining the loss data. A mid-size commercial broker can spend hundreds of hours per year just on loss run handling.

Why extraction is the hard part

If loss runs arrived as standardized, structured data files, none of this would be difficult. They do not. They arrive as PDFs, and the PDFs are formatted differently by every carrier.

Some carriers produce loss runs with clean tabular layouts that OCR tools can parse reasonably well. Others produce reports with headers, footers, and summary sections interleaved with claim detail, and the claim detail itself may span multiple pages per claim. Some carriers use abbreviations and coverage codes that are not self-explanatory. Some produce loss runs that combine multiple policy periods into a single document without clear delineation.

Variation across carriers is the first extraction challenge. Variation within a single carrier's output over time is the second. Loss run formats change. A carrier's reporting system gets updated, the output format shifts, and the extraction logic that worked last quarter breaks.

A third challenge is that loss runs often arrive with annotations, handwritten notes, or redactions. Some are scanned from paper originals rather than generated directly from the policy administration system. Scanned loss runs introduce OCR quality issues on top of the formatting variability.

The result is a document type that looks, to a human, like a predictable structured report, but that is surprisingly hostile to naive automated extraction.

Automated loss run extraction is one of the problems Polysea was specifically built to solve. Rather than building yet another point solution that handles some formats and breaks on others, we are developing extraction tooling that operates as part of the broader shared data infrastructure, where clean extraction feeds into the shared exposure records that brokers and carriers can both access.

The manual workflow

The standard manual workflow for handling loss runs looks like this:

  1. The broker or underwriter receives the loss run PDF, typically as an email attachment.

  2. The document is opened and visually scanned to confirm it covers the right policy period and account.

  3. Each claim is read and the relevant fields are transcribed into a spreadsheet or underwriting system.

  4. Coverage codes are translated, if necessary, into the receiving party's taxonomy.

  5. The resulting data is used for rating, triage, or analysis.

For an account with ten years of loss data across multiple carriers, this process can take four to eight hours. For large national accounts with hundreds of claims, it can take days. In most commercial insurance operations, this work is done by underwriting assistants, loss control analysts, or brokerage account managers. It is high-volume, repetitive, error-prone work that directly constrains how many accounts a team can process.

What automated extraction actually involves

Automated loss run extraction is often described as "OCR plus AI." That description is technically accurate and not very useful. The actual pipeline involves several distinct steps, each with its own failure modes.

Step 1: Document ingestion and classification

Before anything is extracted, the system needs to confirm that the incoming document is a loss run at all, not a binder, a Schedule of Values, or a policy declaration. It also needs to identify which carrier produced it, because the extraction strategy depends on the source format.

Classification can be done with a combination of filename heuristics, document layout analysis, and small-model text classification. For volume operations, this step is automatable with good accuracy but is not zero-error, especially for carrier formats the system has not seen before.

Step 2: Optical character recognition

OCR converts the PDF into machine-readable text. Modern OCR handles cleanly generated PDFs well and handles scanned documents acceptably. The failure modes are tables with unusual layouts, handwritten annotations, and low-quality scans.

The output of OCR is a stream of text with approximate position information, not structured data. Turning that text into claim records is the next step.

Step 3: Structural parsing

Structural parsing is where most of the complexity lives. The system needs to identify where in the document the claim data begins, where each individual claim record starts and ends, and which text corresponds to which field.

For a carrier with a consistent table-based layout, structural parsing can be handled with layout-aware rules. For carriers with variable formats, the parsing typically involves a combination of positional heuristics and language models that can identify field boundaries from context.

The common failure modes at this step include: merging two claims into one when the visual separator is ambiguous, splitting one claim across two records when it spans a page break awkwardly, and misassigning values to fields when the table structure is inconsistent.

Step 4: Field normalization

Once the raw fields are extracted, they have to be normalized into a consistent schema. Dates arrive in multiple formats. Dollar amounts may include or exclude reserves. Coverage codes vary by carrier. Claim status ("open," "closed," "reopened," "in suit") uses different terminology in different source formats.

Normalization is where a loss run extraction system either delivers usable data or produces structured garbage. Good normalization requires an ontology of insurance data that maps the source variations to a consistent target schema.

Step 5: Validation and confidence scoring

The last step, and the one most extraction tools underinvest in, is validation. The system should know which fields it is confident about and which it is not. A total incurred value that exceeds the sum of paid and reserve is a validation error. A date of loss that falls outside the stated policy period is a validation error. An open claim with a zero reserve is a data quality flag worth surfacing.

Confidence scoring lets the system hand off only high-confidence extractions to automated downstream use, while routing lower-confidence records to human review. This is the difference between a tool that replaces manual work and a tool that creates a new kind of manual work (reviewing low-quality automated output).

What good extraction output looks like

A well-designed loss run extraction system produces output with several properties:

  • Structured records, one per claim, with consistent field names across carriers.

  • Source attribution, so every field can be traced back to the exact location in the source PDF.

  • Confidence scores per field, not just per document.

  • Validation flags for records that have internal inconsistencies.

  • Normalization metadata explaining how coverage codes and statuses were translated.

  • Round-trip verifiability, meaning a human reviewer can quickly check an extracted value against the source.

The gap between a basic "OCR a PDF and return a spreadsheet" tool and a useful extraction system is largely in these output properties. The extraction itself is solvable. Producing output that an underwriter can actually trust is harder.

Where AI language models fit

Large language models have changed what is practical in document extraction over the past two years. For loss runs specifically, they help in several ways:

  • Format-flexible parsing. A language model can parse a loss run format it has never seen before, by reasoning about what the fields should be rather than relying on a pre-built template.

  • Context-aware field extraction. Distinguishing a reserve from a paid amount based on surrounding text, even when labels are ambiguous.

  • Coverage code translation. Mapping carrier-specific codes to standardized categories based on context.

  • Summary generation. Producing narrative summaries of loss patterns that an underwriter can review quickly.

Language models do not solve the extraction problem on their own. They introduce new failure modes (confident extractions of wrong values) and they require validation infrastructure to be trustworthy. But used well, they dramatically reduce the engineering effort required to handle new carrier formats.

What this unlocks

Loss run extraction is not an exciting product category. It is plumbing. But the plumbing has downstream consequences that are meaningful.

For brokers, automated extraction shifts the time spent on loss runs from data entry to analysis. Instead of transcribing, the team reviews patterns, identifies risk management opportunities, and prepares recommendations for the insured.

For carriers, automated extraction at submission intake accelerates quote turnaround, which is a meaningful competitive factor. Carriers that can produce indicative quotes in hours rather than days win more business.

For insureds, the downstream effect is faster and more informed renewal conversations. Time that was spent on data plumbing becomes time spent on strategic risk decisions.

For the industry as a whole, structured loss data at scale enables pattern analysis that is currently infeasible. Loss patterns across portfolios, emerging risk categories, and frequency-severity trends become visible when the underlying data is clean.

How to evaluate an extraction tool

Useful questions when evaluating any loss run extraction tool:

QuestionWhat to look for
How many carrier formats does it handle out of the box?Concrete list, not a vague claim. Realistic number is dozens, not hundreds.
How does it handle formats it has not seen before?Ideally, a combination of model-based parsing and a review workflow for uncertain extractions.
What is the output schema?Structured, versioned, with source attribution and confidence scores.
How does it handle validation?Internal consistency checks, outlier detection, and clear flags for human review.
What is the review workflow?Low-confidence extractions should be easy to review and correct, with corrections improving future extractions.
How is the data delivered?API, direct integration with common broker and carrier systems, or export to standard formats.

The answers to these questions distinguish a production-grade extraction tool from a demo. Both exist in the market. The difference matters a lot when the output is feeding underwriting decisions.

Conclusion

Loss run extraction is a deceptively hard problem hiding inside a boring document. The variability of source formats, the subtlety of field-level semantics, and the need for trustworthy output make it a non-trivial engineering challenge. Modern tooling, particularly language-model-assisted extraction with good validation, has made production-grade extraction practical in a way it was not five years ago. The industry is still early in deploying these tools at scale, but the direction is clear: the time brokers and underwriters spend transcribing loss runs is going to become one of those historical anachronisms that younger practitioners will find hard to believe was ever necessary.

Polysea is building neutral infrastructure for the commercial insurance ecosystem, including shared exposure data management, authorization chain tooling, and automated loss run extraction. If the problems described in this article are relevant to your work, we would like to hear from you at hello@polysea.ai.