Table Detection Rules

This article explains how Model Reef detects tables inside PDF files during import and how you can work with or improve the detection.

You will learn:

  • What the table detector looks for.

  • Typical patterns it handles well.

  • Common edge cases.

  • How to intervene when detection is imperfect.

How table detection works at a high level

When you upload a PDF, Model Reef scans each page and looks for areas that behave like tables. It uses a combination of:

  • Text alignment and spacing.

  • Repeating vertical and horizontal structure.

  • Lines, borders and shading where present.

  • Keywords that commonly appear in financial statements (for example Revenue, Assets, Liabilities).

The result is a set of candidate table blocks. Each block has:

  • A header row or rows.

  • A set of row labels.

  • A grid of numeric cells.

You can then accept, adjust or skip each table before proceeding with mapping.

Patterns that detect well

The detector works best on tables that have:

  • Clearly separated rows and columns.

  • Consistent spacing between columns.

  • A single, obvious header row.

  • Text left aligned in the first column and numbers right aligned in the rest.

  • No embedded charts, images or unusual decoration inside the table.

Typical examples include:

  • P&L statements.

  • Balance Sheets.

  • Cashflow statements.

  • Notes with standard tabular layouts.

  • KPI tables with one row per metric and one column per period.

If your tables look like these, detection will usually be straightforward.

Common edge cases

Some PDFs contain tables that are technically valid but difficult to parse automatically. Examples include:

  • Highly formatted tables with heavy merged cells and nested sections.

  • Free form layouts where numbers are aligned visually but not in a strict grid.

  • Tables that mix text blocks and numeric data in a single region.

  • Scanned tables with poor OCR or skewed orientation.

  • Multi table pages where the spacing between tables is very small.

In these cases Model Reef may:

  • Combine two logical tables into one region.

  • Split one logical table into multiple candidate blocks.

  • Miss some rows or columns around the edges.

  • Extract header or footnote text as part of the table.

You can still work with these tables, but you may need to do more manual cleaning in the mapping step.

Working with detected tables

In the import UI you can usually:

  • Navigate through each detected table one by one.

  • Preview the extracted grid of rows and columns.

  • Decide whether to keep or skip a table.

  • Adjust which rows are treated as headers.

  • Exclude non data rows before mapping.

If a page contains many small tables but you only care about a few, skip the ones that are not needed to keep the model clean.

Improving detection quality

You can improve detection outcomes by:

  • Providing higher quality PDFs where possible (not downsampled or flattened).

  • Avoiding scan to PDF when you can export original reports directly from the source system.

  • Removing password protection or unusual security settings.

  • Splitting extremely long or complex documents into smaller logical sections and importing them separately.

If you control the layout of exported reports, using consistent and simple table structures will make future imports easier.

When detection is not good enough

If a table is too messy for automatic extraction to be useful, consider:

  • Exporting the same report from the source system as Excel or CSV, then using the Excel or CSV import instead.

  • Copying the relevant data into a clean spreadsheet and importing that.

  • Creating a smaller, simplified table just for modelling purposes.

PDF import is powerful but is still limited by the quality and structure of the source document.

Last updated