Case studies / Noqta

Rethinking document intelligence without the ML tax

Use case
Document layout analysis
Industry
Document AI, Data extraction
Tech stack

The instinct when building a document understanding system in 2026 is predictable. You reach for a fine-tuned layout model. You spin up a GPU cluster. You wrangle training data. Then you wait. We took a different path. It is faster, leaner, cheaper, and easier to explain. It also beat any model-first approach in the same time.

Noqta is a Python package. It takes raw PDF documents and turns them into high-quality image crops. These crops include text blocks, tables, figures, and stamps. They are ready for OCR. No training data. No GPU required. No black box. Just a tightly engineered pipeline built from a deep understanding of the problem space.

+10%

F1-score improvement in OCR accuracy

100%

CPU-based processing pipeline

The problem with the default playbook

Most teams confronting document layout analysis immediately turn to object detection models or document-specific foundation models. These tools exist for good reason, but they carry hidden costs that compound fast:

• Training data dependency. Every new document class, language, or layout variation can require annotated examples. That cost never goes away.
• Infrastructure overhead. GPU setup, model versions, and inference servers cost money. They also do not solve your real problem.
• Opacity. When a model crops a box in the wrong place, debugging it is largely guesswork. You retrain, you hope.
• Latency and throughput. Batch inference through a neural network is a hard ceiling on how fast you can process documents at scale.

We looked at this and asked a more useful question: what do we actually need to know about a document to split it well? The answer turned out to be simpler than the industry assumes.

The Noqta approach: classical CV, done seriously

A document page, at its core, is a distribution of dark pixels on a light background. Coherent regions - paragraphs, tables, figures - cluster together. Whitespace separates them. This is not a profound observation, but acting on it with precision and engineering discipline produces something genuinely powerful.

Noqta's pipeline is built around five composable stages, each tunable, each inspectable, and each running entirely on CPU.

1. Adaptive rendering and binarization

Instead of processing documents at a fixed DPI, Noqta calculates a dynamic scale factor for each page. This makes the longest side fit within a configurable pixel budget. This keeps clustering fast regardless of whether you're processing a business card scan or an A3 engineering drawing. Binarization uses Otsu's method by default, automatically finding the optimal threshold from the pixel intensity histogram - no hand-tuned constants, no fragility.

2. EDT smudging: connecting fragments intelligently

The most elegant piece of the pipeline. Rather than relying on morphological dilation with a fixed kernel (coarse, lossy), Noqta optionally applies an Euclidean Distance Transform to connect fragmented pixel clusters before clustering begins. EDT computes, for every white pixel, its distance to the nearest black pixel. Threshold this distance map at a small value to “bloom” content regions outward by a precise margin. It is close enough to connect broken characters or dotted line segments. It is far enough to keep real whitespace intact. The result is that hierarchical clustering sees complete, coherent regions rather than pixel confetti.

EDT smudging connects fragmented pixels into coherent regions while preserving real whitespace.

3. Frame removal via Hough transform

Scanned documents frequently carry page borders, ruled lines, or frame artifacts that would contaminate any clustering step. Noqta’s Clusterer uses a Sobel edge detector. It then uses a probabilistic Hough transform. This helps find near-perfect horizontal and vertical lines. These lines span most of the page’s width or height. These lines are whitened out before any clustering occurs. The result is that the clustering step sees only genuine content - not scanner noise.

Detected page borders and frame lines are removed before clustering.

4. Hierarchical clustering on Black Pixel Coordinates

This is where the core intelligence lives. Noqta extracts the (x, y) coordinates of each black pixel in the binarized, smudged image are collected. They are then fed into agglomerative hierarchical clustering. You can set the distance threshold and linkage strategy.

Single-linkage clustering works well here. It merges two clusters once any two points are within the distance threshold. This follows the chaining patterns of text lines and table rows. It also handles irregular figure boundaries. It does not need to know their shape or content. The result is a set of pixel clusters that closely match the meaningful visual regions on the page.

Black pixel coordinates are clustered into coherent document regions.

5. Multi-stage box suppression and tall-box splitting

Raw cluster bounding boxes are noisy. Noqta's Suppressor runs a principled multi-stage cleanup:

• Oversized boxes (exceeding a configurable fraction of page area) are removed.
• Fully enclosed boxes are eliminated.
• High-overlap pairs are merged into their union, iteratively until stable.
• Small, isolated box fragments are grouped by proximity graph, merged, and filtered.

Boxes that survive suppression and still span more than a configurable fraction of the page height are handed to the Splitter. It finds horizontal runs of near-solid white or black pixels across the full image width. These runs are real content separators and become cut points. It then scales the split lines back to the high-DPI image for precise cropping.

Overlapping and noisy boxes are merged, filtered, and split into clean regions.

Two-pass resolution: speed and fidelity without compromise

One architectural decision that distinguishes Noqta from naive approaches is the deliberate separation of detection DPI from extraction DPI. Clustering runs on a low-resolution rendering (default 600px longest side). This keeps the black pixel count tractable and clustering fast.

Once the final boxes are set, the same page is re-rendered at a high-resolution scale. This scale uses a configurable zoom_rate, with a default of 3×. Crops are then extracted from this full-fidelity image. The coordinate mapping between the two resolutions is handled exactly, with independent x/y scale factors to eliminate rounding drift. You get the speed of a small image for detection. You also get the quality of a large image for output. Both happen at the same time.

Fully observable, fully debuggable

Every stage writes its intermediate output to disk. Binarized images are saved. Smudged images are saved. Detected frame lines are saved. Cluster scatter plots are saved. Box visuals are saved at each suppression stage. Final high-resolution crops are saved. Each file has a sequential numbered filename. When something goes wrong - and in production document processing, things always go wrong - the diagnostic trail is right there.

This is what we mean when we say classical CV done seriously. It is not just about avoiding a neural network. It is about building a system where every decision is inspectable. Every parameter is clear. Every failure mode is diagnosable without a PhD or a GPU debugger.

Performance built in

Because the whole pipeline runs on the CPU, deployment is simple. There are no CUDA dependencies, no driver issues, and no GPU scheduling overhead. It scales horizontally across workers trivially. The built-in Timer context manager measures every pipeline stage for each document and page. It writes structured timing CSVs. These CSVs make benchmarking and finding bottlenecks a first-class workflow. It is not an afterthought.

When GPU acceleration is needed for specific steps, the design stays modular. You can add GPU kernels without changing the structure. This helps with EDT on large pages and parallel cluster extraction.

In terms of performance upgrade, using Noqta as a preprocessing step before applying OCR proved to increase the F1-score of our client’s system by 10%. In terms of Noqta’s performance, it doesn’t miss anything.

What needs to improve

Improvements for Noqta are needed in two areas: speed and frame removal. The main edge that this approach has is that it promises no loss of information, this comes with a speed trade-off as a result of hierarchical clustering being time intensive. That is why GPU optimization is needed on that front for optimal performance.

On the other hand, frame removal algorithm occasionally removed continuous lines of text spanning the a majority of the length/width of the page at hand; mistaking them for a line. This unfortunately leads to some loss of information. This can be solved by users dictating a very low line break tolerance parameter (i.e. maximum line gap value) to ensure that frames have tiny gaps indicating that this is likely a line.

What this reflects about how we work

Noqta is not made by following trends. It does not default to what the community calls standard tooling this year. It comes from understanding the problem in detail, choosing the right tool from many options, and building it with discipline. We could have wrapped a layout detection model and called it done in a week. Instead, we built something that runs on any CPU.

It needs no training data. It shows every intermediate result. You can tune it for almost any document domain using configuration alone. It will keep running and producing results long after today’s fine-tuned layout models are replaced by their own weight files. That is the difference between reaching for a tool because it is familiar and reaching for one because it is right.

Noqta is open source. The pipeline is configurable via a single JSON dictionary. It processes single PDFs or entire directories and writes structured outputs ready for downstream OCR.

Want to build something?

Let’s talk about what you’re working on next and see how we can help.

Book a call

No pitches, no hard sell. Just a real conversation.

contact image