ℹ️ In this article:
Logikcull allows you to process and search through text within documents. When you upload a document, Logikcull will check if there is existing text metadata and use that to drive text searches. If there is no existing text, the platform will use Optical Character Recognition (OCR) to extract text from the document. Additionally, Logikcull also offers a feature called Deep Text Recognition (DTR) which extracts text from embedded images within supported file types. All documents that have been processed by OCR or DTR will be tagged and you can filter to view only those documents.
Native and Imported Text
When processing File Uploads (drag-n-drop) and Cloud Uploads, Logikcull will use the existing text metadata for documents that arrive with a text layer (for example, a Microsoft Word document) to drive text searches across documents.
For Production Uploads that have imported text mapped during import, Logikcull will use the imported text to drive text searches across documents.
If you're performing a database upload and you have been provided an accompanying set of text files for the imaged versions of your documents (i.e. PDF or TIFF), we typically recommend utilizing the text files that have been provided and bypassing the use of our OCR engine.
Optical Character Recognition (OCR)
If a document does not have an imported text layer, Logikcull will send it through an Optical Character Recognition (OCR) process.
💡 All documents that Logikcull OCRs will receive the QC Tag: OCRed. If a document has no text, it will receive the QC Tag: "Has No Text."
Filter on the "OCRed" QC Tag in the filter carousel to view documents that have been OCRed by Logikcull.
Examine the text of the OCR by selecting the Text button in the document toolbar.
There are two components that determine text searchability in Logikcull:
The extracted text - available in the document viewer or search results when text preview is selected. A document can be ingested with extracted text, or it can be obtained through Logikcull's OCR service.
The searchable text layer on the image of the document (allows for highlighting). A document can be ingested with this searchable text layer or it can be obtained through Logikcull's OCR service. This is a layer of text on top of the image.
⚠️ Documents can be ingested with one, both, neither, or incomplete versions of text.
Logikcull’s OCR aims to do the following:
Replace any existing/provided text already in the text view with a newly-generated text file based on optical character recognition of the image.
Associate this searchable text layer on the image of the document (to allow for highlighting)
If you are looking for terms to be highlighted on the image of the document, one of these two factors would have to be true:
The document was ingested with an existing searchable text layer already on the image of the document.
Logikcull OCRed the document.
Logikcull uses an in-house optimized version of Tesseract to run optical character recognition (OCR). This is superior to Adobe's own PDF plugin and most, if not all, other OCR engines in the market based on our own thorough testing and industry knowledge.
Logikcull will automatically attempt English-language OCR on image documents (i.e. JPG, PNG, PDF) that don't meet the minimum threshold of text, and give them an "Ocred" QC tag.
Deep Text Recognition (DTR)
DTR extracts "deep text" from embedded images in supported file types (see below), revealing important searchable text that might have been missed by OCR. For example, a document with text and screenshots that contain text will have the entire image searched for Latin characters -- both the document and the embedded screenshot images. It also expands OCR to 100% of:
DTR-supported documents (
.pdf, .doc, .docx, .ppt, .pptx)
Supported image types (.
bmp, .dcx, .gif, .j2k, .jb2, .jbig2, .jfif, .jp2, .jpc, .jpeg, .jpg, .jpm, .jpx, .pcx, .pdf, .png, .tif, .tiff)
The "Has Deep Text" QC tag is applied to documents if additional searchable text is found after running DTR. This indicates that the document has an embedded image that contains text.
Upload stats page also show the number of documents that have "Deep Text". This represents the total number of documents that have the "Has Deep Text" QC tag.
Running QC on OCR/DTR
You can check samples of your scanned documents to help determine the quality of the character recognition, and leverage fuzzy search terms as necessary:
Using the Filter Carousel, select the checkboxes next to OCRed and/or "Has Deep Text" under the QC Tags filter set.
From the resulting search, you can leverage fuzzy search terms to find words with similar spellings. This is to take into account any mis-read characters in the course of character recognition.
Fuzzy Search: use the tilde,
~, symbol at the end of your search term. Add a value between 0 and 1 to alter the level of similarity. Example:
Need~0.3may return Needle, Neading, Meeting, etc. Whereas
Need~0.8may only return Need, Nead.