Logikcull uses an in-house optimized version of Tesseract to run optical character recognition (OCR). This is superior to Adobe's own PDF plugin and most, if not all, other OCR engines in the market based on our own thorough testing and industry knowledge.
Logikcull will automatically attempt English-language OCR on image documents (i.e. JPG, PNG, PDF) that don't meet the minimum threshold of text, and give them an "Ocred" QC tag. You can check samples of your OCRed document to help determine the quality of the character recognition, and leverage fuzzy search terms:
To find words with similar spellings, use the tilde,
~ , symbol at the end of the term. Add a value between 0 and 1 to alter the level of similarity. Example:
Need~0.3 may return Needle, Neading, Meeting, etc. Whereas
Need~0.8 may only return Need, Nead.
For any uploads created after March 15th, 2018, we made major improvements to this engine, where searchable text on the document can have persistent/search highlights, and even handwriting will be searchable! If the OCR work was done prior to this date, the document from the image viewer may not be searchable. However, with Tesseract, this means that you won't necessarily need to open the text preview to search for keywords on a document that was scanned and OCRed in Logikcull!
NOTE: If you're performing a database upload and you have been provided an accompanying set of text files that reference the text for the imaged versions of your documents (i.e. PDF or TIFF), we typically recommend utilizing the text files that have been provided and bypassing the use of our OCR engine.
There are two components that determine text searchability in Logikcull:
The extracted text - available in the document viewer or search results when text preview is selected. A document can be ingested with extracted text, or it can be obtained through Logikcull's OCR service.
The searchable text layer on the image of the document (allows for highlighting). A document can be ingested with this searchable text layer or it can be obtained through Logikcull's OCR service. This is essentially a layer of text on top of the image.
⚠️ Documents can be ingested with one, both, neither, or incomplete versions of the two.
⚠️ The Logikcull OCR service will only be triggered if the document does not meet the minimum threshold of text.
Logikcull’s OCR aims to do the following:
Replace any existing/provided text already in the text view with a newly-generated text file based on OCR of the image.
Associate this searchable text layer on the image of the document (to allow for highlighting)
If you are looking for terms to be highlighted on the image of the document, one of these two factors would have to be true:
The document was ingested with an existing searchable text layer already on the image of the document.
Logikcull OCRed the document (either because the document did not meet the minimum threshold of text, or the document was force OCRed by customer request.