Skip to main content

How to Enable OCR for Non-English Images

Question

How can I enable OCR for non-English images?

Answer

The steps below explain how to deploy additional OCR language pack(s) and how to identify which files should be processed via the installed pack(s). This assumes that you have enabled OCR correctly. More details can be found in the following KB article: Process Document Images results in no extracted text or invalid text.

Select the language you wish to use from the list below to download the corresponding language pack:

  1. Ensure that the pack is deployed on all servers to the following locations:
    1. conceptQS (typically: C:\inetpub\wwwroot\NDC\bin\Tesseract-OCR\tessdata)
    2. conceptCollector (typically: C:\Program Files\Netwrix\Data Classification\Services\ConceptCollectorService\Tesseract-OCR\tessdata)
  2. The language pack file should not be renamed.

Then, identify which files should be processed via a particular language pack:

  1. Log into the Administration Portal.
  2. Select Config.
  3. Expand Text Processing.
  4. Select OCR Path Mapping.
  5. Each mapping allows you to define part of a path to identify specific files for processing:
    1. Select Add.
    2. Define the inclusion filter, such as:
      • *ru_* - Identifies any file that contains ru_ within the path
      • * - Identifies any file
    3. Select the language (mapped to the deployed language pack).
    4. Select Save.
  6. In the event that a file matches multiple inclusion rules, the longest matching rule will be used.

Language Packs: