Skip to main content

Text Processing

This section contains information on how to configure text processing. Related options apply to:

  • Best Bets
  • Content Type Extension Mapping
  • Content Type Extraction Methods
  • Language Detection
  • No Stem
  • OCR Language Mapping
  • Synonyms
  • Text Patterns

Best Bets

Sometimes an application may want to push selected documents to the top of a hitlist for specific queries. This may be implemented by specifying Best Bets for specific query text.

configbestbets_thumb_0_0

First, enter the search term that you want to match and then click the Add button.

Next, click the term, and specify one or more URLs that should appear at the top of the hit list.

Content Type Extension Mapping

Sometimes an organization may want to process certain file types as a different content type. The primary use case for this is internal content types that map to a content type already understood / identified.

In this case the example has a .rpt file being treated as a text file, as such the file will be copied to a temporary location as a .txt file and processed as if it were any other text file.

configcontenttypeextensionmappings

Content Type Extraction Methods

The Content Type Extraction methods describes how documents will be handled by the APIs and the core services. A number of built-in processing methods are available, where there is no available method the processing will default to running through standard Microsoft Search iFilter processing.

You can alter the methods by clicking Edit and then selecting the preferred processing method. You can also specify an iFilter as a backup if the primary method fails to extract text from the document – the backup method will be used if the extraction fails to find more than 5 characters of text.

If you have updated the extraction method, Netwrix recommends re-processing any documents that have already been processed to ensure consistency. Selecting Re-index from the grid for the affected content type will re-process the necessary records.

configcontenttypeextractionmethods_thumb_0_0

Language Detection

The language detection list specifies which languages will be considered for auto-detection.

configlanguages_thumb_0_0

If a language is excluded then it can't be used to identify the language of a document and it will be removed from the language options in Taxonomy Manager.

tip

You can also enable OCR recognition for non-English images. See the Netwrix knowledge base article How to enable OCR for non-English images for setup instructions.

No Stem

The No Stem list offers the ability to disable language stemming for a particular word or phrase, this supports the ability to always apply a phrasematch when a particular term is used as either a clue – or a search term.

confignostem_thumb_0_0

OCR Language Mapping

The OCR language mapping configuration screen can be used if you want to OCR non-English images using Tesseract and the Apache Tika OCR engine. File paths (including parts of paths) can be mapped to specific Tesseract language packs. You can also override the OCR processing mode, enable conversion of PDF files to images for improved text extraction, and override the Page Segmentation mode used by Tika to identify text.

configocrlanguagemapping

Synonyms

Often you need to submit a query and have synonyms automatically included. A generic set of synonyms may be configured by using the Synonyms form.

configsynonyms

Text Patterns

Many HTML web pages contain navigation information and other extraneous information that is the same for all pages and/or not relevant to the individual page content. If all of the text is indexed from these HTML pages then this can lead to unwanted search results where a match is made, for example, to an entry in a standard page navigation area.

The Text Patterns feature is provided to assist with the cleanup of HTML documents. Use TextPatterns to also index terms that would normally be discarded.

configtextpatterns_thumb_0_0

The StartTag and EndTag values are case sensitive strings used to identify the content to be managed, the content is then managed based on the filter type.

You can use three tag types to assist in the cleanup:

  • FILTER—Extracts a subset of the HTML page, before extracting the plain text. Only a single section will be extracted for each TextFilter processed.
  • DELETE—Deletes sections of the HTML page, before extracting the plain text.
  • INDEX TERM (EndTag ignored)—Create index terms that would otherwise not be formed. For example the term “E.ON” is a useful one for people interested in energy companies. However, this term would not normally be created because a full stop normally acts as a term separator. However, creating an INDEX TERM for this pattern means it will be detected and indexed as required.