Amazon Comprehend Enrich Scanner

Introduction

The Amazon Comprehend Enrich Scanner is one of the source connectors available in migration-center starting with version 3.17. It is a special connector which enhances the objects scanned by another source connectors with some data computing using the Amazon Comprehend.

The supported Comprehend classifiers: Dominant Language Classifier, Entities Classifier and Custom Classifier.

Known Issues and Limitations

  • The scanner will extract text in enrich mode even if the scanner was run in simulation mode before.

  • The entities and language classifiers can be run in the same scan run, but the entities classifier will not take into consideration the language extracted by the dominant language classifier. The attribute generated by the language classifier can be used by the entities classifier if it will be run before the entities classifier. More information about the way to use the source attribute as entities language attribute is presented in Classifiers Configuration.

Scanner Configuration

To create a new Amazon Comprehend Enrich Scanner job click on New Scanner button and select "AmazonComprehendEnrich" from the adapter type dropdown list. Once the adapter type has been selected, the parameters list will be populated with the Amazon Comprehend Enrich Scanner parameters.

The Properties window of a scanner can be accessed by double-clicking the scanner in the list or selecting the Properties button or entry from the toolbar or context menu.

Scanner Parameters

The common adaptor parameters are described in Common Parameters.

The configuration parameters available for the Amazon Comprehend Enrich Scanner are described below:

  • publicKey

    The Amazon public key used to create a connection to AWS.

  • privateKey

    The Amazon private key. This should be the pair of the public key.

  • region

    The Amazon region used to create the connection to AWS.

  • executeClassifiers

    Flag indicating if the classifiers will be executed. If this parameter is not checked then the scanner will run in Simulation mode, otherwise, the classifiers jobs will be fired in Comprehend. See Classifiers Configuration.

  • inputS3Uri

    The S3 location where the text files will be uploaded.

  • outputS3Uri

    The S3 location where the output of the classifier will be located.

  • kmsKeyId

    The ARN of custom managed key used to encrypt the data in S3.

    Example: arn:aws:kms:eu-central-1:0908887578777:key/d484ee92-ffff1-444e-bcbb0-7cccceffcffc

  • dataAccessRoleArn

    The ARN of the role that has Comprehend as trusted entities.

    Example: MCDEMO_Comprehend

  • deleteFiles

    Flag indicating if the files from S3 will be deleted. If the parameter is checked then the files from inputS3Uriand outputS3Uri will be deleted.

  • jobRunId*

    The id of the job which scanned the objects that will be enriched.The jobRunId must exist.

  • configurationFile

    The location of the file where the classifiers are configured. When the executeClassifiers parameter is checked, then this parameter is mandatory. The way to configure the classifiers is detailed in Classifiers Configuration.

  • loggingLevel*

    See Common Parameters.

Parameters marked with an asterisk (*) are mandatory.

Classifiers Configuration

The classifiers are configured using an XML file. The structure of this file is a predefined one and allows the user to configure the classifiers as much as possible.

The supported classifiers are divided into two types: standard classifiers and custom classifiers. There is a predefined XML structure for each classifier type. An example of this configuration file can be found in \fme AG\migration-center Server Components <Version>\lib\mc-aws-comprehend-scanner\classifiers-config.xml.

For every classifier, you can specify if the score should be displayed by using the XML attribute "dispayScore". You need to specify this attribute just if you want to have the score as an attribute in migration-center otherwise, the attribute can be omitted because the default value is false.

Standard Classifier

The standard classifiers are split into two supported classifiers and the difference between them is made using an XML attribute named "type".

  • Dominant Language Classifier

The structure of this classifier is presented in the following block. The "threshold" XML element is mandatory and is used to filter the values. If the score for a specific language is lower than the threshold value then the language is not saved on database.

<standard_classifier type="language"  displayScore="true">
    <threshold>0.8</threshold>
</standard_classifier>
  • Entities Classifier

The structure for the entities classifier is presented in the following block.

<standard_classifier type="entities">
    <threshold>0.6</threshold>
    <language>de</language>
    <entityRecognizerArn>arn:aws:comprehend:eu-central-1:0000000000:entities-classifiers/docClassifier-copy</entityRecognizerArn>
    <entities>DATE,PERSON</entities>
</standard_classifier>

The XML sub-elements are:

  1. threshold - is used to filter the entities. If the entity score is less than the threshold value then the entity will not be saved in the database.

  2. language - is a mandatory parameter used to specify the language of the documents. If the user has documents with different languages then the user is allowed to use a source attribute to specify the language. The attribute name should be prefixed with $ character, eg. $aws_language.

  3. entityRecognizerArn - is used to specify the custom entity classifier instead of the standard one.

  4. entities - specify the entities that will be saved on the database. If the entity is not present in the entities list, then the attribute will be ignored by the scanner.

Custom Classifier

The Custom Classifier is used to classify documents using custom created categories. The scanner allows users to use multiple custom classifiers in the same scan run.

The XML sub-element "classifierEndpointArn" is mandatory and specifies the Amazon Resource Names of the custom classifier.

The "threshold" sub-element is to filter the classes. If the class score is lower than the provided value for the threshold, then the attribute will not be saved on the database. The attribute name on the database will be "aws_className_awsJobId".

<custom_classifier displayScore="true">
		<classifierEndpointArn>arn:aws:comprehend:eu-central-1:000000000:document-classifier/docClassifier-copy</classifierEndpointArn>
		<threshold>0.7</threshold>
</custom_classifier>

Using the Amazon Comprehend Enrich Scanner

The Amazon Comprehend Enrich Scanner can be run in two modes: simulation mode and enrich mode.

To extract the text in both cases the scanner uses Tika and Tesseract for OCR. The OCR is disabled by default, but it can be activated by the user. More information can be found in chapter Tika Configuration and Tesseract OCR Configuration.

We recommend you to run the scanner in simulation mode to analyze the cost before running it to extract the Comprehend attributes.

Simulation Mode

The parameter "executeClassifiers" should be not checked when you want to run the scanner in simulation mode. To see the information generated by the scanner, the parameter "loggingLevel" should be set to 3 or 4.

The scanner extracts the text from documents locally and computes the number of characters and units to help the user to estimate the cost of classifiers execution.

The information generated during execution is present in the report log. An example of a report log is present in the following image.

Enrich Mode

To run the scanner in enrich mode you need to check the parameter "executeClassifiers".

The first step that the Amazon Comprehend Enrich Scanner does is to extract locally the text from documents. After that, the text files are uploaded to S3 on inputS3Uri. The classifiers jobs are fired and the result of those are saved in S3 on outputS3Uri. The scanner downloads the files and saves the results on database.

The following image presents the attributes in the database after one scan run on enrich mode with standard entities classifier and dominant language classifier.

Tika Configuration

The Tika library is used by the scanner to extract the text from documents.

The scanner provides a tika configuration file that contains all necessary parsers to extract the text from all office documents. The user can modify the configuration file if more tunings are wanted. The file is located on \fme AG\migration-center Server Components <Version>\lib\mc-aws-comprehend-scanner\tika-config.xml.

The "OOXMLParser" is used for office documents like docx and the "PDFParser" is used for pdf documents. The default configuration provided by the Tika library will be used for other documents type.

More information about the configuration can be found at https://tika.apache.org/1.26/configuring.html.

Tesseract OCR Configuration

The Tesseract OCR is used to extract the text from the embedded images and also from the image file. The documentation of this library is https://tesseract-ocr.github.io/tessdoc/Home.html.

To install this library you can follow the article: https://medium.com/quantrium-tech/installing-and-using-tesseract-4-on-windows-10-4f7930313f82. The executable file can be download from https://sourceforge.net/projects/tesseract-ocr-alt/files/ or https://digi.bib.uni-mannheim.de/tesseract/.

After you installed the Tesseract you need to complete the TesseractOCRConfig.properties file with tesseractPath and tessdataPath. Example:

tesseractPath=C:\\Users\\user\\AppData\\Local\\Tesseract-OCR
tessdataPath=C:\\Users\\user\\AppData\\Local\\Tesseract-OCR\\tessdata

By default, the Tesseract is disabled. If the user wants to enable the Tesseract, the following steps should be followed:

  • Open tika-config.xml and remove from DefaultParser the line <parser-eexclude class="org.apache.tika.parser.ocr.TesseractOCRParser"/>

<parser class="org.apache.tika.parser.DefaultParser">
			<parser-exclude class="org.apache.tika.parser.executable.ExecutableParser" />
			<parser-exclude class="org.apache.tika.parser.pdf.PDFParser" />
			<parser-exclude class="org.apache.tika.parser.microsoft.ooxml.OOXMLParser" />
</parser>
  • Change the value of the ocrStrategy XML element of PDFParser with ocr_and_text.

<parser class="org.apache.tika.parser.pdf.PDFParser">
			<params>
				<param name="extractInlineImages" type="bool">true</param>
				<param name="sortByPosition" type="bool">true</param>
				<param name="extractUniqueInlineImagesOnly" type="bool">false</param>
				<param name="ocrStrategy" type="string">ocr_and_text</param>
				<param name="ocrImageType" type="string">rgb</param>
				<param name="ocrDPI" type="int">100</param>
			</params>
		</parser>

Additional Configuration

For configuring some additional parameters that will apply to all scanner runs, a configuration file (internal-configuration.properties) provided in the folder …\lib\mc-aws-comprehend-scanner. The following settings are available:

Configuration name

Description

waiting_time_between_requests

The time in seconds that the scanner will wait until it will make a request to Amazon to get the Comprehend Classifier Job status.

Example: waiting_time_between_requests=10 means that the scanner will make a request and if the status is "in progress" then the scanner will wait 10 seconds until it will make another request to check the status