Migrating text analytics extractors from AQL 1.x to modular AQL 2.0
How do I migrate my existing text analytics extractor written in AQL 1.x to modular AQL 2.0?
BigInsights Text Analytics v2.0 has introduced major new features to the AQL language, most notably, support for AQL modules. BigInsights Text Analytics v2.0 provides support for compiling AQL code created using BigInsights Text Analytics v1.4 or earlier editions in backward compatibility mode. The backward compatibility support is offered exclusively in Text Analytics v2.0, and will be discontinued in the next major release of Text Analytics. Therefore, it is imperative that extractors written using AQL v1.x be migrated to AQL v2.0 using the BigInsights v2.0 edition. The migration from AQL 1.x to AQL 2.0 consists of two steps:
- Design: decide how to split the extractor into modules by deciding what each module should contain, and by deciding how the modules should interact with each other;
- Development: By using InfoSphere BigInsights Eclipse tools, implement the design from step 1; also migrate the Extraction Plan (if it exists).
Step 1: Design
In BigInsights Text Analytics v1.x, an extractor consists of a main AQL file and a data path to find other files that are required by the extractor, such as included AQL files, dictionaries and UDF jars. This step involves a) deciding how best to break the existing extractor into multiple modules, b) deciding what should be contained in each module, and c) deciding the best possible interaction that can be achieved among these modules. The following four questions are meant to guide you in the design process.
Question 1: How complex is your extractor ?
Simple extractors can be transformed into a single text analytics module. The more complex the extractor, the more apparent the need to modularize it.
Question 2: Does your extractor have components that are used in other extractors ?
A component in the current context refers to individual views, tables, dictionaries, functions, or any combinations of them. Examples of such components are apparent in the following two scenarios:
- You have one UDF jar that you use in each of your BigInsights Text Analytics projects.
- You have an extractor for Organization mentions that you use in other extractors, for example, an extractor that identifies Financial Events, or an extractor that identifies relationships between Person and Organization mentions.
In these instances, it is recommended to abstract out the common components into one or more modules, and expose ( export) the required objects from the common modules so that they can be consumed ( imported) in other modules.
Question 3: Would you like to expose components inside your extractor towards extractor customization ?
One useful way to customize the behavior of an extractor is to abstract out information that is used by the extractor in arriving at some decisions, such as dictionaries and tables, and let the consumer of the compiled extractor fill in these values when the extractor is applied to a specific domain. In this way, the same extractor can be applied to different domains, each time with a different set of entries for the customizable dictionaries and tables. For example, assume that you have a Person extractor that is used to highlight person mentions in an email inbox of a customer. You would like to share this extractor with multiple customers, and let each customer customize the extractor with names from the employee directory of their organization, without having to provide the source AQL code of your extractor and recompiling the extractor. In BigInsights Text Analytics 2.0, you can expose such customization points using the new AQL statements create external dictionary and create external table.
Question 4: Are there domain specific components to your extractor ? Are there any common components, used by such domain specific components within your extractor ?
In the text analytics area, it is generally difficult to create generic extractors that perform reasonably well on each and every possible domain or language. In general, a generic extractor requires customization to improve the accuracy and coverage on a specific domain. Such customization generally falls into three categories:
- Application customization: You have a generic extractor that is used in multiple applications. Each application has specific requirements on the output schema, or the types of mentions that should be extracted. For example, you have a Person extractor and you use it in two different applications. In one application, the extractor should not output person mentions that consist of a single token (for example, mentions of the first name of a person, without a mention of their last name). In the second application, such single-token person mentions are acceptable. In addition, the output schema of the Person extractor differs across the two applications. In one application, a single attribute for the full name is sufficient; the second application requires a few extra attributes in addition to the full name: the first name, the last name, and the job position of the person, if present in the text.
- Data source customization: You have an extractor that is applied to several different types of data sources. Each data source has specifics that require specialized rules to be applied. For example, you have an extractor for Location mentions that is applied on two types of text: formal news reports and emails. In formal news, you know that there is a mention of Location right at the beginning of the first line. You want to write a special rule to capture that location, but you do not want that rule evaluated when the extractor is applied on emails;
- Language customization: Formal language and grammar structure varies across languages. If your extractor applies to multiple languages, you might wish to implement rules that are applicable to a single language. For example, you have a Sentiment extractor that is applied to documents in English and French. You might have a few specialized rules that only apply to English text. Similarly, you might have a few specialized rules that only apply to French text.
In such cases, the ability to modularize source AQL code is a great benefit to ensure that multiple concepts are well-separated in terms of rule sets and patterns used. Some recommendations include:
- Identify AQL objects (views, tables, dictionaries, functions) that are common across multiple specific versions of your extractor and abstract them out into new common modules that can be re-used by other specialized modules;
- Create specialized modules that provide different implementations for the same semantic concept, where each implementation is customized for a particular application, domain, or language;
- Use the new form of the AQL statement output view, output view … as …, to ensure that different specialized modules output a consistent set of names;
- In cases of language customization, use the new AQL statement set default dictionary language to set the default dictionary languages for each module according to the language it is intended to work with.
Step 2: Development
Every extractor is different and the modularization of the extractor must be architected on a case-by-case basis, according to the guidelines that are described in Step 1. Therefore, BigInsights Eclipse tools do not perform automatic migration of AQL source.
When you install BigInsights Eclipse tools v2.0, your existing projects that are created with BigInsights v1.x, compile in backward compatibility mode. Compared to the BigInsights Text Analytics v1.x compiler, the BigInsights Text Analytics v2.0 backward compatibility compiler has a number of restrictions on AQL source code. The restrictions, and corresponding workaround, are described next. If your existing AQL source code does not satisfy these restrictions, you must apply the corresponding workaround to make your existing AQL source code compile in BigInsights v2.0.
Restrictions of Text Analytics v2.0 backward compatibility compiler:
- Names of AQL objects (that is, views, tables, dictionaries, functions) cannot contain the period character ('.'). Workaround: remove any periods from object names. If your application requires that the extractor outputs a view whose name contains the period character, use the new AQL statement output view … as… instead;
- The statement select * from Document; is not supported in AQL 2.0. Workaround: explicitly select the desired fields of the special view Document.
- The statement output view Document; is not supported in AQL 2.0. Workaround: if your application requires outputting specific fields of the special view Document, create a new view that selects the specific fields and output that view instead.
Before you begin:
- Back up your project that contains the older AQL source. Steps: In Eclipse tools, click File > Export > General
- Ensure the presence of regression tests to validate the migration. Your extractor should generate the same result before migration, and after migration. Ensure that you have regression tests that cover all the functionality of your extractor. Use several input document collections that are representative inputs to your extractors. In general, there are two categories of regression tests for AQL code:
- Manual regression testing. Before migration, run the extractor on each input collection. Save the extractor output as a labeled collection. Steps: In Eclipse tools, right click the result folder and click Labeled document collection > Import from extracted result. After migrating, run the modularized extractor on each input collection. Utilize the Annotation Difference tool to compare the extracted result against the previously created labeled collection. Steps: In Eclipse tools, right click the result folder and click Compare text analytics result with... > Labeled document collection
- Automated regression testing. Before migration, use the Text Analytics v1.4 Java APIs to compile and run the extractor on each input collection. Serialize the output in a format of your choice. After migrating, use the Text Analytics v2.0 APIs to compile and run the extractor on each input collection. Serialize the output to the format that is used in serializing results prior to migration. Compare the two result sets by using the mechanism of your choice (for example, file comparison).
Step 2.A: Explicitly migrate the Text Analytics properties of your BigInsights project from v1.x to v2.0
Steps: In Eclipse tools, open the properties of your BigInsights project. Under BigInsights click Text Analytics > General, and then click Migrate.
After this step is performed, the Text Analytics v1.x properties, which consist of the location of a main AQL file and the data path to search included AQL files, dictionaries and UDF jars, are migrated to Text Analytics v2.0 properties, which consist of the location of source AQL modules, which is by default /textAnalytics/src in the root directory of the project, and the location of compiled Text Analytics modules (.tam) files, which is by default /textAnalytics/bin in the root directory of the project.
However, the AQL code is not automatically moved from the previous location to the new location in textAnalytics/src. You must perform this operation manually.
NOTE: The migration of text analytics properties is irreversible, therefore, it is recommended that you back up your project.
Step 2.B: Move the AQL v1.x source manually into the textAnalytics/src location within the migrated BigInsights project
Following the design that is established in Step 1, create new AQL modules; create new AQL files within each module, or move existing AQL files to a module.
- To create a new AQL module, click New > BigInsights > Other > AQL Module
- To create a new AQL file, click New > BigInsights > Other > AQL Script
- Move AQL source files into an existing module and fix any compilation errors as required by AQL modules. See the AQL Reference Manual in BigInsights 2.0 InfoCenter. For example, add a module statement at the beginning of each AQL file, and remove any include statements
Step 2.C: Validate the migration by running the regression tests and ensuring that the extraction results are the same before migration and after migration.
Step 2.D: Migrate the Extraction Plan of your project, if it exists
When a Text Analytics project of version 1.x is imported to Eclipse tools v2.0 and the Extraction Plan is opened in BigInsights Text Analytics Workflow Perspective, the Extraction Plan is converted to the new design automatically. However, the AQL source code of the project is still in version 1.x and needs to be migrated explicitly to v2.0, as described earlier. This section outlines a few actions that are necessary on the Extraction Plan after the the AQL source is migrated to v2.0.
There are three types of objects in an extraction plan: label, example (snippet and clue), and AQL element.
- Label: In Eclipse tools v2.0, there is a connection between a root label name and four modules named <LabelName>_BasicFeatures, <LabelName>_CandidateGeneration, <LabelName>_FilterConsolidate, and <LabelName>_Finals. In fact, when a root label is created, these four modules are created automatically. When migrating AQL code from v1.x to v2.0, it is recommended to organize AQL code following this concept; in other words, create the modules with those names and move the AQL scripts to the appropriate module. In v2.0, labels can be inside the groups BasicFeatures, CandidateGeneration, FilterConsolidate, and Finals. You should take advantage of this feature to put the sub-labels and their related AQL views together in the same group.
- Example: For this type of object, if the referenced document is also imported into the 2.0 workspace at the same location as before, the examples are still valid and no further action is necessary. You can check if the examples are still valid by double-clicking the examples to see if the right documents are loaded and the examples are highlighted correctly. If the examples are not correctly linked to the documents, move the referenced document to the same location as before or you will have to recreate the examples.
- AQL element: After migrating projects to v2.0, it is very likely that the location of AQL files is no longer the same as before; therefore, the AQL views in the Extraction Plan are not linked with their definition in AQL files anymore. You will have to manually fix the AQL views in extraction plan one by one. There are two ways to fix the linkage.
- Double-click the AQL view. Eclipse tools report that the view cannot be found and ask you to manually set the new location for it. Enter the information about its module and AQL file. The new location information is preserved even when the AQL view object is moved to a different place in the extraction plan.
- Drag-and-drop. After locating the definition in AQL editor of an existing AQL view in the extraction plan, delete it and then highlight its name in the definition, drag and drop it into extraction plan to create a new AQL view object.
More support for:
Software version: 2.0.0
Operating system(s): Linux
Software edition: Enterprise Edition
Reference #: 1617267
Modified date: 19 November 2012