Exporting documents for use in other applications

To use information from Watson Content Analytics for other purposes, such as data warehouse, business intelligence, classification, and eDiscovery or compliance applications, you can export documents from collections and then import the exported data into your applications.

When you configure options for exporting documents, you specify whether the documents are exported as XML files or CSV files to a file system, exported to a relational database, or exported according to a format and location that is specified by a custom plug-in. Watson Content Analytics does not provide any utilities for importing the exported documents into other applications.

Restriction: To take advantage of automatic metadata facets, duplicate document detection, and document flagging support that Watson Content Analytics provides to assist with content assessment, the environment must run on the same operating system. For example, if you export data from Watson Content Analytics, and then use IBM® Content Collector to add the exported data to IBM FileNet® P8, all three products must run on the same operating system.

Types of documents that you can export

Documents can be exported when they are crawled, after they are parsed and analyzed, or as the result of searching or mining a collection.
Crawled documents
You can export documents that were crawled by Watson Content Analytics crawlers from the document cache before they are parsed or analyzed. You might select this approach, for example, if your applications need to collect documents from various sources.

In this scenario, assume that your content management system supports importing documents from a file system, but you also want to import documents that are stored in a BLOB column in a relational database. You can configure a database crawler to crawl the BLOB column and then export the crawled data to XML files on a file system. You can then import the exported files into your content management system.

Analyzed documents
You can export analyzed documents from the document processing pipeline. You might select this approach, for example, if you want to make use of unstructured information in a business intelligence application for structured data. The exported data includes all tokens, metadata, and facets generated from annotations that were added to documents by parsers and by the UIMA annotators that the collection is configured to use.

In this scenario, assume that you want to analyze reports about defects in automobiles. The reports contain structured data, such as a problem code or the date of the report. Each report also includes a description of the customer's complaint and how the problem was addressed in free text format. For example, the problem report might be "Customer smelled burning odor under the hood" and the problem solution might be "Rusty connection to the fuel pump relay was replaced."

You can create an annotator to extract industry-specific keywords or patterns from unstructured text, such as the symptoms of a problem, the names of replacement parts, and so on. If you configure a collection to use the annotator, relevant unstructured data can be analyzed, extracted, and annotated when crawled documents enter the document processing pipeline. You can export this analyzed data to XML files, CSV files, or a relational database, and then import the exported data into your business intelligence application.

If you use IBM Cognos® Business Intelligence (IBM Cognos BI), you can configure Watson Content Analytics to export documents directly to a relational database. you can run online analytical processing (OLAP) queries against the reports to do a more in-depth analysis of both structured and unstructured data.

Searched documents
You can query a collection, narrow the results by selecting facets and specifying additional search criteria, and then export documents that match the results. You might select this approach, for example, if you want to create a subset of documents that require further investigation.

In this scenario, assume that you are asked by your legal compliance department to gather documents in response to a discovery request regarding patent infringement. The documents of interest are crawled, analyzed, and stored in the index. However, the collection also includes documents that are of no value to the current investigation. When you search the collection, specify criteria to limit the results to documents that are relevant to the discovery request. You can export the documents to XML files and then import the exported files into your eDiscovery system, such as IBM eDiscovery Manager.

In another scenario, assume that you need to train an IBM Content Classification knowledge base. When you query the collection, you can export documents that match your search conditions as XML files. When the documents are exported, a catalog.xml file that contains information about the fields in the documents is also exported. If you import the document XML files and catalog.xml file into Classification Workbench, you can use the data to train knowledge bases and decision plans. By repeatedly searching collections and exporting documents, you can improve how content is classified over time.

Export options

To export documents, you must enable the document cache for the collection.

When you configure export options for crawled or analyzed documents, you can specify whether the documents are to be added to the index. For example, if you use Watson Content Analytics primarily as a means to collect documents or collect analytical data about content, then you might want to export the documents without adding them to the index.

When you configure export options for searched documents, you can configure schedules to control when the documents are to be exported from the document cache. You can create a general schedule for all export requests and configure custom schedules for individual requests. You can also schedule the request to run on an incremental basis. In this case, only documents that were added to the index after the last time the export program ran are exported.

When you configure export options for crawled, analyzed, or searched documents:
  • You can enable or disable the ability to export documents from the collection.
  • You specify whether you want to export documents as XML files, as CSV files, to a relational database, or according to the logic in a custom plug-in.
  • Depending on the export format, wizards can help you specify export options. For example, if you export documents to a relational database, a wizard helps you specify information about the target database and the fields and facets that you want to export. If you use IBM Cognos BI, the wizard also helps you specify options for exporting directly to an IBM Cognos BI database.
  • The export data reflects documents at the time that the export occurs and does not include any history of changes to the documents. For example:
    • If the originally crawled document is removed from the crawl space, the exported document is not removed from the output file system.
    • For a searched document export, documents that are added to the index after the export process starts are not exported.