Web services often offer text analysis functions that can enrich your documents. The Web Services converter accesses web service functions via their REST API and adds the response to your document, which you may optionally post-process in your own custom converter.
The Web Services converter is used to augment data with content and analysis from an external web service while that data is being ingested. It is generalized for use with nearly any web service. All Watson Explorer converters (including this one) process the text to be indexed at ingestion time. When configured properly, the Web Services converter will send administrator-defined name-value pairs as CGI parameters to a REST-based web service. The response from the REST web service is then stored in its entirety in a new administrator-specified <content> element. The Web Services Converter is designed to work in conjunction with two custom converters which handle:
For an example of the Watson Explorer Web Services converter configured for use with the Watson Developer Cloud Relationship Extraction service, see the wex-web-services-converter example on GitHub.
Additional Watson Explorer cloud integrations can also be found on GitHub.
Implementation considerations:
Security - At the very least, you should ensure that the web service you are using can be accessed via an encrypted HTTPS connection if you might be crawling any data of a sensitive nature. There are additional security options in the Advanced HTTP Config section of the Web Services converter configuration that will allow Watson Explorer Engine to establish a secure channel, authenticate, etc. For more detail, see the tool tip for each setting.
Performance - Adding the Web Services converter can significantly impact the collection's total crawl time. Factors include: the size of the web service request, the size of the web service response, the web service processing time, latency and bandwidth of the networks connecting the Watson Explorer server and the web service endpoint, etc. As mentioned earlier, a caching proxy will likely increase crawl performance if there is any chance of duplicate web service calls during crawling or subsequent refreshes.
Failures will happen - All distributed systems are inherently unreliable and failures will inevitably occur when calling out to a web service. Carefully consider how failures should be handled at conversion time. Should the whole document fail? Is a partially indexed document without enriched metadata OK for your use cases?
Data Preparation - It is the responsibility of the caller to ensure that representative data is being sent to a web service. Data preparation strategies not demonstrated here may be required in some cases. For example, some Watson Explorer Engine converters will produce HTML tags in the "snippet" <content>, such as PDFtoHTML or WordtoHTML. These tags provide hints to the indexer, but in a snippet become encoded XML. Prepare your data carefully and ensure it is clean enough for your web service.
Scalability - Some web services will enforce limits on usage. Such limits may include metrics like calls per day, data per call, number of simultaneous calls, etc. Carefully consider the demand you are placing on the web service. You may need to reduce the aggressiveness of the Watson Explorer Engine crawl, reduce the number of converters that may be running simultaneously, or reduce the number of simultaneously active crawls in order to stay within the operational limits of the web service or within the limits of your license to use it.
Adding the Web Services Converter
After successfully completing these tasks, you must create two custom converters. The first is one for preprocessing text. See Creating the Custom Converter for Pre-Processing Text.
See Custom Converters for more information about custom converters.