The Data Fetcher Development Kit
The Data Fetcher Development Kit allows you to integrate a new data
source into IBM Cognos Consumer Insight. You do this by coding, testing, and
publishing a custom data fetcher.
Data Fetcher Development Kit overview
This document explains how you can implement your own data fetcher for use in IBM Cognos Consumer Insight. You do this by creating a file that describes your data fetcher and how to run it. The output created by the data fetcher must be in a specific format and location so that it can be accessed by Cognos Consumer Insight for analysis.
After you create and test your data fetcher, you must publish it. Publishing a data fetcher makes it available for running and testing queries in Cognos Consumer Insight.
The Data Fetcher Development Kit provides two reference implementations that are written in Jaql.
Where to Find the Data Fetcher Development Kit
In CCI 18.104.22.168, the kit will be extracted into the same directory as cci_installmgr. Look for cci_datafetcher under the install directory where the fix pack was extracted to. The default install path is /local/ibm/cognos/ci/coninsight.
Create the data fetcher .dfm file
The data fetcher meta file (.dfm) is a properties file, which contains information that describes the data fetcher and how to run it.
Properties in the data fetcher .dfm file
You must define the values for the following properties in your .dfm file.
name: The data fetcher name displayed in the UI.
id: The identifier used to identify the data fetcher. This value must be the same name as the folder it exists in and must not contain spaces.
description: Description of the data fetcher.
version: 1 (version of this meta file)
command: Shell command invoked by Cognos Consumer Insight to run the data fetcher. For more information about the format of the command property, see Data fetcher command in the following section.
document_format: Set to "text" if the data fetcher output the documents as text files. Set to "hadoop_sequence" if the data fetcher outputs the documents as Hadoop sequence files.
document_type: Document type value displayed in the administration portal, multiple data fetchers may share the same document type.
Data fetcher command
The command specified in the .dfm file is used by Cognos Consumer Insight to invoke the data fetcher. The invocation happens by executing the command inside the folder to which the data fetcher is published. Any relative path used by the command or whatever it invokes must be relative to this folder. The publish step will copy artifacts to the Hadoop node and therefore it is recommended to use relative paths as opposed to full paths in the implementation.
The command must contain some macros, which are replaced at run time by the actual value. The macros must not be surrounded by quotes. Below is a list of supported macros.
%OUTPUT_FOLDER%: The folder where the data fetcher puts the fetched documents. This folder will exist and be empty. This macro is required.
%MAX_DOCS%: The maximum number of documents to be fetched. It is set to 0 to have all documents matching the query fetched. This macro is required.
Cognos Consumer Insight uses the %MAX_DOCS% parameter when running test searches in the Data Fetcher > Queries tab of the administration portal. Change the maximum number of documents requested by increasing the value of the property toro.flowmgr.datafetchers.test.search.max.docs in the properties file used to create Flow Manager's data store. This can help if your data fetcher is returning few or no results after the documents are filtered. For example if you are filtering by query or date. Test searches will take longer to complete if this value is increased.
%MODE%: The mode the data fetcher should run in. Use "test" for testing the data fetcher while you are developing it. "production" is used when you are validating the data fetcher or when it has been published to Cognos Consumer Insight.
%QUERY%: The query that is used to fetch documents.
%START_DATE%: The start date used to filter fetched documents. It must have the following format: yyyy-mm-dd
%END_DATE%: The end date used to filter fetched documents. It must have the following format: yyyy-mm-dd
%WORKING_FOLDER%: A temporary working folder provided to the data fetcher to use for temporary work. If your Hadoop cluster contains multiple nodes, then this folder must be in a shared file system.
%QUICK_SEARCH%: If "true", the data fetching should happen as quick as possible, even if it returns fewer documents than expected. Use this argument when you want to test a query and do not need to wait for all the results. Cognos Consumer Insight sets this argument to “true” when it runs test searches from the Data Fetcher > Queries menu item in the administration portal.
Define the output for your data fetcher
The output that your data fetcher produces must be in a specific format and structure.
Format for fetched documents
Your data fetcher must output the documents it retrieves in a predefined JSON format.
Each output file must contain a single array of JSON records. The file encoding must be UTF-8. Each JSON record in this array must match a specific schema so that it will be processed properly by the analysis pipeline. The schema file is in the following location:
The schema is written in the Jaql schema language. All fields have data type String. The fields specified by the question mark (?) are optional and all other fields are required. For more information about the syntax of the schema please refer to the Jaql schema language documentation.
If your data fetcher cannot provide values for the optional fields, omit the field. Do not return an empty string or a null value.
The following table lists the fields in the output file and what they are used for:
|Field Name||Required Field||Description|
|Id||Yes||The unique identifier of the document.|
|TextHtml||Yes||The text of the document in HTML format or as plain text.|
|SubjectHtml||Yes||The title of the document in HTML format or as plain text.|
|DocumentType||Yes||The type of the document. The type can be one of the existing document types in Cognos Consumer Insight. For example, “news” or “blogs”. The type can also be a custom document type. For example, “reviews” or “my blogs”. A data fetcher can return multiple document types but only one document type per document. For example, the document type in the .dfm file is “Connections” and the data fetcher returns documents of type “Connections – blogs” and “Connections – forums”.|
|Url||Yes||The URL where the document was found.|
|Published||Yes||The date when the document was published (GMT). It must be specified in format “YYYY-MM-DD HH:MM:SS”.|
|SiteUrl||Yes||The URL of the site where the document was found.|
|SiteName||Yes||The name of the site where the document was found.|
|Language||Optional||The language that the document is written in. If this field is not provided by the data fetcher, Cognos Consumer Insight runs automated language detection on the document. The following languages are currently supported. Use one of these languages (instead of language codes) or omit the field to run the automated language detection:
Albanian, Arabic, Bulgarian, Catalan, Chinese, Chinese - Simplified, Chinese - Traditional, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, Filipino, French, German, Greek, Hebrew, Hungarian, Icelandic, Indonesian, Italian, Japanese, Kazakh, Korean, Lithuanian, Malay, Norwegian, Persian, Polish, Portuguese, Romanian, Russian, Serbian, Slovenian, Slovak, Spanish, Swedish, Thai, Turkish, Ukrainian, Slovak, Spanish, Swedish, Tagalog, Vietnamese.
|Author||Optional||The author of the document.|
|AuthorNickName||Optional||The nick name of the document author.|
|AuthorUrl||Optional||The URL of the profile of the document author.|
|AuthorAge||Optional||The age of the document author.|
|AuthorLocation||Optional||The geographic location of the document author.|
|AuthorSex||Optional||The gender of the document author.|
|Tags||Optional||The terms used by the document author to “tag” the post.|
|ThreadId||Optional||The unique identifier of the thread.|
|ThreadTitle||Optional||The title of the thread.|
|IsComment||Optional||A Boolean value that specifies if a document is a comment (“1”) or not (“0”).|
|CommentsInThread||Optional||The number of comments for the associated thread.|
|Country||Optional||The 2-letter country code defined by ISO 3166. Valid country codes are listed at https://www.iso.org/obp/ui/#search/code/|
|Rating||Optional||The rating that the document author has given to the subject that the document covers. Has to be an integer between 0 and 100. So, a rating like "4 out of 5 starts" corresponds to a value of 80.|
|PreferredSnippetMode||Optional||A flag that determines how the analytics pipeline derives snippets from the text in the document. Valid values are “TypeMatch” and “FullDocument”. If “TypeMatch” is selected, the analytics pipeline searches for matches of a type’s patterns and builds snippets using the matching and surrounding sentences. If “FullDocument” is selected, the analytics pipeline searches for matches of a type’s patterns in the text provided in the field “TriggerForFullDocumentMode” and splits the text of the document into snippets if it finds a match in “TriggerForFullDocumentMode”.
For example, this is useful for documents with review content in which the whole document text is associated with the item named in the title.
If this field is not provided, the analytics pipeline uses “TypeMatch” as default mode.
|TriggerForFullDocumentMode||Optional||The text used to search for type patterns if field “PreferredSnippetMode” is set to “FullDocument.|
Be aware that outputting incomplete documents or files can lead to a failure during the analysis phase of a job.
Output file structure
You can write the fetched documents to the output folder in one of two formats: hadoop sequence files or text files. The formats that the documents are output in must be defined in the data fetcher meta file.
Write the fetched documents to files named with a .json suffix. Put these files anywhere under the output folder. Cognos Consumer Insight reads any file matching the following structure: <output_folder>/**/*.json
Each .json file must contain valid JSON content and use the following structure:
Hadoop sequence Files
Write the fetched documents to Hadoop sequence files named with a .json suffix. When your data fetcher writes to Hadoop sequence files, ensure that it uses com.ibm.jaql.io.hadoop.converter.ToJsonTextConverter as output converter and com.ibm.jaql.io.hadoop.TextFileOutputConfigurator as output configurator. This ensures that JSON documents are written to the sequence files correctly by having one document per line. Also ensure that you use org.apache.hadoop.io.NullWritable as the key type and org.apache.hadoop.io.Text as the value type. For more information see http://publib.boulder.ibm.com/infocenter/bigins/v1r1/index.jsp?topic=%2Fcom.ibm.swg.im.infosphere.biginsights.doc%2Fdoc%2Fc0057877.html.
Cognos Consumer Insight reads any folder or subfolder and file with ".json" suffix as a Hadoop sequence file when it matches the following structure: <output_folder>/**/*.json or <output_folder>/**/*.json/*
Your data fetcher can provide an error message to be displayed in the Cognos Consumer Insight user interface. To do this, create a file in the output folder named "error". Keep the message short, since it is intended to be displayed in the user interface. If you want to provide a long description about the problem, write it to the log.
Print log messages to the default output or error streams. When the data fetcher is run, Cognos Consumer Insight redirects these two streams to a log file.
Test and validate your data fetcher
To test and validate a data fetcher, run cci_datafetcher.sh using the data fetcher folder as input. This script is located under cci_datafetcher/DataFetcher_SDK/ in the install directory. Validate the output that is produced. Make sure the .dfm file is located under the data fetcher implementation directory before validation is run. For the example below, the .dfm file would be: reference/Reviews/reviews.dfm.
Example: ./cci_datafetcher.sh --validate --df_folder reference/Reviews/ --outputPath output_folder --query "my query" --start_date 2010-01-01 --end_date 2011-12-31
If the data fetcher does not return any documents, the validation will fail with a message that indicates no documents were found in the output. If the data fetcher does return documents but validation fails, refer to /DataFetcher_SDK/log/jaql.log to see which record (document) caused the failure.
Publish your data fetcher
Before a user can use your data fetcher in Cognos Consumer Insight, you must publish it using the Management Console command line interface. When you publish the data fetcher the following things happen:
- Properties related to the published data fetcher are added to the .dfm file.
- The contents of the data fetcher are copied to a folder in the publishing location on the server. The name of the folder is the name of the data fetcher id.
Before you publish your data fetcher, complete the following tasks:
1. Validate your data fetcher using the validation tool.
2. Put the data fetcher that you want to publish under the DataFetcher_SDK folder. This ensures that the correct libraries are included when you publish the data fetcher. Warning: Changes to folders or files provided in the original DataFetcher_SDK folder can override existing files in the publishing location on the server when that data fetcher is published again.
3. Make sure all files have the necessary file permissions before publishing them. Cognos Consumer Insight will not be able to execute your data fetcher properly if any executable file does not have at least "read" and "execution" permissions.
4. The .sh file identified in the dfm will need to configure the environment in which you want your data fetcher to run. Setting up class paths for a Java implementation is one example. The publish stage will not configure the environment for the data fetcher and is solely up to the data fetcher to do so.
Note: The current directory during the data fetcher's execution is the directory where the .dfm file is, all relative paths used will be relative to that one.
Published structure and artifacts
<any other file required to execute the command defined in the .dfm file>
Publishing your data fetcher
Publishing a data fetcher makes it available for execution. To publish a data fetcher run the publish process by running the following command:
./cci_cli.sh -process publish
This script is different from the one used to validate the data fetcher and it is located under cci_installmgr/cci_mngmt/cci_cli/ in the install directory. As mentioned earlier, the default install directory path is /local/ibm/cognos/ci/coninsight. It should be run as cciusr. Please make sure to navigate to the cci_cli/ directory and run the script from within that directory to ensure it finds all the necessary files to copy over.
The following prompt appears:
Publish Data Fetcher
You are about to access a third party data site which may be subject
to acceptance of the third party's terms and conditions. IBM is not
responsible for the third party content and is not a party to such
Do you agree to this legal statement (y/n)[n]:
If you reply "y" to the legal statement, the process continues. If you reply "n" to the legal the process exits.
When the process continues the following message appears:
Path to Data Fetcher metafile:
Enter the path of the .dfm file you with to process, for example to publish the reviews data fetcher you would enter './reference/Reviews/reviews.dfm' or you may want to specify the full path '/local/ibm/cognos/ci/coninsight/cci_datafetcher/DataFetcher_SDK/reference/Reviews/reviews.dfm'.
Removing your data fetcher
To remove the data fetcher from Cognos Consumer Insight, you must unpublish it. To do this type the following command:
./cci_cli.sh -process unpublish datafetcherId <id from .dfm>
Updating your data fetcher
If you have made changes to your data fetcher and want to update it, simply apply your changes to the original data fetcher implementation and publish it. You do not need to remove it first.
Troubleshooting and logging
Logs and other artifacts for searches run from the Data fetcher > Queries tab in the Admin UI against custom data fetchers will be found under the following directories on the hadoop node:
<Flow Manager Home>/search_test/<TenantId>_<DocumentType>_<TimeStamp>/ and <Flow Manager Home>/logs/search_test/<TenantId>_<DocumentType>_<TimeStamp>/
/home/hadoop/FlowMgr/search_test/tenant1_IBMConnections_2011-11-09_03-35-12/ and /home/hadoop/FlowMgr/logs/search_test/tenant1_IBMConnections_2011-11-09_03-35-12/
Failed custom data fetcher jobs
If queries against a custom data fetcher fail for any reason, the overall job will continue to run but logs and output artifacts for that data fetcher run will be created under a dedicated directory <PermanentDirs>/customDatafetcherFailedJobs with the following structure:
/<TenantId>/<TimeStamp>/<DocumentType>/<DataFetcherId>/<QueryId>/<JobId as shown in Admin UI>
For example, the following directory will be created if under "tenant1", the "Reviews" data fetcher fails for a query of id "cognos":
Data fetcher reference implementations
The Data Fetcher Development Kit contains two reference implementations. The first reference implementation retrieves data from IBM Connections blogs and the second one retrieves review documents which are provided on an FTP server.
Both data fetchers are implemented in Jaql. They share a common set of Jaql scripts which are located in DataFetcher_SDK/api. The Jaql scripts are located in DataFetcher_SDK/reference/IBMConnections and DataFetcher_SDK/reference/Reviews.
The common Jaql API scripts control the overall flow of the data fetchers. The reference implementations provide variables and functions that are used or called by the API scripts. The reference implementations also contain the .dfm files which describe the data fetcher metadata.
Common Jaql API scripts
The following API Jaql scripts are shared by the two reference implementations.
The script contains input and output format definitions that allow you to read and write files in text and sequence format.
The script declares Java UDFs which implement helper functions that can be used by the reference implementations. It also provides Jaql helper functions (For example, to convert xml data to json format), log errors, or read data from input streams. The options for the various stream types themselves are also defined in the script. There are comments in the script that provide more details.
This is the main Jaql script that controls the flow of the data fetchers.
If the data fetcher mode is set to "test", it will create a URL which retrieves data from the data source and performs one call using this URL. If necessary, it will convert the returned data from XML to Json format. Then it maps the data to the metadata format that is required for IBM Cognos Consumer Insight. In test mode it will write two different outputs: First it will write the data after it is read from the data source and converted to json format to the folder DataFetcher_SDK/reference/<implementation>/original_documents. This allows you to see the original structure of the data for later mapping to the required metadata format. Second, it writes the data after mapping to the metadata format to the output folder in sequence format, so that you can compare the data before and after mapping.
If you start the data fetcher in "production" mode, it will first create the URL and perform a first call to the data source. Both reference implementations provide the total number of documents that will be returned in the result feed. Based on the total number of documents, the data fetcher fetches the data in chunks from the data source, performs the metadata mapping and writes the transformed data to the output folder in sequence format.
Use the script cci_datafetcher.sh to run a reference implementation in test or production mode, For example,
./cci_datafetcher.sh --execute --df_folder reference/Reviews/ --outputPath ./output --mode test --query "IBM" --start_date "2005-07-30" --end_date "2012-01-01"
To see a list of all options, type the following command in the command line:
If you use the quick_search option, the Jaql will be started with the option -Djaql.mapred.mode=local to run in Hadoop local mode. This ensures that the data fetcher does not wait for other long running jobs on the Hadoop cluster.
IBM Connections blogs data fetcher
This data fetcher retrieves blog entries from an IBM Connections server and has been tested with IBM Connections 3.0.1. It uses the Atom search API to retrieve content for a specific query.
In order to support IBM Connections 2.5, the following change should be applied before publishing the data fetcher:
- In DataFetcher_SDK/reference/IBMConnections/Functions.jaql, change the following line subjecthtml: transformText($.(atomNS)."title"."text()") to subjecthtml: transformText($.(atomNS)."title")
The entry point for the data fetcher is the meta information file DataFetcher_SDK/reference/IBMConnections/IBMConnections.dfm file. Beside the meta information described previously, it specifies "Connections - blogs" as document type shown for the data fetcher in the administration portal and contains the command that is called by IBM Cognos Consumer Insight to run the data fetcher. The command runs the shell script IBMConnections.sh provided with the reference implementation, passing the macros like query, start and end date described previously. The shell script calls the Jaql files provided in DataFetcher_SDK and the reference folder and passes the values to Jaql.
Note that if the IBM Connections server resides behind a firewall, you must ensure that the certificate is properly installed on the Hadoop nodes by accessing the server through a browser and confirming the security exception.
It should be noted that some expected pages may not show up when an IBM Connections data fetcher job is run. This is because IBM Connections searches blog entry title and entry text while CCI's fetch and consolidation searches entry text (TextHtml) for snippets. Only the TextHtml field is used for snippet analysis.
The reference folder for IBM Connections contains the following files:
This file provides several variables like the document type that will be displayed for snippets in the analysis portal and the Cognos Consumer Insight reports and various substrings that are needed to build the URL that is used to fetch data from IBM Connections.
This file contains one statement to perform a limited query conversion for the query that is passed from administration portal. It converts the query from the BoardReader query syntax to the query syntax that is used by IBM Connections.
This Jaql script implements functions that are called by the common API Jaql scripts. It builds the first and subsequent URL calls and specifies an http stream as input stream. If your IBM Connections server requires username and password, use an https stream. The subsequent URL calls use different page offsets to get the complete search result.
The main function is the getEntriesfunction which performs the data conversion and maps the atom tags to the required meta data format. It also filters the documents for the start and end dates specified for the query; IBM Connections does not support filtering by date in the search API. The result feed returned by IBM Connections contains a summary of the blog text only. The function getEntriesalso uses the getBlogTextfunction to retrieve the complete blog entry. If the blog entry cannot be retrieved, the document is discarded.
The IBM Connections search API also returns blog entries when the query matches a comment of a blog entry rather than the blog text itself. Comments of blog entries are not retrieved by the reference implementation. Only the blog text itself is retrieved. This can result in retrieved documents that do not contain a snippet in during the analysis phase of a job.
Reviews data fetcher
This data fetcher retrieves documents containing product reviews packaged in zip files from an FTP server. The FTP server used in this implementation is hosted by BoardReader. If you want to use the review data fetcher, you must contact BoardReader to obtain a username and password, as well as the host IP address and a folder that is dedicated to you which would contain Reviews documents based on your requirements. Also you will need to register your IP address with BoardReader to get access to this server.
The reviews data fetcher does not use a search API. Each URL call fetches a zip file from the FTP server and uncompresses it. A zip file contains several documents/reviews.
The data fetcher meta file is DataFetcher_SDK/reference/Reviews/Reviews.dfm. As the IBM Connections data fetcher the command also invokes a shell script which passes the parameters to the Jaql scripts.
The ftp url (SourceUrl) to which the data fetcher points should follow a flat directory hierarchy, meaning that it should be a single directory containing all the content without any subfolders.
The document type returned by the data fetcher is "Reviews". The SourceUrl variable contains an FTP address including user and password. In order to use this data source, you will need to modify the value of this variable and put in the right values for user and password, the server name or IP address, and name of the folder on the ftp server. This information could be retrieved from BoardReader.
This script reads the list of available zip files on the FTP server so that they can be fetched in subsequent calls from the server. The query that is passed from the administration portal is interpreted as a regular expression. There is no query conversion from BoardReader query syntax to regular expressions. The only change to the query is that it is made case insensitive.
This script implements the functions called by the common API Jaql script. The first and subsequent URLs are the file names of the zip files on the FTP server. The used ftpGZIPStream unzips and reads the zip files. The getEntries function performs the data conversion and mapping to the IBM Cognos Consumer Insight metadata format. The function searches for the query string in the subject and text of the reviews and filters by start and end date. The value of the PreferredSnippetMode field for the documents is "FullDocument", using the product name as a trigger. During analysis phase the document is split into snippets if a type match is found in the product name. This is different from the default snippet creation mode that creates snippets from sentences in the document text that contain the type match. It assumes that the whole review deals with the product that is specified in the product name field even if the text itself does not mention the product.
Note: the texthtml field is prepended with the subject field of the meta data. This allows the subject to be analyzed and taken into account for snippet creation during the analytics phase of a job.
Original publication date