Creating and deploying a plug-in for archive files

Crawler plug-ins for archive files are Java™ application programming interfaces (APIs) that you can add your own logic to. You can use this type of plug-in with type A data source crawlers to extract entries from archive files, which can then be parsed and included in collections.

Before you begin

Ensure that the correct version of Java is installed. The crawler plug-in for archive files must be compiled with the IBM® Software Development Kit (SDK) for Java Version 1.6.

Restriction: You cannot use this plug-in with the following type B data source crawlers:

Agent for Windows file systems crawler
BoardReader crawler
Case Manager crawler
Exchange Server crawler
FileNet P8 crawler
SharePoint crawler

About this task

Type A data source crawlers provide a plug-in interface that enables you to extend their crawling capabilities and crawl archive files in Watson Content Analytics. The crawler uses the specified crawler plug-in for archive files to extract archive entries from an archive file and send the extracted archive entries to the parsers.

To use this capability, you must develop a crawler plug-in for archive files that implements the com.ibm.es.crawler.plugin.archive.ArchiveFile interface and register the plug-in in the crawler configuration file.

Important: To enable users to fetch and view files that are extracted from an archive file when they view search results, you must extend your archive plug-in to view extracted files.

Procedure

To create and deploy a plug-in for archive files:

Create a Java class to use as a crawler plug-in for archive files.

Implement the com.ibm.es.crawler.plugin.archive.ArchiveFile interface and implement the following methods:

public interface ArchiveFile {
   /**
    * Creates a new archive file with the specified InputStream instance.
    */
   public void open(InputStream input) throws IOException;

   /**
    * Close this archive file.
    */
   public void close() throws IOException;

   /**
    * Reads the next archive entry and positions stream at the beginning of
    * the entry data.
    * 
    * @param charset the name of charset
    * @return the next entry
    */
   public ArchiveEntry getNextEntry(String charset) throws IOException;

   /**
    * Returns an input stream of the current archive entry.
    * 
    * @return the input stream
    */
   public InputStream getInputStream() throws IOException;
}

For name resolution, use the ES_INSTALL_ROOT/lib/dscrawler.jar file.

Implement the com.ibm.es.crawler.plugin.archive.ArchiveEntry interface and implement the following methods:

public interface ArchiveEntry {
   /**
    * Returns the name of this entry.
    * 
    * @return the name of this entry
    */
   public String getName();
   
   /**
    * Returns the modify time of this entry.
    * 
    * @return the modify time of this entry
    */
   public long getTime();

   /**
    * Returns the length of file in bytes.
    * 
    * @return the length of file in bytes
    */
   public long getSize();
   
   /**
    * Tests whether the entry is a directory.
    * 
    * @return true if the entry is a directory
    */
   public boolean isDirectory();
}

Compile the implemented code and create a JAR file for it. Add the dscrawler.jar file to the class path when you compile. The crawler plug-in for archive files must be compiled with the IBM Software Development Kit (SDK) for Java Version 1.6.

Verify the crawler plug-in with the com.ibm.es.crawler.plugin.archive.ArchiveFileTester class. Add the dscrawler.jar file and your plug-in code to the class path when you run this Java application.
1. List the archive entries with your plug-in code. Confirm that this command returns correct information about the archive file.
  AIX® or Linux
  
  java -classpath $ES_INSTALL_ROOT/lib/dscrawler.jar:path_to_plugin_jar com.ibm.es.crawler.plugin.archive.ArchiveFileTester plugin_classname -tv input_archive_filepath
  
  Windows
  
  java -classpath %ES_INSTALL_ROOT%\lib\dscrawler.jar:path_to_plugin_jar com.ibm.es.crawler.plugin.archive.ArchiveFileTester plugin_classname -tv input_archive_filepath
2. Extract the archive entries with your plug-in code. Confirm that this command extracts all archive entries successfully.
  AIX or Linux
  
  java -classpath $ES_INSTALL_ROOT/lib/dscrawler.jar:path_to_plugin_jar com.ibm.es.crawler.plugin.archive.ArchiveFileTester plugin_classname -xv input_archive_filepath
  
  Windows
  
  java -classpath %ES_INSTALL_ROOT%\lib\dscrawler.jar:path_to_plugin_jar com.ibm.es.crawler.plugin.archive.ArchiveFileTester plugin_classname -xv input_archive_filepath
Deploy the crawler plug-in.
1. In the administration console, stop the crawler that you want to use with your crawler plug-in for archive files.
2. Enter the following command to create a configuration file named crawler_typecrawler_ext.xml, where crawler_ID identifies the crawler that you want to configure, and crawler_type identifies the prefix of the existing crawler configuration file. The existing file is named crawler_typecrawler.xml and it is located in the ES_NODE_ROOT/master_config/crawler_ID directory.
  AIX or Linux
  
  $ES_NODE_ROOT/master_config/crawler_ID/crawler_typecrawler_ext.xml
  
  Windows
  
  %ES_NODE_ROOT%/master_config/crawler_ID/crawler_typecrawler_ext.xml
3. Use a text editor to update the crawler_typecrawler_ext.xml file and add the rules for your crawler plug-in for archive files. Here is a template crawler configuration file for enabling your crawler plug-in for archive files.
```
<ExtendedProperties>
  <AppendChild XPath="/Crawler" Name="ArchiveFileRegistry" />
  <AppendChild XPath="/Crawler/ArchiveFileRegistry" Name="ArchiveFile" />
    <SetAttribute XPath="/Crawler/ArchiveFileRegistry/ArchiveFile"
     Name="Type">archive_file_type</SetAttribute>
    <SetAttribute XPath="/Crawler/ArchiveFileRegistry/ArchiveFile"
     Name="Class">plugin_classname</SetAttribute>
    <SetAttribute XPath="/Crawler/ArchiveFileRegistry/ArchiveFile"
     Name="Classpath">path_to_required_jars</SetAttribute>
    <SetAttribute XPath="/Crawler/ArchiveFileRegistry/ArchiveFile"
     Name="Path"></SetAttribute>
  <AppendChild XPath="/Crawler/ArchiveFileRegistry/ArchiveFile" 
  Name="Extensions" />
  <AppendChild XPath="/Crawler/ArchiveFileRegistry/ArchiveFile/Extensions"
   Name="Extension">archive_file_extension</AppendChild>
</ExtendedProperties>
```
  where:
  archive_file_type
  
  Specifies the type of the archive files.
  
  plugin_classname
  
  Specifies the fully qualified class name of your crawler plug-in for archive files.
  
  path_to_required_jars
  
  Specifies the class path, delimited by the path separator, that are required to run your crawler plug-in for archive files.
  
  archive_file_extension
  
  Specifies the file extension of the archive files that you want to process with your crawler plug-in for archive files.
4. Restart the crawler that you stopped.

Example

Here is a sample crawler configuration for enabling the crawler plug-in for LZH archive files.

<ExtendedProperties>
  <AppendChild XPath="/Crawler" Name="ArchiveFileRegistry" />
  <AppendChild XPath="/Crawler/ArchiveFileRegistry" Name="ArchiveFile" />
    <SetAttribute XPath="/Crawler/ArchiveFileRegistry/ArchiveFile" 
     Name="Type">lzh</SetAttribute>
    <SetAttribute XPath="/Crawler/ArchiveFileRegistry/ArchiveFile" 
     Name="Class">com.ibm.es.sample.archive.lzh.LzhFile</SetAttribute>
    <SetAttribute XPath="/Crawler/ArchiveFileRegistry/ArchiveFile" 
     Name="Classpath">C:\lzhplugin;C:\lzhplugin\lzhplugin.jar</SetAttribute>  
    <SetAttribute XPath="/Crawler/ArchiveFileRegistry/ArchiveFile" 
     Name="Path"></SetAttribute>
  <AppendChild XPath="/Crawler/ArchiveFileRegistry/ArchiveFile" 
  Name="Extensions" />
  <AppendChild XPath="/Crawler/ArchiveFileRegistry/ArchiveFile/Extensions"
   Name="Extension">.lzh</AppendChild>
</ExtendedProperties>