Crawler plug-ins for archive files are Java™
application programming interfaces (APIs) that you can add your own logic to. You can use this
type of plug-in with type A data source crawlers to extract entries from archive files, which can
then be parsed and included in collections.
Before you begin
Ensure that the correct version of Java is installed.
The crawler plug-in for archive files must be compiled with the IBM® Software Development Kit (SDK) for Java Version
1.6.
Restriction: You cannot use this plug-in with the following type B data source
crawlers:
- Agent for Windows file systems crawler
- BoardReader crawler
- Case Manager crawler
- Exchange Server crawler
- FileNet P8 crawler
- SharePoint crawler
About this task
Type A data source crawlers provide a plug-in interface that enables you to extend their
crawling capabilities and crawl archive files in
Watson Content Analytics.
The crawler uses the specified crawler plug-in for archive files to extract archive entries from
an archive file and send the extracted archive entries to the parsers.
To use this capability,
you must develop a crawler plug-in for archive files that implements the
com.ibm.es.crawler.plugin.archive.ArchiveFile interface and register the plug-in in the crawler
configuration file.
Important: To enable users to fetch and view files that are
extracted from an archive file when they view search results, you must extend your archive
plug-in to view extracted files.
Procedure
To create and deploy a plug-in for archive files:
- Create a Java class
to use as a crawler plug-in for archive files.
- Implement the com.ibm.es.crawler.plugin.archive.ArchiveFile
interface and implement the following methods:
public interface ArchiveFile {
/**
* Creates a new archive file with the specified InputStream instance.
*/
public void open(InputStream input) throws IOException;
/**
* Close this archive file.
*/
public void close() throws IOException;
/**
* Reads the next archive entry and positions stream at the beginning of
* the entry data.
*
* @param charset the name of charset
* @return the next entry
*/
public ArchiveEntry getNextEntry(String charset) throws IOException;
/**
* Returns an input stream of the current archive entry.
*
* @return the input stream
*/
public InputStream getInputStream() throws IOException;
}
For name resolution, use the ES_INSTALL_ROOT/lib/dscrawler.jar file.
- Implement the com.ibm.es.crawler.plugin.archive.ArchiveEntry
interface and implement the following methods:
public interface ArchiveEntry {
/**
* Returns the name of this entry.
*
* @return the name of this entry
*/
public String getName();
/**
* Returns the modify time of this entry.
*
* @return the modify time of this entry
*/
public long getTime();
/**
* Returns the length of file in bytes.
*
* @return the length of file in bytes
*/
public long getSize();
/**
* Tests whether the entry is a directory.
*
* @return true if the entry is a directory
*/
public boolean isDirectory();
}
- Compile the implemented code and create a JAR file for
it. Add the dscrawler.jar file to the class path
when you compile. The crawler plug-in for archive files must be compiled
with the IBM Software Development
Kit (SDK) for Java Version 1.6.
- Verify the crawler plug-in with the com.ibm.es.crawler.plugin.archive.ArchiveFileTester
class. Add the dscrawler.jar file
and your plug-in code to the class path when you run this Java application.
- List the archive entries with your plug-in code. Confirm that this command returns correct information about
the archive file.
- AIX® or Linux
java -classpath $ES_INSTALL_ROOT/lib/dscrawler.jar:path_to_plugin_jar com.ibm.es.crawler.plugin.archive.ArchiveFileTester plugin_classname -tv input_archive_filepath
- Windows
java -classpath %ES_INSTALL_ROOT%\lib\dscrawler.jar:path_to_plugin_jar com.ibm.es.crawler.plugin.archive.ArchiveFileTester plugin_classname -tv input_archive_filepath
- Extract the archive entries with your plug-in code. Confirm that this command extracts all archive entries successfully.
- AIX or Linux
java -classpath $ES_INSTALL_ROOT/lib/dscrawler.jar:path_to_plugin_jar com.ibm.es.crawler.plugin.archive.ArchiveFileTester plugin_classname -xv input_archive_filepath
- Windows
java -classpath %ES_INSTALL_ROOT%\lib\dscrawler.jar:path_to_plugin_jar com.ibm.es.crawler.plugin.archive.ArchiveFileTester plugin_classname -xv input_archive_filepath
- Deploy the crawler plug-in.
- In the administration console, stop the crawler that
you want to use with your crawler plug-in for archive files.
- Enter the following command to create a configuration
file named crawler_typecrawler_ext.xml,
where crawler_ID identifies the crawler that you
want to configure, and crawler_type identifies
the prefix of the existing crawler configuration file. The existing
file is named crawler_typecrawler.xml and
it is located in the ES_NODE_ROOT/master_config/crawler_ID directory.
- AIX or Linux
- $ES_NODE_ROOT/master_config/crawler_ID/crawler_typecrawler_ext.xml
- Windows
- %ES_NODE_ROOT%/master_config/crawler_ID/crawler_typecrawler_ext.xml
- Use a text editor to update the crawler_typecrawler_ext.xml file
and add the rules for your crawler plug-in for archive files. Here is a template crawler configuration file for enabling your
crawler plug-in for archive files.
<ExtendedProperties>
<AppendChild XPath="/Crawler" Name="ArchiveFileRegistry" />
<AppendChild XPath="/Crawler/ArchiveFileRegistry" Name="ArchiveFile" />
<SetAttribute XPath="/Crawler/ArchiveFileRegistry/ArchiveFile"
Name="Type">archive_file_type</SetAttribute>
<SetAttribute XPath="/Crawler/ArchiveFileRegistry/ArchiveFile"
Name="Class">plugin_classname</SetAttribute>
<SetAttribute XPath="/Crawler/ArchiveFileRegistry/ArchiveFile"
Name="Classpath">path_to_required_jars</SetAttribute>
<SetAttribute XPath="/Crawler/ArchiveFileRegistry/ArchiveFile"
Name="Path"></SetAttribute>
<AppendChild XPath="/Crawler/ArchiveFileRegistry/ArchiveFile"
Name="Extensions" />
<AppendChild XPath="/Crawler/ArchiveFileRegistry/ArchiveFile/Extensions"
Name="Extension">archive_file_extension</AppendChild>
</ExtendedProperties>
where:- archive_file_type
- Specifies the type of the archive files.
- plugin_classname
- Specifies the fully qualified class name of your crawler plug-in
for archive files.
- path_to_required_jars
- Specifies the class path, delimited by the path separator, that
are required to run your crawler plug-in for archive files.
- archive_file_extension
- Specifies the file extension of the archive files that you want
to process with your crawler plug-in for archive files.
- Restart the crawler that you stopped.
Example
Here is a sample crawler configuration for enabling the crawler
plug-in for LZH archive files.
<ExtendedProperties>
<AppendChild XPath="/Crawler" Name="ArchiveFileRegistry" />
<AppendChild XPath="/Crawler/ArchiveFileRegistry" Name="ArchiveFile" />
<SetAttribute XPath="/Crawler/ArchiveFileRegistry/ArchiveFile"
Name="Type">lzh</SetAttribute>
<SetAttribute XPath="/Crawler/ArchiveFileRegistry/ArchiveFile"
Name="Class">com.ibm.es.sample.archive.lzh.LzhFile</SetAttribute>
<SetAttribute XPath="/Crawler/ArchiveFileRegistry/ArchiveFile"
Name="Classpath">C:\lzhplugin;C:\lzhplugin\lzhplugin.jar</SetAttribute>
<SetAttribute XPath="/Crawler/ArchiveFileRegistry/ArchiveFile"
Name="Path"></SetAttribute>
<AppendChild XPath="/Crawler/ArchiveFileRegistry/ArchiveFile"
Name="Extensions" />
<AppendChild XPath="/Crawler/ArchiveFileRegistry/ArchiveFile/Extensions"
Name="Extension">.lzh</AppendChild>
</ExtendedProperties>