Web crawler plug-ins

The web crawler plug-in provides two types of plug-ins: a prefetch plug-in and a postparse plug-in.

With the prefetch plug-in, you can use Java™ APIs to add fields to the HTTP request header that is sent to the origin server to request a document.

With the postparse plug-in, you can use Java APIs to view the content, security tokens, and metadata of a document before the document is parsed and tokenized. You can add to, delete from, or replace any of these fields, or stop the document from being sent to the parser.

If your plug-in requires Java classes or non-Java libraries or other files besides the plug-in, you must write the plug-in to handle that requirement. For example, your plug-in can invoke a class loader to bring in more Java classes and can also load libraries, make network connections, make database connections, or do anything else that it needs.

Plug-ins run as part of the crawler JVM process. Exceptions and errors will be caught, but crawler performance is affected by plug-in execution. You should write plug-ins to do the minimum amount of processing and catch all anticipated exceptions. Plug-in code must be multithread-safe. If you have 200 concurrent downloads, you might have 200 concurrent calls to your plug-in.

Using a plug-in to crawl secure WebSphere Portal sites

If application security is enabled in WebSphere® Application Server and you want to crawl secure WebSphere Portal sites with the web crawler, you must create a crawler plug-in to handle the form-based authentication requests. For a discussion about form-based authentication and a sample program that you can adapt for your custom web crawler plug-in, see http://www.ibm.com/developerworks/db2/library/techarticle/dm-0707nishitani.

The plug-in is required if you use the web crawler to crawl any sites through WebSphere Portal, including Workplace Web Content Management sites and Lotus® Quickr® sites.