How the Web crawler handles soft error pages

You can configure the Web crawler to handle custom pages that Web site administrators create when they do not want to return a standard error code in response to requests for certain pages.

If an HTTP server cannot return the page that a client requests, the server normally returns a response that consists of a header with a status code. The status code indicates what the problem is (such as error 404, which indicates that the file could not be found). Some Web site administrators create special pages that explain the problem in more detail and configure the HTTP server to return these pages instead. These custom pages are called soft error pages.

Soft error pages can distort the Web crawler's results. For example, instead of receiving a header that indicates a problem, the crawler receives a soft error page and the status code 200, which indicates the successful download of a valid HTML page. But this downloaded soft error page is not related to the requested URL, and its content is nearly identical each time it is returned in place of a requested page. These irrelevant and near-duplicate pages distort the index and search results.

To handle this situation, you can specify options for handling soft error pages when you configure the Web crawler. The Web crawler needs the following information about each Web site that returns soft error pages:

Example

This following configuration tells the Web crawler to compare all valid HTML pages (status code 200) that are returned from the http://www.mysite.com/hr/* Web site to the specified title and content patterns. If the <TITLE> tag of a page begins with "Sorry, the page" and the content of the document contains anything (*), then the crawler handles the page the same way it would a status code 404 (the page was not found).
Table 1. Soft error page example
URL pattern Title pattern Content pattern HTTP status code
http://www.mysite.com/hr/* Sorry, the page* * 404
You can create multiple entries for the same Web site to handle different status codes. Each status code from the same Web site requires its own entry in the Web crawler's configuration.

Using wildcard characters

The URL, title, and content patterns are not regular expressions. The asterisk character matches any characters up to the next occurrence of any non-wildcard character. For example:

*404 matches any characters404
404: * matches 404: any characters
http://*.mysite.com/* matches http://any host.mysite.com/any file
* matches any characters

Affect on performance

When you configure options for handling soft error pages, you increase the amount of crawler processing time because all successfully crawled pages must be checked. More processing time is required to check for pattern matches and determine whether a page or a replacement status code should be returned.