You can configure the Web crawler to handle custom pages
that Web
site administrators create when they do not want to return a standard
error
code in response to requests for certain pages.
If an HTTP server cannot return the
page that a client requests, the server
normally returns a response that consists of a header with a status
code.
The status code indicates what the problem is (such as error 404,
which indicates
that the file could not be found). Some Web site administrators create
special
pages that explain the problem in more detail and configure the HTTP
server
to return these pages instead. These custom pages are called soft
error
pages.
Soft error pages can distort the Web crawler's
results. For example, instead
of receiving a header that indicates a problem, the crawler receives
a soft
error page and the status code 200, which indicates the successful
download
of a valid HTML page. But this downloaded soft error page is not related
to
the requested URL, and its content is nearly identical each time it
is returned
in place of a requested page. These irrelevant and near-duplicate
pages distort
the index and search results.
To handle this situation, you
can specify options for handling soft error
pages when you configure the Web crawler. The Web crawler needs the
following
information about each Web site that returns soft error pages:
- A
URL pattern for a site that uses soft error pages. This URL pattern
consists of the protocol (HTTP or HTTPS), the host name, port number
(if non
standard), and path name. You can use an asterisk (*) as a wildcard
character
to match one or more characters up to the next occurrence of a non-wildcard
character in the pattern. The pattern that you specify is case sensitive.
- A title pattern for text that corresponds to the <TITLE> tag
of an
HTML document. You can use the asterisk (*) as a wildcard character
to specify
this pattern. This pattern that you specify is case sensitive.
- A
content pattern for text that corresponds to the content of an HTML
document. The content is not just the content of the <BODY> tag,
if a <BODY>
tag is present. The content is everything that comes after the HTTP
header
in the file. You can use the asterisk (*) as a wildcard character
to specify
this pattern. This pattern that you specify is case sensitive.
- An
integer that represents the status code to use for documents that
match
the URL, title, and content patterns that you specified.
Example
This following configuration tells
the Web
crawler to compare all valid HTML pages (status code 200) that are
returned
from the
http://www.mysite.com/hr/* Web site
to the
specified title and content patterns. If the <TITLE> tag of a page
begins
with "Sorry, the page" and the content of the document contains anything
(*),
then the crawler handles the page the same way it would a status code
404
(the page was not found).
Table 1. Soft
error page exampleURL pattern |
Title pattern |
Content pattern |
HTTP
status code |
http://www.mysite.com/hr/* |
Sorry, the page* |
* |
404 |
You
can create multiple entries for the same Web site to handle different
status codes. Each status code from the same Web site requires its
own entry
in the Web crawler's configuration.
Using
wildcard characters
The URL, title, and content
patterns are not regular expressions. The asterisk character matches
any characters
up to the next occurrence of any non-wildcard character. For example:
*404 matches any characters404
404: * matches 404: any characters
http://*.mysite.com/* matches http://any host.mysite.com/any file
* matches any characters
Affect on performance
When
you configure options
for handling soft error pages, you increase the amount of crawler
processing
time because all successfully crawled pages must be checked. More
processing
time is required to check for pattern matches and determine whether
a page
or a replacement status code should be returned.