Watson Explorer Engine strictly adheres to the Internet
Engineering Task Force (IETF) Request for Comments (RFC) standards. Because web sites and web
browsers do not strictly adhere to these standards, there are several things you should keep
in mind when configuring your project.
- Watson Explorer Engine percent-encodes non-ASCII characters in
URLs. The hexadecimal digits used in URL percent-encoding are normalized to lowercase. While
hexadecimal case normalization strategy varies among software, the IETF standard declares
that the hexadecimal digits must be treated in a case-insensitive way. The RFC discussing
this IETF standard can be found here: 6.2.2: Syntax-Based Normalization.
- Watson Explorer Engine drops everything after the anchor
symbol (#)when verifying it has already crawled a URL.
- Watson Explorer Engine does not recognize file paths (for
example C:\..) as URLs. Use a URL like the following to reference a
resource on the local filesystem:
file:///C%3a/Program%20Files/my%20file.txt
- Watson Explorer Engine does not recognize Windows Universal
Naming Convention (UNC) file paths (for example \\sharehost\path\file) as
URLs.
- Domain Name System (DNS) aliases and Host Names cannot include underscores
(_) in their URLs.
You can relax the checks Watson Explorer Engine has on URLs by
disabling URL normalizations. To do so, open the search collection's tab. Open the URL normalization section. See URL
Normalizations for more information.