The Web crawler can find some links
(URLs) that
are contained in the JavaScript portions
of Web documents. If you determine that a high number of URLs embedded
in text are of low relevance, you can disable text link parsing by
configuring advanced Web crawler properties.
The Web crawler can find both relative
and absolute links. If an
HTML document contains a BASE element, the crawler uses that element
to resolve relative links. Otherwise, the crawler uses the document's
own URL.
Support for JavaScript is limited to link extraction.
The crawler
does not parse JavaScript, does not build a DOM (Document Object Model),
and does not interpret or execute JavaScript statements. The crawler
looks for strings in the document content (including, but not limited
to the JavaScript portions) that are likely to be URLs in JavaScript
statements. This means two things:
- Some URLs will be found
that are ignored by the stricter HTML
parser. The crawler will reject anything that is not a syntactically
valid URL, but some of the valid URLs returned by the scanning step
might be of low interest for searching.
- Document content that
is generated by JavaScript, such as when
a human user views a page with a browser and the browser executes
some JavaScript, cannot be detected by the Web crawler, and thus will
not be indexed.
Because the Web crawler does not parse JavaScript in
HTML files, URLs in JavaScript are not crawled. To enable the Web
crawler to crawl URLs in JavaScript, you can do either of following
actions:
- In the administration console, edit the Web crawler
and, on the Web
Crawl Space page, add the URLs to the list of URLs that
the crawler is to use as a starting point for adding URLs to the collection
(Start URLs). For the changes to become effective,
restart the Web crawler (you do not need to start a full crawl).
- Use the anchor tag (<a href="..">) to specify
the URLs as hypertext links in the HTML file.