How to append Locale parameter and specifying Accept-Language HTTP header for seedlist crawler

Technote (troubleshooting)


Problem(Abstract)

I'm trying to crawl WebSphere Portal site, which is set up in Hebrew with seedlist crawler. When we use the default configuration, the contents in Hebrew can not be displayed correctly in Hebrew in the search result. The search result seems to be the one when I access to the Portal content in English with browser.

How do you set up the seedlist crawler to get correct content string in Hebrew?

Symptom

When you access target WebSphere Portal contents via browser, the contents varies depending on the browser language configuration, and you can get the expected content in a certain browser language.

For example, you can see the correct contents in Hebrew when browser language is set in Hebrew, while you can not see the contents in Hebrew when browser language is set in English.



Resolving the problem

To get the correct contents in correct language in seedlist crawler, specify additional parameters in seedlist crawler configuration file (ES_NODE_ROOT/master_config/<collection id>.<crawler session id>/seedlistcrawler_ext.xml) such as follows.


For example:

To append Locale parameter when retrieving the seedlist in specific language:

<ExtendedProperties>
<RemoveChild XPath="/Crawler/DataSources/Server/Target/SeedlistExtraParameter" />
<AppendChild XPath="/Crawler/DataSources/Server/Target" Name="SeedlistExtraParameter">&amp;Locale=iw</AppendChild>
</ExtendedProperties>



The value should the valid Locale which is supported by WebSphere Portal or other data sources supported by seedlist crawler.

To change Accept-Language HTTP header to retrieve the content itself in specific language:

<ExtendedProperties>
<RemoveChild XPath="/Crawler/DataSources/Server/Target/AcceptLanguageHeader" />
<AppendChild XPath="/Crawler/DataSources/Server/Target" Name="AcceptLanguageHeader">he,en;q=0.5</AppendChild>
</ExtendedProperties>


The value to specify is valid value as described in RFC 2616. (You can confirm what is set for Accept-Language HTTP header by yourself with browser tools such as browser add-on, when you change the browser language.)

You can specify both entries within ExtendedProperties element in one seedlistcrawler_ext.xml file.


Related information

Adding support for unsupported language to WPS

Cross reference information
Segment Product Component Platform Version Edition
Enterprise Content Management Content Analytics with Enterprise Search AIX, Linux, Linux on System z, Windows 3.0, 2.2

Rate this page:

(0 users)Average rating

Add comments

Document information


More support for:

OmniFind Enterprise Edition

Software version:

8.5, 9.1

Operating system(s):

AIX, Linux, Linux on System z, Windows

Reference #:

1598232

Modified date:

2013-05-09

Translate my page

Machine Translation

Content navigation