I'm trying to crawl WebSphere Portal site, which is set up in Hebrew with seedlist crawler. When we use the default configuration, the contents in Hebrew can not be displayed correctly in Hebrew in the search result. The search result seems to be the one when I access to the Portal content in English with browser.
How do you set up the seedlist crawler to get correct content string in Hebrew?
When you access target WebSphere Portal contents via browser, the contents varies depending on the browser language configuration, and you can get the expected content in a certain browser language.
For example, you can see the correct contents in Hebrew when browser language is set in Hebrew, while you can not see the contents in Hebrew when browser language is set in English.
Resolving the problem
To get the correct contents in correct language in seedlist crawler, specify additional parameters in seedlist crawler configuration file (ES_NODE_ROOT/master_config/<collection id>.<crawler session id>/seedlistcrawler_ext.xml) such as follows.
To append Locale parameter when retrieving the seedlist in specific language:
<RemoveChild XPath="/Crawler/DataSources/Server/Target/SeedlistExtraParameter" />
<AppendChild XPath="/Crawler/DataSources/Server/Target" Name="SeedlistExtraParameter">&Locale=iw</AppendChild>
The value should the valid Locale which is supported by WebSphere Portal or other data sources supported by seedlist crawler.
To change Accept-Language HTTP header to retrieve the content itself in specific language:
<RemoveChild XPath="/Crawler/DataSources/Server/Target/AcceptLanguageHeader" />
<AppendChild XPath="/Crawler/DataSources/Server/Target" Name="AcceptLanguageHeader">he,en;q=0.5</AppendChild>
The value to specify is valid value as described in RFC 2616. (You can confirm what is set for Accept-Language HTTP header by yourself with browser tools such as browser add-on, when you change the browser language.)
You can specify both entries within ExtendedProperties element in one seedlistcrawler_ext.xml file.
|Enterprise Content Management||Content Analytics with Enterprise Search||AIX, Linux, Linux on System z, Windows||3.0, 2.2|