These options are in the Converting sub-section of the
Global Settings for a search collection.
- Number of converters - This is the number of parallel converting
processes to run. Typically, you would want this to be set to the number of processors in
your system.
- Link/Index - Crawler settings that are analogous to
robots.txt exclusions:
- no-index - do not index the content of this URL
- no-follow - do not follow links on a page
- Text cache content types - newline-separated list of content-types
that will be stored in the cache
To modify the default list of cached content types,
select the Modified radio button and add, edit, or remove entries
in the associated text area.
- Rich cache content types - newline-separated list of MIME content
types for which rich preview versions of crawled documents will be generated and retained in
the cache. To define content types for rich preview generation and caching, select the
Modified radio button and enter a list of content-types in the
associated text area. Common values for these content types are the following Microsoft
content types:
application/word
application/excel
application/powerpoint
application/ms-ooxml-word
application/ms-ooxml-excel
application/ms-ooxml-powerpoint
- Converter CPU limit - The amount of CPU time that a converter
process can take is limited to the larger of this value and the CPU
limit specified in the Advanced section of the
configuration for that specific converter. If the limit is exceeded, the converter will be
stopped. Enter this value as a positive number of seconds or use -1 for "infinity".
- Converter memory limit - The amount of memory that a converter
process can consume is limited to the larger of this value and the Memory
limit specified in the Advanced section of the
configuration for that specific converter. If the limit is exceeded, the converter will be
stopped. Enter this value in megabytes or use -1 for "infinity".
- Converter elapsed time - The amount of elapsed time for which a
converter process can run is limited to the larger of this value and the Elapsed
time limit specified in the Advanced section of the
configuration for that specific converter. If the limit is exceeded, the converter will be
stopped. Enter this value as a positive number of seconds or use -1 for "infinity".
- Rich-preview max elapsed time - The maximum number of seconds
allowed for generation of the rich-preview for any individual document in the collection.
The rich-preview will be unavailable for that document if the limit is exceeded. Enter this
value as a positive number of seconds or use -1 for "infinity".
Note: Rich preview of the specified document types is available only for matching
documents that have been crawled since those document types were specified.
The text/plain MIME type is not handled through rich preview, but
text/plain preview is supported. Text preview is generated from text cache,
not rich cache.
Watson Explorer Application
Builder supports rich preview for PDF and RTF
files, and generally supports rich preview of Microsoft Word, Microsoft PowerPoint, and
Microsoft Excel documents. However, many MIME types are associated with these documents. If a
rich preview is not available as you expect, ensure that you listed the correct MIME type for
your document. If you add a MIME type, you must recrawl the collection to save the rich cache
content. You can use the following list as a starting point to enable rich preview of
supported
documents:
application/word
application/excel
application/powerpoint
application/ms-ooxml-word
application/ms-ooxml-excel
application/ms-ooxml-powerpoint
application/pdf
application/rtf
In addition to supporting rich preview for some documents,
Application Builder also supports rich preview of email, and email
attachments that are supported document types. To see rich previews of email and email
attachments, add the following content type to the
Rich cache content
types field in addition to the other content types that are necessary for rich
preview of supported documents:
text/html
For email message bodies that are not text/plain content types, Application Builder uses the text/html content type
to provide rich previews. Email that contains attachments goes through additional conversion
to provide rich previews for each attachment.
Before email can be crawled, indexed, and previewed, it must meet the following requirements:
- The email must be converted to the text/mail content type. Many email
formats can be converted to this type with a converter that converts the email directly
from its native format to text/mail. If a converter does not exist to
convert an email file directly to text/mail, the email can be converted
indirectly by using multiple converters. After an email is converted to
text/mail, the email message converter runs, which converts it to the
vivisimo/crawl-data content type. The converter creates
vivisimo/crawl-data output for each component of the email, including
the headers, the body, and the attachments. By default, attachments are not indexed as
individual documents.
- Attachments must have known content types. The content types must be supported for rich
cache, or be able to be converted to content types that are supported for rich cache. If
the content type of an attachment is unknown and not supported, you cannot preview the
attachment in an Application Builder application.