IBM Support

DeveloperWorks Article: Extracting Author Metadata from Documents by Configuring Index Fields in IBM Content Analytics with Enterprise Search



This IBM DeveloperWorks document describes a very useful procedure to extract metadata from fields that a document's metadata has which are not extracted by the crawler by default, but which maybe needed to be available in the search results or for facets.


The IBM Content Analytics with Enterprise Search file system crawlers do not extract the Author metadata from documents. This document shows a procedure to extract the author information from documents without doing it at the crawler level. The metadata cannot be seen in the raw data store (RDS), even though you can see it in the DumpIndex after following the steps in this document.
The reason is because Microsoft Office documents on the Windows platform can store the author information in multiple locations. One location is in a file's binary content. Another is in the alternate stream of NTFS, where extended properties of a file are stored in a stream other than the file content, and in the format named FMTID_SummaryInformation.

By design, the Windows File System crawler crawls basic properties of files, but it does not read extended properties in the alternate stream. However, if Microsoft Word copied that information into the binary content of the file, then author information is in the binary content as well and can be parsed by the provided Oracle Outside In Content Access technology (the Stellent parser).

However, if you need to crawl extended properties of files, you must develop a crawler plug-in that calls the C++ API of Windows to extract this information.

By using the procedure described in this document, you can create index fields and configure the parser to extract the author information.

Related information

Extracting Author Metadata from Documents by Configurin

Document information

More support for: Watson Content Analytics

Software version: 3.0

Operating system(s): AIX, Linux, Windows

Reference #: 1649486

Modified date: 04 April 2014

Translate this page: