IBM Content Analytics with Enterprise Search does not index header information from Microsoft PowerPoint
In Microsoft PowerPoint, notes can be made for each slide. These notes can have a header and footer like a word document.
These header and footer texts are not getting extracted by IBM Content Analytics with Enterprise Search (ICAwES) text extraction.
In addition, Microsoft PowerPoint has a slide master function where text can be stored too. The ICAwES 3.0 FixPack 1 has a fix where text extraction will only extract the slide master footer text and not text in the header.
Slide notes are ignored by default, but there is a work-around possible.
Resolving the problem
Here are the steps for a work-around to extract this information:
- Make directory '$ES_INSTALL_ROOT/lib/com/ibm/es/oze/parser/outsidein/'
- Put the attached tag_actions.properties file under the created directory
- Modify the classpath parameter in '$ES_INSTALL_ROOT/configurations/interfaces/stellent__interface.ini' file to include 'lib' directory
- Restart the Parse and Index, then re-crawl/re-parse/re-index the documents