Fine-Tuning the Search Collection

As you look through the search results screens retrieved with empty query, you may notice that one result looks different from the others. All of the files in the sample directory used in this tutorial have the same format and content except for one, which is the index.html file that provides a generic interface to these files. Luckily, Watson Explorer Engine provides several ways of excluding specific files from the crawling and indexing process, the simplest of which is discussed here.

In the Watson Explorer Engine administration tool, re-open the binning-collection search collection that you created for this tutorial and select the Configuration tab if it is not already selected. Next, click the Crawling sub-tab if it is not already selected. In the Conditional Settings section of this page, click Add a new condition, select Custom conditional settings, and click Add. Your screen should look something like the one shown in Figure 1. (with no values in the text input areas).

Figure 1. Adding a Custom Condition to Exclude Files by Path

In the text box under the Conditions apply for a path matching, enter the path for the directory that houses the sample files used for this tutorial and the filename that you wish to exclude, in this case, index.html, as in the following example:

  */data/metadata-example/index.html

Next, scroll down to and expand the Converting section of this screen and select the check box beside no-index. Finally, click OK to save your settings and return to the Configuration section's Crawling tab.

The Retrieval options that you just specified tell the Watson Explorer Engine search engine not to index the file (..install-directory../data/examples/metadata-example/index.html, but to follow any links that this file contains. When you re-crawl and re-index this site, the result that contains this file will no longer be displayed.

When working with smaller collections, such as the one used in this tutorial, and making changes to the configuration of a search collection, the easiest way to obtain an updated index is to simply delete the existing index and regenerate it by recrawling and reindexing the files in your search collection. To do this in order to demonstrate that the index.html file is now being excluded, click the Overview tab, and select delete data to the right of the Live Status header. Click Delete on the confirmation screen to proceed. Finally, click the start button to the right of the Live Status header to begin the new crawl and index process.

Tip: You do not want to simply delete the data for a search collection where the complete crawl and index processes take any significant amount of time. In these cases, removing a single file or URL from a collection is best done by re-enqueing that file or URL using the Live Status > Enqueue tab. This causes the crawling and indexing process to re-examine the specified file or URL based on the current search collection definition, and will remove it from the crawled data and the index.

Once the index update completes, you can click Search under the search box that appears under the Test with project label in the Watson Explorer Engine administration tool's left-hand navigation bar to display. This shows that the entry for the index.html file is gone, and also shows that the All Results heading at the left side of the page shows 42 possible results, rather than the 43 results that were previously identified.

Congratulations! You've reached the end of this tutorial. For additional information about adding metadata, adding fields, and so on, search this documentation for these (or related) topics.