Searching for special characters

OmniFind supports indexing and searching special characters.

You can search for special characters like other query terms. To find a special character in a document, include the special character in the query expression. In some cases, escaping special characters is required.

Escaping special characters

Special characters can serve different functions in the query syntax. For example, question marks (?) can be used as wildcard characters. To search for a special character that has a special function in the query syntax, you must escape the special character by adding a backslash before it, for example:

To search for the string “where?”, escape the question mark as follows: “where\?”
To search for the string “c:\temp,” escape the colon and backslash as follows: “c\:\\temp”

Not escaping such special characters can result in syntax errors.

Table 1. Special characters that must be escaped to be searched
Special Character	Notes on behavior when not escaped
Ampersand (&)
Asterisk(*)	Used as a wildcard character.
At sign (@)	A syntax error is generated when an at sign is the first character of a query. In xmlxp expressions, the at sign is used to refer to an attribute.
Brackets [ ]	Used in xmlxp expressions to search the contents of elements and attributes
Braces { }	Generates a syntax error.
Backslash (\)
Caret (^)	Used for weighting (boosting) terms.
Colon (:)	Used to search in the contents of fields.
Equal sign (=)	Generates a syntax error.
Exclamation point (!)	A syntax error is returned when an exclamation point is the first character of a query.
Forward slash (/)	In xmlxp expressions, a forward slash is used as an element path separator.
Greater than symbol (>) Less than symbol (<)	Used in xmlxp expressions to compare the value of an attribute. Otherwise, these characters generate syntax errors.
Minus sign (-)	When a minus sign is the first character of a term, only documents that do not contain the term are returned.
Parentheses ( )	Used for grouping.
Percent sign (%)	Specifies that a search term is optional.
Plus sign (+)
Question mark (?)	Handled as a wildcard character.
Semicolon (;)
Single quotation mark (‘)	Single quotation marks are used to contain xmlxp expressions.
Tilda (~)	Handled as proximity and fuzzy search operators
Vertical bar (\|)

Escaping special characters that do not serve a special function in the query syntax is optional. The following table shows some examples of special characters that do not require escaping.

Table 2. Examples of special characters that do not require escaping
Special Character	Notes on behavior when not escaped
Comma (,)
Dollar sign ($)
Period (.)	In xmlxp expressions, a period is used to search the content of elements.
Pound sign (#)
Underscore (_)

Special characters adjacent to query terms

When a special character is adjacent to a word in a query, documents that contain the special character and word in the same order are returned. For example, searching for “30$” finds documents that contain “30$”, but does not find documents that contain “$30”. However, searching for “30 $” (with a space) finds all documents that contain “30” and “$” anywhere in the documents including both “30$” and “$30”.

When a special character is adjacent to a stop word in a query, the stop word is not removed from the query. For example, searching for “at&t” does not remove the stop word “at”. However, searching for “at & t” with spaces removes the stop word “at”.

When a special character separates two words, the sequence of tokens is searched as a sequence. For example, searching for “jack_jones” finds documents that contain “jack_jones” but not documents that contain “jack_and_jones”.

Words that are adjacent to special characters are lemmatized. For example, searching for “cats&dogs” in English finds documents that contain “cat&dog”.

You can use special characters in wildcard search expressions. For example, searching for “ja*_” finds documents that contain “jack_jones”. However, you cannot use wildcard characters to find special characters. For example, searching for “ca*s” finds documents that contain “cats”, “categories”, or “cas”, but not documents that contain “ca_s”.

Indexing special characters

During tokenization and language processing, OmniFind server identifies and indexes special characters as punctuation. Special characters are token delimiters.

For example, “jack_jones” is tokenized as three separate tokens: “jack”, “_”, and “jones”. Emails, URLs, and file paths are broken down into tokens, for example:

Jack_jones@ibm.com is tokenized as jack _ jones @ ibm . com
http://www.ibm.com is tokenized as http :// www . ibm . com

Special characters do not occupy a token position in the file. For example, "jack_jones" is indexed with the underscore in the same token position as "jack". Special characters also do not occupy a token position when spaces are included. For example, “jack_jones” is indexed in the same way as “jack _ jones”.

The token position is used for exact phrase search and for proximity search. For example, if a document contains the expression jack_jones, searching for the exact phrase ““jack jones”” finds this document.

When a sequence of special characters are indexed separately, they are searched in no particular order. For example, searching for “#$” also finds documents that contain “$#”.

Special characters in CJK languages

To find a sequence of characters that includes special characters, the query expression must include the special characters. If you omit the special characters from the query expression, the character sequence might not be found. In non-CJK languages, the character sequence is always found, even if the query expression omits the special characters. For example, if an indexed document contains john_smith, you can search for john_smith or "john smith" (exact match, without the underscore) and both queries return the document that contains john_smith.

Restriction: You cannot search for the following special characters in CJK documents: ? * \