The uniform resource identifier (URI) of each document
in the index indicates the type of crawler that added the document
to the collection.
You can specify URIs or URI patterns when you configure
categories, scopes, and quick links for a collection. You also specify
the URI when you need to remove documents from the index or view detailed
status information about a specific URI.
Search the collection
to determine the URIs or URI patterns for a document. You can click
the URIs in the search results to retrieve documents that you are
interested in. You can copy the URI from the search results to use
the URI in the administration console. For example, you can specify
a URI pattern to automatically associate documents that match that
URI pattern with a quick link.
When
you specify a URI or URI pattern, you must specify the URL-encoded
format for the URI and ensure that the URI does not contain characters
that are not included in the US-ASCII coded character set. For details,
see RFC1738, the Internet standard for URLs.
In the following
example, you cannot specify the first URI, which contains Hebrew characters.
You can, however, specify the second URI, which is the URL-encoded
format of the first URI.
- Incorrect URI
file:///c:/shared/hebrew/עברית
- Correct URI
file:///c:/shared/hebrew/%D7%A2%D7%91%D7%A8%D7%99%D7%AA
Archive files
The URI format for documents
that are extracted from an archive file (such as a .zip or .tar file)
and then crawled is:
Original_URI(?|&)ArchiveEntry=Entry_Name(&ArchiveEntry=Entry_Name)
- Parameters
- Original_URI
- The location of the archive file on the data source.
- Entry_Name
- The URL-encoded name of the archive entry in the archive file.
- Examples
file:///d:/Archive1.zip
file:///d:/Archive1.zip?ArchiveEntry=Folder1/PowerPoint.ppt
file:///d:/Archive1.zip?ArchiveEntry=Folder2/Text.txt
Agent for Windows file systems crawlers
The URI format for documents that are crawled by
an
Agent for Windows file systems crawler
is:
winfs://Host_Name/Drive:/Directory_Path/File_Name
- Parameters
- URL encoding is applied to all of the fields.
- Host_Name
- The host name or IP address of the server where the
document is located.
- Drive
- The drive on the server where the document is
located.
- Directory_Path
- The path for a shared directory in the Windows
domain.
- File_Name
- The name of the file.
- Example
-
winfs:////9.187.186.83/c:/temp/test/test2/Copy+%284%29+of+dumpstore_1.txt
BoardReader crawlers
The
URI format for documents that are crawled by a
BoardReader
crawler is as follows:
- Replace the protocol of URL with boardreader://
- Add the parameter boardreaderid= and the BoardReader ID to the
URL
- Add the parameter useSSL=true to the URL when the original protocol
is https://
URL encoding is applied to all of the fields.
- Example 1
-
URL: http://www.facebook.com/1102412197/posts/10202426006027108
BoardReader ID: 17005669247
URI: boardreader://www.facebook.com/1102412197/posts/10202426006027108?boardreaderid=17005669247
- Example 2
-
URL: http://foursoftpaws.yuku.com/reply/5369/Kjara-Tockica#reply-5369
BoardReader ID: 12702129376
URI: boardreader://foursoftpaws.yuku.com/reply/5369/Kjara-Tockica%23reply-5369?boardreaderid=12702129376
- Example 3
-
URL: https://www.flashback.org/t2284857#p46743985
BoardReader ID: 23427574780
URI: boardreader://www.flashback.org/t2284857%23p46743985?boardreaderid=23427574780&useSSL=true
Case Manager crawlers
The
URI format for documents that are crawled by a
Case Manager crawler
is:
p8ce://host_name:port/object_store/version_series_id/hash_code[/element_number]?protocol=http
- Parameters
- URL encoding is applied to all of the fields.
- host_name
- A host name of a server on which the IBM® FileNet® Content Engine runs.
- port
- A port number on which the Content Engine Web
Service runs.
- object_store
- A name of an object store in which a document is
stored.
- version_series_id
- A unique document identifier. The version series ID is
used because the document ID
changes as the document is
versioned while the version series ID does not
change.
- hash_code
- To make a distinction between folders, a hash code is
added to the path for the object
URI. In the following
example, 7584373 is the hash code of
folder path
/ObjectStore/CaseSolution/../CaseFolder/SubFolders:
p8ce://9.39.44.204:9080/wsi/FNCEWS40MTOM/ATOSAIX2/{2D09F43F-3392-485E-B338-E67D68F04FA6}.7584373?protocol=http
- element_number
- An index of content elements. This variable is appended
only when a URI points to a
document that contains
multiple content elements.
- protocol
- A protocol for accessing the Web Service. Valid values
are http or
https.
Content Integrator crawlers
The
URI format for documents that are crawled by a
Content Integrator crawler in server access
mode is:
vbr://Server_Name/Repository_System_ID/Repository_Persistent_ID
/Item_ID/Version_ID
/Item_Type/?[Page=Page_Number&] JNDI_properties
The
URI format for documents that are crawled by a
Content Integrator crawler in direct access
mode is:
vbr:///Repository_System_ID/Repository_Persistent_ID
/Item_ID/Version_ID
/Item_Type/[?Page=Page_Number]
- Parameters
- URL encoding is applied to all of the fields.
- Server_Name
- The name of the IBM Content
Integrator server.
- Repository_System_ID
- The system ID for the repository.
- Repository_Persistent_ID
- The persistent ID for the repository.
- Item_ID
- The ID for the item.
- Version_ID
- The ID for the version. If the version ID is blank, this value
indicates the latest version of the document.
- Item_Type
- The type of the item (CONTENT or FOLDER).
- Page_Number
- The page number.
- JNDI_properties
- The JNDI properties for the J2EE application client. There are
two types of properties:
- java.naming.factory.initial
- The name of the class for the application server that is used
to create the EJB handle.
- java.naming.provider.url
- The URL to the naming service for the application server that
is used to request the EJB handle.
- Examples
- Documentum:
vbr://vbrsrv.ibm.com/Documentum/c06b/094e827780000302//CONTENT/?
java.naming.provider.url=iiop%3A%2F%2Fmyvbr.ibm.com%3A2809&
java.naming.factory.initial=com.ibm.websphere.naming.WsnInitContextFactory
FileNet
PanagonCS:
vbr://vbrsrv.ibm.com/PanagonCS/4a4c/003671066//CONTENT/?Page=1&
java.naming.provider.url=iiop%3A%2F%2Fmyvbr.ibm.com%3A2809&
java.naming.factory.initial=com.ibm.websphere.naming.WsnInitContextFactory
Content Manager crawlers
The
URI format for documents that are crawled by a
Content Manager crawler is:
cm://Server_Name/Item_Type_Name/PID
- Parameters
- URL encoding is applied to the PID parameter.
- Server_Name
- The name of the IBM Content
Manager Enterprise Edition library
server.
- Item_Type_Name
- The name of the target item type.
- PID
- The Content Manager EE persistent
identifier.
- Example
cm://cmsrvctg/ITEMTYPE1/92+3+ICM8+icmnlsdb12+ITEMTYPE159+26+A1001001A
03F27B94411D1831718+A03F27B+94411D183171+14+1018
DB2 crawlers
The
URI format for documents that are crawled by a
DB2 crawler is:
db2://Database_Name/Table_Name
/Unique_Identifier_Column_Name1/Unique_Identifier_Value1
[/Unique_Identifier_Column_Name2/Unique_Identifier_Value2/...
/Unique_Identifier_Column_NameN/Unique_Identifier_ValueN]
- Parameters:
- URL encoding is applied to all of the fields.
- Database_Name
- The internal name of the database or the alias for the database.
- Table_Name
- The name of the target table, including the name of the schema.
- Unique_Identifier_Column_Name1
- The name of the first Unique Identifier column in the table.
- Unique_Identifier_Value1
- The value of the first Unique Identifier column.
- Unique_Identifier_Column_NameN
- The name of the nth Unique Identifier column in the table.
- Unique_Identifier_ValueN
- The value of the nth Unique Identifier column.
- Examples
- Local, cataloged database:
db2://LOCALDB/SCHEMA1.TABLE1/MODEL/ThinkPadA20
Remote,
uncataloged database:
db2://myserver.mycompany.com:50001/REMOTEDB/SCHEMA2.TABLE2/NAME/DAVID
Exchange Server crawlers
Because Watson Content Analytics cannot obtain the URL
of attachments through Outlook Web App (OWA), it shows alternate URLs
for attached items. Because Exchange Server 2007 supports only the
Internet Explorer browser, users can access OWA of Exchange Server
2007 only with that browser.
When users click titles in the
results page of the enterprise search application or content analytics
miner, the corresponding Exchange Server item is shown through OWA.
If the user has MailboxPermission to the mailbox that contains the
search results, the user can also open the item through OWA. However,
if the user has MailboxFolderPermission or Delegation to the mailbox
that contains the search results, the user must access the following
URL before clicking the title to access the item, where
user's_primarySmtpAddress is
the address that the search results originally belong to
https://hostname/OWA/user's_primarySmtpAddress/?cmd=contents
The
Exchange Server crawler generates
original URIs for crawled documents. The crawler uses IDs for the
URI that are unique values among items and attachments. If a document
is an item, the URI is formatted as follows:
exchadp://hostname/mailbox_name/itemId=itemId&owa=owaURL
If
document is an attachment, URI is formatted as follows:
exchadp://hostname/mailbox_name/attachmentId=attachmentId&owa=owaURL
FileNet P8 crawlers
The
URI format for documents that are crawled by a
FileNet P8 crawler is:
p8ce://host_name:port/object_store/object_id[/element_number]?protocol=http
- Parameters
- URL encoding is applied to all of the fields.
- host_name
- A host name of a server on which the IBM FileNet Content Engine runs.
- port
- A port number on which the Content Engine Web
Service runs.
- object_store
- A name of an object store in which a document is stored.
- object_id
- A globally unique identifier (GUID) assigned by the Content Engine to a stored object. A character
string that contains 38 characters, the GUID consists of a left curly
brace, 8 hexadecimal characters, a dash, 4 hexadecimal characters,
a dash, 4 hexadecimal characters, a dash, 4 hexadecimal characters,
a dash, 12 hexadecimal characters, and a right curly brace. Braces
are encoded by URL encoding rules. For example:
%7B1234abcd-56ef-7a89-9fe8-7d65cd43ba21%7D
- element_number
- An index of content elements. This variable is appended only when
a URI points to a document that contains multiple content elements.
- protocol
- A protocol for accessing the Web Service. Valid values are http or https.
- Example
p8ce://host.filenet.com:9080/STORE1/{1234abcd-56ef-7a89-9fe8-7d65cd43ba21}/2
JDBC database crawlers
The
URI format for documents that are crawled by a
JDBC database crawler is:
jdbc://DB_URL/Table_Name
/Unique_Identifier_Column_Name1/Unique_Identifier_Value1
/[Unique_Identifier_Column_Name2/Unique_Identifier_Value2
/.../Unique_Identifier_Column_NameN/Unique_Identifier_ValueN]
- Parameters
- URL encoding is applied to all of the fields.
- DB_URL
- The URL for the database.
- Table_Name
- The name of the target table, including the name of the schema.
- Unique_Identifier_Column_Name1
- The name of the first Unique Identifier column in the table.
- Unique_Identifier_Value1
- The value of the first Unique Identifier column.
- Unique_Identifier_Column_NameN
- The name of the nth Unique Identifier column in the table.
- Unique_Identifier_ValueN
- The value of the nth Unique Identifier column.
- Examples:
- DB2 database:
jdbc:db2://host01.svl.ibm.com:50000/SAMPLE/DB2INST1.ORG/DEPTNUMB/51
Oracle
database:
jdbc:oracle:thin:@/host01.svl.ibm.com:1521:ora/SCOTT.EMP/EMPNO/7934
MS
SQL Server 2000 database:
jdbc:microsoft:sqlserver://host01.svl.ibm.com:1433;
DatabaseName=Northwind/dbo.Region/RegionID/100
MS
SQL Server 2005 database:
jdbc:sqlserver://host01.svl.ibm.com:1433;
DatabaseName=Northwind/dbo.Region/RegionID/100
Notes crawlers
The
URI format for documents that are crawled by a
Notes crawler is:
domino://Server_Name[:Port_Number]/Database_Replica_ID/Database_Path_and_Name
/[View_Universal_ID]/Document_Universal_ID
[?AttNo=Attachment_Number&AttName=Attachment_File_Name]
- Parameters
- URL encoding is applied to all of the fields.
- Server_Name
- The name of the Lotus Notes® server.
- Port_Number
- The port number for the Lotus Notes server. The port number is
optional.
- Database_Replica_ID
- The identifier for the database replica.
- Database_Path_and_Name
- The path and file name for the NSF database on the target Lotus
Notes server.
- View_Universal_ID
- The View Universal ID that is defined on the target database.
This ID is specified only when the document is selected from a view
or folder. If you do not designate a view or folder to crawl (for
example if you specify that you want to crawl all documents in a database),
the View Universal ID is not specified.
- Document_Universal_ID
- The Document Universal ID that is defined in the document that
is crawled by the crawler.
- Attachment_Number
- A consecutive number, starting from zero, for each attachment.
The attachment number is optional.
- Attachment_File_Name
- The original name of the attachment file. The attachment file
name is optional.
- Examples
- A document that was selected for crawling by view or folder:
domino://dominosvr.ibm.com/49256D3A000A20DE/Database.nsf/
8178B1C14B1E9B6B8525624F0062FE9F/0205F44FA3F45A9049256DB20042D226
A
document that was not selected for crawling by view or folder:
domino://dominosvr.ibm.com/49256D3A000A20DE/Database.nsf//
0205F44FA3F45A9049256DB20042D226
A document attachment:
domino://dominosvr.ibm.com/49256D3A000A20DE/Database.nsf//
0205F44FA3F45A9049256DB20042D226?AttNo=0&AttName=AttachedFile.doc
Quickr for Domino crawlers
The
URI format for documents that are crawled by a
Quickr for Domino crawler is:
quickplace://Server_Name:Port_Number/Database_Replica_ID/Database_Path_and_Name
/View_Universal_ID/Document_Universal_ID
/?AttNo=Attachment_Number&AttName=Attachment_File_Name
- Parameters
- URL encoding is applied to all of the fields.
- Server_Name
- The host name of the Quickr for Domino server.
- Port_Number
- Optional: The port number for the Quickr for Domino server.
- Database_Replica_ID
- The identifier for the database replica.
- Database_Path_and_Name
- The path and file name for the document NSF database on the target
Quickr for Domino server.
- View_Universal_ID
- The View Universal ID that is used to crawl documents.
- Document_Universal_ID
- The Document Universal ID that is defined in the crawled document.
- Attachment_Number
- Optional: A consecutive number, starting from zero, for each attachment.
- Attachment_File_Name
- Optional: The original name of the attachment file.
- Examples
- A document:
quickplace://ltwsvr.ibm.com/49257043000214B3/QuickPlace%5Csampleplace
%5CPageLibrary4925704300021490.nsf
/A7986FD2A9CD47090525670800167225
/2B02B1DE3A82B2CE49257043001C2498
A page attachment:
quickplace://ltwsvr.ibm.com/49257043000214B3/QuickPlace%5Csampleplace
%5CPageLibrary4925704300021490.nsf
/A7986FD2A9CD47090525670800167225
/2B02B1DE3A82B2CE49257043001C2498
?AttNo=0&AttName==QPCons3.ppt
Seed list crawlers
The
URI format for documents that are crawled by a
Seed list crawler is:
seedlist://Page_URL?pageID=Page_ID[&useSSL;=true]
- Parameters
- URL encoding is applied to all of the fields.
- Page_URL
- The URL for the document (unique for each document).
- Page_ID
- The object identifier for the document.
- useSSL
- When the protocol is HTTPS, &useSSL;=true is
added to the URI. Otherwise, useSSL is omitted.
- Example
- HTTPS protocol:
seedlist://quickrserver.ibm.com:10035/lotus/mypoc?uri=dm:bec6090046f1cd5
2bc5cfcb06e9f4550&verb;=view&pageID;=NlFSZURlMkJQNjZSMDZQMUMwM1FPNjZCQzY
2SUw2SUhPNk1RQ0M2Uk80Nk9PNjVCRUM2UUs2TDFDMA==&useSSL;=true
SharePoint crawlers
The
SharePoint crawler does not generate
its own format of document URI. It creates an accessible URL for the
document URI. The accessible URL can be changed according to the Site
and Form configuration of the SharePoint server. The crawler tries
to retrieve the display form URL and append the document ID to it.
If the crawler is configured to retrieve a URL from a specific field,
the crawler tries to use the field value as the URI. This format is
useful for crawling lists that do not generate URLs based on the primary
key value. The default format is:
http://server/display_form_path?primary_key_field name=primary_key_value
- Parameters
- URL encoding is applied to all of the fields.
- server
- display_form_path
- primary_key_field name
- primary_key_value
- Example
https://sharepoint.example.ibm.com:9999/rootDir/Shared%20Documents/
Forms/DispForm.aspx?ID=5
UNIX file system crawlers
The
URI format for documents that are crawled by a
UNIX file system crawler is:
file:///Directory_Name/File_Name
- Parameters
- URL encoding is applied to all of the fields.
- Directory_Name
- The absolute path name for the directory.
- File_Name
- The name of the file.
- Example
file:///home/user/test.doc
Windows file system crawlers
The
URI formats for documents that are crawled by a
Windows file system crawler are:
file:///Directory_Name/File_Name
file:////Network_Folder_Name/Directory_Name/File_Name
- Parameters
- URL encoding is applied to all of the fields.
- Directory_Name
- The absolute path name for the directory.
- File_Name
- The name of the file.
- Network_Folder_Name
- For documents on remote servers only, the name of the shared folder
on a Windows network.
- Examples
- Local file system:
file:///d:/directory/test.doc
Network
file system:
file:////filesvr.ibm.com/directory/file.doc