Custom annotations and splitters

To control how the system processes incoming log file records, you can define custom annotations and splitters for your Insight Pack.

Before IBM® Operations Analytics - Log Analysis indexes any data, it can split and annotate the incoming log file records. You can use either the Annotation Query Language (AQL) rules or custom logic implemented using technologies such as Java™ or Python.

Splitting

Splitting describes how IBM Operations Analytics - Log Analysis separates physical log file records into logical records using a logical boundary such as time stamp or a new line. For example, when a timestamp is used as the logical boundary, all records after the beginning of the first detected timestamp are included in the logical record. The beginning of the next timestamp is used to end the logical record and to start the next logical record.

The logic used by a splitter to determine how to manage incoming data records must adhere to a schema that is required by IBM Operations Analytics - Log Analysis. This is true for both AQL and custom logic splitters. Splitter logic is used to process batches of records when a complete set of logical log records might not be included in a record batch. The splitter must process partial records that can occur at the start of the batch as well as at the end of the batch.

A splitter must distinguish between incoming data records that form a complete log record from records that it must buffer to be marked as complete when additional records are added. It also must identify records that can be discarded, for example, records that the splitter determines are not going to be part of complete log records. The splitter logic can process a batch of incoming records and must split them on the defined boundary. It returns split records with a type that indicates to IBM Operations Analytics - Log Analysis how each record is handled.

The general schema that is returned by the splitter contains the following attributes:

Log text

The text that is contained in the log record after it is split.

Timestamp

The timestamp, if there is one, that is associated with the log record.

Type

The type is a single character, A, B, or C, that indicates the type of this log record. The possible types are as follows:

A: indicates a complete log record. The splitter logic determines that the associated record is complete. The record can be sent to the annotation and indexing processes. For example, in this example, the first record is a type A record and the second is of type B. This is because the second record indicates to the splitter that the first record is complete:
```
[9/21/12 14:31:13:117 GMT+05:30] 0000003e InternalGener I 
  DSRA8203I: Database product name : D2/LINUXX8664
[9/21/12 14:31:13:119 GMT+05:30] 0000003e InternalGener I 
  DSRA8204I: Database product version : SQL09070
```
B: indicates that there is a partial log record at the end of the set. For example, the splitter detects the start of a new logical record but cannot determine if it is complete because the splitter cannot find the next logical record boundary that indicates the start of the next record. The splitter marks the record as type B to indicate to the IBM Operations Analytics - Log Analysis server that this record is a partial record and it must be buffered until more incoming records are received to allow it to complete the logical record. The IBM Operations Analytics - Log Analysis server sends all type A log records for annotation and indexing. It buffers type B records. The buffered type B records are then prefixed to the next batch of input that is sent to the splitter when it receives more input records. For example:
```
[9/21/12 14:31:27:882 GMT+05:30] 00000051 servlet       
E com.ibm.ws.webcontainer.servlet.ServletWrapper service SRVE0068E:
 Uncaught exception created in one of the 
service methods of the servlet TradeAppServlet in application
 DayTrader2-EE5. Exception created : 
javax.servlet.ServletException: TradeServletAction.doLogout
(...)exception logging out user uid:1
at org.apache.geronimo.samples.daytrader.web
.TradeServletAction.doLogout(TradeServletAction.java:458)
at org.apache.geronimo.samples.daytrader.web
.TradeAppServlet.performTask(TradeAppServlet.java:169)
at org.apache.geronimo.samples.daytrader
.web.TradeAppServlet.doGet(TradeAppServlet.java:78)
```
C: indicates that the text can be discarded. The IBM Operations Analytics - Log Analysis server discards this text. This type of record is not sent for annotation and indexing. It is not buffered. You must define the splitter so that it only marks text as type C if it is certain that it is not part of a log record that is not complete. For example, a partial log record is detected at the beginning of a batch of records. Then, a complete but unrelated logical log record is found. IBM Operations Analytics - Log Analysis can never complete the partial record that was detected first. The record must be marked as type C and discarded. For example:
```
************ Start Display Current Environment ************
WebSphere Platform 7.0.0.0 [ND 7.0.0.0 r0835.03] running with process
 name cldftp48Node01Cell\cldftp48Node01\server1 and process id 28811
Host Operating System is Linux, version 2.6.18-194.el5
Java version = 1.6.0, Java Compiler = j9jit24, Java VM name = IBM J9 VM
```

Annotating

After the log records are split, the logical records are sent to the annotation engine. The engine uses rules that are written in AQL or custom logic that is written in Java or Python to extract important pieces of information that are sent to the indexing engine. IBM Operations Analytics - Log Analysis represents the results from the annotation process in a Java Script Object Notation (JSON) data structure called annotations. The annotations JSON structure is part of a larger structure which also contains the original log record text (the content key) and the metadata passed into the REST API (the metadata key). You can reference the annotations structure to access the actual values from the annotation result.

For more information, see the example. You can reference the annotation results in the source.paths attributes that are contained in the field definitions in the indexing configuration. You use dot notation to indicate where the values of the fields that are indexed are located in the annotations structure.

For example, the annotation engine in IBM Operations Analytics generates the following JSON structure when it processes an AQL rule set against an incoming logical log record:

{ "annotations" : { "annotatorCommon_EventTypeOutput" :
  [ { "field_type" : 
     "EventTypeWS", 
           "span" : { "begin" : 57,
                "end" : 58,
                "text" : "E"
              },
            "text" : "E"
          } ],
      "annotatorCommon_LogTimestamp" :
         [ { "span" :
              { "begin" : 1,
                "end" : 32,
                "text" : "03/24/13 07:16:28:103 GMT+05:30"
              } } ],
      "annotatorCommon_MsgIdOutput" :
        [ { "field_type" :
            "MsgId",
            "span" :
              { "begin" : 59,
                "end" : 68,
                "text" : "DSRA1120E"
              },
            "text" : "DSRA1120E"
          } ],
      "annotatorCommon_ShortnameOutput" :
        [ { "field_type" : "ShortnameWS",
            "span" :
              { "begin" : 43,
                "end" : 56,
                "text" : "TraceResponse"
              },
            "text" : "TraceResponse"
          } ],
      "annotatorCommon_ThreadIDOutput" :
        [ { "field_type" : "ThreadIDWS",
            "span" :
              { "begin" : 34,
                "end" : 42,
                "text" : "00000010"
              },
            "text" : "00000010"
          } ],
      "annotatorCommon_msgText" :
         [ { "fullMsg" :
              { "begin" : 59,
                "end" : 167,
                "text" : "DSRA1120E: Application did not explicitly close 
all handles to this Connection. Connection cannot be pooled."
              },
                "span" : { "begin" : 70,
                "end" : 167,
                "text" : "Application did not explicitly close all handles 
to this Connection. Connection cannot be pooled."
              }
          } ]
    },
  "content" :
        { "span" : { "begin" : 1,
          "end" : 169,
          "text" :
           "[03/24/13 07:16:28:103 GMT+05:30] 00000010 TraceResponse 
E DSRA1120E: Application did not explicitly close all handles to this Connection. 
Connection cannot be pooled.\n"
        },
         "text" : "[03/24/13 07:16:28:103 GMT+05:30] 00000010 TraceResponse 
E DSRA1120E: Application did not explicitly close all handles to this Connection. 
Connection cannot be pooled.\n"
    },
  "metadata" : { "batchsize" : "506",
      "flush" : true,
      "hostname" : "mylogfilehost",
      "inputType" : "logs",
      "logpath" : "/data/unityadm/IBM/LogAnalysis/logsources/was/
SystemOut.log",
      "datasource" : "WAS system out",
      "regex_class" : "AllRecords",
      "timestamp" : "03/24/13 07:16:28:103 GMT+05:30",
      "type" : "A"
    }
}

In the example, there are three main sections or keys that are defined in the JSON data structure:

Annotations: provide access to the annotation results that are created by the annotations engine when it processes an incoming log record according to AQL rules or custom logic.
Content: provides access to the raw logical log record.
Metadata: provides access to some of the metadata that describes the file that the log record was obtained from. For example, the host name or data source. In general, the metadata section contains any name/value pairs sent to the IBM Operations Analytics - Log Analysis server from a client along with the log data.

When you create the indexing configuration, you can set the value of the sourcepaths attribute for each field to a dot notation reference to an attribute within the input JSON data structure.

For example, to specify the text value for the annotated field MsgId from the previous example, use the following dot notation reference that references the actual value DSRA1120E:

annotations.annotatorCommon_MsgIdOutput.text

The following reference produces the same result:

annotations.annotatorCommon_MsgIdOutput.span.text

In a similar manner, you can use dot notation references to the content and metadata keys for the sourcepaths attribute value of each field to be indexed. For example:

content.text
metadata.hostname

For more information about indexing configuration, see Indexing configuration in the Extending guide.