Collecting data for index corruption and/or index or data inconsistency issues

Question & Answer

Question

If you think that the DB2® index manager might be the origin of a problem, collect diagnostic data that you or IBM® Software Support can use to diagnose and resolve the problem.

Cause

Background

There are many flavours of index-related corruption problems, for instance:

Unique index contains duplicates with different RID(s) or same RID(s)
Multiple index entries pointing to the same RID
Index key is out of place (wrong index key order)
Row exists, but index keys do not exist in any or some of the indexes
Index entry pointing to an empty data slot or unused data slot or RID is invalid
Incorrect previous or next index page pointers, incorrect high key or other corruptions on index pages

Even if you see the same return code from the same function or probe point, problems could vary from one to another. For instance, when SQLI_NOKEY error is received from index function "procLeaf2Del" or "sqlischd" during a delete or update, you could be hitting problems 3, 4, or 6.

If "row not found" or "bad page" type of error is received when an index is used to fetch the RID for a delete, update or select, then you could be hitting problems 3 or 5.

In short, for each problem, a DB2 support analyst will need to look at the data collected, and analyze it to determine root cause. To have data collected as much as possible at the first failure point as well as preserving necessary log files are often critical to the root cause determination.

Since index corruption or index data inconsistency type of errors could be caused by DB2 defects or by other issues such as hardware problems, it is necessary to collect the proper diagnostic data in order to determine the potential cause of the error.

DB2 tools to detect these problems

DB2 products offer a variety of tools to detect index or data corruption problems and index or data mismatch problems.

An off-line db2dart /t (table inspection) would be able to detect corruption symptoms 1, 2, 3, 6 of the above list. For example:

Table resides in table space id 4, and object id of table is 2, the database name is T30, the output is "inspTsi4oi2.rpt":
db2dart T30 /t /tsi 4 /oi 2 /scr n /rptn inspTsi4oi2.rpt
The RUNSTATS command with a special registry variable set will collect statistics as well as detect corruption symptoms 4, 5 of the above list. For example:

db2stop
db2set DB2_KEYDATACHECK=IXSTATDATADEBUG
db2start
db2 connect to <database>
db2 -v "runstats on table <tableschema>.<tabname> and detailed indexes all allow read access"

NOTE: After each RUNSTATS, you need to save the contents of the sqllib/db2dump directory. You can do so by moving the entire content to another location. As long as the RUNSTATS does not return with DB2 completed successfully, you will need to report this to DB2 support.

Some enhancements are currently being investigated. The goal is to improve the existing tools for serviceability and usability in future releases in the above areas.

Answer

You often need an advanced knowledge of DB2 in order to determine whether the problem is an index corruption or an index or data inconsistency problem. Therefore you should always contact IBM support organization to help you recover from these problems.

For index-related corruption or index or data inconsistency problems, the following information is typically required:

the contents of the DB2DUMP directory
the on-disk version of the related index pages (collected via the db2dart tool)
related data page(s) for the index or data object (collected via the db2dart tool)
database transaction log files, which are used to understand the past history of the index or table object that encountered the error.

The following steps are the recommended best practice for data collection (and database recovery) when an error related to index corruption occurs. NOTE: DO NOT perform these steps unless you have been asked to do so by an IBM DB2 Support analyst.

Collect the standard diagnostic information using the db2support command. This should include all the files in DB2DUMP directory (or the path specified by DIAGPATH in the dbm configuration).

There are cases where the corruptions are caused by a disk problem, and you do not care about root cause analysis of the problem. In these situations, skip step 2, and go directly to step 3.
Run db2dart as indicated below (note: if this is a multi-partition DPF environment, then you will need to run db2dart on the partition where failure occurred):

db2dart <dbname> /t /tsi <pool id> /oi <object id> /rptn inspTsi<poolid>oi<objectid>_bcrashrec.rpt

In general, this inspection should be fairly quick. If the database is mission critical database and needs to be brought up online immediately, then you may skip this step at this time.
Restart your database and verify crash recovery completes successfully.

If crash recovery completes successfully and you cannot take the time needed to collect this data for root cause analysis of the problem, skip steps 4, 5 and 6 and go directly to step 7.

If the restart fails, and it is determined to be a problem redoing or undoing an index record, then you can mark an index invalid (e.g. db2dart <dbname> /mi /oi <objectID> /tsi <tablespaceID>). INDEXREC will control when the invalid index object will be recreated.
To prevent the previous failing application from accessing the table again before the problem on the related index or table is resolved, you can lock the table in exclusive mode. Use the following command:

db2 +c "LOCK TABLE <SCHEMA NAME>.<TABLENAME> IN EXCLUSIVE MODE"

The "+c" turns off the auto commit, since LOCK TABLE is only effective within the unit of work.

If you do care about the availability of the rest of the tables in the same table space as the problem table, instead you may also want to consider quiescing the table spaces for the table object. For example:

db2 "QUIESCE TABLESPACES FOR TABLE <SCHEMA NAME>.<TABLE NAME> EXCLUSIVE"

NOTE: QUIESCE command will also quiesce the index table space if the index object is in a separate table space. In a partitioned database environment, this command will only quiesce the portion on current node.
Run db2dart against the index and table object.

If the problem is an index NOKEY error, then a number of db2dart commands will be printed in the db2diag.log file. Search for these commands (using either grep on UNIX® or find on Windows®) and save them in a file. Edit the file, replace "DBNAME" with the database name. If the problem has been hit multiple times, then there could be some duplicate entries. You only need to keep the latest set of db2dart commands.

If it is the data manager that reports row not found or bad page, then NO db2dart commands are printed out. You can determine the pool id, object id, data page number from the db2diag.log entries. For example:

2006-06-17-11.32.59.941406+000 I235565A460 LEVEL: Severe PID : 9224396 TID : 1 PROC : db2agent (SAMPLE) 0 INSTANCE: db2inst NODE : 000 DB : SAMPLE APPHDL : 0-78 APPID: .... FUNCTION: DB2 UDB, data management, sqldDeleteRow, probe:1329 RETCODE : ZRC=0x87040001=-2029780991=SQLD_BADPAGE "Bad Data Page" DIA8500C A data file error has occurred, record id is "". 2006-06-17-11.32.59.964241+000 I236403A380 LEVEL: Severe PID : 9224396 TID : 1 PROC : db2agent (SAMPLE) 0 INSTANCE: db2inst NODE : 000 DB : SAMPLE APPHDL : 0-78 APPID: ... MESSAGE : PAGE OBJECT IDENTIFIERS: DATA #1 : String, 54 bytes Tablespace ID = 15, Object ID = 8737, Object Type = 0 2006-06-17-11.32.59.964448+000 I236784A350 LEVEL: Severe PID : 9224396 TID : 1 PROC : db2agent (SAMPLE) 0 INSTANCE: db2inst NODE : 000 DB : SAMPLE APPHDL : 0-78 APPID: ... MESSAGE : PAGE NUMBERS: DATA #1 : String, 35 bytes Obj Page = 297, Pool Page = 787527

In this case, the data page is 787527, pool id is 15 and object id is 8737.

The commands that you need are:
1. db2dart <dbname> /dd /tsi <pool id> /oi <object id> /ps <data page number>p /np 1 /v y /scr n /rptn dTsi<poolid>oi<objectid>.rpt
  
  ...where <data page number> is 787527p using the example above, "p" is used to indicate this is pool relative page which is important for DMS table space.
2. db2dart <dbname> /di /tsi <pool id> /oi <object id> /ps 0 /np <large enough value to dump all index pages> /v y /scr n /rptn iTsi<poolid>oi<objectid>_0-<pagenum>.rpt.rpt
  
  Note: The command might take longer to run if a large /np value is specified. Ideally, you should collect as much data as possible. At minimum, it should be no less than 100 pages.
  
  If table is not too big, also dump the whole table. If the table is too big to dump, select a reasonably large number so that at least a part of the table can be dumped, for example:
  
  db2dart <dbname> /dd /tsi <pool id> /oi <object id> /ps <data page to start> /np <large enough number> /v y /scr n /rptn dTsi<pool id>oi<object id>_0-<page num>.rpt
  
  where <data page to start> would be "0" if you want to dump the whole table, or <data page number - 100>p if the table is too big to dump (using the example, that would be "787427p"). Replace the <pool id> and <object id> with the actual number.
Run RUNSTATS to check for any index or data inconsistency. For example:

db2stop
db2set DB2_KEYDATACHECK=IXSTATDATADEBUG
db2start
db2 connect to <db>
db2 -v "SELECT INDSCHEMA, INDNAME FROM SYSCAT.INDEXES WHERE TABSCHEMA='<TABLE SCHENAME>' AND TABNAME='<TABLENAME>' "

Then use the index schema name and index name from the above SELECT output for the following RUNSTATS command:

db2 -v "runstats on table <tableschema>.<tabname> and index <index schema>.<index name> allow read access"

NOTE: You need to perform the RUNSTATS on all of the indexes defined on the offending table; after each RUNSTATS command, you need to save the contents of the sqllib/db2dump directory (or DIAGPATH, if you're not using the default path).

RUNSTATS on a large table could take a while and the table will not be available for updates while RUNSTATS is running. In a multi-partition DPF environment, RUNSTATS is done on a per partition basis.

If you had quiesced the table space at step 4, then unquiesce the table spaces before you terminate the database connection. You will need to use the same user id for the QUIESCE command in order to reset:

db2 -v "QUIESCE TABLESPACES FOR TABLE <SCHEMA NAME>.<TABLE NAME> RESET"

Once all the RUNSTATS commands are completed, you need to collect the db2diag.log along with the content of sqllib/db2dump (or content of DIAGPATH).
If it is determined that the index is corrupted (that is, if either RUNSTATS or db2dart /t inspection returns with an error), then proceed with rebuilding the index as follows:

db2 "REORG INDEXES ALL FOR TABLE <SCHEMA NAME>.<TABLE NAME> ALLOW NO ACCESS"

Applications cannot access your table while the above REORG is running. If you need the table to be available immediately, you can choose to run the above REORG INDEXES command with ALLOW READ ACCESS or ALLOW WRITE ACCESS options. Be aware, however, that the database is at risk of being brought down again if the previous failing application is run against the table before the REORG completes.

If you are not concerned with root cause analysis of the problem, you do not need to collect any additional information and can skip the subsequent steps.
Collect table or index DDLs. You can use the db2look utility to do so. For example:

db2look -d <database> -l -e -a -t <table name> -o table.ddl
Collect transaction log files

You will only know which log files to collect after a DB2 Support analyst examines the data collected thus far. In general, they will ask you to send a raw log file. In some special cases, they may ask you to format the file and only send the relevant log files. The number of log files required will depend on when the problem was introduced, so it might be necessary to retrieve older log files from archive.
Additional information that may be important to root cause analysis or known APAR identification. For example:
- What type of workload is typically running against the table?
- Did you perform any actions that you don't normally do on this table, for instance any REORG INDEXES/TABLE commands, load or import operations, a high utilities workload, or restore or rollforward operations?
- Were there any other unusual occurrences? For example, if there were recent hardware problems, provide the operating system error log. This would involve collecting the errpt -a output on AIX® or /var/adm/messages file on Solaris.

Related Information

Collecting data

[{"Product":{"code":"SSEPGG","label":"Db2 for Linux, UNIX and Windows"},"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Component":"Database Objects\/Config - Indexes","Platform":[{"code":"PF002","label":"AIX"},{"code":"PF010","label":"HP-UX"},{"code":"PF016","label":"Linux"},{"code":"PF027","label":"Solaris"},{"code":"PF033","label":"Windows"}],"Version":"9.7;9.5;9.1;10.1;10.5","Edition":"Enterprise Server;Express;Personal;Personal Developer\u0027s;Workgroup Server","Line of Business":{"code":"LOB10","label":"Data and AI"}}]

Tips

Collecting data for index corruption and/or index or data inconsistency issues

Question & Answer

Question

Cause

Answer

Related Information

Was this topic helpful?

Document Information

UID

Share your feedback

Need support?