IBM Skip to main content
spacer
spacer spacer
     Home  |  Products & services  |  Support & downloads  |  My account
[an error occurred while processing this directive] Sample estimator tool: ldif2tdbm.est.awk

(Note: A text version of this page is available in the Readme file.)

A sample tool--ldif2tdbm.est.awk--is now available for download. This tool is useful for estimating the space requirements for the DB2 load utility datasets generated by the ldif2tdbm utility program.

Background

The ldif2tdbm utility program generates DB2 load utility input datasets from an LDIF file. These datasets must be preallocated prior to running the ldif2tdbm utility. However, it is not always clear how much space to allocate for these datasets.

For extremely large LDIF files, it becomes more important to allocate these datasets with an appropriate amount of space. Allocating them too large may waste large amounts of DASD space. However, allocating them too small may cost the administrator a lot of time recovering from an out-of-space failure.

The ldif2tdbm utility documentation provides some general formulas for calculating the amount of space needed based on the input LDIF file.

However, for many cases, closer examination of data within the LDIF file is necessary to determine the actual space required. In particular, when long attributes or long entries are present, it is difficult to determine the distribution of data between the DIR_ENTRY table and the two overflow tables, DIR_LONGATTR and DIR_LONGENTRY.

Customization of LDAP Server TDBM configuration values and DB2 DDL values can influence the size of these datasets.

Furthermore, determining the number of tracks or cylinders required for a given number of bytes of LDIF data requires additional calculations.

ldif2tdbm.est.awk uses many of the customized values described above, and then examines the data within the LDIF input file. It then produces an estimate of the space requirements which is generally much closer than the estimates produced by the simple formulas provided in the ldif2tdbm documentation.

Some of the settings that influence the space required for these datasets are:

  • The AttrOverflowSize value from the LDAP Server configuration file.
  • The length of the AdminDN value from the LDAP Server configuration file.
  • The tablespace page size of the various TDBM tablespaces which is determined by the bufferpools chosen
  • The size of the VALUE column of the DIR_SEARCH table.
  • The size of the DN and DN_TRUNC columns of the DIR_ENTRY table.

Additionally, the data within the LDIF file itself influences the space required for these datasets, such as:

  • The number of entries in the file
  • The number of individual attribute value pairs
  • The size of the "dn" (DistinguishedName) of the individual entries
  • The size of the individual attribute values, particularly when they are longer than the AttrOverflowSize or the DIR_SEARCH VALUE column size.
  • The presence of multi-valued attributes
  • The presence of Base64 encoded values
  • Whether "objectClass: top" is specified or omitted for the entries in the file.
  • The length of the structural objectClass value of the typical entry.

The sample tool considers the aforementioned factors when making its estimates.

Note that the tool is intended to provide a reasonably accurate estimate, without consuming too much time processing the LDIF file. While the tool does a line-by-line examination of the LDIF file, the data is not examined to the extent done by the actual ldif2tdbm utility. Many other factors related to the LDIF data and the directory schema affect the load data set sizes. However, these factors are not considered by the tool:

  • Presence of alternate names for the attribute types in the LDIF file. For example, "c: US" and "countryName: US" may be used interchangeably in the LDIF file. The directory will store these both under the primary name chosen in the schema, usually "c: US". However, within the LDIF file, "c: US" accounts for 5 characters, while "countryName: US" accounts for 15 characters. Thus estimates based on the size of the LDIF file and its data alone contain some degree of inaccuracy when the data uses alternate attribute names.
  • Data whose normalized form is of different length than its input form. For example, attributes with "telephone number" syntax, such as "telephoneNumber: 1-234-567-8901" will be stored in a shortened form as "telephoneNumber: 12345678901". Also, large amounts of imbedded whitespace in string data may cause inaccuracies (e.g., "cn: Jane    A.    Doe" usually gets normalized to "cn: JANE A. DOE"). It is expected that these inaccuracies will be small.
  • Presence of binary attributes. Binary attribute values are not stored in the DIR_SEARCH table. However, they may account for a considerably large percentage of the LDIF data (e.g., large .gif attributes) and may cause the space estimate for the DIR_SEARCH tablespace to be too large.

Also, to help speed the processing of the estimator, a "sample_percent" processing option is provided to reduce the amount of data it actually examines closely while making its estimate. Also, when multi-valued attributes are not widely used, the checking for them can be disabled to help speed processing.

Running the Estimator

1. Download the Estimator tool.

2. If you wish, make a copy of the tool into the HFS directory where your LDIF data resides. The tool is an awk script which runs under the OS/390 Unix System Services. The tool can potentially be run under other Unix platforms as well, since it is written in awk.

3. Edit the working copy of the tool (ldif2tdbm.est.awk) and update the customizable values in the script according to:

  • Run options for the tool
  • Values you set earlier in your Directory Server configuration file
  • Values in your SPUFI input file (when you created the TDBM databases, tablespaces, tables, and indexes)
  • Characteristics of your LDIF file data
  • Characteristics of the DASD where the files will be allocated

4. Invoke the awk script from the Unix System Services command line:

awk -f ldif2tdbm.est.awk ldif_file_name > estimator.output.report

Where: "ldif_file_name" is the file name of the LDIF file you intend to load into your Directory Server's TDBM database. You can specify multiple files here (separated by blanks), as the awk command accepts multiple input files.

And "estimator.output.report" is a file name where you wish to capture the estimator's output report. The output is actually written to "stdout", but the ">" redirects the output to a file where it can be captured.

Sample Report

For a sample report, see the readme file.

Performance Considerations

As previously mentioned, using a smaller "sample_percent" value (e.g., 10) in the estimator can speed up processing, sometimes reducing elapsed time by as much as 50%. This is useful if you have an extremely large amount of data to load (megabytes or gigabytes), but the characteristics of the data are very consistent throughout the LDIF file.

Also, when loading extremely large amounts of data, you may wish to split the LDIF input data into multiple files. Again, if the data is extremely consistent, you might want to process only one of the files, and calculate the load dataset sizes yourself. You can make the estimates proportional based on the LDIF file sizes in bytes, provided the data is consistent (i.e., most entries look alike with the same attributes present, but slightly different values).

Examples of highly consistent data might be phone book entries with each entry containing a dn (distinguishedName), cn (commonName), sn (surname), objectClass of person, and a telephoneNumber. Perhaps you have two LDIF files, one with surnames starting with the letters A-M which is 22 megabytes, the other with surnames starting with N-Z which is 23.2 megabytes in size. You could process the A-M file with the estimator, then calculate the N-Z file estimates by multiplying the A-M values by 23.2/22 (or 1.05).

Also, if you do not need to check for multi-valued attributes, either because they are not used, or not prevalent in your LDIF files, setting "check_multi_valued=0" will speed processing of the estimator.

 

Privacy spacer Legal spacer Contact spacer
spacer spacer spacer spacer spacer spacer
spacer
spacer