*********************************************** * Licensed Materials - Property of IBM * * 5647-A01 * * (C) Copyright IBM Corp. 2000 * *********************************************** Sample estimator tool, ldif2tdbm.est.awk Background: ---------- The ldif2tdbm utility program generates DB2 load utility input datasets from an LDIF file. These datasets must be preallocated prior to running the ldif2tdbm utility. However, it is not always clear how much space to allocate for these datasets. For extremely large LDIF files, it becomes more important to allocate these datasets with an appropriate amount of space. Allocating them too large may waste large amounts of DASD space. However, allocating them too small may cost the administrator a lot of time recovering from an out-of-space failure. The ldif2tdbm utility documentation provides some general formulas for calculating the amount of space needed based on the input LDIF file. However, for many cases, closer examination of data within the LDIF file is necessary to determine the actual space required. In particular, when long attributes or long entries are present, it is difficult to determine the distribution of data between the DIR_ENTRY table and the two overflow tables, DIR_LONGATTR and DIR_LONGENTRY. Customization of LDAP Server TDBM configuration values and DB2 DDL values can influence the size of these datasets. Furthermore, determining the number of tracks or cylinders required for a given number of bytes of LDIF data requires additional calculations. Estimator Tool Description: -------------------------- The sample tool ldif2tdbm.est.awk is useful for estimating the space requirements for the DB2 load utility datasets generated by the ldif2tdbm utility program. The sample tool uses many of the customized values described above, and then examines the data within the LDIF input file. It then produces an estimate of the space requirements which is generally much closer than the estimates produced by the simple formulas provided in the ldif2tdbm documentation. Some of the settings which influence the space required for these datasets are: - the AttrOverflowSize value from the LDAP Server configuration file. - the length of the AdminDN value from the LDAP Server configuration file. - the tablespace page size of the various TDBM tablespaces which is determined by the bufferpools chosen. - the size of the VALUE column of the DIR_SEARCH table. - the size of the DN and DN_TRUNC columns of the DIR_ENTRY table. Additionally, the data within the LDIF file itself influences the space required for these datasets, such as: - the number of entries in the file - the number of individual attribute value pairs - the size of the "dn" (DistinguishedName) of the individual entries - the size of the individual attribute values (particularly when they are longer than the AttrOverflowSize or the DIR_SEARCH VALUE column size. - the presence of multi-valued attributes - the presence of Base64 encoded values - whether "objectClass: top" is specified or omitted for the entries in the file. - the length of the structural objectClass value of the typical entry. The sample tool considers the aforementioned factors when making its estimates. Note that the tool was intended to provide a reasonably accurate estimate, without consuming too much time processing the LDIF file. While the tool does a line-by-line examination of the LDIF file, the data is not examined to the extent done by the actual ldif2tdbm utility. Many other factors related to the LDIF data and the directory schema affect the load data set sizes. However, these factors are not considered by the tool: - presence of alternate names for the attribute types in the LDIF file. For example, "c: US" and "countryName: US" may be used interchangeably in the LDIF file. The directory will store these both under the primary name chosen in the schema, usually "c: US". However, within the LDIF file, "c: US" accounts for 5 characters, while "countryName: US" accounts for 15 characters. Thus estimates based on the size of the LDIF file and its data alone contain some degree of inaccuracy when the data uses alternate attribute names. - data whose normalized form is of different length than its input form. For example, attributes with "telephone number" syntax, such as "telephoneNumber: 1-234-567-8901" will be stored in a shortened form as "telephoneNumber: 12345678901". Also, large amounts of imbedded whitespace in string data may cause inaccuracies (e.g., "cn: Jane A. Doe" usually gets normalized to "cn: JANE A. DOE"). It is expected that these inaccuracies will be small. - presence of binary attributes. Binary attribute values are not stored in the DIR_SEARCH table. However, they may account for a considerably large percentage of the LDIF data (e.g., large .gif attributes) and may cause the space estimate for the DIR_SEARCH tablespace to be too large. Also, to help speed the processing of the estimator, a "sample_percent" processing option is provided to reduce the amount of data it actually examines closely while making its estimate. Also, when multi-valued attributes are not widely used, the checking for them can be disabled to help speed processing. Running the Estimator: --------------------- 1. First, you may wish to make a copy of the tool into the HFS directory where your LDIF data resides. The tool is an awk script which runs under the OS/390 Unix System Services. The tool can potentially be run under other Unix platforms as well, since it is written in awk. 2. Edit the working copy of the tool (ldif2tdbm.est.awk) and update the customizable values in the script according to: - run options for the tool - values you set earlier in your Directory Server configuration file - values in your SPUFI input file (when you created the TDBM databases, tablespaces, tables, and indexes) - characteristics of your LDIF file data - characteristics of the DASD where the files will be allocated 3. Invoke the awk script from the Unix System Services command line: awk -f ldif2tdbm.est.awk ldif_file_name > estimator.output.report Where: "ldif_file_name" is the file name of the LDIF file you intend to load into your Directory Server's TDBM database. You can specify multiple files here (separated by blanks), as the awk command accepts multiple input files. And: "estimator.output.report" is a file name where you wish to capture the estimator's output report. The output is actually written to "stdout", but the ">" redirects the output to a file where it can be captured. A sample report looks as follows: ldif2tdbm.est.awk: TDBM load dataset space estimator ... started on Fri Sep 1 17:59:19 EDT 2000 Input parameters: Sampling percent : 100 Check multi-valued attrs : FALSE admin DN : cn=root,o=ibm,c=us admin DN length : 18 AttrOverflowSize : 255 DIR_ENTRY page size : 4096 DIR_LONGATTR page size : 4096 DIR_LONGENTRY page size : 4096 DIR_SEARCH page size : 4096 DN_TRUNC column size : 32 DN column size : 512 SEARCH_VALUE column size : 32 Average entry depth : 3.000000 objectclass top present? : FALSE avg length structural oc : 20 DIR_DESC BLKSIZE : 27998 BLKS/TRK : 2 DIR_ENTRY BLKSIZE : 27998 BLKS/TRK : 2 DIR_LONGATTR BLKSIZE : 27998 BLKS/TRK : 2 DIR_LONGENTRY BLKSIZE : 27998 BLKS/TRK : 2 DIR_SEARCH BLKSIZE : 27998 BLKS/TRK : 2 Tracks per Cylinder : 15 processing ... 10000 entries processed ... Number entries: 10596 Number samples: 10596 Sampling percent: 100.0000 Load DS Records Bytes TRKs CYLs ---------- ---------- ---------- -------- -------- DIR_DESC 31788 953640 19.31 1.29 DIR_ENTRY 10596 11716004 211.92 14.13 DIR_LATTR 0 0 0.00 0.00 DIR_LENTRY 0 0 0.00 0.00 DIR_SEARCH 235263 10589961 206.01 13.73 ldif2tdbm.est.awk: TDBM load dataset space estimator ... ended on Fri Sep 1 17:59:40 EDT 2000 Performance Considerations: -------------------------- As previously mentioned, using a smaller "sample_percent" value (e.g. 10) in the estimator can speed up processing, sometimes reducing elapsed time by as much as 50%. This is useful if you have an extremely large amount of data to load (megabytes or gigabytes), but the characteristics of the data are very consistent throughout the LDIF file. Also, when loading extremely large amounts of data, you may wish to split the LDIF input data into multiple files. Again, if the data is extremely consistent, you might want to process only one of the files, and calculate the load dataset sizes yourself. You can make the estimates proportional based on the LDIF file sizes in bytes, provided the data is consistent (i.e., most entries look alike with the same attributes present, but slightly different values). Examples of highly consistent data might be phone book entries with each entry containing a dn (distinguishedName), cn (commonName), sn (surname), objectClass of person, and a telephoneNumber. Perhaps you have two LDIF files, one with surnames starting with the letters A-M which is 22 megabytes, the other with surnames starting with N-Z which is 23.2 megabytes in size. You could process the A-M file with the estimator, then calculate the N-Z file estimates by multiplying the A-M values by 23.2/22 (or 1.05). Also, if you do not need to check for multi-valued attributes, either because they are not used, or not prevalent in your LDIF files, setting "check_multi_valued=0" will speed processing of the estimator.