PureData System for Analytics (Netezza): Maintenance tasks for optimal performance

Question & Answer

Question

What regular maintenance do I have to carry out to maintain optimal performance for my PDA appliance?

Answer

The PureData for Analytics appliance tries to keep things as simple as possible for users and administrators alike, but there are a few regular housekeeping jobs that are key for good, sustained performance on your system. In this technote, we briefly outline what the jobs are and the impact that they can have. A number of examples are covered in more detail in presentations that are available on the DataWorks Enzee forum.

1. Statistics

The PDA appliance is fitted with a cost-based optimizer, as do most relational database systems. The optimizer makes choices about how to execute queries based on the information it has about tables - the table statistics.

Some automatic statistics are generated as data flows into the system. The minimum and or maximum values for numeric and or temporal datatypes regardless of the load method, and basic statistics are generated when a CTAS operation is carried out on a significant volume of data. By default, over 10,000 rows as defined by the ctas_autostats_min_rows parameter.

The onus of collecting statistics is on the application or DBA team.

There are a number of different types of statistics. For more information, see https://www.ibm.com/developerworks/community/groups/service/html/communityview?communityUuid=9c8f1300-9ac0-4de5-80e6-0708f8e0260d#fullpageWidgetId=W9382f586cf7b_4d64_ae5e_fb6f156ee789&file=4764c60a-c992-474d-ac0c-a1bd875ef313.

Rule number 1: collect statistics! If you find you have slow running queries, the first step is to review the underlying tables and ensure that statistics are up to date.

2. Groom for reclaim

The groom command is something of a swiss-army knife of a command. It carries out several maintenance tasks on tables. You can issue the command to physically delete logically deleted rows from tables. As tables are added to and deleted from over time, these logical rows still take up space and their pages might need to be scanned. This scan can slow processing down. For more information, see https://www.ibm.com/developerworks/community/blogs/9c8f1300-9ac0-4de5-80e6-0708f8e0260d/entry/50_PureData_Nuggets_8_The_many_facets_of_GROOM_part_1_deleting_rows?lang=en.

Rule number 2: Groom tables that are updated and or deleted from regularly to ensure that they are not burdened with wasted space.

3. Groom for organized tables

The second thing that groom does for you is to take organized tables and order the data in them. Organized or clustered tables are tables with the ORGANIZE ON clause specified. This clause tells the system that you have an ideal order that you would like the data to be stored in, based on between one and four columns. Just defining this clause does nothing - it is just metadata. However, when you run a groom the system takes note of the action and reorders your data on disk, optimizing your zonemaps for query performance.

We talk in some detail about zonemaps (page 22), clustered tables (page 26), and groom (page 42) in this best practices presentation:

https://www.ibm.com/developerworks/community/groups/service/html/communityview?communityUuid=9c8f1300-9ac0-4de5-80e6-0708f8e0260d#fullpageWidgetId=W9382f586cf7b_4d64_ae5e_fb6f156ee789&file=156107fe-d0cb-424c-93c0-0c25f01e5a9e.

Rule number 3: For clustered tables, regularly organize the row order with groom to ensure that they are optimized for your queries.

4. Groom for versions

When you ALTER a table to ADD or DROP columns, behind the scenes we keep the original table, and create a new version of the table with the new layout. As data is added to the table or updated, the new rows are stored in the new table version. As you query the table, the system automatically carries out a UNION ALL join between the table versions. For large tables, you might have to process a UNION ALL join or significant data volumes. It is far less efficient than querying a single table.

To deal with this situation, there is a special invocation of groom - GROOM TABLE VERSIONS. This invocation consolidates all the table data into the new format. You are suggested to issue the command as soon as possible after you alter a table to avoid any performance implications.

Rule number 4: Always run GROOM VERSIONS on tables as soon as possible after they are altered.

5. System catalog size and reindexing.

The PDA appliance stores user data on disks that are managed by the S-Blades. The system catalog data, including users, passwords, access information, and all other DDL, is stored on the host system in the /nz filesystem. Unlike user data areas, the system catalog is optimised for well-defined queries returning a small set of rows based on key values. If you imagine the kinds of catalog tasks that are going on all the time implicitly, you can quickly understand the reason.

In the system catalog, we make significant use of indexing to return the data we need quickly. As we add and remove objects from the catalog (for example by creating and dropping tables), we write to and delete from these system catalog tables... and their indexes.

When objects are dropped, we delete from the catalog tables and a background vacuum process runs to physically delete these rows. You can see these vacuum processes in the pg.log file. This process happens automatically and needs no management. However, you must delete from the indexes on these tables. Over time these indexes might get bloated with details about deleted rows and their performance is impacted.

You can monitor the size of your catalog with the nz_catalog_size script on the PDA host. The script indicates how much space is used. It is difficult to define a rule around how big is 'too big' as it depends somewhat on the number of objects in each database and your performance requirements. The nz_catalog_size script provides you with an indication of whether a reindex might be of benefit. The flag is set to 'yes' when the index size exceeds 250% of the base table size as in this example.

`/nz/data TOTAL directory size: 3.7G ========================== Catalogs: 771M` `+ CORE files: 2.6G + CODE cache: 9.0M + HISTORY data: 185M + UDX files: 100M` `Database (sorted by name) Size reindex ? Directory Path -------------------------------- ----- --------- -------------------------- HISTORY_DATABASE 69M 163 /nz/data.1.0/base/202895 MASTER_DB 69M 98 /nz/data.1.0/base/2` `PROD_DB 474M 258 (yes) /nz/data.1.0/base/200253` `SYSTEM 18M 95 /nz/data.1.0/base/1 TEST_DB 69M 98 /nz/data.1.0/base/209304` `Database (sorted by size) Size reindex ? Directory Path -------------------------------- ----- --------- --------------------------` `PROD_DB 474M 258 (yes) /nz/data.1.0/base/200253` `HISTORY_DATABASE 69M 163 /nz/data.1.0/base/202895 MASTER_DB 69M 98 /nz/data.1.0/base/2 TEST_DB 69M 98 /nz/data.1.0/base/209304 SYSTEM 18M 95 /nz/data.1.0/base/1` `CORE files Size --------------------------------------------------- ------------- /nz/data.1.0/base/1/core.1277077968.17950 2,600,000,000` `Files that are greater than the specified threshold Size --------------------------------------------------- -------------` `/nz/data.1.0/base/200253/5030 86,589,440 200253=PROD_DB (DATABASE) 5030=_T_ACTIONFRAG (SYSTEM TABLE)` `/nz/data.1.0/base/200253/1249 132,399,104 200253=PROD_DB (DATABASE) 1249=_T_ATTRIBUTE (SYSTEM TABLE)` `/nz/data.1.0/base/200253/5305 209,633,280 200253=PROD_DB (DATABASE) 5305=_I_ATTRIBUTE_RELID_ATTNAM (SYSTEM INDEX)`

If you determine a reindex is necessary, you can do reindex in one of two ways:

nz_manual_vacuum. This script requires the system to be offline and will reindex (and vacuum) all database areas in the system after first taking a catalog backup.

nz_online_vacuum. NPS 7.2 and up have the capability to reindex databases with the system online. Each database you want to reindex mustn't have connections to it when the reindex starts. Each database is unavailable during the reindex operation. Due to this requirement, it is often easier to schedule downtime and carry out an offline reindex.

On top of these two manual processes, every time you restart the database service with th nzstart command, it reviews catalog index sizes and if required reindexes them on startup.

Rule number 5: Monitor you catalog sizes. If required, reindex them for optimal performance.

You are suggested not to run the nz_online_vacuum script along with other resource and or disk I/O intensive operations such as backup and restore. Issue the script only when the system has a low or moderate load. If multiple operations that increase disk I/O activities are issued together, I/O buffer lock contention might occur and the performance of the ongoing tasks might be hampered. In rare cases, the system might restart.

[{"Product":{"code":"SSULQD","label":"IBM PureData System"},"Business Unit":{"code":"BU053","label":"Cloud & Data Platform"},"Component":"Administration","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"1.0.0","Edition":"","Line of Business":{"code":"LOB10","label":"Data and AI"}}]

Tips

PureData System for Analytics (Netezza): Maintenance tasks for optimal performance

Question & Answer

Question

Answer

Was this topic helpful?

Document Information

UID

Share your feedback

Need support?