Recommendations to improve performance of Column Analysis with IBM InfoSphere Information Analyzer

Question & Answer

Question

How can I improve the performance of Column Analysis in IBM InfoSphere Information Analyzer, in particular when analyzing very large tables?

Answer

Here is a list of things to consider:

Increase the number of nodes in the configuration File. Adding more nodes will increase the degree of parallelism which in turn may improve the total processing time overall. To learn more about this file and the consideration on how to edit it please refer to the Parallel Job Developer Guide.
Use Data Sampling. If you do not need to analyze the complete list of values of a column then Data Sampling is a great alternative. This allows you to generate results based on a fraction of your data. Times to analyze a data sample are significantly shorter.
Break down the analysis by columns. Select a small group of columns (for example 8 columns) and start the analysis. Once this has finished and you have a better idea of how long it takes to analyze this number of columns then you can create additional groups of columns and schedule them so they do not run all at the same time. Some advantages of this approach are: it provides you with a better picture of the progress; it allows you to see the results of your table as they become available so you do not have to wait for the entire table to be analyzed; and if a problem occurs it will be easier to isolate it since the process now have a smaller scope.
Check the network latency. Measure the latency between the IA Engine and the source database and also between the IA Engine and the IADB repository. The latency with the IADB repository should always be very low, as this database should be located in the same network as the IA Engine (commonly the same the same machine that contains xmeta). If the latency with the source database is high then you should consider using Data Sampling or relocating a copy of the source database to a database in the local network. See tech note 1515972 to learn more about network latency.
Check the performance of the databases involved outside of IA. Column Analysis is a process that runs queries against a source database to extract data and also against the Information Analyzer (IADB) repository to insert data. Check with your DBA that these databases are not being overloaded with other tasks or applications when you are running a Column Analysis.

[{"Product":{"code":"SSZJLG","label":"InfoSphere Information Analyzer"},"Business Unit":{"code":"BU053","label":"Cloud & Data Platform"},"Component":"--","Platform":[{"code":"PF002","label":"AIX"},{"code":"PF010","label":"HP-UX"},{"code":"PF016","label":"Linux"},{"code":"PF027","label":"Solaris"},{"code":"PF033","label":"Windows"}],"Version":"8.1;8.0.1;8.0;8.1.1","Edition":"","Line of Business":{"code":"","label":""}}]

Tips

Recommendations to improve performance of Column Analysis with IBM InfoSphere Information Analyzer

Question & Answer

Question

Answer

Was this topic helpful?

Document Information

UID

Share your feedback

Need support?