IBM Support

Frequency Based Bucketing (regular and dynamic)

Question & Answer


Question

How is Dynamic Frequency Based Bucketing (DFBB) different from regular Frequency Based Bucketing (FBB)? How can I set them up and how do they work?

Cause

Regular FBB has been a standard feature in MDS for a number of releases. It is used to mark large buckets (buckets associated with a number of member records) as anonymous and it was done by going to the algorithm, selecting a Bucketing group and setting "Maximum bucket size" in the properties view. With this feature, whenever a bucket was greater than the maximum bucket size, it would no longer be used. This ensures better performance (with fewer comparisons in real time and smaller data sets due to smaller buckets). But this did not delete large buckets and meant that we might have a situation where very large buckets were still present in the database and were loaded into memory although they were ignored for deriving, comparing, and scoring.

Starting with a patch on v-9.7 Dynamic Frequency Based Bucketing was introduced which would delete the previously ignored large buckets. The following setting in Workbench: Member Types->Entity Types->"Maximum candidate count for dynamic frequencies" would be set to a value greater than zero. This setting would mark buckets larger than the value of "Maximum candidate count for dynamic frequencies" for deletion by way of the task manager (discussed later).
Note: If your deployment uses synchronous entity management, then dynamic frequency based bucketing (DFBB) is not possible. For more information about synchronous and asynchronous entity management, see the documentation for Entity managers like the following for 12.0: https://www.ibm.com/docs/en/imdm/12.0?topic=configuration-using-entity-managers.

Bucket manipulations in both regular FBB and DFBB are driven by the mpi_strfreq table. Regular FBB uses entries in this table as a list of bucket values to ignore and DFBB triggers processes based on entries in this table, which result in deletion of those buckets. This table is populated in the following two ways:

1. By running mpxfreq utility in -fbb mode and then loading the results into the database. This utility loops through all the data in the database and judges buckets according to the maximum count defined in the workbench. If it finds any buckets with more members than specified maximum count, it adds them to the mpi_strfreq bucket.
2. In real time by the entity manager. Whenever the entity manager encounters a bucket with more members than is specified by the maximum count, it would be added to the mpi_strfreq table.
Note: DFBB is designed to delete entries during runtime and so part #1 that is using mpxfreq to populate strfreq table is not recommended.

Note: With mpxfreq, the maudrecno may always be 1 when an entry is added and DFBB may not delete these buckets with maudrecno of 1 (the reason is explained later in a subsection explaining how DFBB works). To delete these buckets, employ the workaround mentioned in subsection 2.2 of Answer section further down.

When the table is populated, the regular FBB part is simple; it keeps the contents of the table in memory and if it encounters any bucket with string value and code identical to a row in the table, it ignores that bucket and its contents.
For DFBB, the following occurs:

1. During engine starting, the task manager notes down the highest maudrecno value in the mpi_strfreq table.
2. It periodically polls the mpi_strfreq table to see if a value was added with a higher maudrecno than specified in step #1. If it doesn't find any, it keeps polling. If it finds such entries, then it moves to step #3 before starting polling again.
3. For the entry with the higher maudrecno, it deletes the contents of the bucket corresponding to that entry and then update its internal highest maudrecno count to the one present in the new entry added.

After deletion, if a member put transaction occurs which has the same string value and code combination as a bucket already in strfreq table, then the engine will not create that bucket and the mpi_strfreq table will not be touched.

Answer

Along with the workbench setting ("Maximum candidate count for dynamic frequencies" mentioned previously for DFBB), the task manager also needs to be configured for FBB and DFBB to work. To do so, following files need to be updated:

1. For versions 10.1 and older, check your com.initiate.server.features.cfg engine configuration file. For versions 11.0 and newer, ignore this step.
The feature list for the engine needs to contain a reference to the task manager so it will start with the engine.
Example:
featuresBoot=ldap-server,event-manager,net-listener,task-manager

2. For versions 10.1 and older, check your com.initiate.server.task.cfg engine configuration file. For versions 11.0 and newer, you will need to create or modify your com.ibm.mdm.mds.task.manager.cfg engine configuration file located at WAS_INSTALL_HOME>/AppServer/profiles/<YOUR_APP_SERVER>/InstalledApps/<CELL_NAME>/MDM-native-<INSTANCE_ID>.ear/native.war/conf/.
This should contain at least the following parameters (values may vary):
pollInterval=60
fbbrebkt.enabled=true
fbbrebkt.invokeInterval=60
fbbrebkt.maudrecno=0


After this is done, the entity manager would start adding buckets to the strfreq table and the task manager would start deleting them.

If DFBB is not working as expected, then there are two possible points of failure:

1. Entries are not being added to the mpi_strfreq table:
If this is the case, then save a member (without or without an update) which is a part of a large bucket and check the entity manager logs for descriptive error message or leads on why it may not be added.
Whenever the entity manager encounters a bucket with more candidates than the maximum allowed in workbench, it should be added to the table (if it isn't already there).

2. Entries are added in the mpi_strfreq table but the corresponding buckets are not deleted:
For this, the relevant logging is done in the task manager which is a part of engine logs unless it is set up as a stand-alone JVM which would result in a separate log file altogether. You may check for descriptive error messages or any other leads. Specifically it should be kept in mind that the deletion is driven by the maudrecno. So we may have situations like running mpxfreq which updated the strfreq table but the corresponding buckets were not deleted because when the engine started, it took note of the highest maudrecno circumventing the entries which were added when the task manager was not running (via mpxfreq).

In such an event, you may consider using the following workaround:

2.1. Start the task manager (started with the engine if embedded).
2.2. Execute an SQL to modify the maudrecno corresponding to the buckets which have not yet been deleted to "The highest maudrecno in the mpi_strfreq table plus one". This would trick the task manager into thinking these are new entries which are being deleted. This may also be used as a test to ensure task manager is working.

Please ensure that you do not modify the caudrecno because caudrecnos have corresponding entries in other tables like audhead and manipulating them may lead to complications later.

For reference, following are the expected log entries during FBB:

1. Following log entry shows task manager re-bucketing in engine:
INFO stems.hub.engine.tskmgr.FBBReBkt: Identified new frequency strings: strVal='263199065' bktHash=3617065758672475907 bktRole=3.
5 members were affected.


2. Following entry shows "Maximum candidate count for DFBB" (4 in this case) is exceeded, and some rows may be inserted into mpi_strfreq table:
AUDIT iatesystems.hub.logging.AuditLog: MPI_CtxEntManage: Candidate count 5 exceeds threshold 4, checking for frequent buckets.

3. Following entry implies that a new member sees a record existing in mpi_strfreq and it won't create the bucket (enabling the algorithm log):
ALGO MPI_DvdInfo_ChkFreq: bktRole=3, bktVal='418388869', ignored, freq=6, max=4.

For more information about bucketing, see the Introduction to buckets for IBM Infosphere Master Data Service (Initiate) webcast.

[{"Product":{"code":"SSLVY3","label":"Initiate Master Data Service"},"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Component":"--","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"Version Independent","Edition":"All Editions","Line of Business":{"code":"LOB10","label":"Data and AI"}}]

Product Synonym

MDS;Master Data Service;MDM;MDMSE;MDM SE;Master Data Management;IBM Infosphere Master Data Service;MDM Standard Edition;MDM Hybrid Edition;Initiate

Document Information

Modified date:
17 May 2021

UID

swg21655423