IBM Support

Technical Preview: Machine learning assisted data stewardship

Product Documentation


Abstract

This article describes the configuration steps to use the technical preview version of the InfoSphere MDM machine learning assisted data stewardship feature. This technical preview is available from IBM Fix Central alongside the InfoSphere MDM Version 11.6.0.8 product release.

Note: Technical previews are not yet fully supported product features. Use this preview to evaluate, test, and provide feedback to IBM about this feature.

The technical preview version of the machine learning assisted data stewardship feature is only available as part of the Consent Management user interface.

Content

Technical Preview: Using machine learning to improve data stewardship

The InfoSphere MDM machine learning assisted data stewardship capability can help guide your organization's data stewardship decisions. Data stewardship is a critical part of any MDM system, and some suspect duplicate processing decisions can only be made by real people. By using machine learning, your organization can greatly improve the efficiency, accuracy, and automation of suspect duplicate processing decisions.

Important: This feature is provided alongside InfoSphere MDM V11.6.0.8 as a technical preview only. The intention is for this capability to be included as a full InfoSphere MDM product feature in a future release. Use this preview to evaluate, test, and provide feedback to IBM about this feature.

Machine learning assisted data stewardship improves over time by tracking and learning from the decisions made by your data stewards. You can periodically retrain the machine learning model, enabling it to making it smarter, better informed suggestions.

The technical preview version of the machine learning assisted data stewardship feature is only available as part of the Consent Management user interface.

System requirements

  • MDM 11.6.0.8 Standard Edition with the virtual MDM RESTful services installed
  • Docker v18

If InfoSphere MDM is deployed on Microsoft Windows, Cygwin must be installed to execute the extraction script. This script needs to be executed whenever the machine learning model is trained or retrained. Afterwards, Cygwin is not required to use the trained model in machine learning assisted data stewardship.

Downloading the MDM Machine Learning (mdm-ml) Docker image from IBM Passport Advantage

To acquire the InfoSphere MDM Docker assets, you must download them from IBM Passport Advantage. You will then be able to use a script to acquire the MDM Machine Learning Docker image from the IBM Docker registry.

Before you start, verify that you have an IBM ID, and that the IBM Passport Advantage primary contact at your company has granted your IBM ID permission to access and download IBM InfoSphere Master Data Management. If you do not have an IBM ID, you can create one at the Passport Advantage site.

  1. Locate and download the MDM Docker artifacts from IBM Passport Advantage.
    Note: For more information, see the InfoSphere MDM Knowledge Center topic Downloading MDM Docker assets .
  2. Unpack each of the parts into a Docker working directory on your host machine (such as /mdm).
  3. Open the MDM_<version>_DKR_COMPOSE.tar.gz package file and extract mdmdocker.sh and mdm-ml.yml.

Configuring the Consent Management UI to use machine learning assisted data stewardship

To enable machine learning assisted data stewardship for suspect duplicate processing in the Consent Management UI, you have to change the configuration. Changing the configuration is described in the InfoSphere MDM Knowledge Center .

To enable machine learning assisted data stewardship for suspect duplicate processing, modify the configuration JSON file to add "tasksEnabled": true to the entity type that you want to choose for duplicate processing.

The Consent Management UI allows you to filter tasks by owner. This filter option can be configured by adding a list of taskUsers and taskGroups to the configuration. For example:

{
    "physicalEnabled": true,
    "virtualEnabled": true,
    "externalEnabled": true,
    ...
    "virtual": {
        "appName": "ConsentUI",
        ...
        "taskUsers": [
            "datasteward1",
            "datasteward2"
        ],
        "taskGroups": [
            "stewardGroup1"
        ],
        "entityTypes": [
            {
                "entityTypeName": "id",
                "tasksEnabled": true,
                ...

Note: Without deploying the machine learning Docker service, you cannot benefit from machine learning predictions. However, the user interface will still function and will show n/a as the Linking Suggestion instead of the machine learning derived suggestion. Continue to the next section for details about configuring and deploying the Docker service.

Configuring the machine learning Docker service

The machine learning Docker service can be deployed on any machine that can be reached by the InfoSphere MDM operational server. This machine must have Docker installed. For performance reasons, the Docker service should be close to the InfoSphere MDM server. It is possible to run both on the same machine.

The service stores the machine learning models in a folder known as the shared-folder.
Optionally, you can edit mdm-ml.yml to change the path of the MDM-ML exchange folder to match the shared-folder location:
  1. Edit mdm-ml.yml.
  2. At the bottom of the file, locate the following section:       
    volumes:
    - /tmp/MLPredict:/MLPredict
  3. Optionally, change the file path /tmp/MLPredict to match your shared-folder location. 

Starting the machine learning Docker service

Use the mdmdocker.sh script to start the machine learning Docker service:
  1. Review the text of the license agreement located in ./license/LALI.text.
  2. If you agree to the license terms, run the mdmdocker.sh script with the -acceptlicense argument. You will not be able to continue if you do not accept the license.
       $./mdmdocker.sh -acceptlicense
  3. The script logs in to the IBM Docker registry and displays a list of MDM related Docker Compose (YAML) files that you can either run directly or run using a custom script.
  4. Select option [9] Machine Learning Assisted Data Stewardship [Technical Preview] to run the associated machine learning Docker Compose file (mdm-ml).

After you select the Machine Learning option, the script automatically:

  • Downloads the required MDM Machine Learning Docker image from the IBM Docker registry.
  • Creates the machine learning container.
  • Starts the container.

Making the Docker container known to InfoSphere MDM

To allow InfoSphere MDM to use the machine learning Docker service, add the URL to the service to the madappsvcs.properties properties file. Specify the full path including IP address or hostname, port, and the URL path to the prediction service in a property called serviceMetaData.predictServiceURL.

For example:

serviceMetaData.predictServiceURL=http://localhost:8080/v1/predict

Note: If you have mapped the Docker internal port 8080 to a port other than 8080, change the port in the above above example accordingly.

Determining the machine learning Docker container name

The Docker container name is automatically generated. To determine the container name:

  1. Use the docker ps command to list all active containers.
  2. Look for a container similar that begins with mdmdocker_mdm-ml_... and ends with a number.

Retrieving training data from InfoSphere MDM

To create a machine learning model, we first need to retrieve the task resolution history from InfoSphere MDM so that we can train the model. To do this, IBM provides a script (env.sh) that extracts information from the virtual MDM tables of InfoSphere MDM Standard Edition. This script is stored in the container and can be retrieved using the following command:

docker cp <container-name>:/home/mdmuser/extraction-script <extraction-script-folder>

This command copies the env.sh script outside of the container into the host and creates a folder called extraction-script inside the <extraction-script-folder> folder. This folder contains the necessary script.

Tip: It's not necessary to complete this step every time you need to create a new model.

To run the env.sh script:

  1. Browse to the extraction-script-folder location: cd <extraction-script-folder>/extraction-script
  2. Change the values in the env.sh script to match your InfoSphere MDM system.
  3. Grant execute permissions to run the extractions script. Run: chmod +x entrulecomp_se.sh and chmod +x env.sh
  4. Run the script with the type of the entity as an argument, for example ./entrulecomp_se.sh <id>. (where <id> is the entity name). This creates a CSV file named entrulecomp_id.csv that you can used for training the model.

Training your machine learning model

With the data extracted from InfoSphere MDM that was written to entrulecomp_id.csv, you now have what you need to train and create a machine learning model. The training process creates and saves a new machine learning model file in the shared-folder location.

  1. For the Docker container to access the training data, move the CSV file to your shared folder. For example: mv entrulecomp_id.csv <shared-folder>.
  2. Trigger the training by calling the corresponding REST service of the Docker container. The endpoint for training is: POST http://hostname:port/v1/train.
    Note: Change entrulecomp_id.csv to match your CSV file name if it differs from the default.

Sample request:

{
    "inputDataPath": "/MLPredict/entrulecomp_id.csv"
}

Tip: An easy way to trigger the training is by using CURL. For example:

curl -d '{"inputDataPath":"/MLPredict/entrulecomp_id.csv"}' -H "Content-Type: application/json" -X POST http://localhost:8080/v1/train

The inputDataPath corresponds to the path inside the container. In this example, we mounted the host shared folder as /MLPredict/ inside the container.

Note: If you have mapped the Docker internal port 8080 to a port other than 8080, change the port in the above above example accordingly.

After the training process is complete, you should get a response like this:

{
    "metrics": {
        "labels": [
            "D",
            "S"
        ],
        "precision": [
            0.8031324089,
            0.9795910413
        ],
        "recall": [
            0.9143464747,
            0.9483018199
        ],
        "fscore": [
            0.8551386498,
            0.9636925224
        ],
        "accuracy": 0.9419375807
    },
    "dataVolume": {
        "training": 978446,
        "test": 326149
    }
}

This response shows you the quality of the newly trained model.

Once the model has been trained, it is automatically picked-up by the machine learning container. All predictions are now executed against the new model. This will also immediately be reflected in the Consent Management UI, which will now show the machine learning predictions.

Saving and backing up a model

The training step creates and saves the model with the name MDM-ML.model in the shared-folder location.

To save a backup of the model, create a copy before retraining.

Retraining the Model

To retrain the machine learning model at any time, call the v1/train REST service again. The model will be retrained to include the newly provided training data. The updated model will replace the current model.

[{"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Product":{"code":"SSWSR9","label":"IBM InfoSphere Master Data Management"},"Component":"Machine Learning Assisted Data Stewardship (technical preview)","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"11.6.0.8","Edition":"","Line of Business":{"code":"LOB10","label":"Data and AI"}}]

Product Synonym

InfoSphere MDM; Master Data Management; MDM

Document Information

Modified date:
27 April 2022

UID

ibm10788347