IBM InfoSphere Master Data Management, Version 11.3

Technical overview of big data matching

Although the big data matching capability is technically a part of the IBM® InfoSphere® MDM offering, it integrates directly with IBM InfoSphere BigInsights™.

Perhaps the most important aspect of big data matching is the mechanism for efficiently resolving members into entities. Resolving entities is the process of associating two or more member records that refer to the same individual or organization. After resolving entities, the big data matching applications write the entity linking data into a table within the same database on the InfoSphere BigInsights node as the source table. The entity linking data allows users to run probabilistic searches of the members and entities.

IBM InfoSphere BigInsights is built on the Apache Hadoop software framework, the open source technology for reliably managing large volumes of data. InfoSphere BigInsights also includes the Hadoop MapReduce processing framework and HBase. HBase is the Hadoop-based database that stores data in tables that are non-relational and distributed across nodes.

The big data matching capability does not connect with or rely on the operational server that supports a typical InfoSphere MDM installation. As the instructions make clear, you install the capability directly onto your InfoSphere BigInsights cluster.

The capability uses only a limited portion of the InfoSphere MDM capabilities. Specifically, it uses the InfoSphere MDM Workbench. With the MDM Workbench, you create a Probabilistic Matching Engine (PME) configuration that you then export for use within InfoSphere BigInsights. The chief component of the Probabilistic Matching Engine configuration is one or more MDM algorithms. In the realm of IBM InfoSphere MDM, an algorithm is a step-by-step procedure that compares and scores the similarities and differences of member attributes. As part of a process called derivation, the algorithm standardizes and buckets the data. The algorithm then defines a comparison process, which yields a numerical score. That score indicates the likelihood that two records refer to the same member. As a final step, the process specifies whether to create linkages between records that the algorithm considers to be the same member. A set of linked members is known as an entity, so this last step in the process is called entity linking. Users familiar with IBM InfoSphere MDM might know that the matching process can be used to generate review tasks for potential linkages that don't surpass a certain threshold of certainty. The big data matching capability does not generate tasks. Potential linkages that do not meet the threshold are simply not linked together as entities.

The algorithms you create will differ depending on the data you need to process. Creating the algorithm can be a complex procedure. See the link at the end of this topic for information about creating an algorithm.

After you have created your algorithm, you export it from the MDM Workbench as a Probabilistic Matching Engine (PME) bundle. That bundle includes a .zip file that contains the algorithm. You must then import the .zip file into your InfoSphere BigInsights cluster. The InfoSphere BigInsights applications rely on the configuration when you run the applications to derive, compare, and link the data.

Before you run the applications, you configure the HBase tables for the data you want to manage. Configuring the tables requires you to:

Create an .xml configuration file for each table. Among other settings, the configuration files contain settings that define a one-to-one mapping from the HBase column family and column name to an attribute and field combination in the configuration you created with the MDM Workbench. The configuration file also specifies which algorithms to run when you run big data matching.
Run a set of commands in the HBase console to enable big data matching for each table.

When you install big data matching within your InfoSphere BigInsights cluster, the installer creates a component within the HBase master that is notified whenever a new table is enabled for big data matching. When a table is enabled, the component creates a corresponding table in which to store the derivation, comparison, and linking data generated when you run the big data matching applications. The component also loads the algorithm configuration into the new matching processes on the HBase Region Server and into the JVMs for MapReduce.

The big data matching capability installs within the primary HBase machine, and within the JVMs and HBase Region Server on the secondary machines.

After the components are installed and configured, you can run the applications in one of two ways:

As automatic background processes that run as you load data into the HBase tables that you have configured. As you write data into your InfoSphere BigInsights table, the big data matching coprocessors intercept the data to run the derive, compare, and link processes.
As manual batch processes that you run after you have loaded the data into the HBase tables. Each step in the process (derive, compare, and link) appears as a separate application within the InfoSphere BigInsights Console.

Note that the fields that participate in matching must be written into HBase in an uncompressed and unencrypted format. Fields not used in matching can be stored in any format you wish. To maximize storage capacity, it is recommended that you use Snappy compression at the HBase level. Doing so means that you do not need to pre-compress data if you don't want to.

By default, the derive, compare, and compare applications run as an automatic background process. Depending on your hardware configuration, you might choose to run the applications in batch mode instead. For example, if your InfoSphere BigInsights cluster has a high spindle count, batch mode is likely to be more efficient. For the linking application, ample memory is required. IBM internal testing suggests that the HBase cluster needs to be able to allocate approximately one gigabyte of RAM for every million members you need to process. Sufficient RAM is a priority because the entity linking application must load the entire entity graph into memory.

The derive and compare applications that run automatically do not differ from the corresponding derive and compare functions available with the MDM operational server. By contrast, the entity linking application proceeds as a two-step process that might feel unfamiliar to experienced users of InfoSphere MDM. As a first step, the entity linking application unlinks any members that were previously linked into an entity but no longer have a connection to other entity members. It then links members into entities from scratch based on their most current weights and based on the most current version of the algorithm. Unlinking before linking ensures that only the appropriate members are part of an entity. It also ensures that any new members are processed and linked to the appropriate entity.

If you have experience running entity linking with the InfoSphere MDM operational server, you might notice small differences in the results that the operational server returns as compared to the results that are returned by entity linking application in InfoSphere BigInsights. In particular, you might notice that on average a greater number of members are assigned to an entity.

The big data matching installation package also includes APIs that extend the public HBase API so that you can run probabilistic searches of the data.

A combination of REST API and Java™ API manages communication among the components that are used by big data matching.

The big data matching offering includes a tab-separated sample data set and the default PME algorithm for the Party project. The sample algorithm allows you to explore big data matching without needing to first generate an algorithm with the MDM Workbench. However, the sample is for exploration purposes only. You will not be able to pull the algorithm back into the MDM Workbench to customize it for your own implementation.

A lightweight, web-based dashboard is provided as a sample to allow users to experiment with member searches and entity searches. Administrators can configure the dashboard by making a copy of a template file called ui_config.xml.template that is included with the installation. Administrators can then rename the file from ui_config.xml.template to ui_config.xml and edit the file as needed.

Where applicable, big data matching takes advantage of the security features available with InfoSphere BigInsights. The capability does not include security features independent of InfoSphere BigInsights.

Last updated: 27 June 2014