Although the big data matching capability is technically a part of the IBM® InfoSphere® MDM offering, it integrates directly with IBM InfoSphere BigInsights™.
Perhaps the most important aspect of big data matching is the mechanism for efficiently resolving members into entities. Resolving entities is the process of associating two or more member records that refer to the same individual or organization. After resolving entities, the big data matching applications write the entity linking data into a table within the same database on the InfoSphere BigInsights node as the source table. The entity linking data allows users to run probabilistic searches of the members and entities.
IBM InfoSphere BigInsights is built on the Apache Hadoop software framework, the open source technology for reliably managing large volumes of data. InfoSphere BigInsights also includes the Hadoop MapReduce processing framework and HBase. HBase is the Hadoop-based database that stores data in tables that are non-relational and distributed across nodes.
The big data matching capability does not connect with or rely on the operational server that supports a typical InfoSphere MDM installation. As the instructions make clear, you install the capability directly onto your InfoSphere BigInsights cluster.
The capability uses only a limited portion of the InfoSphere MDM capabilities. Specifically, it uses the InfoSphere MDM Workbench. With the MDM Workbench, you create a Probabilistic Matching Engine (PME) configuration that you then export for use within InfoSphere BigInsights. The chief component of the Probabilistic Matching Engine configuration is one or more MDM algorithms. In the realm of IBM InfoSphere MDM, an algorithm is a step-by-step procedure that compares and scores the similarities and differences of member attributes. As part of a process called derivation, the algorithm standardizes and buckets the data. The algorithm then defines a comparison process, which yields a numerical score. That score indicates the likelihood that two records refer to the same member. As a final step, the process specifies whether to create linkages between records that the algorithm considers to be the same member. A set of linked members is known as an entity, so this last step in the process is called entity linking. Users familiar with IBM InfoSphere MDM might know that the matching process can be used to generate review tasks for potential linkages that don't surpass a certain threshold of certainty. The big data matching capability does not generate tasks. Potential linkages that do not meet the threshold are simply not linked together as entities.
The algorithms you create will differ depending on the data you need to process. Creating the algorithm can be a complex procedure. See the link at the end of this topic for information about creating an algorithm.
After you have created your algorithm, you export it from the MDM Workbench as a Probabilistic Matching Engine (PME) bundle. That bundle includes a .zip file that contains the algorithm. You must then import the .zip file into your InfoSphere BigInsights cluster. The InfoSphere BigInsights applications rely on the configuration when you run the applications to derive, compare, and link the data.
Note that the fields that participate in matching must be written into HBase in an uncompressed and unencrypted format. Fields not used in matching can be stored in any format you wish. To maximize storage capacity, it is recommended that you use Snappy compression at the HBase level. Doing so means that you do not need to pre-compress data if you don't want to.
By default, the derive, compare, and compare applications run as an automatic background process. Depending on your hardware configuration, you might choose to run the applications in batch mode instead. For example, if your InfoSphere BigInsights cluster has a high spindle count, batch mode is likely to be more efficient. For the linking application, ample memory is required. IBM internal testing suggests that the HBase cluster needs to be able to allocate approximately one gigabyte of RAM for every million members you need to process. Sufficient RAM is a priority because the entity linking application must load the entire entity graph into memory.
The derive and compare applications that run automatically do not differ from the corresponding derive and compare functions available with the MDM operational server. By contrast, the entity linking application proceeds as a two-step process that might feel unfamiliar to experienced users of InfoSphere MDM. As a first step, the entity linking application unlinks any members that were previously linked into an entity but no longer have a connection to other entity members. It then links members into entities from scratch based on their most current weights and based on the most current version of the algorithm. Unlinking before linking ensures that only the appropriate members are part of an entity. It also ensures that any new members are processed and linked to the appropriate entity.
If you have experience running entity linking with the InfoSphere MDM operational server, you might notice small differences in the results that the operational server returns as compared to the results that are returned by entity linking application in InfoSphere BigInsights. In particular, you might notice that on average a greater number of members are assigned to an entity.
The big data matching installation package also includes APIs that extend the public HBase API so that you can run probabilistic searches of the data.
A combination of REST API and Java™ API manages communication among the components that are used by big data matching.
The big data matching offering includes a tab-separated sample data set and the default PME algorithm for the Party project. The sample algorithm allows you to explore big data matching without needing to first generate an algorithm with the MDM Workbench. However, the sample is for exploration purposes only. You will not be able to pull the algorithm back into the MDM Workbench to customize it for your own implementation.
A lightweight, web-based dashboard is provided as a sample to allow users to experiment with member searches and entity searches. Administrators can configure the dashboard by making a copy of a template file called ui_config.xml.template that is included with the installation. Administrators can then rename the file from ui_config.xml.template to ui_config.xml and edit the file as needed.
Where applicable, big data matching takes advantage of the security features available with InfoSphere BigInsights. The capability does not include security features independent of InfoSphere BigInsights.