Data Deduplication with QGate Paribus Family

Gaining Efficiencies with Paribus Discovery – Organizing your Match Sessions

Summary: This article is a continuation of the theme of maximizing efficiencies through an informed configuration of your Paribus system – Organizing your Match Sessions
Article Type: How-To Guide
Related Product(s): This article relates to the following products:

  • Paribus Discovery
  • Microsoft Dynamics 365/CRM
  • Infor CRM
  • Saleslogix

Note: This article assumes the reader has read the included Paribus Discovery documentation and has understood the basic concepts of the building blocks of the Paribus Discovery configuration, including Data Providers, Data Sets, Match Conditions, Match Sets and Match Sessions. In addition, you should be familiar with how to run a Paribus Cleansing Session for your CRM system.

Introduction

In a previous article – Gaining Efficiencies with Paribus Discovery – Filters, we covered many of the basics steps to setting up logical Match Sessions and explained how to create Filters that will make sense for your particular needs. This article is a continuation of the theme of maximizing efficiencies through an informed configuration of your Paribus system. It would be good if you read the – Filters article in advance of this one.

The overriding goal when using Paribus Discovery is to clean up your duplicates as quickly as possible. There’s not much you can do (other than improving your infrastructure) to speed up the processing of Match Sessions or the processing of Cleansing Sessions – in both these steps, speed relies greatly on how many records you’re processing and the computing power available.

You do have some control over minimizing the time it takes to review your results. Some help with that is covered in the other articles regarding Filters and Information Columns. Here, we’ll talk more about organizing your Match Sessions to reduce the review time.

Your First Match Sessions

Because you have the ability to tell Paribus how similar the comparison elements (Match Values) need to be, you can easily start seeing results quickly by trying to eliminate the more obvious duplicates by creating specific sessions that look for very close matches.

In many cases, organizations have duplicates due to multiple imports from similar sources or simply from users not checking to see if a record already exists before entering a new record. Often these duplicates are exactly the same – or at least, very similar. If your first Sessions are created in such a way as to try to find these very-close matches first, that could greatly reduce or even eliminate the need to spend any significant time reviewing those Match Session results.

For example, if you base your initial Match Session(s) on matching criteria that is very strict (e.g., 99-100% matching on 2 or 3 values), then you should have a very high level of confidence in the fact that the Groups Paribus finds are duplicates. A Match Session looking for a 99% match on Account Name and a 99% match on Full Address would likely only contain groups of actual duplicate Accounts. Try this with your data and see. It’s very easy to do with the first sample Match Session provided with the included Definitions you likely imported.

Note: I would suggest creating a filter (something like Address1_StateorProvince=’CT’ or similar if you’re using Dynamics CRM or Address.State=’CT’ for Infor CRM) that narrows it down to a couple of thousand records to test. When I’m testing any new settings, this is what I do so that I get some quick results to confirm (or disprove) my expectations.

By combining a Filter comparison like Customer vs Non-Customer and the high level of similarity outlined here, in many cases you can eliminate the review entirely by clicking the Global Group Review button on the Session Review toolbar ( paribus icon 1 ) and choosing to mark every Group in the Session as Reviewed.

After you’ve processed (merged) all of these obvious duplicates, you can then ‘loosen’ your matching criteria to allow Paribus to find more duplicates. Note that when you lower the Match Score Threshold, Paribus will allow more variations in the similarity and therefore increase the chance that it will ‘see’ similar records that you don’t agree are duplicates. We call these ‘false positives’. The key is to understand how low you can go with the match criteria you use and the likelihood that at a certain level you will likely find some false positives that you will need to exclude during the Review step.

So, using our sample Match Sessions as examples, you’ll see that the first one below is looking for very close Account Names AND Full Addresses (99% for both), whereas with the 2nd one we’ve lowered the Match Score Thresholds down to 85% for Account Name and 90% for Full Address.

Organizing your Match Sessions

For the results of the 1st Session above, it is likely that you won’t need to review at all to determine whether or not they are actually duplicates. With the 2nd Session, we advise you to spend some time reviewing to ensure you won’t be merging records that aren’t actually duplicates. There isn’t any one ‘right’ group of settings that will work for every organization. You need to run some tests with a Filtered group of your data to find the Match Score Thresholds that work best for you.

Also, the accuracy of the results you get will rely greatly upon the quality of the data in your system. For example, if you have many records with no addresses or very spotty (incomplete) addresses, matching on the Address values will be less reliable than if your address data was fairly complete and accurate. Again, the only way to find out what works with your data is to run some test session

Naming your Match Sessions

You’ll notice that the name of the first sample Match Session above is 001 – Account Name 99%, Full Address 99% – Filter Active. Let’s talk about some naming convention options you might employ.

Here’s a breakdown of why it’s named as it is –

Starting with 001:

As is the case in all of the areas of the Paribus interface, the list of Match Sessions is sorted alphabetically. This is why we’ve numbered the sample Match Sessions as we have – in the logical order that one might run them so as to speed up the journey to cleaner data. Keep this in mind when you’re naming your Match Sessions (and other configuration items) so that you can organize them in a logical way for your particular use.

Account Name 99%, Full Address 99%:

It helps to put some indication in the name of the Match Session the values you’re matching on as well as the Match Score Thresholds so you know what the settings for that session are/were without opening it.

Filter Active:

Similarly, it’s good to know what Filter(s) were applied for this session so that you don’t have to open it to look at the configuration to know. If you feel you need to be even more descriptive, there is a very large text field called Description on the Session Details tab in the Match Session Settings.

Note: The name field is simply a text field, so you can name Match Sessions what you wish so as to be as descriptive as you need to be – up to 64 characters.

The naming patterns you use are entirely up to you. These are just suggestions that we’ve found work well for many of our customers

Merge Accounts before Contacts

If you utilize your CRM system to track Companies (Accounts) and the people who work for those Companies (Contacts), we would suggest finding and merging your duplicate Accounts before finding and merging your Contacts. Why? Some duplicate Contacts may not be as obvious as you’d think because they are contained (attached to) different Accounts, although the Accounts are likely duplicated as well. We’ve found that often the Address and Phone information attached to Contacts may be their local office address (may or may not be the official corporate address) and their work phone number may be their direct phone or it may be the main phone with an extension.

So, matching on Contact Name and either (or both) Address and/or Phone you may not find these duplicate Contacts on a Contact-first session. However, if you’ve cleaned up your duplicate Accounts already, now you can look for Contacts with similar names where their AccountID is an Exact Match! It’s much more efficient.

Notice that this order of Match Sessions is used in the Sample Match Session numbering system.

Running Multiple Sessions before Cleansing

In the Gaining Efficiencies with Paribus Discovery – Filters article, we talk about breaking up your data into subsets for creating manageable Session sizes. There are many good reasons to do this. In some cases, users will break things up into multiple groups and then run a few different Sessions before going through the Review and Cleansing steps. This is perfectly acceptable, however be careful that you don’t run multiple sessions that might contain the same resulting records.

For example, it’s OK to do the following:

Run a Session for Accounts in NY, another for Accounts in NJ and another for Accounts in CT and MA. Then review any or all of the sessions and process (merge) them as you please.

It’s NOT a good idea to do the following:

Run a Session for Accounts in NY and another for Accounts in the Northeast Territory (let’s say that encompasses all the states Northeast of Pennsylvania) and then review both Sessions before cleansing either of them. Why? You will find duplicates in the first session that would also appear in the 2nd session and you have the potential to make different decisions during the review!

If you need to run Sessions where records may appear in both, it’s always best to run one, process the results and then run the other to avoid the possibility of conflicts.

 Related Resources:
Further Information:


See the Paribus Help Center User Guidelines for important considerations of use.