Table of Contents

maxLevel	2
style	square

Server Configuration and Sizing

Warning

title	Review the Configuration

Before to start any action or any performance investigation, it is important to check the sizing of the servers where the Semarchy application is executed (the Application Server) and where the MDM data is stored and processed (Database Server).

Application Server

If you are using Tomcat, check the memory used using the Tomcat Manager:

Connect to the Tomcat Manager, click Server Status in the upper right corner.
Image Removed
In the Server Status screen, check the memory user by TOMCAT: Free memory, Total memory, Max memory
Image Removed

For Semarchy v4.x:

Maximum Memory -Xmx value must be at least 4 Go.
The OS should have at least 8 Go of RAM.

Database Server

Server Requirements

CPU: 4 CPUs minimum. Plan for more CPUs when using database parallelism
Memory: 16 Go

Database Requirements

Oracle: Minimum size of the SGA : 8 Go or above, depending on number of records.
PostgreSQL: Minimum size of the Cache : 8 Go or above, depending on number of records.

Model Design Recommendations

This section provides recommendations and tips for designing the model for better performances.

Many to Many Relations

Many to many relations are designed as a dedicated entity, which implies storage and processing overhead: More complex SQL is automatically generated when querying and manipulating data from the two entities related by a many-to-many relation.

Tip
Only create a many-to-many relationship between entities when strictly necessary. As a general rule, avoid over-engineering the model.

Fuzzy Matching Entities

An entity using Fuzzy Matching always runs the matching and consolidation phases of the certification process, which are process intensive.

Tip
Only use Fuzzy Matching for entities that really have a fuzzy match and merge requirement: customers, contacts, etc. Avoid using fuzzy matching for Addresses, Phones, etc., or reference data entities.

Data Integration (ETL) Process

The Data Integration process (using an ETL, ESB etc.) loads data into or consumes data from the xDM data locations. It causes sometimes a large part of the delay in the data chain.
When assessing performance issues, make sure to separate the Data Integration Time (before you Submit data to xDM) from the Certification Process Time (when you actually Submit data to xDM) when reviewing the complete data processing time.

Tip
The Data Integration time does not depend on Semarchy xDM. If this integration time is a substantial part of the data integration chain consider optimizing your data integration flow.

Certification Process

The certification process converts Source Data loaded into the xDM data locations (by the Data Integration process) to Golden Data.

If the certification process execution is too long, it is important to identify precisely the performance bottlenecks by executing the query to Export Semarchy xDM Integration Batch Logs.

This query should help you identify specific phases or tasks in the certification process that take most of the time. The following sections give you tips for optimizing these phases.

Enrichment or Validation

SemQL using PL*SQL or PL/PGSQL

A very common cause of performance issues is the use of PL/SQL (Oracle) or PL/pgSQL (PostgreSQL) functions as part of the enrichment and validation phases.

Built-in database functions are usually extremely optimized, whereas user-defined functions are scripts that run once for each line. A poorly coded function has a dramatic impact on performance.

Tip
Make sure to use PL/SQL or PL/pbSQL only when necessary, and do not try to rewrite existing database built-in functions. Assess the execution time of the function, and assume that a function should never have execution time greater than 40 ms per call. This number allows a throughput of 25 executions per second.

Plug-Ins

xDM built-in plug-in enrichers and validation are optimized for fast execution.

Parallel Execution

By default, plugins process one record at a time. It is possible to launch several threads to process multiple records at the time time, reducing the number of read/writes interactions with the database. This is performed by configuring the Thread Pool Size in the Enricher or Validation.
A typical value between 4~8 is sufficient in most cases.

Warnings:

Using a Thread Pool Size greater than 1 means that the plug-in in thread-safe. xDM built-in plug-ins are thread safe, but user-designed plug-ins might not be.
Increasing the Thread Pool Size increate the application server load as it processes multiple instances of the plugins at the same time.
When calling an external service with the plugin, pay attention to possible limitations or throttling limits of the service.
When piping plugin enrichers using the PARAM_AGGREGATE_JOB_PLUGIN_ENRICHERS job parameter, make sure to align the thread pool size in the chain of enrichers.

User Plug-ins

xDM allows you to code and use your own plug-ins. When designing your own plug-ins, make sure to take into account the execution time of your plug-in code, as well as the response time, network latency or throttling when using external services (e.g.: Google Maps) or APIs.

Built-in Plug-ins

The following plug-ins have performances features or considerations to take into account;

Email Validator: The email plug-in uses a local cache for known domain names to avoid repeating useless MX Records lookup for domain name validation. This cache is populated during the first plug-in execution using MX Record Lookup for each email domain in the dataset. Subsequent execution favors the cache over the MX Lookup.
- Avoid dropping the table storing the cache.
- Review the Offline Mode and Processing Mode parameters for tuning the use of this cache.
Lookup Enricher: This enricher uses a query or table for looking up information for each enriched record.
- Make sure that access to the table and that the query should be fast (<40ms/call), or use the Cache Lookup Data parameter to load the lookup table in memory. This second option is suitable for reasonably small tables (If is a tradeoff between the application memory load and the database query speed).
Phonetic transformation: Avoid using custom metaphone functions. Use the Text plugin enricher with METAPHONE Phonetic Transformation instead.

Matching

Overmatch

A common cause of a performance bottleneck in the Matching phase is Overmatching.
Overmatching consist in creating large clusters of matching records. As an example, a 1000 records cluster means 1M matching pairs to consider (records in the DU table), making this cluster impossible to manage (manually or automatically).

Overmatch Symptoms

Temporary Tablespace full (unable to extend) while writing to DU table.
In the job log
- The Match Incoming Record with <rule> task takes a long time.
- Match and Find Duplicates > <entity> > Compute Groups > Compute Transitive Closure for System Dups task takes a long time

Troubleshooting

If you run or profile the matching and/or binning rules using SQL, you can identify which part of the rule causes the issue. Another thing you can do is query the DU table and compare it to the MI table to see if there is overmatching.

Info

Here is an example of troubleshooting an overmatch situation on phone number. It allowed us to identify that there was a placeholder value in the standardized phone number that creates a massive cartesian product.

Run the query below to learn how many records were inserted into the DU_CONTACT table by the previous matching rules. Compare it with the current record count of the MI_CONTACT.

This query should allow us to pinpoint any oddities in the MI_CONTACT table:

select STD_PRIMARY_PHONE, count(*) bin_size
from MI_CONTACT
group by STD_PRIMARY_PHONE
order by 2 desc

Solution

Avoid using attributes containing default or placeholder values in binning or matching expressions. Null is always different from anything, so null values are usually not an issue.
Typical causes of default/placeholder values: Replacing non-existing values by spaces, dummy or default values.
Fix wrong data (for example: Replace placeholder value by an enriched value) or fix the rule to handle properly the placeholder/wrong data.

Match Rules Issues

The following issues are a common source of performance in the matching process.

...

Reasons:
- May cause an issue on the Indexes. These functions are performed every time the record is compared.
Solution
- Materialize these values into attributes via enrichers

...

Distance and Distance Similarity functions are the most costly functions.
Sometimes, materializing a phonetized value with enricher then comparing with equality gives functionally equivalent results.

...

Avoid one big matching rule.
Each rule should address a functionality consistent with a set of attribute data.

...

Have you done truncate and full reload operations recently?
One of the standard maintenance operations, in this case, is to rebuild all of the indexes with:

...

To rebuild the indexes on the entire database in PostgreSQL, you can use the Postgres statement

REINDEX DATABASE database_name; /* PostgreSQL syntax */

...

To detect potential huge clusters, this query should let you know if your query needs to do more exact matching for binning to reduce cluster sizes when performing exact matches (replace the column names with the columns you are matching on):

/* For SQL Server */
select stdaddress1, stdzip_code, count(*) as cnt
from mi_contact
group by stdaddress1, stdzip_code
order by count(*) desc

Use this query if you're comparing cluster sizes and total number of calls to SEM_EDIT_DISTANCE:

...

Table of Contents

maxLevel	2
style	square

Server Configuration and Sizing

Warning

title	Review the Configuration

Before to start any action or any performance investigation, it is important to check the sizing of the servers where the Semarchy application is executed (the Application Server) and where the MDM data is stored and processed (Database Server).

Application Server

If you are using Tomcat, check the memory used using the Tomcat Manager:

Connect to the Tomcat Manager, click Server Status in the upper right corner.
Image Added
In the Server Status screen, check the memory user by TOMCAT: Free memory, Total memory, Max memory
Image Added

For Semarchy xDM:

Maximum Memory -Xmx value must be at least 4 Gb.
The OS should have at least 8 Gb of RAM.

Database Server

Server Requirements

CPU: 4 CPUs minimum. Plan for more CPUs when using database parallelism
Memory: 16 Go

Database Requirements

Oracle: Minimum size of the SGA : 8 Go or above, depending on number of records.
PostgreSQL: Minimum size of the Cache : 8 Go or above, depending on number of records.

Model Design Recommendations

This section provides recommendations and tips for designing the model for better performances.

Many to Many Relations

Many to many relations are designed as a dedicated entity, which implies storage and processing overhead: More complex SQL is automatically generated when querying and manipulating data from the two entities related by a many-to-many relation.

Tip
Only create a many-to-many relationship between entities when strictly necessary. As a general rule, avoid over-engineering the model.

Fuzzy Matching Entities

An entity using Fuzzy Matching always runs the matching and consolidation phases of the certification process, which are process intensive.

Tip
Only use Fuzzy Matching for entities that really have a fuzzy match and merge requirement: customers, contacts, etc. Avoid using fuzzy matching for Addresses, Phones, etc., or reference data entities.

Data Integration (ETL) Process

The Data Integration process (using an ETL, ESB etc.) loads data into or consumes data from the xDM data locations. It causes sometimes a large part of the delay in the data chain.
When assessing performance issues, make sure to separate the Data Integration Time (before you Submit data to xDM) from the Certification Process Time (when you actually Submit data to xDM) when reviewing the complete data processing time.

Tip
The Data Integration time does not depend on Semarchy xDM. If this integration time is a substantial part of the data integration chain consider optimizing your data integration flow.

Certification Process

The certification process converts Source Data loaded into the xDM data locations (by the Data Integration process) to Golden Data.

If the certification process execution is too long, it is important to identify precisely the performance bottlenecks by executing the query to Export Semarchy xDM Integration Batch Logs.

This query should help you identify specific phases or tasks in the certification process that take most of the time. The following sections give you tips for optimizing these phases.

Enrichment or Validation

SemQL using PL*SQL or PL/PGSQL

A very common cause of performance issues is the use of PL/SQL (Oracle) or PL/pgSQL (PostgreSQL) functions as part of the enrichment and validation phases.

Built-in database functions are usually extremely optimized, whereas user-defined functions are scripts that run once for each line. A poorly coded function has a dramatic impact on performance.

Tip
Make sure to use PL/SQL or PL/pbSQL only when necessary, and do not try to rewrite existing database built-in functions. Assess the execution time of the function, and assume that a function should never have execution time greater than 40 ms per call. This number allows a throughput of 25 executions per second.

Plug-Ins

xDM built-in plug-in enrichers and validation are optimized for fast execution.

Parallel Execution

By default, plugins process one record at a time. It is possible to launch several threads to process multiple records at the time time, reducing the number of read/writes interactions with the database. This is performed by configuring the Thread Pool Size in the Enricher or Validation.
A typical value between 4~8 is sufficient in most cases.

Warnings:

Using a Thread Pool Size greater than 1 means that the plug-in in thread-safe. xDM built-in plug-ins are thread safe, but user-designed plug-ins might not be.
Increasing the Thread Pool Size increate the application server load as it processes multiple instances of the plugins at the same time.
When calling an external service with the plugin, pay attention to possible limitations or throttling limits of the service.
When piping plugin enrichers using the PARAM_AGGREGATE_JOB_PLUGIN_ENRICHERS job parameter, make sure to align the thread pool size in the chain of enrichers.

User Plug-ins

xDM allows you to code and use your own plug-ins. When designing your own plug-ins, make sure to take into account the execution time of your plug-in code, as well as the response time, network latency or throttling when using external services (e.g.: Google Maps) or APIs.

Built-in Plug-ins

The following plug-ins have performances features or considerations to take into account;

Email Validator: The email plug-in uses a local cache for known domain names to avoid repeating useless MX Records lookup for domain name validation. This cache is populated during the first plug-in execution using MX Record Lookup for each email domain in the dataset. Subsequent execution favors the cache over the MX Lookup.
- Avoid dropping the table storing the cache.
- Review the Offline Mode and Processing Mode parameters for tuning the use of this cache.
Lookup Enricher: This enricher uses a query or table for looking up information for each enriched record.
- Make sure that access to the table and that the query should be fast (<40ms/call), or use the Cache Lookup Data parameter to load the lookup table in memory. This second option is suitable for reasonably small tables (If is a tradeoff between the application memory load and the database query speed).
Phonetic transformation: Avoid using custom metaphone functions. Use the Text plugin enricher with METAPHONE Phonetic Transformation instead.

Matching

Overmatch

A common cause of a performance bottleneck in the Matching phase is Overmatching.
Overmatching consist in creating large clusters of matching records. As an example, a 1000 records cluster means 1M matching pairs to consider (records in the DU table), making this cluster impossible to manage (manually or automatically).

Overmatch Symptoms

Temporary Tablespace full (unable to extend) while writing to DU table.
In the job log
- The Match Incoming Record with <rule> task takes a long time.
- Match and Find Duplicates > <entity> > Compute Groups > Compute Transitive Closure for System Dups task takes a long time

Troubleshooting

If you run or profile the matching and/or binning rules using SQL, you can identify which part of the rule causes the issue. Another thing you can do is query the DU table and compare it to the MI table to see if there is overmatching.

Info

Here is an example of troubleshooting an overmatch situation on phone number. It allowed us to identify that there was a placeholder value in the standardized phone number that creates a massive cartesian product.

Run the query below to learn how many records were inserted into the DU_CONTACT table by the previous matching rules. Compare it with the current record count of the MI_CONTACT.

This query should allow us to pinpoint any oddities in the MI_CONTACT table:

select STD_PRIMARY_PHONE, count(*) bin_size
from MI_CONTACT
group by STD_PRIMARY_PHONE
order by 2 desc

Solution

Avoid using attributes containing default or placeholder values in binning or matching expressions. Null is always different from anything, so null values are usually not an issue.
Typical causes of default/placeholder values: Replacing non-existing values by spaces, dummy or default values.
Fix wrong data (for example: Replace placeholder value by an enriched value) or fix the rule to handle properly the placeholder/wrong data.

Match Rules Issues

The following issues are a common source of performance in the matching process.

Using Transformations in Matching Rules
Avoid functions transforming data (SOUNDEX, UPPER, etc. included) in match/binning rules.
- Reasons:
  - May cause an issue on the Indexes. These functions are performed every time the record is compared.
- Solution
  - Materialize these values into attributes via enrichers
Use Fuzzy Matching Functions with Care
- Distance and Distance Similarity functions are the most costly functions.
  Sometimes, materializing a phonetized value with enricher then comparing with equality gives functionally equivalent results.
Very Large Complex Rules
- Avoid one big matching rule.
- Each rule should address a functionality consistent with a set of attribute data.
Avoid joining to parent entities
- If you have large data sets, you may have very bad performance if your match rule looks up the parent entity to find attribute values for matching.
- For example, this match rule Record1.Account.CustomerNumber = Record2.Account.CustomerNumber took 4 hours to run for 1 million records. When we updated the child entity to enrich the CustomerNumber from the parent into an attribute on the child entity, the match time went down to 2 minutes. Another option is to use Record1.FID_Account = Record2.FID_Account if it satisfies business requirements.
Consider Indexing
- For very large volumes, adding an index on the significant columns involved in the binning, then one index for the columns matching rule.
  e.g.:
  create index USR_<indexName> on MI_<entity> (<columns involved matching, with those having more distinct values first>, B_PUBID, B_SOURCEID, B_CLASSNAME);
- Have you done truncate and full reload operations recently?
  One of the standard maintenance operations, in this case, is to rebuild all of the indexes with:
  alter index [index name] rebuild online; /* Oracle syntax */
  The code to do this for all indexes can be easily generated using a select on the user_indexes view.
- To rebuild the indexes on the entire database in PostgreSQL, you can use the Postgres statement
```
REINDEX DATABASE database_name; /* PostgreSQL syntax */
```
  If you need to just reindex a specific table you can use
  REINDEX TABLE table_name; /* PostgreSQL syntax */
Detect over matching possibilities:
- To detect potential huge clusters, this query should let you know if your query needs to do more exact matching for binning to reduce cluster sizes when performing exact matches (replace the column names with the columns you are matching on):
```
/* For SQL Server */
select stdaddress1, stdzip_code, count(*) as cnt
from mi_contact
group by stdaddress1, stdzip_code
order by count(*) desc
```
- Use this query if you're comparing cluster sizes and total number of calls to SEM_EDIT_DISTANCE:
```
/* For SQL Server */
select sum(cnt * cnt)
from (
select stdaddress1, stdzip_code, count(*) as cnt
from mi_contact
group by stdaddress1, stdzip_code
) derived
```

SEM_EDIT_DISTANCE_SIMILARITY and SEM_EDIT_DISTANCE functions on SQL Server performs poorly (as of v5.2):

If you are using the SEM_EDIT_DISTANCE_SIMILARITY or SEM_EDIT_DISTANCE functions in your match rules, it could be performing poorly on SQL Server. One of our customer migrating from PostgreSQL to SQL Server was facing huge performance regression during the migration (
Jira Legacy
server System JIRA
serverId 7db41d32-f010-38fc-b966-a5c06dc46fba
key SUPPORT-9524
).

The workaround he found is interesting:

He found a CLR procedure on the web for the Levenshtein distance calculation (in this case, this one : https://github.com/DanHarltey/Fastenshtein)
He has compiled the c# code in a dll and install it on the SQL Server database

He has created a database function with this script :

Code Block

sp_configure 'clr enabled',1
RECONFIGURE
EXEC sp_configure 'show advanced options',1;
RECONFIGURE;

EXEC sp_configure 'clr strict security',0;
RECONFIGURE;

CREATE ASSEMBLY FastenshteinAssembly FROM 'E:\INTEND_INSTALL\CLR\Fastenshtein.dll' WITH PERMISSION_SET = SAFE

CREATE FUNCTION SEM_LEVENSHTEIN_CUST(@value1 [nvarchar](MAX),@value2 [nvarchar](MAX))
RETURNS[int]
AS
EXTERNAL NAME[FastenshteinAssembly].[Fastenshtein.Levenshtein].[Distance]
GO

Then, he has declared SEM_LEVENSHTIEN_CUST in xDM and has used it in the match rules, instead of native SEM_EDIT_DISTANCE function.
Results : the Matching rule execution is completed in a few seconds instead of many minutes (for only 4 000 records)

SEM_NGRAMS_SIMILARITY function on SQL Server performs poorly (as of v5.0):

If you are using the SEM_NGRAMS_SIMILARITY function in your match rules, it could be performing poorly on SQL Server due to the lack of support for native matching functions in SQL Server. Hilaire came up with an alternative version of this function that should allow you to process the same number of calls roughly 10 times faster (same algorithm, but less nested function calls and optimized code)

Compile the function and note that there is the USR_ prefix. Call the function USR_SEM_NGRAMS_SIMILARITY directly in the match rules.

Code Block

/* For SQL Server only: 10x faster execution of the SEM_NGRAMS_SIMILARITY on SQL Server */

create  or alter function [dbo].[USR_SEM_NGRAMS_SIMILARITY]
(
   @P_STRING1 nvarchar(max),
   @P_STRING2 nvarchar(max),
   @P_NGRAM_LEN int = 2
)
RETURNS int
AS
BEGIN
DECLARE @C1 int, @C2 int, @RES int
DECLARE @P_TAB1 SEM_SPLIT_TBL, @P_TAB2 SEM_SPLIT_TBL
    INSERT INTO @P_TAB1 SELECT * from [dbo].SEM_SPLIT_NGRAMS(@P_STRING1, @P_NGRAM_LEN);
Set @C1 = @@ROWCOUNT
    INSERT INTO @P_TAB2 SELECT * from [dbo].SEM_SPLIT_NGRAMS(@P_STRING2, @P_NGRAM_LEN);
Set @C2 = @@ROWCOUNT
select @RES = case when @C1 = 0 or @C2= 0 then 0 else cast(2.0 * 100 * count(*) / (@C1 + @C2) as int) end
from @P_TAB1 L1 inner join @P_TAB2 L2 on L1.STR = L2.STR
    RETURN @RES;
END

GO

To estimate the time required to execute the step, you can compute the theoretical number of calls to SEM_NGRAMS_SIMILARITY using the binning expression of your first matching rule based on SEM_NGRAMS_SIMILARITY as follows:
```
select sum (CNT * CNT) as NB_CALLS 
from (
select STD_PRIMARY_PHONE, count(*) as CNT
from MI_CONTACT
group by STD_PRIMARY_PHONE
) derived
```

...

Symptom: "ORA-01467: Sort Key too long" issue

Solution: alter session set "_windowfunc_optimization_settings" = 128; in session initializing for the connection pool in the datasource configuration.

Example:

Code Block

language	xml

<Resource name="jdbc/SEMARCHY_MDM" 
  username="DATA_LOCATION_USER" password="DATA_LOCATION_PASSWORD" 
  url="jdbc:oracle:thin:@rdmagdev.lcke3gss.eu-west-1.rds.amazonaws.com:1521:ORCL" auth="Container"
  factory="org.apache.tomcat.jdbc.pool.DataSourceFactory" type="javax.sql.DataSource" driverClassName="oracle.jdbc.OracleDriver"
  axActivemaxActive="8" maxIdle="8" minIdle="0" maxWait="15000" initialSize="1" defaultAutoCommit="false" 
  validationQuery="select 1 from dual" testOnBorrow="true" 
  logValidationErrors="true" timeBetweenEvictionRunsMillis="45000" validationInterval="60000" 
  initSQL="alter session set "_windowfunc_optimization_settings" = 128" 
/>

Publish Certified Golden Data

...

Page Comparison

Versions Compared

Old Version 32

New Version Current

Key

Server Configuration and Sizing

Application Server

Database Server

Server Requirements

Database Requirements

Model Design Recommendations

Many to Many Relations

Fuzzy Matching Entities

Data Integration (ETL) Process

Certification Process

Enrichment or Validation

SemQL using PL*SQL or PL/PGSQL

Plug-Ins

Parallel Execution

User Plug-ins

Built-in Plug-ins

Matching

Overmatch

Overmatch Symptoms

Troubleshooting

Solution

Match Rules Issues

Server Configuration and Sizing

Application Server

Database Server

Server Requirements

Database Requirements

Model Design Recommendations

Many to Many Relations

Fuzzy Matching Entities

Data Integration (ETL) Process

Certification Process

Enrichment or Validation

SemQL using PL*SQL or PL/PGSQL

Plug-Ins

Parallel Execution

User Plug-ins

Built-in Plug-ins

Matching

Overmatch

Overmatch Symptoms

Troubleshooting

Solution

Match Rules Issues

Publish Certified Golden Data