Table of Contents | ||||
---|---|---|---|---|
|
...
For Semarchy xDM:
- Maximum Memory
-
Xmx
value must be at least 4 Gb. - The OS should have at least 8 Gb of RAM.
...
- Using Transformations in Matching Rules
Avoid functions transforming data (SOUNDEX, UPPER, etc. included) in match/binning rules.- Reasons:
- May cause an issue on the Indexes. These functions are performed every time the record is compared.
- Solution
- Materialize these values into attributes via enrichers
- Reasons:
- Use Fuzzy Matching Functions with Care
- Distance and Distance Similarity functions are the most costly functions.
Sometimes, materializing a phonetized value with enricher then comparing with equality gives functionally equivalent results.
- Distance and Distance Similarity functions are the most costly functions.
- Very Large Complex Rules
- Avoid one big matching rule.
- Each rule should address a functionality consistent with a set of attribute data.
- Avoid joining to parent entities
- If you have large data sets, you may have very bad performance if your match rule looks up the parent entity to find attribute values for matching.
- For example, this match rule
Record1.Account.CustomerNumber = Record2.Account.CustomerNumber
took 4 hours to run for 1 million records. When we updated the child entity to enrich the CustomerNumber from the parent into an attribute on the child entity, the match time went down to 2 minutes. Another option is to useRecord1.FID_Account = Record2.FID_Account
if it satisfies business requirements.
- Consider Indexing
- For very large volumes, adding an index on the significant columns involved in the binning, then one index for the columns matching rule.
e.g.:create index USR_<indexName> on MI_<entity> (
B_BATCHID,B_BRANCHID,B_CLASSNAME, B_PUBID, B_SOURCEID,<columns involved matching, with those having more distinct values first>
);-- Remove BranchID for v4.0 and above
,
B_PUBID, B_SOURCEID,
);B_CLASSNAME
Have you done truncate and full reload operations recently?
One of the standard maintenance operations, in this case, is to rebuild all of the indexes with:alter index [index name] rebuild online; /* Oracle syntax */
The code to do this for all indexes can be easily generated using a select on the user_indexes view.To rebuild the indexes on the entire database in PostgreSQL, you can use the Postgres statement
REINDEX DATABASE database_name;
/* PostgreSQL syntax */
If you need to just reindex a specific table you can useREINDEX TABLE table_name;
/* PostgreSQL syntax */
- For very large volumes, adding an index on the significant columns involved in the binning, then one index for the columns matching rule.
- Detect over matching possibilities:
- To detect potential huge clusters, this query should let you know if your query needs to do more exact matching for binning to reduce cluster sizes when performing exact matches (replace the column names with the columns you are matching on):
/* For SQL Server */
select stdaddress1, stdzip_code, count(*) as cnt
from mi_contact
group by stdaddress1, stdzip_code
order by count(*) desc - Use this query if you're comparing cluster sizes and total number of calls to
SEM_EDIT_DISTANCE
:/* For SQL Server */
select sum(cnt * cnt)
from (
select stdaddress1, stdzip_code, count(*) as cnt
from mi_contact
group by stdaddress1, stdzip_code
) derived
SEM_EDIT_DISTANCE_SIMILARITY
andSEM_EDIT_DISTANCE
functions on SQL Server performs poorly (as of v5.2):- If you are using the
SEM_EDIT_DISTANCE_SIMILARITY or
functions in your match rules, it could be performing poorly on SQL Server. One of our customer migrating from PostgreSQL to SQL Server was facing huge performance regression during the migration (SEM_EDIT_DISTANCE
).Jira Legacy server System JIRA serverId 7db41d32-f010-38fc-b966-a5c06dc46fba key SUPPORT-9524 - The workaround he found is interesting:
- He found a CLR procedure on the web for the Levenshtein distance calculation (in this case, this one : https://github.com/DanHarltey/Fastenshtein)
- He has compiled the c# code in a dll and install it on the SQL Server database
- He has created a database function with this script :
Code Block sp_configure 'clr enabled',1 RECONFIGURE EXEC sp_configure 'show advanced options',1; RECONFIGURE; EXEC sp_configure 'clr strict security',0; RECONFIGURE; CREATE ASSEMBLY FastenshteinAssembly FROM 'E:\INTEND_INSTALL\CLR\Fastenshtein.dll' WITH PERMISSION_SET = SAFE CREATE FUNCTION SEM_LEVENSHTEIN_CUST(@value1 [nvarchar](MAX),@value2 [nvarchar](MAX)) RETURNS[int] AS EXTERNAL NAME[FastenshteinAssembly].[Fastenshtein.Levenshtein].[Distance] GO
- Then, he has declared SEM_LEVENSHTIEN_CUST in xDM and has used it in the match rules, instead of native SEM_EDIT_DISTANCE function.
- Results : the Matching rule execution is completed in a few seconds instead of many minutes (for only 4 000 records)
- If you are using the
...
- Symptom: "ORA-01467: Sort Key too long" issue
- Solution:
alter session set "_windowfunc_optimization_settings" = 128;
in session initializing for the connection pool in the datasource configuration. Example:
Code Block language xml <Resource name="jdbc/SEMARCHY_MDM" username="DATA_LOCATION_USER" password="DATA_LOCATION_PASSWORD" url="jdbc:oracle:thin:@rdmagdev.lcke3gss.eu-west-1.rds.amazonaws.com:1521:ORCL" auth="Container" factory="org.apache.tomcat.jdbc.pool.DataSourceFactory" type="javax.sql.DataSource" driverClassName="oracle.jdbc.OracleDriver" maxActive="8" maxIdle="8" minIdle="0" maxWait="15000" initialSize="1" defaultAutoCommit="false" validationQuery="select 1 from dual" testOnBorrow="true" logValidationErrors="true" timeBetweenEvictionRunsMillis="45000" validationInterval="60000" initSQL="alter session set "_windowfunc_optimization_settings" = 128" />
- Solution:
...