Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents
maxLevel2
stylesquare

...

For Semarchy v4.x:

  • Maximum Memory -Xmx value must be at least 4 Go.
  • The OS should have at least 8 Go of RAM.

...

  • Using Transformations in Matching Rules
    Avoid functions transforming data (SOUNDEX, UPPER, etc. included) in match/binning rules.
    • Reasons:
      • May cause an issue on the Indexes. These functions are performed every time the record is compared.  
    • Solution
      • Materialize these values into attributes via enrichers
  • Use Fuzzy Matching Functions with Care
    • Distance and Distance Similarity functions are the most costly functions.
      Sometimes, materializing a phonetized value with enricher then comparing with equality gives functionally equivalent results.
  • Very Large Complex Rules
    • Avoid one big matching rule.
    • Each rule should address a functionality consistent with a set of attribute data.
  • Avoid joining to parent entities
    • If you have large data sets, you may have very bad performance if your match rule looks up the parent entity to find attribute values for matching.  
    • For example, this match rule Record1.Account.CustomerNumber = Record2.Account.CustomerNumber took 4 hours to run for 1 million records. When we updated the child entity to enrich the CustomerNumber from the parent into an attribute on the child entity, the match time went down to 2 minutes. Another option is to use Record1.FID_Account = Record2.FID_Account if it satisfies business requirements.
  • Consider Indexing
    • For very large volumes, adding an index on the significant columns involved in the binning, then one index for the columns matching rule.
      e.g.: 
      create index USR_<indexName> on MI_<entity> (B_BATCHID, B_BRANCHID, B_CLASSNAME, B_PUBID, B_SOURCEID, <columns involved matching, with those having more distinct values first>);
      -- Remove BranchID for v4.0 and above
    • Have you done truncate and full reload operations recently?
      One of the standard maintenance operations, in this case, is to rebuild all of the indexes with:

      alter index [index name] rebuild online; /* Oracle syntax */
      The code to do this for all indexes can be easily generated using a select on the user_indexes view.
    • To rebuild the indexes on the entire database in PostgreSQL, you can use the Postgres statement

      REINDEX DATABASE database_name; /* PostgreSQL syntax */


      If you need to just reindex a specific table you can use
      REINDEX TABLE table_name;  /* PostgreSQL syntax */

  • Detect over matching possibilities:
    • To detect potential huge clusters, this query should let you know if your query needs to do more exact matching for binning to reduce cluster sizes when performing exact matches (replace the column names with the columns you are matching on):
      /* For SQL Server */
      select stdaddress1, stdzip_code, count(*) as cnt
      from mi_contact
      group by stdaddress1, stdzip_code
      order by count(*) desc
    • Use this query if you're comparing cluster sizes and total number of calls to SEM_EDIT_DISTANCE
      /* For SQL Server */
      select sum(cnt * cnt)
      from (
      select stdaddress1, stdzip_code, count(*) as cnt
      from mi_contact
      group by stdaddress1, stdzip_code
      ) derived

  • SEM_NGRAMS_SIMILARITY function on SQL Server performs poorly (as of v5.0):
    • If you are using the SEM_NGRAMS_SIMILARITY function in your match rules, it could be performing poorly on SQL Server due to the lack of support for native matching functions in SQL Server. Hilaire came up with an alternative version of this function that should allow you to process the same number of calls roughly 10 times faster (same algorithm, but less nested function calls and optimized code)
    • Compile the function and note that there is the USR_ prefix. Call the function USR_SEM_NGRAMS_SIMILARITY directly in the match rules.

      Code Block
      /* For SQL Server only: 10x faster execution of the SEM_NGRAMS_SIMILARITY on SQL Server */
      
      create  or alter function [dbo].[USR_SEM_NGRAMS_SIMILARITY]
      (
         @P_STRING1 nvarchar(max),
         @P_STRING2 nvarchar(max),
         @P_NGRAM_LEN int = 2
      )
      RETURNS int
      AS
      BEGIN
      DECLARE @C1 int, @C2 int, @RES int
      DECLARE @P_TAB1 SEM_SPLIT_TBL, @P_TAB2 SEM_SPLIT_TBL
          INSERT INTO @P_TAB1 SELECT * from [dbo].SEM_SPLIT_NGRAMS(@P_STRING1, @P_NGRAM_LEN);
      Set @C1 = @@ROWCOUNT
          INSERT INTO @P_TAB2 SELECT * from [dbo].SEM_SPLIT_NGRAMS(@P_STRING2, @P_NGRAM_LEN);
      Set @C2 = @@ROWCOUNT
      select @RES = case when @C1 = 0 or @C2= 0 then 0 else cast(2.0 * 100 * count(*) / (@C1 + @C2) as int) end
      from @P_TAB1 L1 inner join @P_TAB2 L2 on L1.STR = L2.STR
          RETURN @RES;
      END
      
      GO


    • To estimate the time required to execute the step, you can compute the theoretical number of calls to SEM_NGRAMS_SIMILARITY using the binning expression of your first matching rule based on SEM_NGRAMS_SIMILARITY as follows:
      select sum (CNT * CNT) as NB_CALLS 
      from (
      select STD_PRIMARY_PHONE, count(*) as CNT
      from MI_CONTACT
      group by STD_PRIMARY_PHONE
      ) derived

...

  • Symptom: "ORA-01467: Sort Key too long" issue
    • Solution: alter session set "_windowfunc_optimization_settings" = 128; in session initializing for the connection pool in the datasource configuration.
    • Example:

      Code Block
      languagexml
      <Resource name="jdbc/SEMARCHY_MDM" 
        username="DATA_LOCATION_USER" password="DATA_LOCATION_PASSWORD" 
        url="jdbc:oracle:thin:@rdmagdev.lcke3gss.eu-west-1.rds.amazonaws.com:1521:ORCL" auth="Container"
        factory="org.apache.tomcat.jdbc.pool.DataSourceFactory" type="javax.sql.DataSource" driverClassName="oracle.jdbc.OracleDriver"
        maxActive="8" maxIdle="8" minIdle="0" maxWait="15000" initialSize="1" defaultAutoCommit="false" 
        validationQuery="select 1 from dual" testOnBorrow="true" 
        logValidationErrors="true" timeBetweenEvictionRunsMillis="45000" validationInterval="60000" 
        initSQL="alter session set "_windowfunc_optimization_settings" = 128" 
      />


...