Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents
maxLevel2
stylesquare

...

For Semarchy v4.x:

  • Maximum Memory -Xmx value must be at least 4 Go.
  • The OS should have at least 8 Go of RAM.

...

Many to many relations are designed as a dedicated entity, which implies storage and processing overhead: More complex SQL is automatically generated when querying and manipulating data from the two entities related by a many-to-many relation.

Tip

Only create a many-to-many relations relationship between entities when strictly necessary. As a general rule, avoid over-engineering the model.

...

The Data Integration process (using an ETL, ESB etc.) loads data into , or consumes data from the xDM data locations. It causes sometimes a large part of the delay in the data chain.
When assessing performance issues, make sure to separate the Data Integration Time (before you Submit data to xDM) from the Certification Process Time (when you actually Submit data to xDM) when reviewing the complete data processing time.

Tip

The Data Integration time does not depending depend on Semarchy xDM. If this integration time is a substantial part of the data integration chain consider optimizing your data integration flow.

...

This query should help you identify specific phases or tasks in the certification process that take most of the time. The following sections gives give you tips for optimizing these phases.

...

Built-in database functions are usually extremely optimized, whereas user-defined functions are scripts that run once for each line. A poorly coded function has a dramatic impact on the performeceperformance

Tip

Make sure to use PL/SQL or PL/pbSQL only when necessary, and do not try to rewrite existing database built-in functions.
Assess the execution time of the function, and assume that a function should never have execution time greater than 40 ms per call. This number allows a throughput of 25 executions per second.

...

By default, plugins process one record at a time. It is possible to launch several thread threads to process multiple records at the time time, reducing the number of read/writes interactions with the database. This is performed by configuring the Thread Pool Size in the Enricher or Validation.
A typical value between 4~8 is sufficient is in most cases.

Warnings:

  • Using a Thread Pool Size greater than 1 means that the plug-in in thread-safe. xDM built-in plug-ins are thread safe, but user-designed plug-ins might not be.
  • Increasing the Thread Pool Size increate the application server load as it processes multiple instances of the plugins at the same time.
  • When calling to an external service with the plugin, pay attention to possible limitation limitations or throttling limit limits of the service.
  • When piping plugin enrichers using the PARAM_AGGREGATE_JOB_PLUGIN_ENRICHERS job paramterparameter, make sure to align the thread pool size in the chain of enrichers.

...

  • Email Validator: The email plug-in uses a local cache for known domain names to avoid repeating useless MX Records lookup for domain name validation. This cache is populated during the first plug-in execution using MX Record Lookup for each email domain in the dataset. Subsequent execution favor favors the cache over the MX Lookup.
    • Avoid dropping the table storing the cache.
    • Review the Offline Mode and Processing Mode parameters for tuning the use of this cache.
  • Lookup Enricher: This enricher uses a query or table for looking up information for each enriched record.
    • Make sure that access to the table and that the query should be fast (<40ms/call), or use the Cache Lookup Data parameter to load the lookup table in memory. This second option is suitable for reasonably small tables (If is a tradeoff between the application memory load and the database query speed).

...

  • Using Transformations in Matching Rules
    Avoid functions transforming data (SOUNDEX, UPPER, etc. included) in match/binning rules.
    • Reasons:
      • May cause an issue on the Indexes. These functions is performed for every time the record is compared.  
    • Solution
      • Materialize these values into attributes via enrichers
  • Use Fuzzy Matching Functions with Care
    • Distance and Distance Similarity functions are the most costly functions.
      Sometimes, materializing a phonetized value with enricher then comparing with equality gives functionally equivalent results.
  • Very Large Complex Rules
    • Avoid one big matching rule.
    • Each rule should address a functionnally consistent set of attribute data.
  • Consider Indexing
    • For very large volumes, adding an index on the significative columns involved in the binning, then one index for the columns matching rule.
      e.g.: 
      create index USR_<indexName> on MI_<entity> (B_BATCHID, B_BRANCHID, B_CLASSNAME, B_PUBID, B_SOURCEID, <columns involved matching, with those having more distinct values first>);
      -- Remove BranchID for v4.0 and above

Issues in Other Certification Phases

...

  • Symptom: "ORA-01467: Sort Key too long" issue
    • Solution: alter session set "_windowfunc_optimization_settings" = 128; in session initializing for the connection pool in the datasource configuration.

...

Turning off logging can really speed up processing. But that means that no Semarchy activity event is logged.

Explain Plans

Collect an explain plan and analyze it to understand if there is a query that is taking too long.

  1. Identify the step that is taking a very long time from the integration job logs. 
  2. Get the query. 
  3. Run an explain plan in your SQL Client.
    1. For Oracle Explain Plan.  
    2. For PostgreSQL Explain Plan.
  4. Send to Semarchy support for help analyzing the explain plan performance.