...
...
...
...
...
...
...
...
...
...
Table of Contents | ||||
---|---|---|---|---|
|
Server Configuration and Sizing
...
If the certification process execution is too long, it is important to identify precisely the contention points performance bottlenecks by executing the query to Export Semarchy xDM Integration Batch Logs.
...
xDM built-in plug-in enrichers and validation are optimized for fast execution.
When designing your own plug-ins
...
- Application Server load increases.
- Call to external service possibly limited or throttled.
- Align thread pool size in case of chained plugin enrichers
...
Parallel Execution
By default, plugins process one record at a time. It is possible to launch several thread to process multiple records at the time time, reducing the number of read/writes interactions with the database. This is performed by configuring the Thread Pool Size in the Enricher or Validation.
A typical value between 4~8 is sufficient is most cases.
Warnings:
- Using a Thread Pool Size greater than 1 means that the plug-in in thread-safe. xDM built-in plug-ins are thread safe, but user-designed plug-ins might not be.
- Increasing the Thread Pool Size increate the application server load as it processes multiple instances of the plugins at the same time.
- When calling to an external service with the plugin, pay attention to possible limitation or throttling limit of the service.
When piping plugin enrichers using the
PARAM_AGGREGATE_JOB_PLUGIN_ENRICHERS
job paramter, make sure to align the thread pool size in the chain of enrichers.
User Plug-ins
xDM allows you to code and use your own plug-ins. When designing your own plug-ins, make sure to take into account the execution time of your plug-in code, as well as the response time, network latency or throttling when using external services (e.g.: Google Maps)
...
or APIs.
Built-in Plug-ins
The following plug-ins have performances features or considerations to take into account;
- Email Validator: The Email
- email plug In
- -in uses a table stored in the referential to ckeck the domain names (EXT_MAIL_DOMAINS).This table
- local cache for known domain names to avoid repeating useless MX Records lookup for domain name validation. This cache is populated during the first load by external connections that check the dns servers of each distinct domain of your data. If the data are loaded again, the existing domains names are checked in the table, not externaly. So never drop or truncate this table in the referential.Lookup enricher: Table/Query
- plug-in execution using MX Record Lookup for each email domain in the dataset. Subsequent execution favor the cache over the MX Lookup.
- Avoid dropping the table storing the cache.
- Review the Offline Mode and Processing Mode parameters for tuning the use of this cache.
- Lookup Enricher: This enricher uses a query or table for looking up information for each enriched record.
- Make sure that access to the table and that the query should be fast (<40ms/call), or use the Cache Lookup Data parameter to load the lookup table in memory. This second option is suitable for reasonably small
table - tables (If is a tradeoff
memory/reading time)- between the application memory load and the database query speed).
Matching
Overmatch
A common cause of a performance bottleneck in the Matching phase is Overmatching.
...
Overmatching consist in creating large clusters of matching records. As an example, a 1000 records cluster means 1M matching pairs to consider (records in the DU table), making this cluster impossible to manage (manually or automatically).
Overmatch Symptoms
- Temporary Tablespace full (unable to extend) while writing to DU table.
- When looking at In the job log
- The Match Incoming Record with <rule> task takes a long time.
- Match and Find Duplicates > <entity> > Compute Groups > Compute Transitive Closure for System Dups task takes a long time
1000 records cluster means 1M records in DU, which makes management of the cluster (manually or automatically) heavy
Troubleshooting
If you run or profile the matching and/or binning rules using SQL, you can identify which part of the rule causes the issue.
Solution
- Avoid using attributes containing default /or placeholder values in binning or matching expressions. Null is always different from anything, so it null values are usually not an issue.
Typical causes of default/placeholder values: Replacing non-existing values by spaces, dummy or default values. - Fix wrong data (for example do not use : Replace placeholder value but by an enriched value) or fix the rule to handle properly the placeholder/wrong data.
Typical issues:- Replacing non-existing values by spaces
- Placeholders
- Dummy values
- Troubleshooting
- Profile match/binning operation in SQL
Match Rules
- Transformations in Matching Rule
Match Rules Issues
The following issues are common source of performance in the matching process.
- Using Transformations in Matching Rules
Avoid functions transforming data (SOUNDEX, UPPER, etc. included) in match/binning rules.- Reasons:
- May cause an issue on the Indexes. These functions is performed for every time the record is compared.
- Solution
- Materialize these values into attributes via enrichers
- Reasons:
- Use fuzzy matching with careFuzzy Matching Functions with Care
- Distance and Distance Similarity functions are the most costly functions.
Sometimes, materializing a phonetized value with enricher then comparing with equality gives functionally equivalent results.
- Distance and Distance Similarity functions are the most costly functions.
- Split complex rulesVery Large Complex Rules
- Avoid one big matching rule.
- Each rule should address a functionnally consistent set of attribute data.
- Consider Indexing
- For very large volumes, adding an index on the significative columns involved in the binning, then one index for the columns matching rule.
e.g.:create index S_<indexName> on MI_<entity> (B_BATCHID,
B_BRANCHID,B_CLASSNAME, B_PUBID, B_SOURCEID, <columns involved matching, with those having more distinct values first>);-- Remove BranchID for v4.0 and above
- For very large volumes, adding an index on the significative columns involved in the binning, then one index for the columns matching rule.
Other Certification Phases
...
- Symptom: "ORA-01467: Sort Key too long" issue
- Solution:
alter session set "_windowfunc_optimization_settings" = 128;
in session initializing for the connection pool in the datasource configuration.
- Solution:
...