Server Configuration and Sizing
Review the Configuration
Before to start any action or any performance investigation, it is important to check the sizing of the servers where the Semarchy application is executed (the Application Server) and where the MDM data is stored and processed (Database Server).
Application Server
If you are using Tomcat, check the memory used using the Tomcat Manager:
- Connect to the Tomcat Manager, click Server Status in the upper right corner.
- In the Server Status screen, check the memory user by TOMCAT: Free memory, Total memory, Max memory
For Semarchy v4.x:
- Max Memory must be at least 4 Go.
- The OS should have at least 8 Go of RAM.
Database Server
Server Requirements
- CPU: 4 CPUs minimum. Plan for more CPUs when using database parallelism
- Memory: 16 Go
Database Requirements
- Oracle: Minimum size of the SGA : 8 Go or above, depending on number of records.
- PostgreSQL: Minimum size of the Cache : 8 Go or above, depending on number of records.
Model Design Recommendations
This section provides recommendations and tips for designing the model for better performances.
Many to Many Relations
Many to many relations are designed as a dedicated entity, which implies storage and processing overhead: More complex SQL is automatically generated when querying and manipulating data from the two entities related by a many to many relation.
Only create a many to many relations between entities when strictly necessary. As a general rule, avoid over-engineering the model.
Fuzzy Matching Entities
An entity using Fuzzy Matching always runs the matching and consolidation phases of the certification process, which are process intensive.
Only use Fuzzy Matching for entities that really have a fuzzy match and merge requirement: customers, contacts, etc. Avoid using fuzzy matching for Addresses, Phones, etc., or reference data entities .
Data Integration (ETL) Process
The Data Integration process (using an ETL, ESB etc.) loads data into, or consumes data from the xDM data locations. It causes sometimes a large part of the delay in the data chain.
When assessing performance issues, make sure to separate the Data Integration Time (before you Submit data to xDM) from the Certification Process Time (when you actually Submit data to xDM) when reviewing the complete data processing time.
The Data Integration time does not depending on Semarchy xDM. If this integration time is a substantial part of the data integration chain consider optimizing your data integration flow.
Certification Process
The certification process converts Source Data loaded into the xDM data locations (by the Data Integration process) to Golden Data.
If the certification process execution is too long, it is important to identify precisely the performance bottlenecks by executing the query to Export Semarchy xDM Integration Batch Logs.
This query should help you identify specific phases or tasks in the certification process that take most of the time. The following sections gives you tips for optimizing these phases.
Enrichment or Validation
SemQL using PL*SQL or PL/PGSQL
A very common cause of performance issues is the use of PL/SQL (Oracle) or PL/pgSQL (PostgreSQL) functions as part of the enrichment and validation phases.
Built-in database functions are usually extremely optimized, whereas user-defined functions are scripts that run once for each line. A poorly coded function has dramatic impact on the performece.
Make sure to use PL/SQL or PL/pbSQL only when necessary, and do not try to rewrite existing database built-in functions.
Assess the execution time of the function, and assume that a function should never have execution time greater than 40 ms per call. This number allows a throughput of 25 executions per second.
Plug-Ins
xDM built-in plug-in enrichers and validation are optimized for fast execution.
Parallel Execution
By default, plugins process one record at a time. It is possible to launch several thread to process multiple records at the time time, reducing the number of read/writes interactions with the database. This is performed by configuring the Thread Pool Size in the Enricher or Validation.
A typical value between 4~8 is sufficient is most cases.
Warnings:
- Using a Thread Pool Size greater than 1 means that the plug-in in thread-safe. xDM built-in plug-ins are thread safe, but user-designed plug-ins might not be.
- Increasing the Thread Pool Size increate the application server load as it processes multiple instances of the plugins at the same time.
- When calling to an external service with the plugin, pay attention to possible limitation or throttling limit of the service.
When piping plugin enrichers using the
PARAM_AGGREGATE_JOB_PLUGIN_ENRICHERS
job paramter, make sure to align the thread pool size in the chain of enrichers.
User Plug-ins
xDM allows you to code and use your own plug-ins. When designing your own plug-ins, make sure to take into account the execution time of your plug-in code, as well as the response time, network latency or throttling when using external services (e.g.: Google Maps) or APIs.
Built-in Plug-ins
The following plug-ins have performances features or considerations to take into account;
- Email Validator: The email plug-in uses a local cache for known domain names to avoid repeating useless MX Records lookup for domain name validation. This cache is populated during the first plug-in execution using MX Record Lookup for each email domain in the dataset. Subsequent execution favor the cache over the MX Lookup.
- Avoid dropping the table storing the cache.
- Review the Offline Mode and Processing Mode parameters for tuning the use of this cache.
- Lookup Enricher: This enricher uses a query or table for looking up information for each enriched record.
- Make sure that access to the table and that the query should be fast (<40ms/call), or use the Cache Lookup Data parameter to load the lookup table in memory. This second option is suitable for reasonably small tables (If is a tradeoff between the application memory load and the database query speed).
Matching
Overmatch
A common cause of a performance bottleneck in the Matching phase is Overmatching.
Overmatching consist in creating large clusters of matching records. As an example, a 1000 records cluster means 1M matching pairs to consider (records in the DU table), making this cluster impossible to manage (manually or automatically).
Overmatch Symptoms
- Temporary Tablespace full (unable to extend) while writing to DU table.
- In the job log
- The Match Incoming Record with <rule> task takes a long time.
- Match and Find Duplicates > <entity> > Compute Groups > Compute Transitive Closure for System Dups task takes a long time
Troubleshooting
If you run or profile the matching and/or binning rules using SQL, you can identify which part of the rule causes the issue.
Solution
- Avoid using attributes containing default or placeholder values in binning or matching expressions. Null is always different from anything, so null values are usually not an issue.
Typical causes of default/placeholder values: Replacing non-existing values by spaces, dummy or default values. - Fix wrong data (for example: Replace placeholder value by an enriched value) or fix the rule to handle properly the placeholder/wrong data.
Match Rules Issues
The following issues are common source of performance in the matching process.
- Using Transformations in Matching Rules
Avoid functions transforming data (SOUNDEX, UPPER, etc. included) in match/binning rules.- Reasons:
- May cause an issue on the Indexes. These functions is performed for every time the record is compared.
- Solution
- Materialize these values into attributes via enrichers
- Reasons:
- Use Fuzzy Matching Functions with Care
- Distance and Distance Similarity functions are the most costly functions.
Sometimes, materializing a phonetized value with enricher then comparing with equality gives functionally equivalent results.
- Distance and Distance Similarity functions are the most costly functions.
- Very Large Complex Rules
- Avoid one big matching rule.
- Each rule should address a functionnally consistent set of attribute data.
- Consider Indexing
- For very large volumes, adding an index on the significative columns involved in the binning, then one index for the columns matching rule.
e.g.:create index S_<indexName> on MI_<entity> (B_BATCHID,
B_BRANCHID,B_CLASSNAME, B_PUBID, B_SOURCEID, <columns involved matching, with those having more distinct values first>);-- Remove BranchID for v4.0 and above
- For very large volumes, adding an index on the significative columns involved in the binning, then one index for the columns matching rule.
Other Certification Phases
Consolidation
- Symptom: "ORA-01467: Sort Key too long" issue
- Solution:
alter session set "_windowfunc_optimization_settings" = 128;
in session initializing for the connection pool in the datasource configuration.
- Solution:
Publish Certified Golden Data
- Merge Master Records and Merge Golden Records are lengthy if there is an I/O Problem on the database.
This is an issue to report to the DBA (disk, network, etc).
Applications
Check first server sizing and review overload on DB and app server tier.
Collections
Slow Collections
- Filtering is active (user-defined or design-time)
- Consider disabling user filters.
- Number of different entities involved to compose the collection (looking up data through the references)
- Number of computed columns
- Using PL/SQL or Using SemQL
Optimize SemQL or PL/SQL code.
- Using PL/SQL or Using SemQL
Business Views
Slow Business Views
- Slow Embedded Collections (see previous paragraph)
- To many Embedded Collections : when a view is displayed, each embedded collection generate a SQL call to the database. So try to move not necessary embedded collections in tabs
Custom Search
Filter data by using a custom Search can be costly in many cases :
- Using FDN_ columns instead of FID_ columns in the SemQL condition
- Using function transforming data (SOUNDEX, UPPER, etc. included) may cause an issue on the Indexes
- To complex SemQL condition with