Table of Contents | ||||
---|---|---|---|---|
|
...
For Semarchy v4.x:
- Maximum Memory
-
Xmx
value must be at least 4 Go. - The OS should have at least 8 Go of RAM.
...
If you run or profile the matching and/or binning rules using SQL, you can identify which part of the rule causes the issue. Another thing you can do is query the DU table and compare it to the MI table to see if there is overmatching.
Info |
---|
Here is an example of troubleshooting an overmatch situation on phone number. It allowed us to identify that there was a placeholder value in the standardized phone number that creates a massive cartesian product. Run the query below to learn how many records were inserted into the DU_CONTACT table by the previous matching rules. Compare it with the current record count of the MI_CONTACT. This query should allow us to pinpoint any oddities in the MI_CONTACT table: select STD_PRIMARY_PHONE, count(*) bin_size |
Solution
- Avoid using attributes containing default or placeholder values in binning or matching expressions. Null is always different from anything, so null values are usually not an issue.
Typical causes of default/placeholder values: Replacing non-existing values by spaces, dummy or default values. - Fix wrong data (for example: Replace placeholder value by an enriched value) or fix the rule to handle properly the placeholder/wrong data.
...
- Using Transformations in Matching Rules
Avoid functions transforming data (SOUNDEX, UPPER, etc. included) in match/binning rules.- Reasons:
- May cause an issue on the Indexes. These functions is performed for every time the record is compared.
- Solution
- Materialize these values into attributes via enrichers
- Reasons:
- Use Fuzzy Matching Functions with Care
- Distance and Distance Similarity functions are the most costly functions.
Sometimes, materializing a phonetized value with enricher then comparing with equality gives functionally equivalent results.
- Distance and Distance Similarity functions are the most costly functions.
- Very Large Complex Rules
- Avoid one big matching rule.
- Each rule should address a functionnally consistent set of attribute data.
- Consider Indexing
- For very large volumes, adding an index on the significant columns involved in the binning, then one index for the columns matching rule.
e.g.:create index USR_<indexName> on MI_<entity> (B_BATCHID,
B_BRANCHID,B_CLASSNAME, B_PUBID, B_SOURCEID, <columns involved matching, with those having more distinct values first>);-- Remove BranchID for v4.0 and above
Have you done truncate and full reload operations recently?
One of the standard maintenance operations, in this case, is to rebuild all of the indexes with:alter index [index name] rebuild online; /* Oracle syntax */
The code to do this for all indexes can be easily generated using a select on the user_indexes view.To rebuild the indexes on the entire database in PostgreSQL, you can use the Postgres statement
REINDEX DATABASE database_name;
/* PostgreSQL syntax */
If you need to just reindex a specific table you can useREINDEX TABLE table_name;
/* PostgreSQL syntax */
- For very large volumes, adding an index on the significant columns involved in the binning, then one index for the columns matching rule.
...
- Symptom: "ORA-01467: Sort Key too long" issue
- Solution:
alter session set "_windowfunc_optimization_settings" = 128;
in session initializing for the connection pool in the datasource configuration.
- Solution:
...