How to Reprocess Existing Data in Production to Apply New Rules

User Story

You have deployed your new MDM solution into production and it has been running smoothly. You have already started new development sprints in your Dev environment. A month later, you have new enrichers, tweaked existing match rules, configured the survivorship rules. You are ready to deploy the new model changes to production.

Once you closed your model in Dev, test it in QA, and deploy the new model to Production, you’re ready for the changes to the model (add new enrichment, match rules, etc.) to be reflected in all of the records in Production.

How do you push the Production records through the certification process to force the new rules to run?

Solution

This is a great question as it will impact your thinking every time you push a model to production. The key is that you need to reload the Production data to trigger reprocessing of your records so new rules (enrichers, validation rules) work against existing data or replay matching rules.

The quick answer is: you must reload your data in production so it goes through certification again. To do this: take the MD records and reload them into the SD table in Production.

For quick guidance on how to replay matching, here is our Integration documentation on how to replay the matching process. If you need to replay matching and don’t need to reset user decisions, you can skip directly to the section called “Re-load the data from the latest master values.” This similar script can be used to reprocess records for new enrichers and other non-matching rules.

Here is the skeleton of the script you need to reload data. This only covers the insert statement. You will need the additional steps of getting a new load ID and submitting the load (or a continuous load if you choose to delegate the overhead to xDM).

INSERT into SEMARCHY_DEMO_MDM.SD_CUSTOMER (
       B_LOADID         /* Semarchy specific data fields */
      ,B_CLASSNAME      
      ,B_PUBID
      ,B_SOURCEID       /* Customer system fields */
      ,CUSTOMER_NAME    /* Data fields        */
      ,TOTAL_REVENUE
      ,INPADDRESS
      ,INPPOSTAL_CODE
      ,INPCITY
      ,INPCOUNTRY
      ,F_ACCOUNT_MANAGER
    )
    SELECT
       v_load_id          /* B_LOADID    - v_load_id is a variable declared earlier                              */
      ,'Customer'         /* B_CLASSNAME - Entity name. Found in MDM Workbench UI.                               */
      ,B_PUBID            /* B_PUBID     - Publisher ID mapped to B_PUBID in SD table to track  source systems.  */
      ,B_SOURCEID         /* B_SOURCEID  - mapped to B_SOURCEID in SD table.                                     */
      ,CUSTOMER_NAME      /* Data fields        */
      ,TOTAL_REVENUE
      ,INPADDRESS
      ,INPPOSTAL_CODE
      ,INPCITY
      ,INPCOUNTRY
      ,F_ACCOUNT_MANAGER
    FROM SEMARCHY_DEMO_MDM.MD_CUSTOMER
    /* Add a clause to filter records */
    -- WHERE MD_CUSTOMER.CUSTOMER_ID = ??

Wrapping this insert into a stored procedure could look something like the following, using Oracle:

create or replace procedure  SEMARCHY_DEMO_MDM.RELOAD_CUSTOMER_LOAD
AUTHID CURRENT_USER AS
  v_load_id int;
  v_batchid int;
  v_submit_load_ok boolean := TRUE;
BEGIN
  v_load_id := SEMARCHY_DEMO_REPOSITORY.INTEGRATION_LOAD.GET_NEW_LOADID (
    'CustomerMDM'                    /* Data Location Name. Case sensitive. Found in MDM Workbench UI.         */
    ,'MANUAL_ETL'                    /* Informational. Describes the process or program doing the load.        */
    ,'Test to reload master records' /* Informational. Describes the load.                                     */
    ,'MANUAL_ETL_USER'               /* User initializing the load. (Not found in MDM Workbench UI.)           */
  );
  BEGIN
    INSERT into SEMARCHY_DEMO_MDM.SD_CUSTOMER (
       B_LOADID         /* Semarchy specific data fields */
      ,B_CLASSNAME      
      ,B_PUBID
      ,B_SOURCEID       /* Customer system fields */
      ,CUSTOMER_NAME    /* Data fields        */
      ,TOTAL_REVENUE
      ,INPADDRESS
      ,INPPOSTAL_CODE
      ,INPCITY
      ,INPCOUNTRY
      ,F_ACCOUNT_MANAGER
    )
    SELECT
       v_load_id          /* B_LOADID    - v_load_id is a variable declared earlier                              */
      ,'Customer'         /* B_CLASSNAME - Entity name. Found in MDM Workbench UI.                               */
      ,B_PUBID            /* B_PUBID     - Publisher ID mapped to B_PUBID in SD table to track  source systems.  */
      ,B_SOURCEID         /* B_SOURCEID  - mapped to B_SOURCEID in SD table.                                     */
      ,CUSTOMER_NAME      /* Data fields        */
      ,TOTAL_REVENUE
      ,INPADDRESS
      ,INPPOSTAL_CODE
      ,INPCITY
      ,INPCOUNTRY
      ,F_ACCOUNT_MANAGER
    FROM SEMARCHY_DEMO_MDM.MD_CUSTOMER
    COMMIT;
  EXCEPTION
    WHEN DUP_VAL_ON_INDEX THEN 
      dbms_output.put_line('Encountered error ORA-00001. You attempted to insert non-unique records into the SD table.');
      v_submit_load_ok := FALSE;
    WHEN OTHERS THEN 
      dbms_output.put_line('Encountered an error while inserting into the SD table');
      v_submit_load_ok := FALSE;
  END;
  IF v_submit_load_ok THEN
    v_batchid := SEMARCHY_DEMO_REPOSITORY.INTEGRATION_LOAD.SUBMIT_LOAD (
       v_load_id
      ,'INTEGRATE_DATA'   /* Must exactly match a defined integration job. Found in MDM Workbench UI.  */
      ,'MANUAL_ETL_USER'  /* Must exactly match the value of the User used in GET_NEW_LOADID above.    */
    );
  ELSE
    SEMARCHY_DEMO_REPOSITORY.INTEGRATION_LOAD.CANCEL_LOAD( v_load_id, 'MANUAL_ETL_USER' );
  END IF;

END;
/

/* Run the procedure like this (it takes no parameters) */
exec SEMARCHY_DEMO_MDM.CUSTOMER_LOAD;

Performance Concerns

It can make sense to add a where clause on your selection statement to strategically reload only the records you want to reprocess. It's OK to reload all the records as well. But you may want better performance by dividing your records into smaller batches and loading them in a piecemeal fashion.

Matching Concerns

xDM does not automatically replay match because it’s possible you don’t want all of your records to undergo matching again. Consider which records to reload can be important. Test in QA to ensure that it is safe to reload all the records you intend to reprocess and ensure you do not get unexpected bad match results.

If you need to refresh your DEV or QA environment with your production data, ask Semarchy support for a copy of the article https://semarchy.atlassian.net/wiki/spaces/SKB/pages/23396390 . That article is for taking data from Prod and loading it into your Dev environment so that you can have up-to-date production data to use for testing your new match rules against while you're in the development process and configuring new rules. It’s a complex process to make sure you know what you’re doing before attempting it.

Additional Information

A related topic is How to Make Related Entities (Siblings, Children, Parents) Perform An Action (Deep Copy, Reprocess Enrichers). This document covers the best practices around how to set up a sibling, parent, or child entity to perform an action, such as deep copy, reprocess children entities, or repopulate/recalculate enrichers on sibling entities. It’s not available on the public Semarchy Knowledge Base. Ask Semarchy support to export a version for you if you need help to trigger enrichers, deep copy, etc. to run.

Other related topics include how to https://semarchy.atlassian.net/wiki/spaces/SKB/pages/23396390 and the reverse https://semarchy.atlassian.net/wiki/spaces/SKB/pages/922255419.