Since SAP BW serves as a tool for decision-making, it is imperative that data within the system be accurate. One cause of inaccurate data is contamination due to errors, duplicate or missing items, and other inconsistencies. This is why most BW implementations perform data cleansing to some extent. If your site does not, or more likely, if your data-cleansing efforts are not as thorough as they might be, then I will show you how to best approach the task. I will identify the most common causes of contaminated data and explain what tools are available within BW for avoiding, identifying, and repairing contaminated data.

Several processes can create contaminated data in BW. Data can be contaminated in the source system and is therefore already contaminated before it is brought into BW, or data can become contaminated during the extraction and staging processes. Six of the most common sources of contaminated data include:

1. Corrupted source system data. Data coming into the BW system may already be inaccurate, redundant, or inconsistent due to errors within the source system itself.

2. Consolidated master data. If double keys exist in the data sources, the process of consolidating master data can damage the relationship of master data values with the appropriate transaction data and master data attributes. For example, two separate master data items share the same key (in system A, customer 1001 = ACME and in system B, customer 1001 = Terasail) or the same customer has multiple keys (in system A, customer 1001 = ACME and customer 3000 also = ACME). Give careful thought to the logic used to combine this master data to preserve an accurate view.

3. Platform issues. For example, character conversions and code page inconsistencies can alter data during the extraction and load process.

4. Administrative processes. Double or missing data loads, or errors resulting from poorly coded user exits are a few examples of administrative processes that can taint data.

5. Data format inconsistencies. Specifically in relation to time and date fields, format inconsistencies can alter the meaning of data. For example, if the date formats MM/DD/YYYY and DD/MM/YYYY were confused, how would the data field 03012003 be interpreted? March 1 or January 3?

6. Data timing. Data can become contaminated in the load process during master data reads for transaction data. For example, assume that during the data load process, the system reads the master data file for “customer” to determine the current sales representative to associate with sales transactions for customer ABC. If the sales force has been reorganized and the transaction data is loaded prior to a current master data load, the routine picks up the old sales representative instead of the most current one, and the data written to the InfoProvider (i.e., InfoCubes, reporting ODS objects, etc.) in SAP BW will not be accurate.

The best way to avoid having contaminated data appear in BW queries is to take measures to avoid contaminating data. You also need to make a concentrated effort to identify and repair contaminated data prior to loading the data into the InfoProviders. Here are 13 suggestions you might follow:

1. Apply logic to avoid data anomalies. This can either be done within the data extraction process in the source system, or within the BW staging process via the transfer and update rules.

2. Avoid allowing free-form entries (e.g., keys, text, and attributes in one field). Free-form entry can lead to duplicate data entries as a result of misspellings. Another negative to free-form entry fields is that data entered (which may be significant for identifying relationships among data) can become lost in the generic nature of the field.

3. Check for versions during master-data lookups. This can help ensure the timeliness of your data.

4. Test, test, test! Keep in mind that test-phase data is often clean (manually created for testing purposes), so be sure to include tests from extracted source system data.

5. Run data validations to ensure that the staging process is thorough and complete. Some examples are:

a. Extract the same data via two different methods and compare the results.

b. Use the Aggregate Check Tool (SAP note 202469), which executes a background query against an aggregate and then against an InfoCube and compares the results.

6. If you expect to encounter a lot of data issues, consider adding an audit dimension (not delivered by SAP) to your InfoCube. Fields such as data source or logic rules applied to data records can help identify what part of the process needs to be changed to correct errors. It is possible to add InfoObjects for storing these values in an InfoCube and then assign them to a common dimension (which you can title “Audit”—this is standard functionality).

7. Turn on master-data validation when loading data. This function checks against existing master data when loading transaction data. If a transaction comes through for which there is no valid master data, the system marks the transaction with an error. This setting, Do not update data if no master data exists for a characteristic, is in the InfoPackage on the Update tab (Figure 1).

Figure 1

Turn on the master data validation

8. Maintain the permitted characters table. BW checks for permitted characters and only allows certain ones into the system. This table can be accessed in BW customizing under Business Information Warehouse> General BW Settings>Maintain Permitted Extra Characters, or via transaction code RSKC (Figure 2).

Note
The documentation enhancement “Permitted Characters in Characteristic Values” addresses this topic. You many download it from the SAP Service Marketplace.

Figure 2

Maintain the permitted extra characters table

9. Turn on the consistency checks within the InfoPackage. These check for character values in NUMC type data, for the conversion routine ALPHA (which appends zeroes to the front), for lowercase letters, for the use of special characters, and for the correct format of date and time fields. The consistency check can be set within the InfoPackage on the Processing tab (Figure 3).

Figure 3

Turn on consistency checks for the InfoPackage

10. Consider the use of local master data, which means that master data is compounded with the source system ID. Compounding is a way to set dependencies between InfoObjects. For example, if the description of Customer 3000 is dependent upon the source system (e.g., in system A customer 3000 is ACME and in system B Customer 3000 is Terasail), it is possible to compound the InfoObject for source system with the InfoObject for customer. As a result, the InfoObject for source system is required whenever the InfoObject for customer is used.

Note
Not all master data should be compounded to the source system ID—carefully consider the reporting and load consequences before creating local master data.

11. Take advantage of transaction code RSRV, which checks the integrity of most of the objects in BW. Detail on how to use this transaction is available in SAP’s online help.

12. Consider having the data-quality manager manually update the data-quality status (ODS, InfoCube) to green instead of having the status set automatically after a data load. This can be set from within the InfoCube Maintenance area, menu path Environment>Automatic Request Processing (Figure 4). To set it as manual, do not select the option Set quality status to OK.

Figure 4

If Set quality status to OK is selected, the data-quality manager cannot manually update the data-quality status to green

13. Use the data-deletion option in the InfoPackage scheduler (when appropriate) to avoid double loads.

I have just mentioned quite a few tasks that you can do to help avoid or detect data contamination. However, you should consider how much data cleansing to do. The concept of diminishing returns can be applied here. How pure must the data be? Is 100 percent accurate data required, or is there an acceptable margin for error? This is important because a certain cost is involved in cleansing data. For example, if you choose to manually set the quality status on InfoCubes, consider the cost associated with the time and resources that it takes to review the data and change the status. If you turn on all the checks at the InfoPackage level and run extremely large volumes through your data loads, expect an impact on performance. You need to determine what level of data cleansing is most appropriate for each project.

Another topic to consider is where to do the cleansing. You often have multiple ways to accomplish the same result. Options include cleaning the data:

In the source system (possibly using a third-party tool)

During the extraction process (by applying additional logic to the extractors)

Through BW staging (via the transfer and update rules)

For example, missing data can be derived during an exit in the extractor on the source system, or it can be done during a routine in the transfer rules on the BW system. Which is the most efficient? The best option will vary depending on the data volumes and the systems involved. Keep in mind that, ideally, the data should be clean by the time it reaches the ODS level.

If contaminated data is found in a BW InfoProvider (InfoCubes or ODS objects) within the BW system, you have a few options for repairing the data. One option is to delete the contaminated data, correct it, and reload it. Unless the data has been compressed, you can request a deletion for the InfoCube and ODS. Another option, which is useful if only a few records are contaminated, is to use the selective deletion functionality to delete only very specific records. If the contamination occurred in the staging process, then you may opt to do an InfoCube reconstruction—in which the InfoCube is rebuilt from the PSA. (Of course, you would want to correct any errors in the transformation routines prior to the rebuild.) Finally, depending on whether the InfoCube has already been compressed or not, you may opt to do an InfoCube request reversal posting, which loads the same request from the PSA, but with negative signs.

Data cleansing is a vital topic for any data-warehousing project. Unfortunately many projects wait until the test phase to address data cleansing issues. Waiting until this phase of a project to address this issue can jeopardize a successful go-live date. It is best to begin to address data cleansing issues during the blueprint phase so that the appropriate controls and processes can be implemented properly.

Lori Vanourek

Lori Vanourek started with SAP in 1996, after receiving her master’s degree in international business from the University of South Carolina, as a financial applications consultant focusing on the reporting and information analysis requirements of SAP’s customer base. She moved into the BW practice with its first release and has since specialized in the BW product, currently as a member of the BW product management team. Prior to joining SAP, Lori spent five years in the private sector as a financial analyst.

You may contact the author at lori.vanourek@sap.com.

If you have comments about this article or publication, or would like to submit an article idea, please contact the editor.

Articles
Premium

Follow These Data-Cleansing Best Practices to Ensure Accurate Data in BW

Reading time: 9 mins

Inaccurate data can enter BW from a source system or become corrupted once extracted. The author explains how data can become corrupted and describes the means available in BW to identify, avoid, and repair bad data. Since SAP BW serves as a tool for decision-making, it is imperative that data within the system be accurate….