One of the most under-used areas of SAP Data Services is its capability to not only add a comprehensive geographical dimension to your data, but also ensure that such enhanced data is of the highest quality. Data Services provides multiple transforms that make this happen. A transform, such as a geocoder, allows you to derive latitude and longitude based on address data, or vice versa. Address cleansing transforms help ensure that the address data that flows in is accurate.
Key Concept
SAP Data Services enriches any address data entering your application by incorporating value-added geographical components, while ensuring that this data is clean and accurate. This clean geocoded data can then be used by a Business Intelligence application to do geospatial analysis.
How often have you stared at a sales report or a dashboard with all the metrics and key performance indicators (KPIs) and wondered how much more convenient it would be to have all these metrics displayed on a map? Or stared at a supply chain report or dashboard thinking how much easier it would be to visualize all the information about where the delivery trucks are and how many there are on a map?
I can certainly say from my own experience that the answer to both these questions is likely to be many times. Indeed, geospatial analysis is a powerful paradigm that enables you to visually answer the where question; compute the shape and size of data clusters (such as suppliers and vendors) and the distribution of business entities, such as plant locations and warehouses; discover relationships amongst objects on the map; optimize distances; detect patterns; and perform prescriptive and predictive analytics.
Note
The key to geospatial analysis is to successfully geocode your data (that may not have all or any of the geographical components such as addresses, postal and ZIP Codes, latitudes, and longitudes). Generally, data entering transactional systems such as SAP ERP and SAP ERP Central Component (ECC) does not have all the geographical components and therefore geocoding is necessary to enable complete, accurate, and effective geospatial analysis.
Key Terminology
Here are some key terms that are relevant to this article that you need to understand.
- Geographic information system (GIS) – GIS is a software application that enables users to acquire, manage, present, and visualize geographical data on a map to recognize and analyze trends and patterns, and thus make effective decisions.
- Geocoding –Geocoding is the process of adding geographical coordinates, namely latitude and longitude, and other related geographical attributes to address data. You can also call this address geocoding.
- Reverse geocoding – As the name suggests, in this process, the data containing geographical attributes such as the coordinates is deconstructed into an address (at the highest level of granularity) or less specific information such as a point of interest, landmark, region, or county.
- Spatial intelligence – Everyone is familiar with the term business intelligence (BI). The geographical components, when added to a BI system, gives it another space dimension. Additionally, many GIS packages provide tools and software development kits (SDKs) to design custom geographical functionality. All of this provides you with the ability to analyze your data from the space dimension.
Address Data and Data Quality
Data quality is of the utmost importance when you are dealing with address data. By now, you might have inferred that the success of spatial intelligence lies in accurate geocoding. This success is only possible if the data entering your system is of sufficiently high quality. There is probably not a single global organization that does not have to grapple with incomplete or incorrect customer or vendor/supplier address data. Such data needs to be either cleaned or completed to avoid downstream impacts such as incorrect reporting and erroneous decision-making.
Geocoding and Address Cleansing Transforms in SAP Data Services 4.2
SAP Data Services 4.2 provides multiple functions that take a variety of different inputs (that contain non-geocoded address data), and parses, scrubs, and corrects them based on the various address libraries available, or geocodes them. This geocoded data can then be consumed by a client BI application such as SAP Lumira to do basic geographical analysis, or passed onto a third-party GIS application, such as ESRI, for geospatial analysis.
In SAP Data Services, such functions are called transforms. These transforms can be found in the Data Quality node. The fact that the geocoder and global-address-cleanse transforms are included in the Data Quality node should come as no surprise as data quality and geocoding are intrinsically intertwined.
Figure 1 displays the relevant transforms (functions) that are available.
Figure 1
The data quality and geocoding transforms in SAP Data Services 4.2
The Data_Cleanse transform is generic in nature. Its cleansing capabilities are not limited to addresses—these extend to person and firm data, email addresses, phone numbers, and dates. The focus of this article is on the Global_Address_Cleanse and Geocoder transforms because Global_Address_Cleanse scrubs and corrects addresses and is most often the input to the Geocoder transform.
The Global Address Cleanse Transform
This comprehensive transform takes global address data and deconstructs, analyzes, and corrects it to the desired and feasible level of granularity. Your input data (such as customer master data or vendor master data) or any dataset with address data tends to contain many of these fields. SAP Data Services provides you with templates for many countries to expedite your data-preparation work, as shown in
Figure 2.
Figure 2
Country-specific templates for the Global_Address_Cleanse transform
The Geocoder Transform
The Geocoder transform in its most basic form (e.g., address transform) accepts address data (generally without geographic coordinates) and adds latitudes and longitudes at multiple levels of precision (when feasible), and census-related information (for U.S.-based addresses). In its advanced form, it supports reverse geocoding and searches for points of interest. The geographically enriched data can then be fed into a GIS application for spatial analysis on a map.
Figure 3 displays the Geocoder transform list.
Figure 3
The Geocoder transform details
Cleansing and Geocoding Vendor Master Data
Now that I have provided you with the background information on these capabilities, it is time to walk you through them with step-by-step instructions. In my example, I use customer master data from an SAP ERP system. It contains real address data for Company ABC. (Since both Company ABC and its customer data are real, I do not display screenprints of the results of the transformations in this article because of the sensitive nature of the information.) Note that this real data contains some instances of inaccurate and incomplete address data.
Step 1. Create a Project
The highest-level container for all related activities in SAP Data Services Designer is a project. The process for creating a project using the SAP Data Services Designer tool is straightforward. Log on to SAP Data Services and click the Create Project hyperlink as shown in
Figure 4. This opens a Create Project screen (not shown) where you can give your new project an appropriate name and save it. In this example, I named mine Geocoding.
Figure 4
Create a new project
Step 2. Make Your Source Data Available in SAP Data Services Designer
You can connect to a variety of data sources from SAP Data Services Designer. You can also load flat files. The container for your data in Data Services is a datastore. To make this a one-time activity, I recommend that you work with your Data Services Administrator to set up the connection to the source system. In my case, this source system is SAP ERP Central Component (ECC) and the table is the vendor master (general data) table LFA1 (
Figure 5).
Figure 5
Create a vendor master table as a datastore object
Note that Data Services provides you with all the metadata information. To get metadata about table LFA1, you need to click the tabs Attributes, Class Attributes, Indexes, and Partitions, respectively. The General tab is clicked by default. To view the data contained in this table, click the View Data tab in the figure.
Step 3. Create a New Batch Job in Your Project
The next step is to create a new batch job in your project (Geocoding). This is shown in
Figure 6. Hover your mouse over the project name (in this case, Geocoding on the left) and right-click. From the context-menu options that open, select New Batch job. In the screen that opens (not shown) give your job an appropriate name and save it. Once the entire process is set up, you need to run this batch job to fetch the latest data from your SAP ECC datastore object and apply the transforms that you set up later.
Figure 6
Create a new batch job
Tip!
A job—whether a batch job or real-time job—is the only executable in Data Services. All other components are containers.
Step 4. Create a Dataflow
In this step, you establish the relationships between the source of the data, the transformations that it will go through, and the desired output. All of this is done in a drag-and-drop manner. Click either the dataflow icon at the bottom of the screen or in the palette on the right as shown in
Figure 7. In the pop-up window that opens (not shown), give your data flow an appropriate name (in this case, I call it Addresse_Cleanse_Geocode), and save it.
Figure 7
Create a new dataflow
This opens a blank canvas (shown on the right of
Figure 7), where you can drag-and-drop all the components of this dataflow, carry out the relevant customizations, and create the relationships among these components.
Step 5. Identify Input Data Sources, Map Transforms, and Create Outputs
This is the most important step and takes more time than the previous steps. First, you need to drag and drop the LFA1 table that you created in your datastore (in step 2) to the canvas. This is shown in
Figure 8.
Figure 8
Select the master table in the datastore
Next you need to select your transforms, but you want to cleanse your address data prior to geocoding it. So, you first need to look for an appropriate cleansing transform, within the Data Quality node (
Figure 9).
Figure 9
The list of relevant transforms
Note
I have intentionally expanded the transforms that are relevant to this article. When it comes to address cleansing, there are a lot of options. You might want to do some research on each of these based on your use case to find the most appropriate one. Doing research is as simple as reading the standard SAP documentation on each of these. You can access this documentation by double-clicking the name of any of the transforms in Figure 9.
Since the vendor master data contains global vendors and therefore global addresses, I select the GlobalSuggestions transform in this scenario. As for geocoding, the choice is more obvious—I select the Geocode transform since all I need for this example is the latitude and longitude for each address.
Drag and drop the GlobalSuggestions transform (circled in red in the left of
Figure 10, under the Global_Suggestions_List node), from the bottom panel. Now place your cursor on the green arrow of the LFA1 datastore and drag your mouse to the left of the GlobalSuggestions transform to connect the two. Then drag and drop the Geocode transform from the bottom panel. Place your cursor on the right edge of the GlobalSuggestions transform and drag your mouse to the left of the Geocode transform to connect the two transforms.
The next steps are to drag and drop the output (target) on the canvas (which is an XML file called Output) and connect the right end/output of the Geocode transform to the input of the target in the same way you did in the previous steps.
Figure 10 shows how this done—it is easier to understand the relationships visually since all I am really doing here is connecting the output from one element to the input of another and thus creating a simple data flow.
Figure 10
Complete the address cleansing and geocoding of the dataflow
Here are some more details about key aspects of this dataflow. Also, some of these activities are highly intuitive and allow you to experiment without fear of messing up the system.
When you connect two entities in a dataflow, you need to map the output of one entity to the input of the next one. Let’s take the first transform, GlobalSuggestions. When you double-click it (in
Figure 10), a mapping screen is displayed (
Figure 11) that shows the input as Schema In (LFA1) and the output as Schema Out (GlobalSuggestions_AddressCleanse). Then click the Input tab in
Figure 11.
Figure 11
Mapping the source and target inside a transform
In the Schema In box in
Figure 11 you see all the fields in your source table LFA1. Highlight all the fields in table LFA1 that you want to include in the output and move them to the right Schema Out panel. In the bottom panel under the Output tab, you need to map those fields from your input that you want to feed to the corresponding fields of the transform.
Click the Output tab (
Figure 12). Here you can add any new fields (that are not in the original source—e.g., table LFA1) from the transform to the output list. For example, one of the new fields selected for the output is COUNTRY_NAME.
Figure 12
Add additional fields from the transform to the output
Once you are done, validate all your work by clicking the green check-mark icon in the ribbon (boxed in red in
Figure 12). If there are no errors, save the information by clicking the save all icon (circled in red).
You need to take the same approach with the Geocode transform. Select the fields from the input (i.e., the output of the GlobalSuggestions transform), map the relevant fields from the input with the fields in the Geocode transform, and include any additional fields from this transform into the output. In this case, add the key latitude and longitude fields to the output since they are the basic essence of geocoding, as shown in
Figure 13.
Figure 13
Geocode the cleansed data
Step 6. Execute Your Job
The last step is to execute the (batch) job that you created in step 3. Go back to your project area, highlight the job (BT
_Job1), and right-click your mouse (
Figure 14). Then select the Execute… option.
Figure 14
Execute the batch job for address cleansing and geocoding
This executes the batch job, and you can now check the job log (
Figure 15). It generally has a lot of information. You can configure the level of detail you want to have logged. If the job is run successfully, you should see a success notification in the log.
Figure 15
The job log shows successful completion of the batch job run
So, what does all this mean? Basically, the vendor master table that was pulled from a source ECC system was scrubbed for incorrect and/or incomplete address data, corrected and/or completed, and these addresses were then geocoded. As mentioned earlier, I am not showing the contents of the output (XML) file because it contains sensitive data.
Note
All the features and functions I present here are also available in SAP HANA starting with Support Package Stack 9. These come as part of smart data integration and smart data quality. These components make up SAP solutions for enterprise information management and provide data services-like integration features inside SAP HANA via SAP HANA studio and the SAP HANA Web-based Integrated Development Environment (IDE). Smart data quality focuses on data-enrichment activities, namely cleansing and geocoding, using various transforms also within SAP HANA. The focus of this article is on data enrichment through SAP Data Services 4.2, but you could certainly do all of this and then some in SAP HANA
Anurag Barua
Anurag Barua is a principal at TruQua Enterprises. He has 23 years of experience in conceiving, designing, managing, and implementing complex software solutions, including nearly 18 years of experience with SAP applications. He has been associated with several SAP implementations in various capacities. His core SAP competencies include FI and Controlling (FI/CO), logistics, SAP HANA, SAP BW, SAP BusinessObjects, Enterprise Performance Management, SAP Solution Manager, Governance, Risk, and Compliance (GRC), and project management. He is a frequent speaker at SAPinsider conferences and contributes to several publications. He holds a BS in computer science and an MBA in finance. He is a PMI-certified PMP, a Certified Scrum Master (CSM), and is ITIL V3F certified.
You may contact the author at
Anurag.barua@truqua.com.
If you have comments about this article or publication, or would like to submit an article idea, please contact the
editor.