After you master basic data mining methods such as ABC classification, you can try more advanced techniques such as clustering with Analysis Process Designer (APD). See how you can create a clustering model and view the reports that APD provides for clustering data.
Key Concept
Clustering is a statistical method applied to a set of objects to identify groups of similar objects. You define a distance measure to determine the similarity of the objects. Then, an algorithm compares the objects according to their distances and produces the clusters (i.e., the groups of similar objects). For example, you could use clustering to find out whether you can partition your customers into a number of distinct groups and whether these classifications influence customers’ buying behaviors. You can use the clustering results to analyze the structure in a set of objects. Then you can use the resulting criteria to predict in which group the new data belongs.
Analysis Process Designer (APD) offers advanced data mining capabilities such as clustering, decision trees, association analysis, and scoring models. These capabilities differ from simple transformations such as filters or ABC classification, which classifies data according to a simple key figure filter. Advanced data mining tools are more complex because you must first configure them before you can use them — this is called the training phase. In this phase you configure the system to recognize data, and then you keep adjusting that configuration until the system does exactly what you want. After the training phase, you can use the training results immediately to learn more about the structure of the data. These results serve as a base for predicting classifications or clusters of new and unknown data.
However, you cannot use the training results immediately as a transformation in an APD model. (A transformation is any object that connects data sources and targets — it receives data, transforms it, and outputs it again.) First, you create a trained data mining model with a closed APD process (i.e., without open branches and including at least one data source and one data target) that contains the desired data mining object as the data target.
I will briefly describe how to set up clustering to give you a general idea of how to configure this advanced data mining process. A full description of clustering — how the actual algorithm works, on which kind of data you should apply it, and how the different parameter settings influence the results — is outside the scope of this article. However, you should be able to create an advanced data mining process and explore the possibilities of the parameters and their effects using this article. For this, you should have SAP BW 3.5 or higher and SAP CRM 4.0 or higher. You should also have the CRM business partners extracted into BW. Typically, both the BW and CRM teams are involved in this process. If you’d like to read more about the process, refer to the sidebar “Additional Resources.”
Where to Use Clustering
Say you are a bike seller and you want to expand your customer base. You could use clustering to identify customers most likely to purchase your bikes. Using the information you have about your customers, such as gender, age, number of children, number of cars, and commute distance, the clustering algorithm tries to find out whether the data contains groups of customers with similar living styles.
For example, it might turn out that people with more than one car also tend to have more than three children. People in large cities could have long commuting distances. You can then determine whether specific groups are also bike buyers or buyers of specific bike types. For instance, you probably won’t sell the expensive bikes to families with many children. Cluster analysis sometimes provides you with results that sound plausible (e.g., people who own cars and have long commutes are probably not bike buyers). Then again, in some cases it can be very surprising (e.g., people who have to drive far each day might be happy to bike to the grocery store in the evening).
You can use clustering to classify customers automatically with a set of properties. Clustering also classifies customers according to characteristics — such as age, income, and number of children — and key figures, such as sales values. You can classify customers by more than one criterion. For example, the clustering algorithm could return data that shows that customers who own a house are always older than 30 and have at least one child — this is group or “cluster” 1. Customers with one child or less who do not own a house, but have at least two cars, become cluster 2.
Create the Clustering Model
I created the APD process shown in Figure 1. The Customer InfoObject contains information typically used for clustering. This is a custom InfoObject that contains no key figures. Instead, it contains additional properties (i.e., characteristics) of the business partners. It demonstrates the variety of sources that you can link (here via the business partner InfoObject that exists in both data sources).

Figure 1
Create the APD process for clustering
In Figure 1, the check box Bike Buyer signifies whether you sold a bike to a customer. If you combine this information with the clustering information, you could predict which customers will most likely buy your product. To create a new clustering model, double-click on the clustering model 1 icon on the left side of Figure 1. In the screen that appears in Figure 2, enter a Method name, such as SAP_CLUSTERING, to apply in your new data mining model.

Figure 2
Enter the parameters for the new data mining model
You should connect your object during the APD process as early as possible to get automatic proposals. In the APD screen in Figure 1, click on the clustering model 1 icon. Define the mapping between the incoming data into the clustering model and the model itself by clicking on the edge between the join operator and the clustering symbol. By connecting the join operator with the clustering symbol before I define the clustering model, the system proposes fields in the clustering model based on the input fields, which I already defined in the join operator. In my example, I’ll use all of the fields from the join operator, so I need only to choose automatic mapping. (For more information about automatic mapping, refer to my December 2006 CRM Expert article.)
Define the Clustering Model
Next, define the clustering model parameters (depending on the method), fill it with data, and train it. Now you can use the trained clustering model as a transformation.
Switch to the data mining workbench transaction RSDMWB to define the data mining model. You can either click on the new document icon in the Model line highlighted in Figure 2 or start transaction RSDMWB in another BW session (or from SAP CRM with transaction CRM_RSDMWB). Follow SAP Easy Access menu path SAP menu>Analytics>Predictive Modeling>CRM_RSDMWB.
In RSDMWB, the system creates a new clustering model with the specified name and proposes fields (Figure 3). The main difference between the fields is that they are classified as Discrete or Continuous, which is derived from the field type. You should set key fields, such as Business Partner, to Key Field to remove them from the classification task because such keys usually have no meaning.

Figure 3
New clustering model fields
Normally, you can use the proposal values to test the APD process to check the clustering result. After reviewing it, you might find that 10 clusters are too many, that the clusters are too similar, or that you feel completely lost. You could try another training with just a few criteria (e.g., only age, bike buyer, and number of children) to get a better feel for the data. When you have information about your training set, you can use it for new data.
Tip!
If you only have one large data set, say 10,000 customers, but few new customers per month, then you can select a subset of these customers as the training set to train the clustering model. Then you can apply it to the complete data set. This saves you computation time, because you don’t need to run the clustering each time for all customers, and helps to make the clustering model more objective. It applies the criteria gained from the training to the rest of the data set. This is a typical methodology in data mining, especially in cases in which new data is not available often.
Next, set the clustering model parameters in the Parameters tab (Figure 4). Here, the most important value is No. of Clusters. With these parameters, you set how the algorithm divides the customers. For instance, you could split your customers into two groups. The clustering algorithm applied in ADP uses a fixed and predefined number of clusters — by default it partitions the data into 10 groups. If you don’t know anything about the structure of the data, you can only guess and compare different results. Often it helps in the beginning to run the clustering with 2, 4, and 8 clusters (or similar combinations like 2, 5, and 10) to find out how the system segments the data. If, for example, the data for 10 clusters looks very similar in each cluster, a lower cluster number is probably better.

Figure 4
Parameters for new clustering model
The default value 25 of Max.Distinct Values allowed sometimes is too low. It represents the maximum number of values for each field used in clustering. In this example, this could be the case for “age.” You might have customers between 20 and 60 years old, with at least one person of every age between these values. You would have 40 different values, which exceeds the limit of 25. In this case, you receive a warning after running the model in the training phase that says that the system couldn’t use the field because you have too many different values. You should adjust the parameter to a limited number of values.
The Stopping Criteria influences the algorithm performance in two ways: The fewer Max.no. of Iterations (i.e., algorithm trials to order your data properly) you have, the faster the analysis runs. Conversely, the quality of the results tends to increase with more iterations. Depending on the data, more computational time can lead to more accurate (or different) results. This is very data dependent — you have to find out the best parameter setting by trying different ones.
When you finish setting the parameters, click on the Fields tab to return to the screen shown in Figure 4. Enter a description in the empty Description field and click on the activate icon
to activate the clustering. Then click on the left arrow icon to return to transaction RSANWB. Now you can activate the clustering APD process and then execute it. Depending on the data, this can take a while.
Tip!
You could activate the clustering process in the background. Depending on your system’s settings, a timeout (by default after 20 minutes) can occur. If you anticipate that the activation will take longer than 20 minutes, your BW administrator should increase the timeout to avoid stopping the process.
Review the Clustering Results
Review the results via the clustering model context menu by following menu path Data Mining Model>View Model results. Various statistics and graphics are available in this screen (Figure 5).

Figure 5
Analysis of clustering data
The left side of the screen displays the sizes of the clusters (i.e., what percentage of the data is covered by which cluster). The right side of the screen shows which of the input fields (i.e., the fields with the values used for creating the clusters) had the most influence (i.e., were the most useful to split the data into different groups). Fields with low values appear in most of the clusters; fields with higher values (compared to the others) only in some clusters. These values make a cluster more unique compared to another because all members of this unique cluster have a property that other clusters don’t have. Additional figures show the distribution of each value inside each cluster.
Before using a model to classify new data, you should adjust the model parameters and compare the results. For example, if the age field has a much higher influence value than other fields, then you could expect that few, if any, clusters contain customers of all ages. Most of the clusters would consist of customers of a certain age or small age range. Therefore, it is useful to start with a lower number of clusters than 10. In this case, you have a better chance of getting fields with a high influence factor and it is easier to compare the clusters.
After you train the model, you can use it as a transformation. A filter selects all objects that belong to a specific cluster x. The system transfers only business partners who appear in this cluster into the specified target group. You can also use APD to fill the BW/CRM business partner attribute with the newly assigned cluster.
Additional Resources
For more information about APD, refer to my article, “Analysis Process Designer 101: How to Perform Complex Data Analysis Easily,” which appeared in CRM hub of SAPexperts in December 2006.
The following two books offer additional information about clustering:
- Berry, Michael J.A. and Gordon S. Linoff. Data Mining Techniques: For Marketing, Sales, and Customer Relationship Management. New York: John Wiley & Sons, Inc., 1997.
- Han, Jiawei and Micheline Kamber. Data Mining: Concepts and Techniques. San Francisco: Morgan Kaufmann Publishers, 2001.
Dr. Andre Skusa
Dr. Andre Skusa is a BW consultant for syskoplan AG and focuses on data mining and geo-marketing. Since its foundation in 1983, syskoplan has made a name for itself as a systems integrator and consulting partner for all aspects of BI and customer relationship management. syskoplan carries out software projects for major companies and sector leaders in Germany, Europe, and the United States.
You may contact the author at andre.skusa@syskoplan.de.
If you have comments about this article or publication, or would like to submit an article idea, please contact the editor.