Cluster analysis offers a method of grouping data values based on similarities within the data. This technique segments different items into groups depending on the degree of association between items. The degree of association between two objects is maximal if they belong to the same group and minimal otherwise. A specified or determined number of groups, or clusters, is formed, allowing each data value to then be mathematically categorized into the appropriate cluster.
Cluster analysis is seen as an unguided learning technique since there is no target or dependent variable. There are usually underlying features that determine why certain things appear related and others unrelated. Analyzing clusters of related elements can yield meaningful insight into how various elements of a dataset report relate to each other.
MicroStrategy employs the k-Means algorithm for determining clusters. Using this technique, clusters are defined by a center in multidimensional space. The dimensions of this space are determined by the independent variables that characterize each item. Continuous variables are normalized to the range of zero to one (so that no variable dominates). Categorical variables are replaced with a binary indicator variable for each category (1 = an item is in that category, 0 = an item is not). In this fashion, each variable spans a similar range and represents a dimension in this space.
The user specifies the number of clusters,
k
, and the algorithm determines the coordinates of the center of each cluster.
In MicroStrategy, the number of clusters to be generated from a set of data can be specified by the user or determined by the software. If determined by the software, the number of clusters is based on multiple iterations to reveal the optimal grouping. A maximum is set by the user to limit the number of clusters determined by the software.
The table below shows the result of a model that determines which of five clusters a customer belongs to.
The center of each cluster is represented in the five cluster columns. With this information, it is possible to learn what the model says about each cluster. For example, since the center of Cluster 1 is where “Marital Status = Single” is one and all other marital statuses are zero, single people tend to be in this cluster. On the other hand, there is not a lot of variability across the clusters for the Age Range dimensions, so age is not a significant factor for segmentation.
Creating a cluster model is the first step, because while the algorithm will always generate the proper clusters, it is up to the user to understand how each cluster differs from the others.