Tuesday, 28 August 2018

Cluster analysis in MSTR


Cluster analysis offers a method of grouping data values based on similarities within the data. This technique segments different items into groups depending on the degree of association between items. The degree of association between two objects is maximal if they belong to the same group and minimal otherwise. A specified or determined number of groups, or clusters, is formed, allowing each data value to then be mathematically categorized into the appropriate cluster.
Cluster analysis is seen as an unguided learning technique since there is no target or dependent variable. There are usually underlying features that determine why certain things appear related and others unrelated. Analyzing clusters of related elements can yield meaningful insight into how various elements of a dataset report relate to each other.
MicroStrategy employs the k-Means algorithm for determining clusters. Using this technique, clusters are defined by a center in multidimensional space. The dimensions of this space are determined by the independent variables that characterize each item. Continuous variables are normalized to the range of zero to one (so that no variable dominates). Categorical variables are replaced with a binary indicator variable for each category (1 = an item is in that category, 0 = an item is not). In this fashion, each variable spans a similar range and represents a dimension in this space.
The user specifies the number of clusters, k, and the algorithm determines the coordinates of the center of each cluster.
In MicroStrategy, the number of clusters to be generated from a set of data can be specified by the user or determined by the software. If determined by the software, the number of clusters is based on multiple iterations to reveal the optimal grouping. A maximum is set by the user to limit the number of clusters determined by the software.
The table below shows the result of a model that determines which of five clusters a customer belongs to.
The center of each cluster is represented in the five cluster columns. With this information, it is possible to learn what the model says about each cluster. For example, since the center of Cluster 1 is where “Marital Status = Single” is one and all other marital statuses are zero, single people tend to be in this cluster. On the other hand, there is not a lot of variability across the clusters for the Age Range dimensions, so age is not a significant factor for segmentation.
Creating a cluster model is the first step, because while the algorithm will always generate the proper clusters, it is up to the user to understand how each cluster differs from the others.

Friday, 3 August 2018

Logistic regression in MSTR


Logistic regression is a classification technique used to determine the most likely outcome from a set of finite values. The term logistic comes from the logit function, shown below:
Notice how this function tends to push values to zero or one; the more negative the x-axis value becomes, the logit function approaches a value of zero; the more positive the x-axis value becomes, it approaches a value of one. This is how regression, a technique originally used to calculate results across a wide range of values, can be used to calculate a finite number of possible outcomes.
The technique determines the probability that each outcome will occur. A regression equation is created for each outcome and comparisons are made to select a prediction based on the outcome with the highest likelihood.

Evaluating logistic regression using a confusion matrix

As with other categorical predictors, the success of a logistic regression model can be evaluated using a confusion matrix. The confusion matrix highlights the instances of true positives, true negatives, false positives, and false negatives. For example, a model is used to predict the risk of heart disease as Low, Medium, or High. The confusion matrix would be generated as follows:
Proper analysis of the matrix depends on the predictive situation. In the scenario concerning heart disease risk, outcomes with higher false positives for Medium and High Risk are favored over increased false negatives of High Risk. A false positive in this case encourages preventative measures, whereas a false negative implies good health when, in fact, significant concern exists.