Cluster analysis methods. Iterative methods. Types of cluster analysis procedures Cluster analysis thickening search method

Cluster analysis (ClA) is a combination of multidimensional classification methods, the purpose of which is the formation of groups (clusters) of similar objects. Unlike traditional groupings, considered in the general theory of statistics, CLA leads to the division into groups taking into account all grouping characteristics at the same time.

CLA methods allow you to solve the following tasks:

- Classification of objects taking into account many features;

- verification of the put forward assumptions about the presence of some structure in the studied set of objects, i.e. search for an existing structure;

- the construction of new classifications for poorly studied phenomena, when it is necessary to establish the presence of connections within the aggregate and try to bring structure into it.

The following conventions are used to record formalized CLA algorithms:

- a set of objects of observation;

- ith observation in the m-dimensional space of signs ();

- the distance between the -th and -th objects;

- normalized values \u200b\u200bof the source variables;

- matrix of distances between objects.

To implement any CLA method, it is necessary to introduce the concept of "similarity of objects." Moreover, in the process of classification, objects that have the greatest similarity to each other in terms of observed variables should fall into each cluster.

To quantify the similarity, the concept of metric is introduced. Each object is described by β-signs and presented as a point in β-space. The similarity or difference between classified objects is established depending on the metric distance between them. As a rule, the following measures of distance between objects are used:

- Euclidean distance ;

- weighted Euclidean distance ;

- distance city-block ;

- the distance of Mahalanobis,

where is the distance between the th and th objects;

, - values \u200b\u200bof the -variable and, respectively, of the -th and -th objects;

, Are the vectors of values \u200b\u200bof variables of the ith and ith objects;

- general covariance matrix;

Is the weight attributed to the ith variable.

All CLA methods can be divided into two groups: hierarchical (agglomerative and divisible) and iterative (method -red, method for searching for condensations).

Hierarchical cluster analysis. Of all the methods of cluster analysis, the most common is the agglomerative classification algorithm. The essence of the aglo-rhythm is that in the first step, each sample object is considered as a separate cluster. The process of cluster joining takes place sequentially: based on the distance matrix or similarity matrix, the closest objects are combined. If the distance matrix initially has dimension (), then the whole process of combining is completed in () steps. As a result, all objects will be combined into one cluster.

The sequence of the union can be represented in the form of a dendrogram shown in Figure 3.1. The dendrogram shows that in the first step, the second and third objects are combined into one cluster with a distance between them of 0.15. In the second step, the first object joined them. The distance from the first object to the cluster containing the second and third objects, 0.3, etc.

Many methods of hierarchical cluster analysis are distinguished by union (similarity) algorithms, of which the most common are: the single-link method, the full-link method, the middle-link method, and the Ward method.

The method of full connections - the inclusion of a new object in a cluster occurs only if the similarity between all objects is not less than some given level of similarity (Figure 1.3).

b)


Average communication method - when a new object is included in an existing cluster, the average value of the similarity measure is calculated, which is then compared with a given threshold level. If we are talking about combining two clusters, then we calculate the measure of similarity between their centers and compare it with a given threshold value. Consider a geometric example with two clusters (Figure 1.4).

Figure 1.4. The combination of two clusters by the method of average communication:

If the measure of similarity between the centers of the clusters () is not less than a given level, then the clusters will be combined into one.

Ward's method - at the first step, each cluster consists of one object. Initially, the two closest clusters merge. For them, the average values \u200b\u200bof each characteristic are determined and the sum of the squared deviations is calculated

, (1.1)

where is the cluster number, is the object number, is the attribute number; - the number of signs characterizing each object; - the number of objects in the -cluster.

Subsequently, at each step of the algorithm, those objects or clusters that give the smallest increment of magnitude are combined.

The Ward method leads to the formation of clusters of approximately equal sizes with minimal intracluster variation.

The hierarchical cluster analysis algorithm can be represented as a sequence of procedures:

- rationing of initial values \u200b\u200bof variables;

- calculation of a matrix of distances or a matrix of measures of similarity;

- determination of the pair of the closest objects (clusters) and their combination according to the selected algorithm;

- repeating the first three procedures until all objects are combined into one cluster.

The similarity measure for combining two clusters is determined by the following methods:

- “nearest neighbor” method - the degree of similarity between clusters is estimated by the degree of similarity between the most similar (closest) objects of these clusters;

- the “distant neighbor” method - the degree of similarity is estimated by the degree of similarity between the most distant (dissimilar) cluster objects;

- medium communication method - the degree of similarity is estimated as the average value of the degrees of similarity between cluster objects;

- median communication method - the distance between any cluster S and a new cluster, which was obtained as a result of combining the clusters p and q, is defined as the distance from the center of the cluster S to the middle of the segment connecting the centers of the clusters p and q.

Condensation search method. One of the iterative classification methods is the condensation search algorithm. The essence of the iterative algorithm of this method is to use a hypersphere of a given radius, which moves in the space of classification features in order to search for local condensations of objects.


The method of searching for condensations requires, first of all, calculating the matrix of distances (or matrix of measures of similarity) between objects and choosing the initial center of the sphere. Usually at the first step the center of the sphere is the object (point), in the immediate vicinity of which the largest number of neighbors is located. Based on the given radius of the sphere (R), a set of points that fall inside this sphere is determined, and the center coordinates (vector of average values \u200b\u200bof the attributes) are calculated for them.

When the next recalculation of the coordinates of the center of the sphere leads to the same result as in the previous step, the movement of the sphere stops, and the points that fall into it form a cluster, and are excluded from the further process of clustering. The listed procedures are repeated for all remaining points. The algorithm completes in a finite number of steps, and all points are distributed across the clusters. The number of clusters formed is not known in advance and strongly depends on the radius of the sphere.

To assess the stability of the resulting partition, it is advisable to repeat the clustering process several times for various values \u200b\u200bof the radius of the sphere, each time changing the radius by a small amount.

There are several ways to select the radius of a sphere. If is the distance between the -th and -th objects, then the lower boundary of the radius () is chosen, and the upper boundary of the radius can be defined as.

If you start the algorithm from a value and change it by a small amount each time you repeat it, you can identify the values \u200b\u200bof the radii that lead to the formation of the same number of clusters, i.e. to a stable partition.

Example 1. Based on the data in table 1.1, it is necessary to classify five enterprises using hierarchical agglomerative cluster analysis.

Table 1.1

Here: - the average annual value of fixed assets, billion rubles; - material costs per ruble of manufactured products, cop .; - the volume of production, billion rubles

Decision. Before calculating the distance matrix, we normalize the initial data by the formula

The matrix of values \u200b\u200bof normalized variables will have the form

.

We carry out the classification using the hierarchical agglomerative method. To construct the distance matrix, we use the Euclidean distance. Then, for example, the distance between the first and second objects will be

The distance matrix characterizes the distances between objects, each of which, in the first step, is a separate cluster

.

As can be seen from the matrix, the closest objects are and. Combine them into one cluster and assign it a number. We recalculate the distances of all remaining objects (clusters) to the cluster and obtain a new distance matrix

.

In the matrix, the distances between the clusters are determined by the “far neighbor” algorithm. Then the distance between the object and the cluster is

In the matrix, we again find the closest clusters. It will be and,. Therefore, at this step we unite also clusters; get a new cluster containing objects,. Assign him a number. Now we have three clusters (1,3), (2,5), (4).

.

Judging by the matrix, at the next step we combine the clusters and, into one cluster and assign it a number. Now we have only two clusters:

.

And finally, at the last step, we will unite the clusters at a distance of 3.861.

We will present the classification results in the form of a dendrogram (Figure 1.5). The dendrogram indicates that the cluster is more homogeneous in the composition of the incoming objects, since it merged at shorter distances than in the cluster.

Figure 3.5: Cluster dendrogram of five objects

Example 2. Based on the data below, classify stores according to three criteria: - area of \u200b\u200bthe sales area, m2, - sales per seller, den. units, - profitability level,%.

Store Number Store Number

To classify stores, use the condensation search method (you must select the first cluster).

Decision. 1. Calculate the distances between objects according to the Euclidean metric

,

where, are the standardized values \u200b\u200bof the source variables, respectively, of the th and th objects; t is the number of signs.

.

2. Based on the matrix Z, we calculate the square symmetric matrix of distances between objects ().

Analysis of the distance matrix helps determine the position of the initial center of the sphere and select the radius of the sphere.

In this example, most of the "small" distances are in the first line, i.e. the first object has a lot of “close” neighbors. Therefore, the first object can be taken as the center of the sphere.

3. Define the radius of the sphere. In this case, objects whose distance from the first object is less than 2 fall into the sphere.

Cluster analysis is a statistical analysis that allows you to obtain a breakdown of a large amount of data into classes or groups (from English, cluster - class) according to some criterion or their combination.

For data classification X x ..., X p use the concept of metric or distance.

Metric a function p is called mapping a metric space into a space of real numbers and having the following properties (axioms of a metric):

  • 1) p (ZD\u003e 0,
  • 2) p (X, Y) \u003d p(Y, X),
  • 3) p (X, Y) \u003d 0 X \u003d Y
  • 4) P (X, Y) P (Z, Y).

In the theory of cluster analysis, the following metrics are used to measure the distance between individual points (vectors):

1) Euclidean distance

2) weighted Euclidean distance

where w k - weights proportional to the importance of the attribute in the classification problem. Weights are set after additional research

and believe that ^ w * \u003d 1;

  • 3) Hamming distance (or city-block) - distance on a map between blocks in a city

4) Mahalanobis distance (or Mahalanobis angle)

where A is a symmetric positive definite matrix of weight coefficients (often chosen diagonal); AND - vector covariance matrix X 19 ..., X p;

5) Minkowski distance

Distances 1), 2), 3) or 5) are used in the case of a normal distribution of independent random variables X l9 ..., X n ~ N (M, A) or in the case of their homogeneity in geochemical sense, when each vector is equally important for classification. Distance 4) is used in the case of covariance coupling of vectors X x ..., X P.

The choice of metric is carried out by the researcher, depending on what result he wants to get. This choice is not formalized, since it depends on many factors, in particular, on the expected result, on the researcher’s experience, the level of his mathematical training, etc.

In a number of algorithms, along with distances between vectors, distances between clusters and cluster associations are used.

Let be S ( - / cluster consisting of n t vectors or points. Let be

X (l) - sample average over the points falling into the cluster S f, or the center of gravity of the cluster 5 .. Then distinguish the following distances between clusters that do not have inside other clusters:

1) the distance between the clusters according to the principle of "close neighbor"

2) the distance between clusters on the principle of "distant neighbor"

3) the distance between the centers of gravity of the groups

4) the distance between the clusters on the principle of "average communication"

5) generalized Kolmogorov distance

The distance between clusters, which are the union of other classes, can be calculated by the general formula:

where S ^ k ^ - cluster obtained by combining classes S k and S t.

All special cases of distances are obtained from this generalized formula. When a \u003d p \u003d 1/2, 8 \u003d -1/2, y \u003d 0, we have a distance according to the principle of "close neighbor", when a \u003d p \u003d 5 \u003d 1/2, y \u003d 0 - "distant neighbor",

when a \u003d ---, p \u003d ---, 5 \u003d 0, y \u003d 0 - the distance at the centers of heavy

nk + n i nk + n i

sTI groups.

The methods of cluster analysis are divided into I) agglomerative (combining), II) divisible (separating) and III) iterative.

The former consistently combine individual objects into clusters; the latter, on the contrary, divide clusters into objects. Still others combine the first two. Their feature is the formation of clusters, based on the conditions of the partition (the so-called parameters), which can be changed during the operation of the algorithm to improve the quality of the partition. Iterative methods are commonly used to classify large amounts of information.

Let's consider agglomerative methods in more detail. Agglomerative methods are the simplest and most common among cluster analysis algorithms. In the first step, each vector or object X 19 ..., X p raw data is considered as a separate cluster or class. Based on the calculated distance matrix, the ones closest to each other are selected and combined. Obviously, the process will end in (P - 1) a step when, as a result, all objects will be combined into one cluster.

The sequence of associations can be represented as a dendrogram, or tree. In fig. 1.18 shows that at the first step the vectors were combined X t, X 2, since the distance between them is 0.3.

In the second step, a vector was attached to them X 3 away from the cluster (X 1, X 2) at a distance of 0.5, and so on. At the last step, all vectors are combined into one cluster.

Fig. 1.18.

The agglomerative methods include single, medium, full communication and the Ward method.

1. Single communication method.Let be X v ..., X n - vector data, with each vector forming one cluster. First, the matrix of distances between these clusters is calculated, and the distance according to the principle of the nearest neighbor is used as a metric. Using this matrix, two closest vectors are selected, which form the first cluster 5 ,. In the next step between S] and the remaining vectors (which we consider clusters), a new distance matrix is \u200b\u200bcalculated, and the distance between the clusters combined into classes (a \u003d p \u003d 1/2, 5 \u003d -1/2, y \u003d 0) is used as a metric. Closest to the previous class S ( the cluster combines with it, forming S 2 . Etc. through p- In 1 steps we get that all vectors are combined into one cluster.

Advantages: 1) at each step of the algorithm, only one element is added, 2) the method is extremely simple, 3) the algorithm is insensitive to transformations of the source data (rotation, shift, transfer, stretching).

disadvantages: 1) it is necessary to constantly recalculate the distance matrix, 2) the number of clusters is known in advance and cannot be reduced

  • 2. Full communication method.The method practically repeats the single connection method, except that the inclusion of a new object in a cluster occurs if and only if the distance between the objects (vectors or clusters) is less than a certain predetermined number. The number is set by the user. The distance is calculated only on the basis of the “distant neighbor” principle (the same can be said about the distance between classes combined into classes — only the principle of the distant neighbor with a \u003d p \u003d 8 \u003d 1/2, y \u003d 0).
  • 3. Medium Communication Method.The cluster formation algorithm coincides with the single communication algorithm, however, when deciding whether to include a new object in the cluster, the calculations are performed according to the principle of average communication. As in the full communication method, all distances calculated between clusters are compared with a user-defined number. And if it (the distance) is less than a given number, the new object is included in the old class. Thus, the medium communication method differs from the full communication method only in the method of calculating the distance between clusters.
  • 4. WORD Method.Let be X 19 ..., X p - data, and each vector forms one cluster. We find the distance matrix using some metric (for example, the Mahalanobis distance), we determine from it the clusters closest to each other. Calculate the sum of the squared deviations of the vectors within the cluster S k according to the formula:

where to - cluster number i - vector number in the cluster, j - coordinate number X t e U1 R, n to - the number of vectors in the cluster, X jk - sample mean X i in S k. Value V k characterizes deviations of vectors from each other inside the cluster (new S k + S f or old ^). Calculation V k should be carried out before and after the merger, and you need to sort through all possible options for such associations. Further to the cluster S k only vectors or clusters are added that lead to the smallest change V k after combining and, as a result, which will be located at a minimum distance from the source cluster S k.

We consider further iterative methods. The essence of iterative methods is that clustering begins with the specification of some initial conditions. For example, you need to specify the number of clusters obtained or specify the distance that determines the end of the process of cluster formation, etc. The initial conditions are selected according to the result that the researcher needs. However, they are usually given by a solution found by one of the agglomerative methods. Iterative methods include the ^ -means method and the method of searching for condensations.

1. Method / g-means.Let there be vectors X l9 ..., X n e9F and they need to be divided into to clusters. At the zero step from p vectors randomly select to of them, considering that each forms one cluster. We get the set of standard clusters, ..., e [0) with weights

coj 0), ..., X. and calculate the distance matrix between X. and etalons е 1 (0), ..., ^ 0) by some metric, for example, by Euclidean:

Based on the knowledge of the calculated distance matrix, the vector X ( It is placed in that standard, the distance to which is minimal. Suppose for definiteness what it is. It is replaced by a new one, recalculated taking into account the attached point, according to the formula

In addition, the weight is recounted:

If there are two or more minimum distances in the matrix, then X t include in the cluster with the lowest sequence number.

In the next step, select the next vector from the remaining ones, and the procedure is repeated. Thus, through ( pC) steps to each

benchmark e ^ ~ k) the weight will correspond and the clustering procedure will end. With big p and small to the algorithm quickly converges to a stable solution, i.e., to a solution in which the standards obtained after the first application of the algorithm coincide in number and composition with the standards found during repeated application of the method. Nevertheless, the algorithmic procedure is always repeated several times, using the partition obtained in previous calculations as standard vectors (as an initial approximation): previously found standards e [pk e (2 pk) k) taken for e (x 0) 9 ... 9 e (k 0) 9 and the algorithmic procedure is repeated.

  • 2. Condensation search method.This is the next iterative algorithm. It does not require a priori specification of the number of clusters. At the first step, the matrix of distances between X X9 ... 9 X p eU1 p by some metric. Then one vector is randomly selected, which will play the role of the center of the first cluster. This is an initial approximation. We assume that this vector lies at the center of the p-dimensional sphere of radius R moreover, this radius is set by the researcher. After that, the vectors are determined X Si, ... 9 X Sk that fall into this sphere, and the choice
  • - 1 to

exact expectation X \u003d ~ Y] X 5 . Then the center of the sphere

worn in X , and the calculation procedure is repeated. The condition for the end of the iterative process is the equality of the mean vectors Xfound on t and (t +1) steps. Elements falling inside the sphere X 9 ... 9 X

we include them in one cluster and exclude them from further research. For the remaining points, the algorithm repeats. The algorithm converges for any choice of the initial approximation and any amount of input data. However, in order to obtain a stable partition (i.e., a partition in which the clusters found after the first application of the algorithm coincide in number and composition with the clusters found by the repeated application of the method), it is recommended to repeat the algorithmic procedure several times for different values \u200b\u200bof the radius of the sphere R. A sign of a stable partition will be the formation of the same number of clusters with the same composition.

Note that the clustering problem does not have a unique solution. As a result, iterating over all valid partitions of data into classes is quite difficult and not always possible. In order to assess the quality of various clustering methods, the concept of a partition quality functional is introduced, which takes a minimum value on the best (from the researcher's point of view) partition.

Let be X X9 ... 9 X p e U1 P - some set of observations, which is divided into classes S \u003d (S l9 ... 9 S k) 9 moreover to known in advance. Then the main functionals of the partition quality with a known number of clusters have the form:

1) Weighted sum of intraclass variances

where a (1) - selective mathematical expectation of the cluster S l.

Functional Q ((S) allows to evaluate the measure of homogeneity of all clusters as a whole.

2) The sum of pairwise intraclass distances between elements or

where n 1 - the number of elements in the cluster S { .

3) Generalized intra-class variance

where n j is the number of elements in S., AND; . - sample covariance matrix for Sj.

The functional is the arithmetic average of the generalized intraclass variances calculated for each cluster. As is known, generalized variance allows one to estimate the degree of dispersion of multidimensional observations. therefore Q 3 (S)allows to estimate the average scatter of observation vectors in classes S l9 ... 9 S k. Hence its name - generalized intra-class variance. Q 3 (S) it is used when it is necessary to solve the problem of data compression or the concentration of observations in space with a dimension less than the original.

4) The quality of classification of observations can also be estimated using the Hotelling criterion. To do this, we apply the criterion to test the hypothesis H 0 on the equality of the vectors of the means of two multidimensional populations and calculate statistics

where n t and p t - the number of vectors in classes S l, S m; X, X t - centered source data; S * - combined covariance matrix of clusters S n S m: S * \u003d --- (XjX l + X ^ X m). As before, the value Q 4 (S)

n, + n t -2

compare with the tabular value calculated according to the formula

where m - the initial dimension of the observation vectors, and - the significance level.

Hypothesis H 0 is taken with probability (1-os) if Q 4 (S) n _ m, and is rejected otherwise.

It is possible to evaluate the quality of classifications empirically. For example, you can compare the sample means found for each class with the sample average of the entire set of observations. If they differ twice or more, then the partition is good. A more correct comparison of the cluster sample means with the sample average of the entire set of observations leads to the use of analysis of variance to assess the quality of classifications.

If the number of clusters in S = (S l9 ..., S k) unknown in advance, then use the following functionals of the quality of the partition for an arbitrarily chosen whole m:

IIto 1 1 m

- - the average measure of class

P i=1 n i XjeSj X "tSj J

owl scattering

  • (1 P / 1 W
  • - X ~ - ~ r “measure of the concentration of points

p nV l J J

S, is the number of elements in the cluster containing the point X r

Note that for an arbitrary value of the parameter t functional Z m (S) reaches a minimum equal to I / p if the original clusterization S \u003d (S l9 ..., S k) is split into mono clusters S. \u003d (Xj), as V (X t) \u003d 1. At the same time Z m (S) reaches a maximum of 1 if S - one cluster containing all the source data,

as V (X () \u003d n. In particular cases, it can be shown that Z_ l (S) \u003d- where to - the number of different clusters in S \u003d (S l9 ... 9 S k) 9 Z X (S) \u003d max -,

* "V p)

where n t - number of elements in a cluster S i9 Z ^ (S) \u003d min -,

g " p)

Note that in the case of an unknown number of clusters, the partition quality functionals Q (S) can be selected in the form of an algebraic combination (sum, difference, product, relationship) of two functionals I m (S), Z m (S), since the first is a decreasing and the other an increasing function of the number of classes to. Such behavior Z m (S)

guarantees the existence of an extremum Q (S).

Introduction

The term cluster analysis, first introduced by Tryon in 1939, includes more than 100 different algorithms.

Unlike classification tasks, cluster analysis does not require a priori assumptions about the data set, does not impose restrictions on the representation of the objects under study, and allows the analysis of indicators of various types of data (interval data, frequencies, binary data). It should be remembered that variables should be measured in comparable scales.

Cluster analysis allows you to reduce the dimensionality of data, make it visual.

Cluster analysis can be applied to sets of time series; periods of similarity of some indicators can be distinguished here and groups of time series with similar dynamics can be determined.

Cluster analysis simultaneously developed in several directions, such as biology, psychology, and others, so most methods have two or more names.

The tasks of cluster analysis can be grouped into the following groups:

    Typology or classification development.

    A study of useful conceptual schemes for grouping objects.

    Presentation of hypotheses based on data research.

    Testing hypotheses or studies to determine whether the types (groups) identified in one way or another are actually present in the available data.

As a rule, in the practical use of cluster analysis, several of these problems are simultaneously solved.

                Lesson purpose

Obtaining skills of practical application of hierarchical and iterative methods of cluster analysis.

                Practical task

Develop algorithms for near neighbor methods and k-means and realize them in the form of computer programs. Using the DSCH, generate 50 realizations x= (x 1 , x 2) - a random 2-dimensional quantity whose coordinates are distributed evenly in the interval (3.8). Distribute them using the developed programs to the minimum number of clusters, each of which is placed in a sphere of radius 0.15.

                Guidelines

The name cluster analysis comes from the English word cluster - cluster, cluster. Cluster analysis is a wide class of multivariate statistical analysis procedures that allow automated grouping of observations into homogeneous classes - clusters.

The cluster has the following mathematical characteristics:

  • cluster dispersion;

    standard deviation.

The center of the cluster is the geometric mean of the points in the space of variables.

Cluster radius - the maximum distance of points from the center of the cluster.

Cluster dispersion is a measure of the dispersion of points in space relative to the center of a cluster.

The standard deviation (RMS) of the objects relative to the center of the cluster is the square root of the dispersion of the cluster.

Cluster Analysis Methods

Cluster analysis methods can be divided into two groups:

    hierarchical;

    non-hierarchical.

Each group includes many approaches and algorithms.

Using different methods of cluster analysis, the analyst can get different solutions on the same data. This is considered normal.

    Hierarchical methods of cluster analysis

The essence of hierarchical clustering consists in sequentially combining smaller clusters into large ones or dividing large clusters into smaller ones.

Hierarchical agglomerative methods (Agglomerative Nesting, AGNES)

This group of methods is characterized by a sequential combination of the initial elements and a corresponding decrease in the number of clusters.

At the beginning of the algorithm, all objects are separate clusters. In the first step, the most similar objects are combined into a cluster. In the next steps, the union continues until all the objects make up one cluster.

Hierarchical Divisible (Divisible) Methods (DIvisive ANAlysis, DIANA)

These methods are the logical opposite of agglomerative methods. At the beginning of the algorithm, all objects belong to one cluster, which is divided into smaller clusters in the next steps, resulting in a sequence of splitting groups.

Hierarchical clustering methods differ in the rules for building clusters. The rules are the criteria that are used when deciding on the "similarity" of objects when they are combined into a group.

    Similarity measures

To calculate the distance between objects, various similarity measures (similarity measures) are used, also called distance metrics or functions.

Euclidean distance is the geometric distance in multidimensional space and is calculated by the formula (4.1).

The Euclidean distance (and its square) is calculated from the source, and not from standardized data.

Squared Euclidean distance calculated by formula (4.2).

(4.2)

Manhattan distance (the distance of city blocks), also called the “hamming” or “city block” distance, is calculated as the average of the differences in coordinates. In most cases, this measure of distance leads to results similar to calculations of the Euclidean distance. However, for this measure, the effect of individual emissions is less than when using the Euclidean distance, since here the coordinates are not squared. The Manhattan distance is calculated by the formula (4.3).

(4.3)

Chebyshev distance it is worth using when it is necessary to define two objects as “different” if they differ in any one dimension. The Chebyshev distance is calculated by the formula (4.4).

(4.4)

Power distanceit is used in those cases when they want to progressively increase or decrease the weight related to the dimension for which the corresponding objects are very different. The power-law distance is calculated by the formula (4.5).

(4.5)

where r and p -user-defined parameters. Parameter p responsible for the gradual weighting of differences by individual coordinates, the parameter r for progressively weighing large distances between objects. If both parameters r and p equal to two, then this distance coincides with the distance of Euclidean.

Disagreement Percentage used when data is categorical. This distance is calculated by the formula (4.6).

(4.6)

    Combining or linking methods

At the first step, when each object is a separate cluster, the distances between these objects are determined by the chosen measure. However, when several objects are linked together, other methods for determining the distance between the clusters must be used. There are many methods for cluster joining:

    Single communication (nearest neighbor method) - the distance between two clusters is determined by the distance between the two closest objects (nearest neighbors) in different clusters.

    Full communication (the method of the most distant neighbors) - the distances between the clusters are determined by the largest distance between any two objects in different clusters (ie, "the most distant neighbors").

    Unweighted pairwise mean - the distance between two different clusters is calculated as the average distance between all pairs of objects in them.

    Weighted pairwise mean - the method is identical to the method unweighted pairwise mean, except that in calculations the size of the corresponding clusters (i.e., the number of objects contained in them) is used as a weight coefficient.

    Unweighted centroid method - the distance between two clusters is defined as the distance between their centers of gravity.

    Weighted centroid method (median) —the method is identical to the unweighted centroid method, except that the calculations use weights to take into account the difference between cluster sizes (i.e., the number of objects in them).

    Ward's method — distances between clusters is defined as an increase in the sum of the squares of the distances of objects to the centers of clusters obtained as a result of their combination. The method is different from all other methods because it uses analysis of variance methods to estimate the distances between clusters. The method minimizes the sum of squares for any two (hypothetical) clusters that can be formed at each step.

Nearest neighbor method

The distance between two classes is defined as the distance between their closest representatives.

Before starting the algorithm is calculated distance matrix between objects. According to the classification criterion, the union occurs between clusters, the distance between the closest representatives of which is the smallest: two objects with the smallest distance in one cluster are selected. After that, it is necessary to recalculate the distance matrix taking into account the new cluster. At each step, the minimum value is searched for in the distance matrix, which corresponds to the distance between the two closest clusters. Found clusters are combined to form a new cluster. This procedure is repeated until all clusters are combined.

When using the nearest neighbor method, special attention should be paid to the choice of a measure of the distance between objects. Based on it, the initial distance matrix is \u200b\u200bformed, which determines the entire further classification process.

    Iterative methods.

With a large number of observations, hierarchical methods of cluster analysis are not suitable. In such cases, non-hierarchical methods are used, based on the separation of the source clusters into other clusters, and which are iterative methods of fragmenting the original population. In the process of division, new clusters are formed until the stop rule is fulfilled.

Such non-hierarchical clustering consists in dividing the data set into a certain number of individual clusters. There are two approaches. The first is to determine the boundaries of the clusters as the most dense sections in the multidimensional space of the source data, that is, the definition of the cluster where there is a large "cluster of points". The second approach is to minimize the measure of differences in objects.

Unlike hierarchical classification methods, iterative methods can lead to the formation of intersecting clusters, when one object can simultaneously belong to several clusters.

Iterative methods include, for example, k- averages, a method for searching for condensations and others. Iterative methods are fast-acting, which allows them to be used to process large arrays of source information.

K-means algorithm (k-means)

Among iterative methods, the most popular method is k- middle McKean. Unlike hierarchical methods, in most implementations of this method, the user himself must specify the desired number of finite clusters, which is usually denoted by " k". Algorithm k-medium builds k clusters located at the greatest possible distances from each other. The main requirement for the type of tasks that the algorithm solves k- averages, - the presence of assumptions (hypotheses) regarding the number of clusters, while they should be as different as possible. Number selection k may be based on previous research, theoretical considerations, or intuition.

As in hierarchical clustering methods, the user can choose one or another type of similarity measure. Different method algorithms k-means differ in the way of choosing the initial centers of the given clusters. In some versions of the method, the user himself can (or should) set such initial points, either by choosing them from real observations, or by specifying the coordinates of these points for each of the variables. In other implementations of this method, choosing a given number k the initial points are randomly generated, and these initial points (cluster centers) can be further refined in several stages. There are 4 main stages of such methods:

    are selected or assigned k observations that will be the primary centers of the clusters;

    if necessary, intermediate clusters are formed by assigning each observation to the nearest specified cluster centers;

    after all observations are assigned to individual clusters, primary cluster centers are replaced by cluster averages;

    the previous iteration is repeated until the changes in the coordinates of the cluster centers become minimal.

The general idea of \u200b\u200bthe algorithm: a given fixed number k of observation clusters are mapped to clusters so that the averages in the cluster (for all variables) differ as much as possible from each other.

Algorithm description

    The initial distribution of objects in clusters.

Number is selected k and k points. At the first step, these points are considered the “centers” of the clusters. Each cluster has one center. The selection of initial centroids can be carried out as follows:

    the choice k-observations to maximize the initial distance;

    random selection k-observations;

    first choice k-observations.

Then, each object is assigned to a specific closest cluster.

    Iterative process.

The centers of the clusters are calculated, which then and further are the coordinate-wise mean clusters. Objects are redistributed again. The process of calculating the centers and redistributing the objects continues until one of the conditions is met:

    cluster centers stabilized, i.e., all observations belong to the cluster to which they belonged before the current iteration. In some variants of this method, the user can set the numerical value of the criterion, interpreted as the minimum distance for selecting new cluster centers. Observation will not be considered as a candidate for a new cluster center if its distance to the replaced cluster center exceeds a predetermined number. This parameter is called a "radius" in a number of programs. In addition to this parameter, it is usually possible to specify a sufficiently small number with which the change in distance is compared for all cluster centers. This parameter is usually called “convergence,” because it reflects the convergence of the iterative clustering process;

    the number of iterations is equal to the maximum number of iterations.

Clustering Quality Check

After obtaining the results of cluster analysis by k-medium, it is necessary to verify the correct clustering (i.e., to assess how much the clusters differ from each other). For this, the average values \u200b\u200bfor each cluster are calculated. With good clustering, very different averages should be obtained for all measurements, or at least most of them.

Advantagesk-means algorithm:

    ease of use;

    speed of use;

    understandability and transparency of the algorithm.

disadvantagesk-means algorithm:

    the algorithm is too sensitive to outliers that can distort the average. A possible solution to this problem is to use an algorithm modification - an algorithm k-medians;

    the algorithm can work slowly on large databases. A possible solution to this problem is to use a data sample.

The report should contain:

    description and flowcharts of algorithms;

    source codes of software modules;

    the results of the operation of algorithms in the form of graphs.

With a large number of observations, hierarchical methods of cluster analysis are not suitable. In such cases, non-hierarchical methods based on separation are used, which are iterative methods of fragmenting the original population. In the process of division, new clusters are formed until the stop rule is fulfilled.

Such non-hierarchical clustering consists in dividing the data set into a certain number of individual clusters. There are two approaches. The first is to determine the boundaries of the clusters as the most dense sections in the multidimensional space of the source data, i.e. definition of a cluster where there is a large "concentration of points". The second approach is to minimize the measure of differences in objects

K-means algorithm (k-means)

The most common among non-hierarchical methods is the k-means algorithm, also called fast cluster analysis. A full description of the algorithm can be found in the work of Hartigan and Wong (Hartigan and Wong, 1978). Unlike hierarchical methods, which do not require preliminary assumptions regarding the number of clusters, to be able to use this method, it is necessary to have a hypothesis about the most probable number of clusters.

The k-means algorithm constructs k clusters located at as large distances as possible from each other. The main type of problems that the k-means algorithm solves is the existence of assumptions (hypotheses) regarding the number of clusters, and they should be as different as possible. The choice of k can be based on previous research, theoretical considerations, or intuition.

The general idea of \u200b\u200bthe algorithm: a given fixed number k of observation clusters are mapped to clusters so that the averages in the cluster (for all variables) differ as much as possible from each other.

Algorithm description

  1. The initial distribution of objects in clusters. The number k is chosen, and in the first step these points are considered the "centers" of the clusters. Each cluster has one center.
    The selection of initial centroids can be carried out as follows:
    - selection of k-observations to maximize the initial distance;
    - random selection of k-observations;
    - selection of the first k-observations.
    As a result, each object is assigned to a specific cluster.
  2. Iterative process. The centers of the clusters are calculated, which then and further are the coordinate-wise mean clusters. Objects are redistributed again.
    The process of calculating the centers and redistributing the objects continues until one of the conditions is met:
    - cluster centers stabilized, i.e. all observations belong to the cluster that belonged to the current iteration;
    - the number of iterations is equal to the maximum number of iterations.
In fig. 14.1 shows an example of the k-means algorithm for k equal to two.

Figure 1 - An example of the operation of the k-means algorithm (k \u003d 2)

Choosing the number of clusters is a complex issue. If there are no assumptions about this number, it is recommended to create 2 clusters, then 3, 4, 5, etc., comparing the results.

Clustering Quality Check

After obtaining the results of cluster analysis using the k-means method, the correct clustering should be checked (i.e., to assess how much the clusters differ from each other). For this, the average values \u200b\u200bfor each cluster are calculated. With good clustering, very different averages should be obtained for all measurements, or at least most of them.

Advantages of the k-means algorithm:

  1. ease of use;
  2. speed of use;
  3. understandability and transparency of the algorithm.
Disadvantages of the k-means algorithm:
  1. the algorithm is too sensitive to outliers that can distort the average. A possible solution to this problem is to use a modification of the algorithm - the k-median algorithm;
  2. the algorithm can work slowly on large databases. A possible solution to this problem is to use a data sample.
PAM algorithm (partitioning around Medoids)

PAM is a modification of the k-means algorithm, the k-medoids algorithm.

The algorithm is less sensitive to data noise and outliers than the k-means algorithm, since the median is less affected by outlier effects.

PAM is effective for small databases, but should not be used for large data sets. Preliminary dimension reduction

Consider an example. There is a database of company clients, which should be divided into homogeneous groups. Each client is described using 25 variables. The use of such a large number of variables leads to the allocation of clusters of fuzzy structure. As a result, it is rather difficult for an analyst to interpret the resulting clusters.

More understandable and transparent results of clustering can be obtained if, instead of the set of source variables, some generalized variables or criteria are used that contain compressed information about the relationships between the variables. Those. there is a task of reducing the dimension of the data. It can be solved using various methods; one of the most common is factor analysis. Let us dwell on it in more detail.

Factor analysis

Factor analysis is a method used to study the relationships between variable values.

In general, factor analysis has two objectives:

  1. reduction in the number of variables;
  2. classification of variables - determining the structure of the relationship between variables.
Accordingly, factor analysis can be used to solve problems of reducing the dimensionality of data or to solve classification problems.

The criteria or main factors identified as a result of factor analysis contain, in a compressed form, information about the existing relationships between the variables. This information allows you to get better clustering results and better explain the semantics of clusters. Factors themselves may be given a certain meaning.

Using factor analysis, a large number of variables is reduced to a smaller number of independent influencing variables called factors.

The “compressed” form contains information about several variables. Variables that strongly correlate with each other are combined into one factor. As a result of factor analysis, such complex factors are found that as fully as possible explain the relationships between the variables in question.

At the first step of factor analysis, the values \u200b\u200bof variables are standardized, the need for which was discussed in the previous lecture.

Factor analysis is based on the hypothesis that the analyzed variables are indirect manifestations of a relatively small number of some hidden factors.

Factor analysis is a set of methods aimed at identifying and analyzing hidden relationships between observed variables. Hidden dependencies are also called latent.

One of the methods of factor analysis - the method of principal components - is based on the assumption of the independence of factors from each other.

Iterative clustering in SPSS Typically, in statistical packages a wide arsenal of methods is implemented, which allows you to first reduce the dimension of the data set (for example, using factor analysis), and then clustering itself (for example, using the fast cluster analysis method). Consider this option for clustering in the SPSS package.
To reduce the dimension of the source data, we use factor analysis. To do this, select in the menu: Analyze / Data Reduction / Factor:
Using the Extraction: button, select the selection method. We will leave the default analysis of the main components mentioned above. You should also choose a rotation method - choose one of the most popular - varimax method. To save factor values \u200b\u200bin the form of variables in the "Values" tab, select the "Save as variables" checkbox.

As a result of this procedure, the user receives a report "Explained total variance", which shows the number of selected factors - these are components whose eigenvalues \u200b\u200bexceed one.

The obtained values \u200b\u200bof factors, which are usually assigned the names fact1_1, fact1_2, etc., are used for conducting cluster analysis using the k-means method. To conduct a quick cluster analysis, select from the menu:

Analyze / Classify / K-Means Cluster: (K-Means Cluster Analysis).

In the K Means Cluster Analysis dialog box, you need to place the factor variables fact1_1, fact1_2, etc. in the field of tested variables. Here you must specify the number of clusters and the number of iterations.

As a result of this procedure, we obtain a report with the output of the values \u200b\u200bof the centers of the formed clusters, the number of observations in each cluster, as well as with additional information specified by the user.

Thus, the k-means algorithm divides the set of source data into a given number of clusters. To be able to visualize the results obtained, one of the graphs should be used, for example, a dispersion diagram. However, traditional visualization is possible for a limited number of dimensions, because, as you know, a person can perceive only three-dimensional space. Therefore, if we analyze more than three variables, we should use special multidimensional methods for presenting information, which will be discussed in one of the subsequent lectures of the course.

Iterative clustering methods differ in the choice of the following parameters:

  1. starting point;
  2. the rule for the formation of new clusters;
  3. stop rule.
The choice of clustering method depends on the amount of data and whether there is a need to work simultaneously with several types of data.

In the SPSS package, for example, if you need to work with both quantitative (e.g., income) and categorical (e.g., marital status) variables, as well as if the data volume is large enough, the Two-Stage Cluster Analysis method is used, which is a scalable cluster procedure analysis, allowing you to work with data of various types. For this, at the first stage of operation, records are pre-clustered into a large number of sub-clusters. At the second stage, the resulting sub-clusters are grouped into the required number. If this quantity is unknown, the procedure automatically determines it. Using this procedure, a bank employee can, for example, identify groups of people using indicators such as age, gender and income level. The results obtained make it possible to identify customers included in the risk groups for loan defaults.

In the general case, all stages of cluster analysis are interconnected, and decisions made at one of them determine actions at subsequent stages.

The analyst should decide whether to use all the observations or to exclude some data or samples from the data set.

The choice of metrics and standardization method of the source data.

Determining the number of clusters (for iterative cluster analysis).

Definition of a clustering method (association or communication rules).

According to many experts, the choice of clustering method is crucial in determining the shape and specificity of clusters.

Analysis of the results of clustering. This stage involves the solution of such questions: is the obtained clusterization random? whether the partition is reliable and stable on data subsamples; Is there a relationship between clustering results and variables that were not involved in the clustering process? is it possible to interpret the results of clustering.

Checking the results of clustering. Clustering results should also be verified by formal and informal methods. Formal methods depend on the method used for clustering. Informal ones include the following clustering quality control procedures:

  1. analysis of the results of clustering obtained on certain samples of the data set;
  2. cross check;
  3. clustering when changing the order of observations in the data set;
  4. clustering while removing some observations;
  5. clustering in small samples.
One of the options for checking the quality of clustering is to use several methods and compare the results. The lack of similarity will not mean incorrect results, but the presence of similar groups is considered a sign of qualitative clustering.

Difficulties and problems that may arise when applying cluster analysis

Like any other methods, cluster analysis methods have certain weaknesses, i.e. some difficulties, problems and limitations.

When conducting cluster analysis, it should be borne in mind that the results of clustering depend on the criteria for dividing the totality of the source data. With a decrease in the dimension of the data, certain distortions can occur, due to generalizations some individual characteristics of objects can be lost.

There are a number of challenges that should be considered before clustering.

  1. The difficulty of choosing the characteristics on the basis of which clustering is carried out. A rash choice leads to inadequate partitioning into clusters and, as a result, to an incorrect solution to the problem.
  2. The difficulty of choosing a clustering method. This choice requires a good knowledge of the methods and prerequisites for their use. To check the effectiveness of a particular method in a particular subject area, it is advisable to apply the following procedure: consider several a priori different groups among themselves and randomly mix their representatives among themselves. Next, clustering is performed to restore the original clusterization. The proportion of coincidence of objects in the identified and initial groups is an indicator of the effectiveness of the method.
  3. The problem of choosing the number of clusters. If there is no information regarding the possible number of clusters, it is necessary to conduct a series of experiments and, as a result of enumerating a different number of clusters, choose the optimal number of clusters.
  4. The problem of interpreting the results of clustering. The shape of the clusters in most cases is determined by the choice of the union method. However, it should be borne in mind that specific methods tend to create clusters of certain forms, even if there are actually no clusters in the studied data set.
Comparative analysis of hierarchical and non-hierarchical clustering methods

Before conducting clustering, the analyst may wonder which group of cluster analysis methods should be preferred. Choosing between hierarchical and non-hierarchical methods, it is necessary to take into account the following features.

Non-hierarchical methods reveal higher stability with respect to noise and outliers, incorrect choice of metric, inclusion of insignificant variables in the set participating in clustering. The price that you have to pay for these advantages of the method is the word "a priori". The analyst must determine in advance the number of clusters, the number of iterations, or the stop rule, as well as some other clustering parameters. This is especially difficult for beginners.

If there are no assumptions regarding the number of clusters, hierarchical algorithms are recommended. However, if the sample size does not allow this, the possible way is to conduct a series of experiments with a different number of clusters, for example, start splitting the data set from two groups and, gradually increasing their number, compare the results. Due to this “variation” of the results, a rather great clustering flexibility is achieved.

Hierarchical methods, in contrast to non-hierarchical ones, refuse to determine the number of clusters, and build a complete tree of nested clusters.

The difficulties of hierarchical clustering methods: limitation of the data set volume; choice of proximity measure; the rigidity of the classifications obtained.

The advantage of this group of methods in comparison with non-hierarchical methods is their visibility and the ability to get a detailed idea of \u200b\u200bthe data structure.

When using hierarchical methods, it is possible to fairly easily identify outliers in a data set and, as a result, improve data quality. This procedure underlies the two-step clustering algorithm. Such a data set can later be used for non-hierarchical clustering.

There is another aspect that has already been mentioned in this lecture. This is a matter of clustering the entire data set or its selection. The named aspect is essential for both groups of methods considered, however, it is more critical for hierarchical methods. Hierarchical methods cannot work with large data sets, but the use of some selection, i.e. parts of the data could allow the application of these methods.

Clustering results may not have sufficient statistical justification. On the other hand, when solving clustering problems, a non-statistical interpretation of the results is acceptable, as well as a fairly wide variety of options for the concept of a cluster. Such a non-statistical interpretation enables the analyst to obtain clustering results that are satisfactory to him, which is often difficult when using other methods.

New algorithms and some modifications of cluster analysis algorithms

The methods that we examined in this and the previous lectures are the “classics” of cluster analysis. Until recently, the main criterion by which the clustering algorithm was evaluated was the quality of clustering: it was assumed that the entire data set fit in RAM.

However, now, in connection with the advent of extra-large databases, there are new requirements that the clustering algorithm must satisfy. The main one, as mentioned in previous lectures, is the scalability of the algorithm.

We also note other properties that the clustering algorithm must satisfy: independence of the results from the order of the input data; independence of the algorithm parameters from the input data.

Recently, active development of new clustering algorithms capable of processing ultra-large databases is underway. They focus on scalability. Such algorithms include a generalized representation of clusters (summarized cluster representation), as well as the selection and use of data structures supported by underlying DBMS.

Algorithms have been developed in which hierarchical clustering methods are integrated with other methods. Such algorithms include: BIRCH, CURE, CHAMELEON, ROCK.

BIRCH Algorithm (Balanced Iterative Reducing and Clustering using Hierarchies)

The algorithm was proposed by Tian Zang and his colleagues.

Thanks to the generalized representations of clusters, the clustering speed increases, while the algorithm is highly scalable.

This algorithm implements a two-stage clustering process.

During the first stage, a preliminary set of clusters is formed. At the second stage, other clustering algorithms are applied to the identified clusters - suitable for working in RAM.

The following analogy describes this algorithm. If each data element is imagined as a bead lying on the surface of the table, then the clusters of beads can be "replaced" with tennis balls and go on to a more detailed study of the clusters of tennis balls. The number of beads can be quite large, but the diameter of tennis balls can be selected so that at the second stage it is possible, using traditional clustering algorithms, to determine the actual complex shape of the clusters.

WaveCluster Algorithm

WaveCluster is a waveform transform based clustering algorithm. At the beginning of the algorithm, the data are generalized by superimposing a multidimensional lattice on the data space. In the subsequent steps of the algorithm, it is not individual points that are analyzed, but the generalized characteristics of the points that fall into one cell of the lattice. As a result of this generalization, the necessary information fits in the RAM. In the following steps, to determine the clusters, the algorithm applies the wave transform to the generalized data.

Key features of WaveCluster:

  1. complexity of implementation;
  2. the algorithm can detect clusters of arbitrary shapes;
  3. the algorithm is not sensitive to noise;
  4. the algorithm applies only to low dimensional data.
CLARA Algorithm (Clustering LARge Applications)

The CLARA algorithm was developed by Kaufmann and Rousseeuw in 1990 for clustering data in large databases. This algorithm is built in statistical analytical packages, for example, such as S +.

Let us briefly outline the essence of the algorithm. The CLARA algorithm retrieves many samples from a database. Clustering is applied to each of the samples; the best clustering is proposed at the output of the algorithm.

For large databases, this algorithm is more efficient than the PAM algorithm. The effectiveness of the algorithm depends on the data set selected as a sample. Good clustering on the selected set may not give good clustering on the entire data set.

Algorithms Clarans, CURE, DBScan

The Clarans algorithm (Clustering Large Applications based upon RANdomized Search) formulates the clustering problem as a random search in a graph. As a result of the operation of this algorithm, the set of graph nodes is a partition of the data set into the number of clusters defined by the user. The "quality" of the resulting clusters is determined using the criteria function. The Clarans algorithm sorts all possible partitions of the data set in search of an acceptable solution. The search for a solution stops at the node where the minimum is reached among a predetermined number of local minima.

Among the new scalable algorithms, one can also note the CURE algorithm - the hierarchical clustering algorithm, and the DBScan algorithm, where the concept of a cluster is formulated using the concept of density.

The main disadvantage of the BIRCH, Clarans, CURE, DBScan algorithms is the fact that they require the setting of some point density thresholds, which is not always acceptable. These limitations are due to the fact that the described algorithms are oriented to very large databases and cannot use large computational resources.

Many researchers are now actively working on scalable methods, whose main task is to overcome the shortcomings of the algorithms that exist today.

One of the iterative classification methods that do not require specifying the number of clusters is the method of searching for condensations. The method requires calculating the distance matrix, then an object is selected, which is the initial center of the first cluster. The choice of such an object may be arbitrary, or may be based on a preliminary analysis of points and their surroundings.

The selected point is taken as the center of the hypersphere of a given radius R. The set of points that fall inside this sphere is determined, and the coordinates of the center are calculated for them (vector of average values \u200b\u200bof the attributes). Next, we consider a hypersphere of the same radius, but with a new center, and for the set of points that fall into it, the vector of average values \u200b\u200bis again calculated, which is taken as the new center of the sphere and so on. When the next recalculation of the coordinates of the center of the sphere leads to the same result as in the previous step, the movement of the sphere stops, and the points that fall into it form a cluster and are excluded from the further clustering process. For all remaining points, the procedures are repeated.

Thus, there are more non-hierarchical methods, although they work on the same principles. In fact, they are iterative methods of crushing the original population. In the process of division, new clusters are formed, and so on, until the stop rule is satisfied. The methods differ among themselves by the choice of the starting point, the rule for the formation of new clusters, and the stop rule. Most commonly used algorithm K-medium.

Conclusion

Cluster analysis is a method of grouping objects into classes based on experimental data on the properties of objects.

In this case, a cluster model of representing objects is used - objects with similar properties belong to the same class.

Cluster analysis includes a set of different classification algorithms (as an example of the cluster analysis method, you can use the dendrogram method).

Moreover, as a rule, the number of classes and the principles of separation into classes are determined in advance based on general information about the set of objects and the goals of cluster analysis.

Cluster analysis methods are supplemented by discriminant analysis methods that allow you to determine the boundaries between clusters and use them to solve problems of data analysis and classification.

The results of cluster analysis are most often presented graphically, in the form of a dendrogram ("tree"), showing the order of combining objects into clusters. The interpretation of the cluster structure, which in many cases begins with determining the number of clusters, is a creative task. In order for it to be effectively resolved, the researcher must have sufficient information about clustered objects. In clustering “with learning”, the results can be presented in the form of lists of objects assigned to each class.

The main advantages of cluster analysis are the absence of restrictions on the distribution of variables used in the analysis; the possibility of classification (clustering) even in cases where there is no a priori information about the number and nature of classes; universality (cluster analysis can be applied not only to collections of objects, but also to sets of variables or any other units of analysis).

We list the disadvantages of cluster analysis:

    Like factor analysis, it can produce unstable clusters. Repeat the study on other people and compare the classification results. Most likely they will be different. How much is a question of the quality of the study itself.

    He implements an inductive research method from particular to general, which is fraught with anti-scientific conclusions. Ideally, the sample for classification should be very large, heterogeneous, preferably selected by stratification or randomization. Science is moving toward testing hypotheses, so cluster analysis should not be abused. It is best to use it to test the hypothesis of the presence of any types, and not create a classification from scratch.

    Like any multidimensional scaling method, cluster analysis has many features associated with internal methods. What is the criterion for combining people into clusters, the method of finding differences, the number of steps to complete the algorithm in the k-means method, etc. therefore, the results may vary, albeit insignificantly, depending on the "settings" of the procedure.

There are two groups of methods cluster analysis: hierarchical and non-hierarchical.

The main methods of hierarchical cluster analysis are the close neighbor method, the full communication method, the medium communication method, and the Ward method. The most universal is the latter.

There are more non-hierarchical methods, although they work on the same principles. In fact, they are iterative methods of crushing the original population. In the process of division, new clusters are formed, and so on, until the stop rule is satisfied. The methods differ among themselves by the choice of the starting point, the rule for the formation of new clusters, and the stop rule. Most commonly used algorithm K-medium. It implies that the analyst pre-fixes the number of clusters in the resulting partition.

Speaking about the choice of a specific clustering method, we emphasize once again that this process requires the analyst to have a good understanding of the nature and background of the methods, otherwise the results will be similar to the "average temperature in the hospital." In order to make sure that the selected method is really effective in this area, as a rule, use the following procedure:

We consider several a priori different groups among themselves and randomly mix their representatives among themselves. Then, the clustering procedure is carried out in order to restore the original partition into groups. An indicator of the effectiveness of the method will be the proportion of coincidence of objects in the identified and initial groups.

Choosing between hierarchical and non-hierarchical methods, you should pay attention to the following points:

Non-hierarchical methods exhibit higher stability with respect to outliers, incorrect choice of metric, inclusion of insignificant variables in the base for clustering, etc. But the price for this is the word “a priori”. The researcher must fix in advance the resulting number of clusters, the stop rule and, if there is reason, the initial center of the cluster. The last moment significantly affects the efficiency of the algorithm. If there is no reason to artificially set this condition, generally speaking, it is recommended to use hierarchical methods. We also note one more point that is essential for both groups of algorithms: the clustering of all observations is not always the right solution. Perhaps it would be more accurate to first clear the sample of outliers and then continue the analysis. You can also not set very high stopping criteria.

 

It might be useful to read: