Methods of cluster analysis. iterative methods. Types of cluster analysis procedures Cluster analysis method of searching for concentrations

cluster analysis(CLA) is a set of multidimensional classification methods, the purpose of which is the formation of groups (clusters) of objects similar to each other. Unlike traditional groupings considered in the general theory of statistics, CL leads to division into groups, taking into account all grouping characteristics simultaneously.

CL methods allow solving the following problems:

— carrying out the classification of objects, taking into account a variety of features;

- verification of the assumptions made about the presence of some structure in the studied set of objects, i.e. search for an existing structure;

- construction of new classifications for poorly studied phenomena, when it is necessary to establish the presence of connections within the population and try to introduce structure into it.

To write the formalized CL algorithms, the following are used: conventions:

– set of observation objects;

– i-th observation in the m-dimensional feature space ();

is the distance between the -th and -th objects;

– normalized values of initial variables;

is the matrix of distances between objects.

To implement any CL method, it is necessary to introduce the concept of “object similarity”. Moreover, in the process of classification, objects that have the greatest similarity with each other in terms of the observed variables should fall into each cluster.

For quantification similarity, the concept of a metric is introduced. Each object is described by -features and represented as a point in -dimensional space. The similarity or difference between the classified objects is established depending on the metric distance between them. As a rule, the following measures of distance between objects are used:

is the euclidean distance ;

is the weighted Euclidean distance ;

— city-block distance ;

is the Mahalanobis distance,

where is the distance between the -th and -th objects;

, are the values of the -variable and, respectively, of the -th and -th objects;

, – vectors of variable values for the -th and -th objects;

is the general covariance matrix;

is the weight assigned to the -th variable.

All CL methods can be divided into two groups: hierarchical (agglomerative and divisive) and iterative (the -average method, the method of searching for concentrations).

Hierarchical cluster analysis. Of all the methods of cluster analysis, the most common is the agglomerative classification algorithm. The essence of the aglogrythm lies in the fact that at the first step, each sample object is considered as a separate cluster. The process of combining clusters occurs sequentially: based on the distance matrix or the similarity matrix, the closest objects are combined. If the distance matrix initially has the dimension (), then the merging process is completed in () steps. As a result, all objects will be combined into one cluster.

The merging sequence can be represented as a dendrogram shown in Figure 3.1. The dendrogram shows that at the first step the second and third objects are combined into one cluster with a distance between them of 0.15. At the second step, the first object joined them. The distance from the first feature to the cluster containing the second and third features, 0.3, and so on.

Many methods of hierarchical cluster analysis are distinguished by association (similarity) algorithms, of which the most common are: the single connection method, the full connection method, the average connection method, the Ward method.

The method of complete links - the inclusion of a new object in the cluster occurs only if the similarity between all objects is not less than some given level of similarity (Figure 1.3).

Average connection method - when a new object is included in an already existing cluster, the average value of the similarity measure is calculated, which is then compared with a given threshold level. If we are talking about the union of two clusters, then the measure of similarity between their centers is calculated and compared with a given threshold value. Consider a geometric example with two clusters (Figure 1.4).

Figure 1.4. Combining two clusters using the average link method:

If the measure of similarity between the centers of clusters () is not less than a given level, then the clusters and will be merged into one.

Ward's method - at the first step, each cluster consists of one object. Initially, the two closest clusters are merged. For them, the average values of each feature are determined and the sum of the squared deviations is calculated

, (1.1)

where is the cluster number, is the object number, is the feature number; - the number of features that characterize each object; is the number of objects in the -mcluster.

Further, at each step of the algorithm, those objects or clusters are combined that give the smallest increment of the value .

Ward's method leads to the formation of clusters of approximately equal sizes with minimal intracluster variation.

The hierarchical cluster analysis algorithm can be represented as a sequence of procedures:

— normalization of initial values of variables;

— calculation of the distance matrix or matrix of similarity measures;

– determination of a pair of closest objects (clusters) and their combination according to the selected algorithm;

- repeating the first three procedures until all objects are combined into one cluster.

The measure of similarity for combining two clusters is determined by the following methods:

- the method of "nearest neighbor" - the degree of similarity between clusters is estimated by the degree of similarity between the most similar (nearest) objects of these clusters;

— method of "distant neighbor" - the degree of similarity is estimated by the degree of similarity between the most distant (dissimilar) objects of clusters;

- method of average connection - the degree of similarity is estimated as the average value of the degrees of similarity between objects of clusters;

- median connection method - the distance between any cluster S and a new cluster, which resulted from the union of clusters p and q, is defined as the distance from the center of cluster S to the middle of the segment connecting the centers of clusters p and q.

Condensation search method. One of the iterative methods of classification is the algorithm for searching for concentrations. The essence of the iterative algorithm this method consists in using a hypersphere of a given radius, which moves in the space of classification features in order to search for local concentrations of objects.

The method of searching for concentrations requires, first of all, the calculation of the distance matrix (or the matrix of similarity measures) between objects and the choice of the initial center of the sphere. Usually, at the first step, the center of the sphere is the object (point), in the nearest neighborhood of which the largest number of neighbors is located. Based on the given radius of the sphere (R), the set of points that fall inside this sphere is determined, and the coordinates of the center are calculated for them (the vector of average values of features).

When the next recalculation of the coordinates of the center of the sphere leads to the same result as in the previous step, the movement of the sphere stops, and the points that fall into it form a cluster, and are excluded from the further clustering process. The above procedures are repeated for all remaining points. The work of the algorithm is completed in a finite number of steps, and all points are distributed over clusters. The number of formed clusters is not known in advance and strongly depends on the radius of the sphere.

To assess the stability of the resulting partition, it is advisable to repeat the clustering process several times for different values of the sphere radius, each time changing the radius by a small amount.

There are several ways to choose the radius of a sphere. If is the distance between the -th and -th objects, then as the lower limit of the radius () is chosen, and the upper limit of the radius can be defined as .

If the algorithm starts with a value and changes by a small value each time it is repeated, then it is possible to identify the values of the radii that lead to the formation of the same number of clusters, i.e. to a stable partition.

Example 1. Based on the data in Table 1.1, it is necessary to classify five enterprises using hierarchical agglomerative cluster analysis.

Table 1.1

Here: - average annual cost major production assets, billion rubles; - material costs per one ruble of manufactured products, kopecks; - the volume of manufactured products, billion rubles.

Solution. Before calculating the distance matrix, we normalize the initial data using the formula

The matrix of values of normalized variables will look like

The classification will be carried out using the hierarchical agglomerative method. To construct the distance matrix, we will use the Euclidean distance. Then, for example, the distance between the first and second objects will be

The distance matrix characterizes the distances between objects, each of which, at the first step, is a separate cluster

As can be seen from the matrix , the objects and are the closest. Let's combine them into one cluster and assign it a number . We recalculate the distances of all remaining objects (clusters) to the cluster, we get a new distance matrix

In the matrix, the distances between clusters are determined by the "far neighbor" algorithm. Then the distance between the object and the cluster is

In the matrix, we again find the closest clusters. These will be and , . Therefore, at this step, we also combine clusters; get a new cluster containing objects , . Let's give him a number. Now we have three clusters (1.3), (2.5), (4).

Judging by the matrix , at the next step we combine the clusters and , into one cluster and assign it a number . Now we have only two clusters:

And finally, at the last step, we will combine the clusters and at a distance of 3.861.

Let's present the classification results in the form of a dendrogram (Figure 1.5). The dendrogram indicates that the cluster is more homogeneous in terms of the composition of the incoming objects, since in it the union occurred at smaller distances than in the cluster.

Figure 3.5. Dendrogram of five objects clustering

Example 2. Based on the data below, classify stores according to three criteria: - area trading floor, m2 , - turnover per seller, den. units, - the level of profitability, %.

store number				store number

To classify stores, use the method of searching for concentrations (you need to select the first cluster).

Solution. 1. Calculate the distances between objects using the Euclidean metric

where , are the standardized values of the initial variables for the -th and -th objects, respectively; t is the number of features.

2. Based on the Z matrix, we calculate the square symmetric matrix of distances between objects ().

Analysis of the distance matrix helps to determine the position of the initial center of the sphere and choose the radius of the sphere.

In this example, most of the "small" distances are in the first line, i.e. the first object has a lot of "close" neighbors. Therefore, the first object can be taken as the center of the sphere.

3. Set the radius of the sphere . In this case, objects fall into the sphere, the distance of which to the first object is less than 2.

Cluster analysis is a statistical analysis that allows you to split a large amount of data into classes or groups (from English, cluster- class) according to some criterion or their combination.

To classify data X x,...,X p use the concept of metric or distance.

metric is a function p that maps some metric space into the space of real numbers and has the following properties (metric axioms):

1) p(ZD>0,
2) p(X,Y)=p(Y, X),
3) p(X, Y) = 0 X=Y
4) P(X,Y) P(Z, Y).

The theory of cluster analysis uses the following metrics to measure the distance between individual points (vectors):

1) Euclidean distance

2) weighted euclidean distance

where wk - weights proportional to the importance of the feature in the classification problem. Weights are set after additional research

and assume that ^ w* = 1;

3) hamming distance (or city block)- distance on the map between quarters in the city

4) Mahalanobis distance (or Mahalanobis angle)

where A is a symmetric positive-definite weight matrix (often chosen to be diagonal); BUT - vector covariance matrix X 19 ..., X p;

5) Minkowski distance

Distances 1), 2), 3) or 5) are used in case of normal distribution of independent random variables X l9 ...,X n ~N(M,A) or in the case of their homogeneity in geochemical sense, when each vector is equally important for classification. Distance 4) is used if there is a covariance relationship of vectors X x,...,X P.

The choice of the metric is carried out by the researcher, depending on what result he wants to obtain. This choice is not formalized, since it depends on many factors, in particular, on the expected result, on the experience of the researcher, the level of his mathematical training, etc.

In a number of algorithms, along with distances between vectors, distances between clusters and unions of clusters are used.

Let S(- /-th cluster, consisting of n t vectors or points. Let

X (l) - sample mean over the points falling into the cluster S f , or the center of gravity of the cluster 5.. Then the following distances are distinguished between clusters that do not have other clusters inside:

1) distance between clusters according to the “near neighbor” principle

2) distance between clusters according to the principle of "far neighbor"

3) distance between the centers of gravity of groups

4) distance between clusters according to the principle of "average connection"

5) generalized Kolmogorov distance

The distance between clusters that are the union of other classes can be calculated using the general formula:

where S^k^- cluster obtained by combining classes S k and S t .

All special cases of distances are obtained from this generalized formula. With a = p = 1/2, 8 = -1/2, y = 0, we have a distance according to the “near neighbor” principle, with a = p = 5 = 1/2, y = O - “far neighbor”,

at a \u003d ---, p \u003d---, 5 \u003d 0, y \u003d 0 - distance along the centers of gravity

p k + n i p k + n i

groups.

Cluster analysis methods are divided into I) agglomerative (combining), II) divisive (separating) and III) iterative.

The former consistently combine individual objects into clusters, the latter, on the contrary, dismember clusters into objects. The third combines the first two. Their feature is the formation of clusters based on partitioning conditions (the so-called parameters), which can be changed during the operation of the algorithm to improve the quality of partitioning. Iterative methods are commonly used to classify large amounts of information.

Let us consider agglomerative methods in more detail. Agglomerative methods are the simplest and most common among cluster analysis algorithms. At the first step, each vector or object X 19 ..., X p source data is considered as a separate cluster or class. According to the calculated matrix of distances, the closest to each other are selected and combined. Obviously, the process will end in (P - 1) a step when, as a result, all objects will be combined into one cluster.

The sequence of associations can be represented as a dendrogram, or a tree. On fig. 1.18 shows that the vectors were combined in the first step X t , X 2 , since the distance between them is 0.3.

At the second step, a vector was attached to them X 3 away from the cluster (X 1, X 2) to a distance of 0.5, etc. At the last step, all vectors are combined into one cluster.

Rice. 1.18.

Agglomerative methods include methods of single, average, complete connection and Ward's method.

1.single connection method. Let X v ..., X n - vector data, with each vector forming one cluster. First, a matrix of distances between these clusters is calculated, and the nearest neighbor distance is used as a metric. Using this matrix, the two closest vectors are selected, which form the first cluster 5,. The next step between S] and the remaining vectors (which we consider to be clusters), a new distance matrix is calculated, and the distance between clusters combined into classes (a = p = 1/2, 5 = -1/2, y = 0) is used as a metric. Closest to previously obtained class S( the cluster unites with it, forming S2. Etc. through P- 1 steps, we get that all vectors are combined into one cluster.

Advantages: 1) only one element is added at each step of the algorithm, 2) the method is extremely simple, 3) the algorithm is insensitive to transformations of the initial data (rotation, shift, translation, stretching).

Flaws: 1) it is necessary to constantly recalculate the distance matrix, 2) the number of clusters is known in advance and cannot be reduced

2. Full connection method. The method practically repeats the single link method, except that the inclusion of a new object in a cluster occurs if and only if the distance between objects (vectors or clusters) is less than some predetermined number. The number is set by the user. The distance is calculated only according to the “far neighbor” principle (the same can be said about the distance between classes combined into classes - only the far neighbor principle for os = p = 8 = 1/2, y = 0).
3.Average connection method. The cluster formation algorithm coincides with the single link algorithm, however, when deciding whether to include a new object in a cluster, calculations are performed according to the average link principle. As with the full link method, all calculated distances between clusters are compared to a user-specified number. And if it (distance) is less than a given number, the new object is included in the old class. Thus, the average connection method differs from the full connection method only in the way the distance between clusters is calculated.
4. WARD method. Let X 19 ..., X p- data, with each vector forming one cluster. We find the distance matrix using some metric (for example, the Mahalanobis distance), we determine the clusters closest to each other by it. Calculate the sum of squared deviations of vectors within the cluster S k according to the formula:

where to - cluster number, i- vector number in the cluster, j- coordinate number X t e U1 R, p to- the number of vectors in the cluster, Xjk- sample mean X i in S k . Value Vk characterizes the deviations of vectors from each other within the cluster (new Sk+Sf or old^). Calculation Vk should be carried out before and after the union, and it is necessary to go through all possible variants of such unions. Later in the cluster S k only those vectors or clusters that result in the least change are added Vk after merging and, as a result, which will be located at the minimum distance from the original cluster S k .

Consider further iterative methods. The essence of iterative methods is that clustering begins with setting some initial conditions. For example, it is required to set the number of clusters to be obtained or to set the distance that determines the end of the cluster formation process, etc. The initial conditions are chosen according to the result that the researcher needs. However, they are usually set according to the solution found by one of the agglomerative methods. Iterative methods include the method of ^-averages and the method of searching for concentrations.

1. Method /r-means. Let there be vectors Xl9 ...,Xn e9F and they need to be divided into to clusters. At the zero step of P vectors randomly choose to of them, assuming that each forms one cluster. We obtain a set of reference clusters,...,e[ 0) with weights

coj 0) ,...,X. and calculate the distance matrix between x. and standards e 1 (0),...,^ 0) according to some metric, for example, according to Euclidean:

Based on the knowledge of the computed distance matrix, the vector X ( is placed in the standard, the distance to which is minimal. Let's assume for definiteness what it is. It is replaced by a new one, recalculated taking into account the attached point, according to the formula

In addition, the weight is also recalculated:

If two or more minimum distances occur in the matrix, then X t included in the cluster with the lowest sequence number.

At the next step, the next vector is selected from the remaining ones, and the procedure is repeated. Thus, through ( PC) steps to each

standard e^~ k) will match the weight and the clustering procedure will end. With a large P and small to the algorithm quickly converges to a stable solution, i.e., to a solution in which the standards obtained after the first application of the algorithm coincide in number and composition with the standards found when the method is repeated. Nevertheless, the algorithmic procedure is always repeated several times, using the partition obtained in previous calculations as reference vectors (as an initial approximation): previously found references e[ p k e (2 p k) k) taken for e (x 0) 9 ... 9 e (k 0) 9 and the algorithmic procedure is repeated.

2. Condensation search method. This is the following iterative algorithm. It does not require an a priori specification of the number of clusters. At the first step, the distance matrix between X X9 ... 9 X p e Y1 p according to some metric. Then one vector is randomly chosen to play the role of the center of the first cluster. This is the initial approximation. Let us assume that this vector lies at the center of a p-dimensional sphere of radius R, and this radius is set by the researcher. After that, the vectors are determined X Si ,... 9 X Sk , trapped inside this sphere, and the selection is calculated from them
- 1 to

fixed expectation X \u003d ~ Y] X 5. Then the center of the sphere

worn in X, and the calculation procedure is repeated. The condition for the end of the iterative process is the equality of the mean vectors X found on t and (t+ 1) steps. Elements trapped inside the sphere X 9 ... 9 X

we include them in one cluster and exclude them from further research. For the remaining points, the algorithm is repeated. The algorithm converges for any choice of initial approximation and any amount of initial data. However, to obtain a stable partition (i.e., a partition in which the clusters found after the first application of the algorithm coincide in number and composition with the clusters found when the method is repeated), it is recommended to repeat the algorithmic procedure several times for different values of the sphere radius R. A sign of a stable partition will be the formation of the same number of clusters with the same composition.

Note that the clustering problem does not have the only solution. As a result, it is rather difficult and not always possible to enumerate all admissible divisions of data into classes. In order to evaluate the quality various ways clustering introduce the notion of the partition quality functional, which takes the minimum value on the best (from the point of view of the researcher) partition.

Let X X9 ... 9 X p e U1 R - some set of observations, which is divided into classes S = (S l9 ... 9 S k ) 9 and to known in advance. Then the main partitioning quality functionals for a known number of clusters have the form:

1) Weighted sum of intraclass variances

where a(1)- selective mathematical expectation of the cluster S l

Functional Q((S) allows us to estimate the degree of homogeneity of all clusters as a whole.

2) The sum of pairwise intraclass distances between elements or

where p 1- number of elements in the cluster S { .

3) Generalized intraclass variance

where nj- number of elements in S., BUT; . - sample covariance matrix for sj.

The functional is the arithmetic mean of the generalized intraclass variances calculated for each cluster. As is known, the generalized variance makes it possible to estimate the degree of dispersion of multivariate observations. That's why Q 3 (S) allows estimating the average scatter of observation vectors in classes S l9 ... 9 S k . Hence its name - generalized intraclass variance. Q 3 (S) is used when it is necessary to solve the problem of data compression or the concentration of observations in a space with a dimension less than the original one.

4) The quality of the classification of observations can also be assessed using the Hotelling criterion. To do this, we apply the criterion for testing the hypothesis H 0 about the equality of the mean vectors of two multidimensional populations and calculate the statistics

where n t and p t - number of vectors in classes S l ,S m ; X, X t- centered initial data; S*- pooled covariance matrix of clusters S n S m: S* =--- (XjX l + X^X m). As before, the value Q 4 (S)

p,+p t -2

compared with the table value calculated according to the formula

where m- the initial dimension of the observation vectors, and is the significance level.

Hypothesis H 0 is accepted with probability (1-oc) if Q 4 (S) n _ m , and is rejected otherwise.

It is also possible to estimate the quality of division into classes empirically. For example, one can compare the sample means found for each class with the sample mean of the entire population of observations. If they differ by a factor of two or more, then the partition is good. A more correct comparison of cluster sample means with the sample mean of the entire population of observations leads to the use of analysis of variance to assess the quality of the division into classes.

If the number of clusters in S = (S l9 ...,S k ) is not known in advance, then the following partitioning quality functionals are used for an arbitrarily chosen integer m:

IIto 1 1 m

- - the average measure of intraclass

P i=1 n i XjeSj X"tSj J

owl scattering,

(1 P/ 1W
- X ~-~ r “ measure of concentration of points of the set

P nV l J J

S, - the number of elements in the cluster containing the point X g

Note that for an arbitrary value of the parameter t functional Z m (S) reaches a minimum equal to I/ p, if the original clustering S = (S l9 ...,S k ) is a partition into mono-clusters S. = (Xj), because V(Xt) = 1. At the same time Z m (S) reaches a maximum of 1 if S- one cluster containing all the initial data,

because V(X() = n. In particular cases, it can be shown that Z_ l (S) =-, where to - number of different clusters in S = (S l9 ... 9 S k ) 9 Z X (S) = max - ,

*" V P)

where n t - number of elements in the cluster S i9 Z^(S) = min - ,

G " P)

Note that in the case of an unknown number of clusters, the partitioning quality functionals Q(S) can be chosen as an algebraic combination (sum, difference, product, ratio) of two functionals I m (S), Z m (S), since the first is a decreasing and the other is an increasing function of the number of classes to. Such behavior Z m (S)

guarantees the existence of an extremum Q(S).

Introduction

The term cluster analysis, first introduced by Tryon in 1939, includes more than 100 different algorithms.

Unlike classification problems, cluster analysis does not require a priori assumptions about the data set, does not impose restrictions on the representation of the objects under study, and allows you to analyze indicators of various types of data (interval data, frequencies, binary data). It must be remembered that the variables must be measured on comparable scales.

Cluster analysis allows you to reduce the dimension of data and make it visual.

Cluster analysis can be applied to sets of time series, here periods of similarity of some indicators can be distinguished and groups of time series with similar dynamics can be determined.

Cluster analysis has developed in parallel in several directions, such as biology, psychology, and others, so most methods have two or more names.

Cluster analysis tasks can be grouped into the following groups:

Development of a typology or classification.

Exploring useful conceptual schemes for grouping objects.

Presenting hypotheses based on data exploration.

Hypothesis testing or research to determine whether types (groups) identified in one way or another are actually present in the available data.

As a rule, in the practical use of cluster analysis, several of these tasks are simultaneously solved.

Purpose of the lesson

Obtaining skills in the practical application of hierarchical and iterative methods of cluster analysis.

Practical task

Develop algorithms for near neighbor methods and k-medium and implement them in the form of computer programs. Generate 50 implementations using DNC x= (x 1 ,x 2) - a random 2-dimensional variable, the coordinates of which are distributed uniformly in the interval (3.8). Distribute them using the developed programs into the minimum number of clusters, each of which is placed in a sphere with a radius of 0.15.

Guidelines

The name cluster analysis comes from the English word cluster - bunch, accumulation. Cluster analysis is a wide class of multivariate statistical analysis procedures that allow automated grouping of observations into homogeneous classes - clusters.

The cluster has the following mathematical characteristics:

cluster dispersion;
standard deviation.

The cluster center is the locus of points in the space of variables.

Cluster radius - the maximum distance of points from the center of the cluster.

Cluster dispersion is a measure of the spread of points in space relative to the center of the cluster.

The standard deviation (RMS) of objects relative to the cluster center is the square root of the cluster variance.

Cluster analysis methods

Cluster analysis methods can be divided into two groups:

hierarchical;

non-hierarchical.

Each group includes many approaches and algorithms.

Using different methods of cluster analysis, an analyst can get different solutions on the same data. This is considered normal.

Hierarchical methods of cluster analysis

The essence of hierarchical clustering is the sequential merging of smaller clusters into larger clusters or the division of large clusters into smaller ones.

Hierarchical Agglomerative Methods(Agglomerative Nesting, AGNES)

This group of methods is characterized by a consistent combination of initial elements and a corresponding decrease in the number of clusters.

At the beginning of the algorithm, all objects are separate clusters. At the first step, the most similar objects are combined into a cluster. In subsequent steps, the merging continues until all objects form one cluster.

Hierarchical divisible (divisible) methods(DIvisive ANAlysis, DIANA)

These methods are the logical opposite of agglomerative methods. At the beginning of the algorithm, all objects belong to one cluster, which is divided into smaller clusters at subsequent steps, as a result, a sequence of splitting groups is formed.

Hierarchical clustering methods differ in the rules for building clusters. The rules are the criteria that are used when deciding on the "similarity" of objects when they are combined into a group.

Similarity measures

To calculate the distance between objects, various similarity measures (similarity measures), also called metrics or distance functions, are used.

Euclidean distance is a geometric distance in a multidimensional space and is calculated by formula (4.1).

The Euclidean distance (and its square) is calculated from the original data, not from standardized data.

Euclidean distance squared is calculated by formula (4.2).

(4.2)

Manhattan distance (city block distance), also called "hamming" or "city block" distance, is calculated as the average of differences over coordinates. In most cases, this measure of distance leads to results similar to calculations of the Euclidean distance. However, for this measure, the effect of individual outliers is less than when using the Euclidean distance, since here the coordinates are not squared. The Manhattan distance is calculated using formula (4.3).

(4.3)

Chebyshev Distance should be used when it is necessary to define two objects as "different" if they differ in one dimension. The Chebyshev distance is calculated by formula (4.4).

(4.4)

Power Distance is used when one wishes to progressively increase or decrease the weight related to a dimension for which the corresponding objects are very different. The power distance is calculated by formula (4.5).

(4.5)

where r and p- user-defined parameters. Parameter p is responsible for the gradual weighting of the differences over individual coordinates, the parameter r for progressive weighting of large distances between objects. If both options r and p are equal to two, then this distance coincides with the Euclid distance.

Disagreement percentage used when the data is categorical. This distance is calculated by formula (4.6).

(4.6)

Join or link methods

At the first step, when each object is a separate cluster, the distances between these objects are determined by the chosen measure. However, when multiple objects are linked together, other methods must be used to determine the distance between clusters. There are many methods for joining clusters:

Single link (nearest neighbor method) - the distance between two clusters is determined by the distance between the two closest objects (nearest neighbors) in different clusters.

Full connection (most distant neighbor method) - distances between clusters are determined by the largest distance between any two objects in different clusters (i.e. "most distant neighbors").

Unweighted pairwise mean - the distance between two different clusters is calculated as the average distance between all pairs of objects in them.

Weighted pairwise mean – the method is identical to the method unweighted pairwise mean, except that the calculation uses the size of the corresponding clusters (i.e., the number of objects they contain) as a weighting factor.

Unweighted centroid method - the distance between two clusters is defined as the distance between their centers of gravity.

Weighted centroid method (median) - the method is identical to the unweighted centroid method, except that weights are used in the calculations to take into account the difference between the sizes of clusters (i.e., the number of objects in them).

Ward's method - the distance between clusters is defined as the increase in the sum of squared distances of objects to the centers of clusters, obtained as a result of their union. The method is different from all other methods because it uses ANOVA methods to estimate distances between clusters. The method minimizes the sum of squares for any two (hypothetical) clusters that can be formed at each step.

Nearest neighbor method

The distance between two classes is defined as the distance between their nearest members.

Before starting the algorithm, it is calculated distance matrix between objects. According to the classification criterion, the union occurs between clusters, the distance between the nearest representatives of which is the smallest: two objects with the smallest distance into one cluster are selected. After that, it is necessary to recalculate the distance matrix taking into account the new cluster. At each step, the distance matrix is searched for the minimum value corresponding to the distance between the two closest clusters. The found clusters are combined to form a new cluster. This procedure is repeated until all clusters are merged.

When using the nearest neighbor method, special attention should be paid to the choice of a distance measure between objects. Based on it, an initial distance matrix is formed, which determines the entire further classification process.

iterative methods.

With a large number of observations, hierarchical methods of cluster analysis are not suitable. In such cases, non-hierarchical methods are used, based on the division of the original clusters into other clusters, and which are iterative methods for splitting the original population. During the division process, new clusters are formed until the stopping rule is met.

Such non-hierarchical clustering consists of dividing a data set into a certain number of distinct clusters. There are two approaches. The first is to define the boundaries of clusters as the most dense areas in the multidimensional space of the original data, i.e., the definition of a cluster where there is a large "point cluster". The second approach is to minimize the measure of object difference.

Unlike hierarchical classification methods, iterative methods can lead to the formation of overlapping clusters, when one object can simultaneously belong to several clusters.

Iterative methods include, for example, the method k-averages, the method of searching for concentrations, and others. Iterative methods are fast, which allows them to be used to process large arrays of initial information.

Algorithm k-means (k-means)

Among iterative methods, the most popular method is the method k- average McKean. Unlike hierarchical methods, in most implementations of this method, the user himself must specify the desired number of final clusters, which is usually denoted " k". Algorithm k- medium builds k clusters located as far apart as possible. The main requirement for the type of problems that the algorithm solves k-averages - the presence of assumptions (hypotheses) regarding the number of clusters, while they should be as different as possible. Choice of number k may be based on prior research, theoretical considerations, or intuition.

As in hierarchical clustering methods, the user can choose one or another type of similarity measure. Different method algorithms k-averages also differ in the way of choosing the initial centers of the given clusters. In some versions of the method, the user himself can (or must) specify such initial points, either by selecting them from real observations, or by specifying the coordinates of these points for each of the variables. In other implementations of this method, choosing a given number k initial points is produced randomly, and these initial points (cluster centers) can subsequently be refined in several stages. There are 4 main stages of such methods:

chosen or appointed k observations that will be the primary centers of clusters;

if necessary, intermediate clusters are formed by assigning each observation to the nearest given cluster centers;

after assigning all observations to individual clusters, the primary cluster centers are replaced by cluster averages;

the previous iteration is repeated until the changes in the coordinates of the cluster centers become minimal.

The general idea of the algorithm: a given fixed number k of observation clusters are compared to clusters in such a way that the averages in the cluster (for all variables) differ as much as possible from each other.

Description of the algorithm

Initial distribution of objects by clusters.

Selected number k and k points. At the first step, these points are considered to be the "centers" of the clusters. Each cluster corresponds to one center. The choice of initial centroids can be carried out as follows:

choice k- observations to maximize the initial distance;

random selection k- observations;

first choice k- observations.

Each object is then assigned to a certain closest cluster.

Iterative process.

The centers of the clusters are calculated, which then and further are considered to be the coordinate means of the clusters. Objects are redistributed again. The process of calculating centers and redistributing objects continues until one of the following conditions is met:

cluster centers have stabilized, i.e., all observations belong to the cluster they belonged to before the current iteration. In some variants of this method, the user can set a numerical value of the criterion, which is interpreted as the minimum distance for selecting new cluster centers. Observation will not be considered as a candidate for new center cluster, if its distance to the replaced center of the cluster exceeds the specified number. Such a parameter in a number of programs is called "radius". In addition to this parameter, it is possible to set a usually sufficiently small number, with which the change in distance for all cluster centers is compared. This parameter is usually called "convergence", because it reflects the convergence of the iterative clustering process;

the number of iterations is equal to the maximum number of iterations.

Checking the quality of clustering

After obtaining the results of cluster analysis by the method k- averages, you should check the correctness of clustering (i.e., evaluate how the clusters differ from each other). To do this, average values for each cluster are calculated. Good clustering should produce very different means for all measurements, or at least most of them.

Advantagesk-means algorithm:

ease of use;

speed of use;

clarity and transparency of the algorithm.

Flawsk-means algorithm:

the algorithm is too sensitive to outliers that can distort the mean. A possible solution to this problem is to use a modification of the algorithm - the algorithm k- medians;

the algorithm can be slow on large databases. A possible solution to this problem is to use data sampling.

The report must contain:

description and block diagrams of algorithms;

source texts of program modules;

the results of the algorithms in the form of graphs.

With a large number of observations, hierarchical methods of cluster analysis are not suitable. In such cases, non-hierarchical methods based on division are used, which are iterative methods of splitting the original population. During the division process, new clusters are formed until the stopping rule is met.

Such non-hierarchical clustering consists of dividing a data set into a certain number of distinct clusters. There are two approaches. The first is to define the boundaries of clusters as the densest areas in the multidimensional space of the initial data, i.e. definition of a cluster where there is a large "concentration of points". The second approach is to minimize the measure of object difference

Algorithm k-means (k-means)

The most common among non-hierarchical methods is the k-means algorithm, also called fast cluster analysis. A complete description of the algorithm can be found in Hartigan and Wong (1978). Unlike hierarchical methods, which do not require preliminary assumptions about the number of clusters, to be able to use this method, it is necessary to have a hypothesis about the most probable number of clusters.

The k-means algorithm builds k clusters spaced as far apart as possible. The main type of problems that the k-means algorithm solves is the presence of assumptions (hypotheses) regarding the number of clusters, while they should be as different as possible. The choice of the number k may be based on previous research, theoretical considerations, or intuition.

The general idea of the algorithm: a given fixed number k of observation clusters are compared to clusters so that the averages in the cluster (for all variables) differ as much as possible from each other.

Description of the algorithm

Initial distribution of objects by clusters. The number k is chosen, and at the first step these points are considered to be the "centers" of the clusters. Each cluster corresponds to one center.
The choice of initial centroids can be carried out as follows:
- choice of k-observations to maximize the initial distance;
- random selection of k-observations;
- choice of the first k-observations.
As a result, each object is assigned to a specific cluster.
Iterative process. The centers of the clusters are calculated, which then and further are considered to be the coordinate means of the clusters. Objects are redistributed again.
The process of calculating centers and redistributing objects continues until one of the following conditions is met:
- cluster centers have stabilized; all observations belong to the cluster they belonged to before the current iteration;
- the number of iterations is equal to the maximum number of iterations.

On fig. Figure 14.1 shows an example of how the k-means algorithm works for k equal to two.

Figure 1 - An example of the operation of the k-means algorithm (k=2)

The choice of the number of clusters is a complex issue. If there are no assumptions about this number, it is recommended to create 2 clusters, then 3, 4, 5, etc., comparing the results.

Checking the quality of clustering

After obtaining the results of cluster analysis using the k-means method, one should check the correctness of the clustering (i.e., evaluate how the clusters differ from each other). To do this, average values for each cluster are calculated. Good clustering should produce very different means for all measurements, or at least most of them.

Advantages of the k-means algorithm:

ease of use;
speed of use;
clarity and transparency of the algorithm.

Disadvantages of the k-means algorithm:

the algorithm is too sensitive to outliers that can distort the mean. A possible solution to this problem is to use a modification of the algorithm - the k-median algorithm;
the algorithm can be slow on large databases. A possible solution to this problem is to use data sampling.

PAM algorithm (partitioning around Medoids)

PAM is a modification of the k-means algorithm, the k-median algorithm (k-medoids).

The algorithm is less sensitive to data noise and outliers than the k-means algorithm because the median is less affected by outliers.

PAM is effective for small databases, but should not be used for large datasets. Dimension pre-reduction

Consider an example. There is a database of clients of the company, which should be divided into homogeneous groups. Each client is described using 25 variables. The use of such a large number of variables leads to the selection of clusters of a fuzzy structure. As a result, it is quite difficult for an analyst to interpret the resulting clusters.

More understandable and transparent clustering results can be obtained if, instead of a set of initial variables, some generalized variables or criteria are used that contain information about the relationships between variables in a compressed form. Those. the problem of data dimensionality reduction arises. It can be solved with various methods; one of the most common - factor analysis. Let's dwell on it in more detail.

Factor analysis

Factor analysis is a technique used to study relationships between the values of variables.

In general, factor analysis has two goals:

reducing the number of variables;
classification of variables - determination of the structure of relationships between variables.

Accordingly, factor analysis can be used to solve data dimensionality reduction problems or to solve classification problems.

Criteria or main factors identified as a result of factor analysis contain information in a compressed form about the existing relationships between variables. This information allows you to get better clustering results and better explain the semantics of clusters. The factors themselves can be given a certain meaning.

With the help of factor analysis, a large number of variables is reduced to a smaller number of independent influencing quantities, which are called factors.

A factor in a "compressed" form contains information about several variables. Variables that are highly correlated with each other are combined into one factor. As a result of factor analysis, such complex factors are found that explain the relationships between the variables under consideration as fully as possible.

At the first step of factor analysis, the values of variables are standardized, the need for which was considered in the previous lecture.

Factor analysis is based on the hypothesis that the analyzed variables are indirect manifestations of a relatively small number of some hidden factors.

Factor analysis is a set of methods focused on identifying and analyzing hidden dependencies between observed variables. Latent dependencies are also called latent dependencies.

One of the methods of factor analysis - the method of principal components - is based on the assumption that the factors are independent of each other.

Iterative Clustering in SPSS Usually, a wide arsenal of methods is implemented in statistical packages, which makes it possible to first reduce the dimension of a data set (for example, using factor analysis), and then clustering itself (for example, using the fast cluster analysis method). Consider this option for clustering in the SPSS package.
To reduce the dimension of the initial data, we use factor analysis. To do this, select in the menu: Analyze (Analysis) / Data Reduction (Data transformation) / Factor (Factor analysis):
Use the Extraction: button to select the extraction method. We will leave the default Principal Component Analysis mentioned above. You should also choose a rotation method - let's choose one of the most popular - the varimax method. To save factor values as variables, the "Values" tab must be checked "Save as variables".

As a result of this procedure, the user receives the report "Explained total variance", which shows the number of selected factors - these are the components whose eigenvalues exceed one.

The obtained values of the factors, which are usually given the names fact1_1, fact1_2, etc., are used for cluster analysis using the k-means method. To conduct a quick cluster analysis, select from the menu:

Analyze/Classify/K-Means Cluster: (K-means cluster analysis).

In the K Means Cluster Analysis dialog box (Cluster analysis by k-means), you need to put the factor variables fact1_1, fact1_2, etc. in the field of tested variables. Here you also need to specify the number of clusters and the number of iterations.

As a result of this procedure, we obtain a report with the output of the values of the centers of the formed clusters, the number of observations in each cluster, as well as additional information, set by the user.

Thus, the k-means algorithm divides the set of initial data into a given number of clusters. To be able to visualize the results obtained, one of the graphs, for example, a scatterplot, should be used. However, conventional imaging is possible for limited quantity dimensions, because, as you know, a person can perceive only three-dimensional space. Therefore, if we analyze more than three variables, we should use special multidimensional methods for presenting information, they will be discussed in one of the subsequent lectures of the course.

Iterative clustering methods differ in the choice of the following parameters:

starting point;
the rule for the formation of new clusters;
stopping rule.

The choice of clustering method depends on the amount of data and on whether there is a need to work with several types of data at the same time.

In the SPSS package, for example, if you need to work with both quantitative (for example, income) and categorical (for example, marital status) variables, and if the amount of data is large enough, the Two-Stage Cluster Analysis method is used, which is a scalable cluster analysis procedure that allows you to work with data of various types. To do this, at the first stage of work, records are pre-clustered into a large number of sub-clusters. At the second stage, the resulting sub-clusters are grouped into required amount. If this number is unknown, the procedure itself automatically determines it. Using this procedure, a bank employee can, for example, select groups of people, simultaneously using indicators such as age, gender and income level. The results obtained make it possible to identify clients who are at risk of loan default.

AT general case all stages of cluster analysis are interconnected, and the decisions made at one of them determine the actions at subsequent stages.

The analyst must decide whether to use all of the observations or to exclude some data or samples from the data set.

Choice of metric and method of standardization of initial data.

Determining the number of clusters (for iterative cluster analysis).

Defining the clustering method (merging or linking rules).

According to many experts, the choice of clustering method is decisive in determining the form and specificity of clusters.

Analysis of clustering results. This stage involves solving the following questions: is the resulting clustering random; whether the partition is reliable and stable on data subsamples; whether there is a relationship between the results of clustering and variables that did not participate in the clustering process; whether it is possible to interpret the results of clustering.

Checking clustering results. The results of clustering should also be verified by formal and informal methods. Formal methods depend on the method used for clustering. Informal ones include the following procedures for checking the quality of clustering:

analysis of clustering results obtained on certain samples of the data set;
cross-validation;
performing clustering when changing the order of observations in the data set;
carrying out clustering when deleting some observations;
clustering on small samples.

One of the options for checking the quality of clustering is to use several methods and compare the results. The absence of similarity will not mean incorrect results, but the presence of similar groups is considered a sign of high-quality clustering.

Difficulties and problems that may arise when applying cluster analysis

Like any other methods, cluster analysis methods have certain weak sides, i.e. some difficulties, problems and limitations.

When conducting cluster analysis, it should be taken into account that the results of clustering depend on the criteria for splitting the set of initial data. When the data dimension is reduced, certain distortions may occur; due to generalizations, some individual characteristics of objects may be lost.

There are a number of complexities that should be considered before clustering.

The complexity of the choice of characteristics on the basis of which clustering is carried out. An ill-considered choice leads to inadequate partitioning into clusters and, as a result, to an incorrect solution of the problem.
Difficulty in choosing a clustering method. This choice requires a good knowledge of the methods and the prerequisites for their use. To check the effectiveness of a particular method in a particular subject area, it is advisable to apply the following procedure: consider several groups that are a priori different from each other and randomly mix their representatives with each other. Next, clustering is carried out to restore the original partition into clusters. The proportion of coincidences of objects in the identified and initial groups is an indicator of the effectiveness of the method.
The problem of choosing the number of clusters. If there is no information about the possible number of clusters, it is necessary to conduct a series of experiments and, as a result of enumeration of a different number of clusters, choose their optimal number.
The problem of interpretation of clustering results. The form of clusters in most cases is determined by the choice of the method of association. However, it should be taken into account that specific methods tend to create clusters of certain shapes, even if in fact there are no clusters in the studied dataset.

Comparative analysis of hierarchical and non-hierarchical clustering methods

Before clustering, the analyst may have a question which group of cluster analysis methods to give preference to. When choosing between hierarchical and non-hierarchical methods, it is necessary to take into account the following features.

Non-hierarchical methods reveal higher resistance to noise and outliers, incorrect choice of metric, inclusion of insignificant variables in the set involved in clustering. The price to be paid for these advantages of the method is the word "a priori". The analyst must predetermine the number of clusters, the number of iterations, or the stopping rule, as well as some other clustering parameters. This is especially difficult for beginners.

If there are no assumptions about the number of clusters, it is recommended to use hierarchical algorithms. However, if the sample size does not allow this, possible path- conducting a series of experiments with a different number of clusters, for example, start splitting the data set from two groups and, gradually increasing their number, compare the results. Due to this "variation" of the results, a sufficiently large clustering flexibility is achieved.

Hierarchical methods, unlike non-hierarchical ones, refuse to determine the number of clusters, but build a complete tree of nested clusters.

Complexities of hierarchical clustering methods: limitation of the volume of the data set; choice of measure of proximity; inflexibility of the obtained classifications.

The advantage of this group of methods in comparison with non-hierarchical methods is their clarity and the ability to get a detailed idea of the data structure.

When using hierarchical methods, it is possible to identify outliers in a data set quite easily and, as a result, improve data quality. This procedure underlies the two-step clustering algorithm. Such a data set can later be used for non-hierarchical clustering.

There is another aspect that has already been mentioned in this lecture. This is a matter of clustering the entire population of data or its sample. This aspect is essential for both considered groups of methods, but it is more critical for hierarchical methods. Hierarchical methods cannot work with large data sets, and the use of some selection, i.e. part of the data could allow these methods to be applied.

Clustering results may not have sufficient statistical justification. On the other hand, when solving clustering problems, a non-statistical interpretation of the results obtained is acceptable, as well as a fairly large variety of options for the concept of a cluster. Such a non-statistical interpretation enables the analyst to obtain satisfactory clustering results, which is often difficult when using other methods.

New Algorithms and Some Modifications of Cluster Analysis Algorithms

The methods that we have considered in this and previous lectures are the "classics" of cluster analysis. Until recently, the main criterion by which the clustering algorithm was evaluated was the quality of clustering: it was assumed that the entire data set would fit in RAM.

However, now, due to the advent of super-large databases, there are new requirements that the clustering algorithm must satisfy. The main one, as mentioned in previous lectures, is the scalability of the algorithm.

We also note other properties that the clustering algorithm must satisfy: independence of the results from the order of the input data; independence of the algorithm parameters from the input data.

Recently, there has been an active development of new clustering algorithms capable of processing super-large databases. They focus on scalability. Such algorithms include a generalized cluster representation, as well as sampling and using data structures supported by the underlying DBMS.

Algorithms have been developed in which hierarchical clustering methods are integrated with other methods. These algorithms include: BIRCH, CURE, CHAMELEON, ROCK.

Algorithm BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies)

The algorithm was proposed by Tian Zang and his colleagues.

Thanks to generalized representations of clusters, the clustering speed increases, while the algorithm has a large scaling.

This algorithm implements a two-stage clustering process.

During the first stage, a preliminary set of clusters is formed. At the second stage, other clustering algorithms are applied to the identified clusters - suitable for working in RAM.

In the following analogy describing this algorithm is given. If each data element is thought of as a bead lying on the surface of a table, then the clusters of beads can be "replaced" with tennis balls and we can proceed to a more detailed study of tennis ball clusters. The number of beads can be quite large, but the diameter of the tennis balls can be chosen in such a way that at the second stage it would be possible, using traditional clustering algorithms, to determine the actual complex shape of the clusters.

WaveCluster Algorithm

WaveCluster is a clustering algorithm based on wave transformations. At the beginning of the algorithm, the data are generalized by imposing a multidimensional lattice on the data space. At further steps of the algorithm, not individual points are analyzed, but generalized characteristics of points that fall into one cell of the lattice. As a result of such a generalization, the necessary information fits in the RAM. In subsequent steps, the algorithm applies a wave transform to the generalized data to determine the clusters.

Main features of WaveCluster:

complexity of implementation;
the algorithm can detect clusters of arbitrary shapes;
the algorithm is not sensitive to noise;
the algorithm is only applicable to low-dimensional data.

Algorithm CLARA (Clustering LARge Applications)

The CLARA algorithm was developed by Kaufmann and Rousseeuw in 1990 to cluster data in large databases. This algorithm is built in statistical analytical packages, such as S+.

Let us briefly describe the essence of the algorithm. The CLARA algorithm retrieves many samples from the database. Clustering is applied to each of the samples, the best clustering is proposed as the output of the algorithm.

For large databases, this algorithm is more efficient than the PAM algorithm. The efficiency of the algorithm depends on the data set chosen as a sample. Good clustering on the selected set may not give good clustering on the entire data set.

Algorithms Clarans, CURE, DBScan

The Clarans algorithm (Clustering Large Applications based upon RANdomized Search) formulates the clustering problem as a random search in a graph. As a result of this algorithm, the set of graph nodes is a partition of the data set into the number of clusters specified by the user. The "quality" of the resulting clusters is determined using the criterion function. The Clarans algorithm sorts all possible partitions of the data set in search of an acceptable solution. The search for a solution stops at the node where the minimum is reached among a predetermined number of local minima.

Among the new scalable algorithms, one can also note the CURE algorithm - a hierarchical clustering algorithm, and the DBScan algorithm, where the concept of a cluster is formulated using the density concept.

The main disadvantage of the BIRCH, Clarans, CURE, DBScan algorithms is the fact that they require some point density thresholds to be specified, which is not always acceptable. These limitations are due to the fact that the described algorithms are focused on very large databases and cannot use large computing resources.

Many researchers are actively working on scalable methods, whose main task is to overcome the shortcomings of the algorithms that exist today.

One of the iterative classification methods that do not require specifying the number of clusters is the method of searching for clumps. The method requires calculation of the distance matrix, then the object is selected, which is the initial center of the first cluster. The choice of such an object can be arbitrary, or it can be based on a preliminary analysis of the points and their neighborhoods.

The selected point is taken as the center of a hypersphere of a given radius R. The set of points that fall inside this sphere is determined, and the coordinates of the center (the vector of average values of features) are calculated for them. Next, a hypersphere of the same radius, but with a new center, is considered, and for the set of points that fall into it, the vector of average values is again calculated, which is taken as the new center of the sphere, and so on. When the next recalculation of the coordinates of the center of the sphere leads to the same result as in the previous step, the movement of the sphere stops, and the points that fall into it form a cluster and are excluded from the further clustering process. For all remaining points, the procedures are repeated.

Thus, there are more non-hierarchical methods, although they work on the same principles. In fact, they are iterative methods of splitting the original population. In the process of division, new clusters are formed, and so on until the stopping rule is fulfilled. Among themselves, the methods differ in the choice of the starting point, the rule for the formation of new clusters, and the stop rule. The most commonly used algorithm K-means.

Conclusion

Cluster analysis is a method of grouping objects into classes based on experimental data on the properties of objects.

In this case, a cluster model of object representation is used - objects with similar properties belong to the same class.

Cluster analysis includes a set of different classification algorithms (the dendrogram method can be cited as an example of a cluster analysis method).

In this case, as a rule, the number of classes and the principles of division into classes are determined in advance based on general information about the set of objects and the goals of cluster analysis.

Cluster analysis methods are complemented by discriminant analysis methods that allow you to determine the boundaries between clusters and use them to solve problems of data analysis and classification.

The results of cluster analysis are most often presented graphically, in the form of a dendrogram ("tree"), showing the order in which objects are combined into clusters. The interpretation of the cluster structure, which in many cases begins with the number of clusters, is a creative challenge. In order for it to be effectively solved, the researcher must have sufficient information about the clustered objects. With training clustering, the results can be presented as lists of objects assigned to each class.

The main advantages of cluster analysis are the absence of restrictions on the distribution of variables used in the analysis; the possibility of classification (clustering) even in cases where there is no a priori information about the number and nature of classes; universality (cluster analysis can be applied not only to collections of objects, but also to sets of variables or any other units of analysis).

We list the disadvantages of cluster analysis:

Like factor analysis, it can produce unstable clusters. Repeat the study on other people and compare the classification results. Most likely they will be different. How much is a question of the quality of the study itself.

It implements the inductive method of research from the particular to the general, which is fraught with anti-scientific conclusions. Ideally, the sample for classification should be very large, heterogeneous, preferably selected by stratification or randomization. Science moves along the path of testing hypotheses, so cluster analysis should not be abused. It is best to use it to test a hypothesis about the presence of any types, and not to create a classification from scratch.

Like any multidimensional scaling method, cluster analysis has many features associated with internal methods. What is the criterion for grouping people into clusters, the method for finding differences, the number of steps to complete the algorithm in the k-means method, etc. therefore, the results may vary, albeit not significantly, depending on the “settings” of the procedure.

There are two groups of methods cluster analysis: hierarchical and non-hierarchical.

The main methods of hierarchical cluster analysis are the near neighbor method, the full connection method, the average connection method, and the Ward method. The last one is the most versatile.

There are more non-hierarchical methods, although they work on the same principles. In fact, they are iterative methods of splitting the original population. In the process of division, new clusters are formed, and so on until the stopping rule is fulfilled. Among themselves, the methods differ in the choice of the starting point, the rule for the formation of new clusters, and the stop rule. The most commonly used algorithm K-means. It implies that the analyst fixes in advance the number of clusters in the resulting partition.

Speaking about the choice of a specific clustering method, we emphasize once again that this process requires the analyst to be well acquainted with the nature and prerequisites of the methods, otherwise the results will be similar to the "average temperature in the hospital." In order to make sure that the chosen method is really effective in this area, the following procedure is usually applied:

Several a priori different groups are considered and their representatives are randomly mixed with each other. Then, a clustering procedure is carried out in order to restore the original grouping. The indicator of the effectiveness of the method will be the proportion of coincidences of objects in the identified and initial groups.

When choosing between hierarchical and non-hierarchical methods, you should pay attention to the following points:

Non-hierarchical methods show higher resistance to outliers, incorrect choice of metric, inclusion of insignificant variables in the base for clustering, etc. But the price for this is the word "a priori". The researcher must fix in advance the resulting number of clusters, the stopping rule and, if there are reasons for that, the initial center of the cluster. The last point significantly affects the efficiency of the algorithm. If there is no reason to artificially set this condition, generally speaking, it is recommended to use hierarchical methods. We also note one more point that is essential for both groups of algorithms: clustering of all observations is not always the right solution. It may be more accurate to first clear the sample of outliers and then continue with the analysis. It is also possible not to set the stopping criterion very high.

It might be useful to read: