Cluster analysis is an algorithm for studying data divided into groups according to similar characteristics.

The object of study in applied statistics is statistical data obtained as a result of observations or experiments. Statistical data is a set of objects (observations, cases) and features (variables) that characterize them. For example, the objects of study are countries of the world and signs, geographical and economic indicators characterizing them: continent; height of the area above sea level; average annual temperature; place of the country in the list in terms of quality of life, share of GDP per capita; public spending on health care, education, the army; average life expectancy; proportion of unemployment, illiterate; quality of life index, etc.
Variables are quantities that, as a result of measurement, can take on different values.
Independent variables are variables whose values ​​can be changed during the experiment, and dependent variables are variables whose values ​​can only be measured.
Variables can be measured on various scales. The difference between the scales is determined by their information content. The following types of scales are considered, presented in ascending order of their information content: nominal, ordinal, interval, ratio scale, absolute. These scales also differ from each other in the number of valid mathematical operations. The “poorest” scale is nominal, since not a single arithmetic operation is defined, the “richest” one itself is absolute.
Measurement in the nominal (classification) scale means determining whether an object (observation) belongs to a particular class. For example: gender, branch of service, profession, continent, etc. In this scale, one can only count the number of objects in classes - frequency and relative frequency.
Measurement in the ordinal (rank) scale, in addition to determining the class of belonging, allows you to streamline observations by comparing them with each other in some respect. However, this scale does not determine the distance between classes, but only which of the two observations is preferable. Therefore, ordinal experimental data, even if they are represented by numbers, cannot be considered as numbers and arithmetic operations can be performed on them 5 . In this scale, in addition to calculating the frequency of an object, you can calculate the rank of the object. Examples of variables measured on an ordinal scale: student scores, prizes in competitions, military ranks, a country's place in a list of quality of life, etc. Sometimes nominal and ordinal variables are called categorical, or grouping, as they allow the division of research objects into subgroups.
When measuring on an interval scale, the ordering of the observations can be done so precisely that the distances between any two of them are known. The interval scale is unique up to linear transformations (y = ax + b). This means that the scale has an arbitrary reference point - conditional zero. Examples of variables measured on an interval scale: temperature, time, elevation above sea level. Variables in a given scale can be operated on to determine the distance between observations. Distances are full numbers and any arithmetic operations can be performed on them.
The ratio scale is similar to the interval scale, but it is unique up to a transformation of the form y = ax. This means that the scale has a fixed reference point - absolute zero, but an arbitrary measurement scale. Examples of variables measured on a ratio scale are length, weight, current, amount of money, society's spending on health care, education, the military, life expectancy, and so on. The measurements in this scale are full numbers and any arithmetic operations can be performed on them.
An absolute scale has both an absolute zero and an absolute unit of measurement (scale). An example of an absolute scale is the number line. This scale is dimensionless, so measurements in it can be used as an exponent or base of a logarithm. Examples of measurements in an absolute scale: unemployment rate; proportion of illiterates, quality of life index, etc.
Most statistical methods are parametric statistics methods based on the assumption that a random vector of variables forms some multivariate distribution, usually normal or transforms to a normal distribution. If this assumption is not confirmed, nonparametric methods of mathematical statistics should be used.

Correlation analysis. Between variables (random variables) there may be a functional relationship, manifested in the fact that one of them is defined as a function of the other. But between the variables there can also be a connection of another kind, manifested in the fact that one of them reacts to a change in the other by changing its distribution law. Such a relationship is called stochastic. It appears when there are common random factors that affect both variables. As a measure of dependence between variables, the correlation coefficient (r) is used, which varies from -1 to +1. If the correlation coefficient is negative, this means that as the values ​​of one variable increase, the values ​​of the other decrease. If the variables are independent, then the correlation coefficient is 0 (the converse is true only for variables that have a normal distribution). But if the correlation coefficient is not equal to 0 (the variables are called uncorrelated), then this means that there is a relationship between the variables. The closer the value of r to 1, the stronger the dependence. The correlation coefficient reaches its extreme values ​​of +1 or -1 if and only if the relationship between the variables is linear. Correlation analysis allows you to establish the strength and direction of the stochastic relationship between variables (random variables). If the variables are measured at least on an interval scale and have a normal distribution, then correlation analysis is performed by calculating the Pearson correlation coefficient, otherwise Spearman, Kendal's tau, or Gamma correlations are used.

Regression analysis. Regression analysis models the relationship of one random variable with one or more other random variables. In this case, the first variable is called dependent, and the rest - independent. The choice or assignment of dependent and independent variables is arbitrary (conditional) and is carried out by the researcher depending on the problem he is solving. The independent variables are called factors, regressors, or predictors, and the dependent variable is called the outcome feature, or response.
If the number of predictors is equal to 1, the regression is called simple, or univariate, if the number of predictors is more than 1, multiple or multifactorial. V general case The regression model can be written as follows:

Y \u003d f (x 1, x 2, ..., x n),

Where y is the dependent variable (response), x i (i = 1,…, n) are predictors (factors), n is the number of predictors.
Through regression analysis, it is possible to solve a number of important tasks for the problem under study:
one). Reducing the dimension of the space of analyzed variables (factor space), by replacing part of the factors with one variable - the response. This problem is more fully solved by factor analysis.
2). Quantifying the effect of each factor, i.e. multiple regression, allows the researcher to ask (and likely get an answer) about "what is the best predictor for...". At the same time, the influence of individual factors on the response becomes clearer, and the researcher better understands the nature of the phenomenon under study.
3). Calculation of predictive response values ​​for certain factor values, i.e. regression analysis, creates the basis for a computational experiment in order to obtain answers to questions like "What will happen if ...".
4). In regression analysis, the causal mechanism appears in a more explicit form. In this case, the prognosis lends itself better to meaningful interpretation.

Canonical analysis. Canonical analysis is designed to analyze dependencies between two lists of features (independent variables) that characterize objects. For example, you can study the relationship between various adverse factors and the appearance of a certain group of symptoms of a disease, or the relationship between two groups of clinical and laboratory parameters (syndromes) of a patient. Canonical analysis is a generalization of multiple correlation as a measure of the relationship between one variable and many other variables. As you know, multiple correlation is the maximum correlation between one variable and a linear function of other variables. This concept has been generalized to the case of a connection between sets of variables - features that characterize objects. In this case, it suffices to confine ourselves to considering a small number of the most correlated linear combinations from each set. Let, for example, the first set of variables consists of signs y1, ..., ur, the second set consists of - x1, ..., xq, then the relationship between these sets can be estimated as a correlation between linear combinations a1y1 + a2y2 + ... + apyp, b1x1 + b2x2 + ... + bqxq, which is called the canonical correlation. The task of canonical analysis is to find the weight coefficients in such a way that the canonical correlation is maximum.

Methods for comparing averages. In applied research, there are often cases when the average result of some feature of one series of experiments differs from the average result of another series. Since the averages are the results of measurements, as a rule, they always differ, the question is whether the observed discrepancy between the averages can be explained by the inevitable random errors of the experiment or is it caused by certain reasons. If we are talking about comparing two means, then you can apply the Student's test (t-test). This is a parametric test, since it is assumed that the trait has a normal distribution in each series of experiments. At present, it has become fashionable to use nonparametric criteria for comparing averages
Comparison of average results is one of the ways to identify dependencies between variable features that characterize the studied set of objects (observations). If, when dividing the objects of study into subgroups using a categorical independent variable (predictor), the hypothesis about the inequality of the means of some dependent variable in subgroups is true, then this means that there is a stochastic relationship between this dependent variable and the categorical predictor. So, for example, if it is established that the hypothesis about the equality of the average indicators of the physical and intellectual development of children in the groups of mothers who smoked and did not smoke during pregnancy is incorrect, then this means that there is a relationship between the child's mother's smoking during pregnancy and his intellectual and physical development.
The most common method for comparing means is analysis of variance. In ANOVA terminology, a categorical predictor is called a factor.
Analysis of variance can be defined as a parametric, statistical method designed to assess the influence of various factors on the result of an experiment, as well as for subsequent planning of experiments. Therefore, in the analysis of variance, it is possible to investigate the dependence of a quantitative feature on one or more qualitative features of the factors. If one factor is considered, then one-way analysis of variance is used, otherwise, multivariate analysis of variance is used.

Frequency analysis. Frequency tables, or as they are also called single-entry tables, are the simplest method for analyzing categorical variables. Frequency tables can also be successfully used to study quantitative variables, although this can lead to difficulties in interpreting the results. This type statistical research is often used as one of the exploratory analysis procedures to see how different groups of observations are distributed in the sample, or how the value of a feature is distributed over the interval from the minimum to the maximum value. As a rule, frequency tables are graphically illustrated using histograms.

Crosstabulation (pairing)– the process of combining two (or more) frequency tables so that each cell in the constructed table is represented by a single combination of values ​​or levels of tabulated variables. Crosstabulation makes it possible to combine the frequencies of occurrence of observations at different levels of the considered factors. By examining these frequencies, it is possible to identify relationships between the tabulated variables and explore the structure of this relationship. Typically tabulated are categorical or scale variables with relatively few values. If a continuous variable is to be tabulated (say, blood sugar), then it should first be recoded by dividing the range of change into a small number of intervals (eg, level: low, medium, high).

Correspondence analysis. Correspondence analysis, compared to frequency analysis, contains more powerful descriptive and exploration methods analysis of two-input and multi-input tables. The method, like contingency tables, allows you to explore the structure and relationship of grouping variables included in the table. In classical correspondence analysis, the frequencies in the contingency table are standardized (normalized) in such a way that the sum of the elements in all cells is equal to 1.
One of the goals of the correspondence analysis is to represent the contents of the table of relative frequencies in the form of distances between individual rows and/or columns of the table in a lower dimensional space.

cluster analysis. Cluster analysis is a classification analysis method; its main purpose is to divide the set of objects and features under study into groups or clusters that are homogeneous in a certain sense. This is a multivariate statistical method, so it is assumed that the initial data can be of a significant volume, i.e. both the number of objects of study (observations) and the features characterizing these objects can be significantly large. The great advantage of cluster analysis is that it makes it possible to partition objects not by one attribute, but by a number of attributes. In addition, cluster analysis, unlike most mathematical and statistical methods, does not impose any restrictions on the type of objects under consideration and allows you to explore a lot of initial data of an almost arbitrary nature. Since clusters are groups of homogeneity, the task of cluster analysis is to divide their set into m (m - integer) clusters based on the features of objects so that each object belongs to only one partition group. At the same time, objects belonging to the same cluster must be homogeneous (similar), and objects belonging to different clusters must be heterogeneous. If clustering objects are represented as points in the n-dimensional space of features (n is the number of features that characterize objects), then the similarity between objects is determined through the concept of the distance between points, since it is intuitively clear that the smaller the distance between objects, the more similar they are.

Discriminant analysis. Discriminant analysis includes statistical methods for classifying multivariate observations in a situation where the researcher has the so-called training samples. This type of analysis is multidimensional, since it uses several features of the object, the number of which can be arbitrarily large. The purpose of discriminant analysis is to classify an object based on the measurement of various characteristics (features), i.e., to attribute it to one of several specified groups (classes) in some optimal way. It is assumed that the initial data, along with the features of the objects, contain a categorical (grouping) variable that determines whether the object belongs to a particular group. Therefore, discriminant analysis provides for checking the consistency of the classification carried out by the method with the original empirical classification. The optimal method is understood as either the minimum of the mathematical expectation of losses, or the minimum of the probability of false classification. In the general case, the problem of discrimination (discrimination) is formulated as follows. Let the result of observation over an object be the construction of a k-dimensional random vector Х = (X1, X2, …, XК), where X1, X2, …, XК are the features of the object. It is required to establish a rule according to which, according to the values ​​of the coordinates of the vector X, the object is assigned to one of the possible sets i, i = 1, 2, ..., n. Discrimination methods can be conditionally divided into parametric and nonparametric. In parametric it is known that the distribution of feature vectors in each population is normal, but there is no information about the parameters of these distributions. Nonparametric discrimination methods do not require knowledge about the exact functional form of distributions and allow solving discrimination problems based on insignificant a priori information about populations, which is especially valuable for practical applications. If the conditions for the applicability of discriminant analysis are met - independent variables-features (they are also called predictors) must be measured at least on an interval scale, their distribution must correspond to the normal law, it is necessary to use classical discriminant analysis, otherwise - the method of general models of discriminant analysis.

Factor analysis. Factor analysis is one of the most popular multivariate statistical methods. If the cluster and discriminant methods classify observations by dividing them into homogeneity groups, then factor analysis classifies features (variables) that describe observations. Therefore, the main goal of factor analysis is to reduce the number of variables based on the classification of variables and determining the structure of relationships between them. The reduction is achieved by highlighting the hidden (latent) common factors that explain the relationship between the observed features of the object, i.e. instead of the initial set of variables, it will be possible to analyze data on the selected factors, the number of which is much less than the initial number of interrelated variables.

Classification trees. Classification trees are a classification analysis method that allows you to predict the belonging of objects to a particular class, depending on the corresponding values ​​of the features that characterize the objects. Attributes are called independent variables, and a variable indicating whether objects belong to classes is called dependent. Unlike classical discriminant analysis, classification trees are capable of performing one-dimensional branching on variables of various types - categorical, ordinal, interval. No restrictions are imposed on the law of distribution of quantitative variables. By analogy with discriminant analysis, the method makes it possible to analyze the contributions of individual variables to the classification procedure. Classification trees can be, and sometimes are, very complex. However, the use of special graphical procedures makes it possible to simplify the interpretation of the results even for very complex trees. The possibility of graphical presentation of results and ease of interpretation largely explain the great popularity of classification trees in applied areas, however, the most important distinguishing properties of classification trees are their hierarchy and wide applicability. The structure of the method is such that the user has the opportunity to managed parameters build trees of arbitrary complexity, achieving minimal classification errors. But according to a complex tree, due to the large set of decision rules, it is difficult to classify a new object. Therefore, when constructing a classification tree, the user must find a reasonable compromise between the complexity of the tree and the complexity of the classification procedure. wide scope The applicability of classification trees makes them a very attractive tool for data analysis, but it should not be assumed that it is recommended to be used instead of traditional methods of classification analysis. On the contrary, if more stringent theoretical assumptions imposed by traditional methods are met, and the sampling distribution has some special properties (for example, the distribution of variables corresponds to the normal law), then the use of traditional methods will be more effective. However, as a method of exploratory analysis or as a last resort when all traditional methods fail, Classification Trees, according to many researchers, are unmatched.

Principal component analysis and classification. In practice, the problem of analyzing high-dimensional data often arises. The method of principal component analysis and classification allows solving this problem and serves to achieve two goals:
– reduction of the total number of variables (data reduction) in order to obtain “main” and “non-correlated” variables;
– classification of variables and observations, with the help of the factor space under construction.
The method is similar to factor analysis in the formulation of the tasks being solved, but has a number of significant differences:
– not used in principal component analysis iterative methods to extract factors;
– along with the active variables and observations used to extract the principal components, auxiliary variables and/or observations can be specified; then the auxiliary variables and observations are projected onto the factor space computed from the active variables and observations;
- the listed possibilities allow using the method as a powerful tool for classifying both variables and observations.
The solution of the main problem of the method is achieved by creating a vector space of latent (hidden) variables (factors) with a dimension less than the original one. The initial dimension is determined by the number of variables for analysis in the source data.

Multidimensional scaling. The method can be viewed as an alternative to factor analysis, which achieves a reduction in the number of variables by extracting latent (not directly observable) factors that explain the relationships between the observed variables. The purpose of multidimensional scaling is to find and interpret latent variables that enable the user to explain the similarities between objects given points in the original feature space. In practice, indicators of the similarity of objects can be distances or degrees of connection between them. In factor analysis, similarities between variables are expressed using a matrix of correlation coefficients. In multidimensional scaling, an arbitrary type of object similarity matrix can be used as input data: distances, correlations, etc. Despite the fact that there are many similarities in the nature of the issues under study, the methods of multivariate scaling and factor analysis have a number of significant differences. Thus, factor analysis requires that the data under study obey a multivariate normal distribution, and the dependencies are linear. Multidimensional scaling does not impose such restrictions; it can be applied if the matrix of pairwise similarities of objects is given. In terms of differences in outcomes, factor analysis seeks to extract more latent variables than multivariate scaling. Therefore, multidimensional scaling often leads to easier-to-interpret solutions. More importantly, however, the multivariate scaling method can be applied to any type of distance or similarity, while factor analysis requires that a correlation matrix of variables be used as input data or that a correlation matrix be first calculated from the input data file. The main assumption of multidimensional scaling is that there is some metric space of essential basic characteristics, which implicitly served as the basis for the obtained empirical data on the proximity between pairs of objects. Therefore, objects can be represented as points in this space. It is also assumed that closer (according to the initial matrix) objects correspond to smaller distances in the space of basic characteristics. Therefore, multidimensional scaling is a set of methods for analyzing empirical data on the proximity of objects, with the help of which the dimension of the space of the characteristics of the measured objects that are essential for a given meaningful task is determined and the configuration of points (objects) in this space is constructed. This space (“multidimensional scale”) is similar to the commonly used scales in the sense that the values ​​of the essential characteristics of the measured objects correspond to certain positions on the axes of space. The logic of multidimensional scaling can be illustrated with the following simple example. Assume that there is a matrix of pairwise distances (i.e. similarities of some features) between some cities. Analyzing the matrix, it is necessary to place points with the coordinates of cities in two-dimensional space (on a plane), preserving the real distances between them as much as possible. The resulting placement of points on the plane can later be used as an approximate geographic map. In the general case, multidimensional scaling allows objects (cities in our example) to be located in a space of some small dimension (in this case it is equal to two) in such a way as to adequately reproduce the observed distances between them. As a result, these distances can be measured in terms of the found latent variables. So, in our example, we can explain distances in terms of a pair of geographic coordinates North/South and East/West.

Modeling by structural equations (causal modeling). Recent advances in multivariate statistical analysis and analysis of correlation structures, combined with the latest computational algorithms, served as the starting point for the creation of a new, but already recognized technique of structural equation modeling (SEPATH). This extraordinarily powerful technique of multivariate analysis includes methods from various fields of statistics, multiple regression and factor analysis have been naturally developed and combined here.
The object of modeling structural equations are complex systems, the internal structure of which is not known ("black box"). By observing system parameters using SEPATH, you can explore its structure, establish cause-and-effect relationships between system elements.
The statement of the problem of structural modeling is as follows. Let there be variables for which the statistical moments are known, for example, a matrix of sample correlation or covariance coefficients. Such variables are called explicit. They can be characteristics of a complex system. The real relationships between the observed explicit variables can be quite complex, but we assume that there are a number of hidden variables that explain the structure of these relationships with a certain degree of accuracy. Thus, with the help of latent variables, a model of relationships between explicit and implicit variables is built. In some tasks, latent variables can be considered as causes, and explicit ones as consequences, therefore, such models are called causal. It is assumed that hidden variables, in turn, can be related to each other. The structure of connections is supposed to be quite complex, but its type is postulated - these are connections described by linear equations. Some parameters of linear models are known, some are not, and are free parameters.
The main idea of ​​structural equation modeling is that you can check whether the variables Y and X are related by a linear relationship Y = aX by analyzing their variances and covariances. This idea is based on a simple property of the mean and variance: if you multiply each number by some constant k, the mean will also be multiplied by k, with the standard deviation multiplied by the modulus of k. For example, consider a set of three numbers 1, 2, 3. These numbers have a mean equal to 2 and a standard deviation equal to 1. If you multiply all three numbers by 4, then it is easy to calculate that the mean will be equal to 8, the standard deviation is 4, and the variance is 16. Thus, if there are sets of numbers X and Y related by the dependence Y = 4X, then the variance of Y must be 16 times greater than the variance of X. Therefore, we can test the hypothesis that Y and X are related equation Y = 4X, comparing the variances of variables Y and X. This idea can be different ways generalized to several variables, system-related linear equations. In this case, the transformation rules become more cumbersome, the calculations are more complex, but the main idea remains the same - you can check whether the variables are linearly related by studying their variances and covariances.

Survival analysis methods. Survival analysis methods were originally developed in medical, biological research and insurance, but then became widely used in social and economic sciences, as well as in industry in engineering problems (analysis of reliability and failure times). Imagine that a new treatment or drug is being studied. Obviously, the most important and objective characteristic is the average life expectancy of patients from the moment of admission to the clinic or the average duration of remission of the disease. Standard parametric and non-parametric methods could be used to describe mean lifetimes or remissions. However, there is a significant feature in the analyzed data - there may be patients who survived during the entire observation period, and in some of them the disease is still in remission. There may also be a group of patients with whom contact was lost before the completion of the experiment (for example, they were transferred to other clinics). Using standard methods for estimating the mean, this group of patients would have to be excluded, thereby losing important information that was collected with difficulty. In addition, most of these patients are survivors (recovered) during the time they were observed, which is evidence in favor of a new method of treatment (drug). This kind of information, when there is no data on the occurrence of the event of interest to us, is called incomplete. If there is data about the occurrence of an event of interest to us, then the information is called complete. Observations that contain incomplete information are called censored observations. Censored observations are typical when the observed value represents the time until some critical event occurs, and the duration of the observation is limited in time. The use of censored observations is the specificity of the method under consideration - survival analysis. V this method the probabilistic characteristics of the time intervals between successive occurrences of critical events are investigated. This kind of research is called analysis of durations until the moment of termination, which can be defined as the time intervals between the start of observation of the object and the moment of termination, at which the object ceases to meet the properties specified for observation. The purpose of the research is to determine the conditional probabilities associated with durations until the moment of termination. The construction of life tables, fitting of survival distributions, and estimation of the survival function using the Kaplan-Meier procedure are descriptive methods for studying censored data. Some of the proposed methods allow comparison of survival in two or more groups. Finally, survival analysis contains regression models for evaluating relationships between multivariate continuous variables with values ​​similar to lifetimes.
General models of discriminant analysis. If the conditions for the applicability of discriminant analysis (DA) are not met - independent variables (predictors) must be measured at least on an interval scale, their distribution must correspond to the normal law, it is necessary to use the method of general models of discriminant analysis (GDA). The method is so named because it uses the general linear model (GLM) to analyze the discriminant functions. In this module, discriminant function analysis is considered as a general multivariate linear model in which the categorical dependent variable (response) is represented by vectors with codes denoting different groups for each observation. The ODA method has a number of significant advantages over classical discriminant analysis. For example, there are no restrictions on the type of predictor used (categorical or continuous) or on the type of model being defined, it is possible to select predictors step by step and select the best subset of predictors, if there is a cross-validation sample in the data file, the selection of the best subset of predictors can be based on shares misclassification for cross-validation sampling, etc.

Time series. Time series is the most intensively developing, promising area of ​​mathematical statistics. A time (dynamic) series means a sequence of observations of a certain sign X (random variable) at successive equidistant moments t. Individual observations are called levels of the series and are denoted by xt, t = 1, ..., n. When studying a time series, several components are distinguished:
x t \u003d u t + y t + c t + e t, t \u003d 1, ..., n,
where u t is a trend, a smoothly changing component that describes the net impact of long-term factors (population decline, income decline, etc.); - seasonal component, reflecting the frequency of processes over a not very long period (day, week, month, etc.); сt is a cyclical component reflecting the frequency of processes over long periods of time over one year; t is a random component reflecting the influence of random factors that cannot be accounted for and registered. The first three components are deterministic components. The random component is formed as a result of the superposition of a large number external factors, each individually having a slight influence on the change in the values ​​of the feature X. Analysis and study of the time series allow you to build models for predicting the values ​​of the feature X for the future, if the sequence of observations in the past is known.

Neural networks. Neural networks are a computing system, the architecture of which is analogous to the construction of nervous tissue from neurons. The neurons of the lowest layer are supplied with the values ​​of the input parameters, on the basis of which certain decisions must be made. For example, in accordance with the values ​​of the patient's clinical and laboratory parameters, it is necessary to attribute him to one or another group according to the severity of the disease. These values ​​are perceived by the network as signals that are transmitted to the next layer, weakening or strengthening depending on the numerical values ​​(weights) assigned to the interneuronal connections. As a result, a certain value is generated at the output of the neuron of the upper layer, which is considered as a response - the response of the entire network to the input parameters. In order for the network to work, it must be “trained” (trained) on data for which the values ​​of the input parameters and the correct responses to them are known. Learning consists in selecting the weights of interneuronal connections that provide the closest responses to the known correct answers. Neural networks can be used to classify observations.

Experiment planning. The art of arranging observations in a particular order, or making specially planned checks to full use the possibilities of these methods and constitutes the content of the subject "experiment planning". Currently, experimental methods are widely used both in science and in various fields of practical activity. Usually, the main goal of scientific research is to show the statistical significance of the effect of a particular factor on the dependent variable under study. As a rule, the main goal of planning experiments is to extract the maximum amount of objective information about the influence of the factors under study on the indicator (dependent variable) of interest to the researcher using the least number of expensive observations. Unfortunately, in practice, in most cases, insufficient attention is paid to research planning. They collect data (as much as they can collect), and then they carry out statistical processing and analysis. But properly conducted statistical analysis alone is not sufficient to achieve scientific validity, since the quality of any information obtained from data analysis depends on the quality of the data itself. Therefore, the design of experiments is increasingly used in applied research. The purpose of experiment planning methods is to study the influence of certain factors on the process under study and to search for optimal levels of factors that determine the required level of flow of this process.

Quality control cards. In the conditions of the modern world, the problem of the quality of not only manufactured products, but also the services provided to the population is extremely relevant. The well-being of any firm, organization or institution largely depends on the successful solution of this important problem. The quality of products and services is formed in the process scientific research, design and technological developments, is provided by a good organization of production and services. But the manufacture of products and the provision of services, regardless of their type, is always associated with a certain variability in the conditions of production and provision. This leads to some variability in the characteristics of their quality. Therefore, the issues of developing quality control methods that will allow timely detection of signs of violation are relevant. technological process or the provision of services. At the same time, in order to achieve and maintain a high level of quality that satisfies the consumer, methods are needed that are not aimed at eliminating defects. finished products and inconsistencies in services, but on the prevention and prediction of the causes of their occurrence. A control chart is a tool that allows you to track the progress of a process and influence it (using appropriate feedback), preventing it from deviating from the requirements for the process. The quality control chart tool makes extensive use of statistical methods based on probability theory and mathematical statistics. The use of statistical methods makes it possible, with limited volumes of analyzed products, to judge the state of the quality of products with a given degree of accuracy and reliability. Provides forecasting, optimal management of quality problems, acceptance of correct management decisions not on the basis of intuition, but with the help of scientific study and identification of patterns in the accumulated arrays of numerical information. />/>/>/>/>/>/>/>/>/>/>/>/>/>/>/>/>/>/>/>/>

2. V. A. Kritsman, B. Ya. - Higher School, 1983.

Revolutionary discoveries in natural science were often made under the influence of the results of experiments set up by talented experimenters. Great experiments in biology, chemistry, and physics contributed to changing the idea of ​​the world in which we live, the structure of matter, and the mechanisms of heredity transmission. Based on the results of the great experiments, other theoretical and technological discoveries were made.

§ 9. Theoretical methods of research

Lesson-lecture

There are more important things in the world

most beautiful discoveries

is knowledge of the methods

they were made

Leibniz

https://pandia.ru/text/78/355/images/image014_2.gif" alt="(!LANG:Signature: !" align="left" width="42 height=41" height="41">Метод. Классификация. Систематизация. Систематика. Индукция. Дедукция.!}

Observation and description of physical phenomena. Physical laws. (Physics, 7 - 9 cells).

What is a method . Method in science they call a method of building knowledge, a form of practical and theoretical development of reality. Francis Bacon compared the method to a lamp that illuminates the way for a traveler in the dark: "Even the lame one walking on the road is ahead of the one who goes without a road." The right method should be clear, logical, lead to specific purpose, to give a result. The doctrine of the system of methods is called methodology.

The methods of cognition that are used in scientific activity are empirical ( practical, experimental) methods: observation, experiment and theoretical ( logical, rational) methods: analysis, synthesis, comparison, classification, systematization, abstraction, generalization, modeling, induction, deduction. In real scientific knowledge, these methods are always used in unity. For example, when developing an experiment, a preliminary theoretical understanding of the problem is required, the formulation of a research hypothesis, and after the experiment, it is necessary to process the results using mathematical methods. Consider the features of some theoretical methods of cognition.

Classification and systematization. Classification allows you to organize the material under study by grouping the set (class) of the objects under study into subsets (subclasses) in accordance with the selected feature.

For example, all students of the school can be divided into subclasses - "girls" and "boys". You can also choose another feature, such as height. In this case, the classification can be carried out in different ways. For example, select a height limit of 160 cm and classify students into subclasses "low" and "high", or break the growth scale into segments of 10 cm, then the classification will be more detailed. If we compare the results of such a classification over several years, then this will allow us to empirically establish trends in the physical development of students. Consequently, classification as a method can be used to obtain new knowledge and even serve as a basis for constructing new scientific theories.

In science, classifications of the same objects are usually used according to different criteria, depending on the goals. However, the sign (the basis for classification) is always chosen alone. For example, chemists subdivide the class "acids" into subclasses both by the degree of dissociation (strong and weak), and by the presence of oxygen (oxygen-containing and anoxic), and by physical properties (volatile - non-volatile; soluble - insoluble) and other characteristics.

The classification may change in the course of the development of science.

In the middle of the XX century. the study of various nuclear reactions led to the discovery of elementary (non-fissile) particles. Initially, they began to be classified by mass, so leptons (small), mesons (intermediate), baryons (large) and hyperons (superlarge) appeared. Further development of physics showed that classification by mass has little physical meaning, but the terms have been preserved, resulting in the appearance of leptons, much more massive than baryons.

Classification is conveniently reflected in the form of tables or diagrams (graphs). For example, the classification of the planets of the solar system, represented by a diagram - a graph, may look like this:

MAJOR PLANETS

SOLAR SYSTEM

TERRESTRIAL PLANETS

PLANETS - GIANTS

PLUTO

MERCU-

VIENNA-

MARS

JUPITER

SATURN

URANUS

Please note that the planet Pluto in this classification represents a separate subclass, does not belong to either the terrestrial planets or the giant planets. Scientists note that Pluto is similar in properties to an asteroid, of which there can be many on the periphery of the solar system.

In the study of complex systems of nature, classification actually serves as the first step towards the construction of a natural scientific theory. The next higher level is systematization (systematics). Systematization is carried out on the basis of the classification of a sufficiently large amount of material. At the same time, the most significant features are singled out, which allow presenting the accumulated material as a system that reflects all the various relationships between objects. It is necessary in those cases where there is a variety of objects and the objects themselves are complex systems. The result of the systematization of scientific data is taxonomy or otherwise - taxonomy. Systematics as a field of science developed in such fields of knowledge as biology, geology, linguistics, ethnography.

The unit of taxonomy is called a taxon. In biology, taxa are, for example, a type, class, family, genus, order, etc. They are combined into single system taxa of different rank according to the hierarchical principle. Such a system includes a description of all existing and previously extinct organisms, finds out the ways of their evolution. If scientists find the new kind, then they must confirm its place in the overall system. Changes can be made to the system itself, which remains developing and dynamic. Systematics makes it easy to navigate the whole variety of organisms - about 1.5 million species of animals alone are known, and more than 500 thousand species of plants, not counting other groups of organisms. Modern biological systematics reflects Saint-Hilaire's law: "All the diversity of life forms forms a natural taxonomic system consisting of hierarchical groups of taxa of various ranks."

Induction and deduction. The path of knowledge, in which, on the basis of systematization of the accumulated information - from the particular to the general, they conclude about the existing pattern, is called induction. This method as a method of studying nature was developed by the English philosopher F. Bacon. He wrote: “It is necessary to take as many cases as possible - both those where the phenomenon under study is present, and those where it is absent, but where one would expect to meet it; then one must arrange them methodically ... and give the most probable explanation; finally, try to verify this explanation by further comparison with the facts.

Thought and image

Portraits of F. Bacon and S. Holmes

Why are the portraits of a scientist and a literary hero located side by side?

induction is not the only way obtaining scientific knowledge about the world. If experimental physics, chemistry and biology were built as sciences mainly due to induction, then theoretical physics, modern mathematics were based on the system axioms- consistent, speculative, reliable in terms of common sense and the level of historical development of the science of assertions. Then knowledge can be built on these axioms by deriving inferences from the general to the particular, by moving from the premise to the consequences. This method is called deduction. It was developed

René Descartes, French philosopher and scientist.

A striking example of obtaining knowledge about one subject in different ways is the discovery of the laws of motion of celestial bodies. I. Kepler, based on a large amount of observational data on the movement of the planet Mars in early XVII v. discovered by induction the empirical laws of planetary motion in the solar system. At the end of the same century, Newton deductively deduced the generalized laws of motion of celestial bodies on the basis of the law of universal gravitation.

In real research activities research methods are interconnected.

1. ○ Explain what is a research method, methodology of natural sciences?

All these approximations should be substantiated and the errors introduced by each of them should be numerically evaluated.

The development of science shows that every natural science law has limits to its application. For example, Newton's laws are inapplicable in the study of the processes of the microworld. To describe these processes, the laws of quantum theory are formulated, which become equivalent to Newton's laws if they are applied to describe the motion of macroscopic bodies. From the point of view of modeling, this means that Newton's laws are some model that follows, under certain approximations, from a more general theory. However, the laws of quantum theory are not absolute and have their limitations in applicability. More general laws have already been formulated and more general equations have been obtained, which, in turn, also have limitations. And there is no end in sight to this chain. No absolute laws describing everything in nature have yet been obtained, from which all particular laws could be derived. And it is not clear whether such laws can be formulated. But this means that any of the laws of natural science is in fact some kind of model. The difference from those models that were considered in this paragraph lies only in the fact that natural science laws are a model applicable to describe not one specific phenomenon, but for a wide class of phenomena.

Basically, data mining is about processing information and identifying patterns and trends in it that help make decisions. The principles of data mining have been known for many years, but with the advent of big data they have become even more widespread.

Big data has led to an explosion in the popularity of broader data mining techniques, in part because information has become much more abundant and, by its very nature and content, is becoming more diverse and expansive. When dealing with large datasets, relatively simple and straightforward statistics are no longer enough. With 30 or 40 million detailed purchase records, it's not enough to know that two million of them are from the same place. To better meet the needs of buyers, it is necessary to understand whether these two million belong to a certain age group, and to know their average earnings.

These business requirements have moved from simple data mining and statistical analysis to more complex data mining. Solving business problems requires data analysis that allows you to build a model to describe the information and ultimately leads to the creation of the resulting report. This process is illustrated.

Figure 1. Process diagram

The process of analyzing data, searching, and building a model is often iterative as you seek out and discover the various pieces of information that you can extract. You also need to understand how to link, transform and combine them with other data to get the result. As new data elements and aspects are discovered, the approach to identifying data sources and formats and then matching that information to a given outcome may change.

Data Mining Tools

Data mining is not only about the tools or database software used. Data mining can be done with relatively modest database systems and simple tools, including building your own, or using ready-made packages software. Sophisticated data mining relies on past experience and algorithms defined with existing software and packages, with different specialized tools associated with different methods.

For example, IBM SPSS®, which is rooted in statistical analysis and surveys, allows you to build powerful predictive models on past trends and make accurate forecasts. IBM InfoSphere® Warehouse provides data source discovery, preprocessing and data mining in one package, allowing you to extract information from the source database directly into the final report.

Recently, very large datasets and cluster/large-scale data processing have become possible, allowing even more complex generalizations of data mining results across groups and data mappings. An entirely new range of tools and systems is available today, including combined storage and processing systems.

You can analyze a wide variety of datasets, including traditional SQL databases, raw text data, key/value sets, and documentbases. Clustered databases such as Hadoop, Cassandra, CouchDB, and Couchbase Server store and access data in ways that don't follow a traditional table structure.

In particular, a more flexible format for storing a document database gives information processing a new direction and complicates it. SQL databases are highly structured and rigidly schema bound, making it easy to query and analyze data with a known format and structure.

Document databases that conform to a standard JSON-type structure, or files with some machine-readable structure, are also easy to process, although the variable and variable structure can make things difficult. For example, in Hadoop, which processes completely "raw" data, it can be difficult to identify and extract information before it is processed and compared.

Basic Methods

Several basic methods that are used for data mining describe the type of analysis and data recovery operation. Unfortunately, different companies and solutions do not always use the same terms, which can add to the confusion and perceived complexity.

Let's look at some key methods and examples of how to use these or those tools for data mining.

Association

Association (or relation) is probably the most well-known, familiar and simple method of data mining. To identify patterns, a simple comparison is made between two or more elements, often of the same type. For example, when tracking shopping habits, you might notice that cream is usually bought along with strawberries.

Building data mining tools based on associations or relationships is easy. For example, InfoSphere Warehouse has a wizard that provides information flow configurations for creating associations by examining the input source, decision basis, and output information. the corresponding example for the sample database is given.

Figure 2. Information flow used in the association approach

Classification

Classification can be used to get an idea of ​​the type of customers, products or objects by describing several attributes to identify a particular class. For example, cars can be easily classified by type (sedan, SUV, convertible) by defining various attributes (number of seats, body shape, drive wheels). studying new car, you can assign it to a particular class by comparing the attributes with a known definition. The same principles can be applied to customers, for example by classifying them by age and social group.

In addition, the classification can be used as input to other methods. For example, decision trees can be used to define a classification. Clustering allows you to use the common attributes of different classifications in order to identify clusters.

By examining one or more attributes or classes, individual data elements can be grouped together to produce a structured conclusion. At a simple level, clustering uses one or more attributes as the basis for defining a cluster of similar results. Clustering is useful in identifying different information because it is correlated with other examples so you can see where similarities and ranges agree.

The clustering method works both ways. You can assume that there is a cluster at a certain point, and then use your identification criteria to check this. The graph shown in , shows a good example. Here, the buyer's age is compared to the purchase price. It is reasonable to expect that people in their twenties and thirties (before marriage and children) and in their 50s and 60s (when children have left home) have higher disposable incomes.

Figure 3. Clustering

In this example, two clusters are visible, one around $2000/20-30 years and another around $7000-8000/50-65 years. In this case, we made a hypothesis and tested it on a simple graph that can be plotted with any suitable charting software. More complex combinations require a complete analysis package, especially if you need to automatically base decisions on information about nearest neighbor.

Such a construction of clusters is a simplified example of the so-called image nearest neighbor. Individual buyers can be distinguished by their literal proximity to each other on the chart. It is very likely that buyers from the same cluster share other common attributes, and this assumption can be used to search, classify, and other analyzes of the members of the data set.

The clustering method can also be applied in the opposite direction: given certain input attributes, various artifacts can be identified. For example, a recent study of four-digit PIN codes found clusters of numbers in the ranges 1-12 and 1-31 for the first and second pairs. By plotting these pairs on a graph, you can see clusters associated with dates (birthdays, anniversaries).

Forecasting

Forecasting is a broad topic that ranges from predicting equipment component failures to detecting fraud and even predicting company profits. When combined with other data mining techniques, forecasting involves trending, classification, pattern matching, and relationships. By analyzing past events or instances, one can predict the future.

For example, using credit card authorization data, a decision tree analysis of a person's past transactions can be combined with classification and matching against historical patterns to detect fraudulent transactions. If US airfare purchases match US transactions, then it is likely that these transactions are genuine.

Sequential models

Sequential models, which are often used to analyze long-term data, are a useful method for identifying trends, or regular repetitions of similar events. For example, it can be determined from customer data that they buy certain sets of products at different times of the year. Based on this information, the shopping cart prediction application can automatically guess which products will be added to the cart based on the frequency and history of purchases.

Decision Trees

The decision tree associated with most other methods (mainly classification and prediction) can be used either within selection criteria or to support the selection of certain data within overall structure. The decision tree starts with simple question, which has two answers (sometimes more). Each answer leads to the next question, helping you classify and identify data or make predictions.

Figure 5. Data preparation

The data source, location, and database affect how information is processed and aggregated.

Reliance on SQL

The simplest of all approaches is often to rely on SQL databases. The SQL (and the corresponding table structure) is well understood, but the structure and format of the information cannot be completely ignored. For example, when studying user behavior from sales data in an SQL data model (and data mining in general), there are two main formats that can be used: transactional and behavioral-demographic.

When working with InfoSphere Warehouse, building a behavioral demographic model to analyze customer data to understand customer behavior patterns involves using raw SQL data based on transactional information and known customer characteristics, and organizing this information into a predefined tabular structure. InfoSphere Warehouse can then use this information to perform clustering and classification mining on the data to get the desired result. Buyer demographics and transaction data can be combined and then converted into a format that allows analysis of certain data, as shown in .

Figure 6. Special data analysis format

For example, sales data can reveal sales trends for specific products. Raw sales data for individual items can be converted into transactional information that maps customer IDs to transaction data and item codes. Using this information, it is easy to identify sequences and relationships for individual products and individual customers over time. This allows InfoSphere Warehouse to calculate consistent information, such as when a customer is most likely to purchase the same item again.

You can create new data analysis points from the original data. For example, you can expand (or refine) product information by comparing or classifying individual products into broader groups, and then analyze the data for those groups instead of individual customers.

Figure 7 MapReduce Structure

In the previous example, we processed (in this case by MapReduce) the original data in a document database and converted it to a tabular format in an SQL database for data mining purposes.

To work with this complex and even unstructured information, more careful preparation and processing may be required. There are complex data types and structures that cannot be processed and prepared in the form you need in one step. In this case, you can pipe the output of MapReduce to either consistent transform and obtain the necessary data structure, as shown in , or for individual production of several tables of output data.

Figure 8. Sequential chain of output of MapReduce processing results

For example, in one pass, you can take the original information from the document database and perform a MapReduce operation to obtain overview this information by date. good example The sequential process is to regenerate information and combine the results with a decision matrix (created in the second step of MapReduce processing) followed by additional simplification into a sequential structure. During the MapReduce processing step, it is required that the whole set data supported separate data processing steps.

Regardless of the source data, many tools can use flat files, CSV, or other data sources. For example, InfoSphere Warehouse can parse flat files in addition to communicating directly with the DB2 data warehouse.

Conclusion

Data mining is not only about running some complex queries on data stored in a database. Whether you're using SQL, document-based databases like Hadoop, or simple flat files, you need to work with data, format it, or restructure it. You need to determine the format of the information on which your method and analysis will be based. Then, when the information is in the right format, you can apply various methods(individually or collectively) independent of the required underlying data structure or data set.

Home > Lecture

Topic 7.CLASSIFICATION ANALYSIS

Lecture No. 9

1. Exploratory data analysis. Measurement scales

2. Classification trees

3. Discriminant analysis (trained classification)

4. Cluster analysis (classification without training)

5. Canonical correlations

1. Exploratory data analysis. Measurement scales

In the presence of a large number of variables and the absence of information about relationships and patterns, one of the first stages of analyzing the available data is the so-called exploratory data analysis. As a rule, exploratory analysis considers and compares a large number of variables, and for the search, classification and scaling of variables are carried out. Variables differ in how well they can be measured, or, in other words, how much measurable information their measurement scale provides. Another factor that determines the amount of information is the type of scale on which the measurement is taken. Usually the following types of measurement scales are used: nominal, ordinal, interval and relative. Nominal variables used only for qualitative classification. This means that these variables can only be measured in terms of belonging to some significantly different classes. A typical example of nominal variables is the manufacturer, the type of product, the sign of its suitability, etc. Often nominal variables are called categorical. Ordinal variables allow to rank objects, if it is indicated which of them have the quality expressed by this variable to a greater or lesser extent. However, they do not allow one to judge how much more or how much less of a given quality is contained in a variable. A typical example is the sorting of goods: highest, first, second, third. The same product differs qualitatively, but it is impossible to say that the difference between them is 25%. Categorical and ordinal variables are especially common when questioning, for example, change and compare the differences between them. An example - the temperature, measured in degrees, forms an interval scale, since it is possible to evaluate the difference in variables already in numerical form (40 degrees more than 30 by 10). The interval scale can be easily translated into an ordinal scale if we take some values ​​of the variables as the boundaries of different classes (for example, it is warm or hot outside for a month, taking the boundary between the classes "warm" and "hot" in the value of the variable, but their feature is the presence of a certain point absolute zero.As a rule, these are continuous variables. 2. Classification trees Classification Trees is a method that allows one to predict the belonging of observations or objects to one or another class of a categorical dependent variable, depending on the corresponding values ​​of one or more predictor variables. Building classification trees- one of the hierarchical coin sorting device. Let's make the coins roll along a narrow chute, in which a slot the size of a one-kopeck coin is cut. If the coin fell into the slot, then this is 1 kopeck; otherwise, it continues to roll further along the chute and stumbles upon a slot for a two-kopeck coin; if it fails there, then it's 2 kopecks, if not (which means it's 3 or 5 kopecks), it will roll further, and so on. Thus, we have built a classification tree. The decision rule implemented in this classification tree allows efficient sorting of a handful of coins, and is generally applicable to a wide range of classification problems. Classification trees are ideally suited for graphical representation, and therefore the conclusions drawn from them are much easier to interpret than if they were presented only in numerical form. Hierarchical structure classification tree- one of the build process classification tree consists of four main steps:

    Selection of the forecast accuracy criterion

    Branch type selection

    Determining when to stop branching

    Determining "appropriate" tree sizes

Ultimately, the goal of analysis with classification trees is to get the most accurate prediction possible. The most classifications.

3. Discriminant analysis (trained classification)

Discriminant analysis is used to decide which class (group) to attribute this or that object (process) to based on the study of its parameters or characteristics.) of the product and the task is to establish which of the parameters contribute to the difference (discrimination) between separately grouped aggregates (grades) of goods that form the general population. After that, a decision is made on whether this product belongs to a certain group. Therefore, this kind of statistical analysis is multivariate and the main idea of ​​discriminant analysis is to determine whether populations differ in the mean of some parameter (variable), and then use this variable to predict for new members of their domains. Each of the areas differs from the other by the value of a certain parameter (or rather by the value of its average) or sets of parameters taken as a classification feature. The discrimination rule is chosen in accordance with a certain principle of optimality, for example, the minimum probability of false classification. In practical calculations, the differences pass from the vector of features to a linear function (discriminant function), which for two groups (classes) has the form of a linear multiple regression equation, in which the coded features of differences into groups act as dependent variables. If there are more than two groups, then more than one discriminant function can be composed. For example, when there are three populations, one can evaluate: (1) - The feature for discrimination sense is very similar to multivariate analysis of variance. When discriminant functions are obtained, the question arises of how well they can predict, to which population does a particular sample belong? For this, classification indicators or classification functions are determined and the next observation or a specific sample is assigned to the group for which the classification group has the greatest value. 4. Cluster analysis (classification without training) Cluster analysis is a statistical method that includes a set of different algorithms for distributing objects into clusters (cluster - bunch, accumulation). Partitioning objects H into an integer number of clusters K, so that each object belongs to one and only one subset of the partition. At the same time, objects belonging to the same cluster must be similar, and objects belonging to different clusters must be heterogeneous. The solution to the problem of cluster analysis are partitions that satisfy the optimality criterion. This criterion is called the objective function, which, for example, can be the minimum of the sum of squared deviations of the features of the group objects from the mean value

min Σ(x i – x cf) 2

The similarity and heterogeneity of objects in groups will be characterized by a certain value, which received the name - the distance function. The larger the distance function between objects, the more heterogeneous they are. It is clear that if this function exceeds a certain limit, then the objects should be related to different groups(clusters). Depending on the clustering algorithm used, the following distance functions are distinguished: - Euclidean metric (Σx i – xj) 2) 1/2 ; - Manhattan distance Σ|x i – x j |; - Chebyshev distance max|x i – x j |, etc. are considered as separate clusters. Further, at each step of the algorithm, the two closest clusters are combined, and, taking into account the adopted distance function, all distances are recalculated according to the formula. When the objective function is reached, the iterations stop. 5. Canonical correlations Classical correlation analysis allows you to find statistical relationships between two variables, the so-called two sets of variables use the methods of canonical analysis. Canonical analysis, being a generalization of multiple correlation as a measure of the relationship between one random variable and many other random variables, considers the relationship between sets of random variables. At the same time, it is limited to considering a small number of the most correlated linear combinations from each set. The analysis of canonical correlation is based on the use of canonical roots or canonical variables, which are considered as "hidden" variables that characterize the observed phenomena. The number of canonical roots is equal to the number of variables in the smaller set. In practice, when determining the canonical correlation, a separate correlation matrix is ​​built, which is the product of standard correlation matrices that characterize the dependencies between two individual variables. Then, as many eigenvalues ​​of the resulting matrix are calculated as there are canonical roots. If we take the square root of the obtained eigenvalues, we get a set of numbers that can be interpreted as correlation coefficients. Since they are canonical variables, they are also called canonical correlations. The work of discriminant, cluster and canonical analysis should be evaluated using special statistical packages that implement these algorithms on a computer.

Last year, Avito held a number of competitions. Including - a competition for recognizing car brands, the winner of which, Evgeny Nizhibitsky, spoke about his decision at a training session.


Formulation of the problem. From the images of the cars, you need to determine the make and model. The metric was the accuracy of predictions, that is, the proportion of correct answers. The sample consisted of three parts: the first part was available for training initially, the second was given later, and the third was required to show the final predictions.


Computing resources. I used the home computer that kept my room warm all the time and the servers provided at work.

Model overview. Since our task is to recognize, the first thing we want to do is take advantage of the progress in the quality level of image classification on the well-known ImageNet. As you know, modern architectures make it possible to achieve even higher quality than that of a person. So I started by reviewing recent articles and put together a summary table of architectures, implementations, and qualities based on ImageNet.


Note that nai the best quality achieved on architectures and .

Fine-tuning networks. Training a deep neural network from scratch is a rather time-consuming task, and besides, it is not always effective in terms of the result. Therefore, the technique of retraining networks is often used: a network already trained on ImageNet is taken, the last layer is replaced with a layer with the required number of classes, and then the network is tuned with a low learning rate, but already on data from the competition. This scheme allows you to train the network faster and with higher quality.

The first approach to GoogLeNet retraining showed approximately 92% accuracy during validation.

Crop predictions. Using a neural network to predict on a test sample, you can improve the quality. To do this, cut out fragments right size in different places of the original image, and then average the results. A 1x10 crop means that the center of the image is taken, four corners, and then everything is the same, but reflected horizontally. As you can see, the quality increases, but the prediction time increases.

Validation of results. After the second part of the sample appeared, I split the sample into several parts. All further results are shown on this partition.

ResNet-34 Torch. You can use the ready-made repository of the authors of the architecture, but in order to get predictions on the test in the desired format, you have to fix some scripts. In addition, you need to solve the problems of high memory consumption by dumps. Validation accuracy is about 95%.


Inception-v3 TensorFlow. Also used here finished implementation, but image preprocessing has been changed, and cropping of images during batch generation has been limited. The result is almost 96% accuracy.


Ensemble of models. As a result, we got two ResNet models and two Inception-v3 models. What validation quality can be obtained by mixing models? The class probabilities were averaged using the geometric mean. Weights (in this case, degrees) were selected on a delayed sample.


results. ResNet training on GTX 980 took 60 hours and Inception-v3 on TitanX took 48 hours. During the competition, we managed to try out new frameworks with new architectures.


The task of classifying bank customers

Link to Kaggle.

Stanislav Semyonov tells how he and other members of the Kaggle top teamed up and won a prize in the competition for classifying applications from clients of a large bank - BNP Paribas.


Formulation of the problem. The obfuscated data from the insurance claims needs to predict whether the claim can be approved without additional manual checks. For a bank, this is the process of automating the processing of applications, and for data analysts, it is simply a task of machine learning on binary classification. There are about 230 thousand objects and 130 features. The metric is LogLoss . It is worth noting that the winning team deciphered the data, which helped them win the competition.

Getting rid of artificial noise in features. The first step is to look at the data. A few things immediately come to mind. Firstly, all features take values ​​from 0 to 20. Secondly, if you look at the distribution of any of the features, you can see the following picture:

Why is that? The fact is that at the stage of anonymizing and noisy data, random noise was added to all values, and then scaling by a segment from 0 to 20 was carried out. The reverse transformation was carried out in two steps: first, the values ​​were rounded to a certain decimal place, and then the denominator . Was this required if the tree still picks up the threshold when splitting? Yes, after the inverse transformation, the differences of the variables begin to make more sense, and for categorical variables, it becomes possible to conduct one-hot coding.

Removing Linear Dependent Features. We also noticed that some signs are the sum of others. It is clear that they are not needed. Subsets of features were taken to determine them. On such subsets, a regression was built to predict some other variable. And if the predicted values ​​were close to the true ones (it is worth considering artificial noise), then the feature could be removed. But the team did not bother with this and used a ready-made set of filtered features. The set was prepared by someone else. One of the features of Kaggle is the presence of a forum and public solutions, through which participants share their findings.

How to understand what to use? There is a small hack. Let's say you know that someone in an old competition used some technique that helped them place high (on the forums, they usually write short solutions). If in the current competition this participant is again among the leaders, most likely, the same technique will work here too.

Coding of categorical variables. It was striking that a certain variable V22 has a large number of values, but at the same time, if we take a subsample for a certain value, the number of levels (different values) of other variables decreases markedly. There is also a good correlation with the target variable. What can be done? The simplest solution is to build a separate model for each value of V22, but this is the same as splitting over all values ​​of the variable in the first split of the tree.

There is another way to use the received information - coding by the average value of the target variable. In other words, each value of the categorical variable is replaced by the average value of the target for objects in which this feature takes the same value. It is impossible to perform such encoding directly for the entire training set: in the process, we will implicitly introduce information about the target variable into the features. We are talking about information that almost any model is sure to detect.

Therefore, such statistics are counted by folds. Here is an example:

Let's assume that the data is divided into three parts. For each fold of the training sample, we will calculate a new feature for the other two folds, and for the test sample, for the entire training set. Then the information about the target variable will not be included in the sample so explicitly, and the model will be able to use the knowledge gained.

Will there be problems with anything else? Yes - with rare categories and cross-validation.

Rare categories. Let's say a certain category occurs only a few times and the corresponding objects belong to class 0. Then the average value of the target variable will also be zero. However, a completely different situation may arise on a test sample. The decision is the smoothed average (or smoothed likelihood), which is calculated using the following formula:

Here global mean is the average value of the target variable over the entire sample, nrows is the number of times a particular value of the categorical variable occurs, alpha is the regularization parameter (for example, 10). Now, if some value is rare, more weight will have a global average, and if frequent enough, the result will be close to the initial category average. By the way, this formula allows you to process previously unknown values ​​of a categorical variable.

Cross Validation. Let's say we calculated all smoothed means for categorical variables for other folds. Can we evaluate the quality of the model by standard k-fold cross-validation? No. Let's look at an example.

For example, we want to evaluate the model on the third fold. We train the model on the first two folds, but they have a new variable with the average value of the target variable, which we have already used the third test fold to calculate. This does not allow us to correctly evaluate the results, but the problem that has arisen is solved by calculating statistics on folds within folds. Let's look at the example again:

We still want to evaluate the model on the third fold. Let's divide the first two folds (the training sample of our assessment) into some other three folds, in them we will calculate a new feature according to the already analyzed scenario, and for the third fold (this is the test sample of our assessment) we will calculate the first two folds together. Then no information from the third fold will be used when training the model, and the estimate will be honest. In the competition that we are discussing, only such cross-validation allowed to correctly assess the quality of the model. Of course, the "external" and "internal" number of folds can be any.

Feature building. We used not only the already mentioned smoothed averages of the target variable, but also weights of evidence. It's almost the same, but with a logarithmic transformation. In addition, features like the difference in the number of objects of positive and negative classes in the group turned out to be useful without any normalization. The intuition here is the following: the scale shows the degree of confidence in the class, but what to do with quantitative features? After all, if they are processed in a similar way, then all values ​​\u200b\u200bare “clogged” with regularization by the global average. One option is to divide the values ​​into bins, which are then considered separate categories. Another way is simply to build some kind of linear model on the same feature with the same target. In total, about two thousand signs were obtained from 80 filtered ones.

stacking and blending. As with most competitions, model staking is an important part of the solution. In short, the essence of stacking is that we pass the predictions of one model as a feature to another model. However, it is important not to overtrain once again. Let's just take an example:


Taken from Alexander Dyakonov's blog

For example, we decided to split our sample into three folds at the staking stage. Similar to counting statistics, we must train the model on two folds, and add the predicted values ​​for the remaining fold. For a test sample, you can average the model predictions from each pair of folds. Each level of stacking is the process of adding a group of new features-predictions of models based on the existing dataset.

At the first level, the team had 200-250 different models, at the second - another 20-30, at the third - a few more. The result is blending, that is, mixing the predictions of different models. Various algorithms were used: gradient boosting with different parameters, random forests, neural networks. The main idea is to use the most diverse models with different parameters, even if they do not give the highest quality.

Teamwork. Usually, participants are united in teams before the end of the competition, when everyone already has their own achievements. We teamed up with other caglers from the very beginning. Each team member had a folder in the shared cloud where datasets and scripts were placed. general procedure cross-validations were approved in advance so that they could be compared with each other. The roles were distributed as follows: I came up with new features, the second participant built models, the third one selected them, and the fourth one managed the whole process.

Where to get power. Testing a large number of hypotheses, building multi-level stacking, and training models can take too much time when using a laptop. Therefore, many participants use computing servers with a large number of cores and RAM. I usually use AWS servers, and my team members seem to be using machines at work for contests while they are idle.

Communication with the organizing company. After a successful performance in the competition, communication with the company takes place in the form of a joint conference call. Participants talk about their decision and answer questions. In BNP, people were not surprised by multi-level staking, but they were interested, of course, in the construction of features, teamwork, validation of results - everything that could be useful to them in improving their own system.

Do I need to decrypt the dataset. The winning team noticed one feature in the data. Some features have missing values, and some do not. That is, some characteristics did not depend on specific people. In addition, 360 unique values ​​were obtained. It is logical to assume that we are talking about some time marks. It turned out that if we take the difference between two such signs and sort the entire sample by it, then at first there will be zeros more often, and then ones. This is exactly what the winners did.

Our team took third place. Almost three thousand teams participated in total.

Ad category recognition task

Link to DataRing .

This is another contest "Avito". It took place in several stages, the first of which (as well as the third, by the way) was won by Arthur Kuzin.


Formulation of the problem. According to the photos from the ad, you must determine the category. Each ad was matched from one to five images. The metric took into account the coincidence of categories at different levels of the hierarchy - from general to narrower ones (the last level contains 194 categories). In total, there were almost a million images in the training sample, which is close to the size of ImageNet.


Recognition difficulties. It would seem that you just need to learn to distinguish a TV from a car, and a car from shoes. But, for example, there is a category “British cats”, and there are “other cats”, and among them there are very similar images - although you can still distinguish them from each other. What about tires, rims and wheels? This is where a person can't do it. These difficulties are the reason for the appearance of a certain limit of the results of all participants.


Resources and framework. I had at my disposal three computers with powerful video cards: a home one provided by the laboratory at the Moscow Institute of Physics and Technology, and a computer at work. Therefore, it was possible (and had to) train several networks at the same time. MXNet was chosen as the main framework for training neural networks, created by the same guys who wrote the well-known XGBoost. This alone was a reason to trust their new product. The advantage of MXNet is that an efficient iterator with standard augmentation is available right out of the box, which is enough for most tasks.


Network architectures. The experience of participating in one of the previous competitions showed that the architectures of the Inception series show the best quality. I have used them here. It was added to GoogLeNet because it speeded up the training of the model. We also used the Inception-v3 and Inception BN architectures from the Model Zoo model library, which added a dropout before the last fully connected layer. Due to technical problems, it was not possible to train the network using stochastic gradient descent, so Adam was used as an optimizer.



Data augmentation. To improve the quality of the network, augmentation was used - adding distorted images to the sample in order to increase the diversity of data. Transformations such as random cropping of the photo, reflection, rotation by a small angle, changing the aspect ratio and shift were involved.

Accuracy and learning speed. At first I divided the sample into three parts, but then I abandoned one of the validation steps for mixing models. Therefore, later the second part of the sample was added to the training set, which improved the quality of the networks. In addition, GoogLeNet was originally trained on Titan Black, which has half the memory of Titan X. So this network was retrained with a larger batch size, and its accuracy increased. If you look at the training time of networks, we can conclude that in conditions of limited time it is not worth using Inception-v3, since training is much faster with the other two architectures. The reason is the number of parameters. The fastest learner is Inception BN.

Building Predictions.

Like Eugene in the contest with car brands, Arthur used crop predictions - but not in 10 sections, but in 24. The sections were the corners, their reflections, the center, the turns of the central parts and ten more random ones.

If you save the state of the network after each epoch, the result is many different models, not just the final network. Given the time remaining until the end of the competition, I could use the predictions of 11 epoch models - since building predictions using the network also takes a long time. All these predictions were averaged according to the following scheme: first, using the arithmetic mean within the crop groups, then using the geometric mean with weights selected on the validation set. These three groups are mixed, then we repeat the operation for all epochs. At the end, the class probabilities of all images of one ad are averaged using a geometric mean without weights.


results. When fitting weights during the validation phase, a competition metric was used because it did not correlate well with normal accuracy. Prediction in different areas of images gives only a small part of the quality compared to a single prediction, but it is due to this increase that it is possible to show the best result. At the end of the competition, it turned out that the first three places differ in results by thousandths. For example, Zhenya Nizhibitsky had the only model that was slightly inferior to my ensemble of models.


Learning from scratch vs. fine-tuning. After the end of the competition, it turned out that despite the large sample size, it was worth training the network not from scratch, but using a pre-trained network. This approach shows better results.

Reinforcement learning problem

The Black Box Challenge, about which , was not quite like the usual "cagle". The fact is that for the solution it was not enough to mark up some “test” sample. It was required to program and load into the system the “agent” code, which was placed in an environment unknown to the participant and independently made decisions in it. Such tasks belong to the field of reinforcement learning.

Mikhail Pavlov from 5vision spoke about approaches to the solution. He took second place in the competition.


Formulation of the problem. For an environment with unknown rules, it was necessary to write an "agent" that would interact with the specified environment. Schematically, this is a kind of brain that receives information about the state and reward from the black box, makes a decision about the action, and then receives a new state and a reward for the action. Actions are repeated one after another during the game. The current state is described by a vector of 36 numbers. The agent can take four actions. The goal is to maximize the amount of rewards for the entire game.


Environmental Analysis. The study of the distribution of environment state variables showed that the first 35 components do not depend on the selected action, and only the 36th component changes depending on it. At the same time, different actions influenced differently: some increased or decreased, some did not change at all. But it cannot be said that the entire environment depends on one component: there may be some hidden variables in it. In addition, the experiment showed that if you perform more than 100 identical actions in a row, then the reward becomes negative. So strategies like “do only one action” fell away immediately. Some of the participants in the competition noticed that the reward is proportional to the same 36th component. There was an assumption at the forum that the black box imitates the financial market, where the portfolio is the 36th component, and the actions are buying, selling and deciding to do nothing. These options were correlated with the change in the portfolio, and the meaning of one action was not clear.


Q-learning. During participation, the main goal was to try various techniques reinforcement learning. One of the simplest and most famous methods is q-learning. Its essence is in an attempt to build a Q function that depends on the state and the selected action. Q evaluates how "good" it is to choose a particular action in a particular state. The concept of “good” includes the reward that we will receive not only now, but also in the future. This function is trained iteratively. During each iteration, we try to bring the function closer to itself in the next step of the game, taking into account the reward received now. You can read more. The use of q-learning involves working with fully observable Markov processes(in other words, the current state should contain all information from the environment). Despite the fact that the environment, according to the organizers, did not meet this requirement, q-learning could be applied quite successfully.

Adaptation to black box. Empirically, it was found that n-step q-learning was best suited for the environment, where the reward was used not for one last action, but for n actions ahead. The environment allowed you to save the current state and roll back to it, which made it easier to collect a sample - you could try to perform each action from one state, and not just one. At the very beginning of training, when the q-function was not yet able to evaluate actions, the strategy “perform action 3” was used. It was assumed that it did not change anything and it was possible to start training on the data without noise.

Learning process. The training went like this: with the current policy (the agent’s strategy), we play the entire episode, accumulating a sample, then using the obtained sample, we update the q-function, and so on - the sequence is repeated for a certain number of epochs. The results were better than when updating the q-function during the game. Other methods are the replay memory technique (with common bank data for training, where new episodes of the game are entered) and the simultaneous training of several agents playing asynchronously, also turned out to be less effective.

Models. The solution used three regressions (each one time per action) and two neural networks. Some quadratic features and interactions have been added. The final model is a mixture of all five models (five Q-functions) with equal weights. In addition, online retraining was used: in the process of testing, the weights of the old regressions were mixed with the new weights obtained on the test set. This was done only for regressions, since their solutions can be written analytically and recalculated fairly quickly.


Other ideas. Naturally, not all ideas improved the final result. For example, reward discounting (when we do not just maximize the total reward, but consider each next move less useful), deep networks, dueling architecture (with an assessment of the usefulness of the state and each action separately) did not give an increase in results. Due to technical problems, it was not possible to apply recurrent networks - although, in an ensemble with other models, they might provide some benefit.


Results. Team 5vision took second place, but with a very small margin from the owners of the "bronze".


So, why enter a data analysis competition?

  • Prizes. Successful performance in most competitions is rewarded with cash prizes or other valuable gifts. Over seven million dollars have been raffled off on Kaggle in seven years.
  • Career. Sometimes a prize.
  • Experience. This, of course, is the most important thing. You can explore a new area and start solving problems that you have not encountered before.

Currently, machine learning training sessions are held on Saturdays every second week. The venue is the Moscow office of Yandex, the standard number of guests (guests plus Yandexoids) is 60-80 people. The main feature of the trainings is their topicality: each time the competition is sorted out, which ended one or two weeks ago. This makes it difficult to plan everything exactly, but the competition is still fresh in memory and many people gather in the hall who have tried their hand at it. The training is supervised by Emil Kayumov, who, by the way, helped with writing this post.

In addition, there is another format: solutions, where novice specialists jointly participate in existing competitions. Decisions are held on those Saturdays when there are no trainings. Anyone can come to both types of events, announcements are published in groups

 

It might be useful to read: