Cluster analysis is an algorithm for exploring data divided into groups according to similar characteristics.

The object of research in applied statistics is statistical data obtained as a result of observations or experiments. Statistical data is a collection of objects (observations, cases) and features (variables) that characterize them. For example, the objects of study are countries of the world and signs, - geographic and economic indicators characterizing them: continent; terrain height above sea level; average annual temperature; place of the country in the list in terms of quality of life, share of GDP per capita; public spending on health care, education, the army; average life expectancy; the proportion of unemployment, illiterate; quality of life index, etc.
Variables are quantities that, as a result of measurement, can take on different values.
Independent variables are variables whose values ​​can be changed during the experiment, while dependent variables are variables whose values ​​can only be measured.
Variables can be measured on a variety of scales. The difference between the scales is determined by their information content. Consider the following types of scales, presented in ascending order of their information content: nominal, ordinal, interval, ratio scale, absolute. These scales also differ from each other in the number of permissible mathematical operations. The "poorest" scale is nominal, since not a single arithmetic operation is defined, the "rich" one is absolute.
Measurement in the nominal (classification) scale means determining the belonging of an object (observation) to a particular class. For example: gender, branch of service, profession, continent, etc. In this scale, you can only count the number of objects in classes - frequency and relative frequency.
Measurement on an ordinal (rank) scale, in addition to determining the class of belonging, allows you to streamline observations by comparing them with each other in some respect. However, this scale does not determine the distance between classes, but only which of the two observations is preferable. Therefore, ordinal experimental data, even if they are represented by numbers, cannot be regarded as numbers and arithmetic operations on them cannot be performed 5. In this scale, in addition to counting the frequency of the object, you can calculate the rank of the object. Examples of variables measured on an ordinal scale: student scores, prize places in competitions, military ranks, a country's place in the list for quality of life, etc. Sometimes nominal and ordinal variables are called categorical, or grouping, since they allow you to divide the objects of study into subgroups.
When measured on an interval scale, the ordering of the observations can be done so precisely that the distances between any two of them are known. The scale of intervals is unique up to linear transformations (y = ax + b). This means that the scale has an arbitrary reference point - conditional zero. Examples of variables measured on an interval scale: temperature, time, terrain altitude above sea level. Variables in this scale can be used to determine the distance between observations. Distances are full-fledged numbers and any arithmetic operations can be performed on them.
The ratio scale is similar to the interval scale, but it is unique up to a transformation of the form y = ax. This means that the scale has a fixed reference point - absolute zero, but an arbitrary scale of measurement. Examples of variables measured on a scale of relationships: length, weight, amperage, amount of money, public spending on health, education, military, life expectancy, etc. Measurements in this scale are full numbers and any arithmetic operations can be performed on them.
An absolute scale has both absolute zero and an absolute unit of measurement (scale). An example of an absolute scale is a number line. This scale is dimensionless, so measurements on it can be used as an exponent or base of a logarithm. Examples of measurements on an absolute scale: unemployment rate; the proportion of illiterate people, the quality of life index, etc.
Most statistical methods relate to parametric statistics methods, which are based on the assumption that a random vector of variables forms some multivariate distribution, as a rule, normal or transforms to a normal distribution. If this assumption is not confirmed, you should use nonparametric methods of mathematical statistics.

Correlation analysis. There can be a functional relationship between variables (random variables), which manifests itself in the fact that one of them is defined as a function of the other. But between the variables there can also be a connection of another kind, manifested in the fact that one of them reacts to a change in the other by changing its distribution law. This relationship is called stochastic. It appears when there are common random factors affecting both variables. The correlation coefficient (r), which varies from –1 to +1, is used as a measure of the relationship between the variables. If the correlation coefficient is negative, it means that as the values ​​of one variable increase, the values ​​of the other decrease. If the variables are independent, then the correlation coefficient is 0 (the converse is true only for variables with a normal distribution). But if the correlation coefficient is not equal to 0 (variables are called uncorrelated), then this means that there is a dependence between the variables. The closer the r value is to 1, the stronger the dependence. The correlation coefficient reaches its limit values ​​+1 or -1, if and only if the relationship between the variables is linear. Correlation analysis allows you to establish the strength and direction of the stochastic relationship between variables (random variables). If the variables are measured at least on an interval scale and have a normal distribution, then the correlation analysis is carried out by calculating the Pearson correlation coefficient, otherwise the Spearman, Kendal tau, or Gamma correlations are used.

Regression analysis. Regression analysis models the relationship of one random variable to one or more other random variables. In this case, the first variable is called dependent, and the rest are called independent. The choice or assignment of the dependent and independent variables is arbitrary (conditional) and is carried out by the researcher depending on the problem he is solving. The independent variables are called factors, regressors, or predictors, and the dependent variable is called the outcome characteristic, or response.
If the number of predictors is 1, the regression is called simple, or one-way, if the number of predictors is more than 1 - multiple or multivariate. IN general case the regression model can be written as follows:

Y = f (x 1, x 2, ..., x n),

Where y is the dependent variable (response), x i (i = 1,…, n) are predictors (factors), n is the number of predictors.
Regression analysis can be used to solve a number of important problems for the problem under study:
one). Reducing the dimension of the space of the analyzed variables (factor space) by replacing some of the factors with one variable - the response. This problem is solved more fully by factor analysis.
2). Quantifying the effect of each factor, i.e. multiple regression, allows the researcher to ask a question (and probably get an answer) about "what is the best predictor for ...". At the same time, the effect of individual factors on the response becomes clearer, and the researcher better understands the nature of the phenomenon under study.
3). Calculation of predicted values ​​of the response for certain values ​​of factors, i.e. regression analysis, creates the basis for a computational experiment in order to obtain answers to questions such as "What will happen if ...".
4). In regression analysis, the causal mechanism appears in a more explicit form. In this case, the forecast lends itself better to meaningful interpretation.

Canonical analysis. Canonical analysis is intended for the analysis of dependencies between two lists of features (independent variables) that characterize objects. For example, you can study the relationship between various adverse factors and the appearance of a certain group of symptoms of the disease, or the relationship between two groups of clinical and laboratory parameters (syndromes) of a patient. Canonical analysis is a generalization of multiple correlation as a measure of the relationship between one variable and many other variables. As you know, multiple correlation is the maximum correlation between one variable and a linear function of other variables. This concept was generalized to the case of relationships between sets of variables - features that characterize objects. In this case, it is enough to restrict ourselves to considering a small number of the most correlated linear combinations from each set. Suppose, for example, the first set of variables consists of signs y1, ..., ur, the second set consists of - x1, ..., xq, then the relationship between these sets can be estimated as the correlation between linear combinations a1y1 + a2y2 + ... + apyp, b1x1 + b2x2 + ... + bqxq, which is called canonical correlation. The problem of canonical analysis is to find the weight coefficients in such a way that the canonical correlation is maximal.

Methods for comparing averages. In applied research, there are often cases when the average result of some feature of one series of experiments differs from the average result of another series. Since the averages are the results of measurements, then, as a rule, they always differ, the question is whether the discovered discrepancy of the means can be explained by inevitable random errors of the experiment or is it caused by certain reasons. If we are talking about comparing two means, then the Student's test (t-test) can be applied. This is a parametric criterion, since it is assumed that the characteristic has a normal distribution in each series of experiments. Currently, it has become fashionable to use nonparametric criteria for comparing the mean
Comparison of the average result is one of the ways to identify dependencies between variable signs that characterize the studied set of objects (observations). If, when dividing the objects of study into subgroups using the categorical independent variable (predictor), the hypothesis about the inequality of the means of some dependent variable in subgroups is true, then this means that there is a stochastic relationship between this dependent variable and the categorical predictor. So, for example, if it is established that the hypothesis about the equality of the average indicators of the physical and intellectual development of children in the groups of mothers who smoked and did not smoke during pregnancy is found to be incorrect, then this means that there is a relationship between the mother's smoking during pregnancy and his intellectual and physical development.
The most common method for comparing means is analysis of variance. In ANOVA terminology, a categorical predictor is called a factor.
Analysis of variance can be defined as a parametric, statistical method designed to assess the influence of various factors on the result of an experiment, as well as for the subsequent planning of experiments. Therefore, in the analysis of variance, it is possible to investigate the dependence of a quantitative trait on one or more qualitative traits of factors. If one factor is considered, then one-way analysis of variance is used, otherwise multivariate analysis of variance is used.

Frequency analysis. Frequency tables, or as they are called single-entry tables, are the simplest method for analyzing categorical variables. Frequency tables can also be used successfully to investigate quantitative variables, although interpretation of the results can be difficult. This type of statistical study is often used as one of the exploratory analysis procedures to see how different groups of observations are distributed in the sample, or how the value of a feature is distributed over the interval from the minimum to the maximum value. Typically, frequency tables are graphically illustrated with histograms.

Cross-tabulation (pairing)- the process of combining two (or more) frequency tables so that each cell in the constructed table is represented by a single combination of values ​​or levels of tabulated variables. Cross-tabulation allows you to combine the frequency of occurrence of observations at different levels of the factors under consideration. By examining these frequencies, you can identify relationships between tabulated variables and explore the structure of this relationship. Usually categorical or quantitative variables with relatively few values ​​are tabulated. If it is necessary to tabulate a continuous variable (say, blood sugar), then first it must be recoded by dividing the range of change into a small number of intervals (for example, level: low, medium, high).

Analysis of correspondences. Conformance analysis provides more powerful descriptive and exploratory methods for analyzing two-input and multi-input tables compared to frequency analysis. The method, just like contingency tables, allows you to explore the structure and relationship of the grouping variables included in the table. In classical correspondence analysis, the frequencies in the contingency table are standardized (normalized) so that the sum of the elements in all cells is equal to 1.
One of the goals of correspondence analysis is to represent the contents of a table of relative frequencies as distances between individual rows and / or columns of the table in a lower-dimensional space.

Cluster analysis. Cluster analysis is a classification analysis method; its main purpose is to divide the set of objects and features under study into homogeneous groups, or clusters, in a certain sense. This is a multivariate statistical method, therefore it is assumed that the initial data can be of significant volume, i.e. both the number of objects of study (observations) and the features that characterize these objects can be significantly larger. The great advantage of cluster analysis is that it makes it possible to split objects not by one feature, but by a number of features. In addition, cluster analysis, in contrast to most mathematical and statistical methods, does not impose any restrictions on the type of objects under consideration and allows one to study a variety of initial data of almost arbitrary nature. Since clusters are groups of homogeneity, the task of cluster analysis is to divide their set into m (m - whole) clusters based on the attributes of objects so that each object belongs to only one partition group. In this case, objects belonging to one cluster must be homogeneous (similar), and objects belonging to different clusters must be heterogeneous. If clustering objects are represented as points in an n-dimensional feature space (n is the number of features characterizing objects), then the similarity between objects is determined through the concept of distance between points, since it is intuitively clear that the smaller the distance between objects, the more similar they are.

Discriminant analysis. Discriminant analysis includes statistical methods for classifying multivariate observations in a situation where the researcher has so-called training samples. This type of analysis is multidimensional, since it uses several features of an object, the number of which can be as large as desired. The purpose of discriminant analysis is to classify it on the basis of measuring various characteristics (features) of an object, that is, to assign it to one of several specified groups (classes) in some optimal way. In this case, it is assumed that the initial data, along with the attributes of objects, contain a categorical (grouping) variable that determines the belonging of an object to a particular group. Therefore, the discriminant analysis provides for checking the consistency of the classification carried out by the method with the original empirical classification. The optimal method is understood as either the minimum mathematical expectation of losses, or the minimum probability of false classification. In the general case, the problem of discrimination (discrimination) is formulated as follows. Let the result of observation over the object is the construction of a k-dimensional random vector X = (X1, X2,…, XK), where X1, X2,…, XK are the features of the object. It is required to establish a rule according to which, according to the values ​​of the coordinates of the vector X, the object is referred to one of the possible collections i, i = 1, 2, ..., n. Discrimination methods can be roughly divided into parametric and non-parametric. In parametric it is known that the distribution of feature vectors in each population is normal, but there is no information about the parameters of these distributions. Nonparametric discrimination methods do not require knowledge of the exact functional form of distributions and allow solving discrimination problems based on insignificant a priori information about populations, which is especially valuable for practical applications... If the conditions for the applicability of discriminant analysis are met - independent variables - signs (they are also called predictors) should be measured at least on an interval scale, their distribution should correspond to the normal law, it is necessary to use classical discriminant analysis, otherwise - by the method of general models of discriminant analysis.

Factor analysis. Factor analysis is one of the most popular multivariate statistical methods. If the cluster and discriminant methods classify observations by dividing them into homogeneity groups, then factor analysis classifies features (variables) that describe observations. Therefore, the main goal of factor analysis is to reduce the number of variables based on the classification of variables and the determination of the structure of relationships between them. Reduction is achieved by highlighting hidden (latent) common factors that explain the relationship between the observed features of the object, i.e. instead of the initial set of variables, it will be possible to analyze data on the selected factors, the number of which is significantly less than the initial number of interrelated variables.

Classification trees. Classification trees are a method of classification analysis that makes it possible to predict the belonging of objects to a particular class, depending on the corresponding values ​​of the characteristics that characterize the objects. Features are called independent variables, and a variable that indicates whether objects belong to classes is called dependent. Unlike classical discriminant analysis, classification trees are capable of performing one-dimensional branching for variables of various types, categorical, ordinal, and interval. No restrictions are imposed on the distribution law for quantitative variables. By analogy with discriminant analysis, the method makes it possible to analyze the contributions of individual variables to the classification procedure. Classification trees can be, and sometimes are, very complex. However, the use of special graphical procedures makes it possible to simplify the interpretation of the results, even for very complex trees. The ability to graphically represent the results and ease of interpretation largely explain the great popularity of classification trees in applied areas, however, the most important distinguishing properties of classification trees are their hierarchy and wide applicability. The structure of the method is such that the user has the ability to construct trees of arbitrary complexity using controlled parameters, achieving minimal classification errors. But it is difficult to classify a new object based on a complex tree, due to the large set of decision rules. Therefore, when constructing a classification tree, the user must find a reasonable compromise between the complexity of the tree and the complexity of the classification procedure. Wide scope The applicability of classification trees makes them a very attractive tool for data analysis, but it should not be assumed that it is recommended to use it instead of traditional methods of classification analysis. On the contrary, if more rigorous theoretical assumptions imposed by traditional methods are fulfilled, and the sample distribution has some special properties (for example, the correspondence of the distribution of variables to the normal law), then the use of traditional methods will be more effective. However, as a method of exploratory analysis or as a last resort when all traditional methods fail, Classification Trees, according to many researchers, are unmatched.

Principal component analysis and classification. In practice, the problem of analyzing data of large dimensions often arises. Principal component analysis and classification allows you to solve this problem and serves to achieve two goals:
- reduction of the total number of variables (data reduction) in order to obtain "main" and "non-correlated" variables;
- classification of variables and observations, using the constructed factor space.
The method is similar to factor analysis in the formulation of the problems being solved, but it has a number of significant differences:
- in the analysis of the main components, iterative methods are not used to extract factors;
- along with active variables and observations used to extract principal components, auxiliary variables and / or observations can be specified; then the auxiliary variables and observations are projected onto the factor space calculated from the active variables and observations;
- the listed possibilities allow using the method as a powerful tool for classifying variables and observations at the same time.
The solution to the main problem of the method is achieved by creating a vector space of latent (hidden) variables (factors) with a dimension less than the original one. The original dimension is determined by the number of variables for analysis in the original data.

Multidimensional scaling. The method can be viewed as an alternative to factor analysis, in which a reduction in the number of variables is achieved by highlighting latent (not directly observable) factors that explain the relationship between the observed variables. The purpose of multidimensional scaling is to find and interpret latent variables that enable the user to explain the similarities between objects given by points in the original feature space. The indicators of the similarity of objects in practice can be the distance or the degree of connection between them. In factor analysis, similarities between variables are expressed using a matrix of correlation coefficients. In multidimensional scaling, an arbitrary type of object similarity matrix can be used as input data: distances, correlations, etc. Despite the fact that there are many similarities in the nature of the issues studied, the methods of multivariate scaling and factor analysis have a number of significant differences. So, factor analysis requires that the studied data obey a multivariate normal distribution, and the dependences are linear. Multidimensional scaling does not impose such restrictions; it can be applied if a matrix of pairwise similarities of objects is specified. In terms of the differences in the results obtained, factor analysis tends to extract more factors - latent variables compared to multivariate scaling. Therefore, multidimensional scaling often leads to easier-to-interpret solutions. However, more importantly, the multidimensional scaling method can be applied to any type of distance or similarity, while factor analysis requires that the correlation matrix of the variables be used as the input data, or the correlation matrix is ​​first calculated from the source data file. The main assumption of multidimensional scaling is that there is a certain metric space of essential basic characteristics, which implicitly served as the basis for the obtained empirical data on the proximity between pairs of objects. Therefore, objects can be thought of as points in this space. It is also assumed that closer (according to the initial matrix) objects correspond to smaller distances in the space of basic characteristics. Therefore, multidimensional scaling is a set of methods for analyzing empirical data on the proximity of objects, with the help of which the dimension of the space of the characteristics of the measured objects that are essential for a given meaningful problem is determined and the configuration of points (objects) in this space is constructed. This space ("multidimensional scale") is similar to commonly used scales in the sense that the values ​​of the essential characteristics of the measured objects correspond to certain positions on the axes of space. The logic of multidimensional scaling can be illustrated with the following simple example. Suppose there is a matrix of pairwise distances (i.e., the similarity of some features) between some cities. Analyzing the matrix, it is necessary to position the points with the coordinates of the cities in a two-dimensional space (on a plane), keeping the actual distances between them as much as possible. The resulting placement of points on the plane can later be used as an approximate geographic map. In the general case, multidimensional scaling allows thus placing objects (cities in our example) in a space of some small dimension (in this case, it is equal to two) in order to adequately reproduce the observed distances between them. As a result, these distances can be measured in terms of the found latent variables. So, in our example, we can explain distances in terms of a pair of geographic coordinates North / South and East / West.

Structural equation modeling (causal modeling). Recent progress in multivariate statistical analysis and analysis of correlation structures, combined with the latest computational algorithms, served as the starting point for the creation of a new but already recognized structural equation modeling technique (SEPATH). This extraordinarily powerful technique of multivariate analysis includes methods from various fields of statistics, multiple regression and factor analysis are naturally developed and combined here.
The object of modeling by structural equations is complex systems, the internal structure of which is not known ("black box"). Observing the parameters of the system using SEPATH, one can investigate its structure, establish cause-and-effect relationships between the elements of the system.
The statement of the structural modeling problem is as follows. Let there be variables for which statistical moments are known, for example, a matrix of sample correlation coefficients or covariance. Such variables are called explicit. They can be characteristics of a complex system. The real relationships between the observed explicit variables can be quite complex, but we assume that there are a number of latent variables that explain the structure of these relationships with a certain degree of accuracy. Thus, with the help of latent variables, a model of relationships between explicit and implicit variables is built. In some tasks, latent variables can be considered as causes, and explicit ones as consequences, therefore, such models are called causal. It is assumed that hidden variables, in turn, can be related to each other. The structure of links is allowed to be quite complex, but its type is postulated - these are links described by linear equations. Some parameters of linear models are known, some are not, and are free parameters.
The main idea of ​​structural equation modeling is that you can check if the variables Y and X are related by a linear relationship Y = aX by analyzing their variances and covariance. This idea is based on a simple property of mean and variance: if you multiply each number by some constant k, the mean is also multiplied by k, and the standard deviation is multiplied by the modulus k. For example, consider a set of three numbers 1, 2, 3. These numbers have a mean of 2 and a standard deviation of 1. If you multiply all three numbers by 4, you can easily calculate that the mean is 8, the standard deviation is 4, and variance - 16. Thus, if there are sets of numbers X and Y related by the relationship Y = 4X, then the variance of Y must be 16 times greater than the variance of X. Therefore, you can test the hypothesis that Y and X are related equation Y = 4X, comparing variances of variables Y and X. This idea can be different ways generalized to several variables, linked by the system linear equations. In this case, the transformation rules become more cumbersome, the calculations are more complicated, but the main meaning remains the same - you can check whether the variables are related by a linear relationship by studying their variances and covariance.

Survival analysis methods. Survival analysis methods were originally developed in medical, biological research and insurance, but then became widely used in social and economic sciences , as well as in industry in engineering problems (analysis of reliability and failure times). Imagine you are studying the effectiveness of a new treatment or drug. Obviously, the most important and objective characteristic is the average life expectancy of patients from the moment of admission to the clinic or the average duration of remission of the disease. Standard parametric and nonparametric methods could be used to describe mean lifetimes or remissions. However, the analyzed data has a significant feature - there may be patients who survived during the entire observation period, and in some of them the disease is still in remission. A group of patients may also form, with whom contact was lost before the end of the experiment (for example, they were transferred to other clinics). Using standard methods to estimate the mean, this group of patients would have to be excluded, thereby losing the hard-to-collect important information. In addition, most of these patients are survivors (recovered) during the time they were observed, which suggests a new method of treatment (drug). This kind of information, when there is no data on the occurrence of the event of interest to us, is called incomplete. If there is data on the occurrence of an event of interest to us, then the information is called complete. Observations that contain incomplete information are called censored observations. Censored observations are typical when the observed quantity represents the time until some critical event occurs, and the duration of the observation is limited in time. The use of censored observations is specific to the method under consideration - survival analysis. This method investigates the probabilistic characteristics of the time intervals between the successive occurrence of critical events. This kind of research is called analysis of durations until the moment of termination, which can be defined as the time intervals between the beginning of observation of an object and the moment of termination at which the object ceases to respond to the properties specified for observation. The purpose of the research is to determine the conditional probabilities associated with the durations until the moment of termination. The construction of tables of lifetimes, fitting the distribution of survival, estimation of the survival function using the Kaplan-Meier procedure are descriptive methods for examining censored data. Some of the proposed methods allow one to compare survival rates in two or more groups. Finally, survival analysis contains regression models for estimating relationships between multivariate continuous variables with values ​​similar to lifetimes.
General models of discriminant analysis. If the conditions of applicability of discriminant analysis (DA) are not met - independent variables (predictors) should be measured at least on an interval scale, their distribution should correspond to the normal law, it is necessary to use the method of general models of discriminant analysis (ODA). The method has this name because it uses the General Linear Model (GLM) to analyze discriminant functions. In this module, discriminant function analysis is treated as a general multivariate linear model in which the categorical dependent variable (response) is represented by vectors with codes denoting the different groups for each observation. The ODA method has a number of significant advantages over classical discriminant analysis. For example, no restrictions are set on the type of predictor used (categorical or continuous) or on the type of the determined model, you can step-by-step selection of predictors and the selection of the best subset of predictors; if there is a cross-validated sample in the data file, the selection of the best subset of predictors can be based on misclassification for cross-validated sampling, etc.

Time series. Time series is the most intensively developing and promising area of ​​mathematical statistics. A time (dynamic) series means a sequence of observations of some feature X (random variable) at successive equidistant moments t. Individual observations are called the levels of the series and are denoted xt, t = 1, ..., n. When studying a time series, several components are distinguished:
x t = u t + y t + c t + e t, t = 1,…, n,
where u t is a trend, a smoothly changing component that describes the net influence of long-term factors (population decline, income decline, etc.); - the seasonal component, reflecting the recurrence of processes over a not very long period (day, week, month, etc.); ct is a cyclical component reflecting the recurrence of processes over long periods of time over one year; t is a random component reflecting the influence of random factors that cannot be taken into account and recorded. The first three components are deterministic components. The random component is formed as a result of the superposition of a large number external factors, each individually having an insignificant effect on the change in the values ​​of the attribute X. Analysis and research of the time series allow us to build models for predicting the values ​​of the attribute X for the future, if the sequence of observations in the past is known.

Neural networks. Neural networks are a computing system, the architecture of which is analogous to the construction of nerve tissue from neurons. The values ​​of the input parameters are fed to the neurons of the lowest layer, on the basis of which certain decisions must be made. For example, in accordance with the values ​​of the clinical and laboratory parameters of the patient, it is necessary to assign him to one or another group according to the severity of the disease. These values ​​are perceived by the network as signals transmitted to the next layer, weakening or amplifying depending on the numerical values ​​(weights) attributed to the interneural connections. As a result, a certain value is generated at the output of the neuron of the upper layer, which is considered as a response - the response of the entire network to the input parameters. In order for the network to work, it must be "trained" (trained) on data for which the values ​​of the input parameters and the correct responses to them are known. The training consists in the selection of the weights of interneuronal connections that provide the closest possible proximity of the answers to the known correct answers. Neural networks can be used to classify observations.

Experiment planning. The art of arranging observations in a certain order or conducting specially planned tests in order to fully utilize the possibilities of these methods is the content of the subject of "experiment planning." At present, experimental methods are widely used both in science and in various fields of practical activity. Usually, the main goal of a scientific study is to show the statistical significance of the effect of a given factor on the dependent variable of interest. As a rule, the main goal of planning experiments is to extract the maximum amount of objective information about the influence of the studied factors on the indicator (dependent variable) of interest to the researcher using the least number of expensive observations. Unfortunately, in practice, in most cases, insufficient attention is paid to research planning. They collect data (as much as they can collect), and then they carry out statistical processing and analysis. But a correctly conducted statistical analysis alone is not sufficient to achieve scientific reliability, since the quality of any information obtained as a result of data analysis depends on the quality of the data itself. Therefore, the planning of experiments is increasingly used in applied research. The purpose of the methods of planning experiments is to study the influence of certain factors on the process under study and to find the optimal levels of factors that determine the required level of the course of this process.

Quality control charts. In the conditions of the modern world, the problem of the quality of not only the manufactured products, but also the services provided to the population is extremely urgent. The well-being of any firm, organization or institution largely depends on the successful solution of this important problem. The quality of products and services is formed in the process of scientific research, design and technological development, is ensured by a good organization of production and services. But the manufacture of products and the provision of services, regardless of their type, is always associated with a certain inconsistency in the conditions of production and provision. This leads to some variability in their quality traits. Therefore, the issues of developing quality control methods that will allow timely identification of signs of a violation of the technological process or the provision of services are relevant. At the same time, in order to achieve and maintain a high level of quality that satisfies the consumer, methods are needed that are not aimed at eliminating defects. finished products and inconsistencies of services, and to prevent and predict the causes of their occurrence. A control chart is a tool that allows you to track the progress of the process and influence it (with the help of appropriate feedback), preventing its deviations from the requirements for the process. The Quality Control Chart Toolkit makes extensive use of statistical methods based on probability theory and mathematical statistics. The use of statistical methods makes it possible, with limited volumes of analyzed products, to judge the state of the quality of products with a given degree of accuracy and reliability. Provides forecasting, optimal regulation of quality problems, acceptance of the faithful management decisions not on the basis of intuition, but with the help of scientific study and identification of patterns in the accumulated arrays of numerical information. /> /> /> /> /> /> /> /> /> /> /> /> /> /> /> /> /> /> /> /> />

2. Kritzman VA, Rozen B. Ya., Dmitrev IS To the secrets of the structure of matter. - Higher School, 1983.

Revolutionary discoveries in natural science were often carried out under the influence of the results of experiments set up by talented experimenters. Great experiments in biology, chemistry, physics contributed to a change in the idea of ​​the world in which we live, of the structure of matter, of the mechanisms of transmission of heredity. Other theoretical and technological discoveries were made on the basis of the results of great experiments.

§ 9. Theoretical research methods

Lesson-lecture

There are more important things in the world

the most beautiful discoveries -

is knowledge of the methods by which

they were made

Leibniz

https://pandia.ru/text/78/355/images/image014_2.gif "alt =" (! LANG: Signature:!" align="left" width="42 height=41" height="41">Метод. Классификация. Систематизация. Систематика. Индукция. Дедукция.!}

Observation and description of physical phenomena. Physical laws. (Physics, grades 7 - 9).

What is a method . Method in science they call the method of constructing knowledge, the form of practical and theoretical mastering of reality. Francis Bacon compared the method to a lamp illuminating a traveler's path in the dark: "Even a lame man walking along the road is ahead of one who walks without a road." The correctly chosen method should be clear, logical, lead to a specific goal, and give a result. The doctrine of the system of methods is called methodology.

The methods of cognition that are used in scientific activity are empirical ( practical, experimental) methods: observation, experiment and theoretical ( logical, rational) methods: analysis, synthesis, comparison, classification, systematization, abstraction, generalization, modeling, induction, deduction... In real scientific knowledge, these methods are always used in unity. For example, when developing an experiment, a preliminary theoretical understanding of the problem is required, the formulation of a research hypothesis, and after the experiment, processing of the results using mathematical methods is required. Consider the features of some theoretical methods of cognition.

Classification and systematization. Classification allows you to order the material under study by grouping the set (class) of the objects under study into subsets (subclasses) in accordance with the selected feature.

For example, all students in a school can be divided into subclasses - "girls" and "boys". You can choose another characteristic, for example, height. In this case, the classification can be carried out in different ways. For example, highlight the height limit of 160 cm and classify students into subclasses "low" and "tall", or divide the growth scale into 10 cm segments, then the classification will be more detailed. If we compare the results of such a classification for several years, then this will allow empirically to establish tendencies in the physical development of students. Consequently, classification as a method can be used to obtain new knowledge and even serve as a basis for constructing new scientific theories.

In science, classifications of the same objects are usually used according to different criteria, depending on the goals. However, the characteristic (the basis for classification) is always selected alone. For example, chemists subdivide the class "acids" into subclasses both by the degree of dissociation (strong and weak), and by the presence of oxygen (oxygen-containing and anoxic), and by physical properties (volatile - non-volatile; soluble - insoluble) and other characteristics.

The classification can change in the course of the development of science.

In the middle of the xx century. the study of various nuclear reactions led to the discovery of elementary (non-fissionable) particles. Initially, they began to be classified by mass, so leptons (small), mesons (intermediate), baryons (large) and hyperons (superlarge) appeared. Further development of physics showed that the classification by mass has little physical meaning, but the terms have been preserved, as a result of which leptons appeared, which are much more massive than baryons.

It is convenient to reflect the classification in the form of tables or diagrams (graphs). For example, the classification of the planets of the solar system, represented by a diagram - a graph, may look like this:

MAJOR PLANETS

SOLAR SYSTEM

EARTH GROUP PLANETS

PLANETS - GIANTS

PLUTO

MERCU-

VENE

MARS

JUPITER

SATURN

URANUS

Please note that the planet Pluto in this classification represents a separate subclass, does not belong to either the terrestrial planets or the giant planets. Scientists note that Pluto is similar in properties to an asteroid, of which there may be many on the periphery of the solar system.

In the study of complex systems of nature, classification is actually the first step towards the construction of a natural-scientific theory. The next higher level is systematization (taxonomy). Systematization is carried out based on the classification of a fairly large amount of material. At the same time, the most significant features are distinguished, which make it possible to present the accumulated material as a system that reflects all the various relationships between objects. It is necessary in cases where there is a variety of objects and the objects themselves are complex systems. The result of the systematization of scientific data is taxonomy or otherwise - taxonomy. Systematics as a field of science developed in such areas of knowledge as biology, geology, linguistics, ethnography.

The unit of taxonomy is called a taxon. In biology, taxa are, for example, a type, class, family, genus, order, etc. They are combined into unified system taxa of various ranks according to the hierarchical principle. Such a system includes a description of all existing and previously extinct organisms, finds out the ways of their evolution. If scientists find the new kind, then they must confirm its place in the general system. Changes can be made to the system itself, which remains developing and dynamic. Taxonomy makes it easy to navigate in all the variety of organisms - only animals are known about 1.5 million species, and plants - more than 500 thousand species, not counting other groups of organisms. Modern biological systematics reflects Saint-Hilaire's law: "All the diversity of life forms forms a natural taxonomic system, consisting of hierarchical groups of taxa of various ranks."

Induction and deduction. The path of cognition, in which, on the basis of the systematization of the accumulated information - from the particular to the general - a conclusion is made about the existing regularity, is called induction. This method as a method of studying nature was developed by the English philosopher F. Bacon. He wrote: “It is necessary to take as many cases as possible - both those where the phenomenon under investigation is present, and those where it is absent, but where it could be expected to be encountered; then you have to arrange them methodically ... and give the most likely explanation; finally, try to verify this explanation by further comparison with the facts. "

Thought and image

Portraits of F. Bacon and S. Holmes

Why are the portraits of a scientist and a literary hero located next to each other?

Induction is not the only way obtaining scientific knowledge about the world. If experimental physics, chemistry and biology were built as sciences mainly due to induction, then theoretical physics, modern mathematics at their foundation had a system axioms- consistent, speculative, reliable from the point of view of common sense and the level of historical development of science statements. Then knowledge can be built on these axioms by deriving conclusions from the general to the particular, the transition from the premise to the consequences. This method is called deduction. It was developed

René Descartes, French philosopher and scientist.

A striking example of obtaining knowledge about one subject in different ways is the discovery of the laws of motion of celestial bodies. I. Kepler based on a large amount of observational data on the motion of the planet Mars in early XVII in. discovered by induction the empirical laws of planetary motion in the solar system. At the end of the same century, Newton deduced the generalized laws of motion of celestial bodies on the basis of the law of universal gravitation.

In real research activities research methods are interconnected.

1. ○ Explain what is a research method, natural science methodology?

All these approximations should be justified and the errors introduced by each of them should be estimated numerically.

The development of science shows that every natural-scientific law has its limits of application. For example, Newton's laws turn out to be inapplicable in the study of the processes of the microworld. To describe these processes, the laws of quantum theory have been formulated, which become equivalent to Newton's laws if they are applied to describe the motion of macroscopic bodies. From a modeling point of view, this means that Newton's laws are some kind of model that follows, under certain approximations, from a more general theory. However, the laws of quantum theory are not absolute and have their limitations in their applicability. More general laws have already been formulated and more general equations have been obtained, which in turn also have limitations. And this chain has no end in sight. So far, no absolute laws have been obtained that describe everything in nature, from which all particular laws could be derived. And it is not clear whether such laws can be formulated. But this means that any of the natural-scientific laws is actually some kind of model. The difference from the models considered in this section is only in the fact that natural science laws are a model applicable to describe not one specific phenomenon, but for a wide class of phenomena.

Basically, data mining is about processing information and identifying patterns and trends in it that help you make decisions. The principles of data mining have been known for many years, but with the advent of big data they have become even more widespread.

Big data has led to an explosive growth in the popularity of broader data mining techniques, in part because there is so much more information and, by its very nature and content, is becoming more diversified and expansive. When dealing with large datasets, relatively simple and straightforward statistics are no longer enough. With 30 or 40 million detailed purchase records, it’s not enough to know that two million of them are from the same location. To better meet the needs of customers, you need to understand if the two million are in a particular age group and know their average earnings.

These business requirements have moved from simple search and statistical analysis of data to more sophisticated data mining. To solve business problems, data analysis is required that allows you to build a model for describing information and ultimately leads to the creation of a resulting report. This process is illustrated.

Figure 1. Process flow diagram

The process of analyzing data, searching, and building a model is often iterative, as you need to track down and uncover various pieces of information that can be extracted. You also need to understand how to link, transform, and combine them with other data to get a result. Once new elements and aspects of data are discovered, the approach to identifying data sources and formats and then comparing this information with a given result may change.

Data mining tools

Data mining is not only about the tools or database software used. Data mining can be done with relatively modest database systems and simple tools, including creating your own, or using off-the-shelf packages software... Sophisticated data mining draws on past experience and algorithms defined with existing software and packages, with different specialized tools associated with different methods.

For example, IBM SPSS®, which is rooted in statistical analysis and polling, allows you to build effective predictive models on past trends and make accurate predictions. IBM InfoSphere® Warehouse provides data source discovery, preprocessing, and mining in a single package, allowing you to extract information from the source database directly into the final report.

In recent years, it has become possible to work with very large datasets and cluster / large-scale data processing, which allows for even more complex generalizations of data mining results by groups and comparisons of data. A completely new range of tools and systems is available today, including combined storage and data processing systems.

A wide variety of datasets can be analyzed, including traditional SQL databases, raw text data, key / value sets, and document databases. Clustered databases such as Hadoop, Cassandra, CouchDB, and Couchbase Server store and access data in ways that do not follow a traditional tabular structure.

In particular, a more flexible format for storing a document base gives information processing a new focus and complicates it. SQL databases are strictly structured and adhere to the schema, making it easy to query and parse data with a known format and structure.

Documentary databases that follow a standard structure like JSON, or files with some machine-readable structure, are also easy to handle, although this can be complicated by a varied and fluid structure. For example, in Hadoop, which processes completely "raw" data, it can be difficult to identify and extract information before processing and correlating it.

Basic methods

Several basic methods that are used for data mining describe the type of analysis and the data recovery operation. Unfortunately, different companies and solutions don't always use the same terms, which can add to the confusion and perceived complexity.

Let's look at some key techniques and examples of how to use certain data mining tools.

Association

Association (or relation) is probably the most well-known, familiar and simple method of data mining. To identify patterns, a simple comparison is made between two or more elements, often of the same type. For example, by tracking shopping habits, you may notice that cream is usually bought with strawberries.

Building data mining tools based on associations or relationships is not difficult. For example, InfoSphere Warehouse provides a wizard that guides you through information flow configurations for creating associations by examining the input source, decision basis, and output information. an example is provided for the sample database.

Figure 2. Information flow used in the association approach

Classification

Classification can be used to get an idea of ​​the type of customer, product, or object by describing multiple attributes to identify a particular class. For example, cars can be easily classified by type (sedan, SUV, convertible) by defining different attributes (number of seats, body shape, driving wheels). Studying new car, you can assign it to a certain class by comparing the attributes with a known definition. The same principles can be applied to customers, for example, by categorizing them by age and social group.

In addition, the classification can be used as input to other methods. For example, decision trees can be used to define a classification. Clustering allows you to use the common attributes of different classifications in order to identify clusters.

By examining one or more attributes or classes, you can group individual data items together to get a structured conclusion. At a simple level, clustering uses one or more attributes as the basis for defining a cluster of similar results. Clustering is useful in identifying different information because it correlates with other examples so that you can see where the similarities and ranges agree.

The clustering method works both ways. You can assume that there is a cluster at a certain point, and then use your identification criteria to verify this. The graph shown on is an illustrative example. Here, the age of the buyer is compared with the purchase price. It is reasonable to expect that people in their twenties and thirties (before marriage and having children) and those in their 50s and 60s (when the children left home) have a higher disposable income.

Figure 3. Clustering

In this example, two clusters are visible, one around $ 2000 / 20-30 years and the other around $ 7000-8000 / 50-65 years. In this case, we hypothesized and tested it on a simple graph that can be plotted using any suitable graphing software. For more complex combinations, a complete analytical package is required, especially if decisions are to be automatically based on information about closest neighbor.

This clustering is a simplified example of the so-called image nearest neighbor... Individual buyers can be distinguished by their literal proximity to each other on the chart. It is highly likely that customers from the same cluster share other common attributes, and this assumption can be used to search, classify, and other analyzes of the members of the dataset.

The clustering method can also be applied in the opposite direction: given certain input attributes, various artifacts can be identified. For example, a recent study of four-digit PIN codes found clusters of numbers in the ranges 1-12 and 1-31 for the first and second pair. By plotting these pairs on a graph, you can see clusters associated with dates (birthdays, anniversaries).

Forecasting

Forecasting is a broad topic that ranges from predicting component failures to detecting fraud and even predicting a company's profit. When combined with other data mining techniques, forecasting involves trend analysis, classification, model matching, and relationships. By analyzing past events or instances, the future can be predicted.

For example, using credit card authorization data, you can combine decision tree analysis of a person's past transactions with classification and comparison with historical models to identify fraudulent transactions. If the purchase of airline tickets in the United States coincides with transactions in the United States, then it is likely that these transactions are genuine.

Sequential models

Sequential models, which are often used to analyze long-term data, are a useful technique for identifying trends, or regular recurrences of similar events. For example, by looking at customer data, you can determine that they buy certain sets of products at different times of the year. Based on this information, the shopping basket prediction application can automatically assume that certain products will be added to the shopping cart based on the frequency and history of purchases.

Decision trees

A decision tree associated with most other methods (mainly classification and forecasting) can be used either within the selection criteria or to support the selection of specific data within the overall structure... A decision tree starts with a simple question that has two answers (sometimes more). Each answer leads to the next question, helping to classify and identify data or make predictions.

Figure 5. Data preparation

The data source, location, and database affect how information is processed and combined.

Reliance on SQL

The simplest of all approaches is often reliance on SQL databases. SQL (and the corresponding table structure) is well understood, but the structure and format of the information cannot be completely ignored. For example, when studying user behavior on sales data in the SQL Data Model (and data mining in general), there are two main formats that you can use: transactional and behavioral-demographic.

With InfoSphere Warehouse, building a demographic-behavior model to analyze customer data to understand customer behavior involves using raw SQL data based on transaction information and known customer parameters, organizing this information into a predefined tabular structure. InfoSphere Warehouse can then use this information to cluster and classify data mining to obtain the desired result. Customer demographic and transactional data can be combined and then converted to a format that allows analysis of specific data, as shown in.

Figure 6. Custom data analysis format

For example, sales data can be used to identify sales trends for specific products. Raw sales data for individual items can be converted into transaction information, which maps customer IDs to transaction data and item codes. Using this information, it is easy to identify consistencies and relationships for individual products and individual buyers over time. This allows InfoSphere Warehouse to compute consistent information by determining, for example, when a customer is likely to purchase the same item again.

From the original data, you can create new data analysis points. For example, you can expand (or refine) product information by matching or classifying individual products into broader groups, and then analyze the data for those groups instead of individual customers.

Figure 7. MapReduce structure

In the previous example, we processed (in this case through MapReduce) the original data in a document database and converted it to a tabular format in an SQL database for data mining purposes.

Working with this complex and even unstructured information may require more preparation and processing. There are complex types and data structures that cannot be processed and prepared in the form you need in one step. In this case, you can route the MapReduce output to either consistent transforming and obtaining the required data structure, as shown in, or for individual making multiple output tables.

Figure 8. Consecutive output chain of MapReduce processing results

For example, in one pass, you can take the source information from the documentary database and perform the MapReduce operation to obtain brief overview this information by dates. A good example sequential process is to regenerate information and combine the results with a decision matrix (created in the second stage of MapReduce processing) with subsequent further simplification into a sequential structure. During the processing phase, MapReduce requires that whole set data supported the individual steps of data processing.

Regardless of the source data, many tools can use flat files, CSVs, or other data sources. For example, InfoSphere Warehouse can parse flat files in addition to directly connecting to the DB2 data warehouse.

Conclusion

Data mining is not only about performing some complex queries on the data stored in the database. Whether you're using SQL, document-based databases like Hadoop, or simple flat files, you need to work with, format, or restructure the data. You want to define the format of the information on which your method and analysis will be based. Then, once the information is in the right format, different methods can be applied (individually or collectively), independent of the underlying data structure or dataset required.

Home> Lecture

Topic 7.CLASSIFICATION ANALYSIS

Lecture number 9

1. Exploratory data analysis. Measurement scales

2. Classification trees

3. Discriminant analysis (classification with training)

4. Cluster analysis (classification without training)

5. Canonical correlations

1. Exploratory data analysis. Measurement scales

In the presence of a large number of variables and the absence of information about relationships and patterns, one of the first stages of analyzing the available data is the so-called exploratory data analysis. As a rule, when exploratory analysis a large number of variables are taken into account and compared, and variables are classified and scaled for search. Variables differ in how well they can be measured, or, in other words, how much information measured is provided by the scale of their measurements. Another factor that determines the amount of information is the type of scale in which the measurement is made. Typically, the following types of measurement scales are used: nominal, ordinal, interval and relative. Nominal variables are used only for qualitative classification. This means that these variables can only be measured in terms of belonging to some substantially different classes. Typical examples of nominal variables are the manufacturer, the type of product, the sign of its suitability, etc. Nominal variables are often referred to as categorical. Ordinal variables allow you to rank objects, if it is indicated which of them, to a greater or lesser extent, possess the quality expressed by a given variable. However, they do not allow judging how much more or how much less a given quality is contained in a variable. A typical example is the sort of product: highest, first, second, third. One and the same product differs qualitatively, but it cannot be said that the difference between them is 25%. Categorical and ordinal variables are especially common when questioning, for example, measuring and comparing the differences between them. An example is the temperature measured in degrees forms an interval scale, since it is possible to estimate the difference in variables already in numerical form (40 degrees is more than 30 by 10). The interval scale can be easily converted into an ordinal if we take some values ​​of the variables as boundaries of different classes (for example, it is warm or hot outside for a month, taking the boundary between the classes "warm" and "hot" in the value of the variable, but their feature is the presence of a certain point absolute zero, usually continuous variables. 2. Classification trees Classification trees is a method that makes it possible to predict the belonging of observations or objects to a particular class of categorical dependent variable, depending on the corresponding values ​​of one or more predictor variables. Building classification trees- one of the hierarchical coin sorting device. Let's make the coins roll down a narrow chute with a slot the size of a one-kopeck coin. If the coin fell into the slot, then it is 1 kopeck; otherwise, it continues to roll further along the chute and stumbles upon the slot for a two-kopeck coin; if it fails there, then it's 2 kopecks, if not (it means it's 3 or 5 kopecks) - it will roll further, and so on. Thus, we have built a classification tree. The decision rule implemented in this classification tree allows you to effectively sort a handful of coins, and in general is applicable to a wide range of classification problems. Classification trees are ideally suited for graphical presentation, and therefore the conclusions drawn from them are much easier to interpret than if they were presented only in numerical form. Hierarchical structure classification tree- one of the Build process classification tree consists of four main steps:

    Selection of the forecast accuracy criterion

    Choosing the type of branching

    Determining when to stop branching

    Determining "suitable" tree sizes

Ultimately, the goal of analysis with classification trees is to obtain the most accurate prediction possible. Most classifications.

3. Discriminant analysis (classification with training)

Discriminant analysis is used to decide which class (group) to assign a particular object (process) based on the study of its parameters or characteristics.) Of the product and the task is to establish which of the parameters contribute to the difference (discrimination) between separately grouped aggregates (varieties) of goods that form the general population. After that, a decision is made on the belonging of this product to a certain group. Therefore, this kind of statistical analysis is multivariate and the main idea of ​​discriminant analysis is to determine if populations differ in the mean of some parameter (variable), and then use this variable to predict for new members of their domains. Each of the areas differs from the other in the value of a certain parameter (or rather the value of its average) or sets of parameters taken as a classification feature. The discrimination rule is chosen in accordance with a certain optimality principle, for example, the minimum probability of false classification. In practical calculations, differentiation is transferred from the feature vector to a linear function (discriminant function), which for two groups (classes) has the form of a linear multiple regression equation, in which coded features of differentiation into groups act as dependent variables. If there are more than two groups, then more than one discriminant function can be compiled. For example, when there are three populations, then one can estimate: (1) - the function for discrimination sense is very similar to multivariate analysis of variance. When discriminant functions are obtained, the question arises as to how well they can predict What population does a particular sample belong to? For this, classification indicators or classification functions are determined and the next observation or a specific sample is assigned to the group for which the classification group is of greatest importance. 4. Cluster analysis (classification without training) Cluster analysis is a statistical method that includes a set of different algorithms for the distribution of objects into clusters (claster - bunch, cluster). Partition of objects H into an integer number of clusters K, so that each object belongs to one and only one subset of the partition. In this case, objects belonging to the same cluster must be similar, and objects belonging to different clusters must be heterogeneous. The solution to the problem of cluster analysis is partitions that satisfy the criterion of optimality. This criterion is called the objective function, which can be, for example, the minimum sum of the squares of the deviations of the features of the group objects from the mean value

min Σ (x i - x av) 2

The similarity and heterogeneity of objects in groups will be characterized by a certain value, which has received the name - a function of distance. The larger the function of the distance between objects, the more heterogeneous they are. It is clear that if this function exceeds a certain set limit, then the objects should be related to different groups(clusters). Depending on the clustering algorithm used, the following distance functions are distinguished: - Euclidean metric (Σx i - xj) 2) 1/2; - Manhattan distance Σ | x i - x j |; - Chebyshev distance max | x i - x j |, etc. are considered as separate clusters. Subsequently, at each step of the algorithm operation, the two closest clusters are combined, and, taking into account the accepted distance function, all distances are recalculated using the formula. When the target function is reached, iterations are terminated. 5. Canonical correlations Classical correlation analysis allows you to find statistical relationships between two variables, the so-called two sets of variables use the methods of canonical analysis. Canonical analysis, being a generalization of multiple correlation as a measure of the relationship between one random variable and many other random variables, considers the relationship between sets of random variables. At the same time, it is limited to considering a small number of the most correlated linear combinations from each set. The analysis of canonical correlation is based on the use of canonical roots or canonical variables, which are considered as "hidden" variables that characterize the observed phenomena. The number of canonical roots is equal to the number of variables in the smaller set. In practice, when defining a canonical correlation, a separate correlation matrix is ​​constructed, which is a product of standard correlation matrices characterizing the relationship between two separate variables. Then, as many eigenvalues ​​of the resulting matrix are calculated as there are canonical roots. If we extract the square root of the obtained eigenvalues, we get a set of numbers that can be interpreted as correlation coefficients. Since they are canonical variables, they are also called canonical correlations. It is advisable to evaluate the work of discriminant, cluster and canonical analysis using special statistical packages that implement these algorithms on a computer.

Last year the Avito company held a number of competitions. Including the competition for the recognition of car brands, the winner of which, Evgeny Nizhibitsky, told about his decision during the training session.


Formulation of the problem... It is necessary to determine the make and model from the images of the cars. The metric was the accuracy of predictions, that is, the proportion of correct answers. The sample consisted of three parts: the first part was available for training initially, the second was given later, and the third required showing the final predictions.


Computing resources... I used my home computer, which was heating my room all this time, and the servers provided at work.

Model overview... Since our task is recognition, the first thing we want to do is take advantage of the progress in the quality level of image classification on the well-known ImageNet. As you know, modern architectures make it possible to achieve even higher quality than that of a person. So I started with a review of recent articles and put together a pivot table of ImageNet-based architectures, implementations, and qualities.


Note that the best quality is achieved on architectures and.

Fine-tuning networks... To train a deep neural network from scratch is a rather time-consuming exercise, moreover, it is not always effective in terms of results. Therefore, the technique of additional training of networks is often used: a network already trained on ImageNet is taken, the last layer is replaced with a layer with the required number of classes, and then the network is configured with a low learning rate, but using data from the competition. This scheme allows you to train the network faster and with higher quality.

The first approach to retraining GoogLeNet showed about 92% accuracy in validation.

Crop predictions... Using a neural network for prediction on a test sample can improve the quality. To do this, cut out fragments suitable size in different places of the original image, and then average the results. A 1x10 crop means that the center of the image is taken, four corners, and then everything is the same, but reflected horizontally. As you can see, the quality increases, but the prediction time increases.

Validation of results... After the second part of the sample appeared, I split the sample into several parts. All further results are shown in this split.

ResNet-34 Torch... You can use the ready-made repository of the authors of the architecture, but in order to get predictions on the test in the desired format, you have to fix some scripts. In addition, it is necessary to solve the problems of high memory consumption by dumps. The validation accuracy is about 95%.


Inception-v3 TensorFlow... Here, too, a ready-made implementation was used, but the preprocessing of images was changed, and also cropping of images when generating a batch was limited. The result is almost 96% accuracy.


Ensemble of models... The result is two ResNet models and two Inception-v3 models. What validation quality can be obtained by mixing models? The class probabilities were averaged using the geometric mean. The weights (in this case, the degrees) were selected on a deferred sample.


results... ResNet training took 60 hours on GTX 980, and Inception-v3 on TitanX took 48 hours. During the competition, we managed to try out new frameworks with new architectures.


The problem of classification of bank clients

Link to Kaggle.

Stanislav Semyonov tells how he and other members of the top Kaggle united and won a prize in the competition for the classification of clients' orders of a large bank - BNP Paribas.


Formulation of the problem... Based on obfuscated data from insurance claims, it is necessary to predict whether it is possible to confirm the request without additional manual checks. For a bank, this is the process of automating the processing of applications, and for data analysts, it is just a machine learning task on binary classification. There are about 230 thousand objects and 130 features. Metric - LogLoss. It is worth noting that the winning team decrypted the data, which helped them win the competition.

Getting rid of artificial noise in signs... The first step is to look at the data. Several things are immediately apparent. First, all features take values ​​from 0 to 20. Second, if you look at the distribution of any of the features, you can see the following picture:

Why is that? The fact is that at the stage of anonymization and data noise, random noise was added to all values, and then scaling was carried out by a segment from 0 to 20. The reverse transformation was carried out in two steps: first, the values ​​were rounded to a certain decimal place, and then the denominator was selected ... Was this required if the tree still picks up the threshold when splitting? Yes, after the reverse transformation, the differences of the variables begin to make more sense, and for categorical variables it becomes possible to carry out one-hot coding.

Removing linearly dependent features... We also noticed that some traits are the sum of others. It is clear that they are not needed. To determine them, subsets of features were taken. Regression was built on such subsets to predict some other variable. And if the predicted values ​​were close to the true ones (it is worth considering the artificial noise), then the feature could be removed. But the team did not bother with this and used a ready-made set of filtered features. The kit was prepared by someone else. One of the features of Kaggle is the presence of a forum and public solutions through which members share their findings.

How do you know what to use? There is a small hack. Suppose you know that someone in old competitions used some technique that helped him to rank high (they usually write short solutions on the forums). If in the current competition this participant is again among the leaders, most likely, the same technique will shoot here.

Encoding categorical variables... It was striking that a certain variable V22 has a large number of values, but at the same time, if we take a subsample by a certain value, the number of levels (different values) of other variables decreases markedly. This includes a good correlation with the target variable. What can be done? The simplest solution is to build a separate model for each value of V22, but this is the same as in the first split of the tree to split across all the values ​​of the variable.

There is another way to use the obtained information - coding with the mean of the target variable. In other words, each value of the categorical variable is replaced by the average value of the target for objects for which this attribute takes the same value. It is impossible to perform such coding directly for the entire training set: in the process, we will implicitly add information about the target variable to the features. We are talking about information that almost any model will definitely find.

Therefore, these statisticians count on folds. Here's an example:

Suppose the data is split into three parts. For each fold of the training set, we will calculate a new feature based on two other folds, and for the test set - over the entire training set. Then the information about the target variable will not be included in the sample so explicitly, and the model will be able to use the knowledge gained.

Will there be any problems with anything else? Yes - with rare categories and cross-validation.

Rare categories... Suppose a certain category was encountered only a few times and the corresponding objects belong to class 0. Then the average value of the target variable will also be zero. However, a completely different situation may arise on the test sample. The solution is the smoothed average (or smoothed likelihood), which is calculated using the following formula:

Here global mean is the average value of the target variable over the entire sample, nrows is the number of times a specific value of the categorical variable was encountered, alpha is the regularization parameter (for example, 10). Now, if some value is rare, the global average will have more weight, and if often enough, the result will be close to the starting category average. By the way, this formula allows you to process previously unknown values ​​of a categorical variable.

Cross validation... Let's say we calculated all the smoothed means for categorical variables for other folds. Can we assess the quality of the model using standard k-fold cross-validation? No. Let's take an example.

For example, we want to evaluate a model on the third fold. We train the model on the first two folds, but they have a new variable with the mean of the target variable, which we have already calculated using the third test fold. This does not allow us to correctly assess the results, but the problem that has arisen is solved by calculating statistics on folds within folds. Let's look at the example again:

We still want to evaluate the model on the third fold. Let's break the first two folds (the training sample of our estimate) into some other three folds, in them we will calculate the new feature according to the already analyzed scenario, and for the third fold (this is a test sample of our estimate) we will calculate the first two folds together. Then no information from the third fold will be used when training the model and the estimate will be fair. In the competition we are discussing, only such cross-validation allowed to correctly assess the quality of the model. Of course, the "outside" and "inside" number of folds can be any.

Building features... We used not only the already mentioned smoothed mean values ​​of the target variable, but also weights of evidence. It is almost the same, but with a logarithmic transformation. In addition, features like the difference between the number of objects of positive and negative classes in a group without any normalization turned out to be useful. The intuition is as follows: the scale shows the degree of confidence in the class, but what to do with the quantitative signs? After all, if you process them in a similar way, then all values ​​will be "hammered" by the regularization of the global average. One of the options is to split the values ​​into bins, which are then calculated. separate categories... Another way is simply to build some kind of linear model on one feature with the same target. In total, we got about two thousand features out of 80 filtered ones.

Stacking and blending... As with most competitions, model stacking is an important part of the solution. In short, the essence of stacking is that we transfer the predictions of one model as a feature to another model. However, it is important not to retrain again. Let's just take an example:


Taken from the blog of Alexander Dyakonov

For example, we decided to split our sample into three folds during the staking phase. Similar to calculating statistics, we must train the model on two folds, and add the predicted values ​​for the remaining fold. For a test sample, you can average the predictions of the models from each pair of folds. Each stacking level is called the process of adding a group of new model prediction features based on the existing dataset.

At the first level, the team had 200-250 different models, at the second - 20-30 more, at the third - a few more. The result is blending, that is, mixing the predictions of different models. Various algorithms were used: gradient boosting with different parameters, random forests, neural networks. The main idea is to use the most diverse models with different parameters, even if they do not give the highest quality.

Teamwork... Usually, the participants unite in teams before the end of the competition, when everyone already has their own experience. We teamed up with other Kaglers from the very beginning. Each team member had a folder in the shared cloud where datasets and scripts were located. General procedure cross-validations were approved in advance so that comparisons could be made. The roles were distributed as follows: I came up with new features, the second participant built models, the third selected them, and the fourth manages the whole process.

Where to get the power... Testing a large number of hypotheses, building multilevel stacking, and training models can be time-consuming with a laptop. Therefore, many participants use computing servers with a large number of cores and RAM. I usually use AWS servers, and my team members turn out to be using cars at work for competitions while they are idle.

Communication with the organizing company... After a successful performance in the competition, communication with the company takes place in the form of a joint conference call. Participants talk about their decision and answer questions. At BNP, people were not surprised by multi-level staking, but, of course, they were interested in building features, working in a team, validating results - everything that can be useful to them in improving their own system.

Do I need to decrypt the dataset... The winning team noticed one peculiarity in the data. Some of the features have missing values, and some do not. That is, some characteristics did not depend on specific people. In addition, there were 360 ​​unique values. It is logical to assume that we are talking about some time stamps. It turned out that if we take the difference between two such signs and sort the entire sample by it, then at first there will be zeros more often, and then ones. This is exactly what the winners took advantage of.

Our team took third place. In total, almost three thousand teams participated.

The task of recognizing the ad category

Link to DataRing.

This is another Avito contest. It took place in several stages, the first of which (as well as the third, by the way) was won by Arthur Kuzin.


Formulation of the problem... It is necessary to determine the category based on the photos from the ad. Each ad had one to five images. The metric took into account the coincidence of categories at different levels of the hierarchy - from general to narrower ones (the last level contains 194 categories). In total, there were almost a million images in the training sample, which is close to the ImageNet size.


Difficulties of recognition... It would seem that you just need to learn to distinguish a TV from a car, and a car from shoes. But, for example, there is a category "British cats", and there are "other cats", and among them there are very similar images - although you can still distinguish them from each other. What about tires, disks and wheels? Here, and a person can not cope. These difficulties are the reason for the appearance of a certain limit of the results of all participants.


Resources and framework... I had at my disposal three computers with powerful video cards: a home one provided by a laboratory at MIPT and a computer at work. Therefore, it was possible (and had to) train several networks at the same time. MXNet was chosen as the main framework for training neural networks, created by the same guys who wrote the well-known XGBoost. This alone was the reason to trust their new product. The advantage of MXNet is that an efficient iterator with standard augmentation is available right out of the box, which is sufficient for most tasks.


Network architectures... The experience of participating in one of the past competitions has shown that the best quality is shown by the architectures of the Inception series. I used them here. It was added to GoogLeNet because it made learning the model faster. We also used the Inception-v3 and Inception BN architectures from the Model Zoo model library, to which a dropout was added before the last fully connected layer. Due to technical problems, it was not possible to train the network using stochastic gradient descent, so Adam was used as the optimizer.



Data Augmentation... To improve the quality of the network, augmentation was used - adding distorted images to the sample in order to increase the variety of data. Transformations were involved such as accidentally cropping the photo, flipping, rotating by a small angle, changing the aspect ratio, and shifting.

Accuracy and speed of learning... At first, I divided the sample into three parts, but then I abandoned one of the validation steps for mixing models. Therefore, the second part of the sample was subsequently added to the training set, which improved the quality of the networks. In addition, GoogLeNet was originally trained on Titan Black, which has half the memory compared to Titan X. So this network was retrained with a large batch size, and its accuracy increased. If we look at the training time of networks, we can conclude that in conditions of limited time it is not worth using Inception-v3, since training is much faster with the other two architectures. The reason is in the number of parameters. Inception BN learns the fastest.

Making predictions.

Like Eugene in the competition with car brands, Arthur used crop predictions - but not on 10 sections, but on 24. The sections were corners, their reflections, the center, turns of the central parts and ten more random ones.

If you save the state of the network after each epoch, the result is many different models, not just the final network. Taking into account the time remaining until the end of the competition, I could use predictions of 11 model-epochs - since building predictions using the network also takes a lot. All these predictions were averaged according to the following scheme: first, using the arithmetic mean within the crop groups, then using the geometric mean with weights selected on the validation set. These three groups are mixed, then we repeat the operation for all epochs. At the end, the class probabilities of all images of one ad are averaged using the geometric mean without weights.


results... When selecting the weights at the validation stage, the competition metric was used, since it did not correlate too much with the usual accuracy. Prediction on different parts of the images gives only a small part of the quality in comparison with a single prediction, but it is due to this increase that it is possible to show the best result. At the end of the competition, it turned out that the first three places differ in results by thousandths. For example, Zhenya Nizhibitsky had the only model, which was very slightly inferior to my ensemble of models.


Learning from scratch vs. fine-tuning... After the end of the competition, it turned out that, despite the large sample size, it was worth training the network not from scratch, but using a pre-trained network. This approach shows better results.

Reinforcement Learning Problem

The Black Box Challenge, about which, was not quite like a regular "Kagle". The point is that it was not enough to mark up some "test" sample for the solution. It was required to program and load the “agent” code into the system, which was placed in an environment unknown to the participant and independently made decisions in it. Such tasks belong to the field of reinforcement learning.

Mikhail Pavlov from the 5vision company spoke about the approaches to the solution. In the competition, he took second place.


Formulation of the problem... For an environment with unknown rules, it was necessary to write an "agent" that would interact with the specified environment. Schematically, this is a kind of brain that receives information about a state and a reward from a black box, makes a decision about an action, and then receives a new state and a reward for the action performed. Actions are repeated one after another during the game. The current state is described by a vector of 36 numbers. An agent can take four actions. The goal is to maximize the amount of rewards for the entire game.


Environment analysis... The study of the distribution of environment state variables showed that the first 35 components do not depend on the selected action and only the 36th component changes depending on it. At the same time, different actions influenced in different ways: some increased or decreased, some did not change in any way. But it cannot be said that the entire environment depends on one component: there may be some hidden variables in it. In addition, the experiment showed that if you perform more than 100 identical actions in a row, then the reward becomes negative. So strategies like “do only one action” fell away immediately. Someone from the competition participants noticed that the reward is proportional to the same 36th component. It was suggested at the forum that the black box imitates the financial market, where the portfolio is the 36th component, and the actions are buying, selling and the decision to do nothing. These options correlated with portfolio changes, and the meaning of one action was not clear.


Q-learning... During the participation, the main goal was to try different reinforcement learning techniques. One of the simplest and most well-known methods is q-learning. Its essence is in an attempt to construct a function Q, which depends on the state and the selected action. Q evaluates how “good” it is to choose a particular action in a particular state. Good includes the reward that we will receive not only now, but also in the future. Such a function is trained iteratively. During each iteration, we try to bring the function closer to itself in the next step of the game, taking into account the reward we have received now. You can read more about it. The use of q-learning assumes working with fully observable Markov processes (in other words, the current state should contain all information from the environment). Despite the fact that the environment, according to the organizers, did not meet this requirement, it was possible to use q-learning quite successfully.

Adaptation to the black box... It was experimentally found that n-step q-learning was best suited for the environment, where the reward was used not for one last action, but for n steps forward. The environment made it possible to save the current state and roll back to it, which made it easier to collect a sample - it was possible to try to perform each action from one state, and not just one. At the very beginning of the training, when the q-function did not yet know how to evaluate actions, the strategy “do action 3” was used. It was assumed that it did not change anything and it was possible to start training on the data without noise.

Learning process... The training proceeded as follows: with the current policy (agent's strategy) we play the entire episode, accumulating the sample, then using the obtained sample we update the q-function, and so on - the sequence is repeated for a certain number of epochs. The results were better than updating the q-function during the game. Other methods are the replay memory technique (with common bank data for training, where new episodes of the game are entered) and the simultaneous training of several agents playing asynchronously - also turned out to be less effective.

Models... The solution used three regressions (each one once per action) and two neural networks. Some quadratic features and interactions have been added. The resulting model is a mixture of all five models (five Q-functions) with equal weights. In addition, online additional training was used: in the process of testing, the weights of the old regressions were mixed with the new weights obtained on the test sample. This was done only for regressions, since their solutions can be written out analytically and recalculated rather quickly.


More ideas... Naturally, not all ideas improved the bottom line. For example, discounting the reward (when we do not just maximize the total reward, but consider each next move less useful), deep networks, dueling architecture (with an assessment of the usefulness of the state and each action separately) did not give rise to results. Due to technical problems, it was not possible to apply recurrent networks - although in an ensemble with other models, they might provide some benefit.


Outcomes... The 5vision team took second place, but with a very small margin over the bronze winners.


So why do you need to compete in data science competitions?

  • Prizes. Successful performance in most competitions is rewarded with cash prizes or other valuable gifts. Over seven million dollars have been drawn on Kaggle in seven years.
  • Career. Sometimes a prize place.
  • An experience. This is, of course, the most important thing. You can explore a new area and start tackling challenges you haven't encountered before.

Machine learning training is now held on Saturdays every other week. The venue is the Moscow office of Yandex, the standard number of guests (guests plus Yandex) is 60-80 people. The main feature of training is its relevance: every time the competition, which ended one or two weeks ago, is sorted out. This makes it difficult to plan everything accurately, but the competition is still fresh in memory and many people gather in the hall to try their hand at it. The training is supervised by Emil Kayumov, who, by the way, helped with the writing of this post.

In addition, there is another format: permits, where novice specialists jointly participate in existing competitions. Resolutions are held on Saturdays when there is no training. Anyone can attend events of both types, announcements are published in groups

 

It might be helpful to read: