Laboratory “Application of methods of primary exploratory data analysis in solving data mining problems (IAD) by means of the integrated system Statistica. Data Mining Techniques Intelligence Data Analysis

Updated 07.29.

My rather chaotic thoughts on the use of statistical methods in the processing of proteomic data.

APPLICATION OF STATISTICS IN PROTEOMICS

Review of methods for analyzing experimental data

Pyatnitsky M.A.

GU Research Institute of Biomedical Chemistry. V.N. Orekhovich RAMN

119121, Moscow, Pogodinskaya st. 10,

e-mail: mpyat @bioinformatics.ru

Proteomic experiments require careful thought-out statistical processing of the results. There are several important features that characterize proteomic data:

there are a lot of variables
complex relationships between these variables. These relationships are understood to reflect biological facts.
the number of variables is much larger than the number of samples. This makes it very difficult for many statistical methods to work.

However, similar signs are inherent in many other data obtained using high-throughput technologies.

Typical objectives of a proteomic experiment are:

comparison of protein expression profiles between different groups(e.g. cancer / norm). Typically, the task is to construct a decision rule that separates one group from another. Variables with the highest discriminatory ability (biomarkers) are also of interest.
the study of the relationship between proteins.

Here I will focus mainly on the application of statistics to analyze mass spectra. However, much of what has been said applies to other types of experimental data as well. The methods themselves are almost not considered here (with the exception of a more detailed description of the ROC curves), but rather a very brief arsenal of methods for data analysis is outlined and outlines for its meaningful application are given.

Exploratory analysis

The most important step when working with any data set is exploratory data analysis (EDA). In my opinion, this is perhaps the most important point in statistical data processing. It is at this stage that you need to get an idea of the data, understand which methods are best to apply and, more importantly, what results can be expected. Otherwise, it will be a “blind” game (let's try such and such a method), a senseless enumeration of the arsenal of statistics, data dredging. The statistics are so dangerous that they will always give some kind of result. Now, when the launch of the most complex computational method requires only a couple of mouse clicks, this is especially important.

According to Tukey, the objectives of exploratory analysis are:

maximize insight into a data set;
uncover underlying structure;
extract important variables;
detect outliers and anomalies;
test underlying assumptions;
develop parsimonious models; and
determine optimal factor settings.

At this stage, it is wise to get as much information about the data as possible, using primarily graphical tools. Plot histograms for each variable. As trivial as it sounds, take a look at descriptive statistics. It is useful to look at the scatter plots (while drawing the points with different symbols indicating class membership). Interestingseeresults PCA (principal component analysis) and MDS (multidimensional scaling). So, EDA is primarily a widespread use of graphical visualization.

The use of projection pursuit methods for finding the most “interesting” data projection is promising. Usually, some degree of automation of this work is possible (GGobi). It is arbitrary to choose an index to search for interesting projections.

Normalization

Typically, the data is not normally distributed, which is inconvenient for statistical procedures. The log-normal distribution is common. Simple logarithms can make the distribution much more pleasant. In general, do not underestimate such simple techniques as taking logarithms and other data transformations. In practice, more than once there are cases when, after taking the logarithm, meaningful results begin to be obtained, although before preprocessing the results were of little meaning (here is an example about mass spectrometry of wines).

In general, the choice of the normalization is a separate problem, to which many works are devoted. The choice of preprocessing and scaling method can significantly affect the results of the analysis (Berg et al, 2006). In my opinion, it is better to always carry out the simplest normalization by default (for example, if the distribution is symmetric or logarithm in another case) than not to use these methods at all.

Here are some examples of graphical visualization and the use of simple statistical techniques for exploratory data analysis.

Examples of

Below are examples of graphs that might make sense to plot for each variable. On the left, the density estimates for each of the two classes (red - cancer, blue - control) are shown. Please note that under the graphs the very values used to estimate the density are presented. The ROC curve is shown on the right, and the area below it is shown. Thus, one can immediately see the potential of each variable as a discriminator between classes. After all, it is discrimination between classes that is usually the ultimate goal statistical analysis of proteomic data.

The following figure shows an illustration of the normalization: a typical distribution of the peak intensity in the mass spectrum (left) when taken logarithm gives a distribution close to normal (right).

Next, we will show the use of heatmap for exploratory data analysis. By columns - patients, by rows - genes. The color indicates the numerical value. There is a clear division into several groups. This is a great example of how EDA can be used to provide an immediate visual picture of the data.

The following picture shows an example of a gel -view graph. This is a standard technique for visualizing a large set of spectra. Each row is a sample, each column is a peak. The color coded the intensity of the value (the brighter the better). Such pictures can be obtained, for example, in ClinProTools. But there is a big drawback - the lines (samples) are in the order in which they were loaded. It is much more correct to rearrange the lines (samples) in such a way that close samples are located side by side and on the graph. In fact, it is a heatmap without sorting the columns and dendrograms on the sides.

The following picture shows an example of using multidimensional scaling. Circles are control, triangles are cancer. It can be seen that cancer has a significantly greater variance and the construction of a decision rule is quite possible. Such an interesting result is achieved only for the first two coordinates! Looking at such a picture, one can become optimistic about the results of further data processing.

The missing values problem

The next problem the researcher faces is the problem of missing values. Again, there are many books devoted to this topic, each of which describes dozens of ways to solve this problem. Missing values are often found in data that is obtained through high -throughput experimentation. Many statistical methods require complete data.

Here are the main ways to solve the problem of missing values:

. remove rows / columns with missing values. It is justified if there are relatively few missing values, otherwise you will have to remove everything.

. generate new data instead of missing ones (replace with mean, get from the estimated distribution)

. use methods that are insensitive to missing data

. experiment again!

Emissions problem

Outlier is a sample with drastically different rates from the main group. Again, this topic has been deeply and extensively developed in the relevant literature.

What is the danger of emissions? First of all, this can significantly affect the operation of non-robust (not resistant to outliers) statistical procedures. The presence of even one outlier in the data can significantly change the estimates of mean and variance.

Outliers are difficult to notice in multivariate data, since they can manifest themselves only in the values of one or two variables (recall that in a typical case, a proteomic experiment is described by hundreds of variables). This is where the analysis of each variable comes in handy - when looking at descriptive statistics or histograms (such as those shown above), such an outlier is easy to spot.

There are two strategies for finding outliers:

1) manually - analysis of scatter plots, PCA, and other methods of exploratory analysis. Try to build a dendrogram - on it, the outlier will be seen as a separate branch, which departs early from the root.

2) 2) many criteria for detection have been developed (Yang, Mardia, Schjwager, ...)

Emission control products

. outlier removal

. apply outlier-resistant (robust) statistical methods

At the same time, one must keep in mind that a possible outburst is not an experimental error, but some essentially new biological fact. Although this, of course, happens extremely rarely, but still ...

The following figure shows the possible types of outliers by the type of their effect on statistics.

Let us illustrate how outliers affect the behavior of the correlation coefficients.

We are interested in case (f). It can be seen how the presence of only 3 outliers gives the Pearson correlation coefficient equal to 0.68, while the Spearman and Kendall coefficients give much more reasonable estimates (there is no correlation). That's right, Pearson's correlation coefficient is a non-robust statistic.

Let's show the application of the PCA method for visual detection of outliers.

Of course, you shouldn't always rely on such “artisanal” detection methods. Better to turn to literature.

Classification and dimension reduction

Usually, the main purpose of analyzing proteomic data is to construct a decision rule to separate one group of samples from another (eg, cancer / norm). After exploratory analysis and normalization, usually the next step is dimensionality reduction.

Variable selection

A large number of variables (and this is the standard situation in proteomic experiments):

. complicates data analysis

. usually not all variables have a biological interpretation

. often the aim of the work is to select “interesting” variables (biomarkers)

. degrades the performance of classification algorithms. Because of this, overfitting.

Therefore, the standard step is to apply dimensionality reduction before classification.

Dime nsionality reduction methods can be divided into 2 types:

1) Filter

The tasks of this group of methods are either to remove existing “uninteresting” variables, or to create new variables as linear combinations of old ones. This includes

PCA, MDS,

methods of information theory, etc.

Another idea is directed selection of “interesting variables”: for example, bimodal variables are always interesting to look at (ideally, each peak corresponds to its own class for binary classification). However, this can be attributed to exploratory analysis.

Another approach is to exclude highly correlated variables. In this approach, variables are grouped using correlation coefficients as a measure of distance. You can use not only the Pearson correlation, but also other coefficients. From each cluster of correlated variables, only one is left (for example, according to the criterion of the largest area under ROC-curve).

The figure shows an example of visualizing such a cluster analysis of peaks using heatmap ... The matrix is symmetric, the color shows the values of the Pearson correlation coefficient (blue - high correlation values, red - low values). Several clusters of highly dependent variables are clearly distinguished.

2) Wrapper

It uses classification algorithms as a measure of the quality of the set of selected variables. The optimal solution is a complete enumeration of all combinations of variables, since with complex relationships between variables

situations are quite possible when two variables, which are not discriminatory separately, become such when the third is added. Obviously, exhaustive search is computationally impossible for any significant number of variables.

An attempt to overcome this “curse of dimension” is the use of genetic algorithms to find the optimal set of variables. Another strategy is to include / exclude variables one at a time while checking the value of the Akaike Information Criteria or Bayes Information Criteria.

Cross-validation is mandatory for this group of methods. Read more about this in the section on comparing classifiers.

Classification

The task is to build a decision rule that will allow the newly processed sample to be assigned to one or another class.

Learning without a teacher- cluster analysis. This is a search for the best (in a sense) groupings of objects. Unfortunately, you usually need to specify the number of clusters a priori, or choose a cutoff threshold (for hierarchical clustering). This always brings in unpleasant arbitrariness.

Supervised Learning: neural networks, SVM, decision trees, ...

A large sample with pre-classified objects is required.

Usually works better than unsupervised learning. Cross-validation - in the absence of a test sample. Overfitting problem arises

An important and simple test that is rarely performed is running a trained classifier on random data. Generate a matrix with a size equal to the size of the original sample, fill it with random noise or normal distribution, perform all the techniques, including normalization, selection of variables, and training. If reasonable results are obtained (i.e., you have learned to recognize random noise), there will be less reason to believe in the constructed classifier.

There is an easier way - just randomly change the class labels for each object, without affecting the rest of the variables. Thus, you will again get a meaningless data set, on which it is worth running the classifier.

It seems to me that the constructed classifier can be trusted only if at least one of the above tests for recognizing random data has been performed.

ROC - curve

Receiver-Operating Characteristic curve

. It is used to present the results of classification into 2 classes, provided that the answer is known, i.e. the correct partition is known.

. It is assumed that the classifier has a parameter (cut-off point), varying which one or another division into two classes is obtained.

This determines the proportion of false positive (FP) and false negative (FN) results. The sensitivity and specificity are calculated, a graph is plotted in coordinates (1-specificity, sensitivity). By varying the parameter of the classifier, different values of FP and FN are obtained, and the point moves along the ROC-curve.

. Accuracy = (TP + TN) / (TP + FP + FN + TN)

. Sensitivity = TP / TP + FN

. Specificity = TN / TN + FP

What is a “positive” event depends on the conditions of the problem. If the probability of the presence of the disease is predicted, then a positive outcome - the class "sick patient", a negative outcome - the class "healthy patient"

The clearest explanation (with excellent java applets illustrating the essence of the ROC idea) I saw at http://www.anaesthetist.com/mnm/stats/roc/Findex.htm

ROC-curve:

. It is convenient to use to analyze the comparative effectiveness of two classifiers.

. The closer the curve is to the upper left corner, the higher the predictive power of the model.

. The diagonal line corresponds to the “useless classifier”, i.e. complete indistinguishability of classes

. Visual comparison does not always allow us to accurately assess which classifier is preferable.

. AUC - Area Under Curve - a numerical estimate that allows you to compare ROC curves.

. Values from 0 to 1.

Comparison of two ROC curves

Area under the curve (AUC) as a measure for comparing classifiers.

Other examples of ROC curves are given in the exploratory analysis section.

Comparative analysis of classifiers

There are many options for applying pattern recognition methods. Comparing different approaches and choosing the best one is an important task.

The most common way of comparing classifiers in articles on proteomics (and not only) today is cross-validation. In my opinion, there is little sense in a single application of the cross-validation procedure. A smarter approach is to run cross-validation multiple times (ideally, the more the better) and build confidence intervals to estimate the classification accuracy. The presence of confidence intervals makes it possible to reasonably decide whether, for example, an improvement in the quality of the classification by 0.5% is statistically significant or not. Unfortunately, only a small number of papers have confidence intervals for accuracy, sensitivity, and specificity. For this reason, the figures given in other works are difficult to compare with each other, since the range of possible values is not indicated.

Another issue is the choice of the type of cross-validation. I prefer 10-fold or 5-fold cross-validation instead of leave -one -out.

Of course, using cross-validation is an “act of desperation”. Ideally, the sample should be divided into 3 parts: in the first part, the model is built, in the second part, the parameters of this model are optimized, and in the third part, a check is performed. Cross-validation is an attempt to avoid these constructions, and is only justified with a small number of samples.

Other useful information can be gleaned from the numerous runs of the cross-validation procedure. For example, it is interesting to look at which objects the recognition procedure makes mistakes more often. Perhaps these are data errors, outliers, or other interesting cases. Having studied the characteristic properties of these objects, you can sometimes understand in which direction it is worth improving your classification procedure.

Below is a table of comparison of classifiers for the work of Moshkovskii et al, 2007. SVM and logistic regression (LR) were used as classifiers. Methods for selecting features were RFE (Reсursive Feature Elimination) and Top Scoring Pairs (TSP). The use of confidence intervals makes it possible to reasonably judge the significant advantages of various classification schemes.

Literature

Here are some books and articles that may be useful in analyzing proteomic data.

C. Bishop, Neural Networks for Pattern Recognition

* Berrar, Dubitzky, Granzow. Practical approach to microarray data analysis (Kluwer, 2003). The book deals with microarray processing (although I would not recommend it for an introduction to the subject), but there are also a couple of interesting chapters. An illustration showing the effect of outliers on correlation coefficients is taken from there.

Literature marked with * is in in electronic format, and the author shares it free of charge (i.e. free of charge)

), etc. Moreover, the emergence of fast modern computers and free software(like R) has made all of these computationally intensive methods available to almost every researcher. However, this accessibility further exacerbates the well-known problem of all statistical methods, which on English language is often described as " rubbish in, rubbish out", ie" garbage at the entrance - garbage at the exit. "We are talking about the following: miracles do not happen, and if we do not pay due attention to how this or that method works and what requirements it places on the analyzed data, then the results obtained with its help cannot be taken seriously. Therefore, each time the researcher should begin his work with a thorough acquaintance with the properties of the data obtained and checking necessary conditions the applicability of the relevant statistical methods. This First stage analysis is called exploratory(Exploratory Data Analysis).

There are many recommendations in the statistical literature for performing exploratory data analysis (EDA). Two years ago in the magazine Methods in Ecology and Evolution an excellent article was published in which these recommendations were consolidated into a single protocol for the implementation of RDA: Zuur A. F., Ieno E. N., Elphick C. S. (2010) A protocol for data exploration to avoid common statistical problems. Methods in Ecology and Evolution 1 (1): 3-14... Despite the fact that the article was written for biologists (in particular, for ecologists), the principles outlined in it are certainly true in relation to other scientific disciplines. In this and subsequent blog posts, I will provide excerpts from work Zuur et al.(2010) and describe the RDA protocol proposed by the authors. Just as it was done in the original article, the description of the individual steps of the protocol will be accompanied by brief recommendations for using the corresponding functions and packages of the R system.

The proposed protocol includes the following main elements:

Research hypothesis formulation. Performing experiments / observations to collect data.
Exploratory data analysis:
- Identifying sampling points
- Checking the uniformity of dispersions
- Checking the Normality of the Data Distribution
- Identifying excess zero values
- Identifying collinear variables
- Revealing the nature of the relationship between the analyzed variables
- Identifying interactions between predictor variables
- Revealing spatio-temporal correlations among the values of the dependent variable
Application of a statistical method (model) appropriate to the situation.

Zuur et al.(2010) note that RDA is most effective when using a variety of graphical tools, since graphs often provide a better understanding of the structure and properties of the analyzed data than formal statistical tests.

We begin our consideration of the given RDA protocol with identifying outlier points... The sensitivity of different statistical methods to the presence of outliers in the data varies. So, when using a generalized linear model to analyze the dependent variable distributed according to Poisson's law (for example, the number of cases of a disease in different cities), the presence of outliers can cause excessive variance, which will make the model inapplicable. At the same time, when using nonparametric multivariate scaling based on the Jaccard index, all initial data are converted to a nominal scale with two values (1/0), and the presence of outliers does not affect the analysis result in any way. The researcher should clearly understand these differences between different methods and, if necessary, check for the presence of selections in the data. Let's give working definition: by "outlier" we mean an observation that is "too" large or "too small" compared to most of the other available observations.

Typically, to detect outliers, use is made of range diagrams... R uses robust (robust) estimates of the central trend (median) and spread (interquartile range, RBI) to construct range diagrams. The upper whisker extends from the upper border of the box to the largest sampled value within a distance of 1.5 x RBI from this border. Likewise, the lower whisker extends from the lower border of the box to the smallest sampled value within a distance of 1.5 x IFR from that border. Observations outside the whiskers are considered potential outliers (Figure 1).

Figure 1. The structure of the range diagram.

Examples of functions from R for plotting swing charts:

Basic boxplot () function (see details).
Ggplot2 package: geometric object (" geom") boxplot. For example:
p<- ggplot (mtcars, aes(factor(cyl), mpg)) p + geom_boxplot() # или: qplot (factor(cyl), mpg, data = mtcars, geom = "boxplot" )

Another very useful but unfortunately underutilized graphical outlier detection tool is scatter chart of Cleveland... On such a graph on the ordinate axis the ordinal numbers of individual observations are plotted, and on the abscissa axis the values of these observations are plotted. Observations that stand out “significantly” from the main point cloud can potentially be outliers (Figure 2).

Figure 2. Cleveland scatter plot depicting wing length data for 1,295 sparrows (Zuur et al. 2010). In this example, the data has been pre-ordered according to the weight of the birds, so the point cloud is roughly S-shaped.

In Figure 2, the point corresponding to a wing length of 68 mm stands out well. However, this wing length should not be considered an outlier as it differs only slightly from other lengths. This point stands out against the general background only because the original wing lengths were ordered by the weight of the birds. Accordingly, the outlier is more likely to be looked for among the weight values (i.e., a very high wing length (68 mm) was noted in an unusually light sparrow for this).

Up to this point, we have called an "outlier" an observation that is "significantly" different from most other observations in the target population. However, a more rigorous approach to determining outliers is to assess how these unusual observations affect the analysis results. In doing so, a distinction should be made between unusual observations for the dependent and independent variables (predictors). For example, when studying the dependence of the number of a biological species on temperature, most of the temperature values can lie in the range from 15 to 20 ° С, and only one value can be equal to 25 ° С. Such an experimental design, to put it mildly, is not ideal, since the temperature range from 20 to 25 ° C will be studied unevenly. However, in real field studies, it may only be possible to measure at high temperatures once. What, then, is to be done with this unusual measurement taken at 25 ° C? With a large volume of observations, such rare observations can be excluded from the analysis. However, with a relatively small amount of data, its even greater decrease may be undesirable from the point of view of the statistical significance of the results obtained. If it is not possible to remove unusual values of a predictor for one reason or another, a certain transformation of this predictor (for example, taking the logarithm) can help.

Unusual values of the dependent variable are more difficult to “deal with”, especially when building regression models. Transforming by, for example, taking a logarithm can help, but since the dependent variable is of particular interest when constructing regression models, it is best to try to find a method of analysis that is based on a probability distribution that allows a wider range of values for large mean values (for example, a gamma distribution for continuous variables or Poisson distribution for discrete quantitative variables). This approach will allow you to work with the original values of the dependent variable.

Ultimately, it is up to the researcher to decide whether to remove unusual values from the analysis. At the same time, he must remember that the reasons for the occurrence of such observations may be different. Thus, the removal of outliers arising from unsuccessful experimental design (see above example with temperature) can be quite justified. It will also be justified to remove outliers that are clearly due to measurement errors. At the same time, unusual observations among the values of the dependent variable may require a more sophisticated approach, especially if they reflect the natural variability of that variable. In this regard, it is important to maintain detailed documentation of the conditions under which the experimental portion of the study takes place - this can help to interpret "outliers" during data analysis. Regardless of the reasons for the occurrence of unusual observations, in the final scientific report (for example, in an article) it is important to inform the reader about both the fact of identifying such observations and the measures taken in relation to them.

STATISTICA offers a wide range of methods for exploratory statistical analysis. The system can compute virtually all descriptive statistics, including median, mode, quartiles, user-defined percentiles, means and standard deviations, confidence intervals for the mean, skewness coefficients, kurtosis (with their standard errors), harmonic and geometric mean, and many other descriptive statistics. The choice of criteria for testing the distribution normality is possible (the Kolmogorov-Smirnov, Liliefors, Shapiro-Wilks test). A wide selection of charts aids exploratory analysis.

2. Correlations.

This section includes a large number of tools to explore dependencies between variables. It is possible to calculate almost all general measures of dependence, including Pearson's correlation coefficient, Spearman's rank correlation coefficient, Kendall's Tau (b, c), Gamma, C contingency coefficient, and many others.

Correlation matrices can also be computed for data with gaps using special techniques for handling missing values.

Accessible graphics allow you to select individual points in a scatterplot and evaluate their contribution to a regression curve or any other curve fitted to the data.

3. t - criteria (and other criteria for group differences).

The procedures allow calculating t-tests for dependent and independent samples, as well as Hotteling statistics (see also ANOVA / MANOVA).

4. Tables of frequencies and tables of cross-tabulations.

The module contains an extensive set of procedures that provide tabulation of continuous, categorical, dichotomous variables, variables obtained as a result of multivariate surveys. Both cumulative and relative frequencies are calculated. Tests for cross-tabulated frequencies are available. The statistics of Pearson, maximum likelihood, Jegs-correction, chi-square, Fisher's, McNamer's statistics and many others are calculated.

Module "Multiple Regression"

The Multiple Regression module includes a comprehensive set of tools for multiple linear and fixed nonlinear (in particular, polynomial, exponential, logarithmic, etc.) regression, including stepwise, hierarchical and other methods, as well as ridge regression.

System STATISTICA allows you to compute a comprehensive set of statistics and advanced diagnostics, including a full regression table, partial and partial correlations and covariances for regression weights, sweep matrices, Durbin-Watson statistics, Mahalanobis and Cook distances, remote residuals, and many others. Residual and outlier analysis can be performed using a wide variety of plots, including a variety of dot plots, partial correlation plots, and many others. The prediction system allows the user to perform what-if analysis. Extremely large regression problems (up to 300 variables in an exploratory regression procedure) are allowed. STATISTICA It also contains a "Nonlinear Estimation Module" with which almost any user-defined nonlinear model can be estimated, including logit, probit regression, and more.

ANOVA module. Generic ANOVA / MANOVA module

ANOVA / MANOVA module is a set of procedures for general univariate and multivariate analysis of variance and covariance.

The module provides the widest selection of statistical procedures for testing the basic assumptions of analysis of variance, in particular, the Bartlett, Cochran, Hartley, Box and others tests.

Discriminant Analysis Module

Discriminant analysis methods allow constructing, on the basis of a number of assumptions, a classification rule for assigning an object to one of several classes, minimizing some reasonable criterion, for example, the probability of false classification or a user-specified loss function. The choice of criterion is determined by the user for reasons of damage that he will incur due to classification errors.

System discriminant analysis module STATISTICA contains a complete set of procedures for multi-step functional discriminant analysis. STATISTICA allows you to perform step-by-step analysis, both forward and backward, and also within a user-defined block of variables in the model.

Module "Nonparametric Statistics and Fitting Distributions"

The module contains an extensive set of nonparametric goodness-of-fit tests, in particular, the Kolmogorov-Smirnov test, the Mann-Whitney, Wal-da-Wolfowitz, Wilcoxon rank tests, and many others.

All implemented rank tests are available in the case of coincident ranks and use corrections for small samples.

The module's statistical procedures allow the user to easily compare the distribution of observed values with a large number of different theoretical distributions. You can fit the data to Normal, Uniform, Linear, Exponential, Gamma, Lognormal, Chi-Square, Weibull, Gompertz, Binomial, Poisson, Geometric, Bernoulli Distribution. The accuracy of the fit is assessed using the chi-square test or the one-sample Kolmogorov-Smirnov test (the fit parameters can be controlled); Lillifors and Shapiro-Wilks tests are also supported.

Factor Analysis Module

The factor analysis module contains a wide range of methods and options that provide the user with comprehensive factor analysis tools.

It, in particular, includes the method of principal components, the method of minimum residuals, the method of maximum likelihood, etc. with extended diagnostics and an extremely wide range of analytical and exploratory plots. The module can perform the calculation of the principal components of general and hierarchical factor analysis with an array containing up to 300 variables. The space of common factors can be plotted and viewed either slice by slice or in 2D or 3D scatterplots with labeled point variables.

After the solution is determined, the user can recalculate the correlation matrix from the appropriate number of factors in order to assess the quality of the constructed model.

Besides, STATISTICA contains the module "Multidimensional scaling", the module "Reliability analysis", the module " Cluster Analysis"," Log-linear analysis "module," Nonlinear estimation "module," Canonical correlation "module," Life span analysis "module," Time series analysis and forecasting "module and others.

Numerical results of statistical analysis in the system STATISTICA are output in the form of special spreadsheets, which are called result output tables - ScroHsheets™. Tables Scrollsheet can contain any information (both numerical and text), from a short line to megabytes of results. In system STATISTICA this information is displayed as a sequence (queue), which consists of a set of tables Scrollsheet and graphs.

STATISTICA contains a large number of tools for easy viewing of statistical analysis results and their visualization. They include standard operations for editing a table (including operations on blocks of values, Drag-and-Drop - "Drag and Drop", autocomplete blocks, etc.), easy viewing operations (movable column boundaries, split scrolling in the table, etc.), access to basic statistics and graphical capabilities of the system STATISTICA. When displaying a range of results (for example, a correlation matrix) STATISTICA marks significant correlation coefficients with color. The user also has the ability to highlight the necessary values in the table using color Scrollsheet.

If the user needs to conduct a detailed statistical analysis of intermediate results, then you can save the table Scrollsheet in data file format STATISTICA and then work with it as with ordinary data.

In addition to displaying the analysis results in the form of separate windows with graphs and tables Scrollsheet in the workspace of the system STATISTICA, the system has the ability to create a report, in the window of which all this information can be displayed. A report is a document (in the format RTF), which can contain any textual or graphic information. V STATISTICA it is possible to automatically create a report, the so-called auto report. Moreover, any table Scrollsheet or the graph can be automatically sent to the report.

Answer:

Graphical methods can be used to find dependencies, trends, and offsets hidden in unstructured datasets.

Visualization methods include:

Presentation of data in the form of bar, line charts in multidimensional space;

Overlay and merge multiple images;

Identification and labeling of subgroups of data that meet certain conditions;

Splitting or merging subgroups of data on a graph;

Data aggregation;

Data smoothing;

Construction of pictographs;

Creation of mosaic structures;

Spectral planes, level line maps; methods of dynamic rotation and dynamic stratification of 3D images; allocation of certain sets and blocks of data, etc.

Types of graphs in Statistica:

§ two-dimensional plots; (histograms)

§ three-dimensional graphics;

§ matrix graphs;

§ pictographs.

Answer:These plots are collections of 2D, 3D, ternary, or n-dimensional plots (such as histograms, scatterplots, line plots, surfaces, pie charts), one plot for each selected category (subset) of observations.

A graph is a set of graphs, pie charts for each specific category of the selected variable (2 genders - 2 genders).

A categorized data structure can be handled in a similar way. : for example, statistics on customers have been accumulated and it is necessary to analyze the purchase amount by various categories (men-women, old-men-mature-youth).

In statistics - histograms, scatterplots, line graphs, pie graphs, 3D graphs, 3D ternary graphs

As you can see, this variable generally has a normal distribution for each group (type of colors).

5. What information about the nature of the data can be obtained by analyzing scatterplots and categorized scatterplots?

Answer:

Scatterplots are commonly used to reveal the nature of the relationship between two variables (for example, profit and payroll) because they provide much more information than the correlation coefficient.

If it is assumed that one of the parameters depends on the other, then usually the values of the independent parameter are plotted along the horizontal axis, and the values of the dependent parameter are plotted along the vertical. Scatterplots are used to show the presence or absence of a correlation between two variables.

Each point marked on the diagram includes two characteristics, for example, the age and income of an individual, each plotted along its own axis. This often helps to find out if there is any significant statistical relationship between these characteristics and what type of function it makes sense to select. A

6. What information about the nature of the data can be obtained by analyzing histograms and categorized histograms?

Answer

: Histograms are used to study the frequency distributions of variable values. Such a frequency distribution shows which specific values or ranges of values of the studied variable occur most often, how different these values are, whether the majority of observations are located near the mean value, is the distribution symmetric or asymmetric, multimodal (that is, it has two or more vertices), or unimodal, etc. Histograms are also used for comparing observed and theoretical or expected distributions.

Categorized histograms are sets of histograms corresponding to different values of one or more categorizing variables or sets of logical categorization conditions.

A histogram is a way of presenting statistical data in a graphical form - in the form of a bar chart. It displays the distribution of individual measurements of product or process parameters. It is sometimes called the frequency distribution, since the histogram shows the frequency of occurrence of the measured values of the object's parameters.

The height of each column indicates the frequency of occurrence of parameter values in the selected range, and the number of columns indicates the number of selected ranges.

An important advantage of the histogram is that it allows you to visualize the tendencies of change in the measured quality parameters of the object and to visually evaluate the law of their distribution. In addition, the histogram makes it possible to quickly determine the center, spread and shape of the distribution of a random variable. A histogram is built, as a rule, for an interval change in the values of the measured parameter.

7. How are categorized graphs fundamentally different from matrix graphs in Statistica?

Answer:

Matrix plots also consist of multiple plots; however, here each of them is based (or may be based) on the same set of observations, and graphs are drawn for all combinations of variables from one or two lists.

atrix charts. Matrix plots depict dependencies between several variables in the form of a matrix of XY-plots. The most common type of matrix plot is the scatterplot matrix, which can be thought of as the graphical equivalent of the correlation matrix.

Matrix plots - Scatter plots. Matrix plots of this type depict 2M scatterplots organized in the form of a matrix (the values of the column variable are used as coordinates X, and the values of the variable along the line - as coordinates Y). Histograms depicting the distribution of each variable are located on the diagonal of the matrix (in the case of square matrices) or along the edges (in the case of rectangular matrices).

See also section Reducing the sample size.

Categorized plots require the same choice of variables as uncategorized plots of the corresponding type (for example, two variables for a scatter plot). At the same time, for categorized graphs, it is necessary to indicate at least one grouping variable (or a way of dividing observations into categories), which would contain information about the belonging of each observation to a certain subgroup. The grouping variable will not be directly plotted (i.e., will not be plotted), but it will serve as a criterion for dividing all analyzed cases into separate subgroups. One graph will be plotted for each group (category) defined by the grouping variable.

8. What are the advantages and disadvantages of graphical methods of exploratory data analysis?

Answer:+ Visibility and simplicity.

Visibility (multidimensional graphical presentation of data, according to which the analyst himself identifies patterns and relationships between data).

- Methods give approximate values.

n - A high proportion of subjectivity in the interpretation of the results.

n Lack of analytical models.

9. What analytical methods of primary exploratory data analysis do you know?

Answer:Statistical methods, neural networks.

10. How to test the hypothesis about the agreement of the distribution of sample data with the normal distribution model in the Statistica system?

Answer:The x 2 (chi-square) distribution with n degrees of freedom is the distribution of the sum of squares of n independent standard normal random variables.

Chi-square is a measure of difference. We set the error level equal to a = 0.05. Accordingly, if the value p> a, then the distribution is optimal.

- to test the hypothesis about the agreement of the distribution of sample data with the normal distribution model using the chi-square test, select the Statistics / Distribution Fittings menu item. Then, in the Fitting Contentious Distribution dialog box, set the type of theoretical distribution - Normal, select the variable - Variables, set the analysis parameters - Parameters.

11. What are the main statistical characteristics of quantitative variables do you know? Their description and interpretation in terms of the problem being solved.

Answer:Basic statistical characteristics of quantitative variables:

expectation (average among the sample, the sum of the values \ n, the sixth volume of production among enterprises)

median (midpoint of values.)

standard deviation (square root of variance)

variance (a measure of the spread of a given random variable, i.e. its deviation from the mathematical expectation)

asymmetry coefficient (We determine the offset relative to the center of symmetry according to the rule: if B1> 0, then the offset is to the left, otherwise - to the right.)

kurtosis coefficient (close to normal distribution)

minimum sampled value, maximum sampled value,

scatter

sampled upper and lower quartiles

Fashion (peak value)

12. What measures of the relationship are used to measure the degree of closeness of the relationship between quantitative and ordinal variables? Their calculation in Statistica and interpretation.

Answer:Correlation is a statistical relationship between two or more random variables.

In this case, changes in one or more of these quantities lead to a systematic change in another or other quantities. A measure of the correlation of two random variables is the correlation coefficient.

Quantitative:

The correlation coefficient is an indicator of the nature of changes in two random variables.

Pearson's correlation coefficient (measures the degree of linear relationships between variables. You can say that correlation determines the degree to which the values of two variables are proportional to each other.)

Partial correlation coefficient (measures the degree of closeness between variables, provided that the values of other variables are fixed at a constant level).

Qualitative:

Spearman's rank correlation coefficient (used for the purpose of statistically studying the relationship between phenomena. The objects under study are ordered in relation to some attribute, that is, they are assigned ordinal numbers - ranks.)

	\|	next lecture ==>

V STATISTICA classical methods of cluster analysis are implemented, including methods of k-means, hierarchical clustering and two-input join.

The data can come both in its original form and in the form of a matrix of distances between objects.

Observations and variables can be clustered using different distance measures (Euclidean, Euclidean square, Manhattan, Chebyshev, etc.) and different rules for combining clusters (single, full connection, unweighted and weighted pairwise average for groups, etc.).

Formulation of the problem

The original data file contains the following information about cars and their owners:

The purpose of this analysis is to classify cars and their owners into classes, each of which corresponds to a specific risk group. Observations that fall into one group are characterized by the same probability of occurrence of an insured event, which is subsequently assessed by the insurer.

The use of cluster analysis to solve this problem is most effective. V general case Cluster analysis is designed to combine some objects into classes (clusters) in such a way that the most similar ones fall into one class, and objects of different classes differ as much as possible from each other. The quantitative indicator of similarity is calculated in a given way based on the data characterizing the objects.

Measurement scale

Everything cluster algorithms need estimates of the distances between clusters or objects, and it is clear that when calculating the distance, it is necessary to set the scale of measurements.

Since different measurements use completely different types of scales, the data needs to be standardized (in the menu Data select item Standardize), so that each variable will have a mean of 0 and a standard deviation of 1.

A table with standardized variables is shown below.

Step 1. Hierarchical classification

The first step is to find out if cars form "natural" clusters that can be conceptualized.

Let's choose Cluster Analysis on the menu Analysis - Multivariate exploratory analysis to display the start panel of the module Cluster Analysis... In this dialog, select Hierarchical classification and press OK.

Press the button Variables, choose Everything, in field Objects choose Observations (strings). As a union rule, note Full connection method, as a measure of proximity - Euclidean distance... Click OK.

The complete linkage method defines the distance between clusters as the greatest distance between any two features in different clusters (ie, "farthest neighbors").

The measure of proximity, defined by Euclidean distance, is the geometric distance in n-dimensional space and is calculated as follows:

The most important result from tree clustering is the hierarchical tree. Click on the button Vertical dendrogram.

Tree diagrams may seem a little confusing at first, but after some study they become more comprehensible. The diagram starts at the top (for a vertical dendrogram) with each vehicle in its own cluster.

As soon as you start moving downward, the cars that "touch each other more closely" merge and form clusters. Each node in the diagram above represents the union of two or more clusters, the position of the nodes on the vertical axis determines the distance at which the corresponding clusters have been combined.

Step 2. K-means clustering

Based on the visual presentation of the results, it can be assumed that the cars form four natural clusters. Let us check this assumption by dividing the initial data by the K means method into 4 clusters, and check the significance of the difference between the groups obtained.

In the start panel of the module Cluster Analysis choose K-Means Clustering.

Press the button Variables and choose Everything, in field Objects choose Observations (strings), set 4 clusters of the partition.

Method K-means is as follows: the calculations begin with k randomly selected observations (in our case, k = 4), which become the centers of the groups, after which the object composition of the clusters changes in order to minimize the variability within the clusters and maximize the variability between the clusters.

Each subsequent observation (K + 1) belongs to the group, the measure of similarity with the center of gravity of which is minimal.

After changing the composition of the cluster, the new center gravity, most often as a vector of averages for each parameter. The algorithm continues until the composition of the clusters stops changing.

When the results of the classification are obtained, the average value of the indicators for each cluster can be calculated to assess how they differ from each other.

In the window Results of the K means method choose ANOVA to determine the significance of the difference between the resulting clusters.

So the p value<0.05, что говорит о значимом различии.

Press the button Cluster elements and distances to view the observations included in each of the clusters. The option also allows you to display the Euclidean distances of objects from the centers (mean values) of the corresponding clusters.

First cluster:

Second cluster:

Third cluster:

Fourth cluster:

So, in each of the four clusters there are objects with a similar influence on the process of losses.

It might be useful to read:

Laboratory “Application of methods of primary exploratory data analysis in solving data mining problems (IAD) by means of the integrated system Statistica. Data Mining Techniques Intelligence Data Analysis

Exploratory analysis

Normalization

Examples of

The missing values ​​problem

Emissions problem

Classification and dimension reduction

Variable selection

Classification

ROC - curve

Comparative analysis of classifiers

Literature

Formulation of the problem

Measurement scale

Step 1. Hierarchical classification

Step 2. K-means clustering

The missing values problem