Data classification and analysis

2. Kritzman VA, Rozen B. Ya., Dmitrev IS To the secrets of the structure of matter. - Higher School, 1983.

Revolutionary discoveries in natural science were often accomplished under the influence of the results of experiments set up by talented experimenters. Great experiments in biology, chemistry, physics contributed to a change in the concept of the world in which we live, the structure of matter, and the mechanisms of transmission of heredity. Other theoretical and technological discoveries have been made based on the results of great experiments.

§ 9. Theoretical research methods

Lesson-lecture

There are more important things in the world

the most beautiful discoveries -

is knowledge of the methods by which

they were made

Leibniz

https://pandia.ru/text/78/355/images/image014_2.gif "alt \u003d" (! LANG: Signature:!" align="left" width="42 height=41" height="41">Метод. Классификация. Систематизация. Систематика. Индукция. Дедукция.!}

Observation and description of physical phenomena. Physical laws. (Physics, grades 7 - 9).

What is Method . Methodin science they call the method of building knowledge, the form of practical and theoretical development of reality. Francis Bacon compared the method to a lamp illuminating a traveler's path in the dark: "Even a lame man walking along the road is ahead of one who walks without a road." The correctly chosen method should be clear, logical, lead to specific purposegive the result. The doctrine of a system of methods is called methodology.

The methods of cognition that are used in scientific activity are empirical (practical, experimental) methods: observation, experiment and theoretical (logical, rational) methods: analysis, synthesis, comparison, classification, systematization, abstraction, generalization, modeling, induction, deduction... In real scientific knowledge, these methods are always used in unity. For example, when developing an experiment, a preliminary theoretical understanding of the problem is required, the formulation of a research hypothesis, and after the experiment, processing of the results using mathematical methods is required. Consider the features of some theoretical methods of cognition.

Classification and systematization. Classification allows you to order the material under study by grouping the set (class) of objects under study into subsets (subclasses) in accordance with the selected feature.

For example, all students in a school can be divided into subclasses - "girls" and "boys". You can choose another characteristic, for example, height. In this case, the classification can be carried out in different ways. For example, highlight the height limit of 160 cm and classify students into subclasses "low" and "tall", or divide the growth scale into 10 cm segments, then the classification will be more detailed. If we compare the results of such a classification for several years, this will allow empirically to establish trends in the physical development of students. Consequently, classification as a method can be used to gain new knowledge and even serve as a basis for constructing new scientific theories.

In science, classifications of the same objects are usually used according to different criteria, depending on the goals. However, the characteristic (the basis for classification) is always selected alone. For example, chemists subdivide the acid class into subclasses both by the degree of dissociation (strong and weak), and by the presence of oxygen (oxygen-containing and anoxic), and by physical properties (volatile - non-volatile; soluble - insoluble) and other characteristics.

The classification can change in the course of the development of science.

In the middle of the xx century. the study of various nuclear reactions led to the discovery of elementary (non-fissionable) particles. Initially, they began to be classified by mass, so leptons (small), mesons (intermediate), baryons (large) and hyperons (superlarge) appeared. Further development of physics showed that the classification by mass has little physical meaning, but the terms have been preserved, as a result of which leptons appeared, which are much more massive than baryons.

It is convenient to reflect the classification in the form of tables or diagrams (graphs). For example, the classification of the planets of the solar system, represented by a diagram - a graph, may look like this:

MAJOR PLANETS

SOLAR SYSTEM

PLANETS OF THE EARTH GROUP

PLANETS - GIANTS

PLUTO

MERCU-

VENE

MARS

JUPITER

SATURN

URANUS

Please note that the planet Pluto in this classification represents a separate subclass, does not belong either to the terrestrial planets or to the giant planets. Scientists note that Pluto is similar in properties to an asteroid, of which there may be many on the periphery of the solar system.

In the study of complex systems of nature, classification is actually the first step towards building a natural-scientific theory. The next higher level is systematization (taxonomy). Systematization is carried out on the basis of the classification of a fairly large amount of material. At the same time, the most essential features are distinguished, which make it possible to present the accumulated material as a system, which reflects all the various relationships between objects. It is necessary in cases where there is a variety of objects and the objects themselves are complex systems... The result of the systematization of scientific data is taxonomy or otherwise - taxonomy. Systematics as a field of science has developed in such areas of knowledge as biology, geology, linguistics, ethnography.

The unit of taxonomy is called a taxon. In biology, taxa are, for example, a type, class, family, genus, order, etc. They are united into a single system of taxa of various ranks according to the hierarchical principle. Such a system includes a description of all existing and previously extinct organisms, finds out the ways of their evolution. If scientists find the new kind, then they must confirm its place in the general system. Changes can also be made to the system itself, which remains developing and dynamic. Taxonomy makes it easy to navigate in all the variety of organisms - only animals are known about 1.5 million species, and plants - more than 500 thousand species, not counting other groups of organisms. Modern biological systematics reflects Saint-Hilaire's law: "All the diversity of life forms forms a natural taxonomic system, consisting of hierarchical groups of taxa of various ranks."

Induction and deduction. The path of cognition, in which, on the basis of the systematization of the accumulated information - from the particular to the general - a conclusion is made about the existing pattern, is called induction. This method as a method of studying nature was developed by the English philosopher F. Bacon. He wrote: “One should take as many cases as possible - both those where the phenomenon under investigation is present, and those where it is absent, but where it could be expected to be encountered; then you have to arrange them methodically ... and give the most likely explanation; finally, try to verify this explanation by further comparison with the facts. "

Thought and image

Portraits of F. Bacon and S. Holmes

Why are the portraits of a scientist and a literary hero located next to each other?

Induction is not the only way obtaining scientific knowledge about the world. If experimental physics, chemistry and biology were built as sciences mainly due to induction, then theoretical physics, modern mathematics at their foundation had a system axioms - consistent, speculative, valid from the point of view of common sense and the level of historical development of science statements. Then knowledge can be built on these axioms by deducing inferences from the general to the particular, the transition from the premise to the consequences. This method is called deduction. Developed it

René Descartes, French philosopher and scientist.

A striking example of gaining knowledge about one subject in different ways is the discovery of the laws of motion of celestial bodies. I. Kepler on the basis of a large amount of observational data on the motion of the planet Mars in early XVII in. discovered by induction the empirical laws of planetary motion in the solar system. At the end of the same century, Newton deduced the generalized laws of motion of celestial bodies on the basis of the law of universal gravitation.

In real research activities, methods scientific research interrelated.

1. ○ Explain what is a research method, natural science methodology?

All these approximations should be substantiated and the errors introduced by each of them should be estimated numerically.

The development of science shows that every natural-scientific law has its limits of application. For example, Newton's laws turn out to be inapplicable in the study of the processes of the microworld. To describe these processes, the laws of quantum theory have been formulated, which become equivalent to Newton's laws if they are applied to describe the motion of macroscopic bodies. From the point of view of modeling, this means that Newton's laws are some kind of model that follows, under certain approximations, from a more general theory. However, the laws of quantum theory are not absolute and have their limitations in their applicability. More general laws have already been formulated and more general equations have been obtained, which in turn also have limitations. And this chain has no end in sight. So far, no absolute laws have been obtained that describe everything in nature, from which all private laws could be derived. And it is not clear whether such laws can be formulated. But this means that any of the natural-scientific laws is actually some kind of model. The difference from the models that were considered in this section is only in the fact that natural science laws are a model that can be used to describe not one specific phenomenon, but for a wide class of phenomena.

Positions derived from pure
logical means, when comparing
with reality turn out
completely empty.
A. Einstein

How to analyze and classify data correctly? Why do we need graphs and charts?

Lesson-workshop

purpose of work... Learn to classify and analyze data obtained from the text.

Work plan... 1. Analyze the text in order to determine the essential properties of the object referred to. 2. To structure the content of the text in order to highlight the classes of objects referred to. 3. Understand the role of logical schemes, graphs, diagrams for understanding the studied material, establishing logical connections, systematization.

Analyze the text. To do this, you need to mentally define the subject in the text - the essential. Select, dismember it into its component parts in order to find individual elements, signs, sides of this object.

Ivan Kramskoy. D. I. Mendeleev

Whose portraits of scientists-systematizers would you add to this series?

PORTRAIT OF BALL LIGHTNING... “The portrait of a mysterious phenomenon of nature - ball lightning was made by specialists of the main geophysical observatory named after AI Voeikova, using the services of computers and., Methods of forensic science. "Composite image" of the mysterious stranger was compiled on the basis of data published in the press for three centuries, the results of research surveys and reports of eyewitnesses from different countries.

Which of its secrets did the floating energy bunch tell scientists?

It is mostly noticed during thunderstorms. At all times, there were four forms of ball lightning: a sphere, an oval, a disk, a rod. The generation of atmospheric electricity, naturally, for the most part arose in the air. However, according to American polls, with equal frequency lightning can be seen settling on various objects - telegraph poles, trees, houses. The dimensions of the amazing companion of thunderstorms are from 15 to 40 cm. Color? Three quarters of the eyewitnesses watched the sparkling balls of red, yellow and pink.

The life of a bunch of electric plasma is truly moth life, usually within five seconds. Longer than this period, but no more than 30 s, up to 36% of eyewitnesses saw it. Almost always, her death was the same - she exploded spontaneously, sometimes bumping into various obstacles. "Collective portraits" made by observers from different times and peoples coincided. "

If, after reading the text, you were able to answer the questions, what is said in the text, what are the main features, elements, sides, properties of the subject of reasoning, then you have analyzed it. In this case, the subject, the main content of the text is the idea of \u200b\u200bball lightning. Ball lightning properties - its appearance: size, shape, color, as well as life time, behavior features.

Based on the analysis of the text, determine its logical structure. Suggest ways of working with this text for its assimilation, memorization, use of it as an interesting, unusual material in your further educational work - in discussions, speeches.

HINT... You can draw up a plan for this text, its summary, theses (generalizations and conclusions that you consider the main thoughts of the text). It is useful to highlight what is new and unfamiliar to you in the material. You can also log the material. To do this, after analyzing the text, highlight the information that is meaningful to you, try to combine it into groups, to show the connections between these groups.

The use of tables, graphs, diagrams helps us organize the study of natural science subjects. Suppose we have at our disposal data on the average monthly daytime temperatures for one year for St. Petersburg and for Sochi. It is required to analyze and systematize this material in order to identify any patterns.

Let's represent the disparate dataset in the form of a table, then in the form of a graph and a diagram (Fig. 5, 6). Find patterns in temperature distribution. Answer the questions:

What are the features of the temperature distribution by months in different cities? How do these distributions differ?
What is the reason for the processes that lead to this distribution?
Did you help you complete the task by organizing the material using a graph, diagram?

Average monthly daytime temperatures for one year for St. Petersburg and Sochi

Figure: 5. Graph of the course of average monthly daytime temperatures for one year for St. Petersburg and Sochi

Figure: 6. Diagram: average monthly daytime temperatures for one year in the cities of St. Petersburg and Sochi

Important steps to mastering the methods of scientific knowledge are:

Logical analysis of the text.
Drawing up a plan, diagrams, highlighting the structure of the material.
Summaries of text or writing abstracts.
Allocation of new knowledge and its use in discussions, speeches, in solving new problems, problems.

Additional reading literature

Einstein A. Without formulas / A. Einstein; comp. K. Kedrov; per. from English - M .: Thought. 2003.
Science methodology and scientific progress. - Novosibirsk: Science. 1981.
Feyrabend P. Selected works on the methodology of science / P. Feyrabend. - M .: Progress, 1986

Last year the Avito company held a number of competitions. Including - a competition for the recognition of car brands, the winner of which, Evgeny Nizhibitsky, told about his decision during the training.

Formulation of the problem... It is necessary to determine the make and model from the images of the cars. The metric was the accuracy of predictions, that is, the proportion of correct answers. The sample consisted of three parts: the first part was available for training initially, the second was given later, and the third required showing the final predictions.

Computing resources... I used my home computer, which was heating my room all this time, and the servers provided at work.

Model overview... Since our task is recognition, the first thing we want to do is take advantage of the progress in the quality level of image classification on the well-known ImageNet. As you know, modern architecture allows you to achieve even higher quality than that of a person. So I started with a review of recent articles and put together a summary table of ImageNet-based architectures, implementations, and qualities.

Note that the best quality is achieved on architectures and.

Fine-tuning networks... To train a deep neural network from scratch is a rather time-consuming exercise, moreover, it is not always effective in terms of results. Therefore, the technique of additional training of networks is often used: a network already trained on ImageNet is taken, the last layer is replaced with a layer with the required number of classes, and then the network is configured with a low learning rate, but using data from the competition. This scheme allows you to train the network faster and with higher quality.

The first approach to retraining GoogLeNet showed about 92% accuracy in validation.

Crop Predictions... Using a neural network for prediction on a test sample can improve the quality. To do this, cut out fragments suitable size in different places in the original image, and then average the results. A 1x10 crop means that the center of the image is taken, four corners, and then everything is the same, but reflected horizontally. As you can see, the quality increases, but the prediction time increases.

Validation of results... After the appearance of the second part of the sample, I split the sample into several parts. All further results are shown in this split.

ResNet-34 Torch... You can use the ready-made repository of the authors of the architecture, but in order to get predictions on the test in the desired format, you have to fix some scripts. In addition, you need to solve the problem of high memory consumption by dumps. The validation accuracy is about 95%.

Inception-v3 TensorFlow... Here, too, a ready-made implementation was used, but the preprocessing of images was changed, and also cropping of images when generating a batch was limited. The result is almost 96% accuracy.

Ensemble of models... The result is two ResNet models and two Inception-v3 models. What validation quality can be obtained by mixing models? The class probabilities were averaged using the geometric mean. The weights (in this case, the degrees) were selected on a deferred sample.

results... ResNet training took 60 hours on GTX 980, and Inception-v3 on TitanX took 48 hours. During the competition, we managed to try out new frameworks with new architectures.

The problem of classification of bank customers

Link to Kaggle.

Stanislav Semyonov tells how he and other members of the top Kaggle united and won a prize in the competition for the classification of client applications of a large bank - BNP Paribas.

Formulation of the problem... Using obfuscated data from insurance claims, it is necessary to predict whether it is possible to confirm the request without additional manual checks. For a bank, this is the process of automating the processing of applications, and for data analysts, it is just a machine learning task on binary classification. There are about 230 thousand objects and 130 features. Metric - LogLoss. It is worth noting that the winning team decrypted the data, which helped them win the competition.

Getting rid of artificial noise in signs... The first step is to look at the data. Several things are immediately apparent. First, all features take values \u200b\u200bfrom 0 to 20. Second, if you look at the distribution of any of the features, you can see the following picture:

Why is that? The fact is that at the stage of anonymization and data noise, random noise was added to all values, and then scaling was carried out by a segment from 0 to 20. The reverse transformation was carried out in two steps: first, the values \u200b\u200bwere rounded to a certain decimal place, and then the denominator was selected ... Was this required if the tree still picks a threshold when splitting? Yes, after the reverse transformation, the differences of the variables begin to make more sense, and for categorical variables it becomes possible to carry out one-hot coding.

Removing linearly dependent features... We also noticed that some features are the sum of others. It is clear that they are not needed. To determine them, subsets of features were taken. Regression was built on such subsets to predict some other variable. And if the predicted values \u200b\u200bwere close to the true ones (it is worth taking into account artificial noise), then the sign could be removed. But the team did not bother with this and used a ready-made set of filtered features. The kit was prepared by someone else. One of the features of Kaggle is the presence of a forum and public solutions through which members share their findings.

How do you know what to use? There is a small hack. Suppose you know that someone in old competitions used some technique that helped him to rank high (they usually write short solutions on the forums). If in the current competition this participant is again among the leaders - most likely, the same technique will shoot here.

Encoding categorical variables... It was striking that a certain variable V22 has a large number of values, but at the same time, if we take a subsample by a certain value, the number of levels (different values) of other variables decreases noticeably. This includes a good correlation with the target variable. What can be done? The simplest solution is to build a separate model for each value of V22, but this is the same as in the first split of the tree to split across all values \u200b\u200bof the variable.

There is another way to use the obtained information - coding with the mean of the target variable. In other words, each value of the categorical variable is replaced by the average value of the target for the objects for which this attribute takes the same value. It is impossible to perform such coding directly for the entire training set: in the process, we will implicitly add information about the target variable to the features. We are talking about information that almost any model will definitely find.

Therefore, such statisticians count on folds. Here's an example:

Let's assume the data is split into three parts. For each fold of the training set, we will calculate a new feature based on two other folds, and for the test set - over the entire training set. Then the information about the target variable will not be included in the sample so explicitly, and the model will be able to use the knowledge gained.

Will there be any problems with anything else? Yes - with rare categories and cross-validation.

Rare categories... Suppose a certain category has been encountered only a few times and the corresponding objects belong to class 0. Then the average value of the target variable will also be zero. However, a completely different situation may arise on the test sample. The solution is the smoothed average (or smoothed likelihood), which is calculated using the following formula:

Here global mean is the average value of the target variable over the entire sample, nrows is the number of times a specific value of the categorical variable was encountered, alpha is the regularization parameter (for example, 10). Now, if some value is rare, the global average will have more weight, and if often enough, the result will be close to the starting category average. By the way, this formula allows you to process previously unknown values \u200b\u200bof a categorical variable.

Cross validation... Let's say we have calculated all the smoothed means for categorical variables for other folds. Can we assess the quality of the model using standard k-fold cross-validation? Not. Let's take an example.

For example, we want to evaluate a model on the third fold. We train the model on the first two folds, but they have a new variable with the mean of the target variable, which we have already used the third test fold to calculate. This does not allow us to correctly assess the results, but the problem that has arisen is solved by calculating statistics on folds within folds. Let's look at the example again:

We still want to evaluate the model on the third fold. Let's break the first two folds (the training sample of our estimate) into some other three folds, in them we will calculate the new feature according to the already analyzed scenario, and for the third fold (this is a test sample of our estimate) we will calculate the first two folds together. Then no information from the third fold will be used when training the model and the estimate will be fair. In the competition we are discussing, only such cross-validation allowed to correctly assess the quality of the model. Of course, the "outside" and "inside" number of folds can be any.

Building features... We used not only the already mentioned smoothed mean values \u200b\u200bof the target variable, but also weights of evidence. This is almost the same, but with a logarithmic transformation. In addition, features like the difference between the number of objects of positive and negative classes in a group without any normalization turned out to be useful. The intuition here is the following: the scale shows the degree of confidence in the class, but what about the quantitative signs? After all, if you process them in a similar way, then all values \u200b\u200bwill be "hammered" by regularization by the global average. One option is to split the values \u200b\u200binto bins, which are then considered separate categories. Another way is simply to build some kind of linear model on one feature with the same target. In total, about two thousand features out of 80 filtered ones turned out.

Stacking and blending... As with most competitions, model stacking is an important part of the solution. In short, the essence of stacking is that we transfer the predictions of one model as a feature to another model. However, it is important not to retrain again. Let's just take an example:

Taken from the blog of Alexander Dyakonov

For example, we decided to split our sample into three folds when staking. Similar to calculating statistics, we must train the model on two folds, and add the predicted values \u200b\u200bfor the remaining fold. For the test sample, you can average the model predictions from each pair of folds. Each level of stacking is called the process of adding a group of new model prediction features based on the existing dataset.

At the first level, the team had 200-250 different models, at the second - 20-30 more, at the third - several more. The result is blending, that is, mixing the predictions of different models. Various algorithms were used: gradient boosting with different parameters, random forests, neural networks. The main idea is to use the most diverse models with different parameters, even if they do not give the highest quality.

Teamwork... Usually, the participants unite in teams before the end of the competition, when everyone already has their own experience. We teamed up with other "Kaglers" at the very beginning. Each team member had a folder in the shared cloud where datasets and scripts were located. The general cross-validation procedure was approved in advance so that comparisons could be made. The roles were distributed as follows: I came up with new features, the second participant built models, the third selected them, and the fourth manages the whole process.

Where to get the power... Testing a large number of hypotheses, building multilevel stacking, and training models can be time consuming with a laptop. Therefore, many participants use computing servers with a large number of cores and RAM. I usually use AWS servers, and my team members turn out to be using cars at work for competitions while they are idle.

Communication with the organizing company... After successful performance in the competition, communication with the company takes place in the form of a joint conference call. Participants talk about their decision and answer questions. At BNP, people were not surprised by multi-level staking, but were, of course, interested in building features, teamwork, validating results - everything that can be useful to them in improving their own system.

Do I need to decrypt the dataset... The winning team noticed one peculiarity in the data. Some of the signs have missing values, and some do not. That is, some characteristics did not depend on specific people. In addition, 360 unique values \u200b\u200bwere obtained. It is logical to assume that we are talking about some time stamps. It turned out that if we take the difference between two such features and sort the entire sample by it, then at first there will be more zeros, and then ones. This is exactly what the winners took advantage of.

Our team took third place. In total, almost three thousand teams participated.

The task of recognizing an ad category

Link to DataRing.

This is another Avito contest. It took place in several stages, the first of which (as well as the third, by the way) was won by Arthur Kuzin.

Formulation of the problem... It is necessary to determine the category based on the photos from the ad. Each ad had one to five images. The metric took into account the coincidence of categories at different levels of the hierarchy - from general to narrower ones (the last level contains 194 categories). In total, there were almost a million images in the training sample, which is close to the ImageNet size.

Difficulties of recognition... It would seem that you just need to learn to distinguish a TV from a car, and a car from shoes. But, for example, there is a category "British cats", and there are "other cats", and among them there are very similar images - although you can still distinguish them from each other. What about tires, rims and wheels? Here even a person cannot cope. These difficulties are the reason for the appearance of a certain limit of the results of all participants.

Resources and framework... I had at my disposal three computers with powerful video cards: a home one provided by a laboratory at MIPT and a computer at work. Therefore, it was possible (and had to) train several networks simultaneously. MXNet, created by the same guys who wrote the well-known XGBoost, was chosen as the main framework for training neural networks. This alone was the reason to trust their new product. The advantage of MXNet is that an efficient iterator with standard augmentation is available right out of the box, which is sufficient for most tasks.

Network architectures... The experience of participating in one of the past competitions has shown that the best quality is shown by the Inception series architectures. I used them here. The GoogLeNet was added as it accelerated model learning. We also used the Inception-v3 and Inception BN architectures from the Model Zoo model library, to which a dropout was added before the last fully connected layer. Due to technical problems, it was not possible to train the network using stochastic gradient descent, so Adam was used as the optimizer.

Data Augmentation... To improve the quality of the network, augmentation was used - adding distorted images to the sample in order to increase the variety of data. Transformations were involved such as accidentally cropping the photo, flipping, rotating by a small angle, changing the aspect ratio and shifting.

Accuracy and speed of learning... At first I divided the sample into three parts, but then I abandoned one of the validation steps for mixing models. Therefore, the second part of the sample was subsequently added to the training set, which improved the quality of the networks. In addition, GoogLeNet was originally trained on Titan Black, which has half the memory compared to Titan X. So this network was retrained with a large batch size, and its accuracy increased. If we look at the training time of networks, we can conclude that in conditions of limited time it is not worth using Inception-v3, since training is much faster with the other two architectures. The reason is in the number of parameters. Inception BN learns the fastest.

Making predictions.

Like Evgeny in the competition with car brands, Arthur used crop predictions - but not on 10 sites, but on 24. The sites were corners, their reflections, center, turns of the central parts and ten more random ones.

If you save the state of the network after each epoch, the result is many different models, not just the final network. Taking into account the time remaining until the end of the competition, I could use predictions of 11 model epochs - since building predictions using the network also takes a lot. All these predictions were averaged according to the following scheme: first, using the arithmetic mean within the crop groups, then using the geometric mean with weights selected on the validation set. These three groups are mixed, then we repeat the operation for all eras. At the end, the class probabilities of all images of one ad are averaged using the geometric mean without weights.

results... When selecting the weights at the validation stage, the competition metric was used, since it did not correlate too much with the usual accuracy. Prediction on different parts of the images gives only a small part of the quality in comparison with a single prediction, but it is due to this increase that it is possible to show the best result. At the end of the competition, it turned out that the first three places differ in results by thousandths. For example, Zhenya Nizhibitsky had the only model, which was quite a bit inferior to my ensemble of models.

Learning from scratch vs. fine-tuning... After the end of the competition, it turned out that despite the large sample size, it was worth training the network not from scratch, but using a pre-trained network. This approach shows better results.

Reinforcement Learning Problem

The Black Box Challenge, about which, was not quite like an ordinary "Kagle". The fact is that for the solution it was not enough to mark up some "test" sample. It was required to program and load into the system the “agent” code, which was placed in an environment unknown to the participant and independently made decisions in it. Such tasks belong to the field of reinforcement learning.

Mikhail Pavlov from 5vision told about the approaches to the solution. In the competition, he took second place.

Formulation of the problem... For an environment with unknown rules, it was necessary to write an "agent" that would interact with the specified environment. Schematically, this is a kind of brain that receives information about a state and a reward from a black box, makes a decision about an action, and then receives a new state and a reward for the action performed. Actions are repeated one after another during the game. The current state is described by a vector of 36 numbers. An agent can take four actions. The goal is to maximize the amount of rewards for the entire game.

Environment analysis... The study of the distribution of the environment state variables showed that the first 35 components do not depend on the selected action and only the 36th component changes depending on it. At the same time, different actions influenced in different ways: some increased or decreased, some did not change in any way. But we cannot say that the entire environment depends on one component: there may be some hidden variables in it. In addition, the experiment showed that if you perform more than 100 identical actions in a row, then the reward becomes negative. So strategies like “do only one action” fell away immediately. Someone from the competition participants noticed that the reward is proportional to the same 36th component. It was suggested at the forum that the black box imitates the financial market, where the portfolio is the 36th component, and the actions are buying, selling and the decision to do nothing. These options correlated with portfolio changes, and the meaning of one action was not clear.

Q-learning... During the participation, the main goal was to try different reinforcement learning techniques. One of the simplest and most well-known methods is q-learning. Its essence is in an attempt to build a function Q, which depends on the state and the selected action. Q measures how “good” it is to choose a particular action in a particular state. Good includes the reward that we will receive not only now, but also in the future. The training of such a function occurs iteratively. During each iteration, we try to bring the function closer to itself in the next step of the game, taking into account the reward received now. You can read more details. The use of q-learning assumes working with fully observable Markov processes (in other words, the current state should contain all information from the environment). Despite the fact that the environment, according to the organizers, did not meet this requirement, q-learning could be used quite successfully.

Adaptation to the black box... It was experimentally found that n-step q-learning was best suited for the environment, where the reward was used not for one last action, but for n steps forward. The environment allowed you to save the current state and roll back to it, which made it easier to collect a sample - you could try to perform each action from one state, and not just one. At the very beginning of the training, when the q-function was not yet able to evaluate actions, the strategy was “perform action 3”. It was assumed that it did not change anything and it was possible to start training on the data without noise.

Learning process... The training proceeded as follows: with the current policy (agent's strategy) we play the entire episode, accumulating the sample, then using the obtained sample we update the q-function, and so on - the sequence is repeated for a certain number of epochs. The results were better than updating the q-function during the game. Other methods - the technique of replay memory (with a shared database for training, where new episodes of the game are recorded) and the simultaneous training of several agents playing asynchronously - also turned out to be less effective.

Models... The solution used three regressions (each one once per action) and two neural networks. Some quadratic features and interactions have been added. The resulting model is a mixture of all five models (five Q-functions) with equal weights. In addition, online additional training was used: in the process of testing, the weights of the old regressions were mixed with the new weights obtained on the test sample. This was done only for regressions, since their solutions can be written out analytically and recalculated rather quickly.

More ideas... Naturally, not all ideas improved the bottom line. For example, discounting the reward (when we do not just maximize the total reward, but consider each next move less useful), deep networks, dueling architecture (with an assessment of the usefulness of the state and each action separately) did not give rise to results. Due to technical problems, it was not possible to use recurrent networks - although in an ensemble with other models, they might provide some benefit.

Outcome... The 5vision team took second place, but with a very small margin over the bronze winners.

So why should you compete in data science competitions?

Prizes. Successful performance in most competitions is rewarded with cash prizes or other valuable gifts. Kaggle has won over seven million dollars in seven years.
Career. Sometimes a prize place.
Experience. This is, of course, the most important thing. You can explore a new area and start tackling challenges you haven't encountered before.

Machine learning training is now held on Saturdays every other week. The venue is the Moscow office of Yandex, the standard number of guests (guests plus Yandex) is 60-80 people. The main feature of training is its relevance: every time the competition, which ended one or two weeks ago, is sorted out. This makes it difficult to plan everything accurately, but the competition is still fresh in the memory and many people gather in the hall to try their hand at it. The training is supervised by Emil Kayumov, who, by the way, helped with writing this post.

In addition, there is a different format: permits, where novice specialists jointly participate in existing competitions. Resolutions are held on Saturdays when there is no training. Anyone can attend events of both types, announcements are published in groups

Basically, data mining is about processing information and identifying patterns and trends in it that help make decisions. Principles intellectual analysis data have been known for many years, but with the advent of big data they have become even more widespread.

Big data has led to an explosive growth in the popularity of broader data mining techniques, in part because information has grown so much more, by its very nature and content, becoming more diverse and expansive. When dealing with large datasets, relatively simple and straightforward statistics are no longer enough. With 30 or 40 million detailed purchase records, it’s not enough to know that two million of them are from the same location. To better meet the needs of customers, you need to understand if the two million are in a specific age group and know their average earnings.

These business requirements have gone from simple search and statistical data analysis to more sophisticated data mining. To solve business problems, data analysis is required that allows you to build a model for describing information and ultimately leads to the creation of a resulting report. This process is illustrated.

Figure 1. Process flow diagram

The process of analyzing data, searching, and building a model is often iterative, as you need to track down and identify various pieces of information that can be extracted. You also need to understand how to link, transform, and combine them with other data to get a result. After discovering new elements and aspects of data, the approach to identifying sources and data formats and then comparing this information with a given result may change.

Data mining tools

Data mining is not only about the tools or database software used. Data mining can be done with relatively modest database systems and simple tools, including building your own, or using off-the-shelf software packages. Sophisticated data mining draws on past experience and algorithms defined using existing software and packages, with different specialized tools associated with different methods.

For example, IBM SPSS®, which is rooted in statistical analysis and polling, allows you to build effective predictive models on past trends and make accurate predictions. IBM InfoSphere® Warehouse provides data source discovery, preprocessing and mining in a single package, allowing you to extract information from the source database directly into the final report.

More recently, it has become possible to work with very large datasets and cluster / large-scale data processing, which allows for even more complex generalizations of data mining results across groups and comparisons of data. A completely new range of tools and systems is available today, including combined storage and data processing systems.

You can analyze a wide variety of datasets, including traditional SQL databases, raw text data, key / value sets, and document databases. Clustered databases such as Hadoop, Cassandra, CouchDB, and Couchbase Server store and provide access to data in ways that do not follow the traditional tabular structure.

In particular, a more flexible format for storing a database of documents gives information processing a new focus and complicates it. SQL databases are strictly structured and adhere to the schema, making it easy to query and parse data with a known format and structure.

Documentary databases that follow a standard structure like JSON, or files with some machine-readable structure, are also easy to handle, although this can be complicated by their varied and fluid structure. For example, in Hadoop, which processes completely "raw" data, it can be difficult to identify and extract information before processing and correlating it.

Basic methods

Several basic methods that are used for data mining describe the type of analysis and data recovery operation. Unfortunately, different companies and solutions don't always use the same terms, which can add to the confusion and perceived complexity.

Let's take a look at some key techniques and examples of how to use certain data mining tools.

Association

Association (or relation) is probably the most well-known, familiar and simple data mining technique. To identify patterns, a simple comparison is made between two or more elements, often of the same type. For example, by tracking shopping habits, you may notice that cream is usually bought with strawberries.

Building data mining tools based on associations or relationships is not difficult. For example, InfoSphere Warehouse provides a wizard that guides you through information flow configurations to create associations by examining the input source, decision basis, and output information. the corresponding example is provided for the sample database.

Figure 2. Information flow used in the association approach

Classification

Classification can be used to get an idea of \u200b\u200bthe type of customer, product, or object by describing multiple attributes to identify a particular class. For example, cars can be easily classified by type (sedan, SUV, convertible) by defining different attributes (number of seats, body shape, driving wheels). Studying new car, you can assign it to a certain class by comparing the attributes with a known definition. The same principles can be applied to customers, for example, by categorizing them by age and social group.

In addition, the classification can be used as input to other methods. For example, decision trees can be used to define a classification. Clustering allows you to use the common attributes of different classifications in order to identify clusters.

By examining one or more attributes or classes, you can group individual data items together to get a structured conclusion. At a simple level, clustering uses one or more attributes as the basis for defining a cluster of similar results. Clustering is useful in identifying different information because it correlates with other examples so that you can see where the similarities and ranges agree.

The clustering method works both ways. You can assume that there is a cluster at a certain point, and then use your identification criteria to check this. The graph shown on is an illustrative example. Here, the age of the buyer is compared with the purchase price. It is reasonable to expect that people between the ages of twenty and thirty (before marriage and having children) and those in their 50s and 60s (when the children left home) have higher disposable income.

Figure 3. Clustering

In this example, two clusters are visible, one around $ 2000 / 20-30 years and the other around $ 7000-8000 / 50-65 years. In this case, we hypothesized and tested it on a simple graph that can be plotted using any suitable graphing software. More complex combinations require a complete analytical package, especially if decisions are to be automatically based on information about closest neighbor.

This clustering is a simplified example of the so-called image nearest neighbor... Individual buyers can be distinguished by their literal proximity to each other on the chart. It is highly likely that customers from the same cluster share other common attributes, and this assumption can be used to search, classify, and other analyzes of members of a dataset.

The clustering method can be applied in back side: Given certain input attributes, identify various artifacts. For example, a recent study of four-digit PIN codes found clusters of numbers in the ranges 1-12 and 1-31 for the first and second pair. By plotting these pairs on a graph, you can see the clusters associated with dates (birthdays, anniversaries).

Forecasting

Forecasting is a broad topic that ranges from predicting component failures to detecting fraud and even predicting a company's profit. When combined with other data mining techniques, forecasting involves trend analysis, classification, model matching, and relationships. By analyzing past events or instances, the future can be predicted.

For example, using credit card authorization data, you can combine decision tree analysis of a person's past transactions with classification and comparison with historical models to detect fraudulent transactions. If the purchase of airline tickets in the United States coincides with transactions in the United States, then it is likely that these transactions are genuine.

Sequential models

Sequential models, which are often used to analyze long-term data, are a useful method for identifying trends, or regular recurrence of similar events. For example, by looking at customer data, you can determine that they buy certain sets of products at different times of the year. Based on this information, the shopping basket forecasting application can automatically assume that certain products will be added to the shopping cart based on the frequency and history of purchases.

Decision trees

A decision tree associated with most other methods (mainly classification and forecasting) can be used either within the selection criteria or to support the selection of specific data within the overall structure... A decision tree starts with a simple question that has two answers (sometimes more). Each answer leads to the next question, helping to classify and identify data or make predictions.

Figure 5. Data preparation

The data source, location, and database affect how information is processed and combined.

Reliance on SQL

The simplest of all approaches is often reliance on SQL databases. SQL (and the corresponding table structure) is well understood, but the structure and format of the information cannot be completely ignored. For example, when studying user behavior on sales data in the SQL Data Model (and data mining in general), there are two main formats that you can use: transactional and behavioral-demographic.

When working with InfoSphere Warehouse, creating a behavior-demographic model to analyze customer data to understand customer behavior involves using raw SQL data based on transaction information and known customer parameters, organizing this information into a predefined tabular structure. InfoSphere Warehouse can then use this information to mine the data using clustering and classification techniques to obtain the desired result. Customer demographic and transactional data can be combined and then converted into a format that allows analysis of specific data, as shown in.

Figure 6. Custom data analysis format

For example, sales data can show sales trends for specific products. Raw sales data for individual items can be converted into transaction information, which maps customer IDs to transaction data and item codes. Using this information, it is easy to identify consistencies and relationships for individual products and individual customers over time. This allows InfoSphere Warehouse to compute consistent information by determining, for example, when a customer is likely to purchase the same item again.

New data analysis points can be created from the original data. For example, you can expand (or refine) product information by comparing or classifying individual products into broader groups, and then analyze the data for those groups instead of individual customers.

Figure 7. MapReduce structure

In the previous example, we processed (in this case with MapReduce) the original data in a document database and converted it to a tabular format in a SQL database for data mining purposes.

Working with this complex and even unstructured information may require more preparation and processing. There are complex types and data structures that cannot be processed and prepared in the form you want in one step. In this case, you can route the MapReduce output to either consistent transforming and obtaining the required data structure, as shown in, or for individual making multiple output tables.

Figure 8. Consecutive output chain of MapReduce processing results

For example, in a single pass, you can take raw information from a documentary database and perform a MapReduce operation to get an overview of that information by date. A good example sequential process is to regenerate information and combine the results with a decision matrix (created in the second stage of MapReduce processing) with subsequent further simplification into a sequential structure. During the processing phase, MapReduce requires that whole set data supported individual steps of data processing.

Regardless of the source data, many tools can use flat files, CSVs, or other data sources. For example, InfoSphere Warehouse can parse flat files in addition to directly communicating with the DB2 data warehouse.

Conclusion

Data mining is about more than just performing some complex queries on the data stored in the database. Whether you're using SQL, document-based databases like Hadoop, or simple flat files, you need to work with, format, or restructure the data. You want to determine the format of information on which your method and analysis will be based. Then, when the information is in the right format, you can apply different methods (individually or collectively), independent of the underlying data structure or dataset required.

Despite the fact that the "information analysis process" is more of a technical term, its content is 90% related to human activities.

Understanding the needs at the heart of any information analysis task is closely related to understanding a company's business. Collecting data from suitable sources requires experience in fitting them, no matter how automated the final data collection process may be. Turning the collected data into insights and effectively putting them into practice requires deep knowledge of business processes and consulting skills.

The information analysis process is a cyclical flow of events that begins with an analysis of the needs in the area under consideration. This is followed by the collection of information from secondary and (or) primary sources, its analysis and preparation of a report for decision-makers who will use it, as well as give their feedback and prepare proposals.

At the international level, the process of analyzing information is characterized as follows:

First, the decision stages are defined in the key business processes and compared with the standard information analysis end results.
The information analysis process begins with a needs assessment at the international level, that is, with the identification and verification of future decision-making needs.
The stage of collecting information is automated, which allows you to allocate time and resources for the primary analysis of information and, accordingly, increase the value of the already available secondary information.
A significant part of time and resources is spent on information analysis, inference and interpretation.
The resulting analytical information is brought to the attention of each decision-maker on an individual basis, tracking the process of its further use.
The members of the information analysis team have a mindset for continuous improvement.

Introduction: the cycle of information analysis

The term "information analysis process" refers to an ongoing, cyclical process that begins with identifying the information needs of decision-makers and ends with the provision of the amount of information that meets those needs. In this regard, an immediate distinction should be made between the volume of information and the process of analyzing information. Determination of the amount of information is aimed at identifying the goals and needs for information resources for the entire information analysis program, while the information analysis process begins with determining the needs for one, albeit insignificant, end result of such an analysis.

The information analysis process should always be tied to the existing processes in the company, that is strategic planning, sales, marketing or product management in which this information will be used. In practice, the use of the information obtained at the output should either be directly related to decision-making situations, or such information should contribute to raising the level of awareness of the organization in those areas. operating activitiesthat are relevant to various business processes.

In fig. 1 shows the stages of the cyclical process of information analysis (for more details, see below). In turn, the right side of the diagram shows the specific results of the information analysis process, when decisions are made on the basis of general market research, and the results of the information analysis process directly related to various business processes and projects.

Click on the image to enlarge it

The cycle of information analysis consists of six stages. Their detailed description is given below.

1. Analysis of needs

A thorough needs assessment allows you to determine the goals and scope of the information analysis task. Even if those who solve such a problem will collect information for their own use, it makes sense to clearly identify the key directions in solving this problem in order to concentrate resources in the most appropriate areas. However, in the vast majority of cases, those conducting the research are not the end users of the results. Therefore, they must have a complete understanding of what the end results will be used for, in order to avoid collecting and analyzing data that may ultimately be irrelevant to users. For the stage of needs analysis, various templates and questionnaires have been developed that set a high standard for quality at the initial stage of solving the problem.

However, the most important thing is that the organization's information analysis needs must be fully understood and transformed from external to internal in order for the information analysis program to be of definite value. Templates and questionnaires alone cannot achieve this goal. They can, of course, be useful, but there have been times when an excellent needs analysis was carried out simply on the basis of an informal conversation with company leaders. This, in turn, requires the information analysis team to have a consulting approach, or at least be able to effectively negotiate with those responsible for making decisions.

2. Coverage of secondary sources of information

As part of the cycle of information analysis, we separately highlight the collection of information from secondary and primary sources. There are a number of reasons for this. First, gathering information from publicly available sources is cheaper than going directly to primary sources. Secondly, it is easier, provided, of course, provided that the people who are faced with such a task have sufficient experience in studying the available secondary sources. In fact, information source management and related cost optimization is a separate area of \u200b\u200bexpertise in itself. Third, coverage of secondary sources of information prior to conducting research in the form of interviews will provide those conducting such research with valuable background information of a general nature that can be verified and used in response to information from interviewees. In addition, if during the study of secondary sources it is possible to get answers to some questions, this will reduce the cost of the stage of research of primary sources, and sometimes even eliminate the need for them.

3. Research of primary sources

No matter how huge the amount of publicly available information available today, not all information can be accessed through the study of secondary sources. After exploring secondary sources, research gaps can be filled by interviewing experts familiar with the research topic. This stage can turn out to be relatively expensive compared to the study of secondary sources, which, naturally, depends on the scale of the task at hand, as well as on what resources are involved: often companies involve third-party performers in researching primary sources.

4. Analysis

After collecting information from various sources, it is necessary to understand what exactly is needed for the initial analysis of needs in accordance with the task at hand. Again, depending on the scope of the task at hand, this stage of research can be quite costly, since it includes at least the time spent on internal and sometimes external resources and, possibly, some additional verification of the correctness of the analysis results through further interview.

5. Delivery of results

The format for presenting the results after completing the information analysis task is of no small importance to end users. Typically, decision-makers do not have time to search for key analysis results in the large amount of data they receive. The main content should be translated into an easy-to-read format based on their requirements. At the same time, you should provide easy access to additional background data for those who are interested and want to "dig deeper". These ground rules apply regardless of the format of the presentation of the information, whether it is database software, newsletter, powerPoint presentation, personal meeting or seminar. In addition, there is another reason why we have separated the stage of providing information from the end use, as well as receiving feedback and suggestions on the provided analytical information. Sometimes decisions will be made in the same sequence in which analytical information will be provided. However, more often than not, basic, reference materials will be provided before the actual decision-making situation arises, so the format, channel and way of presenting information will affect how it will be perceived.

6. Using and providing comments / remarks

The use phase serves as a kind of litmus test for assessing the success of the information analysis task. It allows you to understand whether the results obtained meet the needs identified at the very beginning of the information analysis process. Regardless of whether or not all of the originally asked questions have been answered, the use phase tends to raise new questions and the need for a new needs analysis, especially if the need for information analysis is ongoing. In addition, as a result of collaborative efforts to create content by end users and information analysts, by the time it moves into the use phase, it may be that the end users of the information have already contributed to the expected end result. On the other hand, those who were mainly involved in analysis can be actively involved in the process of drawing conclusions and interpreting the results, on the basis of which the final decisions will be made. Ideally, thoughtful comments and comments during the use phase can already be used as a basis for assessing needs for the next information analysis task. Thus, the cycle of the information analysis process ends.

Getting Started: Developing an Information Analysis Process

Determination of decision-making stages in business processes that require analytical market research

The term “information analysis for the decision-making phase” is gaining more and more popularity as companies that already have information analysis programs in place have begun to consider options for integrating these programs more effectively into decision-making processes. How abstract, or vice versa, concrete, will be measures to "improve the connection between the final results of information analysis and business processes" will largely depend on whether these business processes have been formally defined, as well as on whether the group has information analysis understanding the specific information needs associated with the decision-making stages of these processes.

As we mentioned in Chapter 1, the methods and techniques discussed in this book are best suited for companies that already have structured business processes, such as strategy development. Companies that are less well structured to manage may need to be somewhat creative in using international market analysis methodologies based on their governance arrangements. However, the basic principles we are looking at here will work for any company.

Information analysis needs assessment: why is it so important?

Given that understanding the key information analysis requirements early in the process has a stronger impact on the quality of the deliverables than at any stage in the process, it is striking that the needs assessment phase is often overlooked. Despite the potential resource constraints at other stages of the information analysis process, close attention to needs assessment alone would in many cases significantly increase the value and applicability of the end results of the process, thus justifying the time and resources spent on the information analysis task. Below we look at specific ways to improve the quality of needs assessment.

It is often automatically assumed that management knows what information the company needs. However, in reality, top management tends to be aware of only a fraction of the information needs of their organization, and even so, it may not be in the best position to determine exactly what information is needed, let alone where it is. can be found.

As a result, the situation is constantly repeated when there is neither a clearly formulated concept of the problem, nor its business context for performing information analysis tasks. Those who are most familiar with the sources of information and methods of analysis are wasting time in what appears to be messy data processing and do not see the big picture or the approaches that matter most to the company. Not surprisingly, as a result, decision-makers receive much more information than they need, which is basically counterproductive, since they soon begin to ignore not only useless, but also important information. They don't need more information, but better and more accurate information.

At the same time, decision makers may have unrealistic expectations about the availability and accuracy of information, since they did not consult with experts in the field of information analysis before setting the task. Therefore, ideally, information analysts and decision-makers should be in constant contact with each other and work together to ensure that both parties have the same understanding of the primary information needs. The ability to manage this process will require analysts working in this direction, a number of skills:

The analyst must understand how to identify and define the information needs of decision makers.
The analyst should develop skills for effective communication, interviews and presentations.
Ideally, the analyst should understand the psychological types of personality in order to take into account the different orientations of the people responsible for making decisions.
The analyst needs to know the organizational structure, culture and environment, as well as the key interviewees.
The analyst must maintain objectivity.

Work within the cycle of information analysis and elimination of "bottlenecks" in the process

In the initial stages of implementing an information analysis program, the target group for carrying out activities is usually limited, as are the final results that the program yields. Likewise, when processing the final results, various difficulties often arise (the so-called "bottlenecks"): even a simple collection of disparate data from secondary and primary sources can require knowledge and experience that the company does not have, and after the completion of the collection of information it may be that time and the resources are insufficient to conduct a detailed analysis of the collected data, let alone prepare informative and well-crafted presentations that decision-makers can use. Moreover, at the initial stages of developing an information analysis program, almost no company has special tools for storing and disseminating the results of such analysis. Typically, the results are ultimately made available to target groups as regular email attachments.

The complexity of the analytical task within the information analysis cycle can be described using the standard project management triangle, that is, it is necessary to complete the task and deliver the result under three main constraints: budget, timing and scope of work. In many cases, these three constraints compete with each other: in a standard information analysis task, increasing the workload will require an increase in time and budget; a tight deadline is likely to mean an increase in the budget and a simultaneous reduction in the amount of work, and a tight budget is likely to mean both a limitation on the amount of work and a reduction in the time frame for the project.

The emergence of bottlenecks in the information analysis process usually leads to significant friction in the implementation of the research task within the information analysis cycle at the initial stages of developing a program for such analysis. Since resources are limited, the most critical bottlenecks should be addressed first. Does the information analysis team have sufficient capacity to conduct it? Do you need additional training? Or is it rather the problem that analysts lack valuable information to work with - in other words, the most critical bottleneck is information gathering? Or maybe the information analysis team simply does not have enough time, that is, the group is not able to respond in a timely manner to urgent requests?

There are two ways to improve the efficiency of the analytical task within the information analysis cycle. The “productivity” of the cycle, that is, the thoroughness with which the information analysis team can handle analytical tasks at each stage, and the speed at which the question is answered. In fig. 2 shows the difference between these approaches and, in general, the difference between strategic analysis tasks and research requests requiring rapid response.

Although both approaches involve the passage of the analytical task through all stages of the information analysis cycle, the information analysis group, which is tasked with quickly conducting research, will work on studying secondary and primary sources in parallel (sometimes one phone call to a specialist can give the necessary answers the questions posed in the research request). In addition, in many cases, the analysis and provision of information are combined, for example, in a synopsis that the analyst gives to the manager who requested the information.

The productivity of the information analysis cycle can be improved by adding either internal (hired) or external (acquired) resources where they are needed to achieve better results and increase the ability to serve an increasing number of user groups within the organization.

The same principle applies to ensuring the responsiveness of a workflow, which means how quickly an urgent research task moves through the various stages of the cycle. According to the established tradition, companies mainly focus on ensuring stable bandwidth through long-term resource planning and staff training schemes. However, with the development of such a specialized area as the analysis of information, and the increase in the availability of global professional resources, attracted from outside, temporary schemes, implemented in each specific case and providing the necessary flexibility, are becoming more common.

In fig. 3 shows two types of outcomes of the information analysis cycle, that is, strategic analysis and research requiring rapid response (see the graph of information analysis outcomes). Despite the fact that the tasks of conducting research requiring prompt response are usually associated with business processes, the level of their analysis is not very high due to the banal lack of time for such analysis. On the other hand, strategic analysis tasks are usually associated with a high level of co-creation at the stage of analysis and information provision, which puts them practically at the top of the triangle, where the information received is interpreted and applied.

Continuous development: striving for an international level of information analysis

The smooth running of the information analysis process can be clearly represented in the form of a cycle graph of uniform thickness (Fig. 2), in the sense that a mature information analysis process does not have "weak links" or significant "bottlenecks" in the organization of the sequence of operations. This uniformity requires appropriate resource scheduling at each step, which in turn is achieved by iterating through the cycle with all the details. For example, the initial needs assessment can be gradually improved by the fact that decision makers and users of the results of work will notice shortcomings and typical discrepancies at the beginning of the task of analytical market research. Similarly, collaboration between searchers and analysts can develop over time (if the two functions are separated) by passing over issues that were previously unnoticed and raised during the analysis to searchers to collect additional data. ... Experience will show over time what resources are needed for each of these steps to achieve optimal results.

Which outcomes are ultimately “optimal” is determined by how closely the resulting information meets the needs of the decision makers in the business process. And this again brings us back to the uniform thickness of the information analysis cycle: the process of analyzing information at the international level does not begin with assessing the needs as such, but with a clear definition of where and how the information obtained will be applied. In fact, communication between decision-makers and information analysts throughout the international analytical process should be constant, informative, and bi-directional.

One way to strengthen the linkages between decision making and market research is to enter into service level agreements with key stakeholders that are served by the market intelligence program. Agreeing the required level of market research services with senior leaders in strategic planning, sales, marketing and R&D will clearly define the final results of such analytical studies and activities for each group of stakeholders for the next 6-12 months, including the budget for market research , people involved, milestones and interactions throughout the process.

Service level agreements have several benefits:

It takes time to sit down and discuss the main goals and decision milestones for those responsible for key business processes \u003d the market research team gains a better understanding of what is important to management while improving personal relationships.
The risk of unanticipated overload on special projects is reduced by identifying areas for regular review, strategic analysis of information, etc.
There is time for co-creation in the process of information analysis: often meetings and seminars on analytical market research with the participation of full-time managers need to be scheduled several months in advance.
By clearly setting goals and evaluating results, market research activities are streamlined and the level of analytics is increased.
In general, the isolation of the organization and the so-called "cooking in its own juice" decreases, the cooperation between managers and specialists in analytical market research becomes more fruitful.

The two examples at the end illustrate how, through a streamlined process for analyzing information, an analytic team can respond to the different requirements of an information analysis task, depending on the geographic region that is being analyzed for that task. In the "Western world" from secondary sources, you can get a large amount of reliable information on almost any topic. Thus, the task of information analysts is reduced to finding the best sources for the cost-effective collection of information for the purpose of its subsequent analysis and reporting.

On the other hand, emerging markets often lack reliable secondary sources or lack the required data in English. Consequently, information analysts need to quickly turn to primary sources and conduct interviews, usually in the language of a given country. In this situation, it is important to rely on a sufficiently large number of sources to assess the correctness of the research results before proceeding with their analysis.

Example. Business Cycle Study for a Chemical Industry Enterprise

A chemical company needed a wealth of information about pre-existing, current and future business cycles across multiple product lines. chemical industry On the market North America... This information was intended to be used to assess future growth in certain areas of chemical production, as well as to plan business development based on an understanding of business cycles in the industry.

The analysis was carried out using statistical methods, including regression and visual analysis. Business cycle analysis was carried out both quantitatively and qualitatively, taking into account the views of industry experts on long-term growth. When performing the task, only secondary sources of information were used, and for the analysis - statistical methods, including regression and visual analysis. As a result, a detailed analytical report was presented describing the duration and nature of business cycles, as well as an assessment of the future prospects for the company's key product lines (ethylene, polyethylene, styrene, ammonia and butyl rubber).

Example. Assessment of the market of ammonium bifluoride and hydrofluoric acid in Russia and the CIS

One of the world's largest nuclear centers was tasked with studying the market for these two by-products of its production, namely ammonium bifluoride and hydrofluoric acid, in Russia and the CIS. Given the insufficient capacity of this market, they would have to invest in the construction of facilities for the disposal of these products.

Studies of secondary sources have been carried out both at the level of Russia and the CIS, and at the global level. Due to the highly specialized nature of the market and the high domestic consumption of by-products, the focus was on primary source research. In preparation for the subsequent analysis, 50 detailed interviews were conducted with potential clients, competitors and industry experts.

The final report presented an estimate of the market size excluding domestic consumption, an analysis of segments, an analysis of imports, an analysis of the value chain, an analysis of replacement technologies and products for each industrial segment, a forecast of market development, an analysis of pricing, and finally an assessment of the potential market opportunities in Russia. and the CIS.

Example. An efficient process for analyzing information based on an assessment of prevailing trends for reporting to managers

A leading energy and petrochemical company has successfully improved its information analysis process, based on strategic scenario analysis for collecting, analyzing and delivering information.

By integrating information analysis activities into key business processes at the planning stage, it was possible to clearly identify the true strategic needs of the organization and bring them to the analytical team, which, accordingly, was able to organize the analysis process in such a way that the focus was on strategy and actions. The process of analyzing information in a company begins with an examination of prevailing trends and ends with illustrative examples of how to respond to risks with recommendations for management.

The key to improving the effectiveness of the information analysis program was a successful needs assessment in terms of the company's strategic goals. At the same time, the people responsible for making decisions participated in the process of analyzing information already at the initial stage (discussions, meetings, seminars). This contributed to the establishment of a two-way dialogue and a more complete integration of the information analysis program into other areas of the company.

Example. A global biotech company has developed an information analysis cycle to deliver timely insights and proactive decision making.

The purpose of the information analysis program was to provide early warning and warning information to enable actionable and achievable strategies to be put in place in all markets in which the company operates. A cycle of information analysis was put in place, in which persons interested in analyzing information (both for input and output of information), as well as numerous sources of information, were involved in several stages.

Those interested in analyzing the information represented four key functions in the company (strategy group, marketing and sales, finance, investor relations and directors). The most active activity was carried out during the planning and implementation stages. The successful implementation of an information analysis cycle that brought together internal stakeholders (to assess needs) and multiple sources of information in a well-defined process for delivering analysis results meant that the analytical program that was implemented had some impact on strategy development and proactive decision making.

It might be useful to read: