Data classification and analysis

2. Kritsman VA, Rozen B. Ya., Dmitrev IS To the secrets of the structure of matter. - Higher School, 1983.

Revolutionary discoveries in natural science were often made under the influence of the results of experiments, staged by talented experimenters. Great experiments in biology, chemistry, physics contributed to a change in the idea of the world in which we live, of the structure of matter, of the mechanisms of transmission of heredity. Other theoretical and technological discoveries were made on the basis of the results of great experiments.

§ 9. Theoretical research methods

Lesson-lecture

There are more important things in the world

the most beautiful discoveries -

is knowledge of the methods by which

they were made

Leibniz

https://pandia.ru/text/78/355/images/image014_2.gif "alt =" (! LANG: Signature:!" align="left" width="42 height=41" height="41">Метод. Классификация. Систематизация. Систематика. Индукция. Дедукция.!}

Observation and description of physical phenomena. Physical laws. (Physics, grades 7 - 9).

What is a method . Method in science they call the method of building knowledge, the form of practical and theoretical mastering of reality. Francis Bacon compared the method to a lamp illuminating a traveler's path in the dark: "Even a lame man walking along the road is ahead of one who walks without a road." The correctly chosen method should be clear, logical, lead to a specific purpose, give the result. The doctrine of a system of methods is called methodology.

The methods of cognition that are used in scientific activity are empirical ( practical, experimental) methods: observation, experiment and theoretical ( logical, rational) methods: analysis, synthesis, comparison, classification, systematization, abstraction, generalization, modeling, induction, deduction... In real scientific knowledge, these methods are always used in unity. For example, when developing an experiment, a preliminary theoretical understanding of the problem is required, the formulation of a research hypothesis, and after the experiment, processing of the results using mathematical methods is required. Consider the features of some theoretical methods of cognition.

Classification and systematization. Classification allows you to order the material under study by grouping the set (class) of the objects under study into subsets (subclasses) in accordance with the selected feature.

For example, all students in a school can be divided into subclasses - "girls" and "boys". You can also choose another characteristic, such as height. In this case, the classification can be carried out in different ways. For example, highlight the height limit of 160 cm and classify students into subclasses "low" and "tall", or divide the growth scale into 10 cm segments, then the classification will be more detailed. If we compare the results of such a classification for several years, then this will allow empirically to establish tendencies in the physical development of students. Consequently, classification as a method can be used to gain new knowledge and even serve as a basis for constructing new scientific theories.

In science, classifications of the same objects are usually used according to different criteria, depending on the goals. However, the characteristic (the basis for classification) is always selected alone. For example, chemists subdivide the class of "acids" into subclasses both by the degree of dissociation (strong and weak), and by the presence of oxygen (oxygen-containing and anoxic), and by physical properties (volatile - non-volatile; soluble - insoluble) and other characteristics.

The classification can change in the course of the development of science.

In the middle of the xx century. the study of various nuclear reactions led to the discovery of elementary (non-fissionable) particles. Initially, they began to be classified by mass, so leptons (small), mesons (intermediate), baryons (large) and hyperons (superlarge) appeared. Further development of physics showed that the classification by mass has little physical meaning, but the terms have been preserved, as a result of which leptons appeared, which are much more massive than baryons.

It is convenient to reflect the classification in the form of tables or diagrams (graphs). For example, the classification of the planets of the solar system, represented by a diagram - a graph, may look like this:

MAJOR PLANETS

SOLAR SYSTEM

EARTH GROUP PLANETS

PLANETS - GIANTS

PLUTO

MERCU-

VENE

MARS

JUPITER

SATURN

URANUS

Please note that the planet Pluto in this classification represents a separate subclass, does not belong to either the terrestrial planets or the giant planets. Scientists note that Pluto is similar in properties to an asteroid, of which there may be many on the periphery of the solar system.

In the study of complex systems of nature, classification is actually the first step towards the construction of a natural-scientific theory. The next higher level is systematization (taxonomy). Systematization is carried out on the basis of the classification of a fairly large amount of material. At the same time, the most essential features are distinguished, which make it possible to present the accumulated material as a system that reflects all the various relationships between objects. It is necessary in cases where there is a variety of objects and the objects themselves are complex systems... The result of the systematization of scientific data is taxonomy or otherwise - taxonomy. Systematics as a field of science developed in such areas of knowledge as biology, geology, linguistics, ethnography.

The unit of taxonomy is called a taxon. In biology, taxa are, for example, a type, class, family, genus, order, etc. They are combined into unified system taxa of various ranks according to the hierarchical principle. Such a system includes a description of all existing and previously extinct organisms, finds out the ways of their evolution. If scientists find the new kind, then they must confirm its place in common system... Changes can be made to the system itself, which remains developing and dynamic. Taxonomy makes it easy to navigate in all the diversity of organisms - only animals are known about 1.5 million species, and plants - more than 500 thousand species, not counting other groups of organisms. Modern biological systematics reflects Saint-Hilaire's law: "All the diversity of life forms forms a natural taxonomic system, consisting of hierarchical groups of taxa of various ranks."

Induction and deduction. The path of cognition, in which, on the basis of the systematization of the accumulated information - from the particular to the general - a conclusion is made about the existing regularity, is called induction. This method as a method of studying nature was developed by the English philosopher F. Bacon. He wrote: “It is necessary to take as many cases as possible - both those where the phenomenon under investigation is present, and those where it is absent, but where it could be expected to be encountered; then you have to arrange them methodically ... and give the most likely explanation; finally, try to verify this explanation by further comparison with the facts. "

Thought and image

Portraits of F. Bacon and S. Holmes

Why are the portraits of a scientist and a literary hero located next to each other?

Induction is not the only way obtaining scientific knowledge about the world. If experimental physics, chemistry and biology were built as sciences mainly due to induction, then theoretical physics, modern mathematics at their foundation had a system axioms- consistent, speculative, reliable from the point of view common sense and the level of historical development of science assertions. Then knowledge can be built on these axioms by deriving inferences from the general to the particular, the transition from the premise to the consequences. This method is called deduction. Developed it

René Descartes, French philosopher and scientist.

A striking example of gaining knowledge about one subject in different ways is the discovery of the laws of motion of celestial bodies. I. Kepler based on a large amount of observational data on the motion of the planet Mars in early XVII v. discovered by induction the empirical laws of planetary motion in the solar system. At the end of the same century, Newton deduced the generalized laws of motion of celestial bodies on the basis of the law of universal gravitation.

In real research activities methods scientific research interconnected.

1. ○ Explain what is a research method, natural science methodology?

All these approximations should be justified and the errors introduced by each of them should be numerically estimated.

The development of science shows that every natural-scientific law has its limits of application. For example, Newton's laws turn out to be inapplicable in the study of the processes of the microworld. To describe these processes, the laws of quantum theory are formulated, which become equivalent to Newton's laws if they are applied to describe the motion of macroscopic bodies. From a modeling point of view, this means that Newton's laws are some kind of model that follows, under certain approximations, from a more general theory. However, the laws of quantum theory are not absolute and have their limitations in their applicability. More general laws have already been formulated and more general equations have been obtained, which in turn also have limitations. And this chain has no end in sight. So far, no absolute laws have been obtained that describe everything in nature, from which all particular laws could be derived. And it is not clear whether such laws can be formulated. But this means that any of the natural-scientific laws is actually some kind of model. The difference from the models considered in this section is only in the fact that natural science laws are a model that is applicable to describe not one specific phenomenon, but for a wide class of phenomena.

Positions derived from purely
logical means, when comparing
with reality turn out
completely empty.
A. Einstein

How to analyze and classify data correctly? Why do we need graphs and charts?

Lesson-workshop

purpose of work... Learn to classify and analyze data obtained from the text.

Work plan... 1. Analyze the text in order to determine the essential properties of the object referred to. 2. To structure the content of the text in order to highlight the classes of objects that are being spoken of. 3. Understand the role of logical schemes, graphs, diagrams for understanding the studied material, establishing logical connections, systematization.

Analyze the text. To do this, you need to mentally define the subject in the text - the essential. Select, dismember it into its component parts in order to find individual elements, signs, sides of this object.

Ivan Kramskoy. D. I. Mendeleev

Whose portraits of scientists-systematizers would you add to this series?

PORTRAIT OF BALL LIGHTNING... “The portrait of a mysterious phenomenon of nature - ball lightning was made by specialists of the main geophysical observatory named after AI Voeikova, using the services of computers and., Methods of forensic science. "Composite image" of the mysterious stranger was compiled on the basis of data published in the press for three centuries, the results of research surveys and reports of eyewitnesses from different countries.

Which of its secrets did the hovering clot of energy tell the scientists?

It is mostly noticed during thunderstorms. At all times, there were four forms of ball lightning: a sphere, an oval, a disk, and a rod. The generation of atmospheric electricity, naturally, for the most part arose in the air. However, according to American polls, lightning can be seen with equal frequency and settled on various objects - telegraph poles, trees, houses. The dimensions of the amazing companion of thunderstorms are from 15 to 40 cm. Color? Three-quarters of the eyewitnesses watched the sparkling balls of red, yellow and pink.

The life of a bunch of electric plasma is truly moth life, as a rule, within five seconds. Longer than this period, but no more than 30 s, up to 36% of eyewitnesses saw it. Almost always, her death was the same - she exploded spontaneously, sometimes bumping into various obstacles. "Collective portraits" made by observers from different times and peoples coincided. "

If, after reading the text, you were able to answer the questions, what is said in the text, what are the main features, elements, sides, properties of the subject of discussion, then you have analyzed it. In this case, the subject, the main content of the text is the idea of ball lightning. Ball lightning properties - its appearance: size, shape, color, as well as life time, behavior features.

Based on the analysis of the text, determine its logical structure. Suggest ways of working with this text for its assimilation, memorization, use of it as an interesting, unusual material in your future educational work- in discussions, speeches.

PROMPT... You can draw up a plan for this text, its synopsis, theses (generalizations and conclusions that you consider the main thoughts of the text). It is useful to highlight what is new and unfamiliar to you in the material. You can also log the material. To do this, after analyzing the text, highlight the information that is meaningful to you, try to combine it into groups, show the connections between these groups.

The use of tables, graphs, diagrams helps us organize the study of natural science subjects. Suppose we have at our disposal data on the average monthly daytime temperatures for one year for St. Petersburg and for Sochi. It is required to analyze and systematize this material in order to identify any patterns.

Let's represent a disparate dataset in the form of a table, then in the form of a graph and a diagram (Fig. 5, 6). Find patterns in temperature distribution. Answer the questions:

What are the features of the temperature distribution by months in different cities? How do these distributions differ?
What is the reason for the processes that lead to this distribution?
Did you help you complete the task by organizing the material using a graph, diagram?

Average monthly daytime temperatures for one year for St. Petersburg and Sochi

Rice. 5. Graph of the course of average monthly daytime temperatures for one year for St. Petersburg and Sochi

Rice. 6. Diagram: average monthly daytime temperatures for one year in the cities of St. Petersburg and Sochi

Important steps to mastering the methods of scientific knowledge are:

Logical analysis of the text.
Drawing up a plan, diagrams, highlighting the structure of the material.
Summaries of text or writing abstracts.
Allocation of new knowledge and its use in discussions, speeches, in solving new problems, problems.

Additional reading literature

Einstein A. Without formulas / A. Einstein; comp. K. Kedrov; per. from English - M .: Thought. 2003.
Science methodology and scientific progress. - Novosibirsk: Science. 1981.
Feyrabend P. Selected works on the methodology of science / P. Feyrabend. - M .: Progress, 1986

Last year the Avito company held a number of competitions. Including a competition for the recognition of car brands, the winner of which, Evgeny Nizhibitsky, told about his decision during the training session.

Formulation of the problem... It is necessary to determine the make and model from the images of the cars. The metric was the accuracy of predictions, that is, the proportion of correct answers. The sample consisted of three parts: the first part was available for training initially, the second was given later, and the third required showing the final predictions.

Computing resources... I used my home computer, which was heating my room all this time, and the servers provided at work.

Model overview... Since our task is recognition, the first thing we want to do is take advantage of the progress in the quality level of image classification on the well-known ImageNet. As you know, modern architectures make it possible to achieve an even higher quality than that of a person. So I started with a review of recent articles and put together a pivot table of ImageNet-based architectures, implementations, and qualities.

Note that the best quality is achieved on architectures and.

Fine-tuning networks... Training a deep neural network from scratch is a rather time-consuming exercise, and besides, it is not always effective in terms of results. Therefore, the technique of additional training of networks is often used: a network already trained on ImageNet is taken, the last layer is replaced with a layer with the required number of classes, and then the network is configured with a low learning rate, but using data from the competition. This scheme allows you to train the network faster and with higher quality.

The first approach to retraining GoogLeNet showed approximately 92% accuracy in validation.

Crop predictions... Using a neural network for prediction on a test sample can improve the quality. To do this, cut out fragments suitable size in different places in the original image, and then average the results. A 1x10 crop means that the center of the image is taken, four corners, and then everything is the same, but reflected horizontally. As you can see, the quality increases, but the prediction time increases.

Validation of results... After the appearance of the output of the second part of the sample, I split the sample into several parts. All further results are shown in this split.

ResNet-34 Torch... You can use the ready-made repository of the authors of the architecture, but in order to get the predictions on the test in the desired format, you have to fix some scripts. In addition, it is necessary to solve the problems of high memory consumption by dumps. The validation accuracy is about 95%.

Inception-v3 TensorFlow... It was also used here finished implementation, but the preprocessing of images was changed, as well as the cropping of images when generating a batch was limited. The result is almost 96% accuracy.

Ensemble of models... The result is two ResNet models and two Inception-v3 models. What validation quality can be obtained by mixing models? The class probabilities were averaged using the geometric mean. The weights (in this case, the degrees) were selected on a deferred sample.

results... ResNet training took 60 hours on GTX 980, and Inception-v3 on TitanX took 48 hours. During the competition, we managed to try out new frameworks with new architectures.

The problem of classification of bank clients

Link to Kaggle.

Stanislav Semyonov tells how he and other members of the top Kaggle united and won a prize in the competition for the classification of clients' orders of a large bank - BNP Paribas.

Formulation of the problem... Based on obfuscated data from insurance claims, it is necessary to predict whether the request can be confirmed without additional manual checks. For a bank, this is the process of automating the processing of applications, and for data analysts, it is just a machine learning task on binary classification. There are about 230 thousand objects and 130 features. Metric - LogLoss. It is worth noting that the winning team decrypted the data, which helped them win the competition.

Getting rid of artificial noise in signs... The first step is to look at the data. Several things are immediately apparent. Firstly, all features take values from 0 to 20. Secondly, if you look at the distribution of any of the features, you can see the following picture:

Why is that? The fact is that at the stage of anonymization and data noise, random noise was added to all values, and then scaling was carried out by a segment from 0 to 20. The reverse transformation was carried out in two steps: first, the values were rounded to a certain decimal place, and then the denominator was selected ... Was this required if the tree still picks up the threshold when splitting? Yes, after the reverse transformation, the differences of the variables begin to make more sense, and for categorical variables it becomes possible to carry out one-hot coding.

Removing linearly dependent features... We also noticed that some traits are the sum of others. It is clear that they are not needed. To determine them, subsets of features were taken. Regression was built on such subsets to predict some other variable. And if the predicted values were close to the true ones (it is worth considering the artificial noise), then the feature could be removed. But the team did not bother with this and used a ready-made set of filtered features. The kit was prepared by someone else. One of the features of Kaggle is the presence of a forum and public solutions through which members share their findings.

How do you know what to use? There is a small hack. Suppose you know that someone in old competitions used some technique that helped him to rank high (they usually write short solutions on the forums). If in the current competition this participant is again among the leaders - most likely, the same technique will shoot here.

Encoding categorical variables... It was striking that a certain variable V22 has a large number of values, but at the same time, if we take a subsample by a certain value, the number of levels (different values) of other variables decreases noticeably. This includes a good correlation with the target variable. What can be done? The simplest solution is to build a separate model for each value of V22, but this is the same as in the first split of the tree to make a split over all values of the variable.

There is another way to use the obtained information - coding with the mean of the target variable. In other words, each value of the categorical variable is replaced by the average value of the target for objects for which this attribute takes the same value. It is impossible to perform such coding directly for the entire training set: in the process, we will implicitly add information about the target variable to the features. We are talking about information that almost any model will definitely find.

Therefore, these statisticians count on folds. Here's an example:

Let's assume the data is split into three parts. For each fold of the training set, we will calculate a new feature based on two other folds, and for the test set - over the entire training set. Then the information about the target variable will not be included in the sample so explicitly, and the model will be able to use the knowledge gained.

Will there be any problems with anything else? Yes - with rare categories and cross-validation.

Rare categories... Suppose a certain category has been encountered only a few times and the corresponding objects belong to class 0. Then the average value of the target variable will also be zero. However, a completely different situation may arise on the test sample. The solution is the smoothed average (or smoothed likelihood), which is calculated using the following formula:

Here global mean is the average value of the target variable over the entire sample, nrows is the number of times a specific value of the categorical variable was encountered, alpha is the regularization parameter (for example, 10). Now, if some value is rare, the global average will have more weight, and if often enough, the result will be close to the starting category average. By the way, this formula allows you to process previously unknown values of a categorical variable.

Cross validation... Let's say we calculated all the smoothed means for categorical variables for other folds. Can we assess the quality of the model using standard k-fold cross-validation? No. Let's take an example.

For example, we want to evaluate a model on the third fold. We train the model on the first two folds, but they have a new variable with the mean of the target variable, which we have already calculated using the third test fold. This does not allow us to correctly evaluate the results, but the problem that has arisen is solved by calculating statistics on folds within folds. Let's look at the example again:

We still want to evaluate the model on the third fold. Let's break the first two folds (the training sample of our estimate) into some other three folds, in them we will calculate the new feature according to the already analyzed scenario, and for the third fold (this is a test sample of our estimate) we will calculate the first two folds together. Then no information from the third fold will be used when training the model and the estimate will be fair. In the competition we are discussing, only such cross-validation allowed to correctly assess the quality of the model. Of course, the "outside" and "inside" number of folds can be any.

Building features... We used not only the already mentioned smoothed means of the target variable, but also weights of evidence. It is almost the same, but with a logarithmic transformation. In addition, features like the difference between the number of objects of positive and negative classes in a group without any normalization turned out to be useful. The intuition is as follows: the scale shows the degree of confidence in the class, but what to do with the quantitative indicators? After all, if you process them in a similar way, then all values will be "hammered" by the regularization of the global average. One option is to split the values into bins, which are then considered separate categories. Another way is simply to build some kind of linear model on one feature with the same target. In total, we got about two thousand features out of 80 filtered ones.

Stacking and blending... As with most competitions, model stacking is an important part of the solution. In short, the essence of stacking is that we transfer the predictions of one model as a feature to another model. However, it is important not to retrain again. Let's just take an example:

Taken from the blog of Alexander Dyakonov

For example, we decided to split our sample into three folds during the staking phase. Similar to calculating statistics, we must train the model on two folds, and add the predicted values for the remaining fold. For a test sample, you can average the predictions of the models from each pair of folds. Each stacking level is called the process of adding a group of new model prediction features based on the existing dataset.

At the first level, the team had 200-250 different models, at the second - 20-30 more, at the third - a few more. The result is blending, that is, mixing the predictions of different models. Various algorithms were used: gradient boosting with different parameters, random forests, neural networks. The main idea is to use the most diverse models with different parameters, even if they do not give the highest quality.

Teamwork... Usually, participants form teams before the end of the competition, when everyone already has their own experience. We teamed up with other "Kaglers" from the very beginning. Each team member had a folder in the shared cloud where datasets and scripts were located. General procedure cross-validations were approved in advance so that comparisons could be made. The roles were distributed as follows: I came up with new features, the second participant built models, the third selected them, and the fourth manages the whole process.

Where to get the power... Testing a large number of hypotheses, building multilevel stacking, and training models can be time-consuming with a laptop. Therefore, many participants use computing servers with a large number of cores and RAM. I usually use AWS servers, and my team members turn out to be using cars at work for competitions while they are idle.

Communication with the organizing company... After a successful performance in the competition, communication with the company takes place in the form of a joint conference call. Participants talk about their decision and answer questions. At BNP, people were not surprised by multi-level staking, but were, of course, interested in building features, teamwork, validating results - everything that can be useful to them in improving their own system.

Do I need to decrypt the dataset... The winning team noticed one peculiarity in the data. Some of the features have missing values, and some do not. That is, some characteristics did not depend on specific people. In addition, there were 360 unique values. It is logical to assume that we are talking about some time stamps. It turned out that if you take the difference between two such features and sort the entire sample by it, then at first there will be zeros more often, and then ones. This is exactly what the winners took advantage of.

Our team took third place. In total, almost three thousand teams participated.

The task of recognizing an ad category

Link to DataRing.

This is another Avito competition. It took place in several stages, the first of which (as well as the third, by the way) was won by Arthur Kuzin.

Formulation of the problem... It is necessary to determine the category based on the photos from the ad. Each ad had one to five images. The metric took into account the overlap of categories at different levels of the hierarchy - from general to narrower ones (the last level contains 194 categories). In total, there were almost a million images in the training sample, which is close to the ImageNet size.

Difficulties of recognition... It would seem that you just need to learn to distinguish a TV from a car, and a car from shoes. But, for example, there is a category "British cats", and there are "other cats", and among them there are very similar images - although you can still distinguish them from each other. What about tires, disks and wheels? A man cannot cope with it. These difficulties are the reason for the appearance of a certain limit of the results of all participants.

Resources and framework... I had at my disposal three computers with powerful video cards: a home one provided by a laboratory at MIPT and a computer at work. Therefore, it was possible (and had to) train several networks at the same time. MXNet was chosen as the main framework for training neural networks, created by the same guys who wrote the well-known XGBoost. This alone was the reason to trust their new product. The advantage of MXNet is that an efficient iterator with standard augmentation is available right out of the box, which is sufficient for most tasks.

Network architectures... The experience of participating in one of the past competitions has shown that the best quality is shown by the Inception series architectures. I also used them here. It was added to GoogLeNet because it made learning the model faster. We also used the Inception-v3 and Inception BN architectures from the Model Zoo model library, to which a dropout was added before the last fully connected layer. Due to technical problems, it was not possible to train the network using stochastic gradient descent, so Adam was used as the optimizer.

Data Augmentation... To improve the quality of the network, augmentation was used - adding distorted images to the sample in order to increase the variety of data. Transformations were involved such as accidentally cropping the photo, flipping, rotating by a small angle, changing the aspect ratio, and shifting.

Accuracy and speed of learning... At first, I divided the sample into three parts, but then I abandoned one of the validation steps for mixing models. Therefore, the second part of the sample was subsequently added to the training set, which improved the quality of the networks. In addition, GoogLeNet was originally trained on Titan Black, which has half the memory compared to Titan X. So this network was retrained with a large batch size, and its accuracy increased. If we look at the training time of networks, we can conclude that in conditions of a limited time frame, it is not worth using Inception-v3, since training is much faster with the other two architectures. The reason is in the number of parameters. Inception BN learns the fastest.

Making predictions.

Like Eugene in the competition with car brands, Arthur used crop predictions - but not on 10 sites, but on 24. The sites were corners, their reflections, the center, turns of the central parts and ten more random ones.

If you save the state of the network after each epoch, the result is many different models, not just the final network. Taking into account the time remaining until the end of the competition, I could use predictions of 11 model-epochs - since building predictions using the network also takes a lot. All these predictions were averaged according to the following scheme: first, using the arithmetic mean within the crop groups, then using the geometric mean with weights selected on the validation set. These three groups are mixed, then we repeat the operation for all epochs. At the end, the class probabilities of all images of one ad are averaged using the geometric mean without weights.

results... When selecting the weights at the validation stage, the competition metric was used, since it did not correlate too much with the usual accuracy. Prediction on different parts of the images gives only a small part of the quality in comparison with a unified prediction, but it is due to this increase that it is possible to show the best result. At the end of the competition, it turned out that the first three places differ in results by thousandths. For example, Zhenya Nizhibitsky had the only model, which was quite a bit inferior to my ensemble of models.

Learning from scratch vs. fine-tuning... After the end of the competition, it turned out that, despite the large sample size, it was worth training the network not from scratch, but using a pre-trained network. This approach shows better results.

Reinforcement learning problem

The Black Box Challenge, about which, was not quite like an ordinary "Kagle". The point is that it was not enough to mark up some "test" sample for the solution. It was required to program and load the “agent” code into the system, which was placed in an environment unknown to the participant and independently made decisions in it. Such tasks belong to the field of reinforcement learning.

Mikhail Pavlov from the 5vision company spoke about the approaches to the solution. In the competition, he took second place.

Formulation of the problem... For an environment with unknown rules, it was necessary to write an "agent" that would interact with the specified environment. Schematically, this is a kind of brain that receives information about a state and a reward from a black box, makes a decision about an action, and then receives a new state and a reward for the performed action. Actions are repeated one after another during the game. The current state is described by a vector of 36 numbers. An agent can take four actions. The goal is to maximize the amount of rewards for the entire game.

Environment analysis... The study of the distribution of the environment state variables showed that the first 35 components do not depend on the selected action and only the 36th component changes depending on it. At the same time, different actions influenced in different ways: some increased or decreased, some did not change in any way. But it cannot be said that the entire environment depends on one component: there may be some hidden variables in it. In addition, the experiment showed that if you perform more than 100 identical actions in a row, then the reward becomes negative. So strategies like “do only one action” fell away immediately. Some of the participants in the competition noticed that the reward is proportional to the same 36th component. It was suggested at the forum that the black box imitates the financial market, where the portfolio is the 36th component, and the actions are buying, selling and the decision to do nothing. These options correlated with portfolio changes, and the meaning of one action was not clear.

Q-learning... During participation, the main goal was to try various techniques reinforcement learning. One of the simplest and most well-known methods is q-learning. Its essence is in an attempt to construct a function Q, which depends on the state and the selected action. Q evaluates how “good” it is to choose a particular action in a particular state. Good includes the reward that we will receive not only now, but also in the future. Training of such a function occurs iteratively. During each iteration, we try to bring the function closer to itself in the next step of the game, taking into account the reward we have received now. You can read more about it. The use of q-learning assumes working with fully observable Markov processes (in other words, the current state should contain all information from the environment). Despite the fact that the environment, according to the organizers, did not meet this requirement, it was possible to use q-learning quite successfully.

Adaptation to the black box... It was experimentally found that n-step q-learning was best suited for the environment, where the reward was used not for one last action, but for n steps forward. The environment allowed you to save the current state and roll back to it, which made it easier to collect the sample - you could try to perform each action from one state, and not just one. At the very beginning of the training, when the q-function was not yet able to evaluate actions, the strategy was “perform action 3”. It was assumed that it did not change anything and it was possible to start training on the data without noise.

Learning process... The training proceeded as follows: we play the entire episode with the current policy (strategy of the agent), accumulating the sample, then using the obtained sample we update the q-function, and so on - the sequence is repeated for a certain number of epochs. The results were better than updating the q-function during the game. Other methods are the replay memory technique (with common bank data for training, where new episodes of the game are recorded) and the simultaneous training of several agents playing asynchronously - also turned out to be less effective.

Models... The solution used three regressions (each one once per action) and two neural networks. Some quadratic features and interactions have been added. The resulting model is a mixture of all five models (five Q-functions) with equal weights. In addition, online additional training was used: in the process of testing, the weights of the old regressions were mixed with the new weights obtained on the test sample. This was done only for regressions, since their solutions can be written out analytically and recalculated rather quickly.

More ideas... Naturally, not all ideas improved the bottom line. For example, discounting the reward (when we do not just maximize the total reward, but consider each next move less useful), deep networks, dueling architecture (with an assessment of the usefulness of the state and each action separately) did not give rise to results. Due to technical problems, it was not possible to apply recurrent networks - although in an ensemble with other models, they might provide some benefit.

Outcomes... The 5vision team took second place, but with a very small margin over the bronze medalists.

So why do you need to compete in data science competitions?

Prizes. Successful performance in most competitions is rewarded with cash prizes or other valuable gifts. Over seven million dollars have been drawn on Kaggle in seven years.
Career. Sometimes a prize-winning place.
An experience. This is, of course, the most important thing. You can explore a new area and start tackling challenges you haven't encountered before.

Machine learning training is now held on Saturdays every other week. The venue is the Moscow office of Yandex, the standard number of guests (guests plus Yandex) is 60-80 people. The main feature of training is its relevance: every time the competition, which ended one or two weeks ago, is sorted out. This makes it difficult to plan everything accurately, but the competition is still fresh in my memory and many people gather in the hall to try their hand at it. The training is supervised by Emil Kayumov, who, by the way, helped with the writing of this post.

In addition, there is another format: permits, where novice specialists jointly participate in existing competitions. Resolutions are held on Saturdays when there is no practice. Anyone can attend events of both types, announcements are published in groups

Basically, data mining is about processing information and identifying patterns and trends in it that help you make decisions. Principles intellectual analysis data have been known for many years, but with the advent of big data they have become even more widespread.

Big data has led to an explosive growth in the popularity of broader data mining techniques, in part because there is so much more information, and by its very nature and content, it is becoming more diversified and expansive. When dealing with large datasets, relatively simple and straightforward statistics are no longer enough. With 30 million or 40 million detailed purchase records, it’s not enough to know that two million of them are from the same location. To better meet the needs of customers, you need to understand if the two million are in a particular age group and know their average earnings.

These business requirements have moved from simple search and statistical analysis of data to more sophisticated data mining. To solve business problems, data analysis is required that allows you to build a model for describing information and ultimately leads to the creation of a resulting report. This process is illustrated.

Figure 1. Process flow diagram

The process of analyzing data, searching and building a model is often iterative, as you need to track down and reveal various information that can be extracted. You also need to understand how to link, transform, and combine them with other data to get a result. Once new elements and aspects of data are discovered, the approach to identifying sources and data formats and then comparing this information with a given result may change.

Data mining tools

Data mining is not only about the tools used or software databases. Data mining can be done with relatively modest database systems and simple tools, including creating your own, or using off-the-shelf software packages. Sophisticated data mining draws on past experience and algorithms defined with existing software and packages, with different specialized tools associated with different methods.

For example, IBM SPSS®, which is rooted in statistical analysis and polling, allows you to build effective predictive models from past trends and make accurate predictions. IBM InfoSphere® Warehouse provides data source discovery, preprocessing, and mining in a single package, allowing you to extract information from the source database directly into the final report.

In recent years, it has become possible to work with very large datasets and cluster / large-scale data processing, which allows for even more complex generalizations of data mining results across groups and comparisons of data. A completely new range of tools and systems are available today, including combined storage and data processing systems.

You can analyze a wide variety of datasets, including traditional SQL databases, raw text data, key / value sets, and document databases. Clustered databases such as Hadoop, Cassandra, CouchDB, and Couchbase Server store and access data in ways that do not follow a traditional tabular structure.

In particular, a more flexible format for storing a document base gives information processing a new focus and complicates it. SQL databases are highly structured and adhere to the schema, making it easy to query and parse data with a known format and structure.

Documentary databases that follow a standard structure like JSON, or files with some machine-readable structure, are also easy to handle, although this can be complicated by a varied and fluid structure. For example, in Hadoop, which processes completely "raw" data, it can be difficult to identify and extract information before processing and correlating it.

Basic methods

Several basic methods that are used for data mining describe the type of analysis and the operation to recover the data. Unfortunately, different companies and solutions don't always use the same terms, which can add to the confusion and perceived complexity.

Let's take a look at some of the key techniques and examples of how to use specific data mining tools.

Association

Association (or relation) is probably the most well-known, familiar, and simple data mining technique. To identify patterns, a simple comparison is made between two or more elements, often of the same type. For example, by tracking shopping habits, you may notice that cream is usually bought with strawberries.

It is not difficult to create data mining tools based on associations or relationships. For example, InfoSphere Warehouse provides a wizard that guides you through information flow configurations to create associations by examining the input source, decision basis, and output information. an example is provided for the sample database.

Figure 2. Information flow used in the association approach

Classification

Classification can be used to get an idea of the type of customer, product, or object by describing multiple attributes to identify a particular class. For example, cars can be easily classified by type (sedan, SUV, convertible) by defining different attributes (number of seats, body shape, driving wheels). Studying new car, you can assign it to a specific class by comparing the attributes with a known definition. The same principles can be applied to customers, for example, by categorizing them by age and social group.

In addition, the classification can be used as input to other methods. For example, decision trees can be used to define a classification. Clustering allows you to use the common attributes of different classifications in order to identify clusters.

By examining one or more attributes or classes, you can group individual data items together to get a structured conclusion. At a simple level, clustering uses one or more attributes as the basis for defining a cluster of similar results. Clustering is useful in identifying different information because it correlates with other examples so that you can see where the similarities and ranges agree.

The clustering method works both ways. You can assume that there is a cluster at a certain point, and then use your identification criteria to verify this. The graph shown on is an illustrative example. Here, the age of the buyer is compared with the purchase price. It is reasonable to expect that people between the ages of twenty and thirty (before marriage and having children) and those in their 50s and 60s (when the children left home) have higher disposable income.

Figure 3. Clustering

In this example, two clusters are visible, one around $ 2000 / 20-30 years and the other around $ 7000-8000 / 50-65 years. In this case, we hypothesized and tested it on a simple graph that can be plotted using any suitable graphing software. For more complex combinations, a complete analytical package is required, especially if decisions are to be automatically based on information about closest neighbor.

This clustering is a simplified example of the so-called image nearest neighbor... Individual buyers can be distinguished by their literal proximity to each other on the chart. It is highly likely that customers from the same cluster share other common attributes, and this assumption can be used to find, classify, and other analyzes of members of a dataset.

The clustering method can be applied in reverse side: Given certain input attributes, identify various artifacts. For example, a recent study of four-digit PIN codes found clusters of numbers in the ranges 1-12 and 1-31 for the first and second pair. By plotting these pairs on a graph, you can see clusters associated with dates (birthdays, anniversaries).

Forecasting

Forecasting is a broad topic that ranges from predicting component failures to detecting fraud and even predicting a company's profit. When combined with other data mining techniques, forecasting involves trend analysis, classification, model matching, and relationships. By analyzing past events or instances, the future can be predicted.

For example, using credit card authorization data, you can combine decision tree analysis of a person's past transactions with classification and comparison with historical models to identify fraudulent transactions. If the purchase of airline tickets in the United States coincides with transactions in the United States, then it is likely that those transactions are genuine.

Sequential models

Sequential models, which are often used to analyze long-term data, are a useful technique for identifying trends, or regular recurrences of similar events. For example, by looking at customer data, you can tell that they buy certain sets of products at different times of the year. Based on this information, the shopping basket prediction application can automatically assume that certain products will be added to the shopping cart based on the frequency and history of purchases.

Decision trees

The decision tree associated with most other methods (mainly classification and forecasting) can be used either within the selection criteria or to support the selection of specific data within the general structure... Decision tree starts with simple question which has two answers (sometimes more). Each answer leads to the next question, helping to classify and identify data or make predictions.

Figure 5. Data preparation

The data source, location, and database all affect how information is processed and combined.

Reliance on SQL

The simplest of all approaches is often reliance on SQL databases. SQL (and the corresponding table structure) is well understood, but the structure and format of the information cannot be completely ignored. For example, when studying user behavior on sales data in the SQL Data Model (and data mining in general), there are two main formats that you can use: transactional and behavioral-demographic.

With InfoSphere Warehouse, building a demographic-behavior model to analyze customer data to understand customer behavior involves using raw SQL data based on transaction information and known customer parameters, organizing that information into a predefined tabular structure. InfoSphere Warehouse can then use this information to mine the data using clustering and classification techniques to obtain the desired result. Customer demographic and transactional data can be combined and then converted to a format that allows analysis of specific data, as shown in.

Figure 6. Custom data analysis format

For example, sales data can be used to identify sales trends for specific products. The original sales data for individual items can be converted into transaction information, which maps customer IDs to transaction data and item codes. Using this information, it is easy to identify consistencies and relationships for individual products and individual buyers over time. This allows InfoSphere Warehouse to compute consistent information by determining, for example, when a customer is likely to purchase the same item again.

From the original data, you can create new data analysis points. For example, you can expand (or refine) product information by matching or classifying individual products into broader groups, and then analyze the data for those groups instead of individual customers.

Figure 7. MapReduce structure

In the previous example, we processed (in this case through MapReduce) the original data in a document database and converted it to a tabular format in an SQL database for data mining purposes.

Working with this complex and even unstructured information may require more preparation and processing. There are complex types and data structures that cannot be processed and prepared in the form you want in one step. In this case, you can route the MapReduce output to either consistent transforming and obtaining the required data structure, as shown in, or for individual making multiple output tables.

Figure 8. Consecutive output chain of MapReduce processing results

For example, you can take raw information from a documentary database in a single pass and perform a MapReduce operation to get an overview of that information by date. A good example sequential process is to regenerate information and combine the results with a decision matrix (created at the second stage of MapReduce processing) with subsequent further simplification into a sequential structure. During the processing phase, MapReduce requires that whole set data supported the individual steps of data processing.

Regardless of the source data, many tools can use flat files, CSVs, or other data sources. For example, InfoSphere Warehouse can parse flat files in addition to directly connecting to the DB2 data warehouse.

Conclusion

Data mining is about more than just performing some complex queries on the data stored in the database. Whether you're using SQL, document-based databases like Hadoop, or simple flat files, you need to work with, format, or restructure the data. You want to define the format of the information on which your method and analysis will be based. Then, when the information is in the right format, you can apply different methods(individually or collectively) independent of the underlying data structure or dataset required.

Despite the fact that the "information analysis process" is more of a technical term, its content is 90% related to human activities.

Understanding the needs at the heart of any information analysis task is closely related to understanding a company's business. Collecting data from suitable sources requires experience in collecting them, regardless of how the final data collection process may be automated. Turning the collected data into insights and applying them effectively in practice requires a deep knowledge of business processes and the availability of consulting skills.

The information analysis process is a cyclical flow of events that begins with an analysis of the needs in the area under consideration. This is followed by the collection of information from secondary and (or) primary sources, its analysis and preparation of a report for decision-makers who will use it, as well as give their feedback and prepare proposals.

At the international level, the information analysis process is characterized as follows:

First, the decision stages are defined in the key business processes and compared with the standard end-results of the information analysis.
The information analysis process begins with a needs assessment at the international level, that is, with the identification and verification of future decision-making needs.
The stage of collecting information is automated, which allows you to allocate time and resources for the primary analysis of information and, accordingly, increase the value of the existing secondary information.
Much time and resources are spent analyzing information, drawing conclusions and interpreting.
The resulting analytical information is brought to the attention of each decision-maker on an individual basis, tracking the process of its further use.
The members of the information analysis team have a mindset for continuous improvement.

Introduction: the cycle of information analysis

The term "information analysis process" refers to an ongoing, cyclical process that begins with identifying the information needs of decision-makers and ends with the provision of the amount of information that meets these needs. In this regard, an immediate distinction must be made between the volume of information and the process of analyzing the information. Determining the amount of information is aimed at identifying the goals and needs for information resources for the entire information analysis program, while the information analysis process begins with determining the needs for one, albeit insignificant, end result of such an analysis.

The information analysis process should always be tied to the existing processes in the company, that is, strategic planning, sales, marketing or product management in which this information will be used. In practice, the use of the information obtained at the output should either be directly related to decision-making situations, or such information should contribute to raising the level of awareness of the organization in those areas. operating activities that are related to various business processes.

In fig. 1 shows the stages of a cyclical process of information analysis (for more details, see below). In turn, the right side of the diagram shows the specific results of the information analysis process, when decisions are made on the basis of general market research, and the results of the information analysis process directly related to various business processes and projects.

Click on the image to enlarge it

The cycle of information analysis consists of six stages. Their detailed description is given below.

1. Analysis of needs

A thorough needs assessment allows you to determine the goals and scope of the information analysis task. Even if those who solve such a problem will collect information for their own use, it makes sense to clearly identify the key directions in solving this problem in order to concentrate resources in the most appropriate areas. However, in the vast majority of cases, those conducting the research are not the end users of the results. Therefore, they must have a complete understanding of what the end results will be used for, in order to avoid collecting and analyzing data that may ultimately be irrelevant to users. For the stage of needs analysis, various templates and questionnaires have been developed that set a high standard for quality at the initial stage of solving the problem.

However, the most important thing is that the organization's information analysis needs must be fully understood and transformed from external to internal in order for the information analysis program to be of definite value. Templates and questionnaires alone cannot achieve this goal. They can, of course, be useful, but there have been times when an excellent needs analysis was carried out simply on the basis of an informal conversation with company leaders. This, in turn, requires the information analysis team to have a consulting approach, or at least be able to effectively negotiate with those responsible for making decisions.

2. Coverage of secondary sources of information

As part of the cycle of information analysis, we separately highlight the collection of information from secondary and primary sources. There are a number of reasons for this. First, collecting information from publicly available sources is less expensive than going directly to primary sources. Secondly, it is easier, provided, of course, provided that the people who are faced with such a task have sufficient experience in studying the available secondary sources. In fact, information source management and related cost optimization is a separate area of expertise in itself. Third, coverage of secondary sources of information prior to conducting research in the form of interviews will provide those conducting such research with valuable background information of a general nature that can be verified and used in response to information from interviewees. In addition, if in the course of studying secondary sources it is possible to obtain answers to some questions, this will reduce the cost of the stage of researching primary sources, and sometimes even eliminate the need for them.

3. Research of primary sources

No matter how huge the amount of publicly available information available today, not all information can be accessed through the study of secondary sources. After exploring secondary sources, research gaps can be filled by interviewing experts familiar with the research topic. This stage can be relatively expensive compared to the study of secondary sources, which, of course, depends on the scale of the task at hand, as well as on what resources are involved: often companies involve third-party performers in the research of primary sources.

4. Analysis

After collecting information from various sources, it is necessary to understand what exactly is needed for the initial analysis of needs in accordance with the task at hand. Again, depending on the scope of the task at hand, this stage of research can turn out to be quite costly, since it includes at least the time spent on internal and sometimes external resources and, possibly, some additional verification of the correctness of the analysis results through further interview.

5. Delivering results

The format for presenting the results after completing the information analysis task is of no small importance to end users. Typically, decision makers do not have time to search for key analysis results in the large amount of data they have obtained. The main content needs to be translated into an easy-to-read format based on their requirements. At the same time, you should provide easy access to additional background data for those who are interested and want to "dig deeper". These basic rules apply regardless of the format of the presentation of the information, whether it be database software, newsletter, PowerPoint presentation, personal meeting or seminar. In addition, there is another reason why we have separated the stage of providing information from the end use, as well as receiving feedback and suggestions on the provided analytical information. Sometimes decisions will be made in the same sequence in which analytical information will be provided. However, more often than not, basic, reference materials will be provided before the actual decision-making situation arises, so the format, channel and way of presenting information will affect how it will be perceived.

6. Use and provision of comments / remarks

The use phase serves as a kind of litmus test for assessing the success of the information analysis task. It allows you to understand whether the results obtained meet the needs identified at the very beginning of the information analysis process. Regardless of whether or not all the originally asked questions have been answered, the use phase tends to raise new questions and the need for a new needs analysis, especially if the need for information analysis is ongoing. In addition, as a result of collaborative efforts to create content by end users and information analysts, by the time it moves to the use phase, it may be that the end users of the information have already contributed to the expected end result. On the other hand, those who were mainly involved in the analysis can be actively involved in the process of forming conclusions and interpreting the results, on the basis of which the final decisions will be made. Ideally, well thought-out remarks and comments during the use phase can already be used as a basis for assessing needs for the next information analysis task. Thus, the cycle of the information analysis process ends.

Getting Started: Developing an Information Analysis Process

Determination of decision-making stages in business processes that require analytical market research

The term “information analysis for the decision-making phase” is gaining more and more popularity as companies that already have information analysis programs in place have begun to consider various options for integrating these programs more effectively into decision-making processes. How abstract, or vice versa, specific, will be measures to "improve the connection between the final results of information analysis and business processes" will largely depend on whether these business processes have been formally defined, as well as whether the group has information analysis understanding the specific information needs associated with the decision-making stages of these processes.

As we mentioned in Chapter 1, the techniques and techniques discussed in this book are best suited for companies that already have structured business processes, such as strategy development. Firms that are less well structured to manage may need to get a little creative when using international market analysis methodology approaches based on their governance structures. However, the basic principles we are looking at here will work for any company.

Information analysis needs assessment: why is it so important?

Given that understanding the key information analysis requirements early in the process has a stronger impact on the quality of the deliverables than at any stage in the process, it is striking that the needs assessment phase is often overlooked. Despite the potential resource constraints at other stages of the information analysis process, close attention to needs assessment alone would in many cases significantly increase the value and applicability of the outcomes of the process, thus justifying the time and resources spent on the information analysis task. Below we look at specific ways to improve the quality of needs assessment.

It is often automatically assumed that management knows what information the company needs. However, in reality, top management tends to be aware of only a fraction of the information needs of their organization, and even so, it may not be in the best position to determine exactly what information is needed, let alone where it is. can be found.

As a result, the situation is constantly repeated when there is neither a clearly formulated concept of the problem, nor its business context for performing information analysis tasks. Those who are best familiar with the sources of information and methods of analysis are wasting time in what appears to be messy data processing and do not see the big picture or the approaches that matter most to the company. Not surprisingly, as a result, decision-makers receive much more information than they need, which is basically counterproductive, since they soon begin to ignore not only useless, but also important information. They don't need more information, but better and more accurate information.

At the same time, decision makers may have unrealistic expectations about the availability and accuracy of information, since they did not consult with experts in the field of information analysis before setting the task. Therefore, ideally, information analysts and decision makers should be in constant contact with each other and work together to ensure that both parties have the same understanding of the primary information needs. The ability to manage this process will require analysts working in this direction, a number of skills:

The analyst must understand how to identify and define the information needs of decision makers.
The analyst should develop skills for effective communication, interviews and presentations.
Ideally, the analyst should understand the psychological types of personality in order to take into account the different orientations of the people responsible for making decisions.
The analyst needs to know the organizational structure, culture and environment, as well as the key interviewees.
The analyst must maintain objectivity.

Work within the cycle of information analysis and elimination of bottlenecks in the process

In the initial stages of implementing an information analysis program, the target group for carrying out activities is usually limited, as are the final results that the program yields. Similarly, when processing the final results, various difficulties often arise (the so-called "bottlenecks"): even a simple collection of scattered data from secondary and primary sources can require knowledge and experience that the company does not have, and after the completion of the collection of information it may be that time and the resources are insufficient to conduct a detailed analysis of the data collected, let alone prepare informative and well-crafted presentations for decision-makers to use. Moreover, at the initial stages of developing an information analysis program, almost no company has special tools for storing and disseminating the results of such analysis. Typically, the results are ultimately provided to target groups as regular email attachments.

The complexity of the analytical task within the information analysis cycle can be described using the standard project management triangle, i.e. it is necessary to complete the task and deliver the result under three main constraints: budget, timeline, and scope of work. In many cases, these three constraints compete with each other: in a standard information analysis task, increasing the workload will require an increase in time and budget; a tight deadline is likely to mean an increase in budget and a simultaneous reduction in the amount of work, and a tight budget is likely to mean both a limitation on the amount of work and a reduction in the time frame for the project.

The emergence of bottlenecks in the information analysis process usually leads to significant friction in the execution of the research task within the information analysis cycle at the initial stages of the development of the program for such analysis. Since resources are limited, the most critical bottlenecks should be addressed first. Does the information analysis team have sufficient capacity to conduct it? Do you need additional training? Or is it rather the problem that analysts lack valuable information to work with - in other words, the most critical bottleneck is information gathering? Or maybe the information analysis team simply does not have enough time, that is, the group is not able to respond in a timely manner to urgent requests?

There are two ways to improve the efficiency of the analytical task within the information analysis cycle. The “productivity” of the cycle, which is the thoroughness with which the information analysis team can handle analytical tasks at each stage, and the speed with which the question is answered. In fig. 2 shows the difference between these approaches and, in general, the difference between strategic analysis tasks and research requests requiring prompt response.

Although both approaches involve the passage of the analytical task through all stages of the information analysis cycle, the information analysis group, which is tasked with quickly conducting research, will work on studying secondary and primary sources in parallel (sometimes one phone call to a specialist can give the necessary answers to the questions posed in the research request). In addition, in many cases, analysis and provision of information are combined, for example, in synopsis that the analyst transmits to the manager who requested this information.

You can improve the performance of your analysis cycle by adding either internal (hired) or external (acquired) resources where you need them to deliver better results and expand your ability to serve an increasing number of user groups within your organization.

The same principle applies to ensuring the responsiveness of a workflow, which means how quickly an urgent research task moves through the various stages of the cycle. According to the established tradition, companies mainly focus on ensuring stable bandwidth through long-term resource planning schemes and staff training. However, with the development of such a specialized direction as the analysis of information, and the increase in the availability of global professional resources, attracted from outside, temporary schemes that are implemented in each specific case and provide the necessary flexibility are becoming more widespread.

In fig. 3 shows two types of outcomes of the information analysis cycle, that is, strategic analysis and research requiring rapid response (see the graph of information analysis outcomes). Despite the fact that research tasks requiring prompt response are usually associated with business processes, the level of their analysis is not very high due to the banal lack of time for such analysis. On the other hand, strategic analysis tasks are usually associated with a high level of co-creation at the stage of analysis and provision of information, which puts them practically at the top of the triangle, where the interpretation and application of the information obtained is carried out.

Continuous development: striving for an international level of information analysis

The smooth running of the information analysis process can be visualized in the form of a cycle graph of uniform thickness (Fig. 2), in the sense that a mature information analysis process does not have "weak links" or significant "bottlenecks" in the organization of the sequence of operations. This uniformity requires appropriate resource scheduling at each stage, which in turn is achieved by iterating through the cycle with all the details. For example, the initial needs assessment can be progressively improved by the fact that decision makers and users of the results of the work will notice gaps and typical discrepancies. initial stage performing tasks for analytical market research. Similarly, collaboration between searchers and analysts (if the two functions are separated) can develop over time by moving issues that were previously unnoticed and raised during the analysis to the searchers to collect additional data. ... Experience will show over time what resources are needed for each of these steps to achieve optimal results.

What results are ultimately “optimal” is determined by how closely the resulting information meets the needs of decision-makers in the business process. And this again brings us back to the uniform thickness of the information analysis cycle: the process of analyzing information at the international level does not begin with an assessment of needs as such, but with a clear definition of where and how the information obtained will be applied. In fact, communication between decision-makers and information analysts throughout the international analytical process must be constant, informative, and bi-directional.

One way to strengthen the linkages between decision making and market research is to enter into service level agreements with key stakeholders that are served by the market intelligence program. Agreeing the required level of market research services with senior leaders in strategic planning, sales, marketing and R&D will clearly define the final results of such analytical studies and activities for each group of stakeholders for the next 6-12 months, including the budget for market research , people involved, milestones and interactions throughout the process.

Service level agreements have several benefits:

It takes time to sit down and discuss the main goals and decision milestones for those responsible for key business processes = the market research team gains a better understanding of what is important to management while improving personal relationships.
Reduces the risk of unanticipated overload on special projects by identifying areas for regular review, strategic analysis information, etc.
This creates time for co-creation in the process of analyzing information: often meetings and seminars on analytical market research with the participation of full-time managers need to be scheduled several months in advance.
By clearly setting goals and evaluating results, market research activities are streamlined and the level of analytics is increased.
In general, the isolation of the organization and the so-called "cooking in its own juice" decreases, the cooperation between managers and specialists in analytical market research becomes more fruitful.

The two examples at the end illustrate how, through a streamlined information analysis process, the analytic team can respond to the different requirements of an information analysis task, depending on the geographic region that is being analyzed for the task. In the "Western world" from secondary sources, you can get a large amount of reliable information on almost any topic. Thus, the task of information analysts is reduced to finding the best sources for cost-effective collection of information for the purpose of its subsequent analysis and reporting.

On the other hand, emerging markets often lack reliable secondary sources or lack the required data in English. Consequently, information analysts need to quickly turn to primary sources and conduct interviews, usually in the language of a given country. In this situation, it is important to rely on a sufficiently large number of sources to assess the correctness of the research results before proceeding with their analysis.

Example. Business Cycle Study for a Chemical Industry Enterprise

A chemical company required a wealth of information on preexisting, current and future business cycles across multiple product lines. chemical industry on the market North America... This information was intended to be used to assess future growth in certain areas of chemical production, as well as to plan business development based on an understanding of the business cycles in the industry.

The analysis was carried out using statistical methods, including regression and visual analysis. Business cycle analysis was carried out both quantitatively and qualitatively, taking into account the views of industry experts on long-term growth. When performing the task, only secondary sources of information were used, and for the analysis - statistical methods, including regression and visual analysis. As a result, a detailed analytical report was presented describing the duration and nature of business cycles, as well as an assessment of the future prospects for the key areas of the company's products (ethylene, polyethylene, styrene, ammonia and butyl rubber).

Example. Assessment of the market for ammonium bifluoride and hydrofluoric acid in Russia and the CIS

One of the world's largest nuclear centers was tasked with studying the market for these two by-products of its production, namely ammonium bifluoride and hydrofluoric acid, in Russia and the CIS. Given the insufficient capacity of this market, they would have to invest in the construction of facilities for the disposal of these products.

Studies of secondary sources have been carried out both at the level of Russia and the CIS, and at the global level. Due to the highly specialized nature of the market and the high domestic consumption of by-products, the focus was on primary source research. In preparation for the subsequent analysis, 50 detailed interviews were conducted with potential clients, competitors and industry experts.

The final report presented an estimate of the market size excluding domestic consumption, an analysis of segments, an analysis of imports, an analysis of the value chain, an analysis of replacement technologies and products for each industrial segment, a forecast of market development, an analysis of pricing and, finally, an assessment of the potential market opportunities in Russia. and the CIS.

Example. An efficient process for analyzing information based on an assessment of prevailing trends for reporting to managers

A leading energy and petrochemical company has successfully improved its information analysis process, based on strategic scenario analysis for collecting, analyzing and delivering information.

By integrating information analysis activities into key business processes at the planning stage, it was possible to clearly identify the true strategic needs of the organization and bring them to the analytical team, which, accordingly, was able to organize the analysis process in such a way that the focus was on strategy and actions. The process of analyzing information in a company begins with an examination of prevailing trends and ends with illustrative examples of how to respond to risks with recommendations for management.

The key to improving the effectiveness of the information analysis program was a successful needs assessment in terms of the company's strategic goals. At the same time, people responsible for making decisions participated in the process of analyzing information already at the initial stage (discussions, meetings, seminars). This contributed to the establishment of a two-way dialogue and a more complete integration of the information analysis program into other areas of the company.

Example. The global biotech company has developed an information analysis cycle to deliver timely insights and proactive decision-making.

The purpose of the information analysis program was to provide early warning and warning information that would enable actionable and achievable strategies to be put in place in all markets in which the company operates. A cycle of information analysis was put in place, in which persons interested in the analysis of information (both for input and output of information), as well as numerous sources of information, were involved in several stages.

Those interested in the analysis of information represented four key functions in the company (strategy group, marketing and sales, finance, investor relations and directors). The most active activity was carried out during the planning and implementation stages. The successful implementation of an information analysis cycle that brought together internal stakeholders (to assess needs) and multiple sources of information in a well-defined process for delivering analysis results meant that the analytical program that was implemented had some impact on strategy development and proactive decision making.

It might be helpful to read: