Fully automated data mining. Possible or not?

Now-a-days huge amount of rough data is available. At the same time many businesses need of fast (real-time) decisions in order to survive and grow in the dynamic competitive environment.

To manage in time with such large dimensional problems automated (without human intervention) analytical solutions are needed. In many cases, like the two examples below, the data mining task and a subsequent decision making should be run as an automated workflow.

Is it possible to create fully automated data mining workflows?

Actually only part of the overall data mining process could be automated. There are actions, which require human intervention, such as: initial problem definition, transition from practical to mathematical problem, accounting for the important system and environment specifics, selection of the model class, appropriate estimators and model structure determination techniques, etc. As a result key decisions are made, which define the particular data mining workflow in order to adapt it to the initial real-life problem. Such decisions can be split into two parts: domain and math decisions. The domain oriented decisions are connected with the business logic and requirements, the environment specifics, which have to be accounted for during the data mining process, etc. The math oriented decisions are associated with the statistical sense, numerical computations, convergence of iterative procedures and so on.

So, humans play a fundamental role in the data mining process. But fortunately it is still possible to design data mining workflows, which solve real-life problems without human intervention. After taking the above mentioned initial decisions it is very likely the remaining data mining stages, adapted to the particular problem, to be ready for a complete automation. Usually these stages are: conducting experiments, data preparation, model development and model validation.

Summarizing: we need of field knowledge to perform the transition from the practical domain to its mathematical, abstract representation. Math knowledge and development skills are also needed in order design the automated workflow, which incorporates both business and math requirements.

Who can be the workflow designer?

More precisely do we need of data scientists, or it is possible a domain expert to construct such a workflow without the need of data scientist?

In fact there are many tools for data processing and analysis, for model development, etc., which are ready to be used without the need of knowledge about multicollinearity, convergence or numerical issues. All these and other specifics are already accounted for in these tools.  From this point of view the answer could be: domain experts can build their solutions without deep technical knowledge about data science. But this is true, only if one can accept the fact that the workflow designer is a user, who plays with black-boxes connecting them and hoping that all potential issues are well handled within these black-boxes. Also many tools involve tuning before to be used and the adjustment of many parameters require math and in some cases business sense.

But this is not all. As discussed before – very often the real-life business problems require custom solutions adapted to the business needs. With the following examples we show that selecting and adjusting the right bricks, which build a workflow is not enough to obtain a reliable and accurate automated analytical solution, which meets the business requirements.

Example 1: Automated demand modelling and forecast

This analytical solution is using in the retail and wholesalers sector. The forecasted demand is essential when optimize the future retailer/wholesaler actions in order to increase business efficiency. For this purpose the solution automatically builds a multiple input multiple output dynamic model.

The main input from the business side is the assumption that the input-output variables directly associated with a product are most likely to affect the product sales. Next, less likely is variables within the same category to be significant factors in the model and finally most unlikely is variables from other categories to be related with the given product sales. Another important fact about the market systems is that it is usual situation to observe very long periods with missing data corresponding to periods when a given product is not on the market.

Some of the more technical assumptions are: periodic (seasonal) and trend components are likely to be observed, log transformation may be appropriate (as the slow moving items usually have Poisson-like distribution), the variables ranges and standard deviations can differ significantly across the variables, etc.

In order to account for the business specifics custom algorithms accounting for the missings were developed. Also the standard modelling techniques are not appropriate as there are many (~103) potential factors and hence there is large possibility some factors to enter in the model due to the high level of data uncertainty. For this reason a multistage modelling approach is used in order to select the model structure in accordance with the expected factors significance. Also to improve the solution performance cross validation tests were incorporated in the factors selection process. There are other modifications of the standard solutions – a good example is seasonality determination, where the expected periods of the oscillating components help to manage with the noise component in the data.

Example 2: Automated scorecard development and potential customers’ classification

This solution is designed to predict the credit risk associated with individuals applying for credit. For this purpose an application scorecard is automatically developed.

The input from the business is: expected logical trends between the independent and the dependent characteristics. For instance the predicted probability the applicant to be good (dependent) should increase monotonically with an increase of ‘age’ (independent characteristic), but should decrease when ‘number of searches’ (another independent characteristic) increases. Also there may be some business rules like: individuals with some post codes are more risky that the others. The financial organizations may also wish the model exploitation to be as cheap as possible, which means that the number of the expensive (e.g. bureau) characteristics in the model to be minimized. The last requirement is important especially when the scorecard has many outputs (associated with clients’ performance w.r.t. different financial products). In this case it is possible the best bureau characteristics per different outputs to be similar. So, here the question is: is it possible the number of the expensive independent characteristics to be reduced without the model accuracy to be decreased sensibly?

In order to provide reasonable models some statistical requirements should be introduced like: when produce dummy variables the statistic: good-bad index calculated for neighbour bins should differ more than a minimum acceptable threshold, the variables in the model should be significant enough in terms of a predefined p-value, BIC and other criteria.

Again as in the previous example, in order to account for the business requirements custom logic was designed to decide which variable(s) to be re-classed and how to redefine the dummies definition (bins ranges) in order the model to reflect the required logical trends. This business requirement reduces the statistical errors caused by the data uncertainty.  Also once the model is built it is possible the score distribution to be not monotonic (according to a preliminary selected classes granularity), or the discriminative power of the scorecard to be not acceptable for some score ranges, etc. The automated analyses performed in the discussed analytical solution are again not trivial

So when consider data mining as a general way for gathering knowledge from data this process cannot be fully automated. Also, domain knowledge is needed in order to take the right decisions before to proceed with the automation of the data mining workflow. Data mining can be fully automated for many practical problems by incorporating the business logic and the statistical sense in the automated analytical solution. The data scientist’s role is essential when design custom automated solutions for particular business case. And finally: both domain experts and data scientists are needed when build qualitative analytical solutions.


Leave a Reply

Your email address will not be published. Required fields are marked *