ABOUT QUANTDISQOVERY

Product Description:

QuantDisqovery is a SaaS Tool for feature engineering for predictive or explanatory modeling.
It is orientated towards the data science community.

The software supports the data scientists in:

  • preparing their data for machine learning, and boosting model performance and explainability
  • allows for visual predictive or explanatory data pattern inspection, for in-depth discussion with subject matter experts
  • can automatically detect underlying predictive / explanatory data patterns
  • offers immediate treatment of missing values and outliers (no need for imputation)
  • can accomodate highly non-linear data patterns
  • binary and continuous dependent variable

WHAT IS QUANTDISQOVERY?

QuantDisqovery is an easy to use, second level ‘feature engineering’ software for predictive or explanatory model development. It is designed to provide an effective and efficient solution to an important component in the development of machine learning models: the transformation and visual interpretation of raw predictive or explanatory variables to be used in a statistical/machine learning model. The software offers an in-built treatment of missing data, as well as for outliers.

WHAT DO YOU GET ?

  • 1 year license
  • Web based
  • 1 mb file size maximum
  • Unlimited number of projects
  • Automatic data fieldformat detection
  • Binary target variable only
  • Automated feature visualisation and transformation
  • Feature transformation code immediately available for R, Python, SAS, C#
  • Feature transformation with no code, upload file and download transformed one
  • Visual Excel file download for easily sharing with colleagues

WHAT BENEFITS IT WILL BRING TO YOU?

For you as a data scientist, the key benefits QuantDisqovery brings are as follows:

  • insights and understanding easy, powerful and fast visualisation of predictive or explanatory data patterns
  • productivity: increased speed of model development
  • quality and explain ability of your model through QuantDisqovery’s optimised WOE transformation
  • allows intensive interaction with the business users, connecting your models with their everyday reality

For you as a business user, the key benefits are as follows:

  • finally understand the hidden predictive power in their data, without the complexity of the classical statistical approaches
  • allows intensive interaction with the data scientist, so you can coach and guide them, and ensure they build models which are reliable, transparent and produce understandable and useful outcomes

WHY NOT GIVE IT A TRY ?

  • QuantDisqovery can be tried free of charge during a 14 day period. It has a number of example data files which you can choose from, to test the software.
  • These will allow you to get familiar with the concept of WOE transformation, and will allow you to see the WOE automated binning of QuantDisqovery in action.
  • On YouTube, you can find a video demonstrating how to use the software. In this video, use is made of the test datasets.
  • The data sets can be downloaded for free, as can the WOE transformation code. This will allow you test your machine learning models using these WOE transformation.



Get a quick impression with the video below



WHAT'S BEHIND THE PRODUCT?

QuantDiscqovery uses the concept of ‘binning’ and Weigh-of-Evidence feature engineering (WOE), to visualise and prepare the underlying data for machine learning or AI algorithms. This WOE data transformation technique, is widely used fort he development of credit scorecards in the financial services industry, where transparancy and explainability of the outcome of het machine learning model is a must-have.

Already this binning was developed some decades ago, and has been tried, tested and refined over the years. Initially this binning, and the visualisation accompanying it, was performed manually. Even if some automated tools exist today, it remains very much a manual effort.

It was the desire to reduce this manual effort which insipired the team behind Quantforce to develop a special algorithm to automate the binning in an efficient (fast), effective (predictive pattern detection) and logical manner. At the same time, generating a visual inspection allowing a user to infuence the binning performed by the algorithm.

Based upon their combined experience in developing scoring models for credit scoring, prospect soring and churn prediction, the founding fathers of Quantforce succeeded in their mission. Already, the binning algortihm QF developed has proven its worth in powering a whole range of scoring systems being used by users around the world, in a whole range of industries.

Originally desktop/laptop based, QuantDisqovery is an entry-level, web based solutions putting the power of the binning-based feature engineering at your fingertips.

Enjoy the visualisation of data patterns, enjoy the insights this give and the collaboration with your domain experts, and enjoy the increased power and transparancy you will be able to get from your powerful AI and ML techniques.


PREDICTIVE MODELING

Predictive Modeling is the area of data science which focusses on using ML and AI techniques to use insights in data, to predict future events. To this end, predictive modeling will use historical relationships between data at a given point in time in history, and an outcome some time after the data was drawn. By analysing the relationships between the historical data and the outcome, predictive models can detect which variables are most correlated or influential of the outcome. Once such a model is built, it can be used to derive, from today's data, what a likely future outcome will be.

Commonly know examples of predictive modeling include credit scoring (predicting whether a borrower is likely to default on payment in the future), churn scoring (predicting whether a customer is likely to stop doing business with you) and prospect scoring (predicting whether a prospect is likely to respond to your marketing campaign and become a customer). It has also been successfully applied in health care (predicting whether a person is likely to develop an illness) and IoT (predicting a machine failure).

Econometricians may know this also as a form of intertemporal modeling.

EXPLANATORY MODELING

In essence this is similar to predictive modelling, but whereas for predictive modeling it is mandatory to have an time lag between the data and the outcome, there is no such condition for explanatory modeling. Other than the future-orientation of predictive modeling, explanatory modeling aims purely to seek which variables explain a certain outcome.

Econometricians may know this also as a form of contemporaneous modeling

THE DIFFERENCE BETWEEN EXPLANATORY AND PREDICTIVE MODELING
An example:

The difference between a predictive model and an explanatory model may not always be very straightforward. Using a simple example however, the difference may become a bit more clear.

Let us take the situation of a medical doctor in a hospital, who has two types of models:

  • A predictive model, which the doctor uses to predict if a patient will become ill in the future
  • An explanatory model, which the doctor uses to determine whether a patient admitted to the hospital, has an illness (or not).

When a patient is admitted to the hospital, a medical doctor will run a series of tests, to determine what is wrong with a patient. In addition to his or her own assessment of the test results, the doctor may well pass the test results through a model, to determine the presence of a certain illness. A commonly known areas of this way of working, is the detection of lung cancer, where AI and ML models assist medical doctors to identify the presence of cancer in lungs. As there is no time lag in the data (i.e. the radiology photo’s are immediately analysed after being taken), the models used are an example of explanatory modeling.

However, the same medical doctor may well run some tests on a person who is not yet ill, and then pass these results through a model in order to determine the likelhood that this person may develop a certain illness over time. For example, in the case a medical doctor would like to pre-actively try to assess the likelihood of a cancer developing in a person’s lungs. The person may not yet be ill, but there may be signals in the test results pointing towards an increases risk of he development of a lung cancer over time.

In this case, a predictive model is used as the doctor is looking to predict a certain condition of the person on the future, using today’s data.

BUSINESS INTELLIGENCE AND PREDICTIVE MODELING - A PERFECT COMBINATION

One may wonder in which way predictive modeling is different from Business Intelligence. After all, BI also tries to uncover relationships between data variables, and trends in these data variables, and certain outcomes or results. Some BI software can even extrapolate trends and apply some form of ML or AI on this trend data.

The key difference it that BI often is concerned with reporting on the current and recent past, where as predictive modeling uses this current and past data, to provide a look into the future. So, even before it has happened, one may claim that predictive modeling can tell us how the world will look like in the future.

Some BI software tools will provide ML or AI, but most often they rely on extrapolation of trends. In contrast, predictive modeling will extract from the data, which variables are predictive of a certain outcome, and use this to produce its prediction. It therefore probes more deeply into the data, and does not merely extrapolate.

Additionally, predictive modeling will automatically combine and weight the predictive variables. It may hence combine several variables which are present in a BI system, in the most optimal form.

However, combining both fields of expertise can yield a very powerful combination: derivations of the time series can be used as variables in the predictive model, making the resulting predictions even better, whilst BI can benefit from the insights of predictive modeling, to make its reporting 'predictive based', rather than just current or past based, or only derived from extrapolations.

THE CONCEPT OF WEIGHT OF EVIDENCE [WOE] AND ITS USE IN PREDICTIVE MODELING

The Weight of Evidence (WoE) is a feature transformation technique, which is used to analyse the relationship or predictive power of a data field, relative to an outcome. It originates from the credit scoring area, where it is commonly used for building ML and AI models.

Mathematically, it is calculated by taking the natural logarithm (log to base e) of the division of the % of non-outcome-events to the % of outcome-events:

WOE = ln(% of non-events / % of events)

In credit scoring, the 'non-events' may be customer who did not default on repayment, whilst the 'events' may customers who did default on their repayment.

Similarly, for other areas of use, the 'outcome events' and the 'non outcome events' will be related to that specific area (e.g. customers who churned versus customers which did not churn)

For each variable, the WoE will be calculated on a sub-set of observations within the variable. This means that before the WoE is calculated, a variable will be split in pieces, or ranges. These pieces, or ranges of the variable, are the so-called 'bins'. For each bin, the WoE can then be calculated. Once the variable is 'binned', and for each bin the WoE is calculated, then this WoE will be used to enter into the ML or AI model development. It is therefore no longer the raw data which is used, but the WoE-transformed binned variable which is used.

A number of data transformation techniques are available for a data scientist preparing his/her data for the ML or AI modeling, so in a way the WoE transformation is just one of them.

It has however some very practical advantages over some other feature transformations techniques:

  • it automatically provides a fact-based imputation for missing observations within a variable. So no need to either use an average value, a median, or imputing some extreme values, as these may influence your ML or AI model

  • it automatically allows the treatment of extreme values. Extremely high or extremely low value observations can be binned or grouped with other observations. So, rather than excluding these factual observations, they can be included and properly treated.

  • it provides a fact-based imputation for nominal variables.

  • it allows to avoid overfitting as it will provide a smoothing of 'saw-tooth' data patterns.

  • alternatively it allows to model highly non-linear data patterns as the WoE can be separately assessed for each bin which the user defines. A monotonic relationship an hence be achieved, without violating the underlying pattern.

  • for some implementations, it has some clear advantages over the use a variable in a continuous manner


As a result, the data patterns fed into the ML or AI, will be much more reflective of the actual underlying predictive or explanatory patterns, and variables which are hard to transform can be used without any problem. Additionally, it will ensure a much better transparency and explainability of your model. It is for this reason that it has been developed in the credit scoring industry, where these are two very important traits of any model used for actual decision making. It is also for this reason, that it has been endorsed by regulators.

However, as indicated above, also in other industries the innate characteristics of transparency and explainability, will to a large extent drive the confidence in, and subsequent adoption of, ML or AI models.

THE PRACTICAL CHALLENGE WITH WOE

A key practical challenge however with this binning transformation, is how to split a variable into relevant bins, and how many bins should be used ? For many years, this used to be a manual effort, involving many iterations per variable.

It was hence time-consuming, and as a data scientist the question always whether the correct number of bins was chosen, and whether the way the variable was 'cut' into the bins, was actually done in the most optimal manner. Choosing the bins in such a way that the most predictive data pattern was captured in a variable, turned into a bit of an 'art'. A beautiful piece of 'art' very often, but mostly quite time consuming.

Over time, some algorithms were developed to automate the 'binning' effort, but most of them fell a bit short. Most often this was either because the resulting bins were too broad, or the calculation was simply computationally too heavy. Also, sometimes the resulting bins seemed not very well chosen, and options to make changes were often limited.

HOW QUANTDISQOVERY CAN HELP OVERCOME THE CHALLENGES OF FEATURE ENGINEERING AND FEATURE TRANSFORMATION FOR PREDICTIVE MODELING

As experts in predictive modeling, with many years of experience in developing credit scoring, churn scoring and prospect scoring models, the Quantforce team developed a special, proprietary algorithm to ensure that the WoE-based binning can :

  • be performed fast and computationally efficient
  • extract the underlying data patterns in a logical manner
  • provide a way for the user to influence and adapt/adjust the binning as performed by the algorithm
  • allow the use to adjust the number of bins manually, and provide a visual interface so the user could explore a variety of emerging predictive or explanatory data patterns

This proprietary algorithm has already proven its worth in the market. It was first used around 10 years ago for a large project on ML, and has since been used to develop many ML and AI models.

Quantforce first implemented this binning algorithm as part of a broader PC based software (either desktop or laptop), but has now made it available through an on-line, internet based solutions. In line with Quantforce's philosophy, the price has been kept on a very democratic level, so as to allow a wide body of data scientists to have access to it.

The PC based software can deal with two types of dependent variables:

- a binary target variable: either 1 or 0

- a continuous target variable

The QuantDisqovery solution currently only supports the a binary target. It does not yet support a continuous dependent variable.

WHY DOES THE SOFTWARE ALLOWS TOCREATE A LARGER NUMBER OF BINS?

Traditionally, the number of bins would be kept to a limited number. In some instances this was due to a statistical reasoning - every bin should have a sufficient number of observations - , or stability reasons - larger bins tend to have a more stable pattern when gauged over time, whilst smaller bins may be more volatile.

In line with this reasoning, when the binning is first prepared on a new dataset uploaded to the QuantDisqovery software, the coarse classifying is by default performed on 20 bins. This ensures that each bin would cover 5% of the observations, thus ensuring that the above two main considerations, are satisfied.

From there, the proprietary algorithm will perform a first automated pattern detection, and suggest an ‘optimal’ binning. Often, after this step, the number of bins may well be less than 20.

The binning which is performed by the algorithm, has been made ‘smart’, and whilst it keeps volume and statistical stability in mind, it does pay special attention to non-linearities in the data pattern. This means that it is able to detect bins with a disproportionate WoE, even if the volume in the bin may be somewhat smaller. This is particularly useful for variables for which the fill rate (or coverage) may be rather low, but which are very powerful predictors.

At the same time, the algorithm will apply a logical fitting function, such that a smooth, consistent and logical feature engineering and transformation is achieved.

In variables with a very granular distribution, the algorithm will make use of this granularity, and may set up to the 20 bins. Again, observing the non-linearities, and the logic. The main advantage of doing so is that the resulting predicted value, will be more granularly distributed, and to avoid what is known as ‘clumping’ in the distribution of the predicted or estimated outcome. Whilst it may not necessarily improve the predictive performance of the mode, it will greatly enhance its usefulness tot he user. In some instances, i twill provide both: it will help to enhance the predictive performance, and also the granularity of the estimation, in which case it offers a great addition to your modeling efforts.

An additional feature which has been built in to the QuantDisqovery software, is that it also allows the user to interactively explore the data, and the underlying predictive or explanatory patterns, more in detail.

Starting from the initial binning provided by the algorithm, as a user you can manually increase the number of bins. It is possible to increase the bins to a maximum of 100, or each bin having at minimum 1% of the observations. Note that the number of bins you will see on the screen, depends on the granularity of the variable under consideration. If the variable does not allow 100 bins, then the software will show the maximum possible (which may then well be less than 100 bins).

Using a sliding bar, or manual entry of the number of bins, the software will display the granular view, alongside a trend of smoothing line. This latter is the first attempt of the algorithm to visualise the underlying predictive data pattern. From here, a user may set the bins manually, of use the ‘automated’ binning algorithm. This latter will function in the exact same way as described above, but rather than starting from the standard of 20 bins, it will then start from the number of bins chosen by the user.

As a user, you can therefore have full control at variable level, and decide how a variable will be prepared to be entered into an AI or ML model. Expert domain knowledge can then effectively be applied at the variable level, which is the very best way to ensure that the resulting model, and the output it produces, will be logical and meaningful tot he user.

Often in practice, a combination of the automated binning, alongside manual adjustments to the binning are performed. This is particularly useful for variables with non-monotonic relationships between the variable under observation and the dependent variable (for example, in U-shaped data patterns).

An additional consideration is that the WoE transformation effectively replaces the actual values of a variable, with a new value. Due to the binning, there will be lower number of WoE values than the original values inside a variable, but the number of variables does not change. This removes certain statistical limitations as for the ML or AI, as each variable after the WoE transformation is effectively a series of numbers. For example, this alleviates some issues of how to run predictive models using binning, for data sets with a limited number of observations but with quite some variables present as potential predictors.

AN EXAMPLE OF HOW THE WOETRNSFORMATION IS HELPFUL IN EXTRACTING PREDICTIVE OR EXPLANATORY DATA PATTERNS

As a data scientist, you will be much aware that data visualisation can really help to explore the data at hand. Several ways exist to create a graphical view of the relationship that a variable in your dataset, may have to your outcome variable. Scatterplots and distribution plots are just two examples of this, often used.

Next to that, you may run classical measurements of testing relationships, such as correlation analysis, single variable regression based measurements, and the like.

However, as these measurements provide you with just one single output number, and may ignore the actual underlying predictive data pattern. Partial or non-linear correlations will remain therefore largely undetected.

Without going too much in detail (as this will be very familiar to any data scientist), the below graph shows how the binning can help to make sense of an underlying data pattern. The top graph shows a typical visualisation of a data variable for predictive modeling. The yellow ‘x’-es show the ‘goods’ (or the ‘non-events’), whereas the blue crosses are the ‘bads’ (or the ‘events’). From this graph, it may be difficult to see whether this variable is predictive of the ‘events’.


Running this data field through the binning algorithm in QuantDisqovery, provides the visualisation below. Here, the raw data format is transformed into an easy-to-read data pattern. Immediately, the predictive pattern ‘hidden’ inside the data becomes immediately very clear.


From here, a user may now leave this transformation as is, or choose to explore the data field further (e.g. through increasing the number of bins/granularity).



WHERE DOES QUANTDISQOVERY FITS IN YOUR DATA SCIENCE TOOLBOX?

QuantDisqovery essentially provides a feature transformation solution using the WoE concept.

It therefore is situated between the data preparation and feature engineering phase (such as the derivation of averages, maximum values,…), and the ML or AI algorithm part.




QuantDisqovery provides a fast detection of underlying data predictive or explanatory patters, visually intuitive such that domain experts can also be much more involved in the modeling process. For the data scientist, it largely will help to overcome the repetitive task of visualising the data in a an insightful format.

The software will provide the feature transformation code in a number of commonly used software packages (e.g. R, Python, C#, SAS). By simply copying and pasting this code, the data scientist can continue the algorithm development without having to spend time of the coding of the data transformation.

BENEFITS FROM PREDICTIVE MODELING

Predictive Modeling is all about showing the future, even if it did not happen yet. Its aim is to use historical data patterns, and try to find a meaningful, robust and generally stable relationship between a set of data or variables at one point in time, and an outcome at a later stage.

Just as all things in life, nothing is perfect, but predictive modeling is all about envisioning future outcomes, so we can act now or be prepared, rather than facing the unexpected and be surprised. Basically, the reason as humanity we spend time on weather forecasting, or predicting volcano eruptions, of finding clues on how we can predict the development of cancer.

Predictive modelling can be applied to almost all aspects of life, and the benefits are important.

Some examples:

  • In Industry, predictive modeling can avoid breakdowns in production processes. I fit can be predicted when machinery will break down due to wear and tear, pro-active maintenance will avoid costly disruptions in the production process.

  • In IoT in general, the same concepts can optimise maintenance and avoid negative surprises. The examples of the elevator manufacturer who can send out pro-actively maintenance teams, will stop many people from being stuck in an elevator.

  • Churn prediction can avoid a loss of customers, and improve customer retention. It typically will add significantly to the top and bottom line.

  • Prospect scoring can help to increase the ROI on every marketing Euro spent. Conversion rates may double or even triple from the more traditional approaches.
  • Credit scoring has been around for a while, and have proven their worth. Nowadays, they prove critical not just to optimise internal workflows or calculate expected losses, but also or digitising the customer interaction process.
  • In medical, predictive modeling is gaining ground, as it can help doctors to uncover relationships between certain diagnostics and the development of a disease or sickness. Day-to-day examples of this are the algorithms in smart watches which grow better every day in helping you to stay healthy.


For all its benefits, predictive modeling was only within reach of large companies or institutions with big budgets. Now that data science is spreading fast across our modern world, predictive modeling is coming within reach of most of us.

Still, as described under the section on the WOE, some real challenges remained. Against this background, Quantforce is proud to present QuantDisqovery, a tool which helps to largely overcome these challenges, and within reach of all organisations, whether large or small.