WHY DOES THE SOFTWARE ALLOWS TOCREATE A LARGER NUMBER OF BINS?
Traditionally, the number of bins would be kept to a limited number. In some instances this was due to a statistical reasoning - every bin should have a sufficient number of observations - , or stability reasons - larger bins tend to have a more stable pattern when gauged over time, whilst smaller bins may be more volatile.
In line with this reasoning, when the binning is first prepared on a new dataset uploaded to the QuantDisqovery software, the coarse classifying is by default performed on 20 bins. This ensures that each bin would cover 5% of the observations, thus ensuring that the above two main considerations, are satisfied.
From there, the proprietary algorithm will perform a first automated pattern detection, and suggest an ‘optimal’ binning. Often, after this step, the number of bins may well be less than 20.
The binning which is performed by the algorithm, has been made ‘smart’, and whilst it keeps volume and statistical stability in mind, it does pay special attention to non-linearities in the data pattern. This means that it is able to detect bins with a disproportionate WoE, even if the volume in the bin may be somewhat smaller. This is particularly useful for variables for which the fill rate (or coverage) may be rather low, but which are very powerful predictors.
At the same time, the algorithm will apply a logical fitting function, such that a smooth, consistent and logical feature engineering and transformation is achieved.
In variables with a very granular distribution, the algorithm will make use of this granularity, and may set up to the 20 bins. Again, observing the non-linearities, and the logic. The main advantage of doing so is that the resulting predicted value, will be more granularly distributed, and to avoid what is known as ‘clumping’ in the distribution of the predicted or estimated outcome. Whilst it may not necessarily improve the predictive performance of the mode, it will greatly enhance its usefulness tot he user. In some instances, i twill provide both: it will help to enhance the predictive performance, and also the granularity of the estimation, in which case it offers a great addition to your modeling efforts.
An additional feature which has been built in to the QuantDisqovery software, is that it also allows the user to interactively explore the data, and the underlying predictive or explanatory patterns, more in detail.
Starting from the initial binning provided by the algorithm, as a user you can manually increase the number of bins. It is possible to increase the bins to a maximum of 100, or each bin having at minimum 1% of the observations. Note that the number of bins you will see on the screen, depends on the granularity of the variable under consideration. If the variable does not allow 100 bins, then the software will show the maximum possible (which may then well be less than 100 bins).
Using a sliding bar, or manual entry of the number of bins, the software will display the granular view, alongside a trend of smoothing line. This latter is the first attempt of the algorithm to visualise the underlying predictive data pattern. From here, a user may set the bins manually, of use the ‘automated’ binning algorithm. This latter will function in the exact same way as described above, but rather than starting from the standard of 20 bins, it will then start from the number of bins chosen by the user.
As a user, you can therefore have full control at variable level, and decide how a variable will be prepared to be entered into an AI or ML model. Expert domain knowledge can then effectively be applied at the variable level, which is the very best way to ensure that the resulting model, and the output it produces, will be logical and meaningful tot he user.
Often in practice, a combination of the automated binning, alongside manual adjustments to the binning are performed. This is particularly useful for variables with non-monotonic relationships between the variable under observation and the dependent variable (for example, in U-shaped data patterns).
An additional consideration is that the WoE transformation effectively replaces the actual values of a variable, with a new value. Due to the binning, there will be lower number of WoE values than the original values inside a variable, but the number of variables does not change. This removes certain statistical limitations as for the ML or AI, as each variable after the WoE transformation is effectively a series of numbers. For example, this alleviates some issues of how to run predictive models using binning, for data sets with a limited number of observations but with quite some variables present as potential predictors.