LumenAI is developing an innovative and interactive solution for real time unsupervised learning that allows data scientists to understand the structure of their data with the minimum of prior expertise knowledge.
Users will be able to upload their datasets in .csv format and LumenAI's clustering algorithms will try to find the optimal number of clusters and their positions. A summary of each cluster will be returned at the end of the task together with an interactive visualization that the data scientist can use to perfectly understand his dataset.
For UX reasons, we need to reduce the number of required actions from the user before launching the clustering task. To do so, an automatic preprocess of the uploaded dataset starts at the end of the upload. Operations like handling missing data and transforming non-numeric columns are performed in this first step of the clustering pipeline.
The preprocessing step remains incomplete as it does not deal with the case of senseless numeric columns like the famous nominal (digital) identifiers. This columns should be removed from the dataset as they can seriously impact the relevance of the clustering results!
So we need a model that can detect nominal Identifiers so we can eliminate them after the confirmation of the user.
Caution: even if the confirmation of excluding identifiers from the clustering task still a required action, the automatic detection of IDs will save the user time as he manually check/uncheck is required.
What's an ID ?
An identifier is a name or a number that identifies a person, a category or an object. Datasets have often at least one column that identifies the objects concerned by the each row (attributes).
An identifier should be unique for each object. So why is it a problem to detect identifiers since a simple check of the distinct values taken by a variable can determine if it's an identifier?
Good question ! Unfortunately the problem is harder in our case since this rule is not always respected as we can find several lines that concern the same object in many datasets.
For example, in the case of data related to visitors behavior on an e-commerce website, several actions can be assigned to the same visitor, which will be reflected by an identifier value that is repeated several times throughout the dataset.
A machine learning based model can be the solution if we consider the problem as a classification task (1: identifier and 0: non-identifier attribute).
To find features to use we tried to find what are the specificities of an identifier in a datasets. In other words, how an identifier may be different from any other attribute of the dataset.
To answer this question, we made some assumptions about the specific characteristics of identifiers:
The proposals of attributes described above are based on assumptions that must be verified using real data. But how can we find a dataset that contains informations about other datasets ?
After a long and unsuccessful search in the open access databases, we decided to create our own columns dataset by following the following steps:
Caution : To avoid any confusion, it is important to clarify that our objective here is to build a dataset where each row corresponds to a column in another dataset.
The dataset we obtained contains 2000 lignes and 14 columns thanks to the data augmentation trick we performed on collected datasets by dividing each one into 10 sub-datasets.
The first problem we faced was the unbalanced labels which was expected as datasets contains often few identifiers among their columns.
This code can reduce the unbalance:
After reducing the number of non-id rows:
In order to validate (or not) the assumptions made above, let's look at the correlations between the variables and the label:
Correlations validate many of our assumptions, particularly about the importance of to the number of unique values, entropy, position and minimum values of identifiers.
From the list of variables we will choose the 5 most correlated with our target (is_id).
Algorithm choice and results
Regression or classification ?
Even if it is a classification problem, we prefer consider it as a regression problem since our objective is to associate a score to each column of the dataset that reflects the probability that it is an identifier. If the model associates a high score to a column, we will ask the user to confirm this "intuition".
The probability threshold from which we will consider that the model have detected an identifier is a parameter to be set according to the desired sensitivity.
Missing an identifier means that a senseless attribute can be used in the clustering task which will seriously impact the algorithm's output. To avoid this risk, false negatives have to be minimized.
Minimizing false negatives means maximizing the recall (rate of detected true labels from to the total number of true labels in the dataset) but it is important to watch the model's accuracy too (rate of detected true labels relative to the number of positive predictions of the model) so we can avoid too many confirmation requests to the user.
Test and validation
Given the small size of our dataset (2k lines), the traditional split TRAIN:0.8 - TEST:0.2 may lead to an overfitted model. Cross-validation should be used to avoid overfitting.
Cross-validation consists in dividing the learning set into k sub-sets (folds) and, at each iteration, the model is trained using the k-1 sub-sets and validation is done using the remaining sub-sets. The final score of the model will be the average of the scores of each iteration.
In order to use cross-validation of the scikit-learn package, we need to override the prediction method of the regression algorithms we are using to make them return binary result as the validation phase is waiting to have binary labels like in the test set.
The table below present the results obtained for the 4 regression-algorithms used so far:
We just sent you an email. Please click the link in the email to confirm your subscription!