
from warnings import simplefilter
simplefilter(action='ignore', category=FutureWarning)
simplefilter(action="ignore", category=RuntimeWarning)
from IPython.display import HTML
import random
def toggle_code(text_1="Show code", text_2="Hide code"):
rand = 'id'+str(random.randint(1,2**64))
html = """
<script>
function code_toggle_{random}() {{
var element = $('#{random}').parent().parent().parent().parent().parent().find('div.input')
element.toggle()
if(element.is(":hidden"))
$('#{random}').text("{text_1}")
else
$('#{random}').text("{text_2}")
}}
$(document).ready(function() {{
code_toggle_{random}();
$('#{random}').parent().parent().find('div.prompt.output_prompt').text("")
}})
</script>
<a id="{random}" style="cursor: pointer;" onclick="javascript:code_toggle_{random}()">{text_1}</a>
""".format(random = rand, text_1 = text_1, text_2=text_2)
return HTML(html)
toggle_code()
K-bMOM is an algorithm that were first introduced by SAUMARD Adrien, BRUNET-SAUMARD Camille and GENETAY Edouard in februar 2020 in this article: https://arxiv.org/pdf/2002.03899.pdf
K-bMOM is a clustering algorithm inspired from both K-means and Median-Of-Means (MOM). These two algorithms have advantages whose K-bMOM benefits from: on the one hand, Kmeans is an unsupervised clustering algorithm. On the other hand, it has been recently proved (2015) that MOM is a way better mean operator than the empirical mean because it is robust to outliers (it has also a better convergence rate). As a result, K-bMOM is indeed a robust and unsupervised clustering algorithm. We will describe K-bMOM in the next section. We are going to give in the following paragraphs some context clues.
In principle, K-means is a clustering algorithm that clusterizes vector data in $\mathbb{R}^p$. More precisely, K-means searches for K points $\left( c_1,c_2,…,c_K\right) $ (called centroids) that minimize this quantity $\text{Crit} := \frac{1}{n} \sum_{i=1}^{n} \underset{1\leq j \leq K}{\min} || x_i – c_j||^2$.
Since essence of clustering is mostly the search for the « most relevant » summary of the data, K-means often fails to find the most relevant centroids. That is especially due to the criterion $\text{Crit}$. The optimization of this criterion leads to centroids that are, geometrically speaking, the « most relevant », and even if the geometrical optimum sometimes coincides with what one calls the « most relevant summary of the data », it often does not.
A solution to this issue could come from the use of a better « mean operator » than the « empirical mean » in the criterion $\text{Crit}$. What is called « empirical mean » is namely to sum a bunch of data and to divide the result by the number of elements one has summed up (an example is given in the next paragraph; formally speaking, it is this type of operation: $ \frac{1}{n} \sum_{i=1}^{n}… $). Consequently, the recent theoretical discoveries about « estimating the mean-value of a probability distribution », such as the work about MOM, made K-bMOM possible. Let us explain a few things about some properties of « estimating the mean-value of a probability distribution ».
The « empirical mean » of $n$ observations $\left(x_1,x_2,…x_n \right)$ of a random phenomenon is $\frac{1}{n} \sum_{i=1}^{n}x_i$. It is the most basic statistic that every one knows. For example: you take the size (in cm) of every close collegues of yours: $\left\{ 168,166,191,160,179 \right\}$. Then divide by the amount of data, $5$, and you get an average size of $ 1.728$ cm (the empirical mean of the observed sizes is 172.8 cm). The empirical mean has been studied by mathematician since 18th century:
The convergence rate above is not satisfying compared to the one achieved by MOM because $\frac{\sigma^2}{\epsilon^2}$ vanishes polynomially quickly as $\epsilon$ grows while the rate of MOM vanishes exponentially quickly as $\epsilon$ grows. This property has been proved but Lugosi and al. in 2015 (https://projecteuclid.org/download/pdfview_1/euclid.aos/1479891632). Last but not least concerning the properties of MOM: it is robust to the presence of observations that are very far from the mean-value. This means that the MOM operator does still give output close to the mean-value of the interesting phenomenon while some observations do not follow the average behaviour.
Inspired from these two estimators K-means and MOM, we proposed K-bMOM
The principle of K-bMOM is to create $B$ blocks from the dataset. So it does at the initialization : it will retain the centers of the block whose empirical risk is median to all blocks.
Input : the dataset $\{x_1, . . . , x_n\}$, $B$ the number of blocks and $S$ size of blocks, with $S > K$
- Iterate from 1 until $B$ blocks:
(a) Select at random, uniformly and with replacement $S$ datapoints
(b) Proceed a kmeans++ initialisation on the block
(c) Compute the empirical risk of the block
- Select the centers from the block having the median empirical risk
Output : Centers
Let’s take a set of 600 data :
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt
#To avoid the plt.show bug
%matplotlib inline
X, y = make_blobs(n_samples=600, centers=4, n_features=2, random_state=1)
plt.scatter(X[:,0],X[:,1],1)
plt.show()
toggle_code()
The first step in initialization is to create blocks with the sampling_init function.
It takes 3 hyperparameters: $B$ the number of blocks, $S$ their size and $K$ number of clusters to find.
On the following example, $B$=15 and $S$=20. We will explain later how to tune these values.
(Note that the functions presented below have been simplified and adapted to the needs of the notebook)
import numpy as np
import random
from scipy.spatial.distance import cdist
from sklearn.cluster import KMeans
def sampling_init(X,B,S,K,random_state,n_local_trials=1):
'''
# Initialisation function: create nbr_blocks blocks, initialize with a kmeans++, compute a partition
# X : dataset
# B : number of blocks
# s : number of data per block
# K : number of clusters
'''
# variables that are only used in this tutorials to explain how it works
all_blocks_before_clustering = []
# body of code
init_blocks = [0]*B
# instanciation of kmeans++
x_squared = X**2
x_squared_norms = x_squared.sum(axis=1)
# Blocks creation
for i in range(B):
idx = random.choices(np.arange(X.shape[0]),k = int(S))
all_blocks_before_clustering.append(idx)
kmeans_obj = KMeans(n_clusters=K, random_state=random_state, n_init=n_local_trials, init='k-means++')
kmeans_obj.fit(X=X[idx,:])
centers = kmeans_obj.cluster_centers_
km_dist = cdist(X[idx,:],centers)
km_labels_ = km_dist.argmin(axis=1)
init_blocks[i] = [[j for j,x in zip(idx,km_labels_) if x==k] for k in range(K)]
return init_blocks,all_blocks_before_clustering
K = 4
B = 15
S = 20
nb_columns = 5
nb_rows = 3
seed = 11061109
random.seed(seed)
So, let’s execute sampling_init
all_blocks_after_init_step, all_blocks_before_clustering = sampling_init(X,B=B,S=S,K=K,random_state=seed)
For example, here’s the 1st block
all_blocks_after_init_step[0]
Here’s the 2nd block
all_blocks_after_init_step[1]
Here’s the 8th block
all_blocks_after_init_step[7]
Now, let’s display the plot of a block :
from matplotlib.cm import get_cmap
one_block_ = all_blocks_before_clustering[0]
plt.plot(X[one_block_][:,0],X[one_block_][:,1],'.', markersize=10)
plt.title("The plot of a block")
plt.show()
toggle_code()
one_block_ = all_blocks_after_init_step[0]
cmap = get_cmap('tab20')
for cluster in range(K):
colors = cmap((cluster))
plt.plot(X[one_block_[cluster]][:,0],X[one_block_[cluster]][:,1],'.',c=colors, markersize=10)
plt.title('The same block after k-means++ initialization')
plt.show()
toggle_code()
That was for a single block, but keep in mind that there are 15 of them:
marge = 1
prop = 2
fig, axs = plt.subplots(nb_rows, nb_columns,figsize=(prop*5,prop*3))
xlim_ = (np.min(X[:,0])-marge,np.max(X[:,0])+marge)
ylim_ = (np.min(X[:,1])-marge,np.max(X[:,1])+marge)
for row in range(nb_rows):
for column in range(nb_columns):
one_block_ = all_blocks_before_clustering[column+row*nb_columns]
axs[row,column].plot(X[one_block_][:,0],X[one_block_][:,1],'.')
axs[row,column].set_xlim(xlim_)
axs[row,column].set_ylim(ylim_)
axs[row,column].set_title(str(column+row*nb_columns+1)+': ')
plt.subplots_adjust(wspace = 0.5,hspace = 0.6)
plt.suptitle('All blocks of data randomly uniformly picked among the dataset with replacement', fontsize=15)
plt.show()
toggle_code()
cmap = get_cmap('tab20')
fig, axs = plt.subplots(nb_rows, nb_columns,figsize=(prop*5,prop*3))
for row in range(nb_rows):
for column in range(nb_columns):
one_block_ = all_blocks_after_init_step[column+row*nb_columns]
for cluster in range(K):
colors = cmap((cluster))
axs[row,column].plot(X[one_block_[cluster]][:,0],X[one_block_[cluster]][:,1],'.',c=colors)
axs[row,column].set_xlim(xlim_)
axs[row,column].set_ylim(ylim_)
plt.xlim(xlim_)
axs[row,column].set_title(str(column+row*nb_columns+1)+': ')
plt.suptitle('All blocks after k-means++ initialization', fontsize=20)
plt.subplots_adjust(wspace = 0.5,hspace = 0.6)
plt.show()
toggle_code()
Then we will compute the risk for each block.
First, the function within_var, computes the inertia of each cluster of a block
def within_var(one_block,X):
'''
# Function which returns a list of within inertia per cluster for the given block
# one_block : dictionnary of the subsample of one block according to the clusters
'''
n_b = 0
var_b = [0]*K
for key , val in enumerate(one_block):
X_block = X[val,:]
var_b[key] = len(val)*np.sum(np.var(X_block,axis=0))/S
return var_b
within_var(one_block_,X)
Then, Qrisk computes the risk of each block and returns the block that got the median risk
quantile = 0.5 # this corresponds to taking the median
def Qrisk(all_blocks,X):
'''
# Function which computes the sum of all within variances and return:
- Q_risk = the q-quantile block risk
- Q_id = id of the associated q-quantile block
- dict_within[Q_id] = list of the within variances of the selected bloc
```parameters ```
. all_blocks : output of sampling_all_blocks
. X : matrix of datapoints
'''
dict_within = [0]*B
Qb_risks = [0]*B
for key, one_block_ in enumerate(all_blocks):
dict_within[key] = within_var(one_block_,X)
Qb_risks[key] = sum(dict_within[key])
risks_argsort = np.argsort(np.array(Qb_risks),kind="mergesort")
quantile_index = risks_argsort[round(quantile*B)-1]
quantile_value = Qb_risks[quantile_index]
return dict_within[quantile_index],all_blocks[quantile_index],quantile_index
Let’s take a look of all risks :
cmap = get_cmap('tab20')
fig, axs = plt.subplots(nb_rows, nb_columns,figsize=(prop*5,prop*3))
xlim_ = (np.min(X[:,0])-3,np.max(X[:,0])+3)
for row in range(nb_rows):
for column in range(nb_columns):
one_block_ = all_blocks_after_init_step[column+row*nb_columns]
for cluster in range(K):
colors = cmap((cluster))
axs[row,column].plot(X[one_block_[cluster]][:,0],X[one_block_[cluster]][:,1],'.',c=colors)
axs[row,column].set_title('{}: risk={}'.format(column+row*nb_columns+1,round(sum(within_var(all_blocks_after_init_step[column+row*nb_columns],X)),2)))
axs[row,column].set_xlim(xlim_)
axs[row,column].set_ylim(ylim_)
plt.xlim(xlim_)
plt.suptitle('All risks', fontsize=20)
plt.subplots_adjust(wspace = 0.5,hspace = 0.6)
plt.show()
toggle_code()
dict_within, Q_block, id_of_the_Q_block = Qrisk(all_blocks_after_init_step,X)
phrase = 'The {}th block was the one to achieve the median value of the risk among all blocks'
print(phrase.format(id_of_the_Q_block+1))
print('median risk = {}'.format(round(sum(dict_within),2)))
toggle_code()
We can also see that the 8th block is an occurence of an unlikely event : no data from the cluster on the top right has been picked contrary to the other clusters
(as you can see on the plot below).
block_ = all_blocks_after_init_step[7]
fig = plt.figure(figsize=(prop*7,prop*2))
ax1 = fig.add_subplot(1, 2, 1)
ax2 = fig.add_subplot(1, 2, 2, sharex=ax1)
ax1.scatter(X[:,0],X[:,1],3,c="Blue")
ax1.set_title('Complete dataset')
for cluster in block_:
ax2.scatter(X[cluster][:,0],X[cluster][:,1],3,c="Blue")
ax2.set_title('8th block')
plt.ylim(ylim_)
plt.xlim(xlim_)
plt.suptitle('', fontsize=20)
plt.show()
toggle_code()
Fortunately, this kind of particular cases will not be retained because their risk will not be median.
Let’s plot the median block :
cmap = get_cmap('tab20')
plt.subplots(figsize=(8,5))
for cluster in range(K):
colors = cmap((cluster))
plt.plot(X[Q_block[cluster]][:,0], X[Q_block[cluster]][:,1], '.', c=colors, markersize=10)
plt.xlim(xlim_)
plt.ylim(ylim_)
plt.title('The block with the median risk')
plt.show()
toggle_code()
At this point, initialization is almost complete. All we have to do now is to determine the centers and the partition.
The update_loop function (which we’ll talk about later) takes care of finalization.
def Qb_centers(Q_block,X):
'''
#compute the mean vector of each cluster in the q-quantile block and return it
'''
centers_ = []
for k in range(K):
center_k = X[Q_block[k],:].mean(axis=0)
centers_.append(center_k)
return np.array(centers_)
def update_loop(X,Qb):
'''
# Qb : Q-quantile block
# Q_block_list_of_within_var : within inertia of the qq block
'''
# updates centers
centers = Qb_centers(Qb,X)
# retrieve partition of data
D_nk = cdist(XA=X,XB=centers)
partition_array = D_nk.argmin(axis=1)
partition_list = []
for k in range(K):
partition_list.append(np.where(partition_array == k)[0].tolist())
# compute empirical risk
D_nk_min_rows = D_nk[[np.arange(X.shape[0]).tolist(),partition_array.tolist()]]
return centers, partition_array,partition_list
centers, partition,partition_list = update_loop(X,Q_block)
That’s it, the initialization is complete, let’s display the result:
plt.figure(figsize=(15,10))
plt.scatter(X[:,0],X[:,1],10,c=tuple(partition))
plt.scatter(centers[:,0],centers[:,1],100,c="red")
plt.title("Initialization completed : initial clustering", fontsize=30)
plt.autoscale()
plt.show()
toggle_code()
Below is the initialization code present in the K-bMOM class :
def initialisation_without_init_centers(self,X):
# Initialisation step : initialisation per block: sampling B blocks and init via kmeans++
init_blocks = self.sampling_init(X)
# compute empirical risk among blocks and select the Q-quantile-block
Q_within, Q_b = self.Qrisk(init_blocks,X)
# update all the global variables
self.update_loop(X,Q_b,Q_within)
# save results
self.id_Qblock_init = Q_b
#self.res_Qb_risk.append(Q_risk)
Just before cell In [3], we told you that we were going to explain how to find the $B$ and $S$ values.
They are calculated using mathematical equations.
from math import log, ceil, floor, inf
def block_size_max(n_sample,n_outliers):
'''
Function which fits the maximum size of blocks before a the breakpoint
```prms```
n_sample: nb of data
n_outlier: nb of outliers
'''
if n_outliers == 0:
bloc_size_max = inf
else:
bloc_size_max = log(2.)/log(1/(1- (n_outliers/n_sample)))
return floor(bloc_size_max)
def minimal_number_of_blocks(n_sample,n_outliers,b_size=None,alpha=0.05):
'''
Function which fits the minimum nb of blocks for a given size t before a the breakpoint
```prms```
n_sample: nb of data
n_outlier: nb of outliers
b_size = bloc_size
alpha : threshold confiance
'''
if n_outliers/n_sample >= 0.5:
print('too much noise')
return()
else:
if n_outliers == 0 :
return 1
elif b_size is None:
t = bloc_size(n_sample,n_outliers)
else:
t = b_size
bloc_nb_min = log(1/alpha) / (2* ((1-n_outliers/n_sample)**t - 1/2)**2)
return ceil(bloc_nb_min)
First you have to choose the size of the blocks, then the number of blocks. We realize that the bigger the size of a block is, the bigger the number of blocks must be, so in order to avoid too much complexity, it is better to choose a block size smaller than the maximum allowed.
Look at the curve below to see how $B$ behaves as $S$ grows.
S = block_size_max(n_sample=600,n_outliers=5)
list_block_size = []
list_nb_blocks = []
abscisse = np.round(np.linspace(K,S-2,100))
ticks = np.round(np.linspace(K,S-2,10))
if S < inf:
for k in abscisse:
list_block_size = list_block_size + [k]
list_nb_blocks = list_nb_blocks + [
minimal_number_of_blocks(n_sample=600,n_outliers=5,b_size=k,alpha=0.05)
]
plt.plot(list_block_size,list_nb_blocks)
plt.xticks(ticks)
plt.xlabel('S : Block sizes')
plt.ylabel('B : Number of Blocks')
plt.show()
toggle_code()
block_size_max(n_sample=600,n_outliers=5)
(Please note that the parameter n_outliers, is actually an estimated majoration of the amount of outliers in your dataset)
The maximum size of a block should be 82, but as you have understood, we will take a lower value.
In this case you can take $S$ equal to 60. (For our example we used $S$=20 for pedagogical reasons)
minimal_number_of_blocks(n_sample=600,n_outliers=5, b_size=20)
Now you can run minimal_number_of_blocks. 13 is the minimum value you can use, and that’s why we chose $B$=15 in our previous example.
In reality, the iterations are very close to the initialization process.
Input : the dataset $\{x_1, . . . , x_n\}$, $B$ the number of blocks and $S$ size of blocks ($S$>$K$)
Main Loop : WHILE the $stop Criterion$ isn’t reached or $iter$ < $iter$max:
- Create $B$ blocks of the data of size $S$ randomly and uniformly with replacement
In each block, assign each datapoint to its closest centroid.
Keep the block with the median empirical risk
- Calculate the new centers
Output : Partition
A CHOISIR
Input : the dataset $\{x_1, . . . , x_n\}$, $B$ the number of blocks and $S$ size of blocks ($S$>$K$)
Main Loop : WHILE the $stop Criterion$ isn’t reached or $iter$ < $iter$max:
Iterate from 1 until $B$ blocks :
• Create a block with size $S$ randomly and uniformly with replacement
• Assign each datapoint to its closest centroid.Keep the block with the median empirical risk
- Compute the new centers
Output : Partition
The only major difference is the construction of the blocks.
They are built using the sampling_all_blocks function which is very close to sampling_init.
However, thanks to the nb_min_repr_by_cluster parameter, each cluster in each block will have at least nb_min_repr_by_cluster data. This value is set to 1 by default.
def sampling_all_blocks(self, X, nb_min_repr_by_cluster=1):#,nbr_blocks,weighted_point,cluster_sizes):
'''
# Function which creates nbr_blocks blocks conditionally to the partition_dict and return a dictionary
# per block of the ids of selected data per cluster and per bloc
#
# nbr_blocks : number of blocks to be created
# weighted_point : a dictonnary of clusters containing the lists of weights
# cluster_sizes : vector containing the size of each clusters
'''
dict_blocks = [0]*B
for i in range(B):
# to ensure there are K categories in each block run that part, one can go through the first piece of code, else run 'else'
list_of_points = []
if nb_min_repr_by_cluster is not None:
for k in range(K):
list_of_points = list_of_points + random.choices(np.array(self.partition_dict[k]), k=nb_min_repr_by_cluster)
idx_indep = random.choices(np.arange(X.shape[0]), k=S-nb_min_repr_by_cluster*K)
else:
idx_indep = random.choices(np.arange(X.shape[0]), k=S)
idx = list_of_points + idx_indep
km_dist = cdist(X[idx,:],centers)
km_labels_ = km_dist.argmin(axis=1)
dict_blocks[i] = [[j for j,x in zip(idx,km_labels_) if x==k] for k in range(self.K)]
return dict_blocks
import numpy as np
from Modules.KbMOM_v8_4_before_5 import KbMOM # KbMOM stands for K-bootstrap Median-Of-Means
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt
#To avoid the plt.show bug
%matplotlib inline
X, y = make_blobs(n_samples=600, centers=4, n_features=2, random_state=1)
plt.scatter(X[:,0],X[:,1],1)
plt.show()
instance_KbMOM = KbMOM(K=4,X=X,coef_ech=20,nbr_blocks=15,nb_min_repr_by_cluster=1)
results_KbMOM = instance_KbMOM.fit(X=X)
results_KbMOM
output_keys = results_KbMOM.keys()
output_keys
centroids are the k vectors that represent each cluster
centroids = results_KbMOM['centroids']
centroids
plt.scatter(X[:,0],X[:,1],1)
plt.scatter(centroids[:,0],centroids[:,1],color=plt.get_cmap('tab10')(1))
plt.title('Plot of the final centroids (position highly variable)')
plt.show()
labels are the n integers that inform which cluster a datapoint belongs to
labels = results_KbMOM['labels']
labels
plt.scatter(X[:,0],X[:,1],color=plt.get_cmap('tab10')(labels))
plt.title('Plot of the final partition')
plt.show()
id_Qblock_init is the composition of the initial block that achieved the median value of the empirical risk among all blocks.
Qblock stands for « quantile block » because taking the median is not compulsory. It is possible to take another quantile.
results_KbMOM['id_Qblock_init']
plt.scatter(X[:,0],X[:,1],s=3,alpha=0.5)
for num,cluster_sample in enumerate(results_KbMOM['id_Qblock_init']):
plt.scatter(X[cluster_sample,0],X[cluster_sample,1],color=plt.get_cmap('tab10')(num+1))
plt.title('Composition of the initial selected block \n One color for each intial cluster')
plt.show()
convergence returns the list of all Aitkens criterion values. The first 2 values are None because Aitkens is based on the two previous quantities.
results_KbMOM['convergence']
scores are the quantities that mimics a data depth. The lower the score of a datapoint, the further from all centroids along all iterations, the more the datapoint is an outlier in its cluster.
scores = results_KbMOM['scores']
np.round(scores,2)
# as shown on the plot, the greatest datapoints are the point whose scores are lowest
# and it seems to coincide with a data depth
# warning /!\ the sizes on the plot are not proportionnal to scores
plt.scatter(X[:,0],X[:,1],color=plt.get_cmap('tab10')(labels),s=1/scores**3)
plt.title('The size of the points are related to their \n score (data-depth like value)')
plt.show()
Qb_risk is the list of all values encountered for the median risk throughout iterations.
As one can see, the risk of the median bloc is not monotonous. The algorithm has though good empirical performances (Qb_block stands for « quantile block » because one can choose another quantile than median)
results_KbMOM['Qb_risk']
Qb_centers is simply the list of all sets of centroids encountered at each steps of the algorithm
Qb stands for « quantile block » because taking the median is not compulsory. It is possible to take another quantile.
Qb_centers = results_KbMOM['Qb_centers']
Qb_centers
plt.scatter(X[:,0],X[:,1],1)
for num,centroids in enumerate(Qb_centers):
plt.scatter(centroids[:,0],centroids[:,1],color=plt.get_cmap('tab10')(1))
plt.title('Trace of all encountered centroids')
plt.show()
ncols=2
nb_graphs = len(Qb_centers)
nrows=int(nb_graphs/ncols+1)
size_1_plot = 5
fig,axes = plt.subplots(ncols=ncols,nrows=nrows,figsize=(size_1_plot*ncols,size_1_plot*nrows ))
for num,centroids in enumerate(Qb_centers):
axes[int(num/ncols),num%ncols].scatter(X[:,0],X[:,1],s=3,alpha=0.5)
axes[int(num/ncols),num%ncols].scatter(centroids[:,0],centroids[:,1],color=plt.get_cmap('tab10')(1),s=50)
axes[int(num/ncols),num%ncols].set_title(f'Centroids at iteration number {num}')
for num_graph in range(ncols*nrows):
if nb_graphs <= num_graph:
axes[int(num_graph/ncols),num_graph%ncols].axis('off')
st = fig.suptitle('All sets of centroids throughout the algorithm', fontsize=20)
st.set_y(0.9)
plt.show()
B is the number of block created a each step of the algorithm
results_KbMOM['B']
t is the size of all created blocks
results_KbMOM['t']