
Claire Brécheteau
https://perso.univ-rennes2.fr/claire.brecheteau
We consider , an unknown compact subset of the Euclidean space . We dispose of a sample of points generated uniformly on or generated in a neighborhood of . The sample of points may be corrupted by outliers. That is, by points lying far from .
Given , we aim at recovering . More precisely, we construct approximations of as unions of balls or ellipsoids, for possibly much smaller than the sample size .
Note that coincides with , the sublevel set of the distance-to-compact function . Then, the approximations we propose for are sublevel sets of approximations of the distance function , based on . Since the sample may be corrupted by outliers and the points may not lie exactly on the compact set, approximating by may be terrible. Therefore, we construct approximations of that are robust to noise.
In this page, we present three methods to construct approximations of from a possibly noisy sample . The first approximation is the well-known distance-to-measure (DTM) function of [Chazal11]. The two last methods are new. They are based on the following approximations which sublevel sets are unions of balls or ellispoids: the -PDTM [Brecheteau19a] and the -PLM [Brecheteau19b].
The codes and some toy examples are available in this page. In particular, they are implemented via the functions:
For a sample X of size , these functions compute the distance approximations at the points in query_pts. The parameter q is a regularity parameter in , k is the number of balls or ellispoids for the sublevel sets of the distance approximations. The procedures remove sig points of the sample, cf Section « Detecting outliers ».
We consider as a compact set , the infinity symbol:
The target is the distance function . The graph of is the following:
We have generated a noisy sample . Then, is a terrible approximation of . Indeed, the graph of is the following:
Nonetheless, there exist robust approximations of the distance-to-compact function, such as the distance-to-measure (DTM) function (that depends on a regularity parameter ) [Chazal11]. The graph of for some is the following:
In this page, we define two functions, the -PDTM and the -PLM . The sublevel sets of the -PDTM are unions of balls. The sublevel sets of the -PLM are unions of ellipsoids. The graphs of and for some and are the following:
andThe distance-to-measure function (DTM) is a surrogate for the distance-to-compact, robust to noise. It was introduced in 2009 [Chazal11]. It depends on some regularity parameter . The distance-to-measure function is defined by
Equivalently, the DTM coincides with the mean distance between and its nearest neighbours:
The following implementation of the DTM is due to Raphaël Tinarrage, in his page DTM-based filtrations: demo.
import numpy as np
from sklearn.neighbors import KDTree
def DTM(X,query_pts,q):
'''
Compute the values of the DTM of the point cloud X
Require sklearn.neighbors.KDTree to search nearest neighbors
Input:
X: a nxd numpy array representing n points in R^d
query_pts: a sxd numpy array of query points
q: parameter of the DTM in {1,2,...,n}
Output:
DTM_result: a sx1 numpy array contaning the DTM of the
query points
Example:
X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
Q = np.array([[0,0],[5,5]])
DTM_values = DTM(X, Q, 3)
'''
n = X.shape[0]
if(q>0 and q<=n):
kdt = KDTree(X, leaf_size=30, metric='euclidean')
NN_Dist, NN = kdt.query(query_pts, q, return_distance=True)
DTM_result = np.sqrt(np.sum(NN_Dist*NN_Dist,axis=1) / q)
else:
raise AssertionError("Error: q should be in {1,2,...,n}")
return(DTM_result)
The points are generated on the circle accordingly to the following function SampleOnCircle. Again, this whole example was picked from the page DTM-based filtrations: demo of Raphaël Tinarrage.
import matplotlib.pyplot as plt
def SampleOnCircle(N_obs = 100, N_out = 0, is_plot = False):
'''
Sample N_obs points (observations) points from the uniform distribution on the unit circle in R^2,
and N_out points (outliers) from the uniform distribution on the unit square
Input:
N_obs: number of sample points on the circle
N_noise: number of sample points on the square
is_plot = True or False : draw a plot of the sampled points
Output :
data : a (N_obs + N_out)x2 matrix, the sampled points concatenated
'''
rand_uniform = np.random.rand(N_obs)*2-1
X_obs = np.cos(2*np.pi*rand_uniform)
Y_obs = np.sin(2*np.pi*rand_uniform)
X_out = np.random.rand(N_out)*2-1
Y_out = np.random.rand(N_out)*2-1
X = np.concatenate((X_obs, X_out))
Y = np.concatenate((Y_obs, Y_out))
data = np.stack((X,Y)).transpose()
if is_plot:
fig, ax = plt.subplots()
plt_obs = ax.scatter(X_obs, Y_obs, c='tab:cyan');
plt_out = ax.scatter(X_out, Y_out, c='tab:orange');
ax.axis('equal')
ax.set_title(str(N_obs)+'-sampling of the unit circle with '+str(N_out)+' outliers')
ax.legend((plt_obs, plt_out), ('data', 'outliers'), loc='lower left')
return data
' Sampling on the circle with outlier '
N_obs = 150 # number of points sampled on the circle
N_out = 100 # number of outliers
X = SampleOnCircle(N_obs, N_out, is_plot=True) # sample points with outliers
' Compute the DTM on X '
# compute the values of the DTM of parameter q
q = 40
DTM_values = DTM(X,X,q)
# plot of the opposite of the DTM
fig, ax = plt.subplots()
plot=ax.scatter(X[:,0], X[:,1], c=-DTM_values)
fig.colorbar(plot)
ax.axis('equal')
ax.set_title('Values of -DTM on X with parameter q='+str(q));
The -PDTM is an approximation of the DTM, which sublevel sets are unions of balls. It was introduced and studied in [Brecheteau19a].
According to the previous expression of the DTM, the DTM rewrites as:
These centers are chosen such that the criterionNote that these centers are not necessarily uniquely defined. The following algorithm provides local minimisers of the criterion .
def mean_var(X,x,q,kdt):
'''
An auxiliary function.
Input:
X: an nxd numpy array representing n points in R^d
x: an sxd numpy array representing s points,
for each of these points we compute the mean and variance of the nearest neighbors in X
q: parameter of the DTM in {1,2,...,n} - number of nearest neighbors to consider
kdt: a KDtree obtained from X via the expression KDTree(X, leaf_size=30, metric='euclidean')
Output:
Mean: an sxd numpy array containing the means of nearest neighbors
Var: an sx1 numpy array containing the variances of nearest neighbors
Example:
X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
x = np.array([[2,3],[0,0]])
kdt = KDTree(X, leaf_size=30, metric='euclidean')
Mean, Var = mean_var(X,x,2,kdt)
'''
NN = kdt.query(x, q, return_distance=False)
Mean = np.zeros((x.shape[0],x.shape[1]))
Var = np.zeros(x.shape[0])
for i in range(x.shape[0]):
Mean[i,:] = np.mean(X[NN[i],:],axis = 0)
Var[i] = np.mean(np.sum((X[NN[i],:] - Mean[i,:])*(X[NN[i],:] - Mean[i,:]),axis = 1))
return Mean, Var
import random # For the random centers from which the algorithm starts
def optima_for_kPDTM(X,q,k,sig,iter_max = 10,nstart = 1):
'''
Compute local optimal centers for the k-PDTM-criterion $R$ for the point cloud X
Require sklearn.neighbors.KDTree to search nearest neighbors
Input:
X: an nxd numpy array representing n points in R^d
query_pts: an sxd numpy array of query points
q: parameter of the DTM in {1,2,...,n}
k: number of centers
sig: number of sample points that the algorithm keeps (the other ones are considered as outliers -- cf section "Detecting outliers")
iter_max : maximum number of iterations for the optimisation algorithm
nstart : number of starts for the optimisation algorithm
Output:
centers: a kxd numpy array contaning the optimal centers c^*_i computed by the algorithm
means: a kxd numpy array containing the local centers m(c^*_i,\mathbb X,q)
variances: a kx1 numpy array containing the local variances v(c^*_i,\mathbb X,q)
colors: a size n numpy array containing the colors of the sample points in X
points in the same weighted Voronoi cell (with centers in opt_means and weights in opt_variances)
have the same color
cost: the mean, for the "sig" points X[j,] considered as signal, of their smallest weighted distance to a center in "centers"
that is, min_i\|X[j,]-means[i,]\|^2+variances[i].
Example:
X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
sig = X.shape[0] # There is no trimming, all sample points are assigned to a cluster
centers, means, variances, colors, cost = optima_for_kPDTM(X, 3, 2, sig)
'''
n = X.shape[0]
d = X.shape[1]
opt_cost = np.inf
opt_centers = np.zeros([k,d])
opt_colors = np.zeros(n)
opt_kept_centers = np.zeros(k)
if(q<=0 or q>n):
raise AssertionError("Error: q should be in {1,2,...,n}")
elif(k<=0 or k>n):
raise AssertionError("Error: k should be in {1,2,...,n}")
else:
kdt = KDTree(X, leaf_size=30, metric='euclidean')
for starts in range(nstart):
# Initialisation
colors = np.zeros(n)
min_distance = np.zeros(n) # Weighted distance between a point and its nearest center
kept_centers = np.ones((k), dtype=bool)
first_centers_ind = random.sample(range(n), k) # Indices of the centers from which the algorithm starts
centers = X[first_centers_ind,:]
old_centers = np.ones([k,d])*np.inf
mv = mean_var(X,centers,q,kdt)
Nstep = 1
while((np.sum(old_centers!=centers)>0) and (Nstep <= iter_max)):
Nstep = Nstep + 1
# Step 1: Update colors and min_distance
for j in range(n):
cost = np.inf
best_ind = 0
for i in range(k):
if(kept_centers[i]):
newcost = np.sum((X[j,:] - mv[0][i,:])*(X[j,:] - mv[0][i,:])) + mv[1][i]
if(newcost < cost):
cost = newcost
best_ind = i
colors[j] = best_ind
min_distance[j] = cost
# Step 2: Trimming step - Put color -1 to the (n-sig) points with largest cost
index = np.argsort(-min_distance)
colors[index[range(n-sig)]] = -1
ds = min_distance[index[range(n-sig,n)]]
costt = np.mean(ds)
# Step 3: Update Centers and mv
old_centers = np.copy(centers)
old_mv = mv
for i in range(k):
pointcloud_size = np.sum(colors == i)
if(pointcloud_size>=1):
centers[i,] = np.mean(X[colors==i,],axis = 0)
else:
kept_centers[i] = False
mv = mean_var(X,centers,q,kdt)
if(costt <= opt_cost):
opt_cost = costt
opt_centers = np.copy(old_centers)
opt_mv = old_mv
opt_colors = np.copy(colors)
opt_kept_centers = np.copy(kept_centers)
centers = opt_centers[opt_kept_centers,]
means = opt_mv[0][opt_kept_centers,]
variances = opt_mv[1][opt_kept_centers]
colors = np.zeros(n)
for i in range(n):
colors[i] = np.sum(opt_kept_centers[range(int(opt_colors[i]+1))])-1
cost = opt_cost
return(centers, means, variances, colors, cost)
def kPDTM(X,query_pts,q,k,sig,iter_max = 10,nstart = 1):
'''
Compute the values of the k-PDTM of the empirical measure of a point cloud X
Require sklearn.neighbors.KDTree to search nearest neighbors
Input:
X: a nxd numpy array representing n points in R^d
query_pts: a sxd numpy array of query points
q: parameter of the DTM in {1,2,...,n}
k: number of centers
sig: number of points considered as signal in the sample (other signal points are trimmed)
Output:
kPDTM_result: a sx1 numpy array contaning the kPDTM of the
query points
Example:
X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
Q = np.array([[0,0],[5,5]])
kPDTM_values = kPDTM(X, Q, 3, 2,X.shape[0])
'''
n = X.shape[0]
if(q<=0 or q>n):
raise AssertionError("Error: q should be in {1,2,...,n}")
elif(k<=0 or k>n):
raise AssertionError("Error: k should be in {1,2,...,n}")
elif(X.shape[1]!=query_pts.shape[1]):
raise AssertionError("Error: X and query_pts should contain points with the same number of coordinates.")
else:
centers, means, variances, colors, cost = optima_for_kPDTM(X,q,k,sig,iter_max = iter_max,nstart = nstart)
kPDTM_result = np.zeros(query_pts.shape[0])
for i in range(query_pts.shape[0]):
kPDTM_result[i] = np.inf
for j in range(means.shape[0]):
aux = np.sqrt(np.sum((query_pts[i,]-means[j,])*(query_pts[i,]-means[j,]))+variances[j])
if(aux<kPDTM_result[i]):
kPDTM_result[i] = aux
return(kPDTM_result, centers, means, variances, colors, cost)
We compute the -PDTM on the same sample of points.
Note that when we take , that is, when is equal to the sample size, the DTM and the -PDTM coincide on the points of the sample.
The sub-level sets of the -PDTM are unions of balls which centers are represented by triangles.
' Compute the k-PDTM on X '
# compute the values of the DTM of parameter q
q = 40
k = 250
sig = X.shape[0]
iter_max = 100
nstart = 10
kPDTM_values, centers, means, variances, colors, cost = kPDTM(X,X,q,k,sig,iter_max,nstart)
# plot of the opposite of the DTM
fig, ax = plt.subplots()
plot = ax.scatter(X[:,0], X[:,1], c=-kPDTM_values)
fig.colorbar(plot)
for i in range(means.shape[0]):
ax.scatter(means[i,0],means[i,1],c = "black",marker = "^")
ax.axis('equal')
ax.set_title('Values of -kPDTM on X with parameter q='+str(q)+' and k='+str(k)+'.');
Sublevel sets of the -PDTM are unions of balls. The