Notebook_kPDTM_kPLM

Robust approximations of compact sets¶

Claire Brécheteau
https://perso.univ-rennes2.fr/claire.brecheteau

We consider $\mathcal{K}$$\mathcal{K}$, an unknown compact subset of the Euclidean space $\left({\mathbb{R}}^{d},‖\cdot ‖\right)$$(\mathbb{R}^d,\|\cdot\|)$. We dispose of a sample of $n$$n$ points $\mathbb{X}=\left\{{X}_{1},{X}_{2},\dots ,{X}_{n}\right\}$$\mathbb{X} = \{X_1, X_2,\ldots, X_n\}$ generated uniformly on $\mathcal{K}$$\mathcal{K}$ or generated in a neighborhood of $\mathcal{K}$$\mathcal{K}$. The sample of points may be corrupted by outliers. That is, by points lying far from $\mathcal{K}$$\mathcal{K}$.

Given ${X}_{1},{X}_{2},\dots ,{X}_{n}$$X_1, X_2,\ldots, X_n$, we aim at recovering $\mathcal{K}$$\mathcal{K}$. More precisely, we construct approximations of $\mathcal{K}$$\mathcal{K}$ as unions of $k$$k$ balls or $k$$k$ ellipsoids, for $k$$k$ possibly much smaller than the sample size $n$$n$.

Note that $\mathcal{K}$$\mathcal{K}$ coincides with ${d}_{\mathcal{K}}^{-1}\left(\left(-\mathrm{\infty },0\right]\right)$$d_{\mathcal{K}}^{-1}((-\infty,0])$, the sublevel set of the distance-to-compact function ${d}_{\mathcal{K}}$$d_{\mathcal{K}}$. Then, the approximations we propose for $\mathcal{K}$$\mathcal{K}$ are sublevel sets of approximations of the distance function ${d}_{\mathcal{K}}$$d_{\mathcal{K}}$, based on $\mathbb{X}$$\mathbb{X}$. Since the sample may be corrupted by outliers and the points may not lie exactly on the compact set, approximating ${d}_{\mathcal{K}}$$d_{\mathcal{K}}$ by ${d}_{\mathbb{X}}$$d_{\mathbb{X}}$ may be terrible. Therefore, we construct approximations of ${d}_{\mathcal{K}}$$d_{\mathcal{K}}$ that are robust to noise.

In this page, we present three methods to construct approximations of ${d}_{\mathcal{K}}$$d_{\mathcal{K}}$ from a possibly noisy sample $\mathbb{X}$$\mathbb{X}$. The first approximation is the well-known distance-to-measure (DTM) function of [Chazal11]. The two last methods are new. They are based on the following approximations which sublevel sets are unions of $k$$k$ balls or ellispoids: the $k$$k$-PDTM [Brecheteau19a] and the $k$$k$-PLM [Brecheteau19b].

The codes and some toy examples are available in this page. In particular, they are implemented via the functions:

• DTM(X,query_pts,q)
• kPDTM(X,query_pts,q,k,sig,iter_max = 10,nstart = 1)
• kPLM(X,query_pts,q,k,sig,iter_max = 10,nstart = 1)

For a sample X of size $n$$n$, these functions compute the distance approximations at the points in query_pts. The parameter q is a regularity parameter in $\left\{0,1,\dots ,n\right\}$$\{0,1,\ldots,n\}$, k is the number of balls or ellispoids for the sublevel sets of the distance approximations. The procedures remove $n-$$n-$sig points of the sample, cf Section « Detecting outliers ».

Example¶

We consider as a compact set $\mathcal{K}$$\mathcal{K}$, the infinity symbol:

The target is the distance function ${d}_{\mathcal{K}}$$d_{\mathcal{K}}$. The graph of $-{d}_{\mathcal{K}}$$-d_{\mathcal{K}}$ is the following:

We have generated a noisy sample $\mathbb{X}$$\mathbb X$. Then, ${d}_{\mathbb{X}}$$d_{\mathbb X}$ is a terrible approximation of ${d}_{\mathcal{K}}$$d_{\mathcal{K}}$. Indeed, the graph of $-{d}_{\mathbb{X}}$$-d_{\mathbb X}$ is the following:

Nonetheless, there exist robust approximations of the distance-to-compact function, such as the distance-to-measure (DTM) function ${d}_{\mathbb{X},q}$$d_{\mathbb X,q}$ (that depends on a regularity parameter $q$$q$) [Chazal11]. The graph of $-{d}_{\mathbb{X},q}$$-d_{\mathbb X,q}$ for some $q$$q$ is the following:

In this page, we define two functions, the $k$$k$-PDTM ${d}_{\mathbb{X},q,k}$$d_{\mathbb X,q,k}$ and the $k$$k$-PLM ${d}_{\mathbb{X},q,k}^{\prime }$$d'_{\mathbb X,q,k}$. The sublevel sets of the $k$$k$-PDTM are unions of $k$$k$ balls. The sublevel sets of the $k$$k$-PLM are unions of $k$$k$ ellipsoids. The graphs of $-{d}_{\mathbb{X},q,k}$$-d_{\mathbb X,q,k}$ and $-{d}_{\mathbb{X},q,k}^{\prime }$$-d'_{\mathbb X,q,k}$ for some $q$$q$ and $k$$k$ are the following:

and

The distance-to-measure (DTM)¶

The distance-to-measure function (DTM) is a surrogate for the distance-to-compact, robust to noise. It was introduced in 2009 [Chazal11]. It depends on some regularity parameter $q\in \left\{0,1,\dots ,n\right\}$$q\in\{0,1,\ldots,n\}$. The distance-to-measure function ${d}_{\mathbb{X},q}$$d_{\mathbb{X},q}$ is defined by

where $m\left(x,\mathbb{X},q\right)=\frac{1}{q}\sum _{i=1}^{q}{X}^{\left(i\right)}$$m(x,\mathbb{X},q) = \frac{1}{q}\sum_{i=1}^qX^{(i)}$ is the barycenter of the $q$$q$ nearest neighbours of $x$$x$ in $\mathbb{X}$$\mathbb{X}$, ${X}^{\left(1\right)},{X}^{\left(2\right)},\dots ,{X}^{\left(q\right)}$$X^{(1)}, X^{(2)},\ldots, X^{(q)}$, and $v\left(x,\mathbb{X},q\right)$$v(x,\mathbb{X},q)$ is their variance $\frac{1}{q}\sum _{i=1}^{q}‖m\left(x,\mathbb{X},q\right)-{X}^{\left(i\right)}{‖}^{2}$$\frac{1}{q}\sum_{i=1}^q\|m(x,\mathbb{X},q)-X^{(i)}\|^2$.

Equivalently, the DTM coincides with the mean distance between $x$$x$ and its $q$$q$ nearest neighbours:

${d}_{\mathbb{X},q}^{2}\left(x\right)=\frac{1}{q}\sum _{i=1}^{q}‖x-{X}^{\left(i\right)}{‖}^{2}.$

The following implementation of the DTM is due to Raphaël Tinarrage, in his page DTM-based filtrations: demo.

In [1]:
import numpy as np
from sklearn.neighbors import KDTree

def DTM(X,query_pts,q):
'''
Compute the values of the DTM of the point cloud X
Require sklearn.neighbors.KDTree to search nearest neighbors

Input:
X: a nxd numpy array representing n points in R^d
query_pts:  a sxd numpy array of query points
q: parameter of the DTM in {1,2,...,n}

Output:
DTM_result: a sx1 numpy array contaning the DTM of the
query points

Example:
X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
Q = np.array([[0,0],[5,5]])
DTM_values = DTM(X, Q, 3)
'''
n = X.shape[0]
if(q>0 and q<=n):
kdt = KDTree(X, leaf_size=30, metric='euclidean')
NN_Dist, NN = kdt.query(query_pts, q, return_distance=True)
DTM_result = np.sqrt(np.sum(NN_Dist*NN_Dist,axis=1) / q)
else:
raise AssertionError("Error: q should be in {1,2,...,n}")
return(DTM_result)


Example – DTM computation for a noisy sample on a circle¶

The points are generated on the circle accordingly to the following function SampleOnCircle. Again, this whole example was picked from the page DTM-based filtrations: demo of Raphaël Tinarrage.

In [2]:
import matplotlib.pyplot as plt

def SampleOnCircle(N_obs = 100, N_out = 0, is_plot = False):
'''
Sample N_obs points (observations) points from the uniform distribution on the unit circle in R^2,
and N_out points (outliers) from the uniform distribution on the unit square

Input:
N_obs: number of sample points on the circle
N_noise: number of sample points on the square
is_plot = True or False : draw a plot of the sampled points

Output :
data : a (N_obs + N_out)x2 matrix, the sampled points concatenated
'''
rand_uniform = np.random.rand(N_obs)*2-1
X_obs = np.cos(2*np.pi*rand_uniform)
Y_obs = np.sin(2*np.pi*rand_uniform)

X_out = np.random.rand(N_out)*2-1
Y_out = np.random.rand(N_out)*2-1

X = np.concatenate((X_obs, X_out))
Y = np.concatenate((Y_obs, Y_out))
data = np.stack((X,Y)).transpose()

if is_plot:
fig, ax = plt.subplots()
plt_obs = ax.scatter(X_obs, Y_obs, c='tab:cyan');
plt_out = ax.scatter(X_out, Y_out, c='tab:orange');
ax.axis('equal')
ax.set_title(str(N_obs)+'-sampling of the unit circle with '+str(N_out)+' outliers')
ax.legend((plt_obs, plt_out), ('data', 'outliers'), loc='lower left')
return data

In [23]:
' Sampling on the circle with outlier '
N_obs = 150                                     # number of points sampled on the circle
N_out = 100                                     # number of outliers
X = SampleOnCircle(N_obs, N_out, is_plot=True)  # sample points with outliers

In [4]:
' Compute the DTM on X '
# compute the values of the DTM of parameter q
q = 40
DTM_values = DTM(X,X,q)

# plot of  the opposite of the DTM
fig, ax = plt.subplots()
plot=ax.scatter(X[:,0], X[:,1], c=-DTM_values)
fig.colorbar(plot)
ax.axis('equal')
ax.set_title('Values of -DTM on X with parameter q='+str(q));


Approximating K$\mathcal{K}$$\mathcal{K}$ with a union of k$k$$k$ balls – or – the k$k$$k$-power-distance-to-measure (k$k$$k$-PDTM)¶

The $k$$k$-PDTM is an approximation of the DTM, which sublevel sets are unions of $k$$k$ balls. It was introduced and studied in [Brecheteau19a].

According to the previous expression of the DTM, the DTM rewrites as:

These centers are chosen such that the criterion
$R:\left({c}_{1},{c}_{2},\dots ,{c}_{k}\right)↦\sum _{X\in \mathbb{X}}\underset{i\in \left\{1,2,\dots ,k\right\}}{min}‖X-m\left({c}_{i},\mathbb{X},q\right){‖}^{2}+v\left({c}_{i},\mathbb{X},q\right)$
is minimal.

Note that these centers ${c}_{1}^{\ast },{c}_{2}^{\ast },\dots ,{c}_{k}^{\ast }$$c^*_1,c^*_2,\ldots,c^*_k$ are not necessarily uniquely defined. The following algorithm provides local minimisers of the criterion $R$$R$.

In [7]:
def mean_var(X,x,q,kdt):
'''
An auxiliary function.

Input:
X: an nxd numpy array representing n points in R^d
x: an sxd numpy array representing s points,
for each of these points we compute the mean and variance of the nearest neighbors in X
q: parameter of the DTM in {1,2,...,n} - number of nearest neighbors to consider
kdt: a KDtree obtained from X via the expression KDTree(X, leaf_size=30, metric='euclidean')

Output:
Mean: an sxd numpy array containing the means of nearest neighbors
Var: an sx1 numpy array containing the variances of nearest neighbors

Example:
X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
x = np.array([[2,3],[0,0]])
kdt = KDTree(X, leaf_size=30, metric='euclidean')
Mean, Var = mean_var(X,x,2,kdt)
'''
NN = kdt.query(x, q, return_distance=False)
Mean = np.zeros((x.shape[0],x.shape[1]))
Var = np.zeros(x.shape[0])
for i in range(x.shape[0]):
Mean[i,:] = np.mean(X[NN[i],:],axis = 0)
Var[i] = np.mean(np.sum((X[NN[i],:] - Mean[i,:])*(X[NN[i],:] - Mean[i,:]),axis = 1))
return Mean, Var

import random # For the random centers from which the algorithm starts

def optima_for_kPDTM(X,q,k,sig,iter_max = 10,nstart = 1):
'''
Compute local optimal centers for the k-PDTM-criterion $R$ for the point cloud X
Require sklearn.neighbors.KDTree to search nearest neighbors

Input:
X: an nxd numpy array representing n points in R^d
query_pts:  an sxd numpy array of query points
q: parameter of the DTM in {1,2,...,n}
k: number of centers
sig: number of sample points that the algorithm keeps (the other ones are considered as outliers -- cf section "Detecting outliers")
iter_max : maximum number of iterations for the optimisation algorithm
nstart : number of starts for the optimisation algorithm

Output:
centers: a kxd numpy array contaning the optimal centers c^*_i computed by the algorithm
means: a kxd numpy array containing the local centers m(c^*_i,\mathbb X,q)
variances: a kx1 numpy array containing the local variances v(c^*_i,\mathbb X,q)
colors: a size n numpy array containing the colors of the sample points in X
points in the same weighted Voronoi cell (with centers in opt_means and weights in opt_variances)
have the same color
cost: the mean, for the "sig" points X[j,] considered as signal, of their smallest weighted distance to a center in "centers"
that is, min_i\|X[j,]-means[i,]\|^2+variances[i].

Example:
X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
sig = X.shape[0] # There is no trimming, all sample points are assigned to a cluster
centers, means, variances, colors, cost = optima_for_kPDTM(X, 3, 2, sig)
'''
n = X.shape[0]
d = X.shape[1]
opt_cost = np.inf
opt_centers = np.zeros([k,d])
opt_colors = np.zeros(n)
opt_kept_centers = np.zeros(k)
if(q<=0 or q>n):
raise AssertionError("Error: q should be in {1,2,...,n}")
elif(k<=0 or k>n):
raise AssertionError("Error: k should be in {1,2,...,n}")
else:
kdt = KDTree(X, leaf_size=30, metric='euclidean')
for starts in range(nstart):

# Initialisation
colors = np.zeros(n)
min_distance = np.zeros(n) # Weighted distance between a point and its nearest center
kept_centers = np.ones((k), dtype=bool)
first_centers_ind = random.sample(range(n), k) # Indices of the centers from which the algorithm starts
centers = X[first_centers_ind,:]
old_centers = np.ones([k,d])*np.inf
mv = mean_var(X,centers,q,kdt)
Nstep = 1
while((np.sum(old_centers!=centers)>0) and (Nstep <= iter_max)):
Nstep = Nstep + 1

# Step 1: Update colors and min_distance
for j in range(n):
cost = np.inf
best_ind = 0
for i in range(k):
if(kept_centers[i]):
newcost = np.sum((X[j,:] - mv[0][i,:])*(X[j,:] - mv[0][i,:])) + mv[1][i]
if(newcost < cost):
cost = newcost
best_ind = i
colors[j] = best_ind
min_distance[j] = cost

# Step 2: Trimming step - Put color -1 to the (n-sig) points with largest cost
index = np.argsort(-min_distance)
colors[index[range(n-sig)]] = -1
ds = min_distance[index[range(n-sig,n)]]
costt = np.mean(ds)

# Step 3: Update Centers and mv
old_centers = np.copy(centers)
old_mv = mv
for i in range(k):
pointcloud_size = np.sum(colors == i)
if(pointcloud_size>=1):
centers[i,] = np.mean(X[colors==i,],axis = 0)
else:
kept_centers[i] = False
mv = mean_var(X,centers,q,kdt)

if(costt <= opt_cost):
opt_cost = costt
opt_centers = np.copy(old_centers)
opt_mv = old_mv
opt_colors = np.copy(colors)
opt_kept_centers = np.copy(kept_centers)

centers = opt_centers[opt_kept_centers,]
means = opt_mv[0][opt_kept_centers,]
variances = opt_mv[1][opt_kept_centers]
colors = np.zeros(n)
for i in range(n):
colors[i] = np.sum(opt_kept_centers[range(int(opt_colors[i]+1))])-1
cost = opt_cost

return(centers, means, variances, colors, cost)

def kPDTM(X,query_pts,q,k,sig,iter_max = 10,nstart = 1):
'''
Compute the values of the k-PDTM of the empirical measure of a point cloud X
Require sklearn.neighbors.KDTree to search nearest neighbors

Input:
X: a nxd numpy array representing n points in R^d
query_pts:  a sxd numpy array of query points
q: parameter of the DTM in {1,2,...,n}
k: number of centers
sig: number of points considered as signal in the sample (other signal points are trimmed)

Output:
kPDTM_result: a sx1 numpy array contaning the kPDTM of the
query points

Example:
X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
Q = np.array([[0,0],[5,5]])
kPDTM_values = kPDTM(X, Q, 3, 2,X.shape[0])
'''
n = X.shape[0]
if(q<=0 or q>n):
raise AssertionError("Error: q should be in {1,2,...,n}")
elif(k<=0 or k>n):
raise AssertionError("Error: k should be in {1,2,...,n}")
elif(X.shape[1]!=query_pts.shape[1]):
raise AssertionError("Error: X and query_pts should contain points with the same number of coordinates.")
else:
centers, means, variances, colors, cost = optima_for_kPDTM(X,q,k,sig,iter_max = iter_max,nstart = nstart)
kPDTM_result = np.zeros(query_pts.shape[0])
for i in range(query_pts.shape[0]):
kPDTM_result[i] = np.inf
for j in range(means.shape[0]):
aux = np.sqrt(np.sum((query_pts[i,]-means[j,])*(query_pts[i,]-means[j,]))+variances[j])
if(aux<kPDTM_result[i]):
kPDTM_result[i] = aux

return(kPDTM_result, centers, means, variances, colors, cost)


We compute the $k$$k$-PDTM on the same sample of points.

Note that when we take $k=250$$k=250$, that is, when $k$$k$ is equal to the sample size, the DTM and the $k$$k$-PDTM coincide on the points of the sample.

The sub-level sets of the $k$$k$-PDTM are unions of $k$$k$ balls which centers are represented by triangles.

In [8]:
' Compute the k-PDTM on X '
# compute the values of the DTM of parameter q
q = 40
k = 250
sig = X.shape[0]
iter_max = 100
nstart = 10
kPDTM_values, centers, means, variances, colors, cost = kPDTM(X,X,q,k,sig,iter_max,nstart)
# plot of  the opposite of the DTM
fig, ax = plt.subplots()
plot = ax.scatter(X[:,0], X[:,1], c=-kPDTM_values)
fig.colorbar(plot)
for i in range(means.shape[0]):
ax.scatter(means[i,0],means[i,1],c = "black",marker = "^")
ax.axis('equal')
ax.set_title('Values of -kPDTM on X with parameter q='+str(q)+' and k='+str(k)+'.');


Approximating K$\mathcal{K}$$\mathcal{K}$ with a union of k$k$$k$ ellipsoids – or – the k$k$$k$-power-likelihood-to-measure (k$k$$k$-PLM)¶

Sublevel sets of the $k$$k$-PDTM are unions of