Notebook_kPDTM_kPLM

Robust approximations of compact sets

Claire Brécheteau
https://perso.univ-rennes2.fr/claire.brecheteau

This article was originally published at https://github.com/GUDHI/TDA-tutorial, you would find more information about their whole package of topological inference. In this article we will focus on the estimator named K-PDTM:

We consider K, an unknown compact subset of the Euclidean space (Rd,). We dispose of a sample of n points X={X1,X2,,Xn} generated uniformly on K or generated in a neighborhood of K. The sample of points may be corrupted by outliers. That is, by points lying far from K.

Given X1,X2,,Xn, we aim at recovering K. More precisely, we construct approximations of K as unions of k balls or k ellipsoids, for k possibly much smaller than the sample size n.

Note that K coincides with dK1((,0]), the sublevel set of the distance-to-compact function dK. Then, the approximations we propose for K are sublevel sets of approximations of the distance function dK, based on X. Since the sample may be corrupted by outliers and the points may not lie exactly on the compact set, approximating dK by dX may be terrible. Therefore, we construct approximations of dK that are robust to noise.


In this page, we present three methods to construct approximations of dK from a possibly noisy sample X. The first approximation is the well-known distance-to-measure (DTM) function of [Chazal11]. The two last methods are new. They are based on the following approximations which sublevel sets are unions of k balls or ellispoids: the k-PDTM [Brecheteau19a] and the k-PLM [Brecheteau19b].

The codes and some toy examples are available in this page. In particular, they are implemented via the functions:

  • DTM(X,query_pts,q)
  • kPDTM(X,query_pts,q,k,sig,iter_max = 10,nstart = 1)
  • kPLM(X,query_pts,q,k,sig,iter_max = 10,nstart = 1)

For a sample X of size n, these functions compute the distance approximations at the points in query_pts. The parameter q is a regularity parameter in {0,1,,n}, k is the number of balls or ellispoids for the sublevel sets of the distance approximations. The procedures remove nsig points of the sample, cf Section « Detecting outliers ».

Example

We consider as a compact set K, the infinity symbol:

The target is the distance function dK. The graph of dK is the following:

We have generated a noisy sample X. Then, dX is a terrible approximation of dK. Indeed, the graph of dX is the following:

Nonetheless, there exist robust approximations of the distance-to-compact function, such as the distance-to-measure (DTM) function dX,q (that depends on a regularity parameter q) [Chazal11]. The graph of dX,q for some q is the following:

In this page, we define two functions, the k-PDTM dX,q,k and the k-PLM dX,q,k. The sublevel sets of the k-PDTM are unions of k balls. The sublevel sets of the k-PLM are unions of k ellipsoids. The graphs of dX,q,k and dX,q,k for some q and k are the following:

and

The distance-to-measure (DTM)

The distance-to-measure function (DTM) is a surrogate for the distance-to-compact, robust to noise. It was introduced in 2009 [Chazal11]. It depends on some regularity parameter q{0,1,,n}. The distance-to-measure function dX,q is defined by

where m(x,X,q)=1qi=1qX(i) is the barycenter of the q nearest neighbours of x in X, X(1),X(2),,X(q), and v(x,X,q) is their variance 1qi=1qm(x,X,q)X(i)2.

Equivalently, the DTM coincides with the mean distance between x and its q nearest neighbours:

dX,q2(x)=1qi=1qxX(i)2.

The following implementation of the DTM is due to Raphaël Tinarrage, in his page DTM-based filtrations: demo.

In [1]:
import numpy as np
from sklearn.neighbors import KDTree

def DTM(X,query_pts,q):
    '''
    Compute the values of the DTM of the point cloud X
    Require sklearn.neighbors.KDTree to search nearest neighbors
    
    Input:
    X: a nxd numpy array representing n points in R^d
    query_pts:  a sxd numpy array of query points
    q: parameter of the DTM in {1,2,...,n}
    
    Output: 
    DTM_result: a sx1 numpy array contaning the DTM of the 
    query points
    
    Example:
    X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
    Q = np.array([[0,0],[5,5]])
    DTM_values = DTM(X, Q, 3)
    '''
    n = X.shape[0]     
    if(q>0 and q<=n):
        kdt = KDTree(X, leaf_size=30, metric='euclidean')
        NN_Dist, NN = kdt.query(query_pts, q, return_distance=True)
        DTM_result = np.sqrt(np.sum(NN_Dist*NN_Dist,axis=1) / q)
    else:
        raise AssertionError("Error: q should be in {1,2,...,n}")
    return(DTM_result)

Example – DTM computation for a noisy sample on a circle

The points are generated on the circle accordingly to the following function SampleOnCircle. Again, this whole example was picked from the page DTM-based filtrations: demo of Raphaël Tinarrage.

In [2]:
import matplotlib.pyplot as plt

def SampleOnCircle(N_obs = 100, N_out = 0, is_plot = False):
    '''
    Sample N_obs points (observations) points from the uniform distribution on the unit circle in R^2, 
        and N_out points (outliers) from the uniform distribution on the unit square  
        
    Input: 
    N_obs: number of sample points on the circle
    N_noise: number of sample points on the square
    is_plot = True or False : draw a plot of the sampled points            
    
    Output : 
    data : a (N_obs + N_out)x2 matrix, the sampled points concatenated 
    '''
    rand_uniform = np.random.rand(N_obs)*2-1    
    X_obs = np.cos(2*np.pi*rand_uniform)
    Y_obs = np.sin(2*np.pi*rand_uniform)

    X_out = np.random.rand(N_out)*2-1
    Y_out = np.random.rand(N_out)*2-1

    X = np.concatenate((X_obs, X_out))
    Y = np.concatenate((Y_obs, Y_out))
    data = np.stack((X,Y)).transpose()

    if is_plot:
        fig, ax = plt.subplots()
        plt_obs = ax.scatter(X_obs, Y_obs, c='tab:cyan');
        plt_out = ax.scatter(X_out, Y_out, c='tab:orange');
        ax.axis('equal')
        ax.set_title(str(N_obs)+'-sampling of the unit circle with '+str(N_out)+' outliers')
        ax.legend((plt_obs, plt_out), ('data', 'outliers'), loc='lower left')
    return data
In [23]:
' Sampling on the circle with outlier '
N_obs = 150                                     # number of points sampled on the circle
N_out = 100                                     # number of outliers 
X = SampleOnCircle(N_obs, N_out, is_plot=True)  # sample points with outliers 
In [4]:
' Compute the DTM on X ' 
# compute the values of the DTM of parameter q
q = 40                      
DTM_values = DTM(X,X,q)             

# plot of  the opposite of the DTM
fig, ax = plt.subplots()
plot=ax.scatter(X[:,0], X[:,1], c=-DTM_values)
fig.colorbar(plot)
ax.axis('equal')
ax.set_title('Values of -DTM on X with parameter q='+str(q));

Approximating K with a union of k balls – or – the k-power-distance-to-measure (k-PDTM)

The k-PDTM is an approximation of the DTM, which sublevel sets are unions of k balls. It was introduced and studied in [Brecheteau19a].

According to the previous expression of the DTM, the DTM rewrites as:

These centers are chosen such that the criterion
R:(c1,c2,,ck)XXmini{1,2,,k}Xm(ci,X,q)2+v(ci,X,q)
is minimal.

Note that these centers c1,c2,,ck are not necessarily uniquely defined. The following algorithm provides local minimisers of the criterion R.

In [7]:
def mean_var(X,x,q,kdt):
    '''
    An auxiliary function.
    
    Input:
    X: an nxd numpy array representing n points in R^d
    x: an sxd numpy array representing s points, 
        for each of these points we compute the mean and variance of the nearest neighbors in X
    q: parameter of the DTM in {1,2,...,n} - number of nearest neighbors to consider
    kdt: a KDtree obtained from X via the expression KDTree(X, leaf_size=30, metric='euclidean')
    
    Output:
    Mean: an sxd numpy array containing the means of nearest neighbors
    Var: an sx1 numpy array containing the variances of nearest neighbors
    
    Example:
    X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
    x = np.array([[2,3],[0,0]])
    kdt = KDTree(X, leaf_size=30, metric='euclidean')
    Mean, Var = mean_var(X,x,2,kdt)
    '''
    NN = kdt.query(x, q, return_distance=False)
    Mean = np.zeros((x.shape[0],x.shape[1]))
    Var = np.zeros(x.shape[0])
    for i in range(x.shape[0]):
        Mean[i,:] = np.mean(X[NN[i],:],axis = 0)
        Var[i] = np.mean(np.sum((X[NN[i],:] - Mean[i,:])*(X[NN[i],:] - Mean[i,:]),axis = 1))
    return Mean, Var

import random # For the random centers from which the algorithm starts


def optima_for_kPDTM(X,q,k,sig,iter_max = 10,nstart = 1):
    '''
    Compute local optimal centers for the k-PDTM-criterion $R$ for the point cloud X
    Require sklearn.neighbors.KDTree to search nearest neighbors
    
    Input:
    X: an nxd numpy array representing n points in R^d
    query_pts:  an sxd numpy array of query points
    q: parameter of the DTM in {1,2,...,n}
    k: number of centers
    sig: number of sample points that the algorithm keeps (the other ones are considered as outliers -- cf section "Detecting outliers")
    iter_max : maximum number of iterations for the optimisation algorithm
    nstart : number of starts for the optimisation algorithm
    
    Output: 
    centers: a kxd numpy array contaning the optimal centers c^*_i computed by the algorithm
    means: a kxd numpy array containing the local centers m(c^*_i,\mathbb X,q)
    variances: a kx1 numpy array containing the local variances v(c^*_i,\mathbb X,q)
    colors: a size n numpy array containing the colors of the sample points in X
        points in the same weighted Voronoi cell (with centers in opt_means and weights in opt_variances)
        have the same color
    cost: the mean, for the "sig" points X[j,] considered as signal, of their smallest weighted distance to a center in "centers"
        that is, min_i\|X[j,]-means[i,]\|^2+variances[i].
        
    
    Example:
    X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
    sig = X.shape[0] # There is no trimming, all sample points are assigned to a cluster
    centers, means, variances, colors, cost = optima_for_kPDTM(X, 3, 2, sig)
    '''
    n = X.shape[0]
    d = X.shape[1]
    opt_cost = np.inf
    opt_centers = np.zeros([k,d])
    opt_colors = np.zeros(n)
    opt_kept_centers = np.zeros(k)
    if(q<=0 or q>n):
        raise AssertionError("Error: q should be in {1,2,...,n}")
    elif(k<=0 or k>n):
        raise AssertionError("Error: k should be in {1,2,...,n}")
    else:
        kdt = KDTree(X, leaf_size=30, metric='euclidean')
        for starts in range(nstart):
            
            # Initialisation
            colors = np.zeros(n)
            min_distance = np.zeros(n) # Weighted distance between a point and its nearest center
            kept_centers = np.ones((k), dtype=bool)
            first_centers_ind = random.sample(range(n), k) # Indices of the centers from which the algorithm starts
            centers = X[first_centers_ind,:]
            old_centers = np.ones([k,d])*np.inf
            mv = mean_var(X,centers,q,kdt)
            Nstep = 1
            while((np.sum(old_centers!=centers)>0) and (Nstep <= iter_max)):
                Nstep = Nstep + 1
                
                # Step 1: Update colors and min_distance
                for j in range(n):
                    cost = np.inf
                    best_ind = 0
                    for i in range(k):
                        if(kept_centers[i]):
                            newcost = np.sum((X[j,:] - mv[0][i,:])*(X[j,:] - mv[0][i,:])) + mv[1][i]
                            if(newcost < cost):
                                cost = newcost
                                best_ind = i
                    colors[j] = best_ind
                    min_distance[j] = cost
                    
                # Step 2: Trimming step - Put color -1 to the (n-sig) points with largest cost
                index = np.argsort(-min_distance)
                colors[index[range(n-sig)]] = -1
                ds = min_distance[index[range(n-sig,n)]]
                costt = np.mean(ds)
                
                # Step 3: Update Centers and mv
                old_centers = np.copy(centers)
                old_mv = mv
                for i in range(k):
                    pointcloud_size = np.sum(colors == i)
                    if(pointcloud_size>=1):
                        centers[i,] = np.mean(X[colors==i,],axis = 0)
                    else:
                        kept_centers[i] = False
                mv = mean_var(X,centers,q,kdt)
                
            if(costt <= opt_cost):
                opt_cost = costt
                opt_centers = np.copy(old_centers)
                opt_mv = old_mv
                opt_colors = np.copy(colors)
                opt_kept_centers = np.copy(kept_centers)
                
        centers = opt_centers[opt_kept_centers,]
        means = opt_mv[0][opt_kept_centers,]
        variances = opt_mv[1][opt_kept_centers]
        colors = np.zeros(n)
        for i in range(n):
            colors[i] = np.sum(opt_kept_centers[range(int(opt_colors[i]+1))])-1
        cost = opt_cost
        
    return(centers, means, variances, colors, cost)


def kPDTM(X,query_pts,q,k,sig,iter_max = 10,nstart = 1):
    '''
    Compute the values of the k-PDTM of the empirical measure of a point cloud X
    Require sklearn.neighbors.KDTree to search nearest neighbors
    
    Input:
    X: a nxd numpy array representing n points in R^d
    query_pts:  a sxd numpy array of query points
    q: parameter of the DTM in {1,2,...,n}
    k: number of centers
    sig: number of points considered as signal in the sample (other signal points are trimmed)
    
    Output: 
    kPDTM_result: a sx1 numpy array contaning the kPDTM of the 
    query points
    
    Example:
    X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
    Q = np.array([[0,0],[5,5]])
    kPDTM_values = kPDTM(X, Q, 3, 2,X.shape[0])
    '''
    n = X.shape[0]     
    if(q<=0 or q>n):
        raise AssertionError("Error: q should be in {1,2,...,n}")
    elif(k<=0 or k>n):
        raise AssertionError("Error: k should be in {1,2,...,n}")
    elif(X.shape[1]!=query_pts.shape[1]):
        raise AssertionError("Error: X and query_pts should contain points with the same number of coordinates.")
    else:
        centers, means, variances, colors, cost = optima_for_kPDTM(X,q,k,sig,iter_max = iter_max,nstart = nstart)
        kPDTM_result = np.zeros(query_pts.shape[0])
        for i in range(query_pts.shape[0]):
            kPDTM_result[i] = np.inf
            for j in range(means.shape[0]):
                aux = np.sqrt(np.sum((query_pts[i,]-means[j,])*(query_pts[i,]-means[j,]))+variances[j])
                if(aux<kPDTM_result[i]):
                    kPDTM_result[i] = aux 
                    
    return(kPDTM_result, centers, means, variances, colors, cost)

We compute the k-PDTM on the same sample of points.

Note that when we take k=250, that is, when k is equal to the sample size, the DTM and the k-PDTM coincide on the points of the sample.

The sub-level sets of the k-PDTM are unions of k balls which centers are represented by triangles.

In [8]:
' Compute the k-PDTM on X ' 
# compute the values of the DTM of parameter q
q = 40
k = 250
sig = X.shape[0]
iter_max = 100
nstart = 10
kPDTM_values, centers, means, variances, colors, cost = kPDTM(X,X,q,k,sig,iter_max,nstart)  
# plot of  the opposite of the DTM
fig, ax = plt.subplots()
plot = ax.scatter(X[:,0], X[:,1], c=-kPDTM_values)
fig.colorbar(plot)
for i in range(means.shape[0]):
    ax.scatter(means[i,0],means[i,1],c = "black",marker = "^")
ax.axis('equal')
ax.set_title('Values of -kPDTM on X with parameter q='+str(q)+' and k='+str(k)+'.');

Approximating K with a union of k ellipsoids – or – the k-power-likelihood-to-measure (k-PLM)

Sublevel sets of the k-PDTM are unions of