TClust class

class tclust._tclust.TClust(object)[source]

General Trimming Approach to Robust Cluster Analysis

TClust searches for k (or less) clusters with different covariance structures in a data matrix x.

To make the estimation robust, a proportion alpha of observations may be trimmed.

This iterative algorithm initializes k clusters randomly and performs “concentration steps” in order to improve the current cluster assignment. The number of maximum concentration steps to be performed is given by iter_max. For approximately obtaining the global optimum, the system is initialized n_inits times and concentration steps are performed until convergence or ksteps is reached. When processing more complex data sets, higher values of n_inits and ksteps have to be specified (obviously implying extra computation time). However, if more than half of the iterations do not converge, a warning message is issued, indicating that n_inits has to be increased.

The parameter restr_cov_var defines the cluster’s shape restrictions, which are applied to all clusters during each iteration. Options “eigen”/”deter” restrict the ratio between the maximum and minimum eigenvalue/determinant of all cluster’s covariance structures to parameter restr_fact. Setting restr_fact=1 yields the strongest restriction, forcing all eigenvalues/determinants to be equal and so the method looks for similarly scattered (respectively spherical) clusters. Option “sigma” is a simpler restriction, which averages the covariance structures during each iteration (weighted by cluster sizes) in order to get similar (equal) cluster scatters.

Note

The trimmed k-means method (tkmeans) can be obtained by setting parameters restr=”eigen”, restr_fact=1 and equal_weights = True.

Parameters
  • k (int, default=2) – The number of clusters

  • alpha (float, default=0.05) – The proportion of observations to be trimmed

  • n_inits (int, default=20) – The number of random intializations to be performed

  • ksteps (int, default=40) – The maximum number of concentration steps to be performed

  • restr_cov_value (string, default='eigen') – The type of restriction to be applied on the cluster scatter matrices. Valid values are {“eigen”, “deter”, “sigma”}.

  • equal_weights (bool, default=False) – Specifying whether cluster weights are equal.

  • zero_tol (float, default=1e-16) – The zero tolerance used.

  • maxfact_e (float, default=5) – Level of eigen constraints.

  • maxfact_d (float, default=5) – Level of determinant constraints.

  • m (float, default=2.) – Fuzzy power parameter.

  • opt (string, default='hard') – Type of assignment. Accepted values are {‘hard’, ‘mixture’, ‘fuzzy’}

  • sol_ini (object of class Iteration, default=None) – Initial solution provided by the user.

  • tk (bool, default=False) – Whether to use tkmeans initialization.

  • verbose (bool, default=True) – Whether to print the progress of the objective function throughout the iterations.

Example

>>> from tclust import TClust
>>> import numpy as np
>>> X = np.array([[1, 2], [1, 4], [1, 0],
...               [10, 2], [10, 4], [10, 0]])
>>> clustering = TClust(k=2).fit(X)
>>> clustering.labels_
array([2, 2, 2, 1, 1, 1], dtype=int64)
>>> clustering.iter.center
array([[10.,  2.], [ 1.,  2.]])
calc_obj_function(X)[source]

Calculates the objective function value for mixture, hard, and fuzzy assignments

Parameters

X – array of shape=(nsamples, nfeatures) containing the data

Returns

N/A

estimClustPar(X)[source]

Function to estimate model parameters

Parameters

X – array of shape=(nsamples, nfeatures) containing the data

Returns

N/A

findClusterLabels(X)[source]

Obtain the cluster assignment and trimming in the non-fuzzy case (i.e., mixture and hard assignments)

Parameters

X – array of shape=(nsamples, nfeatures) containing the data

Returns

N/A

findFuzzyClusterLabels(X)[source]

Obtain assignment and trimming in the fuzzy case

Parameters

X – array of shape=(nsamples, nfeatures) containing the data

Returns

N/A

fit(X, y=None)[source]

Compute tclust clustering.

Parameters
  • X ({array-like, sparse matrix} of shape (n_samples, n_features)) – Training instances to cluster. The observations should be given row-wise.

  • y (Ignored) – Not used, present for API consistency with scikit-learn by convention.

Returns

Fitted estimator.

Return type

self

fit_predict(X, y=None)[source]

Compute cluster centers and predict cluster index for each sample.

Convenience method; equivalent to calling fit(X) followed by predict(X).

Parameters
  • X ({array-like, sparse matrix} of shape (n_samples, n_features)) – New data to transform.

  • y (Ignored) – Not used, present here for API consistency by convention.

Returns

labels_ – Index of the cluster each sample belongs to.

Return type

ndarray of shape (n_samples,)

get_ll(X)[source]

Extracted this function to avoid repetition in the code

Parameters

X – array; 2D array of data, of shape [nsamp, nfeat]

Returns

array of shape (nsamp, k); ll

getini()[source]

Calculates the initial cluster sizes

Returns

array, number of samples in each cluster

init_clusters(X)[source]

Calculates the initial cluster assignment and initial values for the parameters

Parameters

X – 2D array of data [samples, dimensions]

Returns

N/A

predict(X)[source]

Predict the closest cluster each sample in X belongs to.

Parameters

X – {array-like, sparse matrix} of shape (n_samples, n_features) New data to predict.

Returns

labels_: ndarray of shape (n_samples, ) Index of the cluster each sample belongs to.

restr2_deter_(autovalues, ni_ini, factor_d, factor_e, zero_tol=1e-16)[source]

Function for applying constraints to the determinants.

Used when p>1 (multivariate case) – in the univariate case the constraints can be obtained with restr2_eigenv() In order to avoid the instability in the current release of this function implemented in the CRAN, it is better to apply these constraints, at the desired level, joint to eigenvalue constraints at very low level (factor_e=1e10). In this way eigenvalues are not constrained in practice, but numerical issues are avoided.

Parameters
  • autovalues – matrix containing eigenvalues

  • ni_ini – current sample size of the clusters

  • factor_d – constraint level for the determinants

  • factor_e – constraint level for the eigenvalues

  • zero_tol – tolerance level

Returns

?

restr2_eigenv(autovalues, ni_ini, factor_e, zero_tol)[source]

Function for applying eigen constraints. These are the typical constraints.

Parameters
  • autovalues – matrix containing eigenvalues

  • ni_ini – current sample size of the clusters

  • factor_e – level of the constraints

  • zero_tol – tolerance level

Returns

?

restr_avgcov(p)[source]

Restricts the clusters’ covariance matrices to be equal. Simple function to get the pooled within group covariance matrix.

Parameters

p – int, number of dimensions of the data

Returns

N/A

restr_diffax(p)[source]

Function which manages the application of constraints (deter, eigen)

Parameters

p – int, number of features of the data

Returns

N/A

treatSingularity()[source]

To manage singular situations.

Returns

N/A

tclust._tclust.tkmeans(X, k, alpha=0.05, niter=20, ksteps=10, equal_weights=False, maxfact_d=5, m=2.0, zero_tol=1e-16, opt='hard', sol_ini=None, verbose=False)[source]

Convenient function to run trimmed k-means on the data.

Parameters
  • X (array) – Data to be clustered. shape=(n_observations, n_dimensions)

  • k (int) – The number of clusters

  • alpha (float, default=0.05) – The proportion of observations to be trimmed

  • niter (int, default=20) – The number of random intializations to be performed

  • ksteps (int, default=10) – The maximum number of concentration steps to be performed

  • equal_weights (bool, default=False) – Specifying whether cluster weights are equal.

  • maxfact_d (float, default=5) – Level of determinant constraints.

  • m (float, default=2.) – Fuzzy power parameter.

  • zero_tol (float, default=1e-16) – The zero tolerance used.

  • opt (string, default='hard') – Type of assignment. Accepted values are {‘hard’, ‘mixture’, ‘fuzzy’}

  • sol_ini (object of class Iteration, default=None) – Initial solution provided by the user. None is used for random initializations.

  • verbose (bool, default=True) – Whether to print the progress of the objective function throughout the iterations.

Returns

Return type

Fitted TClust estimator.

tclust._tclust.dmnorm(x, mu, sigma)[source]

Multivariate normal density

Parameters
  • x – array of shape=(nsamples, nfeatures) containing the data

  • mu – center of the cluster [features, ]

  • sigma

Returns

?

tclust._tclust.dmnorm_tk(x, mu, lambd)[source]

Multivariate normal density sigma=lambd*ID

Parameters
  • x – array of shape=(nsamples, nfeatures) containing the data

  • mu – center of the cluster [features, ]

  • lambd – one number - diagonal value for tkmeans (whatever that means)

Returns

?