TClust class¶
-
class
tclust._tclust.TClust(object)[source]¶ General Trimming Approach to Robust Cluster Analysis
TClust searches for k (or less) clusters with different covariance structures in a data matrix x.
To make the estimation robust, a proportion alpha of observations may be trimmed.
This iterative algorithm initializes k clusters randomly and performs “concentration steps” in order to improve the current cluster assignment. The number of maximum concentration steps to be performed is given by iter_max. For approximately obtaining the global optimum, the system is initialized n_inits times and concentration steps are performed until convergence or ksteps is reached. When processing more complex data sets, higher values of n_inits and ksteps have to be specified (obviously implying extra computation time). However, if more than half of the iterations do not converge, a warning message is issued, indicating that n_inits has to be increased.
The parameter restr_cov_var defines the cluster’s shape restrictions, which are applied to all clusters during each iteration. Options “eigen”/”deter” restrict the ratio between the maximum and minimum eigenvalue/determinant of all cluster’s covariance structures to parameter restr_fact. Setting restr_fact=1 yields the strongest restriction, forcing all eigenvalues/determinants to be equal and so the method looks for similarly scattered (respectively spherical) clusters. Option “sigma” is a simpler restriction, which averages the covariance structures during each iteration (weighted by cluster sizes) in order to get similar (equal) cluster scatters.
Note
The trimmed k-means method (tkmeans) can be obtained by setting parameters restr=”eigen”, restr_fact=1 and equal_weights = True.
- Parameters
k (int, default=2) – The number of clusters
alpha (float, default=0.05) – The proportion of observations to be trimmed
n_inits (int, default=20) – The number of random intializations to be performed
ksteps (int, default=40) – The maximum number of concentration steps to be performed
restr_cov_value (string, default='eigen') – The type of restriction to be applied on the cluster scatter matrices. Valid values are {“eigen”, “deter”, “sigma”}.
equal_weights (bool, default=False) – Specifying whether cluster weights are equal.
zero_tol (float, default=1e-16) – The zero tolerance used.
maxfact_e (float, default=5) – Level of eigen constraints.
maxfact_d (float, default=5) – Level of determinant constraints.
m (float, default=2.) – Fuzzy power parameter.
opt (string, default='hard') – Type of assignment. Accepted values are {‘hard’, ‘mixture’, ‘fuzzy’}
sol_ini (object of class Iteration, default=None) – Initial solution provided by the user.
tk (bool, default=False) – Whether to use tkmeans initialization.
verbose (bool, default=True) – Whether to print the progress of the objective function throughout the iterations.
Example
>>> from tclust import TClust >>> import numpy as np >>> X = np.array([[1, 2], [1, 4], [1, 0], ... [10, 2], [10, 4], [10, 0]]) >>> clustering = TClust(k=2).fit(X) >>> clustering.labels_ array([2, 2, 2, 1, 1, 1], dtype=int64) >>> clustering.iter.center array([[10., 2.], [ 1., 2.]])
-
calc_obj_function(X)[source]¶ Calculates the objective function value for mixture, hard, and fuzzy assignments
- Parameters
X – array of shape=(nsamples, nfeatures) containing the data
- Returns
N/A
-
estimClustPar(X)[source]¶ Function to estimate model parameters
- Parameters
X – array of shape=(nsamples, nfeatures) containing the data
- Returns
N/A
-
findClusterLabels(X)[source]¶ Obtain the cluster assignment and trimming in the non-fuzzy case (i.e., mixture and hard assignments)
- Parameters
X – array of shape=(nsamples, nfeatures) containing the data
- Returns
N/A
-
findFuzzyClusterLabels(X)[source]¶ Obtain assignment and trimming in the fuzzy case
- Parameters
X – array of shape=(nsamples, nfeatures) containing the data
- Returns
N/A
-
fit(X, y=None)[source]¶ Compute tclust clustering.
- Parameters
X ({array-like, sparse matrix} of shape (n_samples, n_features)) – Training instances to cluster. The observations should be given row-wise.
y (Ignored) – Not used, present for API consistency with scikit-learn by convention.
- Returns
Fitted estimator.
- Return type
self
-
fit_predict(X, y=None)[source]¶ Compute cluster centers and predict cluster index for each sample.
Convenience method; equivalent to calling fit(X) followed by predict(X).
- Parameters
X ({array-like, sparse matrix} of shape (n_samples, n_features)) – New data to transform.
y (Ignored) – Not used, present here for API consistency by convention.
- Returns
labels_ – Index of the cluster each sample belongs to.
- Return type
ndarray of shape (n_samples,)
-
get_ll(X)[source]¶ Extracted this function to avoid repetition in the code
- Parameters
X – array; 2D array of data, of shape [nsamp, nfeat]
- Returns
array of shape (nsamp, k); ll
-
getini()[source]¶ Calculates the initial cluster sizes
- Returns
array, number of samples in each cluster
-
init_clusters(X)[source]¶ Calculates the initial cluster assignment and initial values for the parameters
- Parameters
X – 2D array of data [samples, dimensions]
- Returns
N/A
-
predict(X)[source]¶ Predict the closest cluster each sample in X belongs to.
- Parameters
X – {array-like, sparse matrix} of shape (n_samples, n_features) New data to predict.
- Returns
labels_: ndarray of shape (n_samples, ) Index of the cluster each sample belongs to.
-
restr2_deter_(autovalues, ni_ini, factor_d, factor_e, zero_tol=1e-16)[source]¶ Function for applying constraints to the determinants.
Used when p>1 (multivariate case) – in the univariate case the constraints can be obtained with restr2_eigenv() In order to avoid the instability in the current release of this function implemented in the CRAN, it is better to apply these constraints, at the desired level, joint to eigenvalue constraints at very low level (factor_e=1e10). In this way eigenvalues are not constrained in practice, but numerical issues are avoided.
- Parameters
autovalues – matrix containing eigenvalues
ni_ini – current sample size of the clusters
factor_d – constraint level for the determinants
factor_e – constraint level for the eigenvalues
zero_tol – tolerance level
- Returns
?
-
restr2_eigenv(autovalues, ni_ini, factor_e, zero_tol)[source]¶ Function for applying eigen constraints. These are the typical constraints.
- Parameters
autovalues – matrix containing eigenvalues
ni_ini – current sample size of the clusters
factor_e – level of the constraints
zero_tol – tolerance level
- Returns
?
-
restr_avgcov(p)[source]¶ Restricts the clusters’ covariance matrices to be equal. Simple function to get the pooled within group covariance matrix.
- Parameters
p – int, number of dimensions of the data
- Returns
N/A
-
tclust._tclust.tkmeans(X, k, alpha=0.05, niter=20, ksteps=10, equal_weights=False, maxfact_d=5, m=2.0, zero_tol=1e-16, opt='hard', sol_ini=None, verbose=False)[source]¶ Convenient function to run trimmed k-means on the data.
- Parameters
X (array) – Data to be clustered. shape=(n_observations, n_dimensions)
k (int) – The number of clusters
alpha (float, default=0.05) – The proportion of observations to be trimmed
niter (int, default=20) – The number of random intializations to be performed
ksteps (int, default=10) – The maximum number of concentration steps to be performed
equal_weights (bool, default=False) – Specifying whether cluster weights are equal.
maxfact_d (float, default=5) – Level of determinant constraints.
m (float, default=2.) – Fuzzy power parameter.
zero_tol (float, default=1e-16) – The zero tolerance used.
opt (string, default='hard') – Type of assignment. Accepted values are {‘hard’, ‘mixture’, ‘fuzzy’}
sol_ini (object of class Iteration, default=None) – Initial solution provided by the user. None is used for random initializations.
verbose (bool, default=True) – Whether to print the progress of the objective function throughout the iterations.
- Returns
- Return type
Fitted TClust estimator.