damina::SVClustering Class Reference

An implementation of the Support Vector Clustering originally proposed by Ben-Hur et al (2001). More...

#include <SVClustering.h>

Inheritance diagram for damina::SVClustering:

Inheritance graph
[legend]
Collaboration diagram for damina::SVClustering:

Collaboration graph
[legend]
List of all members.

Public Member Functions

virtual void clusterize ()
 Run the clustering process.
virtual std::vector< int > & getClustersAssignment ()
 Returns the clustering assignments.
virtual int getKernelType ()
 Returns the current kernel type.
virtual double getKernelWidth ()
 Returns the current value of the Kernel Width.
virtual struct svm_model * getModel ()
 Returns the trained model.
virtual unsigned long int getNumberOfClusters ()
 Returns the number of clusters found.
virtual unsigned long int getNumberOfNonSingletonClusters ()
 Returns the number of clusters containing more than one point.
virtual unsigned long int getNumberOfValidClusters (unsigned long int)
 Returns the number of clusters with a number of elements greater than or equal to a certain threshold.
virtual std::vector< int > & getOriginalClasses (int)
 Returns the original classes assignments for the points in input.
virtual struct svm_parameter * getParameters ()
 Returns the SVM paramters.
virtual std::vector< unsigned
long int > & 
getPointPerCluster ()
 Returns the number of points per cluster.
virtual struct svm_problem * getProblem ()
 Returns the current input data set in the libsvm format.
virtual double getSoftConstraint ()
 Returns the current value of the Soft Margin Constraint.
virtual double getSoftConstraintEstimate ()
 Return the estimate of a more appropriate Soft Margin parameter.
virtual double getSphereRadius ()
 Returns the current value of the sphere radius.
virtual DataSetgetTrainingSet ()
 Returns the current input data set.
virtual double initialKernelWidth ()
 Computes the initial gaussian kernel width, as proposed in.
virtual void learn (struct svm_problem *)
 The learning process.
virtual void learn (DataSet *)
 The learning process.
virtual void learn ()
 The learning process.
virtual void setEuclideanDistance (EuclideanDistance *)
 Set a vector-distance measure valid for the Euclidean Space.
virtual void setKernelType (int)
 Set the kernel type for the SVM.
virtual void setKernelWidth (double)
 Set the new width for Gaussian Kernel.
virtual void setProblem (struct svm_problem *)
 Set the input data set in libsvm format.
virtual void setSoftConstraint (double)
 Set the new value for the Soft Margin Constant.
virtual void setTrainingSet (DataSet *)
 Set the input data set.
 SVClustering (double, double, struct svm_problem *)
 Constructor.
 SVClustering (double, double, DataSet *)
 Constructor.
 SVClustering (double, struct svm_problem *)
 Constructor.
 SVClustering (double, DataSet *)
 Constructor.
 SVClustering (struct svm_problem *)
 Constructor.
 SVClustering (DataSet *)
 Constructor.
virtual void useClassicClusteringPolicyForBSV ()
 Use the classic clustering policy to clusterize BSVs.
virtual void useSpecialClusteringPolicyForBSV ()
 Use a special clustering policy to clusterize BSVs.
virtual bool usingSpecialClusteringPolicyForBSV ()
 Return true if the SVC is using the special policy to clusterize BSV.
virtual ~SVClustering ()
 Destructor.

Protected Member Functions

virtual double beta (int, bool)
 Returns the i-th lagrangian multiplier.
virtual double beta (int)
 Returns the i-th lagrangian multiplier.
virtual double calculateDistanceFromCenter (struct svm_node *)
 Calculate the distance of a point from the center of the sphere.
virtual void calculateSphereRadius ()
 Calculate the sphere radius.
virtual struct svm_node * pointOnThePath (struct svm_node *, struct svm_node *, double)
 Samples a point on the path between two points.
virtual double rho (bool)
 Returns the rho value of the decision function computed by One Class SVM.
virtual double rho ()
 Returns the rho value of the decision function computed by One Class SVM.
virtual void separateClusters ()
 Run the clustering process.

Protected Attributes

double C_estimation
 An estimation of a more appropriate C value, based on the heuristics proposed in the last paragraph of.
EuclideanDistanceeucDist
 Euclidean Measure for compute distance between vectors.
std::vector< int > labels
 Clustering assignments.
OneClassSVMtrainer
 The One Class SVM for the first step of the SV Clustering process.

Private Member Functions

void calculateQuadraticPartOfDistanceFromCenter ()
 Calculate the quadratic part of the point distance equation.
virtual void clusterize (DataSet *)
virtual DataSetgetTestSet ()
virtual void setTestSet (DataSet *)

Private Attributes

double C
 The soft constraint in the original SV Clustering formulation.
bool clusterizeBSV
long int numberOfClusters
 The number of clusters found.
std::vector< int > originalClasses
 This could be contains the right assignments.
std::vector< unsigned long
int > 
pointPerCluster
 Number of points per cluster.
double * quadratic
 The quadratic part of the distance from the sphere's center It's the same for all points.
double sphereRadius
 The radius of the Sphere in the Feature space.

Detailed Description

An implementation of the Support Vector Clustering originally proposed by Ben-Hur et al (2001).

This class implements the Support Vector Clustering as proposed in

A. Ben-Hur, D. Horn, H. T. Siegelmann, and V. Vapnik, "Support Vector Clustering," Journal of Machine Learning Research, vol. 2, pp. 125-137, 2001.

The Support Vector Clustering is among the few attempts to use the SVMs in an unsupervised way.

It consists mainly of two steps:

1. Cluster description
=======================

Find the Minimum Enclosing Sphere in the feature space, exactly as SVDD described in

D. M. J. Tax and R. P. W. Duin, "Data Domain Description using Support Vectors," in European Symposium on Artificial Neural Network, Bruges (Belgium), 1999.

D. M. J. TAX, "One-class classification: concept learning in the absence of counter-examples," PhD Thesis , 2001.

This leads to a Quadratic Programming problem like in the classic SVM training problem

This sphere, mapped back to the data space, split in various contours, each representing a cluster.

Here, we use One-class classification by Schoelkopf et al (2001) instead of SVDD by Tax.

The two methods are showed to be equivalent under following conditions:
a. The data point must be normalized to the unit norm vectors (implicitly done using the Gaussian Kernel)
b. The kernel width must be the same
c. The C parameter in SVDD must be equal to 1/(nu*N), where nu is the parameter in One-class SVM and N is the number of input elements.

2. Cluster labeling
===================

The cluster description algorithm does not differentiate between points that belong to differ- ent clusters. To do so, we use a geometric approach involving R(x), based on the following observation: given a pair of data points that belong to different components (clusters), any path that connects them must exit from the sphere in feature space.

So, we build a graph such that two vertex are connected if their images in feature space are connected by a path which lie entirely inside the sphere. The connected components algorithm on this graph returns the clusters.

See also:
OneClassSVM
Author:
Vincenzo Russo - vincenzo.russo@neminis.org

Definition at line 98 of file SVClustering.h.


Constructor & Destructor Documentation

damina::SVClustering::SVClustering ( DataSet train  ) 

Constructor.

Parameters:
train The input data set

Definition at line 12 of file SVClustering.cpp.

References clusterizeBSV, eucDist, damina::KernelType::gaussian, initialKernelWidth(), quadratic, damina::AbstractSVM::setCacheSize(), damina::AbstractSVM::setKernel(), damina::AbstractSVM::setKernelWidth(), damina::AbstractSVM::setTolerance(), damina::OneClassSVM::setTrainingSet(), and trainer.

00012                                                  {
00013                 this->eucDist = new L2Distance();
00014                 this->trainer = new OneClassSVM();
00015                 this->trainer->setKernel(KernelType::gaussian);
00016                 this->trainer->setCacheSize(100);
00017                 this->trainer->setTolerance(0.000001);
00018                 this->trainer->setTrainingSet(train);
00019                 this->trainer->setKernelWidth(initialKernelWidth());
00020                 this->quadratic = NULL;
00021                 this->clusterizeBSV = true;
00022         }

Here is the call graph for this function:

damina::SVClustering::SVClustering ( struct svm_problem *  p  ) 

Constructor.

Parameters:
p The input data in libsvm format

Definition at line 29 of file SVClustering.cpp.

References clusterizeBSV, eucDist, damina::KernelType::gaussian, initialKernelWidth(), quadratic, damina::AbstractSVM::setCacheSize(), damina::AbstractSVM::setKernel(), damina::AbstractSVM::setKernelWidth(), damina::OneClassSVM::setProblem(), damina::AbstractSVM::setTolerance(), and trainer.

00029                                                         {
00030                 this->eucDist = new L2Distance();
00031                 this->trainer = new OneClassSVM();
00032                 this->trainer->setKernel(KernelType::gaussian);
00033                 this->trainer->setCacheSize(100);
00034                 this->trainer->setTolerance(0.000001);
00035                 this->trainer->setProblem(p);
00036                 this->trainer->setKernelWidth(initialKernelWidth());
00037                 this->quadratic = NULL;
00038                 this->clusterizeBSV = true;
00039         }

Here is the call graph for this function:

damina::SVClustering::SVClustering ( double  C,
DataSet train 
)

Constructor.

Parameters:
C The soft margin constant
train The input data set

Definition at line 47 of file SVClustering.cpp.

References clusterizeBSV, eucDist, damina::KernelType::gaussian, damina::DataSet::getSize(), initialKernelWidth(), quadratic, damina::AbstractSVM::setCacheSize(), damina::AbstractSVM::setKernel(), damina::AbstractSVM::setKernelWidth(), damina::OneClassSVM::setNU(), damina::AbstractSVM::setTolerance(), damina::OneClassSVM::setTrainingSet(), and trainer.

00047                                                            {
00048                 this->eucDist = new L2Distance();
00049                 this->trainer = new OneClassSVM();
00050                 this->trainer->setKernel(KernelType::gaussian);
00051                 this->trainer->setCacheSize(100);
00052                 this->trainer->setTolerance(0.000001);
00053                 
00054                 this->trainer->setTrainingSet(train);
00055                 
00056                 this->trainer->setKernelWidth(initialKernelWidth());
00057 
00058                 
00059                 this->trainer->setNU((1 / (train->getSize() * C)));
00060                 this->C = C;
00061                 this->quadratic = NULL;
00062                 this->clusterizeBSV = true;
00063         }

Here is the call graph for this function:

damina::SVClustering::SVClustering ( double  C,
struct svm_problem *  prob 
)

Constructor.

Parameters:
C The soft margin constant
prob The input data set in libsvm format

Definition at line 71 of file SVClustering.cpp.

References clusterizeBSV, eucDist, damina::KernelType::gaussian, initialKernelWidth(), quadratic, damina::AbstractSVM::setCacheSize(), damina::AbstractSVM::setKernel(), damina::AbstractSVM::setKernelWidth(), damina::OneClassSVM::setNU(), damina::OneClassSVM::setProblem(), damina::AbstractSVM::setTolerance(), and trainer.

00071                                                                      {
00072                 this->eucDist = new L2Distance();
00073                 this->trainer = new OneClassSVM();
00074                 this->trainer->setKernel(KernelType::gaussian);
00075                 this->trainer->setCacheSize(100);
00076                 this->trainer->setTolerance(0.000001);
00077                 
00078                 this->trainer->setProblem(prob);
00079                 
00080                 this->trainer->setKernelWidth(initialKernelWidth());
00081                 
00082                 this->trainer->setNU((1 / (prob->l * C)));
00083                 this->C = C;
00084                 this->quadratic = NULL;
00085                 this->clusterizeBSV = true;
00086         }

Here is the call graph for this function:

damina::SVClustering::SVClustering ( double  C,
double  w,
DataSet train 
)

Constructor.

Parameters:
C The soft margin constant
w The gaussian kernel width
train The input data set

Definition at line 95 of file SVClustering.cpp.

References clusterizeBSV, eucDist, damina::KernelType::gaussian, damina::DataSet::getSize(), quadratic, damina::AbstractSVM::setCacheSize(), damina::AbstractSVM::setKernel(), damina::AbstractSVM::setKernelWidth(), damina::OneClassSVM::setNU(), damina::AbstractSVM::setTolerance(), damina::OneClassSVM::setTrainingSet(), and trainer.

00095                                                                      {
00096                 this->eucDist = new L2Distance();
00097                 this->trainer = new OneClassSVM();
00098                 this->trainer->setKernel(KernelType::gaussian);
00099                 this->trainer->setCacheSize(100);
00100                 this->trainer->setTolerance(0.000001);
00101                 
00102                 this->trainer->setTrainingSet(train);
00103                 
00104                 this->trainer->setNU((1 / (train->getSize() * C)));
00105                 this->C = C;
00106                 
00107                 this->trainer->setKernelWidth(w);
00108                 this->quadratic = NULL;
00109                 this->clusterizeBSV = true;
00110         }

Here is the call graph for this function:

damina::SVClustering::SVClustering ( double  C,
double  w,
struct svm_problem *  prob 
)

Constructor.

Parameters:
C The soft margin constant
w The gaussian kernel width
prob The input data set in libsvm format

Definition at line 119 of file SVClustering.cpp.

References clusterizeBSV, eucDist, damina::KernelType::gaussian, quadratic, damina::AbstractSVM::setCacheSize(), damina::AbstractSVM::setKernel(), damina::AbstractSVM::setKernelWidth(), damina::OneClassSVM::setNU(), damina::OneClassSVM::setProblem(), damina::AbstractSVM::setTolerance(), and trainer.

00119                                                                                {
00120                 this->eucDist = new L2Distance();
00121                 this->trainer = new OneClassSVM();
00122                 this->trainer->setKernel(KernelType::gaussian);
00123                 this->trainer->setCacheSize(100);
00124                 this->trainer->setTolerance(0.000001);
00125                 
00126                 this->trainer->setProblem(prob);
00127                 
00128                 this->trainer->setNU((1 / (prob->l * C)));
00129                 this->C = C;
00130                 
00131                 this->trainer->setKernelWidth(w);
00132                 this->quadratic = NULL;
00133                 this->clusterizeBSV = true;
00134         }

Here is the call graph for this function:

damina::SVClustering::~SVClustering (  )  [virtual]

Destructor.

Definition at line 140 of file SVClustering.cpp.

References eucDist, and trainer.

00140                                     {
00141                 delete trainer;
00142                 delete eucDist;
00143                 // if (quadratic != NULL) free(quadratic);
00144         }


Member Function Documentation

double damina::SVClustering::beta ( int  i,
bool  scaled 
) [protected, virtual]

Returns the i-th lagrangian multiplier.

The One Class SVM solves the Quadratic Programming problem and computes the lagrangians "beta". LibSVM solves a scaled version of the One Class SVM problem, so this method returns either the scaled value of a beta or the normal one, depeding on the 'scaled' param.

To obtain the normal value of a lagrangian, we multiply it by C.

The betas equal to zero (non-support vector ones) are not in the solution.

Parameters:
i The index of the lagrangian
scaled True if you want the scaled version, false otherwise
Returns:
The value of the i-th beta (scaled or not, depending on 'scaled' param)

Definition at line 590 of file SVClustering.cpp.

References damina::OneClassSVM::getModel(), getSoftConstraint(), and trainer.

00590                                                     {
00591                 if (scaled) {
00592                         return this->trainer->getModel()->sv_coef[0][i];
00593                 }
00594                 else {
00595                         return this->trainer->getModel()->sv_coef[0][i] * getSoftConstraint();
00596                 }
00597         }

Here is the call graph for this function:

double damina::SVClustering::beta ( int  i  )  [protected, virtual]

Returns the i-th lagrangian multiplier.

The One Class SVM solves the Quadratic Programming problem and computes the lagrangians "beta". This method returns the i-th beta, multiplied by C, because LibSVM solves a scaled version of the One Class SVM problem. The betas equal to zero (non-support vector ones) are not in the solution.

Parameters:
i The index of the lagrangian
Returns:
The value of the i-th beta

Definition at line 569 of file SVClustering.cpp.

References C, damina::OneClassSVM::getModel(), and trainer.

Referenced by calculateDistanceFromCenter(), calculateQuadraticPartOfDistanceFromCenter(), damina::CCLSVClustering::experimentalSeparateClusters(), and damina::CCLSVClustering::separateClusters().

00569                                         {
00570                 return this->trainer->getModel()->sv_coef[0][i] * C; 
00571          }

Here is the call graph for this function:

Here is the caller graph for this function:

double damina::SVClustering::calculateDistanceFromCenter ( struct svm_node *  x  )  [protected, virtual]

Calculate the distance of a point from the center of the sphere.

Reference:

A. Ben-Hur, D. Horn, H. T. Siegelmann, and V. Vapnik, "Support Vector Clustering," Journal of Machine Learning Research, vol. 2, pp. 125-137, 2001.

The Equation nr. 13 in the paper above is the distance of a point from the center of Sphere

We sum over support vectors only (including Bounded SVs), because the Betas (lagrangian multipliers) are zero for non-SVs.

Furthermore, we replace K(x,x) with 1, because with the Gaussian Kernel, K(x,x) = 1 for all x.

See also:
calculateQuadraticPartOfDistanceFromCenter()
Parameters:
x The point

Definition at line 695 of file SVClustering.cpp.

References beta(), calculateQuadraticPartOfDistanceFromCenter(), damina::OneClassSVM::getModel(), and trainer.

Referenced by calculateSphereRadius(), and separateClusters().

00695                                                                            {
00696                 double sum = 0;
00697                 
00698                 struct svm_node **SV = this->trainer->getModel()->SV;
00699                 int nSV = this->trainer->getModel()->l;
00700                 
00701                 
00702                 for (int i = 0; i < nSV; i++) {
00703                         sum += beta(i) * kernelfunction(SV[i], x, this->trainer->getModel()->param);
00704                 }
00705                 
00706                 
00707                 calculateQuadraticPartOfDistanceFromCenter();
00708                 
00709                 // 1 is because (in case of RBF kernel used here) K(x,x) is equal to 1 
00710                 // (therotically provable and pratically tested with libsvm) 
00711                 //return sqrt(1 - (2 * sum) + *(this->quadratic));
00712                 return 1 - (2 * sum) + *(this->quadratic);
00713         }

Here is the call graph for this function:

Here is the caller graph for this function:

void damina::SVClustering::calculateQuadraticPartOfDistanceFromCenter (  )  [private]

Calculate the quadratic part of the point distance equation.

Reference:

A. Ben-Hur, D. Horn, H. T. Siegelmann, and V. Vapnik, "Support Vector Clustering," Journal of Machine Learning Research, vol. 2, pp. 125-137, 2001.

The Equation nr. 13 in the paper above is the distance of a point from the center of Sphere

In that equation, the quadratic part (sum over i, j beta_i beta_j K(x_i, x_j)) is the same for all points, so we compute it once.

Furthermore, we sum over support vectors only (including Bounded SVs), because the Betas (lagrangian multipliers) are zero for non-SVs

Definition at line 652 of file SVClustering.cpp.

References beta(), damina::OneClassSVM::getModel(), quadratic, and trainer.

Referenced by calculateDistanceFromCenter().

00652                                                                       {
00653                 
00654                 if (quadratic != NULL) {
00655                         return;
00656                 }
00657                 
00658                 struct svm_node **SV = this->trainer->getModel()->SV;
00659                 int nSV = this->trainer->getModel()->l;
00660                 double tmp = 0;
00661                 
00662                 
00663                 for (int i = 0; i < nSV; i++) {
00664                         for (int j = 0; j < nSV; j++) {
00665                                 tmp += beta(i) * beta(j) * kernelfunction(SV[i], SV[j], this->trainer->getModel()->param);
00666                         }
00667                 }
00668                 
00669                 this->quadratic = (double *)malloc(sizeof(double));
00670                 *(this->quadratic) = tmp;
00671         } 

Here is the call graph for this function:

Here is the caller graph for this function:

void damina::SVClustering::calculateSphereRadius (  )  [protected, virtual]

Calculate the sphere radius.

Reference:

A. Ben-Hur, D. Horn, H. T. Siegelmann, and V. Vapnik, "Support Vector Clustering," Journal of Machine Learning Research, vol. 2, pp. 125-137, 2001.

Equation nr. 14 in the paper above is radius of the Sphere.

As suggested in

J. Yang, V. Estivill-Castro, and S. K. Chalup, "Support vector clustering through proximity graph modelling," in Neural Information Processing, 2002. ICONIP ‘02. Proceedings of the 9th International Conference on, 2002, pp. 898-903.

we can use the average among SVs' distances from center to calculate the radius.

However, theoretically, all the SVs must have the same distance from center, so we have tested and found that using the distance of just one SV is a good approximation as much as to compute the aforementioned average.

This solution avoids to compute the distance for all SVs.

Definition at line 741 of file SVClustering.cpp.

References calculateDistanceFromCenter(), and sphereRadius.

Referenced by getSphereRadius().

00741                                                  {
00742                 
00743                 // first tecnique
00744                 //
00745                 // assuming SVs have all the same distanace from center (theoretically a correct assumption)
00746                 // to avoid a check on the distance of all SVs
00747                 //
00748                 this->sphereRadius = this->calculateDistanceFromCenter(this->trainer->getModel()->SV[0]);
00749                 
00750                 //if (this->sphereRadius >= 1) {
00751                 //      free(quadratic);
00752                 //      this->sphereRadius = this->calculateDistanceFromCenter(this->trainer->getModel()->SV[0]);
00753                 //}
00754 
00755                 // second tecnique
00756                 //
00757                 // we can use the average among SVs' distances to calculate the radius
00758                 
00759                 //int nSV = this->trainer->getModel()->l;
00760                 //this->sphereRadius = 0;
00761                 //for (int i = 0; i < nSV; i++) {
00762                 //      this->sphereRadius += this->calculateDistanceFromCenter(this->trainer->getModel()->SV[i]);                      
00763                 //}
00764                 //this->sphereRadius = this->sphereRadius / nSV;
00765 
00766                 //      third tecnique
00767                 //      Using One Class SVM, Radius = sqrt(1 - rho/2)
00768                 //sphereRadius = sqrt(1 - (rho()/2));
00769         }

Here is the call graph for this function:

Here is the caller graph for this function:

void damina::SVClustering::clusterize (  )  [virtual]

Run the clustering process.

In this case it runs the Step 2 of Support Vector Clustering, called "Cluster Labeling" by its authors.

The implementation reflects the first solution proposed in

A. Ben-Hur, D. Horn, H. T. Siegelmann, and V. Vapnik, "Support Vector Clustering," Journal of Machine Learning Research, vol. 2, pp. 125-137, 2001.

namely, Complete Graph Cluster Labeling.

Implements damina::ClusteringEngine.

Reimplemented in damina::CCLSVClustering.

Definition at line 376 of file SVClustering.cpp.

References separateClusters().

00376                                       {
00377                 this->separateClusters();
00378         }

Here is the call graph for this function:

virtual void damina::SVClustering::clusterize ( DataSet  )  [inline, private, virtual]

Implements damina::ClusteringEngine.

Definition at line 153 of file SVClustering.h.

00153 {};

std::vector< int > & damina::SVC