Archive for October, 2007

[OT] Star galaxies separation via SVM/CVM classification

We have used some astrophysics star/galaxies datasets for our clustering problems, because they have heavily overlapping clusters.

Here we present some results of an SVM classification performed on the same datasets. In fact, S/G separation is usually faced in a supervised way.

We have used a simple nonlinear SVM/CVM classifier with a linear kernel (K(x,y) = x’ * y).

For each dataset, we have used 5% of it as training set. The rest is the test set.

Datasets:

Longo 01, 2500 items, 2000 stars, 500 galaxies
Longo 02, 9816 items, 2935 stars, 6883 galaxies
Longo 03, 10940 items, 2978 stars, 7964 galaxies

Accuracy results:

Longo 01: 95%
Longo 02: 98,0746%
Longo 03: 97,925%

Accuracy results with CVM:

Longo 01: 94,98%
Longo 02: 97,5%
Longo 03: 95,2%

Probably, other kernels could lead to better results, but it is necessary to understand in which way tune the hyperparameters, such as the kernel width and the soft margin constant, etc.

New talk on SVC and MBI Principle

In the Documents section are available the slides entitled: “Novel Clustering Techniques: Support Vector Methods and Minimum Bregman Information principle

SVC has been explained with more care because it still is a very experimental technique.

SVDD and kernel functions

Support Vector Domain Description (SVDD) is the basis of the Support Vector Clustering. The non linear version of the SVDD use the Gaussian kernel and no other kernels has been apparently investigated but the polynomial one which is an example of kernel type that works bad with SVDD.

I wrote an e-mail message to the SVDD author, Dr. David M. J. Tax (from Delft University of Technology, Netherlands). Here the “core” of the message

Even though the Gaussian kernel is the one with the best average performances, some experiments conducted on a specific application domain have given better results with a Laplacian kernel or an Exponential kernel.

What does it theoretically means from an SVDD perspective? Have you never tried kernels other than Gaussian and polynomial ones?

The reply of Dr. Tax was

To be honest the number of kernels I used is relatively limited. For some cases I used a correlation between image patches, and it seemed to work well. Also, some people
have used a modified Haussdorf distance to compare shapes. I don’t have a lot of
experience with it.

The big problem is that we can only say something about generalization for a given representation. By changing the kernel, the representation changes. And what happens then is completely dependent on the data, so it is extremely hard to say something general about what kernel to use (the same like what features to use). For some applications there may be features ‘proven by experience’ (like the RBF kernel for the simple UCI datasets), but theoretically you cannot really proof it, I think.

So, the conclusion is that a deeper investigation about other kernels and SVDD is needed. Currently, I have yet some experimental results in this direction (even if they are from a clustering perspective) and in future we could think to go in depth of the question analyzing the shape of data description and the behavior of various (exponential-based) kernels at different kernel width values.

Euclidean Co-clustering Scheme 2 without Feature Clustering is K-means

In the previous posts, we have presented the results of some experiments about missing values robustness of SVC and Co-clustering.

A note about Co-clustering is dutiful: the Scheme number two of the Bregman Co-clustering without feature clustering and with Euclidean distance, is equal to the K-means algorithm.

Induced Missing Values Experiments - Stage 2

This is the continuation of the experiments started few days ago.

Two other datasets have been involved in this type of experiments. Both of them are Astrophysics datasets, more precisely two dataset containing Stars and Galaxies.

Star/Galaxies separation is a problem usually tackled with supervised learning methodologies. In our work several clustering testes are conducted on such type of data.

These two datasets was chosen to be quite simple to separate, because we are interested in the robustness with respect missing values.

Starting from the original datasets, I have created eight variants for each of them, in this way

  • 4 variants affecting only 3 features out of 15, with 5, 10, 20, 30 percent of objects reporting missing values for all of the 3 features, respectively
  • 4 variants affecting 6 features out of 15, with 5, 10, 20, 30 percent of objects reporting missing values for all of the 6 features, respectively

The experiments was done with Euclidean Co-clustering (Information-theoretic cannot work with negative values) and SVC.

An archive with all results is available for download (it contains also the results of the previous stage).

In the files above:

- “MV� stands for “Missing Values�
- “FC� stands for “Feature Clusters�
- FC1 means no feature clustering
- FC2 means two clusters of feature requested
- FC3 means three clusters of feature requested
- CC stands for Co-clustering

Induced Missing Values Experiments - Stage 1

Few days ago I made ready a tool to induce pseudo-random missing values within datasets. This tool allow us to test the robustness of both Bregman Co-clustering and SVC with respect to missing values.

The tool accepts two parameters: the fraction of objects that will be affected by the process, and the list of features involved.

As is my custom, I started this series of experiments with the IRIS data. So, I created these IRIS dataset variants

- IRIS 5a: 5% of objects with missing values. One feature (#3) involved.
- IRIS 5b: 5% of objects with missing values. Two features (#3, #4) involved.
- IRIS 10a: 10% of objects with missing values. One feature (#3) involved.
- IRIS 10b: 10% of objects with missing values. Two features (#3, #4) involved.
- IRIS 20a: 20% of objects with missing values. One feature (#3) involved.
- IRIS 20b: 20% of objects with missing values. Two features (#3, #4) involved.
- IRIS 30a: 30% of objects with missing values. One feature (#3) involved.
- IRIS 30b: 30% of objects with missing values. Two features (#3, #4) involved.

We recall the IRIS data have 4 features.

Here you can download the results.

The experiments was done with Co-clustering and SVC. Information-theoretic co-clustering results are not in the files above, because they were irrelevant (very poor performance).

In the files above:

- “MV” stands for “Missing Values”
- “FC” stands for “Feature Clusters”
- FC1 means no feature clustering
- FC2 means two clusters of feature requested.
- CC stands for Co-clustering