Support Vector Clustering Code

Here I put the preliminary alpha source code for the Support Vector Clustering. It implements the Cone Cluster Labeling for the cluster assignment part

  • S. Lee and K. M. Daniels, "Cone Cluster Labeling for Support Vector Clustering," in Proceedings of 6th SIAM Conference on Data Mining, 2006, pp. 484-488.
    @inproceedings{cone2006,
      author = {Sei-Hyung Lee and Karen M. Daniels},
      Booktitle = {Proceedings of 6th SIAM Conference on Data Mining},
      Date-Added = {2007-04-29 16:58:13 +0200},
      Date-Modified = {2007-06-19 18:52:22 +0200},
      Keywords = {SVM, clustering},
      Month = {May},
      Pages = {484–488},
      Title = {Cone Cluster Labeling for Support Vector Clustering},
      Url = {http://www.siam.org/meetings/sdm06/proceedings/046lees.pdf},
      Year = {2006},
      Bdsk-File-1 = {YnBsaXN0MDDUAQIDBAUGBwpZJGFyY2hpdmVyWCR2ZXJzaW9uVCR0b3BYJG9iamVjdHNfEA9OU0tleWVkQXJjaGl2ZXISAAGGoNEICVRyb290gAGoCwwXGBkaHiVVJG51bGzTDQ4PEBMWWk5TLm9iamVjdHNXTlMua2V5c1YkY2xhc3OiERKABIAFohQVgAKAA4AHXHJlbGF0aXZlUGF0aFlhbGlhc0RhdGFfEEsuLi8uLi8uLi9QYXBlcnMvTGVlL0NvbmUgQ2×1c3RlciBMYWJlbGluZyBmb3IgU3VwcG9ydCBWZWN0b3IgQ2×1c3RlcmluZy5wZGbSGw8cHVdOUy5kYXRhTxECLgAAAAACLgACAAAJRG9jdW1lbnRzAAAAAAAAAAAAAAAAAAAAAAAAvs54rkgrAAAANyVBH0NvbmUgQ2×1c3RlciBMYWJlbGluIzJGMDk0My5wZGYAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAvCUPCWn72AAAAAAAAAAAAAwADAAAJAAAAAAAAAAAAAAAAAAAAAANMZWUAABAACAAAvs5cjgAAABEACAAAwlpi1gAAAAEAFAA3JUEANxuAAACy8gAAEsYAABKtAAIATkRvY3VtZW50czpuZW1vOkRvY3VtZW50czpVbml2ZXJzaXRhOlBhcGVyczpMZWU6Q29uZSBDbHVzdGVyIExhYmVsaW4jMkYwOTQzLnBkZgAOAHAANwBDAG8AbgBlACAAQwBsAHUAcwB0AGUAcgAgAEwAYQBiAGUAbABpAG4AZwAgAGYAbwByACAAUwB1AHAAcABvAHIAdAAgAFYAZQBjAHQAbwByACAAQwBsAHUAcwB0AGUAcgBpAG4AZwAuAHAAZABmAA8AFAAJAEQAbwBjAHUAbQBlAG4AdABzABIAXS9uZW1vL0RvY3VtZW50cy9Vbml2ZXJzaXRhL1BhcGVycy9MZWUvQ29uZSBDbHVzdGVyIExhYmVsaW5nIGZvciBTdXBwb3J0IFZlY3RvciBDbHVzdGVyaW5nLnBkZgAAEwASL1ZvbHVtZXMvRG9jdW1lbnRzABUAAgAX//8AAIAG0h8gISJYJGNsYXNzZXNaJGNsYXNzbmFtZaMiIyRdTlNNdXRhYmxlRGF0YVZOU0RhdGFYTlNPYmplY3TSHyAmJ6InJFxOU0RpY3Rpb25hcnkACAARABsAJAApADIARABJAEwAUQBTAFwAYgBpAHQAfACDAIYAiACKAI0AjwCRAJMAoACqAPgA/QEFAzcDOQM+A0cDUgNWA2QDawN0A3kDfAAAAAAAAAIBAAAAAAAAACgAAAAAAAAAAAAAAAAAAAOJ},
      Bdsk-Url-1 = {http://www.siam.org/meetings/sdm06/proceedings/046lees.pdf}
    }

It also implements the Secant-like kernel width generator.

  • S. Lee and K. M. Daniels, "Gaussian Kernel Width Selection and Fast Cluster Labeling for Support Vector Clustering," Department of Computer Science, University of Massachussets Lowell2005.
    @techreport{kernwidthsvc2005,
      author = {Sei-Hyung Lee and Karen M. Daniels},
      Date-Added = {2007-05-18 10:44:22 +0200},
      Date-Modified = {2007-06-20 08:28:06 +0200},
      Institution = {Department of Computer Science, University of Massachussets Lowell},
      Keywords = {svm, clustering, kernel machines},
      Title = {Gaussian Kernel Width Selection and Fast Cluster Labeling for Support Vector Clustering},
      Url = {http://www.cs.uml.edu/~kdaniels/papers/SeiTechReport2005.pdf},
      Year = {2005},
      Bdsk-File-1 = {YnBsaXN0MDDUAQIDBAUGBwpZJGFyY2hpdmVyWCR2ZXJzaW9uVCR0b3BYJG9iamVjdHNfEA9OU0tleWVkQXJjaGl2ZXISAAGGoNEICVRyb290gAGoCwwXGBkaHiVVJG51bGzTDQ4PEBMWWk5TLm9iamVjdHNXTlMua2V5c1YkY2xhc3OiERKABIAFohQVgAKAA4AHXHJlbGF0aXZlUGF0aFlhbGlhc0RhdGFfEG8uLi8uLi8uLi9QYXBlcnMvTGVlL0dhdXNzaWFuIEtlcm5lbCBXaWR0aCBTZWxlY3Rpb24gYW5kIEZhc3QgQ2×1c3RlciBMYWJlbGluZyBmb3IgU3VwcG9ydCBWZWN0b3IgQ2×1c3RlcmluZy5wZGbSGw8cHVdOUy5kYXRhTxECmgAAAAACmgACAAAJRG9jdW1lbnRzAAAAAAAAAAAAAAAAAAAAAAAAvs54rkgrAAAANyVBH0dhdXNzaWFuIEtlcm5lbCBXaWR0IzMxQ0FDQS5wZGYAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAxysrCcyn+UERGIAAAAAAAAwADAAAJAAAAAAAAAAAAAAAAAAAAAANMZWUAABAACAAAvs5cjgAAABEACAAAwnMN3gAAAAEAFAA3JUEANxuAAACy8gAAEsYAABKtAAIATkRvY3VtZW50czpuZW1vOkRvY3VtZW50czpVbml2ZXJzaXRhOlBhcGVyczpMZWU6R2F1c3NpYW4gS2VybmVsIFdpZHQjMzFDQUNBLnBkZgAOALgAWwBHAGEAdQBzAHMAaQBhAG4AIABLAGUAcgBuAGUAbAAgAFcAaQBkAHQAaAAgAFMAZQBsAGUAYwB0AGkAbwBuACAAYQBuAGQAIABGAGEAcwB0ACAAQwBsAHUAcwB0AGUAcgAgAEwAYQBiAGUAbABpAG4AZwAgAGYAbwByACAAUwB1AHAAcABvAHIAdAAgAFYAZQBjAHQAbwByACAAQwBsAHUAcwB0AGUAcgBpAG4AZwAuAHAAZABmAA8AFAAJAEQAbwBjAHUAbQBlAG4AdABzABIAgS9uZW1vL0RvY3VtZW50cy9Vbml2ZXJzaXRhL1BhcGVycy9MZWUvR2F1c3NpYW4gS2VybmVsIFdpZHRoIFNlbGVjdGlvbiBhbmQgRmFzdCBDbHVzdGVyIExhYmVsaW5nIGZvciBTdXBwb3J0IFZlY3RvciBDbHVzdGVyaW5nLnBkZgAAEwASL1ZvbHVtZXMvRG9jdW1lbnRzABUAAgAX//8AAIAG0h8gISJYJGNsYXNzZXNaJGNsYXNzbmFtZaMiIyRdTlNNdXRhYmxlRGF0YVZOU0RhdGFYTlNPYmplY3TSHyAmJ6InJFxOU0RpY3Rpb25hcnkACAARABsAJAApADIARABJAEwAUQBTAFwAYgBpAHQAfACDAIYAiACKAI0AjwCRAJMAoACqARwBIQEpA8cDyQPOA9cD4gPmA/QD+wQEBAkEDAAAAAAAAAIBAAAAAAAAACgAAAAAAAAAAAAAAAAAAAQZ},
      Bdsk-Url-1 = {http://www.cs.uml.edu/~kdaniels/papers/SeiTechReport2005.pdf}
    }

The SVM training part is performed by the means of the LIBSVM library, whereas the graph utilities are provided by the Boost Graph Library. Both libraries allow to redistribute the source code under some license terms, so the package you download contains everything you need to compile the code, you have just to type “make” in the source root directory.

For more information, take a look to the README directory you find once you have unpacked the tarball.

Download

SVC Source Code - SVC Doxygen documentation

25 Comments so far »

  1. Lawrence said

    am February 8 2008 @ 10:20 am

    Dear Sir,

    I can’t compile the code successully under windows operation system using Microsoft Visual C++ 6.0. There are some header files which can’t be included as the error message: Cannot open include file: ‘getopt.h’: No such file or directory
    SVClustering.cpp OR Cannot open include file: ‘bits/stl_pair.h’: No such file or directory
    svm.cpp. Are there really have such header files? Why they can’t be included successully? It seems that there still are some syntax error. However I can’t compile the code successully. Which kind of operation system used by you? As you mentioned “just to type “makeâ€? in the source root directory”, if Unix operation system is used?
    Look forward to your reply.

  2. Lawrence said

    am February 8 2008 @ 10:22 am

    Can you compile the code successfully under windows operation system?

  3. Vincenzo Russo said

    am February 8 2008 @ 10:45 am

    Dear Lawrence,

    I used only Linux and Mac OS X operating systems. I did not try to compile the code under Windows system, because I have not such a system.

    The header files “getopt.h” is used by “svc.c” for parsing the command line, and it is a POSIX header file usually present in all unix-like systems. I guess the software need to be augmented with some precompiler directives which drive the compiler in Windows to include the right header files, but I don’t know what such header files are.

    Anyway, this is a very embryonal release of this software and I am working (together to other people) on a new release, more complete and more cross-platform. I’ll ask some Windows developers about this problem and I’ll update you.

  4. Lawrence said

    am February 11 2008 @ 4:24 pm

    Dear Vincenzo Russo,

    Thank you very very much for your quick reply. As you used Linux and Mac OS X operating systems, that’s why I can’t complie the code successfully under MS windows operating system. I will try to use Linux to complie the code once again. However, I think the README file is not written in detail as you said “No time for a “real” read me file”.
    To my knowledge, there is still no mature SVC code which can be used directly. You are doing a very meaningful thing. If there is updated release of this software, please let me know.

    Many thanks and best regards

  5. Vincenzo Russo said

    am February 11 2008 @ 10:25 pm

    Dear Lawrence,

    thank you for greetings. A totally new release is planned to be developed this year, in order to produce a more stable, clear, clean and extensible code, to finally release a ready-for-production SVC software. We will implement a complete implementation of the Kernel Width Generator, different cluster labeling algorithms, new and more robust stopping criteria, and so on.

    I probably will go to London for a PhD, but I will remain in touch with my Italian university team for developing this idea.

    Be sure I’ll update this blog when I’ll release some stuff.

    My best,

    VR

  6. Lawrence said

    am February 17 2008 @ 4:37 pm

    Dear Vincenzo Russo,

    The code can be successully compiled under Linux operation system. The question I want to ask you is that how you deal with those Bounded Support Vectors (BSV)which lie outside of cluste boundaries in your code. Do you leave them unlabeled or assign them to the cluster that they are closest to? I find that if they are left unlabeled, there will be many minor clusters.

    My best,

    Lawrence

  7. Vincenzo Russo said

    am February 18 2008 @ 10:41 am

    Dear Lawrence,

    I don’t leave BSVs unlabeled, because several reasons:

    - the number of BSVs could be large when we deal with strongly overlapping clusters. In such cases, BSVs are not actually outliers, and need to be assigned to a cluster.

    - if we leave them unclassified, we don’t know whether they are really outliers or not.

    Enhanced methods to neglect minor clusters are planned to be developed. Meanwhile, a trivial method is available trhough the switch -t, which serves to estabilish a minimum accepted cardinality for clusters.

    My Best,

    VR.

  8. Lawrence said

    am April 4 2008 @ 4:29 pm

    Dear Vincenzo Russo,

    I am Lawrence. I used your SVC code in Linux environment to test the famous iris data. Here is a part of output which I can’t understand:

    ==================================
    Clustering process finished
    ==================================
    Check out the last run and second last run.
    Kernel Width Found: 26.0991
    Soft Constraint Estimate: 0.000680272
    Overall time elapsed: 0.09 secs
    Original command line: ./svc -q 6 -C 0.01 -b 1 -c 3 -f iris.txt -D output.txt Segmentation fault (core dumped)

    1. Does the Segmentation fault has influnce on the clustering output? Do I can say that I can successfully use your code now?

    2. The parameters Kernel Width Found (26.0991) and Soft Constraint Estimate(0.000680272) are different
    from my original ones. Are they suggested by the code? what’s the meaning of their values?

  9. Lawrence said

    am April 4 2008 @ 4:31 pm

    Many thanks and best regards

    Lawrence

  10. Vincenzo Russo said

    am April 4 2008 @ 5:06 pm

    Dear Lawrence,

    I am replying point by point

    1. No, it does not affect the clustering output. At least, not at that point of the execution. Those strings are the final output of the process.

    2. The software implements the pseudo-hierarchical execution of the SVC. So it starts from your q/C values and then iterates until it reaches the number of clusters specified in input (see NOTE below).
    The q value is explored in a non-decreasing way, so each iteration uses a different value of ‘q’ that is greater than or equal to the ‘q’ value in previous iteration. In such a manner, the right ‘q’ value is auto-detected and it is the one proposed as ‘Kernel Width Found’. More informations about the Kernel Width Selector are in my master’s thesis.
    You can also omit the first ‘q’ value in input, and it will start with an auto-detected first-value for the q parameter.

    The things are different for the C parameter. You need to specify a value in input (I sugget to try a value C=10/N, where N is the number of patterns) unless you want to use the C=1. The value ‘Soft Constraint Estimate’ is only a value that you can try to use in the case your own value produce no meaningful clustering.

    Please, note that the values you can find in literature about the right values of the q/C parameters usually do not work well with my software. This is probably due to the slightly different SVM model underlying the implementation. Please refer to my master’s thesis for more information (chapter 6 for understanding the model of execution of the SVC, and the chapter 9 where you find experimental results and the related q/C values).

    NOTE: SVC can theoretically neglect the number of clusters in input, but in the
    version of the software you are using is not possible to do that. I recently implemented a new stopping criterion already discussed in my Master’s Thesis and it allows to auto-detect the number of clusters that best fit the problem at hand. Anyway, I will post here when we’ll release a new version of the software.

  11. Vincenzo Russo said

    am April 4 2008 @ 5:09 pm

    I forgot a little detail: if you want to run SVC ONLY with you q/C values, you can specify your q/C values on command line params and add the command line switch “-r 1″, which limits the number of iteration to just one, and avoid to look for other q values. It also avoids to check if the number of clusters obtained match the switch “-c”.

  12. Lawrence said

    am April 5 2008 @ 1:32 pm

    Dear Vincenzo Russo,

    Thank you very very much. You are always so kind to reply quickly and detailly. Your explanation, as well as your thesis, give me a lot of illumination. I don’t major in support vector methods. I only want to do some applicaitons. You also used the iris dataset to test the SVC code and obtained a good output in chapter 9. However, I myself always can’t seperate the two nonlinear separable classes. I think it is manly because I can’t use the SVC code properly.
    Afther your explanation, I think the clustering output finally used the Kernel Width Found (26.0991) and Soft Constraint Estimate (0.000680272) which started from my q/C values. If I want to run SVC ONLY with my q/C values, I should add the command line switch “-r 1″ to limit the number of iteration to just one. Did I am right?

    Many thanks and Best Regards

    Lawrence

  13. Vincenzo Russo said

    am April 5 2008 @ 2:52 pm

    Dear Lawrence,

    you are welcome.

    Your execution ended with a q value (26.0991) that is definitely greater than the right value. This is because you started the process with a too high q value (I know, you probably find it in the literature, but as I already stated, it does not work with my software) and the software stops for other reasons (too much SVs), and not because it finds the right clustering.

    The command line for obtaining a good separation using the Gaussian kernel is

    svc -c 3 -f /path/to/iris-file -q 0.0891501 -C 0.0666667 -s 0.5 -r 1

    I was able to find the q value thanks to the kernel width selector I implemented (if you want to try, run the svc in this way: svc -c 3 -f /path/to/iris-file).

    The C value is estimated by my own (and simple) heuristics: C = 10/N = 10/150

    The -s switch enable another my own heuristics, called ’softening strategy’, that often results in a higher accuracy.

    Finally, the answer to your last question is: Yes, to run svc with (and only with) a specific q/C couple, you need to use the “-r 1″ switch.

    If you do not obtain good results even with the command line I supplied above, please let me know where you downloaded the IRIS dataset as there are two versions of this dataset available on the Web.

    Kind Regards,

    VR.

  14. Lawrence said

    am April 6 2008 @ 11:23 am

    Dear Vincenzo Russo,

    I am very excited now. Thank you very very much. The q/C valus you provided are quite good and the two nonlinear separable classes of the iris dataset can be separated now. The accuracy is up to 92.667% which outperform better than the traditional clustering algorithms. The iris data set is downloaded from wikipedia as the link: http://en.wikipedia.org/wiki/Iris_flower_data_set

    I can use your software on my own dataset now. I still have two problems to ask you:
    1. Do we need do pre-processing on the dataset? In other words, need we do standardization for all of the attributes? Does the clustring will be dominated by some big value attribute such as age, income etc.? I think you maybe have already include this in your code as I didn’t do standardization for iris data and the output is quite good.
    2. For my own dataset, there is no class label information as iris data with three kinds of classes known, that’s the reason I want to partition the observations into several clusters. On the other hand, the input data for your software is in forms of the libsvm where the first column is the class label information. How can I do for the first column of the input data which is the class information?

    Many thanks and Best Regards

    Lawrence

  15. Vincenzo Russo said

    am April 6 2008 @ 2:39 pm

    Dear Lawrence,

    I am glad you finally got a good result.

    About your questions:

    1. The SVC software does not perofrm any normalization and/or scaling on data. SVC is able to deal with multi-modal/multi-variate data. In other words you can use data that have a different variance compared to each other. Anyway, if you want scale your data, I sugget you to use the svm-scale utility supplied by the LIBSVM.

    2. Well, you can specify the same class label for all patterns in the .scale files, and then you can simply ignore the quality measures values (accuracy, macro-averaging, etc.).

    Please note that if you have no idea about the number of clusters in your data, you can edit the main.c file to drop the stop condition based on the number of clusters. The remaining stop condition is based upon the number of SVs. In addition you can specify a maximum number of iterations with the “-r” switch. In such a way, you can exploit the ability of the SVC of analyzing the data at different level of details, by specifying no initial q value and leaving the software to find the most suitable one. However, since there is no solid stopping criterion in the version of the software you are using, this process could be long and boring.

    Out of curiosity, when do you expect to graduate? I will be glad to take a look to your thesis (which, obviously, will report my thesis in the biblio, will not? ;-)) at the end of the work.

    Anyway, Good Luck.

    Best regards,

    VR.

  16. Lawrence said

    am April 6 2008 @ 5:40 pm

    Dear Vincenzo Russo,

    I think I can use your software well now. I am supposed to graduate one year later. I want to partition individuals into several clusters which is part of my thesis work. At present, I am writing a paper using SVC. You really did a great contribution. Not only your thesis will be reported in the biblio, but also you will be regarded in the acknowledgements. I did like to give you a copy when it is accepted for publicaiton. You are really a nice and kind man I never met before. Give my sincerely thanks to you once again.
    You mentioned that you probably will go to London for a PhD. Where are you now?

    Best Regards

    Lawrence

  17. Vincenzo Russo said

    am April 6 2008 @ 6:28 pm

    Dear Lawrence,

    thank you for the greetings, I’m glad to be of help.

    My opinion about research activity is to be open and collaborative. I can’t understand a closed research envinroment, because I think is a model that in constrat with the idea of research itself. While I was working on my master’s thesis, I encountered many people with no intention to share anything at all. Unbelievable, from my perspective. I like to share, I need to share, I want to share.

    This is why I wrote the SVC software and put it online. And this is why I will put online the future versions too.

    The PhD in London will start in October. In the meanwhile, I am in my city (Naples, Italy) and I collaborate with my University (Federico II, Naples, Italy).

    Oh, thanks also for your promises of citation and acknowlegement. I’m glad of this.

    Well, keep in touch.
    For any question, feel free to write me again.

    Best regards,

    VR

  18. Lawrence said

    am April 6 2008 @ 8:01 pm

    Dear Vincenzo Russo,

    I quite agree with you that research activity should be open and collaborative. Althoughs there are many researchers did svc as well as its applications, You can’t find any code or software from internet except yours. I really appreciate your work and kindness.

    One more thing. I checked myself according to your instructions. I run the svc in this way: svc -c 3 -f /path/to/iris-file to obtain the q value which is 0.163274. When I run the svc in another way: svc -c 3 -s 0.5 -f /path/to/iris-file, I will obtain the q/C value suggested by you. The problem is how to control s value. I also test other s values in this way, the output is not as good as s=0.5. Should I also use this s value When I apply svc to my own dataset? or you can obtain the q (0.0891501)value in the way (svc -c 3 -f /path/to/iris-file)?

    Best Regards

    Lawrence

  19. Lawrence said

    am April 7 2008 @ 5:29 am

    Dear Vincenzo Russo,

    Why the number of clusters is usually not equal to the specified c value? If I have already know the number of clusters based on past literature,take 6 as an example, I want to obtain such kind of segmenation. Even the c value is specified to be 6, the results are usually not 6 clusters. What’s the problem? Do you have any suggestions to adjust the parameters q/C or s to obtain the specified number of clusters?

    Many thanks and Best Regards

    Lawrence

  20. Vincenzo Russo said

    am April 7 2008 @ 4:44 pm

    Dear Lawrence,

    the -s switch is supposed to be always used with 0.5 value. It’s just a heuristics that allow to explore the ‘q’ value sequence in a smoother way. After a number of tries, the 0.5 value resulted the most suitable one.

    My advice is to perform the experiments both with and without the softening strategy heuristics, even though the softening should be result in more accurate clustering.

    The second question: SVC i a hierarchical-like clustering, so it is a so-called non-parametric clustering algorithm. It DOES NOT use the number of clusters as input parameter for determining the clustering. Actually, it should not use it at all. In this version of my software such a parameter is a ‘dirty trick’ and serves for a stopping criterion in case of datasets which we know the exact number of clusters.

    However, SVC could find not the same number of clusters you expect. This is why it auto-detects the latent structure. And it is not perfect.

    In such cases you need to try different parameters combination: differnet values of ‘C’ manually, different kernels (with -k switch you can use the Gaussian, the Laplace, the Exponential in my software), different metric distance (with -l switch you can use either the L1 or the L2 distance).

    Only the ‘q’ value have to be found automatically by SVC software in any of the setups you try.

    Best Regards,

    VR

  21. Lawrence said

    am April 17 2008 @ 9:36 am

    Dear Vincenzo Russo,

    I notice that you compared the performance of different clustering methods using real-life benchmarks such as Iris data, Wisconsin’s breast cancer database, and Wine Recognition Database in your thesis. Camastra (2005) also compared the performance of the current clustering methods including K-means, Neural Gas, Self-Organizing Map (SOM), Spectral clustering algorithm and SVC on three kinds of real-life benchmarks. However, there already are class information for these data sets, that is, we have already know the class each observation belongs. It’s quite OK and necessary that these datasets are used to compare the results of clustering methods. If I want to compare the performance of SVC and K-means on a dataset without class information ahead, do you have some suggestions?

    Many thanks and Best Regards

    Lawrence

  22. Vincenzo Russo said

    am April 17 2008 @ 10:14 am

    Dear Lawrence,

    welcome back.

    Since your datasets are unlabeled, the first thing you need are some criteria to evaluate the quality of clustering results. The most effective way to evaluate clustering results when data are unlabeled, is a relative criterion (aka validity index).

    In the section 3.3.3 of my thesis I present some of classical validity indices and in

    J. Wang and J. Chiang, “A cluster validity measure with a hybrid parameter search method for the support vector clustering algorithm,” Pattern Recognition, vol. 41, iss. 2, pp. 506-520, 2008.

    you can found a validity index specific for SVC. I used it to develop a new stopping criterion for SVC (not available in the software you are using).

    So, you have to run the SVC and K-means several times, each with a different parameter settings and then evaluate the results. The better index value, the better the clustering. Since you probably don’t know the number of clusters and K-means need it, a classical way is to run K-means with different number of clusters in input and then choose the instance that yields the best validity index value. As far as the SVC is concerned, you can try different combinations of q/C/kernel and choose the instance that yields the best value of the validity index.

    I hope I was clear.

    Best,
    VR.

  23. Lawrence said

    am April 17 2008 @ 3:25 pm

    Dear Vincenzo Russo,

    Thank you very very much. It seems clear to me. I’d like to read relevant parts of your thesis and the paper suggested by you first. You really give me a great inspiration on how to evaluate the quality of clustering results when datasets are unlabeled. It’s great appreciated for your help.

    Best Regards

    Lawrence

  24. Lawrence said

    am April 20 2008 @ 5:59 pm

    Dear Vincenzo Russo,

    I have already have some idea on the evaluation of clustering results now. As you mentioned, from a review of the literature, generally there are three approaches, internal criteria, external criteria, and relative criteria, have been used for the quantitative evaluation of clustering results in many of the research studies.

    Internal criteria are the only means to evaluate the clustering quality of a completely new domain.

    External criteria imply clustering evaluation by means of external pre-specified structure information of a dataset. Generally, several real-life or synthetic datasets with prior class information such as Iris data, Wisconsin’s breast cancer database, and Spam database, are used as evaluation benchmark for clustering results.

    Relative criteria (also known as validity index) evaluate clustering results with different input parameter settings of a same clustering algorithm. A number of validity indices have been developed and proposed in literature.

    If i want to compare the performance of K-means and SVC in a completely new domain without prior class information. There are two methods as follows:

    1.In the process of running K-means and SVC, validity index is embeded respectively for parameter selection. When the optimal clusering resluts are found respectively, the clstering resluts are evaluated by internal criteria.

    2.There is a validity index specific for SVC which is quite useful for parameter selection as you mentioned (I don’t know if it can used for K-means). If different validity indices are used in K-means and SVC, I think they can’t be compared. If the same validity index is used in both clustering algorithms, regardless of it is not appropriate for SVC, can the index value work as a crtieria of the performance of these two clustering algorithms?

    Maybe I didn’t well express my question. I hope you can understand. I am a little confused here.

    Many thanks and Best regards

    Lawrrence

  25. Vincenzo Russo said

    am April 29 2008 @ 9:58 am

    Dear Lawrence,

    Sorry for the late. I was not able to connect for a week because I was not at home.

    To be short, this is what I would do:

    1. Run K-means several times with different ‘k’ values and choose the best instance. “Best” here means the instance that produce the best value for some validity index (C-index are supposed to be a good choice for K-means).

    2. Run SVC several times with different parameters settings (kernel, C, q, etc.) and choose the best instance according to a validity index (the best choice is the index specifically developed for SVC, I guess).

    3. Compare the results of the “best k-means instance” and the “best SVC instance”.

    That’s all.

    And no, I am sure that the validity index developed for the SVC does not fit the K-means because the index relies on specific characteristics of the SVC.

    I hope I am of help.

    Best,

    VR.

Comment RSS · TrackBack URI

Leave a comment

Name:

eMail:

Website:

Comment: