June 22 2007

Co-clustering - Real World Dataset Test #2

Macchina usata:
PowerPC G4, 1.5GHz, 768MB RAM, Mac OS X

Software usato:

  • H. Cho, Y. Guan, and S. Sra, Co-cluster (v 1.1), 2004.
    @misc{coclus-software,
      author = {Hyuk Cho and Yuqiang Guan and Suvrit Sra},
      Date-Added = {2007-04-29 15:15:55 +0200},
      Date-Modified = {2007-06-25 17:10:33 +0200},
      Howpublished = {Bregman co-clustering software},
      Keywords = {co-clustering, relative entropy, euclidean distance, software},
      Title = {Co-cluster (v 1.1)},
      Url = {http://www.cs.utexas.edu/users/dml/Software/cocluster.html},
      Year = {2004},
      Bdsk-Url-1 = {http://www.cs.utexas.edu/users/dml/Software/cocluster.html}
    }

Dataset Usato:
Mushrooms Database
Number of instances: 8124
Number of Attributes: 22
2480 missing values for attribute #12
Original Class Distribution: edible: 4208 (51.8%), poisonous: 3916 (48.2%)
Mushroom records drawn from The Audubon Society Field Guide to North
American Mushrooms (1981). G. H. Lincoff (Pres.), New York: Alfred A. Knopf
Donor: Jeff Schlimmer (Jeffrey.Schlimmer@a.gp.cs.cmu.edu)
Date: 27 April 1987

Algoritmo di co-clustering usato: Minimum Sum Squared Residue

Prova #1
Richiesti 2 cluster di riga e 1 di colonna. Totale: 2 co-cluster

Tempo impiegato: User = 2 second(s) 127370 ms, System = 0 second(s) 40949 ms, Time/Run = 2.12737 second(s)

Risultato: 3670 elementi nella classe “poisonous”, 4454 elementi nella classe “edible”.

Percentuale d’errore (elementi non classificati correttamente): ~3%

Prova #2
Richiesti 2 cluster di riga e 2 di colonna. Totale: 4 co-cluster

Tempo impiegato: User = 2 second(s) 158490 ms, System = 0 second(s) 40654 ms, Time/Run = 2.15849 second(s)

Risultato: 3915 elementi nella classe “poisonous”, 4209 elementi nella classe “edible”.

Percentuale d’errore: ~1.23 x 10^-4 (1 solo elemento è stato classificato erroneamente)

June 19 2007

Dataset sintetici per Clustering Benchmark

Molto spesso, nell’eseguire i test di algoritmi di clustering, è molto utile avere a disposizione degli insiemi di dati campione sintetici, ovvero creati artificialmente e che non rispecchiano dei dati reali.

A tale scopo molto utile si rivela il lavoro fatto dal Center for Data Engineering, International Institute of Information Technology, Hyderabad, INDIA

  • J. R. Vennam and S. Vadapalli, "SynDECA: A Tool to Generate Synthetic Datasets for Evaluation of Clustering Algorithms," in 11th International Conference on Management of Data (COMAD 2005), Goa, India, 2005.
    @conference{syndeca2005, Address = {Goa, India},
      Author = {Jhansi Rani Vennam and Soujanya Vadapalli},
      Booktitle = {11th International Conference on Management of Data (COMAD 2005)},
      Date-Added = {2007-06-18 16:18:49 +0200},
      Date-Modified = {2007-07-03 18:34:02 +0200},
      Keywords = {clustering, tool, synthetic, dataset, generator},
      Month = {January},
      Organization = {http://cde.iiit.ac.in/syndeca},
      Title = {SynDECA: A Tool to Generate Synthetic Datasets for Evaluation of Clustering Algorithms},
      Url = {http://comad2005.persistent.co.in/COMAD2005Proc/pages027-036.pdf},
      Year = {2005},
      Bdsk-File-1 = {YnBsaXN0MDDUAQIDBAUGBwpZJGFyY2hpdmVyWCR2ZXJzaW9uVCR0b3BYJG9iamVjdHNfEA9OU0tleWVkQXJjaGl2ZXISAAGGoNEICVRyb290gAGoCwwXGBkaHiVVJG51bGzTDQ4PEBMWWk5TLm9iamVjdHNXTlMua2V5c1YkY2xhc3OiERKABIAFohQVgAKAA4AHXHJlbGF0aXZlUGF0aFlhbGlhc0RhdGFfEHAuLi8uLi8uLi9QYXBlcnMvVmVubmFtL1N5bkRFQ0EgQSBUb29sIHRvIEdlbmVyYXRlIFN5bnRoZXRpYyBEYXRhc2V0cyBmb3IgRXZhbHVhdGlvbiBvZiBDbHVzdGVyaW5nIEFsZ29yaXRobXMucGRm0hsPHB1XTlMuZGF0YU8RApwAAAAAApwAAgAACURvY3VtZW50cwAAAAAAAAAAAAAAAAAAAAAAAL7OeK5IKwAAADk5AR9TeW5ERUNBIEEgVG9vbCB0byBHZSMzOTM4RjcucGRmAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAOTj3wrBG9QAAAAAAAAAAAAMAAwAACQAAAAAAAAAAAAAAAAAAAAAGVmVubmFtABAACAAAvs5cjgAAABEACAAAwrAq1QAAAAEAFAA5OQEANxuAAACy8gAAEsYAABKtAAIAUURvY3VtZW50czpuZW1vOkRvY3VtZW50czpVbml2ZXJzaXRhOlBhcGVyczpWZW5uYW06U3luREVDQSBBIFRvb2wgdG8gR2UjMzkzOEY3LnBkZgAADgC0AFkAUwB5AG4ARABFAEMAQQAgAEEAIABUAG8AbwBsACAAdABvACAARwBlAG4AZQByAGEAdABlACAAUwB5AG4AdABoAGUAdABpAGMAIABEAGEAdABhAHMAZQB0AHMAIABmAG8AcgAgAEUAdgBhAGwAdQBhAHQAaQBvAG4AIABvAGYAIABDAGwAdQBzAHQAZQByAGkAbgBnACAAQQBsAGcAbwByAGkAdABoAG0AcwAuAHAAZABmAA8AFAAJAEQAbwBjAHUAbQBlAG4AdABzABIAgi9uZW1vL0RvY3VtZW50cy9Vbml2ZXJzaXRhL1BhcGVycy9WZW5uYW0vU3luREVDQSBBIFRvb2wgdG8gR2VuZXJhdGUgU3ludGhldGljIERhdGFzZXRzIGZvciBFdmFsdWF0aW9uIG9mIENsdXN0ZXJpbmcgQWxnb3JpdGhtcy5wZGYAEwASL1ZvbHVtZXMvRG9jdW1lbnRzABUAAgAX//8AAIAG0h8gISJYJGNsYXNzZXNaJGNsYXNzbmFtZaMiIyRdTlNNdXRhYmxlRGF0YVZOU0RhdGFYTlNPYmplY3TSHyAmJ6InJFxOU0RpY3Rpb25hcnkACAARABsAJAApADIARABJAEwAUQBTAFwAYgBpAHQAfACDAIYAiACKAI0AjwCRAJMAoACqAR0BIgEqA8oDzAPRA9oD5QPpA/cD/gQHBAwEDwAAAAAAAAIBAAAAAAAAACgAAAAAAAAAAAAAAAAAAAQc},
      Bdsk-Url-1 = {http://comad2005.persistent.co.in/COMAD2005Proc/pages027-036.pdf}
    }

Lo strumento riesce a produrre dataset sintetici molto rapidamente; in genere un insieme con spazio delle feature 2D, con un milione di punti e centinaia di cluster, viene prodotto in pochi secondi.

Per ogni insieme prodotto, viene fornito dettagli sul clustering, come:

- quali punti appartengono a quali cluster
- quanti cluster
- quanti punti per cluster
- forma dei cluster
- etc.

May 19 2007

Inizio sviluppo SVC software

Basandomi sulla libSVM, ho iniziato lo sviluppo del software per SVC.

Devo dire che la disponibilità dei vari ricercatori nel fornire il codice da loro usato per effettuare i test è stata abbastanza scarna, a partire da Vapnik et al. e passando tra tutti coloro che negli ultimi anni hanno lavorato per eliminare il collo di bottiglia di questo approccio (cluster labeling).

Soltanto gli autori di

  • J. Lee and D. Lee, "An Improved Cluster Labeling Method for Support Vector Clustering," IEEE Trans. Pattern Anal. Mach. Intell., vol. 27, iss. 3, pp. 461-464, 2005.
    @article{svcimproved, Address = {Washington, DC, USA},
      Author = {Jaewook Lee and Daewon Lee},
      Date-Added = {2007-04-28 18:28:29 +0200},
      Date-Modified = {2007-06-19 15:19:40 +0200},
      Doi = {http://dx.doi.org/10.1109/TPAMI.2005.47},
      Issn = {0162-8828},
      Journal = {IEEE Trans. Pattern Anal. Mach. Intell.},
      Keywords = {clustering, SVM},
      Note = {Member-Jaewook Lee},
      Number = {3},
      Pages = {461–464},
      Publisher = {IEEE Computer Society},
      Title = {An Improved Cluster Labeling Method for Support Vector Clustering},
      Url = {http://doi.ieeecomputersociety.org/10.1109/TPAMI.2005.47},
      Volume = {27},
      Year = {2005},
      Bdsk-File-1 = {YnBsaXN0MDDUAQIDBAUGBwpZJGFyY2hpdmVyWCR2ZXJzaW9uVCR0b3BYJG9iamVjdHNfEA9OU0tleWVkQXJjaGl2ZXISAAGGoNEICVRyb290gAGoCwwXGBkaHiVVJG51bGzTDQ4PEBMWWk5TLm9iamVjdHNXTlMua2V5c1YkY2xhc3OiERKABIAFohQVgAKAA4AHXHJlbGF0aXZlUGF0aFlhbGlhc0RhdGFfEFkuLi8uLi8uLi9QYXBlcnMvTGVlL0FuIEltcHJvdmVkIENsdXN0ZXIgTGFiZWxpbmcgTWV0aG9kIGZvciBTdXBwb3J0IFZlY3RvciBDbHVzdGVyaW5nLnBkZtIbDxwdV05TLmRhdGFPEQJYAAAAAAJYAAIAAAlEb2N1bWVudHMAAAAAAAAAAAAAAAAAAAAAAAC+zniuSCsAAAA3JUEfQW4gSW1wcm92ZWQgQ2×1c3RlciAjMjdDMDVELnBkZgAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACfAXcIke4QAAAAAAAAAAAADAAMAAAkAAAAAAAAAAAAAAAAAAAAAA0xlZQAAEAAIAAC+zlyOAAAAEQAIAADCJG10AAAAAQAUADclQQA3G4AAALLyAAASxgAAEq0AAgBORG9jdW1lbnRzOm5lbW86RG9jdW1lbnRzOlVuaXZlcnNpdGE6UGFwZXJzOkxlZTpBbiBJbXByb3ZlZCBDbHVzdGVyICMyN0MwNUQucGRmAA4AjABFAEEAbgAgAEkAbQBwAHIAbwB2AGUAZAAgAEMAbAB1AHMAdABlAHIAIABMAGEAYgBlAGwAaQBuAGcAIABNAGUAdABoAG8AZAAgAGYAbwByACAAUwB1AHAAcABvAHIAdAAgAFYAZQBjAHQAbwByACAAQwBsAHUAcwB0AGUAcgBpAG4AZwAuAHAAZABmAA8AFAAJAEQAbwBjAHUAbQBlAG4AdABzABIAay9uZW1vL0RvY3VtZW50cy9Vbml2ZXJzaXRhL1BhcGVycy9MZWUvQW4gSW1wcm92ZWQgQ2×1c3RlciBMYWJlbGluZyBNZXRob2QgZm9yIFN1cHBvcnQgVmVjdG9yIENsdXN0ZXJpbmcucGRmAAATABIvVm9sdW1lcy9Eb2N1bWVudHMAFQACABf//wAAgAbSHyAhIlgkY2xhc3Nlc1okY2xhc3NuYW1loyIjJF1OU011dGFibGVEYXRhVk5TRGF0YVhOU09iamVjdNIfICYnoickXE5TRGljdGlvbmFyeQAIABEAGwAkACkAMgBEAEkATABRAFMAXABiAGkAdAB8AIMAhgCIAIoAjQCPAJEAkwCgAKoBBgELARMDbwNxA3YDfwOKA44DnAOjA6wDsQO0AAAAAAAAAgEAAAAAAAAAKAAAAAAAAAAAAAAAAAAAA8E=},
      Bdsk-Url-1 = {http://doi.ieeecomputersociety.org/10.1109/TPAMI.2005.47},
      Bdsk-Url-2 = {http://dx.doi.org/10.1109/TPAMI.2005.47}
    }

mi hanno gentilmente risposto inviandomi il codice (putroppo Matlab) da loro usato per i test.

Sto perciò sviluppando questo software (che avrei dovuto fare comunque) con una visione più ampia in mente. Per ora la struttura è essenziale e fortemente legata a libSVM:

Classi fondamentali

  • una classe astratta AbstractTrainer, che fornisce la fornisce l’interfaccia per l’implementazione della parte di training per il Support Vector Clustering.
  • una classe AbstractLabeler, che fornisce la fornisce l’interfaccia per l’implementazione della parte di cluster labeling per il SVC

Prime concretizzazioni

  • una classe OneClassTrainer, che implementa il training per il SVC utilizzando la One-class classification della libSVM
  • [da terminare] una classe CGLabeler, che implementa nel modo più banale (e inefficiente) l’algoritmo cluster labeling proposto in
    • A. Ben-Hur, D. Horn, H. T. Siegelmann, and V. Vapnik, "Support Vector Clustering," Journal of Machine Learning Research, vol. 2, pp. 125-137, 2001.
      @article{svc,
        author = {A. Ben-Hur and D. Horn and H. T. Siegelmann and V. Vapnik},
        Date-Modified = {2007-06-19 14:44:40 +0200},
        Journal = {Journal of Machine Learning Research},
        Keywords = {clustering, SVM, gaussian kernel},
        Pages = {125-137},
        Title = {Support Vector Clustering},
        Url = {http://citeseer.ist.psu.edu/hur01support.html},
        Volume = 2, Year = 2001, Bdsk-File-1 = {YnBsaXN0MDDUAQIDBAUGBwpZJGFyY2hpdmVyWCR2ZXJzaW9uVCR0b3BYJG9iamVjdHNfEA9OU0tleWVkQXJjaGl2ZXISAAGGoNEICVRyb290gAGoCwwXGBkaHiVVJG51bGzTDQ4PEBMWWk5TLm9iamVjdHNXTlMua2V5c1YkY2xhc3OiERKABIAFohQVgAKAA4AHXHJlbGF0aXZlUGF0aFlhbGlhc0RhdGFfEDUuLi8uLi8uLi9QYXBlcnMvQmVuLUh1ci9TdXBwb3J0IFZlY3RvciBDbHVzdGVyaW5nLnBkZtIbDxwdV05TLmRhdGFPEQHqAAAAAAHqAAIAAAlEb2N1bWVudHMAAAAAAAAAAAAAAAAAAAAAAAC+zniuSCsAAAA3IEUdU3VwcG9ydCBWZWN0b3IgQ2×1c3RlcmluZy5wZGYAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACIbMsH5WY9QREYgAAAAAAADAAMAAAkAAAAAAAAAAAAAAAAAAAAAB0Jlbi1IdXIAABAACAAAvs5cjgAAABEACAAAwflLfwAAAAEAFAA3IEUANxuAAACy8gAAEsYAABKtAAIAUERvY3VtZW50czpuZW1vOkRvY3VtZW50czpVbml2ZXJzaXRhOlBhcGVyczpCZW4tSHVyOlN1cHBvcnQgVmVjdG9yIENsdXN0ZXJpbmcucGRmAA4APAAdAFMAdQBwAHAAbwByAHQAIABWAGUAYwB0AG8AcgAgAEMAbAB1AHMAdABlAHIAaQBuAGcALgBwAGQAZgAPABQACQBEAG8AYwB1AG0AZQBuAHQAcwASAEcvbmVtby9Eb2N1bWVudHMvVW5pdmVyc2l0YS9QYXBlcnMvQmVuLUh1ci9TdXBwb3J0IFZlY3RvciBDbHVzdGVyaW5nLnBkZgAAEwASL1ZvbHVtZXMvRG9jdW1lbnRzABUAAgAX//8AAIAG0h8gISJYJGNsYXNzZXNaJGNsYXNzbmFtZaMiIyRdTlNNdXRhYmxlRGF0YVZOU0RhdGFYTlNPYmplY3TSHyAmJ6InJFxOU0RpY3Rpb25hcnkACAARABsAJAApADIARABJAEwAUQBTAFwAYgBpAHQAfACDAIYAiACKAI0AjwCRAJMAoACqAOIA5wDvAt0C3wLkAu0C+AL8AwoDEQMaAx8DIgAAAAAAAAIBAAAAAAAAACgAAAAAAAAAAAAAAAAAAAMv},
        Bdsk-Url-1 = {http://citeseer.ist.psu.edu/hur01support.html}
      }

Questa struttura permette di implementare agevolmente le diverse versioni di cluster labeling proposte in letteratura e al contempo di sviluppare diversi trainer, compreso quello per la formulazione Least Squares delle SVM, magari implementando l’algoritmo proposto in

  • S. S. Keerthi and S. K. Shevade, "SMO algorithm for Least Squares SVM formulations," Neural Computation, vol. 15, pp. 487-507, 2003.
    @article{smo4lssvm03,
      author = {S.S. Keerthi and S.K. Shevade},
      Date-Added = {2007-05-19 19:11:38 +0200},
      Date-Modified = {2007-06-19 14:48:12 +0200},
      Journal = {Neural Computation},
      Keywords = {svm, ls-svm},
      Month = {February},
      Pages = {487–507},
      Title = {SMO algorithm for Least Squares SVM formulations},
      Url = {http://ieeexplore.ieee.org/iel5/8672/27485/01223730.pdf},
      Volume = {15},
      Year = {2003},
      Bdsk-File-1 = {YnBsaXN0MDDUAQIDBAUGBwpZJGFyY2hpdmVyWCR2ZXJzaW9uVCR0b3BYJG9iamVjdHNfEA9OU0tleWVkQXJjaGl2ZXISAAGGoNEICVRyb290gAGoCwwXGBkaHiVVJG51bGzTDQ4PEBMWWk5TLm9iamVjdHNXTlMua2V5c1YkY2xhc3OiERKABIAFohQVgAKAA4AHXHJlbGF0aXZlUGF0aFlhbGlhc0RhdGFfEEwuLi8uLi8uLi9QYXBlcnMvS2VlcnRoaS9TTU8gYWxnb3JpdGhtIGZvciBMZWFzdCBTcXVhcmVzIFNWTSBmb3JtdWxhdGlvbnMucGRm0hsPHB1XTlMuZGF0YU8RAjAAAAAAAjAAAgAACURvY3VtZW50cwAAAAAAAAAAAAAAAAAAAAAAAL7OeK5IKwAAADcjpB9TTU8gYWxnb3JpdGhtIGZvciBMZSMzMThDOTYucGRmAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAMYyWwm+fUwAAAAAAAAAAAAMAAwAACQAAAAAAAAAAAAAAAAAAAAAHS2VlcnRoaQAAEAAIAAC+zlyOAAAAEQAIAADCb4MzAAAAAQAUADcjpAA3G4AAALLyAAASxgAAEq0AAgBSRG9jdW1lbnRzOm5lbW86RG9jdW1lbnRzOlVuaXZlcnNpdGE6UGFwZXJzOktlZXJ0aGk6U01PIGFsZ29yaXRobSBmb3IgTGUjMzE4Qzk2LnBkZgAOAGoANABTAE0ATwAgAGEAbABnAG8AcgBpAHQAaABtACAAZgBvAHIAIABMAGUAYQBzAHQAIABTAHEAdQBhAHIAZQBzACAAUwBWAE0AIABmAG8AcgBtAHUAbABhAHQAaQBvAG4AcwAuAHAAZABmAA8AFAAJAEQAbwBjAHUAbQBlAG4AdABzABIAXi9uZW1vL0RvY3VtZW50cy9Vbml2ZXJzaXRhL1BhcGVycy9LZWVydGhpL1NNTyBhbGdvcml0aG0gZm9yIExlYXN0IFNxdWFyZXMgU1ZNIGZvcm11bGF0aW9ucy5wZGYAEwASL1ZvbHVtZXMvRG9jdW1lbnRzABUAAgAX//8AAIAG0h8gISJYJGNsYXNzZXNaJGNsYXNzbmFtZaMiIyRdTlNNdXRhYmxlRGF0YVZOU0RhdGFYTlNPYmplY3TSHyAmJ6InJFxOU0RpY3Rpb25hcnkACAARABsAJAApADIARABJAEwAUQBTAFwAYgBpAHQAfACDAIYAiACKAI0AjwCRAJMAoACqAPkA/gEGAzoDPANBA0oDVQNZA2cDbgN3A3wDfwAAAAAAAAIBAAAAAAAAACgAAAAAAAAAAAAAAAAAAAOM},
      Bdsk-Url-1 = {http://ieeexplore.ieee.org/iel5/8672/27485/01223730.pdf}
    }
This blog is multi language by p.osting.it's Babel