June 22 2007

Co-clustering - Synthetic Dataset Test #1

Macchina usata:
PowerPC G4, 1.5GHz, 768MB RAM, Mac OS X

Software usato:

  • H. Cho, Y. Guan, and S. Sra, Co-cluster (v 1.1), 2004.
    @misc{coclus-software,
      author = {Hyuk Cho and Yuqiang Guan and Suvrit Sra},
      Date-Added = {2007-04-29 15:15:55 +0200},
      Date-Modified = {2007-06-25 17:10:33 +0200},
      Howpublished = {Bregman co-clustering software},
      Keywords = {co-clustering, relative entropy, euclidean distance, software},
      Title = {Co-cluster (v 1.1)},
      Url = {http://www.cs.utexas.edu/users/dml/Software/cocluster.html},
      Year = {2004},
      Bdsk-Url-1 = {http://www.cs.utexas.edu/users/dml/Software/cocluster.html}
    }

Dataset usato:
Il dataset usato in questo test è un dataset sintetico, generato grazie a

  • J. R. Vennam and S. Vadapalli, "SynDECA: A Tool to Generate Synthetic Datasets for Evaluation of Clustering Algorithms," in 11th International Conference on Management of Data (COMAD 2005), Goa, India, 2005.
    @conference{syndeca2005, Address = {Goa, India},
      Author = {Jhansi Rani Vennam and Soujanya Vadapalli},
      Booktitle = {11th International Conference on Management of Data (COMAD 2005)},
      Date-Added = {2007-06-18 16:18:49 +0200},
      Date-Modified = {2007-07-03 18:34:02 +0200},
      Keywords = {clustering, tool, synthetic, dataset, generator},
      Month = {January},
      Organization = {http://cde.iiit.ac.in/syndeca},
      Title = {SynDECA: A Tool to Generate Synthetic Datasets for Evaluation of Clustering Algorithms},
      Url = {http://comad2005.persistent.co.in/COMAD2005Proc/pages027-036.pdf},
      Year = {2005},
      Bdsk-File-1 = {YnBsaXN0MDDUAQIDBAUGBwpZJGFyY2hpdmVyWCR2ZXJzaW9uVCR0b3BYJG9iamVjdHNfEA9OU0tleWVkQXJjaGl2ZXISAAGGoNEICVRyb290gAGoCwwXGBkaHiVVJG51bGzTDQ4PEBMWWk5TLm9iamVjdHNXTlMua2V5c1YkY2xhc3OiERKABIAFohQVgAKAA4AHXHJlbGF0aXZlUGF0aFlhbGlhc0RhdGFfEHAuLi8uLi8uLi9QYXBlcnMvVmVubmFtL1N5bkRFQ0EgQSBUb29sIHRvIEdlbmVyYXRlIFN5bnRoZXRpYyBEYXRhc2V0cyBmb3IgRXZhbHVhdGlvbiBvZiBDbHVzdGVyaW5nIEFsZ29yaXRobXMucGRm0hsPHB1XTlMuZGF0YU8RApwAAAAAApwAAgAACURvY3VtZW50cwAAAAAAAAAAAAAAAAAAAAAAAL7OeK5IKwAAADk5AR9TeW5ERUNBIEEgVG9vbCB0byBHZSMzOTM4RjcucGRmAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAOTj3wrBG9QAAAAAAAAAAAAMAAwAACQAAAAAAAAAAAAAAAAAAAAAGVmVubmFtABAACAAAvs5cjgAAABEACAAAwrAq1QAAAAEAFAA5OQEANxuAAACy8gAAEsYAABKtAAIAUURvY3VtZW50czpuZW1vOkRvY3VtZW50czpVbml2ZXJzaXRhOlBhcGVyczpWZW5uYW06U3luREVDQSBBIFRvb2wgdG8gR2UjMzkzOEY3LnBkZgAADgC0AFkAUwB5AG4ARABFAEMAQQAgAEEAIABUAG8AbwBsACAAdABvACAARwBlAG4AZQByAGEAdABlACAAUwB5AG4AdABoAGUAdABpAGMAIABEAGEAdABhAHMAZQB0AHMAIABmAG8AcgAgAEUAdgBhAGwAdQBhAHQAaQBvAG4AIABvAGYAIABDAGwAdQBzAHQAZQByAGkAbgBnACAAQQBsAGcAbwByAGkAdABoAG0AcwAuAHAAZABmAA8AFAAJAEQAbwBjAHUAbQBlAG4AdABzABIAgi9uZW1vL0RvY3VtZW50cy9Vbml2ZXJzaXRhL1BhcGVycy9WZW5uYW0vU3luREVDQSBBIFRvb2wgdG8gR2VuZXJhdGUgU3ludGhldGljIERhdGFzZXRzIGZvciBFdmFsdWF0aW9uIG9mIENsdXN0ZXJpbmcgQWxnb3JpdGhtcy5wZGYAEwASL1ZvbHVtZXMvRG9jdW1lbnRzABUAAgAX//8AAIAG0h8gISJYJGNsYXNzZXNaJGNsYXNzbmFtZaMiIyRdTlNNdXRhYmxlRGF0YVZOU0RhdGFYTlNPYmplY3TSHyAmJ6InJFxOU0RpY3Rpb25hcnkACAARABsAJAApADIARABJAEwAUQBTAFwAYgBpAHQAfACDAIYAiACKAI0AjwCRAJMAoACqAR0BIgEqA8oDzAPRA9oD5QPpA/cD/gQHBAwEDwAAAAAAAAIBAAAAAAAAACgAAAAAAAAAAAAAAAAAAAQc},
      Bdsk-Url-1 = {http://comad2005.persistent.co.in/COMAD2005Proc/pages027-036.pdf}
    }

Il dataset è così composto:
Oggetti: 1000
Attributi: 10
Classi: 5, per un totale di 888 punti (Cluster 0: 327, Cluster 1: 134, Cluster 2: 162, Cluster 3: 132, Cluster 4: 133)
Punti di disturbo: 112 (punti non classificabili)

Algoritmo di co-clustering usato: Euclidean Distance Based, Minimum Sum Squared, Information Theoretic

Problemi: Da questo primo test condotto su un dataset disturbato, lo schema di co-clustering sembra non essere pensato per identificare il rumore e separarlo dal resto della classificazione, col risultato che tutte le istanze di co-clustering tendono a classificare il rumore in una delle cinque classi richieste, sfalsando i risultati.

Eliminazione punti di rumore: Eliminando i punti di rumore, abbiamo ottenuto un dataset di 888 punti e l’algoritmo (Euclidean Distance Based, con 5 co-cluster richiesti) ha separato perfettamente le 5 classi senza alcun errore in un tempo così espresso:
User = 0 second(s) 138552 ms
System = 0 second(s) 6630 ms
Time/Run = 0.138552 second(s)

Comments:

(2) posted on Co-clustering - Synthetic Dataset Test #1

Post a comment

This blog is multi language by p.osting.it's Babel