June 19 2007

Dataset sintetici per Clustering Benchmark

Molto spesso, nell’eseguire i test di algoritmi di clustering, รจ molto utile avere a disposizione degli insiemi di dati campione sintetici, ovvero creati artificialmente e che non rispecchiano dei dati reali.

A tale scopo molto utile si rivela il lavoro fatto dal Center for Data Engineering, International Institute of Information Technology, Hyderabad, INDIA

  • J. R. Vennam and S. Vadapalli, "SynDECA: A Tool to Generate Synthetic Datasets for Evaluation of Clustering Algorithms," in 11th International Conference on Management of Data (COMAD 2005), Goa, India, 2005.
    @conference{syndeca2005, Address = {Goa, India},
      Author = {Jhansi Rani Vennam and Soujanya Vadapalli},
      Booktitle = {11th International Conference on Management of Data (COMAD 2005)},
      Date-Added = {2007-06-18 16:18:49 +0200},
      Date-Modified = {2007-07-03 18:34:02 +0200},
      Keywords = {clustering, tool, synthetic, dataset, generator},
      Month = {January},
      Organization = {http://cde.iiit.ac.in/syndeca},
      Title = {SynDECA: A Tool to Generate Synthetic Datasets for Evaluation of Clustering Algorithms},
      Url = {http://comad2005.persistent.co.in/COMAD2005Proc/pages027-036.pdf},
      Year = {2005},
      Bdsk-File-1 = {YnBsaXN0MDDUAQIDBAUGBwpZJGFyY2hpdmVyWCR2ZXJzaW9uVCR0b3BYJG9iamVjdHNfEA9OU0tleWVkQXJjaGl2ZXISAAGGoNEICVRyb290gAGoCwwXGBkaHiVVJG51bGzTDQ4PEBMWWk5TLm9iamVjdHNXTlMua2V5c1YkY2xhc3OiERKABIAFohQVgAKAA4AHXHJlbGF0aXZlUGF0aFlhbGlhc0RhdGFfEHAuLi8uLi8uLi9QYXBlcnMvVmVubmFtL1N5bkRFQ0EgQSBUb29sIHRvIEdlbmVyYXRlIFN5bnRoZXRpYyBEYXRhc2V0cyBmb3IgRXZhbHVhdGlvbiBvZiBDbHVzdGVyaW5nIEFsZ29yaXRobXMucGRm0hsPHB1XTlMuZGF0YU8RApwAAAAAApwAAgAACURvY3VtZW50cwAAAAAAAAAAAAAAAAAAAAAAAL7OeK5IKwAAADk5AR9TeW5ERUNBIEEgVG9vbCB0byBHZSMzOTM4RjcucGRmAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAOTj3wrBG9QAAAAAAAAAAAAMAAwAACQAAAAAAAAAAAAAAAAAAAAAGVmVubmFtABAACAAAvs5cjgAAABEACAAAwrAq1QAAAAEAFAA5OQEANxuAAACy8gAAEsYAABKtAAIAUURvY3VtZW50czpuZW1vOkRvY3VtZW50czpVbml2ZXJzaXRhOlBhcGVyczpWZW5uYW06U3luREVDQSBBIFRvb2wgdG8gR2UjMzkzOEY3LnBkZgAADgC0AFkAUwB5AG4ARABFAEMAQQAgAEEAIABUAG8AbwBsACAAdABvACAARwBlAG4AZQByAGEAdABlACAAUwB5AG4AdABoAGUAdABpAGMAIABEAGEAdABhAHMAZQB0AHMAIABmAG8AcgAgAEUAdgBhAGwAdQBhAHQAaQBvAG4AIABvAGYAIABDAGwAdQBzAHQAZQByAGkAbgBnACAAQQBsAGcAbwByAGkAdABoAG0AcwAuAHAAZABmAA8AFAAJAEQAbwBjAHUAbQBlAG4AdABzABIAgi9uZW1vL0RvY3VtZW50cy9Vbml2ZXJzaXRhL1BhcGVycy9WZW5uYW0vU3luREVDQSBBIFRvb2wgdG8gR2VuZXJhdGUgU3ludGhldGljIERhdGFzZXRzIGZvciBFdmFsdWF0aW9uIG9mIENsdXN0ZXJpbmcgQWxnb3JpdGhtcy5wZGYAEwASL1ZvbHVtZXMvRG9jdW1lbnRzABUAAgAX//8AAIAG0h8gISJYJGNsYXNzZXNaJGNsYXNzbmFtZaMiIyRdTlNNdXRhYmxlRGF0YVZOU0RhdGFYTlNPYmplY3TSHyAmJ6InJFxOU0RpY3Rpb25hcnkACAARABsAJAApADIARABJAEwAUQBTAFwAYgBpAHQAfACDAIYAiACKAI0AjwCRAJMAoACqAR0BIgEqA8oDzAPRA9oD5QPpA/cD/gQHBAwEDwAAAAAAAAIBAAAAAAAAACgAAAAAAAAAAAAAAAAAAAQc},
      Bdsk-Url-1 = {http://comad2005.persistent.co.in/COMAD2005Proc/pages027-036.pdf}
    }

Lo strumento riesce a produrre dataset sintetici molto rapidamente; in genere un insieme con spazio delle feature 2D, con un milione di punti e centinaia di cluster, viene prodotto in pochi secondi.

Per ogni insieme prodotto, viene fornito dettagli sul clustering, come:

- quali punti appartengono a quali cluster
- quanti cluster
- quanti punti per cluster
- forma dei cluster
- etc.

This blog is multi language by p.osting.it's Babel