May 10 2007

Missing values, co-clustering e predizione dei valori mancanti

Il problema dei missing values è a quanto pare molto sentito, soprattutto in Astrofisica, dove, testimone il prof. Longo, si gettano via svariate migliaia di dati non completamente descritti. Il co-clustering sembra venire in aiuto per affrontare questo tedioso problema.

Come viene espressamente detto in

  • A. B. Tchagang and A. H. Tewfik, "Robust biclustering algorithm (ROBA) for DNA microarray data analysis," in 13th IEEE Workshop on Statistical Signal Processing, 2005, pp. 984-989.
    @conference{roba2005,
      author = {Alan B. Tchagang and Ahmed H. Tewfik},
      Booktitle = {13th IEEE Workshop on Statistical Signal Processing},
      Date-Added = {2007-05-10 13:07:21 +0200},
      Date-Modified = {2007-07-15 11:14:28 +0200},
      Keywords = {co-clustering, bioinformatics, missing values},
      Pages = {984–989},
      Title = {Robust biclustering algorithm ({ROBA}) for {DNA} microarray data analysis},
      Url = {http://ieeexplore.ieee.org/iel5/10843/34164/01628738.pdf},
      Year = {2005},
      Bdsk-File-1 = {YnBsaXN0MDDUAQIDBAUGBwpZJGFyY2hpdmVyWCR2ZXJzaW9uVCR0b3BYJG9iamVjdHNfEA9OU0tleWVkQXJjaGl2ZXISAAGGoNEICVRyb290gAGoCwwXGBkaHiVVJG51bGzTDQ4PEBMWWk5TLm9iamVjdHNXTlMua2V5c1YkY2xhc3OiERKABIAFohQVgAKAA4AHXHJlbGF0aXZlUGF0aFlhbGlhc0RhdGFfEGIuLi8uLi8uLi9QYXBlcnMvVGNoYWdhbmcvUm9idXN0IGJpY2×1c3RlcmluZyBhbGdvcml0aG0gKFJPQkEpIGZvciBETkEgbWljcm9hcnJheSBkYXRhIGFuYWx5c2lzLnBkZtIbDxwdV05TLmRhdGFPEQJyAAAAAAJyAAIAAAlEb2N1bWVudHMAAAAAAAAAAAAAAAAAAAAAAAC+zniuSCsAAAA3MyQfUm9idXN0IGJpY2×1c3RlcmluZyAjMzczMzFFLnBkZgAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAADczHr8n0lIAAAAAAAAAAAADAAMAAAkAAAAAAAAAAAAAAAAAAAAACFRjaGFnYW5nABAACAAAvs5cjgAAABEACAAAvye2MgAAAAEAFAA3MyQANxuAAACy8gAAEsYAABKtAAIAU0RvY3VtZW50czpuZW1vOkRvY3VtZW50czpVbml2ZXJzaXRhOlBhcGVyczpUY2hhZ2FuZzpSb2J1c3QgYmljbHVzdGVyaW5nICMzNzMzMUUucGRmAAAOAJQASQBSAG8AYgB1AHMAdAAgAGIAaQBjAGwAdQBzAHQAZQByAGkAbgBnACAAYQBsAGcAbwByAGkAdABoAG0AIAAoAFIATwBCAEEAKQAgAGYAbwByACAARABOAEEAIABtAGkAYwByAG8AYQByAHIAYQB5ACAAZABhAHQAYQAgAGEAbgBhAGwAeQBzAGkAcwAuAHAAZABmAA8AFAAJAEQAbwBjAHUAbQBlAG4AdABzABIAdC9uZW1vL0RvY3VtZW50cy9Vbml2ZXJzaXRhL1BhcGVycy9UY2hhZ2FuZy9Sb2J1c3QgYmljbHVzdGVyaW5nIGFsZ29yaXRobSAoUk9CQSkgZm9yIEROQSBtaWNyb2FycmF5IGRhdGEgYW5hbHlzaXMucGRmABMAEi9Wb2×1bWVzL0RvY3VtZW50cwAVAAIAF///AACABtIfICEiWCRjbGFzc2VzWiRjbGFzc25hbWWjIiMkXU5TTXV0YWJsZURhdGFWTlNEYXRhWE5TT2JqZWN00h8gJieiJyRcTlNEaWN0aW9uYXJ5AAgAEQAbACQAKQAyAEQASQBMAFEAUwBcAGIAaQB0AHwAgwCGAIgAigCNAI8AkQCTAKAAqgEPARQBHAOSA5QDmQOiA60DsQO/A8YDzwPUA9cAAAAAAAACAQAAAAAAAAAoAAAAAAAAAAAAAAAAAAAD5A==},
      Bdsk-Url-1 = {http://ieeexplore.ieee.org/iel5/10843/34164/01628738.pdf}
    }
  • Y. Cheng and G. M. Church, "Biclustering of Expression Data," in Intelligent Systems for Molecular Biology, 2000, pp. 93-103.
    @inproceedings{cheng-biclustering00,
      author = {Yizong Cheng and George M. Church},
      Booktitle = {Intelligent Systems for Molecular Biology},
      Date-Added = {2007-05-09 22:25:18 +0200},
      Date-Modified = {2007-06-29 08:47:17 +0200},
      Keywords = {clustering, co-clustering, bioinformatics, biclustering},
      Pages = {93–103},
      Publisher = {AAAI Press},
      Title = {Biclustering of Expression Data},
      Url = {http://citeseer.ist.psu.edu/cheng00biclustering.html},
      Year = {2000},
      Bdsk-File-1 = {YnBsaXN0MDDUAQIDBAUGBwpZJGFyY2hpdmVyWCR2ZXJzaW9uVCR0b3BYJG9iamVjdHNfEA9OU0tleWVkQXJjaGl2ZXISAAGGoNEICVRyb290gAGoCwwXGBkaHiVVJG51bGzTDQ4PEBMWWk5TLm9iamVjdHNXTlMua2V5c1YkY2xhc3OiERKABIAFohQVgAKAA4AHXHJlbGF0aXZlUGF0aFlhbGlhc0RhdGFfEDkuLi8uLi8uLi9QYXBlcnMvQ2hlbmcvQmljbHVzdGVyaW5nIG9mIEV4cHJlc3Npb24gRGF0YS5wZGbSGw8cHVdOUy5kYXRhTxEB+AAAAAAB+AACAAAJRG9jdW1lbnRzAAAAAAAAAAAAAAAAAAAAAAAAvs54rkgrAAAANyCfH0JpY2×1c3RlcmluZyBvZiBFeHByIzMwRDU0Qy5wZGYAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAw1UzCZ/nsAAAAAAAAAAAAAwADAAAJAAAAAAAAAAAAAAAAAAAAAAVDaGVuZwAAEAAIAAC+zlyOAAAAEQAIAADCZ93MAAAAAQAUADcgnwA3G4AAALLyAAASxgAAEq0AAgBQRG9jdW1lbnRzOm5lbW86RG9jdW1lbnRzOlVuaXZlcnNpdGE6UGFwZXJzOkNoZW5nOkJpY2×1c3RlcmluZyBvZiBFeHByIzMwRDU0Qy5wZGYADgBIACMAQgBpAGMAbAB1AHMAdABlAHIAaQBuAGcAIABvAGYAIABFAHgAcAByAGUAcwBzAGkAbwBuACAARABhAHQAYQAuAHAAZABmAA8AFAAJAEQAbwBjAHUAbQBlAG4AdABzABIASy9uZW1vL0RvY3VtZW50cy9Vbml2ZXJzaXRhL1BhcGVycy9DaGVuZy9CaWNsdXN0ZXJpbmcgb2YgRXhwcmVzc2lvbiBEYXRhLnBkZgAAEwASL1ZvbHVtZXMvRG9jdW1lbnRzABUAAgAX//8AAIAG0h8gISJYJGNsYXNzZXNaJGNsYXNzbmFtZaMiIyRdTlNNdXRhYmxlRGF0YVZOU0RhdGFYTlNPYmplY3TSHyAmJ6InJFxOU0RpY3Rpb25hcnkACAARABsAJAApADIARABJAEwAUQBTAFwAYgBpAHQAfACDAIYAiACKAI0AjwCRAJMAoACqAOYA6wDzAu8C8QL2Av8DCgMOAxwDIwMsAzEDNAAAAAAAAAIBAAAAAAAAACgAAAAAAAAAAAAAAAAAAANB},
      Bdsk-Url-1 = {http://citeseer.ist.psu.edu/cheng00biclustering.html}
    }

il co-clustering permette di raggruppare oggetti simili tra loro in base a un sottoinsieme di attributi e non rispetto a tutti gli attributi che rappresentano gli oggetti. Essendo questi sottoinsiemi ricavati tramite un feature clustering contestuale al data clustering, il processo dovrebbe, per costruzione, non essere inficiato dalla presenza di missing values.

Infatti, in

  • A. Banerjee, I. S. Dhillon, J. Ghosh, S. Merugu, and D. Modha, "A generalized Maximum Entropy approach to Bregman co-clustering and matrix approximation," UTCS TR04-24, UT, Austin2004.
    @techreport{banerjee04generalized, Address = {UT, Austin},
      Author = {A. Banerjee and I. S. Dhillon and J. Ghosh and S. Merugu and D. Modha},
      Date-Modified = {2007-07-15 11:15:53 +0200},
      Institution = {UTCS TR04-24},
      Keywords = {bregman, clustering, co-clustering, sparse data, missing values},
      Rating = {4},
      Title = {A generalized {Maximum Entropy} approach to {Bregman} co-clustering and matrix approximation},
      Url = {http://www.cs.utexas.edu/ftp/pub/techreports/tr04-24.ps.gz},
      Year = {2004},
      Bdsk-File-1 = {YnBsaXN0MDDUAQIDBAUGBwpZJGFyY2hpdmVyWCR2ZXJzaW9uVCR0b3BYJG9iamVjdHNfEA9OU0tleWVkQXJjaGl2ZXISAAGGoNEICVRyb290gAGoCwwXGBkaHiVVJG51bGzTDQ4PEBMWWk5TLm9iamVjdHNXTlMua2V5c1YkY2xhc3OiERKABIAFohQVgAKAA4AHXHJlbGF0aXZlUGF0aFlhbGlhc0RhdGFfEHUuLi8uLi8uLi9QYXBlcnMvQmFuZXJqZWUvQSBnZW5lcmFsaXplZCBtYXhpbXVtIGVudHJvcHkgYXBwcm9hY2ggdG8gQnJlZ21hbiBjby1jbHVzdGVyaW5nIGFuZCBtYXRyaXggYXBwcm94aW1hdGlvbi5wZGbSGw8cHVdOUy5kYXRhTxECrAAAAAACrAACAAAJRG9jdW1lbnRzAAAAAAAAAAAAAAAAAAAAAAAAvs54rkgrAAAANyQEH0EgZ2VuZXJhbGl6ZWQgbWF4aW11IzJCOUIxRi5wZGYAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAArmx/CRVVZAAAAAAAAAAAAAwADAAAJAAAAAAAAAAAAAAAAAAAAAAhCYW5lcmplZQAQAAgAAL7OXI4AAAARAAgAAMJFOTkAAAABABQANyQEADcbgAAAsvIAABLGAAASrQACAFNEb2N1bWVudHM6bmVtbzpEb2N1bWVudHM6VW5pdmVyc2l0YTpQYXBlcnM6QmFuZXJqZWU6QSBnZW5lcmFsaXplZCBtYXhpbXUjMkI5QjFGLnBkZgAADgC6AFwAQQAgAGcAZQBuAGUAcgBhAGwAaQB6AGUAZAAgAG0AYQB4AGkAbQB1AG0AIABlAG4AdAByAG8AcAB5ACAAYQBwAHAAcgBvAGEAYwBoACAAdABvACAAQgByAGUAZwBtAGEAbgAgAGMAbwAtAGMAbAB1AHMAdABlAHIAaQBuAGcAIABhAG4AZAAgAG0AYQB0AHIAaQB4ACAAYQBwAHAAcgBvAHgAaQBtAGEAdABpAG8AbgAuAHAAZABmAA8AFAAJAEQAbwBjAHUAbQBlAG4AdABzABIAhy9uZW1vL0RvY3VtZW50cy9Vbml2ZXJzaXRhL1BhcGVycy9CYW5lcmplZS9BIGdlbmVyYWxpemVkIG1heGltdW0gZW50cm9weSBhcHByb2FjaCB0byBCcmVnbWFuIGNvLWNsdXN0ZXJpbmcgYW5kIG1hdHJpeCBhcHByb3hpbWF0aW9uLnBkZgAAEwASL1ZvbHVtZXMvRG9jdW1lbnRzABUAAgAX//8AAIAG0h8gISJYJGNsYXNzZXNaJGNsYXNzbmFtZaMiIyRdTlNNdXRhYmxlRGF0YVZOU0RhdGFYTlNPYmplY3TSHyAmJ6InJFxOU0RpY3Rpb25hcnkACAARABsAJAApADIARABJAEwAUQBTAFwAYgBpAHQAfACDAIYAiACKAI0AjwCRAJMAoACqASIBJwEvA98D4QPmA+8D+gP+BAwEEwQcBCEEJAAAAAAAAAIBAAAAAAAAACgAAAAAAAAAAAAAAAAAAAQx},
      Bdsk-Url-1 = {http://www.cs.utexas.edu/ftp/pub/techreports/tr04-24.ps.gz}
    }
  • A. Banerjee, I. Dhillon, J. Ghosh, S. Merugu, and D. Modha, "A generalized Maximum Entropy approach to Bregman co-clustering and matrix approximation," in Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining(KDD), 2004, pp. 509-514.
    @inproceedings{banerjee04generalizedkdd,
      author = {A. Banerjee and I. Dhillon and J. Ghosh and S. Merugu and D. Modha},
      Booktitle = {Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining(KDD)},
      Date-Added = {2007-04-16 10:48:17 +0200},
      Date-Modified = {2007-07-15 11:15:39 +0200},
      Keywords = {clustering, co-clustering, bregman, sparse data, missing values},
      Month = {August},
      Pages = {509–514},
      Title = {A generalized {Maximum Entropy} approach to {Bregman} co-clustering and matrix approximation},
      Url = {http://citeseer.ist.psu.edu/banerjee04generalized.html},
      Year = {2004},
      Bdsk-File-1 = {YnBsaXN0MDDUAQIDBAUGBwpZJGFyY2hpdmVyWCR2ZXJzaW9uVCR0b3BYJG9iamVjdHNfEA9OU0tleWVkQXJjaGl2ZXISAAGGoNEICVRyb290gAGoCwwXGBkaHiVVJG51bGzTDQ4PEBMWWk5TLm9iamVjdHNXTlMua2V5c1YkY2xhc3OiERKABIAFohQVgAKAA4AHXHJlbGF0aXZlUGF0aFlhbGlhc0RhdGFfEHsuLi8uLi8uLi9QYXBlcnMvQmFuZXJqZWUvQSBnZW5lcmFsaXplZCBtYXhpbXVtIGVudHJvcHkgYXBwcm9hY2ggdG8gQnJlZ21hbiBjby1jbHVzdGVyaW5nIGFuZCBtYXRyaXggYXBwcm94aW1hdGlvbi1icmllZi5wZGbSGw8cHVdOUy5kYXRhTxECvgAAAAACvgACAAAJRG9jdW1lbnRzAAAAAAAAAAAAAAAAAAAAAAAAvs54rkgrAAAANyQEH0EgZ2VuZXJhbGl6ZWQgbWF4aW11IzIyMzY1OC5wZGYAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAiNljB+igpAAAAAAAAAAAAAwADAAAJAAAAAAAAAAAAAAAAAAAAAAhCYW5lcmplZQAQAAgAAL7OXI4AAAARAAgAAMH6GhkAAAABABQANyQEADcbgAAAsvIAABLGAAASrQACAFNEb2N1bWVudHM6bmVtbzpEb2N1bWVudHM6VW5pdmVyc2l0YTpQYXBlcnM6QmFuZXJqZWU6QSBnZW5lcmFsaXplZCBtYXhpbXUjMjIzNjU4LnBkZgAADgDGAGIAQQAgAGcAZQBuAGUAcgBhAGwAaQB6AGUAZAAgAG0AYQB4AGkAbQB1AG0AIABlAG4AdAByAG8AcAB5ACAAYQBwAHAAcgBvAGEAYwBoACAAdABvACAAQgByAGUAZwBtAGEAbgAgAGMAbwAtAGMAbAB1AHMAdABlAHIAaQBuAGcAIABhAG4AZAAgAG0AYQB0AHIAaQB4ACAAYQBwAHAAcgBvAHgAaQBtAGEAdABpAG8AbgAtAGIAcgBpAGUAZgAuAHAAZABmAA8AFAAJAEQAbwBjAHUAbQBlAG4AdABzABIAjS9uZW1vL0RvY3VtZW50cy9Vbml2ZXJzaXRhL1BhcGVycy9CYW5lcmplZS9BIGdlbmVyYWxpemVkIG1heGltdW0gZW50cm9weSBhcHByb2FjaCB0byBCcmVnbWFuIGNvLWNsdXN0ZXJpbmcgYW5kIG1hdHJpeCBhcHByb3hpbWF0aW9uLWJyaWVmLnBkZgAAEwASL1ZvbHVtZXMvRG9jdW1lbnRzABUAAgAX//8AAIAG0h8gISJYJGNsYXNzZXNaJGNsYXNzbmFtZaMiIyRdTlNNdXRhYmxlRGF0YVZOU0RhdGFYTlNPYmplY3TSHyAmJ6InJFxOU0RpY3Rpb25hcnkACAARABsAJAApADIARABJAEwAUQBTAFwAYgBpAHQAfACDAIYAiACKAI0AjwCRAJMAoACqASgBLQE1A/cD+QP+BAcEEgQWBCQEKwQ0BDkEPAAAAAAAAAIBAAAAAAAAACgAAAAAAAAAAAAAAAAAAARJ},
      Bdsk-Url-1 = {http://citeseer.ist.psu.edu/banerjee04generalized.html}
    }

si parla anche di “Missing Value Prediction” (rispettivamente par. 5.3 e par. 4.2), dove si sfrutta il co-clustering per la predizione dei valori mancanti, impostando i missing values a 0 e facendo “girare” l’algoritmo di co-clustering. L’algoritmo prosegue non curante dei dati mancanti; trovato il co-clustering, la matrice approssimata basata su di esso può essere usata per “predirre” i valori mancanti con una buona percentuale di errore.

Post a comment

This blog is multi language by p.osting.it's Babel