Abstract
Clustering is an automatic learning technique aimed at grouping a set of objects into subsets or clusters. The goal is to create clusters that are coherent internally, but substantially different from each other. In plain words, objects in the same cluster should be as similar as possible, whereas objects in one cluster should be as dissimilar as possible from objects in the other clusters.
Clustering is an unsupervised learning technique, because it groups objects in clusters without any additional information: only the information provided by data is used and no human operation adds bits of information to improve the learning.
The application domains are manifold. For example, the grouping of text documents: in this case the goal is the construction of groups of documents related to each other, i.e. documents treating the same argument.
The goal of this thesis is studying in depth state-of-the-art and experimental clustering techniques. We consider two techniques. The first is known as Minimum Bregman Information principle. Such a principle generalizes the classic relocation scheme adopted yet by K-means, in order to allow the employment of a rich gamma of divergence functions said just Bregman divergences. A new, more general, clustering scheme was developed on top of this principle. Moreover, a co-clustering scheme is formulated too. This leads to an important generalization, as we will see in the sequel.
The second approach is the Support Vector Clustering. It is a clustering process which relies on the state-of-the-art of the learning machines: the Support Vector Machines. The Support Vector Clustering is currently subject of active research, as it still is in early stage of development. We have accurately analyzed such a clustering method and we have also provided some contributions which allow allow a reduction in the number of iterations and in the computational complexity and a gain in accuracy.
The main application domains we have dealt to are the text mining and the astrophysics data mining. Within these application domains we have verified and accurately analyzed the properties of both methodologies, by means of dedicated experiments.
The results are given in terms of robustness w.r.t. the missing values, the dimensionality reduction, the robustness w.r.t. the noise and the outliers, the ability of describing clusters of arbitrary shapes.