Comparison of Hierarchical Agglomerative Clustering Utilizing an Apriori Itemset Lattice with Term Based Vectors versus Bisec
Jamal R. Alsabbagh, firstname.lastname@example.org
The response to a query against the web or an enterprise’s electronic data can overwhelm the user since it often includes thousands of documents such as articles, correspondences, and emails. One commonly used approach to solve the problem is to automatically classify the returned documents into clusters such that similar documents are assigned to the same cluster.
A major challenge in document clustering is how to define the very notion of similarity and then how to apply it practically. A common approach is to identify a set of terms derived from the corpus of documents and then represent each document as a vector that reflects the presence or absence in the document of each one of the identified terms. Clusters are then created based upon similarity among documents as derived from the respective vectors of the different documents.
Several clustering algorithms have been proposed and evaluated in the literature. Two such algorithms are the Bisecting K-Means (a variation of K-Means) and Hierarchical Agglomerative Clustering (HAC). Bisecting K-Means is top-down whereby smaller clusters are created by repeatedly partitioning larger ones. In contrast, HAC is bottom-up whereby large clusters are created by successively merging smaller ones. Empirical comparative studies reported in the literature show that while K-Means is superior in terms of run time, HAC has the advantage of producing better clusters.
Work performed by the author (to be presented at DMIN’06 in June 2006) investigated clustering based on the Boolean intersection of document term sets. In this algorithm, documents are associated with terms and the resulting sets of terms are intersected. If the intersection of the sets produces a set equal to or larger than a predefined minimum support level, that new set is considered a relevant cluster. The algorithm’s first intersections were carried out at a three term level, allowing overlap of clusters at this level. Documents that were clustered were removed from further consideration and the process repeated at the two term level.
Performance for clustering on a collection of the titles of 21,578 new articles range from five to seven minutes depending on the minimum support for cluster size. Cluster quality was manually evaluated and appeared reasonable.
The proposed study will evaluate the author’s previous algorithm on the text body of the same news articles. An Apriori approach to term intersection set will be utilized in an attempt to improve performance.
The quality of the clusters produced will be compared to the results of the Bisecting K-Means Algorithm by examining and contrasting the cohesion and F values of the clusters created, as well as by manual examination.
Bartman, Casey R., "Comparison of Hierarchical Agglomerative Clustering Utilizing an Apriori Itemset Lattice with Term Based Vectors versus Bisec" (2006). Technical Library. 70.