TIC: Term Intersection Clustering of Text Documents

Document Type

Thesis

Advisors

Dr. Jamal Alsabbagh, alsabbaj@gvsu.edu

Embargo Period

8-17-2010

Abstract

Preliminary work performed by the author [3] investigated the clustering of text documents based upon the Boolean intersection of document term sets. In that algorithm, documents were associated with terms and the resulting sets of terms were intersected. If the intersection of the sets produces a set equal to or larger than a predefined minimum support level, that new set was considered a relevant cluster. The algorithm’s first intersections were carried out at a three term level, allowing overlap of clusters at this level. Documents that were clustered were removed from further consideration and the process repeated at the two term level. In this study the author’s previously described algorithm was adapted to create a more robust and scalable implementation. The modified algorithm, Term Intersection Clustering or TIC, was evaluated and compared to the Bisecting K-Means algorithm. This comparison was performed employing the text of the bodies of articles that compose the Reuter’s 21,578 News Corpus [13]. While the cohesion, as defined and implemented by the author, was superior for the Bisecting K-Means algorithm, the actual value of the clusters, when physically reviewed was superior for the TIC algorithm. Run times were similar for the two algorithms. Furthermore, the data generated by the TIC algorithm was found to be superior for indexing and recall versus the output of the Bisecting K-Means algorithm.

This document is currently not available here.

Share

COinS