A SQL-Based Implementation for the Discovery of Frequent Itemsets
Document Type
Thesis
Advisors
Dr. Jamal Alsabbagh, alsabbaj@gvsu.edu
Committee Members
Dr. Yonglei Tao, taoy@gvsu.edu; Dr. Christian Trefftz, trefftzc@gvsu.edu
Embargo Period
8-13-2010
Abstract
An important problem in data mining is the discovery of frequent itemsets. Briefly, given a large set of transactions, each of which contains items drawn from a possible set of N item types, we are interested in counting those subsets (from among the possible 2N subsets) that occur frequently (i.e. above a given threshold) in the transactions. The problem is computationally expensive since the value of N can be in the thousands (e.g. words in a document or items in a large retail chain). In order to manage the computational complexity, algorithms for discovering frequent itemsets rely, in one way or another, on the Apriori principle which states that all the subsets of a frequent itemset must themselves be frequent.
The vast majority of published algorithms take as input a flat file of transactions and use algorithm-specific data structures and optimization techniques. This research explores an implementation using SQL on relational data. The motivation is to capitalize on the data storage and query optimization capabilities of a typical relational database management system.
A tightly-coupled implementation of the Apriori and Apriori TID algorithms has been done as part of this research. The performance of this implementation has been compared empirically with a classical published implementation using several available standard datasets.
ScholarWorks Citation
Reader, Christopher R., "A SQL-Based Implementation for the Discovery of Frequent Itemsets" (2010). Technical Library. 1.
https://scholarworks.gvsu.edu/cistechlib/1