A SQL-Based Implementation for the Discovery of Frequent Itemsets

Document Type

Thesis Proposal

Advisors

Dr. Jamal Alsabbagh,alsabbaj@gvsu.edu

Committee Members

Dr. Yonglei Tao, taoy@gvsu.edu; Dr. Christian Trefftz, trefftzc@gvsu.edu

Embargo Period

8-16-2010

Comments

An important problem in data mining is the discovery of frequent itemsets. Briefly, given a large set of transactions each of which contains items drawn from a possible set of N item types, we are interested in counting those subsets (from among the possible 2N subsets) that occur frequently (i.e. above a given threshold) in the transactions. The problem is computationally expensive since the value of N can be in the thousands (e.g. words in a document or items in a large retail chain). In order to manage the computational complexity, algorithms for discovering frequent itemsets rely, in one way or another, on the Apriori principle which states that all the subsets of a frequent itemset must themselves be frequent. The vast majority of published algorithms take as input a flat file of transactions and use algorithm-specific data structures and optimization techniques. This research proposes an implementation using SQL on relational data. The motivation is to capitalize on the data storage and query optimization capabilities of a typical relational database management system. A prototype implementation has been done as part of this research. When the final implementation is completed, its performance will be compared empirically with a classical published implementation using several available standard datasets.

This document is currently not available here.

Share

COinS