A SQL-Based Implementation for the Discovery of Frequent Itemsets

Document Type



Dr. Jamal Alsabbagh, alsabbaj@gvsu.edu

Committee Members

Dr. Yonglei Tao, taoy@gvsu.edu; Dr. Christian Trefftz, trefftzc@gvsu.edu

Embargo Period



An important problem in data mining is the discovery of frequent itemsets. Briefly, given a large set of transactions, each of which contains items drawn from a possible set of N item types, we are interested in counting those subsets (from among the possible 2N subsets) that occur frequently (i.e. above a given threshold) in the transactions. The problem is computationally expensive since the value of N can be in the thousands (e.g. words in a document or items in a large retail chain). In order to manage the computational complexity, algorithms for discovering frequent itemsets rely, in one way or another, on the Apriori principle which states that all the subsets of a frequent itemset must themselves be frequent.

The vast majority of published algorithms take as input a flat file of transactions and use algorithm-specific data structures and optimization techniques. This research explores an implementation using SQL on relational data. The motivation is to capitalize on the data storage and query optimization capabilities of a typical relational database management system.

A tightly-coupled implementation of the Apriori and Apriori TID algorithms has been done as part of this research. The performance of this implementation has been compared empirically with a classical published implementation using several available standard datasets.

This document is currently not available here.