Implementation of ODscan: An Algorithm for Mining Frequent Itemsets

Document Type



Dr. Jamal Alsabbagh, alsabbaj@gvsu.edu

Embargo Period



Association rules are an important knowledge representation model in data mining. The derivation process involves two subtasks: frequent itemset counting, and rule formation. It has been shown that frequent itemset counting is, by far, the more computationally expensive of the two tasks. In principle, given a set of items (e.g. all retail items in a store) and a set of transactions (e.g. customer purchase receipts), the problem is to identify all subsets of the retail items that appear frequently in transactions. The complexity of the problems is due to two factors. First, the transactions database is very large thus requiring storage on disk. Second, the set of retail items can be in the thousands making it practically impossible to generate and count every possible subset. The various algorithms reported in the literature differ mainly in the way in which they handle the tradeoff between the cost of disk I/O and memory usage. ODscan is unique in the way it balances I/O cost and memory requirements. In the first database scan, it counts the itemsets of lengths one and two. It then iterates through the following steps: (1) generate longer-than-two candidate (i.e. promising) itemsets until some predefined constraint on memory usage is reached; (2) scan the database and count the candidate itemsets generated in the previous step; (3) purge the infrequent itemsets identified in the previous step in order to free memory. In this project, ODScan was implemented in JAVA 1.6 using Netbeans 5.5 IDE. The implementation involved seven main functions that include preprocessing, first database scan, initialization of the candidate itemsets, creation of candidate itemsets, resetting the candidate itemsets, rescanning the database, and removing the infrequent candidate itemsets. The relative efficiency using different Java containers (LinkedList, ArrayList, HashSet, TreeSet, and LinkedHashList) was investigated empirically on four widely-used synthetic and real-world datasets. Results indicate that LinkedList consistently outperformed the other containers. Furthermore, the implementation was instrumented in order collect various statistics that should help in understanding the problem characteristics. Based upon the results, several venues for future improvements were identified.

This document is currently not available here.