CS685: Special Topics in Data Mining (Spring 2009)

Homework 1: Due Feb 5^th

Goal: This homework would reinforce the understanding of the computational complexity of frequent itemset mining, the difference between apriori-based and depth-first frequent itemset mining algorithms and the output and interpretation of frequent patterns and association rules. It also gives you hands on experience to use existing frequent itemset mining algorithms and to apply them to datasets.

Description of the homework:

You will experiment with the algorithms provided below on a number of datasets. Only the report needs to be submitted.

Algorithms:

There are many implementations of frequent itemset mining algorithms available on the web. In this assignment, you need to download two implementations of itemset mining algorithms from http://www.borgelt.net/fpm.html.

(1) Apriori.

(2) FPgrowth.

Datasets:

(1) Using Synthetic data generator for transaction database with embedded frequent itemset :

Quest: http://www.almaden.ibm.com/cs/projects/iis/hdb/Projects/data_mining/datasets/syndata.html

Generate the two datasets:

T25.I20.D50k

T10.I4.D50k

(2) Mushroom datasets: http://archive.ics.uci.edu/ml/datasets/Mushroom

Reports of experimentation:

0) Please describe the configuration (CPU and MEM) of the computer where you run the experiments.

Make sure to run the whole experiment under the same configuration.

A) For the two synthetic datasets, please report the following results for both algorithms

Please put each set of the result into a plot

(1) Running Time by varying support threshold as 0.5% , 2%, 5%, 10%, 50%

(2) Number of patterns by varying support threshold as 0.5% , 2%, 5%, 10%, 50%

B) For synthetic datasets T25.I20.D50k, please report the following results for both algorithms

Please put each set of the result into a plot

(1) Given support threshold as 0.3%, report running time by using 20%, 40%, 60%, 80%, 100% transactions.

(2) Given support threshold as 0.3%, report number of patterns by using 20%, 40%, 60%, 80%, 100% transactions.

C) For each figure you generated above, explain the trends shown in each figure based on your understanding of the complexity of the algorithms.

e) Apply both algorithms on the mushroom dataset. Report the following.

(1) Running time by varying support threshold as 0.5% , 2%, 5%, 10%, 50%

(2) Number of patterns by varying support threshold as 0.5% , 2%, 5%, 10%, 50%

(3) Report top 10 patterns with highest support and see whether they are useful.

(4) Report 5 association rules with confidence above 20% and explain whether they are important.

f) Your experience with the homework.

Grading:

Each problem will be graded in a letter scale as A (Excellent), B (Good), C (Fair), D (Failed). Each student will need to work on this homework independently.