CS685: Special Topics in Data Mining (Spring
2009)
Homework 1:
Due Feb 5th
Goal:
This homework would reinforce the understanding of the computational
complexity of frequent itemset mining, the difference
between apriori-based and depth-first frequent itemset mining algorithms and the output and interpretation
of frequent patterns and association rules. It also gives you hands on experience to
use existing frequent itemset mining algorithms and to
apply them to datasets.
Description of the homework:
You will experiment with the algorithms provided below on a number of
datasets. Only the report needs to be submitted.
Algorithms:
There are many implementations of frequent itemset
mining algorithms available on the web. In this assignment, you need to
download two implementations of itemset mining
algorithms from http://www.borgelt.net/fpm.html.
(1) Apriori.
(2) FPgrowth.
Datasets:
(1) Using Synthetic data
generator for transaction database with embedded frequent itemset
:
Quest: http://www.almaden.ibm.com/cs/projects/iis/hdb/Projects/data_mining/datasets/syndata.html
Generate the two
datasets:
T25.I20.D50k
T10.I4.D50k
(2)
Mushroom
datasets: http://archive.ics.uci.edu/ml/datasets/Mushroom
Reports of experimentation:
0)
Please
describe the configuration (CPU and MEM) of the computer where you run the
experiments.
Make sure to run the
whole experiment under the same configuration.
A)
For
the two synthetic datasets, please report the following results for both
algorithms
Please put each set of
the result into a plot
(1)
Running Time by varying support threshold
as 0.5% , 2%, 5%, 10%, 50%
(2)
Number of patterns by varying support
threshold as 0.5% , 2%, 5%, 10%, 50%
B)
For synthetic datasets T25.I20.D50k,
please report the
following results for both algorithms
Please put each set of
the result into a plot
(1)
Given
support threshold as 0.3%, report running time by using 20%, 40%, 60%, 80%,
100% transactions.
(2)
Given
support threshold as 0.3%, report number of patterns by using 20%, 40%, 60%,
80%, 100% transactions.
C)
For
each figure you generated above, explain the trends shown in each figure based
on your understanding of the complexity of the algorithms.
e) Apply both algorithms on the mushroom
dataset. Report the following.
(1)
Running
time by varying support threshold as 0.5% , 2%, 5%, 10%, 50%
(2)
Number
of patterns by varying support threshold as 0.5% , 2%, 5%, 10%, 50%
(3)
Report
top 10 patterns with highest support and see whether they are useful.
(4)
Report
5 association rules with confidence above 20% and explain whether they are
important.
f) Your
experience with the homework.
Grading:
Each
problem will be graded in a letter scale as A (Excellent), B (Good), C (Fair),
D (Failed). Each student will need to work on this homework independently.