CS685: Special Topics in Data Mining (Spring
2009)
Homework 2:
Due Feb 19th
Goal:
This homework will reinforce the understanding of two basic clustering
algorithms - k-means and hierarchical clustering algorithms. Through the
homework, you will learn how to use k-means and hierarchical clustering and how
to evaluate the clustering results including determining the number of
clusters.
Description of the homework:
Your homework will again be an application of existing algorithms. Only
the report needs to be submitted. Your report should contain answers for the
following questions.
Algorithms:
Two
algorithms you are supposed to work with include K-means and hierarchical
clustering.
Both
algorithms are available as functions in matlab. (You
can use other software, such as R and Weka, as long
as you do the same analysis)
1)
K-means
Datasets:
Synthetic Control Chart Time
Series Data Set.
http://archive.ics.uci.edu/ml/datasets/Synthetic+Control+Chart+Time+Series
Please Read through the data description carefully to understand the particular
application.
For this dataset, the following is
the set of real clusters underlying the data. You will need to compare your
clustering results with the real clusters.
1-100 Normal
101-200
Cyclic
201-300
Increasing trend
301-400
Decreasing trend
401-500
Upward shift
501-600
Downward shift
Questions to answer in the report:
1.
Write a script to compare two sets of
clustering results
Assume C1 and C2 are two sets of clustering results on the same
set of data. C1 = (S1, S2, …, Sn) contains n clusters, where each S represents a cluster
and C2 = (T1, T2, …, Tn) contains n clusters as
well, where each T is a cluster.
In order to compute the similary between C1 and C2 , you will
need to do the following.
sim = 0
For
each S in C1,
sim = sim
+ the highest similarity between S and any cluster T in C2 measured by Jaccard
Coefficient.
End
avgsim = sim
/ n;
2.
Please follow the instructions in matlab for K-means clustering to do the following.
a)
Write
the script to do k-means clustering. Check out the following function.
---Kmeans
b)
run the program 10 times with k = 6. Check how the clustering algorithm
changes by checking out sumdistance. Please report
your result.
c)
Use
k = 2, 4, 6, 8, 10 and check how the sumdistance
changes. Please report your result.
d)
Pick
one of the results as the best (please argue why), and compare it with real
clusters.
3.
Please follow the instructions in matlab for hierarchical clustering to do the following
a)
Write
the script to do hierarchical clustering. The following functions might be
useful
--load
--Pdist
--Linkage
--Squareform
--Dendrogram
--Cluster
b)
Choose
one of the existing dissimilarity measures to compute the dissimilarity matrix
for hierarchical clustering. Please reason whether it is appropriate for the particular
application and suggest another that might also work.
c)
Run
the hierarchical clustering on the datasets using single linkage, complete
linkage and maximum linkage. Generate
6 clusters for each run and Compare the change of the clustering results.
d)
Pick
one of the results as the best (please argue why), and compare it with real
clusters.
4.
Compare the best (3(d)) of the
hierarchical clustering results generated in (c) with the best generated k-means
clustering when k=6 (2(d)).
5.
Please include all your matlab scripts for submission.
6.
Please report your impression with
the clustering algorithms and your experience with the assignments.