CS685: Special Topics in Data Mining
Homework 1:
Due March 25th
Goal:
This homework will reinforce the understanding of the classification
algorithms – Decision Trees and Naïve Bayes algorithms.
Description of the homework:
Your homework will be an application of existing C4.5 algorithm and an
implementation of Naïve Bayes Classifier. A report containing answers for the
following questions should be submitted including the code you wrote for
Naïve Bayes Classifier.
Algorithms:
1)
C4.5
decision tree implementation can be downloaded at
http://www2.cs.uregina.ca/~dbd/cs831/notes/ml/dtrees/c4.5/tutorial.html
2)
Naïve
Bayes Classifier: Simply implement it using the method we talked about in
class.
Datasets:
Car Evaluation Datasets
http://archive.ics.uci.edu/ml/datasets/Car+Evaluation
Questions to answer in the report:
1.
Apply C4.5 to Car Datasets.
a)
Build
one decision tree based on information gain as selection criterion and one
decision tree on information gain ratio as criterion. Check whether the two
trees are the same. If not, please give one example of the difference.
b)
Using
gain ratio as criterion, generate the list of rules.
c)
Using
1/10 of the dataset to build the classifier and the rest as testing sets,
compute the accuracy of the result.
d)
Using
1/2 of the dataset to build the classifier and the rest as testing sets,
compute the accuracy of the result.
Compare with c) and explain the result.
e)
Using
4-folds cross validation to assess the accuracy of the classification
algorithm.
2.
Develop the Naïve Bayes Classifer.
a)
Using
4-folds cross validation to assess the accuracy of the algorithm. Meanwhile,
compare your result with C4.5.
3.
Research the issue of overfitting: when it will occur and how it can be resolved.
4.
Please include your code for
submission.
5.
Please report your impression with
the two classification algorithms in terms of their ease to use and interpret,
and your experience with the assignments.