With the unprecedented rate at which data
is being collected today in almost all fields of human endeavor, there is
an emerging economic and scientific need to extract useful information from
it. Data mining is the process of automatic discovery of patterns, changes,
associations and anomalies in massive databases, and is a highly
inter-disciplinary field representing the confluence of several
disciplines, including database systems, data warehousing, machine
learning, statistics, algorithms, data visualization, and high-performance
computing. This seminar will provide an introductory survey of the main
topics (including and not limited to classification, regression,
clustering, association rules, trend detection, feature selection,
similarity search, data cleaning, privacy and security issues, and etc.) in
data mining and knowledge discovery as well as a wide spectrum of data
mining applications such as biomedical informatics, bioinformatics,
financial market study, image processing, network monitoring, social
service analysis.
For each
topic, a few most related research papers will be selected as the major
teaching material. Students are expected to read the assigned paper before
each class and to participate the discussion in
each class.
Prerequisite:
Some background in algorithms, data structures,
statistics, machine learning, artificial intelligence, and databases is
helpful.
References: (No required textbook)
1).
Data
Mining --- Concepts and techniques,
by Han and Kamber, Morgan Kaufmann, 2006.
(ISBN:1-55860-901-6)
|
2).
Principles
of Data Mining, by Hand, Mannila,
and Smyth, MIT Press, 2001. (ISBN:0-262-08290-X)
|
3).
Introduction
to Data Mining, by Tan, Steinbach, and Kumar,
Addison Wesley, 2006. (ISBN:0-321-32136-7)
|
4).
The
Elements of Statistical Learning --- Data Mining, Inference, and
Prediction, by Hastie, Tibshirani,
and Friedman, Springer, 2001. (ISBN:0-387-95284-5)
|
|
5).
Pattern Recognition and Machine Learning, by Christopher M. Bishop, 2006.
|
Grading
Each student in CS685 will be expected to present a paper
and lead the discussion following his/her presentation and do a project on
selected topics.
|
|
4 Homeworks
|
40%
|
Exam
|
15%
|
Presentation
|
15%
|
Project
|
30%
|
|
Tentative Course Outline
1. Introduction
·
What is data mining?
2. Data Preprocessing
·
Data sampling, data cleaning, feature
selection, and dimensionality reduction
3. Classification
·
Tree-based, rule-based, and instance-based
methods
·
Bayesian methods (naive Bayes
and Bayesian belief networks)
·
Neural networks, linear discriminant
analysis, support vector machines, and ensemble methods
·
Model evaluation
4. Association Analysis
·
Apriori algorithm and its
extensions
·
Pattern evaluation (subjective and objective
interestingness measures)
·
Sequential patterns and graph mining
5. Clustering
·
Partitional and hierarchical
clustering methods
·
Graph-based and density-based methods
·
Cluster evaluation
|
|
|
|