CS 685:
Final Project Abstract and Talk Schedule
1.
Title: Data mining of Clothing World*
Name: Cindy
Abstract:
Clothing World* is a local retail
clothing store that has been collecting data on their sales, inventory, and customers
via a point-of-sale system for the past year. The store owners would like to have a
better understanding of their business and their customer base. The objective of this project is to
utilize various data mining approaches and techniques to gain valuable
information in order to answer some key business questions. The models employed in this project are
Association Rules (Market Basket Analysis), Decision
Trees, and Clustering.
*The name of store has been masked to
respect the privacy of the business.
2.
Title: Predictive Modeling via Data
Mining: The NetFlix Prize
Name: Casey, Nick, Mary
Abstract:
As developing technologies embrace the
benefits of predictive modeling and data mining, entrepreneurs are finding new
ways of using these technologies for the benefit of customers, companies and
science. In this report, we follow the efforts one such company, NetFlix, and their sponsorship of the now infamous NetFlix Prize - a competition where participants develop
data mining algorithms for predicting movie rentals based on past movie ratings
expressed by NetFlix users. We participate as
contestants in this contest to better understand how an industry sponsored
competition can lead to advancements in data mining and a better understanding
of predictive modeling. Participation in the contest involves the use of
off-the-shelf data mining programs as well as the development of custom
software. These programs, exercised on the NetFlix
Prize data set, attempt to produce movie rating predictions which beat the
ratings observed by NetFlix when they collect movie
ratings from their users. The algorithms used in these programs involve
association rule mining, clustering, and time sequencing. We conclude by
providing closing thoughts on the successes and failures of each algorithm
type.
3.
Title: Classification of Pattern
Recognition Using Neural Network
Name: Arunava
Abstract:
Artificial neural networks are quite robust
in handling highly noisy, nonlinear and complex data with amazing ease. In this
pattern recognition application I have trained a Neural Network so that it can
distinguish among three species of iris flower (iris-setosa,
iris-versicolor, iris-virginica).
Since the inputs are classified in three classes, Multilayer Percepton with backpropagation of
error algorithm is used instead of Single layer Percepton.
Initially the program has four inputs
for four attributes of each type of flower. From the neural network we get
output of three classes. This implies that the output layer has three neurons.
The number of hidden layer is less than the number of output neurons. So here
we have to restrict to one or two hidden layers. The concept of Multilayer Perceptron using Backpropagation
of error is also implemented here. Initially the connection weights between each
pair of neuron are set to small random values lying in the range [-.5,.5]. The weights are updated by backpropagation
until the actual output does not reach to the target output. For the
mathematical calculation here we have to mention first the target output. At
each updating step one can adjust the updating weights w in the range
-0.1<=w<=0.1, so that the possibility of overshooting of weights may be
minimized in the course of smoothly approaching a minimum error solution. For
better result we have to backpropagate or execute the
loop of weight update for each pattern at a large amount of times (say 2000
times).
4.
Title: Visual Object Retrieval System
Name: Jizhou
Abstract:
In this talk, we present a visual object
retrieval system. Given a user-specified query object, the system returns a
ranked list of images which might contain the query object from a relative
large image dataset. The scheme builds upon bag-of-words model. The affine-invariant
image features are extracted from local image regions and quantize into
``visual words'' using Hierarchical K-Means (HKM), which allows us to generate
a relatively large
``visual vocabulary''
effectively. Furthermore, our system enforces the
spatial consistency between corresponding features to filter out the error
candidate in the retrieved list. Potential applications of our systems include
3D model completion for building facades.
5.
Title: A survey on Anomaly Detection
systems and approaches towards reducing false positives
Name: Onur
Abstract:
In
this talk, I will be discussing the most recent techniques employed by anomaly
detection algorithms. Anomaly detection systems are a category of Intrusion
detection systems where a model of normal or acceptable data is defined and the
system then attempts to detect deviations from the normal model in the observed
data. The major drawback of Anomaly detection systems is the high false
positive rates. A recent work on reducing false positives is also going to be
addressed in this talk
------------------------------
BREAK 5
Minutes
------------------------------
6.
Title: A Fast Biclustering
Algorithm
Name: Yuan
Abstract:
Biclustering, a new class of clustering techniques to
simultaneously cluster two dimensions of the data matrix, was introduced for
some potential applications, e.g., analysis of gene expression data and
targeted marketing. Distinct Biclustering algorithms
were proposed by researchers. However, since the inherent complexity of the
general Biclustering problems is computational
intractable (NP-Hard), those algorithms are either heuristic or customized for specific applications.
In this project, I try to design an algorithm to serve as a more general
solution for Biclustering problems when comparing
with existing algorithms. Furthermore, the proposed algorithm could be used as
a technique to improve the quality of existing algorithms without producing
additional complexity.
7.
Title: Classification of Horse Racing
Data
Name: Apurv
Abstract:
Over the centuries horse racing has
transformed from a royal sport into a thriving multi-billion dollar industry. Infact, according to NTRA (National
Thoroghbred Racing Association) estimates, over a 100
billion dollars are wagered on racing horses in 53 countries annually. This
project aims at mining horse racing records to develop a model for predicting
the outcome of a race. The dataset
has been provided by Equix Biomechanics, a Lexington
based organization that analyzes the physiology of thoroughbred racing horses.
I have applied Dimensionality Reduction and CBA (Classification Based on
Association) tools to generate a set of rules which can be used for making
predictions. The developed rules were verified using cross validation and their
performance was better than the current industry prediction rate of
15-17%. This exercise also identified redundancies in the current
procedure followed at Equix.
8.
Title: Comparing Trees Via Crossing
Minimization
Name: Wenbin
Abstract:
In areas such as bioinformatics, compiler
construction, and text databases, the comparison of trees can help researchers
analyze the data. Suppose there is a one-to-one correspondence between the
leaves of two binary trees, while the architectures of the two trees are
different. One important step in the evaluation of the difference between the
two trees is finding out an order of leaves which can minimize the matching
edges between the corresponding leaves.
Since changing the architecture of
two trees is a NP-complete problem, this project will only change one tree
while use the another tree as a fixed standard. For the tree that is changed,
the algorithm will change the left child and the right child of the root and
compare the crosses in both cases to decide which case is better. Then such
process will be applied to all the subtrees until an
architecture which can minimize the crosses is got. By using the dynamic
programming approach, the algorithm can be reduced in time O(n
log2 n).
9.
Title: Finding Orthologous
Relations across Multiple Species
Name: Ellisaveta
Abstract:
Finding orthologous
relations in a set of genes implementing the algorithm for orthologs
and in-paralogs clustering described in the paper of Maido Remm, Christian Storm and
Erik Sonnhammer “Automatic clustering of orthologs and in-paralogs from
pair wise species comparisons”.
This is a distance-based method and uses similarity scores between every
two sequences from two different genomes (A and B) as distances to perform the
clustering procedure. The similarity scores are calculated in pair wise
sequence comparisons of the two genomes to each other (A -> B and B->A)
and the two genomes to themselves (A->A and B->B). For each best matching
pair from the two genomes (bi-directional best match) additional sequences from
each genome separately are clustered to the pair according certain criteria.
Clusters are merged or divided using a set of rules. The algorithm is applied
on a set of genes from the Salamander species. The data sets used for the pair
wise comparisons are from the Human, Mouse, Chicken and Zebra fish species.
10.
Title: Classifying Secondary Medical
Expenses based on Offender Demographics
Name: Arthur
Abstract:
As part of CorrectCare
- Integrated Health's service to the Kentucky Department of Corrections data is
gathered concerning secondary medical expenses for the state inmate
population. This data contains a
wealth of information, and could potentially contain more information when
combined with the inmates' demographics.
This paper explores the combined data to determine if
future secondary medical claims can be classified given an offender's
demographics. Of particular
interest is whether or not secondary medical costs can be predicted based on
the inmate's conviction. Data mining is done using a Classification Based on
Associations (CBA) implementation from the University of Liverpool, department
of Computer Science.
11.
Title: Privacy Preserving of Social
Network Against Sensitive Edge Disclosure
Name: Lian
Abstract:
With the development of emerging social
networks like facebook and myspace,
security and privacy threats arising from social networks analysis bring a risk
of disclosure of confidential knowledge when the data is shared or made public.
Therefore, in addition to the current social network anonymity such as
de-identification techniques, we
focus on an undeveloped research area in which, such as business transaction
networks and intelligence community terrorism networks), the edges (transaction
cost and terrorist's contact frequency) as well as the corresponding weights
are considered to be private. Therefore, we initiate a research towards
preserving weights (data privacy) of some edges, while trying to preserve
similar shortest path lengths and exactly same shortest paths (data utility) of
some pairs of nodes. We develop two privacy-preserving strategies to achieve
above two goals. The first strategy is based on a Gaussian randomization
multiplication, and the graph theory plays a theoretical role in the second
one, a greedy algorithm. Especially, the second strategy can not only keep a
close shortest path length and an exactly same shortest path, but also probably
maximize the weight privacy preservation, proved by our mathematical analysis
and experiments.
------------------------------
BREAK 5
Minutes
------------------------------
12.
Title: Clustering Model for NCI's
Cancer Survey Data
Name: Satya
Abstract:
The National Cancer Institute (NCI)
developed the Health Information National Trends Survey (HINTS) to be repeated
on a routine basis to provide scientists and practitioners with a continuing
source of surveillance data from which to examine trends in health
communication over time. HINTS is a nationally
representative telephone survey of the general population. This project is a
real time project which aims at developing a clustering model for cancer
information seekers using NCI’s HINTS 2005 data. The dataset called
"Health Information National Trends Survey" (HINTS) has been provided by the National
Cancer Institute, U.S. National Institutes of Health. As a part of this project
I try to propose an approach for dealing with the missing data.
13.
Title: Classifying ASes by AdaBoost
Name: Yinfang
Abstract:
An AS node can represent a wide variety of
organizations, e.g., large ISP, or small private business, university, with
vastly different network characteristics, external connectivity patterns,
network growth tendencies and other properties. First select
features that are considered to be most representative for each different kind
of AS nodes. Then apply famous AdaBoost
classifier to these feature vectors. The result serves as an invaluable
addition to further understanding of the structure and evolution of Internet.
14.
Title: Illumination and
Person-Insensitive Head Pose Estimation Using Distance Metric Learning
Name: XianWang
Abstract:
Head pose estimation is an important task
for many face analysis applications, such as face recognition systems and human
computer interactions. In this paper we aim to address the pose estimation problem
under some challenging conditions, e.g., from a single image, large pose
variation, and un-even illumination conditions. The approach we developed
combines non-linear dimension reduction techniques with a learned distance
metric transformation. The learned distance metric provides better intra-class
clustering, therefore preserving a smooth low-dimensional manifold in the presence
of large variation in the input images due to illumination changes. Experiments
show that our method improves the performance, achieving accuracy within 2-3
degrees for face images with varying poses and within 3-4 degrees error for
face images with varying pose and illumination changes.
15.
Title: Email Traffic Analysis
Name: Phani
Abstract:
Emails are primary source of communication
between employees of a company. Analysis of email data gives lot of useful
information. This project aims at finding times at which email traffic is high.
This helps in predicting traffic on server and takes necessary measures to
distribute load to standby servers. In this project we analyze the email
traffic of enron email data.
16.
Title: An Enhanced Algorithm of
Association Mining
Name: Sandeep
Abstract:
The frequent pattern mining is done in
several ways which if in turn associated with sequences. But this association
mining technique is enhanced with two other association mining techniques to
fetch the related data and correlations. My algorithm solves the problem using
Constraint Based Association Mining Techniques. This combination of two
association mining techniques emphasizes the constraint and fast mining of the
result rather what we observe with Apriori algorithm.
This helps in mining more frequent sequences in less duration.
17.
Title: DNA Sequence Analysis
Name: Banda
Abstract:
There are some common sequences in the DNA's
of humans and horses.This project aims at finding out
the matching DNA sequences of various parts of horses like brian,liver,heart
and skeletal muscle in the corresponding human DNA sequences.The
DataSet consists of records which contain information
about the horse genome sequence ,human genome sequence,number
of hits etc.