CS 685: Final Project Abstract and Talk Schedule

1. Title: Data mining of Clothing World*

Name: Cindy

Abstract:

Clothing World* is a local retail clothing store that has been collecting data on their sales, inventory, and customers via a point-of-sale system for the past year. The store owners would like to have a better understanding of their business and their customer base. The objective of this project is to utilize various data mining approaches and techniques to gain valuable information in order to answer some key business questions. The models employed in this project are Association Rules (Market Basket Analysis), Decision Trees, and Clustering.

*The name of store has been masked to respect the privacy of the business.

2. Title: Predictive Modeling via Data Mining: The NetFlix Prize

Name: Casey, Nick, Mary

Abstract:

As developing technologies embrace the benefits of predictive modeling and data mining, entrepreneurs are finding new ways of using these technologies for the benefit of customers, companies and science. In this report, we follow the efforts one such company, NetFlix, and their sponsorship of the now infamous NetFlix Prize - a competition where participants develop data mining algorithms for predicting movie rentals based on past movie ratings expressed by NetFlix users. We participate as contestants in this contest to better understand how an industry sponsored competition can lead to advancements in data mining and a better understanding of predictive modeling. Participation in the contest involves the use of off-the-shelf data mining programs as well as the development of custom software. These programs, exercised on the NetFlix Prize data set, attempt to produce movie rating predictions which beat the ratings observed by NetFlix when they collect movie ratings from their users. The algorithms used in these programs involve association rule mining, clustering, and time sequencing. We conclude by providing closing thoughts on the successes and failures of each algorithm type.

3. Title: Classification of Pattern Recognition Using Neural Network

Name: Arunava

Abstract:

Artificial neural networks are quite robust in handling highly noisy, nonlinear and complex data with amazing ease. In this pattern recognition application I have trained a Neural Network so that it can distinguish among three species of iris flower (iris-setosa, iris-versicolor, iris-virginica). Since the inputs are classified in three classes, Multilayer Percepton with backpropagation of error algorithm is used instead of Single layer Percepton.

Initially the program has four inputs for four attributes of each type of flower. From the neural network we get output of three classes. This implies that the output layer has three neurons. The number of hidden layer is less than the number of output neurons. So here we have to restrict to one or two hidden layers. The concept of Multilayer Perceptron using Backpropagation of error is also implemented here. Initially the connection weights between each pair of neuron are set to small random values lying in the range [-.5,.5]. The weights are updated by backpropagation until the actual output does not reach to the target output. For the mathematical calculation here we have to mention first the target output. At each updating step one can adjust the updating weights w in the range -0.1<=w<=0.1, so that the possibility of overshooting of weights may be minimized in the course of smoothly approaching a minimum error solution. For better result we have to backpropagate or execute the loop of weight update for each pattern at a large amount of times (say 2000 times).

4. Title: Visual Object Retrieval System

Name: Jizhou

Abstract:

In this talk, we present a visual object retrieval system. Given a user-specified query object, the system returns a ranked list of images which might contain the query object from a relative large image dataset. The scheme builds upon bag-of-words model. The affine-invariant image features are extracted from local image regions and quantize into ``visual words'' using Hierarchical K-Means (HKM), which allows us to generate a relatively large

``visual vocabulary'' effectively. Furthermore, our system enforces the spatial consistency between corresponding features to filter out the error candidate in the retrieved list. Potential applications of our systems include 3D model completion for building facades.

5. Title: A survey on Anomaly Detection systems and approaches towards reducing false positives

Name: Onur

Abstract:

In this talk, I will be discussing the most recent techniques employed by anomaly detection algorithms. Anomaly detection systems are a category of Intrusion detection systems where a model of normal or acceptable data is defined and the system then attempts to detect deviations from the normal model in the observed data. The major drawback of Anomaly detection systems is the high false positive rates. A recent work on reducing false positives is also going to be addressed in this talk

------------------------------

BREAK 5 Minutes

------------------------------

6. Title: A Fast Biclustering Algorithm

Name: Yuan

Abstract:

Biclustering, a new class of clustering techniques to simultaneously cluster two dimensions of the data matrix, was introduced for some potential applications, e.g., analysis of gene expression data and targeted marketing. Distinct Biclustering algorithms were proposed by researchers. However, since the inherent complexity of the general Biclustering problems is computational intractable (NP-Hard), those algorithms are either heuristic or customized for specific applications. In this project, I try to design an algorithm to serve as a more general solution for Biclustering problems when comparing with existing algorithms. Furthermore, the proposed algorithm could be used as a technique to improve the quality of existing algorithms without producing additional complexity.

7. Title: Classification of Horse Racing Data

Name: Apurv

Abstract:

Over the centuries horse racing has transformed from a royal sport into a thriving multi-billion dollar industry. Infact, according to NTRA (National

Thoroghbred Racing Association) estimates, over a 100 billion dollars are wagered on racing horses in 53 countries annually. This project aims at mining horse racing records to develop a model for predicting the outcome of a race. The dataset has been provided by Equix Biomechanics, a Lexington based organization that analyzes the physiology of thoroughbred racing horses. I have applied Dimensionality Reduction and CBA (Classification Based on Association) tools to generate a set of rules which can be used for making predictions. The developed rules were verified using cross validation and their performance was better than the current industry prediction rate of 15-17%. This exercise also identified redundancies in the current procedure followed at Equix.

8. Title: Comparing Trees Via Crossing Minimization

Name: Wenbin

Abstract:

In areas such as bioinformatics, compiler construction, and text databases, the comparison of trees can help researchers analyze the data. Suppose there is a one-to-one correspondence between the leaves of two binary trees, while the architectures of the two trees are different. One important step in the evaluation of the difference between the two trees is finding out an order of leaves which can minimize the matching edges between the corresponding leaves.

Since changing the architecture of two trees is a NP-complete problem, this project will only change one tree while use the another tree as a fixed standard. For the tree that is changed, the algorithm will change the left child and the right child of the root and compare the crosses in both cases to decide which case is better. Then such process will be applied to all the subtrees until an architecture which can minimize the crosses is got. By using the dynamic programming approach, the algorithm can be reduced in time O(n log² n).

9. Title: Finding Orthologous Relations across Multiple Species

Name: Ellisaveta

Abstract:

Finding orthologous relations in a set of genes implementing the algorithm for orthologs and in-paralogs clustering described in the paper of Maido Remm, Christian Storm and Erik Sonnhammer “Automatic clustering of orthologs and in-paralogs from pair wise species comparisons”. This is a distance-based method and uses similarity scores between every two sequences from two different genomes (A and B) as distances to perform the clustering procedure. The similarity scores are calculated in pair wise sequence comparisons of the two genomes to each other (A -> B and B->A) and the two genomes to themselves (A->A and B->B). For each best matching pair from the two genomes (bi-directional best match) additional sequences from each genome separately are clustered to the pair according certain criteria. Clusters are merged or divided using a set of rules. The algorithm is applied on a set of genes from the Salamander species. The data sets used for the pair wise comparisons are from the Human, Mouse, Chicken and Zebra fish species.

10. Title: Classifying Secondary Medical Expenses based on Offender Demographics

Name: Arthur

Abstract:

As part of CorrectCare - Integrated Health's service to the Kentucky Department of Corrections data is gathered concerning secondary medical expenses for the state inmate population. This data contains a wealth of information, and could potentially contain more information when combined with the inmates' demographics. This paper explores the combined data to determine if future secondary medical claims can be classified given an offender's demographics. Of particular interest is whether or not secondary medical costs can be predicted based on the inmate's conviction. Data mining is done using a Classification Based on Associations (CBA) implementation from the University of Liverpool, department of Computer Science.

11. Title: Privacy Preserving of Social Network Against Sensitive Edge Disclosure

Name: Lian

Abstract:

With the development of emerging social networks like facebook and myspace, security and privacy threats arising from social networks analysis bring a risk of disclosure of confidential knowledge when the data is shared or made public. Therefore, in addition to the current social network anonymity such as de-identification techniques, we focus on an undeveloped research area in which, such as business transaction networks and intelligence community terrorism networks), the edges (transaction cost and terrorist's contact frequency) as well as the corresponding weights are considered to be private. Therefore, we initiate a research towards preserving weights (data privacy) of some edges, while trying to preserve similar shortest path lengths and exactly same shortest paths (data utility) of some pairs of nodes. We develop two privacy-preserving strategies to achieve above two goals. The first strategy is based on a Gaussian randomization multiplication, and the graph theory plays a theoretical role in the second one, a greedy algorithm. Especially, the second strategy can not only keep a close shortest path length and an exactly same shortest path, but also probably maximize the weight privacy preservation, proved by our mathematical analysis and experiments.

------------------------------

BREAK 5 Minutes

------------------------------

12. Title: Clustering Model for NCI's Cancer Survey Data

Name: Satya

Abstract:

The National Cancer Institute (NCI) developed the Health Information National Trends Survey (HINTS) to be repeated on a routine basis to provide scientists and practitioners with a continuing source of surveillance data from which to examine trends in health communication over time. HINTS is a nationally representative telephone survey of the general population. This project is a real time project which aims at developing a clustering model for cancer information seekers using NCI’s HINTS 2005 data. The dataset called "Health Information National Trends Survey" (HINTS) has been provided by the National Cancer Institute, U.S. National Institutes of Health. As a part of this project I try to propose an approach for dealing with the missing data.

13. Title: Classifying ASes by AdaBoost

Name: Yinfang

Abstract:

An AS node can represent a wide variety of organizations, e.g., large ISP, or small private business, university, with vastly different network characteristics, external connectivity patterns, network growth tendencies and other properties. First select features that are considered to be most representative for each different kind of AS nodes. Then apply famous AdaBoost classifier to these feature vectors. The result serves as an invaluable addition to further understanding of the structure and evolution of Internet.

14. Title: Illumination and Person-Insensitive Head Pose Estimation Using Distance Metric Learning

Name: XianWang

Abstract:

Head pose estimation is an important task for many face analysis applications, such as face recognition systems and human computer interactions. In this paper we aim to address the pose estimation problem under some challenging conditions, e.g., from a single image, large pose variation, and un-even illumination conditions. The approach we developed combines non-linear dimension reduction techniques with a learned distance metric transformation. The learned distance metric provides better intra-class clustering, therefore preserving a smooth low-dimensional manifold in the presence of large variation in the input images due to illumination changes. Experiments show that our method improves the performance, achieving accuracy within 2-3 degrees for face images with varying poses and within 3-4 degrees error for face images with varying pose and illumination changes.

15. Title: Email Traffic Analysis

Name: Phani

Abstract:

Emails are primary source of communication between employees of a company. Analysis of email data gives lot of useful information. This project aims at finding times at which email traffic is high. This helps in predicting traffic on server and takes necessary measures to distribute load to standby servers. In this project we analyze the email traffic of enron email data.

16. Title: An Enhanced Algorithm of Association Mining

Name: Sandeep

Abstract:

The frequent pattern mining is done in several ways which if in turn associated with sequences. But this association mining technique is enhanced with two other association mining techniques to fetch the related data and correlations. My algorithm solves the problem using Constraint Based Association Mining Techniques. This combination of two association mining techniques emphasizes the constraint and fast mining of the result rather what we observe with Apriori algorithm. This helps in mining more frequent sequences in less duration.

17. Title: DNA Sequence Analysis

Name: Banda

Abstract:

There are some common sequences in the DNA's of humans and horses.This project aims at finding out the matching DNA sequences of various parts of horses like brian,liver,heart and skeletal muscle in the corresponding human DNA sequences.The DataSet consists of records which contain information about the horse genome sequence ,human genome sequence,number of hits etc.