Type I: Novel Data Mining Methods
1.
Approximate
frequent pattern mining
Problem and Motivation:
Frequent
itemset mining is a popular and important first step
in the analysis of data arising in a broad range of applications. The
traditional “exact” model for frequent itemsets
requires that every item occur in each supporting transaction. However, real
data is typically subject to noise and measurement error. To date, the effect
of noise on exact frequent pattern mining algorithms have been addressed primarily
through simulation studies, and there has been limited attention to the
development of noise tolerant algorithms.
Reference:
2.
Mining
from unstructured data, such as text and images
Problem and Motivation: There is a plentiful
supply of images available at the typing of a single word using Internet image
search engines such as Google, and the question is how to learn visual models
directly from this source.
Reference:
Learning Object Categories from Google’s Image Search
http://people.csail.mit.edu/fergus/papers/fergus_google.pdf
From
frequent itemsets to semantically meaningful visual
patterns, KDD06
http://portal.acm.org/citation.cfm?id=1281192.1281284
3.
Mining
Social Network
Problem and Motivation: In recent years, social network research has
advanced significantly, thanks to the prevalence of the online social websites
and the availability of a variety of offline large-scale social network systems
such as collaboration networks. These social network systems are usually
characterized by the complex network structures and rich accompanying
contextual information. Researchers are increasingly interested in addressing a
wide range of challenges residing in these disparate social network systems,
including identifying common static topological properties and dynamic
properties during the formation and evolution of these social networks, and how
contextual information can help in analyzing the pertaining social networks.
These issues have important implications on community discovery, anomaly
detection, trend prediction and can enhance
applications in multiple domains such as information retrieval, recommendation
systems, security and so on.
Reference:
Aris Anagnostopoulos, Ravi Kumar, Mohammad Mahdian: Influence and correlation in social networks. SIGKDD, 7-15, 2008. [Slides]
Justin Brickell, Vitaly Shmatikov: The cost of privacy: destruction of data-mining utility in anonymized data publishing. SIGKDD, 70-78, 2008.
http://www.slideshare.net/xllora/mining-social-networks-in-message-boards
4.
Transfer
Learning
Problem and Motivation: All
machine learning algorithms require data to learn and often the amount of data
available is a limiting factor. Classification requires labeled data, which may
be expensive to obtain. Reinforcement learning requires samples from an
environment which takes time to gather. Recently, transfer learning (TL) approaches have been gaining in
popularity as an approach to increase learning performance. Rather than
learning a novel target task in
isolation, transfer approaches make use of data from one or more source tasks in order to learn the
target task with less data or to achieve a higher performance level.
In
many real world applications, however, we wish to make use of the labeled data
from one domain (called in-domain)
to classify the unlabeled data in a different domain (out-of-domain). This problem often happens when obtaining
labeled data in one domain is difficult while there are plenty of labeled data
from a related but different domain. In general, this is a transfer learning problem where we
wish to classify the unlabeled data through the labeled data even though these
data are not from the same domain.
Reference:
A webpage containing the recent development on the research of transfer learning http://www.cse.ust.hk/~sinnopan/conferenceTL.htm
György J. Simon, Vipin Kumar, Zhi-Li Zhang: Semi-supervised approach to rapid and reliable labeling of large data sets. SIGKDD, 641-649, 2008.
Jing Gao,
Wei Fan, Jing Jiang, Jiawei Han: Knowledge Transfer via Multiple Model Local
Structure Mapping. SIGKDD, 2008
Spectral domain transfer learning.
Type 2:
Bioinformatics Applications
1.
Mining
Co-Expressed Genes or Gene Regulatory Networks across Multiple Gene Expression
Datasets.
Problem and Motivation: In any species, the set of genes are limited. But the experiments to study the expression of these genes in order to dissect gene functions under different tissues or environmental conditions have increased dramatically in the recent years. Each gene expression data is like a snapshot of the gene activity. How to assemble a bigger picture with the increasing large number of gene expression data is a challenging and interesting questions. New methodologies are needed to in order to combine multiple gene expression data to conduct the analysis.
2.
Mining
Significant Co-Expression Patterns across a Large Number of Treatment Groups.
Problem and Motivation: Differential analysis of co-expressed genes was typically done across a defined number of groups. However, given a large number of treatment groups, finding a gene that differentiates across all treatment groups probably is not relevant. Instead, it is important to find a subset of genes and a subset of treatment groups where significant and meaningful co-expression patterns exist.
3.
Expression
Quantitative Trait Analysis.
a.
Algorithm
for eQTL analysis
Transcriptional
control is one of the most important steps for an organism to express the
genetic information stored in its sequence as well as to respond
to environmental changes (Ihmels et al., 2004). Recent advance of genomic
technology has made it possible to quantify transcript abundance
systematically, as well as to genotype genetic markers covering the whole
genome in a segregating population. Provided with these tools, expression
quantitative trait locus (eQTL) analysis has been
applied to study inheritance of thousands of similar traits in the hope to find
general rules of genetic control of transcriptional regulation
b. Building Gene Expression Network on top of eQTL Result.
Type 3: Other Applications
1. Collaborative Filtering
2. Information Retrieval
Type 4: Public
Competitions.
1.
KDD
cup 2008 Challenges and previous KDD Cups
2.
Netflix
Prize
3.
UCI Machine
Learning Repository.
http://archive.ics.uci.edu/ml/