Type I: Novel Data Mining Methods

1. Approximate frequent pattern mining

Problem and Motivation: Frequent itemset mining is a popular and important first step in the analysis of data arising in a broad range of applications. The traditional “exact” model for frequent itemsets requires that every item occur in each supporting transaction. However, real data is typically subject to noise and measurement error. To date, the effect of noise on exact frequent pattern mining algorithms have been addressed primarily through simulation studies, and there has been limited attention to the development of noise tolerant algorithms.

Reference:

http://www.google.com/search?q=approximate+frequent+itemset+mining&ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla:en-US:official&client=firefox-a

2. Mining from unstructured data, such as text and images

Problem and Motivation: There is a plentiful supply of images available at the typing of a single word using Internet image search engines such as Google, and the question is how to learn visual models directly from this source.

Reference:

Learning Object Categories from Google’s Image Search

http://people.csail.mit.edu/fergus/papers/fergus_google.pdf

From frequent itemsets to semantically meaningful visual patterns, KDD06

http://portal.acm.org/citation.cfm?id=1281192.1281284

3. Mining Social Network

Problem and Motivation: In recent years, social network research has advanced significantly, thanks to the prevalence of the online social websites and the availability of a variety of offline large-scale social network systems such as collaboration networks. These social network systems are usually characterized by the complex network structures and rich accompanying contextual information. Researchers are increasingly interested in addressing a wide range of challenges residing in these disparate social network systems, including identifying common static topological properties and dynamic properties during the formation and evolution of these social networks, and how contextual information can help in analyzing the pertaining social networks. These issues have important implications on community discovery, anomaly detection, trend prediction and can enhance applications in multiple domains such as information retrieval, recommendation systems, security and so on.

Reference:

Aris Anagnostopoulos, Ravi Kumar, Mohammad Mahdian: Influence and correlation in social networks. SIGKDD, 7-15, 2008. [Slides]

Justin Brickell, Vitaly Shmatikov: The cost of privacy: destruction of data-mining utility in anonymized data publishing. SIGKDD, 70-78, 2008.

http://www.slideshare.net/xllora/mining-social-networks-in-message-boards

http://www.arnetminer.org/

4. Transfer Learning

Problem and Motivation: All machine learning algorithms require data to learn and often the amount of data available is a limiting factor. Classification requires labeled data, which may be expensive to obtain. Reinforcement learning requires samples from an environment which takes time to gather. Recently, transfer learning (TL) approaches have been gaining in popularity as an approach to increase learning performance. Rather than learning a novel target task in isolation, transfer approaches make use of data from one or more source tasks in order to learn the target task with less data or to achieve a higher performance level.

In many real world applications, however, we wish to make use of the labeled data from one domain (called in-domain) to classify the unlabeled data in a different domain (out-of-domain). This problem often happens when obtaining labeled data in one domain is difficult while there are plenty of labeled data from a related but different domain. In general, this is a transfer learning problem where we wish to classify the unlabeled data through the labeled data even though these data are not from the same domain.

Reference:

A webpage containing the recent development on the research of transfer learning http://www.cse.ust.hk/~sinnopan/conferenceTL.htm

György J. Simon, Vipin Kumar, Zhi-Li Zhang: Semi-supervised approach to rapid and reliable labeling of large data sets. SIGKDD, 641-649, 2008.

Jing Gao, Wei Fan, Jing Jiang, Jiawei Han: Knowledge Transfer via Multiple Model Local Structure Mapping. SIGKDD, 2008

Spectral domain transfer learning.

http://delivery.acm.org/10.1145/1410000/1401951/p488-ling.pdf?key1=1401951&key2=7855003321&coll=GUIDE&dl=GUIDE&CFID=19758775&CFTOKEN=26314845

Type 2: Bioinformatics Applications

1. Mining Co-Expressed Genes or Gene Regulatory Networks across Multiple Gene Expression Datasets.

Problem and Motivation: In any species, the set of genes are limited. But the experiments to study the expression of these genes in order to dissect gene functions under different tissues or environmental conditions have increased dramatically in the recent years. Each gene expression data is like a snapshot of the gene activity. How to assemble a bigger picture with the increasing large number of gene expression data is a challenging and interesting questions. New methodologies are needed to in order to combine multiple gene expression data to conduct the analysis.

2. Mining Significant Co-Expression Patterns across a Large Number of Treatment Groups.

Problem and Motivation: Differential analysis of co-expressed genes was typically done across a defined number of groups. However, given a large number of treatment groups, finding a gene that differentiates across all treatment groups probably is not relevant. Instead, it is important to find a subset of genes and a subset of treatment groups where significant and meaningful co-expression patterns exist.

3. Expression Quantitative Trait Analysis.

a. Algorithm for eQTL analysis

Transcriptional control is one of the most important steps for an organism to express the

genetic information stored in its sequence as well as to respond to environmental changes (Ihmels et al., 2004). Recent advance of genomic technology has made it possible to quantify transcript abundance systematically, as well as to genotype genetic markers covering the whole genome in a segregating population. Provided with these tools, expression quantitative trait locus (eQTL) analysis has been applied to study inheritance of thousands of similar traits in the hope to find general rules of genetic control of transcriptional regulation

b. Building Gene Expression Network on top of eQTL Result.

Type 3: Other Applications

1. Collaborative Filtering

2. Information Retrieval

Type 4: Public Competitions.

1. KDD cup 2008 Challenges and previous KDD Cups

http://www.kddcup2008.com/

2. Netflix Prize

http://www.netflixprize.com/

3. UCI Machine Learning Repository.

http://archive.ics.uci.edu/ml/