Fall 2008 Schedule
Coordinator: Tao Cheng, tcheng3 AT illinois.edu

Tuesday, August 19
SC 3403
4 PM
Title: Graph-Based Document Clustering
Speaker: Prof. Rafal A. Angryk (Montana State University)

Abstract: In this work we investigate a new technique for document clustering based on co-occurrence of frequent graphs, which reflect semantic relationships between our keywords. The proposed system, named GDClust (Graph-Based Document Clustering), works with frequent senses rather than with frequent keywords used in traditional text mining. GDClust transforms documents to document-graphs and uses an Apriori approach to find frequent subgraphs. Discovered frequent subgraphs are then utilized to generate sense-based document clusters. In this talk, we are going to introduce some novel, domain-specific constraints that have been incorporated into our system. In particular, we plan to discuss an abstraction-based Gaussian minimum support strategy for candidate subgraph generation, and a Subgraph-Extension Generator ¨C a mechanism that directly utilizes constraints from the natural language to reduce the number of candidates and the overhead imposed by the traditional Apriori-based candidate generation mechanism. GDClust utilizes an English language thesaurus (WordNet) to construct document-graphs and exploits graph-based data mining techniques for word sense disambiguation and clustering. It is a fully automated system and requires minimal human interaction for the clustering process.
Tuesday, August 26

SC 3403
4 PM
No Seminar
Tuesday, September 2

SC 3403
4 PM
Title: Generating Impact-Based Summaries for Scientific Literature
Speaker: Qiaozhu Mei

Abstract: In this talk, we present a study of a novel summarization problem, i.e., summarizing the impact of a scientific publication. Given a paper and its citation context, we study how to extract sentences that can represent the most influential content of the paper. We propose language modeling methods for solving this problem, and study how to incorporate features such as authority and proximity to accurately estimate the impact language model. Experiment results on a SIGIR publication collection show that the proposed methods are effective for generating impact-based summaries.
Tuesday, September 9

SC 3403
4 PM
Title: Knowledge Transfer via Multiple Model Local Structure Mapping
Speaker: Jing Gao

Abstract: In this talk, we consider the problem of transfer learning, where training and test data are drawn from two different domains and their distributions are different. We first give an overview of the existing transfer learning methods, which aim at adapting classifiers learnt from the training domain to the new domain. The task can be especially difficult when the training examples are from several domains different from the test domain. In this talk, we present a locally weighted ensemble framework to combine multiple models for transfer learning, where the weights are dynamically assigned according to a model's predictive power on each test example. It can integrate the advantages of various learning algorithms and the labeled information from multiple training domains into one unified classification model, which can then be applied on a different domain. Importantly, different from many previously proposed methods, none of the base learning method is required to be specifically designed for transfer learning. We show the optimality of this locally weighted ensemble framework as a general approach to combine multiple models for domain transfer. We then propose an implementation of the local weight assignments by mapping the structures of a model onto the structures of the test domain, and then weighting each model locally according to its consistency with the neighborhood structure around the test example. Experimental results on text classification, spam filtering and intrusion detection data sets show significant improvements in classification accuracy gained by the framework. On a transfer learning task of newsgroup message categorization, the proposed locally weighted ensemble framework achieves 97% accuracy when the best single model predicts correctly only on 73% of the test examples. In general, the improvement in accuracy is over 10% and up to 30% across different problems.
Thursday, September 18

SC 3403
4 PM
Please note the time change
Title: PNUTS: Yahoo!'s Hosted Data Serving Platform
Speaker: Brian Cooper (Senior Research Scientist of Yahoo!)

Bio: Brian Cooper is a research scientist at Yahoo! Research. Before that he was an assistant professor at Georgia Tech, and before that he was a PhD student at Stanford. His interests are in building distributed systems, and in particular, distributed systems that do database-style management and processing of data. At Yahoo! he works on building very large distributed data storage and processing systems. In previous lives he has worked on self-adaptive peer-to-peer systems, distributed streaming event processing, reliable distributed archival data storage, and XML indexing.

Abstract: I'll describe PNUTS, a massively parallel and geographically distributed database system for Yahoo!¡¯s web applications. PNUTS provides data storage organized as hashed or ordered tables, low latency for large numbers of concurrent requests including updates and queries, and novel per-record consistency guarantees. It is a hosted, centrally managed, and geographically distributed service, and utilizes automated load-balancing and failover to reduce operational complexity. The first version of the system is currently serving in production. I'll describe the motivation for PNUTS and the design and implementation of its table storage and replication layers, and then present experimental results. I'll also discuss experiences building a real production system out of research ideas, and the hard realities that can shatter our happy, optimistic research view of the world.
Tuesday, September 23

SC 3403
4 PM
Title: A Study of Methods for Negative Relevance Feedback
Speaker: Xuanhui Wang

Abstract: Negative relevance feedback is a special case of relevance feedback where we do not have any positive example; this often happens when the topic is difficult and the search results are poor. Although in principle any standard relevance feedback technique can be applied to negative relevance feedback, it may not perform well due to the lack of positive examples. In this paper, we conduct a systematic study of methods for negative relevance feedback. We compare a set of representative negative feedback methods, covering vector-space models and language models, as well as several special heuristics for negative feedback. Evaluating negative feedback methods requires a test set with sufficient difficult topics, but there are not many naturally difficult topics in the existing test collections. We use two sampling strategies to adapt a test collection with easy topics to evaluate negative feedback. Experiment results on several TREC collections show that language model based negative feedback methods are generally more effective than those based on vector-space models, and using multiple negative models is an effective heuristic for negative feedback. Our results also show that it is feasible to adapt test collections with easy topics for evaluating negative feedback methods through sampling.
Tuesday, September 30

SC 3403
4 PM
Title: SCAN: A Structural Clustering Algorithm for Networks
Speaker: Prof. Xiaowei Xu

Bio: Dr. Xu is a professor in the Department of Information Science. His research focuses on data mining and database management systems. Dr. Xu is the author of over 40 papers published in prestigious journals and conference proceedings including IEEE Transactions on Knowledge & Data Engineering (TKDE), International Journal on Data Ming & Knowledge Discovery (DAMI), ACM SIGKDD International Conferences in Knowledge Discovery & Data Mining (KDD), International Conferences in Very Large Databases (VLDB). His research is supported by federal funding agencies including National Institute of Health (NIH), Federal Food and Drug Administration (FDA). In addition, he serves as consultant for industrial companies including Siemens AG and Acxiom Corporation, as well as FDA National Center for Toxicological Research (NCTR). Furthermore, he serves as program committee members and session chairs for premier forums including ACM SIGKDD, IEEE International Conferences on Data Mining (ICDM), and SIAM International Conference on Data Mining (SDM).

Abstract: Network clustering (or graph partitioning) is an important task for the discovery of underlying structures in networks. Many algorithms find clusters by maximizing the number of intra-cluster edges and minimizing the inter-cluster edges. While such algorithms find useful and interesting structures, they tend to fail to identify and isolate two kinds of vertices that play special roles ¨C vertices that bridge clusters (hubs) and vertices that are marginally connected to clusters (outliers). Identifying hubs is useful for applications such as viral marketing and epidemiology since hubs are responsible for spreading ideas or disease. In contrast, outliers have little or no influence, and may be isolated as noise in the data. Recently, we proposed a novel algorithm called SCAN (Structural Clustering Algorithm for Networks), which detects clusters, hubs and outliers in networks. It clusters vertices based on a structural similarity measure. The algorithm is fast and efficient, visiting each vertex only once. An empirical evaluation of the method using both synthetic and real datasets demonstrates superior performance over other methods such as the modularity-based algorithms. In this talk, we will present the algorithm and demonstrate its wide applications in the analysis of large networks including social and biological networks.
Tuesday, October 7

SC 3403
4 PM
Title: Unsupervised Rank Aggregation with Distance-Based Models and Extensions
Speaker: Alexandre Klementiev

Abstract: Consider the scenario where each member of a panel of judges independently generates a (partial) ranking over a set of items while attempting to reproduce a true underlying ranking according to their level of expertise. This setting motivates a fundamental machine learning and information retrieval problem - the necessity to meaningfully aggregate preference rankings into a joint ranking. For example, in meta-search the aim is to aggregate Web search query results from several engines into a more accurate ranking. In many natural language processing applications, such as machine translation, there has been an increased interest in combining the results of multiple systems built on different principles in an effort to improve performance. Although a number of heuristic and supervised learning approaches to rank aggregation exist, they require domain knowledge or supervised ranked data both of which are expensive to acquire. In order to address these limitations, we propose a mathematical and algorithmic framework for learning to aggregate (partial) rankings without supervision. We instantiate the framework for the cases of combining permutations and combining top-k lists, and propose a novel metric for the latter. Experiments in both scenarios demonstrate the effectiveness of the proposed formalism. We extend these ideas further to consider the cases where the constituent judge's expertise depends on either the type of the query or the type of items being ranked.
Tuesday, October 14

SC 3403
4 PM
Title: Indexing Land Surface for Efficient kNN Query
Speaker: Lu An Tang

Abstract: The class of k Nearest Neighbor (kNN) queries is frequently used in geospatial applications. Many studies focus on processing kNN in Euclidean and road network spaces. Meanwhile, with the recent advances in remote sensory devices that can acquire detailed elevation data, the new geospatial applications heavily operate on this third dimension, i.e., land surface. Hence, for the field of databases to stay relevant, it should be able to efficiently process spatial queries given this constrained third dimension. However, online processing of the surface k Nearest Neighbor (skNN) queries is quite challenging due to the huge size of land surface models which renders any accurate distance computation on the surface extremely slow. In this paper, for the first time, we propose an index structure on land surface that enables exact and fast responses to skNN queries. Two complementary indexing schemes, namely Tight Surface Index (TSI) and Loose Surface Index (LSI), are constructed and stored collectively on a single novel data structure called Surface Index R-tree (SIR-tree). With those indexes, we can process skNN query efficiently by localizing the search and minimizing the invocation of the costly surface distance computation and hence incurring low I/O and computation costs. Our algorithm does not need to know the value of k a priori and can incrementally expand the search region using SIR-tree and report the query result progressively. It also reports the exact shortest surface paths to the query results. We show through experiments with real world data sets that our algorithm has better performance than the competitors in both efficiency and accuracy.
Tuesday, October 21

SC 3403
4 PM
Title: Modeling Online Reviews: Exploiting User Annotations for Sentiment Summarization
Speaker: Ivan Titov

Abstract: User generated content represents a unique source of information in which user interface tools have facilitated the creation of an abundance of labeled content, e.g., topics in blogs, numerical product and service ratings in user reviews, and helpfulness rankings in online discussion forums. Many previous studies on user generated content have attempted to predict these labels automatically from the associated text. However, these annotations are often present in the data already, which opens another interesting line of research: designing models leveraging these labelings to improve a wide variety of applications. In this talk I will be considering the sentiment summarization problem. I will present statistical models which exploit user generated aspect ratings to discover corresponding topics and are therefore able to extract fragments of text discussing these aspects without the need of annotated data. Joint work with Ryan McDonald.
Tuesday, October 28

SC 3403
4 PM
Title: Efficient Mining of Closed Repetitive Gapped Subsequences from a Sequence Database
Speaker: Bolin Ding

Abstract: There is a huge wealth of sequence data available, for example, customer purchase histories, program execution traces, DNA, and protein sequences. Analyzing this wealth of data to mine important knowledge is certainly a worthwhile goal. In this paper, as a step forward to analyzing patterns in sequences, we introduce the problem of mining closed repetitive gapped subsequences and propose efficient solutions. Given a database of sequences where each sequence is an ordered list of events, the pattern we would like to mine is called repetitive gapped subsequence, which is a subsequence (possibly with gaps between two successive events within it) of some sequences in the database. We introduce the concept of repetitive support to measure how frequently a pattern repeats in the database. Different from the sequential pattern mining problem, repetitive support captures not only repetitions of a pattern in different sequences but also the repetitions within a sequence. Given a user-specified support threshold min_sup, we study finding the set of all patterns with repetitive support no less than min_sup. To obtain a compact yet complete result set and improve the efficiency, we also study finding closed patterns. Efficient mining algorithms to find the complete set of desired patterns are proposed based on the idea of instance growth. Our performance study on various datasets shows the efficiency of our approach. A case study is also performed to show the utility of our approach.
Online Video: Link
Tuesday, November 4

SC 3403
4 PM
Title: Analytic Opportunities in State Farm Systems Environment
Speaker: Scott Farris (State Farm Research Center)

Abstract: State Farm is the United States largest property and casualty insurer. Founded in 1922, today State Farm insures over 80 million policyholders in automobile, fire and life coverages and is a leader in financial service products. State Farm has over 67,000 employees nation wide and 17,000 agents The State Farm Systems department has 5900 employees and employs technologies ranging from; Websphere, J2EE and .net development, to Data Warehouse technology, to cryptography and biometrics. Advanced Analytics is an emerging area of emphasis within the State Farm Systems department. Advanced Analytics provides; better business solutions, improved Systems department processes and cost optimization. Our presentation will focus on three recent efforts that utilized advanced analytics to produce more effective solutions. In effort one, geo-spatial concepts are used to identify expected hurricane loss by location. Hurricane model results are combined with spatial data to effectively manage coastal exposure. Effort two combines ARIMA forecasting techniques and hurricane simulation models to produce a strategy aimed at reducing support costs for large catastrophic events. Annual claim forecasts are combined with estimates of seasonality and hurricane model results to provide a hardware deployment strategy. Costs are minimized using cost optimization techniques. Our final effort employs Cluster Analysis to determine optimal project deployment strategies. Geographic based distances are calculated for minimization objectives. Density algorithms are then employed to provide optimal strategies. State Farm is engaged in expanding its analytic capabilities in all departments including the Systems department. For those interested in Systems careers the can broaden into advanced analytics, State Farm provides opportunities and programs to explore those interests.
Tuesday, November 11

SC 3403
4 PM
Title: Domain Independent Semantic Ranking of {XML} Search Results
Speaker: Arash Termehchy

Abstract: The steady increase in the use of XML as a format for data from many different domains has exacerbated the need for an easy-to-use, high precision query interface for XML data. When traditional document-oriented keyword search techniques do not suffice, natural language interfaces and keyword search techniques that take advantage of XML structure make it very easy for ordinary users to query XML databases. Unfortunately, current approaches to processing these queries rely heavily on heuristics that are intuitively appealing but ultimately ad hoc. As a result, these approaches often retrieve false positive answers, overlook correct answers, and cannot rank answers appropriately. To address these problems for data-centric XML, we propose {\it coherency ranking}, a domain- and structure-independent semantic ranking method for XML keyword queries that is based on an extension of the concepts of data dependencies and mutual information. We analyze the way in which previous approaches to XML keyword search approximate coherency ranking, and present efficient algorithms to process queries and rank their answers using coherency ranking. Our empirical evaluation with two real-world XML data sets shows that coherency ranking has better precision and recall than all previous approaches. Coherency ranking can also be used for keyword queries over relational and graph data.
Tuesday, November 18

SC 3403
4 PM
Title: Multi-Aspect Expertise Matching for Review Assignment
Speaker: Maryam Karimzadehgan

Abstract: Review assignment is a common task that many people such as conference organizers, journal editors, and grant administrators would have to do routinely. As a computational problem, it involves matching a set of candidate reviewers with a paper or proposal to be reviewed. A common deficiency of all existing work on solving this problem is that they do not consider the multiple aspects of topics or expertise and all match the entire document to be reviewed with the overall expertise of a reviewer. As a result, if a document contains multiple subtopics, which often happens, existing methods would not attempt to assign reviewers to cover all the subtopics; instead, it is quite possible that all the assigned reviewers would cover the major subtopic quite well, but not covering any other subtopic. In this paper, we study how to model multiple aspects of expertise and assign reviewers so that they together can cover all subtopics in the document well. We propose three general strategies for solving this problem and propose new evaluation measures for this task. We also create a multi-aspect review assignment test set using ACM SIGIR publications. Experiment results on this data set show that the proposed methods are effective for assigning reviewers to cover all topical aspects of a document.
Online Video: Link
Tuesday, November 25

SC 3403
4 PM
Thanksgiving, no seminar.
Tuesday, December 2

SC 3403
4 PM
Title: Non-negative Matrix Factorization on Manifold
Speaker: Deng Cai

Abstract: Recently Non-negative Matrix Factorization (NMF) has received a lot of attentions in information retrieval, computer vision and pattern recognition. NMF aims to find two non-negative matrices whose product can well approximate the original matrix. The sizes of these two matrices are usually smaller than the original matrix. This results in a compressed version of the original data matrix. The solution of NMF yields a natural parts-based representation for the data. When NMF is applied for data representation, a major disadvantage is that it fails to consider the geometric structure in the data. In this paper, we develop a graph based approach for parts-based data representation in order to overcome this limitation. We construct an affinity graph to encode the geometrical information and seek a matrix factorization which respects the graph structure. We demonstrate the success of this novel algorithm by applying it on real world problems.
Online Video: Link


DAIS - Database and Information Systems Laboratory, Department of Computer Science, University of Illinois at Urbana-Champaign, 201 N. Goodwin Ave., Urbana, IL 61801, USA.  Fax: 217-265-6494, Phone: 217-244-6241.