Spring 2007 Schedule
Coordinator: Deng Cai, dengcai2 AT uiuc.edu

Tuesday, January 16

SC 3403
4 PM
No talk this week


Tuesday, January 23

SC 3403
4 PM
Nonparamatric Kernel Models
Speaker: Feng Liang

Abstract
Reproducing kernel Hilbert space(RKHS) is a popular tool used in high dimensional data analysis. For example, smoothing spline and SVM (support vector machine) are special cases of RKHS. In this talk, I'll present a fully Bayesian framework and theory that coherently embed kernel regression/classification in a general nonparametric model. Key practical features of our approach include the use of shrinkage priors to address problems of ``large p'', the use of mixture priors for feature selection, coherent updating as sample sizes change, and an understanding of so-called ``unlabelled'' data.

Tuesday, January 30

SC 3403
4 PM
Discriminative Frequent Pattern Analysis for Effective Classification
Speaker: Hong Cheng

Abstract:
The application of frequent patterns in classification appeared in sporadic studies and achieved initial success in the classification of relational data, text documents and graphs. In this paper, we conduct a systematic exploration of frequent pattern-based classification, and provide solid reasons supporting this methodology. It was well known that feature combinations (patterns) could capture more underlying semantics than single features. However, inclusion of infrequent patterns may not significantly improve the accuracy due to their limited predictive power. By building a connection between pattern frequency and discriminative measures such as information gain and Fisher score, we develop a strategy to set minimum support in frequent pattern mining for generating useful patterns. Based on this strategy, coupled with a proposed feature selection algorithm, discriminative frequent patterns can be generated for building high quality classifiers. We demonstrate that the frequent pattern-based classification framework can achieve good scalability and high accuracy in classifying large datasets. Empirical studies indicate that significant improvement in classification accuracy is achieved (up to 12\% in UCI datasets) using the so-selected discriminative frequent patterns.
Tuesday, February 6

SC 3403
4 PM
No talk this week

We will have two talks in the week of Feb. 19
Tuesday, February 13

SC 3403
4 PM
Cancelled due to the bad weather.

Tuesday, February 20

SC 3403
4 PM
An Axiomatic Approach to Information Retrieval
Speaker: Hui Fang

Abstract:
The effectiveness of any search engine is determined by the underlying information retrieval model. A common deficiency of existing retrieval models is that they are all developed based on indirect modeling of the relevance of a document w.r.t. a query. In this talk, I will present a novel axiomatic approach to developing retrieval models based on direct modeling of relevance with formalized retrieval constraints defined at the level of terms. I will discuss three benefits of such an axiomatic framework. First, the formalized constraints allow us to predict the performance of a retrieval function analytically without needing experimentation. Second, the axiomatic framework serves as a roadmap for us to develop new effective retrieval functions, and several new retrieval functions are derived in this way, which are shown to be more robust and less sensitive to parameter settings than the existing retrieval functions with comparable optimal performance. Third, it allows us to develop a novel general evaluation methodology for IR models, which can pinpoint the weaknesses and strengths of retrieval functions and give hints on how a retrieval function should be modified to further improve the performance.
Thursday, February 22

SC 3403
4 PM
Learning from User Query Reformulations
Speaker: Rosie Jones (senior research scientist from Yahoo Research)

Abstract:
Web search users tend to rewrite their queries, frequently reformulating to try a related phrase or expression. We describe the distribution of rewrites in sequential user query pairs, and show techniques for predicting which terms a searcher is likely to delete from a query. User query sessions are a noisy source of related phrases in which only half of sequential query pairs are related. However, we show that applying a statistical test allows us to filter out the noise: given a query, we can find a related query with 90% precision. We also define a four-point scale of similarity, and show that we can classify query and phrase pairs into finer grained levels of semantic similarity using supervised learning. We also look at the distribution of reformulations across WordNet-like semantic categories and show some features which show promise for automatic classification into these classes.

Bio:
Rosie Jones is a Senior Research Scientist at Yahoo! where her recent work includes research on query log analysis. Her research interests include information retrieval, machine learning, statistical natural language processing and time series analysis. She received her PhD from the School of Computer Science at Carnegie Mellon University in 2005 under the supervision of Tom Mitchell, where her doctoral thesis was titled Learning to Extract Entities from Labeled and Unlabeled Text. In 2005 she co-organized the SIGIR workshop on lexical cohesion and information retrieval, and in 2003 she co-organized the ICML workshop on The Continuum from Labeled to Unlabeled Data in Machine Learning and Data Mining.

Tuesday, February 27

SC 3403
3 PM
Integrating OLAP and Ranking: The Ranking-Cube Methodology
Speaker: Dong Xin

Abstract:
Recent years have witnessed an enormous growth of data in business, industry, and Web applications. Database search often returns a large collection of results, which poses challenges to both efficient query processing and effective digest of the query results. To address this problem, ranked search has been introduced to database systems. We study the problem of On-Line Analytical Processing (OLAP) of ranked queries, where ranked queries are conducted in the arbitrary subset of data defined by multi-dimensional selections. While pre-computation and multi-dimensional aggregation is the standard solution for OLAP, materializing dynamic ranking results is unrealistic because the ranking criteria are not known until the query time. To overcome such difficulty, we first develop a new ranking cube method that performs semi off-line materialization and semi online computation, and then extend it to high-dimensional data. Its complete life cycle, including cube construction, incremental maintenance, and query processing, will also be discussed. Our performance studies show that Ranking-Cube is orders of magnitude faster than previous approaches. In this talk, I will also provide a brief overview of my other research work in the areas of data mining, data warehousing and database systems.
Tuesday, February 27

SC 3403
4 PM
Enabling Data Retrieval: by Ranking and Beyond
Speaker: Chengkai Li

Abstract:
Database management systems (DBMSs) are facing new challenges in supporting non-traditional data retrieval, in contrast to Boolean queries, in many emerging applications. Even for structured data, we need a retrieval system, much like a "Google" for relational databases, parallel the well-established information retrieval over unstructured text. In the talk, I will discuss this challenging and exciting research area and introduce my thesis work in this direction. In particular I will present our work in building RankSQL, a DBMS that provides a systematic and principled framework for efficiently supporting ranking. By extending relational algebra, RankSQL provides an algebraic foundation for treating ranking as a first-class query construct and seamlessly integrating with Boolean constructs. I will further discuss our efficient framework and algorithms for processing ad-hoc ranking aggregate queries. My work goes beyond ranking in supporting data retrieval. I will introduce our proposal of generalizing Group-By to clustering, parallel to the generalization from Order-By to ranking, and combining them in fuzzy and explorative data retrieval. Moreover, I will briefly discuss our study of inverse ranking queries that obtain the ranks of given objects in a database.
Tuesday, March 6

SC 3403
4 PM
User-Centered Adaptive Information Retrieval
Speaker: Xuehua Shen

Abstract:
A major limitation of current retrieval models and systems is that the retrieval decision is generally made based solely on the query and document collection; information about the actual user and the search context is largely ignored. This limitation makes the retrieval performance of existing retrieval systems inherently non-optimal since different users (and the same user in different situations) may use exactly the same query to search for different information. In this talk, I will present my dissertation work on personalized search that aims to break this limitation and to develop a new retrieval strategy called user-centered adaptive information retrieval (UCAIR). In UCAIR, the retrieval process is modeled generally as a sequential decision process, in which the system responds to each user action by choosing an optimal system action, and all the available user information and search context would be exploited to optimize each retrieval decision. I will present a decision-theoretic framework for optimal interactive information retrieval, several statistical language models for exploiting both short-term and long-term personal search history to improve retrieval accuracy, algorithms for seeking active feedback from a user, and a client-side personalized search agent UCAIR. Evaluation of UCAIR shows that it can improve search accuracy over Google significantly through personalization.

Bio:
Xuehua Shen is a Ph.D. candidate in the Computer Science Department of University of Illinois at Urbana-Champaign. His main research interests are in information retrieval, database, and data mining. In addition, he is also broadly interested in some other areas in computer science, especially machine learning, human computer interaction, and information privacy. He has published papers in conferences of ACM SIGIR, ACM CIKM, ACM SIGKDD, WWW, and IEEE ICDE. He received his B.S. in Computer Science from Nanjing University, China in 1999. He did summer interns at Microsoft Research Redmond and IBM T. J. Watson Research Center. Recently, he served on the program committee of Symposium on Information Interaction in Context (IIiX'2006). He has two patents related with personalized search filed or in preparation.

Tuesday, March 13

SC 3403
4 PM
A Systematic Exploration of the Feature Space for Relation Extraction
Speaker: Jing Jiang

Abstract:
Relation extraction is the task of finding semantic relations between entities from text. The state-of-the-art methods for relation extraction are mostly based on statistical learning, and thus all have to deal with feature selection, which can significantly affect the classification performance. In this talk, I will present our recent work on systematic exploration of a large space of features for relation extraction and evaluation of the effectiveness of different feature subspaces. We present a general definition of feature spaces based on a graphic representation of relation instances, and explore three different representations of relation instances and features of different complexities within this framework. Our experiments show that using only basic unit features is generally sufficient to achieve state-of-the-art performance, while over-inclusion of complex features may hurt the performance. A combination of features of different levels of complexity and from different sentence representations, coupled with task-oriented feature pruning, gives the best performance.
Tuesday, March 20

SC 3403
4 PM
Spring Break


Tuesday, March 27

SC 3403
4 PM
Mining Colossal Frequent Patterns by Core Pattern Fusion
Speaker: Feida Zhu

Abstract:
Extensive research for frequent-pattern mining in the past decade has brought forth a number of pattern mining algorithms that are both effective and efficient. However, the existing frequent-pattern mining algorithms encounter challenges at mining rather large patterns, called colossal frequent patterns, in the presence of an explosive number of frequent patterns. Colossal patterns are critical to many applications, especially in domains like bioinformatics. In this study, we investigate a novel mining approach called Pattern-Fusion to efficiently find a good approximation to the colossal patterns. With Pattern-Fusion, a colossal pattern is discovered by fusing its small core patterns in one step, whereas the incremental pattern-growth mining strategies, such as those adopted in Apriori and FP-growth, have to examine a large number of mid-sized ones. This property distinguishes Pattern-Fusion from all the existing frequent pattern mining approaches and draws a new mining methodology. Our empirical studies show that, in cases where current mining algorithms cannot proceed, Pattern- Fusion is able to mine a result set which is a close enough approximation to the complete set of the colossal patterns, under a quality evaluation model proposed in this paper.
Tuesday, April 3

SC 3403
4 PM
This talk is cancelled since our spearker is sick.

Bayesian Logistic Regression for Text Classification
Spearker: David Lewis

Abstract:
Bayesian logistic regression allows incorporating task knowledge through model structure and priors on parameters. I will discuss content-based text categorization and authorship attribution using 1) priors that control sparsity and sign of parameters, 2) priors that incorporate domain knowledge from reference books and other texts, and 3) the use of polytomous (1-of-k) dependent variables. All experiments were performed with our open source programs, BBR and BMR, which can fit models with millions of parameters. (Joint work with David Madigan, Alex Genkin, Aynur Dayanik, Dmitriy Fradkin, and Vladimir Menkov at Rutgers University and DIMACS.)

Bio:
Dave Lewis is an entrepreneur and consulting computer scientist based in Chicago, IL. He consults at, and is building another company at, the intersection of information retrieval, data mining, and natural language processing. He previously held research positions at AT&T Labs, Bell Labs, and the University of Chicago, and received his Ph.D. in Computer Science from the University of Massachusetts. He was recently elected a Fellow of the American Association for the Advancement of Science.

Thursday, April 12

SC 3403
4 PM
This talk is cancelled since our speaker's flight to Champaign was cancelled

A General Model for Retrieval and Filtering
Speaker: Javed Mostafa (from Indiana Univ.)

Abstract:
Information retrieval and filtering (IR/IF) over dynamic environments such as the WWW poses many challenges. At the root of the challenges lie the twin factors of data diversity and data evolution. These two factors are especially problematic for content representation and classification. With a growing number of people relying on web search applications to support everyday tasks, another emerging challenge is providing highly focused results (i.e., personalization) while accommodating diverse user-interests and interest drifts. In fact, the key challenges associated with representation, classification, and personalization are interdependent and the likelihood of achieving progress in any one of these areas can be increased by focusing on them together -- as part of a single information services framework.

In this talk, I will present and discuss such a framework, known as the multilevel information services model. I will demonstrate how the modular nature of the framework permits the isolation and study of key functions that are more basic than IR/IF and establish their individual and combined impact on the effectiveness and efficiency of information services. I will also present results from a series of experiments examining how factors such as the representation method (e.g., automated feature generation vs. manual schemes), classifiers (e.g., supervised vs. unsupervised), nature and size of content streams, level of user interaction, and rate of interest change influence system performance.

Bio:
Dr. Javed Mostafa is the Victor H. Yngve Associate Professor of Information Science and Associate Professor of Informatics at Indiana University, Bloomington. He also has formal faculty affiliations with the Cognitive Science Program in Bloomington (core faculty) and the Computer & Information Science Department (adjunct) in Indianapolis. He is widely published in the information retrieval area. His current research focuses on secure and personalized access to health information. Dr. Mostafa is an associate editor of the ACM Transactions on Information Systems.

Tuesday, April 17

SC 3403
4 PM
Entity Search: Search Directly and Holistically
Speaker: Tao Cheng

Abstract:
As the Web has evolved into a data-rich repository, with the standard "page view", current search engines are increasingly inadequate. While we often search for various data "entities" (e.g., phone number, paper PDF, date), today's engines only take us indirectly to pages. While entities usually appear in many pages, current engines only find each page individually. Toward searching directly and holistically for finding information of finer granularity, we propose the concept of entity search, a significant departure from traditional document retrieval. In particular, we focus on the core challenge of ranking entities, and develop EntityRank, a probabilistic model that integrates both local and global information in ranking.
Tuesday, April 24

SC 3403
4 PM
Topic Sentiment Mixture: Modeling Facets and Opinions in Weblogs
Spearker: Qiaozhu Mei

Abstract:
In this paper, we define the problem of topic-sentiment analysis on Weblogs and propose a novel probabilistic model to capture the mixture of topics and sentiments simultaneously. The proposed Topic-Sentiment Mixture (TSM) model can reveal the latent topical facets in a Weblog collection, the subtopics in the results of an ad hoc query, and their associated sentiments. It could also provide general sentiment models that are applicable to any ad hoc topics. With a specifically designed HMM structure, the sentiment models and topic models estimated with TSM can be utilized to extract topic life cycles and sentiment dynamics. Empirical experiments on different Weblog datasets show that this approach is effective for modeling the topic facets and sentiments and extracting their dynamics from Weblog collections. The TSM model is quite general; it can be applied to any text collections with a mixture of topics and sentiments, thus has many potential applications, such as search result summarization, opinion tracking, and user behavior prediction.
Monday, April 30

SC 3403
4 PM
Classification Using Statistically Significant Rules
Spearker: Sanjay Chawla

Abstract:
Classification based on association rule mining has lately become a popular technique within the data mining community. However, it has now been emphatically shown that association rules generated solely on the basis of support and confidence are often not statistically significant - i.e, the rules generated are artifacts of the particular data set being mined rather than a relationship inherent in the underlying population (process). This is not surprising because the use of support is driven purely by its computational and not statistical properties. In this talk will show that mining for statistical significant rules, in a classification setting, by ˇ°forcingˇ± the Fisher Exact Test or its continuous approxima- tions to be ˇ°anti-monotonicˇ± results in: (i) the vast majority of the rules being statistically significant - by definition (ii) fewer number of rules and (iii) comparable classification performance on balanced datasets and higher performance on imbalanced datasets. All while examining on average only 0.5% of the search space, using 0.4% of the time and finding 0.06% the number of rules as techniques relying on support and confidence. This is joint work with Florian Verhein.

Bio
Sanjay Chawla is an Associate Professor in the School of Information Technologies, University of Sydney, Australia. He currently serves as an Associate Head of the School. Sanjay's research work is in the area of data mining with an emphasis on spatial data and more recently on outlier detection. Sanjay is a co-author on the text "Spatial Databases: A Tour." Before joining the University of Sydney, Sanjay worked for Vignette Corporation in Boston (2000- 2002); was a postdoc in the Department of Computer Science, University of Minnesota (1997- 2000) and an industrial postdoc at the Institute for Mathematics and its Application (IMA), University of Minnesota(1995-1997). He received his PhD from the University of Tennessee, Knoxville in 1995. He is a member of ACM, SIAM and AMS.

Tuesday, May 1

SC 3403
4 PM
Toward Summarizing Information Graphics in Multimodal Documents
Speaker: Sandra Carberry (University of Delaware)

Abstract:

Information is the key to effective knowledge and decision-making. However, in order to be useful, information must be accessible. Information graphics (non-pictorial graphics such as bar charts, line graphs, and pie charts) pose accessibility problems:
1) although much attention has been devoted to the summarization, categorization, and retrieval of text documents and images, almost no attention has been given to information graphics,
2) individuals with impaired eyesight lack effective methods for accessing the information provided by information graphics.

The overwhelming majority of information graphics that appear in newspapers, magazines, and formal reports have a communicative goal or message that they are intended to convey. This talk will first present our corpus study showing that the message conveyed by an information graphic in a multimodal document is generally not contained in the article's text. The talk will then present our research on identifying the primary intended message of an information graphic; this message can then be used as the core of a summary of the graphic for digital libraries or as part of an interactive natural language system for blind individuals.

Our approach is to view information graphics as a form of language containing communicative signals. We will discuss the kinds of communicative signals appearing in information graphics, and describe how we exploit these in a Bayesian network for recognizing the graphic's intended message. We will present our implemented system along with the results of evaluation experiments that demonstrate the effectiveness of our methodology.

This research has been done jointly with Stephanie Elzer (former graduate student at the University of Delaware and currently an assistant professor at Millersville University). Other contributors to the work include Dan Chester, Seniz Demir, and Peng Wu.

Short Bio:
Sandra Carberry is a Professor and former Chair of the Department of Computer Science at the University of Delaware. She received a BS degree from Cornell University, an MS degree from Rice University, and a PhD from the University of Delaware, and she was a Member of Technical Staff at Bell Laboratories. She served as Secretary and Member of the Executive Board of the Association for Computational Linguistics from 2001-2006. She is currently on the Editorial Board of the "Journal of Dialog Systems" and "User Modeling and User-Adapted Interaction". Her research focuses on issues of effective communication.

Wednesday, May 2

SC 4405
4 PM

Bayesian Logistic Regression for Text Classification
Spearker: David Lewis

Abstract:
Bayesian logistic regression allows incorporating task knowledge through model structure and priors on parameters. I will discuss content-based text categorization and authorship attribution using 1) priors that control sparsity and sign of parameters, 2) priors that incorporate domain knowledge from reference books and other texts, and 3) the use of polytomous (1-of-k) dependent variables. All experiments were performed with our open source programs, BBR and BMR, which can fit models with millions of parameters. (Joint work with David Madigan, Alex Genkin, Aynur Dayanik, Dmitriy Fradkin, and Vladimir Menkov at Rutgers University and DIMACS.)

Bio:
Dave Lewis is an entrepreneur and consulting computer scientist based in Chicago, IL. He consults at, and is building another company at, the intersection of information retrieval, data mining, and natural language processing. He previously held research positions at AT&T Labs, Bell Labs, and the University of Chicago, and received his Ph.D. in Computer Science from the University of Massachusetts. He was recently elected a Fellow of the American Association for the Advancement of Science.



DAIS - Database and Information Systems Laboratory, Department of Computer Science, University of Illinois at Urbana-Champaign, 201 N. Goodwin Ave., Urbana, IL 61801, USA.  Fax: 217-265-6494, Phone: 217-244-6241.