| Tuesday, January 16 SC 3403 4 PM |
No talk this week
|
| Tuesday, January 23 SC 3403 4 PM |
Nonparamatric Kernel Models Speaker: Feng Liang
Abstract
|
| Tuesday, January 30 SC 3403 4 PM |
Discriminative Frequent Pattern Analysis for Effective Classification
Speaker: Hong Cheng Abstract: The application of frequent patterns in classification appeared in sporadic studies and achieved initial success in the classification of relational data, text documents and graphs. In this paper, we conduct a systematic exploration of frequent pattern-based classification, and provide solid reasons supporting this methodology. It was well known that feature combinations (patterns) could capture more underlying semantics than single features. However, inclusion of infrequent patterns may not significantly improve the accuracy due to their limited predictive power. By building a connection between pattern frequency and discriminative measures such as information gain and Fisher score, we develop a strategy to set minimum support in frequent pattern mining for generating useful patterns. Based on this strategy, coupled with a proposed feature selection algorithm, discriminative frequent patterns can be generated for building high quality classifiers. We demonstrate that the frequent pattern-based classification framework can achieve good scalability and high accuracy in classifying large datasets. Empirical studies indicate that significant improvement in classification accuracy is achieved (up to 12\% in UCI datasets) using the so-selected discriminative frequent patterns. |
| Tuesday, February 6 SC 3403 4 PM |
No talk this week
We will have two talks in the week of Feb. 19 |
| Tuesday, February 13 SC 3403 4 PM |
Cancelled due to the bad weather.
|
| Tuesday, February 20 SC 3403 4 PM |
An Axiomatic Approach to Information Retrieval
Speaker: Hui Fang Abstract: The effectiveness of any search engine is determined by the underlying information retrieval model. A common deficiency of existing retrieval models is that they are all developed based on indirect modeling of the relevance of a document w.r.t. a query. In this talk, I will present a novel axiomatic approach to developing retrieval models based on direct modeling of relevance with formalized retrieval constraints defined at the level of terms. I will discuss three benefits of such an axiomatic framework. First, the formalized constraints allow us to predict the performance of a retrieval function analytically without needing experimentation. Second, the axiomatic framework serves as a roadmap for us to develop new effective retrieval functions, and several new retrieval functions are derived in this way, which are shown to be more robust and less sensitive to parameter settings than the existing retrieval functions with comparable optimal performance. Third, it allows us to develop a novel general evaluation methodology for IR models, which can pinpoint the weaknesses and strengths of retrieval functions and give hints on how a retrieval function should be modified to further improve the performance. |
| Thursday, February 22 SC 3403 4 PM |
Learning from User Query Reformulations
Speaker: Rosie Jones (senior research scientist from Yahoo Research) Abstract: Web search users tend to rewrite their queries, frequently reformulating to try a related phrase or expression. We describe the distribution of rewrites in sequential user query pairs, and show techniques for predicting which terms a searcher is likely to delete from a query. User query sessions are a noisy source of related phrases in which only half of sequential query pairs are related. However, we show that applying a statistical test allows us to filter out the noise: given a query, we can find a related query with 90% precision. We also define a four-point scale of similarity, and show that we can classify query and phrase pairs into finer grained levels of semantic similarity using supervised learning. We also look at the distribution of reformulations across WordNet-like semantic categories and show some features which show promise for automatic classification into these classes.
Bio: |
| Tuesday, February 27 SC 3403 3 PM |
Integrating OLAP and Ranking: The Ranking-Cube Methodology
Speaker: Dong Xin Abstract: Recent years have witnessed an enormous growth of data in business, industry, and Web applications. Database search often returns a large collection of results, which poses challenges to both efficient query processing and effective digest of the query results. To address this problem, ranked search has been introduced to database systems. We study the problem of On-Line Analytical Processing (OLAP) of ranked queries, where ranked queries are conducted in the arbitrary subset of data defined by multi-dimensional selections. While pre-computation and multi-dimensional aggregation is the standard solution for OLAP, materializing dynamic ranking results is unrealistic because the ranking criteria are not known until the query time. To overcome such difficulty, we first develop a new ranking cube method that performs semi off-line materialization and semi online computation, and then extend it to high-dimensional data. Its complete life cycle, including cube construction, incremental maintenance, and query processing, will also be discussed. Our performance studies show that Ranking-Cube is orders of magnitude faster than previous approaches. In this talk, I will also provide a brief overview of my other research work in the areas of data mining, data warehousing and database systems. |
| Tuesday, February 27 SC 3403 4 PM |
Enabling Data Retrieval: by Ranking and Beyond
Speaker: Chengkai Li Abstract: Database management systems (DBMSs) are facing new challenges in supporting non-traditional data retrieval, in contrast to Boolean queries, in many emerging applications. Even for structured data, we need a retrieval system, much like a "Google" for relational databases, parallel the well-established information retrieval over unstructured text. In the talk, I will discuss this challenging and exciting research area and introduce my thesis work in this direction. In particular I will present our work in building RankSQL, a DBMS that provides a systematic and principled framework for efficiently supporting ranking. By extending relational algebra, RankSQL provides an algebraic foundation for treating ranking as a first-class query construct and seamlessly integrating with Boolean constructs. I will further discuss our efficient framework and algorithms for processing ad-hoc ranking aggregate queries. My work goes beyond ranking in supporting data retrieval. I will introduce our proposal of generalizing Group-By to clustering, parallel to the generalization from Order-By to ranking, and combining them in fuzzy and explorative data retrieval. Moreover, I will briefly discuss our study of inverse ranking queries that obtain the ranks of given objects in a database. |
| Tuesday, March 6 SC 3403 4 PM |
User-Centered Adaptive Information Retrieval
Speaker: Xuehua Shen Abstract: A major limitation of current retrieval models and systems is that the retrieval decision is generally made based solely on the query and document collection; information about the actual user and the search context is largely ignored. This limitation makes the retrieval performance of existing retrieval systems inherently non-optimal since different users (and the same user in different situations) may use exactly the same query to search for different information. In this talk, I will present my dissertation work on personalized search that aims to break this limitation and to develop a new retrieval strategy called user-centered adaptive information retrieval (UCAIR). In UCAIR, the retrieval process is modeled generally as a sequential decision process, in which the system responds to each user action by choosing an optimal system action, and all the available user information and search context would be exploited to optimize each retrieval decision. I will present a decision-theoretic framework for optimal interactive information retrieval, several statistical language models for exploiting both short-term and long-term personal search history to improve retrieval accuracy, algorithms for seeking active feedback from a user, and a client-side personalized search agent UCAIR. Evaluation of UCAIR shows that it can improve search accuracy over Google significantly through personalization.
Bio: |
| Tuesday, March 13 SC 3403 4 PM |
A Systematic Exploration of the Feature Space for Relation Extraction
Speaker: Jing Jiang Abstract: Relation extraction is the task of finding semantic relations between entities from text. The state-of-the-art methods for relation extraction are mostly based on statistical learning, and thus all have to deal with feature selection, which can significantly affect the classification performance. In this talk, I will present our recent work on systematic exploration of a large space of features for relation extraction and evaluation of the effectiveness of different feature subspaces. We present a general definition of feature spaces based on a graphic representation of relation instances, and explore three different representations of relation instances and features of different complexities within this framework. Our experiments show that using only basic unit features is generally sufficient to achieve state-of-the-art performance, while over-inclusion of complex features may hurt the performance. A combination of features of different levels of complexity and from different sentence representations, coupled with task-oriented feature pruning, gives the best performance. |
| Tuesday, March 20 SC 3403 4 PM |
Spring Break
|
| Tuesday, March 27 SC 3403 4 PM |
Mining Colossal Frequent Patterns by Core Pattern Fusion
Speaker: Feida Zhu Abstract: Extensive research for frequent-pattern mining in the past decade has brought forth a number of pattern mining algorithms that are both effective and efficient. However, the existing frequent-pattern mining algorithms encounter challenges at mining rather large patterns, called colossal frequent patterns, in the presence of an explosive number of frequent patterns. Colossal patterns are critical to many applications, especially in domains like bioinformatics. In this study, we investigate a novel mining approach called Pattern-Fusion to efficiently find a good approximation to the colossal patterns. With Pattern-Fusion, a colossal pattern is discovered by fusing its small core patterns in one step, whereas the incremental pattern-growth mining strategies, such as those adopted in Apriori and FP-growth, have to examine a large number of mid-sized ones. This property distinguishes Pattern-Fusion from all the existing frequent pattern mining approaches and draws a new mining methodology. Our empirical studies show that, in cases where current mining algorithms cannot proceed, Pattern- Fusion is able to mine a result set which is a close enough approximation to the complete set of the colossal patterns, under a quality evaluation model proposed in this paper. |
| Tuesday, April 3 SC 3403 4 PM |
This talk is cancelled since our spearker is sick.
Bayesian Logistic Regression for Text Classification
Bio: |
| Thursday, April 12 SC 3403 4 PM |
This talk is cancelled since our speaker's flight to Champaign was cancelled
A General Model for Retrieval and Filtering
Bio:
|
| Tuesday, April 17 SC 3403 4 PM |
Entity Search: Search Directly and Holistically
Speaker: Tao Cheng Abstract: As the Web has evolved into a data-rich repository, with the standard "page view", current search engines are increasingly inadequate. While we often search for various data "entities" (e.g., phone number, paper PDF, date), today's engines only take us indirectly to pages. While entities usually appear in many pages, current engines only find each page individually. Toward searching directly and holistically for finding information of finer granularity, we propose the concept of entity search, a significant departure from traditional document retrieval. In particular, we focus on the core challenge of ranking entities, and develop EntityRank, a probabilistic model that integrates both local and global information in ranking. |
| Tuesday, April 24 SC 3403 4 PM |
Topic Sentiment Mixture: Modeling Facets and Opinions in Weblogs
Spearker: Qiaozhu Mei Abstract: In this paper, we define the problem of topic-sentiment analysis on Weblogs and propose a novel probabilistic model to capture the mixture of topics and sentiments simultaneously. The proposed Topic-Sentiment Mixture (TSM) model can reveal the latent topical facets in a Weblog collection, the subtopics in the results of an ad hoc query, and their associated sentiments. It could also provide general sentiment models that are applicable to any ad hoc topics. With a specifically designed HMM structure, the sentiment models and topic models estimated with TSM can be utilized to extract topic life cycles and sentiment dynamics. Empirical experiments on different Weblog datasets show that this approach is effective for modeling the topic facets and sentiments and extracting their dynamics from Weblog collections. The TSM model is quite general; it can be applied to any text collections with a mixture of topics and sentiments, thus has many potential applications, such as search result summarization, opinion tracking, and user behavior prediction. |
| Monday, April 30 SC 3403 4 PM |
Classification Using Statistically Significant Rules
Spearker: Sanjay Chawla Abstract: Classification based on association rule mining has lately become a popular technique within the data mining community. However, it has now been emphatically shown that association rules generated solely on the basis of support and confidence are often not statistically significant - i.e, the rules generated are artifacts of the particular data set being mined rather than a relationship inherent in the underlying population (process). This is not surprising because the use of support is driven purely by its computational and not statistical properties. In this talk will show that mining for statistical significant rules, in a classification setting, by ˇ°forcingˇ± the Fisher Exact Test or its continuous approxima- tions to be ˇ°anti-monotonicˇ± results in: (i) the vast majority of the rules being statistically significant - by definition (ii) fewer number of rules and (iii) comparable classification performance on balanced datasets and higher performance on imbalanced datasets. All while examining on average only 0.5% of the search space, using 0.4% of the time and finding 0.06% the number of rules as techniques relying on support and confidence. This is joint work with Florian Verhein.
Bio |
| Tuesday, May 1 SC 3403 4 PM |
Toward Summarizing Information Graphics in Multimodal Documents
Speaker: Sandra Carberry (University of Delaware) Abstract:
Information is the key to effective knowledge and decision-making.
However, in order to be useful, information must be accessible.
Information graphics (non-pictorial graphics such as bar charts, line graphs, and pie charts) pose accessibility problems: The overwhelming majority of information graphics that appear in newspapers, magazines, and formal reports have a communicative goal or message that they are intended to convey. This talk will first present our corpus study showing that the message conveyed by an information graphic in a multimodal document is generally not contained in the article's text. The talk will then present our research on identifying the primary intended message of an information graphic; this message can then be used as the core of a summary of the graphic for digital libraries or as part of an interactive natural language system for blind individuals. Our approach is to view information graphics as a form of language containing communicative signals. We will discuss the kinds of communicative signals appearing in information graphics, and describe how we exploit these in a Bayesian network for recognizing the graphic's intended message. We will present our implemented system along with the results of evaluation experiments that demonstrate the effectiveness of our methodology. This research has been done jointly with Stephanie Elzer (former graduate student at the University of Delaware and currently an assistant professor at Millersville University). Other contributors to the work include Dan Chester, Seniz Demir, and Peng Wu.
Short Bio: |
| Wednesday, May 2 SC 4405 4 PM |
Bayesian Logistic Regression for Text Classification
Bio: |
|
|
|
DAIS - Database and Information Systems Laboratory, Department of Computer Science, University of Illinois at Urbana-Champaign, 201 N. Goodwin Ave., Urbana, IL 61801, USA. Fax: 217-265-6494, Phone: 217-244-6241. |