Spring 2006 Schedule
Coordinator: Mayssam Sayyadian, sayyadia AT uiuc.edu

Wednesday, January 19
SC2405
11 AM
Discovering Interesting Subsets of Data in Cube Space
Raghu Ramakrishnan, University of Wisconsin

Data Cubes have been widely studied and implemented, and so we researchers shouldn't be thinking about them anymore, right? Wrong. In this talk, I'll try to convince you that the multidimensional model of data ("cube" sounds so much cooler) provides the right perspective for addressing many challenging tasks, including dealing with imprecision, mining for interesting subsets of data, analysis of historical stream data, and world peace. The talk will touch upon results from a couple of VLDB 2005 papers, and some recent ongoing work.
Friday, January 27
SC3405
1 PM
ConQuer: Managing Inconsistency in Data
Renee Miller, University of Toronto

We present the ConQuer system for efficient and scalable management of data that may be inconsistent or uncertain. ConQuer permits a form of dynamic (query-time) data cleaning where users may retrieve data that is consistent with a set of postulated constraints or preferences. An inconsistent database may be modeled as one that represents several alternative {\em possible worlds}. Based on this observation, we draw analogies between solutions for querying inconsistent databases, with solutions for querying integrated databases which are also traditionally modeled as a set of possible worlds. The ConQuer project is joint work with Ariel Fuxman, Diego Fuxman, Elham Fazli, and Periklis Andritsos. In this talk, we will primarily focus on the work of Ariel Fuxman. The Conquer project was created in 2003 and has been supported in part by NSERC, CITO, and IBM.
Wednesday, February 1
SC3405
11 AM
Using System Support to Improve Software Dependability
Feng Qin, University of Illinois
High dependability is demanded by many applications, such as mission-critical programs, Internet servers, financial applications. Unfortunately, software bugs can slip through even the strictest testing and greatly affect software dependability during production runs. The scope of my thesis research is using system support to improve system dependability during production runs by detecting software bugs triggering, detecting exploitation of software bugs, and surviving software bugs and their effects, such as software failures.

During this talk, I will give an overview of my thesis research and mainly talk about my recent work, Rx, which is a novel technique to survive software failures and improve system availability. In addition, I will briefly introduce another piece of my work, SafeMem, which leverages ECC memory technology to provide fine-grained memory monitoring with very low overhead. SafeMem has been used for detecting memory leak and memory corruption bugs. Finally, I will mention my current on-going project, LIFT, which use dynamic information flow tracking to secure your software.

Wednesday, February 8
SC3405
11 AM
Applying Data Mining Techniques to Computer Systems
Zhenmin Li, University of Illinois
As computer systems are becoming increasingly complex, it is challenging to provide high performance, reliability and manageability that are demanded by enterprise customers. In order to provide a solution, it is necessary to first analyze and characterize such systems and identify the critical issues. The system data such as source code, running traces, etc. provide a valuable asset for us to target the solution. In the meantime, the huge amount of system data, however, renders a tedious and difficult task on managers and developers, and hence the hidden information would be difficult to extract for system characterization. In this talk, I will present a novel approach to analyze various system data by applying data mining techniques. This approach can effectively obtain useful information hidden in huge amount of system data and then we can exploit such information to improve system performance, reliability and manageability. Specifically, we apply different data mining algorithms on different types of system data such as access traces, system call traces, and source code, to achieve different goals including automated debugging and system behavior characterization. The results demonstrate that using data mining techniques is an efficient and effective approach to solving computer system problems.
Wednesday, Februrary 15
DAIS adjunct seminar SC3405
10-10:45 AM
Improving Software Dependability with Hardware Support
Pin Zhou, University of Illinois

As software permeates our daily life and becomes increasingly complex, software dependability becomes critically important. The main obstacle to building dependable software is software bugs, which often account for 40% of computer system failures and over 50% of security vulnerabilities. Existing dynamic bug detection techniques suffer from high overhead and can greatly benefit from the recent impressive advances in computer architecture.

In this talk, I will present my thesis work on using general, novel hardware support for detecting software bugs in production runs. First, I will introduce a simple and general hardware framework called iWatcher that enables efficient memory monitoring and can be used to detect a wide variety of bugs with very low overhead, orders of magnitudes smaller than software-only approaches. Next, I will present an innovative bug detection technique called PC-based invariants and an efficient novel hardware implementation of this technique to detect general memory corruptions during production runs. Finally, I will briefly describe my recent work on using the iWatcher hardware support for efficient data structure consistency checks and my future research plan.

Wednesday, Februrary 15
SC3405
11-12 AM
Automatic Schema Matching and The Monotonicity Principle
Gal Avigdor, Technion - Israel Institute of Technology

Given two database schemata, the task of schema matching involves the identification of a mapping between attributes of the two schemata. New challenges such as that of Web service discovery and composition call for a fully-automatic schema matching process. In this talk we shall present (both intuitively and formally) several challenges in devising algorithms for fully-automatic schema matching, primarily the uncertainty in the outcome of such a process. We shall then outline the monotonicity principle as a tool in schema matching classification and show its usefulness in devising matching heuristics, based on top-K best schema mappings.

This work is a joint work with Ateret Anaby-Tavor, Alberto Trombetta, and Danilo Montesi

Wednesday, Februrary 22
DAIS adjunct seminar SC3405
10-10:45 AM
Applying Data Mining Techniques to Computer Systems
Zhenmin Li, University of Illinois

Zhenmin will continue his unfinished talk on Feb 8. See that slot for its abstract.

Wednesday, February 22
SC3405
11 AM
Large-Scale Legal and Business Information Management: Research & Development at Thomson Legal & Regulatory
Khalid Al-Kofahi, Director of Research
Jack G. Conrad, Sr. Research Scientist,
Thomson, Inc.

Thomson is a leading provider of integrated information solutions for knowledge professionals around the world (www.thomson.com). It employs over 40,000 individuals in 45 countries. Thomson consists of four primary market sectors: Financial, Legal, Scientific & Health Care, and Educational. Thomson Legal & Regulatory (TLR) focuses on providing productivity and work-flow solutions to practitioners in the legal domain (www.tlrg.com).

TLR R&D performs applied research and development to help drive the sustainable operation and growth of TLR companies by providing technological innovations that meet its customer needs. It specializes in in formation retrieval, information extraction, text categorization, text clustering, and data mining, with an emphasis on finding novel ways to organize and present legal and business information. This presentation will first provide an overview of TLR's business and its customers. It will then describe the work of TLR R&D in the fields of research mentioned above, in particular, detailing its efforts in areas of highly granular text classification [using taxonomies possessing O(10^5) nodes] and entity extraction & linking [harnessing authority resources consisting of O(10^6) entities].

Tuesday, Feb 28
DAIS adjunct seminar SC2407
2-4 PM
Title: Mining and Searching Massive Graph Databases
Xifeng Yan, University of Illinois

Graphs are ubiquitous but divergent, with critical applications in domains ranging from software engineering to computational biology. However, it is challenging to analyze any reasonably large collection of graphs due to its high computational complexity. Development of scalable methods for the analysis of massive graph databases thus becomes one major thrust in data mining and database research.

In the core of many graph-related applications, there are two fundamental problems: how to mine graph patterns and how to process graph queries. My initial study grasped a tight connection between these two seemingly parallel areas. In this talk, I will first present a novel graph canonical labeling system that is able to speedup the discovery of frequent subgraph patterns. Next, discriminative pattern analysis is introduced for constructing compact yet high-quality graph indices, which are then applied to exact and approximate graph search in large graph databases. Such index mechanism is shown to be very effective in processing graph queries. The finding of graph pattern-based indexing is profound and yet to be fully explored since the same concept can also be applied to graph classification and clustering. In the end of my talk, I am going to examine broader applications of graph patterns, such as biological network analysis for functional annotation and program flow classification for software bug isolation.

Wednesday, March 1
SC3405
11 AM
Towards Natural Language Queries for Databases
H V Jagadish, Univ. of Michigan

Database query languages can be intimidating to the non-expert, leading to the immense recent popularity for keyword based search in spite of its significant limitations. The holy grail has been the development of a natural language query interface.

Whereas there has been a great deal of work in natural language understanding and question answering, our focus is on what happens once a natural query has been "understood". To be able to pose this against a database, one still requires knowledge of the database schema, and a mapping of the query to this schema.

In this talk, I will present our work on Schema-Free XQuery, permitting the expression of complex queries against XML databases with minimal schema knowledge. I will then show how these schema free querying facilities are being used in our ongoing work on NaLIX, a natural language query interface for an XML database.

Wednesday, March 8
SC3405
11 AM
Efficiently Integrating Heterogeneous Data: Collaboration, Automation, and Relaxation
Robert McCann, University of Illinois

Data integration systems provide a one-stop interface to multiple disparate data sources. Despite a lot of interest from both academia and industry over the last decade, these systems are still built and maintained in a manual, labor-intensive, and error-prone process. As a result, deployment has been limited. In this talk I will describe my work in the AIDA project at Illinois. Namely, I present three approaches to reduce integration costs: collaboration, automation, and relaxation. First I discuss the application of mass collaboration techniques to perform critical integration tasks at little cost to the system builder (reminiscent of open-source software efforts). I then discuss the use of automatic techniques (e.g. anomaly detection, machine learning) to reduce the high cost of system maintenance. Lastly, I describe initial work on relaxing one fundamental cause of integration costs - rigid structural requirements - resulting in IR-style best-effort integration scenarios. These techniques can help significantly reduce integration costs, promoting system deployment and offering solutions to build integration systems where not previously possible.

Wednesday, March 15
SC3405
11 AM
VLDB Deadline - No Seminar

Friday, March 17
SC3405
11 AM
Alan Stacklin - Chief Technology Officer, Akoya

Alan Stacklin has more than 20 years of experience in developing and commercializing software technologies. Stacklin's has worked in the field of business intelligence, data mining, and product management. His proven ability to translate market needs into commercial software applications compliments Akoya's mission. Stacklin's successful business ventures include: ZDU.com, an online technology education platform, MetFAbCity.com, a supply chain automation exchange for the metal fabrication industry, and CPGNetwork.com, a business intelligence platform for the consumer packaged goods industry. His leading-edge software applications assisted in the growth of these companies.

Stacklin earned an MBA from the University of Rochester's William E. Simon Graduate School of Business, an MS from Union College and a BS from the University of Pittsburgh.

Wednesday, March 22
SC3405
11 AM
Spring Break - No Seminar

Wednesday, April 19
SC3405
11 AM
Mining Minimal Distinguishing Subsequence Patterns with Gap Constraints
Guozhu Dong, Wright State University

Discovering contrasts between collections of data is an important task in data mining. In this talk, we consider a new type of contrast pattern, called a Minimal Distinguishing Subsequence (MDS). An MDS is a minimal subsequence that occurs frequently in one class of sequences and infrequently in sequences of another class. It is a natural way of representing strong and succinct contrast information between two sequential datasets and can be useful in applications such as protein comparison, document comparison and building sequential classification models. Mining MDS patterns is a challenging task and is significantly different from mining contrasts between relational/transactional data. One particularly important type of constraint that can be integrated into the mining process is the gap constraint. We present an efficient algorithm called ConSGapMiner (Contrast Sequences with Gap Miner), to mine all MDSs satisfying a minimum and maximum gap constraint, plus a maximum length constraint. It employs highly efficient bitset and boolean operations, for powerful gap based pruning within a prefix growth framework. A performance evaluation with both sparse and dense datasets, demonstrates the scalability of ConSGapMiner and shows its ability to mine patterns from high dimensional datasets at low supports.

This talk is based on a paper which received the 2005 IEEE ICDM Best Paper Award and co-authored with Xiaonan Ji and James Bailey of University of Melbourne.

Wednesday, April 26
SC3405
11 AM
Scalable Continuous Query Processing and Result Dissemination
Jun Yang, Duke University

In contrast to traditional database queries that run once against a database snapshot, continuous queries continuously generate new results (or changes to results) as new data and updates arrive in streams. Many applications, e.g., publish/subscribe systems, need to handle a large number of long-standing continuous queries whose results are needed across a wide-area network. The naive approach, which checks each incoming data item against every query and sends a separate notification for each affected query, is not scalable. While there has been a considerable amount of work on continuous filters, more complex queries, such as joins and aggregates, are more challenging.

In this talk, I will first describe our recent results on processing a large number of continuous joins. Our techniques are input-sensitive---they exploit patterns in queries and data for efficient processing. We demonstrate the advantage of this approach over previously known techniques both theoretically and experimentally.

The second problem I will address is how to disseminate the results of many continuous queries efficiently to users over a network. Traditional solutions are either database- or network-centric, but we argue that there is a previously unexplored design space between these two extremes, and we show how to achieve better scalability by incorporating both database- and network-side considerations.

Wednesday, May 3
SC3405
11 AM
Supporting Ad-hoc Ranking Aggregates
Chengkai Li, UIUC

This paper presents a principled framework for efficient processing of ad-hoc top-k (ranking) aggregate queries, which provide the k groups with the highest aggregates as results. Essential support of such queries is lacking in current systems, which process the queries in a naive materialize-group-sort scheme that can be prohibitively inefficient. Our framework is based on three fundamental principles. The Upper-Bound Principle dictates the requirements of early pruning, and the Group-Ranking and Tuple-Ranking Principles dictate group-ordering and tuple-ordering requirements. They together guide the query processor toward a provably optimal tuple schedule for aggregate query processing. We propose a new execution framework to apply the principles and requirements. We address the challenges in realizing the framework and implementing new query operators, enabling efficient group-aware and rankaware query plans. The experimental study validates our framework by demonstrating orders of magnitude performance improvement in the new query plans, compared with the traditional plans.