Schedule of DAIS Seminars
Spring 2005
3405 Siebel Center
Thursdays, 4:00-5:00
Coordinator:
Lars Olson
Contact: leolson1
Schedule:
- Jan. 20
Topic: User-Centered Adaptive Information Retrieval
Presenter: ChengXiang Zhai
Abstract: A major limitation of existing retrieval models and systems
is that the retrieval decision is, in general, based solely on the query and
document collection; information about the actual user and the search context
is largely ignored. This limitation makes the retrieval performance of
existing IR systems inherently non-optimal, as seen clearly in the following
two cases: (1) Different users may use exactly the same query to search for
different information, but existing IR systems return the same results for
these users. For example, the query "IR applications" on Google
returns a mixture of documents about "information retrieval"
applications and "infrared" applications, as "IR" can be
an acronym for both information retrieval and infrared. Without considering
the actual user it is inherently impossible to know which sense "IR"
refers to. (2) A user's information needs may change over time. The same user
may sometimes use "java" to mean the Java island and some other
times use "java" to mean the programming language. Without
recognizing the search context, it would be again inherently impossible to
recognize the correct sense. Clearly, an optimal retrieval system must
incorporate both user information and search context into the retrieval
decision process.
The UCAIR project seeks to break this limitation of the existing retrieval
methods and formally develop a new retrieval strategy called user-centered
adaptive information retrieval (UCAIR), in which user information and search
context are both exploited to improve retrieval performance. We emphasize
client-side personalization, which helps alleviate the problem of privacy. Our
main idea is to develop a client-side search agent that can collect implicit
feedback information (e.g., previous queries, clickthrough information) and
exploit such information to customize and improve search results from a general
search engine.
In this talk, I will discuss the UCAIR project, and in particular, present
some recent progress that two DAIS students (Xuehua Shen and Bin Tan) have
made in studying the UCAIR search strategy. I will make a demo of the UCAIR
toolbar that they developed and show how it can use the previous queries and
clickthrough information to immediately improve ranking results from Google.
More information about the UCAIR project and toolbar is available at
http://sifaka.cs.uiuc.edu/ir/ucair/.
- Jan. 27
Topic: IncSpan: Incremental Mining of Sequential Patterns in Large Database (KDD'04)
Presenter: Hong Cheng
Abstract: Many real life sequence databases, such as customer shopping
sequences, medical treatment sequences, etc., grow incrementally. It is
undesirable to mine sequential patterns from scratch each time when a small
set of sequences grow, or when some new sequences are added into the database.
Incremental algorithm should be developed for sequential pattern mining so that
mining can be adapted to frequent and incremental database updates, including
both insertions and deletions. However, it is nontrivial to mine sequential
patterns incrementally, especially when the existing sequences grow
incrementally because such growth may lead to the generation of many new
patterns due to the interactions of the growing subsequences with the original
ones. In this study, we develop an efficient algorithm, IncSpan, for
incremental mining of sequential patterns, by exploring some interesting
properties. Our performance study shows that IncSpan outperforms some
previously proposed incremental algorithms as well as a non incremental one
with a wide margin.
- Feb. 3 room change: 2405 SC
Topic: Research Overview
Presenter: Sunil Prabhakar (Purdue)
Abstract: This talk will provide a very high-level summary of a number
of different research projects that the speaker is currently involved in. The
talk will revolve around three main directions: data management for sensor
and moving objects databases, privacy and rights protection for databases, and
database support for biological data. The focus of the work presented will be
on the database challenges in these areas that are currently being explored.
- Feb. 10
Topic: The AIM Project
Presenter: Seung-won Hwang
Abstract: This talk presents the two recent publications (ICDE 2005)
of the AIM project. The AIM project aims at supporting ranking for data
retrieval. Toward this goal, there are two major challenges, as each of the
two papers addresses respectively. First, rank formulation: For ranking to
be useful, what is the right mechanism to express their desired ranking?
Second, rank processing: For ranking to be efficient in a wide range of
access scenarios, what is the unified framework to optimize response time?
In this talk, I will discuss our developments, in particular, our
learner-based rank formulation front-end and unified cost-based rank
processing framework. More information about the AIM project is available at
http://aim.cs.uiuc.edu/.
- Feb. 17
Topic: Scalable Continuous Query Processing in Location-aware Database Servers
Presenter: Mohamed Mokbel (Purdue)
Abstract: The wide spread use of cellular phones, handheld devices, and GPS-like
technology enables location-aware environments where virtually all objects
are aware of their locations. Location-aware environments and location-aware
services are characterized by the large number of moving objects and large
number of continuous moving queries (also known as spatio-temporal queries).
Such environments call for new query processing techniques that deal with
the continuous movement and frequent updates of both spatio-temporal objects
and spatio-temporal queries. This talk presents a novel paradigm and
algorithms for efficient processing and scalable execution of continuous
spatio-temporal queries in pipelined query execution plans. The proposed
paradigm and algorithms are implemented in the context of the PLACE server
(Pervasive Location-Aware Computing Environments); a data stream management
system that provides support for both spatio-temporal queries and
spatio-temporal data. Within the PLACE server, I will highlight several
challenges that I address in my research including managing the scarce
memory resource, uncertainty areas of moving queries, and self-adapting the
system behavior to intervals of high workload.
- Feb. 24 Note time change: 3:00
Topic: Automatic Obligation Enforcement for Privacy Policy Compliance
Presenter: Mukesh Mohania (IBM)
Abstract: It is becoming increasingly important for enterprises to
have a well-defined privacy policy, to establish customer trust and to prevent
misuse of privacy data and avoid litigation. Many standards exist for the
publication of the privacy policy of an enterprise, which may be more
complex than a simple 'allow' or 'deny' rule. The privacy policy may have
rules that specify obligations to be executed in the case of certain data
access. Currently these obligations are executed manually, with the
inherent defects of not being scaleable or auditable. In this paper, we
discuss the architecture and technology for the automated execution of the
obligations associated with a privacy policy that has been developed at IBM
India Research Lab. We also present a prototype solution for the obligation
enforcement, using IBM Content Manager as the data repository and IBM
Record Manager for the obligation enforcement. The system also logs audit
information and generates audit trails, to support auditing of the
obligation enforcements. We also present a generic architecture for
obligation execution associated with different kinds of policies.
- Mar. 3
Topic:
Presenter:
Abstract:
- Mar. 10
Topic: Privacy Preserving Data Mining on Horizontally Partitioned Data
Presenter: Murat Kantarcioglu (Purdue)
Abstract: Data mining can extract important knowledge from large data
collections -- but sometimes these collections are split among various
parties. Privacy concerns may prevent the parties from directly
sharing the data, and some types of information about the data. My
talk addresses secure mining of data over horizontally partitioned
data. The methods that will be discussed incorporate cryptographic
techniques to minimize the information shared, while adding little
overhead to the mining task. Under reasonable assumptions techniques
presented are proven to reveal nothing other than the final data
mining result. I will also address the related privacy issues
including how to use learned data mining models securely, and the
potential privacy effect of the data mining results.
- Mar. 17
Topic: Mining Patterns from Protein Structures
Presenter: Wei Wang (UNC)
Abstract: One of the next great frontiers in molecular biology is to
understand, and predict protein function. Proteins are simple linear chains
of polymerized amino acids (residues) whose biological functions are
determined by the three-dimensional shapes that they fold into. Hence,
understanding proteins requires a unique combination of chemical and geometric
analysis. A popular approach to understanding proteins is to break them down
into structural sub-components called motifs. Motifs are recurring structural
and spatial units that are frequently correlated with specific protein
functions. Traditionally, the discovery of motifs has been a laborious task
of scientific exploration.
In this talk, I will discuss recent data-mining algorithms that we have
developed for automatically identifying potential spatial motifs. Our methods
automatically find frequently occurring substructures within graph-based
representations of proteins. We represent each protein's structure as a
graph, where vertices correspond to residues. Two types of edges connect
residues: sequence edges connect pairs of adjacent residues in the primary
sequence, and proximity edges represent physical distances, which are
indicative of intra-molecular interactions. Such interactions are believed to
be key indicators of the protein's function.
This representation allows us to apply innovative graph mining techniques to
explore protein databases and associated protein families. The complexity of
protein structures and corresponding graphs poses significant computational
challenges. The kernel of our approach is an efficient subgraph-mining
algorithm that detects all (maximal) frequent subgraphs from a graph database
with a user-specified minimal frequency. Our algorithm uses the pattern
growth paradigm with an efficient depth-first enumeration scheme, searching
through the graph space for frequent subgraphs. Our most recent algorithms
incorporate several improvements that take into account specific properties
of protein structures.
- Mar. 24
No seminar -- spring break
- Mar. 30 Note date change: Wed. Mar. 30 at 11am
Topic: Exporting and Utilizing Database Interfaces on the Web
Presenter: Michalis Petropoulos (University of California, San Diego)
Abstract: Database interfaces define the way database functionality is exported to and utilized by end users, developers and programs. Today's prominent architectures, such as publishing, integration and service-oriented, demand capable interfaces and a higher degree of database functionality utilization in order to realize their potential.
In service-oriented architectures, applications need to provide integrated access to the data of multiple sources, which restrict the access to their content by exporting only a limited set of supported queries as web services. A mediator offers an integrated view of the underlying sources by aggregating their exported services. In such scenarios, applications pose queries against the integrated view and it follows that not all queries can be answered, since the sources export limited query capabilities. To avoid sending an application developer into a frustrating trial-and-error loop, I developed an interactive query formulation interface called CLIDE, which allows developers to build only those queries which are executable by the mediator. The interface employs a color scheme to visually guide the developer, who does not need to be aware of the individual data sources and their limited query capabilities.
In addition, applications need to publish powerful interfaces to end users who query and browse the underlying information sources. Such interfaces require extensive coding and are expensive to maintain. I developed the QURSED system, which semi-automatically generates web-based query form and report pages for semistructured XML data. The system drives the generation of the pages from the XML Schema describing the structure of the underlying data and offers to developers an authoring tool that does not require any coding. The resulting forms and reports encode large sets of parameterized queries and are powerful in the sense that heterogeneity, nesting and optionality are tackled declaratively. The system reduces the cost of developing and maintaining web applications by decoupling the query aspects from the visual ones.
- Apr. 7
Topic: CP-Miner: A Tool for Finding Copy-Paste and Related Bugs in Operating System Code
Presenter: Zhenmin Li
Abstract: Copy-pasted code is very common in large software because programmers prefer
reusing code via copy-paste in order to reduce programming effort. Recent
studies show that copy-paste is prone to introducing bugs and a significant
portion of operating system bugs concentrate in copy-pasted code.
Unfortunately, it is challenging to efficiently identify copy-pasted code in
large software. Existing copy-paste detection tools are either not scalable
to large software, or cannot handle small modifications in copy-pasted code.
Furthermore, few tools are available to detect copy-paste related bugs.
In this paper we propose a tool, CP-Miner, that uses data mining techniques
to efficiently identify copy-pasted code in large software including
operating systems, and detects copy-paste related bugs. Specifically, it
takes less than 20 minutes for CP-Miner to identify 190,000 copy-pasted
segments in Linux and 150,000 in FreeBSD. Moreover, CP-Miner has detected 28
copy-paste related bugs in the latest version of Linux and 23 in FreeBSD. In
addition, we analyze some interesting characteristics of copy-paste in Linux
and FreeBSD, including the distribution of copy-pasted code across different
length, granularity, modules, degrees of modification, and various software
versions.
- Apr. 14
Topic: Query Algebra and Optimization for Relational Top-k Queries
(accepted for SIGMOD 2005)
Presenter: Chengkai Li
Abstract: We introduce a systematic and principled framework, by
extending relational algebra and query optimization to support
efficient evaluations of ranking (top-k) queries in relational
database systems (RDBMS). Previously, top-k query processing
is studied in the middleware scenario or in RDBMS in a
"piecemeal" fashion, i.e., focusing on specific operator or
sitting outside the core of query engines. In contrast, we aim
to support ranking as a first-class database construct. As a
key insight, ranking relationship can be viewed as another
logical property of data, parallel to the "membership"
property of relational data model. While membership is
essentially supported in RDBMS, the same support for ranking
is clearly lacking. We address the fundamental integration of
ranking in RDBMS in a way similar to how membership, i.e.,
Boolean query, is supported. We extend relational algebra by
proposing rank-relation model to capture the ranking property,
and introducing new and extended operators to support ranking
as a first-class construct. Enabled by the extended algebra,
we present a pipelined and incremental execution model of
ranking query plans (that cannot be expressed traditionally)
based on a fundamental ranking principle. To optimize top-k
queries, we propose a dimensional enumeration algorithm to
explore the extended plan space by enumerating plans along two
dual dimensions: ranking and membership. We also propose a
sampling-based cardinality estimation method to estimate the
cardinality of rank-aware operators, for costing plans. Our
experiments show the validity of our framework and the
accuracy of the proposed estimation model.
- Apr. 21 Note time change: 3-4pm
Topic: Tesla: On Demand Information Systems
Presenter: Shankar Raman (IBM Almaden)
Abstract: Information systems become complex and brittle as the number of components
increases. The Tesla project is building middleware to give application
programs flexible and predictable access to a large numbers of autonomous
and heterogeneous data sources and compute resources. We provide
flexibility for data access through a novel query formulation and query
processing scheme that gives the abstraction of a highly available and
performant virtual data store over logical data domains. This abstraction
masks from application programs events such as addition or removal of data
sources, subsitution of replicas for performance and availability, and
source unavailability, by converting them into degradations (or
improvements) in the information quality. We also aim to provide
predictable quality of service for access to this virtual data store,
without relying on dedicated resources. Instead, Tesla uses a variety of
resource managers to dynamically provision compute resources as needed, to
meet the quality of service (QoS) needs of the application workload.
In this talk I will present some of the work we have done in Tesla, as well
as an overview of some of the other projects in our group.
- Apr. 28
Topic: Implementing XQuery 1.0: The Story of Galax
Presenter: Mary Fernandez (AT&T)
Abstract: XQuery 1.0 and its sister language XPath 2.0 have set a fire
underneath database vendors and researchers alike. More than thirty
commercial and research XQuery implementations are listed on the XML
Query working group home page.
Galax (www.galaxquery.org) is an open-source, general-purpose XQuery
engine, designed to be complete, efficient, and extensible. During
Galax's development, we have focused on each of these three
requirements in turn, while never losing sight of the other two. In
this talk, I will describe how these requirements have impacted
Galax's evolution and our own research interests. Along the way,
I will show how Galax's architecture supports these three requirements.
Galax is joint work with Jerome Simeon, IBM T.J. Watson Research
Center.
To schedule:
- Stephen Downie (GSLIS)
- Isabel Cruz (UIC)
- Ouri Wolfson (UIC)
- Chip Bruce/Caroline Haythornthwaite (GSLIS) (next semester)
- Qingbo Zhu