Schedule of DAIS Seminars

Spring 2005
3405 Siebel Center
Thursdays, 4:00-5:00


Coordinator:

Lars Olson
Contact: leolson1

Schedule:

  1. Jan. 20
    Topic: User-Centered Adaptive Information Retrieval
    Presenter: ChengXiang Zhai
    Abstract: A major limitation of existing retrieval models and systems is that the retrieval decision is, in general, based solely on the query and document collection; information about the actual user and the search context is largely ignored. This limitation makes the retrieval performance of existing IR systems inherently non-optimal, as seen clearly in the following two cases: (1) Different users may use exactly the same query to search for different information, but existing IR systems return the same results for these users. For example, the query "IR applications" on Google returns a mixture of documents about "information retrieval" applications and "infrared" applications, as "IR" can be an acronym for both information retrieval and infrared. Without considering the actual user it is inherently impossible to know which sense "IR" refers to. (2) A user's information needs may change over time. The same user may sometimes use "java" to mean the Java island and some other times use "java" to mean the programming language. Without recognizing the search context, it would be again inherently impossible to recognize the correct sense. Clearly, an optimal retrieval system must incorporate both user information and search context into the retrieval decision process.

    The UCAIR project seeks to break this limitation of the existing retrieval methods and formally develop a new retrieval strategy called user-centered adaptive information retrieval (UCAIR), in which user information and search context are both exploited to improve retrieval performance. We emphasize client-side personalization, which helps alleviate the problem of privacy. Our main idea is to develop a client-side search agent that can collect implicit feedback information (e.g., previous queries, clickthrough information) and exploit such information to customize and improve search results from a general search engine.

    In this talk, I will discuss the UCAIR project, and in particular, present some recent progress that two DAIS students (Xuehua Shen and Bin Tan) have made in studying the UCAIR search strategy. I will make a demo of the UCAIR toolbar that they developed and show how it can use the previous queries and clickthrough information to immediately improve ranking results from Google. More information about the UCAIR project and toolbar is available at http://sifaka.cs.uiuc.edu/ir/ucair/.

  2. Jan. 27
    Topic: IncSpan: Incremental Mining of Sequential Patterns in Large Database (KDD'04)
    Presenter: Hong Cheng
    Abstract: Many real life sequence databases, such as customer shopping sequences, medical treatment sequences, etc., grow incrementally. It is undesirable to mine sequential patterns from scratch each time when a small set of sequences grow, or when some new sequences are added into the database. Incremental algorithm should be developed for sequential pattern mining so that mining can be adapted to frequent and incremental database updates, including both insertions and deletions. However, it is nontrivial to mine sequential patterns incrementally, especially when the existing sequences grow incrementally because such growth may lead to the generation of many new patterns due to the interactions of the growing subsequences with the original ones. In this study, we develop an efficient algorithm, IncSpan, for incremental mining of sequential patterns, by exploring some interesting properties. Our performance study shows that IncSpan outperforms some previously proposed incremental algorithms as well as a non incremental one with a wide margin.

  3. Feb. 3 room change: 2405 SC
    Topic: Research Overview
    Presenter: Sunil Prabhakar (Purdue)
    Abstract: This talk will provide a very high-level summary of a number of different research projects that the speaker is currently involved in. The talk will revolve around three main directions: data management for sensor and moving objects databases, privacy and rights protection for databases, and database support for biological data. The focus of the work presented will be on the database challenges in these areas that are currently being explored.

  4. Feb. 10
    Topic: The AIM Project
    Presenter: Seung-won Hwang
    Abstract: This talk presents the two recent publications (ICDE 2005) of the AIM project. The AIM project aims at supporting ranking for data retrieval. Toward this goal, there are two major challenges, as each of the two papers addresses respectively. First, rank formulation: For ranking to be useful, what is the right mechanism to express their desired ranking? Second, rank processing: For ranking to be efficient in a wide range of access scenarios, what is the unified framework to optimize response time?

    In this talk, I will discuss our developments, in particular, our learner-based rank formulation front-end and unified cost-based rank processing framework. More information about the AIM project is available at http://aim.cs.uiuc.edu/.

  5. Feb. 17
    Topic: Scalable Continuous Query Processing in Location-aware Database Servers
    Presenter: Mohamed Mokbel (Purdue)
    Abstract: The wide spread use of cellular phones, handheld devices, and GPS-like technology enables location-aware environments where virtually all objects are aware of their locations. Location-aware environments and location-aware services are characterized by the large number of moving objects and large number of continuous moving queries (also known as spatio-temporal queries). Such environments call for new query processing techniques that deal with the continuous movement and frequent updates of both spatio-temporal objects and spatio-temporal queries. This talk presents a novel paradigm and algorithms for efficient processing and scalable execution of continuous spatio-temporal queries in pipelined query execution plans. The proposed paradigm and algorithms are implemented in the context of the PLACE server (Pervasive Location-Aware Computing Environments); a data stream management system that provides support for both spatio-temporal queries and spatio-temporal data. Within the PLACE server, I will highlight several challenges that I address in my research including managing the scarce memory resource, uncertainty areas of moving queries, and self-adapting the system behavior to intervals of high workload.

  6. Feb. 24 Note time change: 3:00
    Topic: Automatic Obligation Enforcement for Privacy Policy Compliance
    Presenter: Mukesh Mohania (IBM)
    Abstract: It is becoming increasingly important for enterprises to have a well-defined privacy policy, to establish customer trust and to prevent misuse of privacy data and avoid litigation. Many standards exist for the publication of the privacy policy of an enterprise, which may be more complex than a simple 'allow' or 'deny' rule. The privacy policy may have rules that specify obligations to be executed in the case of certain data access. Currently these obligations are executed manually, with the inherent defects of not being scaleable or auditable. In this paper, we discuss the architecture and technology for the automated execution of the obligations associated with a privacy policy that has been developed at IBM India Research Lab. We also present a prototype solution for the obligation enforcement, using IBM Content Manager as the data repository and IBM Record Manager for the obligation enforcement. The system also logs audit information and generates audit trails, to support auditing of the obligation enforcements. We also present a generic architecture for obligation execution associated with different kinds of policies.

  7. Mar. 3
    Topic:
    Presenter:
    Abstract:

  8. Mar. 10
    Topic: Privacy Preserving Data Mining on Horizontally Partitioned Data
    Presenter: Murat Kantarcioglu (Purdue)
    Abstract: Data mining can extract important knowledge from large data collections -- but sometimes these collections are split among various parties. Privacy concerns may prevent the parties from directly sharing the data, and some types of information about the data. My talk addresses secure mining of data over horizontally partitioned data. The methods that will be discussed incorporate cryptographic techniques to minimize the information shared, while adding little overhead to the mining task. Under reasonable assumptions techniques presented are proven to reveal nothing other than the final data mining result. I will also address the related privacy issues including how to use learned data mining models securely, and the potential privacy effect of the data mining results.

  9. Mar. 17
    Topic: Mining Patterns from Protein Structures
    Presenter: Wei Wang (UNC)
    Abstract: One of the next great frontiers in molecular biology is to understand, and predict protein function. Proteins are simple linear chains of polymerized amino acids (residues) whose biological functions are determined by the three-dimensional shapes that they fold into. Hence, understanding proteins requires a unique combination of chemical and geometric analysis. A popular approach to understanding proteins is to break them down into structural sub-components called motifs. Motifs are recurring structural and spatial units that are frequently correlated with specific protein functions. Traditionally, the discovery of motifs has been a laborious task of scientific exploration.

    In this talk, I will discuss recent data-mining algorithms that we have developed for automatically identifying potential spatial motifs. Our methods automatically find frequently occurring substructures within graph-based representations of proteins. We represent each protein's structure as a graph, where vertices correspond to residues. Two types of edges connect residues: sequence edges connect pairs of adjacent residues in the primary sequence, and proximity edges represent physical distances, which are indicative of intra-molecular interactions. Such interactions are believed to be key indicators of the protein's function.

    This representation allows us to apply innovative graph mining techniques to explore protein databases and associated protein families. The complexity of protein structures and corresponding graphs poses significant computational challenges. The kernel of our approach is an efficient subgraph-mining algorithm that detects all (maximal) frequent subgraphs from a graph database with a user-specified minimal frequency. Our algorithm uses the pattern growth paradigm with an efficient depth-first enumeration scheme, searching through the graph space for frequent subgraphs. Our most recent algorithms incorporate several improvements that take into account specific properties of protein structures.

  10. Mar. 24
    No seminar -- spring break

  11. Mar. 30 Note date change: Wed. Mar. 30 at 11am
    Topic: Exporting and Utilizing Database Interfaces on the Web
    Presenter: Michalis Petropoulos (University of California, San Diego)
    Abstract: Database interfaces define the way database functionality is exported to and utilized by end users, developers and programs. Today's prominent architectures, such as publishing, integration and service-oriented, demand capable interfaces and a higher degree of database functionality utilization in order to realize their potential.

    In service-oriented architectures, applications need to provide integrated access to the data of multiple sources, which restrict the access to their content by exporting only a limited set of supported queries as web services. A mediator offers an integrated view of the underlying sources by aggregating their exported services. In such scenarios, applications pose queries against the integrated view and it follows that not all queries can be answered, since the sources export limited query capabilities. To avoid sending an application developer into a frustrating trial-and-error loop, I developed an interactive query formulation interface called CLIDE, which allows developers to build only those queries which are executable by the mediator. The interface employs a color scheme to visually guide the developer, who does not need to be aware of the individual data sources and their limited query capabilities.

    In addition, applications need to publish powerful interfaces to end users who query and browse the underlying information sources. Such interfaces require extensive coding and are expensive to maintain. I developed the QURSED system, which semi-automatically generates web-based query form and report pages for semistructured XML data. The system drives the generation of the pages from the XML Schema describing the structure of the underlying data and offers to developers an authoring tool that does not require any coding. The resulting forms and reports encode large sets of parameterized queries and are powerful in the sense that heterogeneity, nesting and optionality are tackled declaratively. The system reduces the cost of developing and maintaining web applications by decoupling the query aspects from the visual ones.

  12. Apr. 7
    Topic: CP-Miner: A Tool for Finding Copy-Paste and Related Bugs in Operating System Code
    Presenter: Zhenmin Li
    Abstract: Copy-pasted code is very common in large software because programmers prefer reusing code via copy-paste in order to reduce programming effort. Recent studies show that copy-paste is prone to introducing bugs and a significant portion of operating system bugs concentrate in copy-pasted code. Unfortunately, it is challenging to efficiently identify copy-pasted code in large software. Existing copy-paste detection tools are either not scalable to large software, or cannot handle small modifications in copy-pasted code. Furthermore, few tools are available to detect copy-paste related bugs. In this paper we propose a tool, CP-Miner, that uses data mining techniques to efficiently identify copy-pasted code in large software including operating systems, and detects copy-paste related bugs. Specifically, it takes less than 20 minutes for CP-Miner to identify 190,000 copy-pasted segments in Linux and 150,000 in FreeBSD. Moreover, CP-Miner has detected 28 copy-paste related bugs in the latest version of Linux and 23 in FreeBSD. In addition, we analyze some interesting characteristics of copy-paste in Linux and FreeBSD, including the distribution of copy-pasted code across different length, granularity, modules, degrees of modification, and various software versions.

  13. Apr. 14
    Topic: Query Algebra and Optimization for Relational Top-k Queries (accepted for SIGMOD 2005)
    Presenter: Chengkai Li
    Abstract: We introduce a systematic and principled framework, by extending relational algebra and query optimization to support efficient evaluations of ranking (top-k) queries in relational database systems (RDBMS). Previously, top-k query processing is studied in the middleware scenario or in RDBMS in a "piecemeal" fashion, i.e., focusing on specific operator or sitting outside the core of query engines. In contrast, we aim to support ranking as a first-class database construct. As a key insight, ranking relationship can be viewed as another logical property of data, parallel to the "membership" property of relational data model. While membership is essentially supported in RDBMS, the same support for ranking is clearly lacking. We address the fundamental integration of ranking in RDBMS in a way similar to how membership, i.e., Boolean query, is supported. We extend relational algebra by proposing rank-relation model to capture the ranking property, and introducing new and extended operators to support ranking as a first-class construct. Enabled by the extended algebra, we present a pipelined and incremental execution model of ranking query plans (that cannot be expressed traditionally) based on a fundamental ranking principle. To optimize top-k queries, we propose a dimensional enumeration algorithm to explore the extended plan space by enumerating plans along two dual dimensions: ranking and membership. We also propose a sampling-based cardinality estimation method to estimate the cardinality of rank-aware operators, for costing plans. Our experiments show the validity of our framework and the accuracy of the proposed estimation model.

  14. Apr. 21 Note time change: 3-4pm
    Topic: Tesla: On Demand Information Systems
    Presenter: Shankar Raman (IBM Almaden)
    Abstract: Information systems become complex and brittle as the number of components increases. The Tesla project is building middleware to give application programs flexible and predictable access to a large numbers of autonomous and heterogeneous data sources and compute resources. We provide flexibility for data access through a novel query formulation and query processing scheme that gives the abstraction of a highly available and performant virtual data store over logical data domains. This abstraction masks from application programs events such as addition or removal of data sources, subsitution of replicas for performance and availability, and source unavailability, by converting them into degradations (or improvements) in the information quality. We also aim to provide predictable quality of service for access to this virtual data store, without relying on dedicated resources. Instead, Tesla uses a variety of resource managers to dynamically provision compute resources as needed, to meet the quality of service (QoS) needs of the application workload.

    In this talk I will present some of the work we have done in Tesla, as well as an overview of some of the other projects in our group.

  15. Apr. 28
    Topic: Implementing XQuery 1.0: The Story of Galax
    Presenter: Mary Fernandez (AT&T)
    Abstract: XQuery 1.0 and its sister language XPath 2.0 have set a fire underneath database vendors and researchers alike. More than thirty commercial and research XQuery implementations are listed on the XML Query working group home page.

    Galax (www.galaxquery.org) is an open-source, general-purpose XQuery engine, designed to be complete, efficient, and extensible. During Galax's development, we have focused on each of these three requirements in turn, while never losing sight of the other two. In this talk, I will describe how these requirements have impacted Galax's evolution and our own research interests. Along the way, I will show how Galax's architecture supports these three requirements.

    Galax is joint work with Jerome Simeon, IBM T.J. Watson Research Center.

To schedule: