Schedule of DAIS Seminars

Fall 2004
3405 Siebel Center
Thursdays, 3:15-4:15


Coordinator:

Lars Olson
Contact: leolson1

Schedule:

  1. Sept. 9
    Topic: DAIS Overview
    Presenter: DAIS Faculty

  2. Sept. 16
    Topic: Discovering Complex Matchings across Web Query Interfaces: A Correlation Mining Approach
    Presenter: Bin He
    Abstract: To enable information integration, schema matching is a critical step for discovering semantic correspondences of attributes across heterogeneous sources. While complex matchings are common, because of their far more complex search space, most existing techniques focus on simple 1:1 matchings. To tackle this challenge, this paper takes a conceptually novel approach by viewing schema matching as correlation mining, for our task of matching Web query interfaces to integrate the myriad databases on the Internet. On this deep Web, query interfaces generally form complex matchings between attribute groups (e.g., author corresponds to first name, last name in the Books domain). We observe that the cooccurrences patterns across query interfaces often reveal such complex semantic relationships: grouping attributes (e.g., first name, last name) tend to be co-present in query interfaces and thus positively correlated. In contrast, synonym attributes are negatively correlated because they rarely co-occur. This insight enables us to discover complex matchings by a correlation mining approach. In particular, we develop the DCM framework, which consists of data preparation, dual mining of positive and negative correlations, and finally matching selection. Unlike previous correlation mining algorithms, which mainly focus on finding strong positive correlations, our algorithm cares both positive and negative correlations, especially the subtlety of negative correlations, due to its special importance in schema matching. This leads to the introduction of a new correlation measure, H-measure, distinct from those proposed in previous work. We evaluate our approach extensively and the results show good accuracy for discovering complex matchings.

  3. Sept. 23 No meeting (qual exams)

  4. Sept. 30
    Topic: Managing Uncertainty in Moving-Object and Sensor Databases
    Presenter: Reynold Cheng (Purdue)
    Abstract: In a moving-object database system, locations of objects are constantly reported to the database. These location values are subsequently used to answer user queries. Due to continuous changes in locations, as well as limited resources (e.g., network bandwidth and battery power), it is infeasible for the database to keep track of the actual location of every moving object. Queries that use the stale values provided by the database can produce incorrect answers. However, if the degree of uncertainty between the actual location value and the database value is limited, one can place more confidence in answers to the queries. More generally, query answers can be augmented with probabilistic guarantees of the validity of the answers. To answer these probabilistic queries, different solutions are required, depending on the moving pattern of an object, as well as the nature of the query. We investigate the interesting issue of modeling "uncertainty" for a moving-object. Based on the uncertainty models, I will discuss how uncertain location data can be queried, where algorithms for range queries and nearest-neighbor queries will be presented.

    We observe that uncertainty management techniques for a moving-object database can be generalized to a vast class of sensor-based applications, which typically involve the monitoring of continuously changing entities (e.g., temperature and pressure). I will talk about a taxonomy of uncertainty models for general sensor data. These classes of models describe uncertainty in different levels of precision. Next, we will present a classification scheme for probabilistic queries for uncertain sensor data, and briefly discuss how probabilistic queries are executed in each class. We also examine the important issue of measuring the quality of answers to probabilistic queries.

  5. Oct. 7 (tentatively at 11-12 instead of 3:15-4:15)
    Topic: FATES: Automatically-tuned Database Storage Management
    Presenter: Natassa Ailamaki (Carnegie Mellon)
    Abstract: Database system performance heavily depends on data access efficiency on all levels of the memory hierarchy. CPU cache performance depends on in-memory data layout, whereas I/O performance depends on data placement on the disks. Throughout the memory hierarchy, however, the workload behavior is a function of the access patterns the application dictates. On-line transaction processing applications, for instance, tend to access full data records in a random fashion, whereas decision-support systems typically scan large tables sequentially, reading small fractions of the records. Unfortunately, current database storage systems are static and fragile. Low-level system parameters are fully configured by human administrators, whereas there is typically a single data organization used at all levels of the memory hierarchy, and data organization is specialized to a single workload. This results in frequent errors, suboptimal performance, and inability to take advantage of the storage technology at its full potential.

    This talk describes Fates, a dynamic, robust, and automated system for database storage management. Borrowing from the Greek mythology, Fates includes three components that establish proper abstractions in the database storage system: Clotho decouples in-memory data layout from on-disk storage layout, providing the opportunity to design efficient data placement at each storage level separately. Lachesis provides device-specific hints and automatically specializes the system to extracted device characteristics. Finally, Atropos provides logical-to-physical volume mapping and arranges storage to offer robust performance across workloads. The talk will describe the Fates system design and implementation, and will demonstrate its efficiency through experimental results with application benchmarks.

  6. Oct. 14
    Topic: DNA Sequencing-by-Hybridization with Gapped-Probes in the Presence of Hybridization Noise
    Presenter: Hon Wai Leong (National University of Singapore)
    Abstract: DNA sequencing-by-hybridization (SBH) is a powerful potential alternative to current sequencing by electrophoresis. Different SBH methods have been compared under the hypothesis of error-free hybridization. However both false negatives and false positive are likely to occur in practice. In this talk, we describe how to successfully adapt the sequence reconstruction algorithms of Preparata and Upfal [PU00], (which are asymptotically optimal in the error-free case, to noisy conditions with hybridization errors. Our algorithm exhibits a graceful degradation of performance as the error-rate increases. As a downside, the computational cost of sequence reconstruction rises noticeably under noisy conditions. We also describe recent work that addresses this shortcoming using an adaptive approach which is both effective and efficient.

  7. Oct. 18 (Monday, departmental seminar)
    Topic: Aggregation Queries at Streaming Speeds
    Presenter: Divesh Srivastava (AT&T Labs)
    Abstract:Measuring and monitoring complex dynamic phenomena, such as IP network traffic evolution, produces highly detailed and voluminous data streams. The monitoring applications that analyze these massive data streams require sophisticated, user-defined aggregation queries (which summarize the data distributions in these streams) to identify, for example, anomalous activity.

    Motivated by the nature of IP network traffic data, we present two families of aggregation queries: hierarchical heavy hitters (hierarchically organized large-valued regions) and biased quantiles (totally ordered skewed ranks). We present deterministic algorithms to approximate these aggregates in a single pass over the data stream. We study the performance implications of using user-defined aggregate functions (UDAFs) to incorporate such algorithms into the Gigascope data stream management system's query processing architecture. Our experimental results highlight the importance of fast, space-efficient, non-blocking implementations to make them useful in practice.

  8. Oct. 21 (no meeting, attend seminar on the 18th instead)

  9. Oct. 28
    Topic: Graph Indexing: A Frequent Structure-based Approach (presented at SIGMOD 04)
    Presenter: Xifeng Yan
    Abstract: Graph has become increasingly important in modelling complicated structures and schemaless data. In this presentation, I will introduce the issues of indexing graphs and our indexing model based on discriminative frequent structures that are identified through a graph mining process. We show that the compact index built under this model can achieve better performance in processing graph queries. Since discriminative frequent structures capture the intrinsic characteristics of the data, they are relatively stable to database updates, thus facilitate sampling-based feature extraction and incremental index maintenance. Through our study, we demonstrate how database indexing and query processing can benefit from data mining, especially frequent pattern mining. Furthermore, the concepts developed here can be generalized and applied to indexing sequences, trees, and other complicated structures as well.

  10. Nov. 4 (joint DB/AI seminar, Room 2405 SC)
    Topic: Link Mining
    Presenter: Lise Getoor (University of Maryland)
    Abstract: A key challenge for data mining is tackling the problem of mining richly structured datasets, where the objects are linked in some way. Links among the objects may demonstrate certain patterns, which can be helpful for many data mining tasks and are usually hard to capture with traditional statistical models. Recently there has been a surge of interest in this area, fueled largely by interest in web and hypertext mining, but also by interest in mining social networks, security and law enforcement data, bibliographic citations and epidemiological records. Link mining includes both descriptive and predictive modeling of link data. Classification and clustering in linked relational domains require new data mining models and algorithms. Furthermore, with the introduction of links, new predictive tasks come to light. Examples include predicting the numbers of links, predicting the type of link between two objects, inferring the existence of a link, inferring the identity of an object, finding co-references, and discovering subgraph patterns. In this talk, I will give an overview of this newly emerging research area. I will describe novel aspects of the modeling, learning and inference tasks and I will give an introduction to a few of the many proposed frameworks. I will spend the majority of the time discussing commonalities in the issues that must be address in any statistical link mining framework.

  11. Nov. 11
    Topic: Audio Segment Retrieval Using a Short Duration Example Query (presented at ICME 04)
    Presenter: Atulya Velivelli (Dept. of Electrical and Computer Engineering)
    Abstract: In this paper, we propose a general approach to audio segment retrieval using a synthesized HMM. The approach allows a user to query audio data by an example audio segment of a short duration and find similar segments. The basic idea of our approach is to first train a theme HMM using the given example and a general background HMM using all the audio data, and then combine these individual HMMs to form a synthesized "Background-Theme-Background" HMM. This synthesized HMM can then be applied to any audio stream as a parser to detect the most likely theme segment. We overcome the problem of a short duration being used to train a theme HMM, by using the MAP rule with the Background model as a prior model. Evaluation of the proposed retrieval scheme using short duration example audio clips of narration as queries gives quite promising results.

  12. Nov. 18
    Topic: Purpose Based Database Access Control for Privacy Protection
    Presenter: Elisa Bertino (Purdue University)
    Abstract: As privacy becomes a major concern for both consumers and enterprises, many privacy protecting access control models have been proposed. Privacy protection however cannot be achieved by traditional access control models. The main reason is that the nature of privacy does not fit into traditional access control models. While traditional access control policies mainly concern which users perform which action on which data object, privacy policies only concern which data object is used for what purpose(s). For example, a typical privacy policy such as "we will collect and use customer identifiable information for billing purposes and to anticipate and resolve problems with your service" does not state who can access the customer information, but states that the information can be accessed for the purposes of billing, customer service, and possibly some analysis. Another difficulty of privacy protection is that definitions of privacy vary from individual to individual. In this talk we explore issues related to annotating data with purpose-related metadata and to using these metadata to support new forms of access control.

  13. Nov. 25
    Topic: Thanksgiving dinner
    Presenter: Butterball Turkeys, Inc.

  14. Dec. 2 (joint DB/AI seminar, Room 2405 SC)
    Topic: Social Network Analysis of Text
    Presenter: Dragomir Radev (Univ. of Michigan)
    Abstract: Textual data is everywhere, in email and scientific papers, in online newspapers and e-commerce sites. The Web contains more than 200 terabytes of text not even counting the contents of dynamic textual databases. This enormous source of knowledge is seriously underexploited. Textual documents on the Web are very hard to model computationally: they are unstructured, time-dependent, collectively authored, and of uneven importance. Traditional grammar-based techniques don't scale up to address such problems. Novel representations and analytical tools are needed.

    NewsInEssence (www.newsinessence.com) is a system that crawls the Web for news, automatically clusters them by topic, and produces user-defined extractive summaries of each cluster. A recent addition to the battery of summarization algorithms available to NewsInEssence is the Cosine Centrality method. In this talk I will describe how one can apply the theory of social networks and stochastic processes (in particular rank-based prestige and random walks on undirected graphs) to multi-document text summarization. (I will begin my talk with a short tutorial on the mathematics needed for the rest of the talk.)

    If time permits, at the end of the talk, I will quickly describe two recent ongoing projects in my research group: one on machine learning for object classification using random walks on bipartite (feature-object) graphs and another on using phylogenetic techniques for fact tracking in evolving multi-document summarization.

  15. Dec. 3 (Friday)
    Topic: Trio: A System for Integrated Management of Data, Accuracy, and Lineage
    Presenter: Jennifer Widom (Stanford University)
    Abstract: The recently-launched Trio project at Stanford aims to develop a new database system that manages not only data, but also the accuracy and lineage of the data. Inexact (uncertain, probabilistic, fuzzy, approximate, incomplete, and imprecise!) databases have been proposed in the past, and the lineage problem also has been studied. In the Trio project we plan to combine and distill previous work into a simple and usable model, design a query language as an understandable extension to SQL, and most importantly build a working system&emdash;a system that augments conventional data management with both accuracy and lineage as an integral part of the data. This talk will be a relatively informal. I will provide numerous motivating applications for Trio and lay out preliminary plans for the data model, query language, and prototype system. I hope to elicit feedback and discussion throughout the presentation.

  16. Dec. 9
    Topic: The Generalized Mapping Model
    Presenter: Everett Sherwood (Motorola)
    Abstract: The Generalized Mapping Model is a sequence of transformations that accepts one or more databases as inputs and maps them into a single set of very simple, canonical data structures. From this new representation, individual users may autonomously restructure both the domain model and the data. The model has utility for meta-data, data security, and MOLAP.

To schedule: