Schedule of DAIS Seminars
Fall 2004
3405 Siebel Center
Thursdays, 3:15-4:15
Coordinator:
Lars Olson
Contact: leolson1
Schedule:
- Sept. 9
Topic: DAIS Overview
Presenter: DAIS Faculty
- Sept. 16
Topic: Discovering Complex Matchings across Web Query Interfaces: A Correlation Mining Approach
Presenter: Bin He
Abstract: To enable information integration, schema matching is a
critical step for discovering semantic correspondences of
attributes across heterogeneous sources. While complex
matchings are common, because of their far more complex search
space, most existing techniques focus on simple 1:1 matchings.
To tackle this challenge, this paper takes a conceptually
novel approach by viewing schema matching as correlation
mining, for our task of matching Web query interfaces to
integrate the myriad databases on the Internet. On this deep
Web, query interfaces generally form complex matchings between
attribute groups (e.g., author corresponds to first name, last
name in the Books domain). We observe that the cooccurrences
patterns across query interfaces often reveal such complex
semantic relationships: grouping attributes (e.g., first name,
last name) tend to be co-present in query interfaces and thus
positively correlated. In contrast, synonym attributes are
negatively correlated because they rarely co-occur. This
insight enables us to discover complex matchings by a
correlation mining approach. In particular, we develop the DCM
framework, which consists of data preparation, dual mining of
positive and negative correlations, and finally matching
selection. Unlike previous correlation mining algorithms,
which mainly focus on finding strong positive correlations,
our algorithm cares both positive and negative correlations,
especially the subtlety of negative correlations, due to its
special importance in schema matching. This leads to the
introduction of a new correlation measure, H-measure, distinct
from those proposed in previous work. We evaluate our approach
extensively and the results show good accuracy for discovering
complex matchings.
- Sept. 23 No meeting (qual exams)
- Sept. 30
Topic: Managing Uncertainty in Moving-Object and Sensor Databases
Presenter: Reynold Cheng (Purdue)
Abstract: In a moving-object database system, locations of objects are
constantly reported to the database. These location values are subsequently
used to answer user queries. Due to continuous changes in locations, as well
as limited resources (e.g., network bandwidth and battery power), it is
infeasible for the database to keep track of the actual location of every
moving object. Queries that use the stale values provided by the database can
produce incorrect answers. However, if the degree of uncertainty between the
actual location value and the database value is limited, one can place more
confidence in answers to the queries. More generally, query answers can be
augmented with probabilistic guarantees of the validity of the answers. To
answer these probabilistic queries, different solutions are required,
depending on the moving pattern of an object, as well as the nature of the
query. We investigate the interesting issue of modeling "uncertainty" for a
moving-object. Based on the uncertainty models, I will discuss how uncertain
location data can be queried, where algorithms for range queries and
nearest-neighbor queries will be presented.
We observe that uncertainty management techniques for a moving-object database
can be generalized to a vast class of sensor-based applications, which
typically involve the monitoring of continuously changing entities (e.g.,
temperature and pressure). I will talk about a taxonomy of uncertainty models
for general sensor data. These classes of models describe uncertainty in
different levels of precision. Next, we will present a classification scheme
for probabilistic queries for uncertain sensor data, and briefly discuss how
probabilistic queries are executed in each class. We also examine the
important issue of measuring the quality of answers to probabilistic queries.
- Oct. 7 (tentatively at 11-12 instead of 3:15-4:15)
Topic: FATES: Automatically-tuned Database Storage Management
Presenter: Natassa Ailamaki (Carnegie Mellon)
Abstract: Database system performance heavily depends on data access efficiency on
all levels of the memory hierarchy. CPU cache performance depends on
in-memory data layout, whereas I/O performance depends on data placement
on the disks. Throughout the memory hierarchy, however, the workload
behavior is a function of the access patterns the application dictates.
On-line transaction processing applications, for instance, tend to
access full data records in a random fashion, whereas decision-support
systems typically scan large tables sequentially, reading small
fractions of the records. Unfortunately, current database storage
systems are static and fragile. Low-level system parameters are fully
configured by human administrators, whereas there is typically a single
data organization used at all levels of the memory hierarchy, and data
organization is specialized to a single workload. This results in
frequent errors, suboptimal performance, and inability to take advantage
of the storage technology at its full potential.
This talk describes Fates, a dynamic, robust, and automated system for
database storage management. Borrowing from the Greek mythology, Fates
includes three components that establish proper abstractions in the
database storage system: Clotho decouples in-memory data layout from
on-disk storage layout, providing the opportunity to design efficient
data placement at each storage level separately. Lachesis provides
device-specific hints and automatically specializes the system to
extracted device characteristics. Finally, Atropos provides
logical-to-physical volume mapping and arranges storage to offer robust
performance across workloads. The talk will describe the Fates system
design and implementation, and will demonstrate its efficiency through
experimental results with application benchmarks.
- Oct. 14
Topic: DNA Sequencing-by-Hybridization with Gapped-Probes
in the Presence of Hybridization Noise
Presenter: Hon Wai Leong (National University of Singapore)
Abstract: DNA sequencing-by-hybridization (SBH) is a powerful potential alternative
to current sequencing by electrophoresis. Different SBH methods have been
compared under the hypothesis of error-free hybridization. However both
false negatives and false positive are likely to occur in practice.
In this talk, we describe how to successfully adapt the sequence
reconstruction algorithms of Preparata and Upfal [PU00], (which are
asymptotically optimal in the error-free case, to noisy conditions
with hybridization errors. Our algorithm exhibits a graceful degradation
of performance as the error-rate increases. As a downside, the computational
cost of sequence reconstruction rises noticeably under noisy conditions.
We also describe recent work that addresses this shortcoming using an
adaptive approach which is both effective and efficient.
- Oct. 18 (Monday, departmental seminar)
Topic: Aggregation Queries at Streaming Speeds
Presenter: Divesh Srivastava (AT&T Labs)
Abstract:Measuring and monitoring complex dynamic phenomena, such as IP network
traffic evolution, produces highly detailed and voluminous data
streams. The monitoring applications that analyze these massive data
streams require sophisticated, user-defined aggregation queries (which
summarize the data distributions in these streams) to identify, for
example, anomalous activity.
Motivated by the nature of IP network traffic data, we present two
families of aggregation queries: hierarchical heavy hitters
(hierarchically organized large-valued regions) and biased quantiles
(totally ordered skewed ranks). We present deterministic algorithms to
approximate these aggregates in a single pass over the data stream. We
study the performance implications of using user-defined aggregate
functions (UDAFs) to incorporate such algorithms into the Gigascope data
stream management system's query processing architecture. Our
experimental results highlight the importance of fast, space-efficient,
non-blocking implementations to make them useful in practice.
- Oct. 21 (no meeting, attend seminar on the 18th instead)
- Oct. 28
Topic: Graph Indexing: A Frequent Structure-based Approach (presented
at SIGMOD 04)
Presenter: Xifeng Yan
Abstract: Graph has become increasingly important in modelling complicated
structures and schemaless data. In this presentation, I will
introduce the issues of indexing graphs and our indexing model
based on discriminative frequent structures that are identified
through a graph mining process. We show that the compact index
built under this model can achieve better performance in
processing graph queries. Since discriminative frequent structures
capture the intrinsic characteristics of the data, they are
relatively stable to database updates, thus facilitate
sampling-based feature extraction and incremental index
maintenance. Through our study, we demonstrate how database
indexing and query processing can benefit from data mining,
especially frequent pattern mining. Furthermore, the concepts
developed here can be generalized and applied to indexing
sequences, trees, and other complicated structures as well.
- Nov. 4 (joint DB/AI seminar, Room 2405 SC)
Topic: Link Mining
Presenter: Lise Getoor (University of Maryland)
Abstract:
A key challenge for data mining is tackling the problem of mining richly
structured datasets, where the objects are linked in some way. Links among the
objects may demonstrate certain patterns, which can be helpful for many data
mining tasks and are usually hard to capture with traditional statistical
models. Recently there has been a surge of interest in this area, fueled
largely by interest in web and hypertext mining, but also by interest in
mining social networks, security and law enforcement data, bibliographic
citations and epidemiological records. Link mining includes both descriptive
and predictive modeling of link data. Classification and clustering in linked
relational domains require new data mining models and algorithms. Furthermore,
with the introduction of links, new predictive tasks come to light. Examples
include predicting the numbers of links, predicting the type of link between
two objects, inferring the existence of a link, inferring the identity of an
object, finding co-references, and discovering subgraph patterns. In this
talk, I will give an overview of this newly emerging research area. I will
describe novel aspects of the modeling, learning and inference tasks and I will
give an introduction to a few of the many proposed frameworks. I will spend
the majority of the time discussing commonalities in the issues that must be
address in any statistical link mining framework.
- Nov. 11
Topic: Audio Segment Retrieval Using a Short Duration Example Query (presented at ICME 04)
Presenter: Atulya Velivelli (Dept. of Electrical and Computer Engineering)
Abstract: In this paper, we propose a general approach to audio segment
retrieval using a synthesized HMM. The approach allows a user to query audio
data by an example audio segment of a short duration and find similar segments.
The basic idea of our approach is to first train a theme HMM using the given
example and a general background HMM using all the audio data, and then combine
these individual HMMs to form a synthesized "Background-Theme-Background" HMM.
This synthesized HMM can then be applied to any audio stream as a parser to
detect the most likely theme segment. We overcome the problem of a short
duration being used to train a theme HMM, by using the MAP rule with the
Background model as a prior model. Evaluation of the proposed retrieval scheme
using short duration example audio clips of narration as queries gives quite
promising results.
- Nov. 18
Topic: Purpose Based Database Access Control for Privacy Protection
Presenter: Elisa Bertino (Purdue University)
Abstract: As privacy becomes a major concern for both consumers and enterprises, many privacy protecting access control models have been proposed. Privacy protection however cannot be achieved by traditional access control models. The main reason is that the nature of privacy does not fit into traditional access control models. While traditional access control policies mainly concern which users perform which action on which data object, privacy policies only concern which data object is used for what purpose(s). For example, a typical privacy policy such as "we will collect and use customer identifiable information for billing purposes and to anticipate and resolve problems with your service" does not state who can access the customer information, but states that the information can be accessed for the purposes of billing, customer service, and possibly some analysis. Another difficulty of privacy protection is that definitions of privacy vary from individual to individual. In this talk we explore issues related to annotating data with purpose-related metadata and to using these metadata to support new forms of access control.
- Nov. 25
Topic: Thanksgiving dinner
Presenter: Butterball Turkeys, Inc.
- Dec. 2 (joint DB/AI seminar, Room 2405 SC)
Topic: Social Network Analysis of Text
Presenter: Dragomir Radev (Univ. of Michigan)
Abstract: Textual data is everywhere, in email and scientific papers, in online newspapers and e-commerce sites. The Web contains more than 200 terabytes of text not even counting the contents of dynamic textual databases. This enormous source of knowledge is seriously underexploited. Textual documents on the Web are very hard to model computationally: they are unstructured, time-dependent, collectively authored, and of uneven importance. Traditional grammar-based techniques don't scale up to address such problems. Novel representations and analytical tools are needed.
NewsInEssence (www.newsinessence.com) is a system that crawls the Web for news, automatically clusters them by topic, and produces user-defined extractive summaries of each cluster. A recent addition to the battery of summarization algorithms available to NewsInEssence is the Cosine Centrality method. In this talk I will describe how one can apply the theory of social networks and stochastic processes (in particular rank-based prestige and random walks on undirected graphs) to multi-document text summarization. (I will begin my talk with a short tutorial on the mathematics needed for the rest of the talk.)
If time permits, at the end of the talk, I will quickly describe two recent ongoing projects in my research group: one on machine learning for object classification using random walks on bipartite (feature-object) graphs and another on using phylogenetic techniques for fact tracking in evolving multi-document summarization.
- Dec. 3 (Friday)
Topic: Trio: A System for Integrated Management of Data, Accuracy, and Lineage
Presenter: Jennifer Widom (Stanford University)
Abstract: The recently-launched Trio project at Stanford aims to develop a new database system that manages not only data, but also the accuracy and lineage of the data. Inexact (uncertain, probabilistic, fuzzy, approximate, incomplete, and imprecise!) databases have been proposed in the past, and the lineage problem also has been studied. In the Trio project we plan to combine and distill previous work into a simple and usable model, design a query language as an understandable extension to SQL, and most importantly build a working system&emdash;a system that augments conventional data management with both accuracy and lineage as an integral part of the data. This talk will be a relatively informal. I will provide numerous motivating applications for Trio and lay out preliminary plans for the data model, query language, and prototype system. I hope to elicit feedback and discussion throughout the presentation.
- Dec. 9
Topic: The Generalized Mapping Model
Presenter: Everett Sherwood (Motorola)
Abstract: The Generalized Mapping Model is a sequence of transformations that accepts one or more databases as inputs and maps them into a single set of very simple, canonical data structures. From this new representation, individual users may autonomously restructure both the domain model and the data. The model has utility for meta-data, data security, and MOLAP.
To schedule:
- Isabel Cruz (UIC) (next semester)
- Ouri Wolfson (UIC) (next semester?)