Fall 2007 Schedule
|
| Tuesday, August 28 SC 3403 4 PM |
Bitmap Indexing Speaker: Rishi Sinha
Abstract:
|
| Tuesday, September 4 SC 3403 4 PM |
Term Feedback for Information Retrieval with Language Models
Speaker: Bin Tan Abstract: This work studies term-based feedback for information retrieval in the language modeling approach. With term feedback a user {\em directly} judges the relevance of individual terms without interaction with feedback documents, taking full control of the query expansion process. We propose a cluster-based method for selecting terms to present to the user for judgment, as well as effective algorithms for constructing refined query language models from user term feedback. Our algorithms are shown to bring significant improvement in retrieval accuracy over a non feedback baseline, and achieve comparable performance to relevance feedback. They are helpful even when there are no relevant documents in the top. |
| Tuesday, September 11 SC 3403 4 PM |
VLDB Practice Talks
1. Adaptive Fastest Path Computation on a Road Network: A Traffic Mining Approach Speaker: Hector Gonzalez Abstract: Efficient fastest path computation in the presence of varying speed conditions on a large scale road network is an essential problem in modern navigation systems. Factors affecting road speed, such as weather, time of day, and vehicle type, need to be considered in order to select fast routes that match current driving conditions. Most existing systems compute fastest paths based on road Euclidean distance and a small set of predefined road speeds. However, \History is often the best teacher". Historical traffic data or driving patterns are often more useful than the simple Euclidean distance-based computation because people must have good reasons to choose these routes, e.g., they may want to avoid those that pass through high crime areas at night or that likely encounter accidents, road construction, or traffic jams. In this paper, we present an adaptive fastest path algorithm capable of efficiently accounting for important driving and speed patterns mined from a large set of traffic data. The algorithm is based on the following observations: (1) The hierarchy of roads can be used to partition the road network into areas, and different path pre-computation strategies can be used at the area level, (2) we can limit our route search strategy to edges and path segments that are actually frequently traveled in the data, and (3) drivers usually traverse the road network through the largest roads available given the distance of the trip, except if there are small roads with a significant speed advantage over the large ones. Through an extensive experimental evaluation on real road networks we show that our algorithm provides desirable (short and well-supported) routes, and that it is significantly faster than competing methods. 2. Mining Approximate Top-K Subspace Anomalies in Multi-Dimensional Time-Series Data Speaker: Xiaolei Li Abstract: Market analysis is a representative data analysis process with many applications. In such an analysis, critical numerical measures, such as profit and sales, fluctuate over time and form time-series data. Moreover, the time series data correspond to market segments, which are described by a set of attributes, such as age, gender, education, income level, and product-category, that form a multi-dimensional structure. To better understand market dynamics and predict future trends, it is crucial to study the dynamics of time-series in multi-dimensional market segments. This is a topic that has been largely ignored in time series and data cube research. In this study, we examine the issues of anomaly detection in multi-dimensional time-series data. We propose time-series data cube to capture the multi-dimensional space formed by the attribute structure. This facilitates the detection of anomalies based on expected values derived from higher level, \more general?¡À time-series. Anomaly detection in a time-series data cube poses computational challenges, especially for high-dimensional, large data sets. To this end, we also propose an efficient search algorithm to iteratively select subspaces in the original high-dimensional space and detect anomalies within each one. Our experiments with both synthetic and real-world data demonstrate the effectiveness and efficiency of the proposed solution. |
| Tuesday, September 18 SC 3403 4 PM |
VLDB Practice Talks
3. EntityRank: Searching Entities Directly and Holistically Speaker: Tao Cheng Abstract: As the Web has evolved into a data-rich repository, with the standard "page view," current search engines are becoming increasingly inadequate for a wide range of query tasks. While we often search for various data "entities" (e.g., phone number, paper PDF, date), today's engines only take us indirectly to pages. While entities appear in many pages, current engines only find each page individually. Toward searching directly and holistically for finding information of ner granularity, we study the problem of entity search, a signicant departure from traditional document retrieval. We focus on the core challenge of ranking entities, by distilling its underlying conceptual model Impression Model and developing a probabilistic ranking framework, EntityRank, that is able to seamlessly integrate both local and global information in ranking. We evaluate our online prototype over a 2TB Web corpus, and show that EntityRank performs effectively. 4. Context-Aware Wrapping: Synchronized Data Extraction Speaker: Shui-lung Chuang Abstract: The deep Web presents a pressing need for integrating large numbers of dynamically evolving data sources. To be more automatic yet accurate in building an integration system, we observe two problems: First, across sequential tasks in integration, how can a wrapper (as an extraction task) consider the peer sources to facilitate the subsequent matching task? Second, across parallel sources, how can a wrapper leverage the peer wrappers or domain rules to enhance extraction accuracy? These issues, while seemingly unrelated, both boil down to the lack of "context awareness": Current automatic wrapper induction approaches generate a wrapper for one source at a time, in isolation, and thus inherently lack the awareness of the peer sources or domain knowledge in the context of integration. We propose the concept of context-aware wrappers that are amenable to matching and that can leverage peer wrappers or prior domain knowledge. Such context awareness inspires a synchronization framework to construct wrappers consistently and collaboratively across their mutual context. We draw the insight from turbo codes and develop the turbo syncer to interconnect extraction with matching, which together achieve context awareness in wrapping. Our experiments show that the turbo syncer can, on the one hand, enhance extraction consistency and thus increase matching accuracy (from 17-83% to 78-94% in F-measure) and, on the other hand, incorporate peer wrappers and domain knowledge seamlessly to reduce extraction errors (from 09-60% to 01-11%). |
| Tuesday, September 25 SC 3403 4 PM |
Title: Automatic Labeling of Multinomial Topic Models (KDD'07 Runner-up Best Student Paper)
Speaker: Qiaozhu Mei Abstract: Multinomial distributions over words are frequently used to model topics in text collections. A common, major challenge in applying all such topic models to any text mining problem is to label a multinomial topic model accurately so that a user can interpret the discovered topic. So far, such labels have been generated manually in a subjective way. In this paper, we propose probabilistic approaches to automatically labeling multinomial topic models in an objective way. We cast this labeling problem as an optimization problem involving minimizing Kullback-Leibler divergence between word distributions and maximizing mutual information between a label and a topic model. Experiments with user study have been done on two text data sets with different genres. The results show that the proposed labeling methods are quite effective to generate labels that are meaningful and useful for interpreting the discovered topic models. Our methods are general and can be applied to labeling topics learned through all kinds of topic models such as PLSA, LDA, and their variations. |
| Thursday, October 2 SC 3403 4 PM |
What Can Database Technology Do for Trust Management?
Speaker: Klemens Boehm (Professor from University of Karlsruhe) Abstract: Trust models have been touted to facilitate cooperation among unknown entities. In our current work, we are interested in behavior-based trust models, i.e., models that derive the trustworthiness of an entity from its behavior in previous interactions. Existing proposals in this field typically feature one specific trust model. Further, various publications exist which have proposed different centrality measures to rank individuals, i.e., compute their reputation based on feedback. While these various publications have demonstrated the effectiveness of 'their' measure in certain specific situations, no measure is a clear winner, as we will explain. All this calls for a higher degree of flexibility. Consequently, this presentation proposes a trust-policy algebra allowing for the specification of a wide range of behavior-based trust policies. Since the evaluation of the standing of an entity requires centrality indices, we propose a first-class operator of our algebra for their computation. The presentation will then say how we intend to tackle the various related economic problems that are in the air and will point to open research questions
Bio: |
| Tuesday, October 9 SC 3403 4 PM |
Please note that there is a location change: 3405 Siebel Center Title: Trustworthy Management of Compliant Records (based on VLDB'06 Best Paper) Speaker: Soumyadeb Mitra Abstract: Recent litigation and intense regulatory focus on secure retention of electronic records have spurred a rush to introduce Write-Once-Read-Many (WORM) storage devices for retaining business records such as electronic mail. However, simply storing records in WORM storage is insufficient to ensure that the records are trustworthy, i.e., able to provide irrefutable proof and accurate details of past events. For example, some form of index is needed for timely access to the records, but unless the index is maintained securely, the records can in effect be hidden or altered. The index structure hence must also be maintained on WORM. Unfortunately, the dynamic nature of index structures makes it a non-trivial task to maintain them on write-once media. Furthermore, it's not adequate to secure a record on a single WORM device. Many practical considerations will mandate the migration of documents from one storage server to another. Unless this migration process is secured, an adversary can tamper with the records while they are being moved from one server to another. Finally, document retention is only one component of its lifecycle. The ability to delete electronic records once their retention period is over is as important as the act of securely maintaining them. It is relatively simple to delete a document, but much harder to remove its index entries from WORM. Yet if these entries are not obliterated, the contents of the deleted document can often be reconstructed. In this presentation, I will talk about each of these problems in detail and will give an overview of the solutions that we have developed to address them. |
| Tuesday, October 16 SC 3403 4 PM |
Title: Efficient Mining of Iterative Patterns for Software Specification Discovery
Speaker: David Lo (National University of Singapore) Abstract: Studies have shown that program comprehension takes up to 45% of software development costs. Such high costs are caused by the lack-of documented specification and further aggravated by the phenomenon of software evolution. There is a need for automated tools to extract specifications to aid program comprehension. In this paper, a novel technique to efficiently mine common software temporal patterns from traces is proposed. These patterns shed light on program behaviors, and are termed iterative patterns. They capture unique characteristic of software traces, typically not found in arbitrary sequences. Specifically, due to loops, interesting iterative patterns can occur multiple times within a trace. Furthermore, an occurrence of an iterative pattern in a trace can extend across a sequence of indefinite length. Since a program behavior can be manifested in numerous ways, analyzing a single trace will not be sufficient. Iterative pattern mining extends sequential pattern and episode minings to discover frequent iterative patterns which occur repetitively both within a program trace and across multiple traces. In this paper, we present CLIPER (CLosed Iterative Pattern minER) to efficiently mine a closed set of iterative patterns. A performance study on several simulated and real datasets shows the efficiency of our mining algorithm and effectiveness of our pruning strategy. Our case study on JBoss Application Server confirms the usefulness of mined patterns in discovering interesting software behavioral specification. Bio: David is a third year PhD candidate at the National University of Singapore. He is working on Software Specification Discovery, namely automated extraction of software specifications from software execution traces and other software artifacts. Mined specification can be used to aid program comprehension and also to detect bugs. It is an interesting research area in the intersection of data mining, software engineering and programming language. You can find his work published in major conferences of data mining, software engineering, and programming languages. |
| Tuesday, October 23 SC 3403 4 PM |
The Inauguration Yahoo!-Dais Seminar Location Change: Please note that we are moving to SC2405 this time Title: Managing Scientific Data: New Challenges for Database Researchers Speaker: Prof. Marianne Winslett Abstract: The database research community's appetite for new applications has led to increased interest in the data management needs of scientists. This area encompasses a huge range of applications, extending from public repositories of observational data such as the popular Sloan Digital Sky Survey to one-of-a-kind runs of simulation codes crafted by individual scientists. In this talk, we will survey the most common data management needs found in the hard sciences, describe the new database research challenges that arise from these needs, and outline ways to address some of these challenges. Bio: Marianne Winslett has been a professor at the University of Illinois at Urbana-Champaign since 1987. Her current research interests include security in open systems and data management for scientific applications. She has served on the editorial boards of ACM Transactions on Database Systems and IEEE Transactions on Knowledge and Data Engineering, and is currently on the board of ACM Transactions on the Web. She is an ACM Fellow, a past vice-chair of ACM SIGMOD and the recipient of an NSF Presidential Young Investigator Award. |
| Tuesday, October 30 SC 3403 4 PM |
Title: iComment: Bugs or Bad Comments?
Speaker: Lin Tan Abstract: Commenting source code has long been a common practice in software development. Compared to source code, comments are more direct, descriptive and easy-to-understand. Comments and source code provide relatively redundant and independent information regarding a program's semantic behavior. As software evolves, they can easily grow out-of-sync, indicating two problems: (1) bugs - the source code does not follow the assumptions and requirements specified by correct program comments; (2) bad comments - comments that are inconsistent with correct code, which can confuse and mislead programmers to introduce bugs in subsequent versions. Unfortunately, as most comments are written in natural language, no solution has been proposed to automatically analyze comments and detect inconsistencies between comments and source code. This paper takes the first step in automatically analyzing comments written in natural language to extract implicit program rules and use these rules to automatically detect inconsistencies between comments and source code, indicating either bugs or bad comments. Our solution, iComment, combines Natural Language Processing (NLP), Machine Learning, Statistics and Program Analysis techniques to achieve these goals. We evaluate iComment on four large code bases: Linux, Mozilla, Wine and Apache. Our experimental results show that iComment automatically extracts 1832 rules from comments with 90.8-100% accuracy and detects 60 comment-code inconsistencies, 33 new bugs and 27 bad comments, in the latest versions of the four programs. Nineteen of them (12 bugs and 7 bad comments) have already been confirmed by the corresponding developers while the others are currently being analyzed by the developers. |
| Tuesday, November 6 SC 3403 4 PM |
Title: R&D Challenges for Information Processing Software Technology: From the Usability Perspective
Speaker: Won Kim (Professor from Sungkyunkwan University) Abstract: Information processing software technology represents one of the three pillars of information technology today, along with semiconductor technology and communications technology. Information processing software technology includes system software, a huge variety of applications, a wide variety of middleware and tools, and all the enabling techniques. It has become an indispensable engine of the daily lives of people around the globe. Today major R&D challenges confront the information processing software technology due to the advent of the ubiquitous computing era and the rapidly expanding volumes of digitized information and reliance on it. I will cast these challenges in terms of ¡°usability of information¡±. Usability challenges include broadly the human computer interaction and the information accessibility. The information accessibility challenges in turn consist of issues posed by the information explosion (i.e., too much information, too much irrelevant information), unlocatable information (i.e., the locations or even the existence of large volumes of information being unknown), and unprocessable information (i.e., large volumes of information being unprocessable due to software inadequacy or semantics being unknown to software). I will examine these issues in turn, and also outline R&D agendas to address them. I will then outline some of the areas of nations¡¯ economies and mankind¡¯s well-being that solutions to these challenges will positively impact. Bio: Dr. Won Kim is a professor and university fellow with School of Information and Communication Engineering, Sungkyunkwan University. Before joining Sungkyunkwan University, he held positions such as the Senior Advisor of Samsung Electronics, Founder and CEO of Cyber Database Solutions Inc., Dean of Engineering Graduate School and Distinguished Professor of Computer Science with Ewha Women's University, Founder and CEO of UniSQL Inc., Director of Microelectronics and Computer Corporation, and Research Staff Member of IBM Almaden Research Center. Dr. Kim was elected as ACM Fellow in 1995. He was the chair of ACM SIGMOD from 1989-1997, the founder and chair of ACM SIGKDD (1998-2005), and the Editor-in-Chief of ACM Transactions on Database Systems from 1992 to 2001. He is also the founder and Editor-in-Chief of ACM Transactions on Internet Technology since 2000. Dr. Kim is an alumni of our department, who received his Ph.D. degree here in 1980. |
| Tuesday, November 13 SC 3403 4 PM |
Title: Two Perspectives on Domain Adaptation in Natural Language Processing
Speaker: Jing Jiang Abstract: The problem of domain adaptation for statistical classifiers arises when our labeled training examples and unlabeled test examples come from different domains. This problem is commonly encountered in natural language processing (NLP) tasks. In this talk, I will present our recent work addressing the domain adaptation problem. We have proposed two frameworks, corresponding to two different perspectives on this problem: feature selection and instance weighting. In the feature selection framework, we seek to identify ¡°generalizable features¡± that behave similarly across domains; in the instance weighting framework, our idea is to re-weight the examples in order to minimize the expected loss on the test domain. In both frameworks, we have also incorporated semi-supervised learning to make use of the unlabeled test domain examples. Experiment results on a number of NLP tasks, including NER, part-of-speech (POS) tagging, and spam filtering, show the effectiveness of both frameworks. |
| Tuesday, November 20 SC 3403 4 PM |
Thanksgiving Break. No meeting
|
| Thursday, November 27 SC 3403 4 PM |
Title: Training Linear Discriminant Analysis in Linear Time
Speaker: Deng Cai Abstract: Linear Discriminant Analysis (LDA) has been a popular method for extracting features which preserve class separability. It has been widely used in many fields of information processing, such as machine learning, data mining, information retrieval, and pattern recognition. However, the computation of LDA involves dense matrices eigen-decomposition which can be computationally expensive both in time and memory. Specifically, LDA has $O(mnt+t^3)$ time complexity and requires $O(mn+mt+nt)$ memory, where $m$ is the number of samples, $n$ is the number of features and $t=\min(m,n)$. When both $m$ and $n$ are large, it is infeasible to apply LDA. In this paper, we propose a novel algorithm for discriminant analysis, called {\em Spectral Regression Discriminant Analysis} (SRDA). By using spectral graph analysis, SRDA casts discriminant analysis into a regression framework which facilitates both efficient computation and the use of regularization techniques. Our theoretical analysis shows that SRDA can be computed with $O(ms)$ time and $O(ms)$ memory, where $s (\leq n)$ is the average number of non-zero features in each sample. Extensive experimental results on four real world data sets demonstrate the effectiveness and efficiency of our algorithm. |
| Tuesday, December 4 SC 3403 4 PM |
Title: Topics on supply chain management and freeway surveillance
Speaker: Prof. Yanfeng Ouyang Abstract: This talk will introduce research topics in the fields of transportation and supply chain systems that may be of interest to the data mining research community. We will first focus on supply chain management. In typical supply chains, multiple suppliers place orders to one another to fill customer demands. Many supply chains are known to suffer from instability problems; i.e., fluctuations in order sequences are usually much larger for suppliers farther away from the customer. We derive robust analytical conditions to predict the presence and magnitude of order instability; we further prove in theory that supply chains can be stabilized by proper information strategies, independent of customer demand processes. However, the performance of such strategies under realistic business settings needs careful verification. A role-playing Internet-based computer simulation game is recently developed with information provision capabilities to simulate the operations of a multi-stage serial supply chain, where human players each manages one stage of the chain. The effects of the players' behaviors on instability will be recorded and analyzed to extract knowledge on practical performances of the proposed information strategies. We will also discuss an ongoing research effort on freeway surveillance and incident detection. Incidents (e.g., crashes) contribute to more than 50% of total urban freeway congestion. Prompt and reliable incident detection is vital in reducing post-incident delay and the potential for additional incidents. Despite the emergence of various real-time algorithms in past decades, incident detection remains one of the weakest links in implementing advanced traffic control and management concepts. A few challenges and possible solutions will be presented and discussed at the talk. Bio: Yanfeng Ouyang holds a B.Eng. in civil engineering from Tsinghua University, Beijing, 2000, M.S. in civil engineering from the University of Washington, 2001, M.S. in industrial engineering and operations research from the University of California at Berkeley, 2005, and Ph.D. in civil engineering from the University of California at Berkeley, 2005. He has been an assistant professor in the Department of Civil and Environmental Engineering at the University of Illinois at Urbana-Champaign since August 2005. His research mainly focuses on supply chain management, transportation network operations, logistics, infrastructure management, and safety. |
| Tuesday, December 11 SC 3403 4 PM |
Title: TBA
Speaker: TBA Abstract: TBA |