Schedule of DAIS Seminars
Fall 2005
3405 Siebel Center
Thursdays, 4:00-5:00
Coordinator:
Deng Cai
Contact: dengcai2
Schedule:
- Aug. 25 (3403 SC)
Topic: Frequent Pattern Mining: Algorithms, Framework and Extension
Presenter: Anthony K. H. Tung (National University of Singapore)
Abstract: Research on frequent pattern mining have lead to many fruitful
result for the data mining community. However, due to the large
number of algorithms being developed, it is often not easy to make
comparison between these algorithms and identify their merits over
each other. In this talk, I will first introduce a framework which I
have been using to organize and compare the various frequent pattern
mining algorithms that I know. I will then explain how this
framework lead to my recent work on 1) recycling/reusing frequent
pattens and 2) frequent pattern mining based on object-wise
enumeration (Owner). At the end of the talk, I will explain why I
think these work have provide interesting extension to the frequent
pattern mining framework.
- Sept. 1
Topic: DAIS group current research
Presenter: Marianne Winslett
- Sept. 8
Topic: Summarizing Itemset Patterns: A Profile-Based Approach (presented at KDD '05)
Presenter: Xifeng Yan
Abstract: Frequent-pattern mining has been studied extensively on
scalable methods for mining various kinds of patterns including itemsets,
sequences, and graphs. However, the bottleneck of frequent-pattern mining is
not at the efficiency but at the interpretability, due to the huge number of
patterns generated by the mining process. In this study, we examine how to
summarize a collection of itemset patterns using only K representatives, a
small number of patterns that a user can handle easily. The K representatives
should not only cover most of the frequent patterns but also approximate their
supports. A generative model is built to extract and profile these
representatives, under which the supports of the patterns can be easily
recovered without consulting the original dataset. Based on the restoration
error, we propose a quality measure function to determine the optimal value of
parameter K. Polynomial time algorithms are developed together with several
optimization heuristics for efficiency improvement. Empirical studies
indicate that we can obtain compact summarization in real datasets.
- Sept. 15
Topic: Axiomatic Approaches to Information Retrieval
Presenter: Hui Fang
Abstract: Existing retrieval models generally do not offer any
guarantee for optimal retrieval performance. Indeed, it is even difficult, if
not impossible, to predict a model's empirical performance analytically. This
Limitation is at least partly caused by the way existing retrieval models are
developed where relevance is only coarsely modeled at the level of documents
and queries as opposed to a finer granularity level of terms. In this paper,
we present a new axiomatic approach to developing retrieval models based on
direct modeling of relevance with formalized retrieval constraints defined at
the level of terms. The basic idea of this axiomatic approach is to search in
a space of candidate retrieval functions for one that can satisfy a set of
reasonable retrieval constraints. To constrain the search space, we propose
to define a retrieval function inductively and decompose a retrieval function
into three component functions. Inspired by the analysis of the existing
retrieval functions with the inductive definition, we derive several new
retrieval functions using the axiomatic retrieval framework. Experiment
results show that the derived new retrieval functions are more robust and less
sensitive to parameter settings than the existing retrieval functions with
comparable optimal performance.
- Sept. 22
Topic:Making Holistic Schema Matching Robust: An Ensemble Approach
Presenter: Bin He
Abstract:
The Web has been rapidly ``deepened" by myriad searchable databases online, where data are hidden behind query interfaces. As an essential task toward integrating these massive ``deep Web" sources, large scale schema matching (i.e., discovering semantic correspondences of attributes across many query interfaces) has been actively studied recently. In particular, many works have emerged to address this problem by ``holistically" matching many schemas at the same time and thus pursuing ``mining" approaches in nature.
However, while holistic schema matching has built its promise upon the large quantity of input schemas, it also suffers the robustness problem caused by noisy data quality. Such noises often inevitably arise in the automatic extraction of schema data, which is mandatory in large scale integration. For holistic matching to be viable, it is thus essential to make it robust against noisy schemas. To tackle this challenge, we propose a data-ensemble framework with sampling and voting techniques, which is inspired by bagging predictors. Specifically, our approach creates an ensemble of matchers, by randomizing input schema data into many independently downsampled trials, executing the same matcher on each trial and then aggregating their ranked results by taking majority voting. As a principled basis, we provide analytic justification of the effectiveness of this data-ensemble framework. Further, empirically, our experiments on real Web data show that the ``ensemblization"
indeed significantly boosts the matching accuracy under noisy schema input, and thus maintains the desired robustness of a holistic matcher.
- Sept. 29
Topic:
Presenter: Zhen Zhang
Abstract:
The Web has been rapidly "deepened" by myriad searchable databases online, where data are hidden behind query forms. Helping users query alternative "deep Web" sources in the same domain (e.g., Books, Airfares) is an important task with broad applications. As a core component of those applications, dynamic query translation (i.e., translating a user's query across dynamically selected sources) has not been extensively explored.
While existing works focus on isolated subproblems (e.g., schema matching, query rewriting) to study, we target at building a complete query translator and thus face new challenges:
1) To complete the translator, we need to solve the predicate mapping problem (i.e., map a source predicate to target predicates), which is largely unexplored by existing works; 2) To satisfy our application requirements, we need to design a customizable system architecture to assemble various components addressing respective subproblems (i.e., schema matching, predicate mapping, query rewriting). Tackling these challenges, we develop a light-weight domain-based form assistant, which can generally handle alternative sources in the same domain and is easily customizable to new domains. Our experiment shows the effectiveness of our form assistant in translating queries for real Web sources.
- Oct. 6
Topic:Toward Fast, Collaborative, and Best-Effort Data Integration
Presenter: AnHai Doan
Abstract:
Data integration is the problem of retrieving desired information from multiple disparate data sources. Despite decades of research, today data integration systems are still extremely labor intensive to build and maintain. In this talk I will describe the AIDA project in the Hungarian Goulash research group that addresses this problem, using a synthesis of database, AI, IR, and open-source software techniques. I discuss three related directions: (1) develop automatic tools for labor-intensive tasks (e.g., schema matching, wrapper maintenance), (2) leverage multiple users to improve tool accuracy (reminiscent of mass collaboration efforts to build software such as Linux), and (3) simplify integration tasks, leading to IR-style best-effort integration scenarios. These directions can potentially help speed up data integration processes, build integration systems where previously not possible, and enable on-the-fly data integration. Finally, I will describe community information management, a long-term research direction that builds upon our current integration work, with implications to Web search and the Semantic Web.
- Oct. 13
Topic:Cross-Relational Clustering with User's Guidance
Presenter: Xiaoxin Yin
Abstract:
Clustering is an essential data mining task with numerous applications.
However, data in most real-life applications are high-dimensional in nature, and the related information often spreads across multiple relations. To ensure effective and efficient high-dimensional, cross-relational clustering, we propose a new approach, called CrossClus, which performs cross-relational clustering with user's guidance. We believe that user's guidance, even likely in very simple forms, could be essential for effective high-dimensional clustering since a user knows well the application requirements and data semantics. CrossClus is carried out as follows: A user specifies a clustering task and selects one or a small set of features pertinent to the task. CrossClus extracts the set of highly relevant features in multiple relations connected via linkages defined in the database schema, evaluates their effectiveness based on user's guidance, and identifies interesting clusters that fit user's needs. This method takes care of both quality in feature extraction and efficiency in clustering. Our comprehensive experiments demonstrate the effectiveness and scalability of this approach.
- Oct. 20
Topic: User-Centered Adaptive Information Retrieval
Presenter:Xuehua Shen
Abstract:
Information retrieval systems (e.g., web search engines) are critical for overcoming information overload. A major deficiency of existing retrieval systems is that they generally lack user modeling and are not adaptive to individual users, resulting in inherently non-optimal retrieval performance.
In UCAIR (User-Centered Adaptive Information Retrieval) project, we study how to infer a user's interest from the user's search context and use the inferred implicit user model for personalized search.
In this talk, I will present some recent work of UCAIR project from framework, model, system and evaluation perspectives.
We introduce a decision theoretic framework, in which the system responds to each user action by choosing an optimal system action and all the available user information and search context would be exploited to optimize each retrieval decision. We propose several context-sensitive retrieval models based on statistical language models to combine the preceding queries and clicked document summaries with the current query for better ranking of documents. We develop an intelligent client-side web search agent (UCAIR) that can perform eager implicit feedback, e.g., query expansion based on previous queries and immediate result reranking based on clickthrough information. We use the TREC AP data to create a test collection with search context information, and quantitatively evaluate our models using this test set.
Experiment results show that using implicit feedback, especially the clicked document summaries, can improve retrieval performance substantially.
Moreover, the user study on web search using the UCAIR search agent shows that our UCAIR search agent can improve search accuracy over the popular Google search engine.
The contents of the talk mostly come from SIGIR 2005 paper and CIKM 2005 paper, which can be downloaded from UCAIR project web site http://sifaka.cs.uiuc.edu/ir/ucair/. The UCAIR search agent can also be downloaded from the web site.
- Oct. 27
Topic:Mining Compressed Frequent-Pattern Sets
Presenter: Dong Xin
Abstract:
A major challenge in frequent-pattern mining is the sheer size of its mining results. In many cases, a high min sup threshold may discover only commonsense patterns but a low one may generate an explosive number of output patterns, which severely restricts its usage.
In this paper, we study the problem of compressing frequent-pattern sets. Typically, frequent patterns can be clustered with a tightness measure delta (called delta-cluster), and a representative pattern can be selected for each cluster.
Unfortunately, finding a minimum set of
representative patterns is NP-Hard. We develop two greedy methods, RPglobal and RPlocal.
The former has the guaranteed compression bound but higher computational complexity.
The latter sacrifices the theoretical
bounds but is far more efficient. Our performance study shows that the compression quality using RPlocal is very close to RPglobal, and both can reduce the number of closed frequent patterns by almost two orders of magnitude.
Furthermore, RPlocal mines even faster
than FPClose[11], a very fast closed frequentpattern mining method. We also show that RPglobal and RPlocal can be combined together to balance the quality and efficiency.
- Nov. 3
Topic:The Music-to-Knowlege (M2K) Project: A visual programming environment for the development and evaluation of music information retrieval techniques
Presenter: J. Stephen Downie and Andreas F. Ehmann
Abstract:
The objective of the International Music Information Retrieval Systems Evaluation Laboratory (IMIRSEL) project is the creation of a large, secure corpus of audio and symbolic music data accessible to the music information retrieval (MIR) community for the testing and evaluation of various MIR techniques. As part of the IMIRSEL project, a cross-platform JAVA based visual programming environment called Music to Knowledge
(M2K) is being developed for a variety of music information retrieval related tasks. The primary objective of M2K is to supply the MIR community with a toolset that provides the ability to rapidly prototype algorithms, as well as foster the sharing of techniques within the MIR community through the use of a standardized set of tools. Due to the relatively large size of audio data and the computational costs associated with some digital signal processing and machine learning techniques, M2K is also designed to support distributed computing across computing clusters. In addition, facilities to allow the integration of non-JAVA based (e.g., C/C++, MATLAB, etc.) algorithms and programs are provided within M2K. [Work supported by the Andrew W. Mellon Foundation and NSF Grants No. IIS-0340597 and No. IIS-0327371.]
- Nov. 10
Topic: Pragmatic design of information and communication Technologies
Presenter:Bertram C. Bruce
Abstract:
his presentation explores two senses of pragmatic ICT design. One is the common language notion of how to create ICTs that work to meet real human needs, accommodate to users, and situation. The second is a conception of ICT design from pragmatist theory, which sees technologies as developed within a community of inquiry and embodying both means of action and forms of understanding. The talk provides an introduction to the theory and practice of the American Pragmatism movement, notably the work of John Dewey (1859-1952) and Jane Addams (1860-1935), whose influence has pervaded realms of philosophy, education, psychology, aesthetics, ethics, and social and political action. It also examines recent, related research on the social uses and implications of ICTs.
Also, the talk will cover some "Inquiry-Based Learning":
The essence of inquiry-based learning is the realization that learning is inseparable from life. When learning activities are divorced from ordinary experience, fragmented into short blocks of time, and framed within narrowly-defined disciplines, the learner is unlikely to engage, remember, or apply the knowledge supposedly conveyed. But there are institutional pressures and practical challenges that constrain learning in exactly those ways. How can we provide learning opportunities that are connected to the learner's needs and interests, relevant to life both within and beyond schoooling, challenging and engaging? How can we conceive disciplines in a way that enlarges learning rather than limits it? How can we make the best use of texts, multimedia, computers, field experiences, dialogues, and all the other media for learning? This presentation explores these questions, all within a larger framework that considers the relation of schooling to the larger society.
- Nov. 17
Topic: No regular seminar on this week
Presenter:
Abstract:
- Dec. 1
Topic:Privacy-Preserving Database Union
Presenter:Alberto Maria Segre
Abstract:
This talk describes a cryptographic protocol for merging two or more data sets without divulging identifying information about individual records; technically, the protocol computes a "blind set-theoretic union." Applications for this protocol arise, for example, in data analysis for biomedical application areas, where identifying fields (e.g., patient names) are protected by governmental privacy regulations or by institutional research board policies.
- Dec. 8
Topic: Efficient k-anonymization using Clustering Techniques
Presenter: Ji-Won Byun
Abstract:
k-anonymization techniques are a key component of any comprehensive solution to data privacy and have been the focus of intense research in the last few years. An important requirement for such techniques is to ensure anonymization of data while at the same time minimizing the information loss resulting from datamodifications such as generalization and suppression.
In this talk we propose a new approach that addresses the limitations of existing solutions. Our approach uses the idea of clustering to minimize information
loss and thus ensure good data quality. The key observation is that data records that are naturally close to each other should be part of the same equivalence class. Current clustering techniques, however, are not directly applicable because they do not consider the requirement that each cluster should contain at least k records. We thus formulate a specific clustering problem, referred to as k-member clustering problem. We prove that this problem is NP-hard and present a greedy algorithm, the complexity of which is in O(n^2). As part of our approach we develop a suitable metric to estimate the information loss introduced by generalizations, which works on both numeric and categorical data. We also present extensions to our proposed algorithm that minimize the classification errors in the anonymized data and eliminate the inference channels due to the lack of diversity.
To schedule:
- Isabel Cruz (UIC)
- Ouri Wolfson (UIC)