Schedule of DAIS Seminars

Fall 2005
3405 Siebel Center
Thursdays, 4:00-5:00


Coordinator:

Deng Cai
Contact: dengcai2

Schedule:

  1. Aug. 25 (3403 SC)
    Topic: Frequent Pattern Mining: Algorithms, Framework and Extension
    Presenter: Anthony K. H. Tung (National University of Singapore)
    Abstract: Research on frequent pattern mining have lead to many fruitful result for the data mining community. However, due to the large number of algorithms being developed, it is often not easy to make comparison between these algorithms and identify their merits over each other. In this talk, I will first introduce a framework which I have been using to organize and compare the various frequent pattern mining algorithms that I know. I will then explain how this framework lead to my recent work on 1) recycling/reusing frequent pattens and 2) frequent pattern mining based on object-wise enumeration (Owner). At the end of the talk, I will explain why I think these work have provide interesting extension to the frequent pattern mining framework.

  2. Sept. 1
    Topic: DAIS group current research
    Presenter: Marianne Winslett

  3. Sept. 8
    Topic: Summarizing Itemset Patterns: A Profile-Based Approach (presented at KDD '05)
    Presenter: Xifeng Yan
    Abstract: Frequent-pattern mining has been studied extensively on scalable methods for mining various kinds of patterns including itemsets, sequences, and graphs. However, the bottleneck of frequent-pattern mining is not at the efficiency but at the interpretability, due to the huge number of patterns generated by the mining process. In this study, we examine how to summarize a collection of itemset patterns using only K representatives, a small number of patterns that a user can handle easily. The K representatives should not only cover most of the frequent patterns but also approximate their supports. A generative model is built to extract and profile these representatives, under which the supports of the patterns can be easily recovered without consulting the original dataset. Based on the restoration error, we propose a quality measure function to determine the optimal value of parameter K. Polynomial time algorithms are developed together with several optimization heuristics for efficiency improvement. Empirical studies indicate that we can obtain compact summarization in real datasets.

  4. Sept. 15
    Topic: Axiomatic Approaches to Information Retrieval
    Presenter: Hui Fang
    Abstract: Existing retrieval models generally do not offer any guarantee for optimal retrieval performance. Indeed, it is even difficult, if not impossible, to predict a model's empirical performance analytically. This Limitation is at least partly caused by the way existing retrieval models are developed where relevance is only coarsely modeled at the level of documents and queries as opposed to a finer granularity level of terms. In this paper, we present a new axiomatic approach to developing retrieval models based on direct modeling of relevance with formalized retrieval constraints defined at the level of terms. The basic idea of this axiomatic approach is to search in a space of candidate retrieval functions for one that can satisfy a set of reasonable retrieval constraints. To constrain the search space, we propose to define a retrieval function inductively and decompose a retrieval function into three component functions. Inspired by the analysis of the existing retrieval functions with the inductive definition, we derive several new retrieval functions using the axiomatic retrieval framework. Experiment results show that the derived new retrieval functions are more robust and less sensitive to parameter settings than the existing retrieval functions with comparable optimal performance.

  5. Sept. 22
    Topic:Making Holistic Schema Matching Robust: An Ensemble Approach
    Presenter: Bin He
    Abstract: The Web has been rapidly ``deepened" by myriad searchable databases online, where data are hidden behind query interfaces. As an essential task toward integrating these massive ``deep Web" sources, large scale schema matching (i.e., discovering semantic correspondences of attributes across many query interfaces) has been actively studied recently. In particular, many works have emerged to address this problem by ``holistically" matching many schemas at the same time and thus pursuing ``mining" approaches in nature. However, while holistic schema matching has built its promise upon the large quantity of input schemas, it also suffers the robustness problem caused by noisy data quality. Such noises often inevitably arise in the automatic extraction of schema data, which is mandatory in large scale integration. For holistic matching to be viable, it is thus essential to make it robust against noisy schemas. To tackle this challenge, we propose a data-ensemble framework with sampling and voting techniques, which is inspired by bagging predictors. Specifically, our approach creates an ensemble of matchers, by randomizing input schema data into many independently downsampled trials, executing the same matcher on each trial and then aggregating their ranked results by taking majority voting. As a principled basis, we provide analytic justification of the effectiveness of this data-ensemble framework. Further, empirically, our experiments on real Web data show that the ``ensemblization" indeed significantly boosts the matching accuracy under noisy schema input, and thus maintains the desired robustness of a holistic matcher.

  6. Sept. 29
    Topic:
    Presenter: Zhen Zhang
    Abstract:
    The Web has been rapidly "deepened" by myriad searchable databases online, where data are hidden behind query forms. Helping users query alternative "deep Web" sources in the same domain (e.g., Books, Airfares) is an important task with broad applications. As a core component of those applications, dynamic query translation (i.e., translating a user's query across dynamically selected sources) has not been extensively explored. While existing works focus on isolated subproblems (e.g., schema matching, query rewriting) to study, we target at building a complete query translator and thus face new challenges: 1) To complete the translator, we need to solve the predicate mapping problem (i.e., map a source predicate to target predicates), which is largely unexplored by existing works; 2) To satisfy our application requirements, we need to design a customizable system architecture to assemble various components addressing respective subproblems (i.e., schema matching, predicate mapping, query rewriting). Tackling these challenges, we develop a light-weight domain-based form assistant, which can generally handle alternative sources in the same domain and is easily customizable to new domains. Our experiment shows the effectiveness of our form assistant in translating queries for real Web sources.

  7. Oct. 6
    Topic:Toward Fast, Collaborative, and Best-Effort Data Integration
    Presenter: AnHai Doan
    Abstract:
    Data integration is the problem of retrieving desired information from multiple disparate data sources. Despite decades of research, today data integration systems are still extremely labor intensive to build and maintain. In this talk I will describe the AIDA project in the Hungarian Goulash research group that addresses this problem, using a synthesis of database, AI, IR, and open-source software techniques. I discuss three related directions: (1) develop automatic tools for labor-intensive tasks (e.g., schema matching, wrapper maintenance), (2) leverage multiple users to improve tool accuracy (reminiscent of mass collaboration efforts to build software such as Linux), and (3) simplify integration tasks, leading to IR-style best-effort integration scenarios. These directions can potentially help speed up data integration processes, build integration systems where previously not possible, and enable on-the-fly data integration. Finally, I will describe community information management, a long-term research direction that builds upon our current integration work, with implications to Web search and the Semantic Web.

  8. Oct. 13
    Topic:Cross-Relational Clustering with User's Guidance
    Presenter: Xiaoxin Yin
    Abstract:
    Clustering is an essential data mining task with numerous applications. However, data in most real-life applications are high-dimensional in nature, and the related information often spreads across multiple relations. To ensure effective and efficient high-dimensional, cross-relational clustering, we propose a new approach, called CrossClus, which performs cross-relational clustering with user's guidance. We believe that user's guidance, even likely in very simple forms, could be essential for effective high-dimensional clustering since a user knows well the application requirements and data semantics. CrossClus is carried out as follows: A user specifies a clustering task and selects one or a small set of features pertinent to the task. CrossClus extracts the set of highly relevant features in multiple relations connected via linkages defined in the database schema, evaluates their effectiveness based on user's guidance, and identifies interesting clusters that fit user's needs. This method takes care of both quality in feature extraction and efficiency in clustering. Our comprehensive experiments demonstrate the effectiveness and scalability of this approach.

  9. Oct. 20
    Topic: User-Centered Adaptive Information Retrieval
    Presenter:Xuehua Shen
    Abstract: Information retrieval systems (e.g., web search engines) are critical for overcoming information overload. A major deficiency of existing retrieval systems is that they generally lack user modeling and are not adaptive to individual users, resulting in inherently non-optimal retrieval performance. In UCAIR (User-Centered Adaptive Information Retrieval) project, we study how to infer a user's interest from the user's search context and use the inferred implicit user model for personalized search. In this talk, I will present some recent work of UCAIR project from framework, model, system and evaluation perspectives. We introduce a decision theoretic framework, in which the system responds to each user action by choosing an optimal system action and all the available user information and search context would be exploited to optimize each retrieval decision. We propose several context-sensitive retrieval models based on statistical language models to combine the preceding queries and clicked document summaries with the current query for better ranking of documents. We develop an intelligent client-side web search agent (UCAIR) that can perform eager implicit feedback, e.g., query expansion based on previous queries and immediate result reranking based on clickthrough information. We use the TREC AP data to create a test collection with search context information, and quantitatively evaluate our models using this test set. Experiment results show that using implicit feedback, especially the clicked document summaries, can improve retrieval performance substantially. Moreover, the user study on web search using the UCAIR search agent shows that our UCAIR search agent can improve search accuracy over the popular Google search engine. The contents of the talk mostly come from SIGIR 2005 paper and CIKM 2005 paper, which can be downloaded from UCAIR project web site http://sifaka.cs.uiuc.edu/ir/ucair/. The UCAIR search agent can also be downloaded from the web site.

  10. Oct. 27
    Topic:Mining Compressed Frequent-Pattern Sets
    Presenter: Dong Xin
    Abstract:
    A major challenge in frequent-pattern mining is the sheer size of its mining results. In many cases, a high min sup threshold may discover only commonsense patterns but a low one may generate an explosive number of output patterns, which severely restricts its usage. In this paper, we study the problem of compressing frequent-pattern sets. Typically, frequent patterns can be clustered with a tightness measure delta (called delta-cluster), and a representative pattern can be selected for each cluster. Unfortunately, finding a minimum set of representative patterns is NP-Hard. We develop two greedy methods, RPglobal and RPlocal. The former has the guaranteed compression bound but higher computational complexity. The latter sacrifices the theoretical bounds but is far more efficient. Our performance study shows that the compression quality using RPlocal is very close to RPglobal, and both can reduce the number of closed frequent patterns by almost two orders of magnitude. Furthermore, RPlocal mines even faster than FPClose[11], a very fast closed frequentpattern mining method. We also show that RPglobal and RPlocal can be combined together to balance the quality and efficiency.

  11. Nov. 3
    Topic:The Music-to-Knowlege (M2K) Project: A visual programming environment for the development and evaluation of music information retrieval techniques
    Presenter: J. Stephen Downie and Andreas F. Ehmann
    Abstract:
    The objective of the International Music Information Retrieval Systems Evaluation Laboratory (IMIRSEL) project is the creation of a large, secure corpus of audio and symbolic music data accessible to the music information retrieval (MIR) community for the testing and evaluation of various MIR techniques. As part of the IMIRSEL project, a cross-platform JAVA based visual programming environment called Music to Knowledge (M2K) is being developed for a variety of music information retrieval related tasks. The primary objective of M2K is to supply the MIR community with a toolset that provides the ability to rapidly prototype algorithms, as well as foster the sharing of techniques within the MIR community through the use of a standardized set of tools. Due to the relatively large size of audio data and the computational costs associated with some digital signal processing and machine learning techniques, M2K is also designed to support distributed computing across computing clusters. In addition, facilities to allow the integration of non-JAVA based (e.g., C/C++, MATLAB, etc.) algorithms and programs are provided within M2K. [Work supported by the Andrew W. Mellon Foundation and NSF Grants No. IIS-0340597 and No. IIS-0327371.]

  12. Nov. 10
    Topic: Pragmatic design of information and communication Technologies
    Presenter:Bertram C. Bruce
    Abstract:
    his presentation explores two senses of pragmatic ICT design. One is the common language notion of how to create ICTs that work to meet real human needs, accommodate to users, and situation. The second is a conception of ICT design from pragmatist theory, which sees technologies as developed within a community of inquiry and embodying both means of action and forms of understanding. The talk provides an introduction to the theory and practice of the American Pragmatism movement, notably the work of John Dewey (1859-1952) and Jane Addams (1860-1935), whose influence has pervaded realms of philosophy, education, psychology, aesthetics, ethics, and social and political action. It also examines recent, related research on the social uses and implications of ICTs. Also, the talk will cover some "Inquiry-Based Learning": The essence of inquiry-based learning is the realization that learning is inseparable from life. When learning activities are divorced from ordinary experience, fragmented into short blocks of time, and framed within narrowly-defined disciplines, the learner is unlikely to engage, remember, or apply the knowledge supposedly conveyed. But there are institutional pressures and practical challenges that constrain learning in exactly those ways. How can we provide learning opportunities that are connected to the learner's needs and interests, relevant to life both within and beyond schoooling, challenging and engaging? How can we conceive disciplines in a way that enlarges learning rather than limits it? How can we make the best use of texts, multimedia, computers, field experiences, dialogues, and all the other media for learning? This presentation explores these questions, all within a larger framework that considers the relation of schooling to the larger society.

  13. Nov. 17
    Topic: No regular seminar on this week
    Presenter:
    Abstract:

  14. Dec. 1
    Topic:Privacy-Preserving Database Union
    Presenter:Alberto Maria Segre
    Abstract:
    This talk describes a cryptographic protocol for merging two or more data sets without divulging identifying information about individual records; technically, the protocol computes a "blind set-theoretic union." Applications for this protocol arise, for example, in data analysis for biomedical application areas, where identifying fields (e.g., patient names) are protected by governmental privacy regulations or by institutional research board policies.

  15. Dec. 8
    Topic: Efficient k-anonymization using Clustering Techniques
    Presenter: Ji-Won Byun
    Abstract: k-anonymization techniques are a key component of any comprehensive solution to data privacy and have been the focus of intense research in the last few years. An important requirement for such techniques is to ensure anonymization of data while at the same time minimizing the information loss resulting from datamodifications such as generalization and suppression. In this talk we propose a new approach that addresses the limitations of existing solutions. Our approach uses the idea of clustering to minimize information loss and thus ensure good data quality. The key observation is that data records that are naturally close to each other should be part of the same equivalence class. Current clustering techniques, however, are not directly applicable because they do not consider the requirement that each cluster should contain at least k records. We thus formulate a specific clustering problem, referred to as k-member clustering problem. We prove that this problem is NP-hard and present a greedy algorithm, the complexity of which is in O(n^2). As part of our approach we develop a suitable metric to estimate the information loss introduced by generalizations, which works on both numeric and categorical data. We also present extensions to our proposed algorithm that minimize the classification errors in the anonymized data and eliminate the inference channels due to the lack of diversity.

To schedule: