| Thursday, August 24 DAIS adjunct seminar SC 3405 4 PM |
Anonymity in Data Publishing and Distribution
Kristen LeFevre, University of Wisconsin - Madison Abstract: Numerous organizations distribute non-aggregate microdata for purposes such as demographic and public health research, and it is important that the released data protect the identities of individuals. At the same time, concern for individual privacy must be balanced with the need to maintain the quality and utility of the data. In this talk, I will describe several new techniques for generalizing data to preserve anonymity. First, I will describe a "multidimensional recoding" paradigm and greedy partitioning algorithm. An extensive theoretical and empirical evaluation indicates that, in addition to being fast and scalable, this algorithm often produces higher-quality data than what would be considered optimal under previous recoding paradigms. In the second part of this talk, I observe that the ultimate indicator of data utility is the task or workload for which the data will be used. Following this intuition, I will describe several extensions to the basic generalization framework that allow us to incorporate a robust set of target workloads, including classification, regression, and selection.
Bio: |
| Thursday, August 31 SC 3403 11 AM |
Generating Semantic Annotation for Frequent Patterns with Context Analysis Qiaozhu Mei Abstract: As a fundamental data mining task, frequent pattern mining has widespread applications in many different domains. Research in frequent pattern mining has so far mostly focused on developing efficient algorithms to discover various kinds of frequent patterns, but little attention has been paid to the important next step -- interpreting the discovered frequent patterns. Although some recent work has studied the compression and summarization of frequent patterns, the proposed techniques can only annotate a frequent pattern with non-semantical information (e.g. support), which provides only limited help for a user to understand the patterns. In this paper, we propose the novel problem of generating semantic annotations for frequent patterns. The goal is to annotate a frequent pattern with in-depth, concise, and structured information that can better indicate the hidden meanings of the pattern. We propose a general approach to generate such an annotation for a frequent pattern by constructing its context model, selecting informative context indicators, and extracting representative transactions and semantically similar patterns. This general approach has potentially many applications such as generating a dictionary-like description for a pattern, finding synonym patterns, discovering semantic relations, and summarizing semantic classes of a set of frequent patterns. Experiments on different datasets show that our approach is effective in generating semantic pattern annotations. |
| Wednesday, September 6 SC 4405 4:00 PM |
Answering Top-k Queries with Multi-Dimensional Selections: The Ranking Cube Approach
Speaker: Dong Xin Abstract: Observed in many real applications, a top-k query often consists of two components to reflect a user's preference: a selection condition and a ranking function. A user may not only propose ad hoc ranking functions, but also use different interesting subsets of the data. In many cases, a user may want to have a thorough study of the data by initiating a multi-dimensional analysis of the top-k query results. Previous work on top-k query processing mainly focuses on optimizing data access according to the ranking function only. The problem of efficient answering top-k queries with multi-dimensional selections has not been well addressed yet. This paper proposes a new computational model, called ranking cube, for efficient answering top-k queries with multi-dimensional selections. We define a rank-aware measure for the cube, capturing our goal of responding to multi-dimensional ranking analysis. Based on the ranking cube, an efficient query algorithm is developed which progressively retrieves data blocks until the top-k results are found. The curse of dimensionality is a well-known challenge for the data cube and we cope with this difficulty by introducing a new technique of ranking fragments. Our experiments on Microsoft's SQL Server 2005 show that our proposed approaches have significant improvement over the previous methods.
|
| Tuesday, September 12 SC 3403 4 PM |
Latent Semantic Analysis for Multiple-Type Interrelated Data Objects Speaker: Xuanhui Wang Abstract Co-occurrence data is quite common in many real applications. Latent Semantic Analysis (LSA) has been successfully used to identify semantic relations in such data. However, LSA can only handle a single co-occurrence relationship between two types of objects. In practical applications, there are many cases where multiple types of objects exist and any pair of these objects could have a pairwise co-occurrence relation. All these co-occurrence relations can be exploited to alleviate data sparseness or to represent objects more meaningfully. In this paper, we propose a novel algorithm, \emph{M-LSA}, which conducts latent semantic analysis by incorporating all pairwise co-occurrences among multiple types of objects. Based on the mutual reinforcement principle, M-LSA identifies the most salient concepts among the co-occurrence data and represents all the objects in a unified semantic space. M-LSA is general and we show that several variants of LSA are special cases of our algorithm. Experiment results show that M-LSA outperforms LSA on multiple applications, including collaborative filtering, text clustering, and text categorization. |
| Thursday, September 21 DAIS adjunct seminar SC 3124 2 PM |
Towards Quality- and Cost-based Models for Multimedia Service Composition Speaker: Wolf-Tilo Balke Abstract: When moving from monolithic applications towards service-oriented Multimedia frameworks, the composition of Web services to form complex multimedia workflows becomes a demanding problem. Especially networked or mobile devices require a flexible composition strategy as they often have to move computationally complex or power-demanding tasks to powerful servers. Such a strategy also has to consider the changing environment due to movements of the device and it has to adapt to device-specific characteristics, e.g., the current battery level. Hence, the selection, execution and monitoring of chains of Web services and graceful recovery from failures of individual services and network-specific or device-specific alarms are important aspects to realize service-oriented multimedia applications.
Short Bio: |
| Wednesday, September 27 SC 4403 4 PM |
A Framework for Analysis of Dynamic Social Networks Speaker: Tanya Berger-Wolf Abstract Finding patterns of social interaction within a population has wide-ranging applications including: disease modeling, cultural and information transmission, phylogeography, conservation, and behavioral ecology. Social interactions are often modeled with networks. A key characteristics of social interactions is their continual change. However, most past analyses of social networks are essentially static in that all information about the time that social interactions take place is discarded. I will present a new mathematical and computational framework that enables analysis of dynamic social networks and that explicitly makes use of information about when social interactions occur. I will discuss several algorithms for obtaining information about the structure of dynamic social networks in this framework and pose many open questions. The research is joint work with J. Saia (UNM), D.I.Rubenstein, S. Sundaresan, and I. Fischoff (Princeton) Bio: Dr. Tanya Berger-Wolf is an assistant professor at the Department of Computer Science at the University of Illinois at Chicago. Her research is in applications of algorithmic and data mining techniques to population biology, both human (epidemiology) and animal, from genetics to social interactions. Dr. Berger-Wolf has received her B.Sc. in Computer Science and Mathematics from Hebrew University (Jerusalem, Israel) in 1995 and her Ph.D. in Computer Science from University of Illinois at Urbana-Champaign in 2002. She has spent two years as a postdoctoral fellow at the University of New Mexico working in computational phylogenetics and a year at the Center for Discrete Mathematics and Theoretical Computer Science (DIMACS) doing research in computational epidemiology. |
| Tuesday, October 3 SC 3403 4 PM |
Semantic Term Matching in Axiomatic Approaches to Information Retrieval Speaker: Hui Fang Abstract A common limitation of many retrieval models, including the recently proposed axiomatic approaches, is that retrieval scores are solely based on exact (i.e., syntactic) matching of terms in the queries and documents, without allowing distinct but semantically related terms to match each other and contribute to the retrieval score. In this paper, we show that semantic term matching can be naturally incorporated into the axiomatic retrieval model through defining the primitive weighting function based on a semantic similarity function of terms. We define several desirable retrieval constraints for semantic term matching and use such constraints to extend the axiomatic model to directly support semantic term matching based on the mutual information of terms computed on some document set. We show that such extension can be efficiently implemented as query expansion. Experiment results on several representative data sets show that, with mutual information computed over the documents in either the target collection for retrieval or an external collection such as the Web, our semantic expansion consistently and substantially improves retrieval accuracy over the baseline axiomatic retrieval model. As a pseudo feedback method, our method also outperforms a state-of-the-art language modeling feedback method. |
| Tuesday, October 10 SC 3403 4 PM |
LinkClus: Efficient Clustering via Heterogeneous Semantic Links Speaker: Xiaoxin Yin Abstract Data objects in a relational database are cross-linked with each other via multi-typed links. Links contain rich semantic information that may indicate important relationships among objects. Most current clustering methods rely only on the properties that belong to the objects per se. However, the similarities between objects are often indicated by the links, and desirable clusters cannot be generated using only the properties of objects. In this paper we explore linkage-based clustering, in which the similarity between two objects is measured based on the similarities between the objects linked with them. In comparison with a previous study (SimRank) that computes links recursively on all pairs of objects, we take advantage of the power law distribution of links, and develop a hierarchical structure called SimTree to represent similarities in multi-granularity manner. This method avoids the high cost of computing and storing pairwise similarities but still thoroughly explore relationships among objects. An efficient algorithm is proposed to compute similarities between objects by avoiding pairwise similarity computations through merging computations that go through the same branches in the SimTree. Experiments show the proposed approach achieves high efficiency, scalability, and accuracy in clustering multi-typed linked objects. |
| Tuesday, October 17 SC 3403 4 PM |
Statistical Debugging and Automated Program Failure Analysis
Speaker: Chao Liu Abstract Abstract of FSE'05]: Automated localization of software bugs is one of the essential issues in debugging aids. Previous studies indicated that the evaluation history of program predicates may disclose important clues about underlying bugs. In this paper, we propose a new statistical model-based approach, called SOBER, which localizes software bugs without any prior knowledge of program semantics. Unlike existing statistical debugging approaches that select predicates correlated with program failures, SOBER models evaluation patterns of predicates in both correct and incorrect runs respectively and regards a predicate as bug-relevant if its evaluation pattern in incorrect runs differs significantly from that in correct ones. SOBER features a principled quantification of the pattern difference that measures the bug-relevance of program predicates. We systematically evaluated our approach under the same setting as previous studies. The result demonstrated the power of our approach in bug localization: SOBER can help programmers locate 68 out of 130 bugs in the Siemens suite when programmers are expected to examine no more than 10% of the code, whereas the best previously reported is 52 out of 130. Moreover, with the assistance of SOBER, we found two bugs in bc 1.06 (an arbitrary precision calculator on UNIX/Linux), one of which has never been reported before. [Abstract of FSE'06]: Recent software systems usually feature an automated fail- ure reporting system, with which a huge number of failing traces are collected every day. In order to prioritize fault diagnosis, failing traces due to the same fault are expected to be grouped together. Previous methods, by hypothesiz- ing that similar failing traces imply the same fault, cluster failing traces based on the literal trace similarity, which we call trace proximity. However, since a fault can be trig- gered in many ways, failing traces due to the same fault can be quite different. Therefore, previous methods actu- ally group together traces exhibiting similar behaviors, like similar branch coverage, rather than traces due to the same fault. In this paper, we propose a new type of failure prox- imity, called R-Proximity, which regards two failing traces as similar if they suggest roughly the same fault location. The fault location each failing case suggests is automati- cally obtained with Sober, an existing statistical debugging tool. We show that with R-Proximity, failing traces due to the same fault can be grouped together. In addition, we find that R-Proximity is helpful for statistical debugging: It can help developers interpret and utilize the statistical debugging result. We illustrate the usage of R-Proximity with a case study on the grep program and some experiments on the Siemens suite, and the result clearly demonstrates the advantage of R-Proximity over trace proximity. |
| Tuesday, October 24 SC 3403 4 PM |
Contextual Text Mining through Probabilistic Theme Analysis Speaker: Qiaozhu Mei Abstract Contextual text mining is concerned with extracting topical themes from a text collection with context information (e.g., time and location) and comparing/analyzing the variations of themes over different contexts. Since the topics covered in a document are usually related to the context of the document, analyzing topical themes within context can potentially reveal many interesting theme patterns. Previous work has addressed a few specific forms of the contextual text mining problem. We propose a new general probabilistic model for contextual text mining that can cover several existing models as special cases. Specifically, we extend the probabilistic latent semantic analysis (PLSA) model by introducing context variables to model the context of a document. The proposed mixture model, called contextual probabilistic latent semantic analysis (CPLSA) model, can be applied to many interesting mining tasks, such as temporal text mining, spatiotemporal text mining, author-topic analysis, event impact analysis, and cross-collection comparative analysis. Empirical experiments show that the proposed mixture model can discover themes and their contextual variations effectively. In this talk, I will introduce the contextual text mining problem, the proposed probabilistic model, and show how it can be regularized to solve different specific mining problems such as spatiotemporal theme pattern mining from weblogs and event impact analysis. |
| Monday, October 30 SC 2405 11 AM - 12 AM |
Speaker: Raghuram Krishnapuram (Manager, Information Management, IBM India Research Lab)
Search Result Summarization and Disambiguation via Contextual Dimensions
Efficient Named Entity Recognition using Inverse Index Operations
Bio |
| Tuesday, November 7 SC 3403 4 PM |
0-1 SDP for Clustering Based on Graph-Cut Models Speaker: Jiming Peng Abstract Graph cut optimization provides a plausible approach for the data clustering problem. In this paper, we present a unified framework for several graph-cut optimization problems arising from cluster analysis: the so-called 0-1 semidefinite programming (0-1 SDP). We show that the new model covers the classical K-means (kernel) clustering, constrained K-means clustering, normalized cut and ensemble clustering. Secondly, we consider the issue of how to solve the underlying 0-1 SDP problem. We consider two approximation methods based on singular value decomposition (SVD) of the coefficient matrix and the SVD of the coefficient matrix in a projected subspace, respectively and prove that both algorithms can provide a 2-approximation to the original clustering problem. The complexity of these approximation algorithms are discussed and preliminary numerical results based on the new algorithm are reported.
bio: |
| Tuesday, November 14 DAIS adjunct seminar SC 2405 4 PM |
Network Biclustering: Identify Condition-Specific Network Modules across Massive Biological networks
Speaker: Jasmine Zhou Abstract Motivation: Current approaches for biological network analysis focus mainly on the analysis of a single biological network, providing static descriptions of network features. With the rapid accumulation of biological networks generated under different conditions, there is an urgent need to analyze the dynamics of biological networks, i.e. how network features change with conditions. Such dynamic descriptions will provide deeper insights into cellular organization and pathway coordination. Results: We developed a novel algorithm, NETBICLUSTER, to identify condition-specific network modules from a large number of biological networks. The NETBICLUSTER algorithm not only identifies network modules corresponding to particular biological processes, but also mines non-dense network sub-graph patterns and more importantly, provides network dynamics information such as condition-specific activation and pathway cross-talking. Applying NETBICLUSTER to 97 mouse gene co-expression networks derived from micro-array datasets, we discovered a large number of functionally homogeneous network modules and dataset conditions under which they are activated and deactivated. To see more about Prof. Zhou, please visit her website at http://www.cmb.usc.edu/people/xjzhou/. |
| Tuesday, November 21 SC 3403 4 PM |
Thanksgiving Break - No Seminar |
| Tuesday, November 28 SC 3403 4 PM |
Towards Robust Indexing for Ranked Queries
Speaker: Dong Xin Abstract: Top-k query asks for k tuples ordered according to a specific ranking function that combines the values from multiple participating attributes. The combined score function is usually linear. To efficiently answer top-k queries, preprocessing and indexing the data have been used to speed up the run time performance. Many indexing methods allow the online query algorithms progressively retrieve the data and stop at a certain point. However, in many cases, the number of data accesses is sensitive to the query parameters (i.e. , linear weights in the score functions). In this paper, we study the sequentially layered indexing problem where tuples are put into multiple consecutive layers and any top-$k$ query can be answered by at most k layers of tuples. We propose a new criterion for building the layered index. A layered index is robust if for any k, the number of tuples in the top k layers is minimal in comparison with all the other alternatives. The robust index guarantees the worst case performance for arbitrary query parameters. We derive a necessary and sufficient condition for robust index. The problem is shown solvable within $O(n^d\log n)$ (where $d$ is the number of dimensions, and $n$ is the number of tuples). To reduce the high complexity of the exact solution, we develop an approximate approach, which has time complexity $O(2^{d}n(\log n)^{r(d)-1})$, where $r(d)=\lceil {{\frac d 2}} \rceil+ \lfloor {\frac d 2} \rfloor \lceil {\frac d 2} \rceil$. Our experimental results show that our proposed method outperforms the best known previous methods. |
| Tuesday, December 5 SC 3403 4 PM |
FlowCube: Constructing RFID FlowCubes for Multi-Dimensional Analysis of Commodity Flows Speaker: Hector Gonzalez Abstract: With the advent of RFID (Radio Frequency Identification) technology, manufacturers, distributors, and retailers will be able to track the movement of individual objects throughout the supply chain. The volume of data generated by a typical RFID application will be enormous as each item will generate a complete history of all the individual locations that it occupied at every point in time, possibly from a specific production line at a given factory, passing through multiple warehouses, and all the way to a particular checkout counter in a store. The movement trails of such RFID data form gigantic commodity flowgraph representing the locations and durations of the path stages traversed by each item. This commodity flow contains rich multi-dimensional information on the characteristics, trends, changes and outliers of commodity movements. In this paper, we propose a method to construct a warehouse of commodity flows, called flowcube. As in standard OLAP, the model will be composed of cuboids that aggregate item flows at a given abstraction level. The flowcube differs from the traditional data cube in two major ways. First, the measure of each cell will not be a scalar aggregate but a commodity flowgraph that captures the major movement trends and significant deviations of the items aggregated in the cell. Second, each flowgraph itself can be viewed at multiple levels by changing the level of abstraction of path stages. In this paper, we motivate the importance of the model, and present an efficient method to compute it by (1) performing simultaneous aggregation of paths to all interesting abstraction levels, (2) pruning low support path segments along the item and path stage abstraction lattices, and (3) compressing the cube by removing rarely occurring cells, and cells whose commodity flows can be inferred from higher level cells. |