Fall 2006 Reading List for the DAIS Qualifying
Examination
I. Information Retrieval
- Basic concepts
- Vector-space
retrieval model, TF-IDF weighting, relevance/pseudo feedback,
(non-interpolated) average precision, query-likelihood retrieval model,
language model smoothing, PageRank, inverted
index
- Background
- Modern Information
Retrieval: A Brief Overview, Singhal, IEEE
Data Engineering Bulletin 24(4), pages 35-43, 2001. [ps]
- Link Analysis in
Web Information Retrieval, Henzinger, IEEE
Data Engineering Bulletin 23 (3), pages 3-8, 2000. [pdf]
- Probabilistic
relevance models based on document and query generation, Lafferty and
Zhai, Language Modeling and Information Retrieval, Kluwer
International Series on Information Retrieval, Vol. 13, 2003. [pdf]
- A study of
smoothing methods for language models applied to information retrieval,
Zhai and Lafferty, ACM Transactions on Information Systems, Vol. 2, No.
2, pp. 179-214, April 2004. [acm]
- More advanced topics
-
Formal models for expert finding in enterprise corpora, Krisztian Balog, Leif Azzopardi, Maarten de Rijke, SIGIR 2006. [
acm]
-
A web-based kernel function for measuring the similarity of short text snippets,
Mehran Sahami, Timothy D. Heilman, WWW 2006. [
acm]
-
Generating query substitutions,
Rosie Jones,
Benjamin Rey,
Omid Madani,
Wiley Greiner, WWW 2006. [
acm]
-
Query expansion using random walk models,
Kevyn Collins-Thompson, Jamie Callan, CIKM 2005.
[
acm]
II. Data Mining and Data Warehousing
- Basic Concepts
- Data warehousing:
star schema, data cube (be able to list half a dozen typical data cube computation
methods), multi-dimensional analysis (OLAP)
- Data mining:
frequent pattern mining (be able to list half a dozen typical methods),
sequential pattern mining (be able to list at four or five typical
methods), correlation analysis, classification (be able to list half a
dozen typical methods), clustering (be able to list half a dozen
typical methods)
- Background
- J. Han and M. Kamber, Data Mining: Concepts and Techniques, 2nd
edition. Chapters 3 & 4 (for data warehousing); Chapters 2, 5-7
(for data mining). Morgan Kaufmann 2006.
- More advanced topics
- Data Warehousing:
- Prediction cubes,
Chen, Chen, Lin, and Ramakrishnan, VLDB 2005. [pdf]
- High-dimensional
OLAP: A minimal cubing approach, Li, Han, and Gonzalez, VLDB 2004. [pdf]
- Data Mining:
- Approximate
frequency counts over data streams, Manku
and Motwani, VLDB 2002. [pdf]
- Graphs over time:
Densification laws, shrinking diameters and possible explanations, Leskovec, Kleinberg, and Faloutsos, KDD 2005. [pdf]
- Discriminative
Frequent Pattern Analysis for Effective Classification, Cheng, Yan, Han,
and Hsu, ICDE 2007. [pdf]
III. Database Management Systems
- Basic concepts
- Hardware: disk
sector, track, block, seek, latency, how to lay out a database page
- Data modeling: ER,
OO, and Object-Relational approaches
- Concurrency control
and recovery: ACID, serializability,
two-phase locking, two-phase commit, logging and recovery, the impact
of data replication
- Theory:
normalization, dependencies
- Queries: access
methods (hashing, B-trees, multidimensional access methods), how to
optimize a query, SQL
- Benchmarks: TPC-C
and TPC-H
- Background
You can use any database textbook you like to
study the most basic of the concepts listed above; for example, CS411
teaches these concepts. (Note that you will be expected to be able to
demonstrate your understanding of the concepts by applying them (as
opposed to simply being able to define them).) In the remaining entries,
"RDS" refers to Stonebraker's Readings in
Database Systems, currently in its 4th edition.
- Generalized Search
Trees for Database Systems, Hellerstein et al., VLDB 1995 and RDS. [pdf] We include this paper as the reference for
multidimensional access methods; access methods based on B-trees and
hashing should be covered in any database textbook.
- New TPC Benchmarks
for Decision Support and Web Commerce, Poess
and Floyd, SIGMOD Record 29(4), December 2000. [pdf]
- Inclusion of New
Types in Relational Data Base Systems, Stonebraker,
ICDE 1986 and RDS. [acm]
We include this paper as your reference for understanding the impact of
extensibility (as, for example, intended by the object-relational
model) on a DBMS.
- More advanced topics
Please note that databases are a very broad field. The papers listed
here will be changed frequently, to reflect this breadth.
- Query processing
- AutoAdmin 'What-if' Index Analysis Utility,
Chaudhuri and Narasayya, SIGMOD 1998 and
RDS. [acm]
- RankSQL: Query Algebra and Optimization for
Relational Top-k Queries, Li, Chang, Ilyas, and Song, SIGMOD 2005. [pdf]
- Shoring up
persistent applications, Carey et al., SIGMOD 1994. [acm]
- Information
integration
- Jun Zhu, Zaiqing Nie, Ji-Rong Wen, Bo Zhang, Wei-Ying
Ma: Simultaneous record detection and attribute labeling in web data
extraction. KDD 2006: 494-503. [pdf]
- Sriram Raghavan, Hector Garcia-Molina:
Complex Queries over Web Repositories. VLDB 2003: 33-44. [pdf]
IV. Bioinformatics
- Basic Concepts
- Sequence alignment
- Motif finding and regulatory
sequence analysis
- Gene prediction
- DNA sequencing
- Phylogenetic tree
reconstruction
- Gene expression analysis
- Clustering of biological data
- Background
- Biological sequence analysis - probabilistic
models of proteins and nucleic acids, by Durbin, Eddy, Krogh, and Mitchison. Read Chapters 2 (Pairwise
alignment), 3 (Markov chains and hidden Markov models), and 7.1-7.4 (Building phylogenetic
trees).
- More advanced topics
- Genome-wide
computational prediction of transcriptional regulatory modules reveals
new insights into human gene expression. Blanchette et al. Genome
Research 2006. 16(5):656-68.
- CE Lawrence, SF
Altschul, MS Boguski,
JS Liu, AF Neuwald, and JC Wootton Detecting subtle sequence signals: a Gibbs
sampling strategy for multiple alignment Science, Vol
262, Issue 5131, 208-214
- Segal E, Yelensky R, Koller D.
Genome-wide
discovery of transcriptional modules from DNA sequence and gene
expression. Bioinformatics. 2003;19 Suppl 1:i273-82.
|