Spring 2009 Reading List for the DAIS
Qualifying Examination
I. Information Retrieval
- Basic concepts
- Vector-space
retrieval model, TF-IDF weighting, relevance/pseudo feedback, query
expansion, mean average precision (MAP), normalized discounted
cumulative gain (NDCG), query-likelihood retrieval model, language
model smoothing, PageRank, inverted index
- Background
- Amit Singhal, Modern
Information Retrieval: A Brief Overview, IEEE Data Engineering Bulletin
24(4), pages 35-43, 2001.
Link: http://singhal.info/ieee2001.pdf
- Chris Manning,
Prabhakar Raghaven, Hinrich Schutze, Introduction to Information
Retrieval, Cambridge University Press, 2008. (Chapter 8 Evaluation in
IR, Chapters 21 Link Analysis)
Link: http://nlp.stanford.edu/IR-book/pdf/irbookonlinereading.pdf
- ChengXiang Zhai,
Statistical Language Models for Information Retrieval, Morgan and
Claypool Publishers, 2008. (Chapter 1 Introduction, Chapter 2 Overview
of IR Models, and Chapter 3 Simple Query Likelihood Retrieval Model).
Link: http://www.morganclaypool.com/doi/abs/10.2200/S00158ED1V01Y200811HLT001
·
More advanced topics
- G. Cong et al.,
Finding question-answering pairs from online forums, Proceedings of ACM
SIGIR 2008, pages 467-474.
Link: http://doi.acm.org/10.1145/1390334.1390415
- Y. Wang et al.,
Exploring traversal strategy for web forum crawling, Proceedings of ACM
SIGIR 2008, pages 459-466.
Link: http://doi.acm.org/10.1145/1390334.1390413
- Z. Xu and R. Akella,
A new probabilistic retrieval model based on the Dirichlet compound
multinominal distribution, Proceedings of ACM SIGIR 2008, pages
427-434.
Link: http://doi.acm.org/10.1145/1390334.1390408
- S. R. K. Branavan et
al., Generating a Table-of-Contents, Proceedings of ACL 2007, pages
544-551.
II. Data Mining and Data Warehousing
- Basic Concepts
- Data warehousing:
star schema, data cube (be able to list half a dozen typical data cube
computation methods), multi-dimensional analysis (OLAP)
- Data mining:
frequent pattern mining (be able to list half a dozen typical methods),
sequential pattern mining (be able to list at four or five typical
methods), correlation analysis, classification (be able to list half a
dozen typical methods), clustering (be able to list half a dozen
typical methods)
- Background
- J. Han and M.
Kamber, Data Mining: Concepts and Techniques, 2nd edition. Chapters 3
& 4 (for data warehousing); Chapters 2, 5-7 (for data mining).
Morgan Kaufmann 2006.
- More advanced topics
- Data Warehousing:
- Prediction cubes.
Chen, Chen, Lin, and Ramakrishnan. VLDB 2005. [pdf]
- ARCube: Supporting Ranking
Aggregate Queries in Partially Materialized Data Cubes. Wu, Xin, and
Han. SIGMOD 2008.
[pdf]
- Data Mining:
- Mining Colossal
Frequent Patterns by Core Pattern Fusion. Zhu et al. ICDE 2007. [pdf]
- SCAN: A Structural
Clustering Algorithm for Networks. Xu et al. KDD 2007. [acm]
- Direct
Discriminative Pattern Mining for Effective Classification. Cheng,
Yan, Han, and Yu. ICDE 2008. [pdf]
III. Database Management Systems
- Basic concepts
- Hardware: disk
sector, track, block, seek, latency, how to lay out a database page
- Data modeling: ER,
OO, and Object-Relational approaches
- Concurrency control
and recovery: ACID, serializability, two-phase locking, two-phase
commit, logging and recovery, the impact of data replication
- Theory:
normalization, dependencies
- Queries: access
methods (hashing, B-trees, multidimensional access methods), how to
optimize a query, SQL
- Benchmarks: TPC-C
and TPC-H
- Background
You can use any database textbook you like to study the most basic of
the concepts listed above; for example, CS411 teaches these concepts.
(Note that you will be expected to be able to demonstrate your
understanding of the concepts by applying them (as opposed to simply
being able to define them).) In the remaining entries, "RDS"
refers to Stonebraker's Readings in Database Systems.
- Generalized Search
Trees for Database Systems. Hellerstein et al. VLDB 1995 and RDS. [pdf] We
include this paper as the reference for multidimensional access
methods; access methods based on B-trees and hashing should be covered
in any database textbook.
- New TPC Benchmarks
for Decision Support and Web Commerce. Poess and Floyd. SIGMOD Record
29(4), December 2000. [pdf]
- Inclusion of New
Types in Relational Data Base Systems. Stonebraker. ICDE 1986 and RDS.
[acm] We
include this paper as your reference for understanding the impact of
extensibility (as, for example, intended by the object-relational
model) on a DBMS.
- More advanced topics
Please note that databases are a very broad field. The papers listed
here will be changed frequently, to reflect this breadth.
- Database Core
- Scalable
Approximate Query Processing with the DBO Engine. Jermaine, Arumugam,
Pol, and Dobra. SIGMOD 2007. [acm]
- Compiling Mappings
to Bridge Applications and Databases. Melnik, Adya, and
Bernstein. SIGMOD 2007. [acm]
- Information Systems
- Scalable Semantic
Web Data Management Using Vertical Partitioning. Abadi, Marcus, Madden, and
Hollenbach. VLDB 2007. [pdf]
- iTrails: Pay-as-you-go Information
Integration in Dataspaces. Salles et al. VLDB 2007. [pdf]
IV. Bioinformatics
- Basic Concepts
- Sequence alignment
- Motif finding and regulatory
sequence analysis
- Gene prediction
- DNA sequencing
- Phylogenetic tree reconstruction
- Gene expression analysis
- Clustering of biological data
- Background
- Biological sequence analysis---probabilistic
models of proteins and nucleic acids, by Durbin, Eddy, Krogh,
and Mitchison. Read Chapters 2 (Pairwise alignment), 3 (Markov chains
and hidden Markov models), and 8.1-8.5 (Probabilistic approaches to
phylogeny).
- More advanced topics
- Combining
phylogenetic and hidden Markov models in biosequence analysis. Siepel
and Haussler. RECOMB 2003.
[acm]
- Predicting
expression patterns from regulatory sequence in Drosophila
segmentation. Segal, Raveh-Sadka, Schroeder, Unnerstall & Gaul. Nature 451, pp. 535-540 (31
January 2008). [html]
- Informative
priors based on transcription factor structural class improve de novo
motif discovery. Narlikar et al. Bioinformatics.
2006. [pdf]
|