Department of Computer Science Unversity of Illinois at Urbana-Champaign
Home People Research Seminars Education Photos Links

Fall 2006 Reading List for the DAIS Qualifying Examination

I. Information Retrieval

  • Basic concepts
    • Vector-space retrieval model, TF-IDF weighting, relevance/pseudo feedback, (non-interpolated) average precision, query-likelihood retrieval model, language model smoothing, PageRank, inverted index
  • Background
    • Modern Information Retrieval: A Brief Overview, Singhal, IEEE Data Engineering Bulletin 24(4), pages 35-43, 2001. [ps]
    • Link Analysis in Web Information Retrieval, Henzinger, IEEE Data Engineering Bulletin 23 (3), pages 3-8, 2000. [pdf]
    • Probabilistic relevance models based on document and query generation, Lafferty and Zhai, Language Modeling and Information Retrieval, Kluwer International Series on Information Retrieval, Vol. 13, 2003. [pdf]
    • A study of smoothing methods for language models applied to information retrieval, Zhai and Lafferty, ACM Transactions on Information Systems, Vol. 2, No. 2, pp. 179-214, April 2004. [acm]
  • More advanced topics
    • A Markov random field model for term dependencies, Metzler and Croft, SIGIR 2005. [acm]
    • MailRank: using ranking for spam detection, Chirita, Diederich, and Nejdl, CIKM 2005. [acm]
    • Query chains: learning to rank from implicit feedback, Radlinski and Joachims, KDD 2005.[acm ]

II. Data Mining and Data Warehousing

  • Basic Concepts
    • Data warehousing: star schema, data cube (be able to list half a dozen typical data cube computation methods), multi-dimensional analysis (OLAP)
    • Data mining: frequent pattern mining (be able to list half a dozen typical methods), sequential pattern mining (be able to list at four or five typical methods), correlation analysis, classification (be able to list half a dozen typical methods), clustering (be able to list half a dozen typical methods)
  • Background
    • J. Han and M. Kamber, Data Mining: Concepts and Techniques, 2nd edition. Chapters 3 & 4 (for data warehousing); Chapters 2, 5-7 (for data mining). Morgan Kaufmann 2005. (Prepublication chapters can be found as CS412Han Class Notes.)
  • More advanced topics
    • Data Warehousing:
      1. Prediction cubes, Chen, Chen, Lin, and Ramakrishnan, VLDB 2005. [pdf]
      2. High-dimensional OLAP: A minimal cubing approach, Li, Han, and Gonzalez, VLDB 2004. [pdf]
    • Data Mining:
      1. Approximate frequency counts over data streams, Manku and Motwani, VLDB 2002. [pdf]
      2. Graphs over time: Densification laws, shrinking diameters and possible explanations, Leskovec, Kleinberg, and Faloutsos, KDD 2005. [pdf]
      3. Mining compressed frequent-pattern sets, Xin, Han, Yan, and Cheng, VLDB 2005. [pdf]

III. Database Management Systems

  • Basic concepts
    • Hardware: disk sector, track, block, seek, latency, how to lay out a database page
    • Data modeling: ER, OO, and Object-Relational approaches
    • Concurrency control and recovery: ACID, serializability, two-phase locking, two-phase commit, logging and recovery, the impact of data replication
    • Theory: normalization, dependencies
    • Queries: access methods (hashing, B-trees, multidimensional access methods), how to optimize a query, SQL
    • Benchmarks: TPC-C and TPC-H
  • Background
    You can use any database textbook you like to study the most basic of the concepts listed above; for example, CS411 teaches these concepts. (Note that you will be expected to be able to demonstrate your understanding of the concepts by applying them (as opposed to simply being able to define them).) In the remaining entries, "RDS" refers to Stonebraker's Readings in Database Systems, currently in its 4th edition.
    • Generalized Search Trees for Database Systems, Hellerstein et al., VLDB 1995 and RDS. [pdf] We include this paper as the reference for multidimensional access methods; access methods based on B-trees and hashing should be covered in any database textbook.
    • New TPC Benchmarks for Decision Support and Web Commerce, Poess and Floyd, SIGMOD Record 29(4), December 2000. [pdf]
    • Inclusion of New Types in Relational Data Base Systems, Stonebraker, ICDE 1986 and RDS. [acm] We include this paper as your reference for understanding the impact of extensibility (as, for example, intended by the object-relational model) on a DBMS.
  • More advanced topics
    Please note that databases are a very broad field. The papers listed here will be changed frequently, to reflect this breadth.
    • Query processing
      • AutoAdmin 'What-if' Index Analysis Utility, Chaudhuri and Narasayya, SIGMOD 1998 and RDS. [acm]
      • RankSQL: Query Algebra and Optimization for Relational Top-k Queries, Li, Chang, Ilyas, and Song, SIGMOD 2005. [pdf]
      • Shoring up persistent applications, Carey et al., SIGMOD 1994. [acm]
    • Information integration
      • Accelerated focused crawling through online relevance feedback, Chakrabarti, Punera, and Subramanya, WWW 2002. [citeseer]
      • Answering Queries Using Views: A Survey, Levy. [ps]

IV. Bioinformatics

  • Basic Concepts
    • A list of 5-10 basic concepts will go here---we are still deciding which terms to include. The "background" required question of the qual tests knowledge of this set of concepts.
  • Background
    • Biological sequence analysis - probabilistic models of proteins and nucleic acids, by Durbin, Eddy, Krogh, and Mitchison. Read Chapters 2 (Pairwise alignment), 3 (Markov chains and hidden Markov models), and 7 (Building phylogenetic trees).
  • More advanced topics
    • Regulatory Element Detection Using a Probabilistic Segmentation Model, Bussemaker, Li, and Siggia, International Conference on Intelligent Systems in Molecular Biology, 2000. [acm]
    • Model-based clustering and data transformations for gene expression data, Yeung, Fraley, Murua, Raftery, and Ruzzo, Bioinformatics, October 2001. [citeseer]
    • MCALIGN: stochastic alignment of noncoding DNA sequences based on an evolutionary model of sequence evolution, Keightley and Johnson, Genome Research, March 2004. [gr]


DAIS - Database and Information Systems Laboratory, Department of Computer Science, University of Illinois at Urbana-Champaign, 201 N. Goodwin Ave., Urbana, IL 61801, USA.  Fax: 217-265-6494, Phone: 217-244-6241.