Huge volumes of (semi-)structured data in formats such as graph, relational data, XML, and RDF data are available in domains ranging from science to business.
Web search engines are leveraging (semi-)structured data sources to provide users with more effective and usable search experiences.
These data sets often have extremely complex schemas. With thousands of entity types and relationships, each
with a hundred or so attributes, it is extremely difficult for normal users to explore the data and formulate queries.
As most users in these domains are not familiar with concepts such as schemas and
query languages, they need an easy query interface such as keyword search interfaces.
However, it is hard to provide good search results for keyword queries over (semi-)structured data.
Since the query is not framed in terms of the data's actual structure, the challenge is how to find the data
most closely related to the user's query. A usable query interface should perform reasonably well over different
data sets and schemata and process the queries efficiently.
My research has been a quest to find the principled approaches to keyword search for
combination of large-scale (semi-)structured and unstructured data. Overall, I have identified
a set of intuitively desirable general principles that keyword search should obey, showed that
previous approaches do not obey those principles, and devised a new approach that does obey
them, based on statistical properties of the data.