Tutorial T10-F - Extracting, Searching and Mining Semantic Annotations on the Web

Soumen Chakrabarti, IIT Bombay


Automatic, large-scale semantic annotations on unstructuredWeb sources is vital to next-generation search and mining over entities and relations. The last decade has seen substantial research advances starting from the early WebKB vision of developing a probabilistic, symbolic knowledge base mirroring the contents of the Web. WebKB started with relatively modest goals of whole-page classification into students, faculty, department, course, etc. The WebKB proposal was followed by a rich literature on identifying not whole pages, but segments of text token as references to real-world entities from a predetermined set of entity types. The set of types may include persons, places, organizations, dates, paper titles, authors, and conference venues. Progress was also made on the task of identifying if entities of specified types were related in a given way, e.g., author A wrote book B, company C1 acquired company C2, person P joined organization O, etc. The next wave of innovations involved open-domain identification and typing of entity mentions, exploiting Hearst patterns of the form “Jordan and other statisticians” as evidence that some person named Jordan is a statistician. The next step was to identify open-domain binary relations from simple lexical patterns like noun phrase, verb phrase, noun phrase, as in “John Baird invented the television”. More recently, efforts have been under way to associate token segments on Web pages with unique entity IDs, such as Wikipedia URNs. This tutorial will give an overview of these recent technologies, with emphasis on the machine learning and data mining techniques involved. Then we will explore how such entity annotations can assist search—free-form and especially semistructured searches.


Soumen Chakrabarti is Associate Professor of Computer Science at IIT Bombay. He got a PhD from the University of California, Berkeley, in 1996, and was a Research Staff Member at IBM Almaden from 1996 to 1999. During Spring 2004 he was Visiting Associate Professor in the School of Computing, Carnegie-Mellon University. He has published extensively at conferences likeWWW, SIGIR, EMNLP/HLT, SIGKDD, VLDB, and SIGMOD and also served frequently as vice chair or program committee member. He is program chair of WSDM 2008 and WWW 2010. His paper on Focused Crawling got the Best Paper award at WWW 1999. He coauthored the best student paper at ECML 2008. Other papers have been invited to Scientific American, IEEE Computer and VLDB Journal. As of 2006, his papers have over 1000 citations in CiteSeer. Google Scholar shows 790, 665, and 503 citations to selected papers. He has presented many tutorials at WWW, SIGMOD, VLDB and SIGKDD. These have led to a successful textbook, Mining the Web, published in 2002, with a second edition currently in progress. He has eight granted US patents. At IIT Bombay, he has obtained research funding and/or gifts from Yahoo Research, HP Labs, Microsoft Research, IBM Research, GE Research, University of California, Tata Consultancy Services, and NEC Research, and delivered working software systems in several cases. He has (co)advised two PhD students and 9 over 20 masters students.
For more details please visit http://www.cse.iitb.ac.in/_soumen/ and http://www.cse.iitb.ac.in/_soumen/main/ceevee.html