| Snowball and QXtract: Scalable Information Extraction over Large
Document Collections |
Text documents (e.g., newspaper
articles) often contain valuable
structured data that is "hidden" in regular English sentences. For many
applications, this data
is best exploited if available as a relational table that could be used to
answer precise queries or to run data mining tasks.
Our project explores scalable techniques for extracting such tables
from large collections of (unstructured) documents (e.g., from a newspaper
archive). Specifically, we focus on developing unsupervised --or partially supervised-- methods that require
minimal human-effort to train:
- Snowball is a bootstrapping-based information extraction
system that requires only a handful of training examples of tuples of
interest. These examples are used to generate extraction patterns, which in turn result in new tuples
being extracted from the document collection (DL'00
paper;
DMKD'00 paper;
SIGMOD'01 demo;
ISMB'03 paper).
- QXtract addresses the complementary problem of scaling
information extraction to large document collections: often information
extraction systems are expensive (e.g., they might require several seconds to
process a single document), so it might not be feasible to sequentially scan all of the documents in
a collection (or on the web) and "feed" them to the extraction system.
Furthermore, such exhaustive inspection of all documents might not be needed,
since only a few documents might be relevant to a particular extraction task.
QXtract receives as input an arbitrary information extraction system,
together with a handful of training tuples. QXtract then uses
machine-learning tools to automatically generate queries that retrieve the
relevant documents in a collection (ICDE'03
paper;
SIGMOD'03 demo).
Eugene Agichtein
(contact)
Luis Gravano
Mayer Crystal
Jeff Pavel
Viktoriya Sokolova
Vijay Sundaram
Aleksandr Voskoboynik
-
Extracting Relations From Large Text Collections, Eugene Agichtein, Ph.D. Thesis, 2005
-
Querying Text Databases for Efficient
Information Extraction, Eugene Agichtein and Luis Gravano, in Proceedings of the 19th IEEE International Conference on Data Engineering
(ICDE), 2003 [errata]
-
Modeling Query-Based Access to Text Databases,
Eugene Agichtein, Panagiotis Ipeirotis, and Luis Gravano, in
Proceedings of the ACM SIGMOD Workshop on the Web and
Databases (WebDB), 2003
-
QXtract: A Building Block for Efficient Information Extraction from Text Databases (demo),
Eugene Agichtein and Luis Gravano, in Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, 2003
-
Extracting
Synonymous Gene and Protein Terms from Biological Literature,
Hong Yu and Eugene Agichtein, in Proceedings of the 11th International
Conference on Intelligent Systems for Molecular Biology (ISMB), 2003
-
Snowball:
A Prototype System for Extracting Relations from Large Text Collections
(demo), Eugene Agichtein, Luis Gravano, Jeff Pavel, Viktoriya
Sokolova, and Alexandr Voskoboynik, in Proceedings of the 2001 ACM SIGMOD
International Conference on Management of Data, 2001
-
Snowball:
Extracting Relations from Large Plain-Text Collections, Eugene
Agichtein and Luis Gravano, in Proceedings of the 5th ACM
International Conference on Digital Libraries (DL),
2000
-
Combining
Strategies for Extracting Relations from Text Collections, Eugene
Agichtein, Eleazar Eskin, and Luis Gravano, in Proceedings of the 2000
ACM SIGMOD Workshop on Data Mining and Knowledge Discovery (DMKD),
2000
eugene@cs.columbia.edu