Snowball and QXtract: Scalable Information Extraction over Large Document Collections

Text documents (e.g., newspaper articles) often contain valuable structured data that is "hidden" in regular English sentences. For many applications, this data is best exploited if available as a relational table that could be used to answer precise queries or to run data mining tasks.

Our project explores scalable techniques for extracting such tables from  large collections of (unstructured) documents (e.g., from a newspaper archive). Specifically, we focus on developing unsupervised --or partially supervised-- methods that require minimal human-effort to train:

People

Eugene Agichtein  (contact) 
Luis Gravano

Mayer Crystal
Jeff Pavel
Viktoriya Sokolova
Vijay Sundaram
Aleksandr Voskoboynik  

Publications
  1. Extracting Relations From Large Text Collections, Eugene Agichtein, Ph.D. Thesis, 2005

  2. Querying Text Databases for Efficient Information Extraction, Eugene Agichtein and Luis Gravano, in Proceedings of the 19th IEEE International Conference on Data Engineering (ICDE), 2003 [errata]

  3. Modeling Query-Based Access to Text Databases, Eugene Agichtein, Panagiotis Ipeirotis, and Luis Gravano, in Proceedings of the ACM SIGMOD Workshop on the Web and Databases (WebDB), 2003

  4. QXtract: A Building Block for Efficient Information Extraction from Text Databases (demo), Eugene Agichtein and Luis Gravano, in Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, 2003

  5. Extracting Synonymous Gene and Protein Terms from Biological Literature, Hong Yu and Eugene Agichtein, in Proceedings of the 11th International Conference on Intelligent Systems for Molecular Biology (ISMB), 2003

  6. Snowball: A Prototype System for Extracting Relations from Large Text Collections (demo), Eugene Agichtein, Luis Gravano, Jeff Pavel, Viktoriya Sokolova, and Alexandr Voskoboynik, in Proceedings of the 2001 ACM SIGMOD International Conference on Management of Data, 2001

  7. Snowball: Extracting Relations from Large Plain-Text Collections, Eugene Agichtein and Luis Gravano, in Proceedings of the 5th ACM International Conference on Digital Libraries (DL), 2000

  8. Combining Strategies for Extracting Relations from Text Collections, Eugene Agichtein, Eleazar Eskin, and Luis Gravano, in Proceedings of the 2000 ACM SIGMOD Workshop on Data Mining and Knowledge Discovery (DMKD), 2000

eugene@cs.columbia.edu