Open Information Extraction

NestIE: Open-Domain Information Extraction for Nested Assertions

Open Information Extraction (OpenIE) systems extract tuple-assertions of natural language expressions from massive textual corpora without requiring a vocabulary or relation-specific training data. The first-generation OpenIE systems suffer from two major drawbacks: a) binary tuples are often insufficient to capture relations that have additional attributes such as reification, conditionals, and multiple arguments, and b) argument and relation boundaries are determined heuristically, often resulting in over-specific or under-specific assertions.

Our system, NestIE, addresses both the issues of lack of expressiveness and granularity in OpenIE tuple-assertions. It uses more expressive nest-tuples for representing the basic propositions asserted by a sentence. It employs bootstrapping techniques to learn domain-independent extraction patterns for the nest-tuples.

See:
Nikita Bhutani, H. V. Jagadish, Dragomir Radev, Nested Propositions in Open Information Extraction. In Conference on Empirical Methods in Natural Language Processing (EMNLP), 2016.

NeurON: Knowledge Base Construction using Conversational Question-Answer Datasets

One interesting source of unstructured data that has been largely ignored to construct KBs is conversational question-answers (cQA). A number of e-commerce websites, such as Amazon and eBay, provide community question answering systems where users can ask/answer product-related questions. Research on extracting information from cQA datasets is sparse. Rule-based methods operating over individual sentences ignore the context and discourse when understanding a question-answer pair.

We develop NeurON, an end-to-end system for extracting tuples from question-answer pairs. It uses a novel multi-encoder constrained-decoder that explicitly models both the question and the answer of a QA pair. It incorporates vocabulary and syntax as hard constraints and prior knowledge as soft constraints in the decoder.

See:
Nikita Bhutani, Yoshihiko Suhara, Wang-Chiew Tan, Alon Halevy, H. V. Jagadish, Open Information Extraction from Question-Answer Pairs. Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2019 .

Canonicalization

LUSTRE - Learning structured representations of named entities using active learning

Fundamental to knowledge-centric applications, such as knowledge base population and question answering, is the need to identify named entities from their textual mentions. However, entities lack a unique representation and their mentions can differ greatly. These variations arise in complex ways that cannot be captured using textual similarity metrics. Entities, however, have underlying structures, typically shared by entities of the same entity type, that can help reason over their name variations.

Our system, LUSTRE, is an active-learning based framework that drastically reduces the labeled data required to learn the structures of entities. From human-comprehensible labels, it automatically synthesizes programs for mapping entity mentions to their structures.

See:
Nikita Bhutani, Kun Qian, Yunyao Li, H. V. Jagadish, Mauricio A. Hernández, Mitesh Vasa, Exploiting Structure in Representation of Named Entities using Active Learning. In 27th International Conference on Computational Linguistics (COLING), 2018.
Kun Qian, Nikita Bhutani, Yunyao Li, H. V. Jagadish, Mauricio A. Hernández, LUSTRE: An Interactive System for Entity Structured Representation and Variant Generation. In 34th IEEE International Conference on Data Engineering (ICDE), 2018.

Open Knowledge-Based Question Answering

Online Schemaless Querying of Heterogeneous Open Knowledge Bases

Advances in open information extraction (OpenIE) have provided an alterative source of knowledge, otherwise curated manually or collaboratively. However, being automatically acquired, such knowledge is often unnormalized, and noisy. Since there is no unique representation of knowledge in open KBs and in questions, open KB-QA systems typically can only support questions with limited semantics.

Our system answers multi-constraint questions using heterogeneous open KBs with varied representations. We devise a generic alignment-based algorithm that extracts answers from heterogeneous tuple-assertions by resolving lexical/structural inconsistencies in queries and assertions at runtime.

Hybrid knowledge-based question answering over open and curated knowledge bases

KB-QA on open KBs derived from unstructured data and KB-QA on curated, structured KBs have evolved rather independently. Much work either assumes a single query model or a uniform knowledge representation for inference over the two types of KBs. Consequently, the information from a specific knowledge source gets under-utilized or under-represented.

We develop, TextRay, which decouples the two types of querying mechanisms and fully exploit the two knowledge sources with effective query decomposition and query planning. We devise methods to decompose a complex question into smaller sub-questions, identify a target knowledge source to execute each sub-question, and efficiently evaluate and combine the results of different sub-questions.

On-demand curation and integration

Structured data such as in relational databases and knowledge bases can often be incomplete. Consequently, queries outside their semantic coverage will fail. We investigate how to efficiently fill these knowledge gaps in an on-demand manner using the abundant textual resources typically available on the Web. Unlike in open-domain extraction and reasoning, the information needs are more specific. The extracted knowledge must be compatible with the structured data. Our goal is to design curation methods that are highly accurate and efficient.

Selected Publications

  • Open Information Extraction from Question-Answer Pairs
    Nikita Bhutani, Yoshihiko Suhara, Wang-Chiew Tan, Alon Halevy, H. V. Jagadish
    Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2019
    PDF
  • Exploiting Structure in Representation of Named Entities using Active Learning
    Nikita Bhutani, Kun Qian, Yunyao Li, H. V. Jagadish, Mauricio A. Hernández, Mitesh Vasa
    27th International Conference on Computational Linguistics (COLING), 2018
  • LUSTRE: An Interactive System for Entity Structured Representation and Variant Generation
    Kun Qian, Nikita Bhutani, Yunyao Li, H. V. Jagadish, Mauricio A. Hernández
    34th IEEE International Conference on Data Engineering (ICDE), 2018
  • Nested propositions in open information extraction
    Nikita Bhutani, H. V. Jagadish, Dragomir Radev
    Conference on Empirical Methods in Natural Language Processing (EMNLP), 2016

Patents

  • Entity Structured Representation and Variant Generation
    Nikita Bhutani, Yunyao Li, Mauricio A. Hernández, Kun Qian, Min Li
    Patent Pending

Working Papers

  • Online Schemaless Querying of Heterogeneous Open Knowledge Bases
    Nikita Bhutani, H. V. Jagadish
    Under Review
  • Hybrid Knowledge-Based Question Answering for Compositional Questions
    Nikita Bhutani, Xinyi Zheng, H. V. Jagadish
    In Submission