KB Population
OKBQA-7 Hackathon
2018.08.07 - 2018.08.08
Task 4. KB Population
Introduction
The knowledge base is a database that stores the expertise knowledge, rules and facts accumulated from intellectual activities and experiences related to the field in which the AI agent is used for problem solving. The knowledge base is an important element that affects the performance of various application systems based on natural language processing. The typical knowledge bases such as Wikidata, Freebase, WordNet, YAGO, Cyc, and BabelNet are widely used in English.
However, building and maintaining this massive knowledge base manually is very difficult in practice. Therefore, Entity Linking and Discovery (ELD) that can detect entities (people, events, places) from texts of Wikipedia, daily news, and Relation Extraction (RE) that can extract the relationship between the entities are very important technology for automatically expanding a knowledge base (KB Population).
Therefore, in this task, our goal is to develop various models to learn and extract knowledge from the given sentence. Especially, this hackathon is trying to solve the following important problems.
[SubTask 4.1] Entity Linking and Discovery
[SubTask 4.2] Noise Reduction Methods for Distant Supervision
Goals
[SubTask 4.1]
Entity linking is the task of finding mentions and connecting it to the knowledge base. The mention is the span in text that refer to a specific entity. And also, continuously discovering new entities (dark entity) in news, web blog, or other written material is important for more complete knowledge base completion. In this subtask, there are three key challenges; 1) entity mention detection, 2) entity disambiguation and linking, 3) entity discovery and registration.
[SubTask 4.2]
Large-scale training data is required for learning knowledge extraction model to automatically build and expand a knowledge base. However, building massive training data is not economical and almost impossible. In this subtask, we consider how to generate appropriate training data automatically and/or how to use these automatically labeled training data to improve the performance of the existing relation extraction models.
Data Formats
Training data for Entity Linking and Discovery
Training data is automatically labeled data using linked text of Korean Wikipedia. (To be announced). So we named this training data Silver Standard Entity Corpus.
Testing data is manually labeled data using human crowdsourcing of 500 paragraph Korean Wikipedia.
Training data for Relation Extraction
Training data is automatically labeled data using Distant Supervision method. (To be announced). So we named this training data Noisy Relation Corpus.
Testing data is manually labeled data using human crowdsourcing of 500 paragraph Korean Wikipedia.
Data Location
To use the data below, you need to apply for an account with the OKBQA server.
Please contact ...(To be announced)
Development
There is no restriction on the language type used for model development.
Evaluation
Model evaluation of each tasks uses the following two methods.
- Learning with Silver Standard Entity Corpus, and Evaluation with Gold Standard Entity Corpus
- Learning with Noisy Relation Corpus, and Evaluation with Gold Standard Relation Corpus
Reference
Sentence information analyzed using natural language processing tool: LBoxed sentences (JSON)
DBpedia
Entity-type information RDF triples: instance_types_ko.ttl
Sentence information with BILOU tagging.
FrameNet
1.5, 1.6, 1.7