OKBQA-7 Hackathon

2018.08.07 - 2018.08.08

Task 4. KB Population


The knowledge base is a database that stores the expertise knowledge, rules and facts accumulated from intellectual activities and experiences related to the field in which the AI agent is used for problem solving. The knowledge base is an important element that affects the performance of various application systems based on natural language processing. The typical knowledge bases such as Wikidata, Freebase, WordNet, YAGO, Cyc, and BabelNet are widely used in English.

However, building and maintaining this massive knowledge base manually is very difficult in practice. Therefore, Entity Linking and Discovery (ELD) that can detect entities (people, events, places) from texts of Wikipedia, daily news, and Relation Extraction (RE) that can extract the relationship between the entities are very important technology for automatically expanding a knowledge base (KB Population).

Therefore, in this task, our goal is to develop various models to learn and extract knowledge from the given sentence. Especially, this hackathon is trying to solve the following important problems.

  • [SubTask 4.1] Entity Linking and Discovery

  • [SubTask 4.2] Noise Reduction Methods for Distant Supervision


[SubTask 4.1]

Entity linking is the task of finding mentions and connecting it to the knowledge base. The mention is the span in text that refer to a specific entity. And also, continuously discovering new entities (dark entity) in news, web blog, or other written material is important for more complete knowledge base completion. In this subtask, there are three key challenges; 1) entity mention detection, 2) entity disambiguation and linking, 3) entity discovery and registration.

[SubTask 4.2]

Large-scale training data is required for learning knowledge extraction model to automatically build and expand a knowledge base. However, building massive training data is not economical and almost impossible. In this subtask, we consider how to generate appropriate training data automatically and/or how to use these automatically labeled training data to improve the performance of the existing relation extraction models.

Data Formats

  • Training data for Entity Linking and Discovery

Training data is automatically labeled data using linked text of Korean Wikipedia. (To be announced). So we named this training data Silver Standard Entity Corpus.

Testing data is manually labeled data using human crowdsourcing of 500 paragraph Korean Wikipedia.

  • Training data for Relation Extraction

Training data is automatically labeled data using Distant Supervision method. (To be announced). So we named this training data Noisy Relation Corpus.

Testing data is manually labeled data using human crowdsourcing of 500 paragraph Korean Wikipedia.

Data Location

  • To use the data below, you need to apply for an account with the OKBQA server.

  • Please contact ...(To be announced)




Entity Linking & Discovery

Silver Standard Entity Corpus (Training)


Entity Linking & DiscoveryGold Standard Entity Corpus (Testing)


Relation Extraction

Noisy Relation Corpus (Training)


Relation Extraction

Gold Standard Relation Corpus (Testing)



  • There is no restriction on the language type used for model development.


  • Model evaluation of each tasks uses the following two methods.

    - Learning with Silver Standard Entity Corpus, and Evaluation with Gold Standard Entity Corpus

    - Learning with Noisy Relation Corpus, and Evaluation with Gold Standard Relation Corpus


  • Sentence information analyzed using natural language processing tool: LBoxed sentences (JSON)

  • DBpedia

    • Entity-type information RDF triples: instance_types_ko.ttl

  • Sentence information with BILOU tagging.

  • FrameNet

    • 1.5, 1.6, 1.7