Multimodal Character Identification on Videos

OKBQA-7 Hackathon

2018.08.07 - 2018.08.08

Task 3. Multimodal Character Identification on Videos

Task Definition

This task aims to link each mention to a certain character in dialogue based on given dialogue text and corresponding video. Let a mention be a nominal referring to a person (e.g., she, mom, Judy), and an entity be a character in a dialogue.


Character identification on text have been studied on Friends dataset and shown practical performance for identifying main characters(Chen et al., 2017; Choi&Chen, 2018). However, these studies solved the problem in the form of entity linking on pre defined characters. Thus, these modules couldn’t be applied to other than the Friends script unless the module is re-trained on the newly constructed data. This task should be approached in the form of coreference resolution to be applied to arbitrary dialogue or video script. There is a study that introduces coreference resolution based approach for this task(Chen et al., 2017), but coreference resolution is difficult problem in NLP, so the performance is not practical(F1 : 57.46% for 9 main characters).

Therefore, if we expand the task to get not only dialogue text, but also video as inputs, the performance would be improved to a practical level by utilizing richer features. This task is the extension of SemEval2018 Task4. There are two main extensions. Firstly, it adds multi modality by utilizing video as a input. Secondly, the final module of this task could be applied to arbitrary dialogue or video script.

Task Organizers


The first two seasons of the TV show Friends are annotated for this task. Each season consists of episodes, each episode comprises scenes, and each scene is segmented into sentences. The followings describe the distributed datasets:

    • friends.train.episode_delim.conll: the training data where each episode is considered a document.

    • friends.test.episode_delim.conll: the test data where each episode is considered a document.

No dedicated development set was distributed for this task; feel free to make your own development set for training or perform cross-validation on the training sets. You can download it on github.

Data Format

All datasets follow the CoNLL 2012 Shared Task data format. Documents are delimited by the comments in the following format:

#begin document (<Document ID>)[; part ###]


#end document

Each sentence is delimited by a new line ("\n") and each column indicates the following:

1. Document ID: /<name of the show>-<season ID><episode ID> (e.g., /friends-s01e01).

2. Scene ID: the ID of the scene within the episode.

3. Token ID: the ID of the token within the sentence.

4. Word form: the tokenized word.

5. Part-of-speech tag: the part-of-speech tag of the word (auto generated).

6. Constituency tag: the Penn Treebank style constituency tag (auto generated).

7. Lemma: the lemma of the word (auto generated).

8. Frameset ID: not provided (always _).

9. Word sense: not provided (always _).

10. Speaker: the speaker of this sentence.

11. Named entity tag: the named entity tag of the word (auto generated).

12. Start time: start time of the sentence on video. (millisecond)

13. End time: start time of the sentence on video. (millisecond)

14. Video file: Pre-processed sequence of image file from the video corresponding to the sentence. This column represents the file name of the pickle object (Pickle object will be released on 08/01)

15.Entity ID: the entity ID of the mention, that is consistent across all documents.

You can check details and examples on github.


You can use friends.train.episode_delim.conll as a training input, and friends.test.episode_delim.conll as a test input.

Output and Evaluation

Your output must consist of the entity ID of each mention, one per line, in the sequential order. You can check details and examples on github.

Given this output, the evaluation script will measure,

    1. The label accuracy considering only 7 entities, that are the 6 main characters (Chandler, Joey, Monica, Phoebe, Rachel, and Ross) and all the others as one entity.

    2. The macro average between the F1 scores of the 7 entities.

    3. The label accuracy considering all entities, where characters not appearing in the tranining data are grouped as one entity, others.

    4. The macro average between the F1 scores of all entities.

    5. The F1 scores for 7 entities.

    6. The F1 scores for all entities.