Training Data of DialEval-1

NTCIR-15 Dialogue Evaluation Task (DialEval-1)

Training Data of DialEval-1 Task

Recently, many reserachers are trying to build automatic helpdesk systems. However, there are very few methods to evaluate such systems. In DialEval-1, we aim to explore methods to evaluate task-oriented, multi-round, textual dialogue systems automatically. This dataset have the following features:

In DialEval-1, we consider annotations ground truth, and participants are required to predict nugget type for each turn (Nugget Detection, or ND) and dialogue quality for each dialogue (Dialogue Quality, or DQ).

Links

Registration

TO and register and obtain the dataset ,please send an email to dialeval1org@list.waseda.jp with the following information so that we can send you the training data.

Later, NII will require you to register to NTCIR tasks through their website, but please contact us by email first

Leaderboard

Comming Soon

Training Data Overview

The Chinese training dataset contains 4,090 (3,700 for training + 390 for dev) customer-helpdesk dialgoues which are crawled from Weibo. All of these dialogues are annotated by 19 annotators.

The English dataset contains 2, 251 dialogues for training + 390 for dev. They are manually translated from a subset of the Chinese dataset. The English dataset shares the same annotations with the Chinese dataset.

Annotators

We hired 19 Chinese students from the department of Computer Science, Waseda University to annotate this dataset.

Format of the JSON file

Each file is in JSON format with UTF-8 encoding.

Following are the top-level fields:

Each element of the turns field contains the following fields:

Each element of annotations contains the following fields:

Nugget Types

drawing

Dialogue Quality

Scale: [2, 1, 0, -1, -2]

Evaluation

Metrics

During the data annotaiton, we noticed that annotators’ assessment on dialgoues are highly subjective and are hard to consolidate them into one gold label. Thus, we proposed to preserve the diverse views in the annotations “as is” and leverage them at the step of evaluation measure calculation.

Instead of juding whether the estimated label is equal to the gold label, we compare the difference between the estiamted distributions and the gold distributions calculaed by 19 anntators’ annotations). Specifically, we employ these metrics for quality sub-task and nugget sub-task:

For the details about the metrics, please vistit:

Test and Submission

Comming Soon

Tenative Schedule

Jul 2019 Test data crawling [DONE] Aug-Oct 2019 Adding more English translations to the training data [DONE] Oct 2019 Task registrations open Oct-Dec 2019 Test data annotation Jun 2020 Test data released / Task registrations due Jul 2020 Run submissions due Aug 2020 Evaluation results released Dec 2020 NTCIR-15 Conference at NII, Tokyo, Japan Timezone: Japan (UTC+9)

Have questions?

Please contact: dialeval1org@list.waseda.jp

Conditions and Terms

See https://dialeval-1.github.io/dataset/terms