The specific descriptions of Task 1 are as follows: build a system that can classify a user's input into the most relevant category, including chit-chat or task subcategories, e.g.,
• What have you done recently? -chat
你最近干嘛呢?
• What’s the big news? -news
有什么重大新闻?
• I want to read free novels. -novel
我要看免费的小说
In Task 1, participating teams do not need to consider the overall intention of multiple rounds of a task-based dialogue, but to pay attention to a single round of dialogue. In addition, they are provided with a template of an example system to facilitate the unification of the interface.
There are many text categorization tasks that use F1-measure as evaluation indicators, such as [14], [15], [16]. In order to avoid the imbalance of category distribution and meanwhile take into account each category, we also evaluate submitted systems based on the F1-measure obtained from precision and recall. Specifically, we first construct a confusion matrix for calculating the Precision\(\mathrm{ }{P}_{i}\) and Recall \({R}_{i}\) value of each category, and then take the average precision as \(\stackrel{-}{P}=\frac{1}{N}\sum _{i=1}^{N}{P}_{i}\) and take the average recall as \(\stackrel{-}{R}=\frac{1}{N}\sum _{i=1}^{N}{R}_{i}\), and F1-measure is calculated by Equation (1):
\(\)\(F1= (2P ̅R ̅)/(P ̅+R ̅ )\)
, (1)
where N denotes the total number of categories.
2.2 Task 2: Online Testing of Task-Oriented Dialogues
Task 2 of the Evaluation is described as follows: For a complex task on booking a flight, a train ticket, or a hotel room, build a system to guide the user to complete the corresponding task based on the given relevant database. In this evaluation, we evaluate submitted systems online manually. Research in [17] suggests that the use of crowdsourcing technology is feasible and it can provide reliable results, and our reviewers are professional testers from iFLYTEK Corporation, which will be more likely to produce accurate results. A complete intent of a flight reservation task is described as:
“帮我订一张从北京到上海的飞机票,早上或者中午都行”
“Booking a flight from Beijing to Shanghai in the morning or at noon”.
The whole dialogue process of this flight reservation task is shown in Table 1, where U denotes the utterance of the user and R denotes the response of the agent.
Table 1.
An example of air ticket booking. Role | Questions and answers |
U
R
U
R
U
R
U
R | 查询明天从北京去上海的机票。 Check out the ticket from Beijing to Shanghai tomorrow. 请问您只要机票吗? Do you only need an air ticket? 是的! Yes! 请问您要明天什么时候出发呢? When are you leaving tomorrow? 上午或中午吧。 Morning or noon. 以下是帮您查询到的机票信息,是否需要预定? The following is the ticket information for you to check, would you like to book a ticket? 也行,就订这个吧。 OK, I’ll take it. 已经帮您预订该航班机票,将跳转至付款页面! The flight ticket has already been booked for you. Now we go to pay for the ticket! |
Considering a variety of important factors on evaluation of a task-oriented dialogue system, we use the following indicators to evaluate the submitted systems in Task 2:
• Task completion ratio: The number of tasks completed during the test divided by the total number of tasks.
• Average number of dialogue turns: The number of utterances during the process of completing a task.
• Satisfaction score: The subjective score of the system marked by the tester, including 5 integers from -2 to 2.
• Fluency degree of response: Subjective scoring, including 3 integers from -1 to 1.
• Uncovered data guidance capability: Subjective scoring, including 0 and 1.
The core purpose of a task-oriented dialogue system is to help users complete a specific task. Then, the two most direct indicators for evaluating a task-oriented dialogue system is the task completion ratio and the average number of dialogue turns [18], [19]. The task completion ratio indicates the completion of the task and is the most important indicator that can reflect the system's capabilities. In Task 2, a complete intent may contain multiple subtasks, such as booking a flight first, and then booking a train ticket, and at last booking a hotel room. In order to demonstrate the ability of the system to complete composite tasks, when all the subtasks of a composite task are completed, we mark the completion of the task. For the average number of dialogue turns, it is counted by the evaluation system. When the task completion ratio is the same, the smaller the number of dialogue rounds, the better the system performs. In order to ensure that the number of dialogue turns of unfinished subtasks must be greater than or equal the number of rounds of completed subtasks, we take the number of dialogue turns of unfinished subtasks as the theoretical maximum number. If the maximum number of rounds is exceeded during the test, the current round of testing will be terminated. The remaining indicators are the subjective scores of the three reviewers, all of which are average scores. They reflect the performance of the dialogue system in the three aspects, respectively.
Actually, the test method of this evaluation is not only applicable to the Chinese Human-Computer Dialogue Technology Evaluation but also can be applied to the same evaluation tasks in other languages without too much modification except for the corpus.