Online First • Versions 5
Abstract: One of the major challenges to build a task-oriented dialogue system is that dialogue state transition frequently happens between multiple domains such as booking hotels or restaurants. Recently, the encoder-decoder model based on the end-to-end neural network has become an attractive approach to meet this challenge. However, it usually requires a sufficiently large amount of training data and it is not flexible to handle dialogue state transition. This paper addresses these problems by proposing a simple but practical framework called Multi-Domain KB-BOT (MDKB-BOT), which leverages both neural networks and rule-based strategy in natural language understanding (NLU) and dialogue management (DM). Experiments on the data set of the Chinese Human-Computer Dialogue Technology Evaluation Campaign show that MDKB-BOT achieves competitive performance on several evaluation metrics, including task completion rate and user satisfaction.
Keywords: Dialogue system; Knowledge base; Natural language understanding; Slot filling; Natural language generation
In the past decade, dialogue systems have become an attractive topic and they can be classified into open-domain dialogue systems and task-oriented dialogue systems. One general approach to dialogue system design is to treat it as a retrieval problem by learning the relevance matching score between user queries and system responses. Inspired by the recent advances in deep learning, building an end-to-end dialogue system has been a popular approach for its flexibility and extendibility. For example, encoder-decoder models based on recurrent neural networks (RNNs) directly maximize the likelihood of the desired responses when previous dialogue history data are available. However, one of the major drawbacks of those systems is that multiple training corpora are required and generic responses such as "I do not know" are likely to be generated. These drawbacks limit the generalization ability, especially for a task-oriented system in which knowledge from multiple domains is needed to understand users' underlying intents.
Compared to end-to-end approach, designing a task-oriented dialogue system as modularized pipeline is feasible. And each essential component is trained individually, including 1) Natural Language Understanding (NLU), to specify task domain and user intent and extract slot-value pairs; 2) Dialogue Manager (DM), to keep tracking the dialogue state and guide users to achieve a desired goal; 3) Natural Language Generation (NLG), to generate responses. One of the challenges for a task-oriented dialogue system is that dialogue state transition frequently happens between multiple domains. If earlier components make mistakes in slot value extraction and errors are accumulated, the entire system's functionality will be severely impaired.
To address the complex dialogue state transition problem, we adopt the architecture of modularized pipeline and propose a multi-domain KB-BOT (MDKB-BOT), which leverages both rule extraction and neural networks. We run the evaluation experiments on the data set of the Chinese Human-Computer Dialogue Technology Evaluation Campaign and experimental results show that MDKB-BOT can robustly fulfill the frequent changes of user intent among three domains (flight, train and hotel) and achieve competitive scores based on human evaluation metrics.
As mentioned before, have been a lot of research efforts in applying deep learning to task-oriented dialogue systems. One of the most effective approaches is to build a modularized pipeline system by connecting NLU, DM and NLG together. Traditional approach to NLU is to model domain classification/intent detection as sentence classification while treating slot value pairs extraction as a sequence labeling task. A desirable NLU system should not be sensitive to intent error and slot error, especially for slot filling. For example, Xu and Sarikaya  applied a RNN to perform contextual domain classification and used a triangular conditional random field (CRF) based on a convolutional neural network for intent detection and slot filling. Jaech, Heck and Ostendorf  applied multi-task learning to achieve the goal to leverage knowledge of a source domain and improve the performance of a model in the target domain with little target data. Bapna et al.  explored the role of context information played in NLU via injecting it into a RNN based encoder and a memory network.
On the other hand, many attempts have been made to improve the architecture of DM. Recent research indicates that reinforcement learning (DL) holds promise for planning dialogue policy based on the current dialogue state. Williams, Asadi, and Zweig  proposed a model called Hybrid Code Networks (HCNs), which is a mixture of supervised learning and reinforcement learning . HCNs select a dialogue action every step by optimizing the reward for completing a task with policy gradient . Faced with the sparse nature of the reward signal in RL, Peng et al.  designed an end-to-end framework for hierarchical RL where a MANAGER is used to choose current goal (like a specific domain task) and a WORKER is used to take actions and help users finish the current subtask. Inspired by recent advances in RL, Mrkšíc et al.  introduced a belief tracker that can overcome the drawback of requiring a large amount of hand-crafted lexicons to capture some of the linguistic variation in users' language. Their Neural Belief Tracking (NBT) models can reason over pre-trained word embeddings of system output, user utterance and candidate pairs in databases.
As for NLG, most of the current work applied information retrieval technique to a large query-response database, or used template-based methods with a set of rules to map frames to natural language or generation models. Dušek and Jurčíček  encoded frames based on the syntax tree and used seq2seq model for generation.
The proposed framework is illustrated in Figure 1, which includes NLU, DM and NLG. The implementation of these components is described from Section 3.1 to Section 3.3.
3.1 Natural Language Understanding (NLU)
The main tasks for NLU involve domain classification, intent detection and slot filling as illustrated in Figure 1.
3.1.1 Domain Classification
A convolutional neural network proposed by Kim  was adopted in domain classification. Let be the d dimensional word embedding table, where v is the vocabulary size. Then sentence semantic representation of user query is obtained by looking up each word in W, where n is the number of words in this query. Then 1-D convolutional layer is adopted on X to extract n-gram feature. However, in a convolutional neural network (CNN), errors may happen in some cases containing several domains' description or domains' transition. For example, "The train is cheaper, but for the time, give me the information on the plane ticket." Thus, for our online model, some rule strategies are used to deal with this misclassification problem by constructing a keyword list from both corpora and databases, e.g., city name list.
3.1.2 Slot Filling
Slot filling is treated as a name entity recognition task where the popular bigin-in-out (BIO) format is used for representing tags of each word in a query. Then Long Short-Term Memory scans the words and outputs the representation:
To enhance the ability to extract slot value pairs, a CRF network is connected to the output of long short-term memory (LSTM) or bidirectional LSTM (BLSTM). Then the score of a sentence X along with a path of tags Y can be calculated as the sum of transition scores A and LSTM network score f:
where θ is the trainable parameter of the LSTM network.
Figure 2 shows a bidirectional LSTM network enhanced with a CRF network on the top layer. For our online model, we apply both keyword matching and BLSTM-CRF to avoid cases like diverse or nonstandard expressions.
3.1.3 Intent Detection
Based on slots extracted from BLSTM-CRF, we update the maintained dialogue state template. Then user intent is inferred by comparing the predefined dialogue template with the new state template.
3.2 Dialogue Management (DM)
After the NLU module, we obtain the output of NLU that includes the user intent domain and the slot value of the current turn.
In order to avoid too many unnecessary turns of dialogue on some insignificant information for many users, we divide all of the slots into two categories: required slots and extra slots. The required slots, like <departCity>, are necessary for the task, and the extra slots may make the dialogue too tedious for many users who do not care them, such as <trainValue> and <countRate>. So we only complete the required slots necessary for booking, but if a user mentions extra slots in the dialogue, we will consider it while retrieving the information from our data base.
To regulate the dialogue course when our system interacts with users, DM module is then applied to update the conversation state and the next dialogue action. Appropriately, we divide this module into three states. The detailed procedures are described as follows.
Initial state At the beginning of the conversation, utterance with no explicit intention will be considered a purposeless talk. After identifying a user's intent, the system will turn into the slot filling state. Note that the system will store the slot information even before domain prediction, and the information will be distributed to the corresponding slot afterward.
Slot filling state The main task at this state is to interactively interact with the user to obtain the required slot information for generating responses.
Recommendation state Our bot will list the results that can adapt to users’demands by retrieval and extraction of the data base. In case of failure, we set a series of strategies for similar recommendations, such as: (1) remove the limitations of extra slots; (2) make appropriate adjustments for the departure time; (3) change the cabin or train type; (4) increase the price range. In addition, users can change their requests and return to the slot filling state or recommendation state again.
Throughout the whole process of the dialogue, when the system finds that an intent cannot be completed, it will prompt the user in time to avoid wasting time. For example, when a user wants to book a flight to a city where no flight service is available, it is unwise to continue the dialogue with the user, and the system will recommend other means of traveling. A user can also change his or her intent anytime, and our system will store the common information slots automatically at the transformation process.
3.3 Natural Language Generation (NLG)
So far, we have obtained both the category of a query's intention and the next dialogue action each turn, which guide the NLG module to generate natural language texts and replay the user's query. Given the user's slots list, we convert it into the SQL statement, then retrieve from the date base which stores information on trains, flights and accomendations, to check if there are eligible items for the user's goal, and match the appropriate template for replay.
Because of the shortage of the large-scale dialogue corpus in these domains, we generate the utterance by the template-based NLG, which means we need to capture every case of different slot states to presuppose the dialogue template. In this way, once the user dialogue actions are found in the predefined sentence templates, we will replace the slot value with user history information. One advantage is that it can ensure the controllability of the response given to users.
In the Chinese Human-Computer Dialogue Technology Evaluation Campaign, a task-oriented dialogue system is developed to help users book flights, trains and hotels.
4.1 Data sets
Since only three databases are provided, we extend the data set of task 1 for domain classification and rule extraction. We also annotate a 300-dialogue corpus (about 1,500) with slot labels for evaluating LSTM-CRF and BLSTM-CRF models. Table 1 and Table 2 show the details of the data set.
For slot filling in NLU, the entity-level prediction F1 score of common name entity recognition is adopted. However, dialogue evaluation still remains to be a difficult task. We use the evaluation metrics of the Chinese Human-Computer Dialogue Technology Evaluation Campaign, including task completion rate, user satisfaction score, dialogue naturalness, number of turns and robustness of uncovered cases.
Table 3 shows the results of slot filling. We compare the performance of LSTM-CRF and BLSTM-CRF with unigram and unigram plus bigram separately. As illustrated, accuracy can increase by 1.98% when considering word sequence order with BLSTM. Using bigram feature can be of help for LSTM-CRF, though it is worse for BLSTM as the average length of bigram sequence is short. One possible way is to use character embedding instead of word embedding.
Table 4 illustrates the performance of our system according to the evaluation metrics mentioned before. Most of the metrics are annotated manually except for the average dialogue turn. our system obtained the best score on user satisfaction, naturalness and boot ability metrics due to the reasonable dialogue intent transition template we predefined. But this leads to a decline in task performance, especially when the user intent is not identified or the important slot value is not extracted correctly.
In this paper, we proposed a simple but practical framework for multi-domain task-oriented dialogue system. Our model leverages both neural network and rule-based strategy to handle the domain transition problem. It achieves competitive results on the Chinese Human-Computer Dialogue Technology Evaluation Campaign, especially for user-friendliness and utterance guidance metrics. For future work, we are going to apply end-to-end neural networks to NLG based on the information extracted and maintained in NLU and DM to improve the system performance.
P. Xu, & R. Sarikaya. Contextual domain classification in spoken language understanding systems using recurrent neural network. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2014, pp. 136–140. doi: 10.1109/ICASSP.2014.6853573.
A. Jaech, L. Heck, & M. Ostendorf. Domain adaptation of recurrent neural networks for natural language understanding. doi: 10.21437/Interspeech.2016-1598.
A. Bapna, G. Tür, D. Hakkani-Tür, & L. Heck. Sequential dialogue context modeling for spoken language understanding. In: Proceedings of the 18th Annual SIGdial Meeting on Discourse and Dialogue, 2017, pp. 103–114. doi: 10.18653/v1/W17-5514.
J.D. Williams, K. Asadi, & G. Zweig. Hybrid code networks: Practical and efficient end-to-end dialogue control with supervised and reinforcement learning. arXiv preprint. arXiv: 1702.03274, 2017.
R.J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning 8(3-4)(1992), 229–256. doi: 10.1007/BF00992696.
B. Peng, X. Li, L. Li, J. Gao, A. Celikyilmaz, S. Lee, & K.-F. Wong. Composite task-completion dialogue policy learning via hierarchical deep reinforcement learning. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, 2017, pp. 2221–2230. doi: 10.18653/v1/D17-1237.
N. Mrkšíc, D. O Śeaghdha, T.-H. Wen, B. Thomson, & S. Young. Neural belief tracker: Data-driven dialogue state tracking. arXiv preprint. arXiv: 1606.03777, 2016.
O. Dušek, & F. Jurčíček. Sequence-to-sequence generation for spoken dialogue via deep syntax trees and strings. arXiv preprint. arXiv: 1606.05491, 2016.
Y. Lao, W. Liu, S. Gao, & S. Li. MDKB-Bot: A practical framework for multi-domain task-oriented dialogue system. Data Intelligence 1(2)(2018).
Published: None （Versions5）