Published Versions 1 Vol 3 (3) : 402-417 2021
Download
Medical Named Entity Recognition from Un-labelled Medical Records based on Pre-trained Language Models and Domain Dictionary
: 2021 - 02 - 15
: 2021 - 03 - 29
: 2021 - 04 - 22
55 3 0
Abstract & Keywords
Abstract: Medical named entity recognition (NER) is an area in which medical named entities are recognized from medical texts, such as diseases, drugs, surgery reports, anatomical parts, and examination documents. Conventional medical NER methods do not make full use of un-labelled medical texts embedded in medical documents. To address this issue, we proposed a medical NER approach based on pre-trained language models and a domain dictionary. First, we constructed a medical entity dictionary by extracting medical entities from labelled medical texts and collecting medical entities from other resources, such as the Yidu-N4K data set. Second, we employed this dictionary to train domain-specific pre-trained language models using unlabelled medical texts. Third, we employed a pseudo labelling mechanism in un-labelled medical texts to automatically annotate texts and create pseudo labels. Fourth, the BiLSTM-CRF sequence tagging model was used to fine-tune the pre-trained language models. Our experiments on the un-labelled medical texts, which were extracted from Chinese electronic medical records, show that the proposed NER approach enables the strict and relaxed F1 scores to be 88.7% and 95.3%, respectively.
Keywords: Medical named entity recognition; Pre-trained language model; Domain dictionary; Pseudo labelling; Un-labelled medical data.
Acknowledgements
This work is supported in part by the Guangdong Science and Technology grant (No. 2016A010101033) and the Hong Kong and Macao joint research and development grant with Wuyi University (No. 2019WGAH21).
[1]
Lei, J., et al.: A comprehensive study of named entity recognition in Chinese clinical text. Journal of the American Medical Informatics Association 21(5), 808–814 (2014)
[2]
Wu, G., et al.: An attention-based BiLSTM-CRF model for Chinese clinic named entity recognition. IEEE Access 7, 113942-113949 (2019)
[3]
Wang, S., Li, S., Chen, T.: Recognition of Chinese medicine named entity based on condition random field. Journal of Xiamen University (Natural Science) 48, 349-364 (2009)
[4]
Wang, Y., Liu, Y., Yu, Z.: A preliminary work on symptom name recognition from free-text clinical records of traditional Chinese medicine using conditional random fields and reasonable features. In: Proceedings of the 2012 Workshop on Biomedical Natural Language Processing, pp. 223-230 (2012)
[5]
Xu, Y., et al.: Joint segmentation and named entity recognition using dual decom- position in Chinese discharge summaries. Journal of the American Medical Informatics Association 21(e1), e84-e92 (2014)
[6]
Wu, Y., et al.: Named entity recognition in Chinese clinical text using deep neural network. Studies in Health Technology and Informatics 216, 624-628 (2015)
[7]
Yang, J., et al.: Chinese electronic medical record named entity and entity relationship corpus construction. Journal of Software 27(11), 2725-2746 (2016)
[8]
Yang, H., et al.: Named entity recognition based on bidirectional long short-term memory combined with case report form. Chinese Journal of Tissue Engineering Research 22(20), 3237-3242 (2018)
[9]
Chowdhury, S., et al.: A multitask bi-directional RNN model for named entity recognition on Chinese electronic medical records. BMC Bioinformatics 19, 449 (2018)
[10]
Wan, L., Luo, Y., Zhi, L.: The recognition of naming entity of Bi-LSTM Chinese electronic medical records based on the joint training of Chinese characters and words. China Digital Medicine 14(2), 54-56 (2019)
[11]
Wu, Y., et al.: Google’s neural machine translation system: bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144 (2016)
[12]
Lee, D.: Pseudo-label: The simple and efficient semi-supervised learning method for deep neural net- works. In: Proceedings of ICML 2013 Workshop: Challenges in Representation Learning (WREPL), pp. 1-6 (2013)
[13]
Liu, Y., et al.: RoBERTa: A robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692 (2019)
[14]
Cui, Y., et al.: Pre-training with whole word masking for Chinese BERT. arXiv preprint arXiv:2004.13922 (2020)
[15]
Gururangan, S., et al.: Don’t stop pretraining: adapt language models to domains and tasks. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 8342-8360 (2020)
Article and author information
Cite As
Citation: Wen, C.J., et al.: Medical named entity recognition from un-labelled medical records based on pre-trained language models and domain dictionary. Data Intelligence 3(3), 402-417 (2021). doi: 10.1162/dint_a_00105
Chaojie Wen
C.J. Wen initiated the proposed medical NER approach and conducted the experiments. C.J. Wen, T. Chen, and X.D. Jia contributed to the final version of the manuscript. C.J. Wen and Jiang Zhu contributed to datapreparation.
Chaojie Wen received his B.S. degree in Computer Science and Technologyfrom Wuyi University in 2019. He is currently pursuing his M.S. degree inElectronic and Communication Engineering at Wuyi University, Guangdong,China. His research interests include natural language processing, namedentity recognition, and relation extraction.
0000-0002-9325-9147
Tao Chen
C.J. Wen, T. Chen, and X.D. Jia contributed to the final version of the manuscript. C.J. Wen and Jiang Zhu contributed to datapreparation. T. Chen and X.D. Jia provided guidance to the project.
chentao1999@gmail.com
Tao Chen received his B.Eng. degree in Communication Engineering in 2003from Nanjing Institute of Communication Engineering, China, and his M.Eng.degree in Computer Application in 2013 from Wuyi University, Guangdong,China, and his PhD degree in Computer Application from Harbin Institute ofTechnology, China in 2018. He joined Wuyi University, China, as a lecturer,in 2018. He is the author of one book, 17 articles and 3 patents. His researchinterests include natural language processing, deep learning, knowledgeacquisition and reasoning, and sentiment analysis.
0000-0002-3634-0854
Xudong Jia
C.J. Wen, T. Chen, and X.D. Jia contributed to the final version of the manuscript. T. Chen and X.D. Jia provided guidance to the project.
Xudong Jia is a Visiting Scholar of Wuyi University, China. He is also aProfessor and the Associate Dean of College of Engineering and ComputerScience, California State University, Northridge. He received his B.S. in 1983and M.S. in 1986 from Beijing Jiaotong University, his M.S. in 1992 fromUniversity of Toronto, Canada, and his PhD in 1996 from Georgia Instituteof Technology. His research interests include intelligent transportation systems(ITS) standards, geographic information system (GIS) applications intransportation, traffic safety, transportation information systems, travel demandmanagement, and air quality. He is an associate editor of IEEE IntelligentTransportation Systems Society and IEEE Open Journal of IntelligentTransportation Systems.
Jiang Zhu
C.J. Wen and Jiang Zhu contributed to datapreparation.
Jiang Zhu received his B.S. degree in Electronic Information Science andTechnology from Inner Mongolia University for Nationalities in 2018. He iscurrently pursuing his M.S. degree in Pattern Recognition and IntelligentSystem at Wuyi University, Guangdong, China. His research interests includenatural language processing, named entity recognition, and data mining.
Publication records
Published: Sept. 16, 2021 (Versions1
References
Data Intelligence