Published Versions 2 Vol 2 (1) : 96–107 2020
Distributed Analytics on Sensitive Medical Data: The Personal Health Train
4019 166 0
Abstract & Keywords
Abstract: In recent years, as newer technologies have evolved around the healthcare ecosystem, more and more data have been generated. Advanced analytics could power the data collected from numerous sources, both from healthcare institutions, or generated by individuals themselves via apps and devices, and lead to innovations in treatment and diagnosis of diseases; improve the care given to the patient; and empower citizens to participate in the decision-making process regarding their own health and well-being. However, the sensitive nature of the health data prohibits healthcare organizations from sharing the data. The Personal Health Train (PHT) is a novel approach, aiming to establish a distributed data analytics infrastructure enabling the (re)use of distributed healthcare data, while data owners stay in control of their own data. The main principle of the PHT is that data remains in its original location, and analytical tasks visit data sources and execute the tasks. The PHT provides a distributed, flexible approach to use data in a network of participants, incorporating the FAIR principles. It facilitates the responsible use of sensitive and/or personal data by adopting international principles and regulations. This paper presents the concepts and main components of the PHT and demonstrates how it complies with FAIR principles.
Keywords: Distributed analytics; Data reuse; FAIR; Health data; Ethics and privacy
General Data Protection Regulation (GDPR). Available at:
Office of the Privacy Commissioner of Canada. The Personal Information Protection and Electronic Documents Act (PIPEDA). Available at:
The Data protection Act. Available at:
Federal Law of 27 July 2006 N 152-FZ on Personal Data. Available at:
Ministry of Electronics and Information Technology, Government of India. Information Technology Act. Available at:
China Data Protection Regulations (CDPR). Available:
Privacy and Confidentiality: The Interagency Advisory Panel on Research Ethics (PRE). Available at:
K. El Emam, S. Rodgers, & B. Malin. Anonymising and sharing individual patient data. BMJ, 350(2015), h1139. doi: 10.1136/bmj.h1139.
V. Torra, & G. Navarro-Arribas. (2016) Big data privacy and anonymization. In: A. Lehmann et al. (eds.) Privacy and Identity Management. Facing up to Next Steps. Privacy and Identity 2016. Cham, Switzland. Springer. doi: 10.1007/978-3-319-55783-0_2.
Secondary use of clinical data: The Vanderbilt approach. Available at:
A distributed infrastructure for life-science information. Available at:
i2b2 Research Data Warehouse.. Available at:
DataSHIELD - Newcastle University. Available at:
DataSHIELD – New Directions and Dimensions. Available at:
What drives academic data sharing? Available at:
A. Jochems, T.M. Deist, J. van Soest, M. Eble, P. Bulens, P. Coucke, W. Dries, P. Lambin, & A. Dekker. Distributed learning: Developing a predictive model based on data from multiple hospitals without data leaving the hospital - A real life proof of concept. Clinical and Translational Radiation Oncology 121(3)(2016), 459-467. doi: 10.1016/j.radonc.2016.10.002.
T.M. Deist, A. Jochems, J. van Soest, G. Nalbantov, C. Oberije, S. Walsh, ... & P. Lambin. Infrastructure and distributed learning methodology for privacy-preserving multi-centric rapid learning health care: euroCAT. Clinical and Translational Radiation Oncology 4(2017), 24–31. doi: 10.1016/j.ctro.2016.12.004
M.D. Wilkinson, M. Dumontier, I.J. Aalbersberg, G. Appleton, M. Axton, A. Baak, … & B. Mons. The FAIR guiding principles for scientific data management and stewardship. Scientific Data 3(2016), Article No. 160018. doi: 10.1038/sdata.2016.18.
. B. Mons, C. Neylon, J. Velterop, M. Dumontier, L.O.B. da Silva Santos, & M.D. Wilkinson. B. Cloudy, increasingly FAIR; revisiting the FAIR Data guiding principles for the European Open Science Cloud. Information Services & Use 37(2017), 49-56. doi: 10.3233/ISU-170824.
P. Wittenburg et al. State of FAIRness in ESFRI projects. Data Intelligence - FAIR best practices 2019. DI-2019-0025
M. Thompson, K. Burger, R. Kaliyaperumal, M. Roos & L.O. Bonino da Silva Santos. Making FAIR easy with FAIR tools: From creolization to convergence. Data Intelligence 2(2020), 87–95. doi: 10.1162/dint_a_00031.
A. Landi, M. Thompson, V. Giannuzzi, F. Bonifazi, I. Labastida, L.O. Bonino da Silva Santos & M. Roos. The “A” of FAIR – as open as possible, as closed as necessary. Data Intelligence 2(2020), 47–55. doi: 10.1162/dint_a_00027.
C. Brewster, B. Nouwt, S. Raaijmakers & J. Verhoosel. Ontology-based access control for FAIR data. Data Intelligence 2(2020), 66–77. doi: 10.1162/dint_a_00029.
M. Hahnel & D. Valen. How to (easily) extend the FAIRness of existing repositories. Data Intelligence 2(2020), 192–198. doi: 10.1162/dint_a_00041.
T. Weigel, U. Schwardmann, J. Klump, S. Bendoukha & R. Quick. Making data and workflows findable for machines. Data Intelligence 2(2020), 40–46. doi: 10.1162/dint_a_00026.
L. Lannom, D. Koureas & A.R. Hardisty. FAIR data and services in biodiversity science and geoscience. Data Intelligence 2(2020), 122–130. doi: 10.1162/dint_a_00034.
I. Labastida & T. Margoni. Licensing FAIR data for reuse. Data Intelligence 2(2020), 199–207. doi: 10.1162/dint_a_00042.
Md. R. Karim, B.P. Nguyen, L. Zimmermann, T. Kirsten, M. L ̈obe, F. Meineke, ... & Oya Beyan.
A Distributed Analytics Platform to Execute FHIR-based Phenotyping Algorithms. Available at:
Article and author information
Cite As
O. Beyan, A. Choudhury, J van Soest, O. Kohlbacher, L. Zimmermann, H. Stenzhorn, Md. R. Karim, M. Dumontier, S. Decker, L.O. Bonino da Silva Santos & A. Dekker. Distributed analytics on sensitive medical data: The Personal Health Train. Data Intelligence 2(2020), 96–107. doi: 10.1162/dint_a_00032
Oya Beyan
O. Beyan ( conceived and designed the concept and wrote the paper.
Oya Beyan is a researcher at Fraunhofer Institute for Applied Information Technology and at the Department of Computer Science at RWTH Aachen University. Her research focuses on methods of data reusability and FAIR data, data-driven transformation and distributed analytics. Her area of expertise is in the semantic Web technologies and application of them in health care and life sciences. She actively contributes to the national and international initiatives to enable the adoption of FAIR principles and develops tools and infrastructures supporting FAIR data. With her interdisciplinary background in informatics, medical informatics and sociology, she developed a focus on societal reflections of data-driven change.
Ananya Choudhury
A. Choudhury ( wrote the manuscript, and is developing the infrastructure.
Ananya Choudhury is a researcher and PhD Student at Clinical Data Science Group, Maastro Clinic, Maastricht University. Her research focuses on methods and infrastructure of privacy preserving distributed learning on clinical data, tools and methods for data FAIR-ification and learning models on FAIR data for improving patient care.
Johan van Soest
J. van Soest ( reviewed the manuscript and is working on PHT infrastructure development and implementations.
Johan van Soest holds a PhD from Maastricht University on centralized and distributed learning of prognostic/predictive models in radiation oncology focusing on knowledge representation, methods for validation of existing models and translation into clinical practice. He is currently active as Postdoctoral Researcher in the Department of Radiation Oncology at MAASTRO clinic and the university’s Institute of Data Science.
Oliver Kohlbacher
O. Kohlbacher ( reviewed the paper and and conceived core components of the architecture.
Oliver Kohlbacher is a Chair for Applied Bioinformatics at the University of Tübingen, Director of the Institute for Translational Bioinformatics at University Hospital Tübingen, and a Fellow at the Max Planck Institute for Developmental Biology. The lab’s current research focus is on developing methods and tools for the analysis of biomedical high-throughput data and their application in translational research.
Lukas Zimmermann
L. Zimmermann ( reviewed the paper and works on components of the PHT architecture.
Lukas Zimmermann is a research assistant and software developer at the Institute for Translational Bioinformatics at the University Hospital Tübingen with a background in Bioinformatics. His research interests currently focus on data integration and software design and quality in medical informatics.
Holger Stenzhorn
H. Stenzhorn ( reviewed the paper and participates in the PHT development.
Holger Stenzhorn is working at the Saarland University Medical Center coordinating the development and organizational set-up of a medical data integration center (meDIC) as well as supporting the Tübingen University Hospital in its meDIC work. His particular interest lies on the seamless integration of the multimodal, multilevel and multisource data from the plethora of clinical and research systems found within hospitals and medical centers to facilitate further biomedical research.
Md. Rezaul Karim
R. Karim ( reviewed the manuscript and is working on PHT infrastructure development and implementations.
Md. Rezaul Karim is a researcher at Fraunhofer FIT, Germany and a PhD candidate at RWTH Aachen University, Germany. He is working towards developing a distributed knowledge pipeline with knowledge graphs and neural networks towards making them explainable and interpretable. His research interests include machine learning, knowledge graphs, bioinformatics, and explainable artificial intelligence (XAI).
Michel Dumontier
M. Dumontier ( conceived and reviewed the paper.
Michel Dumontier is the Distinguished Professor of Data Science at Maastricht University and co-founder of the FAIR (Findable, Accessible, Interoperable and Reusable) data principles. His research focuses on the development of computational methods for scalable and responsible discovery science. Previously at Stanford University, Dr. Dumontier now leads the interfaculty Institute of Data Science at Maastricht University to develop socio-technological systems for accelerating scientific discovery, improving human health and well-being, and empowering communities with ethical data-driven decision making.
Stefan Decker
S. Decker ( and A. Dekker ( conceived and reviewed the paper.
Stefan Decker is the director of Fraunhofer FIT, Germany and Professor of Computer Science at RWTH Aachen University, Germany. He was the director of Insight Centre for Data Analytics, and professor of informatics at NUI Galway, Ireland between 2006 and 2015. His research interests include Semantic Web and linked data, knowledge representation, and neural networks.
Luiz Olavo Bonino da Silva Santos
L.O. Bonino da Silva Santos ( reviewed the paper and works on the design of the Personal Health Train architecture.
Luiz Olavo Bonino da Silva Santos is the International Technology Coordinator of the GO FAIR International Support and Coordination Office, and Associate Professor of the BioSemantics group at the Leiden University Medical Centre in Leiden, The Netherlands. His background is in ontology-driven conceptual modelling, semantic interoperability, service-oriented computing, requirements engineering and context-aware computing. In the last five years Luiz has been involved in a number of activities to realize the FAIR principles, including the development of a number of technologies and tools to support making, publishing, indexing, searching and annotating FAIR (meta)data.
Andre Dekker
S. Decker ( and A. Dekker ( conceived and reviewed the paper.
Andre Dekker is a board-certified medical physicist at MAASTRO Clinic and full professor at Maastricht UMC+ and Maastricht University where he holds the chair “Clinical Data Science”. His research focuses on three main themes: 1) building global FAIR data sharing infrastructures; 2) machine learning outcome prediction models from the data; 3) applying outcome prediction models to improve lives of patients. The main scientific breakthrough has been the development of a Semantic Web and ontology based data sharing and distributed learning infrastructure that does not require data to leave the hospital. This has reduced many of the ethical and other barriers to share data.
Publication records
Published: None (Versions2
Data Intelligence