Published Versions 1 Vol 2 (4) : 443–486 2020
The Semantic Data Dictionary – An Approach for Describing and Annotating Data
: 2020 - 03 - 10
: 2020 - 03 - 21
: 2020 - 03 - 25
357 8 0
Abstract & Keywords
Abstract: It is common practice for data providers to include text descriptions for each column when publishing data sets in the form of data dictionaries. While these documents are useful in helping an end-user properly interpret the meaning of a column in a data set, existing data dictionaries typically are not machine-readable and do not follow a common specification standard. We introduce the Semantic Data Dictionary, a specification that formalizes the assignment of a semantic representation of data, enabling standardization and harmonization across diverse data sets. In this paper, we present our Semantic Data Dictionary work in the context of our work with biomedical data; however, the approach can and has been used in a wide range of domains. The rendition of data in this form helps promote improved discovery, interoperability, reuse, traceability, and reproducibility. We present the associated research and describe how the Semantic Data Dictionary can help address existing limitations in the related literature. We discuss our approach, present an example by annotating portions of the publicly available National Health and Nutrition Examination Survey data set, present modeling challenges, and describe the use of this approach in sponsored research, including our work on a large NIH-funded exposure and health data portal and in the RPI-IBM collaborative Health Empowerment by Analytics, Learning, and Semantics project. We evaluate this work in comparison with traditional data dictionaries, mapping languages, and data integration tools.
Keywords: Semantic Data Dictionary; Dictionary mapping; Codebook; Knowledge modeling; Data integration; Data dictionary; Mapping language; Metadata standard; Semantic Web; Semantic ETL; FAIR; Data
This work is supported by the National Institute of Environmental Health Sciences (NIEHS) Award 0255-0236-4609/1U2CES026555-01, IBM Research AI through the AI Horizons Network, and the CAPES Foundation Senior Internship Program Award 88881.120772 / 2016-01. We acknowledge the members of the Tetherless World Constellation (TWC) and the Institute for Data Exploration and Applications (IDEA) at Rensellaer Polytechnic Institute (RPI) for their contributions, including Rebecca Cowan, John Erickson, and Oshani Seneviratne.
IBM. IBM Dictionary of Computing. 10th ed. New York: McGraw-Hill, 1993. Available at
P. P. Uhrowczik. Data dictionary/directories. IBM Systems Journal 12 (1973), 332–350. doi: 10.147/sj.124.0332.
E. Duval, W. Hodgins, S. Sutton, & S.L. Weibel. Metadata principles and practicalities. D-lib Magazine 8(2002), 1082–9873. doi: 10.1045/april2002-weibel.
M.D. Wilkinson, M. Dumontier, I.J. Aalbersberg, G. Appleton, M. Axton, A. Baak … & B. Mons. The fair guiding principles for scientific data management and stewardship. Scientific Data 3 (2016). doi: 10.1038/sdata.2016.18.
M. Boeckhout, G.A. Zielhuis & A. L. Bredenoord. The fair guiding principles for data stewardship: fair enough? European journal of human genetics 26 (2018), 931. doi: 10.1038/s41431-018-0160-0.[6] J.P. McCusker, S.M. Rashid, Z. Liang, Y. Liu, K. Chastain, P. Pinheiro, J.A. Stingone & D.L. McGuinness. Broad interdisciplinary science in tela: An exposure and child health ontology. In: WebSci ’17: Proceedings of the 2017 ACM on Web Science Conference, 2017, pp. 349–357. doi: 10.1145/3091478.3091497.[7] B. Smith, M. Ashburner, C. Rosse, J. Bard, W. Bug, W. Ceusters, … & S. Lewis. The OBO Foundry: Coordinated evolution of ontologies to support biomedical data integration. Nature Biotechnology 25 (2007), 1251. doi: 10.1038/nbt1346.
N.F. Noy, N.H. Shah, P.L. Whetzel, B. Dai, M. Dorf, N. Griffith, … & M.M. Musen. Bioportal: Ontologies and integrated data resources at the click of a mouse. Nucleic Acids Research 37 (2009), W170–W173. doi: 10.1093/nar/gkp440.
J. Joo. Adoption of semantic web from the perspective of technology innovation: A grounded theory approach. International Journal of Human-Computer Studies 69(2011), 139–154. doi: 10.1016/j.ijhcs.2010.11.002.
S. Staab, A. Maedche & S. Handschuh. An annotation framework for the semantic web. Available at:
S. Handschuh & S. Staab (eds.). Annotation for the semantic web. Amsterdam : IOS Press, 2003. isbn: 9781601294043.
B. Chen, Y. Ding, D. J. Wild, Improving integrative searching of systems chemical biology data using semantic annotation. Journal of cheminformatics 4 (2012), 6. doi: 10.1186/1758-2946-4-6. [13] W. Wei, P.M. Barnaghi & A. Bargiela. Semantic-enhanced information search and retrieval. In: The Sixth International Conference on Advanced Language Processing and Web Information Technology (ALPIT 2007), 2007, pp. 218–223. doi: 10.1109/ALPIT.2007.59.
L. Atici, S.W. Kansa, J. Lev-Tov & E. C. Kansa. Other people’s data: A demonstration of the imperative of publishing primary data. Journal of Archaeological Method and Theory 20 (2013), 663–681. doi: 10.1007/s10816-012-9132-9.
K.W. Willoughby. Technological semantics and technological practice: Lessons from an enigmatic episode in twentieth-century technology studies. Knowledge, Technology & Policy 17(2004), 11–43. doi: 10.1007/s12130-004-1002-7.
R.E. Haskell, J.A. Heil & J. Cassidy. Dynamic dictionary and term repository system, 2009. US Patent 7,580,831.Available at:
L. Lau, J. Endo, S. Karren, M. Willis, S. Harada, S. Beeney, B. Larsen, E. Cassin & M. Gerard. Mapping clinical data with a health data dictionary, 2002. US Patent App. 09/755,966. Available at:
J.T. Apacible, S.P. Nolan, G.D. Kalmady & V. Varadan. Extensible and localizable health-related dictionary, 2013.US Patent 8,417,537. Available at:
F.C. Thompson.Data dictionary and standards for fruit fly information database, Myia (1999). Available at: https: //
W.W.W. Consortium. Data catalog vocabulary (dcat) (2014). Available at:
M. Lenzerini. Data integration: A theoretical perspective. In: Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, 2002, pp. 233–246. doi: 10.1145/543613.543644.
M. del Carmen Legaz-Garc´ıa, J.A. Min˜arro-Gime´nez, M. Mena´rguez-Tortosa & J.T. Ferna´ndez-Breis. Generation of open biomedical data sets through ontology-driven transformation and integration processes. Journal of Biomedical Semantics 7 (2016), 32. doi: 10.1186/s13326-016-0075-z.
J.J. Miller. Graph database applications and concepts with neo4j. In: Proceedings of the Southern Association for Information Systems Conference, 2013, pp. 141–147. Available at: https: //[24] T. Liebig, V. Vialard, M. Opitz & S. Metzl. Graphscale: Adding expressive reasoning to semantic data stores. In: International Semantic Web Conference (Posters & Demos), 2015. Available at:
T. Liebig. Neo4j: A reasonable RDF graph database & reasoning engine. Available at:
G.M. Santipantakis, K.I. Kotis, G.A. Vouros & C. Doulkeridis RDF-gen: Generating RDF from streaming and archival data. In: Proceedings of the 8th International Conference on Web Intelligence, Mining and Semantics, 2018, pp. 28. doi: 10.1145/3227609.3227658.
C. Pinkel, A. Schwarte, J. Trame, A. Nikolov, A.S. Bastinos & T. Zeuch. Dataops: Seamless end-to-end anything-to-RDF data integration. In: European Semantic Web Conference, Springer, 2015, pp. 123–127. doi: 10.1007/978-3-319-25639-924.
K. Ham. Free open-source tool for cleaning and transforming data. Journal of the Medical Library Association: JMLA 101(2013), 233–234. Available at:
M. Arenas, A. Bertails, E. Prud’hommeaux & J. Sequeda. A direct mapping of relational data to RDF. W3C Recommendation 27 (2012), 1–11. Available at:
A. Dimou, M. Vander Sande, P. Colpaert, L. De Vocht, R. Verborgh, E. Mannens, & R. Van de Walle. Extraction and semantic annotation of workshop proceedings in html using rml. In: Semantic Web Evaluation Challenge, 2014, pp. 114–119. doi: 10.1007/978-3-319-12024-915.
P. Heyvaert, A. Dimou, A.-L. Herregodts, R. Verborgh, D. Schuurman, E. Mannens & R. Van de Walle. Rmleditor: A graph-based mapping editor for linked data mappings, in: European Semantic Web Conference, 2016, pp. 709–723. doi: 10.1007/978-3-319-34129-343.
F. Michel, L. Djimenou, C.F. Zucker & J. Montagnat. Translation of relational and non-relational databases into RDF with xr2rml. In: The 11th International Conference on Web Information Systems and Technologies (WEBIST’15), 2015, pp. 443–454. doi: 10.5220/0005448304430454.
V.J. Provider. Openlink virtuoso universal server: Documentation, OpenLink Software (2009). Available at:
J. Slepicka, C. Yin, P. A. Szekely & C.A. Knoblock. Kr2rml: An alternative interpretation of r2rml for heterogenous sources. In: Cold, 2015. Available at:[35] C.A. Knoblock & P. Szekely. Exploiting semantics for big data integration. AI Magazine 36 (2015). doi: 10.1609/aimag.v36i1.2565.
J. Tennison, G. Kellogg & I. Herman. Generating RDF from tabular data on the web. W3C recommendation, World Wide Web Consortium (W3C) (2015). Available at:
A. Dimou, M. Vander Sande, P. Colpaert, E. Mannens & R. Van de Walle. Extending r2rml to a source-independent mapping language for RDF. In: International Semantic Web Conference (Posters & Demos), 2013, pp. 237–240. doi: 10.5555/2874399.2874459.
I. Ermilov, S. Auer & C. Stadler. Csv2rdf: User-driven CSV to RDF mass conversion framework. In: Proceedings of the ISEM, 2013, pp. 4–6. Available at:
N. Haider & F. Hossain. Csv2rdf: Generating RDF data from csv file using semantic web technologies, Journal of Theoretical and Applied Information Technology 96 (2018). Available at:
C. Stadler, J. Unbehauen, P. Westphal, M.A. Sherif & J. Lehmann. Simplified rdb2rdf mapping. In: LDOW@WWW, 2015. Available at:[41] C. Bizer & A. Schultz. The r2r framework: Publishing and discovering mappings on the web. COLD 665 (2010), 97–108. doi: 10.5555/2878947.2878956.
R. Cyganiak, Tarql (sparql for tables): Turn CSV into RDF using SPARQL syntax, Technical Report, 2015. Available at:
C. Bizer & A. Seaborne. D2rq-treating non-RDF databases as virtual RDF graphs. In: Proceedings of the 3rd International Semantic Web Conference (ISWC2004), Proceedings of ISWC2004, 2004. doi: 10.1038/npre.2011.5660.1.
C. Bizer & R. Cyganiak. D2rq-lessons learned. In: W3C Workshop on RDF Access to Relational Databases, 2007, pp. 35. Available at:
K. Cˇ erans & G. Bu¯mans, Rdb2owl: A RDB-to-RDF/OWL mapping specification language. Information Systems (2011),139–152. Available at:
J. Klímek, P. Sˇkoda & M. Necˇasky`. Linkedpipes ETL: Evolved linked data preparation. In: European Semantic Web Conference, Springer, 2016, pp. 95–100. doi: 10.1007/978-3-319-47602-5 _ 20.[47] J.P. McCusker, K. Chastain, S. Rashid, S. Norris & D.L. McGuinness. Setlr: The semantic extract, transform, and load-r. PeerJ Preprints 6 (2018), e26476v1. doi: 10.7287/peerj.preprints.26476v1.[48] A. R. Post, T. Krc, H. Rathod, S. Agravat, M. Mansour, W. Torian & J.H. Saltz. Semantic ETL into i2b2 with Eureka! In AMIA Summits on Translational Science Proceedings, 2013, pp 203-207. Available at:
A. Schultz, A. Matteini, R. Isele, C. Bizer & C. Becker. Ldif-linked data integration framework. In: Proceedings of the Second International Conference on Consuming Linked Data, 2011, pp. 125–130. Available at:
A. Schultz, A. Matteini, R. Isele, P.N. Mendes, C. Bizer & C. Becker. Ldif—A framework for large-scale linked data integration. In: 21st International World Wide Web Conference (WWW 2012), Developers Track, Lyon, France, 2012. doi: 10.17169/refubium-18883.
D. Skoutas & A. Simitsis. Designing ETL processes using semantic web technologies. In: Proceedings of the 9th ACM international workshop on Data warehousing and OLAP, 2006, pp. 67–74. doi: 10.1145/1183512.1183526.
S.K. Bansal. Towards a semantic extract-transform-load (ETL) framework for big data integration. In: 2014 IEEE International Congress on Big Data, 2014, pp. 522–529. doi: 10.1109/BigData.Congress.2014.82.
S. K. Bansal, S. Kagemann, Integrating big data: A semantic extract-transform-load framework, Computer 48(2015) 42–50. doi: 10.1109/MC.2015.76. ___
M.N. Zozus, J. Bonner & L. Rock. Towards data value-level metadata for clinical studies. In: ITCH, 2017, pp. 418–423. doi: 10.3233/978-1-61499-742-9-418.
D.B. Warzel, C. Andonyadis, B. McCurry, R. Chilukuri, S. Ishmukhamedov & P. Covitz. Common data element (cde) management and deployment in clinical trials. In: AMIA Annual Symposium Proceedings, 2003, p. 1048. Available at:
W. Kuchinke, S. Wiegelmann, P. Verplancke & C. Ohmann. Extended cooperation in clinical studies through exchange of cdisc metadata between different study software solutions. Methods of Information in Medicine 45 (2006), 441–446. doi: 10.1055/s-0038-1634102.
H. Paulheim. Knowledge graph refinement: A survey of approaches and evaluation methods. Semantic Web 8 (2017), 489–508. doi: 10.3233/SW-160218.
G.V. Gkoutos, P.N. Schofield & R. Hoehndorf. The units ontology: A tool for integrating units of measurement in science. Database 2012 (2012). doi: 10.1093/database/bas033.
M. Dumontier, C.J. Baker, J. Baran, A. Callahan, L. Chepelev, J. Cruz-Toledo … & R. Hoehndorf. The semanticscience integrated ontology (sio) for biomedical research and knowledge discovery. Journal of Biomedical Semantics 5 (2014), 14. doi: 10.1186/2041-1480-5-14. [60] P. Pinheiro, M. P. Bax, H. Santos, S.M. Rashid, Z. Liang, Y. Liu … D.L. McGuinness. Annotating diverse scientific data with hasco. In: ONTOBRAS, 2018, pp. 80–91. Available at:
T. Lebo, S. Sahoo, D. McGuinness, K. Belhajjame, J. Cheney, D. Corsar, D. Garijo, S. Soiland-Reyes, S. Zednik & J. Zhao. Prov-o: The prov ontology, W3C recommendation (2013). Available at: [62] B. Beckert, U. Keller & P.H. Schmitt. Translating the object constraint language into first-order predicate logic. In: Proc. of the VERIFY Workshop at Federated Logic Conferences (FLoC), 2002, pp. 113–123. Available at:
O. Seneviratne, S. M. Rashid, S. Chari, J. P. McCusker, K. P. Bennett, J. A. Hendler & D.L. McGuinness. Knowledge integration for disease characterization: A breast cancer example. In: International Semantic Web Conference, 2018, pp. 223–238. doi: 10.1007/978-3-030-00668-6_14.
A. Crotti Junior, C. Debruyne, R. Brennan & D. O’Sullivan. An evaluation of uplift mapping languages. International Journal of Web Information Systems 13 (2017), 405–424. doi: 10.1108/IJWIS-04-2017-0036.
M. Hert, G. Reif & H.C. Gall. A comparison of RDB-to-RDF mapping languages. In: Proceedings of the 7th International Conference on Semantic Systems, 2011, pp. 25–32. doi: 10.1145/2063518.2063522.
B. Smith, A. Kumar & T. Bittner. Basic formal ontology for bioinformatics, IFOMIS Reports, 2005. Available at:
M. Butenuth, G.v. Go¨sseln, M. Tiedge, C. Heipke, U. Lipeck & M. Sester. Integration of heterogeneous geospatial data in a federated database. ISPRS Journal of Photogrammetry and Remote Sensing 62(2007), 328–346. doi: 10.1016/j.isprsjprs.2007.04.003.
K. Janowicz, S. Scheider, T. Pehle & G. Hart. Geospatial semantics and linked spatiotemporal data–past, present, and future. Semantic Web 3(2012), 321–332. doi: 10.3233/SW-2012-0077.[69] W. Huang, A. Mansourian & L. Harrie. Geospatial data integration and visualization using linked data. In: AGILE PhD School, 2017. Available at:
W. Huang, A. Mansourian & L. Harrie. Geospatial data integration and visualization using linked data. In: AGILE PhD School, 2017. Available at:
Article and author information
Cite As
S.M. Rashid, J.P. McCusker, P. Pinheiro, M.P. Bax, H. Santos, J.A. Stingone, A.K. Das & D.L. McGuinness. The semantic data dictionary – an approach for describing and annotating data. Data Intelligence 2(2020), 443–486. doi: 10.1162/dint_a_00058
Sabbir M. Rashid
S.M. Rashid (, in drafting the paper, introduced the research, motivation, and claims of this article in Section 1, conducted the majority of the literature review presented in Section 2, summarized the methodology associated with the approach in Section 3, formulated the example of Section 4, detailed the case studies presented in Section 5, performed the evaluation of Section 7 and 8, helped with the discussion in Section 9, and summarized the conclusions of the article in Section 10.
Sabbir M. Rashid is a PhD student at Rensselaer Polytechnic Institute (RPI) working with Professor Deborah L. McGuinness on research related to data annotation and harmonization, ontology engineering, knowledge epresentation, and various forms of reasoning. Prior to attending RPI, Mr. Rashid completed a double major at Worcester Polytechnic Institute, where he received B.S. degrees in both Physics and Electrical & Computer ngineering. Much of his graduate studies at RPI have involved the research discussed in this article. His current work includes the application of deductive and abductive inference techniques over linked health data, such as in the context of chronic diseases like diabetes.
James P. McCusker
J.P. McCusker ( contributed to the content of Section 3 and aided in the formulation of the evaluationof Section 7.
James P. McCusker is the Director of Data Operations at the Tetherless World Constellation at Rensselaer Polytechnic Institute. He works with Deborah McGuinness on using knowledge graphs to further scientific research, especially in biomedical domains. He has worked on applying semantics to numerous projects, including drug repurposing using systems biology, cancer genome resequencing, childhood health and environmental exposure, analysis of sea ice conditions and materials science. He is the architect of the open source Whyis knowledge graph development and management framework, which has been used across many of these domains.
Paulo Pinheiro
P. Pinheiro ( helped scope the example of Section 4.
Paulo Pinheiro is a data scientist and software engineer managing projects at the frontier between artificial intelligence and databases. His areas of expertise include the following: data policies and information assurance, such as security and privacy; data operation including curation, quality monitoring, semantic integration, provenance management and uncertainty assessment; data visualization; and data analytics including automated reasoning. Paulo holds a PhD in Computer Science from the University of Manchester, UK.
Marcello P. Bax
M.P. Bax ( helped with the conducting of the literature review of Section 2 and aided in the formulation of theevaluation of Section 7.
Marcello P. Bax is a professor and researcher in the Postgraduate Program in Knowledge Management and Organization (PPG-GOC) at the School of Information Science at Federal University of Minas Gerais, Brazil. Prior to joining the School of Information Science, Dr. Bax was a postdoctoral fellow in the Computer Science Department at UFMG, a leading Computer Science research group in Latin America. Dr. Bax spent a year on sabbatical with Professor McGuinness’ group and the Tetherless World Constellation at Rensselaer Polytechnic Institute, during which he worked with the coauthors on the research described in this article. His research seeks to develop methods for the curating of scientific data, with a focus on semantic annotation, the goal of building curatorial repositories for data reuse and reproduction of scientific research results.
Henrique Santos
H.O. Santos ( helped synthesize the related literature in Section 2 and presented some limitations of our approach in Section 10.
Henrique O. Santos is a Research Scientist in the Tetherless World Constellation at Rensselaer Polytechnic Institute, where he researches and applies Semantic Web technologies in multidisciplinary domains for supporting more flexible, more efficient, and improved solutions in comparison with traditional approaches. His research interests include data integration, knowledge representation, domain-specific reasoning and explainable artificial intelligence. He has over 10 years of experience working with Semantic Web technologies and holds a PhD in Applied Informatics from Universidade de Fortaleza, Brazil.
Jeanette A. Stingone
J.A. Stingone ( conducted the experiment and drafted the content presented in Section 6.
Jeanette A. Stingone is an Assistant Professor in the Department of Epidemiology at Columbia University’s Mailman School of Public Health. She couples data science techniques with epidemiologic methods to address research questions in children’s environmental health. She currently leads the Data Science Translation and Engagement Group of the Human Health and Exposure Analysis Resource Data Center. In this role, she supports the use of metadata standards and ontologies for data harmonization efforts across disparate studies of environmental health. Dr. Stingone’s interests also include the use of collective science initiatives to advance public health research.
Amar K. Das
A.K. Das ( led the proposal of the research problems associated with the HEALS projects mentioned in Section 5.
Amar K. Das is the Program Director of Integrated Care Research at IBM Research and an Adjunct Associate Professor of Biomedical Data Science at Dartmouth College. His research activities include the development of biomedical ontologies and Semantic Web technologies for clinical decision support, information retrieval and machine learning. In his role in the RPIIBM HEALS initiative, Dr. Das is the IBM technical lead for advancing knowledge representation and reasoning in healthcare. Dr. Das holds an MD and PhD in Biomedical Informatics from Stanford University, and has completed a residency in Psychiatry and a postdoctoral fellowship in Clinical Epidemiology at Columbia University/New York State Psychiatric Institute.
Deborah L. McGuinness
D.L. McGuinness ( has guided the overall direction of this research. All the authors havemade meaningful and valuable contributions in revising and proofreading the resulting manuscript.
Deborah L. McGuinness is the Tetherless World Senior Constellation Chair and Professor of Computer and Cognitive Science. She is also the founding director of the Web Science Research Center at Rensselaer Polytechnic Institute. Dr. McGuinness has been recognized with awards as a fellow of the American Association for the Advancement of Science (AAAS) for contributions to the Semantic Web, knowledge representation, and reasoning environments and as the recipient of the Robert Engelmore award from Association for the Advancement of Artificial Intelligence (AAAI) for leadership in Semantic Web research and in bridging Artificial Intelligence (AI) and eScience, significant contributions to deployed AI applications, and extensive service to the AI community. Deborah is a leading authority on the Semantic Web and has been working in knowledge representation and reasoning environments for over 30 years and leads the research group that designed and implemented the research presented in this paper.
Publication records
Published: Dec. 14, 2020 (Versions1
Data Intelligence