Published Versions 2 Vol 2 (3) : 323-352 2020
Constructing and Cleaning Identity Graphs in the LOD Cloud
: 2019 - 08 - 31
: 2019 - 09 - 20
: 2019 - 09 - 30
2523 66 0
Abstract & Keywords
Abstract: In the absence of a central naming authority on the Semantic Web, it is common for different datasets to refer to the same thing by different names. Whenever multiple names are used to denote the same thing, owl:sameAs statements are needed in order to link the data and foster reuse. Studies that date back as far as 2009, observed that the owl:sameAs property is sometimes used incorrectly. In our previous work, we presented an identity graph containing over 500 million explicit and 35 billion implied owl:sameAs statements, and presented a scalable approach for automatically calculating an error degree for each identity statement. In this paper, we generate subgraphs of the overall identity graph that correspond to certain error degrees. We show that even though the Semantic Web contains many erroneous owl:sameAs statements, it is still possible to use Semantic Web data while at the same time minimising the adverse effects of misusing owl:sameAs.
Keywords: Linked Open Data; Identity; Quality; Reasoning
M. Acosta, A. Zaveri, E. Simperl, D. Kontokostas, S. Auer & J. Lehmann. Crowdsourcing linked data quality assessment. In International Semantic Web Conference 8219(2013),260–276. doi:10.1007/978-3-642-41338-4_17.
W. Beek, J. Raad, J. Wielemaker & F. van Harmelen. sameas. cc: The closure of 500m owl: sameas statements. In Extended Semantic Web Conference 10843(2018), 65–80. doi:10.1007/978-3-319-93417-4_5.
W. Beek, L. Rietveld, H. R. Bazoobandi, J. Wielemaker & S. Schlobach. Lod laundromat: a uniform way of publishing other people’s dirty data. In International Semantic Web Conference 8796(2014), 213–228. doi:10.1007/978-3-319-11964-9_14.
V. D. Blondel, J. Guillaume, R. Lambiotte & E. Lefebvre. Fast unfolding of communities in large networks. Journal of statistical mechanics: theory and experiment 10(2008), 10008. doi:10.1088/1742-5468/2008/10/P10008.
P. Bouquet, H. Stoermer & D. Giacomuzzi. Okkam: Enabling a web of entities.CEUR Workshop Proceedings 249(2007).
J. Cuzzola, E. Bagheri & J. Jovanovic. Filtering inaccurate entity co-references on the linked open data. In International DEXA Conference 9261(2015), 128–143.doi:10.1007/978-3-319-22849-5_10.
G. de Melo. Not quite the same: Identity constraints for the web of linked data. In The Twenty-Seventh AAAI Conference on Artificial Intelligence (2013), 1092–1098.
J. D. Fernández, W. Beek, M. A. Martínez-Prieto & M. Arias. Lod-a-lot. In International Semantic Web Conference 10588(2017), 75–83. doi:10.1007/978-3-319-68204-4_7.
S. Fortunato. Community detection in graphs. Physics reports 486(2010), 75–174. doi:10.1016/j.physrep.2009.11.002.
M. Girvan & M. E. Newman. Community structure in social and biological networks. Proceedings of the national academy of sciences 99(2002), 7821–7826. doi:10.1073/pnas.122653799.
H. Glaser, A. Jaffri & I. Millard. Managing co-reference on the semantic web. WWW2009 Workshop: Linked Data on the Web (2009).
C. Guéret, P. Groth, C. Stadler & J. Lehmann. Assessing linked data mappings using network measures. In Extended Semantic Web Conference 7295(2012), 87–102. doi:10.1007/978-3-642-30284-8_13.
H. Halpin, P. J. Hayes, J. P. McCusker, D. L. McGuinness & H. S. Thompson. When owl:sameAs isn’t the same: An analysis of identity in Linked Data. In International Semantic Web Conference 6496(2010), 305–320. doi:10.1007/978-3-642-17746-0_20.
A. Hogan, A. Zimmermann, J. Umbrich, A. Polleres & S. Decker. Scalable and distributed methods for entity matching, consolidation and disambiguation over linked data corpora. Web Semantics: Science, Services and Agents on the World Wide Web 10(2012), 76–110. doi:10.1016/j.websem.2011.11.002.
A. Lancichinetti & S. Fortunato. Benchmarks for testing community detection algorithms on directed and weighted graphs with overlapping communities. Physical Review E 80(2009), 016118. doi:10.1103/PhysRevE.80.016118.[16] A. Lancichinetti & S. Fortunato. Community detection algorithms: a comparative analysis. Physical review E 80( 2009), 056117. doi:10.1103/PhysRevE.80.056117.
A. Lancichinetti, S. Fortunato & F. Radicchi. Benchmark graphs for testing community detection algorithms. Physical review E 78(2008), 046110. doi:10.1103/PhysRevE.78.046110.
M. Mountantonakis & Y. Tzitzikas. On measuring the lattice of commonalities among several linked datasets. Proceedings of the VLDB Endowment 9(2016), 1101–1112. doi:10.14778/2994509.2994527.
M. E. J. Newman & M. Girvan. Finding and evaluating community structure in networks. Physical review E 69(2004), 026113. doi:10.1103/PhysRevE.69.026113.
L. Papaleo, N. Pernelle, F. Saïs & C. Dumont. Logical detection of invalid sameas statements in rdf data. In International Conference EKAW 8876(2014), 373–384. doi: 10.1007/978-3-319-13704-9_29.
M. A. Porter, J. P. Onnela & P. J. Mucha. Communities in networks. Notices of the AMS 56(2009), 1082–1097.
J. Raad. Identity management in knowledge graphs (doctoral dissertation). University of Paris-Saclay (2018).[23] J. Raad, W. Beek, F. van Harmelen, N. Pernelle & F. Saïs. Detecting erroneous identity links on the web using network metrics. In International Semantic Web Conference 11136(2018), 391–407. doi: 10.1007/978-3-030-00671-6_23.[24] J. Raad, N. Pernelle, F. Saïs, W. Beek & F. van Harmelen. The sameas problem: A survey on identity management in the web of data. eprint arXiv:1907.10528 (2019).
P. Ronhovde & Z. Nussinov. Multiresolution community detection for megascale networks by information-based replica correlations. Physical Review E 80(2009), 016109. doi: 10.1103/PhysRevE.80.016109.
M. Rosvall & C. T. Bergstrom. Maps of random walks on complex networks reveal community structure. Proceedings of the National Academy of Sciences 105( 2008), 1118–1123. doi: 10.1073/pnas.0706851105.
A. Valdestilhas, T. Soru & A. C. N. Ngomo. Cedal: time-efficient detection of erroneous links in large-scale link repositories. In International Conference on Web Intelligence (2017), 106–113. doi: 10.1145/nnnnnnn.nnnnnnn.[28] Z. Yang, R. Algesheimer & C. J. Tessone. A comparative analysis of community detection algorithms on artificial networks. Scientific Reports 6(2016), 30750. doi: 10.1038/srep30750.
Article and author information
Cite As
J. Raad, W. Beek, F. van Harmelen, J. Wielemaker, N. Pernelle & F. Saïs. Constructing and cleaning identity graphs in the LOD cloud. Data Intelligence 2(2020), 323–352. doi: 10.1162/dint_a_00057
Joe Raad
J. Raad was co-responsible for the design of the research, the implementation of the approach, the analysis of the results, and the writing of the manuscript.
Joe Raad is a post-doctoral researcher in the Knowledge Representation & Reasoning group at the Vrije Universiteit Amsterdam in The Netherlands. His research interests focus on identity management in Knowledge Representation systems, large-scale empirical study of semantics, and semantic technologies deployment for Digital Humanities. As part of his research, he has co-developed and MetaLink.
Wouter Beek
W. Beek was co-responsible for the design of the research, the implementation of the approach, the analysis of the results, and the writing of the manuscript.
Wouter Beek is a post-doctoral researcher in the Knowledge Representation & Reasoning group at the Vrije Universiteit Amsterdam and co-founder of Triply. He is interested in the Semantic Web as a platform for knowledge-intensive applications, the deployment of large-scale knowledge bases for innovative reuse, and the interaction between Web semantics and pragmatics, including the empirical study of semantics. As part of his research, he has co-developed the LOD Laundromat, LOD Search, and
Frank van Harmelen
F. van Harmelen has contributed to the design of the research, the analysis of the results, and the writing of the manuscript.
Frank van Harmelen is a Professor in Knowledge Representation & Reasoning in the Computer Science department (Faculty of Science) at the Vrije Universiteit Amsterdam, The Netherlands. Since 2000, he has played a leading role in the development of the Semantic Web. He was co-PI on the first European Semantic Web project (OnToKnowledge, 1999), which laid the foundations for the Web Ontology Language OWL. OWL has become a worldwide standard, it is in wide commercial use, and it has become the basis for an entire research community. He co-authored the Semantic Web Primer, the first academic textbook of the field and now in its third edition, which is in worldwide use (translations in 5 languages, and 10,000 copies sold of the English edition alone). He was one of the architects of Sesame, an RDF storage and retrieval engine, which is in wide academic and industrial use with over 200,000 downloads. This work received the 10-year impact award at the 11th International Semantic Web Conference in 2012, which is the most prestigious award in the field. In recent years, he pioneered the development of large-scale reasoning engines. He was scientific director of the 10m euro EU-funded Large Knowledge Collider, a platform for distributed computation over semantic graphs with billions of edges. The prize-winning work with his student Jacopo Urbani has improved the state of the art by two orders of magnitude. He is scientific director of The Network Institute. In this interdisciplinary research institute some 150 researchers from the Faculties of Social Science, Humanities and Computer Science collaborate on research topics in computational Social Science and e-Humanities. He is a fellow of the European AI Society ECCAI (membership limited to 3% of all European AI researchers), in 2014, he was admitted as member of the Academia Europaea (limited to the top 5% of researchers in each field), and in 2015 he was admitted as Member of the Royal Netherlands Society of Sciences and Humanities (450 members across all sciences).
Jan Wielemaker
J. Wielemaker has contributed to the implementation of the approach.
Jan Wielemaker is senior researcher at the Vrije Universiteit Amsterdam (VUA) and the Centrum voor Wiskundeen Informatica (CWI, The Netherlands). He is the lead developer of SWI-Prolog and responsible for the development and maint enance of the linked data infrastructure libraries for SWI-Prolog.
Nathalie Pernelle
N. Pernelle has contributed to the the design of the research, the analysis of the results, and the writing of the manuscript.
Nathalie Pernelle Professor of Computer Science, member of the LIPN (Laboratoired' Informatique de Paris Nord), at the University of Paris in France, where she moved from Paris-Saclay University in 2019. Her research interests are related to knowledge discovery in data graphs. She has in particular studied models and algorithms for data linking, rule mining, and for the semantic annotation of unstructured documents. She has been involved in many academic and industrial projects related to various domains such as biological or geographical data, bibliographical knowledge bases, asbestos diagnoses, or problems related to the General Data Protection Regulation.
Fatiha Saïs
F. Saïs has contributed to the the design of the research, the analysis of the results, and the writing of the manuscript.
Fatiha Saïs is currently an associate Professor - HDR at the Computer Science Research Laboratory (LRI) of Paris Saclay University, France. She is currently the co-head of LaHDAK group (Large-scale Heterogeneous Data and Knowledge). Her research focuses on: identity management in the Web of data; knowledge graph fusion; knowledge discovery from RDF graphs; and more recently on the veracity assessment in knowledge graphs. Her work has been included in more than 20 national, industrial and European research projects. She has published more than 60 research papers in national and international conferences and journals like, ISWC (International Semantic Web Conference), Journal of Web Semantics and Journal of Data Semantics. She served as a PC member for international conferences (ECAI, ESWC, K-Cap, ICCS, etc), national conferences (EGC, IC, BDA) and organized and chaired several national and intentional workshops and conferences (WebToTouch, EGC, Verita, SoWedo, JDSE, etc.).
Publication records
Published: None (Versions2
Updated: None (Versions3
Data Intelligence