Multilingual Entity Spelling Variants as Linked Data on the EU’s Open Data Portal
Keynote Guillaume Jacquet
The European Commission’s Joint Research Centre (JRC) has been analysing the online version of printed media in over twenty languages since 2004 and has automatically recognised and collected large amounts of ‘named entities’ (names of persons, organisations and more) and their many spelling variants, including across different languages and scripts. By semi-automatically mining Wikipedia, this collection was extended so that it covers up to hundreds of spelling variants for the same name, all occurring in real-life text. These named entity spelling variant lists, known as JRC-Names, have been available for download since 2011. In this talk, we report our efforts to render JRC-Names as linked data. This LOD version goes beyond the initial data release in that it now includes lists of titles found next to the names, as well as date ranges when the titles and the name variants were found. It also provides links towards existing datasets including DBpedia and Talk-Of-Europe datasets. We use the lexicon model for ontologies lemon to represent and interlink the name variants, their titles and their equivalent data point in DBpedia. This LOD version of JRC-Names can help bridge the gap between structured data repositories and multilingual text written in natural languages, thus supporting large-scale data integration, web-based content processing and cross-lingual mapping. JRC-Names is publicly available in the dataset catalogue of the European Union’s Open Data Portal.
Guillaume Jacquet is a research scientist in Natural Language Processing. He currently works at the Joint Research Centre (JRC) of the European Commission, in the Institute for the Protection and Security of the Citizen (Ispra, Italy). He is the Scientific Project Officer of the European Media Monitor (EMM) team, which aims to develop innovative solutions for retrieving and extracting information from the internet, and especially from online news and social media, serving many Commission Services, EU agencies and some EU Member State authorities. His current research interests include:
– Developing and improving NLP tools to structure and analyse the news.
– Multilingual Named Entity Guessing: new/unknown Named Entity detection.
– Acronym resolution: linking short and long forms related to the same entity in a multilingual environment.
– Developing and integrating Linked Open Data in the existing NLP tools.