Extension of the Talk of Europe data set with terminological and semantic information

Wim Peters and Adam Funk

The is a repost of the original post which can be read here.

We enjoyed the hospitality of the Meertens Institute in Amsterdam during Clarin’s Creative Camp in March 2015, which provided an excellent environment for a large number of groups from different countries to work with a common data set.

The pivot of our work was the Talk of Europe (ToE) data set, which consists of plenary debates of the European Parliament as Linked Open Data (http://linkedpolitics.ops.few.vu.nl/). The ToE data set is available in RDF and therefore readily extendable through this standard format.

The main aim of the Creative Camp was for the participant teams to work on a creative extension/interpretation/visualisation of this data set. As a general observation, given the diversity of the activities of the various groups, and the different levels of automation involved, another important underlying aspect of the campo was the exploration of the scope of automation in humanities research. Within the parameters of individual e-humanities researchers’ research questions and requirements, the various exploratory activities and outcomes of the Camp illustrate strategies to bring humanities researchers increasingly closer to the involvement of automation and digitalisation in their research, and assist them in this transition.

The overall task for computational involvement in eHumanities is to ensure that humanities researchers – a considerable part of whom still remain to be convinced of the advantages of this digital revolution for their research – will embrace technology across the board, from digitization to manual inspection and annotation to fully automated analysis.

Scope of our work
Our work produces a ToE extension by means of automatic natural language processing techniques. Below we present a general description. Details can be found on our slides.

Our main tool for the analysis of the parliamentary texts and the production of linguistic, terminological and semantic metadata is the General Architecture for Text Engineering (GATE;http://www.gate.ac.uk). GATE is a framework for language engineering applications, which supports efficient and robust text processing including functionality for both manual and automatic annotation. It is highly scalable and has been applied in many large text processing projects. It is an open source desktop application written in Java that provides a user interface for professional linguists and text engineers to bring together a wide variety of natural language processing tools and apply them to a set of documents.

Our activities concentrated on the extraction and linking of English terminology from ToE and UK parliamentary speeches:
(http://www.parliament.uk/business/publications/parliamentary-archives/).

It involved the following subtasks (see slides for more information):

  • Linguistic data pre-processing such as tokenization, part of speech tagging, sentence detection
  • Sentence-based sentiment analysis
  • Term extraction: determine important domain-specific vocabulary by assigning termhood scores to terms.
  • Relation extraction: finding related terms internal to each data set using pointwise mutual information
  • RDF production using e.g. standard SKOS relations, which provides the enrichment of the ToE data set.

The resulting RDF and what you can do with it
The RDF resource that is the result of our efforts slots into the ToE data model, which makes it an integral part of the ToE data set. It is not an end in itself. It represents partial semantic analysis and does not offer final research results at the push of a button. Its main function is that it offers opportunities for further scholarly analysis.

The main function of our output is to enable researchers to further explore the content of the parliamentary speeches by means of querying semantic metadata in SPARQL. The semantic metadata we offer in service of this exploration and search capability are domain concepts, relations between these concepts (semantic context) and sentiment. The provided concepts, relations and sentiment allow for more in-depth interpretation and search for material that is relevant to research interests. SPARQL queries enable the customized selection of potentially interesting material.

For instance, :

  • Extraction and inspection of the semantic context of identical terms in both data sets by means of the extracted relations between terms and the sentiment with which they are mentioned.
  • Related terms enable semantic traversal through the data.
  • Sentiment can provide insight into the attitude of members of parliament and their political parties regarding certain issues.

In our opinion, the combination of automated content analysis and manual metadata exploration assists scholarly exploration activity of larger amounts of data, which suggest material for further close reading activities.

In our RDF extension, the information captured by two additional concepts: Sentence and Term (see schema below).

Relations between terms are covered by:

  • skos:narrowerTransitive (hyponymy)
  • skos:related (internal to each parliamentary speech set)
  • crossRelated (between EU and UK terms)

rdf-output (1)

More information and downloads

Slides detailing our methodology and results

– The RDF output

– The OWL ontological model describing our extension of the ToE model

SPARQL

As an illustration of the querying capabilities, the example below selects terms from UK parliamentary speeches and their related terms. The terms should have a termhood score of greater than 70 out of 100.