The second Creative Camp (TCC2) took place from 23 to 27 March 2015 in Amsterdam. It was hosted by the Meertens Institute. Of the call for participation, which was spread in January, 13 proposals were selected and 24 participants travelled to the Netherlands to work with the data curated as a part of Talk of Europe.
The most prevalent theme this edition was discovery and analysis of topics in the debate speeches. Although the proceedings of the EP contain titles for the debates, the content of the latter is not further annotated. The only way to find speeches about a specific topic is by combining keywords. If the debates were annotated with concepts from a thesaurus or taxonomy, this would greatly facilitate analysis of the dataset. Still, by their approach towards the dataset, the participants’ agendas for TCC2 were quite diverse. Some teams used the dataset to answer specific questions, while others examined the dataset as a whole; some participants exploited the data while others enriched it with new insights or connections to other data. There was even a team that was interested in the proceedings as a linguistic resource rather than an account of events.
The teams that entered the Camp with a well-defined information need were driven by a variety of research needs:
- Education policy researcher Julie Birkholz (UGhent) investigated what the EP has declared about higher education over the years.
- Members of the Georgian Institute of Public Affairs collected information about Balkan countries to embed in a research tool for journalists.
- Political and social scientists from the Technological Educational Institute of Western Macedonia and from Vytautas Magnus University in Lithuania teamed up to investigate in what terms the economic crisis has been discussed in Parliament.
- The delegation from the University of Malaga and University of Sevilla detected occurrences of debates about cultural heritage.
Other teams explored the speech texts and their metadata more broadly, looking for patterns and contrasts:
- The participants from the University of Tartu developed and analysed an index of all speech terms to be able to decide whether their use over a certain time period differed significantly between any two countries, parties, or debates.
- Researchers from the Baltic Institute of Advanced Technology applied stylometric analysis to see whether MEPs differ in their style of talking, for instance, the typical length of the sentences they produce.
Several teams aimed to generate new data and so facilitate new research possibilities:
- Chaohai Ding (University of Southampton) detected concepts and sentiments in the speech texts with the help of NLP software modules. By doing the same for the UK Parliament, he was able to illustrate parallels and differences between the UK and the EP proceedings, supported by the interactive visualisations he made.
- Wim Peters and Adam Funk (University of Sheffield) used their in-house infrastructure GATE to annotate the speeches with the concepts in them and their degree of occurrence across the proceedings. They also annotated pairs of these concepts based on their semantic relationship. The work of both UK teams has been consolidated in RDF data, which was added to our dataset.
- A (second) team of researchers from various universities in Lithuania annotated the debates with topics from a recognised thematic taxonomy. The availability of Lithuanian translations of the debate titles enabled them to use their in-house classification tool designed for their national parliamentary proceedings, and so to thematically connect the Lithuanian proceedings to the European ones.
- Students from the University of Crete interlinked the EP proceedings and the Greek Parliament proceedings based on the Members of Parliament.
- Finally, researchers from the University of Vienna were interested in the Talk of Europe dataset as a linguistic (and multilingual) resource. They are investigating how parallel texts in the different languages can be used for automatically assisted translation.
The Creative Camp was not merely a hacking event. The schedule consisted for about 60% of break-out sessions and 40% of plenary activities to keep each other up-to-date, exchange ideas, or simply take a well-deserved break from all that hacking.
On Thursday the members of the KNAW eHumanities Group at Meertens joined in on an afternoon of talks and discussion. Guillaume Jacquet from the Joint Research Centre (JRC) was invited to give a talk about the JRC’s activities and their relationship to Talk of Europe.
The JRC collects information about ‘named entities’, e.g. persons and organisations, and publishes this as linked data on the Open Data Portal of the EU. The main challenge in this process is disambiguation, as the same person or organisation is often referred to under different names. He showed how the JRC has set up a system that monitors media for mentions of such entities, disambiguates them, and promptly incorporates new information about their referents in their Named Entity Resource.
On the same occasion, two teams of the Camp were asked to talk in more detail about their TCC2 activities and their own research interests. Wim Peters (University of Sheffield) introduced us to GATE, the text engineering architecture made by the University of Sheffield, and how he and his colleague Adam Funk had applied it to the ToE speech texts. Alexander Tkachenko, Konstantin Tretyakov and Ilya Kuzovkin (University of Tartu) illustrated the statistics they had used to decide whether two groups of interest significantly use different terms. They demonstrated this method with a nice visualisation on some amusing example words.
Looking back at this event, we as the organisers feel that the Camp was very fruitful, in the first place because of the contributions of the participants, but also as a test environment. Jan Wielemaker from VU University introduced the participants to SWISH, an interactive SWI-Prolog environment embedded in the Cliopatria web service that we use to host our data. SWI- Prolog provides an alternative to SPARQL to query the data (and download the results) in Prolog, equipped with some RDF-specific functions. Prolog makes it easy to reuse parts of a query in a new query, and is very suited for manipulating and analysing data, for instance, to calculate with times and dates. SWISH is a collaborative editor for Prolog, which means you can store the queries and scripts and share them with others. We were happy to see that SWISH was intensively used during the Camp.
Participants’ experiences will be used for upcoming updates of the data and for future plans. Amongst other things, we learned that it is difficult to aggregate the EP parties over the years due to name changes and coalitions. One possible solution for this could be to connect our dataset to a taxonomy of EP parties over the years. Also, the groups that were using Natural Language Processing applications were hindered by erroneous language tags associated with speech texts, due to (unflagged) missing translations in the source data. This is a known shortcoming, that is now on top of our list of future plans. In this context we would like to thank Žygimantas Medelis from TokenMill for his proof-of-concept language tag corrections and for pointing out the difficulties involved.
We from the Talk of Europe team think the second Creative Camp was a rewarding event and are looking forward to the next and final edition.