Solveiga Inokaitytė, Dainius Jocas, Justina Madravickaitė, Vytautas Mickevičius and Rūta Užupytė
Idea / Plan
The European Parliament (EP) has been and remains an active institutional actor in the formulation of the EU foreign and external policies. However, with the entry of Lisbon Treaty, the EP‘s functions and policy scope in the terms of foreign policy and diplomacy have significantly widened. The Parliament examines developments in the CSDP in terms of institutions, capabilities and operations and, more importantly, it ensures that security and defence issues respond to concerns expressed by the EU’s citizens. As a result, it debates issues related not only to European matters but also topics that reflect the global agenda of international politics. In this way the EP can fulfill it‘s role in the process of building an EU ‘representative democracy’. In order to evaluate how effectively the EP communicates global events to the wider European public, it is necessary to analyze to what extent and how quickly the EP responds to the changing international context. To do this, researchers need linked data wherein events of global agenda are linked with debates of the EP. We proposed to link debates of the European Parliament with the GDELT Event Database (http://gdeltproject.org).
It is always very important to appropriately preprocess the large amounts of information used in the research. Therefore, we carefully selected the useful (and usable) fields in the GDELT database. The following features are described in raw GDELT database:
- Date of the event;
- Source (country);
- Target (actor) – a country, an institution, a person etc;
- Numbers of mentions in the media;
- Number of minor related/consecutive events;
- Exact geographical location (coordinates) of the event;
- Several CAMEO categories in which the events are classified (economics, health issues, government operations etc);
- URL to the article where the event was first mentioned.
The most important task in this step was to find a way to connect the events with EP speeches. It was determined that the most efficient and rational (not requiring extensive resources and research) way to do that is finding keywords to describe the events and using them to form queries.
We generated a particular set of keywords to each event that was taken into consideration in this research using the following techniques:
- Most of the URLs contain the title of the article in which the event was first mentioned – these titles were split into words that were used as keywords (after eliminating non-informative words, such as “and”, “but”, “is” etc);
- Each event was assigned a default set of keywords by CAMEO category;
The main challenge met in data preprocessing was the vagueness of certain very well-known events, e.g. the crisis in Ukraine – it has no strict beginning date nor can it be unambiguously classified into one of the CAMEO categories. Therefore the analysis of such events was postponed to the future and only clearly defined events were further examined.
The goal of a technical task was to link events from GDELT event database with debates of the European Parliament. To achieve the goal we split the work into three tasks:
- A mechanism to execute ad-hoc analytical queries in GDELT;
- Convert results of GDELT queries into a query to the database with debates of the European parliament;
- Visualise the results.
While working on the first task we found out that there are 3 options to query GDELT:
- GDELT Analysis Service: its main disadvantage is that results of the query are sent via email;
- Google BigQuery: the overall setup process seems to be complicated;
- Raw Data Files: zipped CSV files.
Therefore proceeded with the c) option. Our solution for the task was to download raw data files for the defined time period then unzip and index the CSV files into the ElasticSearch index. ElasticSearch is well suited for ad-hoc querying and its API returns results in JSON format that is convenient for further processing. One-time data loading was implemented with a bash script.
For the second task we needed to have a database with the debates of the European Parliament. Conveniently, we were offered to use http://search.politicalmashup.nl database (many thanks to Maarten Marx) to search for debates of European Parliament. Having this database, our task boiled down to transforming the results of the query to our ElasticSearch index with GDELT events to the query to the http://search.politicalmashup.nl database which stored as an ElasticSearch index. The task was implemented with a custom application written in Closure programming language.
The third task was to visualise a link of the GDELT event with the debates of European Parliament. The link is a JSON document with an event metadata such as the first mention in the media and keywords describing the event, and a list of debate speeches that mentions the event. Visualisation was implemented using R software and its package openair. The result of the aforementioned task is a PDF file/image with a calendar heatmap that depicts mentions in debates of European Parliament (the first mention of the event by the media is highlighted by a green circle).
To sum up, we performed one-time GDELT data loading into the ElasticSearch index and implemented an application with a command line interface that executes a query for GDELT events, searches in http://search.politicalmashup.nl database, and then draws a calendar heat map.
Results are provided using calendar heatmap. This graphical visualization represents values as colors in a calendar-like view, making it easy to identify the beginning, duration and intensity of the debates. In this case yellow color indicates weak rate of discussion (small number of mentions during debates) and red – strong rate, the green dot represents the event date obtained from GDELT database. For example, the outbreak of Ebola in West Africa (see Figure 1) occurred at the beginning of January of 2014. Discussions about this particular event in debates of the European Parliament started at the 13th of January and reached its peak on 15th of January. Another example provided in Figure 2 shows that Edward Snowden scandal have occurred on the 21th of June and on 2th of July this topic was debated by European Parliament. This leads to the conclusion that for European Parliament it takes approx. 2 weeks to reacts to the global events and this is a relatively short period of time considering a vast of the institution.
Considering the challenges emerged in ToE CC3 we plan to extend our research with a goal to make the events – EP speeches linking as efficient as possible. This includes comprehensive event definitions, improving the information extraction from the sources and more fruitful keyword generation. An infrastructure allowing its users to automatically browse and link events with EP speeches is also in consideration.
Figure 1: Outbreaks of Ebola in West Africa
Figure 2: Edward Snowden Disclosures About NSA