Modelling the European debates

The EU publishes the full transcripts of the plenary meetings in all current official languages of the EU on its website. Represented online are the date and title of each agenda item, the name of the speakers and the transcript of every speech, including all translations. This information was taken as the main source for our semantic dataset.

The structure of the EU debate organization, as derived from the (online) official agenda, was taken as the skeleton of the data model. As such, the monthly sessions form the highest level of the hierarchy, consisting of (four or two) daily sessions. A session day in turn is broken down into agenda items such as debates or statements, which consist of speeches, which each feature a single speaker.

The Resource Description Framework

In the Resource Description Framework (RDF), which we adhere to, it is referred to things in the world in an unambiguous way. To this end, RDF uses unique identifiers referred to as IRIs (Internationalized Resource Identifiers). These codes look like web addresses, e.g.,, and ideally are actual web pages that contain information on what they denote. Not only entities receive such a code, but also semantic relationships, e.g., For now, it is important to understand only a few aspects of IRIs. First, that they are built up hierarchically, the first part or namespace embedding the resource in a vocabulary and the last part being the most discriminative; for instance, the two examples used identify a EU member and a has-role relationship, respectively. Second, that the namespace of an IRI is commonly abbreviated to some predefined prefix: for example, polivoc: stands for Finally, identity relationships between any two resources with different IRIs can be added to the web at any point in time, such that in the initial modelling phase it should not be overly concerned with existing codes.

Another aspect of RDF you should know is that it expresses information in a tripartite sentence-like format using a subject and an object, connected by a predicate that expresses the semantic relationship between these two. The subject and predicate are always resources with an identifier; objects can be either resources or textual data elements. Note that objects can be represented as a resource without a URI, that is, a so-called “blank node”. This is done when the object itself owes its meaning to its specifications and does not need to be referred to by itself. The RDF syntax allows for statements about the relationship between two resources (e.g., EUmember_64 represents EUcountry_DE) or between a resource and a piece of textual information that is not an IRI (e.g., EUmember_64 has-familyname "Markov"). As new predicates can be defined ad-hoc by any user, the possibilities of expression are endless.

Finally, RDF relies heavily on hierarchical concept modelling. Similar resources are grouped together in classes, which can be taken from existing classifications (ontologies) or newly defined, and which again carry an IRI. The predicate rdf:type is conventionally used to connect a resource to the class to which it is assigned. For instance, the triple polivoc:EUmember_64 rdf:type polivoc:Politician declares a specific EU member to the class Politician.

Hopefully, this digression about the syntax of RDF allows you to understand that a modelling task like the one at hand involves the following: identifying meaningful information elements, denoting them by meaningful and clear IRIs, and structuring them in categories; and expressing the debate events in terms of relationships between these elements.

The data model

Figure 1 illustrates the preliminary result of this process, using the following notations. Resources are indicated by uncoloured round boxes, classes by grey-coloured, and literals by squared boxes. For readability, the namespace (prefix) is omitted in the classes’ IRIs. Each arrow denotes a relationship that we defined between the resource from which it originates (the subject) and the resource or data element it points to (the object); when read from left to right, they form a triple. For instance, the leftmost vertically oriented arrow expresses, by means of triple linkedpolitics:Session_1 dc:date "2000-01", that this particular session was held in January 2000. Note that the depicted data elements are merely examples: in reality, for instance, the date is declared for each session featured in the dataset. When multiple resources are depicted as a bundle, this indicates one of two things: they are objects of the same subject-predicate relation, as indicated by the separate arrows; or they are equivalents, like debates, vote, statement, and question all are types (subclasses) of AgendaItem. Note that the model does not express any information about mutual exclusivity or combinatorial restrictions of such equivalents; it only aims to represent the full range of options beyond the chosen example.

Figure 1. The data model
Figure 1. The data model

The figure takes as a (hypothetical) example the January part-session of the year 2000, which, realistically, comprises four consecutive days. Each day covers the items scheduled in the agenda, which can be debates, votes, statements or questions. These items are often backed up by official legislative or declarative documents; however, since these data are available to us only on the day-level, we connected it to the higher level of the hierarchy. Represented are the committee to which the report was assigned, hence the domain, and the ‘ rapporteur’ , the committee member responsible for the report. Returning to the spinal structure, formed by the polivoc:hasPart relations, every agenda item consists of speeches. These are the most prevalent and semantically rich information unit. The speeches are categorized by the capacity in which the speaker was assigned the speaking turn. They are specified by a number of relationships, the main one being the speaker itself, characterized by a name, a link to his or her record on the EU website, a country of representation, and one or more political affiliations connected to this person, if applicable. The latter are defined by a role in a political institution for a given time interval, and are represented as blank nodes. Also, for every speech, the spoken text is expressed separately from its translations, and for pragmatic reasons, the language tag which corresponds to the spoken text is also added as a direct feature of a speech item.

Modelling choices

The assumptions that we made in naming, classifying and specifying elements were validated by a domain expert. For instance, examining a number of debates, we noticed that the first speaker after the chair’s introduction was the author of the report under discussion; the expert confirmed this as a regularity and pointed out other fixed-role speeches, such as the speech by a European Commission member, which usually follows the rapporteur’s speech, and the shadow rapporteur’s speech that every party’s speaking block starts with.  However, to allow for irregular courses of action, we did not fixate in the data model the usual order of speakers in the debates; instead, the role of the speaker was made a property of the speech, in isolation of the position of the latter in the debate. The consultations with the domain expert led us to change some of the terminology as well. Rather than referring to the monthly multi-day sitting as “part-session”, as the EU website does, it was opted for the term “session”, which was said to be the term used in practice.

The characteristics of the plenary sessions, as outlined in the previous blog post, led to a number of modeling choices. Since the debates do not display a dialogue structure, subsequent speeches were defined in neutral terms of order (hasSubsequent) rather than in terms of action and reaction. Second, due to their opaque relationship, roles and political affiliations are clearly separated. That is, a speaker’ s momentous role in a speech, if any and if known, is distinguished from his or her field of potential influential forces, that is, the conglomerate of political and geographical affiliations related to the speaker. Finally, trivially, the legislative aspects are represented only to the extent to which they are reflected in the available plenary data, which is through the official documents under discussion.

For pragmatic reasons, we tried to distinguish potentially meaningful elements as separate data entities. For example, by discriminating the actual spoken text in the original language from its translations. Likewise, when phrasing plays a role, search actions can be restricted to the literal text. The speeches that are held in the capacity of a specific role in a political body, such as the speech by the chair and the rapporteur’s speech are also distinguished. These speeches differ from regular member speeches in that they may not reflect the speaker’s individual opinion. By making separate classes for every type of speech, it can directly be searched for speeches of a certain type. This makes it easier for users to study the characteristics of each role and to compare roles among themselves as well as to non-role-specific speeches of members of parliament.

The depicted model is an ‘idealized’ version, in the sense that it contains information that is implicit in the debate logs or represented elsewhere in the official EU information, but that might be a valuable addition. For instance, the duration of an agenda item, which is listed in the agenda but not in the proceedings of the EU; this might be valuable for quantifying the relative time investments of parliament in specific topics or domains. Another example is the data on the report, such as the author and the identification code by which it can be retrieved from the document registry. These data need to be parsed from the chair’s introductory speech or collected from the agenda. Also, it seems valuable to include information about the current and past political affiliations of every speaker. This information is available through a search facility on the official website of the European Parliament and has been collected in a database by Høyland et al. We intend to enrich the dataset with the aforementioned data sources as the project proceeds.

If you have any suggestions, questions or remarks, please comment on this post.