Hacking the Style: Who Talks the Same in EP?

Justina Mandravickaite,  Žygimantas Medelis,  Tomas Krilavičius,  Vaidas Morkevičius, Vytautas Mickevičius

Joining Talk of Europe Creative Camp in Meertens Institute in Amsterdam, the Netherlands on March 23-27,  2015 we had the intention to learn, who talks how in the European Parliament.

Project Hacking the Style: Who Talks the Same in EP? addressed the following questions:

  1. How much rhetoric of the EP members is different from (or similar to) the rhetoric of their and other factions (political groups)?
  2. Who talks the same at individual as well as faction (political group) level?

To address these questions, stylometric analysis to the transcripted speeches of EP members was applied. Stylometry or computational stylistics is mostly used for analyzing texts (it can also be applied to images, music, gene sequences, etc.) while focusing on their style and structure. Mostly shallow features (e.g., character count, average characters per word, letters, function words, punctuation, etc.) are used. They or their vectors are classified using supervised or unsupervised methods.

Computational stylometry develops techniques that allow finding out information about the authors of texts using an automatic linguistic analysis of those texts. Common applications of the method are authorship attribution and exploration of personal style (stylistic differences). While performing stylometric analysis, mostly the following measurable features are important: frequent words, phrases, n-grams; number of texts and authors; length and selection of wordlist.

The intended outcome was to have transcripts of speeches of the EP members visualized as a tree according to their rhetorical similarity/dissimilarity to their parliamentary faction (political group) as well as other parliamentary factions (political groups). This way we planned to reveal rhetorical similarities/dissimilarities in the EP at the level of factions (political groups).

Data set for analysis included transcripts of speeches of 712 out of 751 members of EP (MEPs) in English (the speeches of the missing MEPs are too short for such analysis), covering the period of 2008-2014. Each speech fragment was stored in a separate file. These files were identified by Country, EP faction (political group), Name of the speaker, Session ID and Speech ID. Example: IT_EFD_Speroni_9_71.txt.  All in all we had ~300 Mb of textual data.

For our analysis we used most frequent words (MFW) as features. It is a very common choice as MFW are used for stylometric analysis by many researchers. A number of MFW required for analysis was chosen experimentally.

Stylometric analysis was performed with Stylo package for R (M. Eder and J. Rybicki, 2011). Stylo not only allows flexible choice of features and dissimilarity measures, but also different methods to visualize the results (Principal Components Analysis, Cluster Analysis, Multidimensional Scaling, and Bootstrap Consensus Trees).

English is not a mother-tongue for many MEPs, therefore not all the English texts taken for analysis were “originals”. For MEPs whose mother-tongue was not English, transcripts translated in English were taken. Accordingly, it could have influenced our results to some point as these textual data show not only speakers’ style but also of translators’ style.


We performed stylometric analysis with transcripts of many MEPs and visualization techniques did not help to see consistent patterns. Possible solutions could be (we explored them to some extent during the Camp) to visualize transcripts of speeches only of certain MEPs, speeches of selected factions, speeches of the MEPs belonging to a certain country, region, age group, etc. Also, it would be interesting to explore further correspondence of factions or parties in the EP and MEPs’ home countries. For example, in Lithuanian case Donskis and Uspaskich belong to ALDE in EP, while in Lithuania they are on opposing sides.



We plan to continue stylometric experiments and explore different characteristics/cross sections of EP transcripts, diverse parameters and techniques for visualization. We also plan to explore language usage differences in the EP according to gender.

Overall, it was a profitable week with interesting insights and useful networking. We thank organizers for this opportunity.