Polish Political News Dataset

Ewa Kowalczuk

During the course of third Talk of Europe Creative Camp I worked with Wim Peters’s and Adam Funk’s semantic extension:
http://www.talkofeurope.eu/extension-of-the-talk-of-europe-data-set-with-terminological-and-semantic-information/.
The extension includes a comprehensive list of frequent terms appearing in EU Parliament and UK Parliament speeches.

In order to facilitate analysis of speeches and enable linkage to other datasets (for example my Polish Political News dataset), I identified keywords among the terms occurring in EU parliamentary speeches and linked them to DBpedia concepts.

I developed a microontology in order to describe different phenomena pertaining to the terms, with prefix:
prefix stoe: http://semantic.cs.put.poznan.pl/toe#

Most of the keywords are linked using “stoe:denotes” property, which means exact match. Some keywords are linked using “stoe:closelyMatches” property, in case that no corresponding concept exists in DBpedia, but a broader/related concept exists. In such cases, related Wikipedia article mentions original keyword explicitly and is a best source of knowledge about the original concept. For example, article on “dbr:Banking_Union” mentions and defines Bank Recovery and Resolution Directive, but no “dbr:Banking_Recovery_and_Resolution_Directive” concept exists in DBPedia. In case no matching concept in DBpedia was found, but the concept denoted by the keyword was still uniquely identified in other source, it was linked using “foaf:page” to the webpage describing it.

 Linking process differed for single-word and multi-word terms

 Single-word terms proved a bit more problematic and therefore a manual review of all terms was performed. A few classes of non-keywords were identified and explicitly marked in the result dataset. Class “stoe:NonKeyword” was used for automatic extraction artifacts, like “doesn” and “quo” (word “quo” could be used in English but always in one of the specific collocations).

 Class “stoe:NonEnglishTerm” indicates terms that were mistakenly marked as English words, while they actually came from different languages, mainly German and Polish. Class “stoe:NonNounTerm” indicates single-word terms that are different part of speech than noun. While these words, e.g. science-related adjectives, are very important building blocks of multi-word keywords, linking them separately to related noun-based concepts would seem rather artificial and superfluous (it’s better to link the whole multi-word expression, which probably also appears in the term list).

Class “stoe:CommonTerm” was used for terms that are too general and therefore would not constitute a good keyword, e.g. “tomorrow”, “activities”, “usage”, “nobody”, “difficulty”. In many cases it would be also improper to link them to a single concept, as they could be used in different speeches in many different contexts.

Another class of not linked terms was “stoe:EUPTechTerm”. These technical terms are very likely to be used in EU Parliament speeches in a very specific meaning and therefore linking them to general concepts might be inappropriate. It is also probable that they will appear very often, what would make them not a good keywords, but rather domain-specific stopwords.

A very specific, frequent class of keywords were surnames. These keywords were marked as “stoe:Surname” and not linked, because they could denote different people in different speeches. Other named entities where marked as “stoe:NEKeyword” and linked to DBpedia. Named entity keywords included a lot of acronyms. In case acronyms denoted many different institutions, the one related to European Union was used.

After removing all non-keyword and specific keyword cases, the rest of terms was linked automatically, based on string similarity. DBpedia resources corresponding to redirection and disambiguation pages were removed from the results, in order to link only to the notional concepts.

The same automatic analysis was performed for multi-word terms. I observed that for multi-word expressions, automatic linkage is characterised by very high recall and precision. Terms that would not constitute a proper keywords, for example “advanced implementation by the EU of the liberalisation”, are not linked and can be eliminated. In general, multi-word terms often constitute very good keywords, especially for general topic analysis.

 The results: linked keywords, marked non-keywords and python2 script for automatic linkage process can be downloaded here:
http://www.cs.put.poznan.pl/ekowalczuk/toe/