Finding Cross-Lingual Similarity Measures using Distributional Semantics

Bartholomaeus Wloka, Vesna Lušicky, University of Vienna

In the Talk of Europe Creative Camp #2 we explored the Talk of Europe (ToE) data set, which consists of proceedings of the European Parliament debates, by applying the distributional semantics approach. This included the comparison of co-occurrence statistics for terms, which were then represented as vectors of their relative distributional properties.

The idea behind this approach is the clustering of associated words and multi-word expressions and the creation of topological structures of closely related expressions, the exploration of the possibility of cross-language connection between those clusters, and the possibility of creating concept frameworks, without previous knowledge and annotation. The comparison of clusters between different languages can potentially result in cross-lingual association clouds, which could be in the first stage used for further scholarly analysis, and in the second stage as translation support in the future.

Due to some initial difficulties with inaccurately labeled data in the corpus, we were not able to complete our work, however due to excellent collaborative work during the Creative Camp we were able to address the labeling issue and overcome this problem in order to create a first result in obtaining first co-occurrence statistics for the English language.

A small selection of the output is seen in Fig. 1. Each word is associated a numerical value, which denotes the value of its vector. Values which are close together can be considered as semantically clustered. In Fig.1 we see two of such clusters. A representation in a Cartesian coordinate system is given in Fig. 2.

In the next step, we plan to extend the approach to other languages represented in the ToE in the dataset and explore the possibilities of cross-language visualization.

We thank the organizers for this opportunity.

Wloka-Pic 1

Figure 1

Wloka pic 2

Figure 2