The Kontatto corpus

The Kontatto Corpus has been collected between 2010 and 2014. Its aim was to document bilingual language use in South Tyrol. To do this we had to carry out our fieldwork mainly within the German-speaking community and above all in the area known as Bassa Atesina, south of Bolzano, where language contact is more intense and has a longer history than elsewhere in the region.

The corpus has a hierarchical structure. The largest linguistic unit is the utterance, which may correspond to a conversational turn, but not necessarily. Each utterance has been tokenized and to each token a series of analytical annotations has been manually or semi-automatically added: the language of the token, its corresponding part of speech and its lemma in Standard German (for Tyrolean), in Italian (fro Italian and Trentino and in Gardenese Ladin for the few occurrences in this language.

The corpus has a hierarchical structure. The largest linguistic unit is the utterance, which may correspond to a conversational turn, but not necessarily. Each utterance has been tokenized and to each token a series of analytical annotations have been manually or semi-automatically added: the language of the token, its corresponding part of speech and its lemma in Standard German (for Tyrolean), in Italian (for Italian and Trentino and in Gardenese Ladin for the few occurrences in this language.

This is how it looks like in an ELAN transcript:

This process, involving 146.719 tokens, resulted in a list of 6437 lemmas. Considering the 20 most frequent lemmas in the corpus it is no surprise that they are all German and that they are grammar words with a very general meaning. You can notice the presence of personal pronouns, first of all of the pronoun ich.

The list of the 20 most frequent Italian lemmas that can be found in the corpus (generally with a proportion 1:10 relatively to the German list) is partly similar. In particular pronouns are much less relevant whereas conjunctions and discourse markers rank very high. This is because they play a meaningful role within Tyrolean speech.