Multilingual Stylometry Showcase

Introduction to the Visualization

This showcase supports our study “Multilingual Stylometry: The influence of language on the performance of authorship attribution using corpora from the European Literary Text Collection (ELTeC)” presented at CHR2024 on the impact of language and translation on authorship attribution using stylometric methods. It allows users to interact with data and parameters to explore how stylistic consistency across different languages and text features influences attribution accuracy. The visualization is designed for researchers, students, and educators interested in examining the multilingual dimensions of stylometric analysis and the methodological considerations in cross-linguistic authorship studies within the digital humanities.

Research Question and Purpose of the Showcase

The showcase addresses the central research question of our study: To what extent, and under what conditions, does language affect the performance of stylometric authorship attribution? While the full article covers various theoretical and methodological aspects of multilingual stylometry, this interactive tool is designed to allow users to engage with some of the primary factors impacting authorship attribution across languages and translations.

The visualization supports users in:

Investigating language effects: By analyzing both original and machine-translated corpora, users can explore how different languages influence attribution accuracy;

Examining the influence of stylistic features: Users can experiment with diverse parameters, comparing attributes such as length, n-gram-size, feature type and level to see how these variations influence stylometric performance.

Data

The dataset used for this showcase is derived from the European Literary Text Collection (ELTeC) and represents four languages – English, French, Hungarian, and Ukrainian – each chosen as a representative of a unique language family (Germanic, Romance, Finno-Ugric, and Slavic, respectively). Each corpus includes works by 8-10 authors, with at least three novels per author, ensuring balanced author representation across languages.

To facilitate cross-linguistic comparisons, each corpus was translated into the other three languages using DeepL Pro machine translation. This resulted in a total of 16 sub-corpora, enabling users to explore how attribution accuracy differs between original and translated texts.

Metadata and Corpus Diversity

Each novel in three corpora (English, French and Ukrainian) includes rich metadata covering:

Author identity and publication details;

Genre and subgenre (e.g., social, historical, adventure);

Narrative perspective (e.g., heterodiegetic, homodiegetic, mixed).

This metadata allows for the calculation of a corpus diversity score to assess how variations in genre, narrative style, and other factors correlate with attribution accuracy.

Data Preparation

After the automated translation via DeepL Pro was conducted, we extracted features such as word forms, lemmas, POS tags, and character sequences (n-grams ranging from 1-5) from each text. Texts were sampled at various lengths – from 5,000-word chunks to full novels – to evaluate how sample size influences attribution accuracy.

How to Use the Visualization

The showcase uses a heatmap interface that reveals trends in attribution accuracy across various parameter configurations, allowing for a comparative study of parameter effects. Here’s how to navigate and interpret the key elements:

Axes

X-Axis: Shows the Number of Most Frequent Features (MFF) used in the analysis. This includes the most commonly occurring stylistic elements (such as specific word forms or character sequences);

Y-Axis: Represents Sample Sizes, ranging from smaller text segments (5,000 words) to entire novels.

Color Gradient

The heatmap visually encodes attribution accuracy:

Cooler colors (purple/blue/green) indicate lower accuracy, suggesting challenging configurations;

Warmer colors (red/orange/yellow) represent higher accuracy, highlighting configurations with robust attribution.

Mouseover Details

Hover over any cell to see specific accuracy metrics, Cohen’s Kappa (a measure of reliability adjusted for chance), and the selected feature settings for that cell. This allows for detailed, cell-by-cell comparison of parameter performance.

Parameter Selectors

To enable customized exploration, the visualization includes four selectors:

Corpus Selector: Choose from 16 sub-corpora to analyze how language impacts authorship accuracy. Each sub-corpus is labeled by its source and target languages. For instance, “eng-ukr” represents the English corpus translated into Ukrainian;

N-gram Size Selector: Select n-gram lengths from 1 to 5, representing different levels of linguistic detail, from single words/characters to sequences that capture more nuanced patterns;

Feature Level Selector: Toggle between 'words' and 'characters' to see how larger language units (words) compare to a more granular character-level analysis;

Feature Type Selector: Choose from plain text, lemmas, or POS tags. Each type captures distinct layers of linguistic structure, helping to determine which feature type enhances attribution accuracy.

These selectors allow users to explore interactions between corpus language, text granularity, and linguistic features, contributing to a comprehensive understanding of authorship attribution across languages.

Example: Interpreting Results

To illustrate practical usage, consider analyzing authorship attribution for the French corpus (in its original language) using word-based features:

Select “fra-fra” from the Corpus Selector;
Set N-gram Size to 2 and choose “words” as the Feature Level;
Observe the heatmap to identify cells where results vary across the parameter combination of MFF (Most Frequent Features) and sample size:

Smaller MFF values often show less accurate results;
Larger MFF values improve accuracy, but only up to a certain point; beyond this, accuracy declines, when too many features are used.

Hover over a cell to view detailed metrics, such as the exact accuracy and Cohen’s Kappa values.

This example demonstrates how selecting too many features can reduce accuracy. Adjusting different selectors makes it possible to explore the impact of each parameter on attribution across other languages and corpora.

Further Information

Full paper

Presented at CHR 2024: Computational Humanities Research Conference, December 4–6, 2024, Aarhus, Denmark.

Reference:

Schöch, C., Dudar, J., Fileva, E. & Šeļa, A. (2024). Multilingual Stylometry: The Influence of Language on the Performance of Authorship Attribution using Corpora from the European Literary Text Collection (ELTeC). Proceedings of the Computational Humanities Research Conference 2024. 386-408. Access paper here.

Explanatory Paper

A more extensive explanation of methods and tools used in CLS.

Reference:

Schöch, C. (2024). CLS INFRA D3.3 Showcases for the application of CLS methods and tools. Zenodo. DOI: 10.5281/zenodo.10912517.

Background on Stylometric Methods

For an overview of authorship attribution in Computational Literary Studies, see the relevant chapters in the Survey of Methods in CLS.

Reference:

Schöch, C., Dudar, J., & Fileva, E. (2023). CLS INFRA D3.2: Series of Five Short Survey Papers on Methodological Issues (= Survey of Methods in Computational Literary Studies) (v1.1.0). Zenodo. DOI: 10.5281/zenodo.7892112.