Introduction to the Visualization
This showcase supports our study “Multilingual Stylometry: The influence of language on the performance of authorship attribution using corpora from the European Literary Text Collection (ELTeC)” presented at CHR2024 on the impact of language and translation on authorship attribution using stylometric methods. It allows users to interact with data and parameters to explore how stylistic consistency across different languages and text features influences attribution accuracy. The visualization is designed for researchers, students, and educators interested in examining the multilingual dimensions of stylometric analysis and the methodological considerations in cross-linguistic authorship studies within the digital humanities.
Research Question and Purpose of the Showcase
The showcase addresses the central research question of our study: To what extent, and under what conditions, does language affect the performance of stylometric authorship attribution? While the full article covers various theoretical and methodological aspects of multilingual stylometry, this interactive tool is designed to allow users to engage with some of the primary factors impacting authorship attribution across languages and translations.
The visualization supports users in:
Data
The dataset used for this showcase is derived from the European Literary Text Collection (ELTeC) and represents four languages – English, French, Hungarian, and Ukrainian – each chosen as a representative of a unique language family (Germanic, Romance, Finno-Ugric, and Slavic, respectively). Each corpus includes works by 8-10 authors, with at least three novels per author, ensuring balanced author representation across languages.
To facilitate cross-linguistic comparisons, each corpus was translated into the other three languages using DeepL Pro machine translation. This resulted in a total of 16 sub-corpora, enabling users to explore how attribution accuracy differs between original and translated texts.
Metadata and Corpus Diversity
Each novel in three corpora (English, French and Ukrainian) includes rich metadata covering:
This metadata allows for the calculation of a corpus diversity score to assess how variations in genre, narrative style, and other factors correlate with attribution accuracy.
Data Preparation
After the automated translation via DeepL Pro was conducted, we extracted features such as word forms, lemmas, POS tags, and character sequences (n-grams ranging from 1-5) from each text. Texts were sampled at various lengths – from 5,000-word chunks to full novels – to evaluate how sample size influences attribution accuracy.
How to Use the Visualization
The showcase uses a heatmap interface that reveals trends in attribution accuracy across various parameter configurations, allowing for a comparative study of parameter effects. Here’s how to navigate and interpret the key elements:
Axes
Color Gradient
The heatmap visually encodes attribution accuracy:
Mouseover Details
Hover over any cell to see specific accuracy metrics, Cohen’s Kappa (a measure of reliability adjusted for chance), and the selected feature settings for that cell. This allows for detailed, cell-by-cell comparison of parameter performance.
Parameter Selectors
To enable customized exploration, the visualization includes four selectors:
These selectors allow users to explore interactions between corpus language, text granularity, and linguistic features, contributing to a comprehensive understanding of authorship attribution across languages.
Example: Interpreting Results
To illustrate practical usage, consider analyzing authorship attribution for the French corpus (in its original language) using word-based features:
- Select “fra-fra” from the Corpus Selector;
- Set N-gram Size to 2 and choose “words” as the Feature Level;
- Observe the heatmap to identify cells where results vary across the parameter combination of MFF (Most Frequent Features) and sample size:
- Smaller MFF values often show less accurate results;
- Larger MFF values improve accuracy, but only up to a certain point; beyond this, accuracy declines, when too many features are used.
- Hover over a cell to view detailed metrics, such as the exact accuracy and Cohen’s Kappa values.
This example demonstrates how selecting too many features can reduce accuracy. Adjusting different selectors makes it possible to explore the impact of each parameter on attribution across other languages and corpora.
Further Information
Full paper
Presented at CHR 2024: Computational Humanities Research Conference, December 4–6, 2024, Aarhus, Denmark.
Reference:
Explanatory Paper
A more extensive explanation of methods and tools used in CLS.
Reference:
Background on Stylometric Methods
For an overview of authorship attribution in Computational Literary Studies, see the relevant chapters in the Survey of Methods in CLS.
Reference: