Introduction to the Visualization
Welcome to our Multilingual Stylometry Visualization, a showcase designed for academics, linguists, and digital humanities scholars exploring the intersection of language, corpus composition, and stylometric methods of authorship attribution. This visualization uses a curated dataset from the ELTeC corpus, including texts in English, French, Hungarian, and Ukrainian, each translated to facilitate comparative analysis across languages. By presenting stylometric analysis through interactive heatmaps, our visualization makes it possible to interact directly with the data, exploring how linguistic features and corpus characteristics influence the identification of authorial style.
Research Question
The primary research question this showcase addresses is to what extent, how, and under what conditions, the composition of both language and corpus influence the performance of stylometric methods of authorship attribution.
The primary objective of this showcase, therefore, is to provide detailed data that allows us to evaluate the effectiveness of the stylometric method for corpora of different compositions and in different languages. Specifically, the study aims to determine how the results of stylometry are influenced by the language of the texts (using translations) and the composition of the corpus, considering the degree of similarity both within an author's works and between texts of different authors.
An additional, secondary research question is the following: what makes a corpus “difficult” in terms of authorship attribution? We know that this is likely to depend both on corpus composition and on language; but predicting the difficulty based on corpus composition is particularly complex, since it's usually unrealistic to control for every potential source of variation. E.g. if we are attributing a gothic novel, it’s unlikely to find a reference corpus to consist only of gothic novels written with the same narrative techniques by authors that only come from the same cohort and social background.
Several key aspects likely need to be considered: The level of distinctness or overlap of the two similarity distributions for pairs of works known to be written by the same author, on the one hand, and those known to be written by two different authors, on the other hand, are likely to be a good indicator of attribution difficulty. However, the properties of the works that determine this similarity a priori, that is before a particular stylometric test is performed to establish this similarity based e.g. on word frequencies, is not entirely clear: time period, literary genre and subgenre, narrative perspective or metric form, author gender, author age at the time of writing, and several others may have some relevance. However, it is not clear how to quantify the various metadata configurations for a sensible prediction of overall corpus difficulty. This showcase makes a preliminary attempt some suggestions for this issue but leaves most of this to future work.
Data
The text corpus was formed based on the ELTeC corpora in four languages: French, English, Hungarian, and Ukrainian. The choice of languages is determined by their representativeness for the corresponding language group. For example, using Germanic, Romance, Finno-Ugric, and Slavic languages, one can illustrate the effectiveness of the stylometry method and see how language affects the analysis results. The original ELTeC corpora each include one hundred novels. For our research, a selection of novels was made, retaining novels by 8 to 10 authors, each represented with three novels from ELTeC. This resulted in the following collections of texts: the English corpus includes 31 novels, the French one 30, the Hungarian one 27, and the Ukrainian one 24 novels. Each of the corpora was translated into the other three languages, thus the entire corpus includes 16 sub-corpora. The dataset also includes a metadata table collected for the novels in each language. The metadata table includes information about the author, year of publication of digital and print editions, identification number, language, number of words in the novel, as well as subgenre (social, historical, adventure, detective or sentimental novel, bildungsroman or other) and narrative perspective of the novel (heterodiegetic, homodiegetic, epistolary, dialogue or mixed).
How to Use the Visualization
The visualization, comprising two heatmaps, offers numerous interaction possibilities. User interaction is enhanced through selectors, providing the flexibility to filter and concentrate on specific data facets. For each heatmap, users can select different options including:
Based on user interest, contrasting corpora or feature settings can be chosen to explore diverse perspectives.
Corpus selection: a corpus name is composed of two shortcuts. The first shortcut signifies the original language of novels in the corpus, while the second one indicates the language of analysis, specifically, the translation language. In cases where the original language is used for analysis, the second shortcut aligns with the first. For example, eng-eng means that original language of the corpus is english and analysis is based on original novels, while eng-fra means, that analysis is based on novels translated into french.
What the Visualization Shows
The visualization consists of two heatmaps, enabling users to compare and contrast two sets of results simultaneously. The x-axis represents various settings of the most frequent features (MFF) used in specific analysis, while the y-axis denotes distinct sample sizes, ranging from small text snippets to entire novel (“Full novel”). Words and characters are treated differently and analyzed across diverse sample sizes (from 5000 tp 10000 for words and from 10000 to 50000 for characters). Certain MFF values are excluded from the x-axis due to the absence of results. Each heatmap cell correlates MFF and sample size, with the color intensity indicating accuracy levels, offering a visual metric of analysis precision. A mouseover provides further information, such as the features used as well as numerical indications of accuracy and Cohen’s Kappa. When the data corresponding to selected criteria are not present, the plots display an empty data message.
Further Background Information
A more extensive explanatory paper can be accessed here: CLS INFRA D3.3 Showcases for the application of CLS methods and tools.
DOI: 10.5281/zenodo.10912516
For background on stylometric methods of authorship attribution in the context of Compuational Literary Studies, see the relevant chapters on Authorship in the Survey of Methods in CLS.
DOI: 10.5281/zenodo.7782363