Introduction to Authorship Analysis as a Text Classification/Clustering Problem

Author: Vincent Granville

Guest blog post by Nabanita Roy.

Introduction:

The art and science of discriminating between writing styles of authors by identifying the characteristics of the persona of the authors and examining articles authored by them is called Authorship Analysis. It aims to determine characteristics of an individual like age, gender, native language and personality traits based on available information” pertaining to that individual.

In this article, “available information” refers to textual data only in the context of authorship analysis, however, information in this context could go beyond textual format as it might also involve usage of multi-modal observations. Multi-modal observations capture characteristic features such as voice, intonation, gestures, body posture and other physical behavioral aspects of an individual. A combination of all these characteristics reflects the persona of an individual and consequently helps in profiling that individual. In most cases, multi-modal data are sourced from videos which are then quantified to machine readable as well as processable format.

Application Areas:

Why authorship analysis is important? It plays a crucial role in forensic analysis and crime investigation. Besides, social media and the open web resources have invited a wide set of cyber crimes — fake profile creations, fake reviews by bots, plagiarism, dark web websites facilitating networked and organised terror, discerning terrorist proclamations, harassment and intimidation through social media messaging to name a few. [1]

Understanding consumer profiles and feedback analysis is paramount to Market Analysis and intends to examine the demographics of the author of anonymous feedback. The source of the raw texts could be blogs, online product reviews or social media forums. [3]

Other application areas include resolving disputes in authorship of novels, plagiarism detection, document dating, examining socio-economic factors and mental health examination.

Text Classification Tasks involved in Authorship Analysis:

Different objectives or tasks work towards a common goal of authorship analysis. The three major tasks are — Author Attribution, Author Verification and Author Profiling.

i) Author Attribution: Author Attribution is determining that, after investigating a collection text from multiple authors of unequivocal authorship, if an unforeseen text was written by a particular individual. This is ideally a closed-set multi-class text classification problem. [2]

ii) Author Verification: This task determines whether an individual has authored a piece of text or not by studying a corpora of the same author. This is a binary single-label text classification problem statement. Although, this task seems easy, author verification is a far more complicated process in real.

iii) Author Profiling: Author profiling could also be recognized as personality identification of an author by studying the authored texts. This involved predicting demographic features like gender, age, native language and personality traits of an author from examining their writing styles [1]. Author profiling can be viewed as a multi-class multi-label text classification and a clustering problem. This is a potential clustering problem because we aim to identify homogeneous writing styles and cluster them together for similarity analysis in the given corpus.

Each of these tasks are extensible depending on the kind of problem statement they are used for in the real world. Sometimes, these tasks overlap the objectives of each other.

These tasks are not limited to English as a language in automatic authorship analysis. Computerized applications are developed for other languages such as Greek, French, Dutch, Spanish and Italian.[2, 3]

References:

[1] Reddy, T. Raghunadha, B. Vishnu Vardhan, and P. Vijaypal Reddy. “A survey on authorship profiling techniques.” International Journal of Applied Engineering Research 11.5 (2016): 3092-3102.

[2] Stamatatos, Efstathios, et al. “Overview of the author identification task at PAN 2014.” CLEF 2014 Evaluation Labs and Workshop Working Notes Papers, Sheffield, UK, 2014. 2014.

[3] Stamatatos, Efstathios, et al. “Overview of the pan/clef 2015 evaluation lab.” International Conference of the Cross-Language Evaluation Forum for European Languages. Springer, Cham, 2015.

[4] Rangel, Francisco, et al. “Overview of the author profiling task at PAN 2013.” CLEF Conference on Multilingual and Multimodal Information Access Evaluation. CELCT, 2013.

Go to Source