Stylometry

Stylometry is the application of the study of linguistic style , but it has been successfully applied to music [1] and to fine-art paintings [2] as well. [3]

Stylometry is often used to attribute authorship to anonymous or disputed documents. It has more legal and practical applications, ranging from the question of the authorship of Shakespeare’s works to forensic linguistics .

History

Stylometry grew out of earlier techniques of analyzing texts for authenticity, author identity, and other questions.

Renaissance drama. The major practice of the discipline received major impetus from the study of authorship problems. Researchers and readers observed that some playwrights of the era had distinctive patterns of language preferences, and attempted to use those patterns. Early efforts Were not always successful: in 1901, one researcher Attempted to use John Fletcher’s preference for ’em, “the contractional form of” them, “as a marker to Distinguish entre Fletcher and Philip Massinger In Their collaborations-but he mistakenly employed An edition of Massinger’s works in which the editor had expanded all instances of “’em” to “them”. [4]

The basics of stylometry were set out by Polish philosopher Wincenty Lutosławski in Principles of Stylometry (1890). Lutosławski used this method to build a chronology of Plato’s Dialogues.

The development of computers and their capabilities for analyzing large quantities of data. The great capacity of computers for data analysis, however, did not guarantee quality output. In the early 1960s, Rev. AQ Morton produced a computer analysis of the fourteen Epistles of the New Testament attributed to St. Paul. A check of his method, applied to the works of James Joyce , gave the result that Ulysses , Joyce’s multi-perspective, multi-style masterpiece, was written by five separate individuals; Joyce’s first novel, A Portrait of the Artist as a Young Man . [5]

In time, however, and with practice, researchers and researchers have refined their approaches and methods, to yield better results. One notable early success was the resolution of disputed authorship in twelve of the Federalist Papers by Frederick Mosteller and David Wallace. [6] While there is still some question about the possibility of initial assumptions and methodology (and, perhaps, always will), there is a lot of room for discussion. (Indeed, this article was published in French before the advent of computers by Cyrus Hoy in the late 1950s and early ’60s.)

Applications

Applications of literary studies, historical studies, social studies, gender studies, and many forensic cases and studies. [7] [8]

Current research

Modern stylometry draws heavily for the sake of computers for statistical analysis , artificial intelligence and access to the growing corpus of texts available via the Internet . [9] Software systems such as Signature [10] (freeware produced by Dr. Peter Millican of Oxford University), JGAAP [11] (the Java Graphical Authorship Program Delivery -freeware produced by Patrick Juola of Duquesne University), pen [12] [13] (an open-source R package for a variety of stylometric analyzes, including authorship attribution) and Stylene [14]for Prof. Dr. Walter Daelemans of the University of Antwerp and Dr. Veronique Hoste of the University of Ghent, for non-expert.

Academic Venues and Events

Stylometric methods are discussed in several academic fields, mainly in the field of application for machine learning, natural language processing, or lexicography.

Forensic Linguistics

The International Association of Forensic Linguists (IAFL) organizes the Biennial Conference of the International Association of Forensic Linguists (13th edition in 2016 in Porto ) and publishes The International Journal of Speech, Language and the Law with Forensic stylistics as one of its central topics.

AAAI

The Association for the Advancement of Artificial Intelligence (AAAI) has hosted several events on a subjective and stylistic analysis of text. [15] [16] [17]

PAN

PAN workshops (originally, Plagiarism Analysis, Authorship Identification, and Near-Duplicate Detection), organized by ACM SIGIR , FIRE , and more. KEY . Pan formulas shared challenge tasks for plagiarism detection, [18]authorship identification [19] , author gender identification [20] , author profiling [21] , vandalism detection [22] , and other related text analysis tasks, many of which hinge on stylometry .

Case studies of interest

  1. Around 1370-1070 BC, as recorded in the Book of Judges , one tribe identified members of another tribe in order to kill them by asking them to say the word Shibboleth which in the dialect of the intended victims sounded like “sibboleth.” [23]
  2. In 1439, Lorenzo Valla showed that the Donation of Constantine was a forgery , an argument based on a comparison of the Latin with that used in authentic 4th-century documents.
  3. In 1952, the Swedish bishop Dick Helander was elected bishop of Strängnäs . The campaign was competitive and was published in a series of hundreds of articles published by the President of the Church of Strängnäs. Helander was first convicted of writing the letters and losing his position but bishop but later partially exonerated. The letters were studied with a number of different types of writing and other types of writing. . [24] [25]
  4. In 1975, after Ronald Reagan had served as governor of California, he began presenting weekly radio commentaries syndicated to hundreds of stations. After his personal notes were made public in 2001, which were written by various aids used stylostatistical methods. [26]
  5. In 1996, the stylometric analysis of the controversial, pseudonymously authored book Primary Colors , performed by Vassar Professor Donald Foster [27] brought the field to the attention of a wider audience after identifying the author Joe Klein . (This case was only resolved after a handwriting analysis confirmed the authorship).
  6. In April 2015, researchers using stylometry techniques identified a play, Double Falsehood , as being the work of William Shakespeare . [28] Researchers analyzed 54 plays by Shakespeare and John Fletcher, and compared average sentence length, with the use of unusual words and quantified complexity and psychological valence of its language.
  7. In 2017, a group of linguists, computer scientists, and scholars analyzed the authoship of Elena Ferrante . Based on a corpus created at University of Padua containing 150 novels written by 40 authors, they analyzed Ferrante’s style based on seven of her novels. They were able to compare her writing style with 39 other novelists using, for example, pen [12] . The conclusion was the same for all of them: Domenico Starnone is the secret behind Elena Ferrante . [29] .

Data and Methods

Keywords: descriptive use of boxes, identifiers, and identifiers, the use of identifiers, the use of identifiers, and the identification of methods to distribute items in a space of feature variation.

Most methods are statistical in nature, but they are of the highest interest.

The most striking elements of a text in the past Most systems are based on lexical statistics, ie using the frequencies of words and text in the text (or its author). In this context, unlike in Retrieval information , the current occurrence of the most common forms is more common than the most common . [30] [31]

The primary stylometric method is the invariant writer : a property held in common by all texts, or at least all texts An example of a writer is invariant frequency of function words used by the writer.

In one such method, the text is analyzed to the most common words. The text is then broken into a thousand words and chunk of words. This is a unique 50-number identifier for each chunk. These numbers place each chunk of text in a 50-dimensional space. This 50-dimensional space is flattened by a principal component analysis (PCA). This results in a display of points that corresponds to an author’s style. If two literary works are placed on the same plane, the result may be different.

Statistical data analysis

Methods used include cluster analysis and discriminant analysis .

Neural networks

Neural networks have been used to analyze authorship of texts. Text de unisputed authorship is used to train the neural network through processes such as backpropagation , where training error is calculated and used to update the process to increase accuracy. Through a process to non-linear regression, the network gains the ability to generalize its ability to read and write. Such techniques were applied to the long-standing claims of collaboration of Shakespeare with his contemporaries Fletcher and Christopher Marlowe , [32] [33] and confirmed the view, based on the fact that this collaboration had actually taken place.

A 1999 study showed that a neural network program reached 70% accuracy in determining authorship of poems it had not yet analyzed. This study from Vrije Universiteit examined identification of poems by three Dutch authors using only letter sequences such as “den”.[34]

A study used Deep Belief Networks (DBN) for authentication and authentication for continuous authentication (CA). [35]

One problem with this method of analysis is that the network can become more easily analyzed. [34]

Genetic algorithms

The genetic algorithm is another artificial intelligence technique used in stylometry. This involves a method that starts with a set of rules. An example rule might be, “If but it seems more than 1.7 times in every thousand words, then the text is author X”. The program is presented with text and uses the rules to determine authorship. The rules are tested against a set of known rules and is a fitness score. The 50 rules with the lowest scores are thrown out. The remaining 50 new rules are introduced. This is repeated until the evolved rules correctly attribute the texts.

Rare Peers

One method for identifying style is called “rare peers”, and relates to individual habits of collocation . The use of certain words may, for a particular author, idiosyncratically entail the use of other, predictable words.

Authorship attribution in instant messaging

The distribution of Internet has shifted the authorship attribution towards online texts (web pages, blogs, etc.) electronic messages (e-mails, tweets, posts, etc.), and other types of written information that are far shorter than an average book, much less formal and more diverse in terms of expressive elements such as colors, layout, fonts, graphics, emoticons, etc. Efforts to take into account such aspects at the level of both structure and syntax were reported in. [36] In addition, content-specific and idiosyncratic cues have been introduced to unveil deliberate stylistic choices. [37]

Standard features stylometric-have-been employed to categorize the content of a chat over instant messaging , [38] or the behavior of the participants [39] purpose of Identifying Attempts chat participants are Few and still early. Moreover, the similarity between spoken conversations and chat interactions has been neglected while being a key difference between chat and other types of written information.

See also

  • Forensic stylistics
  • Linguistics and the Book of Mormon, Stylometry (Wordprint Studies)
  • Moshe Koppel
  • Writeprint

Notes

  1. Jump up^ Westcott, Richard (15 June 2006). “Making hit music into a science” . BBC News .
  2. Jump up^ “Internet Archive Wayback Machine” . Web.archive.org. 2006-06-30. Archived from the original on June 30, 2006 . Retrieved 2012-10-15 .
  3. Jump up^ Argon, Shlomo, Kevin Burns, andShlomo Dubnov, eds. The structure of style: algorithmic approaches to understanding and meaning. Springer Science & Business Media, 2010.
  4. Jump up^ Samuel Schoenbaum,Internal evidence and Elizabethan dramatic authorship; an essay in literary history and method,p. 171.
  5. Jump up^ Samuel Schoenbaum,Internal evidence and Elizabethan dramatic authorship; an essay in literary history and method,p. 196.
  6. Jump up^ F. Mosteller & D. Wallace (1964). Inference and Disputed Authorship: The Federalist . Reading, MA : Addison-Wesley .
  7. Jump up^ Chaski, Carole (2012). Author Identification in the Forensic Setting . The Oxford Handbook of Language and Law . Oxford University Press. doi: 10.1093 / oxfordhb / 9780199572120.001.0001 . ISBN  9780199572120.
  8. Jump up^ Chaski, Carole (22 December 2005). Wecht, Cyril H .; Rago, John T., eds. Forensic Science and Law: Investigative Applications in Criminal, Civil and Family Justice . CRC Press. ISBN  978-1-4200-5811-6 .
  9. Jump up^ Argamon, Shlomo,Jussi Karlgren, andJames G. Shanahan. Stylistic analysis of text for information access. Papers from the workshop held in conjunction with the 28th Annual International ACM Conference on Research and Development in Retrieval Information, August 13-19, 2005, Salvador, Bahia, Brazil. Swedish Institute of Computer Science, 2005.
  10. Jump up^ “The Signature Stylometric System” . PhiloComp . Retrieved 2014-01-03 .
  11. Jump up^ “JGAAP” . JGAAP. 2012-09-04 . Retrieved 2012-10-15 .
  12. ^ Jump up to:b “The pen for R package” . Computational Stylistics Group at Jagellonica University. 2014-10-24 . Retrieved 2014-10-24 .
  13. Jump up^ Eder, Maciej, Rybicki, Jan and Kestemont, Mike. (2016). Stylometry with R: a package for computational text analysis. R Journal, 8 (1): 107-121, url:https://journal.r-project.org/archive/2016-1/eder-rybicki-kestemont.pdf
  14. Jump up^ Daelemans, Walter & Hoste, Véronique (2013). STYLENE: An Environment for Styling and Readability Research for Dutch (Technical report). CLiPS Technical Report Series. ISSN  2033-3544 .
  15. Jump up^ Qu Yan,James Shanahan, andJanyce Wiebe. “Exploring attitude and affect in text: Theories and applications.” AAAI Spring Symposium Technical Report SS-04-07. AAAI Press, Menlo Park, CA. 2004.
  16. Jump up^ Jussi Karlgren,Björn Gambäck, andPentti Kanerva. “Acquiring (and Using) Linguistic (and World) Knowledge for Information Access.” (2002). AAAI Spring Symposium. Technical report SS-02-09. AAAI Press, Menlo Park, CA. 2002.
  17. Jump up^ Shlomo Argamon,Shlomo Dubnov, andJulie Jupp. “Style and Meaning in Language, Art, Music, and Design” (2004). AAAI Fall Symposium. Technical report FS-04-07.
  18. Jump up^ Martin Potthast, Benno Stein, Alberto Barrón-Cedeño, and Paolo Rosso. “An evaluation framework for plagiarism detection.” In Proceedings of the 23rd International Conference on Computational Linguistics: Posters, pp. 997-1005. Association for Computational Linguistics, 2010.
  19. Jump up^ Stamatatos, Efstathios, Walter Daelemans, Ben Verhoeven, Patrick Juola, Aurelio Lopez-Lopez, Martin Potthast, and Benno Stein. “Overview of the Author Identification Task at PAN 2014.” In CLEF (Working Notes), pp. 877-897. 2014.
  20. Jump up^ Rangel, Francisco, Paolo Rosso, Martin Potthast, and Benno Stein. “Overview of the 5th author profiling task at pan 2017: Gender and language variety identification in twitter.” Working Papers Papers of the CLEF (2017).
  21. Jump up^ Rangel Pardo, Manuel Francisco, Fabio Celli, Paolo Rosso, Martin Potthast, Benno Stein, and Walter Daelemans. “Overview of the 3rd Author Profiling Task at PAN 2015.” In CLEF 2015 Evaluation Labs and Workshop Working Notes Papers, pp. 1-8. 2015.
  22. Jump up^ Martin Potthast, Benno Stein, and Teresa Holfeld. “Overview of the 1st International Competition on Wikipedia Vandalism Detection.” In CLEF (Notebook Papers / LABs / Workshops). 2010.
  23. Jump up^ Judges 12: 5-6
  24. Jump up^ Text processing text analysis and generation – text typology and attribution. Proceedings of Nobel symposium 51 / ed. bySture AllénStockholm: Almqvist & Wiksell International 1982 653 pp. Data linguistica; 16 Nobel symposium; 51ISBN 91-22-00594-3
  25. Jump up^ Karlgren, Jussi (2003). “Helander: An Authorship Attribution Case” . Retrieved 4 October 2017 .
  26. Jump up^ Edoardo M. Airoldi, Stephen E. Fienberg, Skinner Kiron K. (July 2007). “Whose Ideas? Whose Words? Ronald Reagan’s Authorship of Radio Addresses” (PDF) . PS: Political Science & Politics . 40 (3): 501-506. doi : 10.1017 / S1049096507070874 .
  27. Jump up^ Author Unknown by Gavin McNett Salon November 2, 2000
  28. Jump up^ “Study finds disputed Shakespeare play bears the master’s mark” . LATimes.com. 2015-04-10 . Retrieved 2015-04-13 .
  29. Jump up^ Jacques Savoy. Elena Ferrante Unmasked. https://www.researchgate.net/publication/320131096_Elena_Ferrante_Unmasked
  30. Jump up^ Biber, Douglas. Variation across speech and writing. Cambridge University Press, 1991.
  31. Jump up^ Karlgren, Jussi ; Cutting, Douglass (1994). “Recognizing Text Genres with Simple Metrics Using Discriminant Analysis”. Proceedings of the International Conference on Computational Linguistics .
  32. Jump up^ [1] Neural Computation in Stylometry I: An Application to the Works of Shakespeare and FletcherMatthews RAJ & Merriam TVNLit Linguist Computing(1993)8(4): 203-209. doi: 10.1093 / llc / 8.4.203
  33. Jump up^ [2]Neural Computation in Stylometry II: An Application to the Works of Shakespeare and MarloweMerriam TVN & Matthews RAJLit Linguist Computing(1994) 9 (1): 1-6
  34. ^ Jump up to:b JF HoornZ, SL Frank Kowalczyk W and F van der Ham (2012-09-03). “Neural network identification of poets using letter sequences” . Llc.oxfordjournals.org . Retrieved 2012-10-15 .
  35. Jump up^ Brocardo, ML; Traore, I; Woungang, I; Obaidat, MS (2017). “Authorship verification using deep belief network systems” . Int J Common Syst . 30 : e3259. doi : 10.1002 / dac.3259 .
  36. Jump up^ of Vel, O .; Anderson, A .; Corney, M .; Mohay, G. (2001-12-01). “Mining e-Mail Content for Author Identification Forensics” . SIGMOD Rec . 30(4): 55-64. doi : 10.1145 / 604264.604272 . ISSN  0163-5808 .
  37. Jump up^ Argamon, Shlomo; Koppel, Moshe; Pennebaker, James W .; Schler, Jonathan (2009-02-01). “Automatically Profiling the Author of an Anonymous Text” . Common. ACM . 52 (2): 119-123. doi : 10.1145 / 1461928.1461959 . ISSN  0001-0782 .
  38. Jump up^ “Classification of Instant Messaging Communications for Forensics Analysis – TechRepublic”. TechRepublic. Retrieved 2016-01-26.
  39. Jump up^ Zhou, L .; Zhang, Dongsong (2004-01-01). Can online behavior unveil deceivers? – an exploratory investigation of deception in instant messaging . Proceedings of the 37th Annual Hawaii International Conference on System Sciences, 2004 : 9 pp. doi : 10.1109 / HICSS.2004.1265079 .

References

  • Brocardo, Marcelo Luiz; Issa Traore; Sherif Saad; Isaac Woungang (2013). Authorship Verification for Short Messages Using Stylometry . IEEE Intl. Conference on Computer, Information and Telecommunication Systems (CITS).
  • Can F, Patton JM (2004). “Change of writing style with time”. Computers and the Humanities . 38 (1): 61-82. doi : 10.1023 / b: chum.0000009225.28847.77 .
  • Brennan, Michael Robert; Greenstadt, Rachel. “Practical Attacks Against Authorship Recognition Techniques” . Innovative Applications of Artificial Intelligence .
  • Hope, Jonathan (1994). The Authorship of Shakespeare’s Plays . Cambridge: Cambridge University Press.
  • Hoy C (1956-62). “The Shares of Fletcher and His Collaborators in the Beaumont and Fletcher Canon”. Studies in Bibliography . 7-15 .
  • Juola, Patrick (2006). “Authorship Attribution” (PDF) . Foundations and Trends in Information Retrieval . 1 : 3. doi : 10.1561 / 1500000005 .
  • Kenny, Anthony (1982). The Computation of Style: An Introduction to Statistics for Students of Literature and Humanities . Oxford: Pergamon Press.
  • Romaine, Suzanne (1982). Socio-Historical Linguistics . Cambridge: Cambridge University Press.
  • Samuels, ML (1972). Linguistic Evolution: With Special Reference to English . Cambridge: Cambridge University Press.
  • Schoenbaum, Samuel (1966). Internal Evidence and Elizabethan Dramatic Authorship: An Essay in Literary History and Method . Evanston, IL, USA: Northwestern University Press.
  • Van Droogenbroeck, Frans J. (2016) ” Handling the Zipf distribution in computerized authorship attribution “
  • Zenkov AV A Method of Text Attribution Based on the Statistics of Numerals // Journal of Quantitative Linguistics, 2017, http://dx.doi.org/10.1080/09296174.2017.1371915

Further reading

See also the academic journal Literary and Linguistic Computing (published by the University of Oxford ) and the Language Resources and Evaluation Journal.