Computational Stylometry for Deepening Scholarly Engagements in the Humanities: Intelligent Cyberinfrastructure Resources, Community-development Tools, and Learning Applications
Background: Stylometry and its relation to Computational Humanities
Although significant advances have been made in computational linguistics, natural language processing, and
text-mining, and a variety of associated applications have been demonstrated in the domain of computational humanities[1], the closely allied area of stylometry and its relevance to computational humanities has not received sufficient attention. Stylometry’s modern roots can be traced back to the late 15th century, when scholars relied on manual and labor-intensive literary analytic and counting methods. It was not until the late 19th century that statistical methods started to be incorporated into stylometry, with quantitative analysis methods proposed and demonstrated as a means for author profiling [Calle-Martin & Miranda-Garcia, 2012]. In the mid-20th century, the development of computational stylometry was strongly influenced by early computational humanities projects such as the one jointly led by the Jesuit priest Roberto Busa and IBM which created an index of the all the works by St. Thomas Aquinas [Busa, 1951].A stylome, as related to human authorship, has been described as a unique signature which identifies the authorship based on characteristic patterns or motifs that appear in the author’s writing [Hai-Jew, 2015]. Recently, authorship determination based on stylometric methods has gained a new relevancy and even urgency with the increase in fraudulently produced documents and with growing popularity of various publication forums that encourage “paper mill” enterprises [Odri and Yoon, 2023].Stylometry methods, however, offer numerous other, more positive advantages beyond its founding goals for fraud detection. Particularly, interesting are the positive goals and promises surrounding analysis of creative artifacts, especially texts, to understand styles. Stylometry can be highly useful when it comes to unraveling the structures of core stylistic motifs such as the use of figures of speech to produce a particular effect or impression. For example, scholars have applied basic log-linear classifiers to recognize word repetitions called chiasmus, epanaphora, and epiphora to achieve different types of effects [Dubremertz and Nivre, 2018]. Another branch of stylometry conducted analysis of stylome at an individual author and at author-cohort (or peer-group) levels based on a variety of textual attributes, namely lexical, morphological, syntactical and semantic categories. For example, in one study, the individual authorship motifs were established in terms of the usage of non-content bearing words such as by, to, and or and content-bearing words such as war, innovation, and commonly to distinguish between Madison and Hamilton’s contributions to the Federalist papers [Holmes, 1998]. Drawing upon machine-learning methods, in certain stylometric studies the text features are converted to numerical vectors and then supervised learning methods are used to train classifiers to automatically categorize (or predict) the authorship or influences of individuals. An example of a supervised approach applied in stylometry is support vector machines (SVM) applied on a set of texts that were collaboratively produced, and the goal was to identify shifts in authorial influences across individual works [Maciej, 2016]. Using similar vector-based approaches of converting stylistic features, and unsupervised methods, such as clustering, author cohorts or groups of authors that represent similar styles can be detected [Hai-Jew, 2015]. It is expected that authors from relatively narrow time intervals will demonstrate certain similarities in style. However, a study covering a long period of historical interval has shown that the length of such homogeneous periods has gradually shortened in the recent era, due to the volume and diversity of scholarly texts produced now as compared to the past [Hughes, et al, 2012]. Such large-scale applications of stylometry as demonstrated in the latter study has become possible due to the creation and availability of big data sets and open resources, for example the Project Gutenberg, which offers the opportunity to analyze evolution in authorship styles over time and shifting cultural trends.With the advent of generative AI (GA) methods, computational stylometry has found a new value and application domain: detection of fraudulent synthetic content [Odri and Yoon, 2023]. However, viewed from a positive angle, GAs open the possibility of engaging with ancient or classic texts in creative and exciting ways. Using GA, Martin Puchner, a Harvard scholar, has developed novel methods for engaging with classic texts and literature, based on dialogical interactions with historically critical figures, for example, Socrates, Aristotle, Nietzsche, Montaigne, Du Bois, and Virgina Woolf, that are represented as bots [Sachi, 2025]. The stylometric methods utilized to build such bots require identifying and incorporating textual attributes to make the dialog (i.e., language of the speaker) age- or time- sensitive, indicate emphasis or de-emphasis, express emotions, and individualize the articulation based on the personality of the historical figures[2]. The attribute elucidation and specification required in training the bots offers new opportunities for students to understand the important thinkers, philosophers, and writers more deeply and to advance computational stylometry methods.
Foot notes
[1] We will encourage one or more groups to conduct systematic reviews of the field of computational stylometry (or its major components).
[2] The stylometric attributes identified for the various critical humanities figures need to be made explicit and they need to be deliberately manipulated for such bots to be effective. The careful manipulation of the attributes associated with each humanities figure is important due to two factors: 1) to capture the “voice” or the personality of the humanities figure and 2) to personalize the experience for the interlocutors so that they remain “convinced” and engaged in the dialog.
show more
Computational Stylometry: An Opportunity to Invigorate Interest in the Humanities
Surprisingly, at a time in history when we are witnessing rapid growth in digital storytelling, gaming, and animation industries, we are also
experiencing a significant drop in interest and consequent reductions in humanities offerings in higher education [Schmidt, 2018]. There is a deep contradiction in the latter situation which deserves a broader and more serious examination. Here, however, we have a more modest proposal. Based on a survey of the recent advances, we know that there is a strong potential for encouraging learners to engage in the humanities by exposing them to cutting-edge computational stylometry methods. Therefore, we aim to explore the opportunities and barriers associated with applying computational stylometry methods for deepening interest and advancing learning in the humanities.
The 2025 Documentsociety conference[1] has received commitment from a highly knowledgeable set of educators, scholars, and academic professionals that are engaged in humanities endeavors. Several of the scholars have a solid reservoir of knowledge and interest in stylometry and several key participants are involved in campus-wide projects that aim to develop stylometry applications for humanities research and learning (as part of university institutes, consortia, and programs/units in research libraries).
A day preceding the main event, on October 19th, 2025, we will hold a separate session with the focus on methods, tools, and applications of stylometry, called the Document Society Computational Stylometry, for deepening engagements in and promoting learning in the humanities (i.e., the theme of this proposal). Beyond the current invitees, we are in contact with additional experts with strong background in stylometry and their applications in pedagogy, learning, and research and we will add them to the roster for the DCS forum as panelists, speakers, and participants. A three-pronged approach will be taken to define the critical opportunities and challenges. The first dimension is tools and cyberinfrastructures for supporting computational stylometry learning. The second dimension is sustaining the initiatives beyond the first DCS forum, in the form of a consortium with exclusive focus on ongoing nurturing and support for computational stylometry learning. And the third dimension is pedagogy and learning of computational stylometry.
The Oct. 19th, 2025, Document Society Computational Stylometry (DCS) meeting will be a full day event. DCS will be widely advertised through key scholarly conference platforms and social media channels. All registered and invited participants will be requested to prepare a short, 2–3-page, position paper and submit the paper about a month before the meeting. A selected set of the position papers will be featured as “keynote” presentations, and all participants will be given an opportunity to share their ongoing work on computational stylometry in a lightening round session. The day will conclude with two sessions: 1) An hour-long panel session to delve into the three dimensions in a deeper way, with audience participation and 2) a manuscript “workshopping” session whereby each DCS participant who contributed a position paper and with interest to submit a journal paper will be given feedback by experts to expand their work. To make DCS a more engaging and relevant forum, we are exploring a potential special themed issue with the editors of the Computational Humanities Research. A set of concrete areas / topics will form the scope of both the DCS and the special themed issue. They are:
1. Methods and tools for defining features/attributes, annotating corpora and maintaining open computational stylometry data sets from specific humanities domains and associated use-cases. Methods based on human expertise, machine learning or GA, and hybrid approaches will be considered.
2. Secure and scalable cyberinfrastructure for supporting online development, testing, sharing, and open publishing of computational stylometry software and data.
3. Establishment of “gold standard” training data sets and metric-driven evaluation protocols for computational stylometry that are audited and maintained through fully automated and semi-automated means.
4. Development and deployment of “community” tools for exchanging computational stylometry data, software, and information among learners engaged in computational stylometry projects or courses.
5. Landscape analysis, systematic reviews, or surveys of the state-of-art computational stylometry methods and applications[1], and their applications in the humanities domains.
6. Learning and/or pedagogical strategies for integrating computational stylometry into humanities curricula in university-level courses.
Foot notes
[1] We will encourage one or more groups to conduct systematic reviews of the field of computational stylometry (or its major components).
[2] The stylometric attributes identified for the various critical humanities figures need to be made explicit and they need to be deliberately manipulated for such bots to be effective. The careful manipulation of the attributes associated with each humanities figure is important due to two factors: 1) to capture the “voice” or the personality of the humanities figure and 2) to personalize the experience for the interlocutors so that they remain “convinced” and engaged in the dialog.
show more
Main Conference (20 October 2025)
Documentality as a Lens for Analyzing Scholarly Practices in the age of AI: Perspectives from the Humanities, Social Sciences, and Information Science
Goals of the Event: Examine Three Aspects of Documentality
The critical processes carried out during the creation of documents and the processes executed to engage with documents can be succinctly described as
documentality. A major goal of the planned event is to consider three primary aspects of documentality from the perspectives of adoption, use, and manipulation of digital platforms in handling documents. We describe the three critical aspects below, drawing upon the interplay among the critical areas in the humanities, social sciences, and information science fields.
show less
Representation
How is the representation of a document’s content interpreted and how does the content representation influence its receiver?
The core issues from humanistic and social science context that are relevant here have to do with style, creativity, authenticity, authority, and trust. Areas such as poesis and hermeneutics from the humanities, stylometry and knowledge classification from information science, and semiotics from the social sciences can certainly expand our understanding on the representational aspects of documentality.
show more
Coordination
What are the organizational and human coordination level activities that are affected by and in turn affect documents?
The human-centric disciplinary perspectives that are of concern here have to do with socialization and social dimensions of document use and frameworks and theories to understand agency and power associated with documentality[4]. From information science the emerging and growing area of computer-supported collaborative work (CSCW) and human-information interaction could provide helpful concepts and frameworks to understand coordination.
show more
Transformations
When and why humans transform objects into meaning-bearing or emotion-generating artifacts?
And a closely associated question of how humans interpret objects as documents and what role does context play? With regard to the former question, areas such as anthropology, archeology, science & technology studies, and library and archival practices associated with specialized scholarly collections (e.g., geological or archaeological evidence) could be highly beneficial to identify answers. The obvious human-centric discipline which can expand our understanding of the second question is psychology (particularly cognitive psychology). And other areas such as 3-D rendering, virtual and augmented reality, chatbot design, and digital twining are relevant areas that could aid in understanding transformations as they relate to digital platforms.
For the planned conference we will be seeking short position papers, 2-3 pages long, that we will then discuss in the meeting and some of the key contributors will be invited to develop their core ideas further into book chapters. The chapters will be aggregated into an edited volume (we are currently discussing partnership with an academic publisher). The published edited monographic volume will be a key outcome of the meeting.
show more
Michael Buckland in a seminal paper on document theory[1] argued that the popular term “information society” should be replaced with the more appropriate term “document society”.
A summary of his key arguments is outlined here as a background to the event’s scope. Viewed from the vantage point of its essence, its content, a document can be defined as an artifact that humans rely on to gain new insights or vicariously experience emotions. There are, however, other critical dimensions associated with documents, not having to do with their semantic essence. In an organization, who creates the critical documents, who gets to edit them, who interprets and analyzes them, and who disseminates documents determine authority, remit of roles, and compensations. Thus, the association of humans with documents in organizations influence the evolution of documents and the changing trajectory of documents in organizations impacts the humans that handle them. Sometimes we turn a thing into a document [1].
Consider the introduction of a knife as evidence in a trial, a cup displayed in a museum which was discovered in an archeological dig, or a rose preserved as a “pressed flower” added to a well-known author’s archival materials. In all the latter instances, the actual origin of the objects has no bearing on turning the objects into documentary usage. Viewed through the latter expanded and more nuanced dimensions, a document, therefore, takes on the role of a highly complex and powerful social artifact, whose stature, structure, evolution, and influence demand closer scrutiny.
show more
Digitization of Documents
In the current milieu of scholarship, the notion of documents and its various roles have become even more complex
due to the way digital platforms have now become pervasive, and how digital platforms are routinely deployed to produce and use most documents. Given the ease of storing and editing and the demands of publishers, virtually all books written today are born digital. Additionally, the digital platforms and the technology for creating different types of documents, beyond the conventional types, have now become easier, to a point that numerous humanities and social science scholars are seriously engaging with them to create dynamic documents[2] and they are also critically reflecting on non-conventional document formats to identify their limitations and possibilities[3].
The organizing committee feels while many forums have been held in the past to reflect on the impact of digitization on scholarly practices, few, if any, of the past initiatives engaged humanities and social science scholars that are deeply engaged in their individual disciplines and are also thoughtfully engaging and grappling with digital platforms to explore their impact on scholarly practices. We are also not aware of any effort which attempted to bring seasoned scholars from the social sciences, humanities, and information science together to examine and analyze the impact of digital platforms exclusively on documents and the impact of evolution of new formats of documents on the scholarly practices in human-centric disciplines.
show more
Outcome
Many additional colleagues have shown strong interest in joining the conference as potential participants or speakers.
We will target an audience of about 45-50 members. The University of Toronto has provided a small grant to support the conference. The venue of the conference will be the University de Barcelona. An agreement has been reached with the host institution to hold the meeting on October 20th 2025.
show more
Audience and Participation
For the planned conference we will be seeking short position papers, 2-3 pages long, that we will then discuss in the meeting and
some of the key contributors will be invited to develop their core ideas further into book chapters. The chapters will be aggregated into an edited volume (we are currently discussing partnership with and academic publishers). The published edited monographic volume will be a key outcome of the meeting.
show more
Foot notes
[1] Buckland, M. (2013). Document Theory: An Introduction, pp 223-237 in: Records, Archives and Memory: Selected Papers from the Conference and School on Records, Archives and Memory Studies, University of Zadar, Croatia. Ed. by Mirna Willer, Anne J. Gilliland and Marijana Tomić. Zadar: University of Zadar. (Michael is a co-organizer of this event.)
[2] For example, hypertextual content, avatars, games, and bots. As one specific example developed by one of the co-organizers, see Martin Puchner’s page on bots here: https://www.martinpuchner.com/custom-gpts-and-online-education.html.
[3] Tenen, D. (2024). Literary Theory for Robots. Norton. (Dennis is a co-organizer of this event.)
[4] Day, R. (2014). Indexing it All: The Subject in the Age of Documentation, Information, and Data. MIT. (Ron is a co-organizer of this event.)
show more