Category Archives: Uncategorized

An ontology for critical editions of variant text

In 2020 I had a paper accepted to the DH conference, concerning my attempts to create an ontology for the Stemmarest data model. The motivation for doing this was simply to see how far I could bring my work on Stemmarest in line with community norms for data modelling standards, and certainly in 2019 OWL vocabularies were all the rage.

The Stemmaweb model has been a graph model from the beginning, and it was a natural step in 2015 to move to a graph database to back it. Since we thought (and still think) about the problem domain in terms of texts, witnesses, and versions rather than storehouses of individual tiny facts, we didn’t seriously consider RDF as the backing graph model. Neo4J was our preferred solution in the end, because it provided the graph traversal and path finding algorithms we had needed from the beginning to do the validation of our data.

Since the pandemic wiped away the opportunity to present the work in Ottawa, and since the consequences of the pandemic also wiped away the time and energy I would have needed to present this work as part of the online event, I have published here the full abstract (which is, unaccountably, not available from the DH2020 website or book of abstracts) and a link to the ontology, bugs and all, that I had created as of July 2020. I have meanwhile continued this work after a fashion, which I hope to be able to talk about in 2023.

dh2020-ontology

An explanation of the data model behind Stemmaweb

I post here the slides and the abstract for a presentation I gave most recently at the conference of the European Society for Textual Scholarship in 2019. This is intended as a guide to the data model and thinking that informs Stemmaweb and its tools.

This content (both abstract and presentation) are released under a CC-BY 4.0 license, so feel free to download and share!

Abstract

What are the consequences for data modelling when we think of critical edition not as a document, but as a process? Our aim in this talk is to open a discussion on the difference between treating a critical edition as a text, and treating it as an intellectual endeavour whose result is a text.

The typical digital representation of a critical edition takes the form of a document, whether it is prepared with a word processor, the Classical Text Editor, LaTeX, or TEI-based tools and specifications. While these formats can certainly represent the features of a published critical edition, there is very little that makes explicit the editorial logic behind the product.

Here we will consider a different approach, adopted in the recently-concluded SNSF-funded project “The Chronicle of Matthew of Edessa Online”, in which the logic of edition is modelled not merely in the data format, but also in the associated computer code, embedding logic that allows the editor to define custom answers to question such as the following:

What constitutes a reading, in what context(s)? A lemma reading? A variant?
How should variants be classified? What implicit hierarchy, if any, does the editor’s classification scheme have and what are the implications?
How should the text be subdivided, and in what order(s) should these subdivisions be read?
What kind of information is carried within the text, and how can that be expressed?

Most crucially, the process model allows the answers to these questions to be enforced consistently within the project, with the useful side effect of compelling the editor to reconsider assumptions that turn out not to be adequate. The result, as we hope to demonstrate, is a digital critical edition that inherently captures, not only the resulting text, but also the intellectual process by which it is produced.

Presentation

ests_2019_presentation

ests_2019_presentation Download

A HOWTO for using Stemmaweb

1 Reply

I have been asked for a guide to using the tools on Stemmaweb – there is documentation under ‘About/Help’, of course, but it would be useful to give an overview for where to start and how you go on from there. So this guide is meant to be an introduction, of sorts.

The first step is to create yourself a user account. You can do this by clicking ‘Sign In/Register’ at the top, where you will have three options:

Use your Google account, if you have one.
Use any other OpenID account, if you have one.
Use an account created especially for Stemmaweb. To get one of these you must first click the ‘Sign in with Stemmaweb’ bar, and then follow the ‘Register’ link. Once you are registered you can sign in back on the ‘Sign in with Stemmaweb’ tab.

Once you have done one of these three things, you will find that Stemmaweb looks just the same, but the ‘Sign In’ link will be replaced with a greeting to you (or to your email address anyway), and you will see a new button:

Now you can upload texts of your own to work with!

Stemmaweb operates on collated text. Someday, I hope, there will be an integrated collation tool that will do this first step for you, but as of today we are not there. So the first thing you need, if you want to work on Stemmaweb, is a collation of some text. You can provide this collation in, broadly speaking, one of three ways:

Do it yourself. Align the text in a spreadsheet, one witness per column, with the sigla in the first row. The spreadsheet can be saved as comma-separated format with the extension .csv, tab-separated format with the extension .txt, or as a Microsoft Excel spreadsheet (either XLS or XLSX).
Do it yourself, TEI style. Create a TEI parallel-segmentation file for your text and its witnesses. This is somewhat less recommended because there are a million different ways to apply the TEI guidelines and I have only had time and energy to support a few of them. PLEASE review these guidelines if you want to use this option.
Do it yourself, CTE style. If you have been preparing your witnesses in Classical Text Editor, you can export your work in TEI double-endpoint-attachment format, and Stemmaweb will make a best-effort attempt at reconstructing the witnesses from your apparatus. There are a lot of caveats about doing this; you can read more here. The upload may well fail due to some confusion encountered by the Stemmaweb parser; I am working on a tool to validate CTE input, but it is not yet generally available.
Get a collation tool to do it for you. CollateX is a good option for this; I recommend that you request CollateX results in its GraphML format, in order to preserve any detected reading transpositions. Do NOT use CollateX’s TEI-style parallel-segmentation output.

Now that you have a collation uploaded, if you click on that text in the list you will see some extra buttons. Let’s look at what each of them do.

This is where you should probably start. Clicking on this button will load your text into the ‘relationship mapper’; you will see the collation in the form of a graph, running in one direction from beginning to end. Each reading in the text is a node in the graph somewhere; each witness is essentially a single long string of these reading nodes, collecting its words from beginning to end. Wherever witnesses agree, they are ‘strung’ through the same node. Wherever they disagree, their ‘strings’ will diverge, and the variant readings will appear stacked roughly on top of each other in the graph.

The purpose of this tool is not only to look at the pretty picture made by your variant graph, but also to annotate the variants in relation to each other. Are the variant readings synonyms? Spelling variants? Link them as such. Has a word been shifted to a different location in one witness? Link the two occurrences of the word as a transposition. Do you think the variation in question is stemmatically significant? Make a note of it in the dialog when you create the link. More instructions for using the tool can be found here.

At the moment the available types of links are mostly limited to syntactical relationships between words. Someday, the users of Stemmaweb (that’s you, the scholar) will be able to define their own relationship categorization, but that day is not today. If none of the syntactical categories apply, you are very welcome to make liberal use of the ‘Other’ categorization and leave yourself a note in the ‘Annotation’ field.

If you click this button, you will find a fairly arcane (but hopefully well-explained) way of defining a stemma for your witnesses. This is meant to be used for the definition of any stemma at all, so long as it has a root (an archetype) and doesn’t have a cycle (e.g. A->B, B->C, C->A. Unless you are working on the New Testament, you probably won’t have this.)

Once you add a stemma, there will be a button labelled ‘Edit this stemma’; that brings up the same stemma definition box, but with the stemma in question pre-loaded there. The left- and right-arrow buttons will allow you to page through the stemmata you have defined. There is no limit to how many you can define!

Stemmaweb is connected to the Stemweb service provided from the Helsinki Institute of Information Technology, which can run one of a variety of phylogenetic algorithms on a collated text tradition. If you click this button, you will get a dialog box that asks you which algorithm you want to run. Clicking on ‘What is this?’ will bring up a description of the selected algorithm. Some of the algorithms take parameters; if you choose one of these, you will be asked to fill in the parameter.

If you have marked up the relationships between variants in the graph viewer / relationship mapper, then you will also be able to discount selected categories of relationship, if you wish – for example, it is fairly common to want to disregard spelling variation, and this is the option that lets you do it.

The algorithms offered by Stemweb all return unrooted trees; depending on the algorithm you select and the size or amount of variation of your tradition, you may have multiple trees returned. (At present there is no good way to delete a tree, or to reorder the trees that are returned.)

An unrooted tree is not, by this definition, a stemma until it has been oriented by selecting a root. In Stemmaweb you can orient/root (or re-root!) a tree by clicking on the witness node that you wish to treat as the archetype, and selecting the green checkbox to ‘Use this node to root the stemma’.

Now you have your text uploaded and marked, and you have your stemma hypothesis (or maybe several hypotheses) – you are ready to click the most exciting button!

The Stexaminer is a program and algorithm that was developed in concert with the DTAI group at KU Leuven; its job is to tell you, for each location in the text where variation occurs, whether that variation fits the stemma. In the case of stemmas without contamination / conflation this is pretty easy to calculate, but when you have traditions where contamination is known or suspected to have occurred, up to now it has been difficult to say for sure whether a given pattern of variation can be explained by the stemma. The Stexaminer can handle as complex a stemma as you care to throw at it.

Like the graph viewer / relationship mapper tool, the Stexaminer has its own help documentation for you to consult. The basic idea is that you can generate an overview of how well your stemma seems to match the textual evidence, and you can also drill down variant-by-variant to see which witnesses carry which reading. Where the pattern of variants does not match the stemma, the Stexaminer deduces where in the stemma the change might have been introduced, so that the number of ‘coincidences’ is kept to a minimum. It will also try to detect reading reversion – that is, when a scribe might have altered a reading in the exemplar to restore an ancestral reading. This is a highly experimental feature, and not one to rest a philological argument on without a lot of caution.

So there you have it! An overview of how Stemmaweb’s tools fit together and pointers to how you might use them. Go wild, and if something goes wrong, get in touch!

New home, new features

1 Reply

Along with my own recent institutional move to the University of Bern, I have taken the opportunity to organize a new home (and a new domain) for Stemmaweb. Now that the migration is finished I am very happy to announce two new major features:

Editing and correction of collation data, as developed in collaboration with Huygens ING and previewed at DH2013.
Integration with the Stemweb service for algorithmic generation of stemma hypotheses, in cooperation with researchers at Huygens ING and the HIIT Helsinki Institute for Information Technology, and sponsored through a grant from the European Association for Digital Humanities.

These features are not yet as well documented as I would like, but up and running and ready to use. As ever, use of the Stemmaweb tools is free for any scholar anywhere, and if you have any questions or difficulties, don’t hesitate to contact me.

White Paper: Interoperability between Stemmatological Microservices

1 Reply

White Paper
Interoperability between Stemmatological Microservices

Tara Andrews¹, Simo Linkola², Teemu Roos², Joris van Zundert³

KU Leuven (BE)
Helsinki Institute for Information Technology HIIT, University of Helsinki (FI)
Huygens Institute for the History of the Netherlands, Royal Netherlands Academy of Arts and Sciences (NL)

7 May 2013

In 2012 two online tools for text stemmatology research were published to the web: Stemweb [source code available at https://github.com/Stemweb/Stemweb], under development by members of the Helsinki Institute for Information Technology HIIT, and Stemmaweb [http://byzantini.st/stemmaweb/; source code available at https://github.com/tla/stemmaweb/] developed by the Tree of Texts project at KU Leuven in collaboration with members of the Interedition project [http://www.interedition.eu]. Stemweb is a resource for many sorts of phylogenetic and other statistical analysis of a text, including the RHM and SemStem methods developed specifically for the case of recovering manuscript text stemmata. Stemmaweb is a complementary resource for the visualization, regularization, and analysis of the variation within a text using graph search methods, and allows the scholar to define as many hypothetical stemmata as she or he would like to explore. On 25-26 April 2013 a meeting was held, with funding generously provided by a Small Project Grant of the European Association for Digital Humanities (ALLC), to create an open blueprint for integration. Our blueprint is meant not only to be open to the two existing tools but also to provide a framework for interoperability with any other tool for stemmatological research that may appear in future. The blueprint is provided here in the form of a white paper; comments are invited from interested developers and scholars, and those received by 31 May will be taken under consideration in time for the first implementation.

Background

Stemmatology is the study of how to derive computationally the copying relationships between ancient and medieval manuscripts. Various statistical and algorithmic approaches have been adapted from the field of evolutionary biology for this over the past few decades, such as maximum parsimony and neighbour joining; others, such as the RHM method and the more recent Semstem, have been developed specifically with the problem of text genealogy in mind. Many of the methods require the scholar to become familiar with software packages for evolutionary biology, and are in that sense not particularly approachable (or even, in the case of non-free software, not easily available.) One of the greatest advantages of Stemweb is precisely the collection of several applicable algorithms in one place, so that the scholar can use different methods on the same dataset without having to devise a different technical working process for each.

To allow practical application of and reflection on the various algorithms by as wide a community of scholars as possible, our aim is to provide both open GUI and API access to these tools. The web-based user interfaces allow the integrated facilitation of the various methods for stemmatology developed by different researchers in various locations. Founding this integration on the basis of (web) APIs will allow anyone who develops additional approaches to stemmatology to let their solutions be interoperable with the current published ones.

Where Stemweb provides a collection of algorithms for the creation of stemmatic hypotheses, Stemmaweb is a collection of tools designed to examine and analyse collated sets of texts, the relationships between variant readings, and the logical consequences of one or more stemmatic hypotheses. In this context Stemmaweb is a consumer of the hypotheses that Stemweb can generate.

Aims

The aim of this proposal is thus to allow any scholar to:

have open web access to our current technology for stemmatology
provide an open interoperable way to contribute stemmatological algorithms to the framework
provide an open API that allows integration of stemmatological services in any web-enabled GUI

Throughout future stages of the project, we wish to continue an active engagement with the scholarly community concerning the direction and functionality of the ecosystem of tools for stemmatological and text-genetic research. As far as possible given the resources available to us, we will provide an open communication space for scholars to reflect, comment on, and participate in our work. We also intend to provide or collect information on the properties of different stemmatological methods as well as online guidelines for their use and other documentation.

Request for Comments

At this point in time we have provisionally agreed on the primary APIs to connect and interoperate the Stemweb and StemmaWeb solutions. We present in this white paper the proposed protocol and data formats that will allow the interconnection of these two solutions. We have striven to keep this protocol for interoperability lightweight, extensible and open so that any developer, researcher, or contributor may scrutinize, comment and suggest changes and enhancements to this protocol.

Description of the proposed framework

The protocol proposed is based on the idea of microservices. These are very small web services, with a RESTlike API as far as possible. The API normally uses JSON as its means of exchange and communication between server and client. Microservices have been defined in the context of the Interedition project (http://interedition.eu/wiki/index.php/About_microservices). Individual microservices may be combined in various ways to drive the functionality of multiple web applications. The interoperability of Stemweb and Stemmaweb follows this model as depicted in Figure 1.

Figure 1: Microservices architecture

In this instance the Stemmaweb visualisation service will be a client of the microservice interface provided by Stemweb, sending collation data and receiving one or more stemmatic tree hypotheses in return. The aim is nevertheless to define an API whose ‘server’ functions can be implemented by any other microservice that provides a stemmatological algorithm, and the ‘client’ functions can be implemented by any consumer of such an algorithm.

Server-side API

Discovery: The server must provide a discovery resource, e.g.

  GET /algorithms/available

which will return a JSON response that lists the stemmatology algorithms available on the server, along with descriptions for display in a user interface and option parameters that are required or recommended for their use. The response takes the following form:

[ { model: argument
    pk: <ID in server database>
    value: <argument value’s type, see below>
    verbose_name: <human readable name of the argument>
    description: <longer description of argument’s behaviour>
  },
  { model: argument
    ...
  },
  ...
  { model: algorithm
    pk: <ID in server database>
    name: <human readable name>
    desc: <longer description of the algorithm>
    url: <link to original article, if available>
    args: [<list of argument pk’s for input arguments>]
  },
  { model: algorithm
    ...
  },
  ...
]

Valid argument values at present are: positive_integer, integer, boolean, input_file, float and String.

Request: The server must implement an address to listen for incoming requests, e.g.

  POST /algorithms/calculate

The client should send the following request data to accompany the POST request:

{ userid: <email / ID of user>
  algorithm: <ID of algorithm>
  parameters: { ... }
  data: <string containing data in specified format> }

The server will indicate its response via appropriate HTTP status codes, e.g.:

200 OK (job was accepted and will now be processed
  { jobid: id }
400 Bad Request (e.g. request was malformed)
  { error: <error message> }
403 Forbidden (e.g. client not authorized)
  { error: <error message> }

The algorithms for stemmatology calculation can take an arbitrarily long time to run, and we therefore propose an asynchronous callback method for the return of results. For this reason the initial server response to a successful request will consist only of a job ID.

When the calculation is finished, the server must make a POST request to a location implemented on the client side with the results. The return of a 200 response implies a commitment that the job will run and the results, whether success or error, will be returned. See below, ‘Client-side API’, for more information.

Job status: The server should accept requests for job status, e.g.:

GET /algorithms/jobstatus?jobid=<ID number>
  { jobid: <id number>
    statuscode: < 1 = running / >1 = failed / 0 = success >
    *result: <data>
    *result_format: <format> }

The “result” keys should only be included, where applicable, if the job is no longer running. This is intended primarily as a fallback interface in case the server was unable to make the initial POST reporting results to the client as described above.

Client-side API

In addition to ensuring that requests to the server API are well-formed as documented above, a client to the stemmatological algorithm microservice must implement a URL to which the server can post results:

POST /stemmatology/result
 (Success):
 { jobid: <ID number>
   statuscode: 0
   result_format: <format>
   result: <data> }
 (Failure):
 { jobid: <ID number>
   statuscode: >1
   result: <error message> }

In case no results are received in a reasonable time, or it is otherwise suspected that the results of a job failed to be returned, the client may request a job status from the server as documented above.

Crediting issues

When building web services based on this framework, the author(s) should in each case ensure that the providers of the various solutions and components on which the services are built are given due credit – for instance, that a user using Stemmaweb will be notified that the stemmatological algorithms are provided by Stemweb. A general technical solution to the problem of “giving credit where credit is due” is beyond the scope of this white paper; comments on the issue are nevertheless welcome for the purpose of creating a suitable policy in the future. At this point we can only recommend that web service authors make liberal use of logos.

Implementation plan

We are presently releasing this white paper for comments on the Digital Medievalist mailing list (http://www.digitalmedievalist.org/), Humanist (http://dhhumanist.org/), and the Textual Scholarship blog (http://www.textualscholarship.eu/). Comments are invited on the initial phase until the end of May 2013. The teams at KU Leuven, HIIT, and Huygens ING will begin implementation of the functionality required by the framework and the API in June 2013, with the intent of releasing the complete and tested web services by the end of November 2013.

We hope particularly to solicit comments concerning the API from the point of view of potential future extensions of the services outlined above, such as other tools and resources (both front-end and back-end) that can be implemented in the framework. For instance, it will probably be beneficial to integrate a collation tool such as CollateX into the system, with seamless data storage or sharing between the microservices so that the user need not repeatedly upload and download his or her data in order to perform a full analysis cycle. Another desideratum might be integration with an informational resource such as the Parvum lexicon stemmatologicum [https://wiki.uib.no/stemmatology/index.php] produced by members of the Studia Stemmatologica group, which is a collection of definitions of terms used in stemmatology. The API should be able to accommodate any such extensions as seamlessly as possible.

The Stemmaweb Project

Tools and techniques for empirical stemmatology

Category Archives: Uncategorized

An ontology for critical editions of variant text

An explanation of the data model behind Stemmaweb

Abstract

Presentation

A HOWTO for using Stemmaweb

New home, new features

White Paper: Interoperability between Stemmatological Microservices

White Paper
Interoperability between Stemmatological Microservices

Background

Aims

Request for Comments

Description of the proposed framework

Server-side API

Client-side API

Crediting issues

Implementation plan

Announcing Stemmaweb

Abstract

Presentation

White Paper Interoperability between Stemmatological Microservices

Background

Aims

Request for Comments

Description of the proposed framework

Server-side API

Client-side API

Crediting issues

Implementation plan

White Paper
Interoperability between Stemmatological Microservices