Talking data

Sustaining the richness of the humanities

At Digital Humanities 2010 I recorded quite a bit of focused attention to sustainability. My own contribution was a challenge to think differently about how sustainability is likely to be realised, using ideas from biological evolution and multi-agent games. In the digital world, every access involves the making of a copy. If we could tweak things in such a way that all these copies positively contributed to the preservation of the corresponding resource, then we would wield tremendous power for the good of preservation. The abstract is online and expect the presentation there as well, shortly.

In several other lectures I heard two very different approaches to sustainability. In 
The Specimen Case and the Garden: Preserving Complex Digital Objects, Sustaining Digital Projects Lewis Ulman eloquently advocated an approach that I have always found right, but lately I have begun to doubt it. In (very) short: a humanities project typically delivers a rich interconnected set of materials in many media, accessible through a (web)-interface that let you explore everything in its connectedness. The web-interface is likely to die off at some time, and the best you can do is to document all relationships, store the documentation with the data, so that posterity will not have much trouble to redevelop an interface for the web of materials.

What is wrong with this? Well, it is not really wrong, in fact it is infinitely better than taking no particular measures, but what I do see happen in the distant future is this: a researcher issues a query, finds among his search results an XML document from the project's frozen result. He wants to explore the related materials of the project, but there is no interface. He sees that there is documentation how to build a web interface, but the documentation refers to obsolete architectures and systems, and the researcher is not in a position to dedicate that kind of effort needed to do it. On to the next search result ...

Can we do better?

At least two talks pointed to an other way. In 
The Open Annotation Collaboration: A Data Model to Support Sharing and Interoperability of Scholarly Annotations Jane Hunter proposed a new way to handle annotations: OpenAnnotation. A web based, collaborative, transparent, explicit, standardized paradigm to represent scholarly annotations to just any resource that is representable on the web, be it pieces of texts, sections of images, cuts of videos, or sets of other resources.
And in the same room, Peter Robinson and Federico Meschini presented 
Works, Documents, Texts and Related Resources for Everyone and made proposals that might help the paradigm of the Semantic Web and Linked Data come true in the library world.

Now reconsider the typical output of a humanities project. If we manage to do the annotations in the Open Annotations way, and if we manage to express the relation web through the formalisms of Linked Data, then we do not need special purpose interfaces for the output materials. These formalisms are built directly on top of mainstream web technology, but are more basic than the formalisms found in specific disciplines. As such, I expect that at this level we'll see a new layer of infrastructure emerging. Part of that infrastructure will also be the facility of temporal browsing, i.e. directing your browsing to points in the past, in order to surf the archived web at that point in time. See 
the memento idea by Herbert van de Sompel and others.

There will be very generic interfaces based on Linked Data, and these interfaces are flexible enough to absorb new relationships and datastructures.

The data is now freed from the interface, the interfaces are (meta)data driven, and our future researcher has an instant view on the richness of an old resource.

One challenge remains, in my opinion: how do we make sure that this new infrastructure is really sustainable? I admit that I have only a clouded vision on this: let there be a new kind of cloud, with workspaces for users that collect works that interest them. Let the cloud be smart, in that it implements the new infrastructure. Let the cloud be efficient in that it optimises the copies of works for storage economy, but also for access economy. And let the cloud be honest and creative in dividing the costs over all workspaces, charging users for storage, but also paying users for their contribution to the preservation of the works they copy.

This is truly revolutionary: to be paid for storing a copy of a preservation worthy work. 

Let the data speak

It seems a reduction in complexity, when someone says: let the facts speak. We are invited to move away from the complex world of opinions, interpretations, conjectures, and comments on conjectures to the plain and simple world of facts, outlined by numbers and yes/no statements.

There are many objections against this artificial division, as anybody knows who has tried to make sense of a new set of primary data. There is the statistical analysis, the modelling, the choice of representation, and the addition of metadata in order to explain the significance and value of the data. All these issues point from the realm of plain facts to the realm of complex human understanding.

So when the data speak, they speak a language, and we have to deal with that language. Sometimes quite literally: when the data in question are language data, recording how we, humans, speak. This gives rise to a whole set of new challenges: listening to the data. What do you do if you have sound recordings of a thousand different languages?

Yet, there is another layer to the picture, and that is where the data are meaningful representations of human culture. Texts are the prime example, texts that discuss the events of history, the works of art, the states of mind, the structure of knowledge. In order to research texts, three kinds of language must be dealt with: (i) the language of data, i.e. the problem of appreciating a text among the family of all its variations, with a critical reflection on the origins of that family; (ii) the language of language, i.e. the world of spelling variation, part-of-speech analysis, syntactical analysis, up to the first stages of semantics; (iii) the language of meaning, i.e. the world of concepts and relationships as studied by the many disciplines of the humanities and social sciences.

To me, the excitement is often in the middle. Things always happen at boundaries. Dealing with the language of language amounts to building bridges between the high-level world of human discourse and the low-level world of data and computing. The CLARIN project explicitly takes this position. With the preparatory stage almost finished, it already has created a lot of action. And I am excited that here, in the Netherlands, the funding is in place to carry on.

Here at DANS, we are involved in quite a number of projects with a CLARIN connection. We are not linguists ourselves, our natural language is the language of data. But we are aware that in order to give an audience to the data, we have to facilitate the language of language. So, we will work on the machinery required for language resources, doing the dreary details. And already now we see some projects doing the next step: speaking the language of meaning after listening to the language of language.