Talking data

Let the data speak

It seems a reduction in complexity, when someone says: let the facts speak. We are invited to move away from the complex world of opinions, interpretations, conjectures, and comments on conjectures to the plain and simple world of facts, outlined by numbers and yes/no statements.

There are many objections against this artificial division, as anybody knows who has tried to make sense of a new set of primary data. There is the statistical analysis, the modelling, the choice of representation, and the addition of metadata in order to explain the significance and value of the data. All these issues point from the realm of plain facts to the realm of complex human understanding.

So when the data speak, they speak a language, and we have to deal with that language. Sometimes quite literally: when the data in question are language data, recording how we, humans, speak. This gives rise to a whole set of new challenges: listening to the data. What do you do if you have sound recordings of a thousand different languages?

Yet, there is another layer to the picture, and that is where the data are meaningful representations of human culture. Texts are the prime example, texts that discuss the events of history, the works of art, the states of mind, the structure of knowledge. In order to research texts, three kinds of language must be dealt with: (i) the language of data, i.e. the problem of appreciating a text among the family of all its variations, with a critical reflection on the origins of that family; (ii) the language of language, i.e. the world of spelling variation, part-of-speech analysis, syntactical analysis, up to the first stages of semantics; (iii) the language of meaning, i.e. the world of concepts and relationships as studied by the many disciplines of the humanities and social sciences.

To me, the excitement is often in the middle. Things always happen at boundaries. Dealing with the language of language amounts to building bridges between the high-level world of human discourse and the low-level world of data and computing. The CLARIN project explicitly takes this position. With the preparatory stage almost finished, it already has created a lot of action. And I am excited that here, in the Netherlands, the funding is in place to carry on.

Here at DANS, we are involved in quite a number of projects with a CLARIN connection. We are not linguists ourselves, our natural language is the language of data. But we are aware that in order to give an audience to the data, we have to facilitate the language of language. So, we will work on the machinery required for language resources, doing the dreary details. And already now we see some projects doing the next step: speaking the language of meaning after listening to the language of language.