Talking data

Let the data speak

It seems a reduction in complexity, when someone says: let the facts speak. We are invited to move away from the complex world of opinions, interpretations, conjectures, and comments on conjectures to the plain and simple world of facts, outlined by numbers and yes/no statements.

There are many objections against this artificial division, as anybody knows who has tried to make sense of a new set of primary data. There is the statistical analysis, the modelling, the choice of representation, and the addition of metadata in order to explain the significance and value of the data. All these issues point from the realm of plain facts to the realm of complex human understanding.

So when the data speak, they speak a language, and we have to deal with that language. Sometimes quite literally: when the data in question are language data, recording how we, humans, speak. This gives rise to a whole set of new challenges: listening to the data. What do you do if you have sound recordings of a thousand different languages?

Yet, there is another layer to the picture, and that is where the data are meaningful representations of human culture. Texts are the prime example, texts that discuss the events of history, the works of art, the states of mind, the structure of knowledge. In order to research texts, three kinds of language must be dealt with: (i) the language of data, i.e. the problem of appreciating a text among the family of all its variations, with a critical reflection on the origins of that family; (ii) the language of language, i.e. the world of spelling variation, part-of-speech analysis, syntactical analysis, up to the first stages of semantics; (iii) the language of meaning, i.e. the world of concepts and relationships as studied by the many disciplines of the humanities and social sciences.

To me, the excitement is often in the middle. Things always happen at boundaries. Dealing with the language of language amounts to building bridges between the high-level world of human discourse and the low-level world of data and computing. The CLARIN project explicitly takes this position. With the preparatory stage almost finished, it already has created a lot of action. And I am excited that here, in the Netherlands, the funding is in place to carry on.

Here at DANS, we are involved in quite a number of projects with a CLARIN connection. We are not linguists ourselves, our natural language is the language of data. But we are aware that in order to give an audience to the data, we have to facilitate the language of language. So, we will work on the machinery required for language resources, doing the dreary details. And already now we see some projects doing the next step: speaking the language of meaning after listening to the language of language.

2 comments:

Remco van Veenendaal said...

Thanks DANS and Dirk for starting this blog! It may prove to be a valuable resource for discussion about (language and text) data.

Making sure facts speak and (continue to) have a listening audience is one of the main tasks of the Institute for Dutch Lexicology (INL; www.inl.nl) and in particular the Flemish-Dutch Human Language Technology Agency (TST-Centrale; www.inl.nl/tst-centrale).

The mission of the TST-Centrale is to manage, maintain and make available Dutch digital Language Resources for research, education and commercial purposes. The Dutch language has to keep playing a key role in the information society.

We are also actively involved in the CLARIN project (www.clarin.eu) and e.g. close colleagues of DANS, the Meertens Institute and the Max Planck Institute in Nijmegen – all “CLARIN Centers” in the (Dutch branch of the) CLARIN infrastructure.

We – and in particular our (computational) linguists – are mostly involved in Dirk’s “language or language”, but our work also touches the “language of data” and the “language of meaning”. It would be great to have more comments from these perspectives and look at language and text from all sides.

The TST-Centrale is an initiative of, is funded by the Dutch Language Union (Nederlandse Taalunie; www.taalunieversum.org/taalunie) and housed at the INL.

Sun Ivey said...

It takes seconds to find the document which you are looking for. It is very convenient for a customer to access the service 24/7. Accessibility and durability are last but not the least advantages of a VDR. The world goes digital as well as information does. So check out vdr data room.

Post a Comment