Snaut – accessible distributional semantics

A big obstacle to the widespread use of distributional semantics in psycholinguistics has been the gap between the producers and potential consumers of semantic spaces.

Although several packages have been published that allow users to train various kinds of semantic spaces, the large corpora and computational infrastructure as well as the technical know-how regarding training and evaluating semantic spaces is not available to many psycholinguists.

Therefore, in order to encourage the exchange and use of semantic spaces trained by various research groups, we release a simple interface that can be used to measure relatedness between words on the basis of semantic spaces.

See the source code for snaut and a related python module on github. You can also find more information on this website.


This tool implements an efficient algorithm which allows to efficiently detect pairs of near-duplicate documents. This tool is for you, if you have a large corpus and you suspect it may contain some documents that are not identical but are similar enough that probably one of them should be excluded in your workflow.

The tool is implemented in Scala. You can view the source code on github or read a tutorial.

Vocabulary tests

I have a privilege to work with Marc Brysbaert and Emmanuel Keuleers on massive online vocabulary tests, in which huge numbers of participants decide whether the string that they see on a screen is a valid word or not. These projects are really a lot of fun!

The participants get an estimate of their vocabulary size and we get measurements of who and where knows which words. We also measure reaction times so that we can create a huge database of psycholinguistic data.

So far, I wrote code and maintained the tests for English and Dutch. In collaboration with BCBL, we also conducted similar tests for Spanish and Basque.

For some fun, look at a list of words known by men and women.

We already tested more than 1.5 million participants!

Semantic spaces and extrapolation of psycholinguistic variables

Can we predict whether most people find ‘jump’ concrete or abstract? Is there a method to find out at what age the word ‘distance’ is acquired?

Subjective ratings for words are ubiquitous in psycholinguistic research, but collecting them is time consuming and expensive.

At the same time, based on large text corpora, we can now estimate how similar meanings of two words are. I investigate whether machine learning techniques together with such semantic similarity measures derived from text corpora can improve our estimates of variables such as age of acquisition, concreteness, valence, arousal or dominance.

Here is our paper in which we investigate this topic.


There are now multiple publications which show that text corpora based on movie subtitles can provide particularly good word frequency estimates.

We created such a resource also for Polish. You can explore the frequencies through a web interface, or download them here.

Some information in Polish can be found in this blog post.