Snaut – accessible distributional semantics
A big obstacle to the widespread use of distributional semantics in psycholinguistics has been the gap between the producers and potential consumers of semantic spaces.
Although several packages have been published that allow users to train various kinds of semantic spaces, the large corpora and computational infrastructure as well as the technical know-how regarding training and evaluating semantic spaces is not available to many psycholinguists.
Therefore, in order to encourage the exchange and use of semantic spaces trained by various research groups, we release a simple interface that can be used to measure relatedness between words on the basis of semantic spaces.
This tool implements an efficient algorithm which allows to efficiently detect pairs of near-duplicate documents. This tool is for you, if you have a large corpus and you suspect it may contain some documents that are not identical but are similar enough that probably one of them should be excluded in your workflow.
I have a privilege to work with Marc Brysbaert and Emmanuel Keuleers on massive online vocabulary tests, in which huge numbers of participants decide whether the string that they see on a screen is a valid word or not. These projects are really a lot of fun!
The participants get an estimate of their vocabulary size and we get measurements of who and where knows which words. We also measure reaction times so that we can create a huge database of psycholinguistic data.
For some fun, look at a list of words known by men and women.
We already tested more than 1.5 million participants!
Semantic spaces and extrapolation of psycholinguistic variables
Can we predict whether most people find ‘jump’ concrete or abstract? Is there a method to find out at what age the word ‘distance’ is acquired?
Subjective ratings for words are ubiquitous in psycholinguistic research, but collecting them is time consuming and expensive.
At the same time, based on large text corpora, we can now estimate how similar meanings of two words are. I investigate whether machine learning techniques together with such semantic similarity measures derived from text corpora can improve our estimates of variables such as age of acquisition, concreteness, valence, arousal or dominance.
Here is our paper in which we investigate this topic.
There are now multiple publications which show that text corpora based on movie subtitles can provide particularly good word frequency estimates.
Some information in Polish can be found in this blog post.