Paweł Mandera

Semantic spaces relocated

Mon, 18 Dec 2023 00:00:00 +0100

Semantic spaces that were originally hosted by Ghent University have been moved here.

Here are the web interfaces for the semantic spaces published by Mandera, Keuleers, & Brysbaert (2017):

Here are the semantic spaces from that paper available for download:

English, lemmas - CBOW model trained on a concatenation of UKWAC and subtitle corpus, 300 dimensions, window size 6
English, all words - CBOW model trained on a concatenation of UKWAC and subtitle corpus, 300 dimensions, window size 6
Dutch, lemmas - CBOW model trained on a concatenation of SONAR-500 and subtitle corpus, 200 dimensions, window size 10
Dutch, all words - CBOW model trained on a concatenation of SONAR-500 and subtitle corpus, 200 dimensions, window size 10

If you are looking for the Italian semantic spaces published by Marelli (2017), please contact me via email and I’ll try to help.

Introducing a Polish Semantic Priming Dataset for Researchers

Wed, 14 Jun 2023 00:00:00 +0200

We are pleased to present our recent study that introduces a Polish semantic priming dataset and semantic similarity ratings based on native Polish speakers. This resource provides a useful tool for researchers interested in the Polish language and linguistics.

Our study involved two experiments. The first experiment aimed to create and validate the dataset, which includes strongly related, weakly related, and semantically unrelated word pairs. The results confirmed that the three conditions could be distinguished by their semantic relatedness.

In the second experiment, we used a subset of stimuli to investigate lexical decision performance in relation to the priming effect. We observed a semantic priming effect for strongly related word pairs, while a smaller yet still significant effect was found for weakly related pairs when compared to unrelated pairs.

The dataset incorporates findings from both experiments and the SimLex-999 for Polish, offering semantic model selection from existing and newly trained semantic spaces. By making this database of semantic vectors, semantic relatedness ratings, and collected behavioral data available, we aim to support researchers in their exploration of the Polish language and linguistics.

We believe that this dataset could prove helpful by allowing researchers to benchmark new vectors and investigate the Polish language in more detail. It is worth noting that this is the first freely available database for Polish that combines measures of semantic distance and human data, potentially contributing to the ongoing study of the language and related fields.

A sandpile model implementation and visualizations

Sun, 24 Dec 2017 00:00:00 +0100

I had some time to spare, so I implemented the Abelian sandpile model, a classical example of a system displaying self-organizing criticality.

You can find the code here. It’s flexible but also reasonably fast, as the main loop is implemented in Cython.

Here’s a video of dropping about 10 million grains on the sandpile.

I find the ideas around self-organizing criticality really amazing. Self-critical systems have a critical point as an attractor and many interesting properties originate from that. If you are interested, I really recommend reading How Nature Works: The Science of Self-Organized Criticality by Per Bak.

How to find near-duplicate files with Duometer?

Fri, 06 Nov 2015 00:00:00 +0100

This tutorial shows how to detect near-duplicates using duometer. I assume here that duometer is installed and can be called using the duometer command. You can check this page for more information about how to install duometer.

If you want to follow the examples from this tutorial you can download this file. In texts/, this archive contains a tiny corpus including excerpts from Alice in Wonderland by Lewis Carroll. Importantly, some of the excerpts are not entirely unique. One of them is repeated twice in the exactly same form, the copy of the second one contains only a part of the original text, and the copy of the third one has some phrases changed. Duometer will help us to identify these duplicate excerpts!

Quickstart

If you want to just identify duplicates in the folder without customizing any settings you can run:

duometer -i texts/ -o texts-duplicates.txt

After running this command you can examine text-duplicates.txt. It should contain three lines with tab-separated values similar to these:

texts/after-duplicate.txt       texts/after.txt 1.0
texts/hedgehog.txt      texts/hedgehog-part.txt 0.5238095238095238
texts/curioser-nearduplicate.txt        texts/curioser.txt      0.7380952380952381

The first two columns list a pair of potentially duplicate files, and the third column shows a measure of similarity between the two files - the higher the value the more similar the files (where 0.0 is the minimum and 1.0 maximum similarity).

If you examine the pairs, you will see that after.txt and after-duplicate.txt have the exact same content, hedgehog-part.txt contains most of the text from hedgehog.txt, and curioser-nearduplicate.txt contains almost the same text as curioser.txt with a few differences: the word English was substituted with Persian, Christmas with Easter, and the order with stockings and shoes was changed.

By default duometer will list only pairs of documents for which the similarity value is at least 0.2 but it is for you to decide what is the threshold for considering two files to be duplicates.

Alternative ways of specifying input

Alternatively, you can look for duplicates across two different directories. For illustration, you can create a second directory named texts-2/ and copy curioser-nearduplicate.txt to that directory giving it a new name, curioser-nearduplicate-2.txt.

You can now specify two input directories for duometer:

duometer -i texts -i texts-2 -o texts-texts-2-duplicates.txt

If you now look at texts-texts-2-duplicates.txt you will find two lines there: one listing curioser-nearduplicate-2.txt as an exact duplicate of curioser-nearduplicate.txt and the second one as a near-duplicate of curioser.txt.

The third way in which you can specify input for duometer is to list files which should be considered when looking for duplicate pairs. The path to each file should be put on a new line. For example, you can create a file list.txt like this:

texts/after.txt
texts/hedgehog.txt
texts/hedgehog-part.txt
texts/curioser.txt
texts-2/curioser-nearduplicate-2.txt

and then run:

duometer -i list.txt -o list-duplicates.txt

Further information

This tutorial does not cover more advanced settings. Please run duometer --help for more options.

SUBTLEX-PL - frekwencje polskich słów na podstawie napisów do filmów

Tue, 03 Mar 2015 00:00:00 +0100

W skrócie: tutaj dostępne są starannie przetestowane frekwencje polskich słów. Więcej informacji znajdziesz w niniejszym wpisie, a szczegóły w tym artykule.

Frekwencje słów

Jak często używane są poszczególne słowa? To pytanie jest interesujące samo w sobie, ale z punktu widzenia psycholingwistyki odpowiedź na nie jest absolutnie kluczowa, bo frekwencja bardzo silnie determinuje to, w jaki sposób i jak szybko badani udzielają odpowiedzi w większości eksperymentów.

Center for Reading Research od dawna przygotowuje tego typu statystyki dla poszczególnych języków starannie sprawdzając jakość frekwencji obliczonych na bazie różnych korpusów tekstów. Regularnie powtarzającym się rezultatem jest to, że frekwencje oparte na korpusie napisów do filmów, znacznie lepiej odzwierciedlają zachowanie badanych w eksperymentach niż te obliczone na bazie innych, nawet znacznie większych, korpusów tekstów.

SUBTLEX-PL

W ramach pracy nad doktoratem na Uniwersytecie w Gandawie przygotowałem listę frekwencyjną dla polskich słów. W tym celu ściągnęliśmy około 100,000 plików z napisami i po starannym wyczyszczeniu utworzyliśmy korpus liczący około 146 milionów słów. Cały korpus został otagowany przy użyciu TaKIPI – pozwoliło to na określenie części mowy i podstawowej formy każdego słowa w korpusie.

Następnie policzyliśmy ile razy każde słowo występuje w korpusie, w jakim odsetku filmów się pojawia, jak często występuje jako określona część mowy, jak często możemy jej przypisać daną formę podstawową oraz kilka innych statystyk. Publikujemy wszystkie te dane, więc SUBTLEX-PL zawiera znacznie więcej informacji niż typowa lista frekwencyjna.

Po szczegóły odsyłam do naszego artykułu opublikowanego w Behavior Research Methods. Frekwencje dostępne są do ściągnięcia w kilku formatach na tej stronie).