<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>Paweł Mandera</title>
    <description>Paweł's homepage</description>
    <link>https://www.pawelmandera.com</link>
    <atom:link href="https://www.pawelmandera.com/feed.xml" rel="self" type="application/rss+xml" />
    
      
      
      <item>
        <title>Semantic spaces relocated</title>
        <description>&lt;p&gt;Semantic spaces that were originally hosted by Ghent University have been moved here.&lt;/p&gt;

&lt;p&gt;Here are the web interfaces for the semantic spaces published by Mandera,
Keuleers, &amp;amp; Brysbaert (2017):&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;/snaut-en/&quot;&gt;CBOW space English lemmas based on UKWAC and subtitles&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;/snaut-dutch/&quot;&gt;CBOW space Dutch lemmas based on SONAR-500 and subtitles&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;br /&gt;
Here are the semantic spaces from that paper available for download:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;/download/snaut-downloads/english-lemmas-cbow-window.6-dimensions.300-ukwac_subtitle_en.w2v.gz&quot;&gt;English, lemmas - CBOW model trained on a concatenation of UKWAC and subtitle corpus, 300 dimensions, window size 6&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;/download/snaut-downloads/english-all.words-cbow-window.6-dimensions.300-ukwac_subtitle_en.w2v.gz&quot;&gt;English, all words - CBOW model trained on a concatenation of UKWAC and subtitle corpus, 300 dimensions, window size 6&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;/download/snaut-downloads/dutch-lemmas-cbow-window.10-dimensions.200-sonar_subtitle_nl.w2v.gz&quot;&gt;Dutch, lemmas - CBOW model trained on a concatenation of SONAR-500 and subtitle corpus, 200 dimensions, window size 10&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;/download/snaut-downloads/dutch-all.words-cbow-window.10-dimensions.200-sonar_subtitle_nl.w2v.gz&quot;&gt;Dutch, all words - CBOW model trained on a concatenation of SONAR-500 and subtitle corpus, 200 dimensions, window size 10&lt;/a&gt;
&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you are looking for the Italian semantic spaces published by Marelli (2017),
please contact me via email and I’ll try to help.&lt;/p&gt;
</description>
        <pubDate>Mon, 18 Dec 2023 00:00:00 +0100</pubDate>
        <link>https://www.pawelmandera.com/2023/12/18/snaut-move/</link>
        <guid isPermaLink="true">https://www.pawelmandera.com/2023/12/18/snaut-move/</guid>
      </item>
       
       
     
      
      
      <item>
        <title>Introducing a Polish Semantic Priming Dataset for Researchers</title>
        <description>&lt;p&gt;We are pleased to &lt;a href=&quot;https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0284801&quot;&gt;present our recent
study&lt;/a&gt;
that introduces a Polish semantic priming dataset and semantic similarity
ratings based on native Polish speakers. This resource provides a useful tool
for researchers interested in the Polish language and linguistics.&lt;/p&gt;

&lt;p&gt;Our study involved two experiments. The first experiment aimed to
create and validate the dataset, which includes strongly related, weakly
related, and semantically unrelated word pairs. The results confirmed that the
three conditions could be distinguished by their semantic relatedness.&lt;/p&gt;

&lt;p&gt;In the second experiment, we used a subset of stimuli to
investigate lexical decision performance in relation to the priming effect. We
observed a semantic priming effect for strongly related word pairs, while a
smaller yet still significant effect was found for weakly related pairs when
compared to unrelated pairs.&lt;/p&gt;

&lt;p&gt;The dataset incorporates findings from both experiments and the SimLex-999 for
Polish, offering semantic model selection from existing and newly trained
semantic spaces. By making this database of &lt;a href=&quot;/pl-vectors-codes/&quot;&gt;semantic
vectors&lt;/a&gt;, semantic
relatedness ratings, and collected behavioral data available, we aim to support
researchers in their exploration of the Polish language and linguistics.&lt;/p&gt;

&lt;p&gt;We believe that this dataset could prove helpful by allowing researchers to
benchmark new vectors and investigate the Polish language in more detail. It is
worth noting that this is the first freely available database for Polish that
combines measures of semantic distance and human data, potentially contributing
to the ongoing study of the language and related fields.&lt;/p&gt;
</description>
        <pubDate>Wed, 14 Jun 2023 00:00:00 +0200</pubDate>
        <link>https://www.pawelmandera.com/2023/06/14/pl-vectors/</link>
        <guid isPermaLink="true">https://www.pawelmandera.com/2023/06/14/pl-vectors/</guid>
      </item>
       
       
     
      
      
      <item>
        <title>A sandpile model implementation and visualizations</title>
        <description>&lt;p&gt;I had some time to spare, so I implemented the Abelian sandpile
model, a classical example of a system displaying self-organizing criticality.&lt;/p&gt;

&lt;p&gt;You can find the code &lt;a href=&quot;https://github.com/pmandera/sandpile-model&quot;&gt;here&lt;/a&gt;. It’s
flexible but also reasonably fast, as the main loop is implemented in Cython.&lt;/p&gt;

&lt;div class=&quot;video-container&quot;&gt;
&lt;iframe src=&quot;https://www.youtube.com/embed/5T26OEQLf5s&quot; frameborder=&quot;0&quot; gesture=&quot;media&quot; allow=&quot;encrypted-media&quot; allowfullscreen=&quot;&quot;&gt;&lt;/iframe&gt;
&lt;/div&gt;

&lt;p&gt;Here’s a video of dropping about 10 million grains on the sandpile.&lt;/p&gt;

&lt;p&gt;I find the ideas around self-organizing criticality really amazing.
Self-critical systems have a critical point as an attractor and many interesting
properties originate from that.  If you are interested, I really recommend
reading &lt;a href=&quot;https://www.goodreads.com/book/show/869836.How_Nature_Works&quot;&gt;How Nature Works: The Science of Self-Organized
Criticality&lt;/a&gt; by Per
Bak.&lt;/p&gt;
</description>
        <pubDate>Sun, 24 Dec 2017 00:00:00 +0100</pubDate>
        <link>https://www.pawelmandera.com/2017/12/24/sandpile/</link>
        <guid isPermaLink="true">https://www.pawelmandera.com/2017/12/24/sandpile/</guid>
      </item>
       
       
     
      
      
      <item>
        <title>How to find near-duplicate files with Duometer?</title>
        <description>&lt;p&gt;This tutorial shows how to detect near-duplicates using duometer. I assume here
that duometer is installed and can be called using the &lt;code&gt;duometer&lt;/code&gt; command.  You
can check &lt;a href=&quot;https://github.com/pmandera/duometer&quot;&gt;this&lt;/a&gt; page for more information
about how to install duometer.&lt;/p&gt;

&lt;p&gt;If you want to follow the examples from this tutorial you can download
&lt;a href=&quot;/duometer/downloads/duometer-tutorial.zip&quot;&gt;this&lt;/a&gt; file. In &lt;code&gt;texts/&lt;/code&gt;, this archive contains
a tiny corpus including excerpts from &lt;a href=&quot;http://www.gutenberg.org/cache/epub/11/pg11.txt&quot;&gt;Alice in
Wonderland&lt;/a&gt; by Lewis Carroll.
Importantly, some of the excerpts are not entirely unique.  One of them is
repeated twice in the exactly same form, the copy of the second one contains
only a part of the original text, and the copy of the third one has some phrases
changed.  Duometer will help us to identify these duplicate excerpts!&lt;/p&gt;

&lt;h3 id=&quot;quickstart&quot;&gt;Quickstart&lt;/h3&gt;

&lt;p&gt;If you want to just identify duplicates in the folder without customizing any
settings you can run:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;duometer -i texts/ -o texts-duplicates.txt
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;After running this command you can examine &lt;code&gt;text-duplicates.txt&lt;/code&gt;. It should
contain three lines with tab-separated values similar to these:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;texts/after-duplicate.txt       texts/after.txt 1.0
texts/hedgehog.txt      texts/hedgehog-part.txt 0.5238095238095238
texts/curioser-nearduplicate.txt        texts/curioser.txt      0.7380952380952381
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The first two columns list a pair of potentially duplicate files, and the third
column shows a measure of similarity between the two files - the higher the
value the more similar the files (where 0.0 is the minimum and 1.0 maximum
similarity).&lt;/p&gt;

&lt;p&gt;If you examine the pairs, you will see that &lt;code&gt;after.txt&lt;/code&gt; and
&lt;code&gt;after-duplicate.txt&lt;/code&gt; have the exact same content, &lt;code&gt;hedgehog-part.txt&lt;/code&gt; contains
most of the text from &lt;code&gt;hedgehog.txt&lt;/code&gt;, and &lt;code&gt;curioser-nearduplicate.txt&lt;/code&gt; contains
almost the same text as &lt;code&gt;curioser.txt&lt;/code&gt; with a few differences: the word
&lt;em&gt;English&lt;/em&gt; was substituted with &lt;em&gt;Persian&lt;/em&gt;, &lt;em&gt;Christmas&lt;/em&gt; with &lt;em&gt;Easter&lt;/em&gt;, and the
order with &lt;em&gt;stockings and shoes&lt;/em&gt; was changed.&lt;/p&gt;

&lt;p&gt;By default duometer will list only pairs of documents for which the similarity value
is at least 0.2 but it is for you to decide what is the threshold for
considering two files to be duplicates.&lt;/p&gt;

&lt;h3 id=&quot;alternative-ways-of-specifying-input&quot;&gt;Alternative ways of specifying input&lt;/h3&gt;

&lt;p&gt;Alternatively, you can look for duplicates across two different directories. For
illustration, you can create a second directory named &lt;code&gt;texts-2/&lt;/code&gt; and copy
&lt;code&gt;curioser-nearduplicate.txt&lt;/code&gt; to that directory giving it a new name,
&lt;code&gt;curioser-nearduplicate-2.txt&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;You can now specify two input directories for duometer:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;duometer -i texts -i texts-2 -o texts-texts-2-duplicates.txt
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;If you now look at &lt;code&gt;texts-texts-2-duplicates.txt&lt;/code&gt; you will find two lines there:
one listing &lt;code&gt;curioser-nearduplicate-2.txt&lt;/code&gt; as an exact duplicate of
&lt;code&gt;curioser-nearduplicate.txt&lt;/code&gt; and the second one as a near-duplicate of
&lt;code&gt;curioser.txt&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The third way in which you can specify input for duometer is to list files which
should be considered when looking for duplicate pairs. The path to each file should
be put on a new line.  For example, you can create a file &lt;code&gt;list.txt&lt;/code&gt; like this:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;texts/after.txt
texts/hedgehog.txt
texts/hedgehog-part.txt
texts/curioser.txt
texts-2/curioser-nearduplicate-2.txt
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;and then run:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;duometer -i list.txt -o list-duplicates.txt
&lt;/code&gt;&lt;/pre&gt;

&lt;h3 id=&quot;further-information&quot;&gt;Further information&lt;/h3&gt;

&lt;p&gt;This tutorial does not cover more advanced settings. Please run &lt;code&gt;duometer
--help&lt;/code&gt; for more options.&lt;/p&gt;
</description>
        <pubDate>Fri, 06 Nov 2015 00:00:00 +0100</pubDate>
        <link>https://www.pawelmandera.com/2015/11/06/en-duometer-tutorial/</link>
        <guid isPermaLink="true">https://www.pawelmandera.com/2015/11/06/en-duometer-tutorial/</guid>
      </item>
       
       
     
      
      
      <item>
        <title>SUBTLEX-PL - frekwencje polskich słów na podstawie napisów do filmów</title>
        <description>&lt;p&gt;&lt;em&gt;W skrócie:
&lt;a href=&quot;https://osf.io/5a76z/&quot;&gt;tutaj&lt;/a&gt;
dostępne są starannie przetestowane frekwencje polskich słów. Więcej informacji
znajdziesz w niniejszym wpisie, a szczegóły w
&lt;a href=&quot;https://doi.org/10.3758/s13428-014-0489-4&quot;&gt;tym&lt;/a&gt; artykule.&lt;/em&gt;&lt;/p&gt;

&lt;h1 id=&quot;frekwencje-słów&quot;&gt;Frekwencje słów&lt;/h1&gt;

&lt;p&gt;Jak często używane są poszczególne słowa? To pytanie jest interesujące samo w
sobie, ale z punktu widzenia psycholingwistyki odpowiedź na nie jest absolutnie
kluczowa, bo frekwencja bardzo silnie determinuje to, w jaki sposób i jak
szybko badani udzielają odpowiedzi w większości eksperymentów.&lt;/p&gt;

&lt;p&gt;Center for Reading Research od dawna przygotowuje tego
typu statystyki dla poszczególnych języków starannie sprawdzając jakość
frekwencji obliczonych na bazie różnych korpusów tekstów. Regularnie
powtarzającym się rezultatem jest to, że frekwencje oparte na korpusie napisów
do filmów, znacznie lepiej odzwierciedlają zachowanie badanych w eksperymentach
niż te obliczone na bazie innych, nawet znacznie większych, korpusów tekstów.&lt;/p&gt;

&lt;h1 id=&quot;subtlex-pl&quot;&gt;SUBTLEX-PL&lt;/h1&gt;

&lt;p&gt;W ramach pracy nad doktoratem na Uniwersytecie w Gandawie przygotowałem
listę frekwencyjną dla polskich słów.  W tym celu ściągnęliśmy  około 100,000
plików z napisami i po starannym wyczyszczeniu utworzyliśmy korpus liczący
około 146 milionów słów. Cały korpus został otagowany przy użyciu
&lt;a href=&quot;http://nlp.pwr.wroc.pl/takipi/&quot;&gt;TaKIPI&lt;/a&gt; – pozwoliło to na określenie części
mowy i podstawowej formy każdego słowa w korpusie.&lt;/p&gt;

&lt;p&gt;Następnie policzyliśmy ile razy każde słowo występuje w korpusie, w jakim
odsetku filmów się pojawia, jak często występuje jako określona część mowy, jak
często możemy jej przypisać daną formę podstawową oraz kilka innych statystyk.
Publikujemy wszystkie te dane, więc SUBTLEX-PL zawiera znacznie więcej
informacji niż typowa lista frekwencyjna.&lt;/p&gt;

&lt;p&gt;Po szczegóły odsyłam do naszego
&lt;a href=&quot;https://doi.org/10.3758/s13428-014-0489-4&quot;&gt;artykułu&lt;/a&gt; opublikowanego w
&lt;em&gt;Behavior Research Methods&lt;/em&gt;. 
Frekwencje dostępne są do ściągnięcia w kilku formatach &lt;a href=&quot;https://osf.io/5a76z/&quot;&gt;na tej
stronie&lt;/a&gt;).&lt;/p&gt;
</description>
        <pubDate>Tue, 03 Mar 2015 00:00:00 +0100</pubDate>
        <link>https://www.pawelmandera.com/2015/03/03/pl-subtlex-pl/</link>
        <guid isPermaLink="true">https://www.pawelmandera.com/2015/03/03/pl-subtlex-pl/</guid>
      </item>
       
       
     
  </channel>
</rss>
