using_your_own_vectors.Rmd
To run the statistics you’ll need your own items and your own vectors. The process has two parts
To see the format required for items (words) take a look at the item information from the first WEAT study
the top of this data frame looks like
Study Condition Word Role
1 WEAT1 Flowers aster target
2 WEAT1 Flowers clover target
3 WEAT1 Flowers hyacinth target
4 WEAT1 Flowers marigold target
5 WEAT1 Flowers poppy target
6 WEAT1 Flowers azalea target
Here, the study is called WEAT1
, the conditions are Flowers
, Insects
, the target
roles, and Pleasant
and Unpleasant
, the attribute
roles. The helper function cbn_make_items
can be helpful for creating this structure for your own words. Naturally the words you have vectors for should match the words you have item information for.
The cbn
package bundles all the vectors you will need to replicate the paper analyses using the GloVe 840B 300-dimensional Common Crawl data. If, however, you want to work with different items you’ll need to point the package at your own file of word vectors. The process is:
In the following I’ll assume that you still want to use the Common Crawl, but these instructions will work for any word vectors that arrive in the same file format. That format is essentially
sausage 0.1234 -0.5555 1.4149
i.e. word, space, float, space, float, space float … newline. This is what the code will assume when attempting to read things in.
If you want to use the GloVe Common Crawl data, then go to it’s homepage and download one of the files under ‘Download pre-trained word vectors’, e.g. http://nlp.stanford.edu/data/glove.840B.300d.zip
When download is complete, unzip the file. This should create a roughly 5G file called glove.840B.300d.txt
. I’ll assume you downloaded it to ~/Documents
.
Load the package and assign this location
You can retrieve this location using cbn_get_vectorfile_location()
. If you change your prefered vectors, just call it again with a new location. If you’d like this location to be remembered across R session add persist = TRUE
to the function call.
To get a matrix of vectors for your words
By default there is no reporting, but for a couple of hundred words this function should return in around a minute for the 840B Common Crawl vectors.
If you want to watch progress, set verbose
to TRUE. A second argument report_every
controls how often a progress dot appears. It defaults to 100000 (lines).
mat
is a matrix with as many rows as words
and as many column as the length of the vectors. Ifyou are using the vectors above that will be 300. The matrix has words
as rownames and no column names. In the event that one of your words is not found in the vector file, the corresponding row of mat
is filled with NAs.