To run the statistics you’ll need your own items and your own vectors. The process has two parts
To see the format required for items (words) take a look at the item information from the first WEAT study
weat1 <- cbn_get_items(type = "WEAT")
the top of this data frame looks like
Study Condition Word Role 1 WEAT1 Flowers aster target 2 WEAT1 Flowers clover target 3 WEAT1 Flowers hyacinth target 4 WEAT1 Flowers marigold target 5 WEAT1 Flowers poppy target 6 WEAT1 Flowers azalea target
Here, the study is called
WEAT1, the conditions are
target roles, and
attribute roles. The helper function
cbn_make_items can be helpful for creating this structure for your own words. Naturally the words you have vectors for should match the words you have item information for.
cbn package bundles all the vectors you will need to replicate the paper analyses using the GloVe 840B 300-dimensional Common Crawl data. If, however, you want to work with different items you’ll need to point the package at your own file of word vectors. The process is:
In the following I’ll assume that you still want to use the Common Crawl, but these instructions will work for any word vectors that arrive in the same file format. That format is essentially
sausage 0.1234 -0.5555 1.4149
i.e. word, space, float, space, float, space float … newline. This is what the code will assume when attempting to read things in.
If you want to use the GloVe Common Crawl data, then go to it’s homepage and download one of the files under ‘Download pre-trained word vectors’, e.g. http://nlp.stanford.edu/data/glove.840B.300d.zip
When download is complete, unzip the file. This should create a roughly 5G file called
glove.840B.300d.txt. I’ll assume you downloaded it to
Load the package and assign this location
You can retrieve this location using
cbn_get_vectorfile_location(). If you change your prefered vectors, just call it again with a new location. If you’d like this location to be remembered across R session add
persist = TRUE to the function call.
To get a matrix of vectors for your words
words <- c("Hugh", "Pugh", "Barney", "McGrew") mat <- cbn_extract_word_vectors(words)
By default there is no reporting, but for a couple of hundred words this function should return in around a minute for the 840B Common Crawl vectors.
If you want to watch progress, set
verbose to TRUE. A second argument
report_every controls how often a progress dot appears. It defaults to 100000 (lines).
mat is a matrix with as many rows as
words and as many column as the length of the vectors. Ifyou are using the vectors above that will be 300. The matrix has
words as rownames and no column names. In the event that one of your words is not found in the vector file, the corresponding row of
mat is filled with NAs.