Using Your Own Items and Vectors

To run the statistics you’ll need your own items and your own vectors. The process has two parts

Acquire a list of words and associate them with a condition
Extract vectors for each word

Associate words with conditions

To see the format required for items (words) take a look at the item information from the first WEAT study

weat1 <- cbn_get_items(type = "WEAT")

the top of this data frame looks like

  Study Condition     Word   Role
1 WEAT1   Flowers    aster target
2 WEAT1   Flowers   clover target
3 WEAT1   Flowers hyacinth target
4 WEAT1   Flowers marigold target
5 WEAT1   Flowers    poppy target
6 WEAT1   Flowers   azalea target

Here, the study is called WEAT1, the conditions are Flowers, Insects, the target roles, and Pleasant and Unpleasant, the attribute roles. The helper function cbn_make_items can be helpful for creating this structure for your own words. Naturally the words you have vectors for should match the words you have item information for.

Extract vectors for each word

The cbn package bundles all the vectors you will need to replicate the paper analyses using the GloVe 840B 300-dimensional Common Crawl data. If, however, you want to work with different items you’ll need to point the package at your own file of word vectors. The process is:

Download and unpack a text file of word vectors
Point the package at the file of word vectors
Extract vectors for your choice of words
Analyze Your Vectors

In the following I’ll assume that you still want to use the Common Crawl, but these instructions will work for any word vectors that arrive in the same file format. That format is essentially

sausage 0.1234 -0.5555 1.4149

i.e. word, space, float, space, float, space float … newline. This is what the code will assume when attempting to read things in.

Download and unpack a text file of word vectors

If you want to use the GloVe Common Crawl data, then go to it’s homepage and download one of the files under ‘Download pre-trained word vectors’, e.g. http://nlp.stanford.edu/data/glove.840B.300d.zip

When download is complete, unzip the file. This should create a roughly 5G file called glove.840B.300d.txt. I’ll assume you downloaded it to ~/Documents.

Point the package at the file of word vectors

Load the package and assign this location

library(cbn)

cbn_set_vectorfile_location("~/Documents/glove.840B.300d.txt")

You can retrieve this location using cbn_get_vectorfile_location(). If you change your prefered vectors, just call it again with a new location. If you’d like this location to be remembered across R session add persist = TRUE to the function call.

Extract vectors for your choice of words

To get a matrix of vectors for your words

words <- c("Hugh", "Pugh", "Barney", "McGrew")
mat <- cbn_extract_word_vectors(words)

By default there is no reporting, but for a couple of hundred words this function should return in around a minute for the 840B Common Crawl vectors.

If you want to watch progress, set verbose to TRUE. A second argument report_every controls how often a progress dot appears. It defaults to 100000 (lines).

mat is a matrix with as many rows as words and as many column as the length of the vectors. Ifyou are using the vectors above that will be 300. The matrix has words as rownames and no column names. In the event that one of your words is not found in the vector file, the corresponding row of mat is filled with NAs.

Analyze your vectors

With item information and corresponding vectors for your own words (or your own vectors) you can now use all the statistical functions described in the other vignettes.

Will Lowe

2019-06-25