GloVe Encoder

I was looking on the forum for ways to encode words as SDRs, and this comment by @Ed_Pell gave me an idea for a GloVe word-vector encoder.

The GloVe word vectors are powerful representations of words based on semantic relationships, which is perfect for HTM. But they consist of lists of signed scalar values, so they need to be transformed into SDRs somehow.

Here’s my idea:

Designate x bits for each dimension for a total of n = d * x bits, where d is the dimension of the word vectors.

Half of the x bits for each dimension will correspond to a negative number, the other half to a positive one.

When encoding a word, consider its word vector and divide up the w on-bits among the dimensions according to the relative magnitude of the value in that dimension. I.E. bits_in_d1 = abs(d_1)/sum(abs(d_i) for i…d)
Or perhaps use the squared values instead: d_1^2/sum(d_i^2 for i…d).

Activate the appropriate number of bits in each dimension’s designated space, taking into account the +/- sign of the original value.

-Since the GloVe word vectors do not appear to be normalized (I’d call this odd, but I’m sure they know what they’re doing better than I do), a word vector with a given value in one place may have a different number of assigned bits than a different word vector with the same value in the same dimension. This is because bits are assigned by relative weight, not absolute weight, in order to preserve sparsity.

-Enforces sparsity - only w bits will be on in each encoding.
-Partially preserves semantic similarity. Two similar word vectors will have similar weight distributions among the dimensions, and therefore the SDRs will have similar numbers of on-bits in each dimension’s designated space.
-I get to use the really cool GloVe dataset for my project


Update: Here’s the code for the encoder in case anyone else is using the GloVe datasets. Turns out that ensuring exact sparsity even with possible rounding errors was a bit tedious, even though the basic idea is simplicity itself.

class GloVeEncoder():

def __init__(self, d, b = 20, w = 40, ID = 'GE1'):
    #Constructor method.
    #filepath -> path to the file containing the GloVe word vectors.
    #b -> Number of bits assigned to each vector dimension.
    #Number of on-bits for the encoding.
    #Assign the basic variables
    self.d = d
    self.b = b
    self.w = w
    self.n = 2*b*d
    self.ID = ID
    self.output_dim = (self.n,)
def encode(self, vector, weighting = 'abs'):
    #Encodes a GloVe word-vector as an SDR.
    #Each dimension is assigned 2*self.b bits, half for positive and half for negative.
    #Then some of the bits for each dimension are turned on, according to the relative
    #magnitude of the value in that position, until self.w bits have been activated.
    #Get the total weight of the vector by summing either the absolute values
    #or the squares of the elements.
    if weighting == 'abs':
        total = np.sum(np.abs(vector))
    elif weighting == 'squares':
        total = np.sum(np.square(vector))
        print('Invalid weighting argument. Defaulting to abs.')
        total = np.sum(np.abs(vector))
    num_active_bits = np.abs(np.round(40*vector/total)).astype('int')
    #Make sure there isn't a single zone with more than self.b active bits.
    sorted_bits = np.argsort(num_active_bits)
    overflow = 0
    for index in range(self.d - 1, -1, -1):
        bits = num_active_bits[sorted_bits[index]]
        if bits > self.b:
            #Increment the overflow, and scale back the active bits to self.b
            overflow += bits - self.b
            num_active_bits[sorted_bits[index]] = self.b
        elif bits < self.b:
            #Use up some of the overflow, and scale up the active bits
            num_active_bits[sorted_bits[index]] += overflow
            overflow -= min(self.b, bits + overflow) - bits
    #Make sure the rounding process didn't produce an incorrect bit total.
    #We can accomplish this by scanning through all the zones with active bits,
    #starting at the highest one, and adding/subtracting one bit to/from each.
    increment_index = len(sorted_bits)
    #While the total number of active bits is less than w:
    while np.sum(num_active_bits) != self.w:
        increment_index -= 1
        #Start the cycle over again if the index passed 0
        if increment_index < 0:
            increment_index = len(sorted_bits) - 1
        #Skip this index if the maximum bits are already contained here.
        if num_active_bits[sorted_bits[increment_index]] == self.b:
        #Update the bit counter for this index
        num_active_bits[sorted_bits[increment_index]] += np.sign(self.w-np.sum(num_active_bits))        
    #Now we'll define the output SDR and fill in the active bits for each zone.
    SDR = np.zeros(self.n,)
    for index, bits in enumerate(num_active_bits):
        if bits > self.b:
            print("Error! Bits = {}".format(bits))
        if vector[index] < 0:
            #Activate the bits at index*(2*b)
            start = index*2*self.b
            end = index*2*self.b + bits
            SDR[start:end] = 1
            #Activate the bits at index*(2*b) + b
            start = (1 + index*2)*self.b
            end = (1 + index*2)*self.b + bits
            SDR[start:end] = 1          
    if np.sum(SDR) != self.w:
        print("Error compiling SDR: Total bits = {} and not {}.".format(np.sum(SDR)),self.w)
    return SDR

@Andrew_Stephan thanks for sharing your works. I’m very interested in showing SDR results from 2 correlated words like, e.g. “you” and “your”.


Here’s a sampling of similarity scores (which is the percentage of overlapping bits in a pair of sample encodings). Disclaimer: I made a slight change to the code above in this sample, by removing the part that enforces the exact number of on-bits. I felt that made the encodings a bit artificial. I also pre-normalize the vectors now.

{‘page_read’: 56.098,
‘book_library’: 47.5,
‘frog_lizard’: 68.571,
‘frog_tree’: 51.429,
‘man_woman’: 76.316,
‘chef_cook’: 54.054,
‘cat_laptop’: 42.857,
‘cat_refrigerator’: 36.585,
‘cold_ice’: 53.659,
‘cold_velcro’: 23.684,
‘husband_man’: 60.526,
‘wife_woman’: 60.526,
‘plastic_audiovisual’: 32.432,
‘algebra_blanket’: 17.5,
‘you_your’: 72.973}

Based on this (very small) sample, it seems that overlap percentages translate to semantic similarity roughly as:

< 30% : Totally unrelated
30 - 40% : Mostly unrelated
40 - 50% : Related
50 - 60% : Closely related
. > 60% : Structurally similar words?

For comparison, using the original (normalized) word-vectors, the geometric distance between the closely related words you and your with an overlap of ~73% is 0.44. The distance between the unrelated words algebra and blanket with an overlap of ~18% is 1.52.

edit: This used the 50-dimensional word-vectors. GloVe also has versions with many more dimensions.


Upon further consideration, I’ve realized that this encoder has a big flaw: uneven distribution of active bits. That is to say, I think that any encoder should, with a random input, have an equal chance of assigning any given bit to 1. This encoder has a disproportionately high chance to turn on the first bit of each domain (I.E. if b is 10, every 10th bit, etc.) The second bit of each domain is slightly less likely to be 1. The third bit even less likely, and so on.

So even if the SDR is 2000 bits long, most of the 1’s in any encoding can be found in a smaller subset. Ultimately this results in inflated overlap scores. Two random inputs should have a very low expected overlap score (sparsity^2) but the GloVeEncoder clearly fails in this regard.

I’m tracking my brain to solve this issue in a way that doesnt destroy the overlap of similar words but so far I’m drawing a blank.


Update: I’ve discovered that I can mitigate the issue by modifying the weighting scheme. By setting the weights to np.abs(vector)**beta, and sweeping beta, I can change how ‘distributed’ the resulting SDRs are. The larger beta is, the more the active bits become concentrated in the top 1 or 2 dimensions of the word vector. A smaller beta distributes the activity over more and more dimensions. The best value for beta seems to be between 2 and 4. This extends the range of overlap scores for my test cases ‘you’ vs ‘your’ and ‘blanket’ vs ‘algebra’ to ~97% and ~7% overlap respectively.


I’m curious if you plan to make this encoder open source in the future. It would be nice to have an alternative to cortical IO’s word fingerprints now that they are no longer supporting the low level API’s to query them.


I will add it to my github tomorrow.