The GloVe word vectors are powerful representations of words based on semantic relationships, which is perfect for HTM. But they consist of lists of signed scalar values, so they need to be transformed into SDRs somehow.
Here’s my idea:
Designate x bits for each dimension for a total of n = d * x bits, where d is the dimension of the word vectors.
Half of the x bits for each dimension will correspond to a negative number, the other half to a positive one.
When encoding a word, consider its word vector and divide up the w on-bits among the dimensions according to the relative magnitude of the value in that dimension. I.E. bits_in_d1 = abs(d_1)/sum(abs(d_i) for i…d)
Or perhaps use the squared values instead: d_1^2/sum(d_i^2 for i…d).
Activate the appropriate number of bits in each dimension’s designated space, taking into account the +/- sign of the original value.
-Since the GloVe word vectors do not appear to be normalized (I’d call this odd, but I’m sure they know what they’re doing better than I do), a word vector with a given value in one place may have a different number of assigned bits than a different word vector with the same value in the same dimension. This is because bits are assigned by relative weight, not absolute weight, in order to preserve sparsity.
-Enforces sparsity - only w bits will be on in each encoding.
-Partially preserves semantic similarity. Two similar word vectors will have similar weight distributions among the dimensions, and therefore the SDRs will have similar numbers of on-bits in each dimension’s designated space.
-I get to use the really cool GloVe dataset for my project