Open-source text SDR encoder?

Anyone knows of an open-source SDR encoder for text (ideally multilingual)?

I am cooking sth. up myself using bag of words and self-organizing maps but would like to see existing methods

Hello,

There was a company named cortical.io that built text to sdr encoders, but IIRC they never released code open source. They used to post on this forum, you can search for them disussing their research. Ex: IRIS: New version of free Cortical.io Demo now available!


Some years ago I researched text encoding using a modified HTM. I made a video explainnng my findings and Id be happy to answer any questions about it. Video Lecture of Kropff & Treves, 2008

Thx for the pointers.

I am looking for guidance on how to optimize the text SDR enocoding for optimal text retrieval properties. My intuition is: maximize entropy of the SDR cells aka every cell in the SDR should represent very unique / niche semantic concepts and have at best no overlapping words they represent (even if they are close neighbors).

Just found these related threads:

Close.

The cells should have at best all overlapping words that are adjacent to the semantic concept.

For instance, a cell could represent the concept of “cat”. So all words that are cat-adjacent will include this cell as part of their representation.

In practice, your cell wont represent something as clearly defined as the above concept. Furthermore, you wont have a single cell representing a concept, but multiple redundant cells that will represent the concept in different ways.

I have thought of how to make my own SDR encoding like you are talking about. I think the best way is to use one of the existing word embedding methods such as word2vec and generate an SDR encoding that matches the vector distance metrics with the SDR similarity metrics.

The question then remains, how do you generate that encoding? I haven’t thought too hard on that yet.

There has been a lot of work in “word embedding” that shows that it captures concepts such ask male vs female in words like prince and princess or king and queen. The objection for what you are doing is that it is not sparse. However since some of the implementations use real positive values between o and 1 for the embeddings it might be possible to create a sparsek by selecting the largest values from the code and use vector position of those values to map to loctions in the temporal memory. The memory would have to be sized to allow for the width that the embedding used for its code. If the embedding used the range 1 to -1 in the resulting embedding I haven’t thought about how to deal with the large negative numbers.

Simply converting pre-trained dense word embeddings to SDRs is no problem (see this article for image SDRs: Sparse Distributed Representations) but will result in worse downstream performance (for classification / retrieval), since the resulting SDRs will be a sparse binary quantization of the original dense embeddings (thus losing information).

I am working on a training scheme that directly generates text SDRs from a text corpus (f.e. wikipedia articles) - skipping any pre-trained dense embeddings.

2 Likes

Good article.

Although, just a quick note about your article. I don’t believe that cells physically close to each other needs to be semantically similar. That’s a major constraint and would make any embedding very difficult since I don’t think semantics can really be fit onto a 2D plane. I also don’t think this is how biology does it. I usually imagine an SDR encoding as a finite set. An active bit means an element is present in the set. This also means that it doesn’t matter where the cell is that activates this bit. They are all completely unordered with no topology. (I would change this assumption though when working with spatial data like images).

Now onto your claim that the performance will be worse because of loss of information. I’m not sure your intuition here is correct.

It’s true that information will be lost, but I don’t think this is such a bad thing. If you have an embedding of 1000s of points in a high-dimensional vector space, arranging all those points so that they are only near to other semantically similar points is a big challenge. In a tight high-dimensional box like this, every point is related to every other point, even when they practically shouldn’t be. The difference between semantically similar words and dissimilar words could be a normalized distance of 0.45 and 0.55. That’s bad when you’re trying to organize and represent your knowledge by semantic similarity.

If you convert this dense embedding to a sparse SDR encoding, it’s like you’re taking the points out of the tight high-dimensional box and unspooling them onto a flat surface as a sparse (possibly unconnected) graph. All those points that were previously at distance 0.5 or higher (if this is a threshold we choose) would have no edge between them on this graph. That cuts number of edges from O(n^2) to something like O(k*n) where k is the average number of similar points per point. You would calibrate this threshold based on your dense-to-sparse conversion process.

As part of your dense-to-sparse conversion, a graph can be created from this encoding with the following rules: an edge exists between two points if their SDR encoding shares at least one bit in common. If an edge does not exist between two points, they share no bit in common. The similarity characterized by an edge is computed by the number of common bits between the two node’s encodings. Then you can visualize your semantic embedding as a graph (by embedding it AGAIN in a 2D space for plotting).

So you see, you do throw away information, but it’s information that adds noise and confusion to any classification algorithm. It also eliminates things like catastrophic forgetting if you keep adding new data points to the representation space, since you aren’t worried about trying to prevent creation of unintended relations to existing points.

See my old post when I try to visualize this similarity space for different types of encodings.

2 Likes

The 2D topographic property (cells close, being semantically similar) is useful for subsampling and visualization. It can always be applied as a post-processing step, having no impact on SDR performance (since the arrangement of the SDR dims is invariant to distances like hamming / jaccard).

This is incorrect AFAIK - randomly sampled points in a high dim. space tend to be orthogonal (unrelated) by default. Dense self-supervised learning algorithms exploit this property to only place related samples near each other.

In general I agree with your points though - I see the benefits of SDR properties but want to avoid a slow two step inference pipeline where I have to first run a dense model before transforming the dense output into SDR.

1 Like

It’s true that orthogonality of your randomly sample points is nearly guaranteed, but that doesn’t help you if you’re using cartesian distance between points to compare their similarity. The problem with any distance function is that any two points in the hyperspace have some similarity.

If instead you use a similarity metric, such as cosine similarity, all similarity metrics have a horizon where only points in their neighborhood have some non-zero value. Anything beyond that horizon is zero, irrelevant, and excluded from consideration. This is a nice feature that distance metrics don’t have. It allows you to narrow down the number of points to compare against by many orders of magnitude.

To avoid the two-step inference pipeline, why not pre-convert all possible inputs and outputs offline from the dense embedding model into a sparse representation? The resulting table or “model” becomes your new 1-step SDR embedding model. Just use someone else’s well-known published model and unravel it into a sparse embedding model.

1 Like

This approach feels wrong: I believe one needs to train a native sparse model to truly materialize the benefits of SDRs instead of just converting a dense model into a sparse model for inference. Are you aware of an algorithm to convert any dense neural network into a sparse one?

1 Like

wait converting representations (e.g. output) of a model from float vectors to SDRs is different than converting the model itself to “SDR model” which … is rather tricky, if even possible

1 Like