Future development of fork + how 'far behind' are we?

I’ve been coding through htm.core’s example and feel like I’m getting a hang of the encoder-SP-TM flow, but I’ve got some questions regarding future development.

  1. What is the htm.core fork currently “missing” that NuPIC has implemented?

I looked through numenta’s docs; while the networkAPI and modelFactory seem like nice QoL features, it seems the same task can be achieved by putting together the parts manually in core.
They have a wider arranged of specialized encoders, of course - geo-coordinate, adaptive scalar, category_string, logarithmic (float on log scale?), among others. But a large amount of these seem like fairly close extensions of the base ScalarEncoder which has been ported to core already - though that makes it sound much easier than it surely was to create them, of course.

For example, I wonder if you could ‘mimic’ the geo-coordinate encoder with a series of smaller scalars with fixed minimums/maximums, concatenated into one SDR, perhaps allotting more bits for the values of lat/long that represent larger real-world distances.

Core’s MNIST example on github is a fine example of image encoding as well; it’s only black & white so far, but I don’t see why we couldn’t just multiply the encoded SDR length x3 for RGB 3-channels - stacking it linearly instead of stacking vertically like you would an RGB image for a convolutional neural net. Or you could have a same-size array of tuples (R,G,B) instead of an array of scalars (B&W), but I feel like that might not work out so well.

I don’t quite yet understand how category-strings are encoded - docs say it’s a scalar encoder with radius 1 - but Cortical.io had a really cool “retina” semantic fingerprint for a huge vocabulary. I can’t find the page on their website (might have taken it down?) but it allowed encoding of words while capturing semantic meaning and ‘distance’ for bitwise comparison.

The toughest thing I foresee encoding is graphs. Sort of arbitrary in terms of size, connectivity, what data is stored where. I also remember seeing an audio encoder somewhere for base nupic.

  1. I know Nupic is foreseeably still in maintenance mode, but support for python 2 ended ~9 months ago. I’m quite biased towards wanting to develop anything new in Python 3; is the current state of affairs “we’re applying existing nupic and theorizing/ writing and sharing research for future developments”, or some other paradigm? Do you figure the ‘future’ of Nupic is along these python 3 fork rails?
1 Like

@mcleverley nice review.
We tried to keep the content close to NuPIC and in many cases apps written against NuPIC may also run with the htm.core library.

Since I am one of the maintainers of htm.core I am a bit biased but I do feel that Python 3 (and C++) is the future and htm.core does support Python 3. The objective was to move everything that was important from NuPIC to htm.core and there are a few new things as well. We do have networkAPI but we did not include the OPF model factory. Instead the networkAPI will accept a JSON configuration string that defines the entire test layout.

I don’t quite yet understand how category-strings are encoded - docs say it’s a scalar encoder with radius 1

The category encoding of the ScalarEncoder is not for encoding strings. It’s more like an enum, a static list of values and each value has an index. The encoder just encodes the index. The SimHashDocuentEncoder is more along the ideas of the propitiatory Cortical.io encoder for encoding words and sentences.

We are open to any new algorithms that anyone might want to implement. In particular more relating to the 1000Brains theory, Grid Cells, etc. We can help you get them folded into the library for everyone to use.

4 Likes

Perhaps taking this a bit off topic, but if I understand the sim hash doc encoder, it doesn’t encode language semantics (such as overlap between the words “apple” and “computer”). Just pointing that out, because as such it isn’t (IMO) a great alternative to Cortical IOs word encodings which unfortunately are part of their lower-level APIs that are no longer publicly available. This project looks like a promising alternative for word encodings that include semantics.

Appreciate the insight! Will look into the networkAPI. No complaints about the lack of model factory - that’s like Keras few-lines-quick-implement compared to PyTorch more-lines-more-granular-control. It’s heartening to see that people who very much know what they’re doing are into the Py3 line.

One thing I’d like to look into is a very inelegant way of implementing spindle (von economo) neurons, cells with only one dendrite that appear in humans, cetaceans, elephants and other species that seem to display ‘empathy’.
I imagine I’d have to tinker with synapse.py and other source files, perhaps add a method that randomly caps N% of cells in the SP and/or TM at 2 synapse (simulating one distal dendrite).
But then, that seems… bad, compared to an equivalent size and parameter network without the N% max_1_synapse constraint. Dunno if there’s a way to “add” spindle cells into an existing SP/TM setup without increasing the actual number of cells and columns, since these spindle cells are arguably part of cortical columns just like regular neurons.
As to why, it’s following the thread of “mimic form to get function” without a ton of evidence behind my hunch. Not sure a TM model predicting scalar values needs to understand emotion in a way that lets chimpanzees bond better than snakes. But understanding and remembering empathy requires delicate connection and sensitive handling of specific inputs - specialization, in a sense - that could maybe be useful for some ML problems.
Not… sure where to start, here, exactly. Will have to look into the source classes and methods to understand what exactly happens when I call tm = TemporalMemory(params={...}), then see how I can alter what happens in that instantiation.

I think I see what you mean about categories not being strings, just an index - for discrete class/category prediction. 1=cat, 2=dog, 3=parrot.


On what @Paul_Lamb wrote, now that I think more on it, string encoding in the manner of Cortical’s proprietary algo seems tricky indeed.

My only idea regarding ‘how they could have done it it’ is thinking is based on their old example page, where they showed SDR semantic fingerprints for “mice” “rat” and “animal”.
Mice and rat had more overlap with each other than either did with animal, but they both had more overlap with animal than, say, ‘transistor’. Perhaps rat has less with transistor than mouse (computers don’t have rats).

My theory on how to embed this sort of semantic closeness would be to feed an HTM algo a huge corpus of literature sentence by sentence. Words that appear in the same sentence begin to ‘associate’ each other, developing SDR/synaptic overlap / closeness (how to encode/reinforce this is another issue entirely).

This could be how they solve the plural issue that stemming/lemmatization runs into:
mice ran through the kitchen, a mouse ran through the kitchen, and a rat ran through the kitchen could lead to mice and mouse occupying almost identical fingerprints, because the words would likely be used in almost identical sentence contexts.

Is this approach similar to what the GloVe encoder does? That proposal looks quite fascinating, will check it out.
EDIT: Looked into it. “Word occurence” and “context” - seems to be along the same lines, generally. Can’t imagine how you could learn word closeness by looking at one word at a time, people learn through sentences. Someone in the thread mentions that GloVe vectors aren’t sparse - but an encoded scalar isn’t that sparse (sort of…? it’s less sparse than SP/TM) and becomes more sparse through SP. Maybe I’ve got that entirely wrong, though.
If they accurately convert words to vectors of nonbinary values based on context… haven’t we encoded a vector of values perfectly well before with MNIST images?

Cortical IO has a lot of material out there on how their algorithm works. At a high level, this video is quite good. Of course they hold patents on their technique, and it is not open source, so even though it is relatively straight forward to understand and reverse engineer (I’ve done so myself), you would be quite limited in your ability to use or distribute the result.

GloVe creates vectors, which are of course a very different type of global encoding for words. I don’t know the inner workings of the algorithm, but I do know that it captures semantics. Various examples from their documentation demonstrate this, such as the vector difference from “woman” to “man” being similar to that of “queen” to “king”. @Andrew_Stephan’s project is an attempt to convert these vectors into bit arrays. It may not be the best way (TBH, I haven’t had a chance to do any benchmarks yet), but the general idea I think is promising.

1 Like

The video is quite helpful, thanks - I love good animation used to explain complex computing topics, very underrated.
It seems quite elegant in many ways, especially the way they preserve sparsity for multi-word fingerprints by simply choosing a cutoff threshold so as to not encode too much or too little semantic info.

I sort of get the idea of taking each word in a snippet, but I’m a little fuzzy on aspects of the process:

When they cut the cleaned training text into “meaning based slices” called snippets, how are these snippets chosen, I wonder? The video shows division based on paragraphs in wikipedia articles.

Their next step is ‘distributing snippets over a 2d grid s.t. snippets with similar meanings are placed close to one another’. This I would assume to be like a self-organizing map of some sort - they mention that it’s a proprietary Hebbian algo similar to Kohonen networks.

I started looking into Kohonen maps just now, and they seem quite interesting. I wonder if a SOM takes care of my previous question of ‘splitting snippets based on meaning’ or, if the mentioned Hebbian algo takes care of that.

Found a neat study that uses a SOM to visualize and map a staggering corpus of medical journals:

Keep in mind that although MeSH-based input vectors contain
binary weights (i.e., presence or absence) and are very sparse (i.e.,
with few non-zero values for a given document), training will
generate continuous weights for all dimensions at each neuron.

MeSH meaning ‘PubMed Medical Subject Headings’, sort of a keyword that they focused on extracting and used as their ‘word’ input. I thought it very interesting that their vectors contain binary weights in a sparse manner, before moving to continuous weight generation.

That you’ve reverse engineered it yourself is quite impressive! This is less technical and more legal, but I also wonder about that ‘ability to use or distribute your result’ - I can understand patenting an algorithm/model you design, if nothing could be proprietary there would be little incentive to create business-use tech.
But let’s say someone were to create their own rather shoddy algorithm without access to Cortical’s proprietary code, informed only by their high-level strategy overview and (in my own case) a haphazard understanding of low-level architecture. This copycat model, which arguably copied the task and direction, yet knew nothing of implementation details, can solve the same problems as the original.

My limited reading on “prior art” tells me it depends on things such as industry similarity, technical complexity of both patents, and other details that seem designed for mechanical inventions moreso than code in 2020.

I suppose I’m wondering how a company protects its proprietary algo from people who can reverse engineer it. If, say, you recreated Cortical’s code on your own but never filed a patent for this recreation, kept it quite hidden and sold document-sorting services to customers, they could perhaps claim infringement and legally force you to reveal your source code. Then it gets a little messy where lines are drawn constituting ‘similar’ and overlapping design or function, I believe.

But processes are constantly improved on, even using the same languages and libraries. It seems quite possible that someone in the future will devise an HTM approach for semantic encoding, converting words to SDRs, that the law could consider different from Cortical’s patented process.

Such is the fate of complex, multi-level models, I suppose. The more moving parts to your invention, the more chances for someone else to substitute or radically alter one of those parts and create a valid new invention, so to speak.
I chatted with the guys at Intelletic the other day to glean the non-proprietary details of their tech; they state that they also use a multi-stage model, in which HTM plays a large role, with non-biological ML components - so I assume there’s some regular old ANNs in there as well as input or output to the HTM.

I’ve rambled far off topic, but this has been quite helpful for me - I appreciate the insight very much.

In my original implementation, I just used the text blocks placed by the authors (this seems logical since the folks who wrote the text felt like these were logical breaks). My latest implementation allows online learning by borrowing the concept of eligibility traces from RL, removing the need to define text snippets up front.

This bit took me the longest to figure out. Given the properties of SDRs, however, it isn’t actually necessary to encode word semantics with this type of topology. My initial implementation did not have the “similar meanings closer to one another” topology, but was still able to replicate some of the frequently-cited word SDR math, like “Jaguar - Lion = Porsche”.

I’ve since discovered that hex grids can be used to distill topology from semantics. I’m working on a project to generalize this concept into a “universal encoder” algorithm. The idea originally started from a conversation with @jordan.kay a couple years ago.

3 Likes

Ah, that’s interesting that both the logical author-written breaks and eligibility traces work well. I’ll have to look more into traces, but probably need a better grounding in RL first to do so.

“Similar meanings closer to one another” like GLoVe vectors, or close as in “appearing closer in sentences / over time”? I think I’m misunderstanding exactly how you converted words to SDRs here, and how you fed your model - one word or one sentence at a time.

Hex grids deriving topology from semantics - that is fascinating stuff. I’ll look through Jordan’s thread you linked. Is that connected much with Bitking’s thread? You discuss several interesting potential uses with him in that one, I quite like the idea of less sparse borders from lateral input.

In my initial implementation of semantic folding (similar to how Cortical IO describes their process, minus the topological component), each block of text is given a position in a large 2D semantic map. Representations for words are simply comprised of ON bits in the positions of the semantic map corresponding to blocks of text which contain that word.

The difference between this and Cortical IO’s implementation, is that they ensure the positions for blocks of text sharing more of the same words are placed physically closer to each other on the 2D semantic map, and those sharing fewer words placed further away from each other. Having this topology allows them to easily scale the semantic map to any size using 2D scaling algorithms. In a non-topological semantic map, scaling can be done either through spatial pooling (depending on the dimensions and available resources), or more crudely through random drop-out (see Numenta’s math about sparsity and cell death to see why this works).

The eligibility trace implementation is a little more involved. Prior to employing hex grids, I started each word with a random sparse weighted representation (where each bit has a range of values rather than being a Boolean). Then each word in the training material is processed in the order that it appears in the text, adjusting its representation and the representations in its eligibility trace via logarithmic decay. Finally, when training is complete (or on the fly if used in an online learning scenario), the (now dense) weighted representations are used to generate SDRs by sparsification and conversion to bit arrays.

The conversation with Jordan is not about topology and semantics, but rather it is what first got me interested in the possiblity of an actual “universal encoder” (prior to the conversation, I considered that term to be in the realm of fantasy and unicorns). The realization about topology and its relation to semantics came from a completely different direction, while experimenting with different ideas for temporal pooling using hex grids.

Perhaps not directly, but that is certainly the thread (and the many, many conversations which spawned from it) which initially sparked my interest in the topic of hex grids.

Thank you for the detailed replies! Fantastic explanation of semantic folder in topological and non-topological methods. Sparse Weighted Representations seems interesting as well.
A universal encoder seems like a philosopher’s stone in some ways, but perhaps more achievable. Would certainly open many doors.
I now have many fascinating roads to investigate, and much reading to do.

1 Like