I’m not sure if my implementation is incorrect, or if I don’t understand something about the RDSE. It seems really wasteful to have more buckets than values.
I think Matt will like to hear my perspective here.
I agree that the sampling linear encoder might be the ideal stateless scalar encoder.
By default, it’s a good bounded scalar encoder, i.e. with a min and max.
If you hash its output bits, it’s a good unbounded scalar encoder.
This is what you’re referring to.
The RDSE doesn’t use a hash. It is stateful, storing a memory of encodings that it has used. So it’s going to use up more memory and will potentially be slower, and its output will vary from trial to trial. Given the same parameters, two RDSEs will likely encode 42 differently, even if their parameters are the same.
On the other hand, the RDSE guarantees the overlap properties of its encodings. You can’t get that anywhere else. So I think the RDSE has a niche, maybe a large one. In experiments it’s comforting to know that hash collisions aren’t causing weirdness, and in real-world applications the number of input scalars might be low enough to avoid memory issues, especially when compared to an HTM’s synapses.
(We could talk about the “middle out” varying precision of the sampling linear encoder, but I’ll save for another day.)
Hi @rhyolight as @floybix and @mrcslws say, this is because the RDSE is stateful. The way it works is that each bucket’s encoding differs from its neighbours by exactly one bit. If you have a new input value greater than your current biggest bucket, you need to add buckets one by one (changing one bit “at random” each time) until you have a bucket for the new value. The same applies in the negative direction.
This means that a naive implementation would have an encoding which depends on the entire and exact sequence of inputs seen, and the which is a really bad idea. Even worse, if you use the global random number generator, your encoding will change depending on who else is pulling numbers out of the RNG. The only option is to ensure that the bucket creation process is deterministic, and that means that you have to use a private RNG and a starting “center”, and you must build out your buckets in both positive and negative directions whenever you need to extend the range.
Using your example in the video, let’s say you start off with a center of 500 and a resolution of 0.25. If the first input is 600, you’ll have to grow 400 buckets up from 500 to 600, and 400 down from 500 to 400. So you’ll end up with 800 (or 801) buckets, many of which might never be used (in your example all the non-integer buckets are waste).
NuPIC does this by using its offset parameter to identify the center of the RDSE, and it holds a private stateful RNG in its random field. But NuPIC appears to have a bug because it doesn’t do the growing in both directions (see the createBucket() method. This means that the encoding will differ depending on whether you grow upwards or downwards when you need a new bucket.
For a very early study of how this works, see this page.
I noticed this, but I don’t understand why it matters as long as encodings are not “shared” between different models? Even though each RDSE might encode data differently depending on the order it sees the data, in isolation every model will get a solid representation from the RDSE.
That’s correct, @rhyolight, but it’s important to have a deterministic encoding of each input value, regardless of the order of presentation. This allows you to change out the HTM network structure, use different sized layers with different parameters, etc, turn on and off learning, and have reproducible results. More importantly, you can aggregate inputs over time, use the median or mean of the aggregates, change the aggregation size, etc. And you can start your sequence earlier or later, repeat parts of the sequence, and so on. If your encoding depends on the presentation sequence this becomes very difficult or impossible.
On a similar note, why is it that the number of bucktes is restricted to 1000?
I tried playing around with it (changing the constant in the source code) and already at 1700 there was a noticible slow down.
What should I do if I wish to have descent resolution and descent range of values at the same time?
A 10,000 buckets could do. Now what encoder should I go for?
It should be possible to extend an encoder to arbitrary ranges and resolutions if you use something like a grid cell representation for the encoding. Choose some number of cells to use per module (say 64), and a unit length scale (resolution) associated with each module. Like for example, you may choose a unit length of one for the first module and thus you would have a representation of all integer values modulo 64 (i.e. the pattern would wrap around when you go above 64 or below 0). To extend the range, you would add another module with a larger unit value (like say 64). You now have the capacity to represent up to 64^2 unique integers. To increase the resolution, you would choose a smaller length scale (like say 1/64). Of course the length scales don’t have to be powers of one another, and there may be a robustness argument to make for allowing some overlap in length scales, or multiple modules with similar length scales. The point is that it should always be possible to extend the range and/or resolution of the representation by simply adding more modules. There may be some adjustment needed in how the downstream layers connect to the encoder, but all previously learned representations should be unaffected. At worst, there may simply be additional bits that are now available from the new modules.
EDIT: Upon further reflection, I realize that what I have proposed here is similar to how we add digits to either end of a number to increase the capacity and/or resolution of the value. However, we are typically not able to attend to all of those scales at the same time. So, it may be that we are biologically constrained to only a finite number of grid cell modules and therefore we can only consider/compare values on scales of roughly two or three orders of magnitude. If two-values are not on roughly the same scale, the values are essentially incomparable to one-another (besides one being “much larger” or “much smaller”).
I’m still thinking about how this notion of extensibility (in precision and capacity) can be adapted to other non-scalar and/or higher-dimensional properties.