Intelligence is embarrassingly simple. Part 2

Artie in Shrek 3: Parties, princesses, castles. Princesses.

Marvin in RED 2: Frank is a very simple creature with very simple needs. Okay?
It’s killing, eating, sexting, eating… Killing, I guess.

I’m kind of following with: trucking, hacking, thinking, conferencing. Coding, I guess.

This season I’m out there trucking, so cannot get back timely as probably should.

I suggest moving from complex AI speculations to simple experiments, followed by
analysis from math/AI/neuroscience point of view. Some daredevils will code,
other good people will make sure the snake oil is not marketed as science.

Here is an exercise #1:

Consider input as three limitless sequences of tokens-integers. The nature of sequences
is unknown for the inference engine.
For example: the sequences could be quantized readings of rounded y = 100 * sin (x * r),
or characters (words) codes of long texts or heartbeats or whatever.
Two sequences must be similar to each other and differ to the third.
Like in feeding in two Twain novels and one Stowne. Or two healthy heartbeats and
one sick.

The problem: suggest an algo/method to compute similarities between sequences
with every next token fed in - continually.
Absolute values do not matter, the goal is to continually calculate the sequences similarity and it must
discover the similar ones.

Example: sin( x* 1.05) is similar to sin(x * 1.15) and both dissimilar to sin(x * 1.5).

Then we increase complexity of the problem step-by-step and see where it could take us…
Good luck!

3 Likes

You define a metric of similarity between input data points first (MSP), then a metric of similarity between sequences of points (MSS), and if some data doesn’t like it you define another! Now which metric is liked by most data, aka is general?

PS in sin() case, oscillations for close frequencies gives inference beatings (pulses) on overlap (addition) of the two waves. This is very data specific, but that’s a most obvious observation

3 Likes

Good, very general approach. Take a look at example: s1=“1234”, s2='1254", s3=“2567”. Obviously, s1 and s2 are similar. My University project circa 1980s.
What metrics are there? String metric - Wikipedia

Finding that one could be a problem. With every next token you either must compare inputs [strings] of ever increasing length, or ever between increasing number of substrings, if you decide to put a limit on the “sensor” size - window.
Both approaches lead into hyper-dimensional computing territory (Hyperdimensional computing - Wikipedia) with limits up to 10^4 dimensions (vector length) AFAIK. The Dimensionality Curse rules here.
What approach would you take?

1 Like

Limit on the sensor window size is a requirement. All sensors have limited bandwidth in reality.

2 Likes

Good observation. So it leaves us with unlimited number of substrings[frames], generated by sensors. How do we handle that unlimited number?

2 Likes

Compression, fixed-window buffering, small features to large features hierarchy.

I’ve looked at some of your other posts (Linkedin etc). Seems to me you already do this :slight_smile:

2 Likes

Another good observation:

There is a working POC at https://github.com/MasterAlgo/Simply-Spiking: just start jar after editing properties file (number of sequences, size of a sensor). The best shot at looking into sources is https://github.com/MasterAlgo/Simply-Spiking/blob/main/TheCurse.zip , particularly C:\Projects\TheCurseGit\TheCurse\src\workers\ContinualNeuromorphicTrainerInferer.java but it seems to be too complex to some good people.
I’m trying to see if there is a better way to reproduce it. The Net against The Fish.

Sounds about right. What is interesting - The Curse could be beaten by it’s own medicine - by drastic increase of the dimensionality. An example: s1=“1234” ('abcd") comprises tons of subsequences of different length (do not confuse with substrings).
Those subsequences being collected into some kind of a container (array, tree, hashtable) make a perfect foundation for comparison. If high complexity features (substrings, patterns) do not match, some simple ones will do.
There are two problems here: building a reasonable efficiency container and a similarity measure between containers.
In my experiments successful comparison is done on sets of features with up to a million components.
On commodity PC. Components of sets could be represented as dots in 10^6 dimensionality space, with values-frequencies of occurrence (as integers of adjusted (deflated) real numbers).
There are many interesting consequences of that type of memorizing-comparing , and I intend to advise on the building a novel similarity measure.
But that kind of should be a community project.

2 Likes

Thank you, I’ll take a deeper look at the source and have a play with your POC. I don’t have any Java experience but I’ll see if I can make some sense of it.

Thanks for the ideas. I’m still in the process of putting together a design of my own so it’s good to hear your thoughts.

Indeed, I was planning on using something like this: GitHub - facebookresearch/faiss: A library for efficient similarity search and clustering of dense vectors. for similarity comparison against a hashtable of the base chars or substrings.

Efficiency is the big challenge with all of this. I’d like it to be real-time levels of responsive.

2 Likes

If you want to sell stuff, efficiency is important. If you want to understand what is the source of your curiosity - not so much.
A features’ container implemented as a hashtable has time complexity as O(features number), when implemented as a hierarchical net on a something like Intel Loihi (not affiliated) - the complexity is O(layers #), plus it is very sparse and energy efficient.

1 Like

Curiosity is my main driver - but I also enjoy making efficient systems :slight_smile: Other wise I may as well just use other peoples software. It’s fun to create something that performs well.

Indeed, complexity as O(features), although parallelism will help here.

I have a couple of questions I’m curious to hear your thoughts on:

  • Is there perhaps a way to utilize the Von Neumann architecture towards its own strengths, such that we can find a varying approach to AI that is fitted to this architecture, opposed to using large vectors? The key difference here is larger (but disconnected) memory storage, and more of a linear compute model (although faster clock-rate for each core). So how can we take advantange of the strengths of this architecture by leaning more heavily on the larger memory storage and faster single-core processing? Clearly vector dot products and the like are not well suited to this, they’re suited to parallel architectures.

  • Another thing I’ve been struggling to figure out. The actual sorting mechanism for the features. Going from single letters, to words, to phrases and concepts, and so forth. An example: “greeting” is a word, but it’s also an abstract concept that relates to “an action of giving a welcome”, and so forth. There’s no clear organization here or hierarchy here to me, it’s all very jumbled and chaotic. So I’m struggling to figure out a good way for my system to organize its own hierarchies.

Another factor is encoding time, time is very relevant. Even if a good strategy is found for sorting, it also needs to encode the concept of time. We intuitively know what memories are older than other memories. It’s obviously not perfect, but there’s some conception of time involved. And we also have the ability to reason on time as a factor, such as “this event happened too far after this event so it’s likely unrelated, so it should go in a different sorting category than the category that links to that event”.

1 Like

Got to go, will be short. Like many complex concepts, some are overrated(?) what is the right word?!
Look, consider a sensor of length 5 [tokens], consider a sequence of 11 tokens. Looks like we need a concept of time to encode long 11 in 5 multiple shorts. But :slight_smile: to encode “abracadabra” we need 3 overlapping shorts “abrac”, “acada” & " dabra". If those shorts represented by dedicated nodes (Instance Based Learning) and if those nodes employ “sustained spiking” - broadcasting their activation to the next level for some time - then only one specific node representing “abracadabra” will be activated.
Meaning, a static [hierarchy] dictionary - or dimensionless set of features - easily memorize and recognize both spatial and temporal patterns.
Do some more thinking :slight_smile: you are on the right path, just consider upgrading your nick “null” into something “people would like to do business with”(c). Gone for a few days, best.

3 Likes

Remember the prior input sequences and forward propogate a decaying signal through the prior memory, decayed also relative to the original temporal input (i.e. store the temporal dimension in memory). Then find the top 2 signals (closest correlation). Then create a new memory with zero temporal offset for the last points in the two relative memory sequences.

Harmonics could make 1.05 closer to 1.5 depending on perception of decay and input frequency of repetition (Theta wave replay ?).

Still reading the code when I have some spare time - you could make it a lot more efficient… removing repeated calculations or variable reuse to avoid the garbage overhead.

3 Likes

I don’t get you here; Aren’t there existing methods that can already do this trivially? or is this some rhetorical question to warm-up towards associative memory, or some other variant?

For example any gradient based method would excel here. NNs would work fine as well.

But those are for robust representations. For your tasks, I could solve it with a simple thresholding (use a DT if you’re fancy) and with MSE metric. So I’m still confused what the aim of this question was exactly…

2 Likes

One thing (I assume) is required for intelligence is discovery of sub-clusters in a spatio-temporal stream.

e.g. if we have 1000 consecutive frames, each 1000 long vectors - either SDRs or more detailed (1000x8,16,32bit) vectors

Normal clustering methods will compare full frames and group them by a distance metrics.

A sub-clustering within a single frame space will discover sub-frames that tend to be correlated. e.g. “patterns” 10…100 bits active together within a single frame. That’s simple spatial sub-clustering. Which requires some compute but it searches only a 1000 value size space.

A spatio-temporal sub-clastering would search for correlations across multiple frames. The search space expands with the number of frames.

Which means the computational challenge to discover correlations is massively more difficult. Instead of 1000 values now there are 1Million values to search for sub-clusters.

Here are two assumptions about how this problem is solved:

  1. Our biological networks have learned some tricks in order to reduce the compute of searching correlations.
  2. Even with those tricks, it’s possible (one of) the basic functions of each minicolumn is to search for a sub-correlation within all perceived data. Which means the search is distributed across all (or a significant part of) minicolumns. Massive parallel searches of potentially new correlations.

But we still need to discover those tricks or find new ones.
That’s probably why backpropagating NN learning needs so much data, because it’s a brute force search, besides few architectural “breaktroughs”, not many tricks there.

3 Likes

What if some of those tricks are inherently embeded within the sensory stream(s) itself and not the architecture.

Try to keep this very abstract and not get drawn into the particular sensory stream we call words… the sensory input “and” is an inherent correlator (shortening concept distances of closest concepts - not necessarily the closest raw sensory inputs), which may (within an attention replay cycle) alter the relative timings of the replayed concepts, such that distances are altered within the replay and not the original sensory stream. The replay is of the hierarcical derived concepts and not the original sensory stream - this I think is a critical consideration to scale properly.

Touch is already physically distance aligned, so those raw streams may not need to pass through any temporal shifting, so biologically could be why nerve endings don’t pass through particular areas. Nerves are always raw…

Occipital depth, colour and pattern differentials as we read an environment.

To me, this temporal shifting in an attention replay cycle could fit and removes some of the distance issues a raw temporally aligned stream creates. Replaying the hierarchical concepts (whatever they have been activated from/by) also reduces the data complexity and should scale much better.

We biologically “read” images as a story, we don’t process full frames, so the basic question is why are we processing full frames and think it’s right ? Get’s results, yes, but is it actually efficient or just the horse we have at the moment ? We now process full pages/short books at a time with the likes of GPT.

Where and how this shifting may occur biologically, even if it does, I’m still figuring out. Just my hypothesis and experiments. In code it appears to work for any language as they all share the same attention and alignment patterns.

The code from bullbash is interesting and could do with an attention mechanism to allow for longer refresh to help persist what is significant - attention as feedback from the memory - focus on incremental additions to existing memory and not complete new expose to the unknown. The code also assumes a flat time, so the attention needs to be longer to allow the distances to be resolved with more itterations.

One thing does not do everything.

1 Like

Not continually, no.
See:
local thread" Continual Lifelong Learning Paper Review / Jeff Hawkins on Grid Cell Modules - February
or google “continual learning”, check how it goes.
There was the “vowpal wabbit” once working with text continually, but generally - continual is not solved.
Don’t even start with your beloved gradients. There are about 100 metrics comparing strings (google), non would work continually and/or on long strings.

I answer that at the bottom of the thread.

1 Like

Sounds simple, feels like an easy implementation, why wouldn’t try to code it?

you got me here :frowning: so embarrassed … of course I could. To make it really efficient on the Neumann architecture I’d have to fully redo something like Java GC, because the process generates and kills tens of millions of nodes and billions of synapses [an hour] in present implementation.

Anyways, I’ll try to generalize some short suggestions below.

2 Likes

See the numbers below the line “Last SuperCells producers”? 138, 1, 43, … - those are number of sub-clusters for each handwritten digit(number?). If I remember properly, every MNIST class is defined by a number of 4 pixel BLOBs augmented with “saccades”. Next img shows yellow [empty] squares - those are BLOBs. There are millions of combinations of around 10-20? pixels out of 28x28 - which is C(20, 784). NOt an easy task.
But from that low accuracy simple 20 pixel combination my net grew up stable “granny cells” of around 40 pixels (yellow squares with green cores) - that gives a search space C(40, 784). Incomputable, but there they are. Same GPT-Teaser platform… just different [image] driver.
I saw them - grannies, (so beautiful(c)) - and gave up further work. It should not be conducted in a garage lab.

1 Like

Ok, good people, I started with a simple question:" Take a look at example: s1=“1234”, s2='1254", s3=“2567”.
Was kind of expecting to see if somebody comes up with:
create dictionaries of 2^4-1 = 15 entries for each string, like : “***4”, “*3”, … , “1234”.
Compare matching patterns-entries, you might weight them with complexity or whatever. That’s the metric.

It solves:
continual learning - [hierarchical] dictionaries could be updated with every new token
knowledge transfer - those dictionaries are “mergeable”, no catastrophic forgetting
local learning, plasticities, multimodality, something else…
That moves you from “model” territory into 'Instance based learning" undiscovered terrain.
Told you, embarrassingly simple. Naive , super-vanilla implementation is here:
GitHub - MasterAlgo/Simply-Spiking (TheCurse.,zip - sources, essence is ~500 lines in ContinualNeuromorphicTrainerInferer.java.)

I was going to elevate the initial task to arbitrary number of sequences, and replace a sequence with a “channel” of a number of sequences - to model multimodality, introduce irregular sampling and noise.
Grow complex patterns from simple ones.

I was gonna make you come up with a neural architecture implementing those dictionaries.
I was also gonna explain emerging “psychology” of such architectures.
But we did not get thru the first step, and I’m tired of motivating people and I’m just off my week long truck trip and “the topic has been solved”(c)? what this means - it was not.
Anyways, I demand some beer from Jeff Hawkins to continue! Consider it a bribery! Good luck guys, later.

2 Likes

Your code with the hash lookup just needs some threading, rather than anything to do with the GC as with single threaded code the GC will easily work in the background. I found that with hash / memory lookup the issue is more to do with memory channels and access latencies from memory, which throttle the CPU at a low utilisation rate (of the CPU) if you don’t take into account the memory allocation. I spent several months just doing code to scale across machines to use a couple hundred memory channels with just over 1TB on 20 machines for my experiments. I can get about a billion updates a second (read/calc/update) with my stack of room heaters. That is incredibly slow compared to a GPU route, but this is with a hash and lookup approach. But, still an experiment.

2 Likes