To cut it short there is this contradiction:
The smaller a model the faster it can learn from fewer data points. Very small learners are not only compute efficient during learning & inference but they-re also sample efficient.
The larger the model the more “general” it becomes - it is able to handle more complex input within more contexts/use cases. Yet the cost increases once quadratically with model size for any given learning data set AND this cost is increased again by the increased data size needed to cover the new particular cases.
And an important breakthrough should happen with models that can be both large and small at the same time.
I think there are lots of reasons to approach intelligence from the perspective of a very large collective of micro agents (or micro experts) and my purpose with this topic is to:
why-s - justify this choice of architecture - powerful advantages vs “monolithic” models or even other existing collective models like MoE, random forests, etc…
- discuss what are key properties such idea should implement.
- expose some important how-s - what an agent should look like and what to do, in order to achieve an emerging, incremental improvement of the collective “hive”.
- and motivate myself and others to attempt an actual implementation of these ideas.
Here I close this first message, will follow up with a more general overview.
Conceptually there are two types of agents - experts and scouts, with different roles.
An expert’s purpose is to:
- handle a restricted perspective/view in some global spatial-temporal input. This is expert’s narrow scope
- be able to provide a reliable opinion upon its input when its given input matches its domain of expertise. In general each opinion is a recognized/learned pattern.
- so it also needs to be able asses (or learn) a level of confidence regarding whether its response at (2.) is correct. Unlike other ML evaluators a micro-expert can (and often does) say “I don’t know what am I looking at”
- it also learns to recommend a handful different experts that might have a higher level of confidence (at 3. above) about the current data frame/context.
Most experts are normally dormant, only a few, allegedly most useful ones, are active at any given time, in order to confidently handle the current context or “problem at hand”. That is to maximize global confidence by pausing low confidence experts and activating dormant ones based on the most recent recommendations.
Scouts are simply new agents instantiated to handle situations when the expert processing above fails to produce an expected level of confidence. So while an expert agent is meant to handle what is (recognized as) known, the scouts are meant to handle the unknown, unpredictable.
Every new scouting agent is instantiated with a new restricted scope (1. above) and tries to learn/detect a pattern where existing experts fail to see one. Given the learning capacity of any agent is very restricted it also learns very fast in very few samples (milliseconds & dozens of samples) so the learning/scouting process can be evolutionary and massively parallelized.
One important property is a scout’s input (aka scope) is not restricted to “raw” input but can be the output (opinion) of any established expert.
For this process to be reliable is important that once a scout is promoted to expert (because it discovered a new pattern) it no longer updates weights for the known pattern itself, although an expert continues to update and improve its recommendation weights.
TLDR a HoµA system maintains a pool of sparsely active expert agents to handle the known and when it encounters unknown it invokes many scouting agents to search for and discover new patterns.
This is Not entirely related but… I thought it would be interesting if an intelligent system was be able to provision small fast learners on demand, and throw it away as needed… or save it and enhance it, expanding it to use more resources as it grows (or as required) over time. perhaps starting with small set of neurons and growing later.
I’ve expressed this in one of my posts in the past. Maybe not, but it sounds like it. I thought of a learning Spatial Pooler with many states. These states are created and frozen the moment they mature - primed for some set of patterns (experts). The Spatial Pooler continues to learn while having these ancestors of states and a consensus algorithm is used to decide on the overall outcome of the Spatial Pooler model. The main motivation for this was that I thought SPs are relatively simple but they can easily hit the problem of catastrophic forgetting due to them having fewer parameters and not being differentiable (not smooth) - this I discovered by experimentation. So why not freeze these primed SP states and use them together with other frozen SP states or we can also call them agents to generate a potentially better model?
Yes that’s one goal of this (sketch of an) architecture. “Scouting” which is searching & learning for correlates in available data requires vastly more computational resources than experts inferences.
Few important consequences here:
- the usage pattern of these two groups is asymmetrical - we spend a lot of our time in expert mode doing the “right” moves effortlessly and when our inner experts fail we get sluggish trying to figure out a solution out of many possibilities. We stop to ponder
But we spend relatively short time in the sluggish scouting mode. Well children’s play likely has more scouting involved than adults.
- therefore one can hypothesize the most capacity of our brains is just kept in reserve for the huge search of correlative patterns when we need to conclude in reasonable time whenever the surprise, the unexpected is encountered. Doing what we do when we don’t know what to do.
- which if true, then a practical machine intelligence might not be that hard by running the cheaper experts locally (on “edge”) and the expensive scouts in a cluster that, since a single “head” doesn’t need frequently, can serve multiple heads in parallel.
The same can be applied to Temporal Memory too. That’s even simpler (to me at least) to visualize:
Start with a “shallow” TM e.g. only one cell deep columns. Every time a cell proves itself by correctly activating a few times just freeze it:
- from that point don’t even bother to attempt learning for that cell
- add a fresh, “virgin” cell at the top of the column stack.
- during normal operation evaluate cells from bottom to top (frozen experts first). In non-learning operation don’t even bother to query unfrozen cells.
The above has a couple caveats
- it should account for future cases when an expert cell makes a bad prediction. Then the cell above it should learn to inhibit the activation of its underlying cells. So TM can handle exceptions in a quite straightforward manner.
- the learning cell should be allowed to freeze only when its input comes from experts. Probably simplest thing to try in this sense is have a scouting cell only feed its input from experts activation.
This way it is ensured new learning is stacked upon reliable behaviour of established experts/knowledge.
A couple variations of the above are:
- that instead o one learning cell to maintain a relatively short stack of (e.g. 3-4) “scouting” cells on top of each column.
- you can parallelize massively the whole learning process - have the experts replicated on a dozen or thousands of nodes each node “experimenting” with different scout connections. When one node discovers a new expert simply broadcast it to all nodes and keep going.
PS also while I found this approach interesting the HoµA idea I started this topic with while similar in some ways, is NOT a refurbish of TM mechanism, I think it is more general by not relying on a specific learner and by including the concept of scopes which I’ll detail further.
PS2 before dwelling on the notion of scopes I would notice the TM performance gets improved not only by allowing massive parallelization of learning (if needed to) but also expert inference can get accelerated by few tricks:
- an expert cell can keep only useful synapses/segments and discard all others. No need to look further.
- once an expert cell activates we don’t need to query cells on top of it except those that were marked/trained for column inhibition (which should be much fewer).
Since the most frequent patterns are more likely to be learned first this speedup should benefit a Pareto distribution: most patterns will be recognized by first few “basic” experts
- the whole frozen cell sub-network can be parsed in reverse - instead of looking back each synapse to see whether its input cells were active in a previous time step (using dendritic synapses) we can have the active cells use forward (axonic) synapses to increment a counter in the upstream cells that gets checked in the following time step. Thus the processing of experts gets sparsified too by switching from dendritic to axonic operation.
An agent's Scope
A scope is an instrument through which an agent gets a narrow, particular perspective of the world.
In physical world there are all kind of scopes - telescopes, microscopes, stethoscopes, periscopes, endoscopes and so on. Each optimized for a specific, simpler perspective. Human fovea is another type of scope.
The HTM terminology for a scope is a SDR encoder that instead of trying to encode all available inputs into a single relevant (and potentially very large) input SDR that captures all details which are fed into a large learning network it selects only a handful of details which are encoded to a relatively small fixed size SDR that is passed to a scout’s relatively limited learning model.
Unlike their physical counterparts, a micro agent’s scope provides a much more simplified representation, let’s say 128 or 256 bit SDR size.
There are a few advantage of a tiny agent with a tiny scope focused on few features/input values in its small field of view:
- the local complexity at which an agent is exposed is very low
- then the computing resources needed for both learning, inference and scope itself are also tiny. Instantiating, learning and testing a new agent with a new scope can last around a second on a single core
- which allows a massive parallel search - either evolutionary, systematic or even hand crafted - for relevant scopes.
- whenever a new scope proves useful, since it focuses only on very few features of the context then it is relatively easy to expose the relevant ones and their significant correlations.
The available context
The context has two sources
- external data: recent raw data from sensors. "true" input
- internal data: recent activations of established experts - internal, or processed input
A particular scope that is tested by a scout exposes input from a handful of points from the context . If the scout does not discover a pattern within its scope, the scope and model is discarded and a different scope is instantiated and tried, through learning. whether a model “discovers” a significant pattern.
Let’s say you have a 24x24 input image data, could you please explain the scopes in action here and should they need to overlap with each other?
The reason for asking these questions is because I am interested in experimenting with this micro agents/experts concept.
24x24 is a bit extreme, I imagine you got that by cropping out the 2pixel white border of MNIST images.
I’ll stick to grays (no colors).
Before continuing, my plan is to experiment with fixed 128 long vectors.
Any encoder (aka scopes) produce a 128 dense vector which can be fed to a attached agent
the dense vectors from few scopes are added up by a… correlator scope.
I touched the rationale for simple addition of few dense embeddings and for the term - correlator - in the PS here
In order to allow potentially many agents to have different perspectives about the same image I assume a useful approach is to use patches,
and to keep things simple I’d start with squares (aka windows of focus) of various sizes.
And I assume the scope embedding should start encoding both where and what.
where: x,y patch coordinates and its size,
three 128 long scalar embeddings , added together & normalized to produce an 128 long “where” vector. Here-s some arguing for that
what: a 128 long scalar vector obtained somewhat as here from whatever the small window contains.
And the output of this particular of scope is obtained by adding the above where and what vectors.
This 128 long dense output can be used alone by its containing agent learning/estimator ML model(s) OR can be used by a compounding scope that adds it to outputs from:
- another scope looking at a different patch in the same image
- or a patch from recent past to emphasize motion
- or a scope looking at an entire different source e.g. a sound, pressure or content of the pocket.
Regarding your question on overlapping.
Apparently at least, it makes little sense:
- to have two different scopes overall looking at the same patch. A potential exception would be if they use entirely different algorithms. Maybe a small CNN trained to output an 128 embadding on low resolution images could be more useful. Or several CNNs trained with different criteria, different hyperparameters or different dataset.
- to have two agents fed from the same scope. A possible exception would be that they-re needed in different contexts but I think such a case is pretty far down the line.
It should make sense to have many different correlative scopes that all share feed from one wide-view scope (e.g. the 24x24 full image) combined independently with different smaller patches from different positions.
Pretty similar with attention mechanism in visual transformers that compute the value of correlation between two patches. Unlike there I there is no restriction on patch size&position in what I described here.