NN, centroid clustering, search

From Is billions of parameters (or more) the only solution? - #105 by bkaz. @neel_g, sorry for late reply, and your post has disappeared for some reason.

  • Re Perceptron learning, it’s not technically Hebbian, I meant qualitatively the same type: weight increase / decrease in proportion to some measure of similarity between weighted input and normalized sum. The difference is higher order of quantization in perceptron: grey-scale vs. binary in Hebbian.

  • Re centroid clustering, I think you meant K-means, not KNN, but that’s just one implementation. General principle is the same as in perceptron: elements are included / excluded from the cluster according to some measure of their numerical similarity to centroid, which is a normalized sum of the elements, recursively updated.
    Again, the difference is in degree: this inclusion / exclusion in perceptron is by increasing / decreasing the weights: grey-scale vs. binary in conventional clustering. And the clusters, which are connections for perceptron, are fuzzy, with overlapping receptive field.

  • Re deviation vs. similarity in backprop, deviation is commonly used as inverse measure of numeric similarity: = average_deviation - instance_deviation. In a common-sense meaning of these terms, similarity is inverse or complimentary to distance / difference / deviation.

  • Re search in NN, I guess it’s not directly one-to-one but mediated by middle nodes: query to category and then to instances.

Re your rant on diffusion, sure, I am curious?

2 Likes

The reason I don’t usually describe SGD with MLP as “Hebbian” because hebbian learning implicitly assumes that computation in a neural network (biological or otherwise) is done locally in clumps of neurons. That’s not how backprop would work. It’s true that you can see similar behavior in LLMs, but its not sure that localization is for compartmentalization purposes or some side effect of the scale they’re operating at. We know surprisingly little about quite a bit of stuff.

I think what you’re saying is fuzzily closer to the manifold hypothesis (this blog explains it visually).

The perceptron does to an extent cluster semantically close information together. The problem is, well, we don’t really have any concrete evidence for the hypothesis to be true at scale. Which is why I often cite the anthropic papers because AFAIK they’re the latest iterations of how interesting stuff like superposition works.

It may definitely be there to an extent (because GPT3 is able to recall facts). But the thing is, it also demonstrates that meta-learning and few-shot capabilities. DL as a field is powered by those discoveries. For the first time, you have a model that can truly adapt to new information in a limited but interesting way. It’s why they can break the Turing test trivially - even tech-savvy people are unable to differentiate. The only way right now you can disambiguate LLMs from humans are abilities - addition, extremely complex logical reasoning which requires too many steps etc.

Clearly, meta-learning is definitely an aspect. To make it more concrete, check out this recent paper from DeepMind (paper)

They have an agent that is able to rival human sample efficiency, despite being a measly 250M model. Additionally, the model remains frozen but is still able to meta-learn to such a level that in a few trials (where environment is reset, but memory isn’t) its able to improve on the policies it crafted. None of this is taught - it learnt the processing of improving.

I like this paper because it concretely showcases the phenomena of what it means to meta-learn without employing abstract results and hypotheses. It’s a dirt simple model, simple open-ended environment. One can immediately grasp that since the model doesn’t update, the policy it learnt has to update/improve. The policy itself is how to explore the hidden dynamics of the environment (rules are NOT visible to the agent) so effectively, you have a meta-learner, an agent, and a world model all in one without ANY explicit training.

It’s this kind of behavior that discredits the fact that NNs are just gigantic search engines.

I don’t remember, in what context was this? :sweat_smile:

3 Likes

Here I am talking about single node, so there is no backprop os SGD.

I meant that each node can be seen as a cluster of its connections.

I didn’t mean that it’s a simple search, but search is how you form clusters, and I believe all learning is some form of clustering. Because clustering is partly lossless compression: there are fewer clusters than their elements. And without compression, it’s not learning but recording.

Not sure what exactly meta-learning means here though, need to look into it.

Oh, you just said you’d like to rant about diffusion models.

Can it effectively compartmentalize the network though? Similar to cortical columns in TBT, Hinton’s capsules, or experts in MoE models?
Or these new-fangled hyper-transformers: [2301.04584v2] Continual Few-Shot Learning Using HyperTransformers, what do you make of them?

1 Like

Interesting but still… Those 10**40 possible tasks are still generated by mixing of much more limited set of rules of the environment generator. The large (transformer is it?) model is trained on a very large amount of generated “unique” environment-tasks rules. But it might learn the underlying patterns of the generator itself - Xland 2.0 - so after a huge amount of samples it can successfully apply learned algorithms to quickly solve any new problem generated by said tool.

Which is a very interesting performance in itself but a bit off from the claim on “human like performance”

1 Like

I don’t get what you mean here tbh. Could you elaborate on what exactly you think NNs do?

I take you mean individual networks? Not really - the only analogy you can draw here are that individual attention heads themselves are compartmentalized like cortical columns - but they still have a shared, distributed representation because that’s what SGD empirically converges to.

Seems like the usual meta-learning (in this case, they take inspiration from MAML and perform the meta-learning explictly. I was talking about how LLMs meta-learn implictly without any external effort)

Which is why every evaluation task is human-crafted (the video above mentions that) and the humans are know how the environment works beforehand (so its akin to the LLM having similar priors to humans, except ours are more general and AdA’s are more specific - hence why GATO was so interesting, because it demonstrated that if you provide an LLM with multiple envs the priors it learns would be very general to solve for all of them simultaneously)

It’s also why the evaluation tasks have a name :wink: we haven’t got a model that can consume a 3D environment and describe the task in natural language yet - though that would be an interesting task in itself for multi-modal models I suppose.

1 Like

Talking about NNs in general means talking about their lowest common denominator. That means perceptron or something similar.
At the bottom of it all, core process in any NN is centroid-based clustering of receptive field within each node.

1 Like

Yep, I meant elaboration on this statement:

1 Like
  • node connections are potential elements of it’s cluster, and weighting them is soft inclusion.
  • node output is a version of centroid of that cluster: normalized sum of elements.
  • in single-layer version this output is compared to some template, forming the error.
  • weigh per input is adjusted in proportion to something like: average - error * input.
    Which is modulating the degree of soft inclusion of input connection in this single-node cluster, inversely proportional to its contribution to the error.

Ok, here is my take on basic explicit meta-learning:

  • each connection has input, weight, and “wweight”,
  • the node sums and normalizes weighted inputs and wweighted weights separately,
  • these two separate outputs are sent to separate higher input layer and higher weight layer, etc.,
  • top layers of both threads are compared to corresponding templates, form their errors, then separately backpropagate them to adjust weights and wweights.
    Does that make sense?

How about activation pruning, I think Hawkins talked about it. That is, you first match input to a column / expert, then that expert learns from the input and adjusts its weights. Other experts are not affected and they don’t affect the input?

1 Like

It’s an odd way to think about NNs. Firstly,

Is a bit weird because every node from the preceeding layer is connected to the current layer’s node. So by that analogy you’re clustering the same set of elements in distinctly different clusters. That makes little sense.

where do centroid’s come from? I assumed you were talking about centroid in the context of clusters (like KNN) but there are 0 parallels between neurons and clusters so that’s confusing me even more.

No… I’m honestly confused at this point. Nothing is sent to seperate layers (there is no routing here) “top layers of both threads are compared to corresponding templates” again, no.

By explicitly meta-learning, it implies that you learn a NN such that it produces the weights for another network. In that sense, the parent network needs to somehow meta-learn such that it can adapt to a novel task by generating weights or updating them.

That’s more like an MoE architecture. Activation pruning is a regularization where you penalize activations - similar to how you often penalize grad norm or weight norm. It’s not used very much in practice - Dropout is much simpler and closer to implement.

However, it doesn’t implictly guide the network to compartmentalize information. As I said, like the brain large NNs have distributed representations. They may compartmentalize information, but its always diffused - so you can chop off quite a bit of the network and still get meaningful results.

1 Like

We may also adjust weighing(s) mereotopologically . . . . .

Prof. REZA SANAYE

What the heck is that?!?!?

1 Like

That’s called fuzzy clustering, nothing weird.

This how perceptron works, I am just showing that it’s a version of centroid clustering.

The centroid is node’ output: normalized sum of weighted inputs, which are cluster elements.

Actually, pure centroid clustering of connections is grey-scale version of Hebbian learning,
by numeric similarity of their inputs to the output (node centroid).
Directly-defined similarity is something like min(input, norm_sum) / (input+norm_sum).

Perceptron learning is connection clustering by similarity_to_centroid * similarity_of_top_centroid_to_template.
First-term similarity is defined directly as above, second-term similarity is inverted error, backproped over multiple layers. So it’s a compound centroid clustering.

Yes, but to me “explicit” means training a separate set of such meta-weights: wweights in my scheme. These wweights will then be used to initialize weights for a new task / network?
The training would be through backprop of weight errors, but continued across tasks.
So ok, we don’t need separate layers and templates, but there should be separate wweights, both on links and on node activations?

I am just speculating here, NN is not my thing, natural or artificial.

Do you see any parallels between MoE and hyper-transformers, where the experts form a hyper-network?

1 Like

In a mereological topos , the systematics isd not precisely like “nodes” … .

1 Like

I am very interested but have absolutely no experience with this mereological thing.
Would you mind giving an example that is related to the topic of the thread?

2 Likes

Oh , sure , Sir … . . We have been experimenting with this brand-new methodology for more than one year . . . Traditional data mining and data clustering ( node forming ) methods rely heavily on Clustering algorithms most commonly with unsupervised machine learning techniques , , , Additionally , multiple sliding windows, a clustering algorithm in the initialization phase, and a centroid tracking method for maintaining enough knowledge about centroid behavior , gets an estimate of the positions, velocities and accelerations of centroids for the next point in time generally BUT NOT ALWAYS thru maintaining an up-to-date centroid movement model: However , Our Teams avoid recomputations of the clustering model.

Instead , mereotopology teaches us that :

Larger latency to read data approximates on the verge elements casting proxy for as a separate cluster and merge them into successively more massive clusters : Since the centroids change, the algorithm then re-assigns the points to the closest centroid. Therefore, the inherent ambiguity of the time coordinates of the observed ‘events’ calls for some specific intelligence like in the example above, further underlining the necessity of thinking of the topological approach in terms of a contextually situated analysis. In the specific example of the Etruscan towns, we do not consider of course our analysis as a scientific contribution to Meros versus topos in data bundles .

Thence , We now introduce another quantity which is helpful in the analysis of the distribution of points in space in epidemic and quasi-epidemic processes. When considering distances on the two-dimensional plane, generally we refer to Euclidean distances as measured by a straight line connecting the two points. This notion of distance is clearly insensitive to the distance of any of the two points with respect to their distances to other points in the plane. Therefore, if the position of another point C in the space changes, the Euclidean distance between A and B remains unaffected. However, in terms of our topological approach, as the position of each point in the spatial distribution carries a strong meaning in terms of the global organization, we have to take into account how variations in the positions of certain points may influence the whole spatial structure from the perspective of any single point belonging to it, in its relations with the other points.

To this purpose, we re-define the weights for the computation of the TWC in terms of a new free parameterγ which acts as a modifier of the Euclidean distance between points, so as to suitably ‘tune’ the optimal deformation needed to capture the actual spatial organization of the points according to alogic that is similar to the one already followed for the construction of the Alpha and Beta Maps. By computing a sequence of values of the γ parameter, ranging from 1 (no modification of the Euclidean distance) to infinity, we see how, as γi(t), that is, the deformation parameter assigned to the i-th point in the distribution, increases, the influence of the i-th point in affecting the global organization of distances increases.

In other words, a high γi(t) signals that the position of that point is highly critical in determining the expected evolution of the spatial distribution. Think for instance, in quasi-epidemic terms, of how the geography of trade and social exchange determining acertain distribution of urban settlements would be modified if one of those would suddenly acquire the status of a State Capital. All at a sudden, the relative distances of all the other points with respect to the new Capital would matter much more than in the past in determining whether another given settlement is now considered as ‘far’ or ‘central’ in the now re-defined spatial organization …

Sincerely Yours

Prof. REZA SANAYE

1 Like

Why are U re-sending my material to me ??

And that’s my issue - it has nothing to do with clusters or centroid in any sense of what it means. If somehow nodes from adjacent layers are considered centroids then you would actually be maximizing the weighted sum w.r.t the adjacent node for some arbitrary kernel - which makes no sense mathematically speaking. You won’t be converging to any stable state.

So any interpretation involving similarity between nodes is quite deceptive. In reality, NNs approximate an enormous amount of piecewise manifolds - not somehow optimizing for similarity between different dimensions.

Well, that’s the same as reusing those weights for every task. few-shot learning, how static LLMs do it. “explicit meta-learning” as I called it isn’t a technical term, I meant as-in where the researchers made meta-learning an explicit goal to optimize towards (since there is pretty much no way other than to meta-learn for the network. Not for LLMs, which is why its so interesting that they still learn to meta-learn)

Sure, you can continue using those weights and backprop over other tasks - transfer learning, which is quite old and used everywhere (infact, there are extremely few NNs in production that are not transferred. transfer learning provides huge accuracy and sample efficiency boosts irrespective of the domain).

Not really… I suppose you could say they form meta-gradients according to some interesting new research in this area ([2209.11895] In-context Learning and Induction Heads - I think Oswald also has a twitter thread on this) So the experts might be performing some sort of a hyper-networking mechanism. Of course, the experts are NOT directly generating the weights or anything, rather they probably meta-learned over the intermediate results of other experts.

But MoEs don’t scale well, nor do they exhibit strong few-shot learning so I’m doubtful they do that. Weirdly enough, that sort of phenomena seems to be only in large homogenous networks. GPT-4 might be sparse, so if OAI have researched into MoEs I guess we can get some answers from there…

1 Like

On the forum (discourse.numenta.org), it says your post arrived via email. I guess it emails you your posts or something.

1 Like

I am talking about similarity (contribution if you will) between weighted input and output / activation / centroid in the same node, not between adjacent nodes. Keyword is weighted, it’s not the same as lower-node output. And it’s not one adjacent node, so weight adjustment is proportional to relative contribution of weighted input. This part is basically a grey-scale version of Hebbian learning.

This is macro-view, not analytic enough for me. I am conceptualizing single-node operation.

I don’t mean reusing task-specific weights, rather it will be the wweights trained on weights across multiple tasks. These generic weights would only be used for initialization, then the network will convert them into new task-specific weights. Same training process but a lot faster than initializing with random weights. I guess it’s like MoE hypernetwork training meta-weights, but still keeping expert-specific weights. Maybe you meant the same thing, it’s unlikely I just invented something new.

Do you really think they are a leader in core technology, vs UI?

1 Like

I think “mereology” is just another word for clustering, just replace “part” with “element”. All learning is some form of clustering, but it doesn’t have to be centroid-based. I think connectivity-based clustering can ultimately be far more intelligent, but it takes a lot more work to design right.

Centroid-based is top-down, simpler because there is more brute-forcing and info-destruction.
Connectivity-based is bottom-up, you can evaluate elements at every step, instead of randomly trashing almost all input data. That’s my thing.

1 Like