NN, centroid clustering, search

Which is what’s confusing me - because its anything but Hebbian. It doesn’t represent a significant value, but rather a dimension. The change in weights isn’t determined by some simple relationship like in HL but though Gradient Descent. “Centroid” is simply the wrong terminology here.

That is not a macro view - that’s literally what a single neuron accomplishes. And the behavior of a single neuron also doesn’t necessarily transfer to the properties of the NN

Yea, sounds like transfer learning.

Nah, I meant it in the context of something different. MoE’s just discretize a giant homogenous net into hundreds of smaller networks.

By UI, are you implying that they’re mostly oriented towards serving existing research to customers? Pretty much. OAI doesn’t research towards anything new or groundbreaking. The only time they did was with GPT3 and emergent abilities + scaling.

That’s not to say they don’t have talent - they have some of the best researchers in the field. They’re simply not at the extreme cutting edge because its too expensive $$$ wise. So they make a tradeoff between being the cutting-edge and monetizability.

The leader in research is definitely Google. Though its also started getting outshined by other independent labs. But I guess that’s just research in general :person_shrugging: highly decentralized

2 Likes

You are looking at weight adjustment as whole, where there is a top-layer template instead of centroid. But I see two distinct components / phases in this adjustement:

  • node-external gradient (inverse similarity), backproped from top layer, then:
  • node-internal gradient distribution, by connection’s contribution (similarity) to node output.

It’s the 2nd phase that I call grey-scale Hebbian: clustering by relative contribution to centroid. This distribution is where most of the processing in perceptron is done, something like:
weight += internal deviation (input - average_input) * external gradient.

BTW, do you know of anyone using stand-alone grey-scale Hebbian?
I still don’t understand how backprop improves on it, except for supervised learning?

Have you looked into Hinton’s new forward-forward algo: [2212.13345] The Forward-Forward Algorithm: Some Preliminary Investigations?
To me, it sounds like a combination of grey-scale Hebbian with grey-scale anti-Hebbian, except that the goodness of fit (similarity) is summed across the whole layer?

Ok, but can you train meta-weights on the sets of weights in those sub-networks?

1 Like

Ok, here is a single-centroid version of backprop, modeled after dendritic tree:
[2211.11378] Learning on tree architectures outperforms a convolutional feedforward network,
Is Brain Learning Weaker Than Artificial Intelligence? - Neuroscience News

“We’ve shown that efficient learning on an artificial tree architecture, where each weight has a single route to an output unit, can achieve better classification success rates than previously achieved by DL architectures consisting of more layers and filters. This finding paves the way for efficient, biologically-inspired new AI hardware and algorithms,” said Prof. Ido Kanter, of Bar-Ilan’s Department of Physics and Gonda (Goldschmied) Multidisciplinary Brain Research Center, who led the research.

2 Likes

That’s an interesting paper, thanks. What it misses is comparing the various trees vs CNN not only in terms of accuracy but also relative to the compute/memory cost.
Because when comparing Arch1 with M parameters versus Arch2 with N parameters, Arch2 with P params, and arch 2 with Q params, then claiming Arch2 is the better one, then the questions that must be addressed is does the # of parameters matter? How? How about learning cost, inference cost or sample efficiencies of each variant?
Because otherwise since increasing either architectures # of parameters increases its performance, I can cherry pick some A1-s and a bunch of A2-s to “prove” A1 outperforms A2 or vice-versa

2 Likes

In the paper, they specifically say that custom processor will be needed to compete on speed.
I think it won’t parallelize as well as backprop, but that can be compensated by more efficient higher-level multi-threading or something.

1 Like

@bkaz this paper looks interesting and has also matlab codes for testing in github.
Unfortunately I did not find any links to github.
Do you know it? Thanks

1 Like

It can also be proven algorithmically THAT codes have their own hidden variables , too …

1 Like

No, sorry.

1 Like

Sorry for lots text, couldn’t help it. Whatever it’s faults, that chat gtp is great for stimulation:

Me:

Is there a relationship between centroid-based clustering and backprop in neural nets?

GTP:

Centroid-based clustering and backpropagation in neural networks are two different machine learning techniques used for different purposes, but there can be a relationship between them depending on the context.

Centroid-based clustering is an unsupervised learning technique used for grouping data points into clusters based on their similarity. The most common example of this is the k-means algorithm, which iteratively adjusts the centroids’ positions to minimize the within-cluster sum of squares.

Backpropagation, on the other hand, is a supervised learning algorithm used for training artificial neural networks. It is an optimization technique that minimizes the error between the network’s predicted outputs and the actual outputs by adjusting the network’s weights and biases. It does this by applying the chain rule to compute gradients of the loss function with respect to each weight and then updating the weights using gradient descent.

The relationship between centroid-based clustering and backpropagation in neural networks can be established in a few ways:

  1. Feature extraction: Centroid-based clustering can be used as a preprocessing step to reduce the dimensionality of the input data or to extract meaningful features for a neural network. The output of the clustering algorithm can be fed as input to the neural network, making it easier for the network to learn patterns and relationships within the data.
  2. Initialization: The centroids obtained from clustering algorithms can be used to initialize the weights of a neural network. This can potentially speed up the training process or improve the performance of the network by providing a better starting point for backpropagation.
  3. Hybrid models: Some machine learning models combine clustering and neural networks, such as the self-organizing map (SOM), which is a type of neural network that uses unsupervised learning techniques like clustering to organize the input data. In this case, the clustering algorithm’s concepts are integrated with the neural network’s learning process, creating a relationship between the two techniques.

In summary, while centroid-based clustering and backpropagation are different machine learning techniques, they can be related or combined depending on the specific use case or problem being addressed.

Me:

Ok, how about error gradient distribution in each node of perceptron, from output connections, to adjust the weights of node’s input connections (synapses). Is that analogous to downward weight distribution for node inclusion/exclusion into cluster, in soft fuzzy centroid-based clustering?

GTP:

Yes, there is an analogy between the two concepts you mentioned: error gradient distribution in a perceptron and downward weight distribution for node inclusion/exclusion in soft fuzzy centroid-based clustering. Both techniques involve adjusting weights to optimize a particular objective.

In the case of a perceptron, backpropagation is used to compute the error gradients for each node in the network. These error gradients represent the partial derivatives of the loss function with respect to the weights of each node. The weights are then updated using gradient descent to minimize the loss function. The perceptron adjusts the weights of its input connections (synapses) based on the error gradient distribution from the output connections.

In soft fuzzy centroid-based clustering, each data point has a degree of membership to each cluster, rather than belonging to just one cluster as in hard clustering. The objective is to minimize an objective function that takes into account both the distance between data points and cluster centroids and the degree of membership of each data point to each cluster. The downward weight distribution for node inclusion/exclusion represents the adjustment of the degree of membership of a data point to a particular cluster, based on its distance to the cluster centroids.

While both techniques involve adjusting weights to optimize a particular objective, they do so in different contexts and for different purposes. The analogy between the two can be observed in the way they distribute and adjust weights. However, it is important to remember that they are still distinct techniques applied in different machine learning scenarios – perceptrons are used in supervised learning for classification and regression tasks, while soft fuzzy centroid-based clustering is an unsupervised learning technique for grouping data points based on similarity.

Me

That’s not quite right, perceptron / MLP can also be used for unsupervised learning, and inverse error is often used as a measure of similarity. So, can the weights on the inputs of perceptron node be considered as a measure of their inclusion in the output of that node? In this interpretation, can each node be considered a centroid-based cluster of its input connections?

GTP:

Apologies for the confusion. You are correct that perceptrons and multi-layer perceptrons (MLPs) can be used for unsupervised learning as well, with autoencoders being a prime example.

In this context, we can consider the weights on the inputs of a perceptron node as a measure of their inclusion in the output of that node. The higher the weight, the stronger the connection between the input and the output, and the more the input contributes to the output. Thus, the weights can be seen as a measure of the “importance” or “inclusion” of the input in generating the output of the node.

Regarding the interpretation of each node as a centroid-based cluster of its input connections, this analogy can be drawn to some extent. In centroid-based clustering, the centroids represent the “centers” of the clusters, and the data points are assigned to the clusters based on their similarity to the centroids. In a perceptron, the weights connecting the inputs to a node can be considered as a representation of the “center” of the node’s receptive field, which influences how the node responds to different input patterns.

However, there are still some differences between the two concepts. In centroid-based clustering, the objective is to minimize the within-cluster distances, while in a perceptron, the objective is typically to minimize a loss function that depends on the task being solved (e.g., classification or regression). Moreover, the weights in a perceptron are learned through a supervised or unsupervised learning process, while centroids in clustering are determined based on the distribution of the data points.

In summary, while it is possible to draw some analogies between the weights on the inputs of a perceptron node and the concept of centroid-based clustering, it’s important to note that the two techniques still have different objectives, learning processes, and applications. The analogy may provide some insights into the behavior of perceptrons, but the relationship is not a direct one-to-one mapping.

Me:

“In a perceptron, the weights connecting the inputs to a node can be considered as a representation of the “center” of the node’s receptive field”. This is actually backwards, the “center” of receptive field would be normalized sum of weighted inputs, which becomes node’s output. That output propagates through higher layers, and then its error (inverse similarity) backpropagates to the node. In general, isn’t the objective in both backprop and centroid-based clustering basically fitting (weighting) the inputs to some template, which can be considered a centroid?

GTP:

You are correct in pointing out that the “center” of the node’s receptive field would be better described as the normalized sum of the weighted inputs, which becomes the node’s output.

In both backpropagation and centroid-based clustering, the general objective is to optimize some measure based on the input data. In the case of backpropagation, the goal is to minimize the error between the predicted output and the target output by adjusting the weights of the connections in the network. In centroid-based clustering, the goal is to group data points into clusters based on their similarity, minimizing the within-cluster distances.

Although both methods involve optimizing weights and can be considered as fitting the inputs to some template, there are fundamental differences in their objectives and the way they operate.

  1. Learning type: Backpropagation is used in supervised learning (and unsupervised learning in some cases, such as autoencoders), where there is a target output or a ground truth to compare the predictions against. Centroid-based clustering is an unsupervised learning technique that doesn’t rely on labeled data.
  2. Optimization objective: In backpropagation, the objective is to minimize a loss function, which depends on the specific task (e.g., classification, regression). In centroid-based clustering, the objective is to minimize within-cluster distances, which is a measure of data point similarity within each cluster.
  3. Weight adjustments: In backpropagation, weights are adjusted throughout the entire network, and the error is propagated through multiple layers. In centroid-based clustering, weights (or memberships) are adjusted based on the distance between data points and cluster centroids.
  4. Output representation: In backpropagation, the output is typically a continuous value or a class label, depending on the problem. In centroid-based clustering, the output is a set of cluster assignments for each data point.

While there are similarities between backpropagation and centroid-based clustering, the differences in their objectives, learning types, optimization goals, and output representations make them distinct techniques with different applications. The analogy between the two can help in understanding certain aspects of their behavior, but it is essential to recognize the fundamental differences between them as well.

Me:

So, as I understand the main difference is that the template / center in centroid clustering is “drifting” with each weight adjustment on the inputs, while in MLP that top-layer template is fixed. Are there implementations of MLP or related NN where the top layer also changes in each backprop cycle, in unsupervised fashion?

GTP:

Yes, your understanding is correct. In centroid-based clustering, the template or center (the centroid) changes with each weight adjustment, while in a typical MLP, the top-layer template or target output is fixed during training.

There are implementations of neural networks where the top layer can also change in an unsupervised fashion during the training process. One example is a type of unsupervised learning called Competitive Learning, which includes the Self-Organizing Map (SOM) and the Neural Gas algorithm.

In a Self-Organizing Map (SOM), the top layer consists of nodes (also called neurons) arranged in a grid structure. During the training process, the weights of the nodes in the top layer are adjusted based on the input data. SOMs do not use backpropagation for learning; instead, they employ a competitive learning process. When an input is presented to the network, the node whose weights are closest to the input vector (the “winning” node) and its neighbors are updated to become more similar to the input. This process continues for all input samples, resulting in a top layer that has learned to represent the input data distribution in an unsupervised manner.

Another example is the Neural Gas algorithm, which is also an unsupervised competitive learning technique. Like SOM, Neural Gas adjusts the weights of the nodes in the top layer based on the input data. The key difference between SOM and Neural Gas is that Neural Gas does not have a fixed grid structure for the nodes in the top layer, allowing for a more flexible representation of the input data distribution.

Both Self-Organizing Maps and Neural Gas algorithms can be seen as a form of unsupervised learning in which the top layer of the network adapts and changes during the training process, unlike traditional MLPs. While they do not rely on backpropagation for learning, they illustrate how neural networks can be implemented with top layers that change in an unsupervised fashion during the training process.

2 Likes

It’s absolute nonsense, isn’t it? Or perhaps it’s gospel truth. Or a mixture of the two. Who knows?

Problem is, there is absolutely no way to tell unless you’re already an expert in the subject. ChatGPT delivers answers with all the confidence of the bright student who only did half the homework, but isn’t going to let that stand in the way of delivering a brilliant wrong answer with full confidence.

ChatGPT delivers fragments of text that relate to a topic with enough glue to make it sound good, but with absolutely no idea as to truth or any relationship with the real world.

I think that makes it more useful to bad people than good people, but I guess time will tell. One thing for sure: ChatGPT is not your friend.

2 Likes

It’s neither. You still need to use your brain, but this feedback is better than what you get on most related forums. And it won’t stop getting better. As for confidence, yeah, it does need to quantify how reliable is the training data, + how many “generative” levels is the reply removed from it. But it’s not a rocket science, they will add it eventually. Frankly, most people are already worse at it.

And it’s even better for coding, I just started using it.

1 Like

I’ve read some of the code. I’ve not seen anything yet I’d trust or be happy to leave untouched, but it gets close and churning out boilerplate certainly looks like a win. IOW it has to be code you could have written yourself, but this way is faster. The monkey still needs that organ grinder.

ChatGPT directly, or Pilot etc?

1 Like

ChatGTP plus ($20/mo), you can set it to use GTP-4. If you point out the error, chances are it will fix it. Haven’t used anything else.

1 Like