NN, centroid clustering, search

Which is what’s confusing me - because its anything but Hebbian. It doesn’t represent a significant value, but rather a dimension. The change in weights isn’t determined by some simple relationship like in HL but though Gradient Descent. “Centroid” is simply the wrong terminology here.

That is not a macro view - that’s literally what a single neuron accomplishes. And the behavior of a single neuron also doesn’t necessarily transfer to the properties of the NN

Yea, sounds like transfer learning.

Nah, I meant it in the context of something different. MoE’s just discretize a giant homogenous net into hundreds of smaller networks.

By UI, are you implying that they’re mostly oriented towards serving existing research to customers? Pretty much. OAI doesn’t research towards anything new or groundbreaking. The only time they did was with GPT3 and emergent abilities + scaling.

That’s not to say they don’t have talent - they have some of the best researchers in the field. They’re simply not at the extreme cutting edge because its too expensive $$$ wise. So they make a tradeoff between being the cutting-edge and monetizability.

The leader in research is definitely Google. Though its also started getting outshined by other independent labs. But I guess that’s just research in general :person_shrugging: highly decentralized

2 Likes

You are looking at weight adjustment as whole, where there is a top-layer template instead of centroid. But I see two distinct components / phases in this adjustement:

  • node-external gradient (inverse similarity), backproped from top layer, then:
  • node-internal gradient distribution, by connection’s contribution (similarity) to node output.

It’s the 2nd phase that I call grey-scale Hebbian: clustering by relative contribution to centroid. This distribution is where most of the processing in perceptron is done, something like:
weight += internal deviation (input - average_input) * external gradient.

BTW, do you know of anyone using stand-alone grey-scale Hebbian?
I still don’t understand how backprop improves on it, except for supervised learning?

Have you looked into Hinton’s new forward-forward algo: [2212.13345] The Forward-Forward Algorithm: Some Preliminary Investigations?
To me, it sounds like a combination of grey-scale Hebbian with grey-scale anti-Hebbian, except that the goodness of fit (similarity) is summed across the whole layer?

Ok, but can you train meta-weights on the sets of weights in those sub-networks?

1 Like

Ok, here is a single-centroid version of backprop, modeled after dendritic tree:
[2211.11378] Learning on tree architectures outperforms a convolutional feedforward network,
Is Brain Learning Weaker Than Artificial Intelligence? - Neuroscience News

“We’ve shown that efficient learning on an artificial tree architecture, where each weight has a single route to an output unit, can achieve better classification success rates than previously achieved by DL architectures consisting of more layers and filters. This finding paves the way for efficient, biologically-inspired new AI hardware and algorithms,” said Prof. Ido Kanter, of Bar-Ilan’s Department of Physics and Gonda (Goldschmied) Multidisciplinary Brain Research Center, who led the research.

2 Likes

That’s an interesting paper, thanks. What it misses is comparing the various trees vs CNN not only in terms of accuracy but also relative to the compute/memory cost.
Because when comparing Arch1 with M parameters versus Arch2 with N parameters, Arch2 with P params, and arch 2 with Q params, then claiming Arch2 is the better one, then the questions that must be addressed is does the # of parameters matter? How? How about learning cost, inference cost or sample efficiencies of each variant?
Because otherwise since increasing either architectures # of parameters increases its performance, I can cherry pick some A1-s and a bunch of A2-s to “prove” A1 outperforms A2 or vice-versa

2 Likes

In the paper, they specifically say that custom processor will be needed to compete on speed.
I think it won’t parallelize as well as backprop, but that can be compensated by more efficient higher-level multi-threading or something.

1 Like

@bkaz this paper looks interesting and has also matlab codes for testing in github.
Unfortunately I did not find any links to github.
Do you know it? Thanks

1 Like

It can also be proven algorithmically THAT codes have their own hidden variables , too …

1 Like

No, sorry.

1 Like

Sorry for lots text, couldn’t help it. Whatever it’s faults, that chat gtp is great for stimulation:

Me:

Is there a relationship between centroid-based clustering and backprop in neural nets?

GTP:

Centroid-based clustering and backpropagation in neural networks are two different machine learning techniques used for different purposes, but there can be a relationship between them depending on the context.

Centroid-based clustering is an unsupervised learning technique used for grouping data points into clusters based on their similarity. The most common example of this is the k-means algorithm, which iteratively adjusts the centroids’ positions to minimize the within-cluster sum of squares.

Backpropagation, on the other hand, is a supervised learning algorithm used for training artificial neural networks. It is an optimization technique that minimizes the error between the network’s predicted outputs and the actual outputs by adjusting the network’s weights and biases. It does this by applying the chain rule to compute gradients of the loss function with respect to each weight and then updating the weights using gradient descent.

The relationship between centroid-based clustering and backpropagation in neural networks can be established in a few ways:

  1. Feature extraction: Centroid-based clustering can be used as a preprocessing step to reduce the dimensionality of the input data or to extract meaningful features for a neural network. The output of the clustering algorithm can be fed as input to the neural network, making it easier for the network to learn patterns and relationships within the data.
  2. Initialization: The centroids obtained from clustering algorithms can be used to initialize the weights of a neural network. This can potentially speed up the training process or improve the performance of the network by providing a better starting point for backpropagation.
  3. Hybrid models: Some machine learning models combine clustering and neural networks, such as the self-organizing map (SOM), which is a type of neural network that uses unsupervised learning techniques like clustering to organize the input data. In this case, the clustering algorithm’s concepts are integrated with the neural network’s learning process, creating a relationship between the two techniques.

In summary, while centroid-based clustering and backpropagation are different machine learning techniques, they can be related or combined depending on the specific use case or problem being addressed.

Me:

Ok, how about error gradient distribution in each node of perceptron, from output connections, to adjust the weights of node’s input connections (synapses). Is that analogous to downward weight distribution for node inclusion/exclusion into cluster, in soft fuzzy centroid-based clustering?

GTP:

Yes, there is an analogy between the two concepts you mentioned: error gradient distribution in a perceptron and downward weight distribution for node inclusion/exclusion in soft fuzzy centroid-based clustering. Both techniques involve adjusting weights to optimize a particular objective.

In the case of a perceptron, backpropagation is used to compute the error gradients for each node in the network. These error gradients represent the partial derivatives of the loss function with respect to the weights of each node. The weights are then updated using gradient descent to minimize the loss function. The perceptron adjusts the weights of its input connections (synapses) based on the error gradient distribution from the output connections.

In soft fuzzy centroid-based clustering, each data point has a degree of membership to each cluster, rather than belonging to just one cluster as in hard clustering. The objective is to minimize an objective function that takes into account both the distance between data points and cluster centroids and the degree of membership of each data point to each cluster. The downward weight distribution for node inclusion/exclusion represents the adjustment of the degree of membership of a data point to a particular cluster, based on its distance to the cluster centroids.

While both techniques involve adjusting weights to optimize a particular objective, they do so in different contexts and for different purposes. The analogy between the two can be observed in the way they distribute and adjust weights. However, it is important to remember that they are still distinct techniques applied in different machine learning scenarios – perceptrons are used in supervised learning for classification and regression tasks, while soft fuzzy centroid-based clustering is an unsupervised learning technique for grouping data points based on similarity.

Me

That’s not quite right, perceptron / MLP can also be used for unsupervised learning, and inverse error is often used as a measure of similarity. So, can the weights on the inputs of perceptron node be considered as a measure of their inclusion in the output of that node? In this interpretation, can each node be considered a centroid-based cluster of its input connections?

GTP:

Apologies for the confusion. You are correct that perceptrons and multi-layer perceptrons (MLPs) can be used for unsupervised learning as well, with autoencoders being a prime example.

In this context, we can consider the weights on the inputs of a perceptron node as a measure of their inclusion in the output of that node. The higher the weight, the stronger the connection between the input and the output, and the more the input contributes to the output. Thus, the weights can be seen as a measure of the “importance” or “inclusion” of the input in generating the output of the node.

Regarding the interpretation of each node as a centroid-based cluster of its input connections, this analogy can be drawn to some extent. In centroid-based clustering, the centroids represent the “centers” of the clusters, and the data points are assigned to the clusters based on their similarity to the centroids. In a perceptron, the weights connecting the inputs to a node can be considered as a representation of the “center” of the node’s receptive field, which influences how the node responds to different input patterns.

However, there are still some differences between the two concepts. In centroid-based clustering, the objective is to minimize the within-cluster distances, while in a perceptron, the objective is typically to minimize a loss function that depends on the task being solved (e.g., classification or regression). Moreover, the weights in a perceptron are learned through a supervised or unsupervised learning process, while centroids in clustering are determined based on the distribution of the data points.

In summary, while it is possible to draw some analogies between the weights on the inputs of a perceptron node and the concept of centroid-based clustering, it’s important to note that the two techniques still have different objectives, learning processes, and applications. The analogy may provide some insights into the behavior of perceptrons, but the relationship is not a direct one-to-one mapping.

Me:

“In a perceptron, the weights connecting the inputs to a node can be considered as a representation of the “center” of the node’s receptive field”. This is actually backwards, the “center” of receptive field would be normalized sum of weighted inputs, which becomes node’s output. That output propagates through higher layers, and then its error (inverse similarity) backpropagates to the node. In general, isn’t the objective in both backprop and centroid-based clustering basically fitting (weighting) the inputs to some template, which can be considered a centroid?

GTP:

You are correct in pointing out that the “center” of the node’s receptive field would be better described as the normalized sum of the weighted inputs, which becomes the node’s output.

In both backpropagation and centroid-based clustering, the general objective is to optimize some measure based on the input data. In the case of backpropagation, the goal is to minimize the error between the predicted output and the target output by adjusting the weights of the connections in the network. In centroid-based clustering, the goal is to group data points into clusters based on their similarity, minimizing the within-cluster distances.

Although both methods involve optimizing weights and can be considered as fitting the inputs to some template, there are fundamental differences in their objectives and the way they operate.

  1. Learning type: Backpropagation is used in supervised learning (and unsupervised learning in some cases, such as autoencoders), where there is a target output or a ground truth to compare the predictions against. Centroid-based clustering is an unsupervised learning technique that doesn’t rely on labeled data.
  2. Optimization objective: In backpropagation, the objective is to minimize a loss function, which depends on the specific task (e.g., classification, regression). In centroid-based clustering, the objective is to minimize within-cluster distances, which is a measure of data point similarity within each cluster.
  3. Weight adjustments: In backpropagation, weights are adjusted throughout the entire network, and the error is propagated through multiple layers. In centroid-based clustering, weights (or memberships) are adjusted based on the distance between data points and cluster centroids.
  4. Output representation: In backpropagation, the output is typically a continuous value or a class label, depending on the problem. In centroid-based clustering, the output is a set of cluster assignments for each data point.

While there are similarities between backpropagation and centroid-based clustering, the differences in their objectives, learning types, optimization goals, and output representations make them distinct techniques with different applications. The analogy between the two can help in understanding certain aspects of their behavior, but it is essential to recognize the fundamental differences between them as well.

Me:

So, as I understand the main difference is that the template / center in centroid clustering is “drifting” with each weight adjustment on the inputs, while in MLP that top-layer template is fixed. Are there implementations of MLP or related NN where the top layer also changes in each backprop cycle, in unsupervised fashion?

GTP:

Yes, your understanding is correct. In centroid-based clustering, the template or center (the centroid) changes with each weight adjustment, while in a typical MLP, the top-layer template or target output is fixed during training.

There are implementations of neural networks where the top layer can also change in an unsupervised fashion during the training process. One example is a type of unsupervised learning called Competitive Learning, which includes the Self-Organizing Map (SOM) and the Neural Gas algorithm.

In a Self-Organizing Map (SOM), the top layer consists of nodes (also called neurons) arranged in a grid structure. During the training process, the weights of the nodes in the top layer are adjusted based on the input data. SOMs do not use backpropagation for learning; instead, they employ a competitive learning process. When an input is presented to the network, the node whose weights are closest to the input vector (the “winning” node) and its neighbors are updated to become more similar to the input. This process continues for all input samples, resulting in a top layer that has learned to represent the input data distribution in an unsupervised manner.

Another example is the Neural Gas algorithm, which is also an unsupervised competitive learning technique. Like SOM, Neural Gas adjusts the weights of the nodes in the top layer based on the input data. The key difference between SOM and Neural Gas is that Neural Gas does not have a fixed grid structure for the nodes in the top layer, allowing for a more flexible representation of the input data distribution.

Both Self-Organizing Maps and Neural Gas algorithms can be seen as a form of unsupervised learning in which the top layer of the network adapts and changes during the training process, unlike traditional MLPs. While they do not rely on backpropagation for learning, they illustrate how neural networks can be implemented with top layers that change in an unsupervised fashion during the training process.

2 Likes

It’s absolute nonsense, isn’t it? Or perhaps it’s gospel truth. Or a mixture of the two. Who knows?

Problem is, there is absolutely no way to tell unless you’re already an expert in the subject. ChatGPT delivers answers with all the confidence of the bright student who only did half the homework, but isn’t going to let that stand in the way of delivering a brilliant wrong answer with full confidence.

ChatGPT delivers fragments of text that relate to a topic with enough glue to make it sound good, but with absolutely no idea as to truth or any relationship with the real world.

I think that makes it more useful to bad people than good people, but I guess time will tell. One thing for sure: ChatGPT is not your friend.

3 Likes

It’s neither. You still need to use your brain, but this feedback is better than what you get on most related forums. And it won’t stop getting better. As for confidence, yeah, it does need to quantify how reliable is the training data, + how many “generative” levels is the reply removed from it. But it’s not a rocket science, they will add it eventually. Frankly, most people are already worse at it.

And it’s even better for coding, I just started using it.

1 Like

I’ve read some of the code. I’ve not seen anything yet I’d trust or be happy to leave untouched, but it gets close and churning out boilerplate certainly looks like a win. IOW it has to be code you could have written yourself, but this way is faster. The monkey still needs that organ grinder.

ChatGPT directly, or Pilot etc?

1 Like

ChatGTP plus ($20/mo), you can set it to use GTP-4. If you point out the error, chances are it will fix it. Haven’t used anything else.

1 Like

It seems that GTP4 is already a MoE!

GPT-4 was the most anticipated AI model in history.

Yet when OpenAI released it in March they didn’t tell us anything about its size, data, internal structure, or how they trained and built it. A true black box.

As it turns out, they didn’t conceal those critical details because the model was too innovative or the architecture too moat-y to share. The opposite seems to be true if we’re to believe the latest rumors:

GPT-4 is, technically and scientifically speaking, hardly a breakthrough.

That’s not necessarily bad—GPT-4 is, after all, the best language model in existence—just… somewhat underwhelming. Not what people were expecting after a 3-year wait.

This news, yet to be officially confirmed, reveals key insights about GPT-4 and OpenAI and raises questions about AI’s true state-of-the-art—and its future.

GPT-4: A mixture of smaller models

On June 20th, George Hotz, founder of self-driving startup Comma.ai leaked that GPT-4 isn’t a single monolithic dense model (like GPT-3 and GPT-3.5) but a mixture of 8 x 220-billion-parameter models. Later that day, Soumith Chintala, co-founder of PyTorch at Meta, reaffirmed the leak. Just the day before, Mikhail Parahkin, Microsoft Bing AI lead, had also hinted at this.

GPT-4 is not one big >1T model but eight smaller ones cleverly put together. The mixture of experts paradigm OpenAI supposedly used for this “hydra” model is neither new nor invented by them. In this article, I’ll explain why this is very relevant for the field and how OpenAI masterfully executed its plan to achieve three key goals.

Two caveats.

First, this is a rumor. The explicit sources (Hotz and Chintala) are robust but not OpenAI staff. Parahkin holds an executive position at Microsoft but he never confirmed it explicitly. For these reasons, it’s worth taking this with a grain of salt. The story is nevertheless very plausible.

Second, let’s give credit where credit’s due. GPT-4 is exactly as impressive as users say. The details of the internal architecture can’t change that. If it works, it works. It doesn’t matter whether it’s one model or eight tied together. Its performance and ability on writing and coding tasks are legit. This article is not a dunk on GPT-4—just a warning that we may want to update our priors.

Upgrade to paid

The secrecy around GPT-4

I have to applaud OpenAI’s mastery in dealing with the unreasonably high expectations that surrounded GPT-4 by covering up the more unsatisfactory aspects of the model while remaining at the top of the conversation.

In January, when Connie Loizos of StrictlyVC mentioned the ridiculous 100-trillion GPT-4 graphs that were making the rounds on Twitter, Altman told her that “people are begging to be disappointed and they will be.” He knew GPT-4, which had finished training in the summer of 2022, wouldn’t meet people’s expectations.

But he didn’t want to kill OpenAI’s almost-mystical reputation. So they hid GPT-4 from public scrutiny, further fueling its mysterious aura.

OpenAI had already crystallized its status with ChatGPT by then. They were leaders in the space in the eyes of the majority (despite Google’s longer and richer history of AI R&D). As such, they couldn’t admit explicitly that GPT-4 wasn’t the anticipated breakthrough—and the huge leap from GPT-3—that people wanted.

So they focused on hinting and implying it was really powerful (e.g., sparks of AGI, superintelligence is near, and all that) and defended their decision to not disclose GPT-4’s specs by alluding to increased competitive pressures, as Ilya Sutskever told The Verge.

With this on the table, the mainstream reading of OpenAI’s secrecy was along the lines of: “They won’t disclose the specs because they can’t afford Google or open source initiatives to copy them due to business survival and safety reasons. Also, GPT-4’s SOTA performance implies the architecture must be a scientific feat.”

OpenAI got what it wanted. Altman was honest—GPT-4 would’ve been disappointing—but, at the same time, the subliminal signals suggested something else: GPT-4 is magical. And people believed it.

It is magical in a way, though. We’ve all seen it in action. It’s just not what most people would perceive as a revolutionary achievement. It seems to be just an old trick reimagined. Combining several expert models into one, with each expert trained to specialize in separate areas, tasks, or data was a technique first successfully implemented in 2021. Two years ago. Who did it? You guessed it, Google engineers (some of them, like William Fedus and Trevor Cai, were later hired by OpenAI).

OpenAI surely added engineering ingenuity on top (otherwise Google would have their own GPT-4, or better), but the very key to the model’s absolute dominance across benchmarks is simply that it’s not one model but eight.

So, yes, GPT-4 is magic, but OpenAI made it into the kind we see in shows. A clever mix of skillful misdirection and smooth sleight of hand. And the trick is merely a remake.

The 3 goals OpenAI achieved by hiding GPT-4

First, they freed people’s imagination. Although skeptics saw this as an unscientific practice, it fueled speculation about the model’s power. This, in turn, allowed them to establish their preferred narrative—AGI and the need to plan for it—convincing the government that safety requirements (especially for others) and regulation (that which fits their goals) are paramount. The illusion was complete: GPT-4 had a shiny appearance so it had to be equally shiny inside—and shiny can be dangerous.

In actuality, if we go for the snarky analogy, GPT-4 is better portrayed as a gaze of “raccoons in a trenchcoat.”

Second, they effectively prevented open source initiatives, as well as competitors like Google or Anthropic, from copying the techniques they had supposedly invented or discovered. But OpenAI had no moat in GPT-4. LLaMA is unable to compete with GPT-4, but maybe 8 LLaMAs tied together could—people were comparing apples to oranges but they didn’t know. So maybe I was mistaken and open source wasn’t so far behind after all.

The moat was making GPT-4 appear more impressive than it was.

Finally, they concealed the truth that GPT-4 is actually not that much of an AI breakthrough, effectively preventing witnesses, outsiders, and users from losing faith in the apparently breakneck pace of progress in the field. If we’re nitpicky, GPT-4 is the result of having, on the one hand, enough money and GPUs to train and run eight ~GPT-3.5 models stacked together and, on the other, the audacity to dust an old technique invented by another company without telling anyone.

GPT-4 was a business marketing masterclass.

A final thought

Maybe OpenAI—and the industry at large—are out of ideas, as Hotz suggests. Maybe AI isn’t really going that fast milestone after milestone as companies, media, marketers, and arXiv make it seem. Maybe GPT-4 isn’t as huge a leap from GPT-3 as it should’ve been.

A rumor is still a rumor until we get an official version (I reached out to OpenAI but didn’t hear back yet). It’s hard to deny the plausibility of the story, though. Besides the value of the sources, there’s an overall coherence to it. That’s why I’m giving this news a high credibility.

Quoting Hotz’s conclusion: “Whenever a company is secretive is because they are hiding something that’s not that cool.” Maybe GPT-4 is not that cool after all.

2 Likes

As the person who was quite fond of MoEs, I don’t think they’re the best direction for LLMs/DL in general. They save on inference costs but training is rather… fragile.

I think this is the only part of the article I disagree with - MoEs are a bit more complicated than just ensembling a bunch of LLaMa’s together.

Open source models would never achieve the scale, engineering effort as well as ingenuity that OAI/Google/DM produce. I applaud the community’s drive, but honestly they’re hopelessly out of their depth.

The core bottleneck is researchers - the good ones are usually snapped up by big labs, and some that do play both sides aren’t able to devote as much time. Stability tried to make a new business model that would be OSS and attractive simultaneously but… attract them, but

3 Likes

How about The Bitter Lesson?
Hinton himself said much earlier that progress in hardware contributed 10 times more than his algorithms to the progress in AI.

1 Like

The bitter lesson isn’t about throwing more compute at the problem - rather, its about finding compute scalable architectures and algorithms. Lots of techniques that are scalable are cutting edge - thus only Big tech can afford to experiment with them and utilize them.

Open source by nature can’t afford such experiments; thus they lag behind - badly.

2 Likes

So the bitter lesson is the success of statistical learning, AKA fitting. It’s bitter because fitting is patently dumb, but easy to parallelize, so can utilize lots of hardware.

Sutton contrasts it with domain-specific methods, but the real alternative is instance-based learning. That’s clustering / paritioning / encapsulation, including MoE. It’s a lot more complex to do well, thus harder to parallelize, but still domain-general. And I think it will ultimately scale better than fitting, because it can be selective in search, avoiding combinatorial explosion.

1 Like

MoEs are instance based learner? domain specific? :thinking:

1 Like

They have domain-specific experts, right? The types of expertise or their training sets can be and usually are defined manually. But I am talking about the case where this training set segmentation is automatic and instance-based / non-parametric. Although I don’t know if anyone actually tried that.

Anyway, once these separate training sets are defined, they can be considered instances. But of course it’s only one aspect of MoEs, their training proper is still fitting. Purely instance-based alternative is connectivity based clustering: it’s done bottom up, vs. top-down in fitting. But the proper way to parameterize the clusters seems to be very complex.

1 Like