Yan LeCun on GI vs. current DL

His recent lecture: https://drive.google.com/file/d/1BU5bV3X5w65DwSMapKcsr0ZvrMRU_Nbi/view


  • Abandon generative models in favor joint-embedding architectures
  • Abandon Auto-Regressive generation
  • Abandon probabilistic model in favor of energy-based models
  • Abandon contrastive methods in favor of regularized methods
  • Abandon Reinforcement Learning in favor of model-predictive control
  • Use RL only when planning doesn’t yield the predicted outcome, to adjust the world model or
    the critic.

This an elaboration on his previous views, discussed here: Yann LeCun on GI and much ballyhooed "consciousness" - #83 by azat_n

I am starting a new thread to lose all that nonsense about “consciousness”.


You needed Consciousness to do that.

1 Like

Some of this is also in here - including a few mentions of Forward Forward.


He mentioned trying to combine initial layers of CNN with self-attention on higher layers. Because the former is better for images, and the latter for language.
I was thinking they are actually quite similar in basic results: CNN kernels generally converge to some sort of edge detection kernels. Which usually perform lateral matrix multiplication, same as transformers, only there the kernel is called context window. And it’s a lot larger, I think up to 32K currently, vs. typical 3x3 in CNN.

The difference is that edge detection kernels compute lateral gradient, while self-attention computes lateral query-key dot product, which serves as measure of similarity. So it seems that short-range gradient is a good measure of salience in contiguous sensory (image) data, while similarity is better for abstract (implicitly compressed) symbolic data, at longer range.
I think that’s because gradient (dissimilarity) is rare in raw data, while similarity is rare in generalized data, discontinuous due to some prior compression. Does that make sense, @neel_g?


sense only in co far as gridding does not lead to overlapping off crux-energy de-quantification …

Interesting contrast. I like attempts to relate different techniques like this. I helps us get to the bottom of what is missing.

But I’m confused when you say dissimilarity is rare in raw data, while similarity is rare in generalized data. Don’t dissimilarity and similarity come to the same thing? Aren’t they just different values on a similarity scale?

Maybe you could argue dissimilarity and similarity are qualitatively different. Is zero similarity the same as dissimilarity? Or does dissimilarity only start with negative values of similarity? It’s an interesting question. How is negative similarity different to zero similarity? If two things are completely different (zero similarity?) how does the difference then increase yet more, to make them negatively similar?

I think the contrast you are really reaching for is that the similarity measure to use for images is obvious. It’s just light intensity. But it is less obvious for language. What qualifies as “edges” in language data is less clear. What defines the “edge” between one phrase and another? Not obvious! (HTM struggled for years with this. It still struggles AFAIK. It seems still stuck on vague ideas about prediction failing and columns bursting…)

I think that maps to my comment in one of my posts that historically ANN research just didn’t know what definition of “similarity” to use in many problems:

I think the content of the contrast you are trying to make is that the similarity measure is more obvious for images.

So, no need to contrast “dissimilarity” and similarity. The contrast you are really reaching for is between obvious similarity measures, and less obvious similarity measures.

Transformers have to train to find what to “attend” to for their similarity measures. For language the similarity measure is less obvious. That was the revolution of “attention”. It allowed similarity measures to reach back along a sequence to find what to “attend” to when defining “similarity”. There is no “attention” equivalent in CNNs. The contrast is light intensity. It’s built in.

I would say the real revolution for transformers is that they allow the system to search for the right similarity measure to use. And they do this by making similarity subordinate to prediction. Prediction is an obvious measure to use for language, because language is a clear sequence.

Language leads us to sequence and prediction as a foundation for “objects”. I think that’s the advance of transformers. (Not some regressive idea being pushed by Le Cunn about language being niche data so we have to go back to CNNs for the real world… Sheesh, everyone wants to go back to the last thing they were doing when they don’t know how to move forward. Some want to go back to Bayes, Le Cunn wants to go back to CNNs? :-b )

In transformers “similarity” is no longer the base measure for what defines an “object”. That solves the dilemma that no-one knew what measure of similarity to use for problems other than images. The base measure for what defines an object becomes prediction.

So I think the best way to contrast CNNs and transformers might be that CNNs are based on ad-hoc defined similarity gradients, for problems where an ad-hoc similarity gradient seems obvious. While transformers are based on prediction. And prediction allows a more generalized definition of “similarity” (trained using “attention”), and more generalized similarity gradients.

I actually think we could do a “prediction” based system for images. And that might improve on CNN performance for images too. (Perhaps that happens already. Are transformers already achieving better results than CNNs for images?)

Of course I think the real pay dirt comes when you allow these new, prediction based similarity metrics, to vary dynamically. So that’s another step away from CNNs. In that context you could view our progress as follows:

  1. CNNs require a human specified measure of “similarity”. Works OK for images.
  2. Transformers allow us to take a step back from “similarity” to use prediction as the more fundamental definition of an object, which allows us to search (“attention”) for the best similarity measure. Works better.
  3. I say the next step forward is to take a step back from static prediction measures, and use dynamic prediction measures. (So from “similarity”, to “prediction”, to “dynamic prediction”. Moving from “similarity” to prediction allows the similarity measure to become dynamic like this, not only the product of something else, but dynamically the product of something else.)

But I do like that you’re relating CNNs and transformers in terms of the energy surfaces they rely on. Both coming down to…

I just ask, why do we assume the matrix product is static and not dynamic?

(Sorry to push my agenda into your thread! I was just intrigued by your contrast of “dissimilar” and similar! To summarize again, I think the contrast CNN/transformer you were looking for is between apparent similarity and prediction derived similarity.)


They may be the same parameter, the difference is in which side of its spectrum is valuable / salient. This distinction is obvious in clustering, which is sparse. It’s far more obscure with backprop, there it would look like fine-grained RL, but built-in vs. external.

Right, everything is more obvious in images, because they are not implicitly encoded by prior cognitive processes that produced symbolic data. As you know, I agree with LeCun that world model should be built bottom-up from sensory experience, forming abstract representations organically, vs. importing them as a separate modality. Linguistic data should map to those higher-level representations as mere labels for external communication, not as independent concepts.

When you do that, similarity measure remains the same across the whole hierarchy of the world model, which is essential for any sort of integrity. But this similarity can be defined directly or indirectly. Direct is binary or grey-scale coincidence, vertical in Hebbian learning or lateral in self-attention. Indirect is inverted lateral gradient in images, direct similarity is not useful there because visual perception is decidedly indirect: we perceive reflected light.

Yes, that’s due to second-hand nature of symbolic data that I mentioned.

I think this is backwards, you need similarity measure to quantify prediction is the first place: similarity between prediction and the actual future input. Transformers don’t know what a primary similarity is, so they are forced to use second-order similarity: co-occurrence of second-hand query-key pairs. It’s not more general, just more oblique and opaque.

If it’s obvious, why do you call it ad-hock? I call it first-principles, ad-hock is what transformers do.

Not CNN, edge-detection kernels they converge on. Naive CNN is just as ad-hock as transformers.
Anyway, all artificial systems are initially human-specified :). That’s better than specified by dumb and blind evolution. You can think of NI as a self-improvement of evolution, and true AI will be self-improvement of NI.

You need a selection criterion to do that. And the more direct the better: coincidence is better than reproductive fitness in evolution, and similarity is better than coincidence.

The keyword is lateral, multiplication here is used as generic arithmetic operation. In edge-detection it’s actually performing weighted subtraction, multipliers are used simply because they can do both addition/subtraction and multiplication.


I think you’re addressing something like the continuity of the similarity parameter there.

I don’t want to sweat it, anyway. The contrast I want to make is that in transformers the similarity measure is derivative. The more fundamental thing is prediction on sequences. Prediction on sequences defines what “similarity” means, for transformers.

It’s funny. Your interpretation is completely the reverse from mine. You’re saying the fundamental processes are more obvious in images. I’m saying superficial properties of images actually obscure the more fundamental processes.

I think that both language and wider sensory data will end up being structured in the same way. So I agree ultimately “meaning” will come from the world. But I think language provides a more simple place to see that process.

I think language is a more simple place to see that process, because it is something produced by the brain, for itself. So naturally it will have just those characteristics which the brain responds to, and as little as possible of anything else.

And what “learning” over those characteristics is telling us, I believe, is that characteristics the brain responds to are predictions over sequences.

Having gained that insight you can dump language if you like. We can analyse other sensory structuring processes as predictions over sequences.

I’m guessing that will explain why our visual system, for instance, uses saccades. And as I recall Jeff Hawkins pointed out, back in the day, that the proper perception of touch also requires you to move back and forth over a texture to be sensed.

IIRC that’s why Jeff conceived his direction of research to be a “temporal” model.

So I think we’re coming back to that. We’re coming back to the “temporality” of (Hierarchical) Temporal meaning. And we need to start thinking again of meaning creation from sensory perception generally as a “temporal” process.

That can be a place HTM got things right.

But you’re saying, because images have this property of visual contrast edges, that somehow this is the fundamental system, and we need to seek it elsewhere too. Even though it was kind of a dead end for a long time.

Transformers have been great. But we’ll not learn from them. We’ll just dismiss them as a niche data set, and go back to what we were doing before?

I just don’t think a generalization of the image abstraction by edges idea has generalized to other systems. If the continuing admonishment to keep a bland expression for my passport photo is any indicator, or even Tesla’s continuing agonizing transit of the asymptote of infinite visual road novelty, it hasn’t even worked completely for images either. That’s why transformers have been an advance over CNNs.

Once again, the reverse of me. I see this ability to make transformers find what to attend to as a feature not a bug. A feature which is more apparent for language because of the “second-hand” nature of language being something made by the brain for itself, maybe. But that’s not niche. It’s just more revealing.

It’s because transformers are led by language to define similarity in terms of prediction, that they provide a more flexible framework to address the meaning problem more generally. We can get away from visual “edge” contrast.

Visual edge contrast is OK. It was a step up from full image supervised “learning”. But it’s not enough. CNNs have trouble with broader structure. Training to edges is barely an advance over the old supervised “learning” of images. It generalizes them a bit. Which is why CNNs became popular. But the convolution window is small, they generalize by combining these small parts in different ways, and it’s not obvious how to parameterize those higher (hierarchical) structural principles. Often for CNNs it’s just random? Or maybe a deformation (I think I saw something glancing at LeCun’s lecture talking about deformations…)

By contrast, talking a step back to define similarity as derivative on prediction gives us more power. Most important of all, it allows us to start having concepts of novel meaning. Not just deformations. Flat out novelty. Something can be novel and still meaningful, if it still predicts usefully.

Exactly! Right! We’re seeing this in exactly the opposite way!

You’re saying a prediction based measure still meets the similarity problem? Yes, everything needs its foundation. For the prediction measure I think that will come down to nerve firings. They’re the ultimate resolution of perception.

I don’t think that’s hard to do. You’ll have to do it too.

What’s hard, is figuring out how nerve firings need to be combined to create higher (hierarchical) structure. If you’re saying the mechanism of meaningful structuring will be “edge contrast” all the way down, you need to explain how edge contrast will generalize beyond visual images.

In particular, if you want it to apply to language, the success of transformers suggest you’ll need something like prediction. So unless you want your structural principle to be ad-hoc to images only, it’ll need to incorporate prediction anyway.

Which will be the more fundamental? I can build edges from prediction. Can you build prediction from edges? Can you build novel meaning from edges?

I call light intensity an ad-hoc similarity measure for cognition, because it is appropriate to one special circumstance of cognition, which is images, but not more generally.

Dictionary: ad hoc

  1. For the specific purpose, case, or situation at hand and for no other.

Prediction over sequences I think is more fundamental (once again suggestive of saccades, and the “temporal” original motivation for HTM.)

You think visual edge contrast is more fundamental. Well, I hope the success of transformers can provide an argument against that. Transformers have simply proven more effective than CNNs.

And edge contrast just has no way to deal with novelty. How do you create new meaning with edge contrast? It’s only one step of abstraction above supervised learning from a fixed set of meaningful “objects”. That lame failure to generalize meaning beyond examples was what I stated here:

Prediction over sequences solves this problem that we can’t think of any definition of meaning except listing examples we consider meaningful. And opens the door to novel meaning. It opens the door to meaning beyond just a big list of examples. Ha, “meaning” defined as the biggest list of examples you can find (c.f. Tesla building simulations to create things to train to!!)

And once again a more fundamental definition of meaning in terms of prediction is suggestive of saccades, and the “temporal” original motivation for HTM.

Was that a reason why the operation can’t be dynamic?


Those already exist - they’re often called hybrid convnets in the field. They aren’t used for scaling (though they scale better than pure CNNs) because it still introduces an inductive bias and is computationally a bit expensive and slow - the worst of both worlds IMHO. The real advantage is that the inductive prior of convolutions help sample effiiciency, the downside being robustness to adversarial attacks and generally being not as competitive to pure ViTs.

AFAIK CNN filters converge to detect textures more; the earlier filters do converge to edges but most processing happens in the later layers anyways.

I’m not really sure what that means to be honest. Lateral gradient? lateral matrix-multiplication? do you mean the inductive bias of convolutions to weigh neighboring elements more?

1 Like

I meant gradient as computed by Sobel, etc: 2D version of difference, the intensity of which defines an edge. Yes, proximity-first because that’s where gradient is the lowest / rarest, especially in 1st layer.

My point is, rarity is “high energy” or salience, that’s why CNN operating on contiguous images converge on detecting it.
Transformers on the other hand, generally operate on discontinuous symbolic data, where co-occurrence (similarity represented by query-key dot product) is rare or high energy, so that’s what they detect.
Both end up performing lateral cross-comparison, defined as operation that computes difference and/or similarity between input items within a kernel. But CNNs converge on computing the difference, while transformers are pre-designed to compute similarity.

I think they should compute both difference and similarity, ignoring the difference is probably why they make shit up. But meaningful representation of difference is only possible if you know what the tokens actually represent: generalised physical objects. To get there we have to learn bottom-up: from images.
Note that I am not proposing using CNNs, but directly what they converge to: local cross-comp, starting with pixels in Sobel operator, although weighted by feedback.

No, what I am saying is that fundamental cognitive process is lateral cross-comparison. Which is what transformers, and retina BTW, do by design, but CNNs need to be trained to converge to. And you won’t fully understand the logic of it unless you start from the beginning: the pixels.

You are right in the sense that spikes represent binary version of similarity: coincidence. And spike trains is something like integer-valued similarity. I define such direct similarity as min, which is an equivalent of overlap between two spike trains. That doesn’t actually work for raw images, there I define similarity as inverse deviation of gradient.

So we may not be that far on the definition of similarity, but I emphasize that difference should be computed along with it. And isolated instances of similarity won’t tell you much, you have to understand the process that ultimately produced them, which always starts from raw sensory data.

Of course it should be dynamic. Let’s just try to focus on the logic behind it.

1 Like

I’m unable to resolve a difference between “visual contrast edges” and “lateral cross-comparison”.

We’re back with this distinction you’re drawing between “similarity” and “difference”.

You used this to distinguish between transformers and CNNs.

I tried to get away from it by arguing that the distinction between transformers and CNNs you were trying to describe came down to the fact that transformers have “similarity” measures spread out along a sequence, and identified by training “attention”. (Whereas “similarity” for CNNs is more direct, depending on a human specified convolution filter?)

So I suggested a contrast between “similarity” (based on something different each time) and “similarity” based on prediction, instead of a contrast between “similarity” and “difference”.

My argument was that it was more informative to contrast transformers and CNNs by saying that transformers “learn” similarity, as derivative of a more fundamental principle of prediction. (Where CNNs are more tied to superficial visual qualities of images. In itself closer to the “meaning by example” idea of original ANNs, because it’s tied to the “example” of images. To apply the same idea to any other data type you would have to define the convolution operator in a way specific to that data, and it’s not clear how that generalizes.)

Especially I was arguing that making the distinction transformer to CNN one of similarity based on prediction, was more informative, because making “similarity” derivative on prediction like this, means you can have measures of “similarity” which are dynamic.

To embrace your language, I guess I’m saying the process which produces measures of “similarity” relevant to human cognition are actually external to the raw sensory data. So a hard contrast with you there. I’m suggesting it’s external to the data in the first instance, internal firstly to the process of the organism. I’m saying it’s an evolved sensitivity to prediction. An evolved sensitivity to prediction will act on raw sensory data. But it doesn’t start from that data. Which is actually a power, because it means you don’t need to find a new sense of “similarity” for each new set of data you address.

This maximization of prediction might reduce to the sense of “difference” you want to make. To use your language. But it will provide a unifying principle to build that sense of “difference” on, between many sets of data.

If you insist the sense of “difference” must always be internal to the data only, and come from each new set of data each time, without any systematic principle like predictive efficiency to guide it, I think your task is steeper.

At the very least having a different sense of “difference”, dependent on each new type of data, complicates the cognitive system. The cognitive system will then consist of many (ad-hoc) similarity measures, one for each type of data, like many different CNN convolution filters (making “difference” distinctions?) Making the basis of “difference” predictive effectiveness, provides a single unified process.

Basing many senses of “difference” for many different types of data, all on predictive efficiency, also suggests a path forward. A single process can be made dynamic more easily. And shared context suggests itself very simply as a way to identify predictive groupings using resonances on a network of observed sequences. Much simpler than conjuring new “difference” measures for new kinds of data. Which might be what LeCun’s proposal reduces to as well.

1 Like

So, comparison should compute both similarity and difference, I prefer match and miss because there multiple dimensions and derivation orders for each. Gradient is 1st-order 2D difference between pixels, specific type of partial miss.
In vision (reflected light) we don’t need distinct similarity, match is just average gradient - instance gradient. But we do need it for higher-level representations, where magnitude and predictive value correlate. In comparison by subtraction between integer-valued a and b, partial match is their shared quantity / partial coincidence we just discussed: min(a,b), and miss is a-b. Those are obviously different: min is not directional while a-b is. Each has independent predictive value: the difference represents change in the input, and change may persist when raw input doesn’t. And so on.

You keep getting it backwards. CNNs don’t do lateral cross-comparison (edge detection) by design, they learn to do it through backprop. Transformers do it by design, that’s what query*key is, a measure of similarity between the two. Nevermind sequences and matrices, try to understand what happens pairwise.

You can’t even begin to define or discover “prediction” unless you already have some measure of similarity to quantify it with. Just try.

1 Like

We agreed the baseline similarity is neural firings in response to stimulation. But that is not “meaningful” similarity. The same neuron firing all the time may be “similar”. But it is not meaningful. “Meaning” must be built on top of that. That’s a structuring process.

We’re both talking about a structuring process. Ultimately a structuring process on top of neuron firings.

I’m saying the structuring process is most fundamentally parameterized by prediction. That’s a single unifying principle.

You are saying the structuring principle must come from the data somehow.

Actually, I don’t disagree. To an extent. The animal will need to find what aspects of a given set of data are most usefully predictive for it. But without the unifying principle of prediction around which to frame its attempts to structure, I think your task is harder.

You need some guiding principle. There are just too many ways to structure anything. Typically we minimize “energy”. Which is to go for the simplest description.

But we don’t know which simplicity is “meaningful”.

This was my characterization of early ANNs, where their only guess at “meaning” was supervised labeling. The simplicity is to throw out anything which does not distinguish the label.

Repeated structure is another attempt to find a unifying principle which defines “meaning”. If something repeats, perhaps it is “meaningful”? The implicit simplicity there is to throw out the repetition.

That is still the state of HTM. The only measure of structure in HTM appears to be still repetition (predicted sequence until “burst”.) (Even TBT? Is TBT just a way to resolve the repetition more clearly?)

I think repetition is the basis of most “unsupervised learning” today, too. Like the “discovery” of cats, which first propelled ANNs into the Google mainstream. It will have been because cats, repeat.

I think repetition is still the basis for CNN “learning” too. The innovation of CNNs was just to generalize the repetition a bit by making it local only.

Perhaps that’s one way to say it. I think your “meaning from data” is really “meaning from repetition”. The hidden definition of “meaning” in your assertion that meaning comes from the data, is actually an assertion that meaning equates to repeated structure, again.

In that case you might summarize my suggestion to move the foundation of meaning to be prediction, to be that it allows us to generalize the structuring principle away from repetition. So seeing the history of AI as moving first away from point labels (GOFAI) then away from labeled data (more flexible with the labels, but still equating meaning to human defined labels), then to simple repetition as a basis for “unsupervised learning”, now away from simple repetition, to be structure which predicts, even if it does not repeat. We are not limited to a structuring principle based on repetition anymore. We can now have other structure that predicts. That’s something you would not guess unless you made prediction your deeper definition of “meaning”.

It characterizes the advance of AI to be: labels → repetition → prediction.

It liberates us from a definition of “meaning” which is limited to either labels, or, now, repetition.

So perhaps the contrast I want to make in response to your characterization of “similarity” vs. “difference”, is not “prediction” vs. “similarity”. They are both similarity. It is similarity based on prediction vs. similarity based on repetition.

Are you sure the hidden definition of “meaning” in your assertion that meaningful structure can be learned from data, is not that meaning = repetition?

1 Like

Let’s frame this discussion in terms of operations, not their results. We don’t know what the inputs or outputs will be, this is supposed to be a general learning system.
Operations in a neuron are definitely not a baseline, obviously not principled, and not even fully understood.

The simplest comparison is Boolean: match = AND, miss = XOR. AND is the coincidence we talked about. But that’s not very useful because real-world input is at least integer-valued, etc.

Define prediction as an operation.

The structure of input to each level of processing comes from operations performed on primary sensory input in all lower levels of processing. The higher the level of generalization, the deeper the structure. Bottom level is pixels. That structure is meaning, but we don’t know what that’s going to be, operations must be defined a priori.

Repetition is binary match: AND, I am talking about matches + misses of incremental complexity of operations, performed on inputs that should also have incremental complexity, derived by lower-level comparisons.

Again, define prediction in terms operations. Until you do it’s just a meaningless label.

1 Like

Ah, the Internet bread and butter. An argument about definitions!

That Internet debate typically ends (or neverendingly never ends!) in this way, is central to my thesis that complete definition for meaning is impossible.

I used the words “prediction” and “repetition” as a gloss for the contrast between the objective function of transformers and convolutional NNs. The reference is already concrete!

For the purposes of argument I’m pretty much accepting the objective function of transformers. I only want to allow it to become dynamic. As it can. The objective function of transformers, predicting the next element in a sequence, can become dynamic. It’s not dynamic at the moment, but it can become dynamic (by seeking resonances in a network instead of slopes by back-prop.) That’s it’s great power. The objective function of CNNs, characterized as repetition, is inherently static. Being static is its content. Repetition: something which doesn’t change.

Further generalizations of the references to the words are surely infinite. This is an opportunity for you never to be satisfied, and the direction you are taking the conversation. Eternally. That’s my broader point. Meaning is infinite. We can only define it in terms of a principle which generates it. Attempting to “learn” all meanings, will only result in an infinity. Let alone attempting to state them all. Though demands to state them all, and state them completely, will no doubt provide grist to the mill of Internet dispute as long as there are people who wish to disagree!

1 Like

Bread and butter of internet is loose talk.

Ok, let me do this for you:

Tokens in self-attention are not compared directly, that would be meaningless because they are just labels (although they may map to embeddings).
What is cross-compared by multiplication is their weights, which are computed over the whole vector. Pairwise, I guess the weight is composed of binary precedence: 1 for WQ if query precedes the key, the reverse for WK. That would be an atomic instance of your prediction in transformers.

So testing for such prediction is binary version of comparison: AND(key_instance, key_expectation).
It compares presence of specific key token between instance [query,key] and expectation [query,key], so that’s a higher-derivation comparison, of occurrence rather than of the content. Still fits in my definition, it covers comparison of any derivation order.

If you are really interested in figuring these things out, try chatbots, they are already better than forums. Here is an example:

Me: Is each element in value vector a single embedding or a full set?


It depends on the model. In some models, each element in the value vector is a single embedding. In other models, each element in the value vector is a full set.

In the case of a single embedding, the value vector would be a vector of the same size as the query and key vectors. In the case of a full set, the value vector would be a matrix, where each row would represent a different embedding.

The decision of whether to use a single embedding or a full set depends on the task that the model is trying to perform. For example, in a machine translation task, a single embedding might be sufficient, since the model is only trying to translate each word in the source sentence. However, in a text summarization task, a full set might be necessary, since the model is trying to summarize the entire text passage.

Here are some examples of models that use single embeddings in their value vectors:

The Transformer model, which is a popular model for machine translation and text summarization, uses a single embedding for each word in the input sequence.

The Bidirectional Encoder Representations from Transformers (BERT) model, which is a popular model for natural language understanding, uses a single embedding for each token in the input sequence.

Here are some examples of models that use full sets in their value vectors:

The GPT-3 model, which is a large language model from OpenAI, uses a full set of embeddings for each token in the input sequence. The RoBERTa model, which is a robustly optimized BERT pretraining approach, uses a full set of embeddings for each token in the input sequence. It is important to note that the decision of whether to use a single embedding or a full set is not always clear-cut. In some cases, it may be necessary to experiment with both approaches to see which one works best for a particular task.

1 Like

“Loose talk”? I haven’t heard that one in a while. Exclusively in the sense of revealing secrets:

You honour my comments as the revelation of secrets? :slight_smile:

Sorry to razz you. Do you mean I wasn’t affording your request for definitions the dignity it deserved?

You’ve gone into the transformer algorithm to find the sense of similarity I mean?


Yes, for the “prediction” algorithm of transformers, like in a recursive NN, or like in any grammar, you form a kind of class about the current token to predict the next token. So if the current token is “the” then you form a class with a weight, say {the - 1, a - 0.5} to represent the observation that the word “a” tends to share 1/2 the following word contexts of “the” (for argumment’s sake.)

That generalizes your predictions, so you’re not limited to sequences observed with “the”, you also have a prediction for sequences which have only be observed for “a”. And so on for every token in the data.

The weighted vector was an improvement over symbolic grammars because it captured partial similarities.

But it didn’t work well for years because the prediction depends on the prior sequence. The obvious solution would be a RNN, but RNNs couldn’t converge over that history. LSTM helped by carrying forward bits of this history. Transformers caused a jump by making learning a weight based on history the whole thing (“attention”.)

But it’s all weighted vectors (tensors in context) of shared predictions in context.

Then the “cross comparison” might be a multiplication of this shared prediction in context vector by a vector for the observed sequences of each word.

The similarity calculation here might be seen to be in the weights of the “shared prediction in context” vector.

I’m glad we’ve got that clear.

As you’ve detailed, it was implicit in my reference to the algorithm of transformers.

Then you go on to distinctions between “single embedding” and “full set”. We’re getting deep into definitions land here. Does it matter? We already agreed the transformer is collating weight vectors as a similarity measure based on shared predictions/contexts.

You’ve sketched here a sense of “similarity” in the transformer context of predicting word sequences. This is indeed the reference I was making when I contrasted transformers as “similarity based on prediction”, instead of “similarity based on repetition”.

What’s your point?

(Re. querying BARD to produce sequences around your prompt words. I don’t know what your prompt history was for that session with BARD, so I don’t know what context it was using for the term “single embedding”. Or if it is capable of holding such context.

But the sequence “in a machine translation task … the model is only trying to translate each word in the source sentence” seems odd. I can’t imagine where it found anyone saying machine translation attempts to translate word-by-word. If that’s the “meaning” of that sequence. Maybe there was an MT system in history which attempted this comically simple approach. Or it might have made it up on its own. No way to tell.

Anyway, to associate the (spurious) idea that people want translation to proceed word-by-word, with a justification for a “single embedding” in transformers, while impressive in a sequence continuation sense (though perhaps only chained on the word “transformer”??), seems spurious. The whole point of a transformer (attention) is surely that it will generally not be attempting predictions based on a single token. So whatever BARD is conjuring up as a reference for your distinction between “single embedding” and “full set”, I don’t think the association of transformers with translation will justify it.

Using transformer chatbots would seem the very embodiment of endlessly propagating disputes about word meanings. Because lacking any grounding beyond a little “attention” context, they will surely free associate any sequences containing the same words.

I hope my future doesn’t increasingly hold the need to psychoanalyze vast quantities of word sequences liberated from meaning, generated by transformers!)

1 Like

Get used to it. This will become a dominant force in science and engineering in the future.
If you want to get extra meta, at some point the internet will be full of chatbot utterances, to be sampled in new training sets and further propagated.

Why do I hold such a view?

As part of my job, I deal with a raft of regulations, each with required tasks and reporting requirements. I use ChatGPT now and it has been very helpful in preparing official documents that are as good or better than I can draft. And much faster. I also have been using it to dig up obscure references in official standards and it has been spot on. ChatGPT has been doing an excellent job supporting me in these areas.

I am just learning how these things can help me and I am sure that as I continue it will only get better.
I have not found any work-related uses for DALL-I but if I did more with anything even remotely image-related I can see how this would be a fantastic tool.

I dabble in music (mostly playing jazz to entertain myself) but the tools to write music in any genre, tempo, voicing, instrument choice, and mood, are already better than I have ever been.

Much of engineering has areas where multiple disciplines overlap (Examples: legal, compliance, component EOL and availability, documentation, basic physics, statics, dynamics, component sizing, best practices, system interactions, inventory management, fabrication/construction techniques, cost, life-cycle management, ITAR, marketing, competitive product research, Etc.) and many engineers have blind spots that end up compromising designs. A good design needs to optimize all of these items simultaneously. When I was consulting I was hired to fix design problems in many of the areas I just listed. An AI engineering assistant tool should help prevent blunders and make better products overall.

As others learn how to use these tools I expect that new subscription models will emerge to monetize this market and the various purveyors will keep innovating to capture users. This will emerge as an indispensable feature for every company. I can barely imagine how enterprise suites will work but I am certain that ERP and CRM software will get much better than they are now. This should be very helpful in supply chain, configuration management, and compliance areas.

Much of what I was expecting AI to do is now being done and it is better than I hoped. And it will get even better soon.


What constitutes a “good” official document I suggest is open to debate. I’m inclined to think of sanatory utility. If voluminous and vacuous are your criteria, it probably fulfils that well, yes.

Recombining boiler plate in documents no-one wants to read, it’s probably good. It might put together a very nice example of one of those software user agreements that serve only to remove all your rights in 10 or 20 pages that no-one reads, notices, or understands until too late.

Digging up obscure references, I don’t doubt it might have true utility. Massive memory is where all contemporary AI tech already excels. Humans can’t beat it for memory. (That’s how they do better with image recognition, they can remember all the different breeds of dogs, for instance, even if they struggle with dogs in the wrong light, etc…)

So, yeah, digging up obscure references, they might do well. The issue comes when they “dig up” entirely false references.

I tested this with an friend who thought of using GPT-3 to check the “novelty” of hypotheses.

As “tested” by GPT-3, nothing was new. If there wasn’t an attested example in the past, it created one.

Here’s one of the outputs I looked at more closely:

Idea: The fact that substitution patterns for language sequences grow without bound is a source of creativity.
The capacity to create an infinite number of sentences from a finite set of words is a source of creativity.

Novelty: 0

Similar Ideas:

  • Chomsky’s work in linguistics
  • Popper’s ‘Abduction and the Logic of Scientific Discovery’

The idea that the fact that substitution patterns grow without bound is a source of creativity is not new. Chomsky wrote about it in his work in linguistics. This idea is rated at 0 because it is not a new idea.

Now, this is my claim, so I was interested in it! :slight_smile:

The mapping to Chomsky is equivocal. Yes, Chomsky emphasized in his writings that language is infinite. Maybe GPT-3 picked up that superficial correspondence of words.

But Chomsky himself totally rejected the deeper sense of this claim. Specifically. He read it, and rejected it. So GPT-3 is saying Chomsky has priority. But Chomsky himself completely disagreed with the same claim!

Why? Because Chomsky addressed the full meaning, and didn’t just trigger on superficial words.

Note the deeper claim is that substitution patterns are the source of this creativity. The substance of my claim is that substitution patterns work to find “grammar”, but are chaotic. Chomsky’s thesis had always been that substitution patterns do not work to find grammar (but he concluded this before the “discovery” of mathematical chaos, on evidence that looks to modern eyes very much like mathematical chaos!) That same, chaotic looking, evidence, was why he hypothesized Universal Grammar. Same evidence. Different interpretation.

He was kind enough to listen to my interpretation. But he utterly rejected it. He characterized it as something like the “most egregious misconstrual he had ever heard”! (About 1997, this. Funnily enough I found later he and Syd Lamb appeared to have disagreed in much the same way about the interpretation of the same evidence. Chomsky thought it meant Universal Grammar, Lamb thought it indicated grammatical categories were “non-linear”. Lamb, Funknet, '00s: '…perhaps his most celebrated argument concerns the Russian obstruents. He correctly pointed out that the usual solution incorporates a loss of generality, but he misdiagnosed the problem. The problem was the criterion of linearity.)

Anyway, GPT-3 at best superficial about this rejection of novelty, probably on the basis of use of the words “language” and “infinite”.

Re. Popper and abduction. I searched for this, and the best I could find was an article contrasting Popper’s falsificationism and Peirce’s abduction. As far as I know abduction is purely Peirce. And abduction does have something of the infinite creativity sense (though nothing that I know talking about substitution patterns!) But falsificationism has much the opposite flavour, again. So, GPT-3 asserting similarity in opposites, again.

In any case, the book attributed to Popper does not exist, and saying it does exist is kind of an annoying error, because you have to spend much more time checking it doesn’t exist, than you would have spent finding genuinely relevant work with a simple key word search.

Important point: The program perform well enough to help me in my work.

I am satisfied with the quality of the work performed and since this is regulatory and compliance paperwork I always double-check the references provided. Once I know where to look the verification is fast and easy.

So far, with this class of references it has been correct on every single standard and regulatory reference it has offered. This has reduced my task of finding and comparing references from ASTM, SAE, CE, UKCA, UL, FM, CCC, FCC, FARS, Prop 65, and other regulatory paperwork from hours to minutes. Perhaps you you find reading standards and regulations to be deeply rewarding but I do not. That is why they call it work. It’s a task that has to be done to offer products in the current highly regulated marketplace.

The letters generated may not be Shakespeare but they are suitable for the routine responses that I require. “Good enough” has a place in the commercial workspace.

Now your needs may be different, but the use of ChatGPT has made me more productive and lightened my workload. I think that many people will have the same experience.

1 Like