Interesting contrast. I like attempts to relate different techniques like this. I helps us get to the bottom of what is missing.
But I’m confused when you say dissimilarity is rare in raw data, while similarity is rare in generalized data. Don’t dissimilarity and similarity come to the same thing? Aren’t they just different values on a similarity scale?
Maybe you could argue dissimilarity and similarity are qualitatively different. Is zero similarity the same as dissimilarity? Or does dissimilarity only start with negative values of similarity? It’s an interesting question. How is negative similarity different to zero similarity? If two things are completely different (zero similarity?) how does the difference then increase yet more, to make them negatively similar?
I think the contrast you are really reaching for is that the similarity measure to use for images is obvious. It’s just light intensity. But it is less obvious for language. What qualifies as “edges” in language data is less clear. What defines the “edge” between one phrase and another? Not obvious! (HTM struggled for years with this. It still struggles AFAIK. It seems still stuck on vague ideas about prediction failing and columns bursting…)
I think that maps to my comment in one of my posts that historically ANN research just didn’t know what definition of “similarity” to use in many problems:
I think the content of the contrast you are trying to make is that the similarity measure is more obvious for images.
So, no need to contrast “dissimilarity” and similarity. The contrast you are really reaching for is between obvious similarity measures, and less obvious similarity measures.
Transformers have to train to find what to “attend” to for their similarity measures. For language the similarity measure is less obvious. That was the revolution of “attention”. It allowed similarity measures to reach back along a sequence to find what to “attend” to when defining “similarity”. There is no “attention” equivalent in CNNs. The contrast is light intensity. It’s built in.
I would say the real revolution for transformers is that they allow the system to search for the right similarity measure to use. And they do this by making similarity subordinate to prediction. Prediction is an obvious measure to use for language, because language is a clear sequence.
Language leads us to sequence and prediction as a foundation for “objects”. I think that’s the advance of transformers. (Not some regressive idea being pushed by Le Cunn about language being niche data so we have to go back to CNNs for the real world… Sheesh, everyone wants to go back to the last thing they were doing when they don’t know how to move forward. Some want to go back to Bayes, Le Cunn wants to go back to CNNs? :-b )
In transformers “similarity” is no longer the base measure for what defines an “object”. That solves the dilemma that no-one knew what measure of similarity to use for problems other than images. The base measure for what defines an object becomes prediction.
So I think the best way to contrast CNNs and transformers might be that CNNs are based on ad-hoc defined similarity gradients, for problems where an ad-hoc similarity gradient seems obvious. While transformers are based on prediction. And prediction allows a more generalized definition of “similarity” (trained using “attention”), and more generalized similarity gradients.
I actually think we could do a “prediction” based system for images. And that might improve on CNN performance for images too. (Perhaps that happens already. Are transformers already achieving better results than CNNs for images?)
Of course I think the real pay dirt comes when you allow these new, prediction based similarity metrics, to vary dynamically. So that’s another step away from CNNs. In that context you could view our progress as follows:
- CNNs require a human specified measure of “similarity”. Works OK for images.
- Transformers allow us to take a step back from “similarity” to use prediction as the more fundamental definition of an object, which allows us to search (“attention”) for the best similarity measure. Works better.
- I say the next step forward is to take a step back from static prediction measures, and use dynamic prediction measures. (So from “similarity”, to “prediction”, to “dynamic prediction”. Moving from “similarity” to prediction allows the similarity measure to become dynamic like this, not only the product of something else, but dynamically the product of something else.)
But I do like that you’re relating CNNs and transformers in terms of the energy surfaces they rely on. Both coming down to…
I just ask, why do we assume the matrix product is static and not dynamic?
(Sorry to push my agenda into your thread! I was just intrigued by your contrast of “dissimilar” and similar! To summarize again, I think the contrast CNN/transformer you were looking for is between apparent similarity and prediction derived similarity.)