Wolfram and ChatGPT

The one thing that is even more interesting than ChatGPT is definitely the views of it.

I would seriously recommend looking at Wolframs description - which includes AFAIK still unpublished architectural knowledge:

He seems convinced that there is real science to be learned.

“The success of ChatGPT is, I think, giving us evidence of a fundamental and important piece of science: it’s suggesting that we can expect there to be major new “laws of language”—and effectively “laws of thought”—out there to discover. In ChatGPT—built as it is as a neural net—those laws are at best implicit. But if we could somehow make the laws explicit, there’s the potential to do the kinds of things ChatGPT does in vastly more direct, efficient—and transparent—ways.”

However he’s plenty to say about the lack of theory (but he has a theory for that too!):

“Why does one just add the token-value and token-position embedding vectors together? I don’t think there’s any particular science to this. It’s just that various different things have been tried, and this is one that seems to work. And it’s part of the lore of neural nets that—in some sense—so long as the setup one has is “roughly right” it’s usually possible to home in on details just by doing sufficient training, without ever really needing to “understand at an engineering level” quite how the neural net has ended up configuring itself.”

His extension of this with ‘real computation’ (via Wolfram Alpha) is the latest thing:


uhh… but there is a good reason for that?

There are other things which are inexplicable about LLMs, but this is definitely not one of them


Could you explain? I mean, no pun intended. Could you explain to someone who is not up to speed?

Thanks for sharing, this has fixed a lot of my broken intuition about LLMs. I love how he inspected the model using wolfram alpha it’s very intuitive and seems like a great tool for model explainability.

I hope we start doing this interactive and objective method in this Forum when explaining things or proving things or probably a similar way, especially when we are trying to strongly assert something. This way we can minimize the “your word against my word” kind of situation.

1 Like

Why we add positional encodings? well, the attention mechanism is translational invariant, i.e it doesn’t propagate any information about the relative positioning of the input sequence. You can shuffle the words up and attention would still see it exactly the same.

So we add positional encodings, aka a mutli-dimensional vector composed of periodic functions (sines and cosines) with varying parameters to add some positional information. Effectively, you can think of pushing the token at 0th index into the “0th” index club, where now it resides in a vector space carrying that specific positional information too.

This often leads to problems, because giving the model tokens of a length outside what it’s seen during training and cause errors since those positional encodings are OOD. So, fixes have been devised which encode the same information relatively for each token - while the aforementioned one is absolute (0th token gets pushed only to 0th token “club” of vectors. Relative information is implicit and has to be extracted).

So the idea is pretty well considered and researched, its not anything new - infact, its quite a hacky and obvious solution indeed. But it works best empirically, so we just stick with it.

Just keep in mind Wolfram isn’t really an expert on these things, so take some ideas with a grain of salt :wink: at the end of the day, this is just a marketing spiel for WA to show themselves still being relevant in the AI game.


Thank you. That was crystal!

1 Like

In the brain you have the Arcuate fasciculus - Wikipedia which connects the speech production part (think grammar) in the frontal portions to the posterior portion at the place where the “here&now” temporal lobe part connects to the top of the WHAT/WHERE streams in the parietal lobe. Think of it as the “meaning” part.

The recognition and production of speech depends on both parts working together and there are well known defects that happen if ether end is damaged.

This is much the same as the mixing of the word with the location in the text string, with the position roughly standing for grammar and the word roughly standing for meaning for the kinds of words that can be in that place in a correct grammatical construction. Abstracting to a higher level, you get words that tend to be found together for context and meaning, and sugar words for grammatically correct speech. Since these are deep networks there is room for interaction between the word and grammar fragments and the similarly deconstructed user entered prompts.

It is not surprising to me that there are may parallels with the internals of the GPT transformer architecture and what is known of the gross brain architecture.

It is also unsurprising that the first round of GPT-4 was utterly unaligned until the creators put in some additional work to make the program a bit more palatable for human interaction.

In humans, these social constraints (alignment) are performed by training the limbic system. In humans that are raised without suitable human contact, the resulting humans are often clinically psychopathic. If they were a program we would consider it “unaligned.”



Can you please explain your proof of this? What are “these things”, and which should we take with a grain of salt? The part where he said “there is no science” makes sense to me and you have even explained it above very nicely, it’s not completely science, yes there are empirical guards to it.

1 Like

I don’t accept that for a minute. This is one of the smartest guys writing software today, with a phenomenal background in maths and a vast range of tools at his disposal. I find his comments clearly stated, carefully thought out and perfectly credible. If we says “we do this because we find it works, not because we know why” I tend to accept that.

The problem with experts in the field is that they’re experts in “we do it because it works”, and pretty bad at explaining why or any kind of underlying theory.

If he’s wrong, it’s up to you to show where and why he’s wrong, not to write him off because he’s not one of you.


From what I could follow what he propose is pair chatgpt with wolfram alfa as a tool in order to provide better answers. The main reason he considers WolframAlfa the right tool is it can also “understand” queries in natural language and natural language is doh… natural to chatgpt.

The few issues with this are:

  • somehow chatgpt must recognize that the current prompt would be better by integrating output from a specific external tool
  • gpt-x can generate scripts in many languages e.g. programming languages or web/database queries which falsifies the implicit assumption that a tool accepting natural language queries is what is required (or better) to improve gpt output. In reality it can use an arbitrary number of tools to externally check its output.
  • personally I have issues with proprietary systems. Now that GPT went proprietary Wolfram’s idea is like “Why don’t you strap-on tighter to the proprietary bandwagon - here-s my proprietary ball&chain to do so!”

PS regarding

The answer @neel_g provided the “because was proven to work” answer.
Here-s a take on why does it work. Since this is a Numenta forum, maybe a ScalarEncoder analogy would help.

Imagine you have a 2000 value size token (values between 0 and 1) embedding for “cat”. And you have a 3000 token history window.

The position for “cat” token is 525 in history. Use a scalar encoder taking in values from 1 to 3000 (window size) that outputs SDR of size 2000 (token embedding size).
Actual SDR does not need to be sparse (NNs don’t care much about sparsity) let’s say it is 40% sparse. Add the two vectors, subtract 1 from the result, apply ReLU.

This is equivalent with gating the token by the position SDR, which means all values that overlapped with a 0 bit in the SDR are cut out.
The resulting vector will keep only 40% of the token values but it preserve the similarity property - a “cat” token seen at position 500 will be pretty similar with “cat” at 450 and very dissimilar with the same token at position 2000. The farther apart the positions the higher the difference. It simply forces the transformer to treat the token differently according to its position.

The fact that they don’t force the position embedding to an actual all-vs-nothing scalar encoder only enhances resolutiion. Overall, where the two added values are both high the result is highest (enhanced) and where they are both low their sum is … annihilated/ignored and sums everywhere in between… those closer to maximum it is, the higher its impact to the following layer.
Each scalar in the token vector becomes relevant to different positions in a way that is evenly distributed.

I have encountered the same… improvement in a cartpole RL agent that instead of concatenating four short SDR embeddings it first added four large, different “dense” embeddings before thresholding to obtain a SDR encoding an “overlap” of 4 scalars. The resulting encoding was ~3 times more sample efficient, with the same downstream learner.

Also this scheme works for more general spatial embeddings. Assume that instead of text the transformer trains on real world scenes, where the “cat” token would need attached a 3D position in space - we have three coordinate vectors which can be added & softmax-ed to produce a 3D embedding that is finally added to the “cat” token to represent both “what” and “where”. And use attention to highlight the pairing and relative position between “cat” and “mouse” in the scene.

So the future of transformer-based or transformer-inspired technology is going to be… interesting.

Finally, whenever we want to use some apparently…weird overlapping (embeddings) of different properties and our intuition says “wtf?” there-s a much simpler way to test that it works than throwing millions at training a full-fledged transformer.

You can train a simple classifier to recognize “pure” cats, train a regressor to only predict “clean” positions, overlap (add&softmax) “cat” and “position” vectors and the resulting vector should be recognized - within reasonable boundaries - by both token classifier and the position regressor.


Proof of what exactly? The Wolfram Alpha collab being a giant marketing spiel for them?

I’m saying that Stephen draws many analogies and conceptual reductions which remove all nuance from the discussion. In this complicated field of DL, nuance matters a lot. He’s obviously not an expert, hence I advise to not take his analogies too far or one’d end up with a very incorrect conclusion.

Well, I just did above :slight_smile: positional encodings is a very simple concept - so much so that its a footnote for most papers. There is little empirical rapport for that concept; there is some empirical scaffolding for some aspects yet they’re also theoretically valid. In the end, its about as empirical as any other science. We don’t ask why strong nuclear force exists in physics - it just does. That’s how the universe works. But its exactly those kind of unanswerable questions that intrigue me the most.

I’m not sure what impression you have of me, but I can assure you I’m on nobody’s level here, or claiming to pull someone down.

All I’m saying is that don’t take the analogies and simplifications too far - otherwise you’d often end up confused, and assume “oh, they used positional encodings because it just worked” rather than understanding the crucial theory behind it.

No… that’s exactly the opposite of what I was saying. You need positional encodings because self-attention makes no assumptions about that.

But even if you don’t include positional encodings in your model, its been demonstrated that transformer circuits would learn it anyways. This however incurs a small drop in performance because the model now has to spend extra compute and circuits learning this encoding.

That’s because in the end, transformers are turing complete - they’ll evolve whatever circuits necessary to do the task as best as they can. We apply this external process (like positional encoding) because we wish to save on compute and let the model focus on modelling important stuff.

Again, nuance matters here because for even a simple query you can keep delving into the nuts and bolts, which is needed if you want to form an accurate mental model of what these LLMs do and how they manage to do it.


You said that he’s not an expert. I followed his works in my MS AI and undergrad CS. He has unique achievements that helped the world and few people have done it hence he have global awards, not to mention his works are overarching DL, meaning DL is just a tiny instance of it. He probably was born before NN was discovered.

I find your post summarizing his extraordinary wisdom into a piece of page in a DL book. It’s not fair but it’s your opinion. Only someone like him or better than him can summarise his wisdom, so I assume you are one of them.

When I said prove it, prove that he is not an expert.

1 Like

I’m intrigued by your antagonism here more than anything. From a rational viewpoint, does it matter if someone is an expert or not when their conclusion is illogical? or are we somehow conflating experts with infallible, omniscient gods who can never be wrong? At that point, I might as well talk with the “experts” on Fox News who think Global Warming is a scam.

He might have unique achievements. He might have a better idea than I do about that topic. Yet I don’t see any solid conclusive evidence for or against that - I simply presented my commentary on the concept, yet that has inflamed you for some reason.

So I need to be a messiah like Jesus to doubt his word? or a billionaire like Trump to prove why climate change is real?

I might remind you we’re on a scientific forum. This isn’t for fangirling towards celebrities and famous personalities and following them like sheep.

If you actually have anything substantive to contribute to the thread, you’re welcome. However, if your intention is to troll with some weird arguments and strawman to prove Wolfram is some reincarnation of God who is never wrong, then my only advice would be not to ping me anymore.


You’ve talked too much and unnecessarily. I just asked for a proof because your opinion about Wolfram needs “substantial” backup because it undermined his works (e.g. original post). That’s it.

On second thought, what is antagonizing about asking for proof? Isn’t this a scientific forum?

1 Like