From what I could follow what he propose is pair chatgpt with wolfram alfa as a tool in order to provide better answers. The main reason he considers WolframAlfa the right tool is it can also “understand” queries in natural language and natural language is doh… natural to chatgpt.
The few issues with this are:
- somehow chatgpt must recognize that the current prompt would be better by integrating output from a specific external tool
- gpt-x can generate scripts in many languages e.g. programming languages or web/database queries which falsifies the implicit assumption that a tool accepting natural language queries is what is required (or better) to improve gpt output. In reality it can use an arbitrary number of tools to externally check its output.
- personally I have issues with proprietary systems. Now that GPT went proprietary Wolfram’s idea is like “Why don’t you strap-on tighter to the proprietary bandwagon - here-s my proprietary ball&chain to do so!”
The answer @neel_g provided the “because was proven to work” answer.
Here-s a take on why does it work. Since this is a Numenta forum, maybe a ScalarEncoder analogy would help.
Imagine you have a 2000 value size token (values between 0 and 1) embedding for “cat”. And you have a 3000 token history window.
The position for “cat” token is 525 in history. Use a scalar encoder taking in values from 1 to 3000 (window size) that outputs SDR of size 2000 (token embedding size).
Actual SDR does not need to be sparse (NNs don’t care much about sparsity) let’s say it is 40% sparse. Add the two vectors, subtract 1 from the result, apply ReLU.
This is equivalent with gating the token by the position SDR, which means all values that overlapped with a 0 bit in the SDR are cut out.
The resulting vector will keep only 40% of the token values but it preserve the similarity property - a “cat” token seen at position 500 will be pretty similar with “cat” at 450 and very dissimilar with the same token at position 2000. The farther apart the positions the higher the difference. It simply forces the transformer to treat the token differently according to its position.
The fact that they don’t force the position embedding to an actual all-vs-nothing scalar encoder only enhances resolutiion. Overall, where the two added values are both high the result is highest (enhanced) and where they are both low their sum is … annihilated/ignored and sums everywhere in between… those closer to maximum it is, the higher its impact to the following layer.
Each scalar in the token vector becomes relevant to different positions in a way that is evenly distributed.
I have encountered the same… improvement in a cartpole RL agent that instead of concatenating four short SDR embeddings it first added four large, different “dense” embeddings before thresholding to obtain a SDR encoding an “overlap” of 4 scalars. The resulting encoding was ~3 times more sample efficient, with the same downstream learner.
Also this scheme works for more general spatial embeddings. Assume that instead of text the transformer trains on real world scenes, where the “cat” token would need attached a 3D position in space - we have three coordinate vectors which can be added & softmax-ed to produce a 3D embedding that is finally added to the “cat” token to represent both “what” and “where”. And use attention to highlight the pairing and relative position between “cat” and “mouse” in the scene.
So the future of transformer-based or transformer-inspired technology is going to be… interesting.
Finally, whenever we want to use some apparently…weird overlapping (embeddings) of different properties and our intuition says “wtf?” there-s a much simpler way to test that it works than throwing millions at training a full-fledged transformer.
You can train a simple classifier to recognize “pure” cats, train a regressor to only predict “clean” positions, overlap (add&softmax) “cat” and “position” vectors and the resulting vector should be recognized - within reasonable boundaries - by both token classifier and the position regressor.