SDR Transformer

My POC on replacing the dense embedding layer with SDR . Works by being trained in math and logic. Transform Qwen Transformer to SDR Qwen transformer

How does a random projected embedding (from 2kbit SDR to qwen’s 896 long vector) carries any meaning to the pretrained transformer?

Access the code at GitHub and try it out yourself. That is why the code is out in public The Secret is projection layer , here is a preview for you “PHASE 2 SDR BRAIN ONLINE.
Temp: 0.7 | Rep_Penalty: 1.1
Type ‘quit’ to exit.

User: if 20 + 20 is 40 what is 60 + 60
Assistant: The sum of 60 and 60 equals 120.”

Experiments are now underway to prove SDR Superposition, look very promising, another repo is coming soon. Running log “Step 108580 | Loss: 0.27578 | Time: 40.17s
Step 108780 | Loss: 0.29431 | Time: 29.79s
Step 108980 | Loss: 0.28164 | Time: 25.02s
Step 109180 | Loss: 0.28785 | Time: 40.03s
Step 109380 | Loss: 0.29527 | Time: 39.75s
Step 109580 | Loss: 0.27939 | Time: 39.43s
Step 109780 | Loss: 0.27578 | Time: 39.61s
Step 109980 | Loss: 0.28134 | Time: 39.69s
Step 110180 | Loss: 0.28636 | Time: 39.66s
Step 110380 | Loss: 0.28139 | Time: 39.46s
Step 110580 | Loss: 0.27595 | Time: 39.73s
Step 110780 | Loss: 0.27321 | Time: 34.40s

I just want to understand, is the projection (retina?) layer which maps sdrs to dense vectors random or trained? Because the github readme states it is a random projection.

Do you use the reverse projection (dense → sdr) on output too? I think that would make an important difference in compute since it substitutes the huge last layer 896 x 130k (embedding size x no. of tokens) with one much smaller e.g. 896 x 2k

The projection layer is trained; if you go through the included train script, it is part of Phase 1. Bottom line is that it replaced the dense embedding layer with SDR projection layers. As a matter of fact, SDRs work so well in a transformer that SDR LORA adapters ( my earlier experiments) worked on a transformer which has never seen an SDR. Once we get rid of the embedding layer, then everything is an SDR, images , video, audio and text .They all share the same space.

Three is already a working SDR encoder for image , FYI Have you ever thought about how humans can drive a car in a city where they have never been before? Or sit in a noisy restaurant and still hold a conversation. The secret of biological intelligence… | Aamir Mirza