Combining multiple small width layers into a larger layer

There is a cost advantage (computational, parameter count) to combining multiple small width layers into a large width fully connected layer using a suitable combining algorithm.
There is a blog about it here:


Can you please update the link? The current link requires me to login to, and set up my own blog.


I think it was supposed to be this: Switch Net 4 - reducing the cost of a neural network layer.


Sorry for the error. I fixed the link.
There are a few unusual things going on in Switch Net 4.

*Random or Sub-random Projection pre-processing of the input vector.

*Making a single large layer out of multiple small width layers using a combining algorithm.

*2 Siding ReLU via its forward weights.

I’ll maybe do some more blog posts about some other things, since I’m feeling chatty for some reason.


I wonder if the same can be done with self-attention windows in transformers: exponentially longer windows in higher sub-networks? Where sub-network becomes something like an expert in MoE?

Run it by ChatGTP:

As of my knowledge cutoff in September 2021, there is no direct research on a hierarchical transformer hyper-network where self-attention windows become longer in higher sub-networks, and sub-networks function as experts in a Mixture of Experts (MoE) setting. However, there are related works that explore concepts similar to what you described, such as hierarchical transformers and MoE in transformers.

Hierarchical Transformers: Hierarchical approaches have been explored in the context of transformers, although not specifically with longer self-attention windows in higher sub-networks. For example, the Hierarchical Transformer (H-Transformer-1D) proposed by Bai et al. (2021) organizes self-attention heads in a hierarchical manner to reduce computational complexity while still capturing long-range dependencies effectively.

Mixture of Experts in Transformers: MoE has been applied to transformer architectures to improve performance and efficiency. The Switch Transformer by Fedus et al. (2021) combines transformers and MoE to create an architecture that selectively routes tokens to different experts for processing. This approach reduces computational complexity while maintaining or even improving model performance compared to traditional transformers.

Adaptive Attention Span: Another related line of research is adaptive attention span in transformers. For example, the Sparse Transformer by Child et al. (2019) uses a combination of fixed-length attention and adaptive attention spans, which allows the model to attend to longer contexts when necessary. This can be seen as a way to balance shorter and longer self-attention windows based on the input sequence’s requirements.

Although there is no direct research on the exact concept you mentioned, the ideas from these related works could potentially be combined or extended to create a hierarchical transformer hyper-network with longer self-attention windows in higher sub-networks and MoE-like behavior. Such a model could benefit from the strengths of hierarchical transformers, adaptive attention spans, and MoE, potentially leading to improved performance and efficiency.

The first ref was hallucinated, but there is a similar paper: [2107.11906] H-Transformer-1D: Fast One-Dimensional Hierarchical Attention for Sequences


There are practical limits to how wide you can make a neural network using the Walsh Hadamard transform (WHT) as a combiner.
Those limits mainly depend on the size of the CPU/GPU caching system. A width of 65536 (2^16) is easy on lower cost CPUs. You should be able to do about 5000 65536-point WHTs per second with good code.
A width 1048576 (2^20) should possible on a medium level CPU.
Outside the caching limits the algorithm slows down a lot as data has to be moved to and from external (to the CPU) DRAM.

I stick to low level aspects of neural networks for various reasons.
I don’t have any meaningful comments about higher level aspects like transformers. I have to leave that to others.


Regarding this diagram I see there are 48 weights instead of 64 of a fully connected 8x8 layer/matrix.

The are 6 weights/ input point instead of 8.

For a 256 wide input/output there should be 512 * 16 = 8192, or 32 weights for input point, instead of 256 on a fully connected layer. 8 times less.

Then a layer of width W will have log(W,2)*2 connections per input point instead of W of a fully connected WxW layer.

Which of course, is significantly faster to compute the larger the W,
But doesn’t that lose learning capacity? If each weight IS a learnable parameter, the question arising is how does your connection schema compares to a fully connected layer with the same number of weights?

e.g. Compare learning capacity of your 65536 wide schema circuit with an 1448 (or 1449) wide fully connected layer.
with almost the same number of 2M connections.