IRAM Chips and Sparse Representation

Have you guys talked to the guys at Berkeley about their IRAM chip? IRAM is an emerging architecture that appears promising for sparse representation implementations because it places shared DRAM on the same die as the processing element for super low latency. Sparse boolean vectors stored on-chip, with many parallel sparse boolean vector units could radically speed up Numenta’s implementations.

6 Likes

The real advantage for HTM with Processing-In-memory (PIM) such as IRAM, I think, is not in sparsity but in the simplicity of the operations. PIM can be done combined with 3D stacking (v.gr. HBM2-PIM [1]). The “compute” layer will be less power hungry than in conventional DL (float multiply-adds vs. integer comparisons-adds). With 3D stacking the biggest issue is heat dissipation. This can be a game changer in the long term.

If you replace FP units in [1] by INT units, the number of OPs per second might much more than 1.2 10^12. You need a “fast” on-chip network tough (communications, even on chip, is a bit of a pain in HTM).

[1] Y.-C. Kwon et al. , “25.4 A 20nm 6GB Function-In-Memory DRAM, Based on HBM2 with a 1.2TFLOPS Programmable Computing Unit Using Bank-Level Parallelism, for Machine Learning Applications,” in 2021 IEEE International Solid- State Circuits Conference (ISSCC) , 2021, vol. 64, pp. 350–352.

3 Likes

I also don’t give much second thoughts to concepts such as “sparsity” being handled in hardware… but yeah, having large amounts of memory per core is my (and IMHO “the”) grail for further development and breathroughts in AI.
I’d hope those things become mainstream, and probably like @vpuente, also dream of some simple, integer-oriented, yet heavily vectorized ALU on those cores, instead of the floating point focus and complex hardware logic we find on most current hardware.

With such a device one could do unimaginable HTM computations, true, but also any other mind-blowing models, closer and closer to the real thing, until we can figure out what really matters there… our brains are imensely parallel, and the “information” representing our wiring, from one neuron A to a neuron B, rather than from neuron A to neuron C, is immensely expensive in terms of bit count. => more cores, more parallel cores, more (even isolated) memory per core (and, yes, some way to do “fast” on-chip communication between those as vpuente puts it) is the way to go to simulate that. Any deviation from that (finally quite simple) baseline would be spending silicon on something which will never be as great and as useful for a brain-sim than more cores and more per-core-memory.

Consider a simple HTM operation to determine a neuron’s activation. The neuron’s data structure has two components: It’s activation level (say, an integer), and a linked list of pointers to other such neuron’s data structures. To compute its activation level, loop through the linked list and access the current activation level of each presynaptic neuron, asking if it is above some threshold and, if so, adding one to this neuron’s activation level. Calculation such an activation makes intensive use of pointer indirection. The speed of pointer indirection is the memory’s latency. Latency is what you must minimize.

You may wish to argue with this particular data structure and computation, but no matter how you slice it, you end up with a lot of pointer indirection operations if you want sparse representations.

Moreover, if you want high degrees of parallelism in your computation, you need to have the CPU cores sharing the same address space – all with very low latency pointer indirections.

3 Likes

It would be nice to have some limited processing on the dram. You send an ordered list of relative addresses as a block of data and it loads an sram cache on the dram module with the sparse data then sends the content in a burst. Same with an ssd.

1 Like

I may wish to argue with the structure yes, but let’s switch the perspective instead…

Consider a more biologically-based approach where you broadcast an AP down an axonal tree, where axonal tree nodes are heavily localized. Now each reception site (a dendritic segment in my model, or a cell for a simpler attempt) would periodically sample from that to determine its own reaction. That is thus still very localized to the receiver, where the bulk of the actual computation happens, and would tremendously benefit from a hardware with lots of local memory, per actual computation unit.

In this scheme, the only thing requiring to transfer data from a “global” storage to the localized computation sites are AP transmission events (conceptually “events”, this need not be synchronous to the AP per se), where you need the “fast” on-chip network advocated by @vpuente. And AP transmission events are many orders of magnitude less frequent than individual synaptic computations themselves. IMHO that’s a computing-power win in every metric.

You’d never want to go through 2 or 3 indirections per synapse in the middle of your heavily parallel synaptic computation, and you’d never want to pay 2 or 3*pointer bitsize per synapse anyway, (even on a 32 or even 16-bits arch I might say) no matter how well the hardware helps you perform those hops. In the model above, a handful of bits per synapse is required instead, to access the localized axonal array, and also store synaptic persistency/weight/whatever using the most insanely tiny number of bits still… and still with that you’ve blown up current “regular” hardware RAM (amount and bandwidth) for any serious attempt at a vaguely scalable cortical sheet. We need mem and mem and mem and more mem, and as close to the (otherwise dead-simple) synaptic-crunching cpus as possible.

3 Likes

IRAM is an interesting approach. It looks like a possible next step in the long running computer trend of “consolidate everything into as few chips as possible”.

So the core idea behind IRAM is to some extent already implemented in graphics cards (GPUs).

NVIDIA GPUs have an on-chip shared memory, which they claim is 100x faster than off-chip RAM.
The user has explicit control over the shared memory. You can load your data into the shared memory, work on it, and then write it back out to RAM.

This is a bit off topic, but have you all heard about the Mill CPU?
The headline line is “a clean-sheet rethink of general-purpose CPU architectures”.
It looks very promising but alas is still in the prototyping phase of development.

You would certainly like that sort of integration with memory for Fast Transform neural networks. Mainly because the energy costs of transferring information starts to outweigh the numeric calculation costs and also becomes the key rate limiting step.
I hope you forgive me for linking to a special report I have created:
(Edited to a landing page for the resource)
https://siobhan10.github.io/WHT/
Especially as it is not free. There is an amusement at the end of the page if you like playing with sequencies. With an s.

You may be right depending on the actual numbers involved, but part of my intuition is that there is an enormous gap in machine learning left by the absence of sparsity hardware and that therefore it may be premature to pursue optimization such as you recommend – preferring to support topological flexibility at the sacrifice of some of the speed gains. Once we get into the nitty gritty details of implementation there will be feedback to the theory and flexibility then becomes more important than optimal speed. A factor of 10 decrease in latency is a huge deal.

Having said that, perhaps a compression technique not dependent on pointer indirection would support your position as set forth in the recent paper “A Fast Spatial Pool Learning Algorithm of Hierarchical Temporal Memory Based on Minicolumn’s Self-Nomination”. I can’t say I understand their compression scheme’s tradeoffs sufficiently to recommend it as a compromise position, nor can I say that I feel comfortable with early hardware partitioning (CPU cache memories) even given an effective compression technique, but it does strike me as a promising direction.

1 Like

Yes I’m familiar with the Mill architecture and yes it is a tragedy that it’s been hanging around in limbo for so many years. It is a superior allocation of chip real estate to ordinary cache memories but if I had to make a choice between new processor chip develoments, I’d go with on-chip memory banks shared between simple (integer only) CPUs without cache coherency overhead.

1 Like

I would tend to agree with that… that’s in fact why I’m not tailoring the numbers to current-HTM models (except recognizing the need for more synapses than in traditional models, and allowing that a large part of the computations is happening locally to segments and not at the soma…), instead trying to get a feeling for the actual scales involved… My guess is simply that any brain-inspired model would need parallelism and memory locality, and there could be ways to make it happen in hardware as of today (if we were to reconsider current market preference for global RAM and fancy CPUs).

At home, my current hardware has lots of silicon, but it’s ill-tailored for the purpose of the kind of sims I’d like to see happening. Like you I’m also aiming for a lot of flexibility for topologigal matters. Local memory and local computation is not at all running contrary to that. Quite the opposite, in my view: the model requires handling part of the axonal signals as globally transported from one locality to another, and there is a lot of potential for all kinds of networks and global wiring schemes right there, for my/your/everybody’s experimentation attempts, until we get it right.

This!

1 Like

I don’t know if Micron are still offering some kind of compute in memory chip, or if they have dropped it?
https://vdocument.in/cellular-automata-on-the-micron-automata-processor-3-introduction-to-micron-automata.html
Anyway, some extremely poor sales and marketing was involved.

What do you think of the low-electric-power MIMD GreenArrays chip?

As of Spring 2021, shipments of the EVB002 evaluation kit and of G144A12 chips continue to be made. The arrayForth 3 integrated development system is in use with no reported problems. Design of a new chip, G144A2x, continues; this will be upward compatible with the G144A12, with significant improvements. Development of Application Notes, including that of a solftware defined GPS receiver, continues.

Its computation standard cell is an 18 bit processor that I was looking at widening to 36 for for my approach, but the GA is, out of the box, much closer to your MIMD approach. The processor itself is under 10,000 transistors – probably more like 5,000. I suspect the 144 processor G144A12 doesn’t provide enough RAM per processor to do what you want, but that is easily remedied.

Static ram costs 4 transistors per bit minimum but let’s kick that up to 6 transistors to be conservative. NVIDIA’s latest GPU chip sports 50 billion transistors.

Since communication between the processors is the critical vulnerability of your MIMD approach, it should be noted that there is software defined interprocessor communication called “ethercode”.

1 Like

That looks interesting. The fast Walsh Hadamard transform only needs add and subtract operations, though sometimes bit shifts are needed for rescaling
I’m sure multiplies need to be implemented using the shift and add method. Which are needed if you want to make a neural network.
I haven’t read the data sheet yet, so I don’t know the memory arrangements.
Still, I am sure I could code 144 concurrent neural networks into the thing. Fast Transform fixed-filter-bank neural networks that is.

Well, I guess I won’t be doing that. The amount of memory is far too tiny.

It’s funny that Micron decided to put cellular automata on dram chips and not a bunch of far more useful microcontroller cores. The 8-16 bit z80 cpu needed only
8500 transistors (wiki) which is tiny. I though it was 55000 transistors. Maybe false memory syndrome.

re memory, as I said, that’s easily remedied. That’s why I provided the SRAM 6 transistor figure in conjunction with CPU transistor count.

re Z80 transistor count, the processor portion of the F18A standard cell is probably more like the 6502 even though it is 18 bits wide. The Forth virtual machine is an exceedingly parsimonious architecture ideal for compact silicon design.

It’s easy to imagine a semicustom run of a chip with 144 F24A or even F36A standard cells each sporting enough SRAM to satisfy gmirey’s approach – and very low risk of design flaws.

2 Likes

It looks like the compression is just encoding active bit positions to represent a sparse vector.

I have a hard time following the spec, to be honest, I’m not a hardware guy.
As for the SRAM thing, 50G/6 on a high-end chip seems like we won’t be able to cut it as of 2020s, but I don’t think all hope is lost, here: I could imagine DRAM on support (even local-to-single-proc DRAM on a board of many, well, why not ?), since one can work out very-predictable access schemes for the synaptic-crunching parts, so a not-so-high amount of SRAM as a cache could work wonders.
Yet I still don’t think a single-bus DRAM like we have on today’s hardware could service many of those. Required throughoutput is insane, and access from many would essentially bring us back to random access (instead of being highly predictable).

1 Like