We need new hardware

gmirey · April 17, 2020, 10:52am

Hi. As some of you may know, I’ve been interested for quite some time in getting the bits-per-synapse down to a very low level, in the hope of running large scale simulations of brain parts.
I have a tendency to hop erratically from one project to the next, and haven’t posted for quite some time about this… but what brought me to a halt last time was the realization that our current general-computing hardware is not fitted for the task.

We need new hardware.
I don’t think we’d need super advanced and exotic new hardware of y2035, mind you. We just need to have something a little different. Some parallelism that we already have. Some good integer comp that we already have (we don’t really need fancy floating point nonsense imho). Some fair clock speed that we already have. But much, much, much, more memory per computing unit that we’d care to throw at it currently.
And I don’t see this happening in what I know of the current serious attempts at making deep-learning-specialized hardware, or stuff.

I’m a software guy. I know almost nothing about FPGAs, or what are the customization options we’d currently get out of the SoC industry if we were to think about this seriously.

What I do know is this:

A 8x8 minicolumns chunk would require some 16b integer processing power with decent speed, about 100MB memory, and connectics to be clipped onto a board (one board spanning a brain area?), intraboard-bus to neighboring chunks, and boardwide long-range io bus distributing all of them.
One square cm sim requires 1000x such chunks. Ten thousands of them maybe (only a 100x100 matrix after all) for 10cm² (ie roughly a human V1 on one hemisphere).
Yes, that’s one terrabyte mem.
Yes, i’ve compressed that as much as I could already.
Yes, it is high. But it’s in range of something feasible, isnt it? Maybe not by me but by… someone?

If we had that, we could bring such central units together in a network, as many as we need, to get to a full-brain sim in the kind of those envisionned by my @bitking friend.
Even if we don’t have all the answers to “how” they should communicate, we can be pretty confident the required processing power lies on this kind of range, and that we could play with such hardware and simulate stuff happening, until we get there.

Short of that, no single of our current PCs would be able to function as an independent “brain-area” processor for that goal.

robert-cronin · April 17, 2020, 4:19pm

I don’t pretend to know anything about HTM or neuroscience as I am just interested in it passively but what about Neuromorphic chips? here is a machine that might be worth looking at:

It’s a sub project of the EU’s Humans Brain Project and has aspirations to run whole mouse brain models. I am unfamiliar with the architecture but might be interesting to look at

gmirey · April 17, 2020, 5:39pm

This seems to be going in a promizing direction. Esp. given the integer and 16b focus of this machine (seems so? from a quick look at it). Maybe they’ve nailed that part of the problem same as I envision it. However, I’d argue that 18 cores per 128MB of RAM is still too high a computation/memory ratio that what would be optimal. I believe we wouldn’t need a 10-racks machine and a multi-million dollar setup for same effect if done with a memory-first model in mind. Dunno, really. We’d need those kinds of attemps to evolve towards affordable machines, of the scale of our personal computers. Just… built differently.

Good find, still

MaxLee · April 17, 2020, 5:49pm

I just emailed that project (SpiNNaker) this week. Got a kind reply back from Simon Davidson, a research fellow there, who generously gave the following knowledge:

…
As you may have read, the SpiNNaker software stack has been designed for networks specified using a Python-based API called PyNN. So if you are simulating spiking neural networks and can express them in this framework, the task is straightforward. To do your own framework means working at a much lower level with the machine and is a more serious undertaking. We have knowledge of the HTM approach and members of the teams have discussed the theory with Jeff Hawkin at various conferences. Implementing an HTM on SpiNNaker is certainly possible but you would not be able to make use of much of our software infra-structure beyond the low level APIs and hardware management libraries (SARK and SCAMP),

To get right the heart of your questions, we do sell the 48-node SpiNNaker board and also grant free access to our one million core machine in Manchester (UK) via the HBP portal. Despite the fact this latter route would limit your networks to those expressible with the PyNN language, it would allow you to get a feel for SpiNNaker and help you to decide if it is for you before committing anything further. Here is a link to some of our training material, which also contains information on the HBP portal and how to register to access the hardware for free:

Eighth SpiNNaker Workshop

For a non-academic entity, the boards themselves cost about £10K plus your local sales tax. That’s about $12.5K (plus tax) at today’s exchange rate. Not something one would normally investing in for personal use. But if you want to go that route we can give you a quote.
…

So, if one were to use that hardware, we’d either have to implement HTM using the PyNN framework (doesn’t look like a quick or easy fit), or design a new, specific framework on top of that hardware.

I think it’s worth checking out custom implementations of hardware acceleration via FPGAs first, before going down the rabit hole of trying to use SpiNNaker , but let’s get some more people united on this topic and see what we can come up with.

@marty1885, @jacobeverist, @gmirey and others, let’s try to come up with an agreed (mostly) upon implementation idea, and see if we can flesh this out.

I outlined some ideas here, but let’s get this knocked down.

The main goal is 10hz, which gives us a time budget of 100ms to complete a full timestep of processing. If we can process pulls in 50-70ms, then we’ll have another 20-30ms of time for data transmission between machines. While not easy, it seems doable.

MaxLee · April 17, 2020, 6:07pm

For FPGA setup for myself:

At the moment, using an ICE40LP8K as provided on a TinyFPGA BX board, but have a DE10 Nano on the way with a CycloneV SoC (with embedded ARM hard core) which will give a nice jump up from 8K LUT to 110k LE. This board allows easy/fast connection between the ARM cpu and the FPGA, including direct memory access. Using a minimal HTM design which I’ve outlined here, we should be able to fit reasonably sized pools of columns onto even this tiny machine.

My idea would be to design an HTM coprocessor… we wouldn’t need to keep all the column/cell state in the HTM processor. Just need to use it for operating on some local information, then manipulating data in memory. If everything is pre-allocated, memory should be contiguous for most objects

adam · April 17, 2020, 6:10pm

Hi, I definitely agree with you, we need new hardware!
However I think we are very far from what we actually need.

If we look to the brain, we see the following: Processing capacity and memory are mixed together. They are not separated like in our today’s computers!
Information does not have to travel over a limited bus. It is where it is needed! The algorithms are working locally, not globally.

In this post the argument is made, that a external DDR3 memory needs to be connected to the FPGA: That makes total sense! They need to connect that memory to the FPGA. But as they do so, they immediately get the problem of having a limited memory bandwidth.

More brain like would be to mix the memory and the computing power. Then there is no memory bus and the information does not need to be transferred.

This would also solve another problem, PCs currently have: Heat! Heat is generated when fast clocking operations are executed (Assuming that the static power consumption is not the source of heat).
If we mix computing power and memory, as well as run the algorithms locally, then we would get two things:

more space between the parts that get hot (As memory parts usually stay cold)
less parts that need to have a fast clock (As we could disable clocking for unused parts)

But until we get there I think we have to stick to our memory bus, inherent in all PCs/micro-controllers/etc.

Regarding your thoughts:

If only memory is the limiting factor, then we can use typical server hardware. Computers with 1TB of RAM are already available (Even though costly).
Projects like SPiNNaker (Research Groups: APT - Advanced Processor Technologies (School of Computer Science - The University of Manchester)) push in the direction of having smaller processing units, more densely connected. This allows to have a MUCH bigger memory bandwidth. However they are still bound by their architecture.

Even though the hardware is interesting, I am not convinced, that we actually need those big simulations to discover new things.
Small scale simulations usually do the trick and even if such small simulations need to run a week on our old-style PCs, that is still OK.

Adam

MaxLee · April 17, 2020, 6:33pm

Say we have memory running at 2Ghz, or even 1333Mhz… all that we need to accomplish to simulate the brain, is 10Hz. Memory doesn’t become the limit, lack of distribution does. I think we could even have two memories that alternate between read/write state, just to speed that up if necessary. Let the same objects share the same addresses between memory modules, with a simple mux for seeing which read/write at for a given timestep.

This would also solve another problem, PCs currently have: Heat! Heat is generated when fast clocking operations are executed (Assuming that the static power consumption is not the source of heat).

If I’m not mistaken, what is generally causing heat is the power being lost as we’re trying to get gates to flip faster and faster. But for our purposes, we don’t need fast chips. We just need REALLY parallelized ones. A cheap bottom barrel FPGA typically has a native clock speed of 16Mhz. So if we have to read in data, even a few million bits worth (and I don’t think we need that, with a minimal HTM implementation I mention), that’s only a couple hundred ms, and we can of course scale that up. Once that data is loaded, operations should be able to occur at least once per three clock cycles (giving margin for hold time) across all the processing blocks (binary comparators + accumulators acting as counters).

I’m actually more in favor of a board with a distribution of smaller FPGA chips, like the iCE40LP, that are just mass distributing binary operations on bitstrings (checking for overlap) or finding winners (cell selection). What makes the CPU slow at this right now is linearization, rather than any memory access limit. I imagine sharding a pool across several smaller chips, rather than trying to get the whole thing to fit inside some larger chips… although there’s still a certain appeal to SpiNNaker hardware, which would allow pools to fit inside a core of one of their chips. That’s what originally caught my attention about it.

gmirey · April 17, 2020, 6:44pm

@adam : I both agree and disagree with some things you said. I acknowledge that we need more coupling between mem and processing. But I don’t think it is out of our current technology league to do so… simply that current consumer hardware has not pushed the bar in that direction.

I think it is misguided. We do not need to run longer or have more computing power, to perform more realistic computations. We need more RAM. And yes maybe also, closer to the ALU that will use it, as you stated.
And the benefits of having a sim running real time is enormous for research on all things “temporal”, which Jeff would perhaps argue is “everything”. Sim of vision or audio on real stream of visual or sound data enables real-time interactions, and that’s arguably a big deal.

My implementation ideas are software oriented. And depart somewhat from HTM proper… I’m using some of Numenta insights (for lots of significant synapses per cell, and possibly minicolumnar integration), cross-checking that with what I know of biologically relevant connectivity schemes… hammered at the problem of getting the bits per-synapse to a very, very low level. Which is roughly 12b in my tightest schemes:

0b for implicit “location” on a dendritic subsegment
4b for “permanence” or weight (req. stochastic updates instead of small floating point increments)
8b address to an “axon id” in range from this location (tied with a scheme for carrying cell activation to “axon” terminals, ready for being sampled by those locations using as few as those 8b)

This would give anyone the ability to implement a HTM-based model, or a more generic SNN-based model, with what I believe are biologically plausible figures.

What are those figures, then?
about 100-cells minicolumns, on a square grid, 40µm apart.
max dendritic extent (PC or inhib alike) is thus about 8 of those minicolumns. Chunk them by 8x8 so that any synapse there targets some cell in one of the 9 chunks around.
if you accept thousands of synapses per cell, some additional mem for space losses (filling standard-sized segments, for example), some additional space for axonal part of the problem, some additional space for managing indices dictionnaries and stuff, and some “working memory” carrying current state of activation of axonal afferents or synapses… you’ll find that 100MB figure per 8x8 minicolumns chunk is not far from the mark.
8x8 minicolumns (=64) is close to a tenth of a square millimeter (since 25x40µm = 1mm, and 25x25 minicolumns = 625 minicolumns). => You’d need ten thousand times that amount for simulating a 10cm² area (V1-sized, roughly). That’s 1 terrabyte.

The amount of computing power to run that real-time is, I believe, not that high. Especially for HTM models with 1 computation per synapse per 100ms… but I think even going towards more SNN-based stuff is doable without involving 18 freaking cores per 100MB… We’d “simply” need to discretize this time step somewhat in between.

MaxLee · April 17, 2020, 6:48pm

Another aspect of this would be to keep input encoding spaces smaller when possible, rather than larger. Everything regarding choosing winner columns depending on the encoding space * number of columns in a pool.

A large input space means larger arrays for holding connection strengths, longer binary comparison operations, longer memory read/writes, etc.

MaxLee · April 17, 2020, 7:05pm

If you’re okay with slightly inefficient modelling, I’d love people to test or provide feedback on ElixirHTM. Looking to make it easy to distribute an HTM model across different networked machines.

AMZ · April 18, 2020, 3:08am

Digital design (FPGA or custom) can endow the HTM network the necessary parallelism, but the network scalability and power consumption are always a matter of concern. Thus, mixed-signal design can be a better alternative. It is faster, compact, and also more power-efficient.

Ref:
[1] Zyarah, Abdullah M., and Dhireesha Kudithipudi. “Neuromorphic architecture for the hierarchical temporal memory.” IEEE Transactions on Emerging Topics in Computational Intelligence 3.1 (2019): 4-14.
[2] Zyarah, Abdullah M., and Dhireesha Kudithipudi. “Neuromemrisitive Architecture of HTM with On-Device Learning and Neurogenesis.” ACM Journal on Emerging Technologies in Computing Systems (JETC) 15.3 (2019): 1-24.

jacobeverist · April 18, 2020, 8:49pm

I just replied to @MaxLee in another thread, and gave some similar advice on the assumptions and simplifications I’ve made for efficient implementations. Here’s what I wrote:

For an efficient implementation, you need to make several assumptions that don’t cost you too much.

represent activations at the bit level, 8 neuron winners per byte. Not only does this reduce your memory footprint by 1/8, but it also allows you to use efficient binary AND operations like you previously identified
take advantage of wide AND operations, if you can use 32 or 64 bit ANDs when applying a bitmask representing synapse connectivity with the input space, you’ve got your neuron’s synapse input very quickly
use POPCOUNT algorithms used for computing a neurons overlap score. You can find a lot of discussion about this on Wikipedia under the Hamming Weight
use unsigned integer words for your overlap scores. the size of the word can be determined by the size of your “potential pool” of synapses. you shouldn’t need more than 16 bits
use unsigned 8-bit integers for your synapse permanences. There’s no reason you should need any more resolution than that
unlike in the nupic implementation that has a potential pool of 80-90%, your potential pool should be much smaller, at around 10-20%. you can make up for it by adding more neurons
the big challenge is managing the synaptic addresses. you may actually find that it is faster and cheaper to represent the synaptic connectivity as a bit string instead of a list of addresses. Even the potential pool of permanences may be better represented as an array of uint8. It depends on the tradeoff of a larger memory footprint vs. the compute cost of pointer dereferencing.
avoid boosting. I’ve never figured out a way to make it stable, and it just adds floating points into the mix.

jacobeverist · April 18, 2020, 8:59pm

I haven’t found the neuromorphic architectures built for deep learning or spiking neural networks to be terribly useful or interesting for HTM algorithms since they address completely different problems. If you try to do HTM on them, you’re gonna have a bad time.

HTM algorithms do primarily bitmask operations and and integer array operations. The bottleneck is always in the addressing and reachability of synaptic connectivity to other neurons and how that is managed. Either you limit each neuron to a sparse connectivity to its input block (with a list of addresses), or you limit the size of the blocks of input it has access to (and represent with a bit mask).

If you crack that nut, then you have a blazing fast implementation.

jacobeverist · April 18, 2020, 9:03pm

Very cool @AMZ! Thanks for the link.

marty1885 · April 19, 2020, 3:04am

Personally I’d go for implementing a HTM accelerator on something like a Lattice ECP5 as they come with DRAM/DDR3 controller built in. Then connect multiple of them into a mesh. And then get the mesh to do the calculation.

And since we have DRAM on each FPGA. We ideally could handle each HTM layer on a single FPGA to minimize communication between chips. This would fit HTM.core and Etaler’s programming model quite well and. Now the problem becomes how could we save the synapses and weights to PC or some place.

MaxLee · April 19, 2020, 3:53am

Pulling your whole reply from the other thread here and replying to it .

I think I really should put my conceptual design into a graphic, pointing out the different pieces, and simplifications I’ve made to the overall approach. For temporal connections, I don’t care about proximal strengths at all. I don’t track it, and apparently it isn’t needed as it turns out. Thus we can do away with an entire array per cell per minicolumn. Instead, for each “cell” in the minicol, we have a single votebox/counter and a fixed-size ring buffer array, so that as t-1’s winning cells are chosen, they take a look at the previous time step’s (t-2?) winning cell pointers, then insert themselves into THOSE buffers. They then take a look at their OWN ring buffers, look at those pointers, and increment the countboxes of those cells, so that come cell selection, those

I wouldn’t manage it. If we pretend that a single pool is a full column, we’ll size that instead to limit connection ranges between synapses.

Agreed.

asdf

I’d have to be convinced by evidence that would be a better approach consistently over ANDing and summing. But the nice thing is that we could test it.

Agreed.

Also agreed. We could even do smaller subdivisions if we wanted. After all, for learning rate, you can either increase the size of the step, or shrink the scale.

For a couple reasons, I don’t like “adding” more neurons, because we’ll eventually run into a saturated memory status anyway. I feel it’s much better to pre-allocate all the possibly used memory. By pre-allocating all that space ahead of time, your application is predictable, and if you need to pull down an entire array at a time (or send it off to some co-processor* to pull), the memory read is contiguous, with a far lower chance of running into a cache hits. The memory usage, application hardware budgeting, and data locations then becomes predictable, increasing both speed and development manageability/maintainability. The alternative is any variety of messy tree structure (including linked lists). I’d rather be more artificially constrained, than deal with unpredictability.

At least for connections from the minicols to the input space, current toy examples I’ve written are using bitstrings. The corresponding strengths then are arrays of unsigned 8 bit values.

I never got why people are trying to boost… seems like introducing in unnecessary noise (though I imagine there are some people who will disagree, and that’s fine).

It seems each chip has 3744kbits, or ~468Kb… not sure we’ll be able to get around having to call out to memory, though that memory controller would go a long way toward direct memory access of different independent FPGA chips.

Here’s where I like the idea of having a hard CPU core in the FPGA, so that both can direct access memory. Whatever we do, it has to be relatively easy even for non-technical people to use, i.e. “Flash an sd card, plug it in, add power, etc.”, just like Raspberry Pi or clones of it. For people who want to get deeper into it, we can make sure that core can run on something and/or more expensive, such as the high-end FPGAs that will be coming out in the next couple years with a few million LUT/LEs. If we have something that runs in a basic way on 8k LUTs, then we know we have something that can scale up from that (just duplicate elements and expand the register).

Let’s get a v0.1 of something rolling, then test implementations on different chips. Main goal is that it needs to be inexpensive, and the Lattice stuff generally fits the bill. We can experiment between using a mesh and/or hard/softcore to coordinate the logic of the overall application.

Bitking · April 19, 2020, 4:03am

May I suggest that you ask @gmirey for his work up on data requirements. I have been talking with him about some of my ideas for a simulation project and found his insights very informative- and a little discouraging.
He may keep you from reinventing the wheel.

BrainConstellation · April 19, 2020, 9:28am

I wish to just make you aware of the existance of this older thread in this forum, which I initiated back in January, 2017. It already received a lot of attention and interesting suggestions. We should pool both treads together, and check whether anything has further materialized, from this older thread.

Please consider: @Bitking

gmirey · April 19, 2020, 9:40am

I did similar stuff with bitwise ops, popcount and such things for a HTM spatial pooler, yes.

What I’m after with that “hardware” is something a little more general-purpose than that. If you only devise your ops with simple masks at synapse level, you’re assuming HTM 2016 hit the mark and would prevent any tweaks to this very fundamental step.
Again, I don’t think perf is such an issue vs. mem amount for our current hardware.

This.

@bitking: thanks. I hope it won’t be too misguided, but I’ll try to come up with some diagrams too of how I envision things.
Depending on the amount of “customization” you’d wish for such hardware, we may come up with pretty different designs. I believe at this stage, we’d still need to have a quite open design… so it won’t be straightforward how it does allow a pure HTM SP or TM. But it will be compatible at least.

MaxLee · April 19, 2020, 7:33pm

I agree, and think we’ll need to come up with an abstraction layer for expressing our models that can then distill down into hardware (i.e. Verilog representations) + interface code ( C ).

…I think I might eventually come up with a domain-specific language (small… very small) to parse a specification and compile that into verilog components and their C-code interface. At the very least, it would allow us to generate some generic module elements, then allow the flexibility for picking and choosing depending on one’s preferred implementation requirements/hardware.

Some declarative syntax so that we could have something that looks something Haskell/Elm/Elixir-like. First we’d declare some basic constraints that would inform verilog and code generation.

# Input encoding size (in bits)
@encoding_size 100

#SP poolsize
@poolsize 150

# Cell depth per minicol
@minicol_depth 120

# Connections per cell in minicolumns (max number of connections per cell) --> TM
@minicol_cell_cons 512

Some collection of declarations like would then set up a single hardware “device” in verilog, which would compose the correct elements together from a set of pre-existing hardware templates. Then to declare an interface with the “hardware”, generating C code, we could write a function like below:

# SP handle 100bit encoding
def handleencoding(encoding, 100) do
encoding
|> pool_column_overlap_scores
|> choose_winners(0.02)
|> strength_overlap_connections
end

Above, the " |> " symbol is like the pipe when working in a system shell. The results from one function flow into the next function, which makes declaring pipelines in hardware an idea idea to express. A function like the above, being aware of the hardware setup, would then send encodings (bitstrings or 1d bitmaps if you like) into the hardware, controlling its registers to get overlap scores between the columns and encoding, then choose winners (top 2% of pool) either via CPU or customized hardware if it has been specified.

Something like the above would allow us to write portable and quick-to-change HTM implementations… we could essentially declare graphs, then have the system spit out both the verilog and C interface for that hardware (assuming we have a system where both CPU and FPGA share access to the memory).

Then for temporal aspects:

def growmemory(winning_columns) do
winning_columns
|> choose_winning_cells
|> grow_connections # previous timesteps' winner would already be loaded into known registers
|> add_winning_cells
end

The advantage this would offer over compiling from opencl is that we’d already have our hardware constructs (verilog code templates) which use a minimal number of resources, tested, and be able to get max parallelization out of our compilation target (whichever brand of FPGA dev board).

Topic		Replies	Views
Temporal Memory Performance/Scaling Tricks Implementations	16	940	July 1, 2022
Temporal pooler and Receptive fields aka FeedForward Numenta Theory	42	1396	September 17, 2021
HTM Cheat Sheet Engineering reference	26	10965	May 9, 2020
Topologies in the brain and how to model them Engineering	18	1313	November 27, 2019
Memory layout strategies for a Cell Engineering	11	1536	March 23, 2021

We need new hardware

Related topics