We need new hardware

jacobeverist · April 19, 2020, 10:54pm

MaxLee:

I think I really should put my conceptual design into a graphic, pointing out the different pieces, and simplifications I’ve made to the overall approach. For temporal connections, I don’t care about proximal strengths at all. I don’t track it, and apparently it isn’t needed as it turns out. Thus we can do away with an entire array per cell per minicolumn. Instead, for each “cell” in the minicol, we have a single votebox/counter and a fixed-size ring buffer array, so that as t-1’s winning cells are chosen, they take a look at the previous time step’s (t-2?) winning cell pointers, then insert themselves into THOSE buffers. They then take a look at their OWN ring buffers, look at those pointers, and increment the countboxes of those cells, so that come cell selection, those

I really can’t follow this explanation. I think you need to draw a diagram or write some pseudocode for me to understand.

marty1885 · April 20, 2020, 1:31am

I don’t think it’s too much a problem. We are talking about FPGAs afterall.

Personally I’ll argue the problem is bandwidth not amount. Getting 2PB of memory is easy if there’s enough money but betting 2PB/s of effective bandwidth is an engineering challenge.

But I want to put a few points out so we have a baseline that I think is reasonable.

The design should be scalable
- We should be able to run HTM on 1 accelerator as well as 1000 accelerators
- Otherwise debugging will be very expensive and time consuming.
We have to take care about communication between chips early
- RJ45 and Fiber are expensive and adds additional latency
- Also high-speed networking IP core is expensive (~20K/year)
- We could use a portion of the FPGA to build differential pairs. But careful electrical engineering is needed.
The final design have to be cheap
- Deploying and experimenting gets very expensive very quick
- Adding eMMC/SD and networking costs money

Let me know if the points are invalid in your mind.

So we are building this thing in SystemVerilog? And targeting Altera CyclonV? Count me in.

gmirey · April 20, 2020, 8:31am

@marty1885 : I don’t know if I’m talking about FPGA ^^’ I know almost nothing about them. If you’re confident it can be done with that I’d trust you on this… But when I’m saying I don’t see speed as the most difficult point, I have current “real” hardware in mind, so that FPGA thing has to offer decent speed.

bandwidth: The thing with that neuromorphic design, which would be different from a truly general purpose PC or GPU, is that the computation parallelism matches the memory parallelism. In the spinnaker example, they have racks of lots of core per 128MB memory. I think it’s too high a ratio, but it’s still the way to go: One processor can be responsible for some specific ~~neurons~~ synapses, and the memory associated with those synapses is completely local to it.
If you’re willing to embrace that specificity, and have some computation parallelism still in your design, then the memory bandwidth problem “kinda” vanishes, because it gets distributed also.

Having been raised by @bitking, in my mind it is quite clear that we should try to get all computing and mem amount required for an entire brain “area” (a few cm² worth of cortex) in one single board. Then inter-area communication can be sent over a regular network, as it requires a reasonnable bandwidth (imho). Scalability is thus achieved by adding more “single-area” machines together.

marty1885 · April 20, 2020, 9:07am

I’m officially confused now. Are we talking about building a HTM accelerator or a HTM supercomputer or a HTM accelerator on a FPGA? My apologies.

If we are talking about building real chips. I have to point out how expensive they are. Taping out a single 4mm^2 prototype in TSMC 28nm costs 100K USD (with academic discount). And the chip design tools have ridiculous licenses.

How do you think we (as the community) should approach to build such system?

gmirey · April 20, 2020, 10:10am

If you guys have great ideas for a HTM accelerator on a FPGA, then so be it. I’m not to say what should or shouldn’t be discussed here.

If you’re asking about my original question, it is quite like what I stated initially:
From afar, I feel there is some design space for cheap hardware, which would be incredibly useful to AI research, that hasn’t been fully explored yet.

I don’t know much about the configuration possibilities or performance profiles of FPGAs. I don’t know much about the configuration possibilities or costs of custom SoC. I don’t know much about the engineering possibilities or costs of bringing together some existing microchips on a board. But I do feel there is some space there for the design of something with some definite properties:

simple computing units (mostly 16b ops)
optimally wide computing units (SIMD on those 16b)
numerous computing units (one or a few per chunk of 1/10th of mm² of cortical sheet)
lots of memory per computing unit (100 to 128 MB per chunk), distributed locally close to each
possibility to stack them on a board, communicating over a fast bus, enough of them to reach some critically relevant area of cortex (several cm² would be nice).
network+distribution chip on that board for bringing several of them together in the end.

The reasons why I throw those properties here is fundamentally that they’d bring some juice back in my hope and energy jar. They correspond to some numbers I crunched for my dream machine and my dream design, but I feel they’d allow anyone to play with really tailored harware for all kinds of neural models.
I’ve been trying to get more specific than that since I promised some diagrams… for the moment I failed at it. I have more details available but I’m struggling with how to present or simplify my different designs or ideas.

So what this was about is a cry in the dark, in the hope that someone would show up and state “hell yeah, it is possible and can be done and here’s how, in hardware”. And also to reach the point where we could discuss the whys of that overall design and whether it is indeed a good idea and how it could be improved/optimized/done at all.

(…“still working on it” © famous last words)

dmac · April 20, 2020, 3:02pm

“All models are wrong, but some are useful”

The 2016 HTM model is fascinating, but it ignores many important aspects of the brain. It was created for the purpose of demonstrating Numenta’s discoveries. I think AGI will require many models of the brain including among others the 2016 HTM.

gmirey · April 20, 2020, 3:30pm

@dmac : we seem to agree here on all points

MaxLee · April 20, 2020, 3:35pm

For starters, yes and yes. DE10 Nano being the dev board I’ll be using.

and

This is doable on FPGA, if you’re willing to change the concept of what “synaptic strengths and connections” are, and how they are tracked.

Currently, I get the impression those connections are “grown” by adding a cell_id (based on minicol it’s from) and then increasing/decreasing that strength for connection; this is then multiplied by every single connection from cell-to-cell, leading to (I think stupidly) high memory requirements. I get it is useful for introspection of the minicol and cell’s synapses… but it’s not what our brain does at all, and by doing this in software, we’re adding complexity where none is needed.

Instead, what I keep trying to propose (and perhaps need to illustrate) is that each cell within a minicolumn has only TWO things:

one incrementable counter value
one fixed-length list/array, which stores only the cell_ids of those cells that our given cell forward connects to (stored as pointers to memory locations or value, whichever proves more efficient for a situation).

Not only is this a more biologically plausible model, it’s also a ton more memory efficient, which makes sense, since our brains have been trying to optimize this process for millions of years. Having a cell composed of only these two memory elements then lets us do forward progressions through temporal space.

So why do we only need those two elements? Let’s pretend we’re this cell. We’re just sitting in the brain, isolated and completely oblivious to the outside world, or anything at all beyond the input we’re getting from other connections. We don’t necessarily care about the number of those connections, or their strength. We just care that they’re there, and whether or not they’re firing, because the more they fire, the more likely it is that we will fire. I don’t know that I’m part of a layer or a minicolumn. I know only only that I’m connected to, and that those connections are firing.

Now let’s imagine we’re in time step 1. Inputs from “somewhere” (we don’t know or even really care) are pulsing through out minicolumn. We see maybe a couple of my connections activate, but my neighbors are quite active, firing, and in doing so stretching/reaching themselves out randomly, bumping into their neighbors (us included), and just happening to share information about their active state. A few more timesteps occur, and this activity continues all around us, with cells activating, firing, growing a semi-random, forming connections, etc. Over time, our neighboring cells have pushed quite a few random connections to us, so that when finally a certain number of them activate and fire, we’re activated and fire, and in doing so, reciprocating those connections which activated us, while also stretching ourselves out into the area around us, making random almost connections. A few more times, those same connections to us activate, causing us to fire, and establish more connections with our other neighbors, so that as we activate and fire, WE signal out to THEM, putting them into a potentially active/predictive state to fire.

Circling back to real-world constraints, simulating this dense 3d environment and the connections is simply too intensive. It can and has been done, but nowhere close to realtime, and nowhere covering a region large enough to be useful for applications. But, from the above cell’s perspective, we can draw two things:

Those cells who activated and fired in the timesteps previous to US, called on us (as a consequence of their random connections.
When we fired, sending signals down our own synapses, we didn’t care who it was necessarily who caused us to fire. It didn’t at all. It didn’t matter if one cell’s connection to us was strong than another. We just know that we crossed the magic threshold of excitation and fired away. As we fire, we might have been sending signals right back down those synapses that activated us, but those same cells were likely not listening, as they were needing to cool down, clean out their chemical leftovers, and otherwise inhibited from their connection with us. So communication is one-way (not entirely true, but it safe to say this in general).

So when we go to model this in our systems, we need a counter, which cells connecting to us increment as they fire, and a list of cells that we’re connected to, so that we know which counters to increment.

For the purposes of introspection of cell connections, judging relative strengths between us and some other cell, you could certainly take that list and iterate over it, and sort connections into buckets, giving a general strength indicator. But as far as designing an operational and useful system, keeping track of information like that in memory is a waste of space, and not needed. So I suggest we throw it out entirely, which then makes our minicol take only ~200kb or less in memory, potentially less even if we keep pool sizes down to less than 256 (so that those connection indexes can get away with needing only 8bit per connection).

I think 2016 HTM hit the basic distilled down formula on the head, and now further work will have to go into how to combine this relatively simple algorithm with itself and others to produce more complicated outcomes, such as connecting it to state machines and modulating agents towards objectives (as defined and monitored by those state machines). There’s definitely some room for point neurons in here, and mathematical shortcuts when we can find them.

gmirey · April 20, 2020, 4:12pm

MaxLee:

gmirey:

But I do feel there is some space there for the design of something with some definite properties:

simple computing units (mostly 16b ops)

optimally wide computing units (SIMD on those 16b)

numerous computing units (one or a few per chunk of 1/10th of mm² of cortical sheet)

lots of memory per computing unit (100 to 128 MB per chunk), distributed locally close to each

possibility to stack them on a board, communicating over a fast bus, enough of them to reach some critically relevant area of cortex (several cm² would be nice).

network+distribution chip on that board for bringing several of them together in the end.

This is doable on FPGA, if you’re willing to change the concept of what “synaptic strengths and connections” are, and how they are tracked.

Given that I’ve not mentionned synapses on your quoted parts, I do not follow what you think is doable in that proposal, and what isn’t.

Seems like implementation details of current htm.core or something. I’m not concerned by that at all, for my part.

I’m not sure how that idea (forward indexing) would save significant amounts of memory. In the end, you’re still faced with a very large number of potential candidates.
And I’ve thought about this a lot. pre- or post- indexing is about same amount. You may save like 1 bit of address, one way or the other, depending on your scheme, but not much.

My take on this issue: by modelling the two parts of the problem equally (pre-syn levels carried to axonal terminals, then post-syn segments sampling from those terminals locally), I’m able to bring down the sampling space to a dozen of bits, give or take, while it still feels biologically realistic (to me).

Note that strictly speaking, I’m not considering HTM anything with those figures. Simply concerned about who may connect to what, in a sufficiently realistic, sufficiently plastic, sufficiently performant, and sufficiently model-agnostic way, so that we can do all sorts of real-time simulations with it (if we had the necessary hardware, which we do not, to my despair. Yet it appeared to me to be really close to our fingertips… Hence that whole thread)

MaxLee · April 20, 2020, 4:24pm

I have an image in my mind of what you’re meaning now, but I’m not sure it’s correct (pictures would be nice ) … but it seems you’d be trading memory for processing requirements to translate/decompress the information buried in your bit scheme… certainly doable, and if you could illustrate it, we could probably model it out (where FPGA lets us determine things down to the bit level, logic wise). I think that we’re both trying to drive down to a compact representation, whereas I’m trying to avoid extraneous computation as well… but who am I to know which would work better.

I love the gathering of ideas here. It should give us enough material to try out.

For people who haven’t dealt with digital logic, and have an interest in getting to know the basics of how it all works, I’d highly recommend Ben Eater’s 8-bit breadboard computer series on YouTube, where he goes through from first principles, all the way up to designing a fully implemented 8bit machine using only logic gates.

If you learn the concepts he discusses and teaches, then you can take those same ideas and apply them to FPGAs, which are like a customizable collection of tens or hundreds of thousands of logic gates (as opposed to his 20-30 that he uses). You tell the chip how to configure its hardware through a description language (VHDL, SystemVerilog, and others).

Bitking · April 20, 2020, 4:53pm

At its most basic level is coming up with a flexible scheme to connect two sparse data structures, with quirky requirements as to the local quantization at the dendrite segment and cell level. There is a small data structure at each of these pre and post synapse intersections, and another at the cell level.

Depending on what operation you are doing at the current phase of operations, you need an efficient method of traversing these intersections.

gmirey · April 25, 2020, 1:03pm

I’m not forgetting about your question, MaxLee, but still havent produced any vis.

A little aside what I’m personnaly after, but here’s a very interresting presentation, relevant to the subject.

donsleeter · April 30, 2020, 7:21pm

Has anyone looked at Intel Optane (aka 3D phase state persistent memory) technology?
https://www.intel.com/content/www/us/en/architecture-and-technology/intel-optane-technology.html

I the narrative sounds like it would solve several HTM hardware problems.

MaxLee · April 30, 2020, 8:10pm

I’ve thought about the potential application of Optane… but dang, it’s expensive at the moment. I’d need to rule out other hardware before I sink potentially lost costs into an Optane SSD.

donsleeter · April 30, 2020, 8:31pm

Development is often expensive, but it is a “deductible” expense, and the equipment is a capital expense, all of which may not be applicable to us. The thing we need to focus on is the potential solutions to the key problems of HTM systems. Obviously existing technology provides “first mover” advantages that completely dominate development of applications or demonstrations – in my opinion.

James_Bowery · May 1, 2020, 1:12am

The spiking hardware world is moving fast but HTM theory does not currently accommodate spiking neurodynamics does it?

If/when it does, there is probably something to be said for implementing SDRs on the Nengo platform as it provides compilation to a variety of back ends including spiking hardware chips.

Getting there may require a bit of theoretic work for consilience with Nengo’s Neural Engineering Framework.

gmirey · May 2, 2020, 9:44pm

@MaxLee: didn’t know where to start… so, I tried to lay out the basics for anybody to grasp.

My first question is, how to model that thing…

The HTM model, following Mountcastle, proposes that this cortical sheet is everywhere self-similar and revolving around the building block of a minicolumn (at least developmentally, but probably also functionally). And Numenta’s presentation material hints at a model where we could lay out each of these minicolumns (mCols) regularly on a 2D plane. Okay, I’ll stick to that.

Where each mCol (a tiny square on the grid) amounts to about a hundred of cells stacked vertically. Let’s say 100 excitatory (pyramidal or spiny stellates) and 20 inhibitory of various kinds. Following HTM, you may have some amount of structure or special communication between cells in a single minicolumn, as in the “Temporal Memory” model. Assuming reaching the end-goal for AGI research requires to discover a function for all layers, HTM will eventually need to simulate at least all 100 of them excitatory ones. So let’s take that count as a basis. You then have 62 500 excitatory cells simulated in that square millimeter of cortex represented above.

That 40µm x 40µm figure is somewhat arbitrary, although if you do the math from current overall estimations of cortical sizes and amounts of cells, you’d find it not far from the mark as a global average.

Now, how and where do all these cells connect to each other?
Classical ANNs in current deep learning do not really modelize this. They idealize the network as several intermediate layers, where each cell in a layer connects to all cells of the next layer, and use a scalar “weigth” on those many connections (edges in the graph) as an abstraction of all the messy stuff biological neurons would do to influence one another.

HTM tries to get more biologically inspired with respect to synapses and activations, but the 2016 model still resorts to determining a functional abstraction for a bunch of cells (in a layer), connected by-spec to another, fully determined layer. The fundamental problem I see in “who connects to what” is kinda shadowed by this current approach, and by the fact that current models focus on quite small problems (and bunches of cells).

spatialpooler

One insight of HTM however is that, by using some signal as “context” and some activations as “modulatory”, we get some powerful model to learn stuff “in context”. This can be used to implement, as the TM algorithm shows, the learning and recognizing of a temporal sequence of items.

Our wetware however is far more messy than this simple picture, and it’s not as clear cut which layer connects to which. But even if it was, I suspect that wherever recent Numenta development in the 1000s brains theory may lead, they will progressively add more and more of these driving and contextual input to the model, expanding the cardinality of the “who connects to what” problem, each time. And unless there are very strict developmental constraints, complex identity chemicals, and/or a way to temporarilly distinguish signal origins by means of clear temporal phases, I suspect that these many more “input layers”, expected to arise in future developments of the theory, would be samplable by a single one, and could all be merged together in same HTM-like segments.

In the end, we’ll be back to the problem of addressing a large sea of potential input sheets needing to communicate to a large sea of potential output sheets. And as we’ll see in a minute, it is not a trivial problem, performance wise for current hardware.

Each neuron has many, many synapses along its dendritic tree. It is where the connection forms between neurons. The basal dendrites extensively arbours around the main body (soma).

Distal

Simply by the geometrical nature of it, up to about 95% of all synapses are from so called “distal” parts of dendrites. In the lab, input potentials applied to these distal parts seemed to contribute very little, or not at all, to the overall response of the cell. Again following HTM, it is quite reasonnable to assume that this is not “nature needlessly wasting 95% space and energy”, and that they do have a function very relevant to us. Following current knowledge about NMDA spikes, HTM proposes that, by acting as coincidence detectors, those distal segments in effect can serve as very robust pattern detectors, in a sparsely activated sea of input signals.

So, taking into account the whole dendritic tree, the synaptic count per cell is litterally “several thousands”. What exactly is “several thousands”, you may ask. Well, a minimum of 3K for that “several” denomination, seems maybe sensible… I’m confident I saw 10K on average proposed somewhere. As many as “30K” was Jeff account in a podcast with Lex last year. Let’s say we could devise a model with no real bound for the amount of a particular cell, yet an average (and overall max) of 8192 per cell.

That’s huge. And accounting for it needs two realizations:

(A) First, is that the vast majority of the overall computation will be taken by those distal segments accounting for most synapses. Even if you somehow could perfectly solve the signal preparation issue and had perfectly layed out a 64b “input signal” per 64 “connected” synapses, using HTM binary scheme, ready for a single bitwise-or + popcount… well, given the staggering amount of synapses, the overall bulk of computation would already be dominated by this part.
(B) Second, is that the required memory per square millimeter of cortical simulation where we’d be simulating whole-minicolumns is huge. Given that we settled on 62 500 excitatory cells per square millimetre of cortex, with 8192 syn per cell, we’re talking about 512 millions of synapses already. So, whatever the byte count ‘B’ per synapse you come up with, you have a half gigabyte per B. That’s per square millimeter of your model (ie about the proposed size of a functional macrocolumn). And you’d need more than a thousand times that amount if you had any dream of simulating an area the size of V1. Per hemisphere. That also has an impact performance wise with current hardware, requiring irrealistic amounts of bandwidth to transmit those gigabytes from main memory far away, to your processor… over and over again.

(B) is worsened by the fact that we identified that we’ll ultimately deal with much more potential inputs than current HTM “single sheet”. HTM model is already quite past the point where we could realistically have a dense synaptic connectome, with one persistence (or weight) value per potential target to communicate with. So, rather than this dense thing, we need to spend some additional bytes for a sparse indexing scheme, adding an id or address to the synapse data (either forward, having id of postsyn cell from a presyn storage, or backwards from postsyn storage to presyn signal emitter… doesn’t really matter for minimizing footprint).

(A)'s bitwise-op bugdet needs to be taken with a grain of salt… since perfectly laying-out those 64 inputs requires some preparatory computation anyway… and we’re not sure HTM will forever stick to binary. Well… stepping back a little, whatever the amount of preparation or straightforwardness of your scheme, the amount of indexing + transport + potential unpack + synaptic computation required is huge. So in the end, we’re talking about several orders of magnitude more computations happening in the synapses or segments as compared to integration at soma level (and that would stay true even if you come up with a very complex chemical soma simulation… which could be good or bad news depending on your mood).

So… how does our brain deal with this ?
By composing with (and taking advantage of) topological constraints

AxonalAndDendriticExtent

You’re right. Any scheme would be a tradeoff. Yet I believe there is an exploitable saddle-point to this curve

Do not spend too much time trying to decypher the image above… time has passed since I was messing around with those values and I’m not so happy with them any more. But still I’m kinda confident there is a sweet-spot in the middle.

But this post is getting long already… so…
[to be continued]

MaxLee · May 4, 2020, 3:13pm

I appreciate very much your time to put your thoughts together. You’ve expressed yourself very well so far, and I’m eager to see the rest. I think it’s most fair for me to wait on your fully expressed idea before I respond to various points.

My FPGA dev board (DE10-Nano) finally arrived for me by mail, so I’ll be trying to start my experiments later this week and creating a github repo to centralize some of this.

I hope everyone remains well and in good health.

PettyNeuroMorph · August 26, 2022, 9:05pm

Not sure if bumping an old topic is the recommended approach but I searched and did not see a more appropriate topic. I just wanted to ask this community if you had seen this massive FPGA-based neuromorphic computer and get some takes on whether this would be appropriate for implementing some of Numentas ideas.

Neuromorphic Computing at Human Scale on Reconfigurable Hardware

It would be really interesting if anyone had first hand experience with it.

Falco · August 26, 2022, 11:45pm

They mention Intel Loihi technology, but not specifically if they’re planning on using it.
Here is a great presentation of what the Loihi chip is capable of, by Mike Davies, director of Intel’s Neuromorphic Computing Lab that designed it.

And if you want a nice looking demo, here’s an application of how it will track and kill people in the near future. Errr… I mean… rescue operations. Right.

Topic		Replies	Views
Jeff on Lex Fridman podcast Talks and Events	12	1667	August 14, 2021
NICE Workshop-Hawkins/Ahmad Engineering	4	449	March 31, 2021
"On Intelligence" vs recent developments: What's puzzling me (and some thoughts about grid emergence) Numenta Theory	9	2693	March 31, 2018
Is anyone here familiar with anotherbrain.ai? Lounge	6	841	January 5, 2020
On the diversity of (future) intelligent machines Numenta Theory sequence-memory , encoders	2	792	March 6, 2017

We need new hardware

Related topics