IRAM Chips and Sparse Representation

James_Bowery · June 11, 2021, 7:15pm

That’s approximately right. The only correction I’d make to that characterization is that while this mutex is still proportional to log(N), the base of the log is much greater than 2 and the constant of proportionality is probably less as well. It is also likely to take a lot less real estate.

The technical risks I see are making sure:

Kirchhoff’s Laws apply at the frequencies encountered, and
the “noise” doesn’t overwhelm the digital signals.

Both of these relate to a bad scaling law (N^2) for electrical power on the shared line between diodes:

As the number of diodes increases the capacitance increases.
As the frequency of the analogue signal increases (to speed up contention resolution) the cusps on the curve representing the shared line are going to be higher frequency still.

Power has to be drained off the shared line by a resistor at a rate proportional to N^2, so the constant of proportionality better by very small to begin with or the number of contending cores will be impractically small.

dmac · June 12, 2021, 2:06pm

I think the key to this ideas speed is that only the cores which are contending for access to a memory bank contribute to the time complexity. As opposed to a naive digital implementation which considers all of the cores, regardless of whether the cores are trying to access the bank.

vpuente · June 15, 2021, 10:37am

What if the condition is in memory (i.e. L1/L2/L3 miss)? That can be a 1000 cycle stall for the pipeline. The “predication” trick works for a few instructions… not 100s (the waste of energy is unsustainable). Itanium proved that.

Sorry for being negative, but I could find any comparison between mill and conventional architectures able to change my mind.

vpuente · June 15, 2021, 10:43am

I mean physical contention. When you have a shared media (iv.gr., a memory controller) and the requests start to pile up, latency grows exponentially. If the available bandwidth don’t scale with the computing elements, saturation kills your performance pretty quickly. This is well known from old shared bus systems.

dmac · June 15, 2021, 1:09pm

Well yes, but out of order execution (OoO) computers have the exact same problem too. If the data you need is in RAM then your current thread is definitely going to stall. Neither static nor dynamic scheduling will prevent such a stall. OoO computers will try to schedule around the missing data but there is no way they’re going to fill in any where near a 1000 cycles of work with partially missing data.

Instead OoO computers will use “hyper-threading” to swap a different thread of execution into the CPU and hope that that the other thread has all of its data in-cache. Hyper-threading comes with its own trade-offs, mainly that it needs 2x the cache size to hold 2 threads worth of data. And since only one of the two threads can execute at a time, half of the cache is holding data for a thread which is in the “stalled” or “waiting-to-run” state.

Edit: the Itanium was revised in 2010 implemented hyper-threading. Itanium - Wikipedia

vpuente · June 15, 2021, 3:31pm

Sorry but no. That’s incorrect. 1st, you can speculate what path to follow (using the branch predictor. Branch predictors, which are populated with runtime information, are pretty good today… 98%+ accuracy ). Then, you have all the machinery (ROB, LSQ, etc…) to execute that patch speculatively (i.e. be able to roll-back if speculation was wrong). Current ROBs are 600+ entries… so you have room to keep going. Also, prefechers speculatively can fetch ahead the memory content.

Its called SMT or simultaneous multithreading and again, all your saying is not accurate. I think this thread is derailing from the initial discussion. I think it is not the place for these discussions. Current processors are something really complex. In any case, respectfully, if you are interested in computer architecture I suggest to you to take a look at J.L Baer [1]. Little great book. Not so recognized as H&P brick.

[1] https://www.amazon.com/-/es/Jean-Loup-Baer-ebook/dp/B003E74B42/ref=sr_1_1?__mk_es_US=ÅMÅŽÕÑ&dchild=1&keywords=Computer+Architecture+Baer&qid=1623770570&s=books&sr=1-1

James_Bowery · June 15, 2021, 3:52pm

Yes, I know, and that’s why I pursued a single conductor, high frequency, analog resolution circuit. Did you bother to look at that circuit? What you are trying to do with OoO execution, speculative branching, etc. takes chip area that can be devoted to additional, simple, CPUs that go idle for a cycle or two in worst case. Adding memory banks to keep the ratio of CPUs to memory banks low is cheap with this architecture. Moreover, a lot of your OoO concern can be addressed along the lines of the Cyber 6600 score boarding which is very low chip area. (BTW, I worked at the Arden Hills CDC operations where I met Thornton and a lot of my thinking about the mutex circuit arose from Cray’s approach to memory controllers).

When you have a large number of CPUs and a larger number of memory banks with fast, low-area mutex crossbars, your effective memory latency is low and bandwidth is high.

Some round numbers: The Cray-1s vector computer had about 250k transistors in the CPU not counting vector registers. This is a ridiculously small number given the 50B transistor chips nowadays. 1000 Cray1s CPUs would be half a percent of the area. Add in the vector registers and you’re still probably talking an insignificant area. The rest is memory banks – lots and lots of memory banks with phased access resulting from vectorization. What happens when you have lots of memory banks phased like that is the CPUs tend to synchronize their vector streams once they hit a halt condition on a memory access – so they’re all marching across the memory banks at different phases and maximum bandwidth is approached. I don’t know what the ratio of CPU clock to memory latency would be but you can issue a bunch of memory accesses in sequence – one per CPU clock – and not have to worry too much about the already very low latency. As I recall from the national labs use of these systems, the tradeoff between vector and scalar operations was around length 3 or so – which is one reason you need the low latency: to service the short vectors and scalars.

If you have a memory bank to CPU ratio of about 5 to 1 on 50e9 transistors, you have 5000 memory banks each with about 10M transistors.

Topic		Replies	Views
Anyone can explain why Numenta latest algo optimizes Deep Learning 100x? Machine Learning	15	1217	May 15, 2023
Poster Overview: How Can We Be So Slow? Realizing the Performance Benefits of Sparse Networks Current Research	9	946	May 5, 2022
Ideas for HTM on FPGAs Engineering	14	847	March 15, 2019
Numenta Technology Demonstration: Sparse networks perform inference 50 times faster than dense networks, with competitive accuracy Machine Learning	6	1217	November 12, 2020
Book Review: Sparse Distributed Memory by Pentti Kanerva - April 7, 2021 Current Research	2	547	April 9, 2021

IRAM Chips and Sparse Representation

Related topics