I’m not an architecture expert but there’s a great mimd architecture may be suitable for htm https://www.graphcore.ai/
Or this: https://cerebras.net/
You know the manufacturing cost of the wafer is only around 15,000 dollars. I think it is about 2 million to buy the system. With 2.6 Trillion transistors you can’t go wrong, though.
The graphcore chip is the GC200 IPU. (You have to go hunting for that link. They need to fire their head of sales.)
They say each chip has 1472 cores and each core has 45TB/s bandwidth to its own 900MB RAM.
??REALLY?? Did they mean “Mb” rather than “MB”?.
The 1472x1472 intercore network supports 8TB/s communication rates (although they don’t say what the latency is).
If this doesn’t make gmirey happy I don’t know what will.
The cerebras website is more opaque than the graphcore website (with graphcore’s questionable 900M"B"/core figure) – making you register to download their “whitepaper” which is apparently where they’ve hidden the critical information not available on their website. Otherwise I’d say it looks promising.
PS: The sales people for these ML hardware companies are batting 0 of 1000. You don’t hide information on the core like that.
It says:
- 1472 independent IPU-Tiles ™
- “900MB In-Processor-Memory ™ per IPU”
I think this means that the whole chip is a single “IPU” and the 900-MB is somehow shared or distributed between the 1472 “IPU-Tiles”.
Otherwise the chip would have: 900MB X 1472-tiles = 1.3248 Tera Bytes of memory…
That makes sense of the 900MB figure which would mean, 600kB/core – still quite adequate for trying out gmirey’s approach.
The cerebras figure of “45GB/chip” and “850,000 cores” seems to indicate 47kB/core, which doesn’t strike me as adequate for gmirey’s approach, even with the bit vector compression described in the aforelinked paper. (To call their wafer a “chip” seems a bit of a stretch.)
Itanium showed us that the compiler can do what it can do. Statically scheduling can’t compete with hardware (because the amount of information you have at compile time is limited). Intel/HP now known that.
Cache coherence is required to keep programmer sanity. Cell is a good example of what happens without CC. IBM/Toshiba/Sony known that. Even if you have local memory, you need to “synchronize” … either the hardware do it or the programmer.
Perhaps we miscommunicated regarding my too-laconic phrase “without cache coherency overhead”. What I should have said was “without cache, thereby obviating the need for cache coherency overhead”.
Now, of course, the question arises: How can one get by without cache?
First of all, lest we lose our bearing – bear in mind we’re talking about minimizing latency by moving shared memory on-chip.
Secondly, the question arises: Even so, how are you going to get the latency to on-chip shared memory as low as that to on-core cache?
You will grant, I hope, that maintaining instruction cache coherence is qualitatively different from data cache coherence, and that I may therefore be permitted to restrict my self to eliminating data caching in what follows. Moreover if I may be further permitted to restrict myself to writable data (ie: we’re not dealing here with program literals that are, therefore, data treated more as instructions).
Such RW data that is shared between processes running on different cores requires software-level coherence in the form of locks/semaphores even if at the OS level of software. So we’re stuck going out to shared memory anyway.
So, now we’re ready to compare apples-to-apples in trading off cache-coherence circuitry real estate against shared memory real estate – and their respective latencies.
What it boils down to is if you can get the mutex circuitry for sharing memory banks between cores small enough in terms of real estate and in terms of latency, it makes sense to dispense with RW cache real estate and, with it, RW cache coherency real estate, and use that real estate for the shared memory banks and associated mutex circuitry.
PS: I don’t know if my radical (analog-digital hybrid) idea for such mutex circuitry would actually work in practice, or if it would be necessary to make the above tradeoff viable, but it arose from my thinking about the above tradeoff.
The problem of that idea is bandwidth. Contention can be a huge pain.
I think coherency overhead (both in memory access time and area required for directories + filters) is not that significant.
Can see how you can get MB of memory close to the processor at <1ns access times. Even assuming SRAM, that seems no achievable. Moreover, with simple cores you don’t have room for OOO … them memory latency of a few clock cycles will be terrible.
Not necessarily. The Mill is not the same architecture as the Itanium. They have some very novel and interesting ideas. Furthermore they’ve recorded lectures where they explain how all of it works, on their web page.
Edit. For example on the mill cpu: when you read a byte of memory which is not accessible (because the user does not have permission to read that memory) then the mill returns data which is marked as “Not a result” (NaR). NaR’s are enforced in hardware (by tagging every register an extra “NaR” bit) so that the user can’t accidentally confuse them for actual data. In an OOO machine, such a memory violation would immediately trigger an interrupt, but on the mill, the violation is reported to the user within the normal flow of control (no exception, unless they try to use the NaR like a normal memory read). Because of NaRs, the mill can safely break the dependency between a pointer bounds check and the subsequent pointer load.
of course not, but as Mill, itanium trust in compiler to get good VLIW packs (itanium still was OOO).
The mill is at its core a digital signal processor (DSP). As with all other DSP’s, it is designed for doing loops of arithmetic, and it has a bunch of extra hardware features for accelerating looping structures. One of the primary DSP acceleration techniques is software pipelining, which executes a loop. Essentially a single VLIW connects all of the functional units and registers into one big hardware pipeline. Then as the instruction repeats for every iteration of the loop: the DSP feeds all of your data into/out of that hardware pipeline. The advantage of this technique is that each iteration of the loop is executed sequentially, and all of the parallelism is between different iterations of the loop. Software pipelining exposes a lot of instruction-level-parallelism. However this only works if the loop does not references prior iterations of the loop.
The problem with DSP’s is that actually using them a hard. To get the compiler to emit the instructions to actually use all of those special hardware features, you need to re-write your program to explicitly use them, often with C pragmas, macros, or other custom magic. And those special hardware features are usually not very flexible in what they can do.
The mill’s promise is to run unmodified C programs with all of the DSP acceleration techniques, including to automatically convert loops into software pipelines. They did a “clean sheet rethink” of the ISA with this goal in mind. So to answer your doubts about “getting good VLIW packs”: for software loops, the mill cpu should be able to expose a lot of parallelism via software pipelining and then pack the entire pipeline into a single or a few VLIWs.
I wonder if the Nengo software (which is signal processing oriented) would provide a reasonable platform for HTM’s sparse representations? (Not that Mill will be available before Hell freezes over.)
I agree, Static scheduling works in signal processing just fine. But is very limited in general purpose computing. The compiler has no clue of input-dependent values to re-schedule the code properly. For example, how do you schedule a loop in advance if its branch condition depends on a certain value in memory? The programmer has a limited capability to make it easier (for example, SIMD extensions in current processor are really hard to use efficiently).
There is a lot of dead bodies in the road for static scheduling. I’m not saying that won’t work ever (especially can be very appealing with the current speculative side-channel attack debacle)… but it will require some serious rethinking with the compilation process. Until then, I don’t see how dynamic scheduling will go away (on contrast…i see how is more and more aggressively exploited. Current behemoths (such as M1) have a massive ROB to exploit it (+630 entries)).
Yes, they did do some serious rethinking of the compilation process.
The mill cpu puts forth some pretty radical ideas, which get a lot of attention and naysayers. But they also put forth long-format videos explaining most of their ideas, and no one who has seriously looked into their work has raised any real issues. The ideas have been public for years and no one has said “ah-ha! look here, this is why it will never work!”
So again, here is that link to all of those video explanations: docs / videos / slides – Mill Computing, Inc
I’m not sure if you mean “the looping condition”, or “a loop which contains an if
statement”.
Looping conditions are easy, the VLIW’s can contain branch operations. The decision to branch (or not to branch) is decided at run time (not statically).
Loops containing if
statements are harder. One optimization strategy is to convert the if
to use the pick
assembly instruction. pick
is a hardware implementation the C ternary statement (condition ? true : false)
. One caveat of using pick
is that first you must compute both arms of the if
statement even though you’re only supposed to execute one of them, so if there are globally visible side-effects hidden in the if
statement then you can’t use pick
like this.
“Mutex” is mutual exclusion is resolution of contention. The timing of the linked mutex circuit is a lot better than you might imagine - it is that radical of an improvement. The bottleneck is speed of light on large die. Cache (including coherence) overhead goes up more than linear with core number. It matters.
“…all multi-core Mills will be single-chip designs sharing the same cache.”
“shared cache” is another phrase for “on-chip shared memory” differing perhaps only in the trade-off between speed and size, e.g. one would likely use SRAM over DRAM if one were less concerned about hit rate (size of shared cache) than about speed of hits.
So the Mill guys are obviating the need for cache coherency real estate, as am I. However, I don’t think they are doing anything particularly radical in going to shared memory. The Mill architecture is looking more and more like a fit for my radically faster mutex circuitry to expand on-chip memory shared between all of the large number of cores on-chip.
lol, yea they aren’t inventing any radical new electric circuits. During one of Ivan’s talks he mentioned that his hardware ppl keep coming up with new transistor level circuit designs but that they’re probably not going to use them BC they think that their competitive advantage is in the CPU architecture, not in the transistor level circuit designs.
This is a neat idea: you want to make an analog circuit which does arithmetic faster than a digital circuit can, by using Kirchhoff’s Laws to find the maximum value of a list of numbers in near-constant time, as opposed to the traditional digital circuits which take log(N)
time per N
numbers. And then to apply this to improving shared memory bank circuits.