Forums.parallella.org

Programming Q & A • Re: generating epiphany code from templates, possible?

2015-08-12

piotr5 wrote:
it seems you're missing the point. we're talking of heterogenous system architecture, 2 machine-languages sharing a single memory.

The point that you're missing is expecting that you can always plan up front which code should end up on which processor.
it doesn't work that way. and you are under-estimating the complexity of code that needs parallelising. Thats' probably because you're used to complex CPU's which parallelize for you at runtime.

optimisation is an empirical process. You can make a reasonable guess, but you don't actually know where the bottlenecks are until you actually measure.

epiphany or Cell SPU is not like GPGPU: it's fully capable of running the same complexity of code as a CPU. And thats' the point of having it, vs the existing solutions of CPU, GPGPU, FPGA.

examples for this kind of architecture are parallella, amd's kaveri and godvari, and ODROID XU3 with its samsung-processor.

The closest example is Sony Toshiba IBM CELL . it has more in common with that than any of the other examples you've posted. I am speaking with experience on this architecture.
but epiphany has got further enhancements i.e. the ability to generate off-chip transactions directly from loads & stores that makes it more feasible to do a CPU's job.

One thing Sony explicitely had to go around telling people is 'don't think of them as accelerators - they are your main processors'. Coming from the PS2 we had something that was more restricted. (DSP's basically, "'VU0, VU1"), and that was a mindset they needed to break.

all 3 are available below 200$ and offer about the same parallell-performance. (more or less, in terms of magnitude. amd offers 4x4Ghz plus 8x4x<1Ghz wavefronts with 16 simd-cores each -- you could see this as 512 simd or as 32 mimd, I prefer the latter and call them high bit-width cores. odroid-xu3 has 4x2ghz and 4x1.4ghz both programmed with the same machine-language. parallella has 2 ARM cores and 16 slower cores.) i.e. you don't need to actually upload the program, a lot of complexity is lifted in comparison to the 10 years old combination between cpu and gpgpu. therefore you can use the same sourcecode for both architectures if you take care that shared data is actually compatible. no jit,

The thing that GPGPU can't do, which CELL and CPU can, (and which a combination of ARM+epiphany should also be able to do) is the *nested parallism* - multiple levels of fork & join - a bigger outer task (itself parallel) with inner loops paralleled. The tasks can't interact at the right granularity.

They're not going to carve out new space if developers restricting their thinking to what already happens on CPU+GPGPU, CPU+SIMD, CPU+FPGA. And we need to solve the problem that defeated Sony's attempt.
I firmly believe there IS a niche there - the same one Sony went for. The fact we use GPGPU for it now restricts its' potential, I view it as an evolutionary mistake. You have to choose between complexity and throughput, straddling that boundary is awkward.

This is how innovation works - you look for new opportunities - that which looks crazy now, because it hasn't yet been done. All our technology comes from this. It used to be crazy to claim that people can fly, or that a high level language can be as fast as assembler. I remember when Assembly programmers didn't take C seriously (i was one ), and ditto, when C programmers didn't take C++ seriously..

You spot the opportunity, fill a new space, and the paradigm shifts.

This is what adapteva need to succeed with this.

I do believe their hardware is a good idea.

For yet another take, see what intel are doing with 'knights landing' - merging complex cores & throughput capability. its' on their roadmap..

You mentioned previously that "no one would do vertex transformations on the CPU"
Guess what.

Sony CELL did actually have sufficient throughput to do vertex transformations - AND was sufficiently versatile to handle any task the CPU could do (because it could step through large datasets).

and some of the shoe-horning to encourage more use, sony did actually end up supplying libraries for this: vertex shading , and 'backface culling' (a speedup was possible by getting SPU's to transform vertices , assemble primitives and do the backface cull ahead of the GPU - then you re-wrote an index list, fed *that* to the GPU so that it would only do the full vertex shader for vertices that touched forward facing primitives: a modern GPU makes some compromises over whats possible to shoe-horn rendering into a simplified paradigm.)

it was even sometimes used for pixel work , e.g last stages of deferred lighting.
and yet its' main role was running the same code that was done on an SMP CPU on the rival platform.

This is the kind of space you'd be back in if adapteva succeed - throughput AND greater complexity - wheras with existing options you must decide between either extreme.

When I look at adapteva epiphany I see "version 2.0' of an experiment Sony performed - we have experience of what they were trying to achieve and what went wrong.

During the development of CELL, it was originally intended to be the graphics chip, an evolution of how PS2 worked (where the CPU had one DSP to help, and where vertex transformations were done by a second DSP.. there were going to be 2 CELL chips.. ). In the end they couldn't compete for pixel operations with GPU's (raster & texture sampling)- and throwing in a GPU with vertex shaders led to an awkward hardware setup that restricted its' potential. But the horsepower was definitely there for vertex transformations.

GPUs have subsequently gained extra complexity to handle more techniques (geometry shaders) but CELL was already capable of those tasks.. advanced tessellation etc.. more scope on how to encode & decode vertex data.. more scope to use information between vertices..

you need to make a decision for which architecture your code will be compiled though, and you need to split sourcecode

THAT is the part that prevented this architecture from reaching its' full potential - this is the lesson we learned between the gamedev community & Sony over the past 10 years.
If you have to manually split - you can't handle the full range of complexity and fine grain interactions that its' capable of - and the other options (SIMD, CPU/GPGPU) are superior. the development cost becomes prohibitive.

in a way that a single file is compiled for a single machine-language. if your program was well-designed, this is no problem. parallellizing loops that way makes no sense though, the administrative overhead and program-complexity doesn't make shared computation worth the hassle.
parallell loops is a very cheap shot at automatic parallellization. easy to implement, but rarely useful.

the'yre *nested*. fine and coarse grain ... 2 levels of fork and join. And it DOES work - this is how game engine code works already. It does take effort reworking data structure... but it's proven.
on a complex OOOE CPU the 'automatic parallelization' is done by hardware, you take it for granted. On the Xbox360 we had to do that manually with loop unrolling, which made it very visible. The plus side was, it encouraged us to rework data to maximise the amount of fine grain parallelism possible.

For Adapteva to match what intel do, you will need to make that parallelism explicit, the same as on the xbox360, and leverage inter-core transactions to handle the fine grain interactions.. minimizing the overhead of starting & stopping the group of parallel tasks - the difference between making deliveries by many cars or by few 18wheeler trucks, or 'something in the middle' (pickup trucks'). e.g. imagine trying to make door to door deliveries on the truck.

If you don't, you will waste the full potential of MIMD to do what GPGPU can't.

The opportunity to exceed intel is that handling it at compile time saves runtime cost. (heat,transistors). (and intel have spotted a similar opportunity which is why they're doing 'Knights Landing')

more useful is an approach encompassing loop-parallellization too: investigate what data-processing is independent of eachother and do these in parallell. here the risk is to overlook some hidden independence, detection of which is quite a new and complicated problem for advanced algebra. so, till we teach computer to do algebra for us, we need to optimize and parallellize by hand, in addition to some AI.

yes, and the established pattern for this is high-order-functions .. 'map_reduce' is actually a famous example. I'm talking about writing pretty much all your source this way. Wherever you have an iteration, you write it as a high order function taking a lambda, then you can toggle a simple switch _par version, and you can empirically establish the best granularity.

I have looked with great interest at the world of functional programming. Its' ruined by garbage collection, but I am convinced lambdas and internal iterators are the way to go.

keep in mind, all this is just about mimd, simd can be happy with mere loop-parallellization, although I've heard also simd compilers seek possible similar instructions to be parallellized, although that rarely happens. isn't that how mmx and sse optimization works?

well in Gamedev its very common to use SIMD for the 'Vector3','Vector4' datatypes - most of the calculations (at every level) use 3d vector maths - its' ubiquious - but the newer architectures are more suited to what you describe - with wider simd (8-way); and 'gather' instructions - making it easier to apply SIMD to general problems.
And it was actually possible to work like that on Sony CELL in a very convoluted way (again, really needing fancy compiler magic that never appeared - you could load independent structure data into registers ,and permute into SIMD lanes, work on that , then permute back and store to independent structures again)

as has been mentioned already, complexity comes from all the little peculiarities of the various systems. epiphany requires the program to be loaded into local memory,

loading into local memory is the same as a cache except: instead of the machine figuring it out at runtime, you have the opportunity(&responsibility) to use compile time information upfront.
in gamedev we had machines with in-order processors, which meant they grind to a halt with a cache miss: which meant we needed to manually put prefetch instructions in for optimum performance. Moving from Intel to xbox360 was an education in what Intel is doing behind the scenes i.e. by taking it away

What happened was CELL and Xbox360 actually needed the same data-layout tweaks, its' just the actual coding for CELL was more convoluted - and you needed to do things slightly differently depending on whether data was in local or global memory.
Epiphany has a step forward that can simplify this: namely the ability to use the same address space for 'on-core' and 'out of core' addressing (i know in the current board that's restricted to a 32MB window, but a strong recommendation i would give is to maximize the amount of space that is addressed the same.

amd works best if you make use of the cache, and so on. traditionally the compiler is responsible for these details, but if the compiler doesn't do its work the sourcecode becomes more complicated with all the exceptional stuff. obviously compiler-developers are overburdened with optimizing for the various possible architectures.

imho a program should be composed of 2 layers: the sourcecode expressing intention of the programmer, and some sort of plugin for the compiler to help implementing these intentions in the various architectures. templates are a possibility for doing these 2 layers.

yes this makes perfect sense, now we're on the same page. Templates, macros can sort of do it but the analysis is whole program, and beyond what they can do, which is why we must do things manually at the moment.

One thing I did on CELL was a set of smart pointers that handled the localstore DMA. (e.g. given any pointer to main memory, you just assigned to some local temporary and dereferenced from that - and via constructor/destructor, it would do a DMA load , store on going out of scope. With some #ifdefs, the same code would compile for the xbox 360. (the smart pointer would just pass the original pointer across and the optimiser would just use the original value.)

This was a reasonable 'better than nothing' way of adapting code for portability - but it would be so much better if those could be swapped in by a compiler plugin, and done empirically i.e. from measurement (run the code on the CPU first with extra instrumentation, then decide the granularity of the switch)

An analogy I would use is registers,stack, machine code vs a modern compiler.
You used to have to manually manage which data goes on stack and in registers.
Then compilers advanced, and they could reliably automatically (a) allocate registers (b) track for you which variables move between stack & registers, when the registers overspill - even though from the perspective of a 1985 assembly language programmer, those would be chalk & cheese.

Think of 'fine' and 'coarse' grain parallelism in the same way. At the minute we do it manually, but a compiler should be able to take the same code and figure out where its to be done'... as you shuffle code around between loops and functions it might move how it best maps to the underlying hardware, even though its' logically the same.

Statistics: Posted by dobkeratops — Wed Aug 12, 2015 11:30 am