Forums.parallella.org

Programming Q & A • Re: generating epiphany code from templates, possible?

2015-08-14

Interesting discussion.. thanks for partaking

for example I observed an interesting trend towards usage of sourcecode instead of libs.

Right. nail/head.
... thats' because its' easier than splitting code into strict 'domains' (which is what was tried first)... you need fine grain interaction.
Also C++ allows you to inline and specialise a lot with template libraries (the best bit!) - so even if your code can bit split at source, when its' actually compiled its intermingled.
You miss this if you use binaries. Its' this specific ability that i'm after here. (when I say lambdas, I do want them to inline into a stub on the epiphany)

I have high respect for your experience. but times change, maybe some new ideas are available now which you didn't have back then?

I've always been highly theoretical and of course i've seen a lot along the way. I'm still even here because of speculation, not experience. (experience would say "don't touch it with a barge pole!"). And yes things have changed, a lot.

It was the experience of the PS3 and Xbox360 that got me to research functional programming. (the experience there of course was, 'C++03 was not sufficient..')

this way it becomes possible to move the 3rd party code onto another processor architecture and code for all the different kinds of phones and tablets. I suspect the reason is multicore systems, as started by amd's 8-core system and the highly energy-efficient intel-versions.

so to refine this picture: what happened in 2005-2006 was, microsoft came to us with the 360 - which was a multicore design - which extrapolated ahead of the existing PC curve by going multicore-SMP. Sony came to us with the PS3 - very similar to epiphany -and an extrapolation of the dual DSP's the PS2 had. More power on paper. more 'on-chip memory', more computational power.

What sony said was, "we hope middleware will use the SPUs" (we need a word, 'little-cores' whatever) - but what happened in practice is we had all that hugely complex code being developed already - and we could adapt to new ideas >2x as fast working on the xbox360. So it became the Lead Platform. Around that time there was the Ageia-PhysX chip too , for PC's (again, 'little-cores' on an expansion card ). I remember explaining back then how encouraging the parallel evolution was.. and I was wrong the future was SMP, not little-cores.

The problem with SPU's is , you don't have enough time to refine everything - you have deadlines.. things *must* be released by a certain date.. And whilst it handled fully optimised code amazingly well, on the 360 you could get '80%' of the performance with maybe 1/4th the effort (whatever those numbers are - that was the strategic effect). Out of the box, SPU's couldn't even run most programs. I wrote a set of lazy smart pointer DMA wrappers to facilitate a quick port, looking around it seems eventually there were attempts to compile 'normal' code with a literal software emulated cache.. they were that desperate to get any use of those cores.

PC's then evolved down the multicore route. The PhysX was squeezed out between SMPxSIMD on one side and GPGPU on the other - a great shame - and this is the exact space the Epiphany could be going for (r.e. a PCI card).

I'm stubborn, I still think the world did something wrong - shoe-horning general purpose power into a GPU, and stretching the intel architecture so far (those ancient 8bit instructions driving a deeply pipelined vector-machine.. ?!)

btw, why do you keep talking so much about sorting?

Good question.
When you look at any code that uses random access through a cache, and you think "how the hell do we do this with DMA" - the answer is Sorting.
instead of random accesses, you spit some sort of buffer of requests, then sort them, then taverse the data along with requests per data.

(Today something similar has been popularised as "Map-Reduce" on clusters (take a data set, map producing key-values, sortintermediates by key reduce the values for each key.).)

example in graphics: z-buffer vs tile-based-defered rendering (not to be confused with deferred shading)
z-buffer rendering - random writes into a big frame-buffer might use a cache.
How to get rid of that? - accumulate primitives, 'key'=onscreen grid cell.. sort into each tile - 'reduce' - gather the primitives for a tile, render on chip, spit out the finished tile.

Map'Reduce is very general , the tiled rendering example is very specific.

This is what I want to do in cross platform templates... build up a library of slightly specialised functions like that and you adapt them to different situations by plugging lambdas in.

Accumulating & Sorting requests - thats what this FPGA memory might be good for.?
I wonder if the FPGA could actually sort for you ... could you assist the epiphany with a sorted,indexed DMA channel... almost hardware assist for something like MapReduce..

in graphics the POV is changing smoothly, slowly moving on a fixed line, so also the z-distance data

right i'm talking about sorting *in many different contexts* to get around the absence of a cache. sorting objects by grid cells for collisions, sorting game-script messages sent to actors,
.. I just talk about graphics a lot because it's been my interest.

I think you can apply the epiphany architecture to any problem, eventually, with this approach. Its' just today most software leverages the inbuilt sort ability of cache and all our source is written that way.

As he mentioned separate memories scales much further. Future.. moores law slowing.. but optical interconnect is a potential tech.. maybe it will become easier to cluster huge numbers of cores with optical interconnect between them.. (i also wondered if they'd extend the epiphany grid in 3d, or just use layers for stacked memories to make the 2d cores smaller..)

so you could pass code in a message in the same way you command your video-card to compile some c-program into pixel-shader. but isn't that a big overhead? an external level of cache is a nice idea

Basically the complexity appears somewhere.. whether its a temporary buffer for sorting... whether its L1 caches with idle threads to hide latencies - the cost must be paid - of course one approach might be better than the other.

GPGPU uses an L2 cache to communicate between threads, and for coherence writing to frame buffers. (so it's not writing individual pixels, rather caching a small tile, drawing nearby pixel fragments into it, etc..)

what else do you suggest is required for parallell programming? why do you keep insisting on pragmas to steer which software will be responsible for compiling my source code?

[1]My own opinion is high order functions & lambdas are the way to go, which C++ does quite well now. And the issue I see here is 'how to take that from SMP to Epiphany.'
[2] Something to automatically generate pipelines, perhaps automatically splitting expensive functions off. (e.g. like a piece of game actor code doing retraces.. these would be deferred, and the code either side split into 2 stages, the second taking the ray result).
Perhaps this can be done with compiler plugins, or analyzing IR, ..

Without '2' we can do this semi-manually,e.g write a function 'map_defer_map(data, first_stage, expensive_function, second_stage)'. Thats' why i'd go for [1] first.

I speculate Coarse x Fine x Pipelines would saturate 100's of cores just fine for the kind of complex CPU code in todays game engines (e.g. 8x8x4 .. numbers in that ballpark) and of course with 100's-1000's of cores you'd do more elaborate physics, AI in different ways (more like GPGPU code).

That is of course open to question, others may have different ideas.

I see a GPU as being a big hardware 'high order function' specialized for rendering, taking a Vertex Shader and Pixel shader like lambdas. Its success is because of the simplicity of slotting shaders in- hence many people could use it.

why do you keep insisting on pragmas to steer which software will be responsible for compiling my sourcecode?

OpenMP is similar but as you say, limited, too much is built into the compiler, and it only gives you one pattern. Building templates for a traditional SMP threaded machine you have a lot of control. (i think its' great that someone's done OpenMP support here)

I take a lot of inspiration from what I see in the Haskell world & LISP, but I don't personally like haskell enough to use it. (LISP is usually too dynamic, but its' the origin of the whole 'high order function' approach.. the term 'map-reduce' comes from their use of 'map' and 'reduce'..)

I see objects as non-hierarchical, therefore I don't understand your concept of rough-grained and fine-grained loops. of course I have an arm and I have fingers on its end.

There's still locality to exploit here. e.g. on the PS3, you could fit the intermediate workspace for a character on 1 SPU. 'for each character..' - its' very natural to get a whole SPU working on each. (epiphany cores are smaller, 32k instead of 256k, so perhaps '1 character' would actually be handled by 4-8 cores, overspil in each others' memory? or arbitrating via the external buffer? or one master points at data in subordinates?)

Calculations for one character don't affect the next, but there are many interactions *inside* one.
Then finer grain parallelism is used within the localstore (some work between the individual joints is independant).

Hierarchies usually mean pointer-chasing, which is bad. But what I"m referring to here is data locality, which is something good that you can exploit.

Trees are not so good now (too much pointer chasing), but a 2 level split into 'major' and 'minor' is. (tiles & subtitles, objects sorted into grid chunks.. whatever.).

In physics engines you might encounter the phrases 'broad an narrow phase' and 'islands'.

of course the object contains the triangles, but wouldn't it be better to treat a container of triangles as something distinct from the actual object?

[/quote]
So as you correctly observe, these days on PCs/phones the actual vertices & triangles stay with the GPU. cpu data structures just own and reference them.

Code:

types of processing on horizontal axis.

high versatility,complex interactions <-------------------> high throughput, specialised, simple interactions
serial <---------------------------------> easier parallelism

control code AI collision/physics animation vertices texels/pixels
(not exact, these functions spread along this axis, roughly in this order)

Distant past PS1..
[Mips CPU ] ['GTE' vector coprocessor, works like fpu] [tinyGPU]

PS2
[Mips CPU] [ VU0 DSP ][VU1 DSP ] [ simple GPU ]

past - PS3:-
[PowerPC ] [------------- Cell SPU ----------------------] [---------- GPU------------]

present day PC, Phones, APU, pS4:-

[-------------------CPU SIMD--------------][ ---------------------------------GPGPU GPU-----------------]

manycore future?? <<< This is what I want :)
[legacyCPU][ ------------------------...epiphany cores....???-----------------------------------][simple GPU]

Statistics: Posted by dobkeratops — Fri Aug 14, 2015 9:10 pm