The Mali-G710: Doubling up per-core performance

As a continuation of the Valhall GPU architecture, the cornerstone characteristics of the new G710’s execution engines are similar and roughly the same as what we’ve covered in the past generation Mali-G77 and Mali-G78.

Amongst the larger changes we saw with Valhall was the shift from a wavefront/warp size of 8 towards 16, with dual datapaths (clusters) per execution engine, resulting in a 32 FMA/core design that we saw in the G77 and G78.

The ISA is said to have seen larger improvements that was designed with new modern APIs such as Vulkan – it’s always quite hard to quantify the impact such changes have on the overall performance and efficiency of a GPU.

What’s new in the Mali-G710 is the addition of a second execution engine, effectively doubling up on the compute performance per shader core of the Valhall architecture. In a sense, Arm here is re-adopting some of its scaling means that we had seen in past generation Mali architectures, such as compared to when the Mali-G76 had for example three execution engines per shader core.

In the above slide, the “8x” and “4x” metrics are in regards to the throughput per cycle per core, and we can see by the metrics that other functional blocks of the GPU have also doubled up in terms of throughput to keep up with the doubled up compute execution throughput of the execution engines.

The new G710 includes a brand-new texture unit that is now able to handle up to 8 bilinear texels per clock, and Arm has generally optimised the new design to be significantly more area efficient, giving the new TMU a +50% performance density advantage.

Within the execution engine Arm continues to employ two processing units or clusters of processing elements, and in that regard, we don’t see that much difference between the generations, however if we look deeper into the actual processing unit there are changes to the blocks:

In the simplest and fundamental explanation, what we’re seeing is a shift from a single instance of 16-wide (warp wide) processing elements and execution units, to four instances of 4-wide execution units. The throughput between the designs doesn’t change, but the new microarchitecture gives more dedicated resources to the processing elements and allows for better structing for better efficiency.

Overall, the new execution engine design doubles up the FMA’s per clock per core, which is somewhat obvious, but also has the benefit of lowering the energy distribution within the shader core from the execution engine by 20%.

A further very large highlight of the G710 is the replacement of the traditional “Job Manager” with the new “Command Stream Frontend”, which handles scheduling and handling of draw-calls. The CSF introduces a new CPU of undisclosed nature, and for the first time will now also introduce a firmware layer to Mali GPUs.

The goals of the design is achieving more flexible and scalable performance for more complex graphical workloads while at the same time improving on system CPU power efficiency by reducing driver overhead by providing it with a very light weight submission path. It helps for simplified support of API features such as state inheritance and secondary buffers, and handling timing sensitive applications such as VR or time-warp applications. Synchronisation events also greatly benefit from the move closer to the hardware and the reduction of latency that this enables.

The firmware is closely couples to the hardware and handles requests from the host, or command buffer completion notifications, reduces overhead of things such as protected entry exit, or even allows for emulation of API features that don’t yet exist in the hardware through additional instructions.

The new hardware has been redesigned from the ground-up to be able to keep up with modern content and allow for the throughput of job submission into other GPU units. Arm here claims that the new CSF allows for up to 5 million drawcalls per second.

Overall, the new G710 microarchitecture seems very interesting and in particular seems to want to address some API overhead related weaknesses of Arm’s Mali GPUs. How this plays out remains to be seen, but from the advertised performance and power efficiency gains of 20% this generation, it seems like a solid improvement, although in these figures wouldn’t be quite sufficient to alter the competitive landscape in the mobile market.

The Mali-G610 is the same microarchitecture as the G710, only with a different name with core configurations lower than 7 cores.

Third Generation of Valhall Mali GPUs The Mali-G510 & G310: Attacking the low-end
POST A COMMENT

29 Comments

View All Comments

  • mode_13h - Saturday, May 29, 2021 - link

    Good points, but Imagination demo'd ray tracing many years ago. It'd be interesting to know if they had some unique tech, or if the implementation was just too limited for modern games. Reply
  • t.s - Tuesday, May 25, 2021 - link

    "Conclusion & 1st Impressions" Last paragraph: "If those DTV markets hare numbers are accurate" -> 'hare' minus s. Reply
  • James5mith - Tuesday, May 25, 2021 - link

    I think it was supposed to be market share, not markets hare. Reply
  • eastcoast_pete - Tuesday, May 25, 2021 - link

    Thanks Andrei. Question about Imagination: did they have any design wins recently? It's been awfully quiet. Reply
  • ToTTenTranz - Tuesday, May 25, 2021 - link

    Did Nvidia give up on purchasing ARM?

    If not, there's a bit of a white elephant in the room.
    Reply
  • eastcoast_pete - Tuesday, May 25, 2021 - link

    I believe that NVIDIA wouldn't want to have its precious CUDA cores "slumming" it in $ 100 smartphones, so they probably hang on to Mali just for this. Reply
  • brucethemoose - Tuesday, May 25, 2021 - link

    If it make them money, why not?

    But there is the question of how low it can go. Would something much, much smaller than a Switch Tegra be competitive with low end Mali?
    Reply
  • mode_13h - Wednesday, May 26, 2021 - link

    You heard Mediatek is licensing Nvidia graphics IP? Granted, probably not in $100 phones, but maybe $250? I guess we'll see.

    BTW, you know CUDA cores are in their $100 Jetson boards and Nintendo Switch, right?
    Reply
  • Wereweeb - Wednesday, May 26, 2021 - link

    That's not how this kind of business works. No one knows what the future holds. Reply
  • Dahak - Wednesday, May 26, 2021 - link

    Its still going through the regulatory checks, it will be at least a year before its finalized if its approved and even more it will be a few years on top of that before we see any nvidia IP offered / licensed like ARM's if Nvidia decides to do that.

    Just because they are acquiring ARM does not mean magically Nvidia's GPU IP will replace the current if at all
    Reply

Log in

Don't have an account? Sign up now