Avantek's Arm Workstation: Ampere eMAG 8180 32-core Arm64 Reviewby Andrei Frumusanu on May 22, 2020 8:00 AM EST
The eMAG 8180: AppliedMicro's Legacy Skylark Core
While you’re reading this in 2020, and the eMAG Workstation had been released in 2019 – the CPU powering the system is actually quite ancient, tracing back its roots in the 2017 defunct AppliedMicro. Originally meant to be called the X-Gene3, the chip had originally been planned for the second half of 2017 before the AppliedMicro had went through several changes of ownership before the IP and designs ended up with Ampere Computing.
In that sense, the eMAG 8180 is more of a legacy design and quite distantly related to Ampere’s newer Altra system processors.
The Skylark cores in the eMAG 8180 are a custom core design having the X-Gene processor pedigree. It’s a 4-wide OOO processor that’s relatively narrow by today’s standards, characterised by quite high operating frequencies up to 3-3.3GHz and quite the unusual cache hierarchy, such as two core pairs sharing the same 256KB L2 cache.
On a chip-level, the CPU is characterised by having a large coherent network tying all the CPU modules, the memory controllers, and a big large 32MB L3 cache together.
What’s surprising here is that the core-to-core latency across the whole chip isn’t bad at all, ranging from 68-73ns. While this certainly doesn’t keep up with more recent monolithic designs, this is an Arm v8.0 core lacking CAS atomic operations – so the above figures are done via regular sequential exclusive load / exclusive stores which aren’t as fast. The coherency here going over the 32MB L3 cache certainly helps the system punch above its weight for a design of its time.
The CPU cores have 32KB L1 instruction and data caches – the access latencies here are 5 cycles. The 256KB L2 caches has a 13-cycle access latency, while the 32LB L3 cache has some massive 45ns+ access latencies that are much slower than any other comparable design out there.
We note the core’s L1 TLB ends at 48 pages (192KB) and the L2 TLB at 1024 pages (4MB), after which page-miss access times increasingly result in worse latencies.
In contrast with the quite large cache access latencies, the DRAM access latency isn’t all that bad at around 137ns full random at 128MB depth.
Single-core bandwidth of the Skylark cores isn’t too pretty, load and store bandwidth into the L1 and L2 seem to be limited at 8B/cycle and a combined 16B/cycle for concurrent load & stores. The dip between the L2 and L3 is usually a showcase of a bandwidth bottleneck when evicting/replacing a cacheline, and the load bandwidth at the DRAM level is also quite disappointing.
Overall, the performance here is only half of a more modern Arm core, but again, this is a 2015-2016 core design.