• slide

What is needed to get to Exascale Computing?

legacy

1018 Floating Point Operations per Second

As of 2014, the most powerful supercomputer in the world (Tianhe-2) clocks in at 33 Petaflops (1015 FLOPs), or just over 3% of the requirement for exascale, while using 17 megawatts of electricity, coming out to just under 2 GFLOPs/Watt

10x Power Efficency

The most power efficent petascale supercomputer (Piz Daint) is a 7 petaflop system using 2.3MW of electricity, ranking it at 3 GFLOPs/Watt. The most efficent single processors alone are only 4 to 5 GFLOPs/watt. The US Department of Energy's annual power budget for a exascale system is targetted at no more than 20MW. Scaling a modern system to exascale would require over 250MW, and would cost over $300 million annually to operate

10x Cost reduction

Current estimates for building an exascale system using todays components would cost well over $1 billion. When including electricity and operation costs over the 5 to 6 year lifetime of a system, more than twice the cost of the intial system would be spent on just operating it.

A hyper-efficient core targeting floating-point workloads.

The Neo compute core is optimized for high-efficiency floating-point operations. Each core includes :

  • lightweight independent execution logic
  • an arithmetic logic unit
  • an IEEE 754-2008 compliant FPU (capable of fused multiply-add, with hardware support for single and double precision operations)
  • 64 general purpose registers
  • local scratchpad memory
  • a Network on Chip (NoC) interface.

Each core is a full superscalar RISC core, capable of executing two instructions per cycle. The FPU is a pipelined Fused Multiply-Add, capable of either one IEEE 754-2008 double precision (binary64) operation per cycle, or two single precision (binary32) per cycle.

Single core performance: 1 Double Precision GFLOPs, 2 Single Precision GFLOPs

The Neo chip: Extreme parallelism at ultra low-power.

The Neo processor incorporates 256 cores arranged in a 2D mesh network. The cores communicate via a low-latency NoC subsystem. The RISC-inspired load/store architecture enables memory operations to occur:

  • 1) locally, using the core's own designated scratchpad memory;
  • 2) using scratchpad memory on another on- or off-chip core via message-passing or DMA
  • 3) with external memory via an off-chip memory management unit.

Local scratchpad memories are physically addressed as part of a flat global address space.

Unlike GPUs and other SIMD accelerators, Neo's MIMD processor design leverages independent program counters and instruction registers in each core to allow different operations to be performed in parallel on separate pieces of data.

With 256 cores, estimated power consumption per compute chip is 3 Watts.

Estimated peak performance (Double Precision): 256 GFLOPs
Estimated peak performance (Single Precision): 512 GFLOPs
Estimated peak performance efficency (Double Precision): 85 GFLOPs/Watt
Estimated peak performance efficency (Single Precision): 170 GFLOPs/Watt

The Neo interconnect: Fast, efficent, and intelligent.

A Neo node includes 16+ compute chips arranged in two-dimensional grid. Every chip can independently communicate with any other chip in the grid via an extremely high-bandwidth parallel interconnect, allowing for direct connections between adjacent peers. Each compute chip on the external edge of the grid has at least one parallel connection to the Grid and Memory Management Unit.

The node's Grid and Memory Management Unit (GaMMU) is a distinct chip designed to intelligently manage node-level resources and optionally direct data and control flow across this grid. The GaMMU integrates:

  • Externally accessible Linux host
  • Four DDR4 memory controllers (for up to 256GB) with ECC
  • PCIe controller (for I/O expandability)
  • High-speed connections to the grid of compute chips

The GaMMU provides a common interface for the compute grid to access node-level DRAM and handles external I/O requests. The included Linux host serves as a familiar interface for user administration and debugging. It also allows for the optional execution of supplemental application logic that can accelerate the grid's access to memory and I/O resources. This logic may be specified by application developers explicitly or may be generated by REX development tools.

The GaMMU's external I/O capabilities support low-latency connections between nodes, as well as with external storage devices and other network-attached peripherals.

Neo: The world's most power- efficient HPC platform

By removing legacy components unnecessary to HPC, and by developing a fresh compute architecture focused on power optimization, Neo is anticipated to become the most power-efficient HPC platform available upon its release. A single Neo compute rack based on the (non-density optimized) reference design, incorporating 90 Neo nodes, is expected to deliver 360 DP TFLOPS at only 7.2kW, equivalent to 50 GFLOPS/W. This will make Neo at least 10x more power efficient than today’s most efficient HPC installation (GSIC Center at the Tokyo Institute of Technology).

Moreover, an installation of only ten Neo compute racks would rank among the Top 10 highest performance HPC installations, according to data from top500.org (June 2014).

The Neo software environment: Familiar, flexible, and fast.

The Neo porting environment is designed to simplify and accelerate the process of migrating applications from other HPC platforms to Neo. The porting environment includes a set of commonly used libraries, like BLAS, that are optimized and pre-compiled for the Neo environment.

The full development environment includes a complete GCC and LLVM compiler toolchain, as well as standard C libraries optimized for Neo. It also provides a hardware abstraction layer API allowing developers to orchestrate and optimize data and control flow across the compute grid. Alternatively, developers can leverage supported programming models - such as the Actor model, CSP, and PGAS - which abstract and automate the deployment and execution of concurrent or parallel processes.

Neo's 'bare metal' runtime environment does not require a grid operating system, and will optionally support OpenMP, OpenCL and OpenACC

Have specific questions?

Contact Us

Send an email to:

Subscribe to our mailing list

* indicates required
Primary Interest

info@rexcomputing.com