Reconfigurable Superscalar RISC-V Processor

In this project we are developing a new RISC-V core with superscalar out-of-order architecture. The processor core is designed in VHDL, which is the hardware description language of choice for many European design projects. Additionally, it is broadly used in computer architecture and digital design lectures at universities, providing a good basis for continuous development of the core, and community support. Key features of the architecture include a wide range of configuration options at design time and support for hardware reconfiguration at runtime to further improve the resource efficiency of the core. Furthermore, the processor supports user-defined instructions, with a current focus on accelerating machine learning algorithms.

Architecture of the Core

The general structure of the core follows the common architecture approaches for superscalar processors. Internal instruction processing starts with a fetch stage that can fetch a specified number of instructions per clock cycle from the instruction cache. The address of the next packet of instructions to be fetched is predicted by the branch prediction unit, enabling speculative execution. The instructions are then decoded, and an entry in the reorder buffer is reserved for each instruction. The subsequent register renaming resolves the dependencies between the instructions and enables out-of-order execution. The instructions are then dispatched to the reservation stations of the execution units; there, the instructions wait for the data required for execution, which is provided via the Common Data Bus. The addresses for load and store operations are calculated in the memory execution unit and then transferred to the load store unit, which handles the order in which the load and stores are issued to the data cache.

Block diagram of the developed RISC-V processor

Configuration at Design Time

Many aspects of the core are configurable at design time. These include the architecture bit-width, namely 32- or 64-bit, and the number of instructions processed in parallel. This directly affects the number of instructions fetched per clock (issue width) at the start of the data path and continues with decoding and renaming. The size of buffers, such as the reorder buffer and the register file, can be configured according to the requirements resulting, e.g., from the selected issue width of the core, while the number of interfaces of the reorder buffer and the register files is automatically adapted to the selected issue width. Additional parameters that can be configured at design time are the maximum number of execution units available at power-up (these can be reconfigured at run-time).

Reconfiguration at Runtime

Partial dynamic reconfiguration of the core can be used to further increase the resource-efficiency of an FPGA-based implementation. Here, dynamic reconfiguration of the core mainly focuses on the execution units. The superscalar architecture enables an exchange of the executions with limited architectural changes. Hence, the core can adapt itself to the requirements of the currently executed applications. For example, programs with many floating-point operations would benefit from additional floating-point units, while other programs may profit more from an increased number of multiply or divide units. By means of dynamic reconfiguration, it is possible to free hardware resources of execution units that are currently not utilized. These resources can then be used for more important execution units or for application-specific accelerators.

System-on-Chip (SoC) Integration

For system integration, we use the LiteX SoC generator, which enables the design of complete soft SoCs. LiteX provides a large set of IP cores for the peripheral components of the processing system, such as DDR memory controllers, as well as different general purpose and high-bandwidth interfaces. The integration with LiteX eases the flexible integration of the core and its peripherals into different FPGA platforms and use-case scenarios. Furthermore, our deep learning accelerators based on STANN can be integrated into the soft SoC.