In addition to the work on programming models, we research methods to transparently translate services or parts of services that are only available as binary code for CPU into implementations for reconfigurable hardware (FPGAs). This translation is to take place at service runtime and transparently, that is without manual intervention of a service developer. After this translation process, the generated implementations are made available to the heterogeneous scheduler as alternative, functionally equivalent implementations. Just like implementations obtained with developer interaction through the programming models of the other project area, they can be used to adapt and optimize the heterogeneous system behavior.
One particular challenge for on-the-fly hardware acceleration are the high tool runtimes for the FPGA synthesis process, which can inhibit timely acceleration and conflict with the actual workload of a compute node. We approach this challenge with domain-specific overlays, which can be reconfigured at runtime without repeating the low-level FPGA synthesis process. We are working on methods to select from a set of available overlays a suitable one, to refine an existing overlay architecture by specialization or generalization with re-synthesis for later use, and to generate problem-specific configurations or code for the selected overlay.
A central aspect of this research area are the architectures of such overlays with regard to the specific application domains they are targeting. Overlays can put varying emphasis on exploitation of data parallelism, pipelining or latency hiding through simultaneous multithreading, leading to different architectural approaches. Accordingly, their suitability for different target applications differs a lot. We want to investigate, how a set of overlay architectures can effectively cover a broad range of applications and put a special emphasis on the stress field between specialization of the architecture as prerequisite for high efficiency and the generality of the architecture that is required for good reusability.
The second important aspect in this area are overlay-specific translation methods. On the one hand, each different overlay architecture requires different approaches, but on the other hand, the architectural features of the overlay can guide the detection and translation of suitable application patterns. For this investigation, we consider three overlay architectures. Application-specific reconfigurable vector processors use data parallelism through vector instructions. Customization of instruction formats for specific applications can offer additional efficiency gains. As second target, we consider domain-specific architectures, for example an overlay for finite-difference applications with stencil computations. As third and most ambitious translation target, we consider configurable dataflow architectures. This requires transformation of binary code to a pipelined dataflow representation, for which no established techniques exist yet.