In this work, we propose a design space exploration workflow and tool for generating reconfigurable deep learning hardware models for FPGAs. The workflow is broken down into two main parts, Offline Design Exploration (ODE) and Online Design Reconfiguration (ODR). Offline Design Exploration is automated through a workflow methodology which makes it possible for a designer to provide a Convolutional Neural Network (CNN) architecture and the option of providing additional design constraints in terms of latency and space. These automatically generate multiple design spaces which trade-off latency for resource utilization by dedicating varying processing elements through loop reordering, unrolling and pipelining. The second part of the process introduces online reconfiguration to the design space. Online Reconfiguration means the ability to modify the design at runtime after upload, by selectively running it partially or fully according to an application’s immediate requirements, this provides the design with flexibility which, we believe, is highly beneficial for future autonomous on-board applications. We validate our work on the Xilinx Zynq-7100 board at 200 MHz and use custom trained networks architectures on three datasets for image classification, MNIST, SVHN and CIFAR-10. ODE generated designs achieved latency trade-offs of 95x for MNIST, 71x for CIFAR-10 and 18x for SVHN. Trade-offs in resource utilization in terms of DSP Slices were 44x for MNIST, 52x for SVHN and 24x for CIFAR-10. For the ODR, a 0.7% accuracy loss was traded-off with x13 speedup and a 25% reduction in power for MNIST, a 2% accuracy loss was traded-off with a 14x speedup and a 28% power reduction for SVHN, a 4% accuracy loss was traded off for a 50x speedup with 32.5% power reduction for CIFAR-10.
In this work we propose an adaptive and scalable hardware implementation of convolutional Neural Networks. The adaptive hardware model is the result of a design loop that starts with a software implementation relying on standard scanning window and MAC operations. This design is developed into a deterministic, hardware-friendly model which introduces timing, fixed-point representation and a pixel streaming interface. Then finally HDL code is generated and an RTL of the system is created. Each step is analyzed and validated against pre-set objectives using a golden reference from the last step. The proposed system is capable of selective output execution of different data-paths. It allows for real time trade-offs between accuracy for execution time and power. This is achieved by implementing a CNN network through a number of sequential layer blocks. Layer-blocks can effectively be considered standalone networks with differing complexities. Each layer blocks branches off into an output that is independent of the block that follows it. This allows the system to execute partially or fully according to performance requirements. This reconfigurable model trades off accuracy for speed and power, results show a tradeoff in accuracy for a 50% and 70% gain in both speed and power respectfully.
Breakthroughs in deep convolutional neural networks for new vision applications in image classification and object detection have pushed forward precision and speed performance indicators in both domains. The future of space exploration relies on the development of novel systems for autonomous operations and onboard data handling especially for computer vision and deep learning. However, previous works on object detection and image classification always operate on the rigid assumption that representative data is available and reliable while merely focusing on offline optimization of architectures for accuracy. This assumption cannot be extended to onboard processing especially in a space environment where unknown scene changes in the visual environment directly affect the performance of machine vision systems. The performance of a deep neural network is as dependent on the input data as it is on the network its self. We propose using a multi-sensory computer vision system that accounts for data reliability and availability using an adaptive input policy. We use custom datasets containing RGB and Depth images of a reference satellite mission for training and testing deep convolutional neural network models for object detection. Our simulation testbed generates our datasets which cover all poses, different ranges, lighting conditions and visual environments. The trained models use multi-sensory input data from both an optical sensor (RGB data) and a time of flight sensor (Depth data). The multi-sensory input data is passed through the adaptive input layer to complementarily provide the most reliable output in a harsh space environment that does not tolerate missing and unreliable data. For instance, the ToF sensor provides visual data that reliably cover close ranges and most importantly can operate regardless of ambient light. The optical sensor provides RGB data at farther ranges and, unlike ToF sensors, is not susceptible to saturation from Earth infra-red emissions. This selective multi-sensory input approach ensures that the CNN model receives reliable input data regardless of the changes in the visual environment to fit the strict operational requirements of space missions. Our work is validated using a sensory-data reliability assessment and object detection models based on the state of the art using Faster R-CNN and YOLO detection techniques. Average precision on the validation dataset saw a significant improvement using our approach. Average precision results went from 50% and 40% using RGB and Depth respectively to 080% using the input selective system.
Training of convolutional neural networks (CNNs) on embedded platforms to support on-device learning has become essential for the future deployment of CNNs on autonomous systems. In this work, we present an automated CNN training pipeline compilation tool for Xilinx FPGAs. We automatically generate multiple hardware designs from high-level CNN descriptions using a multi-objective optimization algorithm that explores the design space by exploiting CNN parallelism. These designs that trade-off resources for throughput allow users to tailor implementations to their hardware and applications. The training pipeline is generated based on the backpropagation (BP) equations of convolution which highlight an overlap in computation. We translate the overlap into hardware by reusing most of the forward pass (FP) pipeline reducing the resources overhead. The implementation uses a streaming interface that lends itself well to data streams and live feeds instead of static data reads from memory. Meaning, we do not use the standard array of processing elements (PEs) approach, which is efficient for offline inference, instead we translate the architecture into a pipeline where data is streamed through allowing for new samples to be read as they become available. We validate the results using the Zynq-7100 on three datasets and varying size architectures against CPU and GPU implementations. GPUs consistently outperform FPGAs in training times in batch processing scenarios, but in data stream scenarios, FPGA designs achieve a significant speedup compared to GPU and CPU when enough resources are dedicated to the learning task. A 2.8x, 5.8x, and 3x speed up over GPU was achieved on three architectures trained on MNIST, SVHN, and CIFAR-10 respectively.