FlexCore at USENIX NSDI 2017
Watch FlexCore's video presentation at the 2017 NSDI symposium.
Complexity and throughput
Based on a novel Metric of Promise, MultiSphere estimates the probability of a node in the tree search to be part of the transmitted vector and then, depending on the number of available processing elements (PEs), partitions the Sphere Decoding tree search into sub-trees which preserve ML optimality and can be processed in a nearly concurrent manner.
Multisphere's single carrier latency and complexity
In single-carrier detection with 32 available processing elements, MultiSphere can reduce latency by more than an order of magnitude compared to state-of-the-art sequential Sphere Decoders and maintain a similar computational complexity. Additionally, in the high SNR regime, MultiSphere's latency with 32 processing elemenets is close to that of linear detection methods (i.e., the number of transmitting antennas).
(Complexity is measured in terms of Partial Distance calculations, Latency measured in terms of Visited Nodes - 8 x 8 MIMO, 10-16dB SNR, 16-QAM uncoded transmission, SD partitioning based on the exact transmission channel).
Multisphere's multi carrier latency and complexity
In multi-carrier detection when we consider the amount of available processing elements to be up to the number of subcarriers, one could perform detection by allocating a sequential Sphere Decoder per subcarrier and process the subcarriers in parallel. We can instead employ a single MultiSphere detector to process the subcarriers in a sequential manner. In this case, an 8-PE MultiSphere detector decreases latency by a factor of 3, while a 32-PE MultiSphere detector achieves more than one order of magnitude reduction in latency.
(Latency measured in terms of Visited Nodes - 8 x 8 MIMO, 10-16dB SNR, 16-QAM uncoded transmission of 64 subcarriers).
Related research output: K. Nikitopoulos, D. Chatzipanagiotis, C. Jayawardena and R. Tafazolli, "MultiSphere: Massively Parallel Tree Search for Large Sphere Decoders," in IEEE Global Communications Conference (GLOBECOM), Washington, DC, 2016, pp. 1-6.
Approximate Multisphere: FlexCore
While MultiSphere can obtain the exact ML solution by visiting a significantly decreased number of nodes compared to sequential SDs, its latency can significantly vary, depending on the SNR and the channel condition. MultiSphere's search can be terminated at any time instant, therefore flexibly providing a tradeoff between ML optimality and detection latency. FlexCore is MultiSphere's sub-optimal version which considers only as many paths as the number of available processing elements.
FlexCore has been evaluated using both over-the-air experiments and simulations on channel traces, using the WARP v3 SDR platform and WARPLab (REF). Experiments have been conducted in an indoor environment employing 20 MHz carrier bandwidth within the 5 GHz ISM band.
In contrast to similar, state-of-the-art solutions like the Fixed-Complexity Sphere Decoder, FlexCore can flexibly scale the achievable throughput depending on the number of available processing elements. Moreover, it can reach near optimal performance with more than one order of magnitude fewer processing elements.
Processor speed and lithography
Related research output: C. Husmann, G. Georgis, K. Nikitopoulos and K. Jamieson, "FlexCore: Massively Parallel and Flexible Processing for Large MIMO Access Points," in Proceedings of the 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI), March 2017.
FlexCore on GPUs
FlexCore is flexible enough to achieve real-time detection for all LTE modes on a commercially available desktop general purpose Graphics Processing Unit (GPU), compared to the Fixed-Complexity Sphere Decoder, which requires a fixed number of tree paths. And given a more powerful GPU, FlexCore can employ more tree paths and thus achieve a performance which is even closer to that of optimal detection. (All results include Host to GPU and GPU to Host memory transfers, assuming a 10ms frame duration).
FlexCore on FPGAs
FlexCore's probabilistic tree path allocation and the direct control over hardware resources, allow for achieving more than one order of magnitude higher energy efficiency, compared to the Fixed-Complexity Sphere Decoder, i.e. the similarly parallelizable state-of-the-art. (Xilinx Virtex Ultrascale 440-flga2892-3-e, Xilinx Power Estimator under worst-case static power and 100% activity, 75% maximum logic slice utilization).