A. Dhar, M. Yu, W. Zuo, X. Wang, N. S. Kim, D. Chen
2020 33nd International Conference on VLSI Design and 2020 19th International Conference on Embedded Systems (VLSID),
2020
The rapidly growing demands for powerful AI algorithms in many application domains have motivated massive investment in both high-quality deep neural network (DNN) models and high-efficiency implementations. In this position paper, we argue that a simultaneous DNN/implementation co-design methodology, named Neural Architecture and Implementation Search (NAIS), deserves more research attention to boost the development productivity and efficiency of both DNN models and implementation optimization. We propose a stylized design methodology that can drastically cut down the search cost while preserving the quality of the end solution. As an illustration, we discuss this DNN/implementation methodology in the context of both FPGAs and GPUs. We take autonomous driving as a key use case as it is one of the most demanding areas for high quality AI algorithms and accelerators. We discuss how such a co-design methodology can impact the autonomous driving industry significantly. We identify several research opportunities in this exciting domain.
C. Hao, Y. Chen, X. Liu, A. Sarwari, D. Sew, A. Dhar, B. Wu, D. Fu, J. Xiong, W. Hwu, J. Gu, D. Chen
International Conference on Computer-Aided Design,
2019
Fine-grained action detection is an important task with numerous applications in robotics and human-computer interaction. Existing methods typically utilize a two-stage approach including extraction of local spatio-temporal features followed by temporal modeling to capture long-term dependencies. While most recent papers have focused on the latter (long-temporal modeling), here, we focus on producing features capable of modeling fine-grained motion more efficiently. We propose a novel locally-consistent deformable convolution, which utilizes the change in receptive fields and enforces a local coherency constraint to capture motion information effectively. Our model jointly learns spatio-temporal features (instead of using independent spatial and temporal streams). The temporal component is learned from the feature space instead of pixel space, e.g. optical flow. The produced features can be flexibly used in conjunction with other long-temporal modeling networks, e.g. ST-CNN, DilatedTCN, and ED-TCN. Overall, our proposed approach robustly outperforms the original long-temporal models on two fine-grained action datasets: 50 Salads and GTEA, achieving F1 scores of 80.22% and 75.39% respectively.
Khoi-Nguyen C. Mac, Dhiraj Joshi, Raymond A. Yeh, Jinjun Xiong, Rogerio S. Feris, Minh N. Do
International Conference on Computer Vision,
2019
Multi-scale context module and single-stage encoder-decoder structure are commonly employed for semantic segmentation. The multi-scale context module refers to the operations to aggregate feature responses from a large spatial extent, while the single-stage encoder-decoder structure encodes the high-level semantic information in the encoder path and recovers the boundary information in the decoder path. In contrast, multi-stage encoder-decoder networks have been widely used in human pose estimation and show superior performance than their single-stage counterpart. However, few efforts have been attempted to bring this effective design to semantic segmentation. In this work, we propose a Semantic Prediction Guidance (SPG) module which learns to re-weight the local features through the guidance from pixel-wise semantic prediction. We find that by carefully re-weighting features across stages, a two-stage encoder-decoder network coupled with our proposed SPG module can significantly outperform its one-stage counterpart with similar parameters and computations. Finally, we report experimental results on the semantic segmentation benchmark Cityscapes, in which our SPGNet attains 81.1% on the test set using only ‘fine’ annotations.
B. Cheng, L-C. Chen, Y. Wei, Y. Zhu, Z. Huang, J. Xiong, T. Huang, W-M. Hwu, H. Shi
International Conference on Computer Vision,
2019
Recent advancements in deep learning techniques facilitate intelligentquery support in diverse applications, such as content-based image retrieval and audio texturing. Unlike conventional key-based queries, these intelligent queries lack efficient indexing and require complex compute operations for feature matching. To achieve highperformance intelligent querying against massive datasets, modern computing systems employ GPUs in-conjunction with solid-state drives (SSDs) for fast data access and parallel data processing. However, our characterization with various intelligent-query workloads developed with deep neural networks (DNNs), shows that the storage I/O bandwidth is still the major bottleneck that contributes 56%–90% of the query execution time.
To this end, we present DeepStore, an in-storage accelerator
architecture for intelligent queries. It consists of (1) energy-efficient
in-storage accelerators designed specifically for supporting DNNbased intelligent queries, under the resource constraints in modern SSD controllers; (2) a similarity-based in-storage query cache
to exploit the temporal locality of user queries for further performance improvement; and (3) a lightweight in-storage runtime
system working as the query engine, which provides a simple software abstraction to support different types of intelligent queries.
DeepStore exploits SSD parallelisms with design space exploration
for achieving the maximal energy efficiency for in-storage accelerators. We validate DeepStore design with an SSD simulator, and
evaluate it with a variety of vision, text, and audio based intelligent
queries. Compared with the state-of-the-art GPU+SSD approach,
DeepStore improves the query performance by up to 17.7×, and
energy-efficiency by up to 78.6×.
V.S Mailthdoy, Z. Qureshi, W. Liang, Z. Feng, S. Gonzalo, Y. Li, H. Franke, J. Xiong, J. Huang, W. Hwu
Proceedings of the 52 Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’19),
2019
Deep neural networks (DNNs) have been widely adopted in many domains, including computer vision, natural language processing, and medical care. Recent research reveals that sparsity in DNN parameters can be exploited to reduce inference computational complexity and improve network quality. However, sparsity also introduces irregularity and extra complexity in data processing, which make the accelerator design challenging. This work presents the design and implementation of a highly flexible sparse DNN inference accelerator on FPGA. Our proposed inference engine can be easily configured to be used in both mobile computing and high-performance computing scenarios. Evaluation shows our proposed inference engine effectively accelerates sparse DNNs and outperforms CPU solution by up to 4.7x in terms of energy efficiency.
S. Huang, C. Pearson, R. Nagi, J. Xiong, W. Hwu, D. Chen
IEEE High Performance Extreme Computing Conference,
2019
In automatic speech recognition (ASR), wideband (WB) and narrowband (NB) speech signals with different sampling rates typically use separate acoustic models. Therefore mixed-bandwidth (MB) acoustic modeling has important practical values for ASR system deployment. In this paper, we extensively investigate large-scale MB deep neural network acoustic modeling for ASR using 1,150 hours of WB data and 2,300 hours of NB data. We study various MB strategies including downsampling, upsampling and bandwidth extension for MB acoustic modeling and evaluate their performance on 8 diverse WB and NB test sets from various application domains. To deal with the large amounts of training data, distributed training is carried out on multiple GPUs using synchronous data parallelism.
Khoi-Nguyen C. Mac, Xiaodong Cui, Wei Zhang, Michael Picheny
International Speech Communication Association,
2019
Unlike traditional PCIe-based FPGA accelerators, heterogeneous SoC-FPGA devices provide tighter integrations between software running on CPUs and hardware accelerators. Modern heterogeneous SoC-FPGA platforms support multiple I/O cache coherence options between CPUs and FPGAs, but these options can have inadvertent effects on the achieved bandwidths depending on applications and data access patterns. To provide the most efficient communications between CPUs and accelerators, understanding the data transaction behaviors and selecting the right I/O cache coherence method is essential. In this paper, we use Xilinx Zynq UltraScale+ as the SoC platform to show how certain I/O cache coherence method can perform better or worse in different situations, ultimately affecting the overall accelerator performances as well. Based on our analysis, we further explore possible software and hardware modifications to improve the I/O performances with different I/O cache coherence options. With our proposed modifications, the overall performance of SoC design can be averagely improved by 20%.
S. Min, S. Huang, M. El-Hadedy, J. Xiong, D. Chen, W. Hwu
International Conference on Field Programmable Logic and Applications 2019,
2019
This work presents an update to the triangle-counting portion of the subgraph isomorphism static graph challenge. This work is motivated by a desire to understand the impact of CUDA unified memory on the triangle-counting problem. First, CUDA unified memory is used to overlap reading large graph data from disk with graph data structures in GPU memory. Second, we use CUDA unified memory hintsto solve multi-GPU performance scaling challenges present in our last submission. Finally, we improve the single-GPU kernel performance from our past submission by introducing a work-stealing dynamic algorithm GPU kernel with persistent threads, which makes performance adaptive for large graphs withoutrequiring a graph analysis phase.
Carl Pearson, Mohammad Almasri, Omer Anjum, Vikram S. Mailthody, Zaid Qureshi, Rakesh Nagi, Jinjun Xiong, Wen-Mei Hwu
2019 IEEE High Performance Extreme Computing Conference,
2019
In this paper, we present an update to our previous submission on k-truss decomposition from Graph Challenge 2018.
For single GPU k-truss implementation, we propose multiple algorithmic optimizations that significantly improve performance by up to 35.2x (6.9x on average) compared to our previous GPU implementation. In addition, we present a scalable multi-GPU implementation in which each GPU handles a different ‘k’ value.
Compared to our prior multi-GPU implementation,the proposed approach is faster by up to 151.3x (78.8x on average). In case when the edges with only maximal k-truss are sought, incrementing the `k’ value in each iteration is inefficient particularly for graphs with large maximum k-truss.
Thus, we propose binary search for the ‘k’ value to find the maximal k-truss. The binary search approach on a single GPU is up to 101.5 (24.3x on average) faster than our 2018 $k$-truss submission.
Lastly, we show that the proposed binary search finds the maximum k-truss for “Twitter” graph dataset having 2.8 billion bidirectional edges in just 16 minutes on a single V100 GPU.
Mohammad Almasri, Omer Anjum, Carl Pearson, Vikram S. Mailthody, Zaid Qureshi, Rakesh Nagi, Jinjun Xiong, Wen-Mei Hwu
2019 IEEE High Performance Extreme Computing Conference,
2019