IEEE - Institute of Electrical and Electronics Engineers, Inc. - High-level Synthesis for Semi-global Matching: Is the juice worth the squeeze?
|Author(s):||Affaq Qamar ; Fahad Muslim ; Francesco Gregoretti ; Luciano Lavagno ; Mihai T. Lazarescu|
|Publisher:||IEEE - Institute of Electrical and Electronics Engineers, Inc.|
High-level Synthesis (HLS) based design methodologies are extremely viable for industries that are sensitive to production costs. In order to have competitive advantage, the ability to have... View More
High-level Synthesis (HLS) based design methodologies are extremely viable for industries that are sensitive to production costs. In order to have competitive advantage, the ability to have several different implementations of the same algorithm satisfying a diverse range of resolution, cost and performance constraints is highly desirable. In this article, we present multiple hardware implementations of the Semi-global Matching (SGM) algorithm which is used in stereo vision systems e.g. for automotive applications. The hardware platform considered in this work is a XilinxR ZynqTM System-on-Chip. A performance comparison of both HLS-based design as well as a manual RTL design in terms of quality of results (QoR), flexibility and design time is also presented. SGM mainly includes a sequence of three processing steps i.e. the "cost cube calculation" followed by the "path cost computation" and finally the "disparity approximation and minimization". The path cost processor further performs a pixel-wise processing of the cost cube data along eight distinct path orientations. The baseline algorithmic model usually called the "golden" model utilizes considerably large arrays, that are required to be mapped to an external DRAM and brought into the on-chip RAM when required. This necessitates adding both the memory transfer loops as well as insertion of calls to the AXI transactors for accessing the DRAM through the on-chip DDR slave. Furthermore, the initial algorithm (typically singlethreaded) must be parallelized to fully exploit the concurrency offered by the target hardware platform. The design space exploration was thus performed by making several considerably different micro-architectural choices. Eventually, we were able to obtain an implementation comparable to the manual RTL design. Both manual RTL as well as the HLS designs achieved the target real-time performance of 30 fps for the image resolution of 640x480 with a disparity depth of 128 pixels per frame.View Less