Investigation of area and speed trade-offs in FPGA [602372]

Investigation of area and speed trade-offs in FPGA
implementation of an image correlation algorithm

Z. Kincses1, Zs. Vörösházi2,
1Institute of Informatics, University of Szeged
Szeged, H-6701
[anonimizat]
2Dept. of Electrical Engineering, University of Pannonia
Veszprém, H-8200
[anonimizat]

Z. Nagy3,4, P. Szolgay3,4, T. Laviniu5 A. Gacsádi5
3Cellular Sensory and Optical Wave Computing Laboratory,
Hungarian Academy of Sciences
Budapest, H-1111
4 Faculty of Information Technology,
Pázmány Péter Catholic University
Budapest, H-1083
5Department of Electronics and Telecommunications,
University of Oradea, Romania
410087

Abstract—In this paper an image correlation algorithm is
implemented on FPGA architecture for assisted movements of visually impaired persons or automotive driving systems. Taking
into account the limitations of FPGA devices and the special
requirements of the correlation based image matching algorithm a semi-parallel approach is proposed. This provides an optimal trade-off between area and speed of the implemented algorithm. Several
key issues are investigated and discussed related to the speed and
area.
I. INTRODUCTION
In general, determination of a correlation coefficient
between two images requires high computing power, which is
proportional to the size of the template image (kernel). It is desirable that the template image size should be large enough to contain relevant information. For real-time computation of the correlation coefficients, the majority of the operations can be executed on a CNN algorithm as described in [1]. However, analog CNN VLSI implementations have relatively low precision (7-8 bits) and very sensitive to the small changes of temperature and supply voltage. For these reasons an FPGA-based implementation is chosen.
II. I
MAGE CORRELATION : MATHEMATICAL
BACKGROUNDS
Similarity matching between two gray-scale images can be
classified into feature-based and intensity based-methods. In case of the intensity based methods different metrics or procedures can be applied, such as: Euclidean distance, Sum of Absolute Differences (SAD), Mean Absolute Differences (MAD), Sum of Squared Differences (SSD) or Normalized Cross Correlation (NCC) [2].
The NCC in equation (1) can be easily derived from the
SSD: maximizing the correlation is equivalent to minimizing the sum of squared differences, which effectively gives more weight to large differences. Let us consider an input test image Φ(m,n): R² →R with dimension M×N. Correlation values can
be computed as follows [1]:
11
22
11 11(,) (,)
(, )
(,) (,)Q P
pq
QQPP
pq pqKp q K p q
CORR i j
Kp q K p q==
== ==⎡⎤ ⎡⎤ −⋅ Λ − Λ⎣⎦ ⎣⎦
=
⎡ ⎤⎡ ⎤−⋅ Λ − Λ⎣ ⎦⎣ ⎦∑∑
∑∑ ∑∑ (1)
where K(p,q): R² →R denotes the template image
(correlation kernel) with dimension P×Q (p ∈[1,P], q ∈[1,Q]),
Λ(p,q): R²→R represents the actual image region from test
image Φ(m,n) compared to the kernel K(p,q), with dimension
P×Q, (p ∈[1,P], q ∈[1,Q]), respectively.
K: is the mean intensity value of the template image
Λ: is the mean intensity value of the test image.
The gray-scale input test image is scanned pixel-by-pixel
and overlapped with a kernel as a sliding window to calculate the matching degree for each pixel (i,j) as can be seen in Figure 1. The matching degree between the template K(p,q) and an actual image region Λ(p,q) from the test image Φ(m,n)
is obtained by computing the square of correlation coefficient
(CORR
2(i,j)), which indicates how well the pattern matches
the contents of that region (compared image). Equation (1) can
be rewritten as follows:
{ }2
2
22(,) (,)
(, )
(,) (,)Kp q K p q
CORR i j
Kp q K p q⎡ ⎤⎡ ⎤−⋅ Λ − Λ⎣ ⎦⎣ ⎦=
⎡ ⎤⎡ ⎤−⋅ Λ − Λ⎣ ⎦⎣ ⎦ (2)

Using the transformation described in (2), the square root
operation can be eliminated by replacing with a multiplication
of the two images, while the division can be replaced by a multiplication of reciprocal expression. In this case we must consider that high values of correlation coefficient will result such a situation where the two images are completely anti-correlated.
(m,n)Φ2CORR (i, j)
(p,q)ΛK(p,q)
(, )ij (, )ij(, )kl

Figure 1. Image correlation operation
III. IMPLEMENTATION
In this section the FPGA-based implementation of the
correlation expression will be examined and described according to (2). Investigating the implementation of the image correlation algorithm three different ways are possible: a sequential, a semi-parallel and a massively/fully-parallel approaches.
In the first case, the correlation coefficients CORR
2(i,j) are
computed serially by utilizing one sequential processing element. Although it has the smallest area requirement, it will provide the slowest solution. Therefore it cannot be applied in real-time implementations. In case of fully-parallel solution, the correlation coefficients are computed by a massively parallel array processor [3]. Hereby, this solution gives the highest computing performance, but its area requirement will be quadratically increasing depending on the size of the kernel image. Therefore, this solution [3] can only be applied in such cases where the size of the kernel image is sufficiently small (e.g. between 12×12 and 20×20). The main limitation in this direct array implementation is the number of dedicated MAC (Multiply-Accumulate) DSP blocks. Larger templates, such as 32×32, need to be partitioned, multiplying the processing time by the number of partitions.
In third case, our proposed semi-parallel solution provides
a good trade-off between the serial and fully parallel implementations. In this case several processing MAC units with additional logic elements are linearly arranged and they can calculate the correlation coefficients parallel in a row-wise order. Using this method the area requirement of the architecture increases proportionally to the template size, so relatively large kernel images (e.g. 64×64, 128×128) could be handled in real-time, as well.
For these reasons to implement the correlation formula
described in (2) the semi-parallel solution was chosen. The overall architecture of this solution is shown in Fig. 3. The four main components are the Memory Controller Unit, the Kernel Memory Unit, the Mean Calculator Unit, and finally the Correlator Core Unit. The Memory Controller Unit provides an interface for an external memory, and transfer data to and from Mean Calculator and Correlator Core Units. The Kernel Memory Unit stores the
(,)Kpq K− and
2
(,)Kpq K ⎡ ⎤−⎣ ⎦ values which are considered to be constant
expressions. The Kernel Memory Unit can be implemented as an internal memory utilizing dedicated on-chip BlockRAM (BRAM) memories of the FPGA circuit. Supposing an 8–bit gray-scale kernel image only 1-, or 4 dedicated BRAM memories are occupied depending on the kernel size at resolution of 64×64 or 128×128, respectively.
The Mean Calculator Unit (Fig. 4) computes the mean
intensity of the (i, j)
th element of the compared image ( Λ) by
applying the “cumulative sum” method, while the final correlation results CORR
2(i,j) are calculated by the help of
Correlator Core Unit (Fig. 5).
A. General operation of the semi-parallel architecture
To elaborate the semi-paralle l architecture according to the
Equation (2) it is assumed, that the (,)Kpq K− and the
2
(,)Kpq K⎡ ⎤−⎣ ⎦ were calculated before the computation of the
correlation coefficients begins. Therefore, the architecture
should perform computation of (),pq Λ Λ− , multiplication by
kernel values, squaring of the partial results, mean
calculations and a division operation. The computation of
numerator and denominator parts can be carried out in a parallel way.
In the numerator the mean intensity value of the (i, j)
th
element of the compared image ( Λ) calculated first, then
(),-pqΛΛ difference is computed. In the next step the P×Q-
sized matrix of (),pqΛ− Λ values are multiplied by the
(,)Kpq K− values and the mean of these partial results are
computed.
In the denominator the previously evaluated (),pqΛ− Λ
expression can be utilized and squared. Finally, the mean
intensity value of this P×Q –sized matrix should be calculated
and multiplied by 2
(,)Kpq K⎡ ⎤−⎣ ⎦ which is a constant
expression.

Figure 2. Overall system architecture

B. Mean Calculator Unit
The first step of the calculation of CORR2(i,j) is the
computation of the mean intensity values of the compared
image (Λ) on P×Q-sized windows of the entire M×N-sized
input test image Φ(m,n).
For this task a multi-level adder tree can be used, however
its width increases linearly depending on the image size.
Therefore the “cumulative sum” method is applied instead of implementing such a complex adder tree circuit.
The Mean Calculator Unit (Fig. 4) is responsible for the
calculation of mean intensity values by using cumulative sum
method. In this method the value of
Λ(, )ij is calculated in
two consecutive steps. In first, the row-wise sum of the
compared image is computed. In this case an accumulator unit ACC is continuously fed from the external memory where the input image is stored, while incoming pixel elements in each consecutive row are summed by an Adder. In the first line after reaching the M
th pixel the sum of first row of Φ has been
calculated. In the second line the first partial result from the
first row is required, which is stored in a Delay Line (1×M).
During the correlation coefficient computation the average value of a P×Q-sized window for each pixel is required, which can be easily computed from the integral image by using the following expression (3):
() ()
() ()1234
22 22
22 221(, )
,,
,,QQ PP
QQ PPij ij
ij ijSSSSijPQ
SS
SSPQ−−+Λ= =×
⎡=− − − + − −⎣
⎤ −− + + + +×⎦ × (3)
Using this method the integral of the whole input image
can be computed as shown in Figure 3.
(, )ij(p,q)Λ
(m,n)Φ

Figure 3. Integral image computation
The mean value of each pixel can be computed by using
only four elements of the integral image, which are located at the corners of the P×Q-sized window around the actual pixel. The size of the window can be even 128×128 pixels; therefore 128 lines should be stored on the FPGA to compute the average value in one step. Storing such a wide belt from the image is not efficient. To save on-chip memory on an FPGA the average computation is decomposed to horizontal and vertical parts. The horizontal part can be directly connected to the output of the integral computation unit. The last P values are stored in a shift register to generate the left and right value of the window to the subtractor unit. These partial results are written back to the external memory. During the final computation step of the average values the partial results are loaded in column-wise order, and stored in a Q-length shift register to provide the top and bottom partial results for a subtractor unit. Finally, these values should be divided by P×Q to get the mean intensity of Λ(p,q). Size of the template
image is constant therefore division in (3) can be done by multiplication with the reciprocal of this value. Moreover, assuming that the reciprocal (e.g. image dimensions) is a power of two the division can be replaced by a simple shift operation. Using this solution the off-chip memory bandwidth can be traded for reduced on-chip memory.

Figure 4. Structure of the Mean Calculator Unit
C. Correlator Core Unit
Final correlation values CORR2(i,j) are calculated by the
Correlator Core Unit (Fig. 5). In the equation (2) the numerator and denominator are computed in a parallel way by applying this Correlator Core Unit. According to the size of the kernel image a new correlation value is computed in every Q
th clock cycles.
The Correlator Core Unit is basically built up from four
main components: a Register Array unit, a Subtractor and MAC Tree unit, a Shift Register unit, and a Final Result unit. The Shift Register Array is a P×Q-sized register array. The incoming input values are loaded into the upper-right register element. In each clock cycle the old values are shifted down by 1 position in the register array, while the content of the bottom register elements is moved to the upper register of the adjacent column on the left. The output values stored in the bottom line of the Register Array are loaded into the Subtractor and MAC unit. These MAC units are working in parallel and each pair of dedicated multiplier blocks computes
a different pixel. The result of each pixel is computed in P×Q
clock cycles. Upper MAC units compute the
(,) (,)Kp q K p q⎡ ⎤⎡ ⎤−⋅ Λ − Λ⎣ ⎦⎣ ⎦ expression in numerator,
while the 2
(,)pq⎡ ⎤ Λ− Λ⎣ ⎦expression is calculated by the
lower MAC units.

×
×
×
×
×
×
×

Figure 5. Structure of the Correlator Core Unit
The final correlation results CORR2(i,j) are computed by
the Final Result unit, which contains two multipliers and a
divider. Owing to the large resource requirement of a division operation only one divider unit is applied. On the other hand, P consecutive pixels are computed during P×Q clock cycle. Therefore, computation of the final step of these pixels can be calculated serially. Serialization is carried out by the Shift Register unit, which is fed from partial results in every P×Q clock cycle. The Shift Register is loaded alternately with the numerator and the denominator part and its contents are sent to the Final Result unit serially. Final result of the P consecutive pixels is computed in 2×P clock cycles. Currently, the last step of the final result computation is performed by using a parallel pipe-lined divider, which is underutilized because P divisions must be carried out in P×Q clock cycles. In the future, the area requirement of the division is intended to be decreased by using a serial divider unit.
IV.
RESULTS
As the main component of the elaborated architecture the
Correlator Core unit is investigated in two key aspects, like area consumption (both in case of dedicated and logic elements) and achievable computing performance . During in
the implementation and verification procedures the Xilinx ISE Design Suite version 13.4 has been utilized [4]. The area requirement and the performance of the proposed architecture were measured using the map report and the static timing analysis generated by the ISE integrated development tool.
Area requirements are determined on the state-of-the-art
Xilinx Series-7 FPGA family where a 511×511-sized gray-scale (8-bit) input test image is used, while template image sizes are scaled between 4×4 to 128×128 pixels. The size of test image must be adjusted to the pixel size of template according to the following expression (n×P–1) where n is integer. In this case all the MAC units can be fully utilized during the computation. In addition, the size of the template image should be set to power of two, thus the division during the computation of average values can be replaced by a simple shift operation. The general resource requirements (e.g. Logic Slices, Flip-flops and 6-input Look-Up-Tables) of the Correlator Core Unit are depicted in Figure 6.
05000100001500020000250003000035000400004500050000
4×4 8×8 16×16 32×32 64×64 128×128Number of resources
Template image sizeGeneral resource requirement
Slice Flip-Flop 6-input LUT

Figure 6. General resource requirements of the entire Correlator Core Unit
As deduced from Figure 6, the general resource
requirements of the entire Correlator Core Unit is linearly
increased depending on the template image size.
The dedicated resource utilization of the Correlator Core
Unit is also examined and the results are shown in Figure 7.
0100200300400500600
4×4 8×8 16×16 32×32 64×64 128×128Number of dedicate d blocks
Template image sizeDedicated resource (DSP48E1 Slice)
requirement
DSP48E1

Figure 7. Dedicated resource requirment (e.g. DSP48E1 Slice) of the entire
Correlator Core Unit
As Figure 7 shows, the number of dedicated DSP48E1
slices is also increased linearly depending on the size of the
template image. Assuming a MAC operation on a 4×4–sized

template image at least 8 multipliers are required inside the
MAC Tree unit. In addition, independently from the template size (see Figure 5.), 18 DSP multipliers are also required to calculate the correlation result in the last phase. In case of a 4×4–sized kernel 34 DSP48E1 slices will be occupied, because the input of the multipliers is 29 bit wide, therefore two 25×18-bit multipliers from the DSP48E1 slice are cascaded during the synthesis process.
During the synthesis tests not only the total resource
requirements of the entire Correlator Core Unit are investigated, but its internal submodules are also verified. As previously stated, the Core Correlator Unit is basically built up from 4 main components (according to the hierarchy of the VHDL description). These components having a modular structure are the following: the Register Array unit, the Subtractor and MAC Tree unit, the Shift Register and finally the Final Result unit. General and dedicated resource requirements of the Final Result unit are essentially independent of the size of the template. Therefore only the resource requirements of the remaining three parts are shown in Fig. 8. (the Register Array unit, the Subtractor and MAC Tree unit, and the Shift register unit).
Examining the area requirements of the different units in
case of different sized templates shows linear dependence in each case similarly to the global resource requirements. The Shift Register unit has the highest resource requirement due to its wide data path. The Subtractor and MAC Unit also contains DSP48E1 slices for MAC units, the number of slices is increasing linearly according to the template size.

110100100010000100000
4×4 8×8 16×16 32×32 64×64 128x128Number of resources
Template image sizeResource requirement of Subtractor and
MAC tree unit
Slice Flip-Flop 6-input LUT110100100010000
4×4 8×8 16×16 32×32 64×64 128x128Number of resources
Template image sizeGeneral resource requirement of
Register Array unit
Slice Flip-Flop 6-input LUT110100100010000100000
4×4 8×8 16×16 32×32 64×64 128x128Number of resources
Template image sizeGeneral resource requirement of Shift
Register unit
Slice Flip-Flop 6-input LUT
a.) b.) c.)
Figure 8. General resource requirment of the main parts of Correlator Core Unit a.) Subtractor and MAC tree resource requirement, b.) Reg ister
Array unit resource requirement, c.) Shift Register unit resource requirement
After examining the area requirements of the Correlator
Core Unit its maximum computing performance is also investigated. Our tests showed that the clock frequency of the unit is just slightly depending on the size of the template size and 300MHz clock frequency can be achieved in each case.
Our results are compared to an existing fully-parallel
implementation [3]. The comparison is made with a test image having 511×511 pixel size and a 64×64 pixel-sized kernel. In our proposed semi-parallel implementation the required processing time is about 57 ms, while the architecture in [3] the processing time is only 0.652 ms. However, our solution consumes only 274 DSP MAC slices, while in case of fully-parallel implementation [3] 14-times more dedicated MAC blocks are required.
V. C
ONCLUSIONS
In the paper design tradeoffs of a configurable image
correlation architecture is represented. The semi parallel operation of the proposed architecture provides a reasonable compromise between the completely serial and completely parallel solutions in terms of implementation area and computing performance. The proposed architecture can be implemented on a mid-sized Virtex-7 series FPGA due to the moderate area requirements. On the other hand the high computing performance of the architecture makes it possible real-time processing of 511×511 sized images using even 64×64 sized kernels. Currently, the size of the kernel should be a power of two, and the size of the image is also restricted, in a future version these limitations will be eliminated to get freely configurable image correlation architecture.
A
CKNOWLEDGMENT
This research project is supported by the János Bolyai
Research Scholarship of the Hungarian Academy of Sciences and OTKA Grant No. K84267. The support of the grants TÁMOP-4.2.1.B-11/2/KMR-2011-0002 and TÁMOP-4.2.2/B-10/1-2010-0014 is gratefully acknowledged.
R
EFERENCES
[1] T. Laviniu, A. Gacsádi, I Gavrilu ț V. Tiponu ț “A CNN Based
Correlation Algorithm to Assist Visually Impaired Persons” Signals,
Circuits and Systems (ISSCS), Iasi, Romania, 10th International
Symposium on pp. 1–4, June 2011
[2] Donald G. Bailey “Design for Embedded Image Processing on FPGAs” 2011, Wiley-IEEE Press
[3] Almudena L., Luis E. “High performance FPGA-based image
correlation” J. Real-Time Image Proc., Vol. 2, Special Issue, pp. 223-
233, Springer, 2007
[4] Xilinx Inc [Webpage], 2012: www.xilinx.com

Similar Posts