Optimal location estimation using a metaheuristic [601887]

Optimal location estimation using a metaheuristic
algorithm and deep learning
for visual tracking
Djamel Eddine TOUIL
Energy Systems Modeling Laboratory, University of Biskra,
Algeria.
[anonimizat] Nadjiba TERKI
Department of electrical engineerin g, University of Biskra,
Algeria
[anonimizat]

Abstract — In the last decade, it is noted that visual tracking is
the most an active topic in computer vision field which results
in the appearance of several visual trackers. Although its
efficiency, the existing trackers fail to handle challenging
appearance changes of the tracked object in complex image
sequences, such as significant deformation, fast motion,
occlusions… etc. In this paper, we address the pre -cited
problems. We propose a novel met hod for robust location
estimation in a tracking. The proposed method base on two
main steps. First, we learn adaptively five correlation filters in
the spatial domain in order to improve tracking accuracy and
efficiency. Indeed, Bat algorithm (BA) is a su itable tool for
solving the update model equation of the correlation filters.
Second, we adaptively learn these models on hierarchical
convolutional features to encode the target appearance.
Extensive experiments on a reference dataset, justify the
robustn ess of the proposed method.
Keywords —CNN; Visual tracking ; particle swarm optimization .
I. INTRODUCTION
Object tracking is one of the most active topic in computer
vision field with various applications [1 -3]. The common step
in object tracking process is u sing the initial position of the
target in the first frame. Then, estimate its state parameters in
each frame. Last decade has witnessed a significant progress
has been made in visual tracking topic through the apparition
of many tracking methods [3, 4]. A lthough, they suffer from
several challenges such as background clutters, occlusion,
shape deformation and camera motion [5]. Recently, deep
learning, especially convolutional neural network (CNN) [6,
7] prove its efficiency in extracting semantic and deta il
features of targets compared to other useful techniques on
wide range of visual tasks [8], e.g., image classification [9],
object detection [10] and saliency detection [11]. As a way to
create a model, which allows combining between a large
change of fo reground and background samples . The
improvement of CNN technic has known several steps. In the
beginning, convolutional neural networks -based trackers are
based on training CNN model on a large -scale data using raw
pixels with fixed input size. However, i t is limited to specific
targets since it is trained offline before tracking and corrected afterward. Fan et al [12] trained CNN for performing the
human tracking task. In DLT method [13], they proposed
multilayer network for features generation. Later, th is network
is pre -trained on a part of 80M Tiny dataset [14]. In [13] they
learned CNN’s model through treating each auxiliary
sequence as a specific domain. Li et al [14] worked on
removing noisy samples during model update with the
construction of multip le classifiers on different cases of
tracked objects. In [15], they use a pre -trained CNN model to
learn the saliency map for a specific target. These methods use
only the latest CNN’s layer for feature extraction, which
explains their weakness results on a largescale dataset [7, 16].
Otherwise, Correlation filters based methods, especially
correlation filter based discriminative tracking method (DCF)
had shown considerable efficiency by the use of fast Fourier
transforms, which had improved the computation al efficiency.
Nevertheless, its drawback is that it depends on a grayscale
image as feature channel [17]. Recently, significant works
have improved the DCF framework . In [17 ] learning spatially
discriminative correlation filters is proposed on a larger se t of
negative training samples, without distorting the positive ones.
Ma et al [18] propose exploiting convolutional features within
the kernel correlation filter framework (KCF) [4]. They
propose learning of three models of correlation filter
adaptively o n extracted features from three CNN’s layers.
Experimental results justified the effectiveness of their
approach to the state -of-art methods. However, their approach
fails in maintaining long -term tracking. Indeed, it is limited in
handling heavy occlusion case [17]. To solve these issues, our
main contributions can be summarized as follows:
 We employ particle swarm optimization algorithm [19 –
20] for learning spatially three models of correlation to
handle effectively challenges of fast motion, deformation ,
and occlusion.
 To justify our contributions, our method is evaluated on a
reference database .
This paper is organized as follows. In Section 2, we
present in detail our proposed method, and then; we present in
Section 3 the use of particle swarm optimiz ation algorithm . In
the fourth section, we demonstrate the efficiency of our
tracker by giving a clear comparison of results with several

reference trackers. Section 5 presents the failure cases. Finally,
Section 6 gives the conclusion and future work.
II. PROPOSED ALGORITHM
In this section present the main steps of our proposed
method are presented. First, we estimate in each frame the
position of the target using three models of correlation filters
learned on hierarchical convolutional characteristics.
Furthermore, we produce the final confidence output by
summing the multilevel correlation response maps. Second,
we propose the use of particle swarm optimization algorithm
to learn these correlation filter models. Fig. 1 illustrates the
flowchart of the prop osed method.
A. Correlation filters
In the field of object tracking, the methods which base on
correlation filter have performed favorably against methods
[17, 18, 21, 22]. Basing on the efficiency of correlation filter
in encoding target appearance, which means generating an
efficient correlation map. We convolve the target's template by
the learned filter model on every frame. Similarly to [18], we
search the maximum value from the correlation map witch
hint the estimated target position. Also, we adopt hier archical
convolutional features vector x of size M×N×D obtained from
the conv5 -4, conv4 -4 and conv3 -4 layers of the VGG -Net-19
model [23] to learn three correlation filter models w, where M,
N and D indicate the width, height, and the number of
channels. Furthermore, the size of correlation filter is
MN .
We regressed each sample
of
  ,, 0, 1, , 1() 0, 1, , 1mnm n Nx M       to the soft
Gaussian function label
 y m, n .
Where
22MNmn22
22σy m,n     e           
 ,
 is the kernel width. To learn model s of correlation filter, we need to solve
equation
(1):
 2 2* arg min . ,                  ,2,1 w w x y m n w mn
wmn    

We use particle swarm optimization algorithm to optimize (1)
in the spatial domain by the fact that fast Fourier transform
(FFT) is periodic, and it did not respect the image boundaries.
Consequently, a poor Fourier representation will be resulted
due to the large discontinuity between opposite edges of a
non-periodic image [4]. We adopt FFT, only for computing
(2):

The tracking process is executed on feature vector
lZ to
compute the l-th correlation score
fl .The operator
1

denotes the inverse FFT. We derive the new location (
ˆx,ˆy ) of
the target by searching the maximum of the response map f of
size
MN .
B. Hierarchical convolutional features
Analysis o f deep representations is important to understand
the mecha nism of deep learning. However, it is still very rare
for visual tracking. We present the main characteristics of
CNN features, which smooth visual tracking task. As shown in
Fig. 1, feature are extracted from three pooling layer of VGG –
Net-19 model [23]. The process is as start from a given image
frame with a searching window size of M×N . We resize the
searching window using the bilinear interpolation because the
CNN's input requires resolution of 224×224. Afterward, we
accumulate the output of layers conv 3-4, conv4 -4 and conv5 -4
to generate a multichannel feature [24]. In addition, boundary
Fig. 1. For correlation filters modeling efficiency, correlation filter models w in green rectangle were generated by the effective particle swarm optimization
minimization algorithm. The response map is used to estimate the target location by finding the position of its maximum value in red rectangles.

discontinuities are removed from each feature channel by
weighting it by a cosine window [21] . Then , we rescale each
feature map to a stable larger size
44MN using (3) for
alleviating the differentiation resolution between the layers.
C. Target location estimation
Noting that we use particle swarm optimization algorithm
for generating correlation filter model for each resized feature
channel basing on (1), and getting correlation response
maps
{}fl using ( 2). E ach correlation response map is
weighted by a regularization parameter
. We use to sum the
correlation output from each layer to produce the final
confidence out put. The target estimated position
 **,xy is
pulled by solving ( 4). i.e.,
 ,                   4    arg max
3,4,5 ,mnflll mn

Where
are experimentally set to 0.019, 0.4 and 1.
D. Updating models
In order to get a typical correlation filter model tha t can
handle most of all appearance changes, in each frame, we
update correlation filter models with a learning rate
 as:
 
 1 1                       5.
1 1                  5.t tt a xx x
t tt b WW W
   
   
III. PSO ALGORITHM
PSO or Particle Swarm Optimization is an optimization
metaheuristic inspired from the movement of a group of birds.
In fact, t his optimization method is based on the collaboration Algorithm 1 Pseudo -code of PSO algorithm
Input : Problem size, CNN’s features, cor relation filter
model of a previous frame.
Output : Mgbest
for i = 1 to n do

   ( ,  )rand n p Problem size i ;

   ( ,  )rand n v Problemi size ;
Generate Mi (n, Problem size, CNN’s features )
Estimate Mibest according to Eq. ( 1)

    bestpp ii ;
if
( ) ) (,cost co ppi g bsestt then

  ,ppg best i ;

  ,MMg best i ;
end
end
while (iteration number < 200)
for i = 1 to n do
Update
 vi according to Eq. (6 )
Update
 pi according to Eq. ( 7)
Update
 Mi according to Eq. ( 8)
if
( ) ) (bestcost costipip then

  bestppii ;

  bestMM i i ;
if
() ( ) ,bestcos pgt costpi best then

  ,bestppg best i ;
end
end
end
end
return
,Mg best ;

Fig. 2. Comparisons of our proposed method with and without using PSO
algorithm in the challenging CarScal and FaceOcc1 sequences. It’s obvious
that our tracker performs accurate to treat occlusion and fast motion.

of individuals with each other. Which explains their ability to
solve non -linear, multimodal and high -dimensional
optimization algorithm, especially in visual tracking topic
[25]. We present in this section a detailed explication of the
use of particle swarm optimization in our method [26]. The
idea is the optimize (1 ) using "PSO behavior" for achieving
the appropriate correlation filter models. Algorithm 1
summarizes the basic steps of particle swar m optimization.
More implementation details are discussed as follows.
TABLE I. BA PARAMETERS
Parameter Value
c1 1.3
c2 0.9
w (inertie) 0.9
n (Population size) 20
A. Procedure
The particle swarm optimization algorithm consists of
collecting particles that move i n the search space based on
their own past location and the optimal location of the entire
swarm or near neighbor [26]. Each a particle’s velocity is
updated by:


  t 1   t      , 1,2, , (6. ) 12
rand t (6. ) 11
rand t (6. ) 2 2i n a vv PP ii
bestb pp c P ii
c pp c P gbest i     
   
   

where
    1tvi is the new velocity for the ith particle, n is the
population size, c 1 and c 2 are the weighting coefficients for the
personal best and global best position respectively,
()tpi and
()tMi
are the ith particle’s position and correlation filt er models at time t,
bestpi and
bestMi are the ith particle’s
optimal position and correlation filter models.
,pg best and
,Mg best
are the best position and correlation filter models
known to the swarm. Using equations. (7) and (8), the
particle's position and correlation filter models are updated
respectively [ 26].
 t 1   t   t                 7pp vi ii  

 t 1   t   t                8 v MMii i   

B. Correlation filter models
We initialize each particle by a set of candidate models of
each layer, with a size adapted to those of CNN's features, as
shown in Algorithm 1. Ten correlation filter models are used
for each particle. Noting that this initialization is done using
the previous correlation filter models in the spatial domain.
Fig. 3. Illustration of Distance precision and overlap success over four tracking challenges : fast motion, deformation, low resolution , and occlusion. It is clear that
our proposed method outperforms the state -of-art trackers.

Fig. 4. Distance precision and overlap success over the OTB -50 [7] dataset
using OPE. The legend of precision contains threshold scores at 20 pixels
while the legend of overlap success contain threshold scores at 0.5 for each
tracker.

Once models are initialized, we evaluate equation (1) in order
to choose the best three models. By coming to 200 iterations,
the PSO algorithm’s dynamic can effectively meet three
suitable corr elation filter models, which gives an accurate
locating of the target in most of all frames.
Fig. 2 shows that our PSO -based method is effective to
handle object appearance changes, especially speaking; the
fast motion , deformation and occlusion. In fact, an efficient
learning of correlation filter models as the solution of (4) give
an appropriate model that allow recovering our tracker from
tracking failure cases, in which Fast Fourier transform did not.
IV. EXPERIMENTAL RESULTS
A. Setups and methodology
We eva luate the experiment's results basing on: distance
precision (DP) witch shows the percentage of frames whose
estimated location at a threshold of 20 pixels [7] and overlap
success (OS) which is defined as the percentage of frames in
which the bounding box overlaps a given threshold
[0,1]t .
OP results are presented in t = 0.5 [7]. We implement the
proposed method in PC with RAM of 32GB and CPU of
4.00GHz and Intel Core i7 -6700K. We exploit MatConvNet
toolbox [27] to extract convolutional features. For translation
estimation, we use a search window of size 1.8 times of the
target size. In (5), the parameter
 is set to 0.01. The
parameter of regularization
 in (1) is set to 10-4. Table 1
summarized the parameters of particle swarm optimization
algorithm.
B. Overall comparison
Our proposed tracking method is evaluated on a large
benchmark dataset [7 ] that contains 50 videos. To make our
approach more credible, extensive comparison is done with 9
reference methods, including KCF [4], DLT [13], HCF [18],
MEEM [28], TGPR [29], Struck [30], SCM [31], TLD [32]
and LSHT [33]. Indeed, the results are reported in one -pass-
evolution (OPE) metric. Overall, it's clear the proposed tracker
performs more eff iciently versus the state -of-the-art methods.
Fig. 4 illustrates that among the state -of-the-art methods, the
HCF tracker attains the best results in distance precision of
89.1% and in overlap success rate of 74% on OTB -50 dataset.
Despite, the proposed method proves its accurateness with an
amelioration gain of 0.1% in (DP). To demonstrate the
efficiency of the proposed approach in handling fast motion,
occlusion and deformation, we illustrate in Fig. 2 a
comparison of tracking results on these attributes on two
challenging sequences. Our PSO -based tracker allows
defeating these challenging attributes thanks to optimal
correlation filters obtained from particle swarm optimization
algorithm. Without this module, our tracker drift. In order to
give more credi bility to our work, we report results for the
challenging tracking attributes in Fig. 3 to prove that our
contributions have improved the HCF tracker in handling:
deformation, fast motion, low resolution, and occlusion cases.
Among existing methods, our tr acker performs the best against
the existing trackers with DP in fast motion (80:2%), deformation( 94:5%), low resolution (92:8%) and occlusion
(88:1%) and with success rate of 71:3%, 81:2%, 68:9%, and
74:7% respectively.
Meanwhile, the HCF tracker outperf orms our method in
overlap success part because we have not treat scale
estimation part. In fact, the favorable performance of our
method is interpreted due to three main reasons. First, we use
PSO algorithm to learn three models of correlation filter in th e
spatial domain. Then, we update them to handle the
appearance changes.
V. FAILURE CASE
We show the failure cases for the Singer2 and Lemming
sequences in Fig. 5. For Lemming sequence, a heavy occlusion
occurs between the 370th and the 380th frames, make t he
proposed tracker drift to follow target as opposed to Struck
[30] tracker since it uses a re -detection module. On the other
hand our tracker drift in most of the frames of the Singer2
sequence as opposed to KCF [4] tracker. Consequently, it
affects the overlap success part showed in Fig. 4.
VI. CONCLUSION
In this paper, we propose robust method for visual
tracking. We learn three discriminative models of correlation
filter using particle swarm optimization algorithm.
Experimental results justify that our tra cker outperforms the
state-of-art trackers. Although our method has proved
effective, it is still limited in recovering target in cases of
occlusion and illumination variation. Our future work will
emphasis on improving our tracker from drifting in these cases
by employing a robust detector module.
REFERENCES
[1] A. Yilmaz, O. Javed, and M. Shah, “Object tracking: A
survey,” ACM Computing Surveys, vol. 38, no. 4, pp. 13 -57,
2006.
[2] Q. Zhao, W. Yao, J. Xie, “Research on object tracking with
occlusion”, WIT Transa ction on information and
Communication Technologies, vol. 51, pp. 537 -545, 2014.
[3] D. Comaniciu, V. Ramesh, P. Meer, "Kernel -based object
tracking", IEEE Trans. Pattern Anal. Mach. Intell. , vol. 25,
no. 5, pp. 564 -577, May 2003.
[4] J. A. F. Henriques, R. Caseir o, P. Martins, J. Batista, "High –
speed tracking with kernelized correlation filters", IEEE
Trans. Pattern Anal. Mach. Intell. , 2014.
[5] Q. Hu, Y. Guo, Z. Lin, W. An and H. Cheng, "Robust and
Real-Time Object Tracking Using Scale -Adaptive
Correlation Filters," 2016 International Conference on
Digital Image Computing: Techniques and Applications
(DICTA) , Gold Coast, QLD, 2016, pp. 1 -8.
[6] X. Li, W. Hu, C. Shen, Z. Zhang, A. Dick, and A. Van Den
Hengel, “A survey of appearance models in visual object
tracking,” ACM Trans. Intell. Syst. Technol., vol. 4, no. 4,
pp. 478 –488, 2013.

[7] Y. Wu, J. Lim, and M. -H. Yang, “Online object tracking: A
benchmark,” in Proc. IEEE Conf. Comput. Vis. Pattern
Recognit., 2013, pp. 2411 – 418.
[8] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learni ng,”
Nature, vol. 521, no. 7553, pp. 436 –444, 2015.
[9] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Image net
classification with deep convolutional neural networks. In
NIPS, 2012.
[10] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich
feature hierarchies for accurate object detection and semantic
segmentation. In CVPR, 2014.
[11] R. Zhao, W. Ouyang, H. Li, and X. Wang. Saliency detection
by multi -context deep learning. In CVPR, 2015.
[12] J. Fan, W. Xu, Y. Wu, and Y. Gong, “Human tracking using
convolutional neural networks,” IEEE Trans. Neural Netw.,
vol. 21, no. 10, pp. 1610 –1623, Oct. 2010.
[13] Wang, N., & Yeung, D. Y. (2013). Learning a deep compact
image representation for visual tracking. In Advances in
neural information processing systems (pp. 809 -817).
[14] H. Li, Y. Li, and F. Porikli. Deeptrack: Learning
discriminative feature representations by convolutional neural
networks for visual tracking. In BMVC, 2014.
[15] S. Hong, T. You, S. Kwak, and B. Han, “Online tracking by
learning discriminative saliency map with convolu tional
neural network,” in Proc. 32nd Int. Conf. Mach. Learn., 2015,
pp. 597 –606.
[16] Y. Wu, J. Lim, and M. -H. Yang, “Object tracking
benchmark,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 37,
no. 9, pp. 1834 –848, Sep. 2015.
[17] Danelljan, M., Hager, G., Shahba z Khan, F., & Felsberg, M.
(2015). Learning spatially regularized correlation filters for
visual tracking. In Proceedings of the IEEE International
Conference on Computer Vision (pp. 4310 -4318).
[18] C. Ma, J. Huang, X. Yang, and M. -H. Yang, “Hierarchical
convo lutional features for visual tracking,” in Proc. Int. Conf.
Mach. Learn., 2015, pp. 3074 –3082.
[19] Kennedy, J. (2011). Particle swarm optimization.
In Encyclopedia of machine learning (pp. 760 -766). Springer
US.
[20] Poli, R., Kennedy, J., & Blackwell, T. (2007). P article swarm
optimization. Swarm intelligence , 1(1), 33 -57.
[21] Bolme, D. S., Beveridge, J. R., Draper, B. A., & Lui, Y. M.
(2010, June). Visual object tracking using adaptive
correlation filters. In Computer Vision and Pattern
Recognition (CVPR), 2010 IEEE C onference on (pp. 2544 –
2550). IEEE.
[22] Ma, C., Xu, Y., Ni, B., & Yang, X. (2016). When correlation
filters meet convolutional neural networks for visual
tracking. IEEE Signal Processing Letters , 23(10), 1454 -1458.
[23] Simonyan, K., & Zisserman, A. (2014). Very de ep
convolutional networks for large -scale image
recognition. arXiv preprint arXiv:1409.1556 .
[24] Kiani Galoogahi, H., Sim, T., & Lucey, S. (2013). Multi –
channel correlation filters. In Proceedings of the IEEE International Conference on Computer Vision (pp. 30 72-
3079).
[25] Mollaret, C., Lerasle, F., Ferrané, I., & Pinquier, J. (2014,
October). A Particle Swarm Optimization inspired tracker
applied to visual tracking. In Image Processing (ICIP), 2014
IEEE International Conference on (pp. 426 -430). IEEE.
[26] Brownlee, J. (2011). Clever algorithms: nature -inspired
programming recipes. Jason Brownlee.
[27] Vedaldi, A., & Lenc, K. (2015, October). Matconvnet:
Convolutional neural networks for matlab. In Proceedings of
the 23rd ACM international conference on Multimedia (pp.
689-692). ACM.
[28] Zhang, J., Ma, S., & Sclaroff, S. (2014, September). MEEM:
robust tracking via multiple experts using entropy
minimization. In European Conference on Computer
Vision (pp. 188 -203). Springer, Cham.
[29] Gao, J., Ling, H., Hu, W., & Xing, J. (2014, Sept ember).
Transfer learning based visual tracking with gaussian
processes regression. In European Conference on Computer
Vision (pp. 188 -203). Springer, Cham.
[30] Hare, S., Golodetz, S., Saffari, A., Vineet, V., Cheng, M. M.,
Hicks, S. L., & Torr, P. H. (2016). Struck: Structured output
tracking with kernels. IEEE transactions on pattern analysis
and machine intelligence , 38(10), 2096 -2109.
[31] Zhong, W., Lu, H., & Yang, M. H. (2014). Robust object
tracking via sparse collaborative appearance model. IEEE
Transactions on Image Processing , 23(5), 2356 -2368.
[32] Kalal, Z., Mikolajczyk, K., & Matas, J. (2012). Tracking –
learning -detection. IEEE transactions on pattern analysis and
machine intelligence , 34(7), 1409 -1422.
[33] He, S., Yang, Q., Lau, R. W., Wang, J., & Yang, M. H.
(2013). Visual tracking via locality sensitive histograms.
In Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition (pp. 2427 -2434).

Similar Posts