AUTOMATION AND COMPUTER SCIENCE FACULTY eng. VAJDA TAMÁS PHD THESIS HUMAN BEHAVIOR RECOGNITION IN VIDEO SEQUEN CES Advisor , Prof. dr. eng. SERGIU… [630942]
AUTOMATION AND COMPUTER SCIENCE FACULTY
eng. VAJDA TAMÁS
PHD THESIS
HUMAN BEHAVIOR RECOGNITION
IN VIDEO SEQUEN CES
Advisor ,
Prof. dr. eng. SERGIU NEDEVSCHI
2013
AUTOMATION AND COMPUTER SCIENCE FACULTY
eng. Vajda Tamás
PHD THESIS
HUMAN BEHAVIOR RECOGNITION IN VIDEO
SEQUENCES
Advisor ,
Prof. dr. eng. Sergiu Nedevschi
Comisia de evaluare a tezei de doctorat:
PREȘEDINTE: – Prof.dr.ing. Liviu Miclea , Universitatea Tehnică din Cluj -Napoca
MEMBRI: – Prof.dr.ing. Sergiu Nedevschi – conducător științific, Universitatea Tehnică din Cluj -Napoca;
– Prof.dr.ing. Ștefan -Gheorghe Pentiuc – referent, Universitatea Ștefan cel Mare Suceava;
– Prof.dr.ing. Vladimir -Ioan Crețu – referent, Universitatea „Politehnica” din Timișoara;
– Prof.d r.ing. Gheorghe Sebestyen – referent, Universitatea Tehnica din Cluj -Napoca.
Acknowledgeme nts
I owe a great thanks to my advisor, Prof. dr. eng. Sergiu Nedevschi for his patien ce and
generosity. It’s a pleasure to have known him and I consider myself lucky for having been his
student.
Thanks to my colleagues and collaborators. Thanks to my family for their support all
these years.
To my wife
Table of Contents
1 Table of Contents
ABBREVIATIONS ………………………….. ………………………….. ………………………….. ………………………….. …… 3
1. INTRODUCTION ………………………….. ………………………….. ………………………….. ………………………… 5
1.1 PROBLEM STATEMENT ………………………….. ………………………….. ………………………….. ………………………. 6
1.2 MAIN CONTRIBUTIONS ………………………….. ………………………….. ………………………….. ……………………… 6
1.3 THESIS OVERVIEW ………………………….. ………………………….. ………………………….. ………………………….. .. 7
2. PROBLEM OVERVIEW ………………………….. ………………………….. ………………………….. ………………… 8
2.1 PREPROCESSING ………………………….. ………………………….. ………………………….. ………………………….. …. 9
2.2 HUMAN DETECTION ………………………….. ………………………….. ………………………….. ……………………….. 10
2.2.1 “Single detection window” analysis ………………………….. ………………………….. …………………. 10
2.2.2 “Component -based” methods ………………………….. ………………………….. …………………………. 13
2.3 ACTIVITY AND BEHAVIOR RECOGNITION ………………………….. ………………………….. ………………………….. …. 13
2.3.1 Single -layered approaches ………………………….. ………………………….. ………………………….. …. 14
2.3.2 Hierarchical approaches ………………………….. ………………………….. ………………………….. ……. 17
2.4 CONCLUSIONS ………………………….. ………………………….. ………………………….. ………………………….. ….. 19
3. PREPROCESSING ………………………….. ………………………….. ………………………….. ………………………. 20
3.1 PROBLEM STATEMENT ………………………….. ………………………….. ………………………….. …………………….. 20
3.2 OPTICAL FLOW………………………….. ………………………….. ………………………….. ………………………….. …. 21
3.3 BACKGROUND SUBTRACTIO N ………………………….. ………………………….. ………………………….. …………….. 23
3.4 INTEGRAL IMAGE BASED BACKGROUN D SUBTRACTION ………………………….. ………………………….. …………….. 29
3.4.1 Integral Image ………………………….. ………………………….. ………………………….. ………………….. 29
3.4.2 Estimating the Integral background image and shadow detection ………………………….. …… 32
3.5 DISCUSSIONS AND EXPER IMENTS ………………………….. ………………………….. ………………………….. …………. 33
3.6 CONCLUSIONS ………………………….. ………………………….. ………………………….. ………………………….. ….. 45
4. HUMAN DETECTION AND POSE E STIMATION ………………………….. ………………………….. ……………. 47
4.1 INTRODUCTION ………………………….. ………………………….. ………………………….. ………………………….. … 47
4.2 HUMAN DETECTION AND P OSE ESTIMATION WITH “EXAMPLE -BASED” METHOD ………………………….. ……………. 48
4.2.1 Haar feature ………………………….. ………………………….. ………………………….. …………………….. 49
4.2.2 The AdaBoost algorithm ………………………….. ………………………….. ………………………….. ……. 49
4.2.3 The cascade classifier ………………………….. ………………………….. ………………………….. ………… 50
4.2.4 Human body Detector and Pose Estimation Tree ………………………….. ………………………….. . 51
4.2.5 Building the training set ………………………….. ………………………….. ………………………….. …….. 53
4.2.6 Training the tree ………………………….. ………………………….. ………………………….. ………………. 56
4.2.7 Experiments and discussion ………………………….. ………………………….. ………………………….. .. 59
4.3 TEMPLATE BASED HUMAN DETECTION METHODS ………………………….. ………………………….. …………………… 64
4.3.1 Distance transformation ………………………….. ………………………….. ………………………….. ……. 64
4.3.2 Chamfer distance ………………………….. ………………………….. ………………………….. ……………… 65
Human Behavior Recognition in Video Sequences
2 4.3.3 Fast distance transform computation ………………………….. ………………………….. ………………. 67
4.3.4 Pseudo Parallel Computation of the Distance Transform ………………………….. …………………. 70
4.3.5 Chamfer matching ………………………….. ………………………….. ………………………….. …………….. 72
4.3.6 Matching high number of templates: template space ………………………….. …………………….. 72
4.3.7 Human detection and tracking system with pose estimation and experiments ……………….. 74
4.3.8 Results and discussi on ………………………….. ………………………….. ………………………….. ……….. 76
4.4 PICTORIAL STRUCTURES ………………………….. ………………………….. ………………………….. …………………… 79
4.4.1 Definition of pictorial structures ………………………….. ………………………….. ………………………. 79
4.4.2 Statistical Framework ………………………….. ………………………….. ………………………….. ………… 80
4.4.3 Body parts and connections ………………………….. ………………………….. ………………………….. … 81
4.4.4 Learning Parameters ………………………….. ………………………….. ………………………….. …………. 83
4.4.5 Finding the optimal configuration ………………………….. ………………………….. ……………………. 85
4.4.6 Sampling from the Posterior ………………………….. ………………………….. ………………………….. .. 85
4.4.7 Framework extension ………………………….. ………………………….. ………………………….. ………… 86
4.4.8 Systems framework and Experiments ………………………….. ………………………….. ……………….. 87
4.4.9 Results ………………………….. ………………………….. ………………………….. ………………………….. …. 94
4.5 COMPARISON OF THE HUM AN DETECTION METHODS ………………………….. ………………………….. ……………… 94
4.6 CONCLUSIONS ………………………….. ………………………….. ………………………….. ………………………….. …. 95
5. RECOGNIZING HUMAN ACTION AND BEH AVIOR ………………………….. ………………………….. ……….. 97
5.1 RECOGNIZING ACTIONS ………………………….. ………………………….. ………………………….. …………………… 97
5.1.1 Dynamic time warping ………………………….. ………………………….. ………………………….. ……….. 98
5.1.2 Dimensionality reduction and motion decomposition ………………………….. ……………………. 100
5.1.3 Heuristic Fast Dynamic Time Warping Methods ………………………….. ………………………….. . 102
5.1.4 Classification using Neural Networks ………………………….. ………………………….. ……………… 104
5.1.5 Results ………………………….. ………………………….. ………………………….. ………………………….. .. 105
5.2 RECOGNIZING BEHAVIOR ………………………….. ………………………….. ………………………….. ……………….. 107
5.2.1 Hierarchical Probabilistic Petri Net ………………………….. ………………………….. …………………. 108
5.2.2 Experiment ………………………….. ………………………….. ………………………….. ……………………… 109
5.3 CONCLUSIONS ………………………….. ………………………….. ………………………….. ………………………….. .. 112
6. CONCLUSIONS AND FUTU RE WORK ………………………….. ………………………….. ………………………. 113
6.1 SUMMARY OF RESULTS ………………………….. ………………………….. ………………………….. ………………….. 113
6.2 FUTURE RESEARCH ………………………….. ………………………….. ………………………….. ……………………….. 118
BIBLIOGRAPHY ………………………….. ………………………….. ………………………….. ………………………….. ….. 119
LIST OF FIGURES ………………………….. ………………………….. ………………………….. ………………………….. … 134
LIST OF TABLES ………………………….. ………………………….. ………………………….. ………………………….. ….. 137
APPENDIX ………………………….. ………………………….. ………………………….. ………………………….. ………… 138
Abbreviations
3 Abbreviations
2D – two dimensional
3D – three dimensional
4D – four dimensional
CFG – Context -Free Grammars
CHMM – Coupled Hidden Markov Model
CHMS – Chamfer Matching using Hierarchical and Motion Space Templates database
DBN – Dynamic Bayesian Networks
DTW – Dynamic Time Warping
EM – Expectation –maximization
HCI – Human -Computer Interface
HMM – Hidden Markov Model
HOG – Histograms of Oriented Gradients
HPPN – Hierarchical Probabilistic Petri Net
KDE – Kernel Density Estimator
LHMM – Layered Hidden Markov Model
LTI – Linear Time I nvariant
LVQ – Learning Vector Quantization
MAP – Maximum a posteriori Probability
MCMC – Markov Chain Monte Carlo
MEI – Motion Energy Image
MHI – Motion History Image
MLE – Maximum Likelihood Estimation
MLP – Multi Layer Perceptron
NN –Neural Network
NTS – Negative Training Set
OPSF – Optimized Pictorial Structure based Framework
Human Behavior Recognition in Video Sequences
4 PCA – Principle Component Analysis
PDF -probability density function
PETC – Pose Estimation Tree Classifier
PLSA – Probabilistic Latent Semantic Indexing
PLSA -ISM – – Probabilistic Latent Semantic Indexing -Implicit Shape Model
P-net – propagation network
PNF – network – Past, Now, Future network
PPCDT – Pseudo Parallel Computation of the Distance Transform
PTS – Positive Training Set
RBF – Radial Basis Function
ROC – Receiver Operating Characteristic
SCFG – Stochastic Context -Free Grammars
SHIFT – Scale -invariant Feature Transform
SSD – sum-of-squared difference
STR – Spatio -Temporal Relationship
SVM – Super Vector Machines
XYT – Coordinate System Space axe X, Y (plane) and Time axe
XYZT – Coordinate System Space axe X, Y, Z (3D) and Time axe
Introduction
5 1. Introduction
„Privacy remains one of the ethically imperative issues of the information age” [ 84] and
it will remain like this forever because of the human nature . We desire the safety (physical, and
spiritual) and the privacy of our life, but our actions and behaviors are defined by our own
interests, which sometimes can be achieved only by violating other humans safety or/and
privacy. T o guard our safety, thousands of surveillance cameras were installed all over the world .
Human operators are processing the information gathered by the cameras inefficiently and
slowly, moreover, they insult others privacy. One possible solution for this i ssue is to build an
automatic system, which would fulfill the human operator’s job better and faster. This problem is
difficult. Firstly, the number of people has to be determined, if there are any, then their location
and pose needs to be estimated (body, and limbs configuration). Detecting people and estimating
their pose is challenging, because people move fast, wear different clothes and appear in
different poses.
This thesis addresses a number of key issues that are needed to build an automatic
system , which understands human behaviors through videos and images.
During the last years many researcher s were intensively investigating this topic. Every
solution given by the research community has solved only a particular part of the issue and most
of them were not accurate enough. Two important questions need to be answered before the
system is developed. The first question is: what level of understanding is needed? The second
question: what level of understanding is possible , given the information quantit y of the images or
videos?
The simplest approach is to handle the humans as blobs, points or rectangular regions. In
this case, the “understanding” simply means tracking the blobs as they move across the image
[149]. This approach can be enough if we want to understand human movement in public places,
parks. However, people are more than blobs, capable of gesture s and action s with their limbs and
arms.
More complex approach of human motion recognition is when human contour is tracked.
The human contour can be obtained from images, but it is impossible to track every body part
separately . To handle this problem key positions or specific features extracted from some simple
or repetitive motions can be used for recognition .
The highest level of understanding is when we can track all body parts individually. Of
course, these methods need the highest resolution and the most processing power for
understanding.
Human Behavior Recognition in Video Sequences
6 1.1 Problem Statement
There are a g iven sequences of images . We proposed to detect the human presence in the
frames estimate the body configuration of the people and based on the body configuration to
recognize the human activity and behavior. In this thesis, we present a general framework and its
components that detect the people in the images with the maximum accura cy and offer the
highe st understanding level that the image resolution would allow.
1.2 Main contributions
The main contributions to the field of the human behavior recognition include:
We realized a surve y of the most important methods in the field of human
behavior systems.
We studied and compared the most important foreground detection techniques,
searching for the existence of a generally good foreground detection technique,
which could reduce the searc h space for human bodies and it could speed up the
detection process. It tries to identify which foreground detection technique should
be used for different situations, to have the best result.
Using the Integral Image a fast and reliable foreground detect ion technique was
proposed, which is comparable in speed with the running Gaussian average and
its precision is one of the highest.
We proposed new tree classifiers which can achieve better recognition rates than
similar methods and also a categorization o f human poses or attitudes. The
training of the Haar -based classifier can be further enhanced by a feedback of the
misclassified images during the training phase.
Introducing a new method to represent templates and associating a metric can
speed up the hum an body contours matching, using Chamfer matching.
In case of the Pictorial structure based human detection, we introduce a new terms
in the frame work. This new term is taking into account the relation between
consecutive frames and it speeds up the entir e matching detection process.
Another benefit of this new term is the increased efficiency of the recognition.
For activity recognition we proposed an improved Dynamic Time Warping
(DTW) method.
We observed that the motion time series can be shrunk withou t affecting the
precision of DTW matching.
For human behavior recognition we employed Petri Nets and showed that using a
hierarchical construction it is a very powerful and a general tool for this purpose.
Introduction
7 1.3 Thesis overview
The dissertation is organized into six chapters.
Chapter 2 describes the background of the human behavior recognition systems field. The
components of the system are detailed in this chapter, which also presents the most significant
work in this field
Chapter 3 presents the preprocessing components of the system and searches the answer
to the question , what kind of low vision processing should be used for speeding up the human
detection. For this purpose we present the most important background subtraction methods and
make a comparison of them. Based on the conclusions resulted from the comparison, we present
a rapid background subtraction method.
Chapter 4 presents the most important part of the thesis. Organized in three subchapters ,
the results in human detection and pose estimation tech niques covering different approaches are
presented here. The first technique is a n artificial intelligence method that use s Haar feature . The
second is a template based method. And the third method is a component based method which
uses Pictorial structure to detect the human in the image and estimate s its body configuration.
The subchapters contain the description of the metho ds and experiments .
Chapter 5 contains the presentation of human behavior recognition component. This
component has two parts: the action recognition part and the activity or behavior recognition
part. Both parts are described in separate subchapters. In these subchapters we present a
Heuristic FastDTW based action recognition algorithm and a Petri Net based human behavior
recognition algorithm.
Chapter 6 summarizes the main results of the thesis and provides conclusions. The last
chapter also presents some future research to pics as extension of this work.
Human Behavior Recognition in Video Sequences
8 2. Problem Overview
The researchers have been interested in performing automatic recognitions of human
behavior for a long time . Successful recognition of human behavior makes some application s
reach their actual limits: visual surveillance – from detecting automatically suspicious behavior
to medical tracking and analysis of patients , human -computer interaction (HCI) – gesturing
interfaces could make easier to create smar t offices or homes . There has been progress on
multiple directions in recognizing complex human action resulting enormous quantity of
algorithms. However, challenges remain for developing more robust and more general
algorithms. In this thesis, we investig ate the existing methods and try to build reliable human
behavior recognition systems.
In system development, a significant step is to define the system requirements and the
working environment. Our system 's goal is to increase the video surveillance appl ications
efficiency. We identify some tasks to be handled and some restrictions that can be used in
developing this type of systems. First, we need to work with 2D image sequences because the
widely used cameras in this field are providing only 2D images. This is necessary because we
need to work with the 2D projection of a 3D object, which introduces plenty of uncertainty in the
system. In some cases to a 2D human projection can correspond more than one 3D position of
the human body. In case of the visual surveillance system , we can suppose that the video
acquisition system uses static cameras or the cameras are moving slowly. This fact is crucial
because we can consider in many cases that we have to deal with a static background. We also
need to handle some significant problems. Firstly, we have to deal with illumination changes, in
many cases this is a gradual change but there are some situations when the change happens
suddenly. Another task to be resolved is the human object size, which can vary during the
detection. An important problem represents the occlusion and the self occlusion – as we only see
a part of the human body. If the occlusion occurs the performance of the system will be heavily
affected.
Figure 1. General human behavior recognition system
Preprocessing Human
detection
and pose
estimation Activity and
behavior
recognition Image
acquisition
system
Problem Overview
9 A general human behavior recognition system ( Figure 1) has the following components:
Image acquisition system can be:
Online systems: one or more video cameras
Offline systems: video servers.
Preprocessing component having the goal to filter out the errors from the image, and
extract information using low level tools. In our case, we try to extract some useful
information about the background, or the full background to define or reduce the area
of the image we are interested in.
Human detection and the pose estimation componen t represent the most valuable part
of the system. The performance of this component highly influences the overall
performance of the system.
The activity and behavior recognition component processes the response provided by
the detection and pose estimati on algorithms and classif ies the human actions.
Our contribution mainly focuses on the following three parts of the system described
above: the preprocessing algorithms, the human detection and estimation component , and the
activity and behavior recognitio n part.
2.1 Preprocessing
In the preprocessing phase, we focus on reducing the area of interest by using
background subtraction or optical flow. Both directions have a vast number of algorithms.
Background subtraction methods are used in the context of movin g object detection from
static cameras. One way to obtain the background is to acquire a background image which does
not include any moving object. In some situations, the background is not available. The acquired
background can always be changed under cri tical situations like illumination changes or objects
can be introduced or removed from the scene. Many background modeling methods are
developed [ 51,33], that try to deal with these problems. The methods mostly differ in the way the
backgrounds are modele d. One of the most simplistic method is the basic background modeling
which uses average histogram or median [ 102, 123, 242] over time. Another category are the
statistical methods single Gaussian [ 226] or a Mixture of Gaussians [ 188] or a Kernel Density
Estimation [ 50]. In this case statistical variables are used to classify the pixels as foreground or
background. The background can be also computed with clustering methods using K -mean [ 24]
or Codebook algorithm [ 95]. This approach supposes that each pixel in the frame can be
temporally represented by clusters. The pixels from a new image are matched against the
corresponding cluster group, and they are classified as foreground or background according to
the matching cluster whether are considered as part o f the background or not. A group of
methods uses Wavelets and DTW to create the background model [ 16]. The background can be
estimated using filters (Kalman filter [ 124], Wiener filter [ 213], Tchebychev filter [ 31]). Any
pixel of the current image that dev iates significantly from its predicted value is declared as
foreground. In case of the artificial intelligence based methods: the background is modeled using
Human Behavior Recognition in Video Sequences
10 mean of the weights of a neural network that is trained using N clean frames. The network is
trained to classify each pixel as background or foreground [ 34,122]. Fuzzy logic also can be
employed for Background Modeling by using a fuzzy running average [ 186] or Type -2 fuzzy
mixture of Gaussians [ 49]. Foreground detection is using fuzzy inferences such as [182] the
Sugeno integral [ 240], or the Choquet integral [ 48]. More detailed survey can be read in
[190,191].
The optical flow methods [11, 81, 4, 22, 23, 117, 116] are used in the context of moving
objects and cameras. Finding the displacement field be tween the subsequent frames of an image
sequence has become a classical computer vision problem. This displacement field is called optic
flow. Horn and Schunck have introduced the first variational method for computing the optic
flow field in an image sequ ence [ 11]. This method is based on two assumptions: a brightness
constancy assumption and a smoothness assumption. These two assumptions are characteristic
for many variational optic al flow methods. The variational methods belong to the best
performing tec hniques for computing the optic al flow field [ 81, 4 ]. Improvement to these
methods include refined model assumptions with discontinuity -preserving constraints
[96,75,52,82 ] or spatiotemporal regularization [ 117, 69, 86 ], improved data terms with modified
constraints [ 97,22, 173] or nonquadratic penalization [ 22, 117, 222, 47], and efficient multigrid
algorithms [ 21,23, 166, 38, 57 ] for minimizing these energy functionals.
2.2 Human detection
The most sensitive and the most important component s of the system represent s the
human detection and pose estimation. A lot of research es in this field focus only on detecting
people in the image , but for behavior recognition systems capable to recognize complex
behaviors we also need to know the pose or attitude of the humans. The term detection in this
case means localization, it represents a rectangle i n the image w here the human is. The term
human pose refer s to the body and limb configuration related to a coordinate system.
The people detection systems [ 5,170, 130,114] fall into two main categories:
“Component -based ” methods
“Single detection window ” analysis.
2.2.1 “Single detection window” analysis
First we begin to survey the “single detection window” methods . The first class of
methods uses 2D or 3D human models and tries to match this model to image parts. It is difficult
to create a model that is enough general , simple and capable to catch every particular human
motion. If we succeed to match the model we will also have its pose, but overall the matching
procedure i s very time consuming.
A particularly attractive “single detection window” method contain the shape based
techniques . These methods are attractive because of their property of reducing variations in
Problem Overview
11 human appearance due to lighting or clothing. The methods can use continuous and discrete
representation of the shape.
Discrete approaches represent the shapes by a set of exemplar shapes [ 42, 41, 189, 92].
These methods require a large number of example shapes (many thousands) to cover sufficiently
the shape space due to transformations and intra -class variance. Exemplar -based models have to
create a balance between specificity and compactness. When the shape set is too high, the
methods will work extremely slowly, while if the shape set it is too low the performance will be
low, as well. Efficient matching techniques based on distance -transforms have been combined
with hierarchical structures, to allow matching thousands of exemplars [ 42, 41, 189].
The continuous approach invol ves a parametric repres entation of the shapes, learned
from examples, given the existence of an appropriate manual [ 195, 193, 194] or automatic
[15,112, 115, 121, 169] shape registration method. One direction is the linear shape
representation. In this cas e, to model the class -conditional density a single Gaussian is used [ 15,
195]. An extension of the linear model space uses conditional density models [ 195, 115 ].
Further, nonlinear extensions have been introduced at the cost of requiring a larger number of
training shapes, to cope with higher model complexity [ 195, 115, 193, 194, 169]. Other approach
is when the shape space is breaking into subspaces with a mixture of Gaussians via the EM –
algorithm [ 195] or K -means clustering [ 115, 193, 15, 169] which can be modeled linearly. To
achieve better performance, some approaches combine the shape and the texture information into
a compound parametric appearance model [ 196, 195, 115, 121, 98]. These approaches have
separate statistical models for shape and intensit y variations.
The second type contains the. “ example -based ” met hods, which are using a labeled
training set, to learn to recognize human’s objects. These kinds of methods are discriminative
methods, and usually are using predefined image features and their relationship s in order to
determine the human objects. One of the key elements in this method is the training set, and the
other is the training algorithm. The “example -based” methods can be categorized in three classes
based on the classifier architecture, techniques which should determine an optimal decision
boundary between pattern classes in a feature space:
Feed -forward multilayer neural networks
AdaBoost classifier
Support Vector Machines (SVMs)
The most common class of methods is the Feed -forward multilayer neural networks [1]
implement linear discriminant functions in the feature space in which input patterns have been
mapped nonlinearly, by using feature sets. The de cision boundary is computed by minimizing an
error criterion with respect to the network parameters. The most common criterion is the mean
squared error [ 1]. Multilayer neural networks have been applied in conjunction with adaptive
local receptive field fe atures as nonlinearities in the hidden network layer [ 89, 41, 168, 120,
224]. This architecture unifies feature extraction and classification within a single model.
The third class of a classifier is AdaBoost [228], which is used to construct strong
classifiers as weighted linear combinations of the selected weak classifiers, each involving a
threshold on a single feature [ 144, 178]. The most known and used classifier is the cascade
classifier introduced by Viola et al. [ 144] and adopted by many other s [90, 91, 140, 9, 99, 153].
The idea behind the cascade classifier is that the majority of detection windows in an image are
non humans. The cascade structure is tuned to reject non human objects as early as possible.
AdaBoost has been applied also as aut omatic feature selection procedure [ 228]. Each layer is
Human Behavior Recognition in Video Sequences
12 constructed iteratively using AdaBoost to create a strong classifier guided by user -specified
performance criteria. Each layer is focused on the errors the previous layers make, and the result
of this layer is a complex detector. The cascade classifier is modified later to a tree classifier.
The classifier eliminates in an earlier stage of the classification the majority of the negative
inputs and only the searched objects are classified by all stages. Lienhart et al [ 103, 104, 105 ] in
their work expand the Viola's approach by extending the feature types and using detection trees
to handle in -class variances. The most AdaBoost classifier s use Haar -like feature. Non adaptive
Haar wavelet features have be en introduced by Papageorgiou and Poggio [ 25] and adapted by
many others [ 125, 66, 144 ]. The Haar features are local filters operating on pixel intensities and
represent local intensity differences at various locations, scales, and orientations. The Haar
features are popular because they are simple, and using integral images can be computed
extremely fast with the same computation cost for every scale [176, 144 ]. The Haar features are
frequently redundant, due to overlapping spatial shifts and require mecha nisms to select the most
appropriate subset of features out of the vast amount of possible features. This selection can
manually us e prior knowledge about the geometric configuration of the human body [ 125, 25,
66] or automatic [ 228, 144 ].
Another powerful tool to solve pattern classification problems contains Support Vector
Machines (SVMs) [ 168]. SVMs maximize the margin of a linear decision boundary ( hyper
plane ) to achieve maximum separation between the object classes. To solve human dete ction
problem linear SVM classifiers have been used in combination with various (nonlinear) feature
sets [ 128, 129, 132, 215, 66, 99, 153]. Some methods are using Nonlinear SVM classification,
while others are using polynomial or radial basis function kern els for implicit mapping of the
samples into a higher dimensional space. The Nonlinear SVM is characterized with a significant
increase in computational cost and memory requirement [ 132, 76, 126, 168, 25, 120].
Other type of features us ed during classifica tion tasks are codebook feature patches,
extracted around interesting points in the image [ 196, 7, 6, 176]. During the training, a codebook
of distinctive object feature patches with geometrical relations is learned followed by clustering
in the space of f eature patches to obtain a compact representation of the human class. Beside the
intensity based feature, some features are using the discontinuities from the image. Such feature s
are the histograms of oriented gradients (HOG) [ 128, 178, 215, 99, 153] and, scale -invariant
feature transform (SIFT) [ 39]. These use well -normalized image gradient orientation histograms
computed over local image blocks. The HOG initially were computed using local image blocks
at a single fixed scale [ 128, 178] but later extended to variable -sized blocks [ 215, 99, 153]. To
increase robustness to the illumination changes, local spatial variation and correlation of gradient
based features have been encoded using the covariance matrix descriptors [ 140]. Another
category of features is represent ed by the edgelets – shape filters that explicitly incorporate the
spatial configuration of dominant edgelike structures [ 90]. Manually selected sets of edgelets,
which can be local line or curve segments, have been used to capture edge structures [ 144, 9]
A common problem for all examples presented is, the high in -class variability that make
the classifier very complex. M any recent human detection approaches attempt to break down the
complex appearance of the human class into a subclass with low in-class variability . Usually the
break down is made to create subclasses that correspond to a specific human pose or appearance.
The break down is made using manual [ 132, 178, 66, 9 ] or automatic clustering [ 41, 99 ].
Without a safe clustering me thod, this significant amount of work is done manually. Since the
Problem Overview
13 data is not collected in a controlled environment, manual categorization can become
prohibitively expensive, and because the fundamental ambiguity in labeling different poses and
view s, the complexity of the work grows linearly with the number of classes. Manual
categorization is also an error -prone procedure that may introduce significant bias into the
training process.
2.2.2 “Component -based” methods
The “component -based” methods detect the invar iant object parts separately and check if
they are present in a geometrical natural configuration. These parts are either semantically
motivated as body parts [76, 90, 126, 178, 67, 9 ] or concern codebook representations [ 9, 165, 7,
6]. These components sh ould be small enough to capture articulated motion, but they should be
sufficiently large to contain discriminative visual structure to allow reliable detection. These
types of systems usually use hierarchical detection framework. The body parts are detect ed in
order of their importance, if one of the basic part is not detected the other parts are not searched.
The “component -based” methods require assembly techniques to integrate the local part
responses to a final detection, constrained by spatial relati ons or a model among the parts.
Methods that put together the part -based detection responses into a final classification can be a
combination of classifier s [76, 126,178] or probabilistic inference to determine the most likely
object configuration given th e observed image features [ 90, 9, 67 ].
The “component -based” systems have the ability to easier address partial occlusion [9, 7,
6]. They do not need a huge training set to adequately cover the set of possible appearances.
They are also slow because they need to detect more than one component, so they are slower
than single detection window based methods. Their applicability to lower resolution images is
limited since each component detector requires a certain spatial suppor t for robustness.
2.3 Activity and behavior recognition
The objective of this subchapter is to provide an overview of state -of-the-art human
activity and behavior recognition methodologies.
Figure 2. Relation between activity and behavior recognition
The term "activity " refers to basic human action s like walking , standing and we will use
the term "behavior" to refer complex activities with a longer duration in time. T he behavior can
be composed by a sequence of activities. Both the activity and behavior recognition techniques
can be classified [ 84] into two categories:
Simple
Activity
recogni tion Behavior
recognition Complex
Human Behavior Recognition in Video Sequences
14 Single -layered approaches
Hierarchical approaches.
However, the single -layer approaches are more suitable for activity recognition and the
hierarchical methods are used by behavior recognition.
2.3.1 Single -layered approaches
Single -layered approaches are techniques that represent and recognize human activities
directly based on sequences of images. Due to their nature, single -layered approaches are
suitable for the recognition of actions with sequential characteristics [ 84]. Two types of Single –
layered approaches can be distinct depending on how they model human actions:
Space -time approaches
Sequential approaches.
The Space -time approaches view an input video as a 3 -dimensional volume (the image
plane and the third dimension is the time) and can be divided [ 84] into three categories based on
what features they use:
Trajectory -based approaches,
Space -time volumes based techniques,
Space time -Local feature based techniques
The trajectory -based techniques interpret an activity as a set of space -time trajectories. In
these techniques, the human objects are represented as a set of 2D or 3D points correspondi ng to
the joint or body part positions. Some of the work in this field tracked the joint position and used
them directly [ 157, 158, 180, 234] or construct ed 3D XYT or 4D XYZT representations of the
motion, to recognize the action [ 223, 139, 180, 235].
Instead of using the trajectories directly in human actions recognition, some methods
extract significant curvature patterns from the trajectories. The actions are represented by a set of
peaks and intervals between them. Using different learning algorithm s some prototypes for
actions are created. The action templates will be these prototypes. The recognition is done with
template matching technique [ 160].
Some approaches transform the human action trajectory into a low -dimensional phase
space. In the phas e space, a person's static state at each frame corresponds to a point and an
action corresponds to a set of points [ 84]. The scene s are also converted in the phase spaces and
checked if whether the points are on the maintained curves [ 30]. Using these meth ods we are
able to analyze detailed levels of the human movements and most of these methods are view
invariant.
The space -time volumes based techniques measure the similarity between two volumes.
One approach track s the foreground shape changes and compar es this volume to the saved ones
[18, 179, 94]. In order to match volumes more reliably some techniques use filters to captur e
characteristics of volumes [ 162]. Instead of maintaining the 3 -dimensional space -time volume of
each action some approaches repre sent each action with a template composed of two 2 –
dimensional images: a 2 -dimensional binary motion -energy image (MEI) and a scalar -valued
motion -history image (MHI) and compare th em using template matching [ 18]. Also, hierarchical
Problem Overview
15 space -time volume corre lations were introduced . At every location of the volume they extracted
a small space -time patch around the location and correlated to the saved templates [ 179].
Another system applies a hierarchical mean shift to cluster similarly colored volumetric pixel
elements from which obtains several segmented volumes . At recognition phase use SVM to
classif y the volumes [90, 174].
These techniques use sliding window algorithm that requires a huge number of
computat ions to accurate ly solve the recognition problem. The scene with multiple people or
actions which cannot be spatially segmented can create a problem for these systems [84].
Space time -Local feature based techniques extract local feature from 3D space -time
volumes to represent and recognize activities. According to these techniques the 3 -D space -time
volume created by an action essentially is a rigid 3 -D object, and by extracting appropriate
feature s, the action recognition became an object matching problem. These techniques are
characterized by the feature they extract, how they describe the action with this feature and what
methodology they use to classify activities.
There are two approaches [ 84] first extract the local features at every frame and
concate nate them temporally to describe the overall motion of human activities [ 17, 32, 239] and
the others extract sparse spatio -temporal local features from 3D volumes [ 101, 46, 138, 235,
163]
One of the earliest techniques that extract local feature s at the f rame level to characterize
an action uses motion energy receptive fields together with Gabor filters. The detected local
features are used to build a multidimensional histogram and then using the Bayes rule the
posterior probability of an action is compute d. [32]. A variant of this method use s normalized
local intensity gradient extracted at multiple temporal scales and use s unsupervised clustering
algorithm on the histograms to learn actions [ 239]
Another approach extracts appearance -based local features at each pixel from every
frame and constructs a space -time volume. Using these features and the Poi sson equation space –
time saliency and space -time orientation can be extracted. The action s were recognized using
nearest neighbor classification with a n Euclidean distance.[ 17]
Some approaches extend local feature (corner) detector [ 74] to be applicable to 3D space.
These extended corner detectors capture various types of non -constant motion patterns like
direction change. These features are used in SVM to re cognize activity [ 100, 174].
Cuboid -type spatio -temporal feature s were proposed [ 46] to capture pixel appearance
values of the interest point's neighborhoods. For each dataset, a set of cuboids was used, and the
action s were modeled as a histogram of cuboi ds detected in 3D space -time volume. This method
was extended using unsupervised learning and classification methods [ 138]. These methods were
extended by pruning the cuboid features [ 106], to use the extended 3D SIFT method which is
remarkably similar to the cuboid features [ 175], or to use the color information [ 161]. These
techniques use successful bag -of-words approaches for basic periodic actions.
Recently, action recognition approaches attempt to model spatio -temporal distribution of
the extracted fe atures considering spatial configurations of the local feature for better recognition
of actions. The PLSA introduced by Niebles [ 138] is extended to use an implicit shape model
(PLSA-ISM) [ 225] which captures the relative spatio -temporal location informat ion of the
Human Behavior Recognition in Video Sequences
16 features from the activity center. The extension of these approaches is when the correlation
between the features is considered , too [172, 100, 106 ].
Ryoo and Aggarwal [ 164] introduced the spatio -temporal relationship (STR) match
method that measures structural similarity between two videos, considering spatial and temporal
relationships between detected features. This is capable also to recognize basic behaviors and
interaction between two people.
The Sequential approaches interpret an input video as a sequence of observations.
Sequential approaches extract features from the frames describing the status of a person in the
image, and analyze the sequence of features to measure how likely a person to perform an
activity is . The Sequential approaches can be classified in two categories depending on whether
they use:
Exemplar -based recognition methodologies
Model -based recognition methodologies
Exemplar -based sequential techniques use samples directly in the recognition or tra ining
phase. The new image sequences are compared to the template sequence composed of feature
vectors extracted from the training video. These techniques must be invariant to different styles
and/or different rates of the performed activity. An approach r epresents the DTW adapted for
matching two sequences with variations [ 45, 62, 217] used mostly for gesture recognition.
Another approach uses pri nciple component analysis (PCA) to represent an activity as a linear
combination of a set of activity basis tha t essentially is a set of eigen vectors. Next to the input
activity are computed the coefficients and compared to the templates coefficients [ 217, 232]. In a
recent approach [107] , the human activity is represented as linear time invariant (LTI) system s.
The system tries to use the dynamics of changes in silhouette features. The new frames are
converted to LTI para meters and classified using SVM.
State model -based sequential techniques are approaches which represent a human
activity as a model composed of a set of states. This model is trained statistically so that it
corresponds to sequences of features which define the activity class. For every activity class a
statistical model is constructed. To measure the likelihood between the action model and the
input image sequence , maximum likelihood estimation (MLE ) or the maximum a posteriori
probability (MAP) can be used . The m ost used model -based sequential approaches are the
Hidden Markov models (HMMs) [241] and the dynamic Bayesian networks (DBNs) [147].
In case of both approaches the human is assumed to be in one state at each frame. At each
state we have an observation: the feature vector. The transition between states is evaluated
considering the observations. The activity is represented as a set of hid den states. The activity
recognition is a problem of computing the probability of a given sequence generated by a
particular state -model.
The first approaches use standard HMMs to recognize activities [ 233, 187, 19]. These
approaches use shapes, and posit ions or trajectories for observation s and usually use Viterbi
algorithm [216] to compute the approximation of likelihood distance for classifying an
observation sequence into an activity class. The new generation of HMMs approaches [ 141,
Problem Overview
17 147,134] also cons truct one model (HMM) for each activity they want to recognize, and use
appearance feature s from the scene as observation, but use extended HMMs. These extended
HMMs are designed to handle complex activities. One of these extended HMMs is the coupled
HMM ( CHMM) [ 141]. The CHMM is constructed by coupling multiple HMMs, hidden states of
two different HMMs are coupled by specifying their dependencies. Using the CHMM makes it
possible the modeling of more than one actor (more than one state can be active) and t he
interaction between them. This method was later extended by explicitly modeling the duration of
an activity staying in each state using coupled hidden semi -Markov models (CHSMMs) [ 108,
134]. Another extended HMM is the DBN, which is buil t from multiple conditionally
independent hidden nodes that generate observations at each frame [ 147]. This algorithm was
used to recognize gesture.
2.3.2 Hierarchical approaches
The hierarchical approaches recognize high-level human activities , mostly behavior s.
These systems are composed of multiple layers. These techniques recognize high -level activities,
behaviors based on the recognition results of other simpler activities. The hierarchical
approaches are:
Statistical approaches
Syntactic approaches
Descri ption -based approaches
Statistical approaches construct statistical state -based models such as HMMs and DBNs
related hierarchically to each other to represent and recognize high -level human activities. The
first layers recognize atomic actions from seque nces of feature vectors using single -layered
sequential approaches. The upper layers treat as observations the previous layer’s “atomic
actions”. Using this observation, we can build new statistical models to recognize complex
action or behavior. One of th e most common approaches are the layered Hidden Markov models
(LHMMs),[ 141, 137] constructed from two layers. The bottom layer HMMs recognizes atomic
actions of a single person, and the upper layer recognizes the behavior. This approach is also
suitable fo r recognition of group behavior [ 241, 43]. A variant of the 2 layered HMMs is the
block -based HMM [ 236]. An extension of the traditional HMMs is dynamic probabilistic
networks (DPNs) suitable for representing activities of multiple participants [ 64]. Bayesian
networks that use Markov chain Monte Carlo (MCMC) were also used to recognize activity [ 44].
In this case, the relation between basic actions was modeled using Bayesian networks. The
network was iteratively updated using MCMC, to find the best model for the current sequence of
observation. Another hierarchical approach is the propagation network (P -net)[ 181]. In the P -net,
the activity is represented by multiple state nodes, their transition and observation probabilities.
The main advantage of t his approach is that multiple states are active which makes it possible to
model composed activit ies as well.
Human Behavior Recognition in Video Sequences
18 The syntactic approaches model human activities as a string of sym bols. To model
human activity these approac hes uses grammar syntax such as con text-free grammars (CFGs) or
stochastic context -free grammar (SCFG) [ 78, 125, 88]. The high -level activities are modeled as a
string of atomic -level activities. Similar ly to the case of hierarchical approaches this method uses
two layers. The first recogni zes the atomic activity using any of the previous method s, and the
second layer use a set of rules that generates strings of atomic actions which are recognized by a
parsing technique. One of the basic syntactic approach es [78] use s HMMs for the recognitio n of
simple actions, this is the first layer. The second layer use s SCFGs to recognize complex actions,
based on the outcome of the first layer. The lower layer generates a string of action. The upper
layer parse s the string generated by the first layer us ing Earley -Stolcke algorithm extended to
handle uncertain observations, and a large number of stochastic production s rules which should
be able to explain all activity possibilities. These methods are extended [ 127] to handle multitask
activities by introd ucing more reliable error detection and recovery techniques for the
recognition. Other extensions [ 125] of the basic syntactic approach introduce CFG to improve
the segmentation and object tracking. This method also introduces the concept of the
hallucinat ions, to compensate the failures of atomic -level recognition. Recent work focuses on
grammar extension [ 88] to attach semantic tags and conditions to the production rules of the
SCFG. These extensions are able to recognize activities with higher complexity .
Description -based methods are a hierarchical approach that represents complex human
activities by describing these , using simple activities or sub – event and their temporal, spatial,
and logical structures. The human action is modeled as an occurrence of its simpler activities or
events composing action and certain relation between these activities. In description -based
approaches the relation s between sub -events are described using predicates: before, meets,
overlaps, during, starts, finishes, and equals [ 3, 2, 185, 150, 136, 65, 163, 220, 155] for temporal,
sequential, and concurrent relationships. In the description based approaches the activiti es'
semantics are encoded like programming languages and uses CFG to verify if the representation
fits its grammar activities [ 164, 135].
The description based algorithms differ mostly on how they describe temporal structures
and the way of the making the recognition. One of the first approach es [150] use Past, Now,
Future network (PNF -network) based on Allen's interval algebra constraint network (IA –
network) [ 3] where sub -events are nodes and their temporal relationships are d escribed with
edges between them. Other earlier approach es [77, 136, 70, 220] use a similar format to those of
programming languages. The main differences between these methods the temporal predicates
they use for relations.
In the Bayesian belief network based approach, the root node corresponds to the high –
level activity and the other nodes correspond to the sub -events or describe the temporal
relationships between the sub -events [ 150].
The Petri Nets were also used to represent human activities [ 238; 133; 63]. The Petri Nets
are suitable to represent temporal ordering of the sub -events relations. The recognition is done by
handing tokens in the graph.
Problem Overview
19 2.4 Conclusions
This chapter is a survey of the most representative human behavior recognition systems .
The first part of the chapter presents a general framework for building a human behavior
recognition system emphasizing all important steps of it.
In the second part we made a synthesi s of the current achievements , results and
conclusions in the field of h uman behavior systems . With t he survey we have cover ed all three
components of the general framework:
Preprocessing
Human detection
Behavior recognition .
During the survey we have identified the following issues which worth to be further
studied:
In the literature, we have found a large number of background detection algorithms
performing well only in specific situations. Usually they are either fast or accurate, and most
of them cannot handle the shadows. We haven’t found neither a robust algorith m to use for
deciding which of the algorithms should we use for a specific situation.
In case of the human detection algorithms, the situation is the same. We have identified a
large number of algorithms, but they are working only in special situations and there is no
information or guideline about how to choose the correct one for a given task or scenario.
Another shortage of the human detection algorithms is that they cannot retrieve both the
position of the humans on the image and the position (attitude) of the peoples in that
moment. Or if they do (a very few of them), they are very slow.
In the literature we have found only behavior recognition methods that recognize only
simple actions. Those ones which can recognize complex behaviors need complex trai ning
and usually they are not general enough to recognize arbitrary behaviors. Many of the
algorithms doesn’t process body part motions, resulting this way a high degree of
uncertainty.
Human Behavior Recognition in Video Sequences
20 3. Preprocessing
The images, captured by a camera or loaded from a video file, in most cases need to be
prepared for later processing. These preparations are qualitative enha ncements, normalizations,
or dimensionality reduction algorithms. The qualitative enhancement is noise reduction, contrast
or color enhancement. The normalization is a linear process that changes the range of a feature
(ex: color, intensity, size, orientat ion) in order to be between a predefined range. The
dimensionality reduction algorithms use some restrictions or some heuristic information to
reduce the complexity of the problem.
In this chapter, we will focus in particular on the dimensionality reductio n algorithms, by
presenting and comparing the most common methods. The methods are evaluated from the point
of view of human detection and behavior recognition techniques. We also propose a novel
method for background detection, capable to remove the shado ws as well .
3.1 Problem statement
Human detection is an extremely complex task. In this phase, the main goal is to reduce
the complexity of the task. We can do this by considering all the acquirable information, then
using them to build an algorithm that will make the following processes more reliable and
efficient.
In our case, the most significant information about the humans is that they are moving.
Although their motion is not a continuous one, by considering along enough periods , they will
make some moveme nts. This information is used successfully excepting the still images. For our
case, to narrow the search space, an efficient algorithm can be built. These kinds of algorithms
are the foreground segmentation algorithms.
By choosing or designing an algorit hm for this task, we need to take into consideration
the following cases:
Static background
Periodically static background
Dynamic background
In real environments, all approaches have to deal with several problems [ 87, 152 ] as
follows:
Preprocessing
21 Gradual illumination changes: Gradual illumination changes alter slowly the color
characteristic of the image.
Quick illumination changes: Quick illumination changes entirely alter the color
characteristics of the image.
Relocation of the background Object: Reloc ation of a background object induces
changes in two different regions in the image, its newly acquired position and its
previous position.
Camera oscillations: The camera oscillation generates repeated slow sifting of the
image.
High -frequenc y backgroun d objects: (such as tree branches, sea waves, and
similar) we have to deal with it mostly in outdoor environments, but can be
caused also by a flickering lamp.
Initialization with moving objects: If moving objects are present during
initialization then pa rt of the background will be occluded by moving objects
Camouflage : A foreground object’s pixel may be the same intensity and color as
the background.
Shadows: Objects cast shadows that might also be classified as foreground due to
the illumination change in the shadow region.
There are two kinds of widespread approaches for foreground segmentation: optical flow
computation and background subtraction.
3.2 Optical Flow
Optical flow reflects the image changes due to motion during consecutive frames. The
optical f low field is the velocity field that represents the three -dimensional motion of object
points across a two -dimensional image. The optical flow can tell us about the relative distances
of equal speed objects: closer moving objects will have more apparent mo tion than moving
objects that are further away. There are many approaches to compute the optical flow. D espite
the differences between the approaches, most of them have three stages [13]. The first stage is
the pre -filtering or smoothing with low -pass or b and-pass filters in order to extract signal
structure of interest and to enhance the signal to noise ratio. The second stage represents the
extraction of basic measurements such as spatio -temporal derivatives or local correlation
surfaces. The last stage i s the integration of these measurements to produce a 2 -d flow field
which often involves assumptions about the smoothness of the underlying flow field.
The most prominent optical flow methods are:
Differential techniques
Region -based methods
Frequency -based methods
Human Behavior Recognition in Video Sequences
22 Phase based methods .
Differential techniques: These approaches compute the optical flow from spatio -temporal
derivat ives of the image. A frame from an image sequence is written as a function of position
and time. This representation form of the image is expressed using a Taylor series. This category
of techniques uses three constrain ts:
Brightness constancy: the observed object brightness of any object point is
constant over time
The velocity smoothness
The temporal persistence or „small movements” .
The first assumption can be seen in equation (3.1).
(3.1)
The equation (3.1) can be expanded using Taylor series and lead to the 2D motion
constraint equation
(3.2)
The equation (3.2) has two unknowns without a unique solution. The absence of a unique
solution is because, in a small area, there is not enough information to determine the motion [ 13].
To overcome the problem from equation 3.2, Horn and Schunck [11] has introduced a
new global constrain t of smoothness. In his assumption large, rigid objects are moving in the
image, so over relatively large areas the optical flow will be smooth. Horn and Schunck
minimiz e the square of the magnitude of the gradient of optical flow by using the equation:
(3.3)
In contrast, Lucas and Kanade [10, 183, 184] approach assume that, in a local
neighborhood of a pixel, the flow is constant. The optical flow equation is solved using a local
least squares criterion in the neighborhood surrounding the pixel
(3.4)
The Horn and Schunck constrain can be extended using global smoothness constrain and
second -order derivatives to me asure optical flow [ 131]. Nagel suggested an oriented -smoothness
constraint in which smoothness was not imposed across steep intensity gradients, in an attempt to
Preprocessing
23 handle occlusion. The problem was formulated as the minimization of the following functional
[131]
(3.5)
Region -Based Matching approach uses region matching to compute the optical flow. For
some case a numerical differentiation may be impractical because of the noise. The region based
matching method has three steps: region extraction, region matching and optical flow smoothing.
The velocity v has been defined as the shift . For region matching, a similarity
measure (over d) has to be used. The most used simila rity measures are: the normalized cross –
correlation or the sum -of-squared difference (SSD) [ 84]
(3.6)
Where W denotes discrete 2 -d window function and takes on integer values
[171].
Frequency -Based Methods compute the optical flow using the output energy. These
methods use the velocity -tuned filters in the Fourier domain. The Fourier transform of translating
2-d pattern is:
(
3.7)
Where is the Fourier transform of I is a Dirac delta function, denotes
temporal frequency space.
Phase -Based Techniques or filter based technique employ phase information to compute
the velocity. These techniques deliver precise and reliable estimates without complex parameter
tuning or optimization [ 13] mainly because the phase information is robust to changes in scale,
orientation and speed. It has been demonstrated that, phase -based techniques are more accurate
than local methods, but the filtering operations have high computational need. They can be used
in real-time applications only with dedicated hardware.
3.3 Background subtraction
Background su btraction is a commonly used class of techniques for segmenting out
objects of interest from a scene using static camera. It involves comparing an observed image
with an estimate of the static background. The areas of the image plane where there is a
Human Behavior Recognition in Video Sequences
24 signi ficant difference between the observed and estimated background images indicate the
location of the objects of interest. The name “background subtraction" comes from the simple
technique of subtracting the observed image from the estimated ima ge and th resholding the
result to generate the objects of interest.
We compared several representative techniques of this class, by comparing some
fundamental attributes of them: how the object areas are distinguished from the background; how
the background is maintai ned over time; and how the segmented object areas are post processed
to reject false positives, etc.
Figure 3. General framework for background segmentation
Every variant of the background subtraction method is starting f rom the following
equation :
(3.8)
where is the estimated background, is the actual frame, and represent a “predefined”
threshold. If the absolute difference of the pixels from two consecutive frames is bigger as ,
than that pixel is identified as foreground pixel. The methods mainly differ from each other on
the way they define and maintain the background.
One of the simplest variant s is the frame differencing where the background estimate is
the previous frame:
(3.9)
This technique is fast and robust to all illumination changes, but identifies only the
contour of the moving objects, works only for a particular condition of object’s sp eed and frame
rate [ 229], and it is extremely sensitive to the threshold.
To get a solid moving region, we need to compute the background as the average or the
median of the previous images. This method is approached as a temporal median filter. The
background estimate is defined as the median of the last N frames, with typical values of N,
ranging between 50 and 200.
(3.10)
Video
Frames Background
modeling
Foreground
detection Foreground
mask
Preprocessing
25 where M(I) represents the last n frame average or median. This method is fast, but extremely
memory consuming (size of memory needed: n*Size(I) [156]).
Cucchiara et al [118] proposed to compute the median on a special set of n sub-sampled
values of the last N images , and use the computed median value only for a period. Computing
the background in this way the result will be more stable.
Cutler [ 221] uses color images because they give better segmentations of the images than
monochrome ones, especially for areas with low contrast as objects in dark shadows. Pixels has
to be marked as foreground if
(3.11)
where is an offline generated estimate of the noise standard deviation, and K is an a priory
selected constant (typically 10). This m ethod uses also template matching to help in selecting
candidate matches.
The background, to minimize the memory usage, is computed as running average:
(3.12)
where is the learning rate, keeping it small (typically 0.05) to prevent s artificial shadow
forming behind moving objects. This method does not need extra memory [118]. Several
improvements exist for this method.
After the thresholding, Heikk ila and Olli [ 80] use the closing morphological operation
with a 3×3 structurin g element, to discard the small regions. Two background corrections ha ve to
be applied: One is computed with equation, (3.12) and the second one is applied for pixels
marked as foreground for more than m of the last M frames, then the background is updated as
(3.13)
This correction is designed to compensate sudden illumination changes and newly
appeared static objects. If a pixel state is changed from foreground to background frequently, it is
masked out from inclusion in the foreground. This is designed to compensate fluctuating
illumination, like swinging branches.
In LOTS [ 192], three background models are kept simultaneously: a primary, a
secondary, and the old background. Both the primary and the secondary backgrounds are
updated with the same equation: (3.12) when the pixels belong to the background. The
parameter is smaller than 0.25 with a default value of . If the pixels belong to the
foreground, the primary background is updated with equation (3.12) , where
When the pixels differ significantly from the background, the secondary background is
updated with equation (3.13). The third background, named as the O ld background, is a copy of
the incoming images from 9000 to 18000 frames ago.
The method uses an adapt ive thresholding with hysteresis to classify pixels as foreground
or background. Then to the classified pixels several condition s are applied. If the foreground
Human Behavior Recognition in Video Sequences
26 pixels are only from a small region, they are classified back as backgrounds. If the size of th e
foreground increases significantly in consecutive frames, the global threshold is temporally
increased because this is interpreted as rapid lighting change. To resolve local lighting changes,
the foreground pixels are compared with the primary and second ary background image.
The small foreground regions and noises are removed with the Halevy [55] method,
where the background is updated by
(3.14)
at all pixels, where is a smoothed version of and the value of is in the [0.3…0.5]
interval. This method does not use thresholding for foreground detection; instead it tracks the
maxima of . They also note that gives an indication of the number of
frames t, needed for the background, to settl e down after initialization.
In the methods presented above, the backgrounds were represented by a mean value. An
extension [ 56, 118] of this approach is to model every pixel from the background model with a
Gaussian distribution (μ, σ). By this, the Gaussian distribution will be fitted over the temporal
histogram, resulting in the background PDF. The most convenient way to update the background
is using the running Gaussian average to avoid fitting the PDF to each new frame from scratch.
The background pixels in this case are explicitly modeled by a mean ( and a variance ( ,
which are updated recursively.
(3.15)
(3.16)
The classification is made using threshold of the difference between the mean and the
image :
(3.17)
The threshold value is computed using the variance and a chosen constant: . The
value of the K usually is 2.5.
In practice, the selective update model is used in order to maintain the background PDF.
(3.18)
where, M is a binary value, equal with 1 if the pixel is a foreground pixel or 0 other wise.
The most significant limitation of the previous methods is that they do not cope with
multimodal backgrounds. The adaptive Mixture of Gaussian methods are meant to provide a
solution for this limitation. With this approach, we can model a multimodal background with a
mixture of K Gaussians (μi, σi, ωi) [ 27,28], with arbitrarily pre -defined number of modes,
usually between 3 and 5.
Each pixel is modeled separately by a mixture of K Gaussians
Preprocessing
27
(3.19)
where in [221,229, 27]. In case of color images, the Gaussians are multi -variate.
Assuming the values are independent, the covariance matrix, simplifies to diagonal. To
simplify it is [ 221,27] considered that the standard deviation for the three chann els are equal
to and are simplified as well:
(3.20)
To update the model, first we identify the best matching distribution, and then check the
distance to this distribution. If the distance is smaller than the standard deviations [221,
229,27] then the component will be updated with the following equations:
(3.21)
(3.22)
(3.23)
where
(3.24)
The other components will be update as follows:
(3.25)
(3.26)
(3.27)
If the distance to the best matching distribution is bigger than the standard deviation, than
the least likely component needs to be replaced with a new one, which has , large
value, and – low value. After the updates, the weights will be renormalized.
All components from the mixture will be sorted into a decreasing order of .
Using a threshold T it will be decided what is the number of components that form the
background. This threshold can be different for every pixel. After the component classification,
it will be verified which pixel belongs to a background component, to dec ide which pixels are
representing a foreground object and which a background one [ 29].
Another way to detect the foreground is to estimate the background with Kernel Density
Estimators by computing the background PDF using the histogram of the n most recen t pixel
Human Behavior Recognition in Video Sequences
28 values. Since the histograms may provide poor modeling of the true unknown PDFs, in [ 50] the
background is modeled using the KDE non parametric and the histogram of the n most recent
pixel values. To detect the foreground objects the following equation will be used:
(3.29)
where T is a threshold value. The background is selective ly updated based on the threshold.
The Mean -shift based background estimation [118] is a gradient -ascent method able to
detect the modes of a multimodal distribution with their covariance matrix. The method requires
high computational costs and a study of convergence [123] . The mean shift vector can be
computed with the next equation:
(3.30)
where x is an arbitrary point from the data space, h is the analysis bandwidth, g(u) is the first
derivate of the kernel profile with bounded support. The iterative implementation is way too
slow and requires high memory usage: n*size(frame) . Computational optimizations can be done,
but usually it is used at initialization only for detect ing the background PDF modes, then later,
other lighter computational methods are used (mode propagation).
The combined background estimation and propagation [ 156] is also known as Sequential
Kernel Density Approximation . In this case, the mean -shift mode detection from samples is used
only at initialization time and later modes are propagated by adapting them with the new
samples:
(3.31)
Heuristic procedures are used for merging the existing mode s and without fixing
upfront the number of modes. They are faster than KDE, and have low memory requirements.
For building a background model there is also the minimum and maximum method. In
[73, 72,71], a pixel is marked as foreground if
(3.32)
or
(3.33)
where Max, Min represents the minimum, maximum, while D is the largest absolute difference
between the background frames. These parameters have to be initiated, and usually they are
within the first few seconds of the video. The parameters are periodically updated for the scene
parts without foreground objects.
Preprocessing
29 After the foreground detection, morphological erosion is applied to remove the noise and
all small regions. At the end, the holes f rom the foreground region are removed by a closing
operation.
A different approach compared to the previous method is the Eigenbackgrounds
introduced by A. P. Pentland. The main idea of this method is to use Principal Component
Analysis (PCA) [ 29, 119 ] for reducing the dimensionality of a space. PCA is applied to a
sequence of n frames to compute the eigenbackgrounds.
The Eigenbackgrounds are computed using n frames. These frames are re -arranged as
columns of a matrix A. Using this A matrix we can compute the covariance matrix .
From C, the diagonal matrix of its eigenvalues, L, and the eigenvector matrix, Φ, are computed.
From these structures only the M eigenvectors, representing the eigenbackgrounds, are kept.
To detect the foreground, first we need to project the image I in the M eigenvectors sub –
space and then reconstructed as I’. The difference I–I’ is computed: since the sub -space
represents well only the static parts of the scene, the outcome s of this difference are the
foreground object s.
3.4 Integral Image based background subtraction
The Integral Image based foreground detection method is combining the idea of using
features instead of working directly on the raw image. In Integral Image based foreground
detection method, similarly to the eigenbackground method, the background model is a feature
image and for every frame the feature image has to be computed and compared to the
background. The advantage of working directly on raw images is the speed, while the defic iency
is considering only one pixel at subtraction without dealing with its neighborhoods. Starting
from the fact that in most cases, we are not interested in one pixel or small foreground objects,
we developed a method which looks to the current pixel and to its neighborhood, handling
shadows as well.
The Integral Image based foreground detection technique considers the pixel as
foreground, if the majority of the pixels from its neighborhood are also foregrounds. To
determinate this easily and quickly, we use rectang ular features that we consider the
neighborhood pixels. The simplest feature that satisfies all conditions is the sum of the pixel s and
its neighborhood values, similar to an average. This sum can be computed very fast using an
intermediate rep resentation. This is the integral image which was introduced by Viola and Jones
[219].
3.4.1 Integral Image
The integral image [ 211, 199] at location contains the sum of pixels from the
rectangle with the following coordinates (0, 0, x, y).
Human Behavior Recognition in Video Sequences
30
Figure 4. The Integral Image
Figure 4 shows the Integral Image, where every pixel is computed using the following equation:
(3.34)
The integral image can be computed in one pass over the image from left to right and
from top to bottom :
(3.35)
and
(3.36)
The Integral Image can be computed quickl y also in case of for rotated rectangles with 45
degrees. The rotated Integral Image is shown in Figure 5
Figure 5. Rotated Integral Image
(3.37)
The Rotated Integral Image is computed with two passes over the image. The first pass
from left to right and top to bott om:
(3.38)
and
Preprocessing
31 (3.39)
The second pass over the image is from right to left and bottom to top
(3.40)
The feature image ( ) which contains the rectangle sum features will have the
following equation:
(3.41)
where is the number of the pixels from its neighborhood. The can be calculated more
efficiently with four sums using the integral image for every size of z.
(3.42)
Figure 6. Rectangular features. The AR represents the area of the SumI feature
The distance between L 1-L2 and L 3-L1are equal with 2z. This can be generalized also for
rectangles with an arbitrary length. The SumI feature can be computed for rotated versions too.
Figure 7. The r otated RSumI feature.
The rotated SumI feature is computed with the following equation:
Human Behavior Recognition in Video Sequences
32
(3.43)
where h is the projection of L1, L2, and the w is the projection of L2,L4 to the x ax is. The x, y
points are at the L2 position.
3.4.2 Estimating the Integral background image and shadow
detection
The background image in our case is substituted with a background sum feature image.
Every background feature is modeled by a Gaussian distribution described by its average and
standard deviations. This is maintained using the running average method:
(3.44)
(3.45)
In order to determine the foreground mask image we compare the current image features
to the background model.
(3.46)
If the absolute value of is bigger than a threshold ( Th2), then the pixel is a
foregroun d pixel, if its absolute value is between Th1 and Th2, it has to be verified if it is a
shadow or foreground object, in case it is below Th1 , it is a foreground object.
Th1 is calculated as follows:
(3.47)
where is a constant,with a value between 2 and 3, in the Integral Image based foreground
detection approach has the value . Using the standard deviation to compute the threshold
value, only gradual illuminations changing are covered , which is not correct. To improve the
threshold value calculation, and prepare for the sudden light changes as well , we have introduced
the component. This component will reflect the global changes of the illumination and will
be computed using the equation (3.48):
(3.48)
The second threshold:
(3.49)
where is a gate. In the system .
Preprocessing
33 Detecting the shadows is an extremely hard task and there are no perfect algorithms for
doing it. The assumption for shadow detection is that the regions of shadow are semi -transparent,
without any texture change. To check the texture, we simply look at th e gradient direction in the
vertical and horizontal direction and compare them to the background model. This is done using
the equation (3.50)
(3.50)
If the fraction is bigger than 0, it means that they were maintained in both vertical and
horizontal directions. The optimal measure of the fraction is equal to 1. Considering that the
square features are approximated by a Gaussian it can differ from the o ptimal value with . In
our case, we are using . It is more exact if we compute using the standard deviation of
the rectangle feature but introduce more computational cost without much benefit.
3.5 Discussions and experiments
In order to compare sys tematically the methods presented above, to measure their
performances and to decide which one is the most suitable for our purpose , we need to identify
some quality measures. The quality measures in most cases are task dependent. It is important to
identi fy the future usage of the algorithm results and based on that decide which parameters are
the most important ones.
According to our purpose/aim to speed up the human detection and reduce the false
detection rate we have several cases. These cases are infl uenced by the type of human detection
techniques used and by the scene properties.
From the human detection techniques point of view we have two directions.
The first one is when we use the foreground detection to reduce the search area of
possible humans . In this case, the detected foreground only defines a rectangular window which
could contain a human. The human detection techniques search inside that window the presence
of a human without taking into account the shape of the foreground.
The second dir ection is when the human detection technique relays also on the shape
boundaries of the foreground. In this case, very precise foreground detection is need ed, because
this precision influences directly the outcome of the human detection module.
We can enu nciate even before showing the results of the performance analysis of the
foreground detection techniques, that there are no generally good foreground detection methods
suitable for every type of scene scenarios. The different scene properties are influenc ing the
performance of the methods. Based on background model changes we can identify several
classes of scenes. The first and worse case is when we have a dynamic background . In this case
the background is moving , but with a different velocity th an the fo reground. Here as well are two
different cases: the first case is when the motion of the background is generally parallel with the
image plane (fixed rotating fixed cameras) and the second when the background can change in
along the 3rd axis as well (deep: camera zooming) moving camera or zooming the camera. The
second case is the static background . Here are several cases as well : completely static
Human Behavior Recognition in Video Sequences
34 background, periodical changing background (tree leaf, waves), and the static or periodically
changing backgro und with moments when the scene changes completely without any relation to
previous backgrounds, scenes ( movies).
Figure 8. Test examples for frame differencing in outdoor and indoor environment a) actual
image, b) background image, c) foreground mask
The performance of a foreground detection method has seve ral components. One
component is the precision or discriminative capability of the foreground detection. The
precision has two components: detection precision that refers to a low probability of
misclassifying a foreground pixel, and discriminative power meaning that a background will be
not classified as foreground. These two measures are known in artificial intelligence literature as
false negatives and false positives. The highest pre cision is achieved when both values are a.
b.
c.
Preprocessing
35 minimized. Minimizing them is t ime consuming and sometimes not even possible. That’s why it
is important to measure them separately and precisely.
Figure 9. Test examples for Running average foreground detection in outdoor and indoor
environment a) actual image, b) background image, c) foreground mask
To measure the detection precision we made some manually labeled films from indoor
and outdoor environments. Frame by frame, we have marked precisely the moving objects, and
then we computed the detection precision as a percentage of the detected foreground pixels and
the labeled foregroun d pixels. a.
b.
c.
Human Behavior Recognition in Video Sequences
36
Figure 10. Test examples for Running Gaussian Average foreground detection in outdoor and
indoor environment a) actual image, b) background image, c) foreground mask
(3.51)
where P is the precision, F is the set of the labeled foreground pixels, is the number of
elements in set F. can be computed as follows:
(3.52) a.
b.
c.
Preprocessing
37
Figure 11. Test examples Min -Max method in outdoor and indoor environment a) actual image,
b) background image, c) foreground mask
The discriminative power of the fore ground detector is computed as the rate between the
incorrectly and correctly classified pixels.
(3.53)
where is the discriminative power, N is the number of pixels of the image and can be
computed as follows: a.
b.
c.
Human Behavior Recognition in Video Sequences
38
(3.54)
Figure 12. Test examples for Gaussian Mixture based foreground detection in outdoor and
indoor environment a) actual image, b) background image, c) foreground mask
From our point of view, the hig h detection precision is more important than the
discriminative power in order to do not miss any humans. However, very low discriminative
power, increases the computational demand because in this case additional filtering has to be
used to eliminate small foreground objects and to decide if the objects are humans or not. A7.
b.
c.
Preprocessing
39
Figure 13. Test examples for Eigenbackground based foreground detection in outdoor and
indoor environment a) actual image, b) background image, c) foreground mask
In some situations even the best foreground detection methods are failing. A good
example for this is when the foreground pixels are similar to the background pixels. To eliminate
the possible errors for these types of situations a top down approach has to be used. Usually
these methods have higher computational costs than benefits. Unfortunately for the Integral
Image based foreground detection method s, the computation al speed couldn’t be increased. a.
b.
c.
Human Behavior Recognition in Video Sequences
40
Figure 14. Test examples for integral image based foreground detection in outdoor and indoor
environments a) actual image, b) background image, c) foreground mask.
Examples of implemented foreground technique results are presented in Figure 8 to
Figure 14. All figures have three parts: the actual image, its computed foreground mask, and the
background image. In some cases the presented image is diversionary because the background
models are not illustrated as a n image. In these cases we applied a reverse transformation to see
how it would look like. Even so, the background images cannot encode the whole information of
the models.
The performance of the foreground detection techniques are influenced by several fa ctors
which cannot be measured using only pictures. These kinds of events are the gradual
illumination changes. We can pronounce that all techniques are trying to address the gradual
illumination change by periodically updating the background. The question is how often should a.
b.
c.
Preprocessing
41 be the background updated. If the updates are too often, the response on illumination change in
the background will be good, but on foreground mask some ghost objects could appear. If the
updates are made rarely the background might no t change as fast as the illumination is changing.
The measured values obtained based on foreground detection techniques are summarized in the
following table. The tests were made on indoor and outdoor image sequences and we computed
an average from the mea sured values. The tests was made on a 2,66 GHZ Pentium Core2Duo
PC.
Table 1 The comparison of foreground detection techniques with the performance: speed,
detection precision, discriminative power
We have analyzed also the memory requirement of the different meth ods, but we noticed
that the results in number are not relevant since they are highly dependent by the image size and
the background model update variants, so we have categorized them in three classes: Low,
Medium and High. The results are presented in the following table.
Table 2. Memory Requirement categorization Memory
Requirements
Frame differencing LOW
Running average foreground LOW
Running Gaussian Average LOW
Min-Max method MEDIUM
Meanshift based foreground detection HIGH
Gaussian Mixture based foreground detection MEDIUM
Eigenbackground based foreground detection MEDIUM
Integral image based foreground detection LOW
Speed
(100/ms) detection
precision discriminative
performance
Frame differencing 1.12 0.09 0.72
Running average foreground 10.25 0.65 0.75
Running Gaussian Average 10.49 0.72 0.73
Min-Max method 145.4 0.71 0.77
Meanshift based foreground detection 160.45 0.77 0.8
Gaussian Mixture based foreground
detection 80.56 0.84 0.79
Eigenbackground based foreground
detection 127.47 0.87 0.81
Integral Image based foreground
detection 20.56 0.83 0.88
Human Behavior Recognition in Video Sequences
42
Figure 15. The effect of quick illumination changes
We have tested the methods to quick illumination changes as well. This is an important
aspect of foreground detection methods, because this also reflects the robustness of the method.
On a cloudy day or next to a changing light source (a lamp is switched o n/off) the background Frame differencing
Running average foreground
Running Gaussian Average
Min-Max method
Mean -shift based foreground detection
Gaussian Mixture based foreground
detection
Eigen -background based foreground
detection
Integral image based foreground detection
Preprocessing
43 characteristic will be seriously altered. This will result in a drastically increased number of
falsely detected foreground regions and in the worst case, the whole image could appear as
foreground. In the Figure 15 are some examples to demonstrate how different methods respond
to these conditions.
The fastest reaction among the reviewed methods is certainly the Frame Differencing,
because in that case we do not need to do extra background model maintenance, only a
differencing and a thresholding. In this case the precision is low, because the method only finds
differences mostly in the contour of the moving object. The size of the c ontour is highly
dependent from the velocity of the moving object. However the discriminative power of the
method is good, because the background model is updating instantaneously to the environmental
changes. In Figure 15 it can be seen that this method has the best response to the quick
illumination changes. The method cannot deal wit h very slow or still foreground objects even if
the objects are moti onless for a few frames only. Another inefficiency of this method is that it
does not deal with the camera oscillation and high frequency background changes.
The Running Average and the Running Gauss Average has about the same complexity.
The differences b etween them is that in case of the Gaussian Average method the threshold value
is computed automatically and differentiated for different region s of the image. The result proves
that the Running Gaussian is more precise, but it requires more memory and it is a little slower,
and responds weakly to quick illumination changes. The response to the slow illumination
changes depends from the value of the learning rate parameters. The methods cannot deal with
the high frequencies background changes or camera osci llations.
The Minimum Maximum model is slower than the Running Gaussian Average because it
has many nonlinear computation s. One of the strength s is the automatic threshold value
computation. Similar to the Running Gaussian Average method it does not cope with multiple
modal background distributions. The response to the fast illumination changes is also weak.
Mean -shift method can effectively model a multimodal distribution without the need for
assuming the modes a priori. Has a very high computational cost . Although there are several
methods to reduce the computational cost, it is considered one of the slowest methods. It handles
well the camera oscillations, while adapts very slowly to the quick illumination changes.
The Gaussian Mixture Model is one of t he best methods . It is not too slow and in some
cases achieves really good results. It handles multimodal background, so can cope with high
frequency background changes, and can compute the threshold automatically. It deals with quick
illumination changes only if those are repetitive. Another disadvantage is that we need to give a
priori the number of Gaussians. The speed and the memory requirement s are dependent on the
number of Gaussians.
We have made experiments with the eigenbackground method as well w ith a training set
of 20 recent images and 3 eigenbac kgrounds. The quality of the results was good, but
significantly dependent on the images used for the training set. When the current images
contained a moving object, in the same position as in the train ing image, the eigenspace could
Human Behavior Recognition in Video Sequences
44 not remove completely the moving object and they appeared as a ghost in the foreground. The
method does not cope with fast illumination changes.
The Integral Image based foreground detection method in speed is comparable wi th the
Running Gaussian average method. The method does not deal with the multimodal background,
but if the changes in the background are small than it is invariant to the changes since it works
with features . The method has a good response to fast illumin ation changes and computes the
threshold automatically. Disadvantage of the methods: small objects are not detected, and all
detected objects will appear smaller than in reality without needing an extra filtering of the very
small objects. Extra advantage of the method is that it detects and removes the shadows. On the
examples presented in Figure 8to Figure 14 it can be seen, especially on the outdoor test image,
how this shadow detection technique works.
Figure 16. Results of the optical flow algorithm
We left at the end the conclusions about the optical flow since it is different from the
other foreground detection methods. The result of the opt ical flow gives us the objects velocities.
To segment with this method the image to foreground and background, requires other future
processing of the result. Computing the optical flow then to segment the results costs way too
much, But in some cases when all other methods fail to work, the optical flow can be suitable.
This kind of case is when we have to deal with a non -static background.
To test the optical flow algorithm for foreground detection we have implemented the
Lucas -Kanade optical flow variant .
a b
c d
Preprocessing
45 3.6 Conclusions
The topic of this chapter was the preprocessing component of the human behavior
systems. The scope of the preprocessing component is to speed up the human detection process
and to reduce the false positive results without influencing the dete ction rate. Based on our
studies and tests, we have concluded that the best way to achieve this purpose is to use a
foreground or moving object detection method.
Our contributions to this field are the following:
Identif ication of the challenges of this component
Defini tion of the performance measurements suitable to compare the different
methods [210]
Implement ation and comparison of 9 different foreground detection techniques
[201]
A new foreground detection algorithm based on Integral Image [210]
The f irst part of this chapter describes the list of features and characteristics we have
identified that a foreground detection method needs to possess. The list of features was selected
based on the conclusions described in the related literature and based on the results of our
experi ments :
Detection precision
Discriminative power
Shadow removal capability
Memory requirement
Computational power requirement
The detection precision and discriminative power of the foreground detection method is
influenced by situations like sudden illumination changes, background reallocation, high
frequencies background objects. [210]
The second part presents the 9 methods we have implemented and tested:
Frame differencing,
Running average foreground,
Running Ga ussian Average,
Min-Max method,
Meanshift based foreground detection,
Gaussian Mixture based foreground detection
Eigenbackground based foreground detection
Optical flow ( Lucas -Kanade )
Integral image based foreground detection (our method)
The tests were performed on both indoor and outdoor image sequences to cover all
challenging cases. The test results were compared based on the performance measurement model
defined using the features identified in the previous phase.
Human Behavior Recognition in Video Sequences
46 We concluded after the tests that there are no generally good methods that can perform
well in every situation. The best solution is to combine multiple methods and use always the best
one for that specific scene.
Another finding was that the foreground detection methods can be used with success in
scene change detections. To do so, we only need to determinate the acceptable maximum
percentage of the foreground of the image. If the foreground is more than a threshold we
consider that the scene is changing. If we encounter changing scene s very often, we can use the
optical flow detection to track the motion in the scene [201] .
The last part of the chapter presents the novel method elaborated by us. We managed to
work out a new way of detecting the foreground objects. After definition, we have also
implemented this method and performed all the tests we have done with the other methods as
well. The results were compared and analyzed using the same performance measurement model.
Based on comparison we have shown that our Integral Image based foreground detection
method is suitable for foreground detection with static backgrounds. It has the same precision as
the Gaussian Mixture Method but it is 4 times faster. In combination with the Haar -feature based
Human detection method, the integral Ima ge is reusable also for human detections which will
increase even more the human detection process. The method has one of the best discriminative
performances [210] .
Human Detection and Pose Estimation
47 4. Human Detection and Pose Estimation
The most important task of the human behavior recognition framework is the human
detection and pose estimation. The task can be defined as follows: on a given image or image
series (video) identify the position of humans and their pose or body configuration. The w ay how
this can be done depends on the image quality (information quantity), and the restrictions applied
to the system. This chapter presents three type s of the human detection techniques and their
extensions which transform them suitable for human pose e stimation as well.
4.1 Introduction
The human pose detection task is one of the most important tasks, because it represents
the measurement stage of the human behavior recognition process. The behavior understanding
accuracy is extremely related to the human p ose detection task in that the more accurate the
results of this process are the more accurate the behavior understanding will be. It is also the
most complicated task of the system due the high importance of its accuracy and the huge
variability and appea rance of the humans.
There are several approaches to detect in a frame the human position and its
configuration. These methods can be categorized in different ways. The most common ones are
the “component -based” methods and the “single detection window” a nalysis [ 62, 61].
The “ component -based” methods detect the invariant ob ject parts separately and check if
they are present in a geometrical natural configuration. This type of methods also has many
variants. The part detection and the configuration matching can be done consecutively [ 109, 110,
142, 143] or by using a hierarchical detection framework. In case of a hierarchical approach the
body parts are detected in their importance order, if one of the basic parts is missed other parts
are not searched [ 62]. These systems have the ability to explicitly deal with partial occlusion.
They are slower than sin gle detection window based methods, because they have to detect more
than one component. Some variants of the approach us e a fixed model and do not handle multi –
view and multi -pose cases, so they are fairly limited, similar to single detection window
metho ds. Mostly the same algorithms are used to detect the human components as the single
detection window analysis approach. For component -based human detection systems see
Mohan's work [ 126].
Human Behavior Recognition in Video Sequences
48 The other approach is the “single detection window” analysis. Its m ost significant feature
is its speed, while its drawback is a limited partial occlusion handling. There are three major
types of “single detection window” methods [ 126, 154], likewise in case of component detection
for “component -base” approaches:
The fir st type of methods uses 2D or 3D human models and tries to match them with
parts of the image. It is difficult to create models that are enough general and simple, but are also
capable to catch every particular human motion . After the successful matching ( it is a very time
consuming process ), their pose detection follows.
The second type uses predefined image feature and their relationship, which uniquely
determine the human objects. This is applicable mostly for rigid objects.
The third type is the “examp le-based” methods, which use a labeled training set to learn
to recognize human’s objects. Key elements of this method are the training set and the training
algorithm.
4.2 Human detection and pose estimation with “example –
based” method
The “example based” meth od represents one of the most popular methods. They are very
popular because they appear to be very simple: Only some examples need to be selected and a
neural network has to be trained, then everything will work correctly. Despite of the appearance,
creat ing a trainable structure that learns the task from the example is hard. Even creating a good
training set is complicated and requires a big amount of time and knowledge.
One of the most promising “example -based” techniques was introduce d by Viola [ 219].
He applied a rapid object detection method for face detection and extended it to handle multi –
view face detections as well, [218]. They use Haar -like features for rapid feature extraction and
AdaBoost learning algorithm for feature selection and strong cla ssifier creation. The first
classifier is a cascade classifier modified later to a tree classifier by Lienhart [ 103,105, 104, 154].
The idea behind this approach is that the classifier should eliminate the majority of the
negative inputs on earlier stages of the classification and only the searched objects will be
classified in all stages.
Initially the method was developed to detect human faces then extended for human
bodies as well. But there are significant differences between detecting human faces and human
bodies. The cause of the differences is originating from the nature of the two types of “objects”.
The faces are rigid objects, because their features/parts like nose, eyes, mouth are situated
approximately in the same relative position to each other , while the human body is a semi
deformable object. In a way it has some rigidity since body -parts are fixed to the body, but they
are deformable. This semi deformable nature of the human body increases significantly the in –
class variability of it resultin g in very distinctive human class elements.
A common problem of all techniques presented before are, that they do not handle this
proportion of in -class variability. A solution for this problem is to use for every type of human
Human Detection and Pose Estimation
49 appearance more than one cla ssifier. This solution requires a categorization of the training data.
Without a safe clustering method, this significant amount of work needs to be done manually.
Since the data are not collected in controlled environment, manual categorization can become
prohibitively expensive, and because the fundamental ambiguity in labeling different poses and
views, the complexity of the work grows linearly with the number of classes. Manual
categorization is also an error -prone procedure that may introduce significa nt bias into the
training process.
Shan et. al., [ 177] recently presented a novel framework, which unifies the categorization
and the classification. The drawback of the method is that in the first step it categorize s the input
in a number of classe s and a fter th at secondary classifiers are used to decide if the categorized
inputs are humans or not.
Another common problem of these methods is that they have only two outputs. If we
want to use one of these object detectors to detect and also estimate the hum an pose, we need to
introduce another classifier for that purpose.
4.2.1 Haar feature
The Haar wavelets are a robust basis functions set. This represents the difference of
intensity in neighboring regions [ 219]. Using the Integral Image the value of the Haar feature can
be computed very fast. The Haar function has two important properties:
The value of a Haar function is the same if the picture is reduced or increased by
a factor.
The computation of Haar -feature using Integral image is the same for every size .
These two properties make it suitable for being used in classifiers that need to work in
real-time. Two sets of Haar features are used extensively: the basic one and the 45 degree rotated
feature, which can be calculated using the Integral Image, and the 45 degree rotated Integral
Image [103].
Figure 17. Edge, Line, Center -surround Haar –feature.
Left Basic set, right the Extended set [ 103]
4.2.2 The AdaBoost algorithm
The AdaBoost algorithm was proposed by Freund and Shapire [ 54]. The idea of boosting
is to use weak classifier s to form a highly accurate prediction rule by calling the weak classifier
repeatedly on different distributions over the training examples. A weak classifier is only
Human Behavior Recognition in Video Sequences
50 required to be better than chance, and thus can be very simple and computationally inexpensive.
Initially, all weights are equally initialized, but in each round the weights of incorrectly classified
examples are increased, this way the images, which were poorly predicted by a previous
classifier, will receive greater weight in the next iteration.
From every Haar -feature very efficient weak classifier can be built. To build a weak
classifier hj(x) using the Haar feature fj(x), we need a threshold value and a parity pj to indicate
the direction o f inequality (Equation 4.1):
(4.1)
The threshold value is different for every feature and the optimal value is computed
using the training algorithm.
The most significant propriet ies of AdaBoost focus on combining a set of weak
classif iers into a strong classifier b y its ability to efficiently reduce the training error.
Different variants of boosting are known, like: Discrete Adaboost, Real AdaBoost, and
Gentle AdaBoost [ 54]. All of them are identical with respect to computational complexity from a
classification perspective, but have different learning algorithm.
The classifier for human detection is built by using the AdaBoost algorithm and Haar
feature based on weak clas sifiers [ 103].
The first step is creating the training set consisting of all significant humans (around
15,000), various non -human images (around 100,000) and a file which contains the category of
all images (the expected output). Firstly the optimal thre shold for every weak classifier needs to
be computed. Using the algorithm 1 we can compute the weights and select the best weak
classifier to achieve the desired performance.
4.2.3 The cascade classifier
The computational cost for a monolithic classifier sele ction is the same for every input
image. In real life, we can decide easily in which categories some images belong, but selecting a
category for them needs more attention. The cascade classifier is based on this idea: It reject s
rapidly the most part of th e negative images, then evaluates a part of the images with a few weak
classifiers and evaluates only complex images with the whole classifier. Direct consequence of
this approach is different evaluation time for different images and an overall speed up of the
whole classification procedure.
The cascade design process is driven by a set of detection performances [105]. If each
stage classifier is taught for low performances (f<0.6, false detection rate/stage and d>0.999, hit
rate/stage), then the whole cas cade will have the same classification performances as a
monolithic classifier, but more than 10 times faster [167].
In the training stage, we need to set the desired performance used as stop condition. To
test the performance we need also a test set. Each stage is trained using only the negative images
Human Detection and Pose Estimation
51 that are classified incorrectly and leave all the positives as they are. In case of the cascade
classifier the trade -off is between filtering performance and process speed. The high number of
weak classifiers and stages make the cascade classifier have higher filtering performance but
lower processing speed.
Figure 18. Cascade classifier
4.2.4 Human body Detector and Pose Estimation Tree
The classical cascade classifier was developed and tested for face recognition and
without changes is not suitable for human body detection, because of its high in -class variability.
Since the human body is semi -deformable and can take various poses and can appear in diverse
views, the usage of a single classifier will result overtraining, while the usage of a we ak
classifier will make impossible to detect specific body states. This difficulty can be overcome by
dividing one of the classes in subclasses with specific features to be able to distinguish specific
patterns, and then use a specialized classifier for every subclass. Using this approach, the
classification complexity will grow linearly with the number of subclasses. To keep the real -time
working behavior, the detector has to be very fast. This approach tries to resolve the
contradiction between the detection performance improvement (need more subclasses so need
more processing time) and decreasing the processing time. Our starting point is Viola and Jones
[219] work, which uses a cascade classifier to preserve classifica tion accuracy, while using the
“coarse -to-fine” strategy reduces the calculation complexity. The cascade classifier was extended
Human Behavior Recognition in Video Sequences
52 later [ 103] by adding a pose estimator before the cascade classifiers to handle multi -view cases.
This estimator estimates the pose of the possible face, and chooses the correct face detector.
The human body detector and pose estimator tree tries to merge these two steps: the pose
estimation and the classification. Using a pose estimator before classification, the estimator
would not preserve the coarse -to-fine strategy, and its pose estimation for “non object” patterns
would take too much time. By eliminating first all certainly “non human” patterns and splitting
the class into subclasses (subclasses are groups of poses and view s) only when it is necessary. By
necessary we mean that future classification performance cannot be achieved or it can, but with a
too high cost.
Figure 19. Tree classifier for detection and pose estimation
We used a binary tree classifier to realize this purpose . This tree can be viewed as a
cascade classifier which merge s earlier common stage s of multiple specialized cascade
classifiers and insert s pose estimati on stages after every stage, where it is needed to choose
between specialized classification stages. With this merging we can achieve a very fast detector,
because the majority of the false patterns are eliminated in earlier stages of the classification.
Another advantage of this method is the automated estimation of the detected object's pose.
Human Detection and Pose Estimation
53 The classification and the pose estimation uses the same Haar -like features introduced by
Viola and Jones [ 219] and extended by Lienhart [ 103] as the original cascade classifier, because
it can be calculated very fast using integral images.
The tree is built using two types of nodes: classification and estimation nodes. Both of
them are using Haar -like feature to classify the input pattern. One difference betwee n them is
that the estimation nodes use only one Haar -like feature, while the classification nodes can use
more than one feature to classify the input data. The second difference is how their outputs have
to be processed. If the classification node has neg ative response the input pattern is dropped and
no future processing is made. If the node response is positive, its output is an output of the tree,
when the node is a leaf, otherwise else it is an input of its child node. In case of estimation nodes,
their outputs are processed by one of the child nodes, left or right, or in case of a leaf node, there
are also outputs of the tree. We used for our classifier the Pose Estimation Tree Classifier
(PETC) annotation
4.2.5 Building the training set
To obtain an effici ent classifier one of the main task is to create a proper training set. To
get a proper training set we need to decide about the followings:
input pattern size
human images preparation
the background image size
how to choose the significant images
how b ig is the necessary training set size
The input size of the image is important, because it determines the number of used
features of the learning process. For example in case of face recognition, for a pattern o f 24×24
pixels, the 84848 (BASE) features in the basic set, 111360 (CORE) in the extended set, and
138694 (ALL) features the entire set [ 167] have to be evaluated. Between performance and
image size the relation is the following: larger images are more detailed, are containi ng more
information, but they need more memory and more features for evaluation. Higher number of
features requires more computational power and bigger training set.
Figure 20. Relation between of input pattern size, performance and processing time 0 20 40 60 80 100 120 140
0 10000 20000 30000 40000 Time (ms)
Pixels Processing time
ms
Human Behavior Recognition in Video Sequences
54 There is no formula for comput ing the ideal size of the input pattern, the only way is to
make some experimental analysis. Ideal size is the smallest size for which still correct
classifications can be made. The search of the smallest input pattern size has two reasons. The
first reason is that if the training images are smaller, we need fewer features to classify , therefore
the classification will be faster . The second reason is that by using small images we do not have
many details so the classifier will be more general having less chance to over train the classifier.
According to the experiments reported in [104], for face detection the best images’ patter n size is
the 24×24, because it has the lowest false alarm rate at the same hit rate. According to the
experimental results, the optimal pattern size for human detection is 128×64. We also concluded
that the optimal pattern size depends by the variety of t he data base used for training. Other
approaches of human detectors have obtained other optimal dimensions for the training image
pattern. Face normalization is easy, because the heads are rigid objects, while normalizing
human body is more difficult since they can appear in different postures. There are many
possibilities to crop human images. Cropping only the significant part of the images , results in
images of different sizes , because humans can appear in high variety of poses.
To normalize the human b ody a point has to be selected which is constant for every
position of the body. Normalizing based on the height of the people was proved to be ineffective
because the human’s height can differ when they are sitting or making a step or standing (the
height of the same human can be different in these situations) . For the same reason, normalizing
based on width was also unsuccessful. Both are trying to introduce extra in -class variability in
the training set. This normalization works only if the training set has only one human pose in it.
Finally we conclude that the most constant feature in this pose is the height of the head. The
head’s height is also changing in different poses but its variation was the lowest compared to all
other ideas.
During the normalization we resized the human body in a way that the heigh t of the
humans to be always the same, then we moved the human on the image so, that the coordinates
of head to be in the same position. The positive training set contains many background images. It
is important to have many images from every position with different backgrounds.
The size of the background image does not seem to be so important, it is never explicitly
specified. It can be taken to be the same size as the positive pattern size, namely 128×64 pixels.
The example -based learning method needs a h uge training set with high number of
positive and negative examples. To achieve a good performance the examples from training set
should be representative elements of the class. The positive examples are usually cropped
manually. In case of human body it i s important to have positive images for every human body
configuration. Practice proves that an amount of 15,000 human images would be enough, but the
difficult question is the number of backgrounds. The images of our own database were collected
from publi c marked databases like Inria, MIT and surveillance videos completed with self
marked cropped images. The background images are cropped from surveillance and
documentary videos without human presence. One of the problems with the background
selection is th at we have too many similar backgrounds which increase the learning time without
increasing the performance. To eliminate the similar backgrounds we applied normalization to
the negative examples and compared them using a simple similarity metrics. From th e similar
images we kept only one in the image set. To increase the learning capabilities we also filtered
the background image using the last trained stage from the classifier. To increase the generality
Human Detection and Pose Estimation
55 of the classification we applied distortion to the training set (randomly little translation, rotation
and resizing).
Figure 21. Examples from the training set (positive images)
Another issue of the training set is the size of it. The size of the training set is
proportional with the learning time and affects the performance of the classifier. In case of the
face recognition Leinhart proposed a training set with 5,000 positive and 3,000 negative images
[104]. Viol a and Jones built their classifier with 4,916 faces and 10,000 non -faces selected
randomly from a set of 9,500 images which did not contained faces [ 219]. Considering the
possible face configuration s and the possible human pose diversities , the size of the training set
should be increased.
First we used 10000 positive images and 27,000 negative ones. At this size of training set
the program used approximately 4GB memory and 240s for the selection of one Haar feature.
We used an Intel Core 2 Duo computer with, 2.6GHz CPU frequency and 3GB of RAM. It
needed more than 4 days for the selection of 1,500 features.
The detector was built to scan the input at multiple scales and location. We used for the
step size one pixel and for the scale factor 1.3. In order to achieve a false alarm rate of around
5×10-6, more based on other researches experiments [ 218] we need millions of different
background pictures.
If we try to use millions of different backgrounds, the processing time of one feature
selec tion increases. A huge training set needs a huge amount of memory space, which exceeds
the usable RAM memories. If we use virtual memory created on HDD to solve the memory
problem we need more than 50 times to train the classifier, approximately with one m ore month.
Human Behavior Recognition in Video Sequences
56
To reduce the training time of the cascade classifier we tr ied to reduce the training set
without negatively influencing its performance. The starting point for our idea was the behavior
of the cascade classifier. The c ascade classifier at earlier stage eliminates the majority of
background without eliminating the humans. Direct consequence of this behavior is that we
cannot eliminate any of the images from positive set but we can reduce those backgrounds that
are alread y classified as backgrounds in an earlier stage. Even using this filtering technique we
have at early stages of the classifier a huge amount of background images that make the learning
process very time consuming. To reduce the time needed for learning we use only a randomly
selected part of the backgrounds. This choice theoretically reduces the performance of the
classifier because the randomly cho sen backgrounds will influence the used we ak classifier.
With this approach the earlier stage will eliminate l ess backgrounds as if we trained with the
whole set. The algorithm 2 presents the process of selecting the background images for the
training process. Based on experimental results the number of selected backgrounds needs to be
always bigger than the posit ive example.
4.2.6 Training the tree
It is given a feature set and a training set (positive and negative examples). There are two
main tasks to be solved: pose estimation and classification. If in the training set we have humans
in different posture s and pose s, we concluded that the positive training set can be clustered. The Algorithm 2: The background selection algorithm
Human Detection and Pose Estimation
57 question is how many clusters do we have in the positive training set? Manual settings of the
cluster numbers may not always end up with good results. If the number is too small, it is
possible not to achieve the desired detection performance, while for higher number of clusters
the detection process will be considerably slower.
So it is preferable to use a criterion for deciding the number of clusters needed for
optimal detection performa nce and for fast processing. The used criterion is the lowest
computational complexity needed to achieve a given hit and false alarm rate used also by
Lienhart [ 103, 105]. It is a recursive procedure and the final number of the clusters is decided
only at the end of the training. To use this criterion we need a hierarchical clustering method.
This method always splits the set in two subsets using only one feature for splitting the set,
because the feature tells us how much a pattern is present in the image. There are some cases
about how the features statistically act:
The patterns are typical for the whole set. In this case the feature values are
concentrated near a value and have a Gaussian distribution.
There are two groups: some of the training examples own the pattern and some do
not. In this case the feature values are concentrated beyond two values (the mean
values) and can be easily delimited.
There are more than two groups; every group owns the pattern in a different
measure. In this case the featur e values are grouped around the centroid values
but they cannot be easily delimited.
The patterns are present only incidentally and it is not a characteristic feature. In
this case the feature value has a uniform distribution.
Features used for splittin g the clusters are selected from a subset of the whole feature set.
The subset consist of features, which include features from case II and some from case III (only
if there exist two neighbor groups that are well delimited from each other). The other f eatures
are not used for clustering.
The node training has two stages. In the first stage a node is trained using AdaBoost
method [ 146, 188]. The parent node determin es the training set of the child node. In case of the
root node the whole training set is considered. The result of the training is a strong classifier with
a given false and positive hit rate.
The second stage investigates if the pose estimation step decreases the calculation
complexity or not . For that, we need to choose a feature for cluster ing the input training data. In
case of the root training, the feature for all the positive samples is computed . At that moment the
pose estimation feature set is same as the complete Haar -like feature set. We verified them
separately for every feature and if the samples are distributed uniformly we remove them from
the set. An ordinary node receives from the parent node a set of features that are eligible for
clustering and were not used yet. This “eligible” feature set is constructed at root training
described above. The remaining features are used to cluster the positive training set in two
groups (k=2) using k -means algorithm. We also compute the variance for each cluster. The next
step is choosing the best feature based the relevance criteria:
Human Behavior Recognition in Video Sequences
58 (4.2)
where are weights, is the relative distance between the cluster centroid and
is calculated with the equation (4.3).
(4.3)
where is a constant and represents the length of the domain and depends on the size of the
feature, and are the two cluster centroids. The in equation (4.2) is the mean of the
clusters variance. The in (4.2) is the square difference between the numbers of the cluster's
elements. .
The feature selection algorithm is presented in Algorithm 3. The clustering feature set is
used to build pose estimation stages. Always the first most discriminative feature is used to split
the training set in to two su bsets, and then the used feature is removed from the pose estimation
feature set. So every feature is used only once in the root -leaf path.
After creating the pose estimation step the training set is split using the selected feature.
For every set a new classification stage is trained.
The computational complexity linearly depends on the number of weak clas sifiers and
the number of pose estimation stages. We use the estimation stage if the total number of features
used by nodes is less than the used ones in the monolithic classifier.
Each branch receives the corresponding subset of the training data. This pr ocedure is
used until the target depth of the tree is reached. The classifier node training algorithm is
presented in algorithm 4.
Algorithm 3: Clustering feature set creation
Input: Positive Training Set (PTS)
Feature Types
{
Compute feature values for every Positive example
Remove features with uniform distribution
Clustering the training set using k -mean, k=2
removing the features that do not have two well defined clusters
compute the J (equation 4.2)
search for the highest J
}
Output: Pose estimation feature set, best feature
Human Detection and Pose Estimation
59
When we reach the target depth and the pose estimation has not enough resolution, we
can continue adding extra pose estimation stages u ntil the desired resolution is reached. In the
tree, every classification node has a positive leaf or branch and a negative one associated to the
corresponding set. Both outputs of the pose estimation stage are considered positive. At the end
of the traini ng we have to label manually every positive leaf. If both leaves of the pose
estimation node have the same label, one node is removed and the remaining leaf will have the
common label. This is done until the tree has no more pose estimation nodes with two leaves
with the same label.
4.2.7 Experiments and discussion
In order to prove the performance of the proposed PETC w e tested it on our videos and
on different video sequences from public databases. The test s were performed on a Pentium Core
2 Duo 2 .6 GHz computer with 3GB memory.
We tested the algorithm us ing different structures, and then we have compared them with
other existing methods. The classifier structure used by Viola and Lienhart can be deduced from
the available public classifiers stored in xml files, but it was not really useful for us, because they
trained the classifier for faces. We used from their classifier only some performance parameters
to retrain the classifier.
Analyzing the published classifiers from the stored in xml files w e found out that: the
number of used stages of a classifier varies between 16 and 46, the mean value is about 22 -23
stages. The first stages contain 2 -10 features, whereas the last stages contain 100 -200 features. Algorithm 4: Classification stage classifier training
Input:
Positive Training Set (PTS)
Negative Training Set (NTS)
Pose estimation feature set
{
1.Train standard stage classifier with the PTS and the NTS, compute the
features needed to achieve the desired performance
2.Split the training set using the first feature from the set
3.Train the resulted two stages using t he appropriate sub -PTS and the sub -NTS,
compute the features needed for the stages to achieve the desired performance
4. We choose the lower complexity (complexity = number of features used)
solution
5. If the chosen solution uses pose estimators, we remo ve that feature from the
set
}
Output: Stage or stages
Human Behavior Recognition in Video Sequences
60 We have concluded that a good classifier s hould have more than 1,000 features [ 167]. Even if
Viola and Jones mentioned [ 219] that during training they elaborated a high number of
improvements to decrease considerably the training time th ese result s were never published.
The available classifiers w ere trained for face detection, and then we have modified the
size of the detection window to be suitable for human detection as well. The direct consequence
of the increased window is the increased number of features . The classifier trained for the human
detection contains more than 3000 features with the same structure.
At the beginning we are interested in the performance of the classifiers. In most
publications the detection results are presented in table s. Those tables catch the best value of the
Rece iver Operating Characteristic ( ROC) curve. Using the ROC curve we are able to compare
not only a value representing a “one moment” behavior of the classifier, but also the variation of
the detection rate related to the variation of false alarms [ 167]. Usua lly the publishers are not
attaching the used database and the measure methodology so we should train the classifier using
our database and methodology. The performance of a detector can be very good but this also
implies a huge amount of false alarms. Whe n we decrease the false alarms the performance of
the system will decrease , too.
The ROC curve of the three classifiers ( Figure 22) shows very well the performance of
the classifiers. The simple cascade classifier has its limitation and even if we permit more false
detections its sensitivity saturates at 70%, because the cascade classifier cannot handle high
intra-class variation.
Figure 22. The ROC curve for the three classifier s
The difference between Lienhart’s tree detector and PETC seems to be small. Even if the
two curves seem to be similar there are big differences. In case of PETC , if detection is
considered to be good then it means more than the input is classified correctl y. It also means that
the pose estimation is done well. It is true that the correctly classified input as humans are almost
the same but incorrectly estimated pose are as low as 1-2%. If we add this percentage to the
sensitivity, the PETC are more sensitiv e than the Lienhart‘s method. This can be explained with
the fact that we used only a limited number of poses. If we increase the number of available
poses in the train set the intra -class variability will also increase , and the Leinhart’s trees 50.00 55.00 60.00 65.00 70.00 75.00 80.00 85.00 90.00 True Positive (%)
False positive (%) Viola Cascade Classifier
Lienhart's Tree
Our Classifier
Human Detection and Pose Estimation
61 sensitivit y will decrease, and the ROC curve will be closer to the diagonal line. Another
difference is that PETC has lower false alarm rate at the same sensitivity as the Lienhart’s tree
detector. The test results on our database for the three classifiers are the following:
Table 3. Performance parameters on our database
In Figure 23 we present some image s labele d by the classifiers. We can see on the image s
that the simple cascade classifier does not detect all people from the image and has more false
detection s as well. In this case the false detection for the Leinhart’s tree detector and for PETC is
equal, but PETC has detected all people from the image.
Figure 23. Processed images – our database: a ) cascade classifier, b) Leinhart’s Tree, c) PETC
A second test was made on the INRIA’s database. In this case the human’s positions
were labeled, so the detection results were automatically compared to the label. A result w as
accepted if the detection window contained more than 90% of the labeled region and the area of
the window was less than 120% of the area of the labeled region. For pose detection we labeled
the test set and then we have compared them to the results.
a) b)
c)
Type of Classifier Correct
detection rate False alarm
rate Processing
Time /Frame
Cascade Classifier 70,2% 0.012% 30,1ms
Lienhart's Tree 86,7% 0.009% 33 ms
PETC 86,1% 0.005% 28 ms
Human Behavior Recognition in Video Sequences
62
Table 4. Performance parameters on INRIA database
The differences between the performance parameters measured on our database and on
INRIA database is that INRIA contains much more poses then the ones in our database. It is
important to mention that we trained all three detectors wi th the same database. Another
interesting aspect of the results is the speed of recognition. Apparently even if it uses the cascade
classifier that has the simplest structure, the speed is not faster with respect to the speed of
PETC. An explanation for this fact is that during training the simple structure was compensated
with more weak classifiers, so during a stage the cascade classifi er evaluated more features than
PETC. The second interesting aspect is that the Leinhart’s detector is slower than PETC despite
the fact that PETC has extra nodes in the tree. To find an explanation for this, we have tracked
the behavior of the tree during the detection and we observed that if the tree has many branches
using the depth first search technique is slower than evaluating some extra features to decide on
which branch to continue the evaluation of the input image region.
Figure 24. Processed images – INRIA database: a ) cascade classifier, b) Leinhart’s Tree, c)
PETC
a) b)
c)
Type of Classifier Correct
detection rate False alarm
rate Processing
Time /Frame
Cascade Classifier 60,2% 0.027% 29,1ms
Lienhart's Tree 80,5% 0.013% 40 ms
PETC 85,1% 0.007% 26 ms
Human Detection and Pose Estimation
63 One other aspect that we observed is that using the depth first search at Leinhart’s tree
detector, the same human poses are not always evaluated on the same branch which leads to the
conclusion that this detector is not suitable for human poses estimation.
During the creation of the training set we have ob served, that by training with different
versions of database the performance of PETC is influenced very much by the grade of the
training set normalizations. One unresolved problem for this detector remains that some human
poses cannot fit to the image rat io chosen by us. Changing the ratio is not always a good choice,
because with it we could introduce unnecessary information in the training set, which can lead to
the loss of the training convergence.
Human Behavior Recognition in Video Sequences
64 4.3 Template based human detection methods
In this subch apter we will present a template -based human detection method. The basic
idea behind every template -based method is that we choose one or more representative templates
and we compare every part of the image with the templates . Compared to the “example base d”
methods, for this method the information about the class is not coded in the structure and in the
parameters of the classifier, but is kept in the templates. At one hand we do not have to create the
training set, but we need to select the representative template. The hardest aspect of selecting the
template is that the template should be representative. This demand is very hard to fulfill taking
into account that people can appear in many poses with different clo thes, resulting in a high
variety of possi ble appeara nces of the human bodies, which makes it necessary to use numerous
templates. To reduce the number of templates, it is necessary that the matching techniques are
not used directly on the image pixels, but use one or more features that preferably are invariant
to some of the human appearance variances. Even this way, by using this kind of features, the
dimension of the templates remained high. Despite the fact that matching a template to the image
is faster than using a trained classifier, the usage of all templates repeatedly became much
slower. To make it comparable in speed with the “example -based” techniques, the only
possibility is to reduce the set of templates using a filter method.
In literature there are several attempts to use algorithms to reduce the template set. Some
of them are Gavrila [ 40, 41] wh ich use the hierarchical structure to match the templates. For
crowded scenes Leibe et al. [ 6] used chamfer matching to detect pedestrians. He combined the
chamfer matching with segmentation to prevent false alarms. A hierarchical template
representation was used by Stenger et al. [ 8]. He used bottom -up clustering based on the chamfer
distance. The majority of the researchers have used the Chamfer matching techniques to match
the templates to the image. The matching process is presented below:
Figure 25. Matching process
4.3.1 Distance transformation
Distance transforms represent an important tool for compute r vision, image processing
and pattern recognition. A distance transform of a binary image specifies the distance from each
pixel to the nearest non -zero pixel.
Depending on distance metric there are various ways of computing the distance
transforms. In c ase of image processing the information is propagated using L 1, L 2, cham fer,
Image Distance
transform Chamfer
matching Template set
Result
Human Detection and Pose Estimation
65 metrics, with the L 2 (Euclidean) being the most common one. Chamfer metrics are
approximating the Euclidian distance. There are also other more complex distance metrics th an
the chamfer, which can be used to make the distance computation more robust to noise. These
complex distance metrics are slower than chamfer metrics.
Distance transforms play a central role in the comparison of binary images, particularly
for images resulting from local feature detection techniques, such as edge or corner detection.
For example, both the Chamfer [ 20, 58] and Hausdorff [ 36] matching approaches make use of
distance transforms in comparing binary images.
The general form of the distance transform is [148]:
(4.3)
where is a regular grid and and is a distance metric.
4.3.2 Chamfer distance
The Chamfer distance tries to give a reasonable good approximation of the Euclidian
distance using elementary displacement [ 58]. Chamfer distance and many other traditional
distance transforms use a set of points on a grid , where the grid represents the pixels from
the image, for each grid location the distance to the nearest point in is associated
(4.4)
In this case the distance transform uses the following alternative definition to enable the
sequential computation of the distance transform:
(4.5)
where an indicator function of membership in to initializ e the starting distance.
(4.6)
The idea of this method is that instead of computing and finding minimum distances of
all image points from all O object points, we repeat simple increment operations without
compu ting explicitly any distance. With this simple increment operation the local distances
propagate and results the global distance. The sequential computation of the distance is also
known as “chamfer distance”.
The propagation is made by fixing a set of elementary displacements applying a weight
to them in each step to proximate the Euclidian distance. For elementary displacement we can
use 3×3 or 5×5 neighbors (mask).
Human Behavior Recognition in Video Sequences
66
Figure 26. General masks
The parameter s in the mask s have the following linear constrains:
3×3 mask: b<2a and b>a
5×5 mask: 2a<c, 3b<2c and c<a+b
Borgefors demonstrated in [ 58] that if we use a=3 and b=4, the maximum distance
between the Euclidian distance and this approach can be 8 percent.
The sequential distance transformation algorithm starts from the 0 infinity image, where
the edge pixels are set to 0 and all other pixel are set to infinite, and then two cycles of scanning
are made over the image. The first one is started from top -left and finishes at bottom -right;
(4.7)
where the distance is defined in the 3×3 or 5×5 masks. The second cycle starts from
bottom -right and ends at the up –left corner of the image:
(4.8)
keeps track of the local distances and only at the end will be equal with the real
distances.
AL and BR are defined in the following masks:
Figure 27. Forward and backward neighborhoods AL AL AL BR BR
AL AL AL BR BR
AL AL AL
BR BR BR
AL AL BR BR BR
AL AL BR BR BR
AL AL BR
AL AL
BR BR
AL BR BR
a) b)
b a b
a 0 a
b a b
2b c 2a c 2b
c b a b c
2a a 0 a 2a
c b a b c
2b c 2a c 2b
a) b)
Human Detection and Pose Estimation
67
4.3.3 Fast distance transform computation
It is important to compute the distance transform fast and accurate ly. In this subchapter
we present some methods for fast computation of the distance transform. In the following
chapter we present a novel method and compare its performance with the fast methods described
in this subchapter.
Dual Scan Line propagation method [ 151] split the sequential (chamfer distance method)
distance computation method and compute the distance sequential separately for every direction.
This method takes advantage of the fact that for two neighboring data points , the
associated minimum distances should either be or on a line
passing through three points; an ob ject point, and data points , . Note that the distances
cannot be equal [ 151].
Algorithm 5: Distance Transform
First Step
)
Second Step
where D is the distance image.
Human Behavior Recognition in Video Sequences
68
In other words, we can compute the distance in one direction using a counter to count the
distance from the last object point, and we reset the counter at every object point. The dis tance
can be computed also in the inverse direction; the only difference will be that the new counter
value of the new distance has to be compared to the previous distance. We should keep the
minimum of these two distances as the new minimum distance of th e current point. The
algorithm is presented in Algorithm 6 and Figure 28. Algorithm 6: Dual scan lines
Left to right
origin flag = no;
for i = 1 to image width , i++ do
if I( ) = 1 then
origin flag = yes;
counter = 0;
elseif origin flag then
counter = counter + 1;
f( ) = counter;
end if
end for
Right to left
origin flag = no;
for i = image width to 1, i– do
if I(pi) = 1 then
origin flag = yes;
counter = 0;
elseif origin flag then
counter = counter + 1;
if f(pi) exists then
f(pi) =min(counter, f(pi));
end if
end if
end for
Human Detection and Pose Estimation
69
Figure 28. Back -and-forth scanning on one direction [ 151]
To guarantee the minimum distance from every direction, we need to apply the dual scan
in every normal direction to the object data point surface, Figure 29.
Figure 29. Multiple scanning directions. Either the scan direction is changed or the data space
is rotated. [ 151]
In practice, there is a limited number of normal directions due to the rasterization of the
surface and the confined size of images. We emphasize that each additional dir ection refines the
estimated distance transformation values [ 151].
Wave -propagation method has as starting point the wave propagation based segmentation
[53]. The distance computation starts from each object point and moves in a normal direction to
the su rface. The method uses three separate labels to group the data points at each step:
processed, active, and unprocessed. The distance computation starts from the object boundary by
assign ing them as active points and setting the distances to zero, f(p) = t = 0. Then, it start s
propagating the wave -front using the active set s of points until no data points remain in the
active set. At each step, we update a counter t, to t+1 and sets it for the distances of all points in
the active set f(p) = t. It searches f or the immediate unprocessed neighbors of the points that are
Human Behavior Recognition in Video Sequences
70 in the active set, and constructs a new set of active points from those points. After updating (as
processed) the points from the old active set, iterates the next step [151].
Figure 30. Fast wave -propagation method [ 151]
4.3.4 Pseudo Parallel Computation of the Distance Transform
We can extend this algorithm to a pseudo parallel algorithm whic h can be executed on
multi -processors or on multi -core processor systems. The parallel algorithm to compute the
distance trans formation proposed by Borgefors [20] is not optimal to run on a multi -processor
system , while the algorithm proposed by us splits the sequential distance transformation to a
number of equal tasks. The number of tasks has to be equal or smaller than the number of
processors or number processor’s core.
The basic idea of our approach the Pseudo Parallel Computation of the Distance
Transform (PPCDT) is to split the image into equal vertical regions, compute the distance
transformation independently for every region using the sequential algorithm, then at the end
merge these regions int o one distance image. Splitting the image is the same with defining the
regions. For calculating the distance transformation for the defined regions we use algorithm 6,
with adjusted start and stop conditions of the four statements. For merging two sp litted regions at
column z we use algorithm 7.
Human Detection and Pose Estimation
71
To verify the performance of the pseudo parallel computation of the chamfer distance we
have made some experiments, and the results were compared to the results of other existing
methods: the basic chamfer distance, the dual line propagation method, the wave propagation
method. For testing we used different images. The main differences between the test images
were the number of edges.
Table 5. Distance transforms precision
During the test we compared the correctness of the distances ( Table 5) and the processing
time (Table 6). For the distance correctness we used as etalon the basic chamfer distance
transform. The comparison error w as computed as mean of distance differences per pixel. Distance
transformation
methods/Edges
in the image
(%) Chamfer
distance Dual scan li ne
(4 direction) Wave
propagation Pseudo Parallel
computation of
the distance
transform
5 – 1,94 1,98 0,97
20 – 1,64 1,66 0.68
30 – 1,43 1,44 0,17
Algorithm 7:
do
{
z=z+1;
}while( )
do
{
z=z+1;
}while( )
Human Behavior Recognition in Video Sequences
72
Tabel 6. Distance transformation computation time
4.3.5 Chamfer matching
Chamfer matching is a techniques proposed by Barrow et. al [ 14] for finding the best fit
of edge points between the templates and the images by minimizing the distance between them.
From the template image the edge pixels are extracted and converted to a list of coordinate pairs.
From this list those edge points are selected which uniformly cover the edge. The selected lists of
coordinate pairs are called polygon. The matching measure used to search the best fit is an
average value of pixels that the polygon hits. In our case we used the root mean square average:
(4.9)
where are the ith pixel hit by the polygon on distance image, n represents the number of
coordinates of the polygon. The average is divided by 3 to compensate the unit distance 3 in
distance transformation. We used this distance, because it gives fewer local minimums than other
average measures [ 59]
The position of the polygon can be determined by translation equation (4.10) . These are
parametric equation s which translate and rotate the polygon using the parameters.
(4.10)
where rot is the rotation angle and and are the translation parameters.
4.3.6 Matching high number of templates: template space
The human detection is difficult because people can appear in a variety of poses and
views. The direct cons equence of using chamfer matching in the detection process is that we
need to deal with a high number of templates.
We use templates like the ones in [201], shown in Figure 31.
Chamfer
distance Dual scan line
(4 direction) Wave
propagation Pseudo Parallel
computation of
the distance
transform
4ms 3ms 3ms 2ms *
*Depend on image size
Human Detection and Pose Estimation
73
Figure 31. Human templates
Looking at the templates from the Figure 31 it is clearly visib le that there is a
similarity relation between the templates. To speed up the search, many researchers organize
these templates in a hierarchy structure. Based on this template hierarchy structure they can
perform a coarse -to-fine type matching.
We propose a more general ordering of the human templates which speeds up the
process by reducing the number of match es. Considering that the high variability of human
shapes are derived from their motion and from the angle of view our hypothesis is that, one
can cr eate a template space where the templates representing consecutive movements are
always neighbors.
In this space the templates represent discrete states and they are not uniformly
distributed. We noticed that the template density around some states is high er than near other
states meaning that from some positions the humans can move in more direction s than from
other positions . This observation helped us to create from this template space an optimal
hierarchy of templates by choosing always the center templ ate of the condense zones. The
distance between the chosen templates determine the hierarchy level.
Figure 32.Template splitting regions
To take the advantage of t he high correlation between the templates , we represent the
template space as a finite state machine, where every template is a state. After we have
identified the current state ( that is the template with smallest distance) we do not need to
search the ent ire space anymore, but we need to check only the neighbor states. This is
already a big improvement, but to reduce even more the number of selectable states , we
define a transition criteri on between every state and its neighbors. For this purpose we split
the template in six regions (see
Human Behavior Recognition in Video Sequences
74 Figure 32) and check the modification for each regi on separately. This approach
reduces considerably the templates which need to be checked.
Figure 33. Example of transition criteria parameter
In every region the f orce magnitude in x, y which needs to move the contour from the
initial position to the current position is measured separately. This way the transition criterion
will have twelve parameters. Even with this high number of parameters there are cases when
more than one neighbor states are eligible to be checked.
The number of templates of the space is an issue which needs to be addressed. If this
number is high the technique becomes very memory consuming, while if it is too low the
decision mechanism will be more complex.
4.3.7 Human detection and tracking system with pose estimation
and experiments
To better illustrate our method the Chamfer Matching using Hierarchical and Motion
Space Templates database (CHMS) , we apply it on people detection. In Figure 34 we present
the architecture of human detection and tracking system.
First we extract the edges from the current frame with a Canny edge detect or, then on
the edge image we apply a distance transform.
In the memory module the last matches of the positions, sizes, and template space
states are saved. Based on the information provided by the memory module, the region
chooser selects the region of interest. The new region provided by the memory module is the
extended region of the human location from the previous cycle. The extension is made in
every direction and it is proportional with the frame rate and maximum motion speed.
At system startup, t he entire image is searched for people. If we have static
background we use the background subtraction module. That module shows if new people or
other moving objects appear on the image. If we work with a moving camera we are searching
only at the image b order.
Using the distance image region, the chamfer matching module, matches the template
provided by the template chooser module. Then the decision maker decides if the match is
acceptable. If the result is not accepted a command is sent to the template c hooser to choose
Human Detection and Pose Estimation
75 another template using the result set provided by the chamfer matching module. If the
template match can be accepted, the result is saved by the memory module. The starting point
of every search is the pervious cycle’s best matching templa te. At initialization or at border
search, the coarse -to-fine strategy with and templates from the top level are used.
Figure 34. Human detection system
We have tested the system using a Pentium 4 processor at 2.6 GHz, 512 MB memory,
running the Windows XP operating system. For image acquisition high quality IP cameras
were used with an image resolution of 640 x 480 pixels, securing minimum, 15 fps detection
rate. The maximum detection rate depending on the number of tracked people . We used
around 100 human templates to construct the template space. With tests we have covered
various inside and outside environments. An example of the results is presented in Figure 35.
Human Behavior Recognition in Video Sequences
76
Figure 35. System output image
We have also investi gated the speed of the matching process, and then the results were
compared to the results of the coarse -to-fine hierarchical chamfer matching method. The
results are presented in Table 7.
Table 6. Performance parameters for template matching on our database
The human detection rate is almost the same for our template database representation and
the hierarchical representation. Diff erences are in the false positive rate and in the pose
estimation correctness.
Since this representation works only in case of continuous motions, when we have to
detect people without prior information in non -consecutive frames, CHMS also uses the
hierarchical approach. The speed boost of the CHMS is notable.
4.3.8 Results and discussion
In this subchapter we describe our study on the template matching approach of human
detection and pose estimation. We have chosen the chamfer matching tech nique to accomplish
our scope the human detection and pose estimation . During the implementation, we identified
two domains of the technique w here we contribute d: the distance transform computation and the
template set representation . Type of Classifier Correct
detection
rate False
positive
hit Pose
Estimation Processing
Time /Frame
Chamfer Matching Hierarchical
template database 77,3% 8,3% 49,7% 72 ms
Chamfer Matching Hierarchical and
Motion Space template database 77,2% 6,1% 76,4% 13 ms
Human Detection and Pose Estimation
77 We have studied some of the most known distance transform computation methods. We
also proposed the PPCDT for distance transform computation which we have compared to the
existing methods. The result of this comparison is presented in the table 15 -16. Based on the
results we c an say that PPCDT method performs like the existing methods, but computes the
distance with lower accuracy then the basic chamfer distance transformation method. The error
occurs where the sub images are merged. We can also observe that in a cluttered scen e the errors
are decreasing. Another interesting method is the dual line scan, a more complex and time
consuming method then the basic distance transform, but it is faster due to its intrinsic
parallelisms .
The template -based methods have two weaknesses: t he selection method of the features
for the comparisons method, and the selection method of the templates and the search algorithms
for finding the best match.
A more general ordering of the human templates which speeds up the process by
reducing the numb er of matches was proposed. It is based on a template space where the
templates representing consecutive movements are always neighbors.
Based on experiment the chamfer matching method performs well in human detection
application s. However the technique suffers from mismatching of the cluttered background. The
main negative effect of using chamfer distance is the potential risk of increasing false alarms
occurring in backgrounds with high level of clutter noise.
During the experiments we measured the influe nce of the image structure on the
performance of detection. We measured the homogeneity of the image by computing the
percentage of the edges in the image. This was computed using 5×5 overlapping image regions
and we have an array which shows us how much a region is cluttered in the image. In every
image region we compute also the average value for homogeneity compared to a template. If this
value is higher the region is cluttered, and we have many edges.
Figure 36. Chamfer matching performance related to the image homogeneity .
Figure 36 prese nts some interesting aspects of the performance evaluation related to
image homogeneity. We can see that the recognition accuracy is increasing if on the image there
are more edges, but simultaneously the false positives are increasing too because the dist ance 0 10 20 30 40 50 60 70 80 90 100
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 %
Image Region Homogenity (%) False Positive
True Positive
Pose Estimation
Human Behavior Recognition in Video Sequences
78 erroneously decreases between templates and image regions. Another aspect of the cluttered
scene is that the pose estimation became unreliable with random results. From these tests we can
already conclude that in this form this method cannot be applie d in a cluttered scene, but can be
used with success when background subtraction method can be applied to the image.
One of the main advantages of the chamfer matching method is that at the end of the
matching process beside the position we also know the h umans pose (attitude). The attitude
estimation and the matching are done in the same time. To get the pose we only have to
categorize the templates.
It was demonstrated with experiments that the novel representation of the human
templates and the matching process performance outperforms the hierarchical method if prior
information is available, or performs in a similar way as the hierarchical method when no prior
information is available.
Human Detection and Pose Estimation
79 4.4 Pictorial Structures
The last type of human detector method investi gated in this thesis is a part -base detector.
To create a reliable detection across a wide variety of poses we use d strong body part base
detection and a Pictorial Structure [ 142] based body representation introduced by Felzenszwalb
[143]. The detection process has two steps: the first step is the detection of all body parts; the
second step is matching the parts to a model which is the Pictorial Structure.
We applied a strong discriminative detector for body part detection that uses dens e
appearance representations based on shape context descriptors [ 110], and AdaBoost [ 109]. These
kinds of detectors have been used in the literature for pedestrian detection [ 109, 145, 6] but for
these cases the appearance model is more simplistic [ 200].
4.4.1 Definition of p ictorial structures
Objects are modeled by a collection of parts in a deformable configuration [ 142], with
‘spring -like’ connections between pairs of parts. These connections are modeling spatial
relations between the parts. For detecting an object one can use appearances and spatial
relationships of individual parts. The best match of the pictorial structures depends on the parts’
matching level at their location and how well the locations match with the deformable model.
Matching a pictorial structure does not involve taking decisions about locations of individual
parts; but it is about finding the global minimum of energy function without its initialization.
The pictorial structure model can be represented as an unidirectional graph .
The parts are the vertices:
The edges are the relationships between parts (springs). indicates a
connection between parts and .
An instance of the object is given by its configuration where is
the location of part .
measures the cost of mismatching part . with location .
measures the cost of deforming the model when placing at location
and at
To find the best match of an object configuration within an image we find the that
minimizes [251]:
(4.11)
To solve the minimization problem efficiently we need to make the graph G acyclic, to
be a tree. Then search can be finished in polynomial time O(nh2), otherwise with n parts and h
different locations there will be hn possible configurations. The distance between objects has to
Human Behavior Recognition in Video Sequences
80 be limited to the Mahalanobis distance of the transformed locations of to points on a grid
(4.12)
where and are 1×1 and are represented as positions on a grid [ 142]. They represent
the ideal relative locations for parts vi and v j. The distance between and , weighted
by M ij-1 measures the deformation between the two parts.
4.4.2 Statistical Framework
The Pictorial structures can be viewed as an energy minimization problem in terms of
statistical estimation [ 142]. Using this approach the pictorial structure can be trained from
examples, and can be use d also to find multiple good matches.
We ar e using the following notations:
parameters of object model,
the image,
a configuration .
According to Bayes rule the posterior of a configuration given an image and model
is [142]
(4.13)
where is the likelihood of seeing a particular image, with an object located at a given
location. is the prior probability that an object is at a given location. This contains the
information about the position of the body parts relative to a coordina te system. The
probability can be informative and general. represents the probability of the
configuration given the model parameters and the image .
The model is parameterized by :
} are appearance paramete rs,
consists of edges indicating connections between parts
represents the connection parameters.
The likelihood is given by the product of probabilities of each part being observed at a
specific location in the i mage [ 142] (it assumes that parts are independent and not -occluded)
(4.14)
Human Detection and Pose Estimation
81
Prior distribution over object configurations is captured by a Markov random field
(4.15)
where the denominator is unnecessary (absolute locations aren’t modeled) so we get
(4.17)
The posterior will be:
(4.18)
Observing that the negative log of the prior and likelihood give us the match and
deformation costs, we will get
(4.19)
Since d ij(li,lj) is restricted to the Mahalanobis distance we must model it here. This can be
done [ 142, 143] with a zero mean normal distribution with diagonal covariance matrix Dij.
(4.20)
4.4.3 Body parts and connecti ons
The human body is a deformable object but this object contains rigid objects that are
connected with each other by elastic connections. The best way to model the projection of
human body parts is using rectangles with the following parameters:
x, y the center of the rectangle
s the shortening
θ the orientation of the part.
The body part pairs are connected with “spring -like” connections. Every connection has
its correspondence in the parts ’ local coordinate system: (x ij,yij) and (x ji,yji). In the ideal cases
these connections are overlapping as in the Figure 37. The ideal orientation θ ij of a connection is
given by the difference of the two parts ’ orientation.
Human Behavior Recognition in Video Sequences
82
Figure 37. Connections
For two parts v i and v j we have l i = (x i, yi, si, θi) and l j = (x j, yj, sj, θj) the joint probability
is given by a combination of zero mean Gaussians modeling displacements
(4.21)
The first two expressions measure the vertical and horizontal displacement. The third
expression is the difference in shortening. The last expression measures the difference between
orientations. So the parameters of a connection will be the followings: c ij (xij,yij,xji,yji,σx2,σy
2,σs2,θij,k)
The tran sformation of a body part
),,,(i ii i i syx l into the connection coordinate
system can be obtained using the expression:
iji iiji
yi iji
xi iiji
yi iji
xi i
iji
sds dsyds dsx
lT
~cos sinsin cos
)(
(4.22)
where
Tji
yji
xjidd d ),( is the relative position of the connection between the i and j body parts in
the coordinate system of the i-th body part, and
ji~ is the relative angle between the two body
parts.
The joint distribution of the body part pairs must be a Gaussian distribution with zero
mean and diagonal covariance in the transformed space.
(4.23)
Human Detection and Pose Estimation
83 The joint probability will be:
(4.25)
4.4.4 Learning Parameters
Given a set of training images {I 1…I m} and corresponding object configurations
{L1…L m} we can estimate the model parameters θ = (u, E, c) by finding the θ that maximizes
the energy function based on Felzenszwalb algorithm [142]
(4.26)
Figure 38. The training set
(4.27)
The first part of the equation depends on the appearance parameters and the second part
depends only on the co nnections and the connection parameter set [ 142]. The equation can be
solved for each u i independently for each part
(4.28)
Estimating dependencies is similar to estimating appearance parameters [ 142]. The
connection dependencies can be estimated (where is specified separately for different
models) with
Human Behavior Recognition in Video Sequences
84
(4.29)
can be estimate d [142] as the probability of two locations gi ven the ML
estimate for their joint distribution
(4.30)
Using these we can find a tree connecting the vertices of the graph by finding
the edges .
(4.31)
can be found by building a graph with the vertices and setting the weight of the
edges as and solving for the minimum spanning tree of the graph [ 142].
Figure 39. Learning process
Figure 40. The learnt Pictorial structures (frontal and side view)
Human Detection and Pose Estimation
85 4.4.5 Finding the optimal configuration
Now that we have the model, we can find L = {l i…l j} that minimizes the original
equation
(4.32)
We can compute this for all leaf vertices of the tree. Th e best location B j for a leaf vertex
vj given location l i for its parent v i is the l j that minimizes
(4.33)
This leads to a recursive solution:
(4.34)
4.4.6 Sampling from the Posterior
We want to sample from:
(4.39)
The steps of a recursive solution: First sample a location l r for the root, and then repeat it
for each child of the root. The marginal distribution of the root is
(4.40)
We can formulate this recursively as
(4.41)
(4.42)
where r is the root, c is a child node and S is specified as
Human Behavior Recognition in Video Sequences
86
(4.43)
4.4.7 Framework extension
The evaluation of the criteria from equation (4.11) is very time consuming. To reduce the
detection time and improve the performance of the system we can extend the energy function.
We add a new term to the criterion which is the distance from th e previous position of the video
sequences.
(4.44)
measures the distance between the current position and the previous position
of the body part .
By introducing the in the energy function we use motion information to get
more precise detection, and to reduce the detection time [ 227]. There are two cases: with or
without relation between consecutive frames. If the current frame is not related to a previous
frame the term is 0 for . In the second case we have the body location and
the configuration from the previous frame, the new location and the configuration should be
similar to the previous one.
The
term from equation (4.44) can be approached like a constraint.
This constraint is limiting the search space of the possible body part locations, orientation and
scale. This can be viewed as an adaptive top -down pre -filtering meth od. The temporal constrain t
is limiting the appearance model parameters.
The c onsequence of using
term as constrain is that the deduction for the
detection algorithm is not changed. The term effect is visible in the pictorial str ucture parameters
reducing the search space.
Another variant of this statistical framework uses predicted measurement instead of the
time constrain t.
(4.45)
where
measure s the distance between the predicted and the measured location.
The previous idea is continued and we extended the criteria with a prediction. We track the
motions of every body part with a Kalman filter and search the body parts in the neighborhoo d of
the predicted position.
Human Detection and Pose Estimation
87 By introducing the in the energy function we use motion information to get
more precise detection, this way we can reduce the searching time [ 227]. There are two cases:
with or without relation between consecutive fra mes. If the current frame is not related to a
previous frame , the term will be a high number for . In the second case the
new location and configuration can be computed using the prediction starting based on the
information from the previ ous frame and the result should be close to the starting data.
Comparing the two approaches presented in the equations 4.44 and 4.45 we can conclude
that the second energy function can use information from previous measurements which makes
the system more robust to self occlusion and occlusion. We will annotate the new frameworks as
Optimized Pictorial Structure based Framework (OPSF)
4.4.8 Systems framework and Experiments
In this subchapter we present a framework for human detection and pose estimation that
uses part base d detection. This process has two steps: body parts detection and the second is the
model matching. The detection can be done by creating an occurrence probability map for every
body part separately, and then match them to a Pictorial Structure model.
To create the occurrence probability map we use a strong discriminative detector that
uses dense appearance representations based on shape context descriptors [ 128], and AdaBoost
[54]. These kinds of detector s have been used in the literature for pedestrian detection [ 109, 6 ,
145] but for these cases the appearance models are simpler.
Our purpose is to create a generic method for creating this occurrence probability map for
unconstrained environments. Because t he possible body configuration is very large the reduction
on the search space is important [ 214]. One way to do this is using discriminatively learned
detectors [ 143, 237]. It is also important to use very carefully the pre -filtering of the possible
part locations in the part detection phase, and to postpone the final decision until evidence from
all body parts is available. To entirely eliminating the pre -filtering is inadequate , because it is in
contradiction with the search space reduction intention. T he question remains, what kind of pre –
filtering to use to preserve the generality and the performance of the framework ?
The dense evaluation of the search space consider s all possible part position s,
orientations, and scales . It is in contrast to bottom -up appearance models ( e.g. [109]) based on a
sparse set of local features. The search space is li mited by an adaptive pre -filter that uses
information from previous detection s and from frame differencing. The pre -filtering is used only
when the information is available , otherwise the entire space is considered.
The part detectors is a shape context descriptor based method used for pedestrian
detection. In this descriptor the distribution of locally normalized gradient orientations is
captured in a log -polar histogram. For classification we used an AdaBoost classifier [ 54] which
is based on a feature vector obtained by concatenating all shape context descriptors whose
centers fall inside of the part bounding box.
We matched these maps to the Pictorial Structur e model using the energy function
(equation (4.44)).
Human Behavior Recognition in Video Sequences
88 The used framework is presented in Figure 41. In this diagram it can be seen that the
temporal con strain t is activated by the frame differencing module. We used frame differencing
because it is fast and provide enough information about the temporal relations of the consecutive
frames.
The frame differencing module activates the search space and the mod el control unit.
They are based on the previous detection result and the frame differencing result reduce s the
search space to the neighborhoods of the previous location and the Pictorial Structure model
parameter is configured in concordance with the prev ious match. Considering the possibility of
the synchronization lo ss between the tracking human and the model we introduced some
verification steps in which the constrained detection result is compared to a full image densely
sampled searched based detection without constrain ts.
Figure 41. Framework implementation diagram
To prove the performance of our method the OPSF we compared it to another Pictorial
structure based method presented by Andriluka [110]. We tested them on the same videos and on
the same computer. For testing we used videos from indoor and outdoor environments and also
with movies downloaded from the internet.
Figure 42. Output for the system. Left for Andriluka’s method and right for our s
The first test was the generality test. We made a video in which every frame was an
arbitrary image from the inter net without relations between the frames. To have a correct
current frame
previous frame Frame
differencing Search space
and model
control Human detection
and pose estimation
Human position
and body
configuration
Generic
appearance and
body model Train based on
labeled examples
Human Detection and Pose Estimation
89 comparison between them , instead of training them separately we used the same Pictorial
Structure train ed by one of the frameworks. Ran on the same video, this method managed to
detect people in 9 9% of the time, while the processing speed remained the same.
The second test we made was the speed comparison of the frameworks, having relations
between the consecutive frames.
Table 8 shows the result of the speed test. Again, for both systems we used the same
Pictorial Structure model.
Table 8. Experimental results for human detection using Pictorial structure
Comparing the experimental results OPSF was not only faster but had also a higher
detection rate:
The average time needed in optimized case: 18. 912 s
The average time needed in case without optimization: 379.792 s ≈ 6 minutes Frame Number Time Optimized
(s) Time Search Space Reduction
(s) Time Normal (s)
0 43,547 43,612 383,296
1 17,837 59,676 377,278
2 16,396 55,699 378,589
3 18,495 73,024 385,136
4 21,991 74,010 380,477
5 25,081 66,064 379,24
6 17,252 66,331 380,325
7 20,498 81,480 380,057
8 19,511 78,279 380,04
9 21,922 80,325 374,803
10 25,971 96,062 374,873
11 29,177 92,588 374,534
12 28,472 78,271 375,08
13 19,664 52,025 381,701
14 15,541 60,633 384,964
15 24,919 67,419 381,989
16 20,931 74,955 376,557
17 25,930 70,112 381,565
18 19,737 67,382 376,294
19 17,040 57,728 374,867
Human Behavior Recognition in Video Sequences
90 In Figure 43 we show some differences between the two system outputs. The time
constrain t in OPSF has two consequences: one is the search space reduction on the image plane
and the other is the adaptive m odel configuration modification correlated to the previous match.
Figure 43. The responses of the systems with and without time constrain t optimization.
The Figure 44 shows the speed of the system if only space reduction is applied and if
both optimizations are applied.
Figure 44. The speed of the system for the two kind of optimization
The average time needed in the optimized case: 18.912 seconds
The average time needed in search space reduction: 64.892 seconds
Using both optimizations the system became three time s faster than the one using only
the space reduction method. The difference between the detection precision is not significant in
this case.
Time needed
0,00020,00040,00060,00080,000100,000120,000
1 611 16 21 26 31 36 41 46 51 56 61 66 71 76 81 86 91 96
Frame numberTime (s)
Search space reduction Search space and Configuration reduction
T he D iffe re nce be twe e n optimiz e d and not optimiz e d
050100150200250300350400450
1 6 11 16 21 26 31 36 41 46 51 56 61 66 71 76 81 86 91 96
F ra m e Nu m b e rT im e (s)
W ithout O ptim iz ation W ith O ptim iz ation
Human Detection and Pose Estimation
91
Figure 45. Performance of the pose estimation
In the Figure 45 the performance evaluation of the pose estimation part is presented. The
average ratio of detected parts is 66 .9%, while the match should overl ap more than 95% of the
annotated part area. If the a nnotated area and the detected area is in 75% overlapped the
recognition is rate is : 75.4%
Figure 46. Difference in performance of the pose estimation
Figure 46 presents the differences in body configurations detection if we use Search
Space and Configuration reduction: 66 .9% and Search Space Reduction only 65 .3%
In the literature many system s have to use two models: one for frontal view and one for
the lateral view of the human body. We tested the necessity of these two pictorial structure
models in OPSF .
Segments Correct
024681012
0
3
6
9
12
15
18
21
24
27
30
33
36
39
42
45
48
51
54
57
60
63
66
69
72
75
78
81
84
87
90
93
96
99
Frame NumberParts
Segments Correct
Difference
-5-4-3-2-10123456
147101316192225283134374043464952555861646770737679828588919497100
Frame NumberParts
Difference
Human Behavior Recognition in Video Sequences
92
Figure 47. Output of the system with frontal and lateral body model
As Figure 47 presents, the detection precision is almost the same with both models. More
exactly, the difference between the two models is below 2%. Considering th ese results we
conclude that it is unnecessary to use two models.
Figure 48. Output of the system with search space reduction and with search space reduction
and configuration reduction
Human Detection and Pose Estimation
93
Figure 49. Output of the system s a) OPSF , b) original framework
We also compared our second framework OPSF2 to the original framework. The results
are presented in Table 9. It is visible that the detection rate compared to OPSF1 is increasing but
not significantly. We achieved performance increasing only in the pose estimation domain, but
the usage of prediction has also increased the processing time.
Table 7. Performance parameters on our database Type of Classifier Correct
detection rate Pose Estimation
(75% overlapping
of the body parts) Processing
Time /Frame
Original Framework 89% 70% 385s
OPSF1 92,2% 75% 16,9s
OPSF2 92,2% 80,4% 17s
a)
b)
Human Behavior Recognition in Video Sequences
94 4.4.9 Results
In the previous subchapter we presented a component -based method that uses the
Pictorial Structure model. Starting from the statistical framework proposed by Felzenszwalb we
have extended this model to use prior information during detection, but to remain a generic
model for human detection and pose estimation e ven when prior information is not available. We
have demonstrated that OPSF , although not working in real -time, it is significantly faster than
other similar systems. While the original framework uses more than 6 minutes for recognition,
OPSF detects the h uman positions and attitudes in around 17 seconds.
With our experiments, we have shown that the generality of the system is very high and
by using time constraint the system generality remains unchanged. Using the same videos for the
experiments we observ ed that every human object detected by the original framework was also
detected by OPSF . This is obvious, because OPSF acts in the same way as the original when the
prior information is missing.
We also demonstrated with our experiments that using the time constraint we get a higher
precision by eliminating many of the false positive detection. We achieved a 92% recognition
rate compared to the 89% of the original algorithm, while the processing time is 20 time faster .
Based on test results we can say that is unnecessary to use lateral and frontal Pictorial
Structure models for detections, because the effect on the recognition precision is not significant,
but the processing time increases considerably.
4.5 Comparison of the human detection methods
In the previo us section we presented three types of human detection methods. In this
section we compare these methods. Our initial observation was that in some cases the presented
methods works well and in other the performance of these methods are weak.
First, we tes ted the performance of the methods on humans of different sizes.
Figure 50 presents the performance of the different methods. We can see that the pictorial
structure has better performance when the people on the image are bigger and the Haar based
methods performs b etter on lower resolution. The performances of the Haar based method are
more constant on different human size but its body pose estimation is poor. The Pictorial
structure methods has the best recognition rate and the best pose estimation but is very slow
compared to other methods. The Pictorial structure method cannot be used in real -time systems.
Human Detection and Pose Estimation
95
Figure 50. Human detection system performance for different human size
The chamfer method’s detection rate is not the best, but the pose estimation is better than
the Haar based method’s estimati on.
Based on measurement the best choice for lo wer resolution is the Haar based method. If
the resolution is medium both the Haar and the Chamfer methods can be used, and the results
can be merged. If we need better performance on pose estimation we shoul d use the Pictorial
structure method, but only in case of offline systems.
4.6 Conclusions
The focuses of this chapter are the human detection and pose estimation methods. It
presents the three most significant human detection and pose estimation methods together with
our contribution to them. These methods represent different classes of the most promising
approach es. To evaluate their performances we have compared them in diverse cases .
The first described method is a single window approach. We have bu ilt a novel classifier
[198, 212] based on Viola and Leinhart work, and we have created a new classifier structure, to
detect multi -view and multi -pose human bodies. In order to simplify the classifier structure and
to speed up the training procedure we ha ve introduced a novel background selection algorithm
[199].
After comparing the Viola and Leinhart classifier with our PETC classifier [198, 212] the
conclusion was:
– our PETC classifier is faster than the other two classifiers.
– the correct detection rate is comparable with the Leinhart classifier, but PETC has
lower false positive hit.
The comparisons of the classifiers were performed on multiple databases.
The second presented approach is a template based method. We used chamfer matching
to detect and e stimate the human poses.
0 10 20 30 40 50 60 70 80 90 100
50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 210 220 230 240 250 260 270 Detection Rate
Size of human in pixel (x100) Haar
Chamfer
Pictorial
Human Behavior Recognition in Video Sequences
96 We proposed a novel pseudo parallel approach to compute the distance transformation.
Based on experiments this method is 25% faster than other methods [209] .
We also propose d a new way to store the templates and to find very fast t he most
probable templates [201] .
We propose d the CHMS framework to detect the human presence in the image and
estimate their pose. Using CHMS framework we g ot 5 times faster detector with the same
detection performance but with lower false positives and better pose estimation [201] .
We also performed several test s to investigate the performance of the chamfer matching
related to the image homogeneity and we conclude that in this form this method cannot be
applied in a cluttered scene, but can be used with success when background subtraction method
can be applied to the image.
One of the main advantages of the chamfer matching method is that at the end of the
matching process beside the position we also know the humans pose (attitude). The attitude
estimation and the matching are done in the same time. To get the pose we only have to
categorize the templates [199] .
The third studied approach was the Component based me thod. Our starting point was the
Felzenszwalb Pictorial Structure based framework [204, 208] .
First we implemented the Felzenszwalb algorithm and made some tests. Based on th ese
tests we identified the weakness es of the method [204, 208].
We proposed two ne w method s OPSF1 and OPSF2 to increase the performance of the
Pictorial Structure based method [197, 200] .
The first method uses the motion information to modify the Pictorial Structure
parameters, and speed up the recognition process [197, 200] .
The secon d method use s tracking information to eliminate the ambiguity , caused by the
occlusions or self occlusions [199] .
The experiment showed that the OPSF by us are considerably better than the original
ones because: are 20x times faster, have a 13% higher det ection rate, 5 -10% higher accuracy,
and a considerably reduced false positive detection rate [200] .
The proposed OPSF uses motion and tracking information without reducing the
capability of framework to use in case of still images or when th e motion information is not
available [200] .
The last part of the chapter contains a comparison of the three improved methods using
video sequences with different size of human on the image. We proved that all of them have
their usability. The first two methods a re fast and work well in lower resolution , the Pictorial
Structures give s the best detection results, but work s very slow ly even with our speed up [199] .
Recognizing Human Action and Behavior
97 5. Recognizing Human Action and Behavio r
In this chapter we present the last component of the human behavior recognition system.
There are many researchers focusing on behavior recognition as we already presented in chapter
2. In this chapter we try to present our research results in this field.
The behavior recognition is the last step in the human behavior recognition pro cess. This
component mainly processes the “measurements” provided by the human detection and pose
(attitude) estimation systems that’s why the performance of this component is highly influenced
by the quality of data provided by the previous components.
Based on the nature of the human behaviors the recognition process is divided into two
steps: activity or action recognition and behavior recognition. The term activity is defined as a
sequence of movements that cannot be described using other simple human motions. The
duration of the activities is usually short. The behaviors last longer and they are composed from
a sequence of activities.
The activity and behavior recognition algorithms are categorized [84] into single -layered
approaches and hierarchical approaches , for actions recognition the single layer approaches are
more suitable , while for the behavior recognitions the hierarchical approaches can achieve better
results.
In the following subchapters we will present separately the activity and the beha vior
recognitions together with our contributions to these two fields.
5.1 Recognizing Actions
Based on definition, the action is the most simplistic behavior that cannot be described
with any other actions. The action recognition process has as input a sequen ce of position and
body configurations and the output is the recognized action.
The action recognition algorithms have to treat the following issues:
Multitude of human forms
The same action s are performed in different ways
The duration of the actions differs from people to people
The input information can be incomplete or erroneous.
Human Behavior Recognition in Video Sequences
98 During recognition this sequence is compared to examples or to a model to categorize the
input sequence. The most suitable recognition technique for this purpose is the sin gle layer
approach. Two types of single layer approaches can be distinguished based on the way the input
information is handled. We can lock at the input as a multidimensional unit or as a sequence of
information. According to this there are the space time techniques, and the sequential
approaches.
Starting from the fact that the space -time techniques treat the input as unit , this approach
is suitable mostly for repetitive motions, gestures recognition. Those techniques that treat the
input as volumes are p roviding a straight -forward solution. These approaches mostly fail at
handling view, speed and motion variation. The trajectory based approaches are trying to
eliminate the view problems but usually they introduce others especially if they are used for joi nt
position tracking. This method can be applied successfully to recognize only very simple actions.
The most promising direction for the spatio -temporal techniques is given by the local
feature -based approaches because they are robust to illumination cha nges and for some degree of
noise. Another benefit is that they may be used directly on the image without background
subtraction or body part modeling. Compared to the volumetric approach these methods are
capable to recognize the non -periodic actions by u sing the algorithm that can model the relations
between features. This approach is not suitable to recognize complex actions and has difficulties
to handle multiple views.
The sequential approaches use sequential relationships between features being able to
detect and recognize more complex activities. The state model -based sequential approaches
compute the posterior probability of the action occurring. The probabilistic approach enables to
incorporate other factors in the decision process. To build a mode l the state -based approaches
require training videos. The required quantity of the training videos depends on the action
complexity: in case of complex actions the amount of training data has to be large. When there is
enough training data to properly trai n the model, the system will be flexible and able to
recognize an activity based on completely different action sequences also.
The example based methods requires less training data than the model -based. These
approaches provide a similar execution rate as the non-linear matching techniques. Its main
disadvantage is that these methods require a template for every different act ion sequence.
5.1.1 Dynamic time warping
Dynamic time warping is a template -based dynamic programming matching technique
for measuring similarity between two sequences , which may vary in time or speed. For instance ,
similarities in walking patterns would be detected, for slowly walking persons, for quickly
walking persons, or even for accelerations and decelerations during a course .
In general, DTW is a method that allows a computer to find an optimal match between
two given sequences (e.g. time series) with certain restrictions. The sequences are "warped" non –
linearly in the time dimension to determine a measure of their simi larity independent of certain
non-linear variations in the time dimension. This sequence alignment method is often used in the
context of hidden Markov models .
Recognizing Human Action and Behavior
99 One example of the restrictions imposed on the matching of the sequences is on the
monotony of the mapping in the time dimension. Continuity is less important in DTW than in
other pattern matching algorithms; DTW is an algorithm particularly suited to match sequences
with missing information, if there are enough segments for matching. The Dynamic Tim e
Warping compares two time series and computes the distance between them, even if the two
series are shifted on time axis. Given two series X, and T, of lengths |X| and |T|,
|| 21|| 2 1
,….,…,,….,…,
T jX i
t tttTx xxxX
(5.1)
To align two time series we construct a n |X|-by-|T| distance matrix. The (ith,jth ) element
of the matrix corresponds to the distance between the x i and the t j element of the series. To get
the distance between the two time series we search the warping path W presented in the equation
5.2.
|||| |)||, max(|, ,…,2 1
T X K TXw ww WK
(5.2)
where K is the length of the warp. Every element of the warping path is a pair of coordinates or
indexes, which represent a relation between the two time series.
),(ji wK
(5.3)
where i,j represent the index es of the two time series. There are three constraints on the warp
path: the boundary condition , the continuity and the monoto ny. The boundary condition :
)1,1(1w
and
|)||,(| XT wK , means that the warping path must start at the first el ement and
must end at the last elements of both time series. The starting point should be the bottom left
corner, and the end the opposite corner of the distance matrix. The constraints continuity and the
monotony are merged in equation (5.4).
1 j'j1, i'i '' ,1
j i),j(i w(i,j) wm m
(5.4)
The restriction s from equation ( 5.4) control the allowable step to the adjacent cells, in
such a way that i, j increase monotonically in the warping path. All indexes, from both time
series, must be used.
Many warping paths are satisfying these three conditions , but we are interested in the
path which optimizes the cumulative distance of the path elements (equation (5.5)).
K
llwKTX DTW
11min),(
(5.5)
The 1/K is normalizing the distance, for warping paths with different lengths. The best
way to construct the optimal warping path is the dynamic programming method. First, the task
Human Behavior Recognition in Video Sequences
100 should be split in subtasks, portions of time series. By finding the optima l solution to these
subtasks, we will get the optimal solution of the entire problem. To achieve this we need to
construct a cumulative distance matrix D using the following equation 5.6.
)]1,1(),1,(),,1( min[),( ),(
j iD jiDj iD TX DistjiDj i
(5.6)
Every cell is computed as sum of the d istance (Euclidian or other type of distance) of the
current elements (
),(j iTX Dist ) and the minimum of the cumulative distance of the adjacent cell.
The cost matrix is computed bottom up, from left to right. After the entire cost matrix is filled, a
warp path must be found starting at the left lower corner D(1, 1) and ending at the D(|X|, |T|) top
right corner. The warp path is actually computed in a reverse order, starting at D(|X|, |T|) using a
greedy search algorithm that evaluates cells to the left, down, and diagonally to the bottom -left.
The smallest valued cells coordinate is added at the beginning of the warp path founded so far.
The search continues from the last added cell and stops when D(1, 1) is re ached.
5.1.2 Dimensionality reduction and motion decomposition
Because most approaches in action recognition need to deal with very high -dimensional
data spaces, these approaches often suffer from the ‘curse of dimensionality’. The feature -space
becomes sparser in an exponential fashion with the dimension, that way it require s a larger
number of samples to build efficient class -conditional models. The simplest way to reduce
dimensionality is via Principal Component Analysis (PCA), which assumes that the data lie s on a
linear subspace. Except in very special cases, data does not lie on a linear subspace, thus
requiring methods that can learn the intrinsic geometry of the manifold from a large number of
samples. Nonlinear dimensionality reduction techniques allow f or representation of data points
based on their proximity to each other on nonlinear manifolds. Several methods for
dimensionality reduction such as PCA, locally linear embedding (LLE) [171], Laplacian
eigenmap [111], and Isomap [111] have been applied to reduce the high -dimensionality of video
data in action recognition tasks [ 111, 79, 231, 113]. Specific recognition algorithms such as
template matching, dynamical modeling , etc. can be performed more efficiently once the
dimensionality of the data has been reduced.
Because the human motion can be compositional or concurrent, the global trajectories are
not the best choice. Some actions need only the legs, for example walking, run ning, jumping,
and some actions need only the hands: handshaking, waving. For this reason, we decomposed the
action to its basic elements – to body part motion. To make the recognition easier we track every
body part individually and relative to its parent s body part. Using this approach, we can use only
those basic motions (body part motions) in classification that are relevant for an action, so we
can easily recognize composed motion as well .
In some cases, when we have low -resolution images, we cannot t rack all body part
motions separately, but only the global motions. There are many possibilities for this: we can use
Haar based detector [ 198] or we can use chamfer matching [ 201] to detect the humans and their
poses.
Recognizing Human Action and Behavior
101 Our goal is to get the most detailed information about the human body configuration, and
its relation to other moving objects and environment of the current frame. To achieve this goal,
for low resolution images we used a bottom up approach: the chamfer matching [ 201], while for
higher resolu tion images we have used a top down approach that is the Pictorial structure method
introduced by Felzenszwalb [ 142] and extended by Ramanan [ 159].
In case of the higher resolution frames, the Pictorial structure approach, is modeling the
human body as a collection of parts in a deformable configuration, with 'spring -like' connections
between pairs of parts. These connections are modeling spatial relations between parts.
Appearances and spatial relationships of individual parts can be used to detect an obj ect. Best
match of the pictorial structures depends on how well each part matches its location and how
well the locations agree with the deformable model. The main advantage of this approach is that
the motions of the human body parts are tracked individua lly and relative to its parent’s body
part. Using this approach, we can use only those basic motions (body part motions) in
classification that are relevant and we can easily recognize composed motions too. The first and
most significant motion is the tors o motion. Here we look at two elements: the motion relative to
the image (global motion) and the angular motion. The torso represents the root of body parts in
the pictorial structure. The upper legs and upper arms are connected to the torso and we analyze
their angular motions between -270 and +270 grades only. The absolute motions are tracked
between -180 and +180. The 180 and 270 represents a buffer zone. If the motions angles are
above 180 or below -180, we will have two possible time series. Three even ts can reset one of
the time series: the angular motion returns quickly between -180 and 180 degree; the DTW
matching for one of them has a strong result or the angle is increase d above 270 or decreases
below -270.
Figure 51. Relative motion of the upper leg relative to the torso
The lo wer parts of legs are connected and their angular motions are relative to the upper
parts of the legs. In addition, the lo wer arm angular motions are tracked relative to the upper
arm. We do not track the motion of the head. Figure 52 present s a time series of motion for th e
upper arm representing the waving action.
The most important points in motion series are the peaks because they mark a change in
the motion direction, and the still (constants) points and the zero crossing points, because they
are the stable or typical p ositions of the human body. Knowing this, the speed of the actions is
not relevant.
In case of the low -resolution frames, we use the chamfer matching method [ 201] to track
and to detect the pose of the human body. Using the fast template search method intr oduced by
Human Behavior Recognition in Video Sequences
102 us, we can always track the human body and measure the distance from the closest template
class.
Figure 52. Full resolution time series of waving – upper arm
We can approximate the motion series using the key positions. There is an unequivocal
mapping of the key position to relative position of all body parts. We always map this position
when the current match has the lowest distance from the template. We count the number of
frames between two consecutive best match key positions and then we interpolate the
intermediate points.
Using these tw o approaches, we are able to connect them and provide a general
framework based on Heuristic FastDTW to recognize the human actions.
5.1.3 Heuristic Fast Dynamic Time Warping Methods
The quadratic time and space complexity of DTW generates the need for methods w hich
speed up the dynamic time warping. The most common method is the use of constrain ts, which
limits the search area in the cost matrix. This constrain t is important not only for speeding up the
DTW but also to eliminate the problem of singularities [ 68, 26, 35 ].
There are also other methods to speed up the computation of the DTW. One is the
FastDTW [ 230], which uses recursive shrinking and refine to get the best warping path.
To compare the time series of the human motions we are using an improved version of
DTW algorithm, which has a multilevel approach with the following key operations:
Shrinks the time series into smaller time series that represents only the peak or
constant val ues from the time series,
Coarse DTW – Finds a minimum -distance warp ing path for the shrunk series and
uses that warping path as an initial guess for the full "resolution's" minimum –
distance warp path
Final DTW – Refines the warping path projected from a lower resolution through
local adjustments of the warping path using Sakoe -Chiba constrain t.
Recognizing Human Action and Behavior
103 The first step is in the coarsening step during which we shrink the time series. Human
body part motions most significant moments represent the direction changes. In the Heuristic
FastDTW approach we don’t use averages in the time s eries, but we use a heuristic selection of
the data where only the peaks and constant val ues are kept from the series. In other words only
those xi elements are kept from the X if satisfies one of the next two conditions:
)) () ((||)) () ((
2 22 2
i i i ii i i i
xx xxxx xx
(5.7)
Figure 53. The original and the shrink time series of waving – upper arm
Figure 53 presents the original time series of waving the upper arm and the shrunken
series. As the second step, we did a classical DTW comparison between the shrunken templates
and the shrunken inputs. Using this comparison, we can eliminate th e majority of the templates,
only a few templates remaining for higher resolution comparing.
Figure 54 shows the shrunk time series cost matrix and the projection of this to the
original resolution cost matrix. Projection takes a warping path calculated at a lower resolution
and determines the cells of the warping path as the result of the higher resolution time series.
This projected path is used then as a heuristic s during solution refinement, to find a warping path
at a higher resolution. To make it faster we used Sakoe -Chiba band constrain t.
Figure 54. The course and the full resolution cost matrix with warping path
The Final DTW step is a refinement that finds the optimal warping path in the
neigh borhood of the projected path, where the size of the neighborhood is determined locally by
the distance between two consecutive points in the shrunk series and the difference between the
length of the template series and the input series.
Human Behavior Recognition in Video Sequences
104 This will find th e optimal warping path through the area of the warping path that was
projected from the lower resolution.
In case of the low resolution images the poses are recognized as key positions. To these
positions we can always associate a set of point s in body part series. Because the image
resolution do es not make possible a fine detection of the body part motions one key position is
recognized more than once before the position s change. This kind of motion is very similar to the
shrunken version of th e time series.
5.1.4 Classification using Neural Networks
Because we have for every body part separate motion series, we need to synchronize the
matches and make a final decision about the overall human action. We use the result of the
Heuristic FastDTW as the input for Neural Networks. Neural networks (NNs) are nonlinear
models, which makes them flexible in modeling real world complex relationships. Furthermore,
NNs are data driven self -adaptive methods being able to adjust themselves without any explic it
functional or distributional specification for the underlying model.
Although many types of neural networks could be used for the classification purpose, we
have focused on the following three network types: the Learning Vector Quantisation (LVQ), the
Radial Basis Function (RBF) and the feed forward multilayer networks or MultiLayer Perceptron
(MLP) NNs, which are the most widely studied and used neural network classifiers.
Using the Heuristic FastDTW ’s output, a dataset has been compiled. In order to fo llow
the proper steps of designing test bench system, the dataset has been divided in a training subset
(75% of the samples) and a testing subset (25%). Five different behaviors are represented in the
compiled datase t about twenty measurements for each (va lues corresponding to eight body part s
motion time series). The NN training has been done in Matlab, using the embedded functions of
this environment.
The evolution of the classification accuracy of the RBF NN and the LVQ NN during the
training phase is pr esented in Figure 55 and Figure 56, respectively.
Figure 55. Training chart of the RBF NN
Recognizing Human Action and Behavior
105
Figure 56. Training chart of the LVQ NN
Table 8. summarizes the actual results obtained with these methods, proving, that using
NN to recognize human action s from monocular video after special Heuristic FastDTW
processing is a viable solution.
Table 8. NN classification results
5.1.5 Results
For testing purpose s we have used the detection results from the three type s of human
detection and pose estimation methods. The results were converted to time series, then the torso
speed and the relative motion of the body part relative to their parents is measured. Logically we
will have different resolution s for different type of detection methods.
These parameters are compared with saved templates us ing the Heuristic FastDTW and
some of them are eliminated already at early stage s if the distance between the coarse variant of
the series is bigger than a threshold.
To construct the templates database we have annotated and saved 4 different actions from
10 different videos. For every body part we compared the saved motion series with the Heuristic
FastDTW . If the difference between them were too big we have dropped them, while if they
were similar we have chosen the median series from them.
We have also used a feed forward neural network to classify the actions. The network
inputs are equal with the number of templates and the number of outputs is equal with the
number of trained action. We used the saved templates to train the neural network.
For experiments we have used indoor scenes with simple and composed actions. NN type Neurons
Inp./Hidd./O
ut. Classification
accuracy after
max.100 epochs
LVQ 40/80/5 98%
RBF 40/5 80%
MLP 40/40/5 82%
Human Behavior Recognition in Video Sequences
106
Table 9. Comparative results of the experiment using the three type s of pose estimator.
Using the Heuristic FastDTW the recognition is 2 times faster than the FastDTW,
because many templates can be eliminated at the early steps at coarse comparison. Using the
motion decomposition we were able to recognize composite actions as well, such as standing and
handshaking.
The second part of the experiment shows th at the accuracy is influenced by the resolution
of the human pose estimation. The method was not able to recognize the composite action like
standing and handshaking.
In this subchapter we presented two improvements for human action recognitions: an
effici ent representation of motions by decomposing to its basic elements and a FastDTW
algorithm improved for human motion recognition purpose. Both ideas can be still improved. In
case of body part motions the angular motion can be decomposed into two time seri es one with
low frequency variations and one with high frequency time series. The low frequency will
represent the position, and the high frequency will represent the short action of the body part.
We can introduce more constrains in the coarsening step o f the Heuristic FastDTW ,
reducing this way the length of time series and speeding up this way the recognition procedure.
We can improve the recognition framework also by introducing a Self Organizing
Incremental Neural Network instead of feed forward neura l network.
Figure 57. Output of the action recognition system
a) b)
Pose estimation type FastDTW
processing
time Heuristic
FastDTW
processing
time FastDTW
accuracy Heuristic
FastDTW
accuracy
Haar -based Tree classifier 9.6 ms 5.2 ms 56% 61%
Chamfer matching 9.4 ms 4.9 ms 72% 83%
Pictorial structure based pose
recognition 9.8 ms 5.1 ms 98% 98%
Recognizing Human Action and Behavior
107 5.2 Recognizing Behavior
Based on the definition, the human behavior is a sequence of simple or complex actions.
By its nature, the behavior is co mplex and very variable. For this reason, single layer approaches
cannot recognize behaviors. More appropriate is to use a hierarchic structure having a better
performance in the human behavior recognition.
The low level recognition approaches are responsible for recognizing the atomic actions.
The results of them serve as observations or measurements for higher level recognitions. The
hierarchical technique represents tractable and conceptually understandabl e approaches of
human behavior. Using this approach the human behavior model can be built by human experts.
The hierarchical techniques also reduce the redundancy from the recognition process by re using
the recognized actions.
The hierarchical techniques are able to recognize complex activities and behaviors with
complex structures. This is one of their major advantages. The capability to integrate and use
semantic based processing, made these approaches suitable to analyze comp lex behaviors, group
and object s interactions and to integrate prior information.
One of the most important problems of the behavior recognition systems is the high
training set requirement. The hierarchical structures of the human behavior recognition sys tem
considerably reduce the required size of the training set.
There are three groups of hierarchical techniques: statistical approaches, syntactic
approaches, and description -based approaches. The most relevant techniques in this field are
presented in c hapter 2.
The statistical approach us es state-base model to recognize the behavior. If there is
enough training data this approach can recognize sequentially constructed behaviors even in a
noisy environment. The method cannot recognize well complex behavi ors constructed from
concurrent actions.
The syntactical approaches use string s of symbols to model actions, and use grammar in
the recognition process. Its major limitation is the same as for the statistical techniques: it is
limited in capability of rec ogniz ing behaviors constructed from concurrent atomic action. The
syntactical approach uses production rules provided by the user and must cover all possible
events for a large domain. These rules are used to parse the observations. The system can be
insta ble when an unknown observation interacts with the system.
The last group of techniques uses spatiotemporal structures to model and recognize the
behavior. These methods describe in structures spatial, logical and temporal relation between
atomic or lower level action s from the behavior. The recognition of behavior is search in the
structure of the model. The major inefficiency of this technique is that it does not handle the
errors from the low level recognition and it is not capable to compensate them.
In the following subchapter we will show that by using Petri Net we are able to recognize
behaviors which can be described using concurrent atomic actions, and compensate some errors
in low level recognition.
Human Behavior Recognition in Video Sequences
108 5.2.1 Hierarchical Probabilistic Petri Net
Petri Net is a mathematical tool that can be used also for high -level interpretation of
image sequences by describing relations between events and conditions. This tool was used
previously in computer vision to model simple human behaviors [63]. Petri Nets are also useful
to model and visualize behaviors. Vision -based systems have to deal with ambiguities and
inaccuracies in the lower -level detection and tracking systems. The base form of the Petri Net is
a deterministic one.
To resolve this topic, we have defined a Hierarchical Probabilistic Petri Net (HPPN ) as
where
P is a set of states called places
T is a set of transitions. The sets of places and transitions are disjoint
is the flow relation between places and transitions
, associates to each place a local probability distribution
defined on the transition from P to T
In the Hierarchical Probabilistic Petri Net there exists at least one terminal node
and one start ing node.
The marking s are used to represent the Petri net in action. A Petri Net operation is
controlled by its current marking. A transition is firing if and only if a predefined number of its
input places have a token. When transition fires, the tokens that activate the tran sition are
removed and one token is placed in each of the output places of the transition. If during a
transition a marker is placed in one of the terminal nodes a terminal marking is reached, when
one of the terminal nodes contains at least one token.
To keep the count of the probability of the dynamic behavior of the Petri Net , the tokens
have probabilities. All tokens at the beginning are initialized with a probability score of 1 or the
value provided by the low level processing algorithm. The tokens pr obability score changes
when the transaction fires. The simplest case is when there are two places and one transition
between them. After the transition fires the token is placed in the new place with a probability
score computed by multiplying the pre vious token probability score and the probability
density function associated with the output transition .
(5.8)
Figure 58. Transaction and places
We introduced the concurrence connection between the places. The idea behind this is
that some behaviors can have different action sequences.
2
( )
Recognizing Human Action and Behavior
109
Figure 59. Concurrency
The score of the input places will be the same and can be computed with the equation 5.8.
With the same consideration we introduced an opposite operation as well: the synchronization.
Figure 60. Synchronization
The new token score is the product of the input tokens and the places probability
distribution functions .
(5.8)
We can remove two or more tokens from the net and replace them by a single token. We
will use the final probability of a token in a terminal node as the probability that the activity is
satisfied.
5.2.2 Experiment
To verify the usability of the proposed Hierarchical Probabilistic Petri Net we made some
experiments.
We used the detected positions and the configuration of the pictorial structure to measure
the speed of the torso and to track the relative motions of the body part relative to their parents.
These parameters were compared with the saved templates using the Heuristic FastDTW and
they were eliminated at early stage s if the distance between the coarse variant of t he series was
bigger than a specific threshold.
The Heuristic FastDTW comparisons only categorize the body part motion into the
classes. All classes have an associated place in the network. If the Heuristic FastDTW
categorize a body part motion in a class the associated place gets a token. Using the Petri nets
1
3 2
3 ( ) 2 ( )
1 ( )
3 = ( ) 2 = ( ) 1 = ( )
( )
Human Behavior Recognition in Video Sequences
110 synchronization procedure we can decide about the actual basic activity. The first layer of the
Petri net is used to recognize the atomic actions.
To exemplify this we model a simple walking behavior . The step state is activated by the
Heuristic FastDTW and the repeated activation of the step state activates the walking state. In the
Petri Net the states can represent or not a behavior. By adding new labeled states we can extend
the Petri net to recog nize new motions.
The figure below shows an example of simple activities recognition:
Figure 61. Recognizing simple activities
where LUL is the left upper leg, LLL is the left lower leg, RUL is the right upper leg, RLL is the
right lower leg.
To recognize more complex behaviors we use multiple layers
Figure 62. Activity recognition using HPPN
For testing we used the waiting behavior. This behavior can be compos ed from different
simple actions. These actions are concurrent ones. Another problem is to handle the different
durations of the actions. We solved this problem by introducing skip transitions. These skip
transitions are a feedback to the same place through a transition. Each of these transitions is
penalized by reduced token probability. The probabilities assigned to skip transitions control the
model tolerances to deviations from the base activity pattern.
0.8
2 Step LUL
LLL
RUL
RLL 0.9 0.9
0.6 0.6
Walk
1
Recognizing Human Action and Behavior
111
Figure 63. HPPN for a single motion pattern
The above figure exem plifies the HPPN of a single motion pattern. Used abbreviations:
RA- right arm, LA – left arm, T -torso, RL -right leg and LL left leg. The boxes are reusable Petri
Net parts replaceable with a single place. In the network we have multiple instances from th ese
parts. Every instance has its own parameterization to fulfill its scope and to recognize different
actions or behaviors.
The Hierarchical Probabilistic Petri Net extended with probabilistic concurrency and
synchronization operations, it is able to rep resent human behaviors in more realistic way. The
time delay between the ideal sequence and the realistic input is modeled using skip transitions
which decreases the probability of tokens. The concurrency operation allows to model behaviors
that can have d ifferent sequences.
Human Behavior Recognition in Video Sequences
112 The input for the Petri Net was provided by the Heuristic FastDTW matching algorithm
presented in the previous subchapter. Using the hierarchical approach and concurrency we are
able to synchroni ze the body parts movements and to recogn ize simple basic actions.
5.3 Conclusions
This chapter presents the behavior recognition module of the system. The module
contains the action recognition and the behavior recognition. To create a n efficient behavior
recognition module we studied algorithm fro m both area and proposed new approaches for to
recognize the actions and behaviors.
We proposed an action recognition approach which use s motion decomposition and
Heuristic FastDTW .
We proposed a motion decomposition to efficiently represent the body part motions and
to simplify the task and a compression of the motion signal [203] .
To recognize the human action we proposed a Heuristic FastDTW method. This method
uses the body part angular motions to identify the basic motion of the body parts [206] .
We proved that the proposed Heuristic FastDTW method is suitable to classify the human
body part motion even if the data provided by the human detection and pose estimation
component has a lower resolution [207, 211] .
To recognize the human action we used three neural networks : the Learning Vector
Quantisation (LVQ), the Radial Basis Function (RBF) and the feed forward multilayer networks
or MultiLayer Perceptron (MLP) NNs. Based on comparison the LVQ has the best performance
on categorizing the human actio ns based on the recognized body part motions [202] .
The second part of the chapters issue is the behavior recognition having as input the
recognized action and motion and as outcome the recognized behavior. For this purpose we used
a description based appr oach. We proposed a Hierarchical Probabilistic Petri Net which uses
concurrency and probability to increase the generality of the approach. By introducing the
concurrency the method become capable to describe concurrent behaviors and to handle
uncertainty in input data flow [ 197, 205].
Based on experiments the hierarchical HPPN is suitable to take the role of the Neural
Network from the action recognition step a nd classified the result of the Heuristic FastDTW and
integrate in the behavior recognition process . The HPPN is faster than the Neural network and
capable to handle missing data [197, 205].
Conclusions and future work
113 6. Conclusions and future work
This chapter represents a summary of the thesis highlighting my contributions to the field
of the human behavior recognition from video sequences. The end part presents my final
conclusions related to this diversified and highly researched field of computer vision.
6.1 Summary of Results
This dissertation is composed of four parts .
Chapter 2 contains a survey of the current state of art of human behavior recognition
systems.
In this chapter we formulate d the aim of the thesis and based on the literature we
introduce d a general framework for human behavior recognition.
The second part of the chapter present s a synthesis of the most important achievement s in
the field of human behavior covering all components needed.
We took step by step all the components of the syste m and we have present ed the most
significant works in this field. The first component is the preprocessing component. Here we
enumerate d the most important background subtraction methods and the optical flow
approaches. The most important component of the system represents the human detection and
pose estimation module. Here we present ed the state of the art in the field of single detection
window approaches and the component based approaches. At last we present ed the most
important techniques in the field of human activity and behavior recognition.
Chapter 3 presents the preprocessing components of the system. In this chapter we
described our investigation to reduce the searching space for the human detection module. For
this purpose we studied the foregro und detection algorithms: background subtraction and optical
flow.
Our contributions to this field are the following:
Identification of the challenges of this component
Definition of the performance measurements suitable to compare the different
methods [ 210]
Implementation and comparison of 9 different foreground detection techniques
[201]
A new foreground detection algorithm based on Integral Image [210]
Human Behavior Recognition in Video Sequences
114 We introduced a list of features and a methodology to compare the foreground detection
methods. The i dentified features are: detection precision, discriminative power, shadow removal
capability, memory requirement, and computational power requirement.
We studied and compared the following methods:
Frame differencing,
Running average foreground,
Running Gaussian Average,
Min-Max method,
Meanshift based foreground detection,
Gaussian Mixture based foreground detection
Eigenbackground based foreground detection
Optical flow ( Lucas -Kanade )
Integral Image based foreground detection (our method)
We i mplemented the above mentioned methods. We performed several tests u sing the
performance measurement model defined by us . We used indoor and outdoor image sequences to
cover all challenging cases.
We elaborated a novel Integral Image based foreground detection method to detect the
foreground objects named . We also implemented this method and performed all the tests and
compared the ir performance with the performance of the existing methods [ 210].
Based on comparison we h ave shown that the proposed Integral Image based foreground
detection method is suitable for foreground detection in case of static backgrounds , it is four time
faster than the Gaussian Mixture Method but with the same precision. The method has one of the
best discriminative performances from the studied techniques [201, 210].
Chapter 4 presents the human detection and the pose estimation component which
represent the most important part of the system . We present in this chapter three different
approach es for human detection and pose estimation , each of it represent ing a major research
direction in th is field.
The first method belongs to the "single window" category . We proposed a novel classifier
PETC based on Viola and Leinhart work. We created a new classifier structure , training
algorithm [199] , to detect multi -view and multi -pose human bodies and to estimate the body
configuration of its [198, 212].
We introduced novel background selection algorithms to achieve better result. Based on
experiments t his algorithm speed s up the training procedure and simplifies the classifier structure
[212].
We implemented and trained the Viola and Leinhart classifier to compare with PETC .
We proved that PETC is faster than the other two classifiers. The correct detec tion rate is higher
than Viola's classifier and comparable with the Leinhart classifier , but PETC has lo wer false
positive hit [ 212].
The comparisons of the classifiers were perform ed on INRIA and our databases.
The next method is the chamfer matching met hod which is a template based approach
used to detect the people on the image and estimate their pose. Our contribution to this approach
Conclusions and future work
115 is a fast pseudo para llel computation of the distance transform and a new efficient technique to
store the templates [209].
We proposed a novel pseudo parallel approach to compute the distance transformation.
Based on experiments this method is 25% faster than other methods [207, 201].
We also suggest ed a new technique to store the templates that allow us to find very
quickly the most probable matching template [203, 201].
We propose a framework (CHMS) to detect the human presence in the image and
estimate their pose. Using the CHMS we got 5 times faster detector with the same detection
performance but with lower false positives and better pose estimation rate [199].
We also performed several test s to investigate the performance of the chamfer matching
related to the image homogeneity [199].
Finally , the third method belong s to the component based approaches and uses Pictorial
Structure to detect the people in the image and to estimate its body configuration.
We implemented the Pictorial Structure (Felzenszwalb algorithm) based method and
tested the performance [197,199 , 206, 200, 208 ].
We proposed two new method s to increase the performance of the Pictorial Structure
based method:
The first method uses the motion information to modify the Pictorial Structure
parameters, and speed up the recognition process [ 199, 200].
The second method uses tracking inform ation to eliminate the ambiguity, caused
by the occlusions or self occlusions [ 199].
The experiment shows that the methods proposed by us are considerably better than the
original ones because: are 20x times faster, have a 13% higher detection rate, 5 -10% higher
accuracy, and a considerably reduced false positive detection rate [199, 206, 200].
The proposed method use s motion and tracking information without reduc ing the
capability of use of the framework in still images or in cases when th is information is not
available [199, 200].
The last part of the chapter contains a comparison of the three methods, performance
using video sequences with different size of human on the image. We proved that all of them
have their usability. The first two method s are fast and work well in lower resolution . The
Pictorial Structure based method give us the best detection and pose estimation results, but work s
very slow even with our speed up adjustments [199, 204].
Chapter 5 presents the behavior recognition component of the system. The component
recognizes the human activity and behavior s. Our contributions to the field of activity
recognition are a new method for human motion representation and a n improved matching
algorithm. We also int roduced a Hierarchical HPPN which is suitable to recognize complex
behaviors.
We proposed a motion decomposition method to represent the body part motions more
efficiently and to reduce the dimensionality of the activity recognition task. We also proposed a
Human Behavior Recognition in Video Sequences
116 new motion time series compression method which compress efficient the motion time series
without losing the most valuable information from them.
We proposed an improved version of the DTW which shrink the body parts motion to a
shorter time series that promote a faster categorization of the motion s [202, 203].
We proved that the proposed Heuristic FastDTW method is suitable to classify the human
body part motion even if the data provided by the human detection and pose estimation
component has a lower resolution [211, 207].
To recognize the human action we used three neural networks : the Learning Vector
Quantisation (LVQ), the Radial Basis Function (RBF) and the feedforw ard multilayer networks
or Multi Layer Perceptron (MLP) NNs. Based on comparison the LVQ has the best performance
on categorizing the human actions based on the recognized body part motions [210].
To recognize complex behavior w e propose a Hierarchical Con current Probabilistic Petri
Net. Introducing Concurrency to the Probabilistic Petri net we increase the generality of the
approach and made it suitable to describe concurrent action and to handle uncertainty in input
data flow [197, 205].
Based on experime nts the Hierarchical HPPN is suitable to take the role of the Neural
Network from the action recognition step and integrate in the behavior recognition speeding up
the process and to work with missing data [197, 205].
The thesis is based on the following publications:
Journal s B+
1. Tamás Vajda , Action Recognition Using DTW and Petri Nets, Studia Universitatis
Babes -Bolyai Series Informatica, Volume LV, Number 2 (June 2010), pp 69 -78, ISSN:
1224 -869x.
2. Tamás Vajda, Sergiu Nedevschi : Articulated Pose Estimation in Surveillence Videos
ACAM Scientific Journal, Vol. 20 no.2, 2011, pp. 111 -118, ISSN:1221 -437X
3. Tamás Vajda , "Using Dynamic Time Warping Algorithm Optimization For Fast Human
Action Recognition", Acta Technica Napocensis – Electronics and Telecommunication,
Volume 51, Number 2/2010 pp.32 -37, ISSN 1221 -6542
ISI Proceedings
4. Tamás Vajda , Emőke Szatmári, Sergiu Nedevschi -Human Body Detection and Tracking
in Video Sequences Using Chamfer Matching. IEEE 3th International Conferen ce on
Intelligent Computer Communication and Processing, ICCP 2007, Sept. 6 -8, 2007, Cluj –
Napoca, pp. 141 -146, ISBN :978-1-4244 -1491 -8
5. Tamás Vajda Action Recognition Based on Fast Dynamic -Time Warping Method IEEE
5th International Conference on Intelligent Computer Communication and Processing,
ICCP 2009, Aug 27 -29, 2009, Cluj -Napoca, pp. 127 – 131, ISBN: 978-1-4244 -5007 -7
Conclusions and future work
117 Proceedings indexed i n databases (IEEE Xplore, CPCI )
6. Tamás Vajda Behavior Recognition Based on Dynamic Programming and Concurrence
Probabilistic Petri Nets IEEE 6th International Conference on Intelligent Computer
Communication and Processing, ICCP 2010, Aug 26 -28, 2010, Cluj -Napoca, pp. 179 –
184, ISBN: 978-1-4244 -8229 -0
7. Tamás Vajda Behavior Recognition Using Pictorial Structures and DTW 2010 IEEE
International Conference on Automation, Quality and Testing, Robotics, Mai 28 -29, 2010,
Cluj-Napoca, vol3, pp 198 -201
8. Tamás Vajda and Lőrinc M.: General framework for human object detection and pose
estimation in video sequences, In 5th IEEE International Conference on Industrial
Informatics, June 23 -27, 2007, Viena, pp. 467 – 472, ISSN : 1935 -4576
9. Tamás Vajda , Ábrám Zoltán – Pictorial Structure Based People Detection and Pose
Estimation in Videos . International Conference on Intelligent Computer Communication
and Processing, ICCP 2011, Aug. 25 -27, 2011, Cluj -Napoca, pp. 315 – 318.
10. Tamás Vajda , Behavio r Recognition Using Templ ate Matching, The 4th edition of the
Interdisciplinarity in Engineering International Conference, November 12 -13, 2009 , Tg.-
Mures, pp. 283 -288, ISSN 1843 -780X
Other International Proceedings
11. Tamás Vajda , László Bakó, Sándor Tihamér Brassai. Using dynamic programing and
Neural Network to Match Human Action, 11th International Carpathian Control
Conference ICCC 2010, May 26 -29, 2010, Eger, Hungary, pp 231 -234.
12. Tamás Vajda, Sergiu Nedevschi : Fast Multi -View Human detection and attitude
estimation , CSCS16 – The 16th International Conference on Control Systems and
Computer Science, May22 -25, 2007, Bucuresti, Romania, , vol. 2, pp.17 -23
13. Tamás Vajda : Moving object detection in video sequences using Integral Image 20th
International Conference on Computer Science and Education, October 2010, Satu Mare,
Romania, pp. 225 -228, ISSN 1842 -4546
14. Tamás Vajda : Hierarchical human behavior recognition 8th International Conference on
Computer Science and Energetics -Electrical Engi neering, Sumuleu -Ciuc, October 2008,
Romania, pp. 139 -144, ISSN 1842 -4546
15. Tamás Vajda : Attitude detection methods usability in behavior recognition 9th
International Conference on Computer Science and Education, October 2009, Tg -Mures,
Romania, pp. 139 -144, ISSN 1842 -4546
16. Tamás Vajda : Human Body Detection and Tracking in Video Sequences Using Chamfer
Matching, 7th International Conference on Computer Science and Education, October
2007, Oradea, Romania, pp. 54 -58, ISSN 1842 -4546.
Human Behavior Recognition in Video Sequences
118 Research contracts rela ted to the thesis :
Nr. Project Title Beneficiary Year
1 Sisteme Bazate pe Recunoastera Automată a
numerelor de înmatriculare a Autovechiculelor. Sc. Napa -Impex
SRL 2004 –
2008
2 Sistem informatic integrat, bazat pe inteligenta
artificiala, pentru examinarea cererilor de brevet
de inventie – EXAMBREV CNCSIS, PN II,
PROGRAMUL 4 –
“Parteneriate in
domeniile prioritare”
Nr. Proiect: 2859 2007
–
2010
3 Reconfigurable control of robotic systems over
networks PN-II-RU-TE-2011 –
3-0005 2011 –
2014
6.2 Future research
Possible future directions and research areas:
– add multi -camera support to the a human behavior detection algorithms
– extend our current work to be appropriate to track patients in hospitals and to be able to
recognize certain behaviors and sc enarios related to their diseases
– integrate t he chamfer matching method as part base detection for the Pictorial structure
based method, improving the prediction phase of the next model configuration
– improve t he Pictorial Structure based statistical mo del to handle uncertainties and
occlusions more efficiently using more sophisticated tracking and estimation algorithm s
– Develop a training algorithm for the Petri Nets parameter s using its inputs
Bibliography
119 Bibliography
1. A.K. Jain, R.P.W. Duin, and J. Mao: “Statistical Pattern Recognition: A Review”, IEEE
Trans. Pattern Analysis and Machine Intelligence, Vol. 22, No. 1, pp. 4 -37, 2000.
2. Allen, J. F. and Ferguson, G.: “Actions and events in interval temporal log ic”, Journal of
Logic and Computation, Vol. 4, No 5, pp. 531 -579, 1994.
3. Allen, J. F: “Maintaining knowledge about temporal intervals”, Communications of the
ACM 26, Vol.11, pp. 832 -843, 1983.
4. B. Galvin, B. McCane, K. Novins, D. Mason, and S. Mills: “Recovering motion fields: an
analysis of eight optical flow algorithms”, In Proc. 1998 British Machine Vision
Conference, Southampton, England, 1998.
5. B. Han, D. Comaniciu, and L. Davis: "Sequential kernel density approximation through
mode propagation: ap plications to background modeling”, Proc. ACCV -Asian Conf. on
Computer Vision, 2004.
6. B. Leibe, E. Seemann, and B. Schiele: “Pedestrian Detection in Crowded Scenes”, Proc.
IEEE Int’l Conf. Computer Vision and Pattern Recognition, pp. 878 -885, 2005.
7. B. Leib e, N. Cornelis, K. Cornelis, and L.V. Gool: “Dynamic 3D Scene Analysis from a
Moving Vehicle”, Proc. IEEE Int’l Conf. Computer Vision and Pattern Recognition,
2007.
8. B. Stenger, A. Thayananthan, P. H. S. Torr, and R. Cipolla: “Model -based hand tracking
using a hierarchical bayesian filter”, IEEE Trans. on Pattern Analysis and Machine
Intelligence, Vol. 28, No. 9, pp. 1372 -1384, 2006.
9. B. Wu and R. Nevatia: “Detection and Tracking of Multiple, Partially Occluded Humans
by Bayesian Combination of Edgelet Based Part Detectors”, Int’l J. Computer Vision,
Vol. 75, No. 2, pp. 247 -266, 2007.
10. B.D. Lucas and T. Kanade: “An Iterative Image Registration Technique with an
Application to Stereo Vision”, DARPA Image Understanding Workshop, pp. 121 -130,
1981.
11. B.K.P. Horn and B.G. Schunck: “Determining Optical Flow”, Artificial Intelligence, Vol.
17, pp. 185 -204, 1981.
12. B.P.L. Lo and S.A. Velastin: “Automatic congestion detection system for underground
platforms”, Proc. of Int. Symp. on Intell. Multimedia, Video and Speech Proc essing, pp.
158-161, 2001.
13. Barron, J.L., Fleet, D.J., Beauchemin, S.S., Burkitt, T.A.: “Performance of Optical Flow
Techniques”, Computer Vision and Pattern Recognition, Proceedings '92, IEEE
Computer Society Conference, pp. 236 – 242, 1992.
Human Behavior Recognition in Video Sequences
120 14. Barrow, H.G,,T enenbaum, J.M., Bolles, R.C & Wolf. s.l: “Parametric correspondence
and chamfer matching: Two new techniques for image matching”, Proc 5th Int. Joint
Conf Artificial Intelligence, Cambridge, 1977.
15. Baumberg: “Hierarchical Shape Fitting Using an Iterated Lin ear Filter”, Proc. British
Machine Vision Conf., pp. 313 -323, 1996.
16. Biswas S., Sil J., Sengupta N.: “Background Modeling and Implementation using
Discrete Wavelet Transform: a Review”, JICGST -GVIP, Vol. 11, Issue 1, pp. 29 -42,
2011.
17. Blank, M., Gorelick, L. , Shechtman, E., Irani, M., and Basri, R.: “Actions as space -time
shapes”, In IEEE International Conference on Computer Vision, pp. 1395 -1402, 2005.
18. Bobick, A. and Davis, J.: “The recognition of human movement using temporal
templates”, IEEE Transactions o n Pattern Analysis and Machine Intelligence, Vol. 23,
No.3, pp. 257 -267, 2001.
19. Bobick, A. F. and Wilson, A. D.: “A state -based approach to the representation and
recognition of gesture”, IEEE Transactions on Pattern Analysis and Machine Intelligence,
Vol. 19, No. 12, pp. 1325 -1337, 1997.
20. Borgefors, G. s.l.: “Distance transformations in digital images”, Computer Vision,
Graphics and Image Processing, Vol. 34, No. 3, pp. 344 –371, 1986.
21. Borzi, K. Ito, and K. Kunisch: “Optimal control formulation for determinin g optical
flow”, SIAM Journal on Scientic Computing, Vol. 24, No.3, pp. 818 -847, 2002.
22. Bruhn, J. Weickert, and C. Schnorr: “Lucas/Kanade meets Horn/Schunck: Combining
local and global optic flow methods”, International Journal of Computer Vision, Vol. 61,
Issue 3, pp. 211 -231, 2005.
23. Bruhn, J. Weickert, C. Feddern, T. Kohlberger, and C. Schnorr: “Real -time optic flow
computation with variational methods”, In N. Petkov and M. A. Westenberg, editors,
Computer Analysis of Images and Patterns, Vol. 2756 of Lectu re Notes in Computer
Science, pp. 222 -229. Springer, Berlin, 2003.
24. Butler D., Sridharan S.: “Real -Time Adaptive Background Segmentation”, ICASSP,
2003.
25. C. Papageorgiou and T. Poggio: “A Trainable System for Object Detection”, Int’l J.
Computer Vision, Vol. 38, pp. 15 -33, 2000.
26. C. Stauffer and W. E. L. Grimson, “Learning patterns of activity using real -time
tracking,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 22, no.
8, pp. 747 –757, 2000.
27. C. Stauffer, W.E.L. Grimson: “Adaptive bac kground mixture models for real -time
tracking”, Proceedings IEEE Computer Society Conference on Computer Vision and
Pattern Recognition, pp. 246 -252, 1999.
28. C. Stauffer, W.E.L. Grimson:“Learning patterns of activity using real -ime tracking”,
IEEE Trans. on Patt. Anal. and Machine Intell., Vol. 22, No. 8, pp. 747 -757, 2000.
29. C. Wren, A. Azabayejani, T. Darrell and A. Pentland: “Pfinder: Real -time tracking of the
human body”, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 19,
pp. 780 -785, 1997.
30. Campbell, L. W. and Bobick, A. F.: “Recognition of human body motion using phase
space constraints”, In IEEE International Conference on Computer Vision, pp. 624 -630,
1995.
Bibliography
121 31. Chang R., Ghandi T., Trivedi M.: “Vision modules for a multi sensory bridge monitoring
approach”, ITSC 2004, pp. 971 -976, 2004.
32. Chomat, O. and Crowley, J.: “Probabilistic recognition of activity using local
appearance”, IEEE Conference on Computer Vision and Pattern Recognition, Vol. 2,
1999.
33. Cristani M., Farenzena M., Bloisi D., Murino V.: “Background Subtraction for
Automated Multisensor Surveillance: a Comprehensive Review”, EURASIP Journal on
Advances in Signal Processing, Vol. 2010, pages 24, 2010.
34. Culbrik D., Marques O., Socek D., Kalva H., Furht B.: “Neural network approach to
background modeling for video object segmentation”, IEEE Transaction on Neural
Networks, Vol. 18, No. 6, pp. 1614 –1627, 2007.
35. D. A. Forsyth, O. Arikan, L. Ikemoto, J. O’Brien, and D. Ramanan, “Computational
studies of human motion: part 1, tracking and motion synthesis,” Foundations and Trends
in Computer Graphics and Vision, vol. 1, no. 2 -3, pp. 77 –254, 2005
36. D. Huttenlocher, G. Klanderman, and W. Rucklidge. s.l.: “Comparing images using the
hausdorff distance”, IEEE Transactions on Pattern Analysis and Machine Intelligence,
Vol. 15, No. 9, pp. 50 –863, 1993.
37. D. Koller, J. Weber, T. Huang, J. Malik, G. Ogasawara, B. Rao, and S. Russel: “Towards
Robust Automatic Traffic Scene Analysis in Real -Time”, Proceedings of Int’l Conference
on Pattern Recognition, pp . 126 –131, 1994.
38. D. Terzopoulos: “Image analysis using multigrid relaxation”, IEEE Transactions on
Pattern Analysis and Machine Intelligence,Vol. 8, No. 2, pp. 129 -139, 1986.
39. D.G. Lowe: “Distinctive Image Features from Scale Invariant Keypoints”, Int’l J.
Computer Vision, Vol. 60, No. 2, pp. 91 -110, 2004.
40. D.M. Gavrila and J. Giebel: “Shape -based pedestrian detection and tracking”, IEEE
Intelligent Vehicle Symposium, pp. 8 -14, 2002.
41. D.M. Gavrila and S. Munder: “Multi -cue Pedestrian Detection and Tracking fr om a
Moving Vehicle”, International Journal of Computer Vision, Vol. 73, No. 1, pp. 41 -9,
2007.
42. D.M. Gavrila: “A Bayesian Exemplar -Based Approach to Hierarchical Shape Matching”,
IEEE Trans. Pattern Analysis and Machine Intelligence, Vol. 29, No. 8, pp. 14 08-1421,
2007.
43. Dai, P., H. Di, H., Dong, L., Tao, L., and Xu, G.: “Group interaction analysis in dynamic
context”, IEEE Transactions on Systems, Man, and Cybernetics, Part B, Vol. 38, No. 1,
pp. 275 -282, 2008.
44. Damen, D. and Hogg, D.: “Recognizing linked ev ents: Searching the space of feasible
explanations”, IEEE Conference on Computer Vision and Pattern Recognition, 2009.
45. Darrell, T. and Pentland, A.: “Space -time gestures”, IEEE Conference on Computer
Vision and Pattern Recognition, pp. 335 -340, 1993.
46. Dolla r, P., Rabaud, V., Cottrell, G., and Belongie, S.: “Behavior recognition via sparse
spatio -temporal features”, 2nd Joint IEEE International Workshop on Visual Surveillance
and Performance Evaluation of Tracking and Surveillance, pp. 65 -72, 2005.
47. E. Memin a nd P. Perez: “Dense estimation and object -based segmentation of the optical
flow with robust techniques”, IEEE Transactions on Image Processing, Vol. 7, No.5,pp.
703-719, 1998.
Human Behavior Recognition in Video Sequences
122 48. El Baf F., Bouwmans T., Vachon B.: “Fuzzy Integral for Moving Object Detection” ,
FUZZ -IEEE 2008, pp. 1729 -1736, Hong -Kong, China, 2008.
49. El Baf F., Bouwmans T., Vachon B.: “Type -2 fuzzy mixture of Gaus – sians model:
Application to background modeling”, ISVC 2008, pp. 772 -781, Las Vegas, USA, 2008.
50. Elgammal A., Harwood D., Davis L.: “N on-parametric Model for Background
Subtraction”, ECCV 2000, pp. 751 -767, Dublin, Ireland, 2000.
51. Elhabian S. Y., El -Sayed K. M., Ahmed S.: “Moving Object Detection in Spatial Domain
using Background Removal Techniques – State -of-Art”, Recent Patents on Comp uter
Science, Vol. 1, No 1, pp. 32 – 54, 2008.
52. F. Heitz and P. Bouthemy: “Multimodal estimation of discontinuous optical flow using
Markov random fields”, IEEE Transactions on Pattern Analysis and Machine
Intelligence,Vol. 15, Issue 12, pp. 1217 -1232, 1993.
53. F. Porikli: “Automatic image segmentation by Wave Propagation”, Proceedings of
IS&T/SPIE Symposium on Electronic Imaging, San Jose, 2004
54. Freud, Y. and R. Schapire: “A short introduction to Boosting”, 1999 J. of Japanese
Society for AI, Vol. 14, No. 5, pp. 71–780, 1999.
55. G. Halevy and D.Weinshall: “Motion of disturbances: detection and tracking of
multibody non -rigid motion”, Machine Vision and Applications, Vol. 11, pp. 122 -137,
1999.
56. G. P. Stein: “Tracking from multiple view points: Self -calibration of spa ce and time”,
Computer Vision and Pattern Recognition Fort Collins, pp. 521 -527, 1999.
57. G. Zini, A. Sarti, and C. Lamberti: “Application of continuum theory and multi -grid
methods to motion evaluation from 3D echocardiography”, IEEE Transactions on
Ultrasonics, Ferroelectrics, and Frequency Control,Vol. 44, No. 2, pp. 297 -308, 1997.
58. G., Borgefors, s.l.: “Hierarchical chamfer matching: A parametric edge matching
algorithm”, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 10,
No. 6 , pp. 849 -865, 1988.
59. G., Borgefors, s.l.: “Improved version of chamfer matching algorithm”, 7th Int. Conf
Pattern Recognition, 1984.
60. Gavrila, D. and Davis, L.: “Towards 3 -D model -based tracking and recognition of human
movement”, International Workshop on Face and Gesture Recognition, pp. 272 -277,
1995.
61. Gavrila, D. and V. Philomin: “Real -time object detection for “smart” vehicles”,
International Conference on Computer Vision (ICCV99), pp 87 –93, 1999.
62. Gavrila, D. M.: “The visual analysis of human movement: A survey”, Computer Vision
and Image Understanding, Vol. 73,No. 1, pp. 82 -98, 1999.
63. Ghanem, N., DeMenthon, D., Doermann, D., and Davis, L.: “Representation and
recognition of events in surveillance video using Petri nets”, IEEE Conference on
Computer Vision and Pattern Recognition Workshop, 2004.
64. Gong, S. and Xiang, T.: “Recognition of group activities using dynamic probabilistic
networks”, IEEE International Conference on Computer Vision, p. 742, 2003.
65. Gupta, A., Srinivasan, P., Shi, J., and Davis, L. S.: “ Understanding videos, constructing
plots learning a visually grounded storyline model from annotated videos”, IEEE
Conference on Computer Vision and Pattern Recognition, 2009.
66. H. Shimizu and T. Poggio: “Direction Estimation of Pedestrian from Multiple Stil l
Images”, Proc. IEEE Intelligent Vehicles Symposium, pp. 596 -600, 2004.
Bibliography
123 67. H. Sidenbladh and M.J. Black: “Learning the Statistics of People in Images and Video”,
Int’l J. Computer Vision, Vol. 54, Nos. 1 -3, pp. 183 -209, 2003.
68. H. Zhong, J. Shi, and M. Visonta i, “Detecting unusual activity in video,” Proceedings of
IEEE Conference on Computer Vision and Pattern Recognition, pp. 819 –826, 2004.
69. H.-H. Nagel: “Extending the 'oriented smoothness constraint' into the temporal domain
and the estimation of derivatives of optical flow”, In O. Faugeras, ed., Computer Vision
ECCV '90, Vol. 427 of Lecture Notes in Computer Science, pp. 139 -148. Springer,
Berlin, 1990.
70. Hakeem, A., Sheikh, Y., and Shah, M.: “CASEE: A hierarchical event representation for
the analysis of video s”, Proceedings of the 20th National Conference on Artificial
Intelligence, pp. 263 -268, 2004.
71. Haritaoglu, D. Harwood and L. S. Davis: “Hydra: Multiple people detection and tracking
using silhouettes”, Second IEEE Workshop on Visual Surveillance Fort Colli ns, pp. 6 -13,
1999.
72. Haritaoglu, D. Harwood and L. S. Davis: “W4: Who? when? where? what? a real time
system for detecting and tracking people”. Third Face and Gesture Recognition
Conference, pp. 222 -227. 1998.
73. Haritaoglu, R. Cutler, D. Harwood and L. S. Da vis: “Backpack: Detection of people
carrying objects using silhouettes”. International Conference on Computer Vision, pp.
102-107, 1999.
74. Harris, C. and Stephens, M.: “A combined corner and edge detector”, Alvey Vision
Conference, pp. 147 -152, 1988.
75. I. Cohe n: “Nonlinear variation method for optical flow computation”, In Proc. Eighth
Scandinavian Conference on Image Analysis, Vol. 1, pp. 523 -530, Tromso, Norway,
1993.
76. I.P. Alonso et al: “Combination of Feature Extraction Methods for SVM Pedestrian
Detection”, IEEE Trans. Intelligent Transportation Systems, Vol. 8, No. 2, pp. 292 -307,
2007.
77. Intille, S. S. and Bobick, A. F.: “A framework for recognizing multi -agent action from
visual evidence”, AAAI/IAAI, pp. 518 -525, 1999.
78. Ivanov, Y. A. and Bobick, A. F.: “Reco gnition of visual activities and interactions by
stochastic parsing”, IEEE Transactions on Pattern Analysis and Machine Intelligence
Vol. 22, No. 8, pp. 852 -872, 2000.
79. J. B. Tenenbaum, V. D. Silva, and J. C. Langford, “A global geometric framework for
nonlinear dimensionality reduction.” Science, vol. 290, no. 5500, pp. 2319 –2323, 2000.
80. J. Heikkila and O. Silven: “A real -time system for monitoring of cyclists and
pedestrians”, Second IEEE Workshop on Visual Surveillance Fort Collins, pp. 74 -81,
1999.
81. J. L. Barron, D. J. Fleet, and S. S. Beauchemin: “Performance of optical flow
techniques”, International Journal of Computer Vision, Vol. 12, No.1, pp. 43 -77, 1994.
82. J. Weickert and C. Schnorr: “A theoretical framework for convex regularizes in PDE –
based computat ion of image motion”, International Journal of Computer Vision, Vol. 45,
Issue 3, pp. 245 -264, 2001.
Human Behavior Recognition in Video Sequences
124 83. J.D. Rymel, J.R. Renno, D. Greenhill, J. Orwell, G.A. Jones: “Adaptive Eigen –
Backgrounds for Object Detection”, IEEE International Conference on Image Proc essing,
2004.
84. J.K. Aggarwal , M.S. Ryoo: “Human activity analysis: A review”, ACM Computing
Surveys, Vol.43, No.3, pp.1 -43, 2011.
85. J.L. Barron and N.A. Thacker: “Tutorial: Computing 2D and 3D Optical Flow”, Tina
Memo No. 2004 -012, 2005.
86. J.Weickert and C. Schnorr: “Variational optic flow computation with a spatio -temporal
smoothness constraint”, Journal of Mathematical Imaging and Vision, Vol.14, No. 3, pp.
245-255, 2001.
87. Javed, O., Shafique, K: “A hierarchical approach to robust background subtraction usin g
color and gradient information”, Proceedings of the IEEE Workshop Motion and Video
Computing, pp. 22 – 27, 2002.
88. Joo, S. -W. and Chellappa, R.: “Attribute grammar -based event recognition and anomaly
detection”, IEEE Conference on Computer Vision and Patter n Recognition Workshop, p.
107, 2006.
89. K. Fukushima, S. Miyake, and T. Ito: “Neocognitron: A Neural Network Model for a
Mechanism of Visual Pattern Recognition”, IEEE Trans. Systems, Man, and Cybernetics,
Vol. 13, pp. 826 -834, 1983.
90. K. Mikolajczyk, C. Schmi d, and A. Zisserman: “Human Detection Based on a
Probabilistic Assembly of Robust Part Detectors”, Proc. European Conf. Computer
Vision, pp. 69 -81, 2004.
91. K. Okuma, A. Taleghani, N. de Freitas, J. Little, and D. Lowe: “A Boosted Particle Filter:
Multitarget Detection and Tracking”, Proc. European Conf. Computer Vision, pp. 28 -39,
2004.
92. K. Toyama and A. Blake: “Probabilistic Tracking with Exemplars in a Metric Space”,
Int’l J. Computer Vision, Vol. 48, No. 1, pp. 9 -19, 2002.
93. K. Toyama, J. Krumm, B. Brumitt an d B. Meyers: “Wallower: Principles and practice of
background maintenance”, International Conference on Computer Vision, pp. 255 -261,
1999.
94. Ke, Y., Sukthankar, R., and Hebert, M.: “Spatio -temporal shape and flow correlation for
action recognition”, IEEE Co nference on Computer Vision and Pattern Recognition,
2007.
95. Kim K., Chalidabhongse T., Harwood D., Davis L.: “Real -time Foreground -Background
Segmentation using Codebook Model”, Real -Time Imaging, Vol.11, 2005.
96. L. Alvarez, J. Esclarn, M. Lef_ebure, and J. S anchez: “A PDE model for computing the
optical flow”, In Proc. XVI Congreso de Ecuaciones Diferenciales y Aplicaciones, pp.
1349 -1356, Las Palmas de Gran Canaria, Spain, 1999.
97. L. Alvarez, J. Weickert, and J. Sanchez: “Reliable estimation of dense optical f low fields
with large displacements”, International Journal of Computer Vision, Vol. 39, Issue 1, pp.
41-56, 2000.
98. L. Fan, K. -K. Sung, and T. -K. Ng: “Pedestrian Registration in Static Images with
Unconstrained Background”, Pattern Recognition, Vol. 36, pp. 1019 -1029, 2003.
99. L. Zhang, B. Wu, and R. Nevatia: “Detection and Tracking of Multiple Humans with
Extensive Pose Articulation”, Proc. Int’l Conf. Computer Vision, 2007.
Bibliography
125 100. Laptev, I. and Lindeberg, T.: “Space -time interest points”, IEEE International Confere nce
on Computer Vision, p. 432, 2003.
101. Laptev, I., Marszalek, M., Schmid, C., and Rozenfeld, B.: “Learning realistic human
actions from movies”, IEEE Conference on Computer Vision and Pattern Recognition,
2008.
102. Lee B., Hedley M.: “Background Estimation for Video Surveillance”, IVCNZ 2002,
Vol.1, pp. 315 -320, 2002.
103. Lienhart Rainer and Jochen Maydt.: “An Extended Set of Haar -like Features for Rapid
Object Detection”, IEEE ICIP2002, Vol. 1, pp. 900 –903, 2002.
104. Lienhart Rainer, Alexander Kuranov, Vadim Pisarevsk y: “Empirical Analysis of
Detection Cascades of Boosted Classifiers for Rapid Object Detection”, DAGM'03, 25th
Pattern Recognition Symposium, Madgeburg, pp. 297 –304, 2003.
105. Lienhart Rainer, Luhong Liang, and Alexander Kuranov: “A Detector Tree of Boosted
Classifiers for Real -time Object Detection and Tracking”, IEEE ICME2003, Vol. 2, pp.
277–280, 2003.
106. Liu, J., Luo, J., and Shah, M.: “Recognizing realistic actions from videos in the wild”,
IEEE Conference on Computer Vision and Pattern Recognition, 2009.
107. Lublinerman, R., Ozay, N., Zarpalas, D., and Camps, O.: “Activity recognition from
silhouettes using linear systems and model (in) validation techniques”, International
Conference on Pattern Recognition, pp. 347 -350, 2006.
108. Lv, F. and Nevatia, R.: “Single view human action recognition using key pose matching
and Viterbi path searching”, IEEE Conference on Computer Vision and Pattern
Recognition, 2007.
109. M. Andriluka, S. Roth, and B. Schiele: “People -tracking -by-detection and people –
detection -by-tracking”, CVPR, p p.1-8, 2008.
110. M. Andriluka, Stefan Roth, and Bernt Schiele: “Pictorial structures revisited: People
detection and articulated pose estimation”, In Proc. of the IEEE Computer Society
Conference on Computer Vision and Pattern Recognition (CVPR), 2009.
111. M. Belkin and P. Niyogi, “Laplacian eigenmaps and spectral techniques for embedding
and clustering,” Advances in Neural Information Processing Systems, pp. 585 –591, 2001.
112. M. Bergtholdt, D. Cremers, and C. Schnörr: “Variation Segmentation with Shape Priors”,
Handbook of Math. Models in Computer Vision, N. Paragios, Y. Chen, and O. Faugeras,
eds., Springer, 2005.
113. M. Elgammal and C. S. Lee, “Inferring 3D body pose from silhouettes using activity
manifold learning,” Proceedings of IEEE Conference on Computer Visio n and Pattern
Recognition, pp. 681 –688, 2004.
114. M. Enzweiler and D. M. Gavrila: “Monocular Pedestrian Detection: Survey and
Experiments”, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 31,
No. 12, pp. 2179 -2195, 2009.
115. M. Enzweiler and D .M. Gavrila: “A Mixed Generative -Discriminative Framework for
Pedestrian Classification”, Proc. IEEE Int’l Conf. Computer Vision and Pattern
Recognition, 2008.
116. M. Drulea and S. Nedevschi, "Total variation regularization of local -global optical flow,"
in Intelligent Transportation Systems (ITSC), 2011 14th International IEEE Conference
on, 2011, pp. 318 -323
Human Behavior Recognition in Video Sequences
126 117. M. J. Black and P. Anandan: “Robust dynamic motion estimation over time”, In Proc.
1991 IEEE Computer Society Conference on Computer Vision and Pattern R ecognition,
pp 292 -302, Maui, HI, IEEE Computer Society Press (1991)
118. M. Piccardi, T. Jan: “Mean -shift background image modeling”, Proc. of IEEE
International Conference on Image Processing, Singapore, 2004.
119. M. Seki, T. Wada, H. Fujiwara, K. Sumi: “Backgrou nd detection based on the
cooccurrence of image variations”, Proc. of CVPR, Vol. 2, pp. 65 -72, 2003.
120. M. Szarvas, A. Yoshizawa, M. Yamamoto, and J. Ogata: “Pedestrian Detection with
Convolutional Neural Networks,” Proc. IEEE Intelligent Vehicles Symposium, pp. 223 –
228, 2005.
121. M.J. Jones and T. Poggio: “Multidimensional Morphable Models”, Proc. Int’l Conf.
Computer Vision, pp. 683 -688, 1998.
122. Maddalena L., Petrosino A.: “A self organizing approach to background subtraction for
visual surveillance applications”, IEEE Transactions on Image Processing, Vol.17, No. 7,
pp. 1729 –1736, 2008.
123. McFarlane N., Schofield C.:” Segmentation and tracking of piglets in images”, BMVA
1995, pp. 187 -193, 1995.
124. Messelodi S., Modena C., Segata N., Zanin M. A: “Kalman filter based bac kground
updating algorithm robust to sharp illumination changes”, ICIAP 2005, Vol. 3617, pp.
163-170, Cagliari, Italy, 2005.
125. Minnen, D., Essa, I. A., and Starner, T.: “Expectation grammars: Leveraging high -level
expectations for activity recognition”, IEEE Conference on Computer Vision and Pattern
Recognition, Vol. 2, pp. 626 -632, 2003.
126. Mohan, C. Papageorgiou, and T. Poggio: “Example -Based Object Detection in Images by
Components”, IEEE Trans. Pattern Analysis and Machine Intelligence, Vol. 23, No. 4,
pp. 3 49-361, 2001.
127. Moore, D. J. and Essa, I. A.: “Recognizing multitasked activities from video using
stochastic context -free grammar”, AAAI/IAAI, pp. 770 -776, 2002.
128. N. Dalal and B. Triggs: “Histograms of Oriented Gradients for Human Detection”, Proc.
IEEE Int’ l Conf. Computer Vision and Pattern Recognition, pp. 886 -893, 2005.
129. N. Dalal, B. Triggs, and C. Schmid: “Human Detection Using Oriented Histograms of
Flow and Appearance”, Proc. European Conf. Computer Vision, pp. 428 -441, 2006.
130. N. M. Oliver, B. Rosario, a nd A. P. Pentland: “A Bayesian Computer Vision System for
Modeling Human Interactions”, IEEE Trans. on Patt. Anal. and Machine Intell., Vol. 22,
No. 8, pp. 831 -843, 2000.
131. Nagel H.H.: “Displacement vectors derived from second -order intensity variations in
image sequences”, CGIP, Vol. 21, pp. 85 -117, 1983.
132. Nakajima, M. Pontil, B. Heisele, and T. Poggio: “Full -Body Recognition System”,
Pattern Recognition, Vol. 36, pp. 1997 -2006, 2003.
133. Nam, Y., Wohn, K., and Lee -Kwang, H.: “Modeling and recognition of hand ges ture
using colored Petri nets”, IEEE Transactions on Systems, Man and Cybernetics, Vol. 29,
No. 5, pp. 514 -521, 1999.
134. Natarajan, P. and Nevatia, R.: “Coupled hidden semi Markov models for activity
recognition”, IEEE Workshop on Motion and Video Computing, 2007.
135. Nevatia, R., Hobbs, J., and Bolles, B.: “An ontology for video event representation”,
IEEE Conference on Computer Vision and Pattern Recognition Workshop, Vol. 7, 2004.
Bibliography
127 136. Nevatia, R., Zhao, T., and Hongeng, S.: “Hierarchical language -based representati on of
events in video streams”, In IEEE Workshop on Event Mining, 2003.
137. Nguyen, N. T., Phung, D. Q., Venkatesh, S., and Bui, H. H.: “Learning and detecting
activities from movement trajectories using the hierarchical hidden Markov models”,
IEEE Conference on Computer Vision and Pattern Recognition, Vol. 2, pp. 955 -960,
2005.
138. Niebles, J. C., Wang, H., and Fei -Fei, L.: “Unsupervised learning of human action
categories using spatial -temporal words”, International Journal of Computer Vision, Vol.
79, No. 3, 200 8.
139. Niyogi, S. and Adelson, E.: “Analyzing and recognizing walking figures in XYT”, IEEE
Conference on Computer Vision and Pattern Recognition, pp. 469 -474, 1994.
140. O. Tuzel, F. Porikli, and P. Meer: “Human Detection via Classification on Riemannian
Manifolds ”, Proc. IEEE Int’l Conf. Computer Vision and Pattern Recognition, 2007.
141. Oliver, N. M., Rosario, B., and Pentland, A. P.: “A Bayesian computer vision system for
modeling human interactions”, IEEE Transactions on Pattern Analysis and Machine
Intelligence, V ol. 22, No. 8, pp. 831 -843, 2000.
142. P. F. Felzenszwalb and D. P. Huttenlocher: “Pictorial structures for object recognition”,
IJCV, Vol. 61, No. 1, pp. 55 –79, 2005.
143. P. F. Felzenszwalb, D. McAllester, and D. Ramanan: “A discriminatively trained,
multiscale, d eformable part model”, CVPR, pp. 1 – 8, 2008.
144. P. Sabzmeydani and G. Mori: “Detecting Pedestrians by Learning Shapelet Features”,
Proc. IEEE Int’l Conf. Computer Vision and Pattern Recognition, 2007.
145. P. Viola, M. Jones, and D. Snow: “Detecting Pedestrians U sing Patterns of Motion and
Appearance”, Int’l J. Computer Vision, Vol. 63, No. 2, pp. 153 -161, 2005.
146. Papageorgiou, C., M. Oren, and T. Poggio: “A general framework for object detection”, ,
International Conference on Computer Vision, pp. 555 –562, 1998.
147. Park, S. and Aggarwal, J. K.: “A hierarchical Bayesian network for event recognition of
human actions and interactions”, Multimedia Systems, Vol.10, No. 2, pp. 164 -179, 2004.
148. Pedro F. Felzenszwalb, Daniel P. Huttenlocher. s.l: “Pictorial Structures for Objec t
Recognition”, Intl. Journal of Computer Vision, 2005.
149. Peter J. Carew, Larry Stapleton and Gabriel J. Byrne: “Implications of an ethic of privacy
for human -centered systems engineering”, AI & Society, Vol. 22, No 3, pp 385 -403,
2008
150. Pinhanez, C. S. and Bo bick, A. F.: “Human action detection using PNF propagation of
temporal constraints”, IEEE Conference on Computer Vision and Pattern Recognition,
pp. 898, 1998.
151. Porikli, F.; Kocak, T. s.l.: “Fast Distance Transform Computation Using Dual Scan Line
Propagati on”, SPIE Conference Real -Time Image Processing , 2007.
152. Prati, I. Mikic, M. M. Trivedi, and R. Cucchiara: “Detecting moving shadows: algorithms
and evaluation”, Pattern Analysis and Machine Intelligence, IEEE Transactions, Vol. 25,
No. 7, pp. 918 –923, 2003 .
153. Q. Zhu, S. Avidan, M. Yeh, and K. Cheng: “Fast Human Detection Using a Cascade of
Histograms of Oriented Gradients”, Proc. IEEE Int’l Conf. Computer Vision and Pattern
Recognition, 2006, pp. 1491 -1498.
Human Behavior Recognition in Video Sequences
128 154. R. Brehar, S. Nedevschi, "A comparative study of ped estrian detection methods using
classical Haar and HoG features versus bag of words model computed from Haar and
HoG features", Proceedings of 2011 IEEE Intelligent Computer Communication and
Processing , Cluj -Napoca, August 25 -27, 2011, pp. 299 -306.
155. R. Bor ca, S. Nedevschi, Correlation Between features and Classifiers for Semantic
Understanding of Pedestrian Attitudes in Trafic Senes, in Proceedings of 2009 IEEE
Intelligent Computer Communication and Processing , Cluj -Napoca, August 27 -29, 2009,
pp. 149 -152.
156. R. Cucchiara, C. Grana, M. Piccardi, and A. Prati: “Detecting moving objects, ghosts and
shadows in video streams”, IEEE Trans. on Patt. Anal. and Machine Intell., Vol. 25, No.
10, pp. 1337 -1342, 2003.
157. R. Cutler and L. Davis: “View -based detection and ana lysis of periodic motion”,
International Conference on Pattern Recognition Brisbane, pp. 495 -500, 1998.
158. R. Pless, “Image spaces and video trajectories: Using isomap to explore video
sequences,” Proceedings of IEEE International Conference on Computer Visio n, pp.
1433 –1440, 2003
159. Ramanan, D.; Forsyth, D. A.; Zisserman, A.: “Tracking People by Learning Their
Appearance”, Pattern Analysis and Machine Intelligence, IEEE Transactions on, Vol. 29,
pp. 65 -81, 2007.
160. Rao, C. and Shah, M.: “View -invariance in action r ecognition”, IEEE Conference on
Computer Vision and Pattern Recognition, Vol. 2, pp. 316 -322, 2001.
161. Rapantzikos, K., Avrithis, Y., and Kollias, S.: “Dense saliency -based spatiotemporal
feature points for action recognition”, IEEE Conference on Computer Vis ion and Pattern
Recognition, 2009.
162. Rodriguez, M. D., Ahmed, J., and Shah, M.: “Action MACH: A spatio -temporal
maximum average correlation height filter for action recognition”, IEEE Conference on
Computer Vision and Pattern Recognition, 2008.
163. Ryoo, M. S. and Aggarwal, J. K.: “Recognition of composite human activities through
context -free grammar based representation”, IEEE Conference on Computer Vision and
Pattern Recognition, pp. 1709 -1718, 2006.
164. Ryoo, M. S. and Aggarwal, J. K.: “Semantic understanding of continued and recursive
human activities”, International Conference on Pattern Recognition, pp. 379 -382, 2006.
165. S. Agarwal, A. Awan, and D. Roth: “Learning to Detect Objects in Images via a Sparse,
Part-Based Representation”, IEEE Trans. Pattern Analysis a nd Machine Intelligence, Vol.
26, No. 11, pp. 1475 -1490, 2004.
166. S. Ghosal and P. C. Vanek: “Scalable algorithm for discontinuous optical flow
estimation”, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 18,
No. 2, pp. 181 -194, 1996.
167. S. Lefkovits: “Performance analysis of face detection systems based on haar features”, In
Complexity and Intelligence of the Artifical and Neural Complex Systems, Vol. 1, pp.
184–192, 2008.
168. S. Munder and D.M. Gavrila: “An Experimental Study on Pedestrian Clas sification”,
IEEE Trans. Pattern Analysis and Machine Intelligence, Vol. 28, No. 11, pp. 1863 -1868,
2006.
Bibliography
129 169. S. Munder, C. Schno¨ rr, and D.M. Gavril:, “Pedestrian Detection and Tracking Using a
Mixture of View -Based Shape -Texture Models”, IEEE Trans. Intelli gent Transportation
Systems, Vol. 9, No. 2, pp. 333 -343, 2008.
170. S. Nedevschi, S. Bota, C. Tomiuc, “Stereo -Based Pedestrian Detection for Collision –
Avoidance Applications”, in IEEE Transactions on Intelligent Transportation Systems ,
vol. 10, no. 3, 2009, pp. 380-391
171. S. T. Roweis and L. K. Saul, “Nonlinear dimensionality reduction by locally linear
embedding,” Science, vol. 290, no. 5500, pp. 2323 –2326, 2000.
172. Savarese, S., DelPozo, A., Niebles, J., and Fei -Fei, L.: “Spatial -temporal correlations for
unsupervis ed action classification”, IEEE Workshop on Motion and Video Computing,
2008.
173. Schnorr: “Unique reconstruction of piecewise smooth images by minimizing strictly
convex non -quadratic functional”, Journal of Mathematical Imaging and Vision, Vol. 4,
pp. 189 -198, 1994.
174. Schuldt, C., Laptev, I., and Caputo, B.: “Recognizing human actions: A local SVM
approach”, International Conference on Pattern Recognition, Vol. 3, pp. 32 -36, 2004.
175. Scovanner, P., Ali, S., and Shah, M.: “A 3 -dimensional sift descriptor and its ap plication
to action recognition”, ACM International Conference on Multimedia, pp. 357 -360, 2007.
176. Seemann, M. Fritz, and B. Schiele: “Towards Robust Pedestrian Detection in Crowded
Image Sequences”, Proc. IEEE Int’l Conf. Computer Vision and Pattern Recogni tion,
2007.
177. Shan Ying; Feng Han; Sawhney, H.S.; Kumar, R.: “Learning Exemplar -Based
Categorization for the Detection of Multi -View Multi -Pose Objects”, Computer Vision
and Pattern Recognition, IEEE Computer Society Conference, Vol. 2, pp. 1431 –1438,
2006.
178. Shashua, Y. Gdalyahu, and G. Hayon: “Pedestrian Detection for Driving Assistance
Systems: Single -Frame Classification and System Level Performance”, Proc. IEEE
Intelligent Vehicles Symp., pp. 1 -6, 2004.
179. Shechtman, E. and Irani, M.: “Space -time behavior bas ed correlation”, IEEE Conference
on Computer Vision and Pattern Recognition, Vol. 1, pp. 405 -412, 2005.
180. Sheikh, Y., Sheikh, M., and Shah, M.: “Exploring the space of a human action”, IEEE
International Conference on Computer Vision, Vol. 1, pp. 144 -149, 20 05.
181. Shi, Y., Huang, Y., Minnen, D., Bobick, A. F., and Essa, I. A.: “Propagation networks for
recognition of partially ordered sequential action”, IEEE Conference on Computer Vision
and Pattern Recognition, Vol. 2, pp. 862 -869, 2004.
182. Sigari M., Mozayani N. , Pourreza H.: “Fuzzy Running Average and Fuzzy Background
Subtraction: Concepts and Application”, International Journal of Computer Science and
Network Security, Vol. 8, No. 2, pp. 138 -143, 2008.
183. Simoncelli E.P., Adelson E.H. and Heeger D.J.: “Probability distributions of optical
flow”, IEEE Proc. of CVPR, pp. 310 -315, 1991.
184. Simoncelli E.P.: “Distributed Representation and Analysis of Visual Motion”, PhD
Dissertation, Dept. of Electrical Engineering and Computer Science MIT, 1993.
185. Siskind, J. M.: “Groundin g the lexical semantics of verbs in visual perception using force
dynamics and event logic”, Journal of Artificial Intelligence Research, Vol. 15, pp. 31 –
90, 2001.
Human Behavior Recognition in Video Sequences
130 186. Sivabalakrishnan M., Manjula D.: “Adaptive Background subtraction in Dynamic
Environments Us ing Fuzzy Logic”, International Journal on Computer Science and
Engineering, Vol. 02, No. 2, pp. 270 – 273, 2010.
187. Starner, T. and Pentland, A.: “Real -time American Sign Language recognition from video
using hidden Markov models”, International Symposium on Computer Vision, pp. 265 –
270, 1995.
188. Stauffer C, Grimson W. E. L.: “Adaptive background mixture models for real -time
tracking.”, Proceedings IEEE Computer Society Conference on Computer Vision and
Pattern Recognition (Cat. No PR00149). IEEE Comput. Soc. Par t, Vol. 2, 1999.
189. Stenger, A. Thayananthan, P.H.S. Torr, and R. Cipolla: “Model -Based Hand Tracking
Using a Hierarchical Bayesian Filter”, IEEE Trans. Pattern Analysis and Machine
Intelligence, Vol. 28, No. 9, pp. 1372 -1385, 2006.
190. T. Bouwmans, F. El Baf, B. Vachon: “Statistical Background Modeling for Foreground
Detection: A Survey”, Handbook of Pattern Recognition and Computer Vision, World
Scientific Publishing, Vol. 4, Part 2, pp. 181 -199, 2010.
191. T. Bouwmans: “Recent Advanced Statistical Background Modelin g for Foreground
Detection: A Systematic Survey", Recent Patents on Computer Science, Vol. 4, No. 3, pp.
147-176, 2011.
192. T. E. Boult, R. Micheals, X. Gao, P. Lewis, C. Power, W. Yin and A. Erkan: “Frame rate
omnidirectional surveillance and tracking of camo uflaged and occluded targets”, Second
IEEE Workshop on Visual Surveillance Fort Collins, Colorado, pp. 48 -55, 1999.
193. T. Heap and D. Hogg: “Improving Specificity in PDMs Using a Hierarchical Approach”,
Proc. British Machine Vision Conf., pp. 80 -89, 1997.
194. T. Heap and D. Hogg: “Wormholes in Shape Space: Tracking through Discontinuous
Changes in Shape”, Proc. Int’l Conf. Computer Vision, pp. 344 -349, 1998.
195. T.F. Cootes and C.J. Taylor: “Statistical Models of Appearance for Computer Vision”,
Technical report, Univ . of Manchester, 2004.
196. T.F. Cootes, S. Marsland, C.J. Twining, K. Smith, and C.J. Taylor: “Groupwise
Diffeomorphic Non -Rigid Registration for Automatic Model Building”, Proc. European
Conf. Computer Vision, pp. 316 -327, 2004.
197. Tamás Vajda , Action Recognitio n Using DTW and Petri Nets, Studia Universitatis
Babes -Bolyai Series Informatica, Volume LV, Number 2 (June 2010), pp 69 -78, ISSN:
1224 -869x.
198. Tamás Vajda and Lőrinc M.: General framework for human object detection and pose
estimation in video sequences, In 5th IEEE International Conference on Industrial
Informatics, June 23 -27, 2007, Viena, pp. 467 – 472, ISSN: 1935 -4576
199. Tamás V ajda, Sergiu Nedevschi : Articulated Pose Estimation in Surveillence Videos
ACAM Scientific Journal, Vol. 20 no.2, 2011, pp. 111 -118, ISSN:1221 -437X
200. Tamás Vajda, Ábrám Zoltán – Pictorial Structure Based People Detection and Pose
Estimation in Videos. International Conference on Intelligent Computer Communication
and Processing, ICCP 2011, Aug. 25 -27, 2011, Cluj -Napoca, pp. 315 – 318.
201. Tamás Vajda, Emőke Szatmári, Sergiu Nedevschi -Human Body Detection and Tracking
in Video Sequences Using Chamfer Matching. IEEE 3th International Conference on
Intelligent Computer Communication and Processing, ICCP 2007, Sept. 6 -8, 2007, Cluj –
Napoca, pp . 141 -146, ISBN:978 -1-4244 -1491 -8
Bibliography
131 202. Tamás Vajda, László Bakó, Sándor Tihamér Brassai. Using dynamic programing and
Neural Network to Match Human Action, 11th International Carpathian Control
Conference ICCC 2010, May 26 -29, 2010, Eger, Hungary, pp 231 -234.
203. Tamás Vajda Action Recognition Based on Fast Dynamic -Time Warping Method IEEE
5th International Conference on Intelligent Computer Communication and Processing,
ICCP 2009, Aug 27 -29, 2009, Cluj -Napoca, pp. 127 – 131, ISBN: 978 -1-4244 -5007 -7
204. Tamás Vajda : At titude detection methods usability in behavior recognition 9th
International Conference on Computer Science and Education, October 2009, Tg -Mures,
Romania, pp. 139 -144, ISSN 1842 -4546
205. Tamás Vajda Behavior Recognition Based on Dynamic Programming and Concur rence
Probabilistic Petri Nets IEEE 6th International Conference on Intelligent Computer
Communication and Processing, ICCP 2010, Aug 26 -28, 2010, Cluj -Napoca, pp. 179 –
184, ISBN: 978 -1-4244 -8229 -0
206. Tamás Vajda Behavior Recognition Using Pictorial Structur es and DTW 2010 IEEE
International Conference on Automation, Quality and Testing, Robotics, Mai 28 -29,
2010, Cluj -Napoca, vol3, pp 198 -201
207. Tamás Vajda, Behavior Recognition Using Template Matching, The 4th edition of the
Interdisciplinarity in Engineering International Conference, November 12 -13, 2009, Tg.-
Mures, pp. 283 -288, ISSN 1843 -780X
208. Tamás Vajda : Hierarchical human behavior recognition 8th International Conference on
Computer Science and Energetics -Electrical Engineering, Sumuleu -Ciuc, October 2008 ,
Romania, pp. 139 -144, ISSN 1842 -4546
209. Tamás Vajda : Human Body Detection and Tracking in Video Sequences Using Chamfer
Matching, 7th International Conference on Computer Science and Education, October
2007, Oradea, Romania, pp. 54 -58, ISSN 1842 -4546.
210. Tamá s Vajda : Moving object detection in video sequences using Integral Image 20th
International Conference on Computer Science and Education, October 2010, Satu Mare,
Romania, pp. 225 -228, ISSN 1842 -4546
211. Tamás Vajda , "Using Dynamic Time Warping Algorithm Opti mization For Fast Human
Action Recognition",Acta Technica Napocensis – Electronics and Telecommunication,
Volume 51, Number 2/2010 pp.32 -37, ISSN 1221 -6542
212. Vajda Tamás, Sergiu Nedevschi : Fast Multi -View Human detection and attitude
estimation , CSCS16 – The 16th International Conference on Control Systems and
Computer Science, May22 -25, 2007, Bucuresti, Romania, , vol. 2, pp.17 -23
213. Toyama K., Krumm J. Brumitt B., Meyers B.: “Wallflower: Principles and Practice of
Background Maintenance”, International Conference on Computer Vision, Corfu,
Greece, 1999, pp. 255 -261.
214. V. Ferrari, M. Marin, and A. Zisserman: “Progressive search space reduction for human
pose estimation”, CVPR 2008.
215. V.D. Shet, J. Neumann, V. Ramesh, and L.S. Davis: “Bilattice – Based Logical Reasoning
for Human Detection”, Proc. IEEE Int’l Conf. Computer Vision and Pattern Recognition,
2007.
216. V.N. Vapnik: “The Nature of Statistical Learning Theory”. Springer, 1995.
Human Behavior Recognition in Video Sequences
132 217. Veeraraghavan, A., Chellappa, R., and Roy -Chowdhury, A.: “The function space of an
activity”, IEEE Conference on Computer Vision and Pattern Recognition, Vol. 1, pp.
959-968, 2006.
218. Viola, P. and M. Jones: “Fast multi -view face detection”, In Merl Technical Report
TR2003 -96. 2003
219. Viola, P. and M. Jones: “Rapid object detection using a boosted cascade of simple
features”, IEEE CVPR, pp. 511 –518, 2001.
220. Vu, V. -T., Bremond, F., and Thonnat, M.: “Automatic video interpretation: A novel
algorithm for temporal scenario recognition”, International Joint Conference on Artificial
Intelligence, p p. 1295 -1302, 2003.
221. W. E. L. Grimson, C. Stau_er, R. Romano and L. Lee: “Using adaptive tracking to
classify and monitor activities in a site”. Computer Vision and Pattern Recognition, pp. 1 –
8, 1998.
222. W. Hinterberger, O. Scherzer, C. Schnorr, and J. Weicker t: “Analysis of optical flow
models in the framework of calculus of variations”, Numerical Functional Analysis and
Optimization, Vol. 23, Nos. 1 -2, pp. 69 -89, 2002.
223. Webb, J. A. and Aggarwal, J. K.: “Structure from motion of rigid and jointed objects”,
Artificial Intelligence, Vol.19, pp. 107 -130, 1982.
224. Wo¨hler and J. Anlauf: “An Adaptable Time -Delay Neural – Network Algorithm for
Image Sequence Analysis”, IEEE Trans. Neural Networks, Vol. 10, No. 6, pp. 1531 –
1536, 1999.
225. Wong, S. -F., Kim, T. -K., and Cipolla, R.: “Learning motion categories using both
semantic and structural information”, IEEE Conference on Computer Vision and Pattern
Recognition, 2007.
226. Wren C., Azarbayejani A., Darrell T., Pentland A.: “Pfinder: Real -Time Tracking of the
Human Body”, IEEE Tran sactions on Pattern Analysis and Machine Intelligence, Vol.
19, No. 7, pp. 780 -785, 1997.
227. X. Ren, A. C. Berg, and J. Malik: “Recovering human body configurations using pairwise
constraints between parts”, ICCV 2005
228. Y. Freund and R. Schapire: “A decision -theoretic generalization of on -line learning and
an application to boosting”, Journal of Computer and System Sciences, Vol. 55, No. 1,
pp. 119 –139, 1997.
229. Y. Ivanov, C. Stau_er, A. Bobick and W. E. L. Grimson: “Video surveillance of
interactions”, Second IEEE Workshop on Visual Surveillance Fort Collins, pp. 82 -90,
1999.
230. Y. Rui, T. S. Huang, and S. F. Chang, “Image retrieval: current techniques, promising
directions and open issues,” Journal of Visual Communication and Image Representation,
vol. 10, no. 4, pp. 39–62, 1999
231. Y. Yacoob and M. J. Black, “Parameterized modeling and recognition of activities,”
Computer Vision and Image Understanding, vol. 73, no. 2, pp. 232 –247, 1999.
232. Yacoob, Y. and Black, M.: “Parameterized modeling and recognition of activities”, IE EE
International Conference on Computer Vision, pp. 120 -127. 1998.
233. Yamato, J., Ohya, J., and Ishii, K.: “Recognizing human action in time -sequential images
using hidden Markov model”, IEEE Conference on Computer Vision and Pattern
Recognition, pp. 379 -385, 1992.
Bibliography
133 234. Yilmaz, A. and Shah, M.: “Actions sketch: a novel action representation”, IEEE
Conference on Computer Vision and Pattern Recognition, Vol. 1, pp. 984 -989, 2005.
235. Yilmaz, A. and Shah, M.: “Recognizing human actions in videos acquired by
uncalibrated m oving cameras”, IEEE International Conference on Computer Vision,
2005.
236. Yu, E. and Aggarwal, J. K.: “Detection of fence climbing from monocular video”,
International Conference on Pattern Recognition, pp. 375 -378, 2006.
237. Z. Tu, X. Chen, A. L. Yuille, and S. -C. Zhu: “Image parsing: Unifying segmentation,
detection, and recognition”, IJCV, Vol. 63, No. 2, pp. 113 –140, 2005.
238. Zaidi, A. K.: “On temporal logic programming using Petri nets”, IEEE Transactions on
Systems, Man and Cybernetics, Vol 29, No. 3, pp. 245 -254. 1999.
239. Zelnik -Manor, L. and Irani, M.: “Event -based analysis of video”, IEEE Conference on
Computer Vision and Pattern Recognition, 2001.
240. Zhang H., Xu D.: “Fusing Color and Texture Features for Background Model”,
International Conference on Fuzzy Syste ms and Knowledge Discovery, Vol. 4223, No. 7,
pp. 887 -893, 2006.
241. Zhang, D., Gatica -Perez, D., Bengio, S., and McCowan, I.: “Modeling individual and
group actions in meetings with layered HMMs”, IEEE Transactions on Multimedia, Vol.
8, No. 3, pp. 509 -520, 2 006.
242. Zheng J., Wang Y., Nihan N., Hallenbeck, E.: “Extracting Roadway Background Image:
a Mode Based Approach”, Journal of Transportation Research Report, No. 1944, pp. 82 –
88, 2006.
Human Behavior Recognition in Video Sequences
134 List of figures
Figure 1. General human behavior recognition system ………………………….. ………………………….. .. 8
Figure 2. Relation betwe en activity and behavior recognition ………………………….. ………………… 13
Figure 3. General framework for background segmentation ………………………….. …………………… 24
Figure 4. The Integral Image ………………………….. ………………………….. ………………………….. ……… 30
Figure 5. Rotated Integral Image ………………………….. ………………………….. ………………………….. .. 30
Figure 6. Rectangular features. The AR represents the area of the SumI feature ……………………. 31
Figure 7. The rotated RSumI feature. ………………………….. ………………………….. ……………………… 31
Figure 8. Test examples for frame differencing in outdoor and indoor environment a)
actual image, b) background image, c) foreground mask ………………………….. ………… 34
Figure 9. Test examples for Running average foreground detection in outdoor and indoor
environment a) actual image, b) background image, c) foreground mask ………………. 35
Figure 10. Test examples for Running Gaussian Average foreground detection in outdoor
and indoor environment a) actual ima ge, b) background image, c) foreground
mask ………………………….. ………………………….. ………………………….. ………………………… 36
Figure 11. Test examples Min -Max method in outdoor and indoor environment a) actual
image, b) background image, c) foreground mask ………………………….. ………………….. 37
Figure 12. Test examples for Gaussian Mixture based foreground detection in outdoor and
indoor environment a) actual image, b) background image, c) foreground mask …….. 38
Figure 13. Test examples for Eigenbackgr ound based foreground detection in outdoor and
indoor environment a) actual image, b) background image, c) foreground mask …….. 39
Figure 14. Test examples for integral image based foreground detection in outdoor and
indoor environments a) actual image, b) background image, c) foreground mask. ….. 40
Figure 15. The effect of quick illumination changes ………………………….. ………………………….. …. 42
Figure 16. Results of the optical flow algorithm ………………………….. ………………………….. ………. 44
Figure 17. Edge, Line, Center -surround Haar –feature. Left Basic set, right the Extended
set [103] ………………………….. ………………………….. ………………………….. …………………… 49
Figure 18. Cascade classifier ………………………….. ………………………….. ………………………….. …….. 51
Figure 19. Tree classifier for detection and pose estimation ………………………….. …………………… 52
Figure 20. Relation between of input pattern size, perfo rmance and processing time …………….. 53
List of figures
135 Figure 21. Examples from the training set (positive images) ………………………….. ………………….. 55
Figure 22. The ROC curve for the three classifiers ………………………….. ………………………….. ……. 60
Figure 23. Processed images – our database: a) cascade classifier, b) Leinhart’s Tree,
c)PETC ………………………….. ………………………….. ………………………….. ……………………. 61
Figure 24. Proce ssed images – INRIA database: a) cascade classifier, b) Leinhart’s Tree, c)
PETC ………………………….. ………………………….. ………………………….. ……………………….. 62
Figure 25. Matching process ………………………….. ………………………….. ………………………….. ……… 64
Figure 26. General masks ………………………….. ………………………….. ………………………….. ………….. 66
Figure 27. Forward and backward neighborhoods ………………………….. ………………………….. …….. 66
Figure 28. Back -and-forth scanning on one direction [151] ………………………….. ……………………. 69
Figure 29. Multiple scanning directions. Either the scan direction is changed or the data
space is rotated. [151] ………………………….. ………………………….. ………………………….. … 69
Figure 30. Fast wave -propagation method [151] ………………………….. ………………………….. ………. 70
Figure 31. Human templates ………………………….. ………………………….. ………………………….. ……… 73
Figure 32.Template splitting regions ………………………….. ………………………….. ………………………. 73
Figure 33. Example of transition criteria parameter ………………………….. ………………………….. ….. 74
Figure 34. Human detection system ………………………….. ………………………….. ………………………… 75
Figure 35. System output image ………………………….. ………………………….. ………………………….. …. 76
Figure 36. Chamfer matching performance related to the image homogeneity. ……………………… 77
Figure 37. Connections ………………………….. ………………………….. ………………………….. …………….. 82
Figure 38. The training set ………………………….. ………………………….. ………………………….. ………… 83
Figure 39. Learning process ………………………….. ………………………….. ………………………….. ………. 84
Figure 40. The learnt Pictorial structures (frontal and side view) ………………………….. …………….. 84
Figure 41. Framework implementation diagram ………………………….. ………………………….. ……….. 88
Figure 42. Output for the system. Left for Andriluka’s method and right for ours …………………. 88
Figure 43. The responses of the systems with and without time constraint optimization. ………… 90
Figure 44. The speed of the system for the two kind of optimization ………………………….. ……….. 90
Figure 45. Performance of the pose estimation ………………………….. ………………………….. …………. 91
Figure 46. Difference in performance of the pose estimation ………………………….. ………………….. 91
Figure 47. Output of the system with fro ntal and lateral body model ………………………….. ………. 92
Figure 48. Output of the system with search space reduction and with search space
reduction and configuration reduction ………………………….. ………………………….. ………. 92
Figure 49. Output of the systems a) OPSF, b) original framework ………………………….. ………….. 93
Figure 50. Human detection system performance for different human size ………………………….. . 95
Human Behavior Recognition in Video Sequences
136 Figure 51. Relative motion of the upper leg relative to the torso ………………………….. …………… 101
Figure 52. Full resolution time series of waving – upper arm ………………………….. ……………….. 102
Figure 53. The original and the shrink time series of waving – upper arm ………………………….. . 103
Figure 54. The course and the full resolution cost matrix with warping path ………………………. 103
Figure 55. Training chart of the RBF NN ………………………….. ………………………….. ………………. 104
Figure 56. Training chart of the LVQ NN ………………………….. ………………………….. ……………… 105
Figure 57. Output of the action recognition system ………………………….. ………………………….. …. 106
Figure 58. Transaction and places ………………………….. ………………………….. …………………………. 108
Figure 59. Concurrency ………………………….. ………………………….. ………………………….. …………… 109
Figure 60. Synchronization ………………………….. ………………………….. ………………………….. ……… 109
Figure 61. Recognizing simple activities ………………………….. ………………………….. ……………….. 110
Figure 62. Activity recognition using HPPN ………………………….. ………………………….. ………….. 110
Figure 63. HPPN for a single motion pattern ………………………….. ………………………….. …………. 111
List of Tables
137 List of Tables
Table 1 The comparison of foreground detection techniques with the performance: speed,
detection precision, discriminative power ………………………….. ………………………….. …. 41
Table 2. Memory Requirement categorization ………………………….. ………………………….. ………….. 41
Table 3. Performance parameters on our database ………………………….. ………………………….. …….. 61
Table 4. Performance parameters on INRIA database ………………………….. ………………………….. .. 62
Table 5. Distance transforms precision ………………………….. ………………………….. ……………………. 71
Table 6. Performance parameters for template matching on our database ………………………….. … 76
Table 7. Performance parameters on our database ………………………….. ………………………….. …….. 93
Table 8. NN classification results ………………………….. ………………………….. …………………………. 105
Table 9. Comparative results of the experiment using the three types of pose estimator. ………. 106
Human Behavior Recognition in Video Sequences
138 Appendix
1. Tamás Vajda : Action Recognition Using DTW and Petri Nets, Studia Universitatis Babes –
Bolyai Series Informatica, Volume LV, Number 2 (June 2010), pp 69 -78, ISSN: 1224 -869x.
2. Tamás Vajda, Sergiu Nedevschi : Articulated Pose Estimation in Surveillence Videos
ACAM Scientific Journal, Vol. 20 no.2, 2011, pp. 111 -118, ISSN:1221 -437X
Copyright Notice
© Licențiada.org respectă drepturile de proprietate intelectuală și așteaptă ca toți utilizatorii să facă același lucru. Dacă consideri că un conținut de pe site încalcă drepturile tale de autor, te rugăm să trimiți o notificare DMCA.
Acest articol: AUTOMATION AND COMPUTER SCIENCE FACULTY eng. VAJDA TAMÁS PHD THESIS HUMAN BEHAVIOR RECOGNITION IN VIDEO SEQUEN CES Advisor , Prof. dr. eng. SERGIU… [630942] (ID: 630942)
Dacă considerați că acest conținut vă încalcă drepturile de autor, vă rugăm să depuneți o cerere pe pagina noastră Copyright Takedown.
