AUTOMATION AND COMPUTER SCIENCE FACULTY eng. VAJDA TAMÁS PHD THESIS HUMAN BEHAVIOR RECOGNITION IN VIDEO SEQUENCES Advisor, Prof. dr. eng. SERGIU… [309888]

Licență

AUTOMATION AND COMPUTER SCIENCE FACULTY eng. VAJDA TAMÁS PHD THESIS HUMAN BEHAVIOR RECOGNITION IN VIDEO SEQUENCES Advisor, Prof. dr. eng. SERGIU… [309888]

Byadmin ianuarie 1, 2024

AUTOMATION AND COMPUTER SCIENCE FACULTY

eng. VAJDA TAMÁS

[anonimizat]. SERGIU NEDEVSCHI

2013

AUTOMATION AND COMPUTER SCIENCE FACULTY

eng. Vajda Tamás

[anonimizat]. Sergiu Nedevschi

Comisia de evaluare a tezei de doctorat:

PREȘEDINTE: – Prof.dr.ing. [anonimizat]-Napoca

MEMBRI: – Prof.dr.ing. [anonimizat], [anonimizat]; – Prof.dr.ing. Ștefan-[anonimizat], Universitatea Ștefan cel Mare Suceava; – Prof.dr.ing. Vladimir-[anonimizat], Universitatea „Politehnica” din Timișoara; – Prof.dr.ing. [anonimizat], [anonimizat].

Acknowledgements

I owe a [anonimizat]. Sergiu Nedevschi for his patience and generosity. It’s a pleasure to have known him and I consider myself lucky for having been his student.

Thanks to my colleagues and collaborators. Thanks to my family for their support all these years.

To my wife

Abbreviations

2D – two dimensional

3D – three dimensional

4D – [anonimizat]-[anonimizat] – [anonimizat]- [anonimizat]–[anonimizat]-[anonimizat] – [anonimizat] – [anonimizat] – [anonimizat] – Maximum a [anonimizat] – [anonimizat] – [anonimizat] –[anonimizat] – [anonimizat] -[anonimizat]- [anonimizat] – – [anonimizat]

P-net – [anonimizat] – Past, Now, [anonimizat] – [anonimizat] – [anonimizat]-[anonimizat]-[anonimizat]-of-[anonimizat]-[anonimizat] – Coordinate System Space axe X, Y (plane) [anonimizat] X, Y, Z (3D) and Time axe

Introduction

„Privacy remains one of the ethically imperative issues of the information age” [84] and it will remain like this forever because of the human nature. We desire the safety (physical, and spiritual) [anonimizat], which sometimes can be achieved only by violating other humans safety or/and privacy. [anonimizat]. [anonimizat], they insult others privacy. [anonimizat]ich would fulfill the human operator’s job better and faster. This problem is difficult. Firstly, the number of people has to be determined, if there are any, then their location and pose needs to be estimated (body, and limbs configuration). Detecting people and estimating their pose is challenging, because people move fast, wear different clothes and appear in different poses.

This thesis addresses a number of key issues that are needed to build an automatic system, which understands human behaviors through videos and images.

During the last years many researchers were intensively investigating this topic. Every solution given by the research community has solved only a particular part of the issue and most of them were not accurate enough. Two important questions need to be answered before the system is developed. The first question is: what level of understanding is needed? The second question: what level of understanding is possible, given the information quantity of the images or videos?

The simplest approach is to handle the humans as blobs, points or rectangular regions. In this case, the “understanding” simply means tracking the blobs as they move across the image [149]. This approach can be enough if we want to understand human movement in public places, parks. However, people are more than blobs, capable of gestures and actions with their limbs and arms.

More complex approach of human motion recognition is when human contour is tracked. The human contour can be obtained from images, but it is impossible to track every body part separately. To handle this problem key positions or specific features extracted from some simple or repetitive motions can be used for recognition.

The highest level of understanding is when we can track all body parts individually. Of course, these methods need the highest resolution and the most processing power for understanding.

Problem Statement

There are a given sequences of images. We proposed to detect the human presence in the frames estimate the body configuration of the people and based on the body configuration to recognize the human activity and behavior. In this thesis, we present a general framework and its components that detect the people in the images with the maximum accuracy and offer the highest understanding level that the image resolution would allow.

Main contributions

The main contributions to the field of the human behavior recognition include:

We realized a survey of the most important methods in the field of human behavior systems.

We studied and compared the most important foreground detection techniques, searching for the existence of a generally good foreground detection technique, which could reduce the search space for human bodies and it could speed up the detection process. It tries to identify which foreground detection technique should be used for different situations, to have the best result.

Using the Integral Image a fast and reliable foreground detection technique was proposed, which is comparable in speed with the running Gaussian average and its precision is one of the highest.

We proposed new tree classifiers which can achieve better recognition rates than similar methods and also a categorization of human poses or attitudes. The training of the Haar-based classifier can be further enhanced by a feedback of the misclassified images during the training phase.

Introducing a new method to represent templates and associating a metric can speed up the human body contours matching, using Chamfer matching.

In case of the Pictorial structure based human detection, we introduce a new terms in the frame work. This new term is taking into account the relation between consecutive frames and it speeds up the entire matching detection process. Another benefit of this new term is the increased efficiency of the recognition.

For activity recognition we proposed an improved Dynamic Time Warping (DTW) method.

We observed that the motion time series can be shrunk without affecting the precision of DTW matching.

For human behavior recognition we employed Petri Nets and showed that using a hierarchical construction it is a very powerful and a general tool for this purpose.

Thesis overview

The dissertation is organized into six chapters.

Chapter 2 describes the background of the human behavior recognition systems field. The components of the system are detailed in this chapter, which also presents the most significant work in this field

Chapter 3 presents the preprocessing components of the system and searches the answer to the question, what kind of low vision processing should be used for speeding up the human detection. For this purpose we present the most important background subtraction methods and make a comparison of them. Based on the conclusions resulted from the comparison, we present a rapid background subtraction method.

Chapter 4 presents the most important part of the thesis. Organized in three subchapters, the results in human detection and pose estimation techniques covering different approaches are presented here. The first technique is an artificial intelligence method that uses Haar feature. The second is a template based method. And the third method is a component based method which uses Pictorial structure to detect the human in the image and estimates its body configuration. The subchapters contain the description of the methods and experiments.

Chapter 5 contains the presentation of human behavior recognition component. This component has two parts: the action recognition part and the activity or behavior recognition part. Both parts are described in separate subchapters. In these subchapters we present a Heuristic FastDTW based action recognition algorithm and a Petri Net based human behavior recognition algorithm.

Chapter 6 summarizes the main results of the thesis and provides conclusions. The last chapter also presents some future research topics as extension of this work.

Problem Overview

The researchers have been interested in performing automatic recognitions of human behavior for a long time. Successful recognition of human behavior makes some applications reach their actual limits: visual surveillance – from detecting automatically suspicious behavior to medical tracking and analysis of patients, human-computer interaction (HCI) – gesturing interfaces could make easier to create smart offices or homes. There has been progress on multiple directions in recognizing complex human action resulting enormous quantity of algorithms. However, challenges remain for developing more robust and more general algorithms. In this thesis, we investigate the existing methods and try to build reliable human behavior recognition systems.

In system development, a significant step is to define the system requirements and the working environment. Our system's goal is to increase the video surveillance applications efficiency. We identify some tasks to be handled and some restrictions that can be used in developing this type of systems. First, we need to work with 2D image sequences because the widely used cameras in this field are providing only 2D images. This is necessary because we need to work with the 2D projection of a 3D object, which introduces plenty of uncertainty in the system. In some cases to a 2D human projection can correspond more than one 3D position of the human body. In case of the visual surveillance system, we can suppose that the video acquisition system uses static cameras or the cameras are moving slowly. This fact is crucial because we can consider in many cases that we have to deal with a static background. We also need to handle some significant problems. Firstly, we have to deal with illumination changes, in many cases this is a gradual change but there are some situations when the change happens suddenly. Another task to be resolved is the human object size, which can vary during the detection. An important problem represents the occlusion and the self occlusion – as we only see a part of the human body. If the occlusion occurs the performance of the system will be heavily affected.

Figure 1. General human behavior recognition system

A general human behavior recognition system () has the following components:

Image acquisition system can be:

Online systems: one or more video cameras

Offline systems: video servers.

Preprocessing component having the goal to filter out the errors from the image, and extract information using low level tools. In our case, we try to extract some useful information about the background, or the full background to define or reduce the area of the image we are interested in.

Human detection and the pose estimation component represent the most valuable part of the system. The performance of this component highly influences the overall performance of the system.

The activity and behavior recognition component processes the response provided by the detection and pose estimation algorithms and classifies the human actions.

Our contribution mainly focuses on the following three parts of the system described above: the preprocessing algorithms, the human detection and estimation component, and the activity and behavior recognition part.

Preprocessing

In the preprocessing phase, we focus on reducing the area of interest by using background subtraction or optical flow. Both directions have a vast number of algorithms.

Background subtraction methods are used in the context of moving object detection from static cameras. One way to obtain the background is to acquire a background image which does not include any moving object. In some situations, the background is not available. The acquired background can always be changed under critical situations like illumination changes or objects can be introduced or removed from the scene. Many background modeling methods are developed [51,33], that try to deal with these problems. The methods mostly differ in the way the backgrounds are modeled. One of the most simplistic method is the basic background modeling which uses average histogram or median [102, 123, 242] over time. Another category are the statistical methods single Gaussian [226] or a Mixture of Gaussians [188] or a Kernel Density Estimation [50]. In this case statistical variables are used to classify the pixels as foreground or background. The background can be also computed with clustering methods using K-mean [24] or Codebook algorithm [95]. This approach supposes that each pixel in the frame can be temporally represented by clusters. The pixels from a new image are matched against the corresponding cluster group, and they are classified as foreground or background according to the matching cluster whether are considered as part of the background or not. A group of methods uses Wavelets and DTW to create the background model [16]. The background can be estimated using filters (Kalman filter [124], Wiener filter [213], Tchebychev filter [31]). Any pixel of the current image that deviates significantly from its predicted value is declared as foreground. In case of the artificial intelligence based methods: the background is modeled using

mean of the weights of a neural network that is trained using N clean frames. The network is trained to classify each pixel as background or foreground [34,122]. Fuzzy logic also can be employed for Background Modeling by using a fuzzy running average [186] or Type-2 fuzzy mixture of Gaussians [49]. Foreground detection is using fuzzy inferences such as [182] the Sugeno integral [240], or the Choquet integral [48]. More detailed survey can be read in [190,191].

The optical flow methods [11, 81, 4, 22, 23, 117, 116] are used in the context of moving objects and cameras. Finding the displacement field between the subsequent frames of an image sequence has become a classical computer vision problem. This displacement field is called optic flow. Horn and Schunck have introduced the first variational method for computing the optic flow field in an image sequence [11]. This method is based on two assumptions: a brightness constancy assumption and a smoothness assumption. These two assumptions are characteristic for many variational optical flow methods. The variational methods belong to the best performing techniques for computing the optical flow field [81, 4]. Improvement to these methods include refined model assumptions with discontinuity-preserving constraints [96,75,52,82] or spatiotemporal regularization [117, 69, 86], improved data terms with modified constraints [97,22,173] or nonquadratic penalization [22, 117, 222, 47], and efficient multigrid algorithms [21,23, 166, 38, 57] for minimizing these energy functionals.

Human detection

The most sensitive and the most important components of the system represents the human detection and pose estimation. A lot of researches in this field focus only on detecting people in the image, but for behavior recognition systems capable to recognize complex behaviors we also need to know the pose or attitude of the humans. The term detection in this case means localization, it represents a rectangle in the image where the human is. The term human pose refers to the body and limb configuration related to a coordinate system.

The people detection systems [5,170, 130,114] fall into two main categories:

“Component-based” methods

“Single detection window” analysis.

“Single detection window” analysis

First we begin to survey the “single detection window” methods. The first class of methods uses 2D or 3D human models and tries to match this model to image parts. It is difficult to create a model that is enough general, simple and capable to catch every particular human motion. If we succeed to match the model we will also have its pose, but overall the matching procedure is very time consuming.

A particularly attractive “single detection window” method contain the shape based techniques. These methods are attractive because of their property of reducing variations in human appearance due to lighting or clothing. The methods can use continuous and discrete representation of the shape.

Discrete approaches represent the shapes by a set of exemplar shapes [42, 41, 189, 92]. These methods require a large number of example shapes (many thousands) to cover sufficiently the shape space due to transformations and intra-class variance. Exemplar-based models have to create a balance between specificity and compactness. When the shape set is too high, the methods will work extremely slowly, while if the shape set it is too low the performance will be low, as well. Efficient matching techniques based on distance-transforms have been combined with hierarchical structures, to allow matching thousands of exemplars [42, 41, 189].

The continuous approach involves a parametric representation of the shapes, learned from examples, given the existence of an appropriate manual [195, 193, 194] or automatic [15,112, 115, 121, 169] shape registration method. One direction is the linear shape representation. In this case, to model the class-conditional density a single Gaussian is used [15, 195]. An extension of the linear model space uses conditional density models [195, 115]. Further, nonlinear extensions have been introduced at the cost of requiring a larger number of training shapes, to cope with higher model complexity [195, 115, 193, 194, 169]. Other approach is when the shape space is breaking into subspaces with a mixture of Gaussians via the EM-algorithm [195] or K-means clustering [115, 193, 15, 169] which can be modeled linearly. To achieve better performance, some approaches combine the shape and the texture information into a compound parametric appearance model [196, 195, 115, 121, 98]. These approaches have separate statistical models for shape and intensity variations.

The second type contains the. “example-based” methods, which are using a labeled training set, to learn to recognize human’s objects. These kinds of methods are discriminative methods, and usually are using predefined image features and their relationships in order to determine the human objects. One of the key elements in this method is the training set, and the other is the training algorithm. The “example-based” methods can be categorized in three classes based on the classifier architecture, techniques which should determine an optimal decision boundary between pattern classes in a feature space:

Feed -forward multilayer neural networks

AdaBoost classifier

Support Vector Machines (SVMs)

The most common class of methods is the Feed-forward multilayer neural networks [1] implement linear discriminant functions in the feature space in which input patterns have been mapped nonlinearly, by using feature sets. The decision boundary is computed by minimizing an error criterion with respect to the network parameters. The most common criterion is the mean squared error [1]. Multilayer neural networks have been applied in conjunction with adaptive local receptive field features as nonlinearities in the hidden network layer [89, 41, 168, 120, 224]. This architecture unifies feature extraction and classification within a single model.

The third class of a classifier is AdaBoost [228], which is used to construct strong classifiers as weighted linear combinations of the selected weak classifiers, each involving a threshold on a single feature [144, 178]. The most known and used classifier is the cascade classifier introduced by Viola et al. [144] and adopted by many others [90, 91, 140, 9, 99, 153]. The idea behind the cascade classifier is that the majority of detection windows in an image are non humans. The cascade structure is tuned to reject non human objects as early as possible. AdaBoost has been applied also as automatic feature selection procedure [228]. Each layer is constructed iteratively using AdaBoost to create a strong classifier guided by user-specified performance criteria. Each layer is focused on the errors the previous layers make, and the result of this layer is a complex detector. The cascade classifier is modified later to a tree classifier. The classifier eliminates in an earlier stage of the classification the majority of the negative inputs and only the searched objects are classified by all stages. Lienhart et al [103, 104, 105] in their work expand the Viola's approach by extending the feature types and using detection trees to handle in-class variances. The most AdaBoost classifiers use Haar-like feature. Non adaptive Haar wavelet features have been introduced by Papageorgiou and Poggio [25] and adapted by many others [125, 66, 144]. The Haar features are local filters operating on pixel intensities and represent local intensity differences at various locations, scales, and orientations. The Haar features are popular because they are simple, and using integral images can be computed extremely fast with the same computation cost for every scale [176, 144]. The Haar features are frequently redundant, due to overlapping spatial shifts and require mechanisms to select the most appropriate subset of features out of the vast amount of possible features. This selection can manually use prior knowledge about the geometric configuration of the human body [125, 25, 66] or automatic [228, 144].

Another powerful tool to solve pattern classification problems contains Support Vector Machines (SVMs) [168]. SVMs maximize the margin of a linear decision boundary (hyper plane) to achieve maximum separation between the object classes. To solve human detection problem linear SVM classifiers have been used in combination with various (nonlinear) feature sets [128, 129, 132, 215, 66, 99, 153]. Some methods are using Nonlinear SVM classification, while others are using polynomial or radial basis function kernels for implicit mapping of the samples into a higher dimensional space. The Nonlinear SVM is characterized with a significant increase in computational cost and memory requirement [132, 76, 126, 168, 25, 120].

Other type of features used during classification tasks are codebook feature patches, extracted around interesting points in the image [196, 7, 6, 176]. During the training, a codebook of distinctive object feature patches with geometrical relations is learned followed by clustering in the space of feature patches to obtain a compact representation of the human class. Beside the intensity based feature, some features are using the discontinuities from the image. Such features are the histograms of oriented gradients (HOG) [128, 178, 215, 99, 153] and, scale-invariant feature transform (SIFT) [39]. These use well-normalized image gradient orientation histograms computed over local image blocks. The HOG initially were computed using local image blocks at a single fixed scale [128, 178] but later extended to variable-sized blocks [215, 99, 153]. To increase robustness to the illumination changes, local spatial variation and correlation of gradient based features have been encoded using the covariance matrix descriptors [140]. Another category of features is represented by the edgelets – shape filters that explicitly incorporate the spatial configuration of dominant edgelike structures [90]. Manually selected sets of edgelets, which can be local line or curve segments, have been used to capture edge structures [144, 9]

A common problem for all examples presented is, the high in-class variability that make the classifier very complex. Many recent human detection approaches attempt to break down the complex appearance of the human class into a subclass with low in-class variability. Usually the break down is made to create subclasses that correspond to a specific human pose or appearance. The break down is made using manual [132, 178, 66, 9] or automatic clustering [41, 99]. Without a safe clustering method, this significant amount of work is done manually. Since the data is not collected in a controlled environment, manual categorization can become prohibitively expensive, and because the fundamental ambiguity in labeling different poses and views, the complexity of the work grows linearly with the number of classes. Manual categorization is also an error-prone procedure that may introduce significant bias into the training process.

“Component-based” methods

The “component-based” methods detect the invariant object parts separately and check if they are present in a geometrical natural configuration. These parts are either semantically motivated as body parts [76, 90, 126, 178, 67, 9] or concern codebook representations [9, 165, 7, 6]. These components should be small enough to capture articulated motion, but they should be sufficiently large to contain discriminative visual structure to allow reliable detection. These types of systems usually use hierarchical detection framework. The body parts are detected in order of their importance, if one of the basic part is not detected the other parts are not searched.

The “component-based” methods require assembly techniques to integrate the local part responses to a final detection, constrained by spatial relations or a model among the parts. Methods that put together the part-based detection responses into a final classification can be a combination of classifiers [76, 126,178] or probabilistic inference to determine the most likely object configuration given the observed image features [90, 9, 67].

The “component-based” systems have the ability to easier address partial occlusion [9, 7, 6]. They do not need a huge training set to adequately cover the set of possible appearances. They are also slow because they need to detect more than one component, so they are slower than single detection window based methods. Their applicability to lower resolution images is limited since each component detector requires a certain spatial support for robustness.

Activity and behavior recognition

The objective of this subchapter is to provide an overview of state-of-the-art human activity and behavior recognition methodologies.

Figure 2. Relation between activity and behavior recognition

The term "activity" refers to basic human actions like walking, standing and we will use the term "behavior" to refer complex activities with a longer duration in time. The behavior can be composed by a sequence of activities. Both the activity and behavior recognition techniques can be classified [84] into two categories:

Single-layered approaches

Hierarchical approaches.

However, the single-layer approaches are more suitable for activity recognition and the hierarchical methods are used by behavior recognition.

Single-layered approaches

Single-layered approaches are techniques that represent and recognize human activities directly based on sequences of images. Due to their nature, single-layered approaches are suitable for the recognition of actions with sequential characteristics [84]. Two types of Single-layered approaches can be distinct depending on how they model human actions:

Space-time approaches

Sequential approaches.

The Space-time approaches view an input video as a 3-dimensional volume (the image plane and the third dimension is the time) and can be divided [84] into three categories based on what features they use:

Trajectory-based approaches,

Space-time volumes based techniques,

Space time-Local feature based techniques

The trajectory-based techniques interpret an activity as a set of space-time trajectories. In these techniques, the human objects are represented as a set of 2D or 3D points corresponding to the joint or body part positions. Some of the work in this field tracked the joint position and used them directly [157, 158, 180, 234] or constructed 3D XYT or 4D XYZT representations of the motion, to recognize the action [223, 139, 180, 235].

Instead of using the trajectories directly in human actions recognition, some methods extract significant curvature patterns from the trajectories. The actions are represented by a set of peaks and intervals between them. Using different learning algorithms some prototypes for actions are created. The action templates will be these prototypes. The recognition is done with template matching technique [160].

Some approaches transform the human action trajectory into a low-dimensional phase space. In the phase space, a person's static state at each frame corresponds to a point and an action corresponds to a set of points [84]. The scenes are also converted in the phase spaces and checked if whether the points are on the maintained curves [30]. Using these methods we are able to analyze detailed levels of the human movements and most of these methods are view invariant.

The space-time volumes based techniques measure the similarity between two volumes. One approach tracks the foreground shape changes and compares this volume to the saved ones [18, 179, 94]. In order to match volumes more reliably some techniques use filters to capture characteristics of volumes [162]. Instead of maintaining the 3-dimensional space-time volume of each action some approaches represent each action with a template composed of two 2-dimensional images: a 2-dimensional binary motion-energy image (MEI) and a scalar-valued motion-history image (MHI) and compare them using template matching [18]. Also, hierarchical space-time volume correlations were introduced. At every location of the volume they extracted a small space-time patch around the location and correlated to the saved templates [179]. Another system applies a hierarchical mean shift to cluster similarly colored volumetric pixel elements from which obtains several segmented volumes. At recognition phase use SVM to classify the volumes [90, 174].

These techniques use sliding window algorithm that requires a huge number of computations to accurately solve the recognition problem. The scene with multiple people or actions which cannot be spatially segmented can create a problem for these systems [84].

Space time-Local feature based techniques extract local feature from 3D space-time volumes to represent and recognize activities. According to these techniques the 3-D space-time volume created by an action essentially is a rigid 3-D object, and by extracting appropriate features, the action recognition became an object matching problem. These techniques are characterized by the feature they extract, how they describe the action with this feature and what methodology they use to classify activities.

There are two approaches [84] first extract the local features at every frame and concatenate them temporally to describe the overall motion of human activities [17, 32, 239] and the others extract sparse spatio-temporal local features from 3D volumes [101, 46, 138, 235, 163]

One of the earliest techniques that extract local features at the frame level to characterize an action uses motion energy receptive fields together with Gabor filters. The detected local features are used to build a multidimensional histogram and then using the Bayes rule the posterior probability of an action is computed. [32]. A variant of this method uses normalized local intensity gradient extracted at multiple temporal scales and uses unsupervised clustering algorithm on the histograms to learn actions [239]

Another approach extracts appearance-based local features at each pixel from every frame and constructs a space-time volume. Using these features and the Poisson equation space-time saliency and space-time orientation can be extracted. The actions were recognized using nearest neighbor classification with an Euclidean distance.[ 17]

Some approaches extend local feature (corner) detector [74] to be applicable to 3D space. These extended corner detectors capture various types of non-constant motion patterns like direction change. These features are used in SVM to recognize activity [100, 174].

Cuboid-type spatio-temporal features were proposed [46] to capture pixel appearance values of the interest point's neighborhoods. For each dataset, a set of cuboids was used, and the actions were modeled as a histogram of cuboids detected in 3D space-time volume. This method was extended using unsupervised learning and classification methods [138]. These methods were extended by pruning the cuboid features [106], to use the extended 3D SIFT method which is remarkably similar to the cuboid features [175], or to use the color information [161]. These techniques use successful bag-of-words approaches for basic periodic actions.

Recently, action recognition approaches attempt to model spatio-temporal distribution of the extracted features considering spatial configurations of the local feature for better recognition of actions. The PLSA introduced by Niebles [138] is extended to use an implicit shape model (PLSA-ISM) [225] which captures the relative spatio-temporal location information of the features from the activity center. The extension of these approaches is when the correlation between the features is considered, too [172, 100, 106].

Ryoo and Aggarwal [164] introduced the spatio-temporal relationship (STR) match method that measures structural similarity between two videos, considering spatial and temporal relationships between detected features. This is capable also to recognize basic behaviors and interaction between two people.

The Sequential approaches interpret an input video as a sequence of observations. Sequential approaches extract features from the frames describing the status of a person in the image, and analyze the sequence of features to measure how likely a person to perform an activity is. The Sequential approaches can be classified in two categories depending on whether they use:

Exemplar-based recognition methodologies

Model-based recognition methodologies

Exemplar-based sequential techniques use samples directly in the recognition or training phase. The new image sequences are compared to the template sequence composed of feature vectors extracted from the training video. These techniques must be invariant to different styles and/or different rates of the performed activity. An approach represents the DTW adapted for matching two sequences with variations [45, 62, 217] used mostly for gesture recognition. Another approach uses principle component analysis (PCA) to represent an activity as a linear combination of a set of activity basis that essentially is a set of eigenvectors. Next to the input activity are computed the coefficients and compared to the templates coefficients [217, 232]. In a recent approach [107], the human activity is represented as linear time invariant (LTI) systems. The system tries to use the dynamics of changes in silhouette features. The new frames are converted to LTI parameters and classified using SVM.

State model-based sequential techniques are approaches which represent a human activity as a model composed of a set of states. This model is trained statistically so that it corresponds to sequences of features which define the activity class. For every activity class a statistical model is constructed. To measure the likelihood between the action model and the input image sequence, maximum likelihood estimation (MLE) or the maximum a posteriori probability (MAP) can be used. The most used model-based sequential approaches are the Hidden Markov models (HMMs) [241] and the dynamic Bayesian networks (DBNs) [147].

In case of both approaches the human is assumed to be in one state at each frame. At each state we have an observation: the feature vector. The transition between states is evaluated considering the observations. The activity is represented as a set of hidden states. The activity recognition is a problem of computing the probability of a given sequence generated by a particular state-model.

The first approaches use standard HMMs to recognize activities [233, 187, 19]. These approaches use shapes, and positions or trajectories for observations and usually use Viterbi algorithm [216] to compute the approximation of likelihood distance for classifying an observation sequence into an activity class. The new generation of HMMs approaches [141, 147,134] also construct one model (HMM) for each activity they want to recognize, and use appearance features from the scene as observation, but use extended HMMs. These extended HMMs are designed to handle complex activities. One of these extended HMMs is the coupled HMM (CHMM) [141]. The CHMM is constructed by coupling multiple HMMs, hidden states of two different HMMs are coupled by specifying their dependencies. Using the CHMM makes it possible the modeling of more than one actor (more than one state can be active) and the interaction between them. This method was later extended by explicitly modeling the duration of an activity staying in each state using coupled hidden semi-Markov models (CHSMMs) [108, 134]. Another extended HMM is the DBN, which is built from multiple conditionally independent hidden nodes that generate observations at each frame [147]. This algorithm was used to recognize gesture.

Hierarchical approaches

The hierarchical approaches recognize high-level human activities, mostly behaviors. These systems are composed of multiple layers. These techniques recognize high-level activities, behaviors based on the recognition results of other simpler activities. The hierarchical approaches are:

Statistical approaches

Syntactic approaches

Description-based approaches

Statistical approaches construct statistical state-based models such as HMMs and DBNs related hierarchically to each other to represent and recognize high-level human activities. The first layers recognize atomic actions from sequences of feature vectors using single-layered sequential approaches. The upper layers treat as observations the previous layer’s “atomic actions”. Using this observation, we can build new statistical models to recognize complex action or behavior. One of the most common approaches are the layered Hidden Markov models (LHMMs),[141, 137] constructed from two layers. The bottom layer HMMs recognizes atomic actions of a single person, and the upper layer recognizes the behavior. This approach is also suitable for recognition of group behavior [241, 43]. A variant of the 2 layered HMMs is the block-based HMM [236]. An extension of the traditional HMMs is dynamic probabilistic networks (DPNs) suitable for representing activities of multiple participants [64]. Bayesian networks that use Markov chain Monte Carlo (MCMC) were also used to recognize activity [44]. In this case, the relation between basic actions was modeled using Bayesian networks. The network was iteratively updated using MCMC, to find the best model for the current sequence of observation. Another hierarchical approach is the propagation network (P-net)[181]. In the P-net, the activity is represented by multiple state nodes, their transition and observation probabilities. The main advantage of this approach is that multiple states are active which makes it possible to model composed activities as well.

The syntactic approaches model human activities as a string of symbols. To model human activity these approaches uses grammar syntax such as context-free grammars (CFGs) or stochastic context-free grammar (SCFG) [78, 125, 88]. The high-level activities are modeled as a string of atomic-level activities. Similarly to the case of hierarchical approaches this method uses two layers. The first recognizes the atomic activity using any of the previous methods, and the second layer use a set of rules that generates strings of atomic actions which are recognized by a parsing technique. One of the basic syntactic approaches [78] uses HMMs for the recognition of simple actions, this is the first layer. The second layer uses SCFGs to recognize complex actions, based on the outcome of the first layer. The lower layer generates a string of action. The upper layer parses the string generated by the first layer using Earley-Stolcke algorithm extended to handle uncertain observations, and a large number of stochastic productions rules which should be able to explain all activity possibilities. These methods are extended [127] to handle multitask activities by introducing more reliable error detection and recovery techniques for the recognition. Other extensions [125] of the basic syntactic approach introduce CFG to improve the segmentation and object tracking. This method also introduces the concept of the hallucinations, to compensate the failures of atomic-level recognition. Recent work focuses on grammar extension [88] to attach semantic tags and conditions to the production rules of the SCFG. These extensions are able to recognize activities with higher complexity.

Description-based methods are a hierarchical approach that represents complex human activities by describing these, using simple activities or sub- event and their temporal, spatial, and logical structures. The human action is modeled as an occurrence of its simpler activities or events composing action and certain relation between these activities. In description-based approaches the relations between sub-events are described using predicates: before, meets, overlaps, during, starts, finishes, and equals [3, 2, 185, 150, 136, 65, 163, 220, 155] for temporal, sequential, and concurrent relationships. In the description based approaches the activities' semantics are encoded like programming languages and uses CFG to verify if the representation fits its grammar activities [164, 135].

The description based algorithms differ mostly on how they describe temporal structures and the way of the making the recognition. One of the first approaches [150] use Past, Now, Future network (PNF-network) based on Allen's interval algebra constraint network (IA-network) [3] where sub-events are nodes and their temporal relationships are described with edges between them. Other earlier approaches [77, 136, 70, 220] use a similar format to those of programming languages. The main differences between these methods the temporal predicates they use for relations.

In the Bayesian belief network based approach, the root node corresponds to the high-level activity and the other nodes correspond to the sub-events or describe the temporal relationships between the sub-events [150].

The Petri Nets were also used to represent human activities [238; 133; 63]. The Petri Nets are suitable to represent temporal ordering of the sub-events relations. The recognition is done by handing tokens in the graph.

Conclusions

This chapter is a survey of the most representative human behavior recognition systems. The first part of the chapter presents a general framework for building a human behavior recognition system emphasizing all important steps of it.

In the second part we made a synthesis of the current achievements, results and conclusions in the field of human behavior systems. With the survey we have covered all three components of the general framework:

Preprocessing

Human detection

Behavior recognition.

During the survey we have identified the following issues which worth to be further studied:

In the literature, we have found a large number of background detection algorithms performing well only in specific situations. Usually they are either fast or accurate, and most of them cannot handle the shadows. We haven’t found neither a robust algorithm to use for deciding which of the algorithms should we use for a specific situation.

In case of the human detection algorithms, the situation is the same. We have identified a large number of algorithms, but they are working only in special situations and there is no information or guideline about how to choose the correct one for a given task or scenario. Another shortage of the human detection algorithms is that they cannot retrieve both the position of the humans on the image and the position (attitude) of the peoples in that moment. Or if they do (a very few of them), they are very slow.

In the literature we have found only behavior recognition methods that recognize only simple actions. Those ones which can recognize complex behaviors need complex training and usually they are not general enough to recognize arbitrary behaviors. Many of the algorithms doesn’t process body part motions, resulting this way a high degree of uncertainty.

Preprocessing

The images, captured by a camera or loaded from a video file, in most cases need to be prepared for later processing. These preparations are qualitative enhancements, normalizations, or dimensionality reduction algorithms. The qualitative enhancement is noise reduction, contrast or color enhancement. The normalization is a linear process that changes the range of a feature (ex: color, intensity, size, orientation) in order to be between a predefined range. The dimensionality reduction algorithms use some restrictions or some heuristic information to reduce the complexity of the problem.

In this chapter, we will focus in particular on the dimensionality reduction algorithms, by presenting and comparing the most common methods. The methods are evaluated from the point of view of human detection and behavior recognition techniques. We also propose a novel method for background detection, capable to remove the shadows as well.

Problem statement

Human detection is an extremely complex task. In this phase, the main goal is to reduce the complexity of the task. We can do this by considering all the acquirable information, then using them to build an algorithm that will make the following processes more reliable and efficient.

In our case, the most significant information about the humans is that they are moving. Although their motion is not a continuous one, by considering along enough periods, they will make some movements. This information is used successfully excepting the still images. For our case, to narrow the search space, an efficient algorithm can be built. These kinds of algorithms are the foreground segmentation algorithms.

By choosing or designing an algorithm for this task, we need to take into consideration the following cases:

Static background

Periodically static background

Dynamic background

In real environments, all approaches have to deal with several problems [87, 152] as follows:

Gradual illumination changes: Gradual illumination changes alter slowly the color characteristic of the image.

Quick illumination changes: Quick illumination changes entirely alter the color characteristics of the image.

Relocation of the background Object: Relocation of a background object induces changes in two different regions in the image, its newly acquired position and its previous position.

Camera oscillations: The camera oscillation generates repeated slow sifting of the image.

High-frequency background objects: (such as tree branches, sea waves, and similar) we have to deal with it mostly in outdoor environments, but can be caused also by a flickering lamp.

Initialization with moving objects: If moving objects are present during initialization then part of the background will be occluded by moving objects

Camouflage: A foreground object’s pixel may be the same intensity and color as the background.

Shadows: Objects cast shadows that might also be classified as foreground due to the illumination change in the shadow region.

There are two kinds of widespread approaches for foreground segmentation: optical flow computation and background subtraction.

Optical Flow

Optical flow reflects the image changes due to motion during consecutive frames. The optical flow field is the velocity field that represents the three-dimensional motion of object points across a two-dimensional image. The optical flow can tell us about the relative distances of equal speed objects: closer moving objects will have more apparent motion than moving objects that are further away. There are many approaches to compute the optical flow. Despite the differences between the approaches, most of them have three stages [13]. The first stage is the pre-filtering or smoothing with low-pass or band-pass filters in order to extract signal structure of interest and to enhance the signal to noise ratio. The second stage represents the extraction of basic measurements such as spatio-temporal derivatives or local correlation surfaces. The last stage is the integration of these measurements to produce a 2-d flow field which often involves assumptions about the smoothness of the underlying flow field.

The most prominent optical flow methods are:

Differential techniques

Region-based methods

Frequency-based methods

Phase based methods.

Differential techniques: These approaches compute the optical flow from spatio-temporal derivatives of the image. A frame from an image sequence is written as a function of position and time. This representation form of the image is expressed using a Taylor series. This category of techniques uses three constraints:

Brightness constancy: the observed object brightness of any object point is constant over time

The velocity smoothness

The temporal persistence or „small movements”.

The first assumption can be seen in equation (3.1).

The equation (3.1) can be expanded using Taylor series and lead to the 2D motion constraint equation

The equation (3.2) has two unknowns without a unique solution. The absence of a unique solution is because, in a small area, there is not enough information to determine the motion [13].

To overcome the problem from equation 3.2, Horn and Schunck [11] has introduced a new global constraint of smoothness. In his assumption large, rigid objects are moving in the image, so over relatively large areas the optical flow will be smooth. Horn and Schunck minimize the square of the magnitude of the gradient of optical flow by using the equation:

In contrast, Lucas and Kanade [10, 183, 184] approach assume that, in a local neighborhood of a pixel, the flow is constant. The optical flow equation is solved using a local least squares criterion in the neighborhood surrounding the pixel

The Horn and Schunck constrain can be extended using global smoothness constrain and second-order derivatives to measure optical flow [131]. Nagel suggested an oriented-smoothness constraint in which smoothness was not imposed across steep intensity gradients, in an attempt to handle occlusion. The problem was formulated as the minimization of the following functional [131]

Region-Based Matching approach uses region matching to compute the optical flow. For some case a numerical differentiation may be impractical because of the noise. The region based matching method has three steps: region extraction, region matching and optical flow smoothing. The velocity v has been defined as the shift. For region matching, a similarity measure (over d) has to be used. The most used similarity measures are: the normalized cross-correlation or the sum-of-squared difference (SSD) [84]

Where W denotes discrete 2-d window function and takes on integer values [171].

Frequency-Based Methods compute the optical flow using the output energy. These methods use the velocity-tuned filters in the Fourier domain. The Fourier transform of translating 2-d pattern is:

Where is the Fourier transform of I is a Dirac delta function, denotes temporal frequency space.

Phase-Based Techniques or filter based technique employ phase information to compute the velocity. These techniques deliver precise and reliable estimates without complex parameter tuning or optimization [13] mainly because the phase information is robust to changes in scale, orientation and speed. It has been demonstrated that, phase-based techniques are more accurate than local methods, but the filtering operations have high computational need. They can be used in real-time applications only with dedicated hardware.

Background subtraction

Background subtraction is a commonly used class of techniques for segmenting out objects of interest from a scene using static camera. It involves comparing an observed image with an estimate of the static background. The areas of the image plane where there is a significant difference between the observed and estimated background images indicate the location of the objects of interest. The name “background subtraction" comes from the simple technique of subtracting the observed image from the estimated image and thresholding the result to generate the objects of interest.

We compared several representative techniques of this class, by comparing some fundamental attributes of them: how the object areas are distinguished from the background; how the background is maintained over time; and how the segmented object areas are post processed to reject false positives, etc.

Figure 3. General framework for background segmentation

Every variant of the background subtraction method is starting from the following equation:

where is the estimated background, is the actual frame, and represent a “predefined” threshold. If the absolute difference of the pixels from two consecutive frames is bigger as, than that pixel is identified as foreground pixel. The methods mainly differ from each other on the way they define and maintain the background.

One of the simplest variants is the frame differencing where the background estimate is the previous frame:

This technique is fast and robust to all illumination changes, but identifies only the contour of the moving objects, works only for a particular condition of object’s speed and frame rate [229], and it is extremely sensitive to the threshold.

To get a solid moving region, we need to compute the background as the average or the median of the previous images. This method is approached as a temporal median filter. The background estimate is defined as the median of the last N frames, with typical values of N, ranging between 50 and 200.

where M(I) represents the last n frame average or median. This method is fast, but extremely memory consuming (size of memory needed: n*Size(I) [156]).

Cucchiara et al [118] proposed to compute the median on a special set of n sub-sampled values of the last N images, and use the computed median value only for a period. Computing the background in this way the result will be more stable.

Cutler [221] uses color images because they give better segmentations of the images than monochrome ones, especially for areas with low contrast as objects in dark shadows. Pixels has to be marked as foreground if

where is an offline generated estimate of the noise standard deviation, and K is an a priory selected constant (typically 10). This method uses also template matching to help in selecting candidate matches.

The background, to minimize the memory usage, is computed as running average:

where is the learning rate, keeping it small (typically 0.05) to prevents artificial shadow forming behind moving objects. This method does not need extra memory [118]. Several improvements exist for this method.

After the thresholding, Heikkila and Olli [80] use the closing morphological operation with a 3×3 structuring element, to discard the small regions. Two background corrections have to be applied: One is computed with equation, (3.12) and the second one is applied for pixels marked as foreground for more than m of the last M frames, then the background is updated as

This correction is designed to compensate sudden illumination changes and newly appeared static objects. If a pixel state is changed from foreground to background frequently, it is masked out from inclusion in the foreground. This is designed to compensate fluctuating illumination, like swinging branches.

In LOTS [192], three background models are kept simultaneously: a primary, a secondary, and the old background. Both the primary and the secondary backgrounds are updated with the same equation: (3.12) when the pixels belong to the background. The parameter is smaller than 0.25 with a default value of . If the pixels belong to the foreground, the primary background is updated with equation (3.12), where When the pixels differ significantly from the background, the secondary background is updated with equation (3.13). The third background, named as the Old background, is a copy of the incoming images from 9000 to 18000 frames ago.

The method uses an adaptive thresholding with hysteresis to classify pixels as foreground or background. Then to the classified pixels several conditions are applied. If the foreground pixels are only from a small region, they are classified back as backgrounds. If the size of the foreground increases significantly in consecutive frames, the global threshold is temporally increased because this is interpreted as rapid lighting change. To resolve local lighting changes, the foreground pixels are compared with the primary and secondary background image.

The small foreground regions and noises are removed with the Halevy [55] method, where the background is updated by

at all pixels, where is a smoothed version of and the value of is in the [0.3…0.5] interval. This method does not use thresholding for foreground detection; instead it tracks the maxima of . They also note that gives an indication of the number of frames t, needed for the background, to settle down after initialization.

In the methods presented above, the backgrounds were represented by a mean value. An extension [56, 118] of this approach is to model every pixel from the background model with a Gaussian distribution (μ, σ). By this, the Gaussian distribution will be fitted over the temporal histogram, resulting in the background PDF. The most convenient way to update the background is using the running Gaussian average to avoid fitting the PDF to each new frame from scratch. The background pixels in this case are explicitly modeled by a mean (and a variance (, which are updated recursively.

The classification is made using threshold of the difference between the mean and the image:

The threshold value is computed using the variance and a chosen constant:. The value of the K usually is 2.5.

In practice, the selective update model is used in order to maintain the background PDF.

where, M is a binary value, equal with 1 if the pixel is a foreground pixel or 0 otherwise.

The most significant limitation of the previous methods is that they do not cope with multimodal backgrounds. The adaptive Mixture of Gaussian methods are meant to provide a solution for this limitation. With this approach, we can model a multimodal background with a mixture of K Gaussians (μi, σi, ωi) [27,28], with arbitrarily pre-defined number of modes, usually between 3 and 5.

Each pixel is modeled separately by a mixture of K Gaussians

where in [221,229, 27]. In case of color images, the Gaussians are multi-variate. Assuming the values are independent, the covariance matrix, simplifies to diagonal. To simplify it is [221,27] considered that the standard deviation for the three channels are equal toand are simplified as well:

To update the model, first we identify the best matching distribution, and then check the distance to this distribution. If the distance is smaller than the standard deviations [221, 229,27] then the component will be updated with the following equations:

where

The other components will be update as follows:

If the distance to the best matching distribution is bigger than the standard deviation, than the least likely component needs to be replaced with a new one, which has, large value, and – low value. After the updates, the weights will be renormalized.

All components from the mixture will be sorted into a decreasing order of . Using a threshold T it will be decided what is the number of components that form the background. This threshold can be different for every pixel. After the component classification, it will be verified which pixel belongs to a background component, to decide which pixels are representing a foreground object and which a background one [29].

Another way to detect the foreground is to estimate the background with Kernel Density Estimators by computing the background PDF using the histogram of the n most recent pixel values. Since the histograms may provide poor modeling of the true unknown PDFs, in [50] the background is modeled using the KDE non parametric and the histogram of the n most recent pixel values. To detect the foreground objects the following equation will be used:

where T is a threshold value. The background is selectively updated based on the threshold.

The Mean-shift based background estimation [118] is a gradient-ascent method able to detect the modes of a multimodal distribution with their covariance matrix. The method requires high computational costs and a study of convergence [123]. The mean shift vector can be computed with the next equation:

where x is an arbitrary point from the data space, h is the analysis bandwidth, g(u) is the first derivate of the kernel profile with bounded support. The iterative implementation is way too slow and requires high memory usage: n*size(frame). Computational optimizations can be done, but usually it is used at initialization only for detecting the background PDF modes, then later, other lighter computational methods are used (mode propagation).

The combined background estimation and propagation [156] is also known as Sequential Kernel Density Approximation. In this case, the mean-shift mode detection from samples is used only at initialization time and later modes are propagated by adapting them with the new samples:

Heuristic procedures are used for merging the existing modes and without fixing upfront the number of modes. They are faster than KDE, and have low memory requirements.

For building a background model there is also the minimum and maximum method. In [73, 72,71], a pixel is marked as foreground if

or

where Max, Min represents the minimum, maximum, while D is the largest absolute difference between the background frames. These parameters have to be initiated, and usually they are within the first few seconds of the video. The parameters are periodically updated for the scene parts without foreground objects.

After the foreground detection, morphological erosion is applied to remove the noise and all small regions. At the end, the holes from the foreground region are removed by a closing operation.

A different approach compared to the previous method is the Eigenbackgrounds introduced by A. P. Pentland. The main idea of this method is to use Principal Component Analysis (PCA) [29, 119] for reducing the dimensionality of a space. PCA is applied to a sequence of n frames to compute the eigenbackgrounds.

The Eigenbackgrounds are computed using n frames. These frames are re-arranged as columns of a matrix A. Using this A matrix we can compute the covariance matrix. From C, the diagonal matrix of its eigenvalues, L, and the eigenvector matrix, Φ, are computed. From these structures only the M eigenvectors, representing the eigenbackgrounds, are kept.

To detect the foreground, first we need to project the image I in the M eigenvectors sub-space and then reconstructed as I’. The difference I–I’ is computed: since the sub-space represents well only the static parts of the scene, the outcomes of this difference are the foreground objects.

Integral Image based background subtraction

The Integral Image based foreground detection method is combining the idea of using features instead of working directly on the raw image. In Integral Image based foreground detection method, similarly to the eigenbackground method, the background model is a feature image and for every frame the feature image has to be computed and compared to the background. The advantage of working directly on raw images is the speed, while the deficiency is considering only one pixel at subtraction without dealing with its neighborhoods. Starting from the fact that in most cases, we are not interested in one pixel or small foreground objects, we developed a method which looks to the current pixel and to its neighborhood, handling shadows as well.

The Integral Image based foreground detection technique considers the pixel as foreground, if the majority of the pixels from its neighborhood are also foregrounds. To determinate this easily and quickly, we use rectangular features that we consider the neighborhood pixels. The simplest feature that satisfies all conditions is the sum of the pixels and its neighborhood values, similar to an average. This sum can be computed very fast using an intermediate representation. This is the integral image which was introduced by Viola and Jones [219].

Integral Image

The integral image [211, 199] at location contains the sum of pixels from the rectangle with the following coordinates (0, 0, x, y).

Figure 4. The Integral Image

Figure 4 shows the Integral Image, where every pixel is computed using the following equation:

The integral image can be computed in one pass over the image from left to right and from top to bottom:

and

The Integral Image can be computed quickly also in case of for rotated rectangles with 45 degrees. The rotated Integral Image is shown in

Figure 5. Rotated Integral Image

The Rotated Integral Image is computed with two passes over the image. The first pass from left to right and top to bottom:

and

The second pass over the image is from right to left and bottom to top

The feature image () which contains the rectangle sum features will have the following equation:

where is the number of the pixels from its neighborhood. The can be calculated more efficiently with four sums using the integral image for every size of z.

Figure 6. Rectangular features. The AR represents the area of the SumI feature

The distance between L1-L2 and L3-L1are equal with 2z. This can be generalized also for rectangles with an arbitrary length. The SumI feature can be computed for rotated versions too.

Figure 7. The rotated RSumI feature.

The rotated SumI feature is computed with the following equation:

where h is the projection of L1, L2, and the w is the projection of L2,L4 to the x axis. The x, y points are at the L2 position.

Estimating the Integral background image and shadow detection

The background image in our case is substituted with a background sum feature image. Every background feature is modeled by a Gaussian distribution described by its average and standard deviations. This is maintained using the running average method:

In order to determine the foreground mask image we compare the current image features to the background model.

If the absolute value of is bigger than a threshold (Th2), then the pixel is a foreground pixel, if its absolute value is between Th1 and Th2, it has to be verified if it is a shadow or foreground object, in case it is below Th1 , it is a foreground object.

Th1 is calculated as follows:

where is a constant,with a value between 2 and 3, in the Integral Image based foreground detection approach has the value . Using the standard deviation to compute the threshold value, only gradual illuminations changing are covered, which is not correct. To improve the threshold value calculation, and prepare for the sudden light changes as well, we have introduced the component. This component will reflect the global changes of the illumination and will be computed using the equation (3.48):

where is a gate. In the system .

Detecting the shadows is an extremely hard task and there are no perfect algorithms for doing it. The assumption for shadow detection is that the regions of shadow are semi-transparent, without any texture change. To check the texture, we simply look at the gradient direction in the vertical and horizontal direction and compare them to the background model. This is done using the equation (3.50)

If the fraction is bigger than 0, it means that they were maintained in both vertical and horizontal directions. The optimal measure of the fraction is equal to 1. Considering that the square features are approximated by a Gaussian it can differ from the optimal value with. In our case, we are using . It is more exact if we compute using the standard deviation of the rectangle feature but introduce more computational cost without much benefit.

Discussions and experiments

In order to compare systematically the methods presented above, to measure their performances and to decide which one is the most suitable for our purpose, we need to identify some quality measures. The quality measures in most cases are task dependent. It is important to identify the future usage of the algorithm results and based on that decide which parameters are the most important ones.

According to our purpose/aim to speed up the human detection and reduce the false detection rate we have several cases. These cases are influenced by the type of human detection techniques used and by the scene properties.

From the human detection techniques point of view we have two directions.

The first one is when we use the foreground detection to reduce the search area of possible humans. In this case, the detected foreground only defines a rectangular window which could contain a human. The human detection techniques search inside that window the presence of a human without taking into account the shape of the foreground.

The second direction is when the human detection technique relays also on the shape boundaries of the foreground. In this case, very precise foreground detection is needed, because this precision influences directly the outcome of the human detection module.

We can enunciate even before showing the results of the performance analysis of the foreground detection techniques, that there are no generally good foreground detection methods suitable for every type of scene scenarios. The different scene properties are influencing the performance of the methods. Based on background model changes we can identify several classes of scenes. The first and worse case is when we have a dynamic background. In this case the background is moving, but with a different velocity than the foreground. Here as well are two different cases: the first case is when the motion of the background is generally parallel with the image plane (fixed rotating fixed cameras) and the second when the background can change in along the 3rd axis as well (deep: camera zooming) moving camera or zooming the camera. The second case is the static background. Here are several cases as well: completely static background, periodical changing background (tree leaf, waves), and the static or periodically changing background with moments when the scene changes completely without any relation to previous backgrounds, scenes (movies).

Figure 8. Test examples for frame differencing in outdoor and indoor environment a) actual image, b) background image, c) foreground mask

The performance of a foreground detection method has several components. One component is the precision or discriminative capability of the foreground detection. The precision has two components: detection precision that refers to a low probability of misclassifying a foreground pixel, and discriminative power meaning that a background will be not classified as foreground. These two measures are known in artificial intelligence literature as false negatives and false positives. The highest precision is achieved when both values are minimized. Minimizing them is time consuming and sometimes not even possible. That’s why it is important to measure them separately and precisely.

Figure 9. Test examples for Running average foreground detection in outdoor and indoor environment a) actual image, b) background image, c) foreground mask

To measure the detection precision we made some manually labeled films from indoor and outdoor environments. Frame by frame, we have marked precisely the moving objects, and then we computed the detection precision as a percentage of the detected foreground pixels and the labeled foreground pixels.

Figure 10. Test examples for Running Gaussian Average foreground detection in outdoor and indoor environment a) actual image, b) background image, c) foreground mask

where P is the precision, F is the set of the labeled foreground pixels, is the number of elements in set F. can be computed as follows:

Figure 11. Test examples Min-Max method in outdoor and indoor environment a) actual image, b) background image, c) foreground mask

The discriminative power of the foreground detector is computed as the rate between the incorrectly and correctly classified pixels.

where is the discriminative power, N is the number of pixels of the image and can be computed as follows:

Figure 12. Test examples for Gaussian Mixture based foreground detection in outdoor and indoor environment a) actual image, b) background image, c) foreground mask

From our point of view, the high detection precision is more important than the discriminative power in order to do not miss any humans. However, very low discriminative power, increases the computational demand because in this case additional filtering has to be used to eliminate small foreground objects and to decide if the objects are humans or not.

Figure 13. Test examples for Eigenbackground based foreground detection in outdoor and indoor environment a) actual image, b) background image, c) foreground mask

In some situations even the best foreground detection methods are failing. A good example for this is when the foreground pixels are similar to the background pixels. To eliminate the possible errors for these types of situations a top down approach has to be used. Usually these methods have higher computational costs than benefits. Unfortunately for the Integral Image based foreground detection methods, the computational speed couldn’t be increased.

Figure 14. Test examples for integral image based foreground detection in outdoor and indoor environments a) actual image, b) background image, c) foreground mask.

Examples of implemented foreground technique results are presented in to . All figures have three parts: the actual image, its computed foreground mask, and the background image. In some cases the presented image is diversionary because the background models are not illustrated as an image. In these cases we applied a reverse transformation to see how it would look like. Even so, the background images cannot encode the whole information of the models.

The performance of the foreground detection techniques are influenced by several factors which cannot be measured using only pictures. These kinds of events are the gradual illumination changes. We can pronounce that all techniques are trying to address the gradual illumination change by periodically updating the background. The question is how often should be the background updated. If the updates are too often, the response on illumination change in the background will be good, but on foreground mask some ghost objects could appear. If the updates are made rarely the background might not change as fast as the illumination is changing. The measured values obtained based on foreground detection techniques are summarized in the following table. The tests were made on indoor and outdoor image sequences and we computed an average from the measured values. The tests was made on a 2,66 GHZ Pentium Core2Duo PC.

Table 1 The comparison of foreground detection techniques with the performance: speed, detection precision, discriminative power

We have analyzed also the memory requirement of the different methods, but we noticed that the results in number are not relevant since they are highly dependent by the image size and the background model update variants, so we have categorized them in three classes: Low, Medium and High. The results are presented in the following table.

Table 2. Memory Requirement categorization

Figure 15. The effect of quick illumination changes

We have tested the methods to quick illumination changes as well. This is an important aspect of foreground detection methods, because this also reflects the robustness of the method. On a cloudy day or next to a changing light source (a lamp is switched on/off) the background characteristic will be seriously altered. This will result in a drastically increased number of falsely detected foreground regions and in the worst case, the whole image could appear as foreground. In the are some examples to demonstrate how different methods respond to these conditions.

The fastest reaction among the reviewed methods is certainly the Frame Differencing, because in that case we do not need to do extra background model maintenance, only a differencing and a thresholding. In this case the precision is low, because the method only finds differences mostly in the contour of the moving object. The size of the contour is highly dependent from the velocity of the moving object. However the discriminative power of the method is good, because the background model is updating instantaneously to the environmental changes. In it can be seen that this method has the best response to the quick illumination changes. The method cannot deal with very slow or still foreground objects even if the objects are motionless for a few frames only. Another inefficiency of this method is that it does not deal with the camera oscillation and high frequency background changes.

The Running Average and the Running Gauss Average has about the same complexity. The differences between them is that in case of the Gaussian Average method the threshold value is computed automatically and differentiated for different regions of the image. The result proves that the Running Gaussian is more precise, but it requires more memory and it is a little slower, and responds weakly to quick illumination changes. The response to the slow illumination changes depends from the value of the learning rate parameters. The methods cannot deal with the high frequencies background changes or camera oscillations.

The Minimum Maximum model is slower than the Running Gaussian Average because it has many nonlinear computations. One of the strengths is the automatic threshold value computation. Similar to the Running Gaussian Average method it does not cope with multiple modal background distributions. The response to the fast illumination changes is also weak.

Mean-shift method can effectively model a multimodal distribution without the need for assuming the modes a priori. Has a very high computational cost. Although there are several methods to reduce the computational cost, it is considered one of the slowest methods. It handles well the camera oscillations, while adapts very slowly to the quick illumination changes.

The Gaussian Mixture Model is one of the best methods. It is not too slow and in some cases achieves really good results. It handles multimodal background, so can cope with high frequency background changes, and can compute the threshold automatically. It deals with quick illumination changes only if those are repetitive. Another disadvantage is that we need to give a priori the number of Gaussians. The speed and the memory requirements are dependent on the number of Gaussians.

We have made experiments with the eigenbackground method as well with a training set of 20 recent images and 3 eigenbackgrounds. The quality of the results was good, but significantly dependent on the images used for the training set. When the current images contained a moving object, in the same position as in the training image, the eigenspace could not remove completely the moving object and they appeared as a ghost in the foreground. The method does not cope with fast illumination changes.

The Integral Image based foreground detection method in speed is comparable with the Running Gaussian average method. The method does not deal with the multimodal background, but if the changes in the background are small than it is invariant to the changes since it works with features. The method has a good response to fast illumination changes and computes the threshold automatically. Disadvantage of the methods: small objects are not detected, and all detected objects will appear smaller than in reality without needing an extra filtering of the very small objects. Extra advantage of the method is that it detects and removes the shadows. On the examples presented in to it can be seen, especially on the outdoor test image, how this shadow detection technique works.

Figure 16. Results of the optical flow algorithm

We left at the end the conclusions about the optical flow since it is different from the other foreground detection methods. The result of the optical flow gives us the objects velocities. To segment with this method the image to foreground and background, requires other future processing of the result. Computing the optical flow then to segment the results costs way too much, But in some cases when all other methods fail to work, the optical flow can be suitable. This kind of case is when we have to deal with a non-static background.

To test the optical flow algorithm for foreground detection we have implemented the Lucas-Kanade optical flow variant.

Conclusions

The topic of this chapter was the preprocessing component of the human behavior systems. The scope of the preprocessing component is to speed up the human detection process and to reduce the false positive results without influencing the detection rate. Based on our studies and tests, we have concluded that the best way to achieve this purpose is to use a foreground or moving object detection method.

Our contributions to this field are the following:

Identification of the challenges of this component

Definition of the performance measurements suitable to compare the different methods [210]

Implementation and comparison of 9 different foreground detection techniques [201]

A new foreground detection algorithm based on Integral Image [210]

The first part of this chapter describes the list of features and characteristics we have identified that a foreground detection method needs to possess. The list of features was selected based on the conclusions described in the related literature and based on the results of our experiments:

Detection precision

Discriminative power

Shadow removal capability

Memory requirement

Computational power requirement

The detection precision and discriminative power of the foreground detection method is influenced by situations like sudden illumination changes, background reallocation, high frequencies background objects. [210]

The second part presents the 9 methods we have implemented and tested:

Frame differencing,

Running average foreground,

Running Gaussian Average,

Min-Max method,

Meanshift based foreground detection,

Gaussian Mixture based foreground detection

Eigenbackground based foreground detection

Optical flow (Lucas-Kanade)

Integral image based foreground detection (our method)

The tests were performed on both indoor and outdoor image sequences to cover all challenging cases. The test results were compared based on the performance measurement model defined using the features identified in the previous phase.

We concluded after the tests that there are no generally good methods that can perform well in every situation. The best solution is to combine multiple methods and use always the best one for that specific scene.

Another finding was that the foreground detection methods can be used with success in scene change detections. To do so, we only need to determinate the acceptable maximum percentage of the foreground of the image. If the foreground is more than a threshold we consider that the scene is changing. If we encounter changing scenes very often, we can use the optical flow detection to track the motion in the scene [201].

The last part of the chapter presents the novel method elaborated by us. We managed to work out a new way of detecting the foreground objects. After definition, we have also implemented this method and performed all the tests we have done with the other methods as well. The results were compared and analyzed using the same performance measurement model.

Based on comparison we have shown that our Integral Image based foreground detection method is suitable for foreground detection with static backgrounds. It has the same precision as the Gaussian Mixture Method but it is 4 times faster. In combination with the Haar-feature based Human detection method, the integral Image is reusable also for human detections which will increase even more the human detection process. The method has one of the best discriminative performances [210].

Human Detection and Pose Estimation

The most important task of the human behavior recognition framework is the human detection and pose estimation. The task can be defined as follows: on a given image or image series (video) identify the position of humans and their pose or body configuration. The way how this can be done depends on the image quality (information quantity), and the restrictions applied to the system. This chapter presents three types of the human detection techniques and their extensions which transform them suitable for human pose estimation as well.

Introduction

The human pose detection task is one of the most important tasks, because it represents the measurement stage of the human behavior recognition process. The behavior understanding accuracy is extremely related to the human pose detection task in that the more accurate the results of this process are the more accurate the behavior understanding will be. It is also the most complicated task of the system due the high importance of its accuracy and the huge variability and appearance of the humans.

There are several approaches to detect in a frame the human position and its configuration. These methods can be categorized in different ways. The most common ones are the “component-based” methods and the “single detection window” analysis [62, 61].

The “component-based” methods detect the invariant object parts separately and check if they are present in a geometrical natural configuration. This type of methods also has many variants. The part detection and the configuration matching can be done consecutively [109, 110, 142, 143] or by using a hierarchical detection framework. In case of a hierarchical approach the body parts are detected in their importance order, if one of the basic parts is missed other parts are not searched [62]. These systems have the ability to explicitly deal with partial occlusion. They are slower than single detection window based methods, because they have to detect more than one component. Some variants of the approach use a fixed model and do not handle multi-view and multi-pose cases, so they are fairly limited, similar to single detection window methods. Mostly the same algorithms are used to detect the human components as the single detection window analysis approach. For component-based human detection systems see Mohan's work [126].

The other approach is the “single detection window” analysis. Its most significant feature is its speed, while its drawback is a limited partial occlusion handling. There are three major types of “single detection window” methods [126, 154], likewise in case of component detection for “component-base” approaches:

The first type of methods uses 2D or 3D human models and tries to match them with parts of the image. It is difficult to create models that are enough general and simple, but are also capable to catch every particular human motion. After the successful matching (it is a very time consuming process), their pose detection follows.

The second type uses predefined image feature and their relationship, which uniquely determine the human objects. This is applicable mostly for rigid objects.

The third type is the “example-based” methods, which use a labeled training set to learn to recognize human’s objects. Key elements of this method are the training set and the training algorithm.

Human detection and pose estimation with “example-based” method

The “example based” method represents one of the most popular methods. They are very popular because they appear to be very simple: Only some examples need to be selected and a neural network has to be trained, then everything will work correctly. Despite of the appearance, creating a trainable structure that learns the task from the example is hard. Even creating a good training set is complicated and requires a big amount of time and knowledge.

One of the most promising “example-based” techniques was introduced by Viola [219]. He applied a rapid object detection method for face detection and extended it to handle multi-view face detections as well, [218]. They use Haar-like features for rapid feature extraction and AdaBoost learning algorithm for feature selection and strong classifier creation. The first classifier is a cascade classifier modified later to a tree classifier by Lienhart [103,105, 104, 154].

The idea behind this approach is that the classifier should eliminate the majority of the negative inputs on earlier stages of the classification and only the searched objects will be classified in all stages.

Initially the method was developed to detect human faces then extended for human bodies as well. But there are significant differences between detecting human faces and human bodies. The cause of the differences is originating from the nature of the two types of “objects”. The faces are rigid objects, because their features/parts like nose, eyes, mouth are situated approximately in the same relative position to each other, while the human body is a semi deformable object. In a way it has some rigidity since body-parts are fixed to the body, but they are deformable. This semi deformable nature of the human body increases significantly the in-class variability of it resulting in very distinctive human class elements.

A common problem of all techniques presented before are, that they do not handle this proportion of in-class variability. A solution for this problem is to use for every type of human appearance more than one classifier. This solution requires a categorization of the training data. Without a safe clustering method, this significant amount of work needs to be done manually. Since the data are not collected in controlled environment, manual categorization can become prohibitively expensive, and because the fundamental ambiguity in labeling different poses and views, the complexity of the work grows linearly with the number of classes. Manual categorization is also an error-prone procedure that may introduce significant bias into the training process.

Shan et. al., [177] recently presented a novel framework, which unifies the categorization and the classification. The drawback of the method is that in the first step it categorizes the input in a number of classes and after that secondary classifiers are used to decide if the categorized inputs are humans or not.

Another common problem of these methods is that they have only two outputs. If we want to use one of these object detectors to detect and also estimate the human pose, we need to introduce another classifier for that purpose.

Haar feature

The Haar wavelets are a robust basis functions set. This represents the difference of intensity in neighboring regions [219]. Using the Integral Image the value of the Haar feature can be computed very fast. The Haar function has two important properties:

The value of a Haar function is the same if the picture is reduced or increased by a factor.

The computation of Haar-feature using Integral image is the same for every size.

These two properties make it suitable for being used in classifiers that need to work in real-time. Two sets of Haar features are used extensively: the basic one and the 45 degree rotated feature, which can be calculated using the Integral Image, and the 45 degree rotated Integral Image [103].

Figure 17. Edge, Line, Center-surround Haar –feature.
Left Basic set, right the Extended set [103]

The AdaBoost algorithm

The AdaBoost algorithm was proposed by Freund and Shapire [54]. The idea of boosting is to use weak classifiers to form a highly accurate prediction rule by calling the weak classifier repeatedly on different distributions over the training examples. A weak classifier is only required to be better than chance, and thus can be very simple and computationally inexpensive. Initially, all weights are equally initialized, but in each round the weights of incorrectly classified examples are increased, this way the images, which were poorly predicted by a previous classifier, will receive greater weight in the next iteration.

From every Haar-feature very efficient weak classifier can be built. To build a weak classifier hj(x) using the Haar feature fj(x), we need a threshold value and a parity pj to indicate the direction of inequality (Equation 4.1):

The threshold value is different for every feature and the optimal value is computed using the training algorithm.

The most significant proprieties of AdaBoost focus on combining a set of weak classifiers into a strong classifier by its ability to efficiently reduce the training error.

Different variants of boosting are known, like: Discrete Adaboost, Real AdaBoost, and Gentle AdaBoost [54]. All of them are identical with respect to computational complexity from a classification perspective, but have different learning algorithm.

The classifier for human detection is built by using the AdaBoost algorithm and Haar feature based on weak classifiers [103].

The first step is creating the training set consisting of all significant humans (around 15,000), various non-human images (around 100,000) and a file which contains the category of all images (the expected output). Firstly the optimal threshold for every weak classifier needs to be computed. Using the algorithm 1 we can compute the weights and select the best weak classifier to achieve the desired performance.

The cascade classifier

The computational cost for a monolithic classifier selection is the same for every input image. In real life, we can decide easily in which categories some images belong, but selecting a category for them needs more attention. The cascade classifier is based on this idea: It rejects rapidly the most part of the negative images, then evaluates a part of the images with a few weak classifiers and evaluates only complex images with the whole classifier. Direct consequence of this approach is different evaluation time for different images and an overall speed up of the whole classification procedure.

The cascade design process is driven by a set of detection performances [105]. If each stage classifier is taught for low performances (f<0.6, false detection rate/stage and d>0.999, hit rate/stage), then the whole cascade will have the same classification performances as a monolithic classifier, but more than 10 times faster [167].

In the training stage, we need to set the desired performance used as stop condition. To test the performance we need also a test set. Each stage is trained using only the negative images that are classified incorrectly and leave all the positives as they are. In case of the cascade classifier the trade-off is between filtering performance and process speed. The high number of weak classifiers and stages make the cascade classifier have higher filtering performance but lower processing speed.

Figure 18. Cascade classifier

Human body Detector and Pose Estimation Tree

The classical cascade classifier was developed and tested for face recognition and without changes is not suitable for human body detection, because of its high in-class variability. Since the human body is semi-deformable and can take various poses and can appear in diverse views, the usage of a single classifier will result overtraining, while the usage of a weak classifier will make impossible to detect specific body states. This difficulty can be overcome by dividing one of the classes in subclasses with specific features to be able to distinguish specific patterns, and then use a specialized classifier for every subclass. Using this approach, the classification complexity will grow linearly with the number of subclasses. To keep the real-time working behavior, the detector has to be very fast. This approach tries to resolve the contradiction between the detection performance improvement (need more subclasses so need more processing time) and decreasing the processing time. Our starting point is Viola and Jones [219] work, which uses a cascade classifier to preserve classification accuracy, while using the “coarse-to-fine” strategy reduces the calculation complexity. The cascade classifier was extended later [103] by adding a pose estimator before the cascade classifiers to handle multi-view cases. This estimator estimates the pose of the possible face, and chooses the correct face detector.

The human body detector and pose estimator tree tries to merge these two steps: the pose estimation and the classification. Using a pose estimator before classification, the estimator would not preserve the coarse-to-fine strategy, and its pose estimation for “non object” patterns would take too much time. By eliminating first all certainly “non human” patterns and splitting the class into subclasses (subclasses are groups of poses and views) only when it is necessary. By necessary we mean that future classification performance cannot be achieved or it can, but with a too high cost.

Figure . Tree classifier for detection and pose estimation

We used a binary tree classifier to realize this purpose. This tree can be viewed as a cascade classifier which merges earlier common stages of multiple specialized cascade classifiers and inserts pose estimation stages after every stage, where it is needed to choose between specialized classification stages. With this merging we can achieve a very fast detector, because the majority of the false patterns are eliminated in earlier stages of the classification. Another advantage of this method is the automated estimation of the detected object's pose.

The classification and the pose estimation uses the same Haar-like features introduced by Viola and Jones [219] and extended by Lienhart [103] as the original cascade classifier, because it can be calculated very fast using integral images.

The tree is built using two types of nodes: classification and estimation nodes. Both of them are using Haar-like feature to classify the input pattern. One difference between them is that the estimation nodes use only one Haar-like feature, while the classification nodes can use more than one feature to classify the input data. The second difference is how their outputs have to be processed. If the classification node has negative response the input pattern is dropped and no future processing is made. If the node response is positive, its output is an output of the tree, when the node is a leaf, otherwise else it is an input of its child node. In case of estimation nodes, their outputs are processed by one of the child nodes, left or right, or in case of a leaf node, there are also outputs of the tree. We used for our classifier the Pose Estimation Tree Classifier (PETC) annotation

Building the training set

To obtain an efficient classifier one of the main task is to create a proper training set. To get a proper training set we need to decide about the followings:

input pattern size

human images preparation

the background image size

how to choose the significant images

how big is the necessary training set size

The input size of the image is important, because it determines the number of used features of the learning process. For example in case of face recognition, for a pattern of 24×24 pixels, the 84848 (BASE) features in the basic set, 111360 (CORE) in the extended set, and 138694 (ALL) features the entire set [167] have to be evaluated. Between performance and image size the relation is the following: larger images are more detailed, are containing more information, but they need more memory and more features for evaluation. Higher number of features requires more computational power and bigger training set.

Figure 20. Relation between of input pattern size, performance and processing time

There is no formula for computing the ideal size of the input pattern, the only way is to make some experimental analysis. Ideal size is the smallest size for which still correct classifications can be made. The search of the smallest input pattern size has two reasons. The first reason is that if the training images are smaller, we need fewer features to classify, therefore the classification will be faster. The second reason is that by using small images we do not have many details so the classifier will be more general having less chance to over train the classifier. According to the experiments reported in [104], for face detection the best images’ pattern size is the 24×24, because it has the lowest false alarm rate at the same hit rate. According to the experimental results, the optimal pattern size for human detection is 128×64. We also concluded that the optimal pattern size depends by the variety of the data base used for training. Other approaches of human detectors have obtained other optimal dimensions for the training image pattern. Face normalization is easy, because the heads are rigid objects, while normalizing human body is more difficult since they can appear in different postures. There are many possibilities to crop human images. Cropping only the significant part of the images, results in images of different sizes, because humans can appear in high variety of poses.

To normalize the human body a point has to be selected which is constant for every position of the body. Normalizing based on the height of the people was proved to be ineffective because the human’s height can differ when they are sitting or making a step or standing (the height of the same human can be different in these situations). For the same reason, normalizing based on width was also unsuccessful. Both are trying to introduce extra in-class variability in the training set. This normalization works only if the training set has only one human pose in it. Finally we conclude that the most constant feature in this pose is the height of the head. The head’s height is also changing in different poses but its variation was the lowest compared to all other ideas.

During the normalization we resized the human body in a way that the height of the humans to be always the same, then we moved the human on the image so, that the coordinates of head to be in the same position. The positive training set contains many background images. It is important to have many images from every position with different backgrounds.

The size of the background image does not seem to be so important, it is never explicitly specified. It can be taken to be the same size as the positive pattern size, namely 128×64 pixels.

The example-based learning method needs a huge training set with high number of positive and negative examples. To achieve a good performance the examples from training set should be representative elements of the class. The positive examples are usually cropped manually. In case of human body it is important to have positive images for every human body configuration. Practice proves that an amount of 15,000 human images would be enough, but the difficult question is the number of backgrounds. The images of our own database were collected from public marked databases like Inria, MIT and surveillance videos completed with self marked cropped images. The background images are cropped from surveillance and documentary videos without human presence. One of the problems with the background selection is that we have too many similar backgrounds which increase the learning time without increasing the performance. To eliminate the similar backgrounds we applied normalization to the negative examples and compared them using a simple similarity metrics. From the similar images we kept only one in the image set. To increase the learning capabilities we also filtered the background image using the last trained stage from the classifier. To increase the generality of the classification we applied distortion to the training set (randomly little translation, rotation and resizing).

Figure 21. Examples from the training set (positive images)

Another issue of the training set is the size of it. The size of the training set is proportional with the learning time and affects the performance of the classifier. In case of the face recognition Leinhart proposed a training set with 5,000 positive and 3,000 negative images [104]. Viola and Jones built their classifier with 4,916 faces and 10,000 non-faces selected randomly from a set of 9,500 images which did not contained faces [219]. Considering the possible face configurations and the possible human pose diversities, the size of the training set should be increased.

First we used 10000 positive images and 27,000 negative ones. At this size of training set the program used approximately 4GB memory and 240s for the selection of one Haar feature. We used an Intel Core 2 Duo computer with, 2.6GHz CPU frequency and 3GB of RAM. It needed more than 4 days for the selection of 1,500 features.

The detector was built to scan the input at multiple scales and location. We used for the step size one pixel and for the scale factor 1.3. In order to achieve a false alarm rate of around 5×10-6, more based on other researches experiments [218] we need millions of different background pictures.

If we try to use millions of different backgrounds, the processing time of one feature selection increases. A huge training set needs a huge amount of memory space, which exceeds the usable RAM memories. If we use virtual memory created on HDD to solve the memory problem we need more than 50 times to train the classifier, approximately with one more month.

To reduce the training time of the cascade classifier we tried to reduce the training set without negatively influencing its performance. The starting point for our idea was the behavior of the cascade classifier. The cascade classifier at earlier stage eliminates the majority of background without eliminating the humans. Direct consequence of this behavior is that we cannot eliminate any of the images from positive set but we can reduce those backgrounds that are already classified as backgrounds in an earlier stage. Even using this filtering technique we have at early stages of the classifier a huge amount of background images that make the learning process very time consuming. To reduce the time needed for learning we use only a randomly selected part of the backgrounds. This choice theoretically reduces the performance of the classifier because the randomly chosen backgrounds will influence the used weak classifier. With this approach the earlier stage will eliminate less backgrounds as if we trained with the whole set. The algorithm 2 presents the process of selecting the background images for the training process. Based on experimental results the number of selected backgrounds needs to be always bigger than the positive example.

Training the tree

It is given a feature set and a training set (positive and negative examples). There are two main tasks to be solved: pose estimation and classification. If in the training set we have humans in different postures and poses, we concluded that the positive training set can be clustered. The question is how many clusters do we have in the positive training set? Manual settings of the cluster numbers may not always end up with good results. If the number is too small, it is possible not to achieve the desired detection performance, while for higher number of clusters the detection process will be considerably slower.

So it is preferable to use a criterion for deciding the number of clusters needed for optimal detection performance and for fast processing. The used criterion is the lowest computational complexity needed to achieve a given hit and false alarm rate used also by Lienhart [103, 105]. It is a recursive procedure and the final number of the clusters is decided only at the end of the training. To use this criterion we need a hierarchical clustering method. This method always splits the set in two subsets using only one feature for splitting the set, because the feature tells us how much a pattern is present in the image. There are some cases about how the features statistically act:

The patterns are typical for the whole set. In this case the feature values are concentrated near a value and have a Gaussian distribution.

There are two groups: some of the training examples own the pattern and some do not. In this case the feature values are concentrated beyond two values (the mean values) and can be easily delimited.

There are more than two groups; every group owns the pattern in a different measure. In this case the feature values are grouped around the centroid values but they cannot be easily delimited.

The patterns are present only incidentally and it is not a characteristic feature. In this case the feature value has a uniform distribution.

Features used for splitting the clusters are selected from a subset of the whole feature set. The subset consist of features, which include features from case II and some from case III (only if there exist two neighbor groups that are well delimited from each other). The other features are not used for clustering.

The node training has two stages. In the first stage a node is trained using AdaBoost method [146, 188]. The parent node determines the training set of the child node. In case of the root node the whole training set is considered. The result of the training is a strong classifier with a given false and positive hit rate.

The second stage investigates if the pose estimation step decreases the calculation complexity or not. For that, we need to choose a feature for clustering the input training data. In case of the root training, the feature for all the positive samples is computed. At that moment the pose estimation feature set is same as the complete Haar-like feature set. We verified them separately for every feature and if the samples are distributed uniformly we remove them from the set. An ordinary node receives from the parent node a set of features that are eligible for clustering and were not used yet. This “eligible” feature set is constructed at root training described above. The remaining features are used to cluster the positive training set in two groups (k=2) using k-means algorithm. We also compute the variance for each cluster. The next step is choosing the best feature based the relevance criteria:

where are weights, is the relative distance between the cluster centroid and is calculated with the equation (4.3).

where is a constant and represents the length of the domain and depends on the size of the feature, and are the two cluster centroids. The in equation (4.2) is the mean of the clusters variance. The in (4.2) is the square difference between the numbers of the cluster's elements. .

The feature selection algorithm is presented in Algorithm 3. The clustering feature set is used to build pose estimation stages. Always the first most discriminative feature is used to split the training set into two subsets, and then the used feature is removed from the pose estimation feature set. So every feature is used only once in the root-leaf path.

After creating the pose estimation step the training set is split using the selected feature. For every set a new classification stage is trained.

The computational complexity linearly depends on the number of weak classifiers and the number of pose estimation stages. We use the estimation stage if the total number of features used by nodes is less than the used ones in the monolithic classifier.

Each branch receives the corresponding subset of the training data. This procedure is used until the target depth of the tree is reached. The classifier node training algorithm is presented in algorithm 4.

When we reach the target depth and the pose estimation has not enough resolution, we can continue adding extra pose estimation stages until the desired resolution is reached. In the tree, every classification node has a positive leaf or branch and a negative one associated to the corresponding set. Both outputs of the pose estimation stage are considered positive. At the end of the training we have to label manually every positive leaf. If both leaves of the pose estimation node have the same label, one node is removed and the remaining leaf will have the common label. This is done until the tree has no more pose estimation nodes with two leaves with the same label.

Experiments and discussion

In order to prove the performance of the proposed PETC we tested it on our videos and on different video sequences from public databases. The tests were performed on a Pentium Core 2 Duo 2.6 GHz computer with 3GB memory.

We tested the algorithm using different structures, and then we have compared them with other existing methods. The classifier structure used by Viola and Lienhart can be deduced from the available public classifiers stored in xml files, but it was not really useful for us, because they trained the classifier for faces. We used from their classifier only some performance parameters to retrain the classifier.

Analyzing the published classifiers from the stored in xml files we found out that: the number of used stages of a classifier varies between 16 and 46, the mean value is about 22-23 stages. The first stages contain 2-10 features, whereas the last stages contain 100-200 features. We have concluded that a good classifier should have more than 1,000 features [167]. Even if Viola and Jones mentioned [219] that during training they elaborated a high number of improvements to decrease considerably the training time these results were never published.

The available classifiers were trained for face detection, and then we have modified the size of the detection window to be suitable for human detection as well. The direct consequence of the increased window is the increased number of features. The classifier trained for the human detection contains more than 3000 features with the same structure.

At the beginning we are interested in the performance of the classifiers. In most publications the detection results are presented in tables. Those tables catch the best value of the Receiver Operating Characteristic (ROC) curve. Using the ROC curve we are able to compare not only a value representing a “one moment” behavior of the classifier, but also the variation of the detection rate related to the variation of false alarms [167]. Usually the publishers are not attaching the used database and the measure methodology so we should train the classifier using our database and methodology. The performance of a detector can be very good but this also implies a huge amount of false alarms. When we decrease the false alarms the performance of the system will decrease, too.

The ROC curve of the three classifiers () shows very well the performance of the classifiers. The simple cascade classifier has its limitation and even if we permit more false detections its sensitivity saturates at 70%, because the cascade classifier cannot handle high intra-class variation.

Figure 22. The ROC curve for the three classifiers

The difference between Lienhart’s tree detector and PETC seems to be small. Even if the two curves seem to be similar there are big differences. In case of PETC, if detection is considered to be good then it means more than the input is classified correctly. It also means that the pose estimation is done well. It is true that the correctly classified input as humans are almost the same but incorrectly estimated pose are as low as 1-2%. If we add this percentage to the sensitivity, the PETC are more sensitive than the Lienhart‘s method. This can be explained with the fact that we used only a limited number of poses. If we increase the number of available poses in the train set the intra-class variability will also increase, and the Leinhart’s trees sensitivity will decrease, and the ROC curve will be closer to the diagonal line. Another difference is that PETC has lower false alarm rate at the same sensitivity as the Lienhart’s tree detector. The test results on our database for the three classifiers are the following:

Table 3. Performance parameters on our database

In we present some images labeled by the classifiers. We can see on the images that the simple cascade classifier does not detect all people from the image and has more false detections as well. In this case the false detection for the Leinhart’s tree detector and for PETC is equal, but PETC has detected all people from the image.

Figure 23. Processed images – our database: a) cascade classifier, b) Leinhart’s Tree, c)PETC

A second test was made on the INRIA’s database. In this case the human’s positions were labeled, so the detection results were automatically compared to the label. A result was accepted if the detection window contained more than 90% of the labeled region and the area of the window was less than 120% of the area of the labeled region. For pose detection we labeled the test set and then we have compared them to the results.

Table 4. Performance parameters on INRIA database

The differences between the performance parameters measured on our database and on INRIA database is that INRIA contains much more poses then the ones in our database. It is important to mention that we trained all three detectors with the same database. Another interesting aspect of the results is the speed of recognition. Apparently even if it uses the cascade classifier that has the simplest structure, the speed is not faster with respect to the speed of PETC. An explanation for this fact is that during training the simple structure was compensated with more weak classifiers, so during a stage the cascade classifier evaluated more features than PETC. The second interesting aspect is that the Leinhart’s detector is slower than PETC despite the fact that PETC has extra nodes in the tree. To find an explanation for this, we have tracked the behavior of the tree during the detection and we observed that if the tree has many branches using the depth first search technique is slower than evaluating some extra features to decide on which branch to continue the evaluation of the input image region.

Figure 24. Processed images – INRIA database: a) cascade classifier, b) Leinhart’s Tree, c) PETC

One other aspect that we observed is that using the depth first search at Leinhart’s tree detector, the same human poses are not always evaluated on the same branch which leads to the conclusion that this detector is not suitable for human poses estimation.

During the creation of the training set we have observed, that by training with different versions of database the performance of PETC is influenced very much by the grade of the training set normalizations. One unresolved problem for this detector remains that some human poses cannot fit to the image ratio chosen by us. Changing the ratio is not always a good choice, because with it we could introduce unnecessary information in the training set, which can lead to the loss of the training convergence.

Template based human detection methods

In this subchapter we will present a template-based human detection method. The basic idea behind every template-based method is that we choose one or more representative templates and we compare every part of the image with the templates. Compared to the “example based” methods, for this method the information about the class is not coded in the structure and in the parameters of the classifier, but is kept in the templates. At one hand we do not have to create the training set, but we need to select the representative template. The hardest aspect of selecting the template is that the template should be representative. This demand is very hard to fulfill taking into account that people can appear in many poses with different clothes, resulting in a high variety of possible appearances of the human bodies, which makes it necessary to use numerous templates. To reduce the number of templates, it is necessary that the matching techniques are not used directly on the image pixels, but use one or more features that preferably are invariant to some of the human appearance variances. Even this way, by using this kind of features, the dimension of the templates remained high. Despite the fact that matching a template to the image is faster than using a trained classifier, the usage of all templates repeatedly became much slower. To make it comparable in speed with the “example-based” techniques, the only possibility is to reduce the set of templates using a filter method.

In literature there are several attempts to use algorithms to reduce the template set. Some of them are Gavrila [40, 41] which use the hierarchical structure to match the templates. For crowded scenes Leibe et al. [6] used chamfer matching to detect pedestrians. He combined the chamfer matching with segmentation to prevent false alarms. A hierarchical template representation was used by Stenger et al. [8]. He used bottom-up clustering based on the chamfer distance. The majority of the researchers have used the Chamfer matching techniques to match the templates to the image. The matching process is presented below:

Figure 25. Matching process

Distance transformation

Distance transforms represent an important tool for computer vision, image processing and pattern recognition. A distance transform of a binary image specifies the distance from each pixel to the nearest non-zero pixel.

Depending on distance metric there are various ways of computing the distance transforms. In case of image processing the information is propagated using L1, L2, chamfer, metrics, with the L2 (Euclidean) being the most common one. Chamfer metrics are approximating the Euclidian distance. There are also other more complex distance metrics than the chamfer, which can be used to make the distance computation more robust to noise. These complex distance metrics are slower than chamfer metrics.

Distance transforms play a central role in the comparison of binary images, particularly for images resulting from local feature detection techniques, such as edge or corner detection. For example, both the Chamfer [20, 58] and Hausdorff [36] matching approaches make use of distance transforms in comparing binary images.

The general form of the distance transform is [148]:

where is a regular grid and and is a distance metric.

Chamfer distance

The Chamfer distance tries to give a reasonable good approximation of the Euclidian distance using elementary displacement [58]. Chamfer distance and many other traditional distance transforms use a set of points on a grid, where the grid represents the pixels from the image, for each grid location the distance to the nearest point in is associated

In this case the distance transform uses the following alternative definition to enable the sequential computation of the distance transform:

where an indicator function of membership in to initialize the starting distance.

The idea of this method is that instead of computing and finding minimum distances of all image points from all O object points, we repeat simple increment operations without computing explicitly any distance. With this simple increment operation the local distances propagate and results the global distance. The sequential computation of the distance is also known as “chamfer distance”.

The propagation is made by fixing a set of elementary displacements applying a weight to them in each step to proximate the Euclidian distance. For elementary displacement we can use 3×3 or 5×5 neighbors (mask).

Figure 26. General masks

The parameters in the masks have the following linear constrains:

3×3 mask: b<2a and b>a

5×5 mask: 2a<c, 3b<2c and c<a+b

Borgefors demonstrated in [58] that if we use a=3 and b=4, the maximum distance between the Euclidian distance and this approach can be 8 percent.

The sequential distance transformation algorithm starts from the 0 infinity image, where the edge pixels are set to 0 and all other pixel are set to infinite, and then two cycles of scanning are made over the image. The first one is started from top-left and finishes at bottom-right;

where the distance is defined in the 3×3 or 5×5 masks. The second cycle starts from bottom-right and ends at the up–left corner of the image:

keeps track of the local distances and only at the end will be equal with the real distances.

AL and BR are defined in the following masks:

Figure 27. Forward and backward neighborhoods

Fast distance transform computation

It is important to compute the distance transform fast and accurately. In this subchapter we present some methods for fast computation of the distance transform. In the following chapter we present a novel method and compare its performance with the fast methods described in this subchapter.

Dual Scan Line propagation method [151] split the sequential (chamfer distance method) distance computation method and compute the distance sequential separately for every direction.

This method takes advantage of the fact that for two neighboring data points , the associated minimum distances should either be or on a line passing through three points; an object point, and data points , . Note that the distances cannot be equal [151].

In other words, we can compute the distance in one direction using a counter to count the distance from the last object point, and we reset the counter at every object point. The distance can be computed also in the inverse direction; the only difference will be that the new counter value of the new distance has to be compared to the previous distance. We should keep the minimum of these two distances as the new minimum distance of the current point. The algorithm is presented in Algorithm 6 and .

Figure 28. Back-and-forth scanning on one direction [151]

To guarantee the minimum distance from every direction, we need to apply the dual scan in every normal direction to the object data point surface, .

Figure 29. Multiple scanning directions. Either the scan direction is changed or the data space is rotated. [151]

In practice, there is a limited number of normal directions due to the rasterization of the surface and the confined size of images. We emphasize that each additional direction refines the estimated distance transformation values [151].

Wave-propagation method has as starting point the wave propagation based segmentation [53]. The distance computation starts from each object point and moves in a normal direction to the surface. The method uses three separate labels to group the data points at each step: processed, active, and unprocessed. The distance computation starts from the object boundary by assigning them as active points and setting the distances to zero, f(p) = t = 0. Then, it starts propagating the wave-front using the active sets of points until no data points remain in the active set. At each step, we update a counter t, to t+1 and sets it for the distances of all points in the active set f(p) = t. It searches for the immediate unprocessed neighbors of the points that are in the active set, and constructs a new set of active points from those points. After updating (as processed) the points from the old active set, iterates the next step [151].

Figure 30. Fast wave-propagation method [151]

Pseudo Parallel Computation of the Distance Transform

We can extend this algorithm to a pseudo parallel algorithm which can be executed on multi-processors or on multi-core processor systems. The parallel algorithm to compute the distance transformation proposed by Borgefors [20] is not optimal to run on a multi-processor system, while the algorithm proposed by us splits the sequential distance transformation to a number of equal tasks. The number of tasks has to be equal or smaller than the number of processors or number processor’s core.

The basic idea of our approach the Pseudo Parallel Computation of the Distance Transform (PPCDT) is to split the image into equal vertical regions, compute the distance transformation independently for every region using the sequential algorithm, then at the end merge these regions into one distance image. Splitting the image is the same with defining the regions. For calculating the distance transformation for the defined regions we use algorithm 6, with adjusted start and stop conditions of the four statements. For merging two splitted regions at column z we use algorithm 7.

To verify the performance of the pseudo parallel computation of the chamfer distance we have made some experiments, and the results were compared to the results of other existing methods: the basic chamfer distance, the dual line propagation method, the wave propagation method. For testing we used different images. The main differences between the test images were the number of edges.

Table 5. Distance transforms precision

During the test we compared the correctness of the distances (Table 5) and the processing time (Table 6). For the distance correctness we used as etalon the basic chamfer distance transform. The comparison error was computed as mean of distance differences per pixel.

Tabel 6. Distance transformation computation time

Chamfer matching

Chamfer matching is a techniques proposed by Barrow et. al [14] for finding the best fit of edge points between the templates and the images by minimizing the distance between them. From the template image the edge pixels are extracted and converted to a list of coordinate pairs. From this list those edge points are selected which uniformly cover the edge. The selected lists of coordinate pairs are called polygon. The matching measure used to search the best fit is an average value of pixels that the polygon hits. In our case we used the root mean square average:

where are the ith pixel hit by the polygon on distance image, n represents the number of coordinates of the polygon. The average is divided by 3 to compensate the unit distance 3 in distance transformation. We used this distance, because it gives fewer local minimums than other average measures [59]

The position of the polygon can be determined by translation equation (4.10). These are parametric equations which translate and rotate the polygon using the parameters.

where rot is the rotation angle and and are the translation parameters.

Matching high number of templates: template space

The human detection is difficult because people can appear in a variety of poses and views. The direct consequence of using chamfer matching in the detection process is that we need to deal with a high number of templates.

We use templates like the ones in [201], shown in .

Figure 31. Human templates

Looking at the templates from the it is clearly visible that there is a similarity relation between the templates. To speed up the search, many researchers organize these templates in a hierarchy structure. Based on this template hierarchy structure they can perform a coarse-to-fine type matching.

We propose a more general ordering of the human templates which speeds up the process by reducing the number of matches. Considering that the high variability of human shapes are derived from their motion and from the angle of view our hypothesis is that, one can create a template space where the templates representing consecutive movements are always neighbors.

In this space the templates represent discrete states and they are not uniformly distributed. We noticed that the template density around some states is higher than near other states meaning that from some positions the humans can move in more directions than from other positions. This observation helped us to create from this template space an optimal hierarchy of templates by choosing always the center template of the condense zones. The distance between the chosen templates determine the hierarchy level.

Figure 32.Template splitting regions

To take the advantage of the high correlation between the templates, we represent the template space as a finite state machine, where every template is a state. After we have identified the current state (that is the template with smallest distance) we do not need to search the entire space anymore, but we need to check only the neighbor states. This is already a big improvement, but to reduce even more the number of selectable states, we define a transition criterion between every state and its neighbors. For this purpose we split the template in six regions (see

Figure 32) and check the modification for each region separately. This approach reduces considerably the templates which need to be checked.

Figure 33. Example of transition criteria parameter

In every region the force magnitude in x, y which needs to move the contour from the initial position to the current position is measured separately. This way the transition criterion will have twelve parameters. Even with this high number of parameters there are cases when more than one neighbor states are eligible to be checked.

The number of templates of the space is an issue which needs to be addressed. If this number is high the technique becomes very memory consuming, while if it is too low the decision mechanism will be more complex.

Human detection and tracking system with pose estimation and experiments

To better illustrate our method the Chamfer Matching using Hierarchical and Motion Space Templates database (CHMS), we apply it on people detection. In we present the architecture of human detection and tracking system.

First we extract the edges from the current frame with a Canny edge detector, then on the edge image we apply a distance transform.

In the memory module the last matches of the positions, sizes, and template space states are saved. Based on the information provided by the memory module, the region chooser selects the region of interest. The new region provided by the memory module is the extended region of the human location from the previous cycle. The extension is made in every direction and it is proportional with the frame rate and maximum motion speed.

At system startup, the entire image is searched for people. If we have static background we use the background subtraction module. That module shows if new people or other moving objects appear on the image. If we work with a moving camera we are searching only at the image border.

Using the distance image region, the chamfer matching module, matches the template provided by the template chooser module. Then the decision maker decides if the match is acceptable. If the result is not accepted a command is sent to the template chooser to choose another template using the result set provided by the chamfer matching module. If the template match can be accepted, the result is saved by the memory module. The starting point of every search is the pervious cycle’s best matching template. At initialization or at border search, the coarse-to-fine strategy with and templates from the top level are used.

Figure 34. Human detection system

We have tested the system using a Pentium 4 processor at 2.6 GHz, 512 MB memory, running the Windows XP operating system. For image acquisition high quality IP cameras were used with an image resolution of 640 x 480 pixels, securing minimum, 15 fps detection rate. The maximum detection rate depending on the number of tracked people. We used around 100 human templates to construct the template space. With tests we have covered various inside and outside environments. An example of the results is presented in .

Figure 35. System output image

We have also investigated the speed of the matching process, and then the results were compared to the results of the coarse-to-fine hierarchical chamfer matching method. The results are presented in Table 7.

Table 6. Performance parameters for template matching on our database

The human detection rate is almost the same for our template database representation and the hierarchical representation. Differences are in the false positive rate and in the pose estimation correctness.

Since this representation works only in case of continuous motions, when we have to detect people without prior information in non-consecutive frames, CHMS also uses the hierarchical approach. The speed boost of the CHMS is notable.

Results and discussion

In this subchapter we describe our study on the template matching approach of human detection and pose estimation. We have chosen the chamfer matching technique to accomplish our scope the human detection and pose estimation. During the implementation, we identified two domains of the technique where we contributed: the distance transform computation and the template set representation.

We have studied some of the most known distance transform computation methods. We also proposed the PPCDT for distance transform computation which we have compared to the existing methods. The result of this comparison is presented in the table 15-16. Based on the results we can say that PPCDT method performs like the existing methods, but computes the distance with lower accuracy then the basic chamfer distance transformation method. The error occurs where the sub images are merged. We can also observe that in a cluttered scene the errors are decreasing. Another interesting method is the dual line scan, a more complex and time consuming method then the basic distance transform, but it is faster due to its intrinsic parallelisms.

The template-based methods have two weaknesses: the selection method of the features for the comparisons method, and the selection method of the templates and the search algorithms for finding the best match.

A more general ordering of the human templates which speeds up the process by reducing the number of matches was proposed. It is based on a template space where the templates representing consecutive movements are always neighbors.

Based on experiment the chamfer matching method performs well in human detection applications. However the technique suffers from mismatching of the cluttered background. The main negative effect of using chamfer distance is the potential risk of increasing false alarms occurring in backgrounds with high level of clutter noise.

During the experiments we measured the influence of the image structure on the performance of detection. We measured the homogeneity of the image by computing the percentage of the edges in the image. This was computed using 5×5 overlapping image regions and we have an array which shows us how much a region is cluttered in the image. In every image region we compute also the average value for homogeneity compared to a template. If this value is higher the region is cluttered, and we have many edges.

Figure 36. Chamfer matching performance related to the image homogeneity.

presents some interesting aspects of the performance evaluation related to image homogeneity. We can see that the recognition accuracy is increasing if on the image there are more edges, but simultaneously the false positives are increasing too because the distance erroneously decreases between templates and image regions. Another aspect of the cluttered scene is that the pose estimation became unreliable with random results. From these tests we can already conclude that in this form this method cannot be applied in a cluttered scene, but can be used with success when background subtraction method can be applied to the image.

One of the main advantages of the chamfer matching method is that at the end of the matching process beside the position we also know the humans pose (attitude). The attitude estimation and the matching are done in the same time. To get the pose we only have to categorize the templates.

It was demonstrated with experiments that the novel representation of the human templates and the matching process performance outperforms the hierarchical method if prior information is available, or performs in a similar way as the hierarchical method when no prior information is available.

Pictorial Structures

The last type of human detector method investigated in this thesis is a part-base detector. To create a reliable detection across a wide variety of poses we used strong body part base detection and a Pictorial Structure [142] based body representation introduced by Felzenszwalb [143]. The detection process has two steps: the first step is the detection of all body parts; the second step is matching the parts to a model which is the Pictorial Structure.

We applied a strong discriminative detector for body part detection that uses dense appearance representations based on shape context descriptors [110], and AdaBoost [109]. These kinds of detectors have been used in the literature for pedestrian detection [109, 145, 6] but for these cases the appearance model is more simplistic [200].

Definition of pictorial structures

Objects are modeled by a collection of parts in a deformable configuration [142], with ‘spring-like’ connections between pairs of parts. These connections are modeling spatial relations between the parts. For detecting an object one can use appearances and spatial relationships of individual parts. The best match of the pictorial structures depends on the parts’ matching level at their location and how well the locations match with the deformable model. Matching a pictorial structure does not involve taking decisions about locations of individual parts; but it is about finding the global minimum of energy function without its initialization.

The pictorial structure model can be represented as an unidirectional graph .

The parts are the vertices:

The edges are the relationships between parts (springs). indicates a connection between parts and .

An instance of the object is given by its configuration where is the location of part .

measures the cost of mismatching part . with location .

measures the cost of deforming the model when placing at location and at

To find the best match of an object configuration within an image we find the that minimizes [251]:

To solve the minimization problem efficiently we need to make the graph G acyclic, to be a tree. Then search can be finished in polynomial time O(nh2), otherwise with n parts and h different locations there will be hn possible configurations. The distance between objects has to be limited to the Mahalanobis distance of the transformed locations of to points on a grid

where and are 1×1 and are represented as positions on a grid [142]. They represent the ideal relative locations for parts vi and vj. The distance between and , weighted by Mij-1 measures the deformation between the two parts.

Statistical Framework

The Pictorial structures can be viewed as an energy minimization problem in terms of statistical estimation [142]. Using this approach the pictorial structure can be trained from examples, and can be used also to find multiple good matches.

We are using the following notations:

parameters of object model,

the image,

a configuration.

According to Bayes rule the posterior of a configuration given an image and model is [142]

where is the likelihood of seeing a particular image, with an object located at a given location. is the prior probability that an object is at a given location. This contains the information about the position of the body parts relative to a coordinate system. The probability can be informative and general. represents the probability of the configuration given the model parameters and the image .

The model is parameterized by :

} are appearance parameters,

consists of edges indicating connections between parts

represents the connection parameters.

The likelihood is given by the product of probabilities of each part being observed at a specific location in the image [142] (it assumes that parts are independent and not-occluded)

Prior distribution over object configurations is captured by a Markov random field

where the denominator is unnecessary (absolute locations aren’t modeled) so we get

The posterior will be:

Observing that the negative log of the prior and likelihood give us the match and deformation costs, we will get

Since dij(li,lj) is restricted to the Mahalanobis distance we must model it here. This can be done [142, 143] with a zero mean normal distribution with diagonal covariance matrix Dij.

Body parts and connections

The human body is a deformable object but this object contains rigid objects that are connected with each other by elastic connections. The best way to model the projection of human body parts is using rectangles with the following parameters:

x, y the center of the rectangle

s the shortening

θ the orientation of the part.

The body part pairs are connected with “spring-like” connections. Every connection has its correspondence in the parts’ local coordinate system: (xij,yij) and (xji,yji). In the ideal cases these connections are overlapping as in the . The ideal orientation θij of a connection is given by the difference of the two parts’ orientation.

Figure 37. Connections

For two parts vi and vj we have li = (xi, yi, si, θi) and lj = (xj, yj, sj, θj) the joint probability is given by a combination of zero mean Gaussians modeling displacements

The first two expressions measure the vertical and horizontal displacement. The third expression is the difference in shortening. The last expression measures the difference between orientations. So the parameters of a connection will be the followings: cij (xij,yij,xji,yji,σx2,σy 2,σs2,θij,k)

The transformation of a body part into the connection coordinate system can be obtained using the expression:

where is the relative position of the connection between the i and j body parts in the coordinate system of the i-th body part, and is the relative angle between the two body parts.

The joint distribution of the body part pairs must be a Gaussian distribution with zero mean and diagonal covariance in the transformed space.

The joint probability will be:

Learning Parameters

Given a set of training images {I1…Im} and corresponding object configurations {L1…Lm} we can estimate the model parameters θ = (u, E, c) by finding the θ that maximizes the energy function based on Felzenszwalb algorithm [142]

Figure 38. The training set

The first part of the equation depends on the appearance parameters and the second part depends only on the connections and the connection parameter set [142]. The equation can be solved for each ui independently for each part

Estimating dependencies is similar to estimating appearance parameters [142]. The connection dependencies can be estimated (where is specified separately for different models) with

can be estimated [142] as the probability of two locations given the ML estimate for their joint distribution

Using these we can find a tree connecting the vertices of the graph by finding the edges .

can be found by building a graph with the vertices and setting the weight of the edges as and solving for the minimum spanning tree of the graph [142].

Figure 39. Learning process

Figure 40. The learnt Pictorial structures (frontal and side view)

Finding the optimal configuration

Now that we have the model, we can find L = {li…lj} that minimizes the original equation

We can compute this for all leaf vertices of the tree. The best location Bj for a leaf vertex vj given location li for its parent vi is the lj that minimizes

This leads to a recursive solution:

Sampling from the Posterior

We want to sample from:

The steps of a recursive solution: First sample a location lr for the root, and then repeat it for each child of the root. The marginal distribution of the root is

We can formulate this recursively as

where r is the root, c is a child node and S is specified as

Framework extension

The evaluation of the criteria from equation (4.11) is very time consuming. To reduce the detection time and improve the performance of the system we can extend the energy function. We add a new term to the criterion which is the distance from the previous position of the video sequences.

measures the distance between the current position and the previous position of the body part.

By introducing the in the energy function we use motion information to get more precise detection, and to reduce the detection time [227]. There are two cases: with or without relation between consecutive frames. If the current frame is not related to a previous frame the term is 0 for . In the second case we have the body location and the configuration from the previous frame, the new location and the configuration should be similar to the previous one.

The term from equation (4.44) can be approached like a constraint. This constraint is limiting the search space of the possible body part locations, orientation and scale. This can be viewed as an adaptive top-down pre-filtering method. The temporal constraint is limiting the appearance model parameters.

The consequence of using term as constrain is that the deduction for the detection algorithm is not changed. The term effect is visible in the pictorial structure parameters reducing the search space.

Another variant of this statistical framework uses predicted measurement instead of the time constraint.

where measures the distance between the predicted and the measured location. The previous idea is continued and we extended the criteria with a prediction. We track the motions of every body part with a Kalman filter and search the body parts in the neighborhood of the predicted position.

By introducing the in the energy function we use motion information to get more precise detection, this way we can reduce the searching time [227]. There are two cases: with or without relation between consecutive frames. If the current frame is not related to a previous frame, the term will be a high number for . In the second case the new location and configuration can be computed using the prediction starting based on the information from the previous frame and the result should be close to the starting data.

Comparing the two approaches presented in the equations 4.44 and 4.45 we can conclude that the second energy function can use information from previous measurements which makes the system more robust to self occlusion and occlusion. We will annotate the new frameworks as Optimized Pictorial Structure based Framework (OPSF)

Systems framework and Experiments

In this subchapter we present a framework for human detection and pose estimation that uses part based detection. This process has two steps: body parts detection and the second is the model matching. The detection can be done by creating an occurrence probability map for every body part separately, and then match them to a Pictorial Structure model.

To create the occurrence probability map we use a strong discriminative detector that uses dense appearance representations based on shape context descriptors [128], and AdaBoost [54]. These kinds of detectors have been used in the literature for pedestrian detection [109, 6, 145] but for these cases the appearance models are simpler.

Our purpose is to create a generic method for creating this occurrence probability map for unconstrained environments. Because the possible body configuration is very large the reduction on the search space is important [214]. One way to do this is using discriminatively learned detectors [143, 237]. It is also important to use very carefully the pre-filtering of the possible part locations in the part detection phase, and to postpone the final decision until evidence from all body parts is available. To entirely eliminating the pre-filtering is inadequate, because it is in contradiction with the search space reduction intention. The question remains, what kind of pre-filtering to use to preserve the generality and the performance of the framework?

The dense evaluation of the search space considers all possible part positions, orientations, and scales. It is in contrast to bottom-up appearance models (e.g. [109]) based on a sparse set of local features. The search space is limited by an adaptive pre-filter that uses information from previous detections and from frame differencing. The pre-filtering is used only when the information is available, otherwise the entire space is considered.

The part detectors is a shape context descriptor based method used for pedestrian detection. In this descriptor the distribution of locally normalized gradient orientations is captured in a log-polar histogram. For classification we used an AdaBoost classifier [54] which is based on a feature vector obtained by concatenating all shape context descriptors whose centers fall inside of the part bounding box.

We matched these maps to the Pictorial Structure model using the energy function (equation (4.44)).

The used framework is presented in . In this diagram it can be seen that the temporal constraint is activated by the frame differencing module. We used frame differencing because it is fast and provide enough information about the temporal relations of the consecutive frames.

The frame differencing module activates the search space and the model control unit. They are based on the previous detection result and the frame differencing result reduces the search space to the neighborhoods of the previous location and the Pictorial Structure model parameter is configured in concordance with the previous match. Considering the possibility of the synchronization loss between the tracking human and the model we introduced some verification steps in which the constrained detection result is compared to a full image densely sampled searched based detection without constraints.

Figure 41. Framework implementation diagram

To prove the performance of our method the OPSF we compared it to another Pictorial structure based method presented by Andriluka [110]. We tested them on the same videos and on the same computer. For testing we used videos from indoor and outdoor environments and also with movies downloaded from the internet.

Figure 42. Output for the system. Left for Andriluka’s method and right for ours

The first test was the generality test. We made a video in which every frame was an arbitrary image from the internet without relations between the frames. To have a correct comparison between them, instead of training them separately we used the same Pictorial Structure trained by one of the frameworks. Ran on the same video, this method managed to detect people in 99% of the time, while the processing speed remained the same.

The second test we made was the speed comparison of the frameworks, having relations between the consecutive frames.

Table 8 shows the result of the speed test. Again, for both systems we used the same Pictorial Structure model.

Table 8. Experimental results for human detection using Pictorial structure

Comparing the experimental results OPSF was not only faster but had also a higher detection rate:

The average time needed in optimized case: 18.912 s

The average time needed in case without optimization: 379.792 s ≈ 6 minutes

In we show some differences between the two system outputs. The time constraint in OPSF has two consequences: one is the search space reduction on the image plane and the other is the adaptive model configuration modification correlated to the previous match.

Figure 43. The responses of the systems with and without time constraint optimization.

The shows the speed of the system if only space reduction is applied and if both optimizations are applied.

Figure 44. The speed of the system for the two kind of optimization

The average time needed in the optimized case: 18.912 seconds

The average time needed in search space reduction: 64.892 seconds

Using both optimizations the system became three times faster than the one using only the space reduction method. The difference between the detection precision is not significant in this case.

Figure 45. Performance of the pose estimation

In the the performance evaluation of the pose estimation part is presented. The average ratio of detected parts is 66.9%, while the match should overlap more than 95% of the annotated part area. If the annotated area and the detected area is in 75% overlapped the recognition is rate is: 75.4%

Figure 46. Difference in performance of the pose estimation

presents the differences in body configurations detection if we use Search Space and Configuration reduction: 66.9% and Search Space Reduction only 65.3%

In the literature many systems have to use two models: one for frontal view and one for the lateral view of the human body. We tested the necessity of these two pictorial structure models in OPSF.

Figure 47. Output of the system with frontal and lateral body model

As presents, the detection precision is almost the same with both models. More exactly, the difference between the two models is below 2%. Considering these results we conclude that it is unnecessary to use two models.

Figure 48. Output of the system with search space reduction and with search space reduction and configuration reduction

Figure 49. Output of the systems a) OPSF, b) original framework

We also compared our second framework OPSF2 to the original framework. The results are presented in Table 9. It is visible that the detection rate compared to OPSF1 is increasing but not significantly. We achieved performance increasing only in the pose estimation domain, but the usage of prediction has also increased the processing time.

Table 7. Performance parameters on our database

Results

In the previous subchapter we presented a component-based method that uses the Pictorial Structure model. Starting from the statistical framework proposed by Felzenszwalb we have extended this model to use prior information during detection, but to remain a generic model for human detection and pose estimation even when prior information is not available. We have demonstrated that OPSF, although not working in real-time, it is significantly faster than other similar systems. While the original framework uses more than 6 minutes for recognition, OPSF detects the human positions and attitudes in around 17 seconds.

With our experiments, we have shown that the generality of the system is very high and by using time constraint the system generality remains unchanged. Using the same videos for the experiments we observed that every human object detected by the original framework was also detected by OPSF. This is obvious, because OPSF acts in the same way as the original when the prior information is missing.

We also demonstrated with our experiments that using the time constraint we get a higher precision by eliminating many of the false positive detection. We achieved a 92% recognition rate compared to the 89% of the original algorithm, while the processing time is 20 time faster.

Based on test results we can say that is unnecessary to use lateral and frontal Pictorial Structure models for detections, because the effect on the recognition precision is not significant, but the processing time increases considerably.

Comparison of the human detection methods

In the previous section we presented three types of human detection methods. In this section we compare these methods. Our initial observation was that in some cases the presented methods works well and in other the performance of these methods are weak.

First, we tested the performance of the methods on humans of different sizes.

presents the performance of the different methods. We can see that the pictorial structure has better performance when the people on the image are bigger and the Haar based methods performs better on lower resolution. The performances of the Haar based method are more constant on different human size but its body pose estimation is poor. The Pictorial structure methods has the best recognition rate and the best pose estimation but is very slow compared to other methods. The Pictorial structure method cannot be used in real-time systems.

Figure 50. Human detection system performance for different human size

The chamfer method’s detection rate is not the best, but the pose estimation is better than the Haar based method’s estimation.

Based on measurement the best choice for lower resolution is the Haar based method. If the resolution is medium both the Haar and the Chamfer methods can be used, and the results can be merged. If we need better performance on pose estimation we should use the Pictorial structure method, but only in case of offline systems.

Conclusions

The focuses of this chapter are the human detection and pose estimation methods. It presents the three most significant human detection and pose estimation methods together with our contribution to them. These methods represent different classes of the most promising approaches. To evaluate their performances we have compared them in diverse cases.

The first described method is a single window approach. We have built a novel classifier [198, 212] based on Viola and Leinhart work, and we have created a new classifier structure, to detect multi-view and multi-pose human bodies. In order to simplify the classifier structure and to speed up the training procedure we have introduced a novel background selection algorithm [199].

After comparing the Viola and Leinhart classifier with our PETC classifier [198, 212] the conclusion was:

our PETC classifier is faster than the other two classifiers.

the correct detection rate is comparable with the Leinhart classifier, but PETC has lower false positive hit.

The comparisons of the classifiers were performed on multiple databases.

The second presented approach is a template based method. We used chamfer matching to detect and estimate the human poses.

We proposed a novel pseudo parallel approach to compute the distance transformation. Based on experiments this method is 25% faster than other methods [209].

We also proposed a new way to store the templates and to find very fast the most probable templates [201].

We proposed the CHMS framework to detect the human presence in the image and estimate their pose. Using CHMS framework we got 5 times faster detector with the same detection performance but with lower false positives and better pose estimation [201].

We also performed several tests to investigate the performance of the chamfer matching related to the image homogeneity and we conclude that in this form this method cannot be applied in a cluttered scene, but can be used with success when background subtraction method can be applied to the image.

One of the main advantages of the chamfer matching method is that at the end of the matching process beside the position we also know the humans pose (attitude). The attitude estimation and the matching are done in the same time. To get the pose we only have to categorize the templates [199].

The third studied approach was the Component based method. Our starting point was the Felzenszwalb Pictorial Structure based framework [204, 208].

First we implemented the Felzenszwalb algorithm and made some tests. Based on these tests we identified the weaknesses of the method[204, 208].

We proposed two new methods OPSF1 and OPSF2 to increase the performance of the Pictorial Structure based method [197, 200].

The first method uses the motion information to modify the Pictorial Structure parameters, and speed up the recognition process [197, 200].

The second method uses tracking information to eliminate the ambiguity, caused by the occlusions or self occlusions [199].

The experiment showed that the OPSF by us are considerably better than the original ones because: are 20x times faster, have a 13% higher detection rate, 5-10% higher accuracy, and a considerably reduced false positive detection rate [200].

The proposed OPSF uses motion and tracking information without reducing the capability of framework to use in case of still images or when the motion information is not available [200].

The last part of the chapter contains a comparison of the three improved methods using video sequences with different size of human on the image. We proved that all of them have their usability. The first two methods are fast and work well in lower resolution, the Pictorial Structures gives the best detection results, but works very slowly even with our speed up [199].

Recognizing Human Action and Behavior

In this chapter we present the last component of the human behavior recognition system. There are many researchers focusing on behavior recognition as we already presented in chapter 2. In this chapter we try to present our research results in this field.

The behavior recognition is the last step in the human behavior recognition process. This component mainly processes the “measurements” provided by the human detection and pose (attitude) estimation systems that’s why the performance of this component is highly influenced by the quality of data provided by the previous components.

Based on the nature of the human behaviors the recognition process is divided into two steps: activity or action recognition and behavior recognition. The term activity is defined as a sequence of movements that cannot be described using other simple human motions. The duration of the activities is usually short. The behaviors last longer and they are composed from a sequence of activities.

The activity and behavior recognition algorithms are categorized [84] into single-layered approaches and hierarchical approaches, for actions recognition the single layer approaches are more suitable, while for the behavior recognitions the hierarchical approaches can achieve better results.

In the following subchapters we will present separately the activity and the behavior recognitions together with our contributions to these two fields.

Recognizing Actions

Based on definition, the action is the most simplistic behavior that cannot be described with any other actions. The action recognition process has as input a sequence of position and body configurations and the output is the recognized action.

The action recognition algorithms have to treat the following issues:

Multitude of human forms

The same actions are performed in different ways

The duration of the actions differs from people to people

The input information can be incomplete or erroneous.

During recognition this sequence is compared to examples or to a model to categorize the input sequence. The most suitable recognition technique for this purpose is the single layer approach. Two types of single layer approaches can be distinguished based on the way the input information is handled. We can lock at the input as a multidimensional unit or as a sequence of information. According to this there are the space time techniques, and the sequential approaches.

Starting from the fact that the space-time techniques treat the input as unit, this approach is suitable mostly for repetitive motions, gestures recognition. Those techniques that treat the input as volumes are providing a straight-forward solution. These approaches mostly fail at handling view, speed and motion variation. The trajectory based approaches are trying to eliminate the view problems but usually they introduce others especially if they are used for joint position tracking. This method can be applied successfully to recognize only very simple actions.

The most promising direction for the spatio-temporal techniques is given by the local feature-based approaches because they are robust to illumination changes and for some degree of noise. Another benefit is that they may be used directly on the image without background subtraction or body part modeling. Compared to the volumetric approach these methods are capable to recognize the non-periodic actions by using the algorithm that can model the relations between features. This approach is not suitable to recognize complex actions and has difficulties to handle multiple views.

The sequential approaches use sequential relationships between features being able to detect and recognize more complex activities. The state model-based sequential approaches compute the posterior probability of the action occurring. The probabilistic approach enables to incorporate other factors in the decision process. To build a model the state-based approaches require training videos. The required quantity of the training videos depends on the action complexity: in case of complex actions the amount of training data has to be large. When there is enough training data to properly train the model, the system will be flexible and able to recognize an activity based on completely different action sequences also.

The example based methods requires less training data than the model-based. These approaches provide a similar execution rate as the non-linear matching techniques. Its main disadvantage is that these methods require a template for every different action sequence.

Dynamic time warping

Dynamic time warping is a template-based dynamic programming matching technique for measuring similarity between two sequences, which may vary in time or speed. For instance, similarities in walking patterns would be detected, for slowly walking persons, for quickly walking persons, or even for accelerations and decelerations during a course.

In general, DTW is a method that allows a computer to find an optimal match between two given sequences (e.g. time series) with certain restrictions. The sequences are "warped" non-linearly in the time dimension to determine a measure of their similarity independent of certain non-linear variations in the time dimension. This sequence alignment method is often used in the context of hidden Markov models.

One example of the restrictions imposed on the matching of the sequences is on the monotony of the mapping in the time dimension. Continuity is less important in DTW than in other pattern matching algorithms; DTW is an algorithm particularly suited to match sequences with missing information, if there are enough segments for matching. The Dynamic Time Warping compares two time series and computes the distance between them, even if the two series are shifted on time axis. Given two series X, and T, of lengths |X| and |T|,

To align two time series we construct an |X|-by-|T| distance matrix. The (ith,jth ) element of the matrix corresponds to the distance between the xi and the tj element of the series. To get the distance between the two time series we search the warping path W presented in the equation 5.2.

where K is the length of the warp. Every element of the warping path is a pair of coordinates or indexes, which represent a relation between the two time series.

where i,j represent the indexes of the two time series. There are three constraints on the warp path: the boundary condition, the continuity and the monotony. The boundary condition: and , means that the warping path must start at the first element and must end at the last elements of both time series. The starting point should be the bottom left corner, and the end the opposite corner of the distance matrix. The constraints continuity and the monotony are merged in equation (5.4).

The restrictions from equation (5.4) control the allowable step to the adjacent cells, in such a way that i, j increase monotonically in the warping path. All indexes, from both time series, must be used.

Many warping paths are satisfying these three conditions, but we are interested in the path which optimizes the cumulative distance of the path elements (equation (5.5)).

The 1/K is normalizing the distance, for warping paths with different lengths. The best way to construct the optimal warping path is the dynamic programming method. First, the task should be split in subtasks, portions of time series. By finding the optimal solution to these subtasks, we will get the optimal solution of the entire problem. To achieve this we need to construct a cumulative distance matrix D using the following equation 5.6.

Every cell is computed as sum of the distance (Euclidian or other type of distance) of the current elements () and the minimum of the cumulative distance of the adjacent cell. The cost matrix is computed bottom up, from left to right. After the entire cost matrix is filled, a warp path must be found starting at the left lower corner D(1, 1) and ending at the D(|X|, |T|) top right corner. The warp path is actually computed in a reverse order, starting at D(|X|, |T|) using a greedy search algorithm that evaluates cells to the left, down, and diagonally to the bottom-left. The smallest valued cells coordinate is added at the beginning of the warp path founded so far. The search continues from the last added cell and stops when D(1, 1) is reached.

Dimensionality reduction and motion decomposition

Because most approaches in action recognition need to deal with very high-dimensional data spaces, these approaches often suffer from the ‘curse of dimensionality’. The feature-space becomes sparser in an exponential fashion with the dimension, that way it requires a larger number of samples to build efficient class-conditional models. The simplest way to reduce dimensionality is via Principal Component Analysis (PCA), which assumes that the data lies on a linear subspace. Except in very special cases, data does not lie on a linear subspace, thus requiring methods that can learn the intrinsic geometry of the manifold from a large number of samples. Nonlinear dimensionality reduction techniques allow for representation of data points based on their proximity to each other on nonlinear manifolds. Several methods for dimensionality reduction such as PCA, locally linear embedding (LLE) [171], Laplacian eigenmap [111], and Isomap [111] have been applied to reduce the high-dimensionality of video data in action recognition tasks [111, 79, 231, 113]. Specific recognition algorithms such as template matching, dynamical modeling, etc. can be performed more efficiently once the dimensionality of the data has been reduced.

Because the human motion can be compositional or concurrent, the global trajectories are not the best choice. Some actions need only the legs, for example walking, running, jumping, and some actions need only the hands: handshaking, waving. For this reason, we decomposed the action to its basic elements – to body part motion. To make the recognition easier we track every body part individually and relative to its parents body part. Using this approach, we can use only those basic motions (body part motions) in classification that are relevant for an action, so we can easily recognize composed motion as well.

In some cases, when we have low-resolution images, we cannot track all body part motions separately, but only the global motions. There are many possibilities for this: we can use Haar based detector [198] or we can use chamfer matching [201] to detect the humans and their poses.

Our goal is to get the most detailed information about the human body configuration, and its relation to other moving objects and environment of the current frame. To achieve this goal, for low resolution images we used a bottom up approach: the chamfer matching [201], while for higher resolution images we have used a top down approach that is the Pictorial structure method introduced by Felzenszwalb [142] and extended by Ramanan [159].

In case of the higher resolution frames, the Pictorial structure approach, is modeling the human body as a collection of parts in a deformable configuration, with 'spring-like' connections between pairs of parts. These connections are modeling spatial relations between parts. Appearances and spatial relationships of individual parts can be used to detect an object. Best match of the pictorial structures depends on how well each part matches its location and how well the locations agree with the deformable model. The main advantage of this approach is that the motions of the human body parts are tracked individually and relative to its parent’s body part. Using this approach, we can use only those basic motions (body part motions) in classification that are relevant and we can easily recognize composed motions too. The first and most significant motion is the torso motion. Here we look at two elements: the motion relative to the image (global motion) and the angular motion. The torso represents the root of body parts in the pictorial structure. The upper legs and upper arms are connected to the torso and we analyze their angular motions between -270 and +270 grades only. The absolute motions are tracked between -180 and +180. The 180 and 270 represents a buffer zone. If the motions angles are above 180 or below -180, we will have two possible time series. Three events can reset one of the time series: the angular motion returns quickly between -180 and 180 degree; the DTW matching for one of them has a strong result or the angle is increased above 270 or decreases below -270.

Figure 51. Relative motion of the upper leg relative to the torso

The lower parts of legs are connected and their angular motions are relative to the upper parts of the legs. In addition, the lower arm angular motions are tracked relative to the upper arm. We do not track the motion of the head. presents a time series of motion for the upper arm representing the waving action.

The most important points in motion series are the peaks because they mark a change in the motion direction, and the still (constants) points and the zero crossing points, because they are the stable or typical positions of the human body. Knowing this, the speed of the actions is not relevant.

In case of the low-resolution frames, we use the chamfer matching method [201] to track and to detect the pose of the human body. Using the fast template search method introduced by us, we can always track the human body and measure the distance from the closest template class.

Figure 52. Full resolution time series of waving – upper arm

We can approximate the motion series using the key positions. There is an unequivocal mapping of the key position to relative position of all body parts. We always map this position when the current match has the lowest distance from the template. We count the number of frames between two consecutive best match key positions and then we interpolate the intermediate points.

Using these two approaches, we are able to connect them and provide a general framework based on Heuristic FastDTW to recognize the human actions.

Heuristic Fast Dynamic Time Warping Methods

The quadratic time and space complexity of DTW generates the need for methods which speed up the dynamic time warping. The most common method is the use of constraints, which limits the search area in the cost matrix. This constraint is important not only for speeding up the DTW but also to eliminate the problem of singularities [68, 26, 35].

There are also other methods to speed up the computation of the DTW. One is the FastDTW [230], which uses recursive shrinking and refine to get the best warping path.

To compare the time series of the human motions we are using an improved version of DTW algorithm, which has a multilevel approach with the following key operations:

Shrinks the time series into smaller time series that represents only the peak or constant values from the time series,

Coarse DTW – Finds a minimum-distance warping path for the shrunk series and uses that warping path as an initial guess for the full "resolution's" minimum-distance warp path

Final DTW – Refines the warping path projected from a lower resolution through local adjustments of the warping path using Sakoe-Chiba constraint.

The first step is in the coarsening step during which we shrink the time series. Human body part motions most significant moments represent the direction changes. In the Heuristic FastDTW approach we don’t use averages in the time series, but we use a heuristic selection of the data where only the peaks and constant values are kept from the series. In other words only those xi elements are kept from the X if satisfies one of the next two conditions:

Figure 53. The original and the shrink time series of waving – upper arm

presents the original time series of waving the upper arm and the shrunken series. As the second step, we did a classical DTW comparison between the shrunken templates and the shrunken inputs. Using this comparison, we can eliminate the majority of the templates, only a few templates remaining for higher resolution comparing.

shows the shrunk time series cost matrix and the projection of this to the original resolution cost matrix. Projection takes a warping path calculated at a lower resolution and determines the cells of the warping path as the result of the higher resolution time series. This projected path is used then as a heuristics during solution refinement, to find a warping path at a higher resolution. To make it faster we used Sakoe-Chiba band constraint.

Figure 54. The course and the full resolution cost matrix with warping path

The Final DTW step is a refinement that finds the optimal warping path in the neighborhood of the projected path, where the size of the neighborhood is determined locally by the distance between two consecutive points in the shrunk series and the difference between the length of the template series and the input series.

This will find the optimal warping path through the area of the warping path that was projected from the lower resolution.

In case of the low resolution images the poses are recognized as key positions. To these positions we can always associate a set of points in body part series. Because the image resolution does not make possible a fine detection of the body part motions one key position is recognized more than once before the positions change. This kind of motion is very similar to the shrunken version of the time series.

Classification using Neural Networks

Because we have for every body part separate motion series, we need to synchronize the matches and make a final decision about the overall human action. We use the result of the Heuristic FastDTW as the input for Neural Networks. Neural networks (NNs) are nonlinear models, which makes them flexible in modeling real world complex relationships. Furthermore, NNs are data driven self-adaptive methods being able to adjust themselves without any explicit functional or distributional specification for the underlying model.

Although many types of neural networks could be used for the classification purpose, we have focused on the following three network types: the Learning Vector Quantisation (LVQ), the Radial Basis Function (RBF) and the feed forward multilayer networks or MultiLayer Perceptron (MLP) NNs, which are the most widely studied and used neural network classifiers.

Using the Heuristic FastDTW’s output, a dataset has been compiled. In order to follow the proper steps of designing test bench system, the dataset has been divided in a training subset (75% of the samples) and a testing subset (25%). Five different behaviors are represented in the compiled dataset about twenty measurements for each (values corresponding to eight body parts motion time series). The NN training has been done in Matlab, using the embedded functions of this environment.

The evolution of the classification accuracy of the RBF NN and the LVQ NN during the training phase is presented in and , respectively.

Figure 55. Training chart of the RBF NN

Figure 56. Training chart of the LVQ NN

Table 8. summarizes the actual results obtained with these methods, proving, that using NN to recognize human actions from monocular video after special Heuristic FastDTW processing is a viable solution.

Table 8. NN classification results

Results

For testing purposes we have used the detection results from the three types of human detection and pose estimation methods. The results were converted to time series, then the torso speed and the relative motion of the body part relative to their parents is measured. Logically we will have different resolutions for different type of detection methods.

These parameters are compared with saved templates using the Heuristic FastDTW and some of them are eliminated already at early stages if the distance between the coarse variant of the series is bigger than a threshold.

To construct the templates database we have annotated and saved 4 different actions from 10 different videos. For every body part we compared the saved motion series with the Heuristic FastDTW. If the difference between them were too big we have dropped them, while if they were similar we have chosen the median series from them.

We have also used a feed forward neural network to classify the actions. The network inputs are equal with the number of templates and the number of outputs is equal with the number of trained action. We used the saved templates to train the neural network.

For experiments we have used indoor scenes with simple and composed actions.

Table 9. Comparative results of the experiment using the three types of pose estimator.

Using the Heuristic FastDTW the recognition is 2 times faster than the FastDTW, because many templates can be eliminated at the early steps at coarse comparison. Using the motion decomposition we were able to recognize composite actions as well, such as standing and handshaking.

The second part of the experiment shows that the accuracy is influenced by the resolution of the human pose estimation. The method was not able to recognize the composite action like standing and handshaking.

In this subchapter we presented two improvements for human action recognitions: an efficient representation of motions by decomposing to its basic elements and a FastDTW algorithm improved for human motion recognition purpose. Both ideas can be still improved. In case of body part motions the angular motion can be decomposed into two time series one with low frequency variations and one with high frequency time series. The low frequency will represent the position, and the high frequency will represent the short action of the body part.

We can introduce more constrains in the coarsening step of the Heuristic FastDTW, reducing this way the length of time series and speeding up this way the recognition procedure.

We can improve the recognition framework also by introducing a Self Organizing Incremental Neural Network instead of feed forward neural network.

Figure 57. Output of the action recognition system

Recognizing Behavior

Based on the definition, the human behavior is a sequence of simple or complex actions. By its nature, the behavior is complex and very variable. For this reason, single layer approaches cannot recognize behaviors. More appropriate is to use a hierarchic structure having a better performance in the human behavior recognition.

The low level recognition approaches are responsible for recognizing the atomic actions. The results of them serve as observations or measurements for higher level recognitions. The hierarchical technique represents tractable and conceptually understandable approaches of human behavior. Using this approach the human behavior model can be built by human experts. The hierarchical techniques also reduce the redundancy from the recognition process by re using the recognized actions.

The hierarchical techniques are able to recognize complex activities and behaviors with complex structures. This is one of their major advantages. The capability to integrate and use semantic based processing, made these approaches suitable to analyze complex behaviors, group and objects interactions and to integrate prior information.

One of the most important problems of the behavior recognition systems is the high training set requirement. The hierarchical structures of the human behavior recognition system considerably reduce the required size of the training set.

There are three groups of hierarchical techniques: statistical approaches, syntactic approaches, and description-based approaches. The most relevant techniques in this field are presented in chapter 2.

The statistical approach uses state-base model to recognize the behavior. If there is enough training data this approach can recognize sequentially constructed behaviors even in a noisy environment. The method cannot recognize well complex behaviors constructed from concurrent actions.

The syntactical approaches use strings of symbols to model actions, and use grammar in the recognition process. Its major limitation is the same as for the statistical techniques: it is limited in capability of recognizing behaviors constructed from concurrent atomic action. The syntactical approach uses production rules provided by the user and must cover all possible events for a large domain. These rules are used to parse the observations. The system can be instable when an unknown observation interacts with the system.

The last group of techniques uses spatiotemporal structures to model and recognize the behavior. These methods describe in structures spatial, logical and temporal relation between atomic or lower level actions from the behavior. The recognition of behavior is search in the structure of the model. The major inefficiency of this technique is that it does not handle the errors from the low level recognition and it is not capable to compensate them.

In the following subchapter we will show that by using Petri Net we are able to recognize behaviors which can be described using concurrent atomic actions, and compensate some errors in low level recognition.

Hierarchical Probabilistic Petri Net

Petri Net is a mathematical tool that can be used also for high-level interpretation of image sequences by describing relations between events and conditions. This tool was used previously in computer vision to model simple human behaviors [63]. Petri Nets are also useful to model and visualize behaviors. Vision-based systems have to deal with ambiguities and inaccuracies in the lower-level detection and tracking systems. The base form of the Petri Net is a deterministic one.

To resolve this topic, we have defined a Hierarchical Probabilistic Petri Net (HPPN) as where

P is a set of states called places

T is a set of transitions. The sets of places and transitions are disjoint

is the flow relation between places and transitions

, associates to each place a local probability distribution defined on the transition from P to T

In the Hierarchical Probabilistic Petri Net there exists at least one terminal node and one starting node.

The markings are used to represent the Petri net in action. A Petri Net operation is controlled by its current marking. A transition is firing if and only if a predefined number of its input places have a token. When transition fires, the tokens that activate the transition are removed and one token is placed in each of the output places of the transition. If during a transition a marker is placed in one of the terminal nodes a terminal marking is reached, when one of the terminal nodes contains at least one token.

To keep the count of the probability of the dynamic behavior of the Petri Net, the tokens have probabilities. All tokens at the beginning are initialized with a probability score of 1 or the value provided by the low level processing algorithm. The tokens probability score changes when the transaction fires. The simplest case is when there are two places and one transition between them. After the transition fires the token is placed in the new place with a probability score computed by multiplying the previous token probability score and the probability density function associated with the output transition .

Figure 58. Transaction and places

We introduced the concurrence connection between the places. The idea behind this is that some behaviors can have different action sequences.

Figure 59. Concurrency

The score of the input places will be the same and can be computed with the equation 5.8. With the same consideration we introduced an opposite operation as well: the synchronization.

Figure 60. Synchronization

The new token score is the product of the input tokens and the places probability distribution functions.

We can remove two or more tokens from the net and replace them by a single token. We will use the final probability of a token in a terminal node as the probability that the activity is satisfied.

Experiment

To verify the usability of the proposed Hierarchical Probabilistic Petri Net we made some experiments.

We used the detected positions and the configuration of the pictorial structure to measure the speed of the torso and to track the relative motions of the body part relative to their parents. These parameters were compared with the saved templates using the Heuristic FastDTW and they were eliminated at early stages if the distance between the coarse variant of the series was bigger than a specific threshold.

The Heuristic FastDTW comparisons only categorize the body part motion into the classes. All classes have an associated place in the network. If the Heuristic FastDTW categorize a body part motion in a class the associated place gets a token. Using the Petri nets synchronization procedure we can decide about the actual basic activity. The first layer of the Petri net is used to recognize the atomic actions.

To exemplify this we model a simple walking behavior. The step state is activated by the Heuristic FastDTW and the repeated activation of the step state activates the walking state. In the Petri Net the states can represent or not a behavior. By adding new labeled states we can extend the Petri net to recognize new motions.

The figure below shows an example of simple activities recognition:

Figure 61. Recognizing simple activities

where LUL is the left upper leg, LLL is the left lower leg, RUL is the right upper leg, RLL is the right lower leg.

To recognize more complex behaviors we use multiple layers

Figure 62. Activity recognition using HPPN

For testing we used the waiting behavior. This behavior can be composed from different simple actions. These actions are concurrent ones. Another problem is to handle the different durations of the actions. We solved this problem by introducing skip transitions. These skip transitions are a feedback to the same place through a transition. Each of these transitions is penalized by reduced token probability. The probabilities assigned to skip transitions control the model tolerances to deviations from the base activity pattern.

Figure . HPPN for a single motion pattern

The above figure exemplifies the HPPN of a single motion pattern. Used abbreviations: RA- right arm, LA- left arm, T -torso, RL-right leg and LL left leg. The boxes are reusable Petri Net parts replaceable with a single place. In the network we have multiple instances from these parts. Every instance has its own parameterization to fulfill its scope and to recognize different actions or behaviors.

The Hierarchical Probabilistic Petri Net extended with probabilistic concurrency and synchronization operations, it is able to represent human behaviors in more realistic way. The time delay between the ideal sequence and the realistic input is modeled using skip transitions which decreases the probability of tokens. The concurrency operation allows to model behaviors that can have different sequences.

The input for the Petri Net was provided by the Heuristic FastDTW matching algorithm presented in the previous subchapter. Using the hierarchical approach and concurrency we are able to synchronize the body parts movements and to recognize simple basic actions.

Conclusions

This chapter presents the behavior recognition module of the system. The module contains the action recognition and the behavior recognition. To create an efficient behavior recognition module we studied algorithm from both area and proposed new approaches for to recognize the actions and behaviors.

We proposed an action recognition approach which uses motion decomposition and Heuristic FastDTW.

We proposed a motion decomposition to efficiently represent the body part motions and to simplify the task and a compression of the motion signal [203].

To recognize the human action we proposed a Heuristic FastDTW method. This method uses the body part angular motions to identify the basic motion of the body parts [206].

We proved that the proposed Heuristic FastDTW method is suitable to classify the human body part motion even if the data provided by the human detection and pose estimation component has a lower resolution [207, 211].

To recognize the human action we used three neural networks: the Learning Vector Quantisation (LVQ), the Radial Basis Function (RBF) and the feed forward multilayer networks or MultiLayer Perceptron (MLP) NNs. Based on comparison the LVQ has the best performance on categorizing the human actions based on the recognized body part motions [202].

The second part of the chapters issue is the behavior recognition having as input the recognized action and motion and as outcome the recognized behavior. For this purpose we used a description based approach. We proposed a Hierarchical Probabilistic Petri Net which uses concurrency and probability to increase the generality of the approach. By introducing the concurrency the method become capable to describe concurrent behaviors and to handle uncertainty in input data flow [197, 205].

Based on experiments the hierarchical HPPN is suitable to take the role of the Neural Network from the action recognition step and classified the result of the Heuristic FastDTW and integrate in the behavior recognition process. The HPPN is faster than the Neural network and capable to handle missing data [197, 205].

Conclusions and future work

This chapter represents a summary of the thesis highlighting my contributions to the field of the human behavior recognition from video sequences. The end part presents my final conclusions related to this diversified and highly researched field of computer vision.

Summary of Results

This dissertation is composed of four parts.

Chapter 2 contains a survey of the current state of art of human behavior recognition systems.

In this chapter we formulated the aim of the thesis and based on the literature we introduced a general framework for human behavior recognition.

The second part of the chapter presents a synthesis of the most important achievements in the field of human behavior covering all components needed.

We took step by step all the components of the system and we have presented the most significant works in this field. The first component is the preprocessing component. Here we enumerated the most important background subtraction methods and the optical flow approaches. The most important component of the system represents the human detection and pose estimation module. Here we presented the state of the art in the field of single detection window approaches and the component based approaches. At last we presented the most important techniques in the field of human activity and behavior recognition.

Chapter 3 presents the preprocessing components of the system. In this chapter we described our investigation to reduce the searching space for the human detection module. For this purpose we studied the foreground detection algorithms: background subtraction and optical flow.

Our contributions to this field are the following:

Identification of the challenges of this component

Definition of the performance measurements suitable to compare the different methods [210]

Implementation and comparison of 9 different foreground detection techniques [201]

A new foreground detection algorithm based on Integral Image [210]

We introduced a list of features and a methodology to compare the foreground detection methods. The identified features are: detection precision, discriminative power, shadow removal capability, memory requirement, and computational power requirement.

We studied and compared the following methods:

Frame differencing,

Running average foreground,

Running Gaussian Average,

Min-Max method,

Meanshift based foreground detection,

Gaussian Mixture based foreground detection

Eigenbackground based foreground detection

Optical flow (Lucas-Kanade)

Integral Image based foreground detection (our method)

We implemented the above mentioned methods. We performed several tests using the performance measurement model defined by us. We used indoor and outdoor image sequences to cover all challenging cases.

We elaborated a novel Integral Image based foreground detection method to detect the foreground objects named. We also implemented this method and performed all the tests and compared their performance with the performance of the existing methods [210].

Based on comparison we have shown that the proposed Integral Image based foreground detection method is suitable for foreground detection in case of static backgrounds, it is four time faster than the Gaussian Mixture Method but with the same precision. The method has one of the best discriminative performances from the studied techniques [201, 210].

Chapter 4 presents the human detection and the pose estimation component which represent the most important part of the system. We present in this chapter three different approaches for human detection and pose estimation, each of it representing a major research direction in this field.

The first method belongs to the "single window" category. We proposed a novel classifier PETC based on Viola and Leinhart work. We created a new classifier structure, training algorithm [199], to detect multi-view and multi-pose human bodies and to estimate the body configuration of its [198, 212].

We introduced novel background selection algorithms to achieve better result. Based on experiments this algorithm speeds up the training procedure and simplifies the classifier structure [212].

We implemented and trained the Viola and Leinhart classifier to compare with PETC. We proved that PETC is faster than the other two classifiers. The correct detection rate is higher than Viola's classifier and comparable with the Leinhart classifier, but PETC has lower false positive hit [212].

The comparisons of the classifiers were performed on INRIA and our databases.

The next method is the chamfer matching method which is a template based approach used to detect the people on the image and estimate their pose. Our contribution to this approach is a fast pseudo parallel computation of the distance transform and a new efficient technique to store the templates [209].

We proposed a novel pseudo parallel approach to compute the distance transformation. Based on experiments this method is 25% faster than other methods [207, 201].

We also suggested a new technique to store the templates that allow us to find very quickly the most probable matching template [203, 201].

We propose a framework (CHMS) to detect the human presence in the image and estimate their pose. Using the CHMS we got 5 times faster detector with the same detection performance but with lower false positives and better pose estimation rate [199].

We also performed several tests to investigate the performance of the chamfer matching related to the image homogeneity [199].

Finally, the third method belongs to the component based approaches and uses Pictorial Structure to detect the people in the image and to estimate its body configuration.

We implemented the Pictorial Structure (Felzenszwalb algorithm) based method and tested the performance [197,199, 206, 200, 208].

We proposed two new methods to increase the performance of the Pictorial Structure based method:

The first method uses the motion information to modify the Pictorial Structure parameters, and speed up the recognition process [199, 200].

The second method uses tracking information to eliminate the ambiguity, caused by the occlusions or self occlusions [199].

The experiment shows that the methods proposed by us are considerably better than the original ones because: are 20x times faster, have a 13% higher detection rate, 5-10% higher accuracy, and a considerably reduced false positive detection rate [199, 206, 200].

The proposed method uses motion and tracking information without reducing the capability of use of the framework in still images or in cases when this information is not available [199, 200].

The last part of the chapter contains a comparison of the three methods, performance using video sequences with different size of human on the image. We proved that all of them have their usability. The first two methods are fast and work well in lower resolution. The Pictorial Structure based method give us the best detection and pose estimation results, but works very slow even with our speed up adjustments [199, 204].

Chapter 5 presents the behavior recognition component of the system. The component recognizes the human activity and behaviors. Our contributions to the field of activity recognition are a new method for human motion representation and an improved matching algorithm. We also introduced a Hierarchical HPPN which is suitable to recognize complex behaviors.

We proposed a motion decomposition method to represent the body part motions more efficiently and to reduce the dimensionality of the activity recognition task. We also proposed a new motion time series compression method which compress efficient the motion time series without losing the most valuable information from them.

We proposed an improved version of the DTW which shrink the body parts motion to a shorter time series that promote a faster categorization of the motions [202, 203].

We proved that the proposed Heuristic FastDTW method is suitable to classify the human body part motion even if the data provided by the human detection and pose estimation component has a lower resolution [211, 207].

To recognize the human action we used three neural networks: the Learning Vector Quantisation (LVQ), the Radial Basis Function (RBF) and the feedforward multilayer networks or Multi Layer Perceptron (MLP) NNs. Based on comparison the LVQ has the best performance on categorizing the human actions based on the recognized body part motions [210].

To recognize complex behavior we propose a Hierarchical Concurrent Probabilistic Petri Net. Introducing Concurrency to the Probabilistic Petri net we increase the generality of the approach and made it suitable to describe concurrent action and to handle uncertainty in input data flow [197, 205].

Based on experiments the Hierarchical HPPN is suitable to take the role of the Neural Network from the action recognition step and integrate in the behavior recognition speeding up the process and to work with missing data [197, 205].

The thesis is based on the following publications:

Journals B+

Tamás Vajda, Action Recognition Using DTW and Petri Nets, Studia Universitatis Babes-Bolyai Series Informatica, Volume LV, Number 2 (June 2010), pp 69-78, ISSN: 1224-869x.

Tamás Vajda, Sergiu Nedevschi : Articulated Pose Estimation in Surveillence Videos ACAM Scientific Journal, Vol. 20 no.2, 2011, pp. 111-118, ISSN:1221-437X

Tamás Vajda, "Using Dynamic Time Warping Algorithm Optimization For Fast Human Action Recognition",Acta Technica Napocensis – Electronics and Telecommunication, Volume 51, Number 2/2010 pp.32-37, ISSN 1221-6542

ISI Proceedings

Tamás Vajda, Emőke Szatmári, Sergiu Nedevschi -Human Body Detection and Tracking in Video Sequences Using Chamfer Matching. IEEE 3th International Conference on Intelligent Computer Communication and Processing, ICCP 2007, Sept. 6-8, 2007, Cluj-Napoca, pp. 141-146, ISBN:978-1-4244-1491-8

Tamás Vajda Action Recognition Based on Fast Dynamic-Time Warping Method IEEE 5th International Conference on Intelligent Computer Communication and Processing, ICCP 2009, Aug 27-29, 2009, Cluj-Napoca, pp. 127 – 131, ISBN: 978-1-4244-5007-7

Proceedings indexed in databases (IEEE Xplore, CPCI)

Tamás Vajda Behavior Recognition Based on Dynamic Programming and Concurrence Probabilistic Petri Nets IEEE 6th International Conference on Intelligent Computer Communication and Processing, ICCP 2010, Aug 26-28, 2010, Cluj-Napoca, pp. 179 – 184, ISBN: 978-1-4244-8229-0

Tamás Vajda Behavior Recognition Using Pictorial Structures and DTW 2010 IEEE International Conference on Automation, Quality and Testing, Robotics, Mai 28-29, 2010, Cluj-Napoca, vol3, pp 198-201

Tamás Vajda and Lőrinc M.: General framework for human object detection and pose estimation in video sequences, In 5th IEEE International Conference on Industrial Informatics, June 23-27, 2007, Viena, pp. 467 – 472, ISSN: 1935-4576

Tamás Vajda, Ábrám Zoltán – Pictorial Structure Based People Detection and Pose Estimation in Videos. International Conference on Intelligent Computer Communication and Processing, ICCP 2011, Aug. 25-27, 2011, Cluj-Napoca, pp. 315 – 318.

Tamás Vajda, Behavior Recognition Using Template Matching, The 4th edition of the Interdisciplinarity in Engineering International Conference, November 12-13, 2009, Tg.- Mures, pp. 283-288, ISSN 1843-780X

Other International Proceedings

Tamás Vajda, László Bakó, Sándor Tihamér Brassai. Using dynamic programing and Neural Network to Match Human Action, 11th International Carpathian Control Conference ICCC 2010, May 26-29, 2010, Eger, Hungary, pp 231-234.

Tamás Vajda, Sergiu Nedevschi : Fast Multi-View Human detection and attitude estimation , CSCS16 – The 16th International Conference on Control Systems and Computer Science, May22-25, 2007, Bucuresti, Romania, , vol. 2, pp.17-23

Tamás Vajda : Moving object detection in video sequences using Integral Image 20th International Conference on Computer Science and Education, October 2010, Satu Mare, Romania, pp. 225-228, ISSN 1842-4546

Tamás Vajda : Hierarchical human behavior recognition 8th International Conference on Computer Science and Energetics-Electrical Engineering, Sumuleu-Ciuc, October 2008, Romania, pp. 139-144, ISSN 1842-4546

Tamás Vajda : Attitude detection methods usability in behavior recognition 9th International Conference on Computer Science and Education, October 2009, Tg-Mures, Romania, pp. 139-144, ISSN 1842-4546

Tamás Vajda : Human Body Detection and Tracking in Video Sequences Using Chamfer Matching, 7th International Conference on Computer Science and Education, October 2007, Oradea, Romania, pp. 54-58, ISSN 1842-4546.

Research contracts related to the thesis:

Future research

Possible future directions and research areas:

– add multi-camera support to the a human behavior detection algorithms

– extend our current work to be appropriate to track patients in hospitals and to be able to recognize certain behaviors and scenarios related to their diseases

– integrate the chamfer matching method as part base detection for the Pictorial structure based method, improving the prediction phase of the next model configuration

– improve the Pictorial Structure based statistical model to handle uncertainties and occlusions more efficiently using more sophisticated tracking and estimation algorithms

– Develop a training algorithm for the Petri Nets parameters using its inputs

Bibliography

A.K. Jain, R.P.W. Duin, and J. Mao: “Statistical Pattern Recognition: A Review”, IEEE Trans. Pattern Analysis and Machine Intelligence, Vol. 22, No. 1, pp. 4-37, 2000.

Allen, J. F. and Ferguson, G.: “Actions and events in interval temporal logic”, Journal of Logic and Computation, Vol. 4, No 5, pp. 531-579, 1994.

Allen, J. F: “Maintaining knowledge about temporal intervals”, Communications of the ACM 26, Vol.11, pp. 832-843, 1983.

B. Galvin, B. McCane, K. Novins, D. Mason, and S. Mills: “Recovering motion fields: an analysis of eight optical flow algorithms”, In Proc. 1998 British Machine Vision Conference, Southampton, England, 1998.

B. Han, D. Comaniciu, and L. Davis: "Sequential kernel density approximation through mode propagation: applications to background modeling”, Proc. ACCV -Asian Conf. on Computer Vision, 2004.

B. Leibe, E. Seemann, and B. Schiele: “Pedestrian Detection in Crowded Scenes”, Proc. IEEE Int’l Conf. Computer Vision and Pattern Recognition, pp. 878-885, 2005.

B. Leibe, N. Cornelis, K. Cornelis, and L.V. Gool: “Dynamic 3D Scene Analysis from a Moving Vehicle”, Proc. IEEE Int’l Conf. Computer Vision and Pattern Recognition, 2007.

B. Stenger, A. Thayananthan, P. H. S. Torr, and R. Cipolla: “Model-based hand tracking using a hierarchical bayesian filter”, IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol. 28, No. 9, pp. 1372-1384, 2006.

B. Wu and R. Nevatia: “Detection and Tracking of Multiple, Partially Occluded Humans by Bayesian Combination of Edgelet Based Part Detectors”, Int’l J. Computer Vision, Vol. 75, No. 2, pp. 247-266, 2007.

B.D. Lucas and T. Kanade: “An Iterative Image Registration Technique with an Application to Stereo Vision”, DARPA Image Understanding Workshop, pp. 121-130, 1981.

B.K.P. Horn and B.G. Schunck: “Determining Optical Flow”, Artificial Intelligence, Vol. 17, pp. 185-204, 1981.

B.P.L. Lo and S.A. Velastin: “Automatic congestion detection system for underground platforms”, Proc. of Int. Symp. on Intell. Multimedia, Video and Speech Processing, pp. 158-161, 2001.

Barron, J.L., Fleet, D.J., Beauchemin, S.S., Burkitt, T.A.: “Performance of Optical Flow Techniques”, Computer Vision and Pattern Recognition, Proceedings '92, IEEE Computer Society Conference, pp. 236 – 242, 1992.

Barrow, H.G,,Tenenbaum, J.M., Bolles, R.C & Wolf. s.l: “Parametric correspondence and chamfer matching: Two new techniques for image matching”, Proc 5th Int. Joint Conf Artificial Intelligence, Cambridge, 1977.

Baumberg: “Hierarchical Shape Fitting Using an Iterated Linear Filter”, Proc. British Machine Vision Conf., pp. 313-323, 1996.

Biswas S., Sil J., Sengupta N.: “Background Modeling and Implementation using Discrete Wavelet Transform: a Review”, JICGST-GVIP, Vol. 11, Issue 1, pp. 29-42, 2011.

Blank, M., Gorelick, L., Shechtman, E., Irani, M., and Basri, R.: “Actions as space-time shapes”, In IEEE International Conference on Computer Vision, pp. 1395-1402, 2005.

Bobick, A. and Davis, J.: “The recognition of human movement using temporal templates”, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 23, No.3, pp. 257-267, 2001.

Bobick, A. F. and Wilson, A. D.: “A state-based approach to the representation and recognition of gesture”, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 19, No. 12, pp. 1325-1337, 1997.

Borgefors, G. s.l.: “Distance transformations in digital images”, Computer Vision, Graphics and Image Processing, Vol. 34, No. 3, pp. 344–371, 1986.

Borzi, K. Ito, and K. Kunisch: “Optimal control formulation for determining optical flow”, SIAM Journal on Scientic Computing, Vol. 24, No.3, pp. 818-847, 2002.

Bruhn, J. Weickert, and C. Schnorr: “Lucas/Kanade meets Horn/Schunck: Combining local and global optic flow methods”, International Journal of Computer Vision, Vol. 61, Issue 3, pp. 211-231, 2005.

Bruhn, J. Weickert, C. Feddern, T. Kohlberger, and C. Schnorr: “Real-time optic flow computation with variational methods”, In N. Petkov and M. A. Westenberg, editors, Computer Analysis of Images and Patterns, Vol. 2756 of Lecture Notes in Computer Science, pp. 222-229. Springer, Berlin, 2003.

Butler D., Sridharan S.: “Real-Time Adaptive Background Segmentation”, ICASSP, 2003.

C. Papageorgiou and T. Poggio: “A Trainable System for Object Detection”, Int’l J. Computer Vision, Vol. 38, pp. 15-33, 2000.

C. Stauffer and W. E. L. Grimson, “Learning patterns of activity using real-time tracking,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 22, no. 8, pp. 747–757, 2000.

C. Stauffer, W.E.L. Grimson: “Adaptive background mixture models for real-time tracking”, Proceedings IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 246-252, 1999.

C. Stauffer, W.E.L. Grimson:“Learning patterns of activity using real-ime tracking”, IEEE Trans. on Patt. Anal. and Machine Intell., Vol. 22, No. 8, pp. 747-757, 2000.

C. Wren, A. Azabayejani, T. Darrell and A. Pentland: “Pfinder: Real-time tracking of the human body”, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 19, pp. 780-785, 1997.

Campbell, L. W. and Bobick, A. F.: “Recognition of human body motion using phase space constraints”, In IEEE International Conference on Computer Vision, pp. 624-630, 1995.

Chang R., Ghandi T., Trivedi M.: “Vision modules for a multi sensory bridge monitoring approach”, ITSC 2004, pp. 971-976, 2004.

Chomat, O. and Crowley, J.: “Probabilistic recognition of activity using local appearance”, IEEE Conference on Computer Vision and Pattern Recognition, Vol. 2, 1999.

Cristani M., Farenzena M., Bloisi D., Murino V.: “Background Subtraction for Automated Multisensor Surveillance: a Comprehensive Review”, EURASIP Journal on Advances in Signal Processing, Vol. 2010, pages 24, 2010.

Culbrik D., Marques O., Socek D., Kalva H., Furht B.: “Neural network approach to background modeling for video object segmentation”, IEEE Transaction on Neural Networks, Vol. 18, No. 6, pp. 1614–1627, 2007.

D. A. Forsyth, O. Arikan, L. Ikemoto, J. O’Brien, and D. Ramanan, “Computational studies of human motion: part 1, tracking and motion synthesis,” Foundations and Trends in Computer Graphics and Vision, vol. 1, no. 2-3, pp. 77–254, 2005

D. Huttenlocher, G. Klanderman, and W. Rucklidge. s.l.: “Comparing images using the hausdorff distance”, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 15, No. 9, pp. 50–863, 1993.

D. Koller, J. Weber, T. Huang, J. Malik, G. Ogasawara, B. Rao, and S. Russel: “Towards Robust Automatic Traffic Scene Analysis in Real-Time”, Proceedings of Int’l Conference on Pattern Recognition, pp. 126–131, 1994.

D. Terzopoulos: “Image analysis using multigrid relaxation”, IEEE Transactions on Pattern Analysis and Machine Intelligence,Vol. 8, No. 2, pp. 129-139, 1986.

D.G. Lowe: “Distinctive Image Features from Scale Invariant Keypoints”, Int’l J. Computer Vision, Vol. 60, No. 2, pp. 91-110, 2004.

D.M. Gavrila and J. Giebel: “Shape-based pedestrian detection and tracking”, IEEE Intelligent Vehicle Symposium, pp. 8-14, 2002.

D.M. Gavrila and S. Munder: “Multi-cue Pedestrian Detection and Tracking from a Moving Vehicle”, International Journal of Computer Vision, Vol. 73, No. 1, pp. 41-9, 2007.

D.M. Gavrila: “A Bayesian Exemplar-Based Approach to Hierarchical Shape Matching”, IEEE Trans. Pattern Analysis and Machine Intelligence, Vol. 29, No. 8, pp. 1408-1421, 2007.

Dai, P., H. Di, H., Dong, L., Tao, L., and Xu, G.: “Group interaction analysis in dynamic context”, IEEE Transactions on Systems, Man, and Cybernetics, Part B, Vol. 38, No. 1, pp. 275-282, 2008.

Damen, D. and Hogg, D.: “Recognizing linked events: Searching the space of feasible explanations”, IEEE Conference on Computer Vision and Pattern Recognition, 2009.

Darrell, T. and Pentland, A.: “Space-time gestures”, IEEE Conference on Computer Vision and Pattern Recognition, pp. 335-340, 1993.

Dollar, P., Rabaud, V., Cottrell, G., and Belongie, S.: “Behavior recognition via sparse spatio-temporal features”, 2nd Joint IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance, pp. 65-72, 2005.

E. Memin and P. Perez: “Dense estimation and object-based segmentation of the optical flow with robust techniques”, IEEE Transactions on Image Processing, Vol. 7, No.5,pp. 703-719, 1998.

El Baf F., Bouwmans T., Vachon B.: “Fuzzy Integral for Moving Object Detection”, FUZZ-IEEE 2008, pp. 1729-1736, Hong-Kong, China, 2008.

El Baf F., Bouwmans T., Vachon B.: “Type-2 fuzzy mixture of Gaus- sians model: Application to background modeling”, ISVC 2008, pp. 772-781, Las Vegas, USA, 2008.

Elgammal A., Harwood D., Davis L.: “Non-parametric Model for Background Subtraction”, ECCV 2000, pp. 751-767, Dublin, Ireland, 2000.

Elhabian S. Y., El-Sayed K. M., Ahmed S.: “Moving Object Detection in Spatial Domain using Background Removal Techniques – State-of-Art”, Recent Patents on Computer Science, Vol. 1, No 1, pp. 32- 54, 2008.

F. Heitz and P. Bouthemy: “Multimodal estimation of discontinuous optical flow using Markov random fields”, IEEE Transactions on Pattern Analysis and Machine Intelligence,Vol. 15, Issue 12, pp. 1217-1232, 1993.

F. Porikli: “Automatic image segmentation by Wave Propagation”, Proceedings of IS&T/SPIE Symposium on Electronic Imaging, San Jose, 2004

Freud, Y. and R. Schapire: “A short introduction to Boosting”, 1999 J. of Japanese Society for AI, Vol. 14, No. 5, pp. 71–780, 1999.

G. Halevy and D.Weinshall: “Motion of disturbances: detection and tracking of multibody non-rigid motion”, Machine Vision and Applications, Vol. 11, pp. 122-137, 1999.

G. P. Stein: “Tracking from multiple view points: Self-calibration of space and time”, Computer Vision and Pattern Recognition Fort Collins, pp. 521-527, 1999.

G. Zini, A. Sarti, and C. Lamberti: “Application of continuum theory and multi-grid methods to motion evaluation from 3D echocardiography”, IEEE Transactions on Ultrasonics, Ferroelectrics, and Frequency Control,Vol. 44, No. 2, pp. 297-308, 1997.

G., Borgefors, s.l.: “Hierarchical chamfer matching: A parametric edge matching algorithm”, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 10, No. 6, pp. 849-865, 1988.

G., Borgefors, s.l.: “Improved version of chamfer matching algorithm”, 7th Int. Conf Pattern Recognition, 1984.

Gavrila, D. and Davis, L.: “Towards 3-D model-based tracking and recognition of human movement”, International Workshop on Face and Gesture Recognition, pp. 272-277, 1995.

Gavrila, D. and V. Philomin: “Real-time object detection for “smart” vehicles”, International Conference on Computer Vision (ICCV99), pp 87–93, 1999.

Gavrila, D. M.: “The visual analysis of human movement: A survey”, Computer Vision and Image Understanding, Vol. 73,No. 1, pp. 82-98, 1999.

Ghanem, N., DeMenthon, D., Doermann, D., and Davis, L.: “Representation and recognition of events in surveillance video using Petri nets”, IEEE Conference on Computer Vision and Pattern Recognition Workshop, 2004.

Gong, S. and Xiang, T.: “Recognition of group activities using dynamic probabilistic networks”, IEEE International Conference on Computer Vision, p. 742, 2003.

Gupta, A., Srinivasan, P., Shi, J., and Davis, L. S.: “Understanding videos, constructing plots learning a visually grounded storyline model from annotated videos”, IEEE Conference on Computer Vision and Pattern Recognition, 2009.

H. Shimizu and T. Poggio: “Direction Estimation of Pedestrian from Multiple Still Images”, Proc. IEEE Intelligent Vehicles Symposium, pp. 596-600, 2004.

H. Sidenbladh and M.J. Black: “Learning the Statistics of People in Images and Video”, Int’l J. Computer Vision, Vol. 54, Nos. 1-3, pp. 183-209, 2003.

H. Zhong, J. Shi, and M. Visontai, “Detecting unusual activity in video,” Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 819–826, 2004.

H.-H. Nagel: “Extending the 'oriented smoothness constraint' into the temporal domain and the estimation of derivatives of optical flow”, In O. Faugeras, ed., Computer Vision ECCV '90, Vol. 427 of Lecture Notes in Computer Science, pp. 139-148. Springer, Berlin, 1990.

Hakeem, A., Sheikh, Y., and Shah, M.: “CASEE: A hierarchical event representation for the analysis of videos”, Proceedings of the 20th National Conference on Artificial Intelligence, pp. 263-268, 2004.

Haritaoglu, D. Harwood and L. S. Davis: “Hydra: Multiple people detection and tracking using silhouettes”, Second IEEE Workshop on Visual Surveillance Fort Collins, pp. 6-13, 1999.

Haritaoglu, D. Harwood and L. S. Davis: “W4: Who? when? where? what? a real time system for detecting and tracking people”. Third Face and Gesture Recognition Conference, pp. 222-227. 1998.

Haritaoglu, R. Cutler, D. Harwood and L. S. Davis: “Backpack: Detection of people carrying objects using silhouettes”. International Conference on Computer Vision, pp. 102-107, 1999.

Harris, C. and Stephens, M.: “A combined corner and edge detector”, Alvey Vision Conference, pp. 147-152, 1988.

I. Cohen: “Nonlinear variation method for optical flow computation”, In Proc. Eighth Scandinavian Conference on Image Analysis, Vol. 1, pp. 523-530, Tromso, Norway, 1993.

I.P. Alonso et al: “Combination of Feature Extraction Methods for SVM Pedestrian Detection”, IEEE Trans. Intelligent Transportation Systems, Vol. 8, No. 2, pp. 292-307, 2007.

Intille, S. S. and Bobick, A. F.: “A framework for recognizing multi-agent action from visual evidence”, AAAI/IAAI, pp. 518-525, 1999.

Ivanov, Y. A. and Bobick, A. F.: “Recognition of visual activities and interactions by stochastic parsing”, IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 22, No. 8, pp. 852-872, 2000.

J. B. Tenenbaum, V. D. Silva, and J. C. Langford, “A global geometric framework for nonlinear dimensionality reduction.” Science, vol. 290, no. 5500, pp. 2319–2323, 2000.

J. Heikkila and O. Silven: “A real-time system for monitoring of cyclists and pedestrians”, Second IEEE Workshop on Visual Surveillance Fort Collins, pp. 74-81, 1999.

J. L. Barron, D. J. Fleet, and S. S. Beauchemin: “Performance of optical flow techniques”, International Journal of Computer Vision, Vol. 12, No.1, pp. 43-77, 1994.

J. Weickert and C. Schnorr: “A theoretical framework for convex regularizes in PDE-based computation of image motion”, International Journal of Computer Vision, Vol. 45, Issue 3, pp. 245-264, 2001.

J.D. Rymel, J.R. Renno, D. Greenhill, J. Orwell, G.A. Jones: “Adaptive Eigen-Backgrounds for Object Detection”, IEEE International Conference on Image Processing, 2004.

J.K. Aggarwal , M.S. Ryoo: “Human activity analysis: A review”, ACM Computing Surveys, Vol.43, No.3, pp.1-43, 2011.

J.L. Barron and N.A. Thacker: “Tutorial: Computing 2D and 3D Optical Flow”, Tina Memo No. 2004-012, 2005.

J.Weickert and C. Schnorr: “Variational optic flow computation with a spatio-temporal smoothness constraint”, Journal of Mathematical Imaging and Vision, Vol.14, No. 3, pp. 245-255, 2001.

Javed, O., Shafique, K: “A hierarchical approach to robust background subtraction using color and gradient information”, Proceedings of the IEEE Workshop Motion and Video Computing, pp. 22- 27, 2002.

Joo, S.-W. and Chellappa, R.: “Attribute grammar-based event recognition and anomaly detection”, IEEE Conference on Computer Vision and Pattern Recognition Workshop, p. 107, 2006.

K. Fukushima, S. Miyake, and T. Ito: “Neocognitron: A Neural Network Model for a Mechanism of Visual Pattern Recognition”, IEEE Trans. Systems, Man, and Cybernetics, Vol. 13, pp. 826-834, 1983.

K. Mikolajczyk, C. Schmid, and A. Zisserman: “Human Detection Based on a Probabilistic Assembly of Robust Part Detectors”, Proc. European Conf. Computer Vision, pp. 69-81, 2004.

K. Okuma, A. Taleghani, N. de Freitas, J. Little, and D. Lowe: “A Boosted Particle Filter: Multitarget Detection and Tracking”, Proc. European Conf. Computer Vision, pp. 28-39, 2004.

K. Toyama and A. Blake: “Probabilistic Tracking with Exemplars in a Metric Space”, Int’l J. Computer Vision, Vol. 48, No. 1, pp. 9-19, 2002.

K. Toyama, J. Krumm, B. Brumitt and B. Meyers: “Wallower: Principles and practice of background maintenance”, International Conference on Computer Vision, pp. 255-261, 1999.

Ke, Y., Sukthankar, R., and Hebert, M.: “Spatio-temporal shape and flow correlation for action recognition”, IEEE Conference on Computer Vision and Pattern Recognition, 2007.

Kim K., Chalidabhongse T., Harwood D., Davis L.: “Real-time Foreground-Background Segmentation using Codebook Model”, Real-Time Imaging, Vol.11, 2005.

L. Alvarez, J. Esclarn, M. Lef_ebure, and J. Sanchez: “A PDE model for computing the optical flow”, In Proc. XVI Congreso de Ecuaciones Diferenciales y Aplicaciones, pp. 1349-1356, Las Palmas de Gran Canaria, Spain, 1999.

L. Alvarez, J. Weickert, and J. Sanchez: “Reliable estimation of dense optical flow fields with large displacements”, International Journal of Computer Vision, Vol. 39, Issue 1, pp. 41-56, 2000.

L. Fan, K.-K. Sung, and T.-K. Ng: “Pedestrian Registration in Static Images with Unconstrained Background”, Pattern Recognition, Vol. 36, pp. 1019-1029, 2003.

L. Zhang, B. Wu, and R. Nevatia: “Detection and Tracking of Multiple Humans with Extensive Pose Articulation”, Proc. Int’l Conf. Computer Vision, 2007.

Laptev, I. and Lindeberg, T.: “Space-time interest points”, IEEE International Conference on Computer Vision, p. 432, 2003.

Laptev, I., Marszalek, M., Schmid, C., and Rozenfeld, B.: “Learning realistic human actions from movies”, IEEE Conference on Computer Vision and Pattern Recognition, 2008.

Lee B., Hedley M.: “Background Estimation for Video Surveillance”, IVCNZ 2002, Vol.1, pp. 315-320, 2002.

Lienhart Rainer and Jochen Maydt.: “An Extended Set of Haar-like Features for Rapid Object Detection”, IEEE ICIP2002, Vol. 1, pp. 900–903, 2002.

Lienhart Rainer, Alexander Kuranov, Vadim Pisarevsky: “Empirical Analysis of Detection Cascades of Boosted Classifiers for Rapid Object Detection”, DAGM'03, 25th Pattern Recognition Symposium, Madgeburg, pp. 297–304, 2003.

Lienhart Rainer, Luhong Liang, and Alexander Kuranov: “A Detector Tree of Boosted Classifiers for Real-time Object Detection and Tracking”, IEEE ICME2003, Vol. 2, pp. 277–280, 2003.

Liu, J., Luo, J., and Shah, M.: “Recognizing realistic actions from videos in the wild”, IEEE Conference on Computer Vision and Pattern Recognition, 2009.

Lublinerman, R., Ozay, N., Zarpalas, D., and Camps, O.: “Activity recognition from silhouettes using linear systems and model (in) validation techniques”, International Conference on Pattern Recognition, pp. 347-350, 2006.

Lv, F. and Nevatia, R.: “Single view human action recognition using key pose matching and Viterbi path searching”, IEEE Conference on Computer Vision and Pattern Recognition, 2007.

M. Andriluka, S. Roth, and B. Schiele: “People-tracking-by-detection and people-detection-by-tracking”, CVPR, pp.1-8, 2008.

M. Andriluka, Stefan Roth, and Bernt Schiele: “Pictorial structures revisited: People detection and articulated pose estimation”, In Proc. of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), 2009.

M. Belkin and P. Niyogi, “Laplacian eigenmaps and spectral techniques for embedding and clustering,” Advances in Neural Information Processing Systems, pp. 585–591, 2001.

M. Bergtholdt, D. Cremers, and C. Schnörr: “Variation Segmentation with Shape Priors”, Handbook of Math. Models in Computer Vision, N. Paragios, Y. Chen, and O. Faugeras, eds., Springer, 2005.

M. Elgammal and C. S. Lee, “Inferring 3D body pose from silhouettes using activity manifold learning,” Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 681–688, 2004.

M. Enzweiler and D. M. Gavrila: “Monocular Pedestrian Detection: Survey and Experiments”, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 31, No. 12, pp. 2179-2195, 2009.

M. Enzweiler and D.M. Gavrila: “A Mixed Generative-Discriminative Framework for Pedestrian Classification”, Proc. IEEE Int’l Conf. Computer Vision and Pattern Recognition, 2008.

M. Drulea and S. Nedevschi, "Total variation regularization of local-global optical flow," in Intelligent Transportation Systems (ITSC), 2011 14th International IEEE Conference on, 2011, pp. 318-323

M. J. Black and P. Anandan: “Robust dynamic motion estimation over time”, In Proc. 1991 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp 292-302, Maui, HI, IEEE Computer Society Press (1991)

M. Piccardi, T. Jan: “Mean-shift background image modeling”, Proc. of IEEE International Conference on Image Processing, Singapore, 2004.

M. Seki, T. Wada, H. Fujiwara, K. Sumi: “Background detection based on the cooccurrence of image variations”, Proc. of CVPR, Vol. 2, pp. 65-72, 2003.

M. Szarvas, A. Yoshizawa, M. Yamamoto, and J. Ogata: “Pedestrian Detection with Convolutional Neural Networks,” Proc. IEEE Intelligent Vehicles Symposium, pp. 223-228, 2005.

M.J. Jones and T. Poggio: “Multidimensional Morphable Models”, Proc. Int’l Conf. Computer Vision, pp. 683-688, 1998.

Maddalena L., Petrosino A.: “A self organizing approach to background subtraction for visual surveillance applications”, IEEE Transactions on Image Processing, Vol.17, No. 7, pp. 1729–1736, 2008.

McFarlane N., Schofield C.:” Segmentation and tracking of piglets in images”, BMVA 1995, pp. 187-193, 1995.

Messelodi S., Modena C., Segata N., Zanin M. A: “Kalman filter based background updating algorithm robust to sharp illumination changes”, ICIAP 2005, Vol. 3617, pp. 163-170, Cagliari, Italy, 2005.

Minnen, D., Essa, I. A., and Starner, T.: “Expectation grammars: Leveraging high-level expectations for activity recognition”, IEEE Conference on Computer Vision and Pattern Recognition, Vol. 2, pp. 626-632, 2003.

Mohan, C. Papageorgiou, and T. Poggio: “Example-Based Object Detection in Images by Components”, IEEE Trans. Pattern Analysis and Machine Intelligence, Vol. 23, No. 4, pp. 349-361, 2001.

Moore, D. J. and Essa, I. A.: “Recognizing multitasked activities from video using stochastic context-free grammar”, AAAI/IAAI, pp. 770-776, 2002.

N. Dalal and B. Triggs: “Histograms of Oriented Gradients for Human Detection”, Proc. IEEE Int’l Conf. Computer Vision and Pattern Recognition, pp. 886-893, 2005.

N. Dalal, B. Triggs, and C. Schmid: “Human Detection Using Oriented Histograms of Flow and Appearance”, Proc. European Conf. Computer Vision, pp. 428-441, 2006.

N. M. Oliver, B. Rosario, and A. P. Pentland: “A Bayesian Computer Vision System for Modeling Human Interactions”, IEEE Trans. on Patt. Anal. and Machine Intell., Vol. 22, No. 8, pp. 831-843, 2000.

Nagel H.H.: “Displacement vectors derived from second-order intensity variations in image sequences”, CGIP, Vol. 21, pp. 85-117, 1983.

Nakajima, M. Pontil, B. Heisele, and T. Poggio: “Full-Body Recognition System”, Pattern Recognition, Vol. 36, pp. 1997-2006, 2003.

Nam, Y., Wohn, K., and Lee-Kwang, H.: “Modeling and recognition of hand gesture using colored Petri nets”, IEEE Transactions on Systems, Man and Cybernetics, Vol. 29, No. 5, pp. 514-521, 1999.

Natarajan, P. and Nevatia, R.: “Coupled hidden semi Markov models for activity recognition”, IEEE Workshop on Motion and Video Computing, 2007.

Nevatia, R., Hobbs, J., and Bolles, B.: “An ontology for video event representation”, IEEE Conference on Computer Vision and Pattern Recognition Workshop, Vol. 7, 2004.

Nevatia, R., Zhao, T., and Hongeng, S.: “Hierarchical language-based representation of events in video streams”, In IEEE Workshop on Event Mining, 2003.

Nguyen, N. T., Phung, D. Q., Venkatesh, S., and Bui, H. H.: “Learning and detecting activities from movement trajectories using the hierarchical hidden Markov models”, IEEE Conference on Computer Vision and Pattern Recognition, Vol. 2, pp. 955-960, 2005.

Niebles, J. C., Wang, H., and Fei-Fei, L.: “Unsupervised learning of human action categories using spatial-temporal words”, International Journal of Computer Vision, Vol. 79, No. 3, 2008.

Niyogi, S. and Adelson, E.: “Analyzing and recognizing walking figures in XYT”, IEEE Conference on Computer Vision and Pattern Recognition, pp. 469-474, 1994.

O. Tuzel, F. Porikli, and P. Meer: “Human Detection via Classification on Riemannian Manifolds”, Proc. IEEE Int’l Conf. Computer Vision and Pattern Recognition, 2007.

Oliver, N. M., Rosario, B., and Pentland, A. P.: “A Bayesian computer vision system for modeling human interactions”, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 22, No. 8, pp. 831-843, 2000.

P. F. Felzenszwalb and D. P. Huttenlocher: “Pictorial structures for object recognition”, IJCV, Vol. 61, No. 1, pp. 55–79, 2005.

P. F. Felzenszwalb, D. McAllester, and D. Ramanan: “A discriminatively trained, multiscale, deformable part model”, CVPR, pp. 1 – 8, 2008.

P. Sabzmeydani and G. Mori: “Detecting Pedestrians by Learning Shapelet Features”, Proc. IEEE Int’l Conf. Computer Vision and Pattern Recognition, 2007.

P. Viola, M. Jones, and D. Snow: “Detecting Pedestrians Using Patterns of Motion and Appearance”, Int’l J. Computer Vision, Vol. 63, No. 2, pp. 153-161, 2005.

Papageorgiou, C., M. Oren, and T. Poggio: “A general framework for object detection”, , International Conference on Computer Vision, pp. 555–562, 1998.

Park, S. and Aggarwal, J. K.: “A hierarchical Bayesian network for event recognition of human actions and interactions”, Multimedia Systems, Vol.10, No. 2, pp. 164-179, 2004.

Pedro F. Felzenszwalb, Daniel P. Huttenlocher. s.l: “Pictorial Structures for Object Recognition”, Intl. Journal of Computer Vision, 2005.

Peter J. Carew, Larry Stapleton and Gabriel J. Byrne: “Implications of an ethic of privacy for human-centered systems engineering”, AI & Society, Vol. 22, No 3, pp 385-403, 2008

Pinhanez, C. S. and Bobick, A. F.: “Human action detection using PNF propagation of temporal constraints”, IEEE Conference on Computer Vision and Pattern Recognition, pp. 898, 1998.

Porikli, F.; Kocak, T. s.l.: “Fast Distance Transform Computation Using Dual Scan Line Propagation”, SPIE Conference Real-Time Image Processing , 2007.

Prati, I. Mikic, M. M. Trivedi, and R. Cucchiara: “Detecting moving shadows: algorithms and evaluation”, Pattern Analysis and Machine Intelligence, IEEE Transactions, Vol. 25, No. 7, pp. 918–923, 2003.

Q. Zhu, S. Avidan, M. Yeh, and K. Cheng: “Fast Human Detection Using a Cascade of Histograms of Oriented Gradients”, Proc. IEEE Int’l Conf. Computer Vision and Pattern Recognition, 2006, pp. 1491-1498.

R. Brehar, S. Nedevschi, "A comparative study of pedestrian detection methods using classical Haar and HoG features versus bag of words model computed from Haar and HoG features", Proceedings of 2011 IEEE Intelligent Computer Communication and Processing, Cluj-Napoca, August 25-27, 2011, pp. 299-306.

R. Borca, S. Nedevschi, Correlation Between features and Classifiers for Semantic Understanding of Pedestrian Attitudes in Trafic Senes, in Proceedings of 2009 IEEE Intelligent Computer Communication and Processing, Cluj-Napoca, August 27-29, 2009, pp. 149-152.

R. Cucchiara, C. Grana, M. Piccardi, and A. Prati: “Detecting moving objects, ghosts and shadows in video streams”, IEEE Trans. on Patt. Anal. and Machine Intell., Vol. 25, No. 10, pp. 1337-1342, 2003.

R. Cutler and L. Davis: “View-based detection and analysis of periodic motion”, International Conference on Pattern Recognition Brisbane, pp. 495-500, 1998.

R. Pless, “Image spaces and video trajectories: Using isomap to explore video sequences,” Proceedings of IEEE International Conference on Computer Vision, pp. 1433–1440, 2003

Ramanan, D.; Forsyth, D. A.; Zisserman, A.: “Tracking People by Learning Their Appearance”, Pattern Analysis and Machine Intelligence, IEEE Transactions on, Vol. 29, pp. 65-81, 2007.

Rao, C. and Shah, M.: “View-invariance in action recognition”, IEEE Conference on Computer Vision and Pattern Recognition, Vol. 2, pp. 316-322, 2001.

Rapantzikos, K., Avrithis, Y., and Kollias, S.: “Dense saliency-based spatiotemporal feature points for action recognition”, IEEE Conference on Computer Vision and Pattern Recognition, 2009.

Rodriguez, M. D., Ahmed, J., and Shah, M.: “Action MACH: A spatio-temporal maximum average correlation height filter for action recognition”, IEEE Conference on Computer Vision and Pattern Recognition, 2008.

Ryoo, M. S. and Aggarwal, J. K.: “Recognition of composite human activities through context-free grammar based representation”, IEEE Conference on Computer Vision and Pattern Recognition, pp. 1709-1718, 2006.

Ryoo, M. S. and Aggarwal, J. K.: “Semantic understanding of continued and recursive human activities”, International Conference on Pattern Recognition, pp. 379-382, 2006.

S. Agarwal, A. Awan, and D. Roth: “Learning to Detect Objects in Images via a Sparse, Part-Based Representation”, IEEE Trans. Pattern Analysis and Machine Intelligence, Vol. 26, No. 11, pp. 1475-1490, 2004.

S. Ghosal and P. C. Vanek: “Scalable algorithm for discontinuous optical flow estimation”, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 18, No. 2, pp. 181-194, 1996.

S. Lefkovits: “Performance analysis of face detection systems based on haar features”, In Complexity and Intelligence of the Artifical and Neural Complex Systems, Vol. 1, pp. 184–192, 2008.

S. Munder and D.M. Gavrila: “An Experimental Study on Pedestrian Classification”, IEEE Trans. Pattern Analysis and Machine Intelligence, Vol. 28, No. 11, pp. 1863-1868, 2006.

S. Munder, C. Schno¨ rr, and D.M. Gavril:, “Pedestrian Detection and Tracking Using a Mixture of View-Based Shape-Texture Models”, IEEE Trans. Intelligent Transportation Systems, Vol. 9, No. 2, pp. 333-343, 2008.

S. Nedevschi, S. Bota, C. Tomiuc, “Stereo-Based Pedestrian Detection for Collision-Avoidance Applications”, in IEEE Transactions on Intelligent Transportation Systems, vol. 10, no. 3, 2009, pp. 380-391

S. T. Roweis and L. K. Saul, “Nonlinear dimensionality reduction by locally linear embedding,” Science, vol. 290, no. 5500, pp. 2323–2326, 2000.

Savarese, S., DelPozo, A., Niebles, J., and Fei-Fei, L.: “Spatial-temporal correlations for unsupervised action classification”, IEEE Workshop on Motion and Video Computing, 2008.

Schnorr: “Unique reconstruction of piecewise smooth images by minimizing strictly convex non-quadratic functional”, Journal of Mathematical Imaging and Vision, Vol. 4, pp. 189-198, 1994.

Schuldt, C., Laptev, I., and Caputo, B.: “Recognizing human actions: A local SVM approach”, International Conference on Pattern Recognition, Vol. 3, pp. 32-36, 2004.

Scovanner, P., Ali, S., and Shah, M.: “A 3-dimensional sift descriptor and its application to action recognition”, ACM International Conference on Multimedia, pp. 357-360, 2007.

Seemann, M. Fritz, and B. Schiele: “Towards Robust Pedestrian Detection in Crowded Image Sequences”, Proc. IEEE Int’l Conf. Computer Vision and Pattern Recognition, 2007.

Shan Ying; Feng Han; Sawhney, H.S.; Kumar, R.: “Learning Exemplar-Based Categorization for the Detection of Multi-View Multi-Pose Objects”, Computer Vision and Pattern Recognition, IEEE Computer Society Conference, Vol. 2, pp. 1431–1438, 2006.

Shashua, Y. Gdalyahu, and G. Hayon: “Pedestrian Detection for Driving Assistance Systems: Single-Frame Classification and System Level Performance”, Proc. IEEE Intelligent Vehicles Symp., pp. 1-6, 2004.

Shechtman, E. and Irani, M.: “Space-time behavior based correlation”, IEEE Conference on Computer Vision and Pattern Recognition, Vol. 1, pp. 405-412, 2005.

Sheikh, Y., Sheikh, M., and Shah, M.: “Exploring the space of a human action”, IEEE International Conference on Computer Vision, Vol. 1, pp. 144-149, 2005.

Shi, Y., Huang, Y., Minnen, D., Bobick, A. F., and Essa, I. A.: “Propagation networks for recognition of partially ordered sequential action”, IEEE Conference on Computer Vision and Pattern Recognition, Vol. 2, pp. 862-869, 2004.

Sigari M., Mozayani N., Pourreza H.: “Fuzzy Running Average and Fuzzy Background Subtraction: Concepts and Application”, International Journal of Computer Science and Network Security, Vol. 8, No. 2, pp. 138-143, 2008.

Simoncelli E.P., Adelson E.H. and Heeger D.J.: “Probability distributions of optical flow”, IEEE Proc. of CVPR, pp. 310-315, 1991.

Simoncelli E.P.: “Distributed Representation and Analysis of Visual Motion”, PhD Dissertation, Dept. of Electrical Engineering and Computer Science MIT, 1993.

Siskind, J. M.: “Grounding the lexical semantics of verbs in visual perception using force dynamics and event logic”, Journal of Artificial Intelligence Research, Vol. 15, pp. 31-90, 2001.

Sivabalakrishnan M., Manjula D.: “Adaptive Background subtraction in Dynamic Environments Using Fuzzy Logic”, International Journal on Computer Science and Engineering, Vol. 02, No. 2, pp. 270- 273, 2010.

Starner, T. and Pentland, A.: “Real-time American Sign Language recognition from video using hidden Markov models”, International Symposium on Computer Vision, pp. 265-270, 1995.

Stauffer C, Grimson W. E. L.: “Adaptive background mixture models for real-time tracking.”, Proceedings IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No PR00149). IEEE Comput. Soc. Part, Vol. 2, 1999.

Stenger, A. Thayananthan, P.H.S. Torr, and R. Cipolla: “Model-Based Hand Tracking Using a Hierarchical Bayesian Filter”, IEEE Trans. Pattern Analysis and Machine Intelligence, Vol. 28, No. 9, pp. 1372-1385, 2006.

T. Bouwmans, F. El Baf, B. Vachon: “Statistical Background Modeling for Foreground Detection: A Survey”, Handbook of Pattern Recognition and Computer Vision, World Scientific Publishing, Vol. 4, Part 2, pp. 181-199, 2010.

T. Bouwmans: “Recent Advanced Statistical Background Modeling for Foreground Detection: A Systematic Survey", Recent Patents on Computer Science, Vol. 4, No. 3, pp. 147-176, 2011.

T. E. Boult, R. Micheals, X. Gao, P. Lewis, C. Power, W. Yin and A. Erkan: “Frame rate omnidirectional surveillance and tracking of camouflaged and occluded targets”, Second IEEE Workshop on Visual Surveillance Fort Collins, Colorado, pp. 48-55, 1999.

T. Heap and D. Hogg: “Improving Specificity in PDMs Using a Hierarchical Approach”, Proc. British Machine Vision Conf., pp. 80-89, 1997.

T. Heap and D. Hogg: “Wormholes in Shape Space: Tracking through Discontinuous Changes in Shape”, Proc. Int’l Conf. Computer Vision, pp. 344-349, 1998.

T.F. Cootes and C.J. Taylor: “Statistical Models of Appearance for Computer Vision”, Technical report, Univ. of Manchester, 2004.

T.F. Cootes, S. Marsland, C.J. Twining, K. Smith, and C.J. Taylor: “Groupwise Diffeomorphic Non-Rigid Registration for Automatic Model Building”, Proc. European Conf. Computer Vision, pp. 316-327, 2004.