Chapter 1 INTRODUCTION Over the past decade, there has been an [612255]

Chapter 1

INTRODUCTION

Over
the
past
decade,
there
has
been
an
uptrend
in
the
popularity
of

Unmanned

Aerial
Vehicles
(UAVs),
receiving
attention
in
the
research
community
[2]
where
an

important
number
of
seminal
results
and
applications
has
been
proposed
and
experimented.

Despite
the
significant
progress,
flight
autonomy
is
still
considered
an
open
research
topic,

implying control and orientation, obstacle avoidance and even trajectory optimisation.

Machine
Learning
is
currently
applied
to
numerous
types
of
applications
and
systems.

Autonomous
flight
is
an
interesting
topic
in
this
area
opening
many
challenges:
precision
and

accuracy,
robustness
and
adaptation
and
reward
engineering
when
using
Reinforcement

Learning.

1.1 Motivation

If
we
think
in
terms
of
evolutionary
history,
all
animals
exhibit
some
kind
of
behavior:
they

do
something
in
response
to
the
inputs
that
they
receive
from
the
environment
they
exist
in.

Some
animals
change
the
way
they
behave
over
time:
given
the
same
input,
they
may

respond
differently
later
on
than
they
did
earlier.
Such
learning
behavior
is
vital
to
the

survival
of
species.
As
technology
develops,
how
to
enable
machines
to
mimic
the
learning

ability
is
one
of
the
long-standing
challenges
for
scientists.
Reinforcement
learning
is
such

an
area
of
machine
learning
inspired
by
behaviorist
psychology,
which
has
revolutionized
our

understanding of learning.

Drones
are
not
a
new
phenomenon
as
they
date
back
to
the
mid-1800
but
their

development
in
the
civil
market
is
relatively
recent.
2013
has
been
qualified
by
some
experts

as
the
year
of
drones
[3]
and
was
the
moment
when
one
felt
the
need
of
such
autonomous

systems to perform tasks that humans used to coordinate.

The
applications
of
drones
are
no
longer
limited
to

military

use
and
they
invaded
all

the
fields,
from
agriculture
to
the
world
of
internet
[5].
The
biggest
achievement
of
drone

technology
is
performing

monitoring
and
precision
tasks
on
large
farm
lands,
irrigation

systems,
wind
energy,
highway
traffic
and
constructions.
Strongly
related
is
the
domain
of

safety
surveillance,
reporting
and
rescue

.
They
are
very
useful
to
watch
over
large
public

gatherings,
border
areas
or
to
report
accidents
and
natural
disasters
in
real
time.
In
some

countries they are used to perform difficult search and rescue operations.

Research

on
environmental
compliance,
atmosphere,
wildlife
protection
and
even

3D

geographical
mappings
are
made
through
semi-autonomous
drones.
In
Germany
[4]

experiments
of
carrying
small
cargo
are
ongoing,
which
means
the
beginning
of
drone
use
for

shipping and delivery

tasks.

Autonomous
drones
could
be
used
to
carry

wireless
internet
signal
to
remote

locations.
Maybe
in
the
near
future
we
will
be
able
to
access
the
internet
through
a
drone

flying somewhere above in the air.

In
the
absence
of
obstacles,
satellite
based
navigation,
coordinated
by
a
human
from

departure
to
the
destination
is
a
simple
task.
When
obstacles
are
in
the
path,
they
are
not
in

known
or
fixed
positions,
or
they
are
too
many,
the
drone
pilot
must
build
a
flight
plan
to

avoid
them
so
the
task
becomes
very
challenging
[7].
Going
further,
in
a
week
satellite
signal

environment
the
task
is
almost
impossible.
Moreover,
in
either
surveillance
and
reporting

tasks
the
surfaces
to
be
watched
over
are
so
vaste,
requiring
an
incredibly
high
number
of

human
agents
to
coordinate
the
systems,
not
to
mention
that
some
activities
are
high

precision or need a permanent survey.

Autonomous
drones
are
the
solution
to
overcome
these
many
challenges
traditional

systems
encounter
in
any
field
they
are
used.
For
monitoring,
surveillance,
delivery
and

research
tasks
autonomy
is
a
key
concept
in
the
development
of
drones
and
a
deciding
factor

for
their
contribution
to
industry.
The
mentioned
arguments
arise
the
idea
of
applying

intelligent agents to achieve our tasks in any area, with smaller costs and more accuracy.

1.2 Problem formulation

The
goal
of
this
project
is
to
create
an
agent
that
can
learn
a
control
policy
that
will

maximize
the
rewards
or
reach
the
proposed
goals.
Suppose
there
is
an
agent
that
could

interact
with
an
environment
by
executing
a
sequence
of
actions
and
receiving
rewards.
More

precisely,
at
each
time
step,
the
agent
selects
an
action
a
from
the
set
of
legal
actions
and
then

transmits
it
to
the
environment.
Thereafter,
the
environment
may
return
the
updated
reward
or

a
screenshot
of
the
environment.
It
is
possible
that
the
environment
uses
a
stochastic

transition
with
a
certain
success
rate.
Therefore,
the
agent
could
only
learn
how
to
behave

from direct experience with the environment.

The
objective
is
is
to
autonomously
navigate
a
drone
[1]
in
a
simulated
unknown

environment,
without
any
human
help,
without
any
prior
knowledge
about
the
terrain
and

using
default
drone
sensors
only.
The
predefined
task
of
the
drone
is
not
important,
nor
its

construction
or
behaviour
details,
but
the
training
method.
Machine
learning
techniques
will

be used to train the drone to fly autonomously among obstacles.

The
drone
is
initially
spawned
in
a
random
initial
location
on
the
map
and
the
goal
of

the drone is to fly as much as possible from its initial position without any collisions.

1.3 Main results

1.4 Thesis structure

The
structure
of
the
thesis
is
organised
as
follows.
Section
two
is
divided
in
three
parts,
the

first
consisting
of
a
general
overview
on
Reinforcement
learning
elements,
the
second
driving

the
reader
through
autonomous
flight
preliminaries
and
bringing
these
concepts
together.
In

the
third
section
five
experimental
works
focused
on
drone
navigation
and
obstacle

avoidance
are
discussed.
Section
four
presents
the
system
requirements,
offers
details

regarding
the
implementation
and
user
manual.
The
fifth
section
describes
the
driven

experiments
along
with
their
results
and
section
six
offers
some
conclusions,
stating
in
the

end the encountered difficulties and the possible improvements.

Chapter 2

BACKGROUND

2.1 Elements of Reinforcement Learning

Reinforcement
learning
deviates
from
supervised
learning
in
the
lack
of
learning
data
and

supervisor.
In
supervised
learning
the
objective
is
to
generalize
from
the
given
data
to
make

accurate
predictions
on
previously
unseen
data.
The
given
data
therefore,
has
both
input
and

output values.

In
reinforcement
learning
however,
the
model
is
given
a
set
of
input
data,
and
it

determines
those
hidden
patterns
in
the
data
by
which
it
can
maximize
its
rewards.
Therefore,

by
having
a
reward-based
learning
strategy,
reinforcement
learning
differs
from
both

supervised
and
unsupervised
learning
paradigms,
and
it
yields
a
completely
different

interface.

The
main
elements
used
in
reinforcement
learning
are:
the
agent,
the
environment
and

respectively
in
the
context
of
the
above
two,
the
model,
the
value
function,
the
reward
and
the

policy.
Briefly,
the
agent
acts
in
the
environment
based
on
its
policy
,
the
environment
sends

some
rewards
back
based
on
the
actions.
This
process
is
known
as
the
agent-environment

loop
and
is
presented
on
Figure
2.1.
In
the
following,
the
most
relevant
elements
of
the

agent-environment loop are presented in more detail.

Figure 2.1: The agent-environment interface

2.1.1 The agent

The
agent
is
the
intelligent
(software)
component
of
any
reinforcement
learning
system.

Agents
are
the
learner
components
of
the
system.
Their
task
is
to
act
in
the
environment
in
a

way
that
the
reward
received
by
them
will
be
maximized.
Agents
act
on
behalf
of
a
so
called

policy, a concept that will be later detailed in this chapter.

Based
on
what
key
component
the
agent
contains,
one
can
distinguish
value-based,

policy-
based
and
actor-critic
agents.
A
value-based
agent’s
policy
is
implicit,
as
it
can

directly
read
out
the
best
action
from
its
value
function.
In
contrast,
a
policy-based
agent

stores
only
a
policy
and
selects
actions
without
ever
explicitly
storing
the
value-function.

Actor-critic
agents
combine
the
previous
two
and
try
to
get
the
best
out
of
both
the
value

function and the policy.

The
agent’s
knowledge
of
the
environment
yields
another
fundamental
distinction

among
agents,
they
can
be
categorized
as
model-based
and
model-free
methods.
A

model-based
agent
has
full
knowledge
about
the
environment’s
dynamics
and
states,
its

model
of
the
environment
is
complete,
whereas
model-free
agents
have
an
incomplete
model

of the environment(or none at all).

Figure 2.2: David Silver’s taxonomy of RL agents

2.1.2 The environment
Everything
inside
the
system,
outside
the
agent
is
said
to
be
part
of
the
environment.
The

agent
has
a
representation
of
the
environment
as
a
form
of
model.
There
are
multiple
types
of

environments,
depending
on
which
the
accuracy
of
the
agent’s
model
is
also
defined.
Russell

and Norvig propose the following environment properties:



deterministic/non-deterministic

:
in
a
deterministic
environment
the
outcome
can
be

precisely
predicted
based
on
the
current
state.
Simply
put,
if
an
agent
performs
the
same
task

in
two
identical
deterministic
environments,
the
outcome
will
be
the
same
in
both
cases.
In
a

non-deterministic
or
stochastic
environment
the
outcome
can
not
be
determined
from
the

current state. The latter type of environments introduce a greater level of uncertainty.



discrete/continuous

:
in
a
discrete
environment,
there
is
a
finite
number
of
possible
actions

for
moving
from
one
state
to
another.
In
contrast,
in
a
continuous
environment,
there
is
an

infinite number of possible actions for one transition



episodic/non-episodic
environment:
episodic
environments
are
also
called
non-sequential.

In
an
episodic
environment,
the
agent’s
action
will
not
affect
any
future
action,
the
agent’s

performance
is
the
result
of
a
series
of
independent
tasks
performed.
In
case
of
non-
episodic

or
sequential
environments,
the
agent’s
current
action
will
affect
future
actions,
in
other

words, the agent’s actions are not independent of each other.



static/dynamic
environment:
environments
that
are
modified
exclusively
by
the
agent
are

called
static.
Environments
that
can
change
without
the
interaction
of
the
agent
are
called

dynamic.



accessible/inaccessible

:
in
the
reinforcement
learning
vocabulary,
these
two
categories
are

also
known
as
observable/partially
observable
environments.
Observable
environments
are

those
about
which
state
space
the
agent
can
hold
a
complete
and
accurate
representation,

whereas
the
knowledge
of
the
agent
in
a
partially
observable
environment
is
limited
to

observations

2.1.3 Main Reinforcement Learning problems

The
succession
of
actions
taken
by
the
agent
follows
the
procedural
approach
called

sequential
decision
making.
Fundamental
problems
that
arise
in
these
kinds
of
processes
are

the
problem
of
learning
and
the
problem
of
planning.
The
main
distinction
between
these
two

is based on the agent’s initial knowledge of environment.

The
method
by
which
the
agent
balances
exploration
and
exploitation
is
another
key

part
of
the
identity
of
what
it
means
to
do
reinforcement
learning.
Exploration
means

choosing
to
give
up
some
reward
the
agent
knows
about
for
the
sake
of
gaining
information

about
the
environment
instead,
while
exploitation
means
maximizing
reward
based
on
the

knowledge
the
agent
is
already
possessing.
One
key
point
of
designing
RL
agents
is
to
come

up with an algorithm that exploits as well as it explores.

Another
distinction
within
reinforcement
learning
is
between
the
problem
of

prediction
and
the
problem
of
control.
When
solving
the
prediction
problem,
the
agent
has
a

fixed
policy
and
its
purpose
is
to
predict
how
well
will
it
do
in
the
future.
Solving
the
control

problem
implies
finding
a
best
policy.
Thus,
solving
the
problem
of
control
also
implies

solving the prediction problem.

2.2 Deep Reinforcement Learning

2.2.1 Deep Learning

Deep
learning
is
a
supervised
learning
technique
to
find
patterns
in
data
given
by
a

supervisor.
The
patterns
the
algorithm
learns
from
the
labeled
data
form
a
predictive
model

that
is
able
to
recognize
previously
unseen
data.
Deep
learning
is
a
key
technology
in

self-driving
cars,
voice
control
systems,various
disease
detection/prediction
algorithms,
etc.

Many
times,
deep
learning
models
can
achieve
higher
accuracy
and
better
precision
in
certain

tasks
than
human
experts.
In
the
vast
majority
of
the
cases,
deep
learning
models
are
used
to

solve supervised learning problems.

The
key
components
of
deep
learning
models
are
the
multi-layer
perceptrons
(MLP).

The
MLP
can
be
viewed
as
a
mathematical
function
that
itself
is
just
a
chaining(composition)

of
simpler
mathematical
functions,
that
maps
some
input
values
to
some
output
values.
This

simple
idea
allows
deep
learning
models
to
build
complex
concepts
out
of
simpler
concepts.

Obtaining
high-level
features
of
data
in
such
a
way
seem
to
be
promising
in
solving
other

types of problems too.

The
deep
neural
networks
underlying
them,
when
viewed
as
function
approximators

prove
to
be
useful
in
other
classes
of
tasks
too.
In
reinforcement
learning
problems,
where
the

state
space
size
makes
it
infeasible
to
calculate
the
value
function
with
direct
solution

methods,
one
resorts
to
approximations.
Reinforcement
learning
models
that
use
deep
neural

networks as value function approximators are called deep reinforcement learning models.

2.2.2 Experience Replay

Once
a
connection
between
supervised
and
reinforcement
learning
with
function

approximators
is
made
as
described
previously,
one
might
find
that
reinforcement
learning

methods
do
not
exploit
as
much
the
potential
of
the
data
as
supervised
learning
methods
do.

Reinforcement
learning
takes
the
steps
of
an
episode,
updates
the
value
function
and
takes

another
episode
with
another
steps.
However
this
does
not
fully
exploit
the
potential
hidden

in
the
data
and
value
function
approximator-based
reinforcement
learning
methods
can
be

extremely
sensible
to
this,
mainly
because
of
the
bias-variance
problems.
The
method
called

experience replay aim to solve some of these issues.

The
core
idea
of
experience
replay
is
that
at
each
timestep,
the
agent
stores
as
a
tuple

containing
the
current
state,
the
action
taken,
the
reard
and
the
next
state
in
a
so
called
replay

buffer.
This
buffer
is
accessed
when
performing
the
updates
of
the
weights.
The
updates
are

performed
based
on
batches
sampled
randomly
from
the
experience
replay
buffer.
By
random

sampling,
one
does
break
the
dependencies
of
successive
experiences
coming
from
the
same

episode
and
hence
one
important
source
of
instability
is
eliminated.
There
are
several

modified
versions
of
experience
replay
such
as
prioritized
and
hindsight
experience
replay.

However,
the
vast
majority
of
these
different
methods
are
usually
used
hand
in
hand
with

some off-policy methods such as Q-learning.

2.2.3 Deep Q Learning

The
variant
of
Q-learning
where
a
deep
neural
network
is
used
as
function
approximator
is

called
Deep
Q-Learning.
The
basic
steps
of
the
usual

Q

-learning
are
still
preserved.
Yet,
to

avoid
some
frequent
issues
with
the
algorithm’s
convergence,
some
modifications
are

done(for
example,
balancing
exploration
and
exploitation
with
epsilon-greedy,
deal
with
the

correlations
among
the
’training
data’
by
using
experience
replay
and
decoupling
the
target

network from the policy network etc.). The final steps of Deep Q-Learning is as follows:

1.
initialize
parameters
for
Q(s,a)
and
Q’(s,a),
epsilon=1
and
empty

replay buffer

2. select action a with epsilon-greedy

3. execute action a, collect reward r, observe next state s’

4. store transition (s,a,r,s’) to replay buffer

5. sample batch of transitions from the replay buffer

6. compute target y for each transition from the batch

7. compute loss

8. update the parameters using stochastic gradient descent

9. repeat from 2. until convergence

One
important
note
regarding
the
above
algorithm
is
that
there
are
actually
two

networks
working.
They
are
called
the
main
network
and
the
target
network
respectively.
The

target
network
is
a
copy
of
the
main
network
producing
the
targets

Q

(

s’,a’

)
and
it
is

periodically
synchronized
with
the
main
network.
This
tweak
is
introduced
because

originally,
Q-learning
is
an
off-policy
control
algorithm,
and
if
experience
was
produced
by

the
same
network
that
dictates
the
policy,
then
due
to
the
high
bias
there
is
a
high
risk
that
the

algorithm will diverge.

While
experience
replay
is
meant
to
break
the
correlations
between
the
training

examples,
using
two
networks
targets
the
decoupling
of
target
and
behaviour
policies.
Deep

Q-Networks
are
particularly
useful
in
Markov
Decision
Processes
where
the
observations
are

given
as
images.
In
such
cases,
from
one
hand,
it
is
convenient
and
straightforward
to
use

convolutional
neural
networks
for
the
approximators,
from
the
other
hand,
the
nature
of

image
observations
comply
with
the
issues
described
in
the
previous
paragraph,
issues
that

Deep Q-Learning was designed to solve.

2.2 Autonomous Flight

This
section
offers
a
brief
introduction
to
autonomous
flight.
Basic
notions
about
drones
are

introduced,
then
an
overview
on
quadrotors
is
given.
The
last
part
is
a
discussion
regarding

autonomous drones.

2.2.1 General view

Unmanned
aircraft
systems
(UAS)
are
an
aircraft
and
its
associated
elements
which
are

operated
with
no
pilot
on
board.
A
subset
of
UAS,
remotely
piloted
aircraft
systems
(RPAS)

are
a
set
of
configurable
elements
consisting
of
a
remotely
piloted
aircraft,
its
associated

remote
pilot
station(s),
the
required
command
and
control
links
and
any
other
system

elements as may be required , at any point during flight operations.

In
the
past
decade,
the
field
of
aerial
systems
has
exploded.
Tremendous
progress
has

been
made
in
the
design
and
autonomy
of
flight
robotics,
which
can
roughly
be
categorized

as
either
man-made
or
bio-inspired.
In
the
former
class,
fixed-pitch
rotorcrafts
have
become

particularly
popular
for
their
mechanical
simplicity,
but
several
more
complex
vehicles
have

also
been
developed.
For
example,
decoupled
rotational
and
translational
degrees
of
freedom

can be achieved with variable-pitch or omnidirectional rotor configurations.

Fixed-wing
aircrafts
are
mechanically
simple,
but
are
still
able
to
glide
and
execute

birdlike
maneuvers,
such
as
perching.
Compared
with
ornithopters(

aircrafts
that
fly
by

flapping
their
wings

),
they
are
easier
to
model
and
are
typically
better
able
to
carry
onboard

sensors,
allowing
for
the
successful
development
of
planning,
perception,
and
formation

control algorithms.

Multirotor
aircrafts,
particularly
quadrotors,
have
been
the
most
capable
vehicles
in

terms
of
accessibility,
maneuverability,
capacity
for
onboard
sensors,
and
applicability
to
a

breadth
of
applications.
Great
research
progress
has
been
made
across
multiple
areas,

including
the
control
of
agile
maneuvers,
planning
and
perception
in
unknown,
unstructured

environments and collaboration in multiagent teams, sparking a surge of industry investment.

RPAS
are
replaced
in
the
common
language
by
the
word
drone
and
this
document

will accordingly use drones to speak of UAS and RPAS.

2.2.2 Quadrotors

Due
to
their
ease
of
construction,
lightweight
flight
controllers
and
durability,
quadrotors
are

the
most
frequent
configuration
of
unmanned
aerial
vehicles.
With
their
small
size
and

maneuverability, these quadcopters can be flown indoors as well as outdoors.

As
opposed
to
fixed-wing
aircraft,
quadcopters
are
classified
as
rotorcraft,
because

their
lift
is
generated
by
a
set
of
rotors,
vertically
oriented
propellers.
Generally,
quadrotors

use
two
pairs
of
identical
fixed
pitch
propellers,
two
clockwise
and
two
counterclockwise.

These
use
independent
variation
of
the
speed
of
each
rotor
to
achieve
control.
By
changing

the
speed
of
each
rotor
it
is
possible
to
specifically
generate
a
desired
total
thrust
and
total

torque.
The
thrust
happens
when
the
air
is
pushed
in
the
direction
opposite
to
flight
by
the

spinning
blades
of
a
propeller.
The
torque
or
the
turning
force
is
the
rotational
equivalent
of

the linear force.

Figure 2.2 Clockwise and counterclockwise propellers

Each
rotor
produces
both
a
thrust
and
torque
about
its
center
of
rotation,
as
well
as
a

drag
force
opposite
to
the
vehicle's
direction
of
flight.
There
is
no
need
for
a
tail
rotor
as
on

conventional
helicopters
because
i

f
all
rotors
are
spinning
at
the
same
angular
velocity,
with

rotors
A
and
C
rotating
clockwise
and
rotors
B
and
D
counterclockwise
like
in
Figure
2.2,
the

net
aerodynamic
torque,
and
hence
the
angular
acceleration
about
the
yaw
or
longitudinal

axis is exactly zero.

The
attitude
of
a
quadrotor
provides
information
about
its
orientation
with
respect
to

the
local
level
frame,
horizontal
plane
and
true
north.
As
depicted
in
figure
2.3
the
attitude

has
3
components:

Roll

,

Pitch
and

Yaw

.
The
easiest
way
to
understand
them
is
considering
a

drone with three linear axis running through it.

Figure 2.3 Attitude components of a quadrotor

A quadrotor maneuvers are done in the following way:


adjusts the altitude by applying equal thrust to all four rotors


adjusts
its
yaw
by
applying
more
thrust
to
rotors
rotating
in
one
direction,
for
example

turning right implies more thrust on A and C propellers from figure 2.2


adjusts
its
pitch
or
roll
by
applying
more
thrust
to
one
rotor
and
less
thrust
to
its

diametrically opposite rotor

2.2.3 Autonomous Drones

An Unmanned Aerial Vehicle(UAV, term used by The United States Department of Defense)

is defined by US Dictionary as a "powered, aerial vehicle that does not carry a human

operator, uses

aerodynamic forces

to provide vehicle lift, can fly autonomously or be piloted

remotely, can be expendable or recoverable, and can carry a payload".

Quadcopters
and
other
multicopters
often
can
fly
autonomously.
Many
modern
flight

controllers
use
software
that
allows
the
user
to
mark
way-points
on
a
map,
to
which
the

quadcopter
will
fly
and
perform
tasks,
such
as
landing
or
gaining
altitude.
The
PX4
autopilot

system,
an
open-source
software/hardware
combination
in
development
since
2009,
has
been

adopted
by
both
hobbyists
and
drone
manufacturing
companies
to
give
their
quadcopter

projects
flight-control
capabilities.
Another
project
that
enables
its
hobbyist
use
small

remotely
piloted
aircraft
such
as
micro
air
vehicles
is

ArduCopter

,
the
multicopter

unmanned
aerial
vehicle
version
of
the
open-source
ArduPilot
autopilot
platform.
Other
flight

applications
include
crowd
control
between
several
quadcopters
where
visual
data
from
the

device
is
used
to
predict
where
the
crowd
will
move
next
and
in
turn
direct
the
quadcopter
to

the next corresponding waypoint.

The
use
of
drones
is
developing
at
a
quick
pace
worldwide
and
in
particular
in

European
Aviation
Safety
Agency

(EASA)
countries.
The
use
of
drones
is
extremely
varied.

Some
examples
are:
precision
agriculture,
infrastructure
inspection,
wind
energy
monitoring,

pipeline
and
power
inspection,
highway
monitoring,
natural
resources
monitoring,

environmental
compliance,
atmospheric
research,
media
and
entertainment,
sport
photos,

filming,
wildlife
protection
and
research,
accident
reporting,
disaster
relief.
Experiments
to

carry small cargo are ongoing in Germany and France.

Initially
focused
on
robots
and
now
mostly
applied
to
ground
vehicles,
RL
began
to

be
used
also
to
train
drones.

Reinforcement
learning
is
inspired
by
a
human’s
way
of

learning,
based
on
trial
and
error
experiences.
Using
the
information
about
the
environment,

the
agent
makes
decisions
and
actions
in
discrete
intervals
known
as
steps.
Actions
produce

changes in the environment and also a reward.

The
application
of
reinforcement
learning
to
drones
will
provide
them
with
more

intelligence,
eventually
converting
drones
in
fully-autonomous
machines.

Deep
RL
proposes

the
use
of
neural
networks
in
the
decision
algorithm.
In
conjunction
with
the
experience

replay
memory,
deep
RL
has
been
able
to
achieve
a
super-human
level
when
playing
video

and
board
games.
Even
if
typical
deep
RL
applications
are
optimal
sensorimotor
control
of

autonomous
robotic
agents
in
immersive
environments,
this
learning
technique
is
a
promising

approach.

Chapter 3

RELATED WORK

A
wide
variety
of
techniques
for
drone
navigation
and
obstacle
avoidance
can
be
found
in
the

literature.
In
this
section
we
will
analyse
five
experimental
works
that
tackle
the
same

problem:
achieving
an
autonomous
flight
when
obstacles
are
encountered
in
the
environment.

Some
use
predefined
paths,
some
fixed
obstacles,
or
indoor
environments,
but
independent

on
that
our
focus
is
on
the
machine
learning
algorithms
used
to
obtain
an
autonomous
aerial

agent.

3.1.
Deep
Convolutional
Neural
Network-Based
Autonomous
Drone
Navigation
by
K.

Amer et al. [8]

This
paper
presents
an
approach
for
aerial
autonomous
navigation
of
a
drone
using
a

monocular
camera
and
without
reliance
on
a
Global
Positioning
System
(GPS).
For
systems

that
require
regular
trips,
on
the
same
locations,
with
the
same
trajectories
the
human
agent

can
be
replaced
to
reduce
costs
and
enhance
precision
that
can
not
be
achieved
in
GPS-denied

environments.
The
proposed
algorithm
focuses
on
predefined
paths
autonomous
drone

navigation
which
is
a
suitable
solution
for
environmental
monitoring,
shipment
and
delivery

and remote wireless internet signal.

They
use
deep
Convolutional
Neural
Network
(CNN)
and
combine
it
with
a
regressor

to
output
the
drone
steering
commands.
Onboard
information
is
immediately
available,
which

is
an
advantage
because
external
signals
are
not
needed,
and
also
a
disadvantage
because
it

expose
the
system
to
many
real-life
deployment
scenarios.
To
overcome
this
and
make
the

system adaptable they use a ‘navigation envelope’ for data augmentation.

The
combined
training
dataset
consists
of
simulator-generated
data
along
with
data

collected
during
manually-operated
or
GPS-based
trips.
They
use
the
Unreal
Engine
physics

environment
with
the
AirSim
plugin
that
provides
an
API
for
drone
control
and
data

generation.
Two
synthetically-generated
scenarios
were
used
in
the
training
dataset:
Blocks

and
Landscape.
The
first
is
an
abstract
environment
which
contains
different
cubes
and
it
is

used
in
the
initial
training
of
the
drone
and
the
second
is
a
realistic
scene
with
frozen
lakes,

trees, and mountains to cover more complex environments.

A
neural
network
pre-trained
for
the
classification
task
is
used
as
feature
extractor.

Then
as
shown
in
Figure
3.1
these
features
are
treated
by
the
regressor
which
is
a
Fully

Connected
Neural
Network
(FCNN)
or
Recurrent
Gated
Neural
Network
(GRU)
whose

purpose
is
to
indicate
the
yaw
angle
by
which
the
drone
should
rotate
when
navigating
at

fixed
forward
velocity.
This
generalisation
is
helpful
when
deployed
in
a
realistic,
compex

setting.
They
use
Adam
optimizer
with
learning
rate
2.7
−4
and
a
batch
size
of
64
for
100

epochs.
The
learning
rate
is
to
be
halved
every
25
epochs.
The
loss
function
used
to
correlate

between
the
visual
input,
waypoints
and
the
deviation
angle
is
the
Mean
Squared
Error

(MSE).

Figure 3.1 Sketch of the training system

For
simplicity
reasons,
the
drone
flies
only
forward
at
5
meters
height
with
1
m/s

velocity.
The
only
parameter
that
has
to
be
adjusted
in
real
time
is
the
yaw
angle,
calculated

between the next waypoint and the drone heading.

During
training
each
path
was
treated
separately
in
order
to
teach
the
network
to

follow
the
path
by
correlating
input
images
with
the
yaws
required
to
advance.
The

alternative
would
be
joint
training
which
can
lead
to
conflicting
decisions
in
a
complex

environment.
The
‘navigation
envelope’
consist
of
augmenting
each
path
with
100
auxiliary

paths by adding noise to the optimal shortest path used in training.

To
measure
the
results
two
error
metrics
are
followed:
Mean
Waypoints
Minimum

Distance
and
Mean
Cross
Track
Distance.
The
first
one
is
the
difference
between
each

waypoint
position
and
the
nearest
position
reached
by
the
drone
for
the
entire
path
averaged

over
all
waypoints
and
the
second
is
the
shortest
distance
between
the
drone
position
and
the

next two closest waypoints.

To
test
the
path
following
FCNN
and
GRU
regressors
with
2
and
4
time
steps
are
then

compared.
Contradictory
to
similar
results
from
self-driving
cars,
FCNN
archives
here
a

lower
error
in
both
environments.
All
the
models
were
able
to
follow
the
path
until
the
end

except
for
the
GRU
with
4
time
steps.
The
improved
generalisation
to
find
direction
and

pursue
the
right
path
was
achieved
with
FCNN
regressor,
the
model
they
trained
indicating
a

robust autonomous navigation.

3.2.
Reactive
Model
Predictive
Control
for
Autonomous
MAV
Navigation
in
Indoor

Cluttered Environments: Flight Experiments by Julien Marzat et al. [9]

This
paper
describes
an
integrated
perception-control
loop
for
trajectory
tracking
with

obstacle
avoidance
by
micro-air
vehicles
(MAV)
in
indoor
unknown
environments.
To

integrate
the
perception
part
a
stereo-vision
camera
is
used
to
build
a
3D
model
of
the

environment.

An
Asctec
Pelican
quadrotor
controlled
in
diamond
configuration
shown
in
Figure
3.2

is
the
aerial
vehicle
chosen
for
flight
experiments.
It
includes
an
Inertial
Measurement
Unit

(IMU)
and
a
low-level
controller,
which
sends
acceleration
control
inputs
to
the
MAV:
roll

angle,
pitch
angle
and
yaw
rate
inputs.
A
quad-core
i7
CPU
is
also
embedded
to
control
the

algorithmic chain and make the MAV fully autonomous.

Figure3.2 Diamond configuration quadrotor

The
technique
used
to
tackle
the
trajectory
problem
is
Model
Predictive
Control

(MPC).
The
authors
make
use
of
the
current
estimated
states
of
the
vehicle
and
environment

and
use
reactive
strategies
for
obstacle
avoidance.
These
systematic
search
procedures
are
a

combined
with
a
linear
analytical
solution
for
trajectory
tracking
using
the
vision-based

environment model.

The
algorithm
eVO
(Sanfourche
2013)
is
used
for
localisation
and
operates
based
on
a

map
of
isolated
landmarks,
which
is
continuously
updated.
Position
and
attitude
are

estimated
by
the
localization
function
which
tracks
previously
mapped
landmarks.
The

mapping function associates the 3D landmarks in a local map.

The
stereo
images
are
used
to
compute
a
depth
map
with
the
ELAS
(Efficient

Large-scale
Stereo
Matching)
algorithm
[13]
to
achieve
a
robust
matching
of
image
features.

The
probabilistic
propagation
of
the
local
depth
information
is
then
included
in
a
3D

occupancy
grid,
exploited
by
the
Octomap
model
[14]
which
deals
with
probabilities
of

occupancy
and
free-space.
To
check
for
possible
collisions
the
algorithm
sends
a
query
on
a

position
for
which
the
map
returns
the
distance
and
direction
unit
vector
to
the
nearest

obstacle.
The
search
takes
into
account
all
possible
avoidance
plans
and
if
no
solution
can
be

found the MAV enters the emergency mode and starts hovering.

To
evaluate
the
3D
model,
perception
and
control
architecture
experiments
were
taken

in
a
flying
arena
and
an
industrial
warehouse.
In
the
first
environment
a
Matthews

Correlation
Coefficient
[15]
of
0.94
was
obtained
with
respect
to
the
ground
truth
model.
For

the
second,
many
scenarios
were
taken:
following
a
reference,
crossing
paths
and
the
MAV

demonstrated autonomy finding a safe trajectory.

3.3.
Path
Planning
for
Quadrotor
UAV
Using
Genetic
Algorithm
by
Reagan
L.
Galvez

et al. [10]

This
paper
focuses
on
path
planning
for
quadrotor-typed
UAV
navigation
from
initial

to
destination
point
in
an
environment
with
fixed
obstacles.
Their
goal
is
saving
energy,

hence
they
have
to
determine
the
shortest
path
that
the
vehicle
must
travel
from
the
starting

point to target, minimising the power consumed, without hitting any obstacle..

To
find
the
minimum
cost
path,
they
use
a
Genetic
Algorithm
(GA),
which
is

effective
to
search
for
the
optimal
solution
when
having
a
sample
space
or
a
population.
The

chromosomes
represent
here
a
possible
path
among
the
given
points.
The
algorithm
creates
a

population
of
individuals
and
lets
them
evolve
over
multiple
generations
to
find
better
and

better
solutions.
The
evolution
happens
through
recombination
and
mutation
processes
and

the
quality
of
each
individual
is
measured
by
the
fitness.
When
building
a
new
generation

natural
selection
is
applied,
weaker
members
of
a
population
being
less
likely
to
survive
and

produce offspring.

The
process
starts
with
the
given
initial
point
(IP)
and
target
point
(TP)
in

3-dimensional
coordinates,
the
input
parameters
for
the
GA.
The
output,
some
ordered
path

coordinates
represent
the
shortest
path.
The
fitness
is
computed
as
summation
of
distances

between
points
or
routes
in
a
given
path.
For
the
selection
process,
the
first
half
of
the
high

fitness
chromosomes
in
the
generation
is
used
as
parent
for
the
new
offspring.
The
best

fitness
for
a
chromosome
is
given
by
the
lowest
value
of
distance
traveled.
The
crossover

switches
the
genes
of
the
parent
chromosomes,
depending
on
the
random
cut
point.
Then

mutation
randomly
alters
individuals
with
a
small
probability,
0.2
to
guarantee
a
small

amount
of
random
search,
to
ensure
each
point’s
probability
of
being
examined
and
escape

local optima.

To
simulate
the
possible
routes
for
the
quadrotor,
they
used
integer
values
for

coordinates.
The
boundary
for
the
XYZ
plane
is
10x10x10
units.
Number
of
individuals
in
a

generation
and
number
of
epochs
run
were
not
specified.
The
result
stated
is
quite
acceptable

optimal path that avoids obstacles, obtained after 500 generations.

3.4. Reinforcement Learning for UAV Attitude Control by William Koch et al. [11]

Autopilot
systems
are
composed
of
an
inner
loop
providing
stability
and
control,
and

an
outer
loop
responsible
for
mission-level
objectives
like
way-point
navigation.
Such

systems
are
usually
implemented
with
Proportional
Integral
Derivative
(PID)
control

systems,
which
have
demonstrated
exceptional
performance
in
stable
environments.
However

more
sophisticated
control
is
required
to
operate
in
unpredictable,
and
harsh
environments.

This
paper
focuses
on
the
inner
control
loop
using
intelligent
flight
control
systems
to
achieve

its accuracy.

They
use
Reinforcement
Learning
techniques
like
Deep
Deterministic
Policy

Gradient,
Trust
Region
Policy
Optimization,
and
Proximal
Policy
Optimization
to
identify
if

they
are
appropriate
for
high-precision,
time-critical
flight
control.
Two
simulation

environments
are
used,
one
for
training
a
flight
controller
and
the
other
to
compare
the

obtained results with those of a PID.

The
learning
environment,
GymFC
allows
the
agent
to
learn
attitude
control
of
an

aircraft
through
both
episodic
tasks
and
continuous
tasks.
While
episodic
tasks
are
not
so

reflective
of
realistic
flight
conditions
because
the
agent
is
required
to
learn
with
respect
to

individual
velocity
commands,
continuous
tasks
are
more
intense
with
random
widths
and

amplitudes being continuously generated.

They
consider
a
RL
architecture
consisting
of
a
neural
network
flight
controller
as
an

agent
interacting
with
an
Iris
quadcopter
in
a
Gazebo
simulator.
The
quadcopter’s
inertial

measurement
unit
(IMU)
provides
an
observation
from
the
environments
at
each
timestep
t.

Once
the
observation
is
received,
the
agent
executes
an
action
within
the
environment
and

receives a single numerical reward indicating the performance of this action.

Each
learning
agent
was
trained
with
an
RL
algorithm
for
a
total
of
10
million

simulation
steps,
equivalent
to
10,000
episodes
or
2.7
simulation
hours.
The
RL
algorithm

used
for
training
and
the
memory
size
m
define
a
configuration.
Training
for
DDPG
took

approximately
33
hours,
while
PPO
9
hours
and
TRPO
13
hours
respectively.
Training

results
show
clearly
that
PPO
converges
faster
and
accumulates
higher
rewards
than
TRPO

and
DDPG.
One
fact
they
noticed
is
that
a
large
memory
size
actually
decreases
convergence

and
stability
among
all
trained
algorithms,
maybe
because
this
causes
algorithm
to
take

longer to learn the mapping to the optimal action.

3.5.
Autonomous
UAV
Navigation
Using
Reinforcement
Learning
by
Mudassar
Liaq
et

al. [12]

The
paper
states
that
the
existing
work
in
autonomous
navigation
of
UAV
is
restricted

to
an
ideal
environment,
instead
of
a
realistic
one.
The
aim
is
to
overcome
these
limitations

by
providing
a
model
compatible
with
every
drone,
using
only
standard
drone
sensors
and

taking
into
account
environmental
factors
such
as
wind
and
rain.
The
sensors
used
are
a

camera,
global
positioning
system
(GPS),
inertial
measurement
unit(IMU),
magnetometer

and barometer.

This
paper
describes
a
deep
RL
based
framework
for
UAVs
motion
planning
task.

An
efficient
policy
based
Deep
Q
Network(DQN,
a
Deep
learning
technique)
is
designed,

which
comprises
of
convolutional
and
fully
connected
layers
to
extract
important
features

from
images
taken
through
a
depth
vision
camera.
These
features
are
then
combined
with

inputs
taken
directly
from
the
other
sensors
and
are
passed
to
a
fully
connected
layer.
This
is

followed by a policy-based Q Learning approach to apprehend UAV navigation.

The
simulation
tool
that
was
used
is
AirSim,
which
enables
using
different
APIs,

retrieving
images
and
controlling
the
vehicle.
System
specifications
are
Linux
OS,
Python

3.6,
TensorFlow
1.13,
Cuda
10.0,
Cuda
NN
7.4.2.
The
simulation
environment
is
3D,
the

quadcopter
can
navigate
across
all
three-axis
and
the
speed
and
altitude
of
the
UAV
range

between
0
to
10
m/s
and
5
to
200m
respectively.
The
episode
length
can
be
up
to
200.
The

network
parameters
are
learned
by
using
Adam
Optimizer
with
a
learning
rate
of
10-3.
The

discount factor γ for the experiments is varied in the range of 0.91 to 0.99.

When
run
for
500
epochs
the
network
was
learning
very
slow.
This
is
a
known

behavior
as
the
network
is
experimenting
with
different
outputs
to
check
which
strategy

works.
Figure
3.3
shows
a
trend
of
reward
function
values
when
run
for
1000
epochs.
After
a

few
small
tweaks
the
reward
is
maxing
out,
being
capped
at
200.
As
the
iterations
increase,

the reward values show a rapid increase, respectively.

Figure 3.3 Trend of reward function

The
results
they
achieved
are
building
a
model
that
can
be
used
in
a
realistic

environment,
doesn’t
need
special
featured
hardware,
only
drones
with
standard
sensors,
uses

only onboard processing, learns on the fly and is able to navigate for more than five minutes.

Chapter 4

UAVEL DEVELOPMENT

This
chapter
focuses
on
the
software
development
part
of
the
application.
First
of
all
it

gives
an
overview
on
the
platforms,
plugins
and
libraries
required
to
setup
the
system.

Then
details
concerning
the
design
and
implementation
decisions
are
presented.
In
the
last

section a user manual is specified along with features for training and inference modes.

4.1 System requirements

The system consists of three major parts:


3D Simulation Platform, Unreal Engine to create and run simulated environments


Interface
Platform,
AirSim
Py
to
simulate
drone
physics
and
interface
between

Unreal and Python


python code, based on TensorFlow


Unreal
Engine
is
an
advanced
real-time
3D
creation
platform
for
photoreal
visuals

and
immersive
experiences.
Developed
by
Epic
Games,
having
now
a
C++
code
base
it

includes
a
wide
variety
of
platform
types,
for
driving,
flying
and
offers
a
high
degree
of

portability on all consacred operating systems.


AirSim
Py
is
an
open
source
plugin
developed
by
Microsoft
that
interfaces
Unreal

Engine
with
Python.
It
provides
basic
python
functionalities
controlling
the
sensory
inputs

and
control
signals
of
the
drone.
Our
project
is
built
onto
the
low
level
python
modules

provided
by
AirSim
creating
higher
level
python
modules
for
the
purpose
of
drone
RL

applications.

Python

is
the
chosen
programming
language
to
interface
with
the
environments
and

carry
out
the
Deep
reinforcement
learning
process.

TensorFlow

is
an
artificial
intelligence

library,
using
data
flow
graphs
to
build
models.
It
allows
creating
large-scale
neural

networks with many layers which serve as function approximators for our RL algorithms.

Anaconda

,
an
open-source
distribution
of
Python
for
data
science,
machine
learning

and large-scale analytics is used to simplify package management and deployment.

The

pygame
library
is
an
open-source
module
for
the
Python
programming
language

specifically
intended
to
help
us
make
games
and
other
multimedia
applications.
Here
the

PyGame
screen
can
be
used
to
control
simulation
parameters
such
as
pausing
the

simulation,
modifying
algorithmic
or
training
parameters,
overwrite
config
file
and
save
the

current state of the simulation.

4.2 System design

4.3 User manual

Running
this
project
implies
having
Unreal
Engine
installed
on
your
computer.
Then
to

avoid
conflicts
and
pursue
a
smooth
installation
process

it’s
advisable
to

make
a
new

virtual
environment
for
this
project,
clone
the
code
here
and
install
the
dependencies
in

requirements.txt. More detailed steps can be found in the provided readme.

UAVEL
engine
takes
input
from
a
config
file
used
to
define
the
problem
and
the

algorithm
for
solving
it.
There
one
can
specify
the
environment
type,
the
drone
specific

parameters
along
with
its
ip
address
and
type,
but
also
parameters
for
the
camera.
One

example of a main configuration file can be found in the figure below.

Figure 4.1 Example of a main configuration file

The user can select the following simulation modes for the drone:


move_around

:
When
mode
is
set
to
move_around,
the
simulation
starts
the

environment
in
free
mode.
In
this
mode,
keyboard
can
be
used
to
navigate
across

the
environment.
This
mode
can
help
the
user
get
an
idea
of
the
environment

dynamics.
The
keyboard
keys
a,
w,
s,
d,
left,
right,
up
and
down
can
be
used
to

navigate
around.
This
can
also
be
helpful
when
identifying
the
initial
positions
of

the drone.


train

: Signifies the training mode, used as an input flag for the selected algorithm


infer

: Signifies the inference mode, used as input flag for the chosen algorithm

After
setting
up
the
desired
mode
and
parameters
one
can
run
the
main
of
the
project

from
the
command
line
or
some
Ide.
This
will
open
the
simulation
environment
without

rendering.
By
pressing
Fn+/F1
the
AirSim
keyboard
commands
panel
will
be
shown
on
the

screen.
Using
these
one
can
activate
the
rendering,
change
the
type
of
the
view
or
switch
to

manual control.

For
the
training
and
inference
mode
PyGame
screen
is
available,
listing
the
keyboard

commands
to
control
the
simulation.
We
can
use
it
to
pause
the
simulation,
to
modify

algorithmic
or
training
parameters,
overwrite
the
configuration
file
and
save
the
current

state
of
the
simulation.
The
system
generates
a
number
of
output
files.
The
log
file
keeps

track
of
the
simulation
state
per
iteration
listing
useful
algorithmic
parameters.
This
is

particularly
useful
when
troubleshooting
the
simulation.
Tensorboard
can
be
used
to

visualize
the
training
plots
in
run-time,
to
monitor
the
training
parameters
and
to
change
the

input parameters using the PyGame screen if needed.

The
simulation
updates
two
graphs
in
real-time.
The
first
graph
is
the
altitude
variation

of
the
drone,
while
the
other
one
is
the
drone
trajectory
mapped
onto
the
environment

floorplan.
The
trajectory
graph
also
reports
the
total
distance
traveled
by
the
drone
before

crash.

Chapter 5

CASE STUDY

5.1 Experiments and results

Chapter 6

CONCLUSIONS

6.1 Main contribution

6.2 Limitations and future work

Chapter 7

BIBLIOGRAPHY

1. Article about Unmanned aerial vehicles,

https://en.wikipedia.org/wiki/Unmanned_aerial_vehicle

2. Article about drone research,

https://blogs.ei.columbia.edu/2017/06/16/how-drones-are-advancing-scientific-research/

3. Article about The Year of drones,

https://newatlas.com/year-drone-2013/30102/

4. Article about small cargo drones in Germany,

http://unmannedcargo.org/cargo-drones-deliver-blood-samples-in-germany/

5. Article about drones applications,

https://filmora.wondershare.com/drones/drone-applications-and-uses-in-future.html?gclid=Cj
0KCQjwka_1BRCPARIsAMlUmEpZ8TrrZDnoHCkVxmL_vV55IsidzY1YEXXkqfonH-ND
eoqrXF-XCcQaAvnbEALw_wcB

6. Article about drone technology applications,

https://www.allerin.com/blog/10-stunning-applications-of-drone-technology

7. Article about challenges with autonomous drones,

https://www.ansys.com/blog/challenges-developing-fully-autonomous-drone-technology

8.
Deep
Convolutional
Neural
Network-Based
Autonomous
Drone
Navigation
by
K.
Amer,

M.
Samy,
M.
Shaker
and
M.
ElHelw,
Center
for
Informatics
Science,
Nile
University,
Giza,

Egypt, 2019

9.
Reactive
MPC
for
Autonomous
MAV
Navigation
in
Indoor
Cluttered
Environments:

Flight
Experiments
by
Julien
Marzat,
Sylvain
Bertrand,
Alexandre
Eudes,
Martial

Sanfourche, Julien Moras, The French Aerospace Lab, Palaiseau, France, 2017

10.
Path
Planning
for
Quadrotor
UAV
Using
Genetic
Algorithm
by
Reagan
L.
Galvez,
Elmer

P. Dadios, Argel A. Bandala, De La Salle University, Manila, Philippines, 2016

11.
Reinforcement
Learning
for
UAV
Attitude
Control
by
William
Koch,
Renato
Mancuso,

Richard West, Azer Bestavros, Boston University, USA, 2019

12.
Autonomous
UAV
Navigation
Using
Reinforcement
Learning(RL)
by
Mudassar
Liaq
and

Yungcheol
Byun,
International
Journal
of
Machine
Learning
and
Computing
Vol.
9,
No.
6

2019

13. Paper on ELAS algorithm,

http://w.cvlibs.net/publications/Geiger2010ACCV.pdf

14. Paper about OctoMap,

https://courses.cs.washington.edu/courses/cse571/16au/slides/hornung13auro.pdf

15. Article about Matthews Correlation Coefficient,

https://en.wikipedia.org/wiki/Matthews_correlation_coefficient

16. ICAO circular 328-AN/190.

Similar Posts