Chapter 1 INTRODUCTION Over the past decade, there has been an [612255]
Chapter 1
INTRODUCTION
Over
the
past
decade,
there
has
been
an
uptrend
in
the
popularity
of
Unmanned
Aerial
Vehicles
(UAVs),
receiving
attention
in
the
research
community
[2]
where
an
important
number
of
seminal
results
and
applications
has
been
proposed
and
experimented.
Despite
the
significant
progress,
flight
autonomy
is
still
considered
an
open
research
topic,
implying control and orientation, obstacle avoidance and even trajectory optimisation.
Machine
Learning
is
currently
applied
to
numerous
types
of
applications
and
systems.
Autonomous
flight
is
an
interesting
topic
in
this
area
opening
many
challenges:
precision
and
accuracy,
robustness
and
adaptation
and
reward
engineering
when
using
Reinforcement
Learning.
1.1 Motivation
If
we
think
in
terms
of
evolutionary
history,
all
animals
exhibit
some
kind
of
behavior:
they
do
something
in
response
to
the
inputs
that
they
receive
from
the
environment
they
exist
in.
Some
animals
change
the
way
they
behave
over
time:
given
the
same
input,
they
may
respond
differently
later
on
than
they
did
earlier.
Such
learning
behavior
is
vital
to
the
survival
of
species.
As
technology
develops,
how
to
enable
machines
to
mimic
the
learning
ability
is
one
of
the
long-standing
challenges
for
scientists.
Reinforcement
learning
is
such
an
area
of
machine
learning
inspired
by
behaviorist
psychology,
which
has
revolutionized
our
understanding of learning.
Drones
are
not
a
new
phenomenon
as
they
date
back
to
the
mid-1800
but
their
development
in
the
civil
market
is
relatively
recent.
2013
has
been
qualified
by
some
experts
as
the
year
of
drones
[3]
and
was
the
moment
when
one
felt
the
need
of
such
autonomous
systems to perform tasks that humans used to coordinate.
The
applications
of
drones
are
no
longer
limited
to
military
use
and
they
invaded
all
the
fields,
from
agriculture
to
the
world
of
internet
[5].
The
biggest
achievement
of
drone
technology
is
performing
monitoring
and
precision
tasks
on
large
farm
lands,
irrigation
systems,
wind
energy,
highway
traffic
and
constructions.
Strongly
related
is
the
domain
of
safety
surveillance,
reporting
and
rescue
.
They
are
very
useful
to
watch
over
large
public
gatherings,
border
areas
or
to
report
accidents
and
natural
disasters
in
real
time.
In
some
countries they are used to perform difficult search and rescue operations.
Research
on
environmental
compliance,
atmosphere,
wildlife
protection
and
even
3D
geographical
mappings
are
made
through
semi-autonomous
drones.
In
Germany
[4]
experiments
of
carrying
small
cargo
are
ongoing,
which
means
the
beginning
of
drone
use
for
shipping and delivery
tasks.
Autonomous
drones
could
be
used
to
carry
wireless
internet
signal
to
remote
locations.
Maybe
in
the
near
future
we
will
be
able
to
access
the
internet
through
a
drone
flying somewhere above in the air.
In
the
absence
of
obstacles,
satellite
based
navigation,
coordinated
by
a
human
from
departure
to
the
destination
is
a
simple
task.
When
obstacles
are
in
the
path,
they
are
not
in
known
or
fixed
positions,
or
they
are
too
many,
the
drone
pilot
must
build
a
flight
plan
to
avoid
them
so
the
task
becomes
very
challenging
[7].
Going
further,
in
a
week
satellite
signal
environment
the
task
is
almost
impossible.
Moreover,
in
either
surveillance
and
reporting
tasks
the
surfaces
to
be
watched
over
are
so
vaste,
requiring
an
incredibly
high
number
of
human
agents
to
coordinate
the
systems,
not
to
mention
that
some
activities
are
high
precision or need a permanent survey.
Autonomous
drones
are
the
solution
to
overcome
these
many
challenges
traditional
systems
encounter
in
any
field
they
are
used.
For
monitoring,
surveillance,
delivery
and
research
tasks
autonomy
is
a
key
concept
in
the
development
of
drones
and
a
deciding
factor
for
their
contribution
to
industry.
The
mentioned
arguments
arise
the
idea
of
applying
intelligent agents to achieve our tasks in any area, with smaller costs and more accuracy.
1.2 Problem formulation
The
goal
of
this
project
is
to
create
an
agent
that
can
learn
a
control
policy
that
will
maximize
the
rewards
or
reach
the
proposed
goals.
Suppose
there
is
an
agent
that
could
interact
with
an
environment
by
executing
a
sequence
of
actions
and
receiving
rewards.
More
precisely,
at
each
time
step,
the
agent
selects
an
action
a
from
the
set
of
legal
actions
and
then
transmits
it
to
the
environment.
Thereafter,
the
environment
may
return
the
updated
reward
or
a
screenshot
of
the
environment.
It
is
possible
that
the
environment
uses
a
stochastic
transition
with
a
certain
success
rate.
Therefore,
the
agent
could
only
learn
how
to
behave
from direct experience with the environment.
The
objective
is
is
to
autonomously
navigate
a
drone
[1]
in
a
simulated
unknown
environment,
without
any
human
help,
without
any
prior
knowledge
about
the
terrain
and
using
default
drone
sensors
only.
The
predefined
task
of
the
drone
is
not
important,
nor
its
construction
or
behaviour
details,
but
the
training
method.
Machine
learning
techniques
will
be used to train the drone to fly autonomously among obstacles.
The
drone
is
initially
spawned
in
a
random
initial
location
on
the
map
and
the
goal
of
the drone is to fly as much as possible from its initial position without any collisions.
1.3 Main results
1.4 Thesis structure
The
structure
of
the
thesis
is
organised
as
follows.
Section
two
is
divided
in
three
parts,
the
first
consisting
of
a
general
overview
on
Reinforcement
learning
elements,
the
second
driving
the
reader
through
autonomous
flight
preliminaries
and
bringing
these
concepts
together.
In
the
third
section
five
experimental
works
focused
on
drone
navigation
and
obstacle
avoidance
are
discussed.
Section
four
presents
the
system
requirements,
offers
details
regarding
the
implementation
and
user
manual.
The
fifth
section
describes
the
driven
experiments
along
with
their
results
and
section
six
offers
some
conclusions,
stating
in
the
end the encountered difficulties and the possible improvements.
Chapter 2
BACKGROUND
2.1 Elements of Reinforcement Learning
Reinforcement
learning
deviates
from
supervised
learning
in
the
lack
of
learning
data
and
supervisor.
In
supervised
learning
the
objective
is
to
generalize
from
the
given
data
to
make
accurate
predictions
on
previously
unseen
data.
The
given
data
therefore,
has
both
input
and
output values.
In
reinforcement
learning
however,
the
model
is
given
a
set
of
input
data,
and
it
determines
those
hidden
patterns
in
the
data
by
which
it
can
maximize
its
rewards.
Therefore,
by
having
a
reward-based
learning
strategy,
reinforcement
learning
differs
from
both
supervised
and
unsupervised
learning
paradigms,
and
it
yields
a
completely
different
interface.
The
main
elements
used
in
reinforcement
learning
are:
the
agent,
the
environment
and
respectively
in
the
context
of
the
above
two,
the
model,
the
value
function,
the
reward
and
the
policy.
Briefly,
the
agent
acts
in
the
environment
based
on
its
policy
,
the
environment
sends
some
rewards
back
based
on
the
actions.
This
process
is
known
as
the
agent-environment
loop
and
is
presented
on
Figure
2.1.
In
the
following,
the
most
relevant
elements
of
the
agent-environment loop are presented in more detail.
Figure 2.1: The agent-environment interface
2.1.1 The agent
The
agent
is
the
intelligent
(software)
component
of
any
reinforcement
learning
system.
Agents
are
the
learner
components
of
the
system.
Their
task
is
to
act
in
the
environment
in
a
way
that
the
reward
received
by
them
will
be
maximized.
Agents
act
on
behalf
of
a
so
called
policy, a concept that will be later detailed in this chapter.
Based
on
what
key
component
the
agent
contains,
one
can
distinguish
value-based,
policy-
based
and
actor-critic
agents.
A
value-based
agent’s
policy
is
implicit,
as
it
can
directly
read
out
the
best
action
from
its
value
function.
In
contrast,
a
policy-based
agent
stores
only
a
policy
and
selects
actions
without
ever
explicitly
storing
the
value-function.
Actor-critic
agents
combine
the
previous
two
and
try
to
get
the
best
out
of
both
the
value
function and the policy.
The
agent’s
knowledge
of
the
environment
yields
another
fundamental
distinction
among
agents,
they
can
be
categorized
as
model-based
and
model-free
methods.
A
model-based
agent
has
full
knowledge
about
the
environment’s
dynamics
and
states,
its
model
of
the
environment
is
complete,
whereas
model-free
agents
have
an
incomplete
model
of the environment(or none at all).
Figure 2.2: David Silver’s taxonomy of RL agents
2.1.2 The environment
Everything
inside
the
system,
outside
the
agent
is
said
to
be
part
of
the
environment.
The
agent
has
a
representation
of
the
environment
as
a
form
of
model.
There
are
multiple
types
of
environments,
depending
on
which
the
accuracy
of
the
agent’s
model
is
also
defined.
Russell
and Norvig propose the following environment properties:
•
deterministic/non-deterministic
:
in
a
deterministic
environment
the
outcome
can
be
precisely
predicted
based
on
the
current
state.
Simply
put,
if
an
agent
performs
the
same
task
in
two
identical
deterministic
environments,
the
outcome
will
be
the
same
in
both
cases.
In
a
non-deterministic
or
stochastic
environment
the
outcome
can
not
be
determined
from
the
current state. The latter type of environments introduce a greater level of uncertainty.
•
discrete/continuous
:
in
a
discrete
environment,
there
is
a
finite
number
of
possible
actions
for
moving
from
one
state
to
another.
In
contrast,
in
a
continuous
environment,
there
is
an
infinite number of possible actions for one transition
•
episodic/non-episodic
environment:
episodic
environments
are
also
called
non-sequential.
In
an
episodic
environment,
the
agent’s
action
will
not
affect
any
future
action,
the
agent’s
performance
is
the
result
of
a
series
of
independent
tasks
performed.
In
case
of
non-
episodic
or
sequential
environments,
the
agent’s
current
action
will
affect
future
actions,
in
other
words, the agent’s actions are not independent of each other.
•
static/dynamic
environment:
environments
that
are
modified
exclusively
by
the
agent
are
called
static.
Environments
that
can
change
without
the
interaction
of
the
agent
are
called
dynamic.
•
accessible/inaccessible
:
in
the
reinforcement
learning
vocabulary,
these
two
categories
are
also
known
as
observable/partially
observable
environments.
Observable
environments
are
those
about
which
state
space
the
agent
can
hold
a
complete
and
accurate
representation,
whereas
the
knowledge
of
the
agent
in
a
partially
observable
environment
is
limited
to
observations
2.1.3 Main Reinforcement Learning problems
The
succession
of
actions
taken
by
the
agent
follows
the
procedural
approach
called
sequential
decision
making.
Fundamental
problems
that
arise
in
these
kinds
of
processes
are
the
problem
of
learning
and
the
problem
of
planning.
The
main
distinction
between
these
two
is based on the agent’s initial knowledge of environment.
The
method
by
which
the
agent
balances
exploration
and
exploitation
is
another
key
part
of
the
identity
of
what
it
means
to
do
reinforcement
learning.
Exploration
means
choosing
to
give
up
some
reward
the
agent
knows
about
for
the
sake
of
gaining
information
about
the
environment
instead,
while
exploitation
means
maximizing
reward
based
on
the
knowledge
the
agent
is
already
possessing.
One
key
point
of
designing
RL
agents
is
to
come
up with an algorithm that exploits as well as it explores.
Another
distinction
within
reinforcement
learning
is
between
the
problem
of
prediction
and
the
problem
of
control.
When
solving
the
prediction
problem,
the
agent
has
a
fixed
policy
and
its
purpose
is
to
predict
how
well
will
it
do
in
the
future.
Solving
the
control
problem
implies
finding
a
best
policy.
Thus,
solving
the
problem
of
control
also
implies
solving the prediction problem.
2.2 Deep Reinforcement Learning
2.2.1 Deep Learning
Deep
learning
is
a
supervised
learning
technique
to
find
patterns
in
data
given
by
a
supervisor.
The
patterns
the
algorithm
learns
from
the
labeled
data
form
a
predictive
model
that
is
able
to
recognize
previously
unseen
data.
Deep
learning
is
a
key
technology
in
self-driving
cars,
voice
control
systems,various
disease
detection/prediction
algorithms,
etc.
Many
times,
deep
learning
models
can
achieve
higher
accuracy
and
better
precision
in
certain
tasks
than
human
experts.
In
the
vast
majority
of
the
cases,
deep
learning
models
are
used
to
solve supervised learning problems.
The
key
components
of
deep
learning
models
are
the
multi-layer
perceptrons
(MLP).
The
MLP
can
be
viewed
as
a
mathematical
function
that
itself
is
just
a
chaining(composition)
of
simpler
mathematical
functions,
that
maps
some
input
values
to
some
output
values.
This
simple
idea
allows
deep
learning
models
to
build
complex
concepts
out
of
simpler
concepts.
Obtaining
high-level
features
of
data
in
such
a
way
seem
to
be
promising
in
solving
other
types of problems too.
The
deep
neural
networks
underlying
them,
when
viewed
as
function
approximators
prove
to
be
useful
in
other
classes
of
tasks
too.
In
reinforcement
learning
problems,
where
the
state
space
size
makes
it
infeasible
to
calculate
the
value
function
with
direct
solution
methods,
one
resorts
to
approximations.
Reinforcement
learning
models
that
use
deep
neural
networks as value function approximators are called deep reinforcement learning models.
2.2.2 Experience Replay
Once
a
connection
between
supervised
and
reinforcement
learning
with
function
approximators
is
made
as
described
previously,
one
might
find
that
reinforcement
learning
methods
do
not
exploit
as
much
the
potential
of
the
data
as
supervised
learning
methods
do.
Reinforcement
learning
takes
the
steps
of
an
episode,
updates
the
value
function
and
takes
another
episode
with
another
steps.
However
this
does
not
fully
exploit
the
potential
hidden
in
the
data
and
value
function
approximator-based
reinforcement
learning
methods
can
be
extremely
sensible
to
this,
mainly
because
of
the
bias-variance
problems.
The
method
called
experience replay aim to solve some of these issues.
The
core
idea
of
experience
replay
is
that
at
each
timestep,
the
agent
stores
as
a
tuple
containing
the
current
state,
the
action
taken,
the
reard
and
the
next
state
in
a
so
called
replay
buffer.
This
buffer
is
accessed
when
performing
the
updates
of
the
weights.
The
updates
are
performed
based
on
batches
sampled
randomly
from
the
experience
replay
buffer.
By
random
sampling,
one
does
break
the
dependencies
of
successive
experiences
coming
from
the
same
episode
and
hence
one
important
source
of
instability
is
eliminated.
There
are
several
modified
versions
of
experience
replay
such
as
prioritized
and
hindsight
experience
replay.
However,
the
vast
majority
of
these
different
methods
are
usually
used
hand
in
hand
with
some off-policy methods such as Q-learning.
2.2.3 Deep Q Learning
The
variant
of
Q-learning
where
a
deep
neural
network
is
used
as
function
approximator
is
called
Deep
Q-Learning.
The
basic
steps
of
the
usual
Q
-learning
are
still
preserved.
Yet,
to
avoid
some
frequent
issues
with
the
algorithm’s
convergence,
some
modifications
are
done(for
example,
balancing
exploration
and
exploitation
with
epsilon-greedy,
deal
with
the
correlations
among
the
’training
data’
by
using
experience
replay
and
decoupling
the
target
network from the policy network etc.). The final steps of Deep Q-Learning is as follows:
1.
initialize
parameters
for
Q(s,a)
and
Q’(s,a),
epsilon=1
and
empty
replay buffer
2. select action a with epsilon-greedy
3. execute action a, collect reward r, observe next state s’
4. store transition (s,a,r,s’) to replay buffer
5. sample batch of transitions from the replay buffer
6. compute target y for each transition from the batch
7. compute loss
8. update the parameters using stochastic gradient descent
9. repeat from 2. until convergence
One
important
note
regarding
the
above
algorithm
is
that
there
are
actually
two
networks
working.
They
are
called
the
main
network
and
the
target
network
respectively.
The
target
network
is
a
copy
of
the
main
network
producing
the
targets
Q
(
s’,a’
)
and
it
is
periodically
synchronized
with
the
main
network.
This
tweak
is
introduced
because
originally,
Q-learning
is
an
off-policy
control
algorithm,
and
if
experience
was
produced
by
the
same
network
that
dictates
the
policy,
then
due
to
the
high
bias
there
is
a
high
risk
that
the
algorithm will diverge.
While
experience
replay
is
meant
to
break
the
correlations
between
the
training
examples,
using
two
networks
targets
the
decoupling
of
target
and
behaviour
policies.
Deep
Q-Networks
are
particularly
useful
in
Markov
Decision
Processes
where
the
observations
are
given
as
images.
In
such
cases,
from
one
hand,
it
is
convenient
and
straightforward
to
use
convolutional
neural
networks
for
the
approximators,
from
the
other
hand,
the
nature
of
image
observations
comply
with
the
issues
described
in
the
previous
paragraph,
issues
that
Deep Q-Learning was designed to solve.
2.2 Autonomous Flight
This
section
offers
a
brief
introduction
to
autonomous
flight.
Basic
notions
about
drones
are
introduced,
then
an
overview
on
quadrotors
is
given.
The
last
part
is
a
discussion
regarding
autonomous drones.
2.2.1 General view
Unmanned
aircraft
systems
(UAS)
are
an
aircraft
and
its
associated
elements
which
are
operated
with
no
pilot
on
board.
A
subset
of
UAS,
remotely
piloted
aircraft
systems
(RPAS)
are
a
set
of
configurable
elements
consisting
of
a
remotely
piloted
aircraft,
its
associated
remote
pilot
station(s),
the
required
command
and
control
links
and
any
other
system
elements as may be required , at any point during flight operations.
In
the
past
decade,
the
field
of
aerial
systems
has
exploded.
Tremendous
progress
has
been
made
in
the
design
and
autonomy
of
flight
robotics,
which
can
roughly
be
categorized
as
either
man-made
or
bio-inspired.
In
the
former
class,
fixed-pitch
rotorcrafts
have
become
particularly
popular
for
their
mechanical
simplicity,
but
several
more
complex
vehicles
have
also
been
developed.
For
example,
decoupled
rotational
and
translational
degrees
of
freedom
can be achieved with variable-pitch or omnidirectional rotor configurations.
Fixed-wing
aircrafts
are
mechanically
simple,
but
are
still
able
to
glide
and
execute
birdlike
maneuvers,
such
as
perching.
Compared
with
ornithopters(
aircrafts
that
fly
by
flapping
their
wings
),
they
are
easier
to
model
and
are
typically
better
able
to
carry
onboard
sensors,
allowing
for
the
successful
development
of
planning,
perception,
and
formation
control algorithms.
Multirotor
aircrafts,
particularly
quadrotors,
have
been
the
most
capable
vehicles
in
terms
of
accessibility,
maneuverability,
capacity
for
onboard
sensors,
and
applicability
to
a
breadth
of
applications.
Great
research
progress
has
been
made
across
multiple
areas,
including
the
control
of
agile
maneuvers,
planning
and
perception
in
unknown,
unstructured
environments and collaboration in multiagent teams, sparking a surge of industry investment.
RPAS
are
replaced
in
the
common
language
by
the
word
drone
and
this
document
will accordingly use drones to speak of UAS and RPAS.
2.2.2 Quadrotors
Due
to
their
ease
of
construction,
lightweight
flight
controllers
and
durability,
quadrotors
are
the
most
frequent
configuration
of
unmanned
aerial
vehicles.
With
their
small
size
and
maneuverability, these quadcopters can be flown indoors as well as outdoors.
As
opposed
to
fixed-wing
aircraft,
quadcopters
are
classified
as
rotorcraft,
because
their
lift
is
generated
by
a
set
of
rotors,
vertically
oriented
propellers.
Generally,
quadrotors
use
two
pairs
of
identical
fixed
pitch
propellers,
two
clockwise
and
two
counterclockwise.
These
use
independent
variation
of
the
speed
of
each
rotor
to
achieve
control.
By
changing
the
speed
of
each
rotor
it
is
possible
to
specifically
generate
a
desired
total
thrust
and
total
torque.
The
thrust
happens
when
the
air
is
pushed
in
the
direction
opposite
to
flight
by
the
spinning
blades
of
a
propeller.
The
torque
or
the
turning
force
is
the
rotational
equivalent
of
the linear force.
Figure 2.2 Clockwise and counterclockwise propellers
Each
rotor
produces
both
a
thrust
and
torque
about
its
center
of
rotation,
as
well
as
a
drag
force
opposite
to
the
vehicle's
direction
of
flight.
There
is
no
need
for
a
tail
rotor
as
on
conventional
helicopters
because
i
f
all
rotors
are
spinning
at
the
same
angular
velocity,
with
rotors
A
and
C
rotating
clockwise
and
rotors
B
and
D
counterclockwise
like
in
Figure
2.2,
the
net
aerodynamic
torque,
and
hence
the
angular
acceleration
about
the
yaw
or
longitudinal
axis is exactly zero.
The
attitude
of
a
quadrotor
provides
information
about
its
orientation
with
respect
to
the
local
level
frame,
horizontal
plane
and
true
north.
As
depicted
in
figure
2.3
the
attitude
has
3
components:
Roll
,
Pitch
and
Yaw
.
The
easiest
way
to
understand
them
is
considering
a
drone with three linear axis running through it.
Figure 2.3 Attitude components of a quadrotor
A quadrotor maneuvers are done in the following way:
➢
adjusts the altitude by applying equal thrust to all four rotors
➢
adjusts
its
yaw
by
applying
more
thrust
to
rotors
rotating
in
one
direction,
for
example
turning right implies more thrust on A and C propellers from figure 2.2
➢
adjusts
its
pitch
or
roll
by
applying
more
thrust
to
one
rotor
and
less
thrust
to
its
diametrically opposite rotor
2.2.3 Autonomous Drones
An Unmanned Aerial Vehicle(UAV, term used by The United States Department of Defense)
is defined by US Dictionary as a "powered, aerial vehicle that does not carry a human
operator, uses
aerodynamic forces
to provide vehicle lift, can fly autonomously or be piloted
remotely, can be expendable or recoverable, and can carry a payload".
Quadcopters
and
other
multicopters
often
can
fly
autonomously.
Many
modern
flight
controllers
use
software
that
allows
the
user
to
mark
way-points
on
a
map,
to
which
the
quadcopter
will
fly
and
perform
tasks,
such
as
landing
or
gaining
altitude.
The
PX4
autopilot
system,
an
open-source
software/hardware
combination
in
development
since
2009,
has
been
adopted
by
both
hobbyists
and
drone
manufacturing
companies
to
give
their
quadcopter
projects
flight-control
capabilities.
Another
project
that
enables
its
hobbyist
use
small
remotely
piloted
aircraft
such
as
micro
air
vehicles
is
ArduCopter
,
the
multicopter
unmanned
aerial
vehicle
version
of
the
open-source
ArduPilot
autopilot
platform.
Other
flight
applications
include
crowd
control
between
several
quadcopters
where
visual
data
from
the
device
is
used
to
predict
where
the
crowd
will
move
next
and
in
turn
direct
the
quadcopter
to
the next corresponding waypoint.
The
use
of
drones
is
developing
at
a
quick
pace
worldwide
and
in
particular
in
European
Aviation
Safety
Agency
(EASA)
countries.
The
use
of
drones
is
extremely
varied.
Some
examples
are:
precision
agriculture,
infrastructure
inspection,
wind
energy
monitoring,
pipeline
and
power
inspection,
highway
monitoring,
natural
resources
monitoring,
environmental
compliance,
atmospheric
research,
media
and
entertainment,
sport
photos,
filming,
wildlife
protection
and
research,
accident
reporting,
disaster
relief.
Experiments
to
carry small cargo are ongoing in Germany and France.
Initially
focused
on
robots
and
now
mostly
applied
to
ground
vehicles,
RL
began
to
be
used
also
to
train
drones.
Reinforcement
learning
is
inspired
by
a
human’s
way
of
learning,
based
on
trial
and
error
experiences.
Using
the
information
about
the
environment,
the
agent
makes
decisions
and
actions
in
discrete
intervals
known
as
steps.
Actions
produce
changes in the environment and also a reward.
The
application
of
reinforcement
learning
to
drones
will
provide
them
with
more
intelligence,
eventually
converting
drones
in
fully-autonomous
machines.
Deep
RL
proposes
the
use
of
neural
networks
in
the
decision
algorithm.
In
conjunction
with
the
experience
replay
memory,
deep
RL
has
been
able
to
achieve
a
super-human
level
when
playing
video
and
board
games.
Even
if
typical
deep
RL
applications
are
optimal
sensorimotor
control
of
autonomous
robotic
agents
in
immersive
environments,
this
learning
technique
is
a
promising
approach.
Chapter 3
RELATED WORK
A
wide
variety
of
techniques
for
drone
navigation
and
obstacle
avoidance
can
be
found
in
the
literature.
In
this
section
we
will
analyse
five
experimental
works
that
tackle
the
same
problem:
achieving
an
autonomous
flight
when
obstacles
are
encountered
in
the
environment.
Some
use
predefined
paths,
some
fixed
obstacles,
or
indoor
environments,
but
independent
on
that
our
focus
is
on
the
machine
learning
algorithms
used
to
obtain
an
autonomous
aerial
agent.
3.1.
Deep
Convolutional
Neural
Network-Based
Autonomous
Drone
Navigation
by
K.
Amer et al. [8]
This
paper
presents
an
approach
for
aerial
autonomous
navigation
of
a
drone
using
a
monocular
camera
and
without
reliance
on
a
Global
Positioning
System
(GPS).
For
systems
that
require
regular
trips,
on
the
same
locations,
with
the
same
trajectories
the
human
agent
can
be
replaced
to
reduce
costs
and
enhance
precision
that
can
not
be
achieved
in
GPS-denied
environments.
The
proposed
algorithm
focuses
on
predefined
paths
autonomous
drone
navigation
which
is
a
suitable
solution
for
environmental
monitoring,
shipment
and
delivery
and remote wireless internet signal.
They
use
deep
Convolutional
Neural
Network
(CNN)
and
combine
it
with
a
regressor
to
output
the
drone
steering
commands.
Onboard
information
is
immediately
available,
which
is
an
advantage
because
external
signals
are
not
needed,
and
also
a
disadvantage
because
it
expose
the
system
to
many
real-life
deployment
scenarios.
To
overcome
this
and
make
the
system adaptable they use a ‘navigation envelope’ for data augmentation.
The
combined
training
dataset
consists
of
simulator-generated
data
along
with
data
collected
during
manually-operated
or
GPS-based
trips.
They
use
the
Unreal
Engine
physics
environment
with
the
AirSim
plugin
that
provides
an
API
for
drone
control
and
data
generation.
Two
synthetically-generated
scenarios
were
used
in
the
training
dataset:
Blocks
and
Landscape.
The
first
is
an
abstract
environment
which
contains
different
cubes
and
it
is
used
in
the
initial
training
of
the
drone
and
the
second
is
a
realistic
scene
with
frozen
lakes,
trees, and mountains to cover more complex environments.
A
neural
network
pre-trained
for
the
classification
task
is
used
as
feature
extractor.
Then
as
shown
in
Figure
3.1
these
features
are
treated
by
the
regressor
which
is
a
Fully
Connected
Neural
Network
(FCNN)
or
Recurrent
Gated
Neural
Network
(GRU)
whose
purpose
is
to
indicate
the
yaw
angle
by
which
the
drone
should
rotate
when
navigating
at
fixed
forward
velocity.
This
generalisation
is
helpful
when
deployed
in
a
realistic,
compex
setting.
They
use
Adam
optimizer
with
learning
rate
2.7
−4
and
a
batch
size
of
64
for
100
epochs.
The
learning
rate
is
to
be
halved
every
25
epochs.
The
loss
function
used
to
correlate
between
the
visual
input,
waypoints
and
the
deviation
angle
is
the
Mean
Squared
Error
(MSE).
Figure 3.1 Sketch of the training system
For
simplicity
reasons,
the
drone
flies
only
forward
at
5
meters
height
with
1
m/s
velocity.
The
only
parameter
that
has
to
be
adjusted
in
real
time
is
the
yaw
angle,
calculated
between the next waypoint and the drone heading.
During
training
each
path
was
treated
separately
in
order
to
teach
the
network
to
follow
the
path
by
correlating
input
images
with
the
yaws
required
to
advance.
The
alternative
would
be
joint
training
which
can
lead
to
conflicting
decisions
in
a
complex
environment.
The
‘navigation
envelope’
consist
of
augmenting
each
path
with
100
auxiliary
paths by adding noise to the optimal shortest path used in training.
To
measure
the
results
two
error
metrics
are
followed:
Mean
Waypoints
Minimum
Distance
and
Mean
Cross
Track
Distance.
The
first
one
is
the
difference
between
each
waypoint
position
and
the
nearest
position
reached
by
the
drone
for
the
entire
path
averaged
over
all
waypoints
and
the
second
is
the
shortest
distance
between
the
drone
position
and
the
next two closest waypoints.
To
test
the
path
following
FCNN
and
GRU
regressors
with
2
and
4
time
steps
are
then
compared.
Contradictory
to
similar
results
from
self-driving
cars,
FCNN
archives
here
a
lower
error
in
both
environments.
All
the
models
were
able
to
follow
the
path
until
the
end
except
for
the
GRU
with
4
time
steps.
The
improved
generalisation
to
find
direction
and
pursue
the
right
path
was
achieved
with
FCNN
regressor,
the
model
they
trained
indicating
a
robust autonomous navigation.
3.2.
Reactive
Model
Predictive
Control
for
Autonomous
MAV
Navigation
in
Indoor
Cluttered Environments: Flight Experiments by Julien Marzat et al. [9]
This
paper
describes
an
integrated
perception-control
loop
for
trajectory
tracking
with
obstacle
avoidance
by
micro-air
vehicles
(MAV)
in
indoor
unknown
environments.
To
integrate
the
perception
part
a
stereo-vision
camera
is
used
to
build
a
3D
model
of
the
environment.
An
Asctec
Pelican
quadrotor
controlled
in
diamond
configuration
shown
in
Figure
3.2
is
the
aerial
vehicle
chosen
for
flight
experiments.
It
includes
an
Inertial
Measurement
Unit
(IMU)
and
a
low-level
controller,
which
sends
acceleration
control
inputs
to
the
MAV:
roll
angle,
pitch
angle
and
yaw
rate
inputs.
A
quad-core
i7
CPU
is
also
embedded
to
control
the
algorithmic chain and make the MAV fully autonomous.
Figure3.2 Diamond configuration quadrotor
The
technique
used
to
tackle
the
trajectory
problem
is
Model
Predictive
Control
(MPC).
The
authors
make
use
of
the
current
estimated
states
of
the
vehicle
and
environment
and
use
reactive
strategies
for
obstacle
avoidance.
These
systematic
search
procedures
are
a
combined
with
a
linear
analytical
solution
for
trajectory
tracking
using
the
vision-based
environment model.
The
algorithm
eVO
(Sanfourche
2013)
is
used
for
localisation
and
operates
based
on
a
map
of
isolated
landmarks,
which
is
continuously
updated.
Position
and
attitude
are
estimated
by
the
localization
function
which
tracks
previously
mapped
landmarks.
The
mapping function associates the 3D landmarks in a local map.
The
stereo
images
are
used
to
compute
a
depth
map
with
the
ELAS
(Efficient
Large-scale
Stereo
Matching)
algorithm
[13]
to
achieve
a
robust
matching
of
image
features.
The
probabilistic
propagation
of
the
local
depth
information
is
then
included
in
a
3D
occupancy
grid,
exploited
by
the
Octomap
model
[14]
which
deals
with
probabilities
of
occupancy
and
free-space.
To
check
for
possible
collisions
the
algorithm
sends
a
query
on
a
position
for
which
the
map
returns
the
distance
and
direction
unit
vector
to
the
nearest
obstacle.
The
search
takes
into
account
all
possible
avoidance
plans
and
if
no
solution
can
be
found the MAV enters the emergency mode and starts hovering.
To
evaluate
the
3D
model,
perception
and
control
architecture
experiments
were
taken
in
a
flying
arena
and
an
industrial
warehouse.
In
the
first
environment
a
Matthews
Correlation
Coefficient
[15]
of
0.94
was
obtained
with
respect
to
the
ground
truth
model.
For
the
second,
many
scenarios
were
taken:
following
a
reference,
crossing
paths
and
the
MAV
demonstrated autonomy finding a safe trajectory.
3.3.
Path
Planning
for
Quadrotor
UAV
Using
Genetic
Algorithm
by
Reagan
L.
Galvez
et al. [10]
This
paper
focuses
on
path
planning
for
quadrotor-typed
UAV
navigation
from
initial
to
destination
point
in
an
environment
with
fixed
obstacles.
Their
goal
is
saving
energy,
hence
they
have
to
determine
the
shortest
path
that
the
vehicle
must
travel
from
the
starting
point to target, minimising the power consumed, without hitting any obstacle..
To
find
the
minimum
cost
path,
they
use
a
Genetic
Algorithm
(GA),
which
is
effective
to
search
for
the
optimal
solution
when
having
a
sample
space
or
a
population.
The
chromosomes
represent
here
a
possible
path
among
the
given
points.
The
algorithm
creates
a
population
of
individuals
and
lets
them
evolve
over
multiple
generations
to
find
better
and
better
solutions.
The
evolution
happens
through
recombination
and
mutation
processes
and
the
quality
of
each
individual
is
measured
by
the
fitness.
When
building
a
new
generation
natural
selection
is
applied,
weaker
members
of
a
population
being
less
likely
to
survive
and
produce offspring.
The
process
starts
with
the
given
initial
point
(IP)
and
target
point
(TP)
in
3-dimensional
coordinates,
the
input
parameters
for
the
GA.
The
output,
some
ordered
path
coordinates
represent
the
shortest
path.
The
fitness
is
computed
as
summation
of
distances
between
points
or
routes
in
a
given
path.
For
the
selection
process,
the
first
half
of
the
high
fitness
chromosomes
in
the
generation
is
used
as
parent
for
the
new
offspring.
The
best
fitness
for
a
chromosome
is
given
by
the
lowest
value
of
distance
traveled.
The
crossover
switches
the
genes
of
the
parent
chromosomes,
depending
on
the
random
cut
point.
Then
mutation
randomly
alters
individuals
with
a
small
probability,
0.2
to
guarantee
a
small
amount
of
random
search,
to
ensure
each
point’s
probability
of
being
examined
and
escape
local optima.
To
simulate
the
possible
routes
for
the
quadrotor,
they
used
integer
values
for
coordinates.
The
boundary
for
the
XYZ
plane
is
10x10x10
units.
Number
of
individuals
in
a
generation
and
number
of
epochs
run
were
not
specified.
The
result
stated
is
quite
acceptable
optimal path that avoids obstacles, obtained after 500 generations.
3.4. Reinforcement Learning for UAV Attitude Control by William Koch et al. [11]
Autopilot
systems
are
composed
of
an
inner
loop
providing
stability
and
control,
and
an
outer
loop
responsible
for
mission-level
objectives
like
way-point
navigation.
Such
systems
are
usually
implemented
with
Proportional
Integral
Derivative
(PID)
control
systems,
which
have
demonstrated
exceptional
performance
in
stable
environments.
However
more
sophisticated
control
is
required
to
operate
in
unpredictable,
and
harsh
environments.
This
paper
focuses
on
the
inner
control
loop
using
intelligent
flight
control
systems
to
achieve
its accuracy.
They
use
Reinforcement
Learning
techniques
like
Deep
Deterministic
Policy
Gradient,
Trust
Region
Policy
Optimization,
and
Proximal
Policy
Optimization
to
identify
if
they
are
appropriate
for
high-precision,
time-critical
flight
control.
Two
simulation
environments
are
used,
one
for
training
a
flight
controller
and
the
other
to
compare
the
obtained results with those of a PID.
The
learning
environment,
GymFC
allows
the
agent
to
learn
attitude
control
of
an
aircraft
through
both
episodic
tasks
and
continuous
tasks.
While
episodic
tasks
are
not
so
reflective
of
realistic
flight
conditions
because
the
agent
is
required
to
learn
with
respect
to
individual
velocity
commands,
continuous
tasks
are
more
intense
with
random
widths
and
amplitudes being continuously generated.
They
consider
a
RL
architecture
consisting
of
a
neural
network
flight
controller
as
an
agent
interacting
with
an
Iris
quadcopter
in
a
Gazebo
simulator.
The
quadcopter’s
inertial
measurement
unit
(IMU)
provides
an
observation
from
the
environments
at
each
timestep
t.
Once
the
observation
is
received,
the
agent
executes
an
action
within
the
environment
and
receives a single numerical reward indicating the performance of this action.
Each
learning
agent
was
trained
with
an
RL
algorithm
for
a
total
of
10
million
simulation
steps,
equivalent
to
10,000
episodes
or
2.7
simulation
hours.
The
RL
algorithm
used
for
training
and
the
memory
size
m
define
a
configuration.
Training
for
DDPG
took
approximately
33
hours,
while
PPO
9
hours
and
TRPO
13
hours
respectively.
Training
results
show
clearly
that
PPO
converges
faster
and
accumulates
higher
rewards
than
TRPO
and
DDPG.
One
fact
they
noticed
is
that
a
large
memory
size
actually
decreases
convergence
and
stability
among
all
trained
algorithms,
maybe
because
this
causes
algorithm
to
take
longer to learn the mapping to the optimal action.
3.5.
Autonomous
UAV
Navigation
Using
Reinforcement
Learning
by
Mudassar
Liaq
et
al. [12]
The
paper
states
that
the
existing
work
in
autonomous
navigation
of
UAV
is
restricted
to
an
ideal
environment,
instead
of
a
realistic
one.
The
aim
is
to
overcome
these
limitations
by
providing
a
model
compatible
with
every
drone,
using
only
standard
drone
sensors
and
taking
into
account
environmental
factors
such
as
wind
and
rain.
The
sensors
used
are
a
camera,
global
positioning
system
(GPS),
inertial
measurement
unit(IMU),
magnetometer
and barometer.
This
paper
describes
a
deep
RL
based
framework
for
UAVs
motion
planning
task.
An
efficient
policy
based
Deep
Q
Network(DQN,
a
Deep
learning
technique)
is
designed,
which
comprises
of
convolutional
and
fully
connected
layers
to
extract
important
features
from
images
taken
through
a
depth
vision
camera.
These
features
are
then
combined
with
inputs
taken
directly
from
the
other
sensors
and
are
passed
to
a
fully
connected
layer.
This
is
followed by a policy-based Q Learning approach to apprehend UAV navigation.
The
simulation
tool
that
was
used
is
AirSim,
which
enables
using
different
APIs,
retrieving
images
and
controlling
the
vehicle.
System
specifications
are
Linux
OS,
Python
3.6,
TensorFlow
1.13,
Cuda
10.0,
Cuda
NN
7.4.2.
The
simulation
environment
is
3D,
the
quadcopter
can
navigate
across
all
three-axis
and
the
speed
and
altitude
of
the
UAV
range
between
0
to
10
m/s
and
5
to
200m
respectively.
The
episode
length
can
be
up
to
200.
The
network
parameters
are
learned
by
using
Adam
Optimizer
with
a
learning
rate
of
10-3.
The
discount factor γ for the experiments is varied in the range of 0.91 to 0.99.
When
run
for
500
epochs
the
network
was
learning
very
slow.
This
is
a
known
behavior
as
the
network
is
experimenting
with
different
outputs
to
check
which
strategy
works.
Figure
3.3
shows
a
trend
of
reward
function
values
when
run
for
1000
epochs.
After
a
few
small
tweaks
the
reward
is
maxing
out,
being
capped
at
200.
As
the
iterations
increase,
the reward values show a rapid increase, respectively.
Figure 3.3 Trend of reward function
The
results
they
achieved
are
building
a
model
that
can
be
used
in
a
realistic
environment,
doesn’t
need
special
featured
hardware,
only
drones
with
standard
sensors,
uses
only onboard processing, learns on the fly and is able to navigate for more than five minutes.
Chapter 4
UAVEL DEVELOPMENT
This
chapter
focuses
on
the
software
development
part
of
the
application.
First
of
all
it
gives
an
overview
on
the
platforms,
plugins
and
libraries
required
to
setup
the
system.
Then
details
concerning
the
design
and
implementation
decisions
are
presented.
In
the
last
section a user manual is specified along with features for training and inference modes.
4.1 System requirements
The system consists of three major parts:
➢
3D Simulation Platform, Unreal Engine to create and run simulated environments
➢
Interface
Platform,
AirSim
Py
to
simulate
drone
physics
and
interface
between
Unreal and Python
➢
python code, based on TensorFlow
Unreal
Engine
is
an
advanced
real-time
3D
creation
platform
for
photoreal
visuals
and
immersive
experiences.
Developed
by
Epic
Games,
having
now
a
C++
code
base
it
includes
a
wide
variety
of
platform
types,
for
driving,
flying
and
offers
a
high
degree
of
portability on all consacred operating systems.
AirSim
Py
is
an
open
source
plugin
developed
by
Microsoft
that
interfaces
Unreal
Engine
with
Python.
It
provides
basic
python
functionalities
controlling
the
sensory
inputs
and
control
signals
of
the
drone.
Our
project
is
built
onto
the
low
level
python
modules
provided
by
AirSim
creating
higher
level
python
modules
for
the
purpose
of
drone
RL
applications.
Python
is
the
chosen
programming
language
to
interface
with
the
environments
and
carry
out
the
Deep
reinforcement
learning
process.
TensorFlow
is
an
artificial
intelligence
library,
using
data
flow
graphs
to
build
models.
It
allows
creating
large-scale
neural
networks with many layers which serve as function approximators for our RL algorithms.
Anaconda
,
an
open-source
distribution
of
Python
for
data
science,
machine
learning
and large-scale analytics is used to simplify package management and deployment.
The
pygame
library
is
an
open-source
module
for
the
Python
programming
language
specifically
intended
to
help
us
make
games
and
other
multimedia
applications.
Here
the
PyGame
screen
can
be
used
to
control
simulation
parameters
such
as
pausing
the
simulation,
modifying
algorithmic
or
training
parameters,
overwrite
config
file
and
save
the
current state of the simulation.
4.2 System design
4.3 User manual
Running
this
project
implies
having
Unreal
Engine
installed
on
your
computer.
Then
to
avoid
conflicts
and
pursue
a
smooth
installation
process
it’s
advisable
to
make
a
new
virtual
environment
for
this
project,
clone
the
code
here
and
install
the
dependencies
in
requirements.txt. More detailed steps can be found in the provided readme.
UAVEL
engine
takes
input
from
a
config
file
used
to
define
the
problem
and
the
algorithm
for
solving
it.
There
one
can
specify
the
environment
type,
the
drone
specific
parameters
along
with
its
ip
address
and
type,
but
also
parameters
for
the
camera.
One
example of a main configuration file can be found in the figure below.
Figure 4.1 Example of a main configuration file
The user can select the following simulation modes for the drone:
➢
move_around
:
When
mode
is
set
to
move_around,
the
simulation
starts
the
environment
in
free
mode.
In
this
mode,
keyboard
can
be
used
to
navigate
across
the
environment.
This
mode
can
help
the
user
get
an
idea
of
the
environment
dynamics.
The
keyboard
keys
a,
w,
s,
d,
left,
right,
up
and
down
can
be
used
to
navigate
around.
This
can
also
be
helpful
when
identifying
the
initial
positions
of
the drone.
➢
train
: Signifies the training mode, used as an input flag for the selected algorithm
➢
infer
: Signifies the inference mode, used as input flag for the chosen algorithm
After
setting
up
the
desired
mode
and
parameters
one
can
run
the
main
of
the
project
from
the
command
line
or
some
Ide.
This
will
open
the
simulation
environment
without
rendering.
By
pressing
Fn+/F1
the
AirSim
keyboard
commands
panel
will
be
shown
on
the
screen.
Using
these
one
can
activate
the
rendering,
change
the
type
of
the
view
or
switch
to
manual control.
For
the
training
and
inference
mode
PyGame
screen
is
available,
listing
the
keyboard
commands
to
control
the
simulation.
We
can
use
it
to
pause
the
simulation,
to
modify
algorithmic
or
training
parameters,
overwrite
the
configuration
file
and
save
the
current
state
of
the
simulation.
The
system
generates
a
number
of
output
files.
The
log
file
keeps
track
of
the
simulation
state
per
iteration
listing
useful
algorithmic
parameters.
This
is
particularly
useful
when
troubleshooting
the
simulation.
Tensorboard
can
be
used
to
visualize
the
training
plots
in
run-time,
to
monitor
the
training
parameters
and
to
change
the
input parameters using the PyGame screen if needed.
The
simulation
updates
two
graphs
in
real-time.
The
first
graph
is
the
altitude
variation
of
the
drone,
while
the
other
one
is
the
drone
trajectory
mapped
onto
the
environment
floorplan.
The
trajectory
graph
also
reports
the
total
distance
traveled
by
the
drone
before
crash.
Chapter 5
CASE STUDY
5.1 Experiments and results
Chapter 6
CONCLUSIONS
6.1 Main contribution
6.2 Limitations and future work
Chapter 7
BIBLIOGRAPHY
1. Article about Unmanned aerial vehicles,
https://en.wikipedia.org/wiki/Unmanned_aerial_vehicle
2. Article about drone research,
https://blogs.ei.columbia.edu/2017/06/16/how-drones-are-advancing-scientific-research/
3. Article about The Year of drones,
https://newatlas.com/year-drone-2013/30102/
4. Article about small cargo drones in Germany,
http://unmannedcargo.org/cargo-drones-deliver-blood-samples-in-germany/
5. Article about drones applications,
https://filmora.wondershare.com/drones/drone-applications-and-uses-in-future.html?gclid=Cj
0KCQjwka_1BRCPARIsAMlUmEpZ8TrrZDnoHCkVxmL_vV55IsidzY1YEXXkqfonH-ND
eoqrXF-XCcQaAvnbEALw_wcB
6. Article about drone technology applications,
https://www.allerin.com/blog/10-stunning-applications-of-drone-technology
7. Article about challenges with autonomous drones,
https://www.ansys.com/blog/challenges-developing-fully-autonomous-drone-technology
8.
Deep
Convolutional
Neural
Network-Based
Autonomous
Drone
Navigation
by
K.
Amer,
M.
Samy,
M.
Shaker
and
M.
ElHelw,
Center
for
Informatics
Science,
Nile
University,
Giza,
Egypt, 2019
9.
Reactive
MPC
for
Autonomous
MAV
Navigation
in
Indoor
Cluttered
Environments:
Flight
Experiments
by
Julien
Marzat,
Sylvain
Bertrand,
Alexandre
Eudes,
Martial
Sanfourche, Julien Moras, The French Aerospace Lab, Palaiseau, France, 2017
10.
Path
Planning
for
Quadrotor
UAV
Using
Genetic
Algorithm
by
Reagan
L.
Galvez,
Elmer
P. Dadios, Argel A. Bandala, De La Salle University, Manila, Philippines, 2016
11.
Reinforcement
Learning
for
UAV
Attitude
Control
by
William
Koch,
Renato
Mancuso,
Richard West, Azer Bestavros, Boston University, USA, 2019
12.
Autonomous
UAV
Navigation
Using
Reinforcement
Learning(RL)
by
Mudassar
Liaq
and
Yungcheol
Byun,
International
Journal
of
Machine
Learning
and
Computing
Vol.
9,
No.
6
2019
13. Paper on ELAS algorithm,
http://w.cvlibs.net/publications/Geiger2010ACCV.pdf
14. Paper about OctoMap,
https://courses.cs.washington.edu/courses/cse571/16au/slides/hornung13auro.pdf
15. Article about Matthews Correlation Coefficient,
https://en.wikipedia.org/wiki/Matthews_correlation_coefficient
16. ICAO circular 328-AN/190.
Copyright Notice
© Licențiada.org respectă drepturile de proprietate intelectuală și așteaptă ca toți utilizatorii să facă același lucru. Dacă consideri că un conținut de pe site încalcă drepturile tale de autor, te rugăm să trimiți o notificare DMCA.
Acest articol: Chapter 1 INTRODUCTION Over the past decade, there has been an [612255] (ID: 612255)
Dacă considerați că acest conținut vă încalcă drepturile de autor, vă rugăm să depuneți o cerere pe pagina noastră Copyright Takedown.
