Interaction
Based State Representation - uBot Impedance Control Basis
Introduction
The uBot platform is
being used to carry out reinforcement learning experiments with
interaction based state representation. To accomplish this, a set
of impedance controllers (described below) is used in which each
controller provides a qualitatively different tracking behavior.
Moreover, the set of impedance controllers is also used to observe the
driving (path) function. When a particular impedance controller is
driving the uBot, it produces a sequence of position and velocity
errors on the other controllers dependent on their respective impedance
parameters and the driving (path) function. If we assess the
equilibrium status (see below) of each discrete impedance controllers
under the actions of controller i , the result is a
coarse spectral decomposition of the path function. We can
represent this spectral decomposition of the path function as a bit
vector called an impedance code or "i-code". A spectral
decomposition of the path function should be useful in selecting the
best next impedance controller in the context of a control objective. Our hypothesis holds
that a state representation consisting of the active controller id
concatenated with the i-code is a sufficient state estimate on which to
base optimal control decisions.
Impedance
Controller Design
A
control model is employed that attaches a "control yolk" to the front of
the uBot. The control point for a given controller is located on
the control yolk at distance X0 from the center of the uBot. This
is conceptually the point at which forces are applied to the robot by
that controller. This effectively solves the nonholonomic motion
planning problem since the robot will stably track behind the control
point. The path function may be defined in a variety of
different ways (we have used both the stream line passing through the
center of the robot in a harmonic flow field or a path defined by a
cubic spline). A controller's reference point is the closest point
to the controller's control point along the path function.
A controller is
defined by the parameter vector [K, B, X0, Vref]. The K and B parameters are gains
applied to the position and velocity errors when determining the force
exerted on the control point by the controller. The position
error is the distance between a controller's reference point and
control point. The velocity error is the difference between the
control point's current velocity and a velocity of magnitude Vref tangent to the path
function at the reference point. The parameter, X0 (the distance from
the center of the robot to the control point), dominates the bandpass
character of the impedance controller - with cutoff frequency
proportional to 1/X0. Furthermore,
the robot looks ahead in time along the path function an amount
proportional to X0. This implies that
a controller with a large X0 will cut corners off
the path and one with a small X0 will track more
precisely. Both properties can be useful in the context of
different objective functions.
Below is an example Bode plot of nine controllers. The Bode plots
of the amplitude and phase response of the impedance controllers
(determined in simulation) indicate that the controllers are, in fact,
linearly independent in the frequency domain (although it may be
possible to use fewer controllers without significant loss of
expressive power).
I-Code State
Representation
The equilibrium
status of a candidate impedance controller is assessed in a phase
portrait. A particular controller is near equilibrium when it is in the
neighborhood of the origin. The neighborhood we have employed is the
weighted squared error in position and velocity. The weights are the
impedance parameters, K and B, respectively.
Controllers inside of this envelope assert a status bit, and those that
are not inside this envelope do not assert this bit. For n controllers, this
leads to a vector of n bits that could
assume any one of 2^n possible values.
We determine the state of our robot by concatenating this bit
vector with the index of the active controller. Thus the size of
the state space for learning policies is n*2^n.
Experimental
Results
A
standard Q-learning backup with a tabular value function was used to
acquire policies using value iteration. A stepwise reward reflecting
time and energy expended was employed. In each case, the policy was
acquired on the blue (training) racetracks illustrated and then
evaluated independently on the red (evaluation) racetrack.
During evaluation, the policy acquired on the training tracks was held
fixed and applied to the test track without prior exposure. We,
therefore, evaluated the ability of the policy based on the i-code
representation to generalize to novel path functions. The plots with
red and blue learning curves compare the policy acquired from the
training tracks (blue) with the policy acquired by training on the test
track (red). These results suggest that policies based on the
i-code representation can generalize well to novel situations. It
is interesting to note that the policy trained on the training tracks
perform better on the evaluation track than the policy trained on the
evaluation track. We believe this is because the path functions of
the training tracks expose the agent to a wider variety of experience
resulting in a better general policy.
Min-Energy
Learning Curve
Min-Time Learning
Curve
Evaluation of
Min-Energy Policy During Learning (on Evaluation Track)
Evaluation of
Min-Time Policy During Learning (on Evaluation Track)