Interaction Based State Representation - uBot Impedance Control Basis

stripe
bullet Learning Optimal Switching Policies for Path Tracking Tasks on a Mobile Robot

Introduction

The uBot platform is being used to carry out reinforcement learning experiments with interaction based state representation.  To accomplish this, a set of impedance controllers (described below) is used in which each controller provides a qualitatively different tracking behavior. Moreover, the set of impedance controllers is also used to observe the driving (path) function. When a particular impedance controller is driving the uBot, it produces a sequence of position and velocity errors on the other controllers dependent on their respective impedance parameters and the driving (path) function.  If we assess the equilibrium status (see below) of each discrete impedance controllers under the actions of controller i , the result is a coarse spectral decomposition of the path function.   We can represent this spectral decomposition of the path function as a bit vector called an impedance code or "i-code".  A spectral decomposition of the path function should be useful in selecting the best next impedance controller in the context of a control objective. Our hypothesis holds that a state representation consisting of the active controller id concatenated with the i-code is a sufficient state estimate on which to base optimal control decisions.

Impedance Controller Design

A control model is employed that attaches a "control yolk" to the front of the uBot.  The control point for a given controller is located on the control yolk at distance X0 from the center of the uBot.  This is conceptually the point at which forces are applied to the robot by that controller. This effectively solves the nonholonomic motion planning problem since the robot will stably track behind the control point.   The path function may be defined in a variety of different ways (we have used both the stream line passing through the center of the robot in a harmonic flow field or a path defined by a cubic spline).  A controller's reference point is the closest point to the controller's control point along the path function.




 

A controller is defined by the parameter vector [K, B, X0, Vref].  The K and B parameters are gains applied to the position and velocity errors when determining the force exerted on the control point by the controller.  The position error is the distance between a controller's reference point and control point.  The velocity error is the difference between the control point's current velocity and a velocity of magnitude Vref tangent to the path function at the reference point.  The parameter, X0 (the distance from the center of the robot to the control point), dominates the bandpass character of the impedance controller - with cutoff frequency proportional to 1/X0.  Furthermore, the robot looks ahead in time along the path function an amount proportional to X0. This implies that a  controller with a large X0 will cut corners off the path and one with a small X0 will track more precisely.  Both properties can be useful in the context of different objective functions.  

Below is an example Bode plot of nine controllers.  The Bode plots of the amplitude and phase response of the impedance controllers (determined in simulation) indicate that the controllers are, in fact, linearly independent in the frequency domain (although it may be possible to use fewer controllers without significant loss of expressive power).



 

I-Code State Representation


equilibrium
The equilibrium status of a candidate impedance controller is assessed in a phase portrait. A particular controller is near equilibrium when it is in the neighborhood of the origin. The neighborhood we have employed is the weighted squared error in position and velocity. The weights are the impedance parameters, K and B, respectively. Controllers inside of this envelope assert a status bit, and those that are not inside this envelope do not assert this bit. For n controllers, this leads to a vector of n bits that could assume any one of 2^n possible values.  We determine the state of our robot by concatenating this bit vector with the index of the active controller.  Thus the size of the state space for learning policies is n*2^n.



Experimental Results

A standard Q-learning backup with a tabular value function was used to acquire policies using value iteration. A stepwise reward reflecting time and energy expended was employed. In each case, the policy was acquired on the blue (training) racetracks illustrated and then evaluated independently on the red (evaluation) racetrack.


During evaluation, the policy acquired on the training tracks was held fixed and applied to the test track without prior exposure. We, therefore, evaluated the ability of the policy based on the i-code representation to generalize to novel path functions. The plots with red and blue learning curves compare the policy acquired from the training tracks (blue) with the policy acquired by training on the test track (red).  These results suggest that policies based on the i-code representation can generalize well to novel situations.  It is interesting to note that the policy trained on the training tracks perform better on the evaluation track than the policy trained on the evaluation track.  We believe this is because the path functions of the training tracks expose the agent to a wider variety of experience resulting in a better general policy.

Min-Energy Learning Curve


Min-Time Learning Curve


Evaluation of Min-Energy Policy During Learning (on Evaluation Track)


Evaluation of Min-Time Policy During Learning (on Evaluation Track)