ppo

package
v0.0.0-...-225e849 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Oct 22, 2020 License: Apache-2.0 Imports: 14 Imported by: 1

README

Proximal Policy Optimization

In Progress ⚠️ blocked on https://github.com/gorgonia/gorgonia/issues/373

Implementation of the Proximal Policy Optimization algorithm.

How it works

PPO is an on-policy method that aims to solve the step size issue with policy gradients. Typically policy gradient algorithms are very sensitive to step size, too large a step and the agent can fall into an unrecoverable state, to small a size and the agent takes a very long time to train. PPO solves this issue by ensuring that an agents policy never deviates too far from the previous policy.

eq
A ratio is taken of the old policy to the new policy and the delta is clipped to ensure policy changes remain within a bounds.

Examples

See the experiments folder for example implementations.

Roadmap

References

Documentation

Overview

Package ppo is an agent implementation of the Proximal Policy Optimization algorithm.

Index

Constants

This section is empty.

Variables

View Source
var DefaultActorConfig = &ModelConfig{
	Optimizer:    g.NewAdamSolver(),
	LayerBuilder: DefaultActorLayerBuilder,
	BatchSize:    20,
}

DefaultActorConfig are the default hyperparameters for a policy.

View Source
var DefaultActorLayerBuilder = func(env *envv1.Env) []layer.Config {
	return []layer.Config{
		layer.FC{Input: env.ObservationSpaceShape()[0], Output: 24},
		layer.FC{Input: 24, Output: 24},
		layer.FC{Input: 24, Output: envv1.PotentialsShape(env.ActionSpace)[0], Activation: layer.Softmax},
	}
}

DefaultActorLayerBuilder is a default fully connected layer builder.

View Source
var DefaultAgentConfig = &AgentConfig{
	Hyperparameters: DefaultHyperparameters,
	Base:            agentv1.NewBase("PPO"),
	ActorConfig:     DefaultActorConfig,
	CriticConfig:    DefaultCriticConfig,
}

DefaultAgentConfig is the default config for a dqn agent.

View Source
var DefaultCriticConfig = &ModelConfig{
	Loss:         modelv1.MSE,
	Optimizer:    g.NewAdamSolver(),
	LayerBuilder: DefaultCriticLayerBuilder,
	BatchSize:    20,
}

DefaultCriticConfig are the default hyperparameters for a policy.

View Source
var DefaultCriticLayerBuilder = func(env *envv1.Env) []layer.Config {
	return []layer.Config{
		layer.FC{Input: env.ObservationSpaceShape()[0], Output: 24},
		layer.FC{Input: 24, Output: 24},
		layer.FC{Input: 24, Output: 1, Activation: layer.Tanh},
	}
}

DefaultCriticLayerBuilder is a default fully connected layer builder.

View Source
var DefaultHyperparameters = &Hyperparameters{
	Gamma:  0.99,
	Lambda: 0.95,
}

DefaultHyperparameters are the default hyperparameters.

Functions

func GAE

func GAE(values, masks, rewards []*t.Dense, gamma, lambda float32) (returns, advantage *t.Dense, err error)

GAE is generalized advantage estimation.

func MakeActor

func MakeActor(config *ModelConfig, base *agentv1.Base, env *envv1.Env) (modelv1.Model, error)

MakeActor makes the actor which chooses actions based on the policy.

func MakeCritic

func MakeCritic(config *ModelConfig, base *agentv1.Base, env *envv1.Env) (modelv1.Model, error)

MakeCritic makes the critic which creats a qValue based on the outcome of the action taken.

func WithClip

func WithClip(val float64) func(*Loss)

WithClip sets the clipping value. Defaults to 0.2

func WithCriticDiscount

func WithCriticDiscount(val float32) func(*Loss)

WithCriticDiscount sets the critic discount. Defaults to 0.5

func WithEntropyBeta

func WithEntropyBeta(val float32) func(*Loss)

WithEntropyBeta sets the entropy beta. Defaults to 0.001

Types

type Agent

type Agent struct {
	// Base for the agent.
	*agentv1.Base

	// Hyperparameters for the dqn agent.
	*Hyperparameters

	// Actor chooses actions.
	Actor modelv1.Model

	// Critic updates params.
	Critic modelv1.Model

	// Memory of the agent.
	Memory *Memory
	// contains filtered or unexported fields
}

Agent is a dqn agent.

func NewAgent

func NewAgent(c *AgentConfig, env *envv1.Env) (*Agent, error)

NewAgent returns a new dqn agent.

func (*Agent) Action

func (a *Agent) Action(state *tensor.Dense) (action int, event *Event, err error)

Action selects the best known action for the given state.

func (*Agent) Learn

func (a *Agent) Learn(event *Event) error

Learn the agent.

type AgentConfig

type AgentConfig struct {
	// Base for the agent.
	Base *agentv1.Base

	// Hyperparameters for the agent.
	*Hyperparameters

	// ActorConfig is the actor model config.
	ActorConfig *ModelConfig

	// CriticConfig is the critic model config.
	CriticConfig *ModelConfig
}

AgentConfig is the config for a dqn agent.

type BatchedEvents

type BatchedEvents struct {
	States, ActionProbs, ActionOneHots, QValues, Masks, Rewards *tensor.Dense
	Len                                                         int
}

BatchedEvents are the events as a batched tensor.

type Event

type Event struct {
	State, ActionProbs, ActionOneHot, QValue, Mask, Reward *tensor.Dense
}

Event is an event that occurred when interacting with an environment.

func NewEvent

func NewEvent(state, actionProbs, actionOneHot, qValue *tensor.Dense) *Event

NewEvent returns a new event.

func (*Event) Apply

func (e *Event) Apply(outcome *envv1.Outcome)

Apply an outcome to an event.

type Events

type Events struct {
	States, ActionProbs, ActionOneHots, QValues, Masks, Rewards []*tensor.Dense
}

Events are the events as a batched tensor.

func (*Events) Batch

func (e *Events) Batch() (events *BatchedEvents, err error)

Batch the events.

type Hyperparameters

type Hyperparameters struct {
	// Gamma is the discount factor (0≤γ≤1). It determines how much importance we want to give to future
	// rewards. A high value for the discount factor (close to 1) captures the long-term effective award, whereas,
	// a discount factor of 0 makes our agent consider only immediate reward, hence making it greedy.
	Gamma float32

	// Lambda is the smoothing factor which is used to reduce variance and stablilize training.
	Lambda float32
}

Hyperparameters for the dqn agent.

type LayerBuilder

type LayerBuilder func(env *envv1.Env) []layer.Config

LayerBuilder builds layers.

type Loss

type Loss struct {
	// contains filtered or unexported fields
}

Loss is a custom loss for PPO. It is designed to ensure that policies are never over updated.

func NewLoss

func NewLoss(oldProbs, advantages, rewards, values *modelv1.Input, opts ...LossOpt) *Loss

NewLoss returns a new PPO loss.

func (*Loss) CloneTo

func (l *Loss) CloneTo(graph *g.ExprGraph, opts ...modelv1.CloneOpt) modelv1.Loss

CloneTo another graph.

func (*Loss) Compute

func (l *Loss) Compute(yHat, y *g.Node) (loss *g.Node, err error)

Compute the loss.

func (*Loss) Inputs

func (l *Loss) Inputs() modelv1.Inputs

Inputs returns any inputs the loss function utilizes.

type LossOpt

type LossOpt func(*Loss)

LossOpt is an option for PPO loss.

type Memory

type Memory struct {
	// contains filtered or unexported fields
}

Memory for the dqn agent.

func NewMemory

func NewMemory() *Memory

NewMemory returns a new Memory store.

func (*Memory) Len

func (m *Memory) Len() int

Len is the number of events in the memory.

func (*Memory) Pop

func (m *Memory) Pop() (e *Events)

Pop the values out of the memory.

func (*Memory) Remember

func (m *Memory) Remember(event *Event) error

Remember an event.

func (*Memory) Reset

func (m *Memory) Reset()

Reset the memory.

type ModelConfig

type ModelConfig struct {
	// Loss function to evaluate network performance.
	Loss modelv1.Loss

	// Optimizer to optimize the weights with regards to the error.
	Optimizer g.Solver

	// LayerBuilder is a builder of layer.
	LayerBuilder LayerBuilder

	// BatchSize of the updates.
	BatchSize int

	// Track is whether to track the model.
	Track bool
}

ModelConfig are the hyperparameters for a model.

Directories

Path Synopsis
experiments

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL