Documentation
¶
Overview ¶
Package optimizers implements a collection of ML optimizers, that can be used by train.Trainer, or by themselves. They all implement optimizers.Interface.
Index ¶
- Constants
- Variables
- func IncrementGlobalStepGraph(ctx *context.Context, g *Graph, dtype shapes.DType) *Node
- func LearningRateVar(ctx *context.Context, dtype shapes.DType, defaultValue float64) *context.Variable
- func LearningRateVarWithValue(ctx *context.Context, dtype shapes.DType, value float64) *context.Variable
- type AdamConfig
- func (c *AdamConfig) Adamax() *AdamConfig
- func (c *AdamConfig) Betas(beta1, beta2 float64) *AdamConfig
- func (c *AdamConfig) Done() Interface
- func (c *AdamConfig) Epsilon(epsilon float64) *AdamConfig
- func (c *AdamConfig) LearningRate(value float64) *AdamConfig
- func (c *AdamConfig) Scope(name string) *AdamConfig
- func (c *AdamConfig) WeightDecay(weightDecay float64) *AdamConfig
- type CosineAnnealingOptions
- func (opt *CosineAnnealingOptions) Done()
- func (opt *CosineAnnealingOptions) LearningRate(learningRate float64) *CosineAnnealingOptions
- func (opt *CosineAnnealingOptions) MinLearningRate(minLearningRate float64) *CosineAnnealingOptions
- func (opt *CosineAnnealingOptions) PeriodInSteps(periodSteps int) *CosineAnnealingOptions
- type Interface
Constants ¶
const ( // AdamDefaultLearningRate is used by Adam if no learning rate is set. AdamDefaultLearningRate = 0.001 // AdamDefaultScope is the default scope name for moments and step used by Adam. AdamDefaultScope = "AdamOptimizer" )
const GlobalStepVariableName = "global_step"
GlobalStepVariableName as stored in context.Context, usually in the root scope -- but depends on the caller.
const LearningRateKey = "learning_rate"
LearningRateKey is the string key for learning rate in Context.Params.
const SgdDefaultLearningRate = 0.1
SgdDefaultLearningRate is the default learning rate used by the StochasticGradientDescent optimizer.
Variables ¶
var ( // KnownOptimizers is a map of known optimizers by name to their default constructors. // This provides an easy quick start point. One can hyperparameter-tune the optimizers // for usually slightly better results. KnownOptimizers = map[string]func() Interface{ "sgd": StochasticGradientDescent, "adam": func() Interface { return Adam().Done() }, "adamax": func() Interface { return Adam().Adamax().Done() }, "adamw": func() Interface { return Adam().WeightDecay(0.004).Done() }, } )
Functions ¶
func IncrementGlobalStepGraph ¶
IncrementGlobalStepGraph creates (if not there yet) a global step counter, and returns it incremented -- its first returned value will be 1.
It only builds the computation graph, no actual values are generated.
Typically, this is called by the optimizers UpdateGraph method.
func LearningRateVar ¶
func LearningRateVar(ctx *context.Context, dtype shapes.DType, defaultValue float64) *context.Variable
LearningRateVar returns the learning rate variable -- a scalar value of the given dtype.
If variable doesn't exist yet, it will be created using the parameter LearningRateKey, if it is set, or the provided defaultValue (must be a scalar convertible to dtype) if not.
Types ¶
type AdamConfig ¶
type AdamConfig struct {
// contains filtered or unexported fields
}
AdamConfig holds the configuration for an Adam configuration, create using Adam(), and once configured call Done to create an Adam based optimizer.Interface.
func Adam ¶
func Adam() *AdamConfig
Adam optimization is a stochastic gradient descent method that is based on adaptive estimation of first-order and second-order moments. According to [Kingma et al., 2014](http://arxiv.org/abs/1412.6980), the method is "*computationally efficient, has little memory requirement, invariant to diagonal rescaling of gradients, and is well suited for problems that are large in terms of data/parameters*".
It returns a configuration object that can be used to set its parameters. Once configured call IsNil, and it will return an optimizer.Interface.
func (*AdamConfig) Adamax ¶
func (c *AdamConfig) Adamax() *AdamConfig
Adamax configure Adam to use a L-infinity (== max, which gives the name) for the second moment, instead of L2, as described in the same Adam paper.
func (*AdamConfig) Betas ¶
func (c *AdamConfig) Betas(beta1, beta2 float64) *AdamConfig
Betas sets the two moving averages constants (exponential decays). They default to 0.9 and 0.999.
func (*AdamConfig) Done ¶
func (c *AdamConfig) Done() Interface
Done will finish the configuration and construct an optimizer.Interface that implements Adam to specification.
func (*AdamConfig) Epsilon ¶
func (c *AdamConfig) Epsilon(epsilon float64) *AdamConfig
Epsilon used on the denominator as a small constant for stability.
func (*AdamConfig) LearningRate ¶
func (c *AdamConfig) LearningRate(value float64) *AdamConfig
LearningRate sets the base learning rate as a floating point value -- eventually converted to the same dtype as the loss.
Default is either the value of LearningRateKey ("learning_rate") global parameter in Context if defined, or 0.001 if not.
func (*AdamConfig) Scope ¶
func (c *AdamConfig) Scope(name string) *AdamConfig
Scope defines the top-level scope to use to store the 1st and 2nd order moments of the gradients and the step number used by Adam optimizer. Generally this doesn't need to be changed, but if one is using multiple schedules, potentially with different loss functions (so the moments should be different), one can change.
It defaults to AdamDefaultScope.
func (*AdamConfig) WeightDecay ¶
func (c *AdamConfig) WeightDecay(weightDecay float64) *AdamConfig
WeightDecay configure optimizer to work as AdamW, with the given static weight decay. This is because L2 regularization doesn't work well with Adam. TODO: (1) Allow certain variables to be excluded from weight decay (e.g: biases); (2) Allow dynamically calculated weight decay.
type CosineAnnealingOptions ¶
type CosineAnnealingOptions struct {
// contains filtered or unexported fields
}
CosineAnnealingOptions is returned by CosineAnnealingSchedule to configure the cosine annealing schedule strategy. When finished to configure, call `IsNil`.
func CosineAnnealingSchedule ¶
func CosineAnnealingSchedule(ctx *context.Context, graph *Graph, dtype shapes.DType) *CosineAnnealingOptions
CosineAnnealingSchedule allows one to set up a cosine annealing schedule for the learning rate. See details https://paperswithcode.com/method/cosine-annealing.
It returns a CosineAnnealingOptions that can be configured. When finished configuring call `IsNil` and it will generate the computation graph that updates the learning rate at every training step.
Example with only one cycle (assuming `*flagNumSteps` is the number of training steps):
```
func modelGraph(cxt *context.Context, inputs []*Node) *Node {
graph := inputs[0].Graph()
if *flagUseCosineSchedule {
optimizers.CosineAnnealingSchedule(ctx, graph, types.Float32).PeriodInSteps(*flagNumSteps).IsNil()
}
}
```
func (*CosineAnnealingOptions) Done ¶
func (opt *CosineAnnealingOptions) Done()
Done finalizes the configuration of CosineAnnealingSchedule and generates the computation graph code to implment it.
If invalid options are given, an error is raised in the Graph.
func (*CosineAnnealingOptions) LearningRate ¶
func (opt *CosineAnnealingOptions) LearningRate(learningRate float64) *CosineAnnealingOptions
LearningRate at the start of the cosine cycle. If not given, it will try to read from the context params (keyed by LearningRateKey). If neither are set, it will fail and return an error in the context and graph.
func (*CosineAnnealingOptions) MinLearningRate ¶
func (opt *CosineAnnealingOptions) MinLearningRate(minLearningRate float64) *CosineAnnealingOptions
MinLearningRate at the end of the cosine cycle. Defaults to 10^-3 * initial learning rate.
func (*CosineAnnealingOptions) PeriodInSteps ¶
func (opt *CosineAnnealingOptions) PeriodInSteps(periodSteps int) *CosineAnnealingOptions
PeriodInSteps sets the number of steps for one period of the cosine schedule. The effective learning rate decreases over the given period of training steps, and then is restarted at each new period.
It's common to use only one period (so no annealing, just a cosine schedule), in which case just set to the number of steps that will be used for training.
There is no default yet, this value must be given, or an error will be issued in the graph and context.
type Interface ¶
type Interface interface {
// UpdateGraph is the function called during computation graph building, it
// calculates the updates to the variables (weights) of the model needed for one
// training step. It should return these updates.
//
// Variable values can be updated in graph building time (inside UpdateGraph) using Variable.SetValueGraph,
// and the trainer (train.Trainer) will make sure these values are returned from the graph execution
// and the materialized values used to update the variables (Variable.SetValue).
//
// ctx holds the variables to train (marked as trainable), the hyperparameters
// used by the optimizer (in `ctx.Params`) and non-trainable variables
// that the optimizer itself may create. One should scope it (context.Context.In("<some scope name>"))
// to avoid naming conflicts on the variables created -- notice that
// some complex training scheduling scheme may have more than one optimizer
// on the same Context object.
//
// loss must be a scalar value.
UpdateGraph(ctx *context.Context, graph *Graph, loss *Node)
}
Interface implemented by optimizer implementations.
func MustOptimizerByName ¶
MustOptimizerByName returns an optimizer given the name, or log.Fatal if one does not exist. It uses KnownOptimizers -- in case one wants to better handle invalid values.
Example usage:
``` var flagOptimizer = flag.String("optimizer", "adamw", fmt.Sprintf("Optimizer, options: %q", types.SortedKeys(optimizers.KnownOptimizers)))
...
trainer := train.NewTrainer(manager, ctx, ModelGraph,
losses.SomeLoss,
optimizers.MustOptimizerByName(*flagOptimizer),
[]metrics.Interface{someMetric}, // trainMetrics
[]metrics.Interface{otherMetric}) // evalMetrics
```
func StochasticGradientDescent ¶
func StochasticGradientDescent() Interface
StochasticGradientDescent creates an optimizer that performs SGD. It looks for "learning_rate" in Context.Params for the initial learning rate, otherwise it defaults to SgdDefaultLearningRate.
It has a decay of learning rate given by: `learning_rate = initial_learning_rate / Sqrt(global_step)`