-
Notifications
You must be signed in to change notification settings - Fork 4
Open
Labels
Description
Definitively the notion of optimizer is somewhat fuzzy and so is the class Optimizer.
We should attempt to clarify definitions we are going to use in the library.
Definitions (to be added in a wiki)
-
Trainer: manages the optimization procedure of a model on a particular dataset. -
Optimizer: optimizes a certain objective function (e.g. a loss) by updating some parameters (the ones used in the computation of the objective function: parameters of the model)). -
UpdateRule: Something that modified a direction (often the gradient) in order to update some parameters. -
BatchScheduler: manages the batches (nb. of examples, order of the examples) to give to the learn function. -
Loss: is responsible of outputting the Theano graph corresponding to the desired loss function to optimizer by theOptimizer. It takes as inputs aModeland aDataset.
Some questions
- What should an optimizer take as inputs? The loss function to optimize. Now a
Lossclass. - What are the different kind of optimizers? (bold: already available in the library)
- Zeroth order (needs only function) (not used for real)
- First order (needs only gradient):
- GD, SGD, adam, adadelta, adagrad, nag, svrg, sdca, sag, sagvr, ...
- Quasi-Newton (needs only gradient, builds an hessian approximation):
- L-BFGS, ...
- Second order (needs the gradient and the hessian (or hessian-vector product)
- Newton, Newton-Trust Region, Hessian-Free, ARC, ...
- Should an optimizer be agnostic to the notion of batch, batch size, batch ordering, etc? Yes, we created a
BatchSchedulerfor that. - How do we call ADAGRAD, Adam, Adadelta, etc.? Right now those are called
UpdateRule. - Should we allow trivially multiple
UpdateRuleor create specialUpdateRulethat will combine them as the user want. Right now, we blindly applied them one after the other. - Is SGD really something in our framework? Yes, otherwise we would need a
SMART-optimmodule. - Is L-BFGS simply what we call an update rule? No. It requires the current, and past parameters and the past gradients.
- Is using the Hessian (e.g. in Newton Method) can be seen as an update rule? No, using exact second-order information should be done in a given subclass of Optimizer, it would then call the necessary method of the model (e.g. hessian or the Rop - Hessian-vector product ).
- Does
Optimizershould be the one computingnb_updates_per_epoch? No, a BatchScheduler should do it.
Suggestions
- We could define a class
Lossthat will be provided to the optimizer. This class could know about the model and the dataset, provide the necessary symbolic variables(maybe it should build the.givenfor the Theano function) - Currently, all calls to
update_rules.applyinSGDshould be moved insideOptimizer. The same goes for calls toparam_modifier.apply.