# lamb
**Repository Path**: mirrors_deepmind/lamb
## Basic Information
- **Project Name**: lamb
- **Description**: LAnguage Modelling Benchmarks
- **Primary Language**: Unknown
- **License**: Apache-2.0
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No
## Statistics
- **Stars**: 0
- **Forks**: 0
- **Created**: 2021-02-27
- **Last Updated**: 2025-09-14
## Categories & Tags
**Categories**: Uncategorized
**Tags**: None
## README
# What is this?
LAnguage Modelling Benchmarks is to tune and test Tensorflow language models. It
was used in the following papers (alse see [citations](#citations)):
- [On the state of the art of evaluation in neural language models](https://arxiv.org/abs/1707.05589)
See [./experiment/on-the-state/README.md](./experiment/on-the-state/README.md)
for more.
- [Pushing the bounds of dropout](https://arxiv.org/abs/1805.09208)
See
[./experiment/pushing-the-bounds/README.md](./experiment/pushing-the-bounds/README.md)
for more.
- [Mogrifier LSTM](https://arxiv.org/abs/1909.01792)
See [./experiment/mogrifier/README.md](./experiment/mogrifier/README.md) for
more.
# Overview
The default dataset locations are `~/data//`. See
`lib/config/{ptb,wikitext-2,wikitext-103,enwik8}.sh` for the defaults.
To train a small LSTM on Penn Treebank, run this script:
experiment/train_ptb_10m_lstm_d1.sh
In the script, model configuration, data files, etc are specified by setting
variables:
training_file="ptb.train.txt"
validation_file="ptb.valid.text"
model="lstm"
hidden_size=500
These shell variables are passed as command line arguments to the python
program. These options are documented in the [reference](#reference) section.
To test a trained model:
experiment/test.sh run "mymodel" "experiment_dir_of_training_run"
In the output, lines with `final valid* xe:':` have the validation set
cross-entropy. Evaluation results are printed as they happen (see the section on
[evaluation](#evaluation)). Lines of special interest in the output are those
with `final {valid,test}` in them. The format is the following:
final ${dataset}_${eval_method}[_${dropout_multiplier}][_t${softmax_temp}]
For [`eval_method=arithmetic`](#eval_method) with
[`eval_dropout_multiplier=0.8`](#eval_dropout_multiplier) and
[`eval_softmax_temperature=0.9`](#eval_softmax_temperature) results may look
like this after 200 optimization steps and 2 evaluations:
turn: 2 (eval), step: 200 (opt) (5.29/s)
final valid_mca_d0.8_t0.9 xe: 5.315
final test_mca_d0.8_t0.9 xe: 5.289
... except that training runs normally don't have the test set results (see
[`eval_on_test`](#eval_on_test)). Test runs are pretty much training runs with
no optimization steps.
# Installation
For example:
conda create -n tfp3.7 python=3.7 numpy scipy
conda activate tfp3.7
conda install cudatoolkit
conda install cudnn
conda install tensorflow-gpu=1.15
conda install tensorflow-probability-gpu=1.15
conda install tensorflow-probability
pip install -e
# Reference
A value given for an option gets converted to the data type corresponding to the
option in question. In the following, options are listed with their data type
and default value (e.g. `model (string, lstm)` means that the variable `model`
has type `string` and default value `lstm`). If there is no default value
listed, then the option is mandatory.
## Data
- `training_file` (string)
The file with the training data, one line per example. Newlines are translated
to an end-of-sentence token.
- `validation_file` (string)
A file of the same format as [`training_file`](#training_file). During
training, the model is evaluated periodically on data from validation_file. Most
notably, early stopping and hyperparameter tuning are based on performance on
this set of examples. This must not be specified when doing cross-validation
as in that case, the evaluation set is constructed from the training set.
- `test_file` (string, '')
A file of the same format as [`training_file`](#training_file). During
training, the model evaluated periodically on data from `test_file` and the
results are logged. As opposed to [`validation_file`](#validation_file), this dataset have
no affect on training or tuning. The empty string (the default) turns off
evaluation on the test set.
- `file_encoding` (string, utf-8)
The encoding of [`training_file`](#training_file), [`validation_file`](#validation_file)
and [`test_file`](#test_file).
- `word_based` (boolean, false)
Whether to do word or character based modelling. If word based, lines are
split at whitespace into tokens. Else, lines are simply split into characters.
- `episodic` (boolean, false)
If true, iterate over examples (lines in the data files) in random order. If
false, iterate mostly sequentially carrying over model from the previous
example to the next.
## Model
- `num_params` (float, -1)
An upper bound on the total number of trainable parameters over all parts of
the model (including the recurrent cell and input/output embeddings). If this
is set to a meaningful value (i.e. not -1, the default), then
[`hidden_size`](#hidden_size) is set to the largest possible value such that
the parameter budget is not exceeded.
- `share_input_and_output_embeddings` (boolean, false)
Whether the input and output embeddings are the same matrix (transposed) or
independent (the default). If true, then `input_embedding_size` and
output_embedding_size must be the same.
- `input_embedding_size` (integer, -1)
The length of the vector that represents an input token. If -1 (the default),
then it's determined by [`input_embedding_ratio`](#input_embedding_ratio).
- `output_embedding_size` (integer, -1)
The length of the vector that represents an output token. If -1 (the default),
then it's determined by output_embedding_ratio. If - after applying the
defaulting rules - `output_embedding_size` is not equal to
[`hidden_size`](#hidden_size), then the cell output is linearly transformed to
`output_embedding_size` before the final linear transform into the softmax.
- `input_embedding_ratio` (float, 1.0)
If [`input_embedding_size`](#output_embedding_size) is not specified (i.e.
-1), then it's set to `round(input_embedding_ratio*hidden_size)`.
- `output_embedding_ratio` (float, -1.0)
If [`output_embedding_size`](#output_embedding_size) is not specified (i.e.
-1), then it's set to `round(output_embedding_ratio*hidden_size)`. The default
value of -1, makes `output_embedding_ratio` default to the value of
[`input_embedding_ratio`](#input_embedding_ratio) so that one can tune easily
with [`share_input_and_output_embeddings`](#share_input_and_output_embeddings)
`=true`.
- `mos_num_components` (integer, 0)
See [Breaking the softmax bottleneck](https://arxiv.org/abs/1711.03953). The
default of 0 turns this feature off.
- `embedding_dropout` (float, 0.0)
The probability that all occurrences of a word are dropped from a batch.
- `token_dropout` (float, 0.0)
The probability that a token will be dropped (i.e. the input at that step
becomes zero). This can be thought of as a version of
[`embedding_dropout`](#embedding_dropout) that has different masks per time
step.
- `input_dropout` (float, 0.0)
The dropout rate (here and elsewhere, 0 means deterministic operation) for the
input to the first layer (i.e. just after the input embeddings). This drops
out individual elements of the embedding vector.
- `output_dropout` (float, 0.0)
The dropout rate for just after the cell output.
- `downprojected_output_dropout` (float, -1.0)
The dropout rate for the projection of the cell output. Only used if
`output_embedding_size` is different from [`hidden_size`](#hidden_size) or if
[`mos_num_components`](#mos_num_components)
is not 1. Defaults to `output_dropout` if set to -1.
- `shared_mask_dropout` (boolean, false)
Whether to use the time same dropout mask for all time steps for
[`input_dropout`](#input_dropout),
[`inter_layer_dropout`](#inter_layer_dropout),
[`output_dropout`](#output_dropout) and
[`downprojected_output_dropout`](#downprojected_output_dropout).
- `output_once` (boolean, true)
Whether to compute the logits from the cell output in a single operation or
per time step. The single operation is faster but uses more GPU memory. Also,
see [`swap_memory`](#swap_memory).
### Cell
- `model` (string, lstm)
One of `lstm`, `rhn` (Recurrent Highway Network), `nas`.
- `num_layers` (integer, 1)
The number of same-sized LSTM cells stacked on top of each other, or the
number of processing steps per input an RHN does. Has no effect on NAS.
- `lstm_skip_connection` (boolean, true)
If true, for multi-layer (num_layers>1) LSTMs, the output is computed as the
sum of the outputs of the individual layers.
- `feature_mask_rounds` (integer, 0)
The Mogrifier LSTM is implemented in terms of the feature masking option. The
LSTM specific feature masking option involves gating the input and the state
before they are used for calculating all the other stuff (i.e. `i`, `j`, `o`,
`f`). This allows input features to be reweighted based on the state, and
state features to be reweighted based on the input. See the [Mogrifier
LSTM](https://arxiv.org/abs/1909.01792) paper for details.
When `feature_mask_rounds` is 0, there is no extra gating in the LSTM.
When 1<=, the input is gated: `x *= 2*sigmoid(affine(h)))`.
When 2<=, the state is gated: `h *= 2*sigmoid(affine(x)))`.
For higher number of rounds, the alternating gating continues.
- `feature_mask_rank` (integer, 0)
If 0, the linear transforms described above are full rank, dense matrices. If
>0, then the matrix representing the linear transform is factorized as the
product of two low rank matrices (`[*, rank]` and `[rank, *]`). This reduces
the number of parameters greatly.
- `hidden_size` (string, "-1")
A comma-separated list of integers representing the number of units in the
state of the recurrent cell per layer. Must not be longer than
[`num_layers`](#num_layers). If it's shorter, then the missing values are
assumed to be equal to the last specified one. For example, for a 3 layer
network `"512,256"` results in the first layer having 512 units, the second
and the third having 256. If "-1" (the default), an attempt is made to deduce
it from [`num_params`](#num_params) assuming all layers have the same size.
- `layer_norm` (boolean, false)
Whether to perform Layer Normalization (currently only implemented for LSTMs).
- `activation_fn` (string, tf.tanh)
The non-linearity for the update candidate ('j') and the output ('o') in an
LSTM, or the output ('h') in an RHN.
- `tie_forget_and_input_gates` (boolean, false)
In an LSTM, whether the input gate ('i') is set to 1 minus the forget gate
('f'). In an RHN, whether the transform gate ('t') is set to 1 minus the carry
gate ('c').
- `cap_input_gate` (boolean, true)
Whether to cap the input gate at 1-f if
[`tie_forget_and_input_gates`](#tie_forget_and_input_gates) is off. Currently
only affects LSTMs. This makes learning more stable, especially at the early
stages of training.
- `trainable_initial_state` (boolean, true)
Whether the initial state of the recurrent cells is allowed to be learnt or is
set to a fixed zero vector. In non-episodic mode, this switch is forced off.
- `inter_layer_dropout` (float, 0.0)
The input dropout for layers other than the first one. Defaults to no dropout,
but setting it to -1 makes it inherit [`input_dropout`](#input_dropout). It
has no effect on RHNs, since the input is not fed to their higher layers.
- `state_dropout` (float, 0.0)
This is the dropout rate for the recurrent state from the previous time step
('h' in an LSTM, 's' in an RHN). See Yarin Gal's "A Theoretically Grounded
Application of Dropout in Recurrent Neural Networks". The dropout mask is the
same for all time steps of a specific example in one batch.
- `update_dropout` (float, 0.0)
This is the Recurrent Dropout (see "Recurrent Dropout without Memory Loss")
rate on the update candidate ('j' in an LSTM, 'h' in an RHN). Should have been
named Update Dropout.
- `cell_clip` (float, -1.0)
If set to a positive value, the cell state ('c' in an LSTM, 's' in an RHN) is
clipped to the `[-cell_clip, cell_clip]` range after each iteration.
## Training
### Objective
- `model_average` (string, arithmetic)
[Pushing the bounds of dropout](https://arxiv.org/abs/1805.09208) makes the
point that the actual dropout objective being optimized is a lower bound of
the true objectives of many different models. If we construct the lower bound
from multiple samples though (a'la IWAE), the lower bound will get tighter.
`model_average` is the training time equivalent of `eval_method` and
determines what kind of model (and consequently, averaging) is to be used. One
of `geometric`, `power` and `arithmetic`. Only in effect if
[`num_training_samples`](#num_training_samples) `> 1`.
- `num_training_samples` (integer, 1)
The number of samples from which to compute the objective (see
[`model_average`](#model_average)). Each training example being presented is
run through the network `num_training_samples` times so the effective batch
size is [`batch_size`](#batch_size) `* num_training_samples`. Increasing the
number of samples doesn't seems to help generalization, though.
- `l2_penalty` (float, 0.0)
The L2 penalty on all trainable parameters.
- `l1_penalty` (float, 0.0)
The L1 penalty on all trainable parameters.
- `activation_norm_penalty` (float, 0.0)
Activation Norm Penalty (Regularizing and optimizing LSTM language models by
Merity et al).
- `drop_state_probability` (float, 0.0)
In non-episodic mode, model state is carried over from batch to batch. Not
feeding back the state with `drop_state_probability` encourages the model to
work well starting from the zero state which brings it closer to the test
regime.
### Initialization
- `embedding_init_factor` (float, 1.0)
All input embedding weights are initialized with a truncated normal
distribution with mean 0 and:
stddev=sqrt(embedding_init_factor/input_embedding_size)
- `scale_input_embeddings` (boolean, false)
This is not strictly an initialization option, but it serves a similar
purpose. Input embeddings are initialized from a distribution whose variance
is inversely proportional to [`input_embedding_size`](#input_embedding_size).
Since every layer in the network is initialized to produce output with
approximately the same variance as its input, changing the embedding size has
a potentially strong, undesirable effect on optimization. Set
`scale_input_embeddings` to `true` to multiply input embeddings by
`sqrt(input_embedding_size)` to cancel this effect.
As opposed to just changing `embedding_init_factor`, this multiplication has
the benefit that the input embedding matrix is of the right scale for use as
the output embedding matrix should
[`share_input_and_output_embeddings`](#share_input_and_output_embeddings) be
turned on.
- `cell_init_factor` (float, 1.0)
The various weight matrices in the recurrent cell are initialized
independently (of which there are 8 in an LSTM, 4/2 in an RHN) with
stddev=sqrt(cell_init_factor/fan_in)
while biases are initialized with
stddev=sqrt(cell_init_factor/hidden_size)
- `forget_bias` (float, 1.0)
Sometimes initializing the biases of the forget gate ('f') in the LSTM (or
that of the carry gate ('c') in an RHN) to a small positive value (typically
1.0, the default) makes the initial phase of optimization faster. Higher
values make the network forget _less_ of its state over time. With deeper
architectures and no skip connections (see [`num_layers`](#num_layers) and
[`lstm_skip_connection`](#lstm_skip_connection)), this may actually make
optimization harder.
The value of `forget_bias` is used as the mean of the distribution used for
initialization with unchanged variance.
- `output_init_factor` (float, 1.0)
If [`share_input_and_output_embeddings`](#share_input_and_output_embeddings)
is false, then the output projection (also known as the output embeddings) is
initialized with
stddev=sqrt(output_init_factor/fan_in)
If [`share_input_and_output_embeddings`](#share_input_and_output_embeddings)
is true, then this only affects the linear transform of the cell output (see
[`output_embedding_size`](#output_embedding_size)).
### Schedule
- `steps_per_turn` (integer, 1000)
The number of optimization steps between two successive evaluations. After
this many steps performance is evaluated and logged on the training,
validation and test sets (if specified). One so called turn consists of
`steps_per_turn` optimization steps.
- `turns` (integer)
The number of evaluations beyond which training cannot continue (also see
early stopping).
- `print_training_stats_every_num_steps` (integer, 1000)
Debug printing frequency.
### Optimization
- `optimizer_type` (string, rmsprop)
The optimizer algorithm. One of `rmsprop`, `adam`, `adagrad`, `adadelta` and
`sgd`.
- `rmsprop_beta2` (float, 0.999)
RMSPROP is actually Adam with `beta1=0.0` so that Adam's highly useful
correction to the computed statistics is in effect which allows higher initial
learning rates. Only applies when [`optimizer_type`](#optimizer_type)` is
`rmsprop`.
- `rmsprop_epsilon` (float, 1e-8)
Similar to [`adam_epsilon`](#adam_epsilon). Only applies when
[`optimizer_type`](#optimizer_type) is `rmsprop`.
- `adam_beta1` (float, 0.9)
- `adam_beta2` (float, 0.999)
- `adam_epsilon`(float, 1e-8)
- `max_grad_norm` (float, 1.0)
If non-zero, gradients are rescaled so that their norm does not exceed
`max_grad_norm`.
- `batch_size` (integer)
Batch size for training. Also, the evaluation batch size unless
[`min_non_episodic_eval_examples_per_stripe`](#min_non_episodic_eval_examples_per_stripe)
overrides it.
- `accum_batch_size` (integer, -1)
The number of examples that are fed to the network at the same time. Set this
to a divisor of [`batch_size`](#batch_size) to reduce memory usage at the cost
of possibly slower training. Using `accum_batch_size` does not change the
results.
- `max_time_steps` (integer, 100)
For episodic operation, examples that have more tokens than this are truncated
when the training and test files when loaded. For non-episodic operation, this
is the window size of the truncated backprop.
- `trigger_averaging_turns` (integer, -1)
The number of turns of no improvement on the validation set, after which
weight averaging is turned on. Weight averaging is a trivial generalization of
the idea behind Averaged SGD: it keeps track of the average weights, updating
the average after each optimization step. Weight averaging does not affect
training directly, only through evaluation. This feature is an alternative to
[dropping the learning rate](#drop_learning_rate_turns).
- `trigger_averaging_at_the_latest` (integer, -1)
If optimization reaches turn `trigger_averaging_at_the_latest`, then it is
ensured that averaging is turned on. Set this to be somewhat smaller than
[`turns`](#turns) so that all runs get at least one drop which should the
results more comparable.
#### Learning rate
- `learning_rate` (float, 0.001)
- `drop_learning_rate_turns` (integer, -1)
If the validation score doesn't improve for `drop_learning_rate_turns` number
of turns, then the learning rate is multiplied by
[`drop_learning_rate_multiplier`](#drop_learning_rate_multiplier), possibly
repeatedly.
- `drop_learning_rate_multiplier` (float, 1.0)
Set this to a value less than 1.0.
- `drop_learning_rate_at_the_latest` (integer, -1)
If optimization reaches turn
`drop_learning_rate_multiplier_at_the_latest` without having yet dropped the
learning rate, then it is dropped regardless of whether the curve is still
improving or not. Set this to be somewhat smaller than
[`turns`](#turns) so that all runs get at least one drop which
should the results more comparable.
#### Early stopping
- `early_stopping_turns` (integer, -1)
Maximum number of turns without improvement in validation cross-entropy before
stopping.
- `early_stopping_rampup_turns` (integer, 0)
The effective `early_stopping_turns` starts out at 1 and is increased linearly
to the specified [`early_stopping_turns`](#early_stopping_turns) in
`early_stopping_rampup_turns` turns.
- `early_stopping_worst_xe_target` (float, '')
If the estimated best possible validation cross-entropy (extrapolated from the
progress made in the most recent
[`early_stopping_turns`](#early_stopping_turns) (subject to rampup) is worse
than `early_stopping_worst_xe_target`, then training is stopped. This is
actually a string of comma separated floats. The first value is in effect when
the learning rate has not been dropped yet. The second value is effect if it
has been dropped once and so on. The last element of the list applies to any
further learning rate drops.
- `early_stopping_slowest_rate` (float, 0.0)
The rate is defined as the average improvement in validation cross-entropy
over the effective `early_stopping_turns` (see
[`early_stopping_rampup_turns`](#early_stopping_rampup_turns)). If the rate is
less than `early_stopping_slowest_rate`, then stop early.
### Cross-validation
- `crossvalidate` (boolean, false)
If true, randomly split the training set into
[`crossvalidation_folds`](#crossvalidation_folds) folds, evaluate performance
on each and average the cross-entropies. Repeat the entire process for
[`crossvalidation_rounds`](#crossvalidation_rounds) and average the averages.
- `crossvalidation_folds` (integer, 10)
Number of number of folds to split the training set into.
- `crossvalidation_rounds` (integer, 1)
If [`crossvalidate`](#crossvalidate), then do this many rounds of
`crossvalidate_folds`-fold crossvalidation. Set this to a value larger than
one if the variance of the cross-validation score over the random splits is
too high.
## Evaluation
The model being trained is evaluated periodically (see [`turns`](#turns) and
[`steps_per_turn`](#steps_per_turn)) on the validation set (see
[`validation_file`](#validation_file)) and also on the training set (see
[`training_file`](#training_file)). Evaluation on the training set is different
from the loss as it does not include regularization terms such as
[`l2_penalty`](#l2_penalty) and is performed the same way as evaluation on the
validation set (see [`eval_method`](#eval_method)).
To evaluate a saved model one typically wants to do no training, disable saving
of checkpoints and evaluate on the test set which corresponds to this:
turns=0
save_checkpoints=false
eval_on_test=true
Furthermore, [`load_checkpoint`](#load_checkpoint), and in all likelihood
[`config_file`](#config_file) must be set. This is all taken care of by the
`experiment/test.sh` script.
- `max_training_eval_batches` (integer, 100)
When evaluating performance on the training set, it is enough to get a rough
estimate. If specified, at most `max_training_eval_batches` number of batches
will be evaluated. Set this to zero, to turn off evaluation on the training
set entirely. Set it to -1 to evaluate on the entire training set.
- `max_eval_eval_batches` (integer, -1)
Evaluation can be pretty expensive with large datasets. For expediency, one
can impose a limit on the number of batches of examples to work with on the
validation test.
- `max_test_eval_batches` (integer, -1)
Same as [`max_eval_eval_batches`](#max_eval_eval_batches) but for the test
set.
- `min_non_episodic_eval_examples_per_stripe` (integer, 100)
By default, evaluation is performed using the training batch size causing each
"stripe" in a batch to be over rougly dataset_size/batch_size number of
examples. With a small dataset in a non-episodic setting, that may make the
evaluation quite pessimistic. This flag ensures that the batch size for
evaluation is small enought that at least this many examples are processed in
the same stripe.
- `eval_on_test` (boolean, false)
Even if [`test_file`](#test_file) is provided, evaluation on this test dataset
is not performed by default. Set this to true to do that. Flipping this switch
makes it easy to test a model by loading a checkpoint and its saved
configuration without having to remember what the dataset was.
- `eval_method` (string, deterministic)
One of `deterministic`, `geometric`, `power` and `arithmetic`. This determines
how dropout is applied at evaluation.
- `deterministic` is also known as standard dropout: dropout is turned off at
evaluation time and a single deterministic pass propagates the expectation
of each unit through the network.
- `geometric` performs a renormalized geometric average of predicted
probabilities over randomly sampled dropout masks.
- `power` computes the power mean with exponent
[`eval_power_mean_power`](#eval_power_mean_power).
- `arithmetic` computes the arithmetic average.
See [Pushing the bounds of dropout](https://arxiv.org/abs/1805.09208) for a
more detailed discussion.
- `num_eval_samples` (integer, 0)
The number of samples to average probabilities over at evaluation time. Needs
some source of stochasticity (currently only dropout) to be meaningful. When
it's zero, the model is run in deterministic mode. Training evaluation is
always performed in deterministic mode for expediency.
- `eval_softmax_temperature` (float, 1.0)
Set this to a value lower than 1 to smoothen the distribution a bit at
evaluation time to counter overfitting. Set it to a value between -1 and 0 to
search for the optimal value between -value and 1 on the validation set. For
example, `eval_softmax_temperature=-0.8` will search for the optimal
temperature between 0.8 and 1.0.
- `eval_power_mean_power` (float, 1.0)
The exponent of the renormalized power mean to compute predicted
probabilities. Only has an effect if [`eval_method=power`](#eval_method).
- `eval_dropout_multiplier` (float, 1.0)
At evaluation time all dropout probabilities used for training are multiplied
by this. Does not affect the [`eval_method=deterministic`](#eval_method) case.
See [Pushing the bounds of dropout](https://arxiv.org/abs/1805.09208) for
details.
- `validation_prediction_file` (string)
The name of the file where log probabilities of for the validation file are
written. The file gets superseded by a newer version each time the model is
evaluated. The file lists tokens and predicted log probabilities on
alternating lines. Currently only implemented for deterministic
[evaluation](#eval_method).
- `dyneval` (boolean, false)
Whether model weights shall be updated at evaluation time (see [Dynamic
Evaluation of Neural Sequence Models][https://arxiv.org/abs/1709.07432] by
Krause et al.). This forces batch size at evaluation time to 1 which makes it
very slow, so turn it is best to leave it off until the final evaluation.
Whereas RMSProp maintains an online estimate of gradient variance, dynamic
evaluation bases its estimate on training statistics which are affected by
[max_training_eval_batches](#max_training_eval_batches) and
[batch_size](#batch_size).
Also, when doing dynamic evaluation it might make sense to turn off some
regularizers such as [l2_penalty](#l2_penalty), or hacks like
[max_grad_norm](#max_grad_norm).
- `dyneval_learning_rate` (float, 0.001)
The learning rate for dynamic evaluation.
- `dyneval_decay_rate` (float, 0.02)
The rate with which weights revert to the _mean_ which is defined as what was
trained.
- `dyneval_epsilon` (float, 1e-5)
This serves a similar purpose to [`rmsprop_epsilon`](#rmsprop_epsilon), but
for dynamic evaluation.
## Experiments
- `name` (string, see below)
The name of the experiment. Defaults to the git version concatenated with the
basename of the script (without the `.sh`). See
[`experiment_dir`](#experiment_dir).
- `experiment_dir` (string, ./ + `$name`)
Directory for saving configuration, logs and checkpoint files.
Lamb's git version is saved in `lamb_version` along with any uncommitted
changes in the checkout (if in a git tree). `stdout` and `stderr` are also
captured.
If [`save_checkpoints`](#save_checkpoints) is true, checkpoints are saved
here. Also see [`save_config`](#save_config).
- `save_config` (boolean, true)
All options are saved in `$experiment_dir/config` except for which it
doesn't make sense:
- [`load_checkpoint`](#load_checkpoint)
- [`load_optimizer_state`](#load_optimizer_state)
- [`load_averaged`](#load_averaged)
- [`ensure_new_experiment`](#ensure_new_experiment)
- [`config_file`](#config_file)
- [`save_config`](#save_config)
- `ensure_new_experiment` (boolean, true)
If `ensure_new_experiment` is true, a random suffix is appended to
[`experiment_dir`](#experiment_dir) to ensure the experiment starts afresh.
When ensure_new_experiment is false and `experiment_dir` exists, the last
checkpoint will be loaded on startup from that directory.
- `config_file` (string, '')
This is to load `$experiment_dir/config` that gets saved automatically when
[`save_checkpoints`](#save_checkpoints) is true. It is not needed if one uses
`experiment/test.sh` for evaluation. If a configuration option is set
explicitly and is also in the configuration file, then the explicit version
overrides the one in the configuration file.
### Checkpoints
- `save_checkpoints` (boolean, true)
Whether to save any checkpoints. `save_checkpoints` also affects saving of the
configuration (see [`config_file`](#config_file).
If save_checkpoints is true, then two checkpoints are saved:
`$experiment_dir/best` and `$experiment_dir/last`. If `last` exists, it will
be loaded automatically on startup and training will continue from that state.
If that's undesirable, use a different [`experiment_dir`](#experiment_dir) or
delete the checkpoint manually. The `best` checkpoint corresponds to the best
validation result seen so far during preriodic model evaluation during
training.
- `load_checkpoint` (string, '')
The name of the checkpoint file to load instead of loading
`$experiment_dir/last` or randomly initializing. Absolute or relative to
[`experiment_dir`](#experiment_dir).
- `load_optimizer_state` (boolean, true)
Set this to `false` to prevent [`load_checkpoint`](#load_checkpoint) from
attempting to restore optimizer state. This effectively reinitializes the
optimizer and also allows changing the optimizer type. It does not affect
automatic loading of the latest checkpoint (see
[`experiment_dir`](#experiment_dir)).
### Misc options
- `seed` (integer, 0)
The random seed. Both python and tensorflow seeds are initialized with this
value. Due to non-determinism in tensorflow, training runs are not exactly
reproducible even with the same seed.
- `swap_memory` (boolean, false)
Transparently swap the tensors produced in forward inference but needed for
back prop from GPU to CPU. This allows training RNNs which would typically not
fit on a single GPU, but slows things down a bit.
- `log_device_placement` (boolean, false)
Log tensorflow device placement.
# Notes
This is not an official Google product.
# Citations
- [On the state of the art of evaluation in neural language models](https://arxiv.org/abs/1707.05589)
@inproceedings{
melis2018on,
title={On the State of the Art of Evaluation in Neural Language Models},
author={G{\'a}bor Melis and Chris Dyer and Phil Blunsom},
booktitle={International Conference on Learning Representations},
year={2018},
url={https://openreview.net/forum?id=ByJHuTgA-},
}
- [Pushing the bounds of dropout](https://arxiv.org/abs/1805.09208)
@article{melis2018pushing,
title={Pushing the bounds of dropout},
author={Melis, G{\'a}bor and Blundell, Charles and Ko{\v{c}}isk{\'y}, Tom{\'a}{\v{s}} and Hermann, Karl Moritz and Dyer, Chris and Blunsom, Phil},
journal={arXiv preprint arXiv:1805.09208},
year={2018}
}
- [Mogrifier LSTM](https://arxiv.org/abs/1909.01792)
@article{melis2020mogrifier,
title={Mogrifier LSTM},
author={Melis, G{\'a}bor and Ko{\v{c}}isk{\'y}, Tom{\'a}{\v{s}} and Blunsom, Phil},
booktitle={International Conference on Learning Representations},
year={2020},
url={https://openreview.net/forum?id=SJe5P6EYvS},
}