The image shows smoothed loss function as a function of training epochs.
The orange curve shows the standard procedure, where training is performed using a simple momentum based scheme.
The blue curve uses the same scheme, but multiplies the accumulated gradients with static positive weights, a so called 'preconditioner', before applying them to the parameters of the neural network.
The network parameters of the two networks are also initialized differently.
Both networks are trained on image recognition using the CIFAR10 dataset.
Initially, the blue curve falls a lot steeper, an indication of faster training.
However, the standard procedure (orange) overtakes the blue curve before the first epoch of training is finished.
This is because statistical assumptions about the neural network parameters, which were made to find the weights, are violated as the parameters change during training.
If one would find a good way to adjust the weights during training, one could speed up the entire training.