If your input data is sparse then methods such as SGD,NAG and momentum are inferior and perform poorly. For sparse data sets one should use one of the adaptive learning-rate methods. An additional benefit is that we won’t need to adjust the learning rate but likely achieve the best results with the default value. If one wants fast convergence and train a deep Neural Network Model or a highly complex Neural Network then Adam or any other Adaptive learning rate techniques should be used because they outperforms every other optimization algorithms.
One thing about Machine Learning the overal depth of the topics and algorithms makes it so easy to totally ‘sink’ yourself into it. And there is always something to dig. This article provides a view from a higher ground and compare different optimization algorithms and their application areas, thus pulling you out of the deep hole of deep learning.
A more visual example of these algorithms, see these two beautifully crafted animations: