The below derivations are how we arive at the mathematical formulas above.
Stochastic gradient descent is commonly reffered to as SGD, going ahead I will be using the short form only. The above method of gradient descent is called as Batch gradient descent, where the whole training data is given as the input to the network and back propagated well this works good for smaller dataset but while we scale up the model we face the following problems:
Computation Inability Its likely that you would be designing a model to solve a real world problem where there’s amaple amount of data ans conditions and its not possible to load the complete data in the RAM of the hardware. The computation cost associated with running training on the complete data sets is not possible in real life scenarios as the datasets are huge.
Slow Since we are running the training on the complete dataset in gradient descent, it is very slow as the backpropogation takes time.
What SGD offers is that, it takes mini batches of data for training and updates the weight after back propogation of the first minibatch and so on. This ideas solves us a lot of problems about harware resources as mentioned below,
Note: A typical minibatch size is 256, although the optimal size of the minibatch can vary for different applications and architectures.
In SGD the learning rate α is typically much smaller than a corresponding learning rate in batch gradient descent because there is much more variance in the update.
Choosing the proper learning rate and schedule (i.e. changing the value of the learning rate as learning progresses) is a fairly difficult task. One standard method that works well in practice is to use a small enough constant learning rate that gives stable convergence in the initial epoch (full pass through the training set) or two of training and then halve the value of the learning rate as convergence slows down.
An even better approach is to evaluate a held out set after each epoch and anneal the learning rate when the change in objective between epochs is below a small threshold. This tends to give good convergence to a local optima. Another commonly used schedule is to anneal the learning rate at each iteration t as a/(b+t)
where a and b dictate the initial learning rate and when the annealing begins respectively.
If the data is given in some meaningful order, this can bias the gradient and lead to poor convergence. Generally a good method to avoid this is to randomly shuffle the data prior to each epoch of training.
The objective of the optimization function has always been to reach the optimal minima but in most of the cases of standard SGD they tend to oscillate across the narrow minima values as the negative gradient will point down one of the steep side rather than along the ravine towards the optimum.
The objectives of deep architectures have this form near local optima and thus standard SGD can lead to very slow convergence particularly after the initial steep gains. Momentum is one method for pushing the objective more quickly along the shallow ravine. The momentum update is given by,
v = γv+α∇θJ(θ;x(i),y(i)) θ = θ−v
In the above equation v is the current velocity vector which is of the same dimension as the parameter vector θ. The learning rate α is as described above, although when using momentum α may need to be smaller since the magnitude of the gradient will be larger. Finally γ∈(0,1] determines for how many iterations the previous gradients are incorporated into the current update. Generally γ is set to 0.5 until the initial learning stabilizes and then is increased to 0.9 or higher.
Create a file name sample_sgd.py
Copy paste the code below
#importing the keras Deeplearning library from keras import optimizers #Creating a new object model = Sequential() #Dense - Regular densely-connected NN layer model.add(Dense(64, init='uniform', input_shape=(10,))) #Applying the other stacked layers model.add(Activation('tanh')) model.add(Activation('softmax')) #Defined the SGD optimizer with a learning rate of 1e-16 and momentun of o.9 sgd = optimizers.SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True) model.compile(loss='mean_squared_error', optimizer=sgd) #x_train and y_train are Numpy arrays -- traning dataset model.fit(x_train, y_train, epochs=5, batch_size=32) #or #model.train_on_batch(x_batch, y_batch) #Evaluating metrics loss_and_metrics = model.evaluate(x_test, y_test, batch_size=128) #Predic the class at realtime classes = model.predict(x_test, batch_size=128)