Sinusoid Regression: The Stats (Receipts)

Frank Lanke Fu Tarimo
2 min readFeb 10, 2021

On my previous post, I covered how performing gradient descent on a small batch size (e.g. 2 samples at a time) yielded better results that running over the entire dataset. Well, there are still a few "knobs" to account for before I can say that I've made a fair comparison between the small mini-batch optimisation and the full batch optimisation. But for now, here's some statistics taken from a few different settings.

A plot of the final regression error vs. batch size (Note some intermediate batch sizes aren't visible because the final error's are negligible.

On bar plot on the left we see that the final regression error is high for larger batch sizes, and interestingly so is the error for the batch size of one case. In addition to the high regression error, the batch size of one case also has a large variance (depicted by the vertical line).

The batch sizes between 2 and 8 (even 15 maybe?) show way lower final regression errors and a moderate variance. To get a better picture of the results, here's a histogram of the same data

Histograms of the final regression error. Each subplot shows a different batch size, used during training.

Interestingly, the batch sizes 2 to 8 show an almost unimodal behaviour — terminating in the optimal 0.0 regression error most of the time. The larger batch sizes and the batch size of one, both seem to pick up some bimodal behaviour where most of the times the model terminates in a regression error that's relatively high.

I mentioned above that there are still some "knobs" to be tweaked before we can consider this a fair comparison of different batch sizes. Let me share the training details. For each batch size setting a preset number of epochs is run, where each epoch is considered a run through the entire dataset. As a result, runs with larger batch sizes will experience fewer iterations per epoch (in the extreme case where the entire dataset is fed in as a single batch, there is only one iteration per epoch). This means that the optimiser is run a varying number of times on the network, depending on the batch size used. Runs with smaller batch sizes, experience more iterations per epoch and so will run back-propagation more times than the larger batch size runs do.

In an upcoming post I will evaluate the effect of number of back-prop iterations on the regression error statistics. For now though, it's interesting to see that for the same number of epochs, runs with smaller batch sizes (but still larger than 1) can result in between regression results.

For full details on this experiment, checkout my update Jupyter notebook

--

--

Frank Lanke Fu Tarimo

PhD Candidate in Perception for Robotics. University of Oxford