Sinusoid Regression Using Stochastic Gradient Descent (For real this time)

Frank Lanke Fu Tarimo
2 min readFeb 8, 2021

On a previous post (https://fulkast.medium.com/a-simple-exercise-in-regressing-a-sinusoidal-function-9e2932031155) I wrote about regressing a sinusoid using some appropriately sampled data (within the required Nyqvist frequency requirement). For the optimisation framework I chose PyTorch (for convenience) and for the optimiser I picked the "SGD" algorithm. This however, was a lie — I was not using the SGD algorithm and as you can see from the results, my optimisation wasn't getting anywhere. In this post we are going to fix this and actually learn some waves!

My mistake on the previous post was that I was actually running "batch optimisation" on the entire dataset, instead of "mini-batch optimisation" on small chunks of the data. Due to the periodic nature of sinusoid function, running optimisation over the entire dataset meant that it was very likely for my function (which starts out looking almost linear), to over-estimate for some parts of the data and under-estimate for other parts. This created counter-acting feedbacks which made my optimisation stuck — fitting a slanted straight line to the periodic sinusoid.

In this post I am going to tackle this issue by actually performing SGD — optimising over a small randomly selected batch of the dataset at each iteration. The hope is to allow the model to try to fit small chunks of the dataset, one at a time, without being inhibited by the average performance over the entire dataset. Below is the updated notebook:

On the column on the left, the crosshairs show the mini-batch samples used in each iteration of the optimisation (only a subset of the optimisation iterations are plotted, for brevity). The column on the right shows the prediction of the current model over the entire dataset. With the new (mini-batch) approach, we see that we nicely converge to the true form of the sinusoid at the end of the epoch — after having effectively "seen" the same amount of data as in the first approach, but we now actually learn the sin function.

To be fair, not every roll-out of this second approach converges. I cherry-picked an example that worked well, but this was not rare. In an upcoming post I will be exploring the convergence statistics as a function of batch sizes, and noise-corrupted inputs. Stay tuned!

In the meantime, here's an interesting read on good, small batch-sizes ;)

https://arxiv.org/abs/1804.07612

--

--

Frank Lanke Fu Tarimo

PhD Candidate in Perception for Robotics. University of Oxford