Above: Graph of m with respect to iterations for the brute force, (sfjb) step forward jump back, (ss) split search and (lps) linear proportional solvers.
Basic Math Via Algorithm
The most basic thing that I could think of when I hit rewind on machine learning and try to get down to the basics is a solver. A solver that solves basic math using an algorithm. This is so simple that if we imagine going back to the days before machine learning was invented, this could have been created.
It could have been created before computers. I could have been created on mechanical devices, an abacus or even on paper. It is something that is fairly obvious and my intent is to push the obvious to the limit.
y = mx, with y as 5 and x as 2, what m do we need to solve this. Normal mathematicians would just rearrange it to be y/x = m or 5/2 = m.
But, the real question is if we had to use a machine how would it attempt it and how can we start to generalize from a first attempt and warm up to better ways of having a machine learn how to do it?
https://github.com/erickclasen/PMLC/tree/master/solvers
Brute Force
First try, just brute force it. Create a for loop and just iterate until the proper m is found that will produce a guess at y which is 5. Here the code just sets m to zero. The only hyperparameters are n for the number of loops that the for loop will execute and l the learning rate, which is how big will each step be for every loop. y1 is the predicted value which is compared to y to see if the guess is good or not.
''' No brains, just brute force search solver. ''' # Output, input and initial m y = 5 x = 2 m = 0 # Hyper parameters # Learning rate and max number of loops. l = 0.25 n = 20 print("i,m,y1") for i in range(0,n): # Predict y1 = m*x print(i,m,y1) # Update weight m = m + l # Done solving, break out of the loop! if y1 ==y: break
The output is as follows…
As can be seen from the image it solves the problem at loop 10 with m at 2.5 and y1 at 5.0. Great, but this is a special case. If the learning rate was not set to 0.25, the results would have been different, it could have gone on forever. Also if y was negative, the loop would never find it. If it was too large the loop would quit long before getting to the correct predicted value.
Step Up and Jump Down
The step up and jump down approach attempts to overcome a few of the limitations of the brute force method by moving up along the number line and then jumping back when the y1 predicted value exceeds y. This ensures that given a large enough n for a loop count it will at least always find the proper m for solving for y. The key difference in this code is highlighted in blue.
''' Simple Almost Machine Learning Linear Solver. ''' # Output, input and initial m y = 5 x = 2 m = 0 # Hyper parameters # Learning rate and number of loops. l = 0.25 n = 20 print("i,m,y1") for i in range(0,n): # Predict y1 = m*x print(i,m,y1) # Update weight, by one step of the learning rate. m = m + l # If the error is less than zero, going the wrong direction, jump back 2 learning rate units. if y - y1 < 0: m = m - 2*l
The output…
In this code there could have been a break out for when y=y1. I left it out to show the behavior that it has in that it oscillates back and forth in the vicinity of the solution. This is kind of a small but key point. If for instance the learning rate was something other than 0.25 and it could not get to the exact m = 2.5 to predict y from x, it would stay in the neighborhood of the correct answer. This edges close to the concept of machine learning. Given the right hyper-parameters, large enough n and a decent enough l, it gets in the neighborhood of the answer. It doesn’t get stuck if y is negative for example. It will reach it exactly or get close.
Setting m to 0.33 for example it will dither back and forth in the vicinity of the correct answer.
(17, 2.31, 4.62)
(18, 2.64, 5.28)
(19, 2.31, 4.62)
You might notice that the average of 4.62 and 5.28 is very close to the answer of 5 at 4.95. As a matter of fact if m was set to many decimal places such as 0.3333333, the average approaches 5.
It seems like taking the average could give the answer. How about bringing the values that are dithering back and forth closer together? Wouldn’t that work as a general case.
Split Search
Split search does what the name says. It applies one type of search in one direction and another in the opposite direction. If the guess y1 is lower than y, m gets bumped up by the learning rate. If the guess y1 is higher, the learning rate is cut in half and then applied by subtracting it. Instead of stepping forward and jumping back too much like the last example, now it closes the gap as time goes on. Bringing the values together, much like closing in on the average.
The loop is no longer a for loop that runs for n number of times. Now it is a loop that depends on the size of the learning rate. When it becomes small enough the loop exits. I also did this for variety to show that it is not always necessary to use a for loop, especially when we have a condition that can guarantee and ext from a while loop.
'' Simple Almost Machine Learning Linear Solver with Split Search. ''' # Output, input and initial m y = 5 x = 2 m = 0 # Hyper parameter # Learning rate. l = 1.0 print("i,m,y1") while l > 0.01: y1= m*x if y1 <= y: m += l elif y1 > y: l = l/2 m -= l print(l,m,y1)
The output of the code is as follows. It can be seen that it can get arbitrarily close to the correct value and if the learning rate is allowed to get lower than 0.01 to lets say 0.0000001. You could say that it has pretty much found a solution. This example now has what it takes to be a general solver. It can solve for y, no matter if it is negative or the seed of m is not zero but some other value. It will always converge to a solution.
At this point one can think of a way to make it a little cleaner. Is there some kind of math that can be used instead of the split search that uses two kinds of search, one for if it is above the target and another for when it is below? Can this if/elif just be folded into a one-liner equation.
Linear Proportional Solver
The name says it all. It works by solving an equation by a linear equation that works in proportion to the error between the target and prediction. The clunky if/elif logic of the split search is replaced by a one line equation….
# Update weight
m = m + l*(y-y1)
…this introduces the concept of backpropagation (backprop). It is not quite the full blow version of it but, it gives a small linear preview. Where a forward prediction is made and the weight is updated by backpropagating the error backwards to adjust the weight. As this is just a linear solver the derivative of the output is a constant and the difference between the y and y1 is simply fed back and multiplied by the learning rate and applied to the weight. This makes the change in m as time goes on, proportional the the error. More error means that the equation converges quicker, unlike some of the previous examples. Even though the split-solver trims the learning rate down, it still take small steps that are wholly governed by the size of the learning rate as it descends. This proportional solver can approach the target faster when the error is larger, something the other solvers won’t do. This in effect makes it self adjusting. Now it is not only general but it is efficient as well. Making it suitable for solving equations with large y and x values in a reasonable amount of time.
''' Linear Proportional Solver, has all the basic elements of machine learning. ''' # Output, input and initial m y = 5 x = 2 m = 0.5 # Hyper parameters # Learning rate and number of loops l = 0.25 n = 10 print("i,m,y1") for i in range(0,n): # Predict y1 = m*x print(i,m,y1) # Update weight m = m + l*(y-y1)
The following is the output….
Root Solver
The inspiration for this chapter on solvers came from the idea of using Newton’s Method to solve for a square root. It is interesting to note that the linear proportional solver is capable of solving for a square root by making a trivial change to the code. I am showing this here to make the point that solver can do something other than just solve for y = mx and to tie it back to Newton’s Method.
# The proportional linear solver can be converted to a root finder by replacing one line. # y1 = m*m instead of y1=m*x will allow m to converge to the square root of m # Output, input and initial m y = 5 #x = 2 m = 0.5 # Hyper parameters # Learning rate l = 0.25 n = 10 print("i,m,y1") for i in range(0,n): # Predict the square y1 = m*m print(i,m,y1) # Update weight, the weight will converge to the square root of y m = m + l*(y-y1)
The output shows m converging to the square root of y. As it has been converted to generate the square of m for a y1 prediction.
Newton’s Method
The following code has been modified to use Newton’s Method to solve for a square root. This is the method that I remember learning on a calculator by fiddling around with it. It is a recursive type of operation which works well on a calculator as you keep feeding in the results over and over again.
This time a seed value of 0.5 is used as the math needs some starting guess to start with. If it were 0, the x/m would go nowhere and the code would error out with a divide by zero. It takes the guess m and divides it into the input x, 2 in this case. This result y1 is then averaged with the m that was input. The result is fed back into m. In the code y1 is calculated as an intermediate result, keeping the code consistent with the other examples and for showing the output as well. But the math can be simplified.
m = (m + y1)/2 replacing y1 with x/m gives m = (m + x/m)/2
Looking at m = (m + x/m)/2 the method for using this recursively on a calculator is clear below as it is worked for a few iterations.
# Output, input and initial m #y = 5 x = 2 m = 0.5 # Hyper parameters # Learning rate l = 0.25 n = 10 print("i,m,y1") for i in range(0,n): # Predict y1 = x/m print(i,m,y1) # Recursive with respect to m. m = (m + y1)/2
The output is as follows
Bonus: Differential Evolution
I have decided to include this section to show an alternative method. It is not machine learning but, very similar. Differential evolution is modeled after genetic evolution and starts with a random population of numbers which get mutated using the difference between two added to a third which gives the name differential, nothing to do with the calculus term. Differential evolution is good in cases where regular machine learning would be hard or impossible to implement. Machine learning using backprop implies using derivatives of activation functions of which the derivatives must be continuous. If you have some monster problem that is not calculus differentiable or just discontinuous somehow, differential evolution (DE) can help. I discovered it while looking for a better way to train a trading algorithm that I was working on. Yes, it’s an algorithm that trains another algorithm! The trading algorithm has 6 tunable values of which several were moving average constants and the others were sort of hyperparameters in a way. Other that brute force, using 6 nested for loops and trimming the ranges down there was no reasonable way to tune it using something like gradient descent, except for DE which came to the rescue and reduced the training time by a few orders of magnitude. It also allow for some help with regularization as well a topic that will be covered later in this course.
DE Code
The following code does the same operation as the previous examples, except using DE. Have a good look at it and by this point in the article, especially if you have played around with the code in the other examples, you might be able to decode what is going on with it. It is quite lengthy for what it is doing in this example as compared to the other examples. It is an overkill way to find m but, it gives an idea of what DE can do.
This was the initial example that I worked through while I was trying to learn DE. I put in a decent amount of comments to keep my head clear during the learning process. It should be fairly understandable. Once I understood the code, I was able to move onto a multivariate version that I could port to the trading algorithm that I was working on. I took a few steps in that direction from the code below to warm up to DE along the way.
''' Differential Evolution - Create a random population which is mean zero and std dev of 1, as in a normal distribution. Take it and scale and shift it via a weight and bias to the range of the bounds, therefore normalizing it. Copy the population to popcp and select from three random places in the population. Take these values and form a new trial mutation value by adding and differencing, trial = a * mut + (b-c) Try this value out in the y = m * x formula. Also try the value pulled from the population, which is looped through in the inner loop. If the mutation trial does better save it at the location that is at the array position that is currently being looped through. Repeat until the outer loop ends. Hyper parameters. bounds - A range within the value is expected to fall. mut - Mutation constant. From 0.5-2.0. Sets how aggressive the mutation process is per iteration. pop_size - The population size. ''' import numpy as np # Trial is an array of 3 to create the mutation from. trial = [0,0,0] # Output, input y = 5 x = 2 # An estimate of the bounds of the problem, the random numbers are generated within this range. # Clipping of the pop values can be done to this range as well. bounds = [0,5] norm_bias = np.mean(bounds)+bounds[0] norm_weight = bounds[1]-bounds[0] #print(norm_weight,norm_bias) # Mutation constant mut = 1 pop_size = 10 #dimensions = 1 # Generate a starter random population normalized to the middle of the range pop = (np.random.rand(pop_size, 1) * norm_weight) + norm_bias print("Initial Population") print(pop) print() # Outer loop of epochs of differential evolution. for m in range(0,50): # Iterate through population. for i in range(0,pop_size): # Make a copy to work with and remove values used popcp = pop[:] # Predict with the i-th population value. y1 = pop[i]*x # Remove it from the copy. popcp = np.delete(popcp,i,0) for s in range(0,3): #print(s,len(popcp),popcp) select = np.random.randint(0,len(popcp)) #print(select) # Remove the trial from the list at the select point. trial[s] = popcp[select] popcp = np.delete(popcp,select,0) # These are the trials to mutate. #print(trial[0],trial[1],trial[2]) # Generate the mutation to run a trial with. muta = mut*trial[0]+(trial[1] - trial[2]) # Trial prediction using the mutation. y2 = muta * x # Errors of the population at the idx and the trial mutation. e1 = abs(y - y1) e2 = abs(y - y2) # If the mutation is better, keep it and store it in the population. if e1 > e2: pop[i] = muta print(m,muta) # How good is it so far? print("Index and y1 prediction:",m,y1)
The output from the code is a bit different from the other examples. On the top the population that was randomly created is shown. Then the run proceeds showing which population index on the left is under mutation and the results of the trial on the right. The numbers grind around seemingly randomly and then start to converge to the correct result for m = 2.50. The more runs the closer it gets. It is possible to either run the code for a preset amount of loops or use a while loop and exit when the changes to m become smaller than some preset threshold.
There are several other solvers using DE up on my Github, they were used as stepping stones on the way to the actual tuning problem I was trying to solve and looking at the code, running them and hacking them will make for good exercises for those that are curious about DE.
https://github.com/erickclasen/PMLC/tree/master/solvers/de
The resource that I used to get started in DE is…
https://pablormier.github.io/2017/09/05/a-tutorial-on-differential-evolution-with-python/#