# Stanford Machine Learning: (6).Large Scale Machine Learning

Learning with large datasets
• This set of notes look at large scale machine learning - how do we deal with big datasets?
• If you look back at 5-10 year history of machine learning, ML is much better now because we have much more data
• However, with this increase in data comes great responsibility? No, comes a much more significant computational cost
• New and exciting problems are emerging that need to be dealt with on both the algorithmic and architectural level
Why large datasets?
Learning with large datasets

• Define our cost function slightly differently, as
• • So the function represents the cost of θ with respect to a specific example (xi,yi)
• And we calculate this value as one half times the squared error on that example
• Measures how well the hypothesis works on a single example
• The overall cost function can now be re-written in the following form;
• • This is equivalent to the batch gradient descent cost function
• With this slightly modified (but equivalent) view of linear regression we can write out how stochastic gradient descent works
• So what's going on here?
• What stochastic gradient descent algorithm is doing is scanning through each example
• The inner for loop does something like this...
• Looking at example 1, take a step with respect to the cost of just the 1st training example
• Having done this, we go on to the second training example
• Now take a second step in parameter space to try and fit the second training example better
• Now move onto the third training example
• And so on...
• Until it gets to the end of the data
• We may now repeat this who procedure and take multiple passes over the data
• The randomly shuffling at the start means we ensure the data is in a random order so we don't bias the movement
• Randomization should speed up convergence a little bit
• Although stochastic gradient descent is a lot like batch gradient descent, rather than waiting to sum up the gradient terms over all m examples, we take just one example and make progress in improving the parameters already
• Means we update the parameters on EVERY step through data, instead of at the end of each loop through all the data
• What does the algorithm do to the parameters?
• One final implementation note
• May need to loop over the entire dataset 1-10 times
• If you have a truly massive dataset it's possible that by the time you've taken a first pass through the dataset you may already have a perfectly good hypothesis
• In which case the inner loop might only need to happen once if m is very very large
• If we contrast this to batch gradient descent
• We have to make k passes through the entire dataset, where k is the number of steps needed to move through the data<

• Mini-batch gradient descent is an additional approach which can work even faster than stochastic gradient descent
• To summarize our approaches so far
• Batch gradient descent: Use all m examples in each iteration
• Stochastic gradient descent: Use 1 example in each iteration
• Mini-batch gradient descent: Use b examples in each iteration
• b = mini-batch size
• So just like batch gradient descent, except we use tiny batches
• Typical range for b = 2-100 (10 maybe)
• For example
• b = 10
• Get 10 examples from training set
• Perform gradient descent update using the ten examples
Mini-batch algorithm • We for-loop through b-size batches of m
• Compared to batch gradient descent this allows us to get through data in a much more efficient way<
• After just b examples we begin to improve our parameters
• Don't have to update parameters after every example, and don't have to wait until you cycled through all the data
• Why should we use mini-batch?
• Allows you to have a vectorized implementation
• Means implementation is much more efficient
• Can partially parallelize your computation (i.e. do 10 at once)
• A disadvantage of mini-batch gradient descent is the optimization of the parameter b
• But this is often worth it!
• To be honest, stochastic gradient descent and batch gradient descent are just specific forms of batch-gradient descent
• For mini-batch gradient descent, b is somewhere in between 1 and m and you can try to optimize for it!

• But how do you know when it's done!?
• How do you tune learning rate alpha (α)?
Checking for convergence
Learning rate
• We saw that with stochastic gradient descent we get this wandering around the minimum
• In most implementations the learning rate is held constant
• However, if you want to converge to a minimum you can slowly decrease the learning rate over time
Online learning
• New setting
• Allows us to model problems where you have a continuous stream of data you want an algorithm to learn from
• Similar idea of stochastic gradient descent, in that you do slow updates
• Web companies use various types of online learning algorithms to learn from traffic
• Can (for example) learn about user preferences and hence optimize your website
• Example - Shipping service
• Users come and tell you origin and destination
• You offer to ship the package for some amount of money (\$10 - \$50)
• Based on the price you offer, sometimes the user uses your service (y = 1), sometimes they don't (y = 0)
• Build an algorithm to optimize what price we offer to the users
• Capture
• Origin and destination
• Work out
• What the probability of a user selecting the service is
• We want to optimize the price
• To model this probability we have something like
• p(y = 1|x; θ)
• Probability that y =1, given x, parameterized by θ
• Build this model with something like
• Logistic regression
• Neural network
• If you have a website that runs continuously an online learning algorithm would do something like this
• While in previous examples we might have described the data example as (xi,yi), for an online learning problem we discard this idea of a data "set" - instead we have a continuous stream of data so indexing is largely irrelevant as you're not storing the data (although presumably you could store it) /font>
• If you have a major website where you have a massive stream of data then this kind of algorithm is pretty reasonable
• You're free of the need to deal with all your training data
• If you had a small number of users you could save their data and then run a normal algorithm on a dataset
• An online algorithm can adapt to changing user preferences
• So over time users may become more price sensitive
• The algorithm adapts and learns to this
• So your system is dynamic
Another example - product search
• Run an online store that sells cellphones
• You have a UI where the user can type in a query like, "Android phone 1080p camera"
• We want to offer the user 10 phones per query
• How do we do this
• For each phone and given a specific user query, we create a feature vector (x) which has data like features of the phone, how many words in the user query match the name of the phone, how many words in user query match description of phone
• Basically how well does the phone match the user query
• We want to estimate the probability of a user selecting a phone
• So define
• y = 1 if a user clicks on a link
• y = 0 otherwise
• So we want to learn
• p(y = 1|x ; θ) <- this is the problem of learning the predicted click through rate (CTR)
• If you can estimate the CTR for any phone we can use this to show the highest probability phones first
• If we display 10 phones per search, it means for each search we generate 10 training examples of data
• i.e. user can click through one or more, or none of them, which defines how well the prediction performed
• Other things you can do
• Special offers to show the user
• Show news articles - learn what users like
• Product recommendation
• These problems could have been formulated using standard techniques, but they are the kinds of problems where you have so much data that this is a better way to do things

Map reduce and data parallelism