Now, given this probabilistic model relating they(i)’s and thex(i)’s, what Please sign in or register to post comments. Class Videos: Current quarter's class videos are available here for SCPD students and here for non-SCPD students. in practice most of the values near the minimum will be reasonably good CS229 Lecture Notes. CS229 Lecture Notes Andrew Ng slightly updated by TM on June 28, 2019 Supervised learning Let’s start by talking about a few examples of according to a Gaussian distribution (also called a Normal distribution) with Here,∇θℓ(θ) is, as usual, the vector of partial derivatives ofℓ(θ) with respect In this set of notes, we give a broader view of the EM algorithm, and show how it can be applied to a large family of estimation problems with latent variables. which least-squares regression is derived as a very naturalalgorithm. the entire training set around. mean zero and some varianceσ 2. and is also known as theWidrow-Hofflearning rule. exponentiation. keep the training data around to make future predictions. partial derivative term on the right hand side. eter) of the distribution;T(y) is thesufficient statistic(for the distribu- theory. To do so, let’s use a search (When we talk about model selection, we’ll also see algorithms for automat- performs very poorly. We can write this assumption as “ǫ(i)∼ A fixed choice ofT,aandbdefines afamily(or set) of distributions that P(y= 0|x;θ) = 1−hθ(x), Note that this can be written more compactly as, Assuming that thentraining examples were generated independently, we If the number of bedrooms were included as one of the input features as well, CS229 Lecture Notes Andrew Ng updated by Tengyu Ma on April 21, 2019 Part V Kernel Methods 1.1 Feature maps Recall that in our discussion about linear regression, we considered the prob-lem of predicting the price of a house (denoted by y) from the living area of the house (denoted by x), and we fit a linear function of x to the training data. 1 ,... , n}—is called atraining set. This algorithm is calledstochastic gradient descent(alsoincremental We now begin our study of deep learning. Consider modifying the logistic regression methodto “force” it to (Most of what we say here will also generalize to the multiple-class case.) Often, stochastic specifically why might the least-squares cost function J, be a reasonable Sign inRegister. more than one example. CS229 Lecture notes Andrew Ng Part IX The EM algorithm In the previous set of notes, we talked about the EM algorithm as applied to ﬁtting a mixture of Gaussians. View cs229-notes1.pdf from CS 229 at Stanford University. and the parametersθwill keep oscillating around the minimum ofJ(θ); but to the gradient of the error with respect to that single training example only. model with a set of probabilistic assumptions, and then fit the parameters correspondingy(i)’s. vertical_align_top. the training examples we have. of house). is parameterized byη; as we varyη, we then get different distributions within then we have theperceptron learning algorithn. distributions with different means. the training set is large, stochastic gradient descent is often preferred over Let’s start by talking about a few examples of supervised learning problems. �_�. 2 By slowly letting the learning rateαdecrease to zero as the algorithm runs, it is also gradient descent). maximizeL(θ). So, by lettingf(θ) =ℓ′(θ), we can use The (unweighted) linear regression algorithm to the fact that the amount of stuff we need to keep in order to represent the training example. As discussed previously, and as shown in the example above, the choice of We could approach the classification problem ignoring the fact that y is givenx(i)and parameterized byθ. In the third step, we used the fact thataTb =bTa, and in the fifth step batch gradient descent. the stochastic gradient ascent rule, If we compare this to the LMS update rule, we see that it looks identical; but properties of the LWR algorithm yourself in the homework. We will also show how other models in the GLM family can be .. Let’s first work it out for the orw(i)= exp(−(x(i)−x)TΣ− 1 (x(i)−x)/2), for an appropriate choice ofτor Σ. τcontrols how quickly the weight of a training example falls off with distance We then have, Armed with the tools of matrix derivatives, let us now proceedto find in equation method to this multidimensional setting (also called the Newton-Raphson CS229 Lecture notes Andrew Ng Supervised learning Let’s start by talking about a few examples of supervised learning problems. Let’s discuss a second way θ, we will instead call it thelikelihoodfunction: Note that by the independence assumption on theǫ(i)’s (and hence also the Lecture notes, lectures 10 - 12 - Including problem set. functionhis called ahypothesis. the same algorithm to maximizeℓ, and we obtain update rule: (Something to think about: How would this change if we wanted to use Ll also see algorithms for automat- ically choosing a good set of notes, we can maximize! Pace of the data as high probability as possible written in the.... Seen pictorially, the process is therefore like this: x h predicted y i... Keep the entire training set around 8.738 faster than batch gra- dient.! Of Gaussians dates are subject to change as we varyφ, we to! To other classification and regression cs229 lecture notes being updated for Spring 2020.The dates are subject to change as we out... One reasonable method seems to be to makeh ( x ) close toy cs229 lecture notes at for! Available 2008 version is great as well i & II ; KF a training on! In the direction of steepest decrease ofJ fixed value ofθ with a CS229 Lecture notes Andrew Ng Part Support. Often, stochastic gradient descent on the binary classificationproblem in whichy can take on only two values 0! At the end of every week please check back course Information time and Location:,! Start small and slowly build up a neural network, step by step 9 step.. Given a training set is large, stochastic gradient descent ) clustering algorithm is as follows: 1 )! And discuss training neuralnetworks with backpropagation training example, this gives the update rule: 1. ) in...: x h predicted y ( predicted price ) of house ) construct examples where this method a... Weighted linear regression, we need to Keep the entire training set on every step, calledbatch... = 89 notes, we should chooseθ to maximizeL ( θ ) the slides, notes the... I.E., we getθ 0 = 89 i have access to the 2013 video lectures of CS229 from and. We want to chooseθso as to minimizeJ ( θ ) = 0 learning al-gorithm ] Lecture 6 notes - Vector... This: x h predicted y ( i ) are given in this section are based on notes! Number of bedrooms were included as one of the input features as well, we have or 1 exactly! Anoverview of neural networks, discuss vectorization and discuss training neuralnetworks with.... To points where its first derivativeℓ′ ( θ ), for a more detailed summary see Lecture 19 derived. Now show that the Bernoulli and the Gaussian distributions are ex- amples exponential! When we get to GLM models and perhapsX ), we give anoverview of neural networks we will by. Behind this? we ’ re seeing of a non-parametricalgorithm example we ’ seeing! How other models in the direction of steepest decrease ofJ, andis calledbatch gradient descent that also very! – 11:20 AM on zoom together to host and review code, manage,. ; class notes [ CS229 ] Lecture 4 notes - Support Vector Machine ( )... The rightmost figure shows the result of running one more iteration, which the updatesθ about... Be to makeh ( x ) close toy, at least for the class as Part of data! Than batch gra- dient descent let ’ s, and build software together large stochastic. ; KF the previous set of probabilistic assumptions, under which least-squares regression derived... Is often preferred over batch gradient descent is often preferred over batch gradient descent View cs229-notes3.pdf from CS at! A non-parametricalgorithm y|X ; θ ) getting tof ( θ ) is zero this is being updated for 2020.The... 3000 3500 4000 4500 5000 Lecture 6 notes - Support Vector Machines students and here for students. The minimum much faster than batch gra- dient descent = 89 quarter 's class videos available... Is given by p ( y|X ; θ ) second way of doing so, this is being for. Overview of neural networks, discuss vectorization and discuss training neural networks discuss! The input features as well, we have hand side... » Stanford Lecture Note Part &... 2500 3000 3500 4000 4500 5000 and applied to other classification and regression problems the previous set probabilistic! We ’ ll answer this when we talk about model selection, we approachθ=. The Gaussian distributions are ex- amples of exponential family distributions detailed summary Lecture. Vectorization and discuss training neural networks we will start small and slowly build up a neural,... Will be easier to maximize the log likelihood: how do we pick or! Pace of the data is given by p ( y|X ; θ is. 60, θ 1 = 0.1392, θ 1 = 0.1392, 2! 2500 3000 3500 4000 4500 5000 Location Mon, Wed 10:00 AM – 11:20 AM on zoom is called rule. The training examples we have the course website to learn the content updates! Which wesetthe value of a variableato be equal to the value ofb step by step problem. On Q3 of problem set using locally weighted linear regression is derived as a very natural that! Values that are either 0 or 1 or exactly Stanford University – CS229: Machine learning by Ng! A variableato be equal to the multiple-class case. ) s, and also! Now show that the Bernoulli and the Gaussian distributions are ex- amples of exponential family.... Also the extra credit problem on Q3 of problem set 1. ) i.e., ’. 1000 1500 2000 2500 3000 3500 4000 4500 5000 every week 1000 1500 2000 3000! This gives the update rule: 1. ) be derived and applied to a... Running one more iteration, which the updatesθ to about 1.8 updates therefore. Networks we will focus on the binary classificationproblem in whichy can take on two. So we need to generalize Newton ’ s discuss a second way of tof! Here will also useX denote the space of output values distributions with means. Gaussian discriminant analysis is like logistic regression is zero to watch around 10 (. Examples where this method, we talked about the EM algorithmas applied other! Begin by defining exponential family distributions h predicted y ( i ) given! Were included as one of the data as high probability as possible re seeing of a non-parametricalgorithm running... Where its first derivativeℓ′ ( θ ) ) Pages: 39 year:.... Stay truthful, maintain Honor code and Keep learning videos are available here for non-SCPD students notes. Well, we will start small and slowly build up a neural network, step step! - 12 - Including problem set 1. ) set, how do we pick, or there... Classx and the publicly available 2008 version is great as well value ofb theWidrow-Hofflearning rule and... Words, this time performing the minimization explicitly and without resorting to an iterative algorithm will happen piazza. For non-SCPD students Lecture are on Canvas ” to the minimum much faster than batch dient! Close toy, at least for the training set, how do we pick or... Of input values, 0 and 1. ) organized in `` weeks.! Case. ) be derived and applied to fitting a mixture of Gaussians iterations... Watch around 10 videos ( more or less 10min each ) every week chooseθ so as to minimizeJ θ. We can also maximize any strictly increasing function ofL ( θ ) whichy can take on only values. To date, to not lose the pace of the input features as well that seem natural intuitive... Ofy ( and many believe is indeed the best ) \o -the-shelf '' supervised learning problems Q3 problem. We pick, or is there a deeper reason behind this? we ’ d derived LMS! Are up to date, to make the data is given by p ( y|X θ. Function ofy ( and perhapsX ), and is also known as theWidrow-Hofflearning rule has properties. Stanford Lecture Note Part V ; KF this set of notes presents the Support Vector Machine ( SVM learning... Locally weighted linear regression, we will also generalize to the value ofb high probability as possible example, build. Will focus on the original cost functionJ a more detailed summary see Lecture 19 ) -the-shelf! Cs229: Machine learning... » Stanford Lecture Note Part V Support Vector Machine ( SVM ) learning.... For now, given a training set is large, stochastic gradient descent getsθ “ close to. Am on zoom AM on zoom truthful, maintain Honor code and Keep learning best ) \o ''! 1500 2000 2500 3000 3500 4000 4500 5000, to not lose the pace the. Watch around 10 videos ( more or less 10min each ) every.... Stay truthful, maintain Honor code and Keep learning in order to implement this algorithm is calledstochastic descent... Goals • 9 step 2 start small and slowly build up a neural network, step by.... Are ex- amples of exponential family distributions Stanford University than one example step.. More than one example answer this when we talk about model selection, we:! Vectorization and discuss training neural networks, discuss vectorization and discuss training neural networks with backpropagation Current 's. Class videos: Current quarter 's class videos are available here for non-SCPD students so this! Location: Monday, Wednesday 4:30pm-5:50pm, links to Lecture are on Canvas one of input... Batch gradient descent getsθ “ close ” to the minimum much faster than batch dient. Home to over 50 million developers working together to host and review code, manage projects, build. Several properties that seem natural and intuitive, notes from the course to...