1 00:00:00,000 --> 00:00:00,530 2 00:00:00,530 --> 00:00:02,960 The following content is provided under a Creative 3 00:00:02,960 --> 00:00:04,370 Commons license. 4 00:00:04,370 --> 00:00:07,410 Your support will help MIT OpenCourseWare continue to 5 00:00:07,410 --> 00:00:11,060 offer high-quality educational resources for free. 6 00:00:11,060 --> 00:00:13,960 To make a donation or view additional materials from 7 00:00:13,960 --> 00:00:19,790 hundreds of MIT courses, visit MIT OpenCourseWare at 8 00:00:19,790 --> 00:00:21,040 ocw.mit.edu. 9 00:00:21,040 --> 00:00:22,775 10 00:00:22,775 --> 00:00:24,130 PROFESSOR: All right. 11 00:00:24,130 --> 00:00:27,580 So we've got three main topics to talk about. 12 00:00:27,580 --> 00:00:29,320 One is distributions. 13 00:00:29,320 --> 00:00:30,980 The other is Monte Carlo methods. 14 00:00:30,980 --> 00:00:33,340 And one is on regression. 15 00:00:33,340 --> 00:00:38,720 So for distributions, which distributions have we learned 16 00:00:38,720 --> 00:00:39,970 about in class? 17 00:00:39,970 --> 00:00:43,336 18 00:00:43,336 --> 00:00:44,340 Hmm? 19 00:00:44,340 --> 00:00:46,340 AUDIENCE: Normal. 20 00:00:46,340 --> 00:00:46,720 PROFESSOR: OK. 21 00:00:46,720 --> 00:00:47,970 So we have normal. 22 00:00:47,970 --> 00:00:51,990 23 00:00:51,990 --> 00:00:53,382 What's another one? 24 00:00:53,382 --> 00:00:54,326 AUDIENCE: Uniform. 25 00:00:54,326 --> 00:00:55,576 PROFESSOR: OK. 26 00:00:55,576 --> 00:00:59,530 27 00:00:59,530 --> 00:01:01,160 And there's one more that he's kind of 28 00:01:01,160 --> 00:01:03,542 mentioned, I think, in passing. 29 00:01:03,542 --> 00:01:04,760 AUDIENCE: Exponential? 30 00:01:04,760 --> 00:01:05,780 PROFESSOR: Yes. 31 00:01:05,780 --> 00:01:07,030 So exponential. 32 00:01:07,030 --> 00:01:13,430 33 00:01:13,430 --> 00:01:17,010 So for uniform, what would this look like if I were to 34 00:01:17,010 --> 00:01:23,315 plot this as a histogram, and I have endpoints A and B? 35 00:01:23,315 --> 00:01:26,358 36 00:01:26,358 --> 00:01:30,354 Someone clue me in? 37 00:01:30,354 --> 00:01:30,852 Hmm? 38 00:01:30,852 --> 00:01:32,350 AUDIENCE: [INAUDIBLE] straight line. 39 00:01:32,350 --> 00:01:34,140 PROFESSOR: So it's going to be a horizontal line, right? 40 00:01:34,140 --> 00:01:37,790 41 00:01:37,790 --> 00:01:39,200 And if we were to look at the function for 42 00:01:39,200 --> 00:01:40,470 this, it would be-- 43 00:01:40,470 --> 00:01:46,890 44 00:01:46,890 --> 00:01:52,430 the probability would be 1 over b minus a for all points 45 00:01:52,430 --> 00:01:54,690 between a and b. 46 00:01:54,690 --> 00:01:57,335 So let's look at this graphically. 47 00:01:57,335 --> 00:02:02,160 48 00:02:02,160 --> 00:02:05,010 So this chunk of code should not be too difficult to 49 00:02:05,010 --> 00:02:07,280 understand at this point, right? 50 00:02:07,280 --> 00:02:09,639 All we're doing is we're using the random 51 00:02:09,639 --> 00:02:11,580 number generator, randint. 52 00:02:11,580 --> 00:02:14,720 It's going to return us an integer, random integer from a 53 00:02:14,720 --> 00:02:17,410 uniform distribution between a and b. 54 00:02:17,410 --> 00:02:20,810 Is there Anyone that's puzzled by that? 55 00:02:20,810 --> 00:02:21,640 All right. 56 00:02:21,640 --> 00:02:24,280 We're going to do that for numpoints, and then we're 57 00:02:24,280 --> 00:02:25,810 going to plot a histogram. 58 00:02:25,810 --> 00:02:27,980 The only parameter that I don't think you've seen here 59 00:02:27,980 --> 00:02:29,230 is this normed=True. 60 00:02:29,230 --> 00:02:32,100 61 00:02:32,100 --> 00:02:37,460 What this does is, normally, when you use the hist command 62 00:02:37,460 --> 00:02:41,260 in Python, it's going to give you raw frequency counts on 63 00:02:41,260 --> 00:02:42,130 the y-axis. 64 00:02:42,130 --> 00:02:45,380 What normed=True does is it gives you the proportion of 65 00:02:45,380 --> 00:02:47,890 the points that wound up in a particular bin. 66 00:02:47,890 --> 00:02:52,350 So I can actually show you both ways. 67 00:02:52,350 --> 00:02:56,430 68 00:02:56,430 --> 00:02:59,400 So does that look about right? 69 00:02:59,400 --> 00:03:05,770 For 100 bins, got, what, 100,000 points? 70 00:03:05,770 --> 00:03:11,620 Each one has about 0.01, so it looks right, 1% in each bin. 71 00:03:11,620 --> 00:03:13,330 So that was normed. 72 00:03:13,330 --> 00:03:14,580 If we do it un-normed-- 73 00:03:14,580 --> 00:03:21,990 74 00:03:21,990 --> 00:03:24,620 see how the y-axis here has changed? 75 00:03:24,620 --> 00:03:29,230 Before, it was from like 0 to 0.12, or [? 0-1 ?] 76 00:03:29,230 --> 00:03:33,060 Now, it's from 0 to like 1,000. 77 00:03:33,060 --> 00:03:34,870 That's all that normed primer does. 78 00:03:34,870 --> 00:03:38,150 But this is what we would expect. 79 00:03:38,150 --> 00:03:39,380 This is for integers. 80 00:03:39,380 --> 00:03:46,700 And then, of course, Python also has a way of doing it for 81 00:03:46,700 --> 00:03:47,880 floating point. 82 00:03:47,880 --> 00:03:51,210 So here, we are going to use the uniform command. 83 00:03:51,210 --> 00:03:54,650 And then when I say, show continuous uniform, going to 84 00:03:54,650 --> 00:03:59,920 give it the a and b, 0 and 1.0. 85 00:03:59,920 --> 00:04:03,660 And it's really not going to look all that much different. 86 00:04:03,660 --> 00:04:08,265 It's just that the x-axis is from 0 to 1. 87 00:04:08,265 --> 00:04:15,440 88 00:04:15,440 --> 00:04:15,632 Ok. 89 00:04:15,632 --> 00:04:19,329 So uniform is easy. 90 00:04:19,329 --> 00:04:21,594 What does a Gaussian look like, or a normal? 91 00:04:21,594 --> 00:04:27,710 92 00:04:27,710 --> 00:04:31,880 Like if I were to plot it, what should this look like? 93 00:04:31,880 --> 00:04:33,700 AUDIENCE: Bell curve. 94 00:04:33,700 --> 00:04:36,040 PROFESSOR: OK, it'll be a bell curve. 95 00:04:36,040 --> 00:04:37,661 Where is its peak going to be? 96 00:04:37,661 --> 00:04:39,425 AUDIENCE: Exactly in the middle? 97 00:04:39,425 --> 00:04:40,310 AUDIENCE: At the mean. 98 00:04:40,310 --> 00:04:40,850 PROFESSOR: At the mean. 99 00:04:40,850 --> 00:04:42,200 Thank you. 100 00:04:42,200 --> 00:04:44,190 So the peak is going to be at the mean. 101 00:04:44,190 --> 00:04:46,430 We usually denote it with mu. 102 00:04:46,430 --> 00:04:49,540 And then it's going to fall off asymmetrically or 103 00:04:49,540 --> 00:04:52,370 symmetrically off on either side? 104 00:04:52,370 --> 00:04:53,620 Symmetrically. 105 00:04:53,620 --> 00:04:57,330 106 00:04:57,330 --> 00:05:01,200 Now, a Gaussian can be specified fully using two 107 00:05:01,200 --> 00:05:01,820 parameters. 108 00:05:01,820 --> 00:05:04,730 What are they? 109 00:05:04,730 --> 00:05:07,920 You have one here, and then you have standard deviation. 110 00:05:07,920 --> 00:05:12,976 So mean and sigma. 111 00:05:12,976 --> 00:05:17,040 112 00:05:17,040 --> 00:05:20,020 Now, the function for this is not something you're going to 113 00:05:20,020 --> 00:05:20,530 have to know. 114 00:05:20,530 --> 00:05:24,770 But I wanted to show it to you. 115 00:05:24,770 --> 00:05:26,810 And the stats major can correct me if I'm wrong. 116 00:05:26,810 --> 00:05:45,580 117 00:05:45,580 --> 00:05:47,760 So it might be a little scary. 118 00:05:47,760 --> 00:05:48,300 I don't know. 119 00:05:48,300 --> 00:05:50,130 It intimidated me the first time I saw it. 120 00:05:50,130 --> 00:05:51,540 Does that look about right to you? 121 00:05:51,540 --> 00:05:51,830 AUDIENCE: Yes. 122 00:05:51,830 --> 00:05:52,830 PROFESSOR: All right. 123 00:05:52,830 --> 00:05:56,170 So the reason why I threw that out there is because what I 124 00:05:56,170 --> 00:06:03,380 want to do is show you the ideal form when we plot out 125 00:06:03,380 --> 00:06:07,960 this function, versus a bunch of random samples we've drawn 126 00:06:07,960 --> 00:06:10,280 from a distribution that is Gaussian. 127 00:06:10,280 --> 00:06:15,200 128 00:06:15,200 --> 00:06:19,580 So, I have a function make Gaussian plot. 129 00:06:19,580 --> 00:06:23,240 All it takes is the mean, standard deviation, how many 130 00:06:23,240 --> 00:06:26,480 points we want to draw from the distribution. 131 00:06:26,480 --> 00:06:28,720 And then I have a parameter here, show ideal. 132 00:06:28,720 --> 00:06:33,520 And we'll get to that in a second. 133 00:06:33,520 --> 00:06:38,160 The function that we use is called dot Gauss. 134 00:06:38,160 --> 00:06:41,410 And it just takes a mean and the standard deviation. 135 00:06:41,410 --> 00:06:44,930 136 00:06:44,930 --> 00:06:48,860 We're also going to compute the ideal points. 137 00:06:48,860 --> 00:06:54,010 So if I take the mean, and I go a couple of standard 138 00:06:54,010 --> 00:06:58,590 deviations in either direction on the x-axis, then I can plot 139 00:06:58,590 --> 00:07:01,270 out what the y should be according to this function 140 00:07:01,270 --> 00:07:06,690 here, and then just do a histogram. 141 00:07:06,690 --> 00:07:09,710 142 00:07:09,710 --> 00:07:12,030 If I want to show this plot, that's what 143 00:07:12,030 --> 00:07:13,920 that parameter controls. 144 00:07:13,920 --> 00:07:15,940 It'll plot out the function. 145 00:07:15,940 --> 00:07:19,120 And if not, then it'll just plot the histogram. 146 00:07:19,120 --> 00:07:21,340 So let's see what this looks like with just the histogram. 147 00:07:21,340 --> 00:07:34,370 148 00:07:34,370 --> 00:07:37,290 So it looks like what we would expect. 149 00:07:37,290 --> 00:07:38,610 We have the nice bell shape. 150 00:07:38,610 --> 00:07:42,350 It's centered at 0, and it's got a standard deviation of 1. 151 00:07:42,350 --> 00:07:46,760 152 00:07:46,760 --> 00:07:49,680 These are the relative frequencies of a random 153 00:07:49,680 --> 00:07:54,150 sampling of points from a Gaussian distribution. 154 00:07:54,150 --> 00:07:59,950 And we can see that if we look at the ideal version or the 155 00:07:59,950 --> 00:08:14,595 actual function, it matches very closely. 156 00:08:14,595 --> 00:08:20,200 157 00:08:20,200 --> 00:08:25,850 And then for various shapes, standard deviation of 2, 158 00:08:25,850 --> 00:08:28,520 different mean, different standard deviation. 159 00:08:28,520 --> 00:08:31,500 So it's pretty easy, right? 160 00:08:31,500 --> 00:08:34,200 Are there any questions on Gaussian distributions or 161 00:08:34,200 --> 00:08:35,450 normal distributions? 162 00:08:35,450 --> 00:08:45,760 163 00:08:45,760 --> 00:08:45,790 Ok. 164 00:08:45,790 --> 00:08:48,410 So, the last one we have-- 165 00:08:48,410 --> 00:08:49,660 AUDIENCE: [INAUDIBLE]. 166 00:08:49,660 --> 00:08:52,170 167 00:08:52,170 --> 00:08:53,030 PROFESSOR: Oh -- 168 00:08:53,030 --> 00:08:55,060 frange is a custom function. 169 00:08:55,060 --> 00:08:56,890 So we actually define it up here. 170 00:08:56,890 --> 00:08:57,320 AUDIENCE: [INAUDIBLE]. 171 00:08:57,320 --> 00:09:01,640 PROFESSOR: Was kind of hoping I could slip that past you. 172 00:09:01,640 --> 00:09:05,930 It's just like range, except instead of integers, it 173 00:09:05,930 --> 00:09:09,010 returns a list of floating point numbers separated by 174 00:09:09,010 --> 00:09:10,270 step argument. 175 00:09:10,270 --> 00:09:18,400 So it starts at a lower-end range start, and stops at the 176 00:09:18,400 --> 00:09:24,950 stop, and then increments by step, until it returns a bunch 177 00:09:24,950 --> 00:09:26,200 of floating point numbers. 178 00:09:26,200 --> 00:09:37,240 179 00:09:37,240 --> 00:09:39,710 The last one is the exponential distribution. 180 00:09:39,710 --> 00:09:42,490 And I don't know-- did he really explain what the shape 181 00:09:42,490 --> 00:09:47,350 looked like for this at all? 182 00:09:47,350 --> 00:09:50,740 So we can go really quickly through it, because it doesn't 183 00:09:50,740 --> 00:09:55,850 sound he actually expects you to know it too deeply. 184 00:09:55,850 --> 00:09:59,250 Basically, it'll like that. 185 00:09:59,250 --> 00:10:01,270 And the function is-- 186 00:10:01,270 --> 00:10:13,410 187 00:10:13,410 --> 00:10:14,840 you don't need to know it. 188 00:10:14,840 --> 00:10:16,865 It's just there for your edification. 189 00:10:16,865 --> 00:10:21,030 190 00:10:21,030 --> 00:10:24,400 Lambda is greater than 0. 191 00:10:24,400 --> 00:10:27,340 So I'm just going to show you what it looks like, and then 192 00:10:27,340 --> 00:10:28,590 we'll move on. 193 00:10:28,590 --> 00:10:38,990 194 00:10:38,990 --> 00:10:44,250 So here, the blue are the sample points, and the red is 195 00:10:44,250 --> 00:10:45,500 the ideal curve. 196 00:10:45,500 --> 00:10:47,940 197 00:10:47,940 --> 00:10:50,559 Just different values of lambda. 198 00:10:50,559 --> 00:10:52,924 AUDIENCE: Does it always have a downward slope like that for 199 00:10:52,924 --> 00:10:55,290 it to be exponential? 200 00:10:55,290 --> 00:10:58,940 PROFESSOR: Yeah, in this case. 201 00:10:58,940 --> 00:11:02,180 There's another family of distributions that we're not 202 00:11:02,180 --> 00:11:03,430 going to touch on. 203 00:11:03,430 --> 00:11:07,310 204 00:11:07,310 --> 00:11:12,660 But that is that for distributions for today. 205 00:11:12,660 --> 00:11:14,110 Unless anyone has any questions, I'm 206 00:11:14,110 --> 00:11:17,150 going to move on. 207 00:11:17,150 --> 00:11:19,310 OK. 208 00:11:19,310 --> 00:11:28,950 So the next big topic is Monte Carlo methods. 209 00:11:28,950 --> 00:11:34,670 So can someone give me an informal definition of what a 210 00:11:34,670 --> 00:11:36,265 Monte Carlo method is? 211 00:11:36,265 --> 00:11:40,971 212 00:11:40,971 --> 00:11:46,350 AUDIENCE: Really roughly, is it based on using a random 213 00:11:46,350 --> 00:11:48,900 method to try to approximate something that's not random, 214 00:11:48,900 --> 00:11:52,370 by doing it many, many times over? 215 00:11:52,370 --> 00:11:53,550 PROFESSOR: Yeah, more or less. 216 00:11:53,550 --> 00:11:57,470 It's trying to arrive at a solution by repeated sampling, 217 00:11:57,470 --> 00:11:58,720 or random sampling. 218 00:11:58,720 --> 00:12:00,950 219 00:12:00,950 --> 00:12:05,040 And we've seen many different applications of this. 220 00:12:05,040 --> 00:12:10,240 But we're going to review them and kind of try and get a 221 00:12:10,240 --> 00:12:11,820 better understanding. 222 00:12:11,820 --> 00:12:15,840 So the Monty Hall problem. 223 00:12:15,840 --> 00:12:18,670 This is a Monte Carlo simulation. 224 00:12:18,670 --> 00:12:23,582 So, one, what's the action that a person should take? 225 00:12:23,582 --> 00:12:24,430 AUDIENCE: [INAUDIBLE]. 226 00:12:24,430 --> 00:12:24,800 PROFESSOR: All right. 227 00:12:24,800 --> 00:12:27,430 And does anyone remember what proportion of the time if they 228 00:12:27,430 --> 00:12:28,754 switch they won? 229 00:12:28,754 --> 00:12:29,640 AUDIENCE: 2/3. 230 00:12:29,640 --> 00:12:31,700 PROFESSOR: Two-thirds, Ok. 231 00:12:31,700 --> 00:12:34,770 So I happen to know this works-- 232 00:12:34,770 --> 00:12:37,470 233 00:12:37,470 --> 00:12:38,934 maybe. 234 00:12:38,934 --> 00:12:41,419 I think my program died. 235 00:12:41,419 --> 00:12:48,380 236 00:12:48,380 --> 00:12:52,640 OK, so it works. 237 00:12:52,640 --> 00:12:57,710 Is this code confusing to anyone or cryptic? 238 00:12:57,710 --> 00:13:00,420 I tried to make it a little bit simpler than the code that 239 00:13:00,420 --> 00:13:01,770 was in the handout for class. 240 00:13:01,770 --> 00:13:06,020 241 00:13:06,020 --> 00:13:07,620 We have a number of trials. 242 00:13:07,620 --> 00:13:10,750 We're going to pick a door for the prize. 243 00:13:10,750 --> 00:13:12,170 The player's going to choose a door. 244 00:13:12,170 --> 00:13:14,820 245 00:13:14,820 --> 00:13:18,720 If they choose to stay, and the prize is in the door that 246 00:13:18,720 --> 00:13:21,400 they chose, then stay wins. 247 00:13:21,400 --> 00:13:27,680 And if they choose to switch, and the prize door is not the 248 00:13:27,680 --> 00:13:31,060 door that they originally chose, then switch wins. 249 00:13:31,060 --> 00:13:34,447 250 00:13:34,447 --> 00:13:36,350 So it's easy. 251 00:13:36,350 --> 00:13:40,380 What I wanted to try and do is look at an intuitive 252 00:13:40,380 --> 00:13:41,730 explanation for this. 253 00:13:41,730 --> 00:13:45,120 254 00:13:45,120 --> 00:13:47,870 At office hours, we were kicking around different ways 255 00:13:47,870 --> 00:13:49,200 of explaining this. 256 00:13:49,200 --> 00:13:53,550 And we went to Wikipedia, and we found this explanation. 257 00:13:53,550 --> 00:13:59,700 So the idea is let's say that the contestant 258 00:13:59,700 --> 00:14:01,130 chooses door One. 259 00:14:01,130 --> 00:14:04,890 So there's a 1/3 probability that they've chosen the door 260 00:14:04,890 --> 00:14:07,210 that has the prize behind it. 261 00:14:07,210 --> 00:14:10,860 And then there's a 1/3 probability that it's behind 262 00:14:10,860 --> 00:14:13,120 door number Two, 1/3 probability it's behind door 263 00:14:13,120 --> 00:14:14,790 number Three. 264 00:14:14,790 --> 00:14:18,080 The key to this kind of explanation is that if you 265 00:14:18,080 --> 00:14:21,080 consider both Two and Three together, then there's a 2/3 266 00:14:21,080 --> 00:14:25,600 probability that the prize is behind one of those two doors. 267 00:14:25,600 --> 00:14:28,880 268 00:14:28,880 --> 00:14:31,210 So the player chooses, and then Monty opens a door. 269 00:14:31,210 --> 00:14:34,390 There's a goat behind door number Three. 270 00:14:34,390 --> 00:14:37,140 This new knowledge doesn't change, though, the 271 00:14:37,140 --> 00:14:41,180 probability that you chose the correct door. 272 00:14:41,180 --> 00:14:45,520 So you still have 1/3 chance that One was the correct door. 273 00:14:45,520 --> 00:14:49,300 And there's still 2/3 chance on this side. 274 00:14:49,300 --> 00:14:52,100 But you know this one is 0, because you see the goat. 275 00:14:52,100 --> 00:14:57,542 So this door has to a 2/3 chance of having the prize. 276 00:14:57,542 --> 00:14:58,960 Does that agree with you? 277 00:14:58,960 --> 00:15:01,550 278 00:15:01,550 --> 00:15:04,960 So it's one way of explaining it. 279 00:15:04,960 --> 00:15:05,370 I don't know. 280 00:15:05,370 --> 00:15:09,950 I had problems getting this into my head. 281 00:15:09,950 --> 00:15:12,170 Does anyone want me to try again? 282 00:15:12,170 --> 00:15:14,066 All right. 283 00:15:14,066 --> 00:15:15,316 AUDIENCE: [INAUDIBLE] 284 00:15:15,316 --> 00:15:17,538 285 00:15:17,538 --> 00:15:22,498 two doors the probability that your goat is going to be 286 00:15:22,498 --> 00:15:25,308 [INAUDIBLE] behind the door you chose [INAUDIBLE], so it's 287 00:15:25,308 --> 00:15:26,820 basically the same [INAUDIBLE]? 288 00:15:26,820 --> 00:15:29,490 PROFESSOR: Same idea, but kind of negating it, and thinking 289 00:15:29,490 --> 00:15:31,073 of it from the negative direction. 290 00:15:31,073 --> 00:15:34,900 291 00:15:34,900 --> 00:15:37,780 Another explanation that was good was if you had a million 292 00:15:37,780 --> 00:15:43,670 doors, and you had 999,999 goats, and you had one prize, 293 00:15:43,670 --> 00:15:45,000 you have a one in a million chance of 294 00:15:45,000 --> 00:15:46,510 choosing the right door. 295 00:15:46,510 --> 00:15:49,770 So now imagine Monty walking down and open opening up 296 00:15:49,770 --> 00:15:54,400 999,998 doors, each with a goat behind it. 297 00:15:54,400 --> 00:15:57,560 Well, now you have your door that's still closed, and the 298 00:15:57,560 --> 00:16:02,230 door that's mystery also closed. 299 00:16:02,230 --> 00:16:04,993 The probability that you chose the correct door is still one 300 00:16:04,993 --> 00:16:06,310 in a million. 301 00:16:06,310 --> 00:16:11,380 So if you see 999,998 goats, and one closed door, and you 302 00:16:11,380 --> 00:16:14,080 know that your door only has a one in a million chance, you 303 00:16:14,080 --> 00:16:15,770 want to switch to the other door, because that probably 304 00:16:15,770 --> 00:16:18,440 has the prize. 305 00:16:18,440 --> 00:16:20,860 So different ways of thinking about it. 306 00:16:20,860 --> 00:16:23,960 The probability problems and statistics problems, it always 307 00:16:23,960 --> 00:16:26,410 helps to-- or at least, I think it does-- to have an 308 00:16:26,410 --> 00:16:28,850 intuitive idea of what's going on. 309 00:16:28,850 --> 00:16:33,670 So with that said, let's talk about pi. 310 00:16:33,670 --> 00:16:39,870 Because this is one of my favorite Monte Carlo methods. 311 00:16:39,870 --> 00:16:42,250 Because it's got a nice explanation. 312 00:16:42,250 --> 00:16:48,500 So does anyone need me to talk about the idea behind this, 313 00:16:48,500 --> 00:16:52,435 like how this method works, or to go through it? 314 00:16:52,435 --> 00:16:56,200 315 00:16:56,200 --> 00:16:57,450 Someone's nodding. 316 00:16:57,450 --> 00:17:00,710 317 00:17:00,710 --> 00:17:05,705 So the idea is we have a square. 318 00:17:05,705 --> 00:17:10,140 319 00:17:10,140 --> 00:17:17,630 And its side is 2r units long. 320 00:17:17,630 --> 00:17:19,020 So what's the area of the square? 321 00:17:19,020 --> 00:17:22,450 322 00:17:22,450 --> 00:17:23,700 So Asq ... 323 00:17:23,700 --> 00:17:27,390 324 00:17:27,390 --> 00:17:29,970 squared, right? 325 00:17:29,970 --> 00:17:32,170 Now, we still have a circle that's 326 00:17:32,170 --> 00:17:33,420 inscribed in the square. 327 00:17:33,420 --> 00:17:37,100 328 00:17:37,100 --> 00:17:39,850 And it's got a radius of r. 329 00:17:39,850 --> 00:17:41,150 So area of circle. 330 00:17:41,150 --> 00:17:46,490 331 00:17:46,490 --> 00:17:51,800 If we take the ratio of the circle to the area of the 332 00:17:51,800 --> 00:17:57,410 square, then we find have pi over 4. 333 00:17:57,410 --> 00:18:01,875 Now, let's assume that I throw darts at this. 334 00:18:01,875 --> 00:18:04,430 335 00:18:04,430 --> 00:18:07,630 Wakes people up. 336 00:18:07,630 --> 00:18:12,670 And there's a uniform probability that the point 337 00:18:12,670 --> 00:18:15,856 will land somewhere in the square here. 338 00:18:15,856 --> 00:18:23,800 If I throw N of these, then I can expect pi over 4 of them, 339 00:18:23,800 --> 00:18:25,865 times N, to wind up in the circle. 340 00:18:25,865 --> 00:18:28,500 341 00:18:28,500 --> 00:18:31,130 And since I find this number and this number, and I want to 342 00:18:31,130 --> 00:18:33,326 find pi, I can just rearrange this. 343 00:18:33,326 --> 00:18:36,295 344 00:18:36,295 --> 00:18:37,545 That's how we get pi. 345 00:18:37,545 --> 00:18:40,590 346 00:18:40,590 --> 00:18:42,500 So let's go to the code. 347 00:18:42,500 --> 00:18:45,800 348 00:18:45,800 --> 00:18:47,830 We just have some easy code. 349 00:18:47,830 --> 00:18:51,280 It gets a random point within a square that's from minus r 350 00:18:51,280 --> 00:18:55,580 to r, so 2r units long. 351 00:18:55,580 --> 00:18:57,840 I have a function that makes a whole bunch of points. 352 00:18:57,840 --> 00:19:00,470 353 00:19:00,470 --> 00:19:02,760 And then I have a function that checks if a point is 354 00:19:02,760 --> 00:19:09,740 within a circle of radius r and another function that 355 00:19:09,740 --> 00:19:11,910 looks at a bunch of points and counts how many 356 00:19:11,910 --> 00:19:15,330 are within the circle. 357 00:19:15,330 --> 00:19:17,290 And then I have my compute pi function here. 358 00:19:17,290 --> 00:19:19,900 359 00:19:19,900 --> 00:19:23,500 And all it does is you can either pass at some points 360 00:19:23,500 --> 00:19:29,200 that are already made, or just say, I want to have 100,000 361 00:19:29,200 --> 00:19:31,990 darts thrown at this square. 362 00:19:31,990 --> 00:19:36,230 And it'll make a whole bunch of those random points, figure 363 00:19:36,230 --> 00:19:38,470 out many are in the circle. 364 00:19:38,470 --> 00:19:40,380 And then we have-- 365 00:19:40,380 --> 00:19:48,080 this would be m and numpoints N. If we multiply it by 4, 366 00:19:48,080 --> 00:19:53,190 that gives us pi, more or less. 367 00:19:53,190 --> 00:19:57,893 So let's look at a couple of plots. 368 00:19:57,893 --> 00:20:00,530 I have a function here, runtrials. 369 00:20:00,530 --> 00:20:02,920 And what it's going to do is it's going to run a number of 370 00:20:02,920 --> 00:20:07,020 trials for a given number of points. 371 00:20:07,020 --> 00:20:15,900 So what I want to do is I'm going to run 50 trials for 372 00:20:15,900 --> 00:20:17,900 each number of points. 373 00:20:17,900 --> 00:20:21,590 And I'm going to have a points list that goes from 10 to 374 00:20:21,590 --> 00:20:23,950 10,000 in 1000-point increments. 375 00:20:23,950 --> 00:20:26,770 376 00:20:26,770 --> 00:20:28,470 I'm going to run the trials and get the results. 377 00:20:28,470 --> 00:20:31,880 And then I'm going to plot my results. 378 00:20:31,880 --> 00:20:33,380 And why don't we just throw that out there? 379 00:20:33,380 --> 00:20:48,410 380 00:20:48,410 --> 00:20:48,650 Ok. 381 00:20:48,650 --> 00:20:52,480 So on the plot, the blue line blue, horizontal line, that's 382 00:20:52,480 --> 00:20:55,750 the actual value of pi, as near as a computer can 383 00:20:55,750 --> 00:20:58,240 approximate it. 384 00:20:58,240 --> 00:21:00,800 On the x-axis, we have the number of darts that we threw 385 00:21:00,800 --> 00:21:03,320 at the square. 386 00:21:03,320 --> 00:21:07,260 And each red dot represents the result of one trial of 387 00:21:07,260 --> 00:21:12,120 throwing however many darts at a board. 388 00:21:12,120 --> 00:21:16,945 So when you're down here, and you're only throwing 10 darts, 389 00:21:16,945 --> 00:21:19,130 you tend to have a very wide spread for the 390 00:21:19,130 --> 00:21:21,120 estimated value of pi. 391 00:21:21,120 --> 00:21:28,770 As you increase the number of darts, you get much closer-- 392 00:21:28,770 --> 00:21:31,330 I would say shot group, but grouping it's probably more 393 00:21:31,330 --> 00:21:34,060 appropriate. 394 00:21:34,060 --> 00:21:38,646 And it's much closer to the actual of pi. 395 00:21:38,646 --> 00:21:41,980 There's nothing really unusual about this, right? 396 00:21:41,980 --> 00:21:44,760 Nothing confusing? 397 00:21:44,760 --> 00:21:53,020 So another way of visualizing this is to actually, well, 398 00:21:53,020 --> 00:21:55,380 look at the darts that are thrown. 399 00:21:55,380 --> 00:22:02,880 400 00:22:02,880 --> 00:22:06,600 So I have a function here, plot pi scatter. 401 00:22:06,600 --> 00:22:10,520 And this is actually just going to plot this. 402 00:22:10,520 --> 00:22:13,320 403 00:22:13,320 --> 00:22:17,280 And it's going to do it for 10 points, 100 points, 1,000 404 00:22:17,280 --> 00:22:19,450 points, and 10,000 points. 405 00:22:19,450 --> 00:22:26,500 And we'll see why we can start converging on pi. 406 00:22:26,500 --> 00:22:30,990 So this is with only 10 darts thrown at the square. 407 00:22:30,990 --> 00:22:34,430 The value for pi is really pretty off. 408 00:22:34,430 --> 00:22:36,140 And it doesn't really look very compelling. 409 00:22:36,140 --> 00:22:42,810 410 00:22:42,810 --> 00:22:47,360 In fact, one of the darts actually 411 00:22:47,360 --> 00:22:49,150 fell outside the circle. 412 00:22:49,150 --> 00:22:50,730 Nine of the darts fell inside the circle. 413 00:22:50,730 --> 00:22:54,310 So you're not going to get a real good estimate there. 414 00:22:54,310 --> 00:22:57,240 The blue dots there represent being in the circle. 415 00:22:57,240 --> 00:22:59,810 Red is outside. 416 00:22:59,810 --> 00:23:02,250 So if we do it with 100 points, it starts getting a 417 00:23:02,250 --> 00:23:04,832 little better. 418 00:23:04,832 --> 00:23:08,770 If we do with 1,000 points, starts getting better. 419 00:23:08,770 --> 00:23:12,690 420 00:23:12,690 --> 00:23:20,720 If we do it with 10,000 points. 421 00:23:20,720 --> 00:23:21,970 Anyone confused? 422 00:23:21,970 --> 00:23:25,330 423 00:23:25,330 --> 00:23:29,840 So I'm going to move on and show you how we can use the 424 00:23:29,840 --> 00:23:32,895 same method to do numeric integration. 425 00:23:32,895 --> 00:23:39,740 426 00:23:39,740 --> 00:23:41,880 So here we go. 427 00:23:41,880 --> 00:23:44,570 Here's that frange function again, so it's 428 00:23:44,570 --> 00:23:48,360 not confusing anyone. 429 00:23:48,360 --> 00:23:53,170 What we're going to do is we're going to use a Monte 430 00:23:53,170 --> 00:23:58,946 Carlo method to integrate a polynomial. 431 00:23:58,946 --> 00:24:01,790 432 00:24:01,790 --> 00:24:04,030 So let's say that I have-- 433 00:24:04,030 --> 00:24:11,310 434 00:24:11,310 --> 00:24:12,560 what I want to find. 435 00:24:12,560 --> 00:24:19,730 436 00:24:19,730 --> 00:24:22,070 I'm going to do it for-- 437 00:24:22,070 --> 00:24:26,780 because this is a numeric method, let's say do it from 438 00:24:26,780 --> 00:24:27,820 negative 5 to 5. 439 00:24:27,820 --> 00:24:29,155 So I want to do this. 440 00:24:29,155 --> 00:24:37,180 441 00:24:37,180 --> 00:24:39,990 If you haven't had calculus or anything like that, don't 442 00:24:39,990 --> 00:24:41,010 worry about this. 443 00:24:41,010 --> 00:24:45,006 But I think a lot of people have, with a couple of 444 00:24:45,006 --> 00:24:46,710 exceptions. 445 00:24:46,710 --> 00:24:53,350 So this is an easy function to integrate, right? 446 00:24:53,350 --> 00:24:55,570 But there are also some functions that are really hard 447 00:24:55,570 --> 00:24:56,540 or impossible to. 448 00:24:56,540 --> 00:25:02,360 So that's where a lot of software packages actually use 449 00:25:02,360 --> 00:25:06,740 Monte Carlo methods to do a numeric integration for you. 450 00:25:06,740 --> 00:25:14,080 But the idea is the same I'm going to take a function. 451 00:25:14,080 --> 00:25:17,000 And this is going to be x-squared. 452 00:25:17,000 --> 00:25:19,576 And then I'm going to take an x-min and an x-max. 453 00:25:19,576 --> 00:25:23,750 454 00:25:23,750 --> 00:25:26,020 These become my left and right boundaries. 455 00:25:26,020 --> 00:25:28,310 And then I'm going to find the minimum of the function 456 00:25:28,310 --> 00:25:34,000 between these limits and the maximum of the function. 457 00:25:34,000 --> 00:25:35,460 So you see what I'm doing? 458 00:25:35,460 --> 00:25:37,715 I'm defining a rectangle. 459 00:25:37,715 --> 00:25:40,270 460 00:25:40,270 --> 00:25:41,975 So again, same thing. 461 00:25:41,975 --> 00:25:47,260 462 00:25:47,260 --> 00:25:48,250 Same principle. 463 00:25:48,250 --> 00:25:49,500 I have the area of the rectangle. 464 00:25:49,500 --> 00:25:54,070 465 00:25:54,070 --> 00:25:57,260 I don't have the area of this guy. 466 00:25:57,260 --> 00:25:59,080 That's what I'm trying to find. 467 00:25:59,080 --> 00:26:04,350 But I know that if I find the ratio, the number of points 468 00:26:04,350 --> 00:26:07,740 that land in the square-- 469 00:26:07,740 --> 00:26:13,220 or the ratio that land in this curve versus the total in the 470 00:26:13,220 --> 00:26:17,565 square, then I can find this area pretty easily. 471 00:26:17,565 --> 00:26:20,360 472 00:26:20,360 --> 00:26:28,940 So this function, find function, y-min, y-max. 473 00:26:28,940 --> 00:26:30,200 Does exactly what it says. 474 00:26:30,200 --> 00:26:33,750 475 00:26:33,750 --> 00:26:37,540 Just goes between x-min and x-max, and then finds where 476 00:26:37,540 --> 00:26:43,150 the function is a minimum and where it's a maximum. 477 00:26:43,150 --> 00:26:46,546 So the function I'm calling f. 478 00:26:46,546 --> 00:26:49,350 It's one of the few single-letter variable names 479 00:26:49,350 --> 00:26:52,850 I'll use that isn't an index counter. 480 00:26:52,850 --> 00:26:56,610 481 00:26:56,610 --> 00:26:58,910 My random point generator, it's 482 00:26:58,910 --> 00:27:00,840 going to take the bounds-- 483 00:27:00,840 --> 00:27:03,090 x-min, x-max, y-min, y-max. 484 00:27:03,090 --> 00:27:05,640 So it's going to uniformly produce a point that falls 485 00:27:05,640 --> 00:27:06,890 within this rectangle. 486 00:27:06,890 --> 00:27:10,240 487 00:27:10,240 --> 00:27:11,520 My make-points -- 488 00:27:11,520 --> 00:27:14,590 it just makes a whole bunch of these. 489 00:27:14,590 --> 00:27:17,790 Then I have this function between curve. 490 00:27:17,790 --> 00:27:21,610 What this tells me is if I have a point here, it'll 491 00:27:21,610 --> 00:27:24,220 return true, because it's between the 492 00:27:24,220 --> 00:27:28,540 curve and the x-axis. 493 00:27:28,540 --> 00:27:31,510 If it's up here, it's false, right? 494 00:27:31,510 --> 00:27:34,460 495 00:27:34,460 --> 00:27:37,130 Does anyone not understand how that works? 496 00:27:37,130 --> 00:27:38,380 Ah, you're all smart. 497 00:27:38,380 --> 00:27:40,990 498 00:27:40,990 --> 00:27:49,335 So here is our estimate of our main function, estimate area. 499 00:27:49,335 --> 00:27:52,780 You give it a function, x-min, x-max. 500 00:27:52,780 --> 00:27:55,690 I'm going to tell it how many points to toss. 501 00:27:55,690 --> 00:27:58,600 And optionally, we can tell it that we already have points 502 00:27:58,600 --> 00:28:01,160 that have been tossed. 503 00:28:01,160 --> 00:28:04,220 And the first thing we do is find the y-min and the y-max. 504 00:28:04,220 --> 00:28:07,010 505 00:28:07,010 --> 00:28:10,960 And then if we don't have points, we make them. 506 00:28:10,960 --> 00:28:14,910 And then point counter counts how many times a point wound 507 00:28:14,910 --> 00:28:18,030 up between the curve and the x-axis. 508 00:28:18,030 --> 00:28:21,110 509 00:28:21,110 --> 00:28:24,150 And we just iterate through the points. 510 00:28:24,150 --> 00:28:27,185 If it's between the curve, that means it's here. 511 00:28:27,185 --> 00:28:30,150 512 00:28:30,150 --> 00:28:32,340 Then, if it's above the x-axis, we're going to 513 00:28:32,340 --> 00:28:33,590 increment the point counter. 514 00:28:33,590 --> 00:28:37,910 And then if it's below the x-axis, we're going to 515 00:28:37,910 --> 00:28:38,800 decrement the point counter. 516 00:28:38,800 --> 00:28:40,820 So we're accounting for signs here. 517 00:28:40,820 --> 00:28:46,770 So if we had a function that did this, we'd be able to 518 00:28:46,770 --> 00:28:48,020 properly handle it. 519 00:28:48,020 --> 00:28:51,170 520 00:28:51,170 --> 00:28:55,190 Now we get the rectangular area. 521 00:28:55,190 --> 00:29:00,110 And then all we do is we multiply the rectangular area 522 00:29:00,110 --> 00:29:04,240 by the ratio of the number of points between the curve and 523 00:29:04,240 --> 00:29:08,060 the x-axis and the total number of points thrown. 524 00:29:08,060 --> 00:29:10,250 And that gives us the function area. 525 00:29:10,250 --> 00:29:13,060 526 00:29:13,060 --> 00:29:18,680 So here's my function, x-squared. 527 00:29:18,680 --> 00:29:21,040 And this is just a plot function scatter. 528 00:29:21,040 --> 00:29:23,810 All this is going to do is just do the same thing I did 529 00:29:23,810 --> 00:29:25,060 with the circle. 530 00:29:25,060 --> 00:29:27,070 531 00:29:27,070 --> 00:29:31,920 And I am going to do this for-- 532 00:29:31,920 --> 00:29:35,910 if I tossed 10 points, 100 points, 1,000, 533 00:29:35,910 --> 00:29:38,540 10,000, or a 100,000. 534 00:29:38,540 --> 00:29:39,815 So let's see what this looks like. 535 00:29:39,815 --> 00:29:47,230 536 00:29:47,230 --> 00:29:49,260 Assuming that Python doesn't crash. 537 00:29:49,260 --> 00:29:57,190 538 00:29:57,190 --> 00:29:59,485 So not too nice. 539 00:29:59,485 --> 00:30:06,690 540 00:30:06,690 --> 00:30:26,230 100 points, 1,000 points, 10,000 points. 541 00:30:26,230 --> 00:30:27,480 And then a whole mess of points. 542 00:30:27,480 --> 00:30:36,956 543 00:30:36,956 --> 00:30:40,372 Oh, I crashed it. 544 00:30:40,372 --> 00:30:41,348 Hm? 545 00:30:41,348 --> 00:30:42,812 AUDIENCE: Can't we just [INAUDIBLE]? 546 00:30:42,812 --> 00:30:49,160 547 00:30:49,160 --> 00:30:49,580 PROFESSOR: I'm sorry. 548 00:30:49,580 --> 00:30:50,323 Say that again? 549 00:30:50,323 --> 00:30:51,573 AUDIENCE: Calculate [INAUDIBLE] 550 00:30:51,573 --> 00:30:54,187 551 00:30:54,187 --> 00:30:58,534 split up the x-axis to a lot of points, and then multiply 552 00:30:58,534 --> 00:31:01,432 those by the value function [INAUDIBLE] 553 00:31:01,432 --> 00:31:02,420 add them up? 554 00:31:02,420 --> 00:31:04,490 PROFESSOR: You're talking about doing a Riemann 555 00:31:04,490 --> 00:31:05,390 approximation? 556 00:31:05,390 --> 00:31:07,000 AUDIENCE: Yeah, [INAUDIBLE]. 557 00:31:07,000 --> 00:31:09,630 PROFESSOR: Or a Riemann sum? 558 00:31:09,630 --> 00:31:16,640 So his question is, why don't you do something like this? 559 00:31:16,640 --> 00:31:23,650 560 00:31:23,650 --> 00:31:39,610 Divide up the x-axis into very small portions, like that, and 561 00:31:39,610 --> 00:31:41,265 then sum up the areas of these rectangles. 562 00:31:41,265 --> 00:31:43,991 563 00:31:43,991 --> 00:31:46,426 Yeah, you could do that. 564 00:31:46,426 --> 00:31:47,676 AUDIENCE: [INAUDIBLE]? 565 00:31:47,676 --> 00:31:51,310 566 00:31:51,310 --> 00:31:53,460 PROFESSOR: You know, I don't have an answer for that. 567 00:31:53,460 --> 00:31:57,644 I can't say which one would work better. 568 00:31:57,644 --> 00:31:59,000 Do you know, Serena? 569 00:31:59,000 --> 00:32:03,230 570 00:32:03,230 --> 00:32:08,920 I would say that right now, whichever one you prefer. 571 00:32:08,920 --> 00:32:11,810 572 00:32:11,810 --> 00:32:15,090 But I'll see if there's any actual research on whether or 573 00:32:15,090 --> 00:32:16,620 not one is better than the other. 574 00:32:16,620 --> 00:32:19,950 It might turn out that there are certain instances where 575 00:32:19,950 --> 00:32:22,500 doing this sort of approximation is better than 576 00:32:22,500 --> 00:32:24,340 doing the approximation I'm talking about. 577 00:32:24,340 --> 00:32:27,570 578 00:32:27,570 --> 00:32:30,832 But I don't know. 579 00:32:30,832 --> 00:32:34,095 Yeah, for this problem, you could definitely use that. 580 00:32:34,095 --> 00:32:38,800 581 00:32:38,800 --> 00:32:41,280 Is everyone good with this? 582 00:32:41,280 --> 00:32:42,260 Anyone confused? 583 00:32:42,260 --> 00:32:45,080 Any questions? 584 00:32:45,080 --> 00:32:45,370 Yeah? 585 00:32:45,370 --> 00:32:49,314 AUDIENCE: I think my concern is that you need a 586 00:32:49,314 --> 00:32:52,765 fantastically large number of darts to get a reasonably good 587 00:32:52,765 --> 00:32:54,750 integration [INAUDIBLE]. 588 00:32:54,750 --> 00:32:56,090 PROFESSOR: Yeah. 589 00:32:56,090 --> 00:32:59,970 That is one issue with Monte Carlo methods, is that they do 590 00:32:59,970 --> 00:33:02,660 rely on large numbers. 591 00:33:02,660 --> 00:33:09,218 So, yeah, sometimes they can take a while. 592 00:33:09,218 --> 00:33:11,628 AUDIENCE: At least for the purposes of this class, we 593 00:33:11,628 --> 00:33:15,002 don't need to be able to quantify the error or anything 594 00:33:15,002 --> 00:33:16,930 like that, right? 595 00:33:16,930 --> 00:33:18,180 PROFESSOR: No. 596 00:33:18,180 --> 00:33:22,080 597 00:33:22,080 --> 00:33:26,480 You do need to understand that there can be error. 598 00:33:26,480 --> 00:33:29,750 And you should also understand stuff like confidence 599 00:33:29,750 --> 00:33:31,190 intervals and confidence levels. 600 00:33:31,190 --> 00:33:34,200 601 00:33:34,200 --> 00:33:35,448 Are you OK with that? 602 00:33:35,448 --> 00:33:37,938 AUDIENCE: Mostly. 603 00:33:37,938 --> 00:33:41,175 But in order to get a confidence interval, you'd 604 00:33:41,175 --> 00:33:44,412 have to do several trials at, say, 605 00:33:44,412 --> 00:33:46,420 100,000 points, and then-- 606 00:33:46,420 --> 00:33:47,670 PROFESSOR: Right, exactly. 607 00:33:47,670 --> 00:33:51,260 608 00:33:51,260 --> 00:33:56,380 You could estimate the error. 609 00:33:56,380 --> 00:33:58,220 Like you could estimate it. 610 00:33:58,220 --> 00:34:00,860 But in order to really get a good sense for how much 611 00:34:00,860 --> 00:34:03,820 variance there is, you'd have to do repeated trials. 612 00:34:03,820 --> 00:34:05,810 So yeah. 613 00:34:05,810 --> 00:34:08,260 AUDIENCE: What I guess I was getting at was in order to get 614 00:34:08,260 --> 00:34:10,220 a sense of how big the error is relative to 615 00:34:10,220 --> 00:34:11,989 the number of trials-- 616 00:34:11,989 --> 00:34:13,430 PROFESSOR: Yeah. 617 00:34:13,430 --> 00:34:14,855 AUDIENCE: --without sort of analytically. 618 00:34:14,855 --> 00:34:17,710 But I guess that's probably [INAUDIBLE]. 619 00:34:17,710 --> 00:34:19,113 PROFESSOR: I'm sorry, what? 620 00:34:19,113 --> 00:34:20,965 AUDIENCE: That's not something that we're going to be asked 621 00:34:20,965 --> 00:34:22,360 to do, at least in this course? 622 00:34:22,360 --> 00:34:24,475 PROFESSOR: Yeah, no. 623 00:34:24,475 --> 00:34:27,400 The purpose is we want you to understand that when you do 624 00:34:27,400 --> 00:34:32,280 things like this, that there is some thought that has to go 625 00:34:32,280 --> 00:34:33,929 into, well, how many trials do I need to do? 626 00:34:33,929 --> 00:34:36,280 How many points do I need to throw? 627 00:34:36,280 --> 00:34:39,174 And you have to ask yourself, how much error am 628 00:34:39,174 --> 00:34:41,040 I willing to tolerate? 629 00:34:41,040 --> 00:34:45,940 There's the joke that mathematicians call pi pi, and 630 00:34:45,940 --> 00:34:54,810 then engineers call it 3.14. 631 00:34:54,810 --> 00:34:59,770 OK, so if everyone's done with integration, I'm going to move 632 00:34:59,770 --> 00:35:01,020 on to regression. 633 00:35:01,020 --> 00:35:07,370 634 00:35:07,370 --> 00:35:08,260 Oh, wait, now. 635 00:35:08,260 --> 00:35:11,870 There's one thing wanted to touch on. 636 00:35:11,870 --> 00:35:20,880 So we kind of looked at some toy problems with Monte Carlo. 637 00:35:20,880 --> 00:35:24,070 And this is, I guess, a toy problem too, because it has to 638 00:35:24,070 --> 00:35:24,670 do with a toy. 639 00:35:24,670 --> 00:35:28,110 Is everyone familiar with the game of Monopoly? 640 00:35:28,110 --> 00:35:32,780 So I don't have to explain the rules too much in depth? 641 00:35:32,780 --> 00:35:33,550 OK. 642 00:35:33,550 --> 00:35:41,150 So let's assume that there are no factors that modify this 643 00:35:41,150 --> 00:35:43,590 distribution. 644 00:35:43,590 --> 00:35:49,390 If I roll the die twice, then each one of these spaces has 645 00:35:49,390 --> 00:35:51,940 an equal probability of being landed on. 646 00:35:51,940 --> 00:35:53,890 It's about 2 and 1/2%. 647 00:35:53,890 --> 00:35:57,800 But there are certain rules that distort this 648 00:35:57,800 --> 00:35:58,700 distribution. 649 00:35:58,700 --> 00:36:01,200 So you can land on Go To Jail. 650 00:36:01,200 --> 00:36:05,070 You can roll three doubles, and get sent to Jail. 651 00:36:05,070 --> 00:36:08,660 You can draw a Chance card, and get sent to Jail, sent to 652 00:36:08,660 --> 00:36:12,070 Go, or sent anywhere on the board. 653 00:36:12,070 --> 00:36:15,660 And there are 10 out of 16 Chance cards that modify this 654 00:36:15,660 --> 00:36:17,560 distribution. 655 00:36:17,560 --> 00:36:19,230 And for Community Chest, same thing. 656 00:36:19,230 --> 00:36:22,340 There's 2 out of the 16 cards that distort the distribution. 657 00:36:22,340 --> 00:36:27,570 So the question is, how do you do this analytically? 658 00:36:27,570 --> 00:36:28,710 And I've tried. 659 00:36:28,710 --> 00:36:30,700 It's hard. 660 00:36:30,700 --> 00:36:33,020 I'm actually not sure if it's possible. 661 00:36:33,020 --> 00:36:36,270 Well, this is a perfect example of where you would use 662 00:36:36,270 --> 00:36:39,940 a Monte Carlo simulation in order to arrive at the answer. 663 00:36:39,940 --> 00:36:44,780 So if you actually want to take a whack at this problem, 664 00:36:44,780 --> 00:36:48,090 you can go to this site called projecteuler.net. 665 00:36:48,090 --> 00:36:50,880 They have a whole bunch of mathy questions on there that 666 00:36:50,880 --> 00:36:55,740 are meant to get people to think about math and computer 667 00:36:55,740 --> 00:36:58,550 programming. 668 00:36:58,550 --> 00:37:01,890 And you get little rankings the more questions you answer 669 00:37:01,890 --> 00:37:02,890 correctly, and stuff like that. 670 00:37:02,890 --> 00:37:04,790 So there's a little competition. 671 00:37:04,790 --> 00:37:10,650 But the question in this particular case was, what are 672 00:37:10,650 --> 00:37:14,650 the top three places you'll land on 673 00:37:14,650 --> 00:37:16,500 with all these factors? 674 00:37:16,500 --> 00:37:20,950 And if you represent them as a number that is concatenated 675 00:37:20,950 --> 00:37:23,110 one after the other, what is the number? 676 00:37:23,110 --> 00:37:26,140 What is the six-digit number? 677 00:37:26,140 --> 00:37:28,190 But that's a fun problem. 678 00:37:28,190 --> 00:37:30,990 679 00:37:30,990 --> 00:37:34,075 So going onto something that's less fun, regression. 680 00:37:34,075 --> 00:37:37,280 681 00:37:37,280 --> 00:37:43,030 So can someone tell me what purposes we would use 682 00:37:43,030 --> 00:37:44,280 regression for? 683 00:37:44,280 --> 00:37:52,190 684 00:37:52,190 --> 00:37:52,960 Take a stab. 685 00:37:52,960 --> 00:37:53,300 AUDIENCE: Sure. 686 00:37:53,300 --> 00:37:58,100 If you have experimental data which you believe to fit some 687 00:37:58,100 --> 00:37:59,540 type of theoretical model. 688 00:37:59,540 --> 00:38:00,980 But experiments being 689 00:38:00,980 --> 00:38:04,010 experiments, they're not perfect. 690 00:38:04,010 --> 00:38:06,710 You can't-- 691 00:38:06,710 --> 00:38:09,690 the data points exactly fall in the model, so you have to 692 00:38:09,690 --> 00:38:14,078 find which parameters from the model to pick so that your 693 00:38:14,078 --> 00:38:16,874 experiment [UNINTELLIGIBLE] best fits [INAUDIBLE]. 694 00:38:16,874 --> 00:38:17,810 PROFESSOR: Uh-huh. 695 00:38:17,810 --> 00:38:21,940 So the idea is that you have a bunch of experimental data 696 00:38:21,940 --> 00:38:24,000 that has error. 697 00:38:24,000 --> 00:38:28,260 And you want to be able to maybe find the underlying 698 00:38:28,260 --> 00:38:32,610 function of those observations. 699 00:38:32,610 --> 00:38:35,420 And you would do that using regression. 700 00:38:35,420 --> 00:38:39,880 So we have a couple of nice cools in 701 00:38:39,880 --> 00:38:41,130 Python for doing that. 702 00:38:41,130 --> 00:38:44,070 703 00:38:44,070 --> 00:38:47,710 Actually, before I move on, another reason is you can find 704 00:38:47,710 --> 00:38:48,090 the function. 705 00:38:48,090 --> 00:38:50,310 But you can also then, once you find that function, you 706 00:38:50,310 --> 00:38:52,560 can use it to predict additional values. 707 00:38:52,560 --> 00:38:56,460 So say you have a gap in your data, or you want to predict 708 00:38:56,460 --> 00:38:58,510 values beyond the range that you collected 709 00:38:58,510 --> 00:39:00,050 observations for. 710 00:39:00,050 --> 00:39:03,270 If you do a regression, you find the function, find the 711 00:39:03,270 --> 00:39:05,510 parameters for the function, then you can use it to predict 712 00:39:05,510 --> 00:39:08,450 those values. 713 00:39:08,450 --> 00:39:14,130 And what we mainly want you to understand here are the 714 00:39:14,130 --> 00:39:17,120 functions that you would use to do it, and how you would 715 00:39:17,120 --> 00:39:21,650 tell if you have a good fit or not a good fit, and the idea 716 00:39:21,650 --> 00:39:23,190 of overfitting. 717 00:39:23,190 --> 00:39:30,710 So we have a little bit of code that demonstrates this, 718 00:39:30,710 --> 00:39:36,980 so a couple of helper functions that compute various 719 00:39:36,980 --> 00:39:40,330 values that you've seen before. 720 00:39:40,330 --> 00:39:44,170 So MSE is the sum of the residual squares. 721 00:39:44,170 --> 00:39:48,230 And then you have the total sum of squares. 722 00:39:48,230 --> 00:39:57,160 So these will help you compute the coefficient of 723 00:39:57,160 --> 00:39:58,410 termination. 724 00:39:58,410 --> 00:40:00,280 725 00:40:00,280 --> 00:40:06,680 And what I'm going to show is let's say I 726 00:40:06,680 --> 00:40:07,930 define a function here. 727 00:40:07,930 --> 00:40:10,380 728 00:40:10,380 --> 00:40:12,470 In this case, I have it defined as x-cubed 729 00:40:12,470 --> 00:40:13,820 plus 5x plus 3. 730 00:40:13,820 --> 00:40:16,580 731 00:40:16,580 --> 00:40:22,310 I am going to, for a certain number of x values, apply the 732 00:40:22,310 --> 00:40:25,860 function and get the y value. 733 00:40:25,860 --> 00:40:28,800 And then to simulate observational data, I'm going 734 00:40:28,800 --> 00:40:33,250 to perturb it using a Gaussian distribution. 735 00:40:33,250 --> 00:40:34,910 So it's going to jitter the points. 736 00:40:34,910 --> 00:40:39,650 737 00:40:39,650 --> 00:40:42,220 And that's what the make observations function does, is 738 00:40:42,220 --> 00:40:47,040 it just adds noise to the y values. 739 00:40:47,040 --> 00:40:49,470 And then I'm going to-- 740 00:40:49,470 --> 00:40:55,730 this function here plots out the measured or observed 741 00:40:55,730 --> 00:41:00,430 values, the simulated. 742 00:41:00,430 --> 00:41:07,750 It computes a fit for one degree. 743 00:41:07,750 --> 00:41:10,297 So in this case, I have two parameters, fit degree 1 and 744 00:41:10,297 --> 00:41:13,000 fit degree 2, because I want to do comparisons. 745 00:41:13,000 --> 00:41:19,080 So it'll compute fit using the first degree and predict some 746 00:41:19,080 --> 00:41:20,500 values for the curve. 747 00:41:20,500 --> 00:41:23,320 748 00:41:23,320 --> 00:41:27,590 And then it'll compute the residual error and the 749 00:41:27,590 --> 00:41:32,290 coefficient of determination and plot it out. 750 00:41:32,290 --> 00:41:36,790 And then it'll do the same thing for the second degree. 751 00:41:36,790 --> 00:41:42,290 752 00:41:42,290 --> 00:41:44,410 Let's see what this looks like. 753 00:41:44,410 --> 00:41:52,460 754 00:41:52,460 --> 00:41:56,885 Let's see Python not behave badly. 755 00:41:56,885 --> 00:41:59,795 756 00:41:59,795 --> 00:42:01,250 There we go. 757 00:42:01,250 --> 00:42:10,640 758 00:42:10,640 --> 00:42:15,130 The function that we plotted was, what, x-squared 759 00:42:15,130 --> 00:42:17,310 something, 5x-squared? 760 00:42:17,310 --> 00:42:18,560 Let me see. 761 00:42:18,560 --> 00:42:22,100 762 00:42:22,100 --> 00:42:24,830 x-cubed plus 5x plus 3. 763 00:42:24,830 --> 00:42:27,960 764 00:42:27,960 --> 00:42:30,780 And we're plotting it from negative 2 to 2. 765 00:42:30,780 --> 00:42:34,170 So this is what I'm talking about with the noise. 766 00:42:34,170 --> 00:42:37,840 So each of these red dots represents some observation 767 00:42:37,840 --> 00:42:41,280 that's been disturbed a little bit. 768 00:42:41,280 --> 00:42:45,800 And then I try to fit this with a first degree 769 00:42:45,800 --> 00:42:48,380 polynomial, and then a second degree. 770 00:42:48,380 --> 00:42:51,570 And I see-- 771 00:42:51,570 --> 00:42:57,304 actually, my residual error is lower for my first degree fit. 772 00:42:57,304 --> 00:42:58,620 That's interesting. 773 00:42:58,620 --> 00:43:01,310 774 00:43:01,310 --> 00:43:04,340 So I don't know. 775 00:43:04,340 --> 00:43:06,690 At this point, I'd say just stop and 776 00:43:06,690 --> 00:43:07,450 don't proceed further. 777 00:43:07,450 --> 00:43:09,940 But we know that that's not the right function. 778 00:43:09,940 --> 00:43:16,260 So let's look at what we have for a third degree fit. 779 00:43:16,260 --> 00:43:19,025 It actually worse, huh. 780 00:43:19,025 --> 00:43:24,470 781 00:43:24,470 --> 00:43:26,530 This is the problem with random programs, is that 782 00:43:26,530 --> 00:43:27,780 sometimes they fail you. 783 00:43:27,780 --> 00:43:34,950 784 00:43:34,950 --> 00:43:37,440 I would say that these are nice pretty plots, but they're 785 00:43:37,440 --> 00:43:41,000 not really telling me much, other than I can fit some 786 00:43:41,000 --> 00:43:43,892 lines to some points. 787 00:43:43,892 --> 00:43:45,620 AUDIENCE: What should it look like? 788 00:43:45,620 --> 00:43:48,540 What are you looking for that's not there? 789 00:43:48,540 --> 00:43:51,590 PROFESSOR: So we know that the function that we made the 790 00:43:51,590 --> 00:43:55,900 observations on is a third degree polynomial. 791 00:43:55,900 --> 00:44:06,850 So it's a little puzzling why this first degree fit is 792 00:44:06,850 --> 00:44:14,120 better than our third degree fit. 793 00:44:14,120 --> 00:44:18,800 That's the conundrum. 794 00:44:18,800 --> 00:44:20,150 So maybe-- 795 00:44:20,150 --> 00:44:23,350 I wonder what would happen if I expanded the x range. 796 00:44:23,350 --> 00:44:27,570 So let's say I go from negative 5 to 5. 797 00:44:27,570 --> 00:44:29,820 Maybe it's just too little data. 798 00:44:29,820 --> 00:44:34,150 799 00:44:34,150 --> 00:44:35,400 That's looking a little better. 800 00:44:35,400 --> 00:44:43,790 801 00:44:43,790 --> 00:44:45,040 Now I feel better. 802 00:44:45,040 --> 00:44:48,080 803 00:44:48,080 --> 00:44:51,190 So the issue was that we just were going from negative 2 to 804 00:44:51,190 --> 00:44:54,070 2, and basically it looked linear there. 805 00:44:54,070 --> 00:44:57,370 So the first degree polynomial was doing fine. 806 00:44:57,370 --> 00:45:01,610 But as soon as we go out and get a little curvy in there, 807 00:45:01,610 --> 00:45:04,820 we see that both the first and the second degree fits, they 808 00:45:04,820 --> 00:45:07,770 have pretty high error. 809 00:45:07,770 --> 00:45:09,460 Their R is pretty good. 810 00:45:09,460 --> 00:45:14,940 But when you compare them with, say, a third degree fit, 811 00:45:14,940 --> 00:45:18,240 you see that the error drops down dramatically. 812 00:45:18,240 --> 00:45:22,050 And it's got higher coefficient of determination. 813 00:45:22,050 --> 00:45:25,130 So what we would say in this case is that this third degree 814 00:45:25,130 --> 00:45:29,800 fit here is a lot better than the first or 815 00:45:29,800 --> 00:45:32,660 second degree fit. 816 00:45:32,660 --> 00:45:35,970 And then we can also look at, say, a fourth degree fit, 817 00:45:35,970 --> 00:45:39,040 which in this case happens to have a higher error. 818 00:45:39,040 --> 00:45:41,580 So that's a good thing. 819 00:45:41,580 --> 00:45:45,080 And then if we look at a fifth degree fit, it also has a 820 00:45:45,080 --> 00:45:45,630 higher error. 821 00:45:45,630 --> 00:45:50,810 So we'd say in this case that the third degree fit is 822 00:45:50,810 --> 00:45:53,650 probably our best bet, and we probably have a pretty good 823 00:45:53,650 --> 00:45:56,600 idea of what the function is for the underlying model. 824 00:45:56,600 --> 00:45:59,960 825 00:45:59,960 --> 00:46:02,780 AUDIENCE: Which part of this is regression? 826 00:46:02,780 --> 00:46:06,220 PROFESSOR: Well, the part of this that is regression is-- 827 00:46:06,220 --> 00:46:10,170 828 00:46:10,170 --> 00:46:12,560 the part that actually does the regression is this poly 829 00:46:12,560 --> 00:46:15,200 fit method here. 830 00:46:15,200 --> 00:46:19,030 And what you do is you pass it in the x values, the y values, 831 00:46:19,030 --> 00:46:20,860 and the degree of the polynomial that you 832 00:46:20,860 --> 00:46:22,110 want to fit to it. 833 00:46:22,110 --> 00:46:29,950 834 00:46:29,950 --> 00:46:32,270 I've hit the end of my material, 835 00:46:32,270 --> 00:46:33,785 unless someone has questions. 836 00:46:33,785 --> 00:46:36,950 837 00:46:36,950 --> 00:46:41,476 Comments, fears, trepidations? 838 00:46:41,476 --> 00:46:42,853 AUDIENCE: Just [INAUDIBLE] 839 00:46:42,853 --> 00:46:46,266 having done some stuff-- like in Excel, you can fit curves 840 00:46:46,266 --> 00:46:47,238 with the R-squares? 841 00:46:47,238 --> 00:46:47,724 PROFESSOR: Yeah. 842 00:46:47,724 --> 00:46:50,640 AUDIENCE: The R-squared values are really, really high, like 843 00:46:50,640 --> 00:46:51,890 really, really [? wanting ?] 844 00:46:51,890 --> 00:46:55,240 these fits, even though the fits are pretty terrible. 845 00:46:55,240 --> 00:46:55,830 PROFESSOR: Yeah. 846 00:46:55,830 --> 00:46:57,510 AUDIENCE: So that's weird to me. 847 00:46:57,510 --> 00:47:00,150 PROFESSOR: That is puzzling. 848 00:47:00,150 --> 00:47:04,925 And it's quite possible that I have a bug. 849 00:47:04,925 --> 00:47:07,350 AUDIENCE: I wonder whether there were different 850 00:47:07,350 --> 00:47:09,290 definitions for R-squared that are maybe floating around in 851 00:47:09,290 --> 00:47:10,430 different places? 852 00:47:10,430 --> 00:47:12,200 PROFESSOR: No. 853 00:47:12,200 --> 00:47:14,130 I made a correction to this earlier. 854 00:47:14,130 --> 00:47:16,360 And like I said, maybe I introduced a bug. 855 00:47:16,360 --> 00:47:20,140 So I'm going to have to double-check my math. 856 00:47:20,140 --> 00:47:21,803 Unfortunately, I'm not perfect. 857 00:47:21,803 --> 00:47:23,053 I wish I was. 858 00:47:23,053 --> 00:47:26,058