1 00:00:00,000 --> 00:00:00,040 2 00:00:00,040 --> 00:00:02,460 The following content is provided under a Creative 3 00:00:02,460 --> 00:00:03,870 Commons license. 4 00:00:03,870 --> 00:00:06,910 Your support will help MIT OpenCourseWare continue to 5 00:00:06,910 --> 00:00:10,560 offer high quality educational resources for free. 6 00:00:10,560 --> 00:00:13,460 To make a donation or view additional materials from 7 00:00:13,460 --> 00:00:17,390 hundreds of MIT courses, visit MIT OpenCourseWare at 8 00:00:17,390 --> 00:00:18,640 ocw.mit.edu. 9 00:00:18,640 --> 00:00:21,490 10 00:00:21,490 --> 00:00:25,520 PROFESSOR: Last Tuesday, we ended up the lecture talking 11 00:00:25,520 --> 00:00:27,840 about knapsack problems. 12 00:00:27,840 --> 00:00:30,920 We talked about the continuous knapsack problem and the fact 13 00:00:30,920 --> 00:00:32,920 that you could solve that optimally 14 00:00:32,920 --> 00:00:35,060 with a greedy algorithm. 15 00:00:35,060 --> 00:00:38,680 And we looked at the 0-1 knapsack problem and discussed 16 00:00:38,680 --> 00:00:42,190 the fact that while we could write greedy algorithms that 17 00:00:42,190 --> 00:00:45,560 would solve the problem quickly, we have to be careful 18 00:00:45,560 --> 00:00:49,930 what we mean by "solve," and that while those algorithms 19 00:00:49,930 --> 00:00:54,010 would choose a set of items that we could indeed carry 20 00:00:54,010 --> 00:00:57,390 away, there was no guarantee that it would choose the 21 00:00:57,390 --> 00:01:01,120 optimal items, that is to say, one that would meet the 22 00:01:01,120 --> 00:01:03,700 objective function of maximizing the value. 23 00:01:03,700 --> 00:01:07,170 24 00:01:07,170 --> 00:01:11,960 We looked after that at a brute force algorithm on the 25 00:01:11,960 --> 00:01:16,850 board only for finding an optimal solution, a guaranteed 26 00:01:16,850 --> 00:01:21,800 optimal solution, but observed the fact that on even a 27 00:01:21,800 --> 00:01:25,320 moderately sized set of items, it might take a 28 00:01:25,320 --> 00:01:27,280 decade or so to run. 29 00:01:27,280 --> 00:01:29,910 Decided that wasn't very good. 30 00:01:29,910 --> 00:01:33,940 Nevertheless, I want to start today looking at some code 31 00:01:33,940 --> 00:01:38,930 that implements a brute force algorithm, not because I 32 00:01:38,930 --> 00:01:43,560 expect anyone to actually run this on a real example, but 33 00:01:43,560 --> 00:01:46,100 because a bit later in the term, we'll see how we could 34 00:01:46,100 --> 00:01:49,800 modify this to something that would be practical. 35 00:01:49,800 --> 00:01:53,540 And there's some things to learn by looking at it. 36 00:01:53,540 --> 00:01:57,510 So let's look at some code here. 37 00:01:57,510 --> 00:02:01,090 I don't expect you to understand in real time all 38 00:02:01,090 --> 00:02:02,940 the details of this code. 39 00:02:02,940 --> 00:02:06,280 It's more I want you to understand the basic idea 40 00:02:06,280 --> 00:02:10,310 behind it and then the result we get. 41 00:02:10,310 --> 00:02:13,520 So you'll remember that we looked at the complexity by 42 00:02:13,520 --> 00:02:18,990 saying well, really, it's like binary numbers. 43 00:02:18,990 --> 00:02:23,930 So the first helper subroutine I'm going to use is something 44 00:02:23,930 --> 00:02:28,980 that generates binary numbers. 45 00:02:28,980 --> 00:02:34,330 So it takes an n, some natural number, and the number of 46 00:02:34,330 --> 00:02:38,570 digits and returns a binary string of that length, 47 00:02:38,570 --> 00:02:42,780 representing the decimal number n. 48 00:02:42,780 --> 00:02:47,340 Why am I giving it this number of digits? 49 00:02:47,340 --> 00:02:50,970 Because I need to zero pad it. 50 00:02:50,970 --> 00:02:53,970 If I want to have a vector that represents whether or not 51 00:02:53,970 --> 00:03:00,090 I take items, if I take only one item, say the first one, I 52 00:03:00,090 --> 00:03:03,370 don't want just a binary spring with one digit in it. 53 00:03:03,370 --> 00:03:06,200 Because I need all those zeros to indicate that I'm not 54 00:03:06,200 --> 00:03:08,700 taking the other items. 55 00:03:08,700 --> 00:03:12,320 And so the second argument tells me, in effect, how many 56 00:03:12,320 --> 00:03:14,820 zeros I'm going to need. 57 00:03:14,820 --> 00:03:20,010 And there's nothing mysterious about the way it does it. 58 00:03:20,010 --> 00:03:21,660 OK. 59 00:03:21,660 --> 00:03:25,320 The next helper function generates the 60 00:03:25,320 --> 00:03:29,940 power set of the items. 61 00:03:29,940 --> 00:03:32,230 What is a power set? 62 00:03:32,230 --> 00:03:37,170 If you take a set, you can then ask the question what are 63 00:03:37,170 --> 00:03:41,430 all the subsets of the set? 64 00:03:41,430 --> 00:03:43,230 What's the smallest subset of a set? 65 00:03:43,230 --> 00:03:45,260 It's the empty set. 66 00:03:45,260 --> 00:03:46,910 No items. 67 00:03:46,910 --> 00:03:49,970 What's the largest subset of a set? 68 00:03:49,970 --> 00:03:52,590 All of the items. 69 00:03:52,590 --> 00:03:56,930 And then we have everything in between, the set that contains 70 00:03:56,930 --> 00:03:59,780 the first item, the set that contains the second item, et 71 00:03:59,780 --> 00:04:02,620 cetera, the set that contains the first and the second, the 72 00:04:02,620 --> 00:04:03,740 first and the third. 73 00:04:03,740 --> 00:04:06,320 There are a lot of them. 74 00:04:06,320 --> 00:04:08,510 And of course, how many is a lot? 75 00:04:08,510 --> 00:04:11,870 Well, 2 to the n is a lot. 76 00:04:11,870 --> 00:04:15,470 But now we're going to generate every 77 00:04:15,470 --> 00:04:19,079 possible subset of items. 78 00:04:19,079 --> 00:04:23,670 And we're going to do this simply using the decimal to 79 00:04:23,670 --> 00:04:27,580 binary function to tell us whether or not we keep each 80 00:04:27,580 --> 00:04:29,560 one, so we can enumerate them. 81 00:04:29,560 --> 00:04:31,590 We can generate them all. 82 00:04:31,590 --> 00:04:36,830 And now we have the set of all possible items one might take, 83 00:04:36,830 --> 00:04:40,820 irrespective of whether they obey the constraint of not 84 00:04:40,820 --> 00:04:42,495 weighing too much. 85 00:04:42,495 --> 00:04:47,090 86 00:04:47,090 --> 00:04:51,190 The next function is the one that does the work. 87 00:04:51,190 --> 00:04:53,290 This is the interesting one. 88 00:04:53,290 --> 00:04:55,380 Choose best. 89 00:04:55,380 --> 00:05:04,790 It takes a power set, the constraint, and two functions. 90 00:05:04,790 --> 00:05:06,960 One, getValue-- 91 00:05:06,960 --> 00:05:08,770 it tells me the value of an item. 92 00:05:08,770 --> 00:05:10,315 And the other getWeight-- 93 00:05:10,315 --> 00:05:13,770 it tells me the weight of an item. 94 00:05:13,770 --> 00:05:15,020 Then it just goes through. 95 00:05:15,020 --> 00:05:17,610 96 00:05:17,610 --> 00:05:21,580 And it enumerates all possibilities 97 00:05:21,580 --> 00:05:24,100 and eventually chooses-- 98 00:05:24,100 --> 00:05:27,980 I won't say the best set, because it 99 00:05:27,980 --> 00:05:29,140 might not be unique. 100 00:05:29,140 --> 00:05:31,650 There might be more than one optimal answer. 101 00:05:31,650 --> 00:05:35,020 But it finds at least one optimal answer. 102 00:05:35,020 --> 00:05:38,470 And then it returns that. 103 00:05:38,470 --> 00:05:42,440 Again it's a very straightforward implementation 104 00:05:42,440 --> 00:05:47,190 of the brute force algorithm I sketched on the board. 105 00:05:47,190 --> 00:05:50,800 And then we can run it with testBest, which is going to 106 00:05:50,800 --> 00:05:52,680 build the items using the function we 107 00:05:52,680 --> 00:05:54,560 looked at last time. 108 00:05:54,560 --> 00:05:58,150 It's then going to get the power set of the items. 109 00:05:58,150 --> 00:06:00,970 It's going to call chooseBest and then print the result. 110 00:06:00,970 --> 00:06:03,540 111 00:06:03,540 --> 00:06:05,490 So let's see what happens if we run it. 112 00:06:05,490 --> 00:06:18,550 113 00:06:18,550 --> 00:06:19,770 We get an error. 114 00:06:19,770 --> 00:06:20,295 Oh dear. 115 00:06:20,295 --> 00:06:23,890 I hadn't expected that. 116 00:06:23,890 --> 00:06:27,420 And it says test-- oh testBest is not defined. 117 00:06:27,420 --> 00:06:27,960 All right. 118 00:06:27,960 --> 00:06:29,930 Let's try that again. 119 00:06:29,930 --> 00:06:31,730 Sure looks like it's defined to me. 120 00:06:31,730 --> 00:06:38,000 121 00:06:38,000 --> 00:06:39,400 There it is. 122 00:06:39,400 --> 00:06:40,670 OK. 123 00:06:40,670 --> 00:06:44,170 And you may recall that this is a better answer than 124 00:06:44,170 --> 00:06:45,940 anything that was generated by the greedy 125 00:06:45,940 --> 00:06:47,290 algorithm on Tuesday. 126 00:06:47,290 --> 00:06:51,300 You may not recall it, but believe me it is. 127 00:06:51,300 --> 00:06:54,460 It happened to have found a better solution. 128 00:06:54,460 --> 00:06:57,320 And not surprisingly, that's because I contrived the 129 00:06:57,320 --> 00:07:00,550 example to make sure that would happen. 130 00:07:00,550 --> 00:07:03,720 Why does it work better in the sense-- or why does it find a 131 00:07:03,720 --> 00:07:04,830 better answer? 132 00:07:04,830 --> 00:07:07,160 Why might it find a better answer? 133 00:07:07,160 --> 00:07:12,680 Well, because the greedy algorithm chose something that 134 00:07:12,680 --> 00:07:14,770 was locally optimal at each step. 135 00:07:14,770 --> 00:07:21,420 136 00:07:21,420 --> 00:07:24,400 But there was no guarantee that a sequence of locally 137 00:07:24,400 --> 00:07:26,750 optimal decisions would reach a global optimum. 138 00:07:26,750 --> 00:07:32,220 139 00:07:32,220 --> 00:07:37,820 What this algorithm does is it finds a global optimum by 140 00:07:37,820 --> 00:07:39,920 looking at all solutions. 141 00:07:39,920 --> 00:07:43,060 And that's something we'll see again and again, as we go 142 00:07:43,060 --> 00:07:47,340 forward, that there's always a temptation to do things one 143 00:07:47,340 --> 00:07:50,930 step at a time, finding local optimum-- 144 00:07:50,930 --> 00:07:52,050 optima-- 145 00:07:52,050 --> 00:07:53,840 because it's fast. 146 00:07:53,840 --> 00:07:55,110 It's easy. 147 00:07:55,110 --> 00:08:00,490 But there's no guarantee it will work well. 148 00:08:00,490 --> 00:08:02,830 Now the problem, of course, with finding the global 149 00:08:02,830 --> 00:08:09,230 optimum is, as we discussed, it is prohibitively expensive. 150 00:08:09,230 --> 00:08:12,430 Now you could ask is it prohibitively expensive 151 00:08:12,430 --> 00:08:15,290 because I chose a stupid algorithm, 152 00:08:15,290 --> 00:08:17,670 the brute force algorithm? 153 00:08:17,670 --> 00:08:20,420 Well, it is a stupid algorithm. 154 00:08:20,420 --> 00:08:25,080 But in fact, this is a problem that is what we would call 155 00:08:25,080 --> 00:08:26,330 inherently exponential. 156 00:08:26,330 --> 00:08:38,659 157 00:08:38,659 --> 00:08:41,440 We've looked at this concept before. 158 00:08:41,440 --> 00:08:45,910 That in addition to talking about the complexity of an 159 00:08:45,910 --> 00:08:49,860 algorithm, we can talk about the complexity of a problem in 160 00:08:49,860 --> 00:08:54,650 which we ask the question how fast can the absolute best 161 00:08:54,650 --> 00:08:58,170 solution, fastest solution to this problem, be? 162 00:08:58,170 --> 00:09:04,230 And here you can construct a mathematical proof that says 163 00:09:04,230 --> 00:09:08,320 the problem is inherently exponential. 164 00:09:08,320 --> 00:09:11,880 No matter what we do, we're not going to be able to find 165 00:09:11,880 --> 00:09:16,010 something that's guaranteed to find the optimal, that is 166 00:09:16,010 --> 00:09:19,090 faster than exponential. 167 00:09:19,090 --> 00:09:23,428 Well, now let's be careful of that statement. 168 00:09:23,428 --> 00:09:27,420 What that means is the worst case is inherently 169 00:09:27,420 --> 00:09:29,760 exponential. 170 00:09:29,760 --> 00:09:34,070 As we will see in a couple of weeks-- it'll take us a while 171 00:09:34,070 --> 00:09:35,420 to get there-- 172 00:09:35,420 --> 00:09:39,730 there are actually algorithms that people use to solve these 173 00:09:39,730 --> 00:09:43,760 inherently exponential problems and solve them fast 174 00:09:43,760 --> 00:09:45,010 enough to be useful. 175 00:09:45,010 --> 00:09:47,230 176 00:09:47,230 --> 00:09:51,810 So, for example, when you go to look at airline fares on 177 00:09:51,810 --> 00:09:56,800 Kayak to try and find the best fare from A to B, it is an 178 00:09:56,800 --> 00:09:59,310 inherently exponential problem, but you get an answer 179 00:09:59,310 --> 00:10:01,120 pretty quickly. 180 00:10:01,120 --> 00:10:04,740 And that's because there are techniques you can use. 181 00:10:04,740 --> 00:10:06,820 Now, in fact, one of the reasons you get it is they 182 00:10:06,820 --> 00:10:08,330 don't guarantee that you actually 183 00:10:08,330 --> 00:10:10,680 get an optimal solution. 184 00:10:10,680 --> 00:10:13,140 But there are techniques that guarantee to give you an 185 00:10:13,140 --> 00:10:18,000 optimal solution that almost all the time will run quickly. 186 00:10:18,000 --> 00:10:23,110 And we'll look at one of those a bit later in the term. 187 00:10:23,110 --> 00:10:30,620 Before we do that, however, I want to leave for a while the 188 00:10:30,620 --> 00:10:35,350 whole question of complexity behind and look at another 189 00:10:35,350 --> 00:10:38,300 class of optimization problems. 190 00:10:38,300 --> 00:10:39,940 We'll look at several different kinds of 191 00:10:39,940 --> 00:10:43,300 optimization problems as the term goes forward. 192 00:10:43,300 --> 00:10:50,470 The kind I want to look at today is probably what I would 193 00:10:50,470 --> 00:10:52,430 say is the most exciting branch of 194 00:10:52,430 --> 00:10:53,730 computer science today. 195 00:10:53,730 --> 00:10:56,240 And of course I might have a bias. 196 00:10:56,240 --> 00:10:57,490 And that's machine learning. 197 00:10:57,490 --> 00:11:05,040 198 00:11:05,040 --> 00:11:09,030 It's a word you'll hear a lot about. 199 00:11:09,030 --> 00:11:14,090 And it's a technique that many of you will apply. 200 00:11:14,090 --> 00:11:15,810 You might not write your own codes. 201 00:11:15,810 --> 00:11:19,370 But I guarantee you were either be the beneficiary or 202 00:11:19,370 --> 00:11:23,550 the victim of machine learning almost every time you log on 203 00:11:23,550 --> 00:11:25,890 to the web these days. 204 00:11:25,890 --> 00:11:28,820 I should probably start by defining what 205 00:11:28,820 --> 00:11:30,490 machine learning is. 206 00:11:30,490 --> 00:11:32,490 But that's hard to do. 207 00:11:32,490 --> 00:11:34,540 I really don't know how to do it. 208 00:11:34,540 --> 00:11:38,000 Superficially, you could say that machine learning deals 209 00:11:38,000 --> 00:11:42,920 with the question of how to build programs that learn. 210 00:11:42,920 --> 00:11:46,090 However, I think in a very real sense every program we 211 00:11:46,090 --> 00:11:48,170 write learns something. 212 00:11:48,170 --> 00:11:51,660 If I implement Newton's method, it's learning what the 213 00:11:51,660 --> 00:11:54,730 roots of the polynomial is. 214 00:11:54,730 --> 00:11:58,750 Certainly when we looked at curve fitting-- 215 00:11:58,750 --> 00:12:00,710 fitting curves to data-- 216 00:12:00,710 --> 00:12:03,320 we were learning a model of the data. 217 00:12:03,320 --> 00:12:05,840 That's what that regression is. 218 00:12:05,840 --> 00:12:08,570 219 00:12:08,570 --> 00:12:10,660 Wikipedia says-- 220 00:12:10,660 --> 00:12:14,240 and of course, it must be true if Wikipedia says it-- 221 00:12:14,240 --> 00:12:17,470 that machine learning is a scientific discipline that is 222 00:12:17,470 --> 00:12:20,960 concerned with the design and development of algorithms that 223 00:12:20,960 --> 00:12:23,860 allow computers to evolve behaviors based 224 00:12:23,860 --> 00:12:26,290 on empirical data. 225 00:12:26,290 --> 00:12:28,640 I'm not sure how helpful this definition is. 226 00:12:28,640 --> 00:12:30,310 But it was the best I could find. 227 00:12:30,310 --> 00:12:31,620 And it doesn't really matter. 228 00:12:31,620 --> 00:12:34,220 229 00:12:34,220 --> 00:12:38,910 But it sort of gets at the issue that a major focus of 230 00:12:38,910 --> 00:12:42,940 machine learning research is to automatically learn to 231 00:12:42,940 --> 00:12:48,640 recognize complex patterns and make intelligent decisions 232 00:12:48,640 --> 00:12:51,880 based on data. 233 00:12:51,880 --> 00:12:55,210 This whole process is something 234 00:12:55,210 --> 00:12:56,670 called inductive inference. 235 00:12:56,670 --> 00:13:07,300 236 00:13:07,300 --> 00:13:10,410 The basic idea is one observes-- 237 00:13:10,410 --> 00:13:11,410 actually one doesn't. 238 00:13:11,410 --> 00:13:17,620 The program observes examples that represent an incomplete 239 00:13:17,620 --> 00:13:22,750 information about some statistical phenomena and then 240 00:13:22,750 --> 00:13:28,270 tries to generate a model, just like with curve fitting, 241 00:13:28,270 --> 00:13:33,170 that summarizes some statistical properties of that 242 00:13:33,170 --> 00:13:39,350 data and can be used to predict the future, for 243 00:13:39,350 --> 00:13:44,250 example, give you information about unseen data. 244 00:13:44,250 --> 00:13:51,420 There are roughly speaking two distinctive approaches to 245 00:13:51,420 --> 00:14:01,320 machine learning called supervised learning and 246 00:14:01,320 --> 00:14:02,570 unsupervised learning. 247 00:14:02,570 --> 00:14:17,630 248 00:14:17,630 --> 00:14:20,460 Let's first talk about supervised learning. 249 00:14:20,460 --> 00:14:24,650 It's a little easier to appreciate how it might work. 250 00:14:24,650 --> 00:14:37,450 In supervised learning, we associate a label with each 251 00:14:37,450 --> 00:14:39,310 example in a training set. 252 00:14:39,310 --> 00:14:52,450 253 00:14:52,450 --> 00:14:55,770 So think of that as an answer to a query about an example. 254 00:14:55,770 --> 00:14:58,580 255 00:14:58,580 --> 00:15:06,520 If the label is discrete, we typically call it a 256 00:15:06,520 --> 00:15:08,010 classification problem. 257 00:15:08,010 --> 00:15:25,630 258 00:15:25,630 --> 00:15:31,470 So we would try and classify, for example, a transaction on 259 00:15:31,470 --> 00:15:35,150 a credit card as belonging to the owner of that credit card 260 00:15:35,150 --> 00:15:39,360 or not belonging to the owner, as i.e., with some 261 00:15:39,360 --> 00:15:42,680 probability, a stolen credit card. 262 00:15:42,680 --> 00:15:43,490 So it's discrete. 263 00:15:43,490 --> 00:15:44,460 It belongs to the owner. 264 00:15:44,460 --> 00:15:47,930 It doesn't belong to the owner. 265 00:15:47,930 --> 00:15:54,620 If the labels are real valued, we think of it as 266 00:15:54,620 --> 00:15:57,090 a regression problem. 267 00:15:57,090 --> 00:16:01,990 And so indeed, when we did the curve fitting, we were doing 268 00:16:01,990 --> 00:16:04,000 machine learning. 269 00:16:04,000 --> 00:16:05,775 And we were handling a regression problem. 270 00:16:05,775 --> 00:16:08,870 271 00:16:08,870 --> 00:16:13,610 Based on the examples from the training set, the goal is to 272 00:16:13,610 --> 00:16:17,460 build a program that can predict the answer for other 273 00:16:17,460 --> 00:16:23,260 cases before they were explicitly observed. 274 00:16:23,260 --> 00:16:26,490 So we're trying to generalize from the statistical 275 00:16:26,490 --> 00:16:30,790 properties of a training set to be able to make predictions 276 00:16:30,790 --> 00:16:32,430 about things we haven't seen. 277 00:16:32,430 --> 00:16:36,960 278 00:16:36,960 --> 00:16:38,210 Let's look at an example. 279 00:16:38,210 --> 00:16:48,990 280 00:16:48,990 --> 00:16:57,220 So here, I've got red and blue circles. 281 00:16:57,220 --> 00:17:05,240 And I'm trying to learn what makes a circle red or what's 282 00:17:05,240 --> 00:17:09,930 the difference between red and blue, other than the color? 283 00:17:09,930 --> 00:17:14,940 Think of my information as the (x,y) values and the label as 284 00:17:14,940 --> 00:17:16,190 the color, red or blue. 285 00:17:16,190 --> 00:17:18,920 286 00:17:18,920 --> 00:17:20,079 So I've labeled each one. 287 00:17:20,079 --> 00:17:23,910 And now I'm trying to learn something. 288 00:17:23,910 --> 00:17:25,364 Well, it's kind of tricky. 289 00:17:25,364 --> 00:17:29,040 290 00:17:29,040 --> 00:17:32,410 What are the questions I need to answer to think about this? 291 00:17:32,410 --> 00:17:36,610 And then we'll look at how we might do it. 292 00:17:36,610 --> 00:17:42,720 So, a first question I need to ask is 293 00:17:42,720 --> 00:17:45,745 are the labels accurate? 294 00:17:45,745 --> 00:17:53,070 295 00:17:53,070 --> 00:17:56,980 And in fact, in a lot of real world examples, in most real 296 00:17:56,980 --> 00:17:59,410 world examples, there's no guarantee that 297 00:17:59,410 --> 00:18:02,000 the labels are accurate. 298 00:18:02,000 --> 00:18:04,640 So you have to assume that well, maybe some of 299 00:18:04,640 --> 00:18:06,690 the labels are wrong. 300 00:18:06,690 --> 00:18:08,555 How do we deal with that? 301 00:18:08,555 --> 00:18:13,150 302 00:18:13,150 --> 00:18:18,810 Perhaps the most fundamental question is the past 303 00:18:18,810 --> 00:18:20,300 representative of the future? 304 00:18:20,300 --> 00:18:34,240 305 00:18:34,240 --> 00:18:39,640 We've seen many examples where people have learned things, 306 00:18:39,640 --> 00:18:45,110 for example, to predict the price of housing. 307 00:18:45,110 --> 00:18:48,910 And it turns out you hit some singularity which means the 308 00:18:48,910 --> 00:18:51,710 past is not a very good predictor of the future. 309 00:18:51,710 --> 00:18:55,120 And even if all of your learning is good, you get the 310 00:18:55,120 --> 00:18:56,790 wrong answer. 311 00:18:56,790 --> 00:19:01,160 So you sort of always have to ask that question. 312 00:19:01,160 --> 00:19:05,780 Do you have enough data to generalize? 313 00:19:05,780 --> 00:19:15,200 314 00:19:15,200 --> 00:19:18,430 And by this, I mean enough training data. 315 00:19:18,430 --> 00:19:21,970 If your training set is very small, you shouldn't have a 316 00:19:21,970 --> 00:19:23,560 lot of confidence in what you learn. 317 00:19:23,560 --> 00:19:27,770 318 00:19:27,770 --> 00:19:31,420 A big issue here is feature extraction. 319 00:19:31,420 --> 00:19:37,720 320 00:19:37,720 --> 00:19:41,820 As we'll see when we look at real examples, the world is a 321 00:19:41,820 --> 00:19:44,350 pretty complex place. 322 00:19:44,350 --> 00:19:50,420 And we need to decide what features we're going to use. 323 00:19:50,420 --> 00:19:53,650 If I were to ask 25% of you to come up in the front of the 324 00:19:53,650 --> 00:19:59,560 room and then try and separate you based upon some feature-- 325 00:19:59,560 --> 00:20:01,490 if I were to say, all right, I'm going to separate the good 326 00:20:01,490 --> 00:20:04,460 students from the bad students, but the only 327 00:20:04,460 --> 00:20:08,390 features I have available are the clothes you're wearing, it 328 00:20:08,390 --> 00:20:09,750 might not work so well. 329 00:20:09,750 --> 00:20:12,670 330 00:20:12,670 --> 00:20:17,140 And very importantly, how tight should the fit be? 331 00:20:17,140 --> 00:20:27,490 332 00:20:27,490 --> 00:20:29,535 So now let's go back to our example here. 333 00:20:29,535 --> 00:20:32,750 334 00:20:32,750 --> 00:20:38,660 We can look at two different ways we might 335 00:20:38,660 --> 00:20:41,230 generalize from this data. 336 00:20:41,230 --> 00:20:45,410 And indeed, when we're looking at classification problems in 337 00:20:45,410 --> 00:20:48,320 supervised learning, what we're typically doing is 338 00:20:48,320 --> 00:20:53,405 trying to find some way of dividing our training data. 339 00:20:53,405 --> 00:20:56,120 340 00:20:56,120 --> 00:20:59,210 In this case, I've given you a two-dimensional projection. 341 00:20:59,210 --> 00:21:01,120 As we'll see, it's not always two-dimensional. 342 00:21:01,120 --> 00:21:03,570 It's not usually two-dimensional. 343 00:21:03,570 --> 00:21:07,850 So I might choose this rather eccentric shape 344 00:21:07,850 --> 00:21:09,760 and say that's great. 345 00:21:09,760 --> 00:21:11,680 And why is that great? 346 00:21:11,680 --> 00:21:15,775 It's great because it minimizes training error. 347 00:21:15,775 --> 00:21:20,290 348 00:21:20,290 --> 00:21:27,880 So if we look at it as an optimization problem, we might 349 00:21:27,880 --> 00:21:34,080 say that our objective function is how many points 350 00:21:34,080 --> 00:21:36,520 are correctly classified in the training 351 00:21:36,520 --> 00:21:39,980 data as red or blue. 352 00:21:39,980 --> 00:21:45,680 And this triangular shape has no training error. 353 00:21:45,680 --> 00:21:48,050 Every point is perfectly classified in 354 00:21:48,050 --> 00:21:50,470 the training data. 355 00:21:50,470 --> 00:21:54,480 If I choose this linear separator instead, I have some 356 00:21:54,480 --> 00:21:56,140 training error. 357 00:21:56,140 --> 00:22:01,750 This red point is misclassified in the training. 358 00:22:01,750 --> 00:22:03,380 Does that mean that the triangle is 359 00:22:03,380 --> 00:22:05,560 better than the line? 360 00:22:05,560 --> 00:22:07,340 Not necessarily, right? 361 00:22:07,340 --> 00:22:11,180 Because my goal is to predict future points. 362 00:22:11,180 --> 00:22:16,690 And maybe that's mislabeled or an experimental error. 363 00:22:16,690 --> 00:22:22,050 Maybe it's accurately labeled but an outlier, very unusual. 364 00:22:22,050 --> 00:22:25,240 And this will not generalize well. 365 00:22:25,240 --> 00:22:28,630 This is analogous to what we talked about as overfitting 366 00:22:28,630 --> 00:22:30,720 when we looked at curve fitting. 367 00:22:30,720 --> 00:22:34,510 And that's-- a very big problem in machine learning is 368 00:22:34,510 --> 00:22:38,170 if you overfit to your training data, it might not 369 00:22:38,170 --> 00:22:41,010 generalize well and might give you bogus 370 00:22:41,010 --> 00:22:42,955 answers going forward. 371 00:22:42,955 --> 00:22:46,040 372 00:22:46,040 --> 00:22:46,300 OK. 373 00:22:46,300 --> 00:22:51,860 So that's a very quick look at supervised learning. 374 00:22:51,860 --> 00:22:53,960 We'll come back to that. 375 00:22:53,960 --> 00:22:56,170 I now want to talk about unsupervised learning. 376 00:22:56,170 --> 00:22:58,810 377 00:22:58,810 --> 00:23:04,940 The big difference here is we have training data, but we 378 00:23:04,940 --> 00:23:06,190 don't have labels. 379 00:23:06,190 --> 00:23:08,690 380 00:23:08,690 --> 00:23:12,890 So I just give you a bunch of points. 381 00:23:12,890 --> 00:23:16,100 It's as if we looked at this picture, and I didn't tell you 382 00:23:16,100 --> 00:23:20,240 which were the red points and which were the blue points. 383 00:23:20,240 --> 00:23:21,730 They were just all points. 384 00:23:21,730 --> 00:23:24,320 385 00:23:24,320 --> 00:23:28,500 So what can I learn? 386 00:23:28,500 --> 00:23:48,420 What typically you're learning in unsupervised learning, is 387 00:23:48,420 --> 00:23:56,195 you're learning about regularities of the data. 388 00:23:56,195 --> 00:24:04,580 389 00:24:04,580 --> 00:24:08,660 So if we looked at this and think away the red and the 390 00:24:08,660 --> 00:24:14,880 blue, we might well say well, at least, if I look at this, 391 00:24:14,880 --> 00:24:18,820 there is some structure to this data. 392 00:24:18,820 --> 00:24:21,730 And maybe what I should do is divide it this way. 393 00:24:21,730 --> 00:24:25,370 It gives me kind of a nice clean separation. 394 00:24:25,370 --> 00:24:28,650 But maybe I should divide it this way. 395 00:24:28,650 --> 00:24:31,300 Or maybe I should put a circle around each 396 00:24:31,300 --> 00:24:34,480 of these four groupings. 397 00:24:34,480 --> 00:24:36,210 Complicated, what to do. 398 00:24:36,210 --> 00:24:40,740 But what we see is there is clearly some structure here. 399 00:24:40,740 --> 00:24:44,860 And the idea of unsupervised learning is to 400 00:24:44,860 --> 00:24:46,200 discover that structure. 401 00:24:46,200 --> 00:24:49,110 402 00:24:49,110 --> 00:24:53,270 Far and away, the dominant form of unsupervised learning 403 00:24:53,270 --> 00:24:56,370 is clustering. 404 00:24:56,370 --> 00:25:00,110 And that's what I was just talking about, is finding the 405 00:25:00,110 --> 00:25:04,450 cluster in this data. 406 00:25:04,450 --> 00:25:08,660 So we'll move forward here. 407 00:25:08,660 --> 00:25:13,060 There it is, with everything the same color. 408 00:25:13,060 --> 00:25:17,100 But here I've labeled the x- and y-axes 409 00:25:17,100 --> 00:25:18,790 as height and weight. 410 00:25:18,790 --> 00:25:22,640 411 00:25:22,640 --> 00:25:24,540 What does clustering mean? 412 00:25:24,540 --> 00:25:29,300 It's the process of organizing the objects or the points into 413 00:25:29,300 --> 00:25:35,280 groups whose members are similar in some way. 414 00:25:35,280 --> 00:25:38,790 A key issue is what do we mean by similar? 415 00:25:38,790 --> 00:25:42,120 What's the metric we want to use? 416 00:25:42,120 --> 00:25:43,860 And we can see that here. 417 00:25:43,860 --> 00:25:46,650 If I tell you that, really, I want to 418 00:25:46,650 --> 00:25:50,080 cluster people by height-- 419 00:25:50,080 --> 00:25:52,850 say, people are similar if they're the same height-- 420 00:25:52,850 --> 00:25:56,790 then it's pretty clear how I should divide this, right, 421 00:25:56,790 --> 00:25:58,880 what my clusters should be. 422 00:25:58,880 --> 00:26:02,590 My clusters should probably be this group of shorter people 423 00:26:02,590 --> 00:26:04,110 and this group of taller people. 424 00:26:04,110 --> 00:26:06,780 425 00:26:06,780 --> 00:26:11,510 If I tell you I'm interested in weight, then probably I 426 00:26:11,510 --> 00:26:16,390 want a cluster it with the divisor here between the 427 00:26:16,390 --> 00:26:18,950 heavier people and the lighter people. 428 00:26:18,950 --> 00:26:23,670 Or if I say well, I'm interested in some combination 429 00:26:23,670 --> 00:26:27,090 of those two, then maybe I'll get four clusters as I 430 00:26:27,090 --> 00:26:28,340 discussed before. 431 00:26:28,340 --> 00:26:35,840 432 00:26:35,840 --> 00:26:38,680 Clustering algorithms are used all over the place. 433 00:26:38,680 --> 00:26:43,800 For example, in marketing, they're used to find groups of 434 00:26:43,800 --> 00:26:47,910 customers with similar behavior. 435 00:26:47,910 --> 00:26:52,380 Walmart is famous for using that clustering to find that. 436 00:26:52,380 --> 00:26:56,670 They did a clustering to determine when people bought 437 00:26:56,670 --> 00:26:58,010 the same thing. 438 00:26:58,010 --> 00:27:00,240 And then they would rearrange their shelves to encourage 439 00:27:00,240 --> 00:27:02,230 people to buy things. 440 00:27:02,230 --> 00:27:05,360 And sort of the most famous example they discovered was 441 00:27:05,360 --> 00:27:08,170 there was a strong correlation between people between people 442 00:27:08,170 --> 00:27:11,790 who bought beer and people who bought diapers. 443 00:27:11,790 --> 00:27:13,630 And so there was a period where if you walked in a 444 00:27:13,630 --> 00:27:16,160 Walmart store, you would find the beer and the diapers next 445 00:27:16,160 --> 00:27:17,550 to each other. 446 00:27:17,550 --> 00:27:19,720 And I leave it to you to speculate on 447 00:27:19,720 --> 00:27:21,420 why that was true. 448 00:27:21,420 --> 00:27:23,215 It just happened to be true in Walmart. 449 00:27:23,215 --> 00:27:26,170 450 00:27:26,170 --> 00:27:30,600 Amazon uses clustering to find people who like similar books. 451 00:27:30,600 --> 00:27:33,820 So every time you buy a book on Amazon, they're running a 452 00:27:33,820 --> 00:27:37,090 clustering algorithm to find out who looks like you. 453 00:27:37,090 --> 00:27:39,530 Said, oh, this person looks just like you. 454 00:27:39,530 --> 00:27:42,130 So if they buy a book, maybe you'll get an email suggesting 455 00:27:42,130 --> 00:27:46,430 you buy that book or the next time you log into Amazon. 456 00:27:46,430 --> 00:27:48,800 Or when you look at a book, they tell you here are some 457 00:27:48,800 --> 00:27:50,400 similar books. 458 00:27:50,400 --> 00:27:52,670 And then they've done a clustering to group books as 459 00:27:52,670 --> 00:27:56,380 similar based on buying habits. 460 00:27:56,380 --> 00:28:01,550 Netflix uses that to recommend movies, et cetera. 461 00:28:01,550 --> 00:28:07,420 Biologists spend a lot of time these days doing clustering. 462 00:28:07,420 --> 00:28:09,630 They classify plants or animals 463 00:28:09,630 --> 00:28:10,780 based on their features. 464 00:28:10,780 --> 00:28:14,890 We'll shortly see an example of that, as in right after 465 00:28:14,890 --> 00:28:16,760 Patriot's Day. 466 00:28:16,760 --> 00:28:19,640 But they also use it a lot in genetics. 467 00:28:19,640 --> 00:28:24,140 So clustering is used to try and find genes that look like 468 00:28:24,140 --> 00:28:27,080 or groups of genes. 469 00:28:27,080 --> 00:28:30,930 Insurance companies use that to decide how much to charge 470 00:28:30,930 --> 00:28:33,340 you for your automobile insurance. 471 00:28:33,340 --> 00:28:36,990 They cluster drivers based upon-- and use that to predict 472 00:28:36,990 --> 00:28:40,420 who's going to have an accident. 473 00:28:40,420 --> 00:28:45,830 Document classification on the web is used all the time. 474 00:28:45,830 --> 00:28:47,200 It's used a lot in medicine. 475 00:28:47,200 --> 00:28:49,580 Just used all over the place. 476 00:28:49,580 --> 00:28:51,650 So what is it exactly? 477 00:28:51,650 --> 00:28:56,200 Well the nice thing is we can define it very 478 00:28:56,200 --> 00:28:59,840 straightforwardly as an optimization problem. 479 00:28:59,840 --> 00:29:03,670 And so we can ask what properties does a good 480 00:29:03,670 --> 00:29:04,920 clustering have? 481 00:29:04,920 --> 00:29:07,610 482 00:29:07,610 --> 00:29:16,900 Well, it should have low intra-cluster dissimilarity. 483 00:29:16,900 --> 00:29:26,100 484 00:29:26,100 --> 00:29:29,600 So in a good clustering, all of the points in the same 485 00:29:29,600 --> 00:29:34,060 cluster should be similar, by whatever metric you're using 486 00:29:34,060 --> 00:29:35,800 for similarity. 487 00:29:35,800 --> 00:29:38,910 As we'll see, there are a lot of choices there. 488 00:29:38,910 --> 00:29:41,960 But that's not enough. 489 00:29:41,960 --> 00:29:55,300 We'd also like to have high inter-cluster dissimilarity. 490 00:29:55,300 --> 00:29:57,790 So we'd like the points within a cluster to be a lot like 491 00:29:57,790 --> 00:29:58,600 each other. 492 00:29:58,600 --> 00:30:01,730 But if points are in different clusters, we'd like them to be 493 00:30:01,730 --> 00:30:04,900 quite different from each other. 494 00:30:04,900 --> 00:30:07,894 That tells us that we have a good cluster. 495 00:30:07,894 --> 00:30:09,570 All right, let's look at it. 496 00:30:09,570 --> 00:30:18,640 497 00:30:18,640 --> 00:30:22,160 How might we model dissimilarity? 498 00:30:22,160 --> 00:30:26,760 Well, using a concept we've already seen-- variance. 499 00:30:26,760 --> 00:30:37,940 So we can talk about the variance of some cluster C as 500 00:30:37,940 --> 00:30:45,775 equal to the sum of all elements x in C, of the mean 501 00:30:45,775 --> 00:30:53,210 of C minus x squared. 502 00:30:53,210 --> 00:30:56,310 Or maybe we can take the square root of it, if we want. 503 00:30:56,310 --> 00:30:59,990 But it's exactly the idea we've seen before, right? 504 00:30:59,990 --> 00:31:02,610 Then we say what's the average value of the cluster? 505 00:31:02,610 --> 00:31:06,000 And then we look at how far is each point from the average. 506 00:31:06,000 --> 00:31:07,070 We sum them. 507 00:31:07,070 --> 00:31:09,140 And that tells us how much variance we 508 00:31:09,140 --> 00:31:10,770 have within the cluster. 509 00:31:10,770 --> 00:31:13,810 510 00:31:13,810 --> 00:31:15,060 Make sense? 511 00:31:15,060 --> 00:31:17,560 512 00:31:17,560 --> 00:31:19,680 So that's variance. 513 00:31:19,680 --> 00:31:22,660 514 00:31:22,660 --> 00:31:26,190 So we can use that to talk about how similar or 515 00:31:26,190 --> 00:31:28,430 dissimilar the elements in the cluster are. 516 00:31:28,430 --> 00:31:31,860 517 00:31:31,860 --> 00:31:36,490 We can use the same idea to compare points in separate 518 00:31:36,490 --> 00:31:40,660 clusters and compute various different ways-- and we'll 519 00:31:40,660 --> 00:31:42,300 look at different ways-- 520 00:31:42,300 --> 00:31:46,510 to look at the distance between clusters. 521 00:31:46,510 --> 00:31:50,890 So combining these two things, we could get, say, a metric 522 00:31:50,890 --> 00:31:52,140 we'll call badness-- 523 00:31:52,140 --> 00:31:55,000 524 00:31:55,000 --> 00:31:57,480 not a technical word. 525 00:31:57,480 --> 00:32:03,470 And now I'll ask the question is the optimization problem 526 00:32:03,470 --> 00:32:05,690 that we're solving in clustering 527 00:32:05,690 --> 00:32:07,540 finding a set of clusters-- 528 00:32:07,540 --> 00:32:09,230 capital C-- 529 00:32:09,230 --> 00:32:13,120 such that badness of that set of clusters is minimized? 530 00:32:13,120 --> 00:32:16,080 531 00:32:16,080 --> 00:32:19,600 Is that a sufficient definition of the problem 532 00:32:19,600 --> 00:32:22,710 we're trying to solve? 533 00:32:22,710 --> 00:32:25,760 Find a set of clusters C, such that the 534 00:32:25,760 --> 00:32:29,130 badness of C is minimized. 535 00:32:29,130 --> 00:32:32,835 is that good enough? 536 00:32:32,835 --> 00:32:33,310 AUDIENCE: No. 537 00:32:33,310 --> 00:32:35,010 PROFESSOR: No, why not? 538 00:32:35,010 --> 00:32:37,007 AUDIENCE: Just imagine a case where you view cluster-- if 539 00:32:37,007 --> 00:32:38,855 you make a single cluster, every cluster has 540 00:32:38,855 --> 00:32:40,170 one element in it. 541 00:32:40,170 --> 00:32:41,700 And the variance is 0. 542 00:32:41,700 --> 00:32:42,680 PROFESSOR: Exactly. 543 00:32:42,680 --> 00:32:46,890 So that has a trivial solution, which is probably 544 00:32:46,890 --> 00:32:50,330 not the one we want, of putting each 545 00:32:50,330 --> 00:32:54,040 point in its own cluster. 546 00:32:54,040 --> 00:32:54,810 Badness-- 547 00:32:54,810 --> 00:32:55,520 it won't be bad. 548 00:32:55,520 --> 00:32:58,790 It'll be a perfect clustering in some sense. 549 00:32:58,790 --> 00:33:02,640 But it doesn't do us any good really. 550 00:33:02,640 --> 00:33:05,570 So what do we do to fix that? 551 00:33:05,570 --> 00:33:07,270 What do we usually do when we formulate an 552 00:33:07,270 --> 00:33:08,360 optimization problem? 553 00:33:08,360 --> 00:33:09,990 What's missing? 554 00:33:09,990 --> 00:33:11,620 I've given you the objective function. 555 00:33:11,620 --> 00:33:13,620 What have I not giving you? 556 00:33:13,620 --> 00:33:14,100 AUDIENCE: Constraints. 557 00:33:14,100 --> 00:33:16,020 PROFESSOR: A constraint. 558 00:33:16,020 --> 00:33:22,620 So we need to add some constraint here that will 559 00:33:22,620 --> 00:33:27,230 prevent us from finding a trivial solution. 560 00:33:27,230 --> 00:33:31,700 So what kind of constraints might we look at? 561 00:33:31,700 --> 00:33:33,940 There are different ways of doing it. 562 00:33:33,940 --> 00:33:36,950 563 00:33:36,950 --> 00:33:38,380 A couple of ones that is usual. 564 00:33:38,380 --> 00:33:43,520 Sometimes you might have as a constraint, the maximum number 565 00:33:43,520 --> 00:33:44,770 of clusters. 566 00:33:44,770 --> 00:33:46,965 567 00:33:46,965 --> 00:33:52,530 Say, all right, cluster my data, but I want at most K 568 00:33:52,530 --> 00:33:53,460 clusters -- 569 00:33:53,460 --> 00:33:56,160 10 clusters. 570 00:33:56,160 --> 00:33:58,600 That would be my constraint, like the weight for the 571 00:33:58,600 --> 00:34:01,680 knapsack problem. 572 00:34:01,680 --> 00:34:07,740 Or maybe I'll want to put something on the maximum 573 00:34:07,740 --> 00:34:10,469 distance between clusters. 574 00:34:10,469 --> 00:34:13,110 So I don't want the distance between any two clusters to be 575 00:34:13,110 --> 00:34:14,360 more than something. 576 00:34:14,360 --> 00:34:19,540 577 00:34:19,540 --> 00:34:23,420 In general, solving this optimization problem is 578 00:34:23,420 --> 00:34:25,690 computationally prohibitive. 579 00:34:25,690 --> 00:34:30,190 So once again, in practice, what people typically resort 580 00:34:30,190 --> 00:34:32,699 to is greedy algorithms. 581 00:34:32,699 --> 00:34:36,600 And I want to look at two kinds of greedy algorithms, 582 00:34:36,600 --> 00:34:41,020 probably the two most common approaches to clustering. 583 00:34:41,020 --> 00:34:42,270 One is called k-means. 584 00:34:42,270 --> 00:34:47,145 585 00:34:47,145 --> 00:34:52,385 In k-means clustering, you say I want exactly k clusters. 586 00:34:52,385 --> 00:34:55,000 587 00:34:55,000 --> 00:34:57,950 And find the best k clustering. 588 00:34:57,950 --> 00:35:00,070 We'll talk about how it does that. 589 00:35:00,070 --> 00:35:03,170 And again, it's not guaranteed to find the best. 590 00:35:03,170 --> 00:35:05,000 And the other is hierarchical clustering. 591 00:35:05,000 --> 00:35:14,210 592 00:35:14,210 --> 00:35:15,790 We'll come back to that shortly. 593 00:35:15,790 --> 00:35:22,750 Both are simple to understand and widely used in practice. 594 00:35:22,750 --> 00:35:28,380 So let's first talk about how we do this. 595 00:35:28,380 --> 00:35:30,555 Let's first look at hierarchical clustering. 596 00:35:30,555 --> 00:35:38,200 597 00:35:38,200 --> 00:35:43,060 So we have a set of n items to be clustered. 598 00:35:43,060 --> 00:35:54,670 And let's assume we have an n by n distance matrix that 599 00:35:54,670 --> 00:35:59,790 tells me for each pair of items how far they are from 600 00:35:59,790 --> 00:36:01,040 each other. 601 00:36:01,040 --> 00:36:03,140 602 00:36:03,140 --> 00:36:05,970 So we can look at an example. 603 00:36:05,970 --> 00:36:10,400 So here's an n by n distance matrix for the airline 604 00:36:10,400 --> 00:36:14,230 distance between some cities in the United States. 605 00:36:14,230 --> 00:36:17,480 The distance from Boston to Boston is 0 miles. 606 00:36:17,480 --> 00:36:20,450 Distance from New York is 206. 607 00:36:20,450 --> 00:36:23,420 The distance from Chicago to San Francisco 608 00:36:23,420 --> 00:36:26,480 is 2,142, et cetera. 609 00:36:26,480 --> 00:36:27,370 All right? 610 00:36:27,370 --> 00:36:30,020 So I have my n by n distance matrix there. 611 00:36:30,020 --> 00:36:33,200 612 00:36:33,200 --> 00:36:37,360 Now let's go through how hierarchical clustering would 613 00:36:37,360 --> 00:36:40,540 relate these things to each other. 614 00:36:40,540 --> 00:36:51,490 So we start by assigning each item to its own cluster. 615 00:36:51,490 --> 00:36:59,910 616 00:36:59,910 --> 00:37:03,710 So if we have n items, we now have n clusters. 617 00:37:03,710 --> 00:37:08,250 618 00:37:08,250 --> 00:37:09,070 All right? 619 00:37:09,070 --> 00:37:13,430 That's the trivial solution that you suggested before. 620 00:37:13,430 --> 00:37:34,360 The next step is to find the most similar pair of clusters 621 00:37:34,360 --> 00:37:35,610 and merge them. 622 00:37:35,610 --> 00:37:42,310 623 00:37:42,310 --> 00:37:49,150 So if we look here and we just-- we start, we'll have 624 00:37:49,150 --> 00:37:52,050 six clusters, one for each city. 625 00:37:52,050 --> 00:37:56,185 And we would merge the two most similar, which I guess in 626 00:37:56,185 --> 00:37:59,450 this case is New York and Boston. 627 00:37:59,450 --> 00:38:01,920 Hard to believe that those are the most similar cities. 628 00:38:01,920 --> 00:38:05,800 But at least by this distance metric they're the closest. 629 00:38:05,800 --> 00:38:07,450 So we would merge those two. 630 00:38:07,450 --> 00:38:14,530 631 00:38:14,530 --> 00:38:24,680 And then you just continue the process in principle, until 632 00:38:24,680 --> 00:38:27,420 all items are in one cluster. 633 00:38:27,420 --> 00:38:30,810 So now you have a whole hierarchy of clusters. 634 00:38:30,810 --> 00:38:33,920 And you can cut it off where you want. 635 00:38:33,920 --> 00:38:36,020 If you want to have six clusters, you could look at 636 00:38:36,020 --> 00:38:37,480 where the hierarchy you have six. 637 00:38:37,480 --> 00:38:41,160 You can look where you have two, where you have three. 638 00:38:41,160 --> 00:38:43,610 Of course, you don't have to go all the way to finish it if 639 00:38:43,610 --> 00:38:45,670 you don't want to. 640 00:38:45,670 --> 00:38:48,450 This kind of hierarchical clustering is called 641 00:38:48,450 --> 00:38:49,700 agglomerative. 642 00:38:49,700 --> 00:38:57,880 643 00:38:57,880 --> 00:38:58,420 Why? 644 00:38:58,420 --> 00:39:00,850 Well, because we're combining things. 645 00:39:00,850 --> 00:39:02,100 We're agglomerating them. 646 00:39:02,100 --> 00:39:07,890 647 00:39:07,890 --> 00:39:09,310 So this is pretty 648 00:39:09,310 --> 00:39:12,025 straightforward, except for two. 649 00:39:12,025 --> 00:39:15,690 650 00:39:15,690 --> 00:39:22,460 The complication in step (2) is we have to define what it 651 00:39:22,460 --> 00:39:26,010 means to find the two most similar clusters. 652 00:39:26,010 --> 00:39:28,560 653 00:39:28,560 --> 00:39:32,520 Now it's pretty easy when the clusters each contain one 654 00:39:32,520 --> 00:39:36,590 element, because, well, we have our metric-- in this 655 00:39:36,590 --> 00:39:37,400 case, distance-- 656 00:39:37,400 --> 00:39:41,180 and we can just do that as I did. 657 00:39:41,180 --> 00:39:44,810 But it's not so obvious what you do when they 658 00:39:44,810 --> 00:39:46,740 have multiple elements. 659 00:39:46,740 --> 00:39:49,940 660 00:39:49,940 --> 00:39:55,250 And in fact, different metrics can be used to get different 661 00:39:55,250 --> 00:39:56,940 properties. 662 00:39:56,940 --> 00:39:58,540 So I want to talk about some of the 663 00:39:58,540 --> 00:40:00,690 metrics we use for that. 664 00:40:00,690 --> 00:40:03,645 These are typically called linkage criteria. 665 00:40:03,645 --> 00:40:13,990 666 00:40:13,990 --> 00:40:18,380 So one popular one is what's called single linkage. 667 00:40:18,380 --> 00:40:24,170 668 00:40:24,170 --> 00:40:25,190 It's also called 669 00:40:25,190 --> 00:40:28,040 connectedness, or minimum method. 670 00:40:28,040 --> 00:40:31,160 In this, we consider the distance between a pair of 671 00:40:31,160 --> 00:40:36,680 clusters to be equal to the shortest distance from any 672 00:40:36,680 --> 00:40:37,975 member to any other member. 673 00:40:37,975 --> 00:41:00,220 674 00:41:00,220 --> 00:41:04,080 So we take the two points in each cluster that are closest 675 00:41:04,080 --> 00:41:06,780 to each other and say that's the distance 676 00:41:06,780 --> 00:41:08,030 between the two clusters. 677 00:41:08,030 --> 00:41:14,490 678 00:41:14,490 --> 00:41:17,335 People also use something called complete linkage-- 679 00:41:17,335 --> 00:41:21,376 680 00:41:21,376 --> 00:41:25,420 It's also called diameter or maximum-- 681 00:41:25,420 --> 00:41:30,010 where we consider the distance between any two clusters to be 682 00:41:30,010 --> 00:41:32,710 the distance between the points that are furthest from 683 00:41:32,710 --> 00:41:33,960 each other. 684 00:41:33,960 --> 00:41:40,060 685 00:41:40,060 --> 00:41:44,940 So in one case, essentially single linkages was looking at 686 00:41:44,940 --> 00:41:47,150 the best case. 687 00:41:47,150 --> 00:41:49,390 Complete-- 688 00:41:49,390 --> 00:41:50,870 in English, not French-- 689 00:41:50,870 --> 00:41:53,450 is looking at the worst case. 690 00:41:53,450 --> 00:41:58,110 691 00:41:58,110 --> 00:42:01,870 And you won't be surprised to hear that you could also look 692 00:42:01,870 --> 00:42:11,175 at the average case, where you take all of the distances. 693 00:42:11,175 --> 00:42:13,950 694 00:42:13,950 --> 00:42:16,160 So you take all of the pairwise things. 695 00:42:16,160 --> 00:42:17,020 You add them up. 696 00:42:17,020 --> 00:42:18,860 You take the average. 697 00:42:18,860 --> 00:42:20,460 You can also take the mean, the 698 00:42:20,460 --> 00:42:23,640 median, if you want instead. 699 00:42:23,640 --> 00:42:27,130 None of these is necessarily best. 700 00:42:27,130 --> 00:42:29,790 But they do give you different answers. 701 00:42:29,790 --> 00:42:32,605 And so I want to look at that now with our example here. 702 00:42:32,605 --> 00:42:35,370 703 00:42:35,370 --> 00:42:38,500 So let's look at it and run it. 704 00:42:38,500 --> 00:42:41,740 So the first step is independent of what linkage 705 00:42:41,740 --> 00:42:42,850 we're using. 706 00:42:42,850 --> 00:42:46,580 We get these six clusters. 707 00:42:46,580 --> 00:42:51,790 All right, now let's look at the second step. 708 00:42:51,790 --> 00:42:55,030 Well, also pretty simple since we only have one 709 00:42:55,030 --> 00:42:57,400 element in each one. 710 00:42:57,400 --> 00:42:59,170 We're going to get that clustering. 711 00:42:59,170 --> 00:43:02,750 712 00:43:02,750 --> 00:43:03,300 All right. 713 00:43:03,300 --> 00:43:08,960 Now, what about the next step? 714 00:43:08,960 --> 00:43:13,750 What do I get if I'm using the minimal 715 00:43:13,750 --> 00:43:15,350 single linkage distance? 716 00:43:15,350 --> 00:43:16,895 What gets merged here? 717 00:43:16,895 --> 00:43:23,640 718 00:43:23,640 --> 00:43:24,980 Somebody? 719 00:43:24,980 --> 00:43:27,070 AUDIENCE: Boston, New York and Chicago. 720 00:43:27,070 --> 00:43:28,550 PROFESSOR: Boston, New York, and Chicago. 721 00:43:28,550 --> 00:43:31,910 722 00:43:31,910 --> 00:43:35,495 And it turns out we'll get the same thing if we use other 723 00:43:35,495 --> 00:43:37,270 linkages in this case. 724 00:43:37,270 --> 00:43:38,745 Let's continue to the next step. 725 00:43:38,745 --> 00:43:43,060 726 00:43:43,060 --> 00:43:45,935 Now we'll end up merging San Francisco and Seattle. 727 00:43:45,935 --> 00:43:52,840 728 00:43:52,840 --> 00:43:55,840 Now we get a difference. 729 00:43:55,840 --> 00:43:57,540 What does the red represent and what 730 00:43:57,540 --> 00:43:58,520 does the blue represent? 731 00:43:58,520 --> 00:44:00,980 Which linkage criteria? 732 00:44:00,980 --> 00:44:04,590 We're saying, we could either merge Denver with Boston, New 733 00:44:04,590 --> 00:44:06,480 York, and Chicago. 734 00:44:06,480 --> 00:44:10,185 Or we could merge Denver with San Francisco and Seattle. 735 00:44:10,185 --> 00:44:14,920 736 00:44:14,920 --> 00:44:16,790 Which is which? 737 00:44:16,790 --> 00:44:20,660 Which linkage criterion has put Denver in which cluster? 738 00:44:20,660 --> 00:44:30,910 739 00:44:30,910 --> 00:44:33,640 Well, suppose we're using single linkage. 740 00:44:33,640 --> 00:44:37,520 Where are we getting it from? 741 00:44:37,520 --> 00:44:39,460 AUDIENCE: Boston, New York and Chicago? 742 00:44:39,460 --> 00:44:39,945 PROFESSOR: Yes. 743 00:44:39,945 --> 00:44:42,370 Because it's not so far from Chicago. 744 00:44:42,370 --> 00:44:45,924 Even though it's pretty far from Boston or New York. 745 00:44:45,924 --> 00:44:50,744 But if we use average linkage, we see on average, it's closer 746 00:44:50,744 --> 00:44:55,180 to San Francisco and Seattle than it is to the average of 747 00:44:55,180 --> 00:44:57,300 Boston, New York, or Chicago. 748 00:44:57,300 --> 00:45:00,920 So we get a different answer. 749 00:45:00,920 --> 00:45:03,860 And then finally, at the last step, 750 00:45:03,860 --> 00:45:08,270 everything gets merged together. 751 00:45:08,270 --> 00:45:15,940 So you can see, in this case, without having labels, we have 752 00:45:15,940 --> 00:45:22,200 used a feature to produce things and, say, if we wanted 753 00:45:22,200 --> 00:45:25,840 to have three clusters, we would maybe stop here. 754 00:45:25,840 --> 00:45:28,300 And we'd say, all right, these things are one cluster. 755 00:45:28,300 --> 00:45:29,170 This is a cluster. 756 00:45:29,170 --> 00:45:31,340 And this is a cluster. 757 00:45:31,340 --> 00:45:33,640 And that's not a bad geographical clustering, 758 00:45:33,640 --> 00:45:36,990 actually, for deciding how to relate these 759 00:45:36,990 --> 00:45:38,240 things to each other. 760 00:45:38,240 --> 00:45:40,890 761 00:45:40,890 --> 00:45:43,160 This technique is used a lot. 762 00:45:43,160 --> 00:45:46,260 It does have some weaknesses. 763 00:45:46,260 --> 00:45:50,050 One weakness is it's very time consuming. 764 00:45:50,050 --> 00:45:53,310 It doesn't scale well. 765 00:45:53,310 --> 00:45:58,370 The complexity is at least order n-squared, where n is 766 00:45:58,370 --> 00:46:03,900 the number of points to be clustered. 767 00:46:03,900 --> 00:46:06,060 And in fact, in many implementations, it's worse 768 00:46:06,060 --> 00:46:08,230 than n-squared. 769 00:46:08,230 --> 00:46:13,010 And of course, it doesn't necessarily find the optimal 770 00:46:13,010 --> 00:46:17,240 clustering, even giving these criteria. 771 00:46:17,240 --> 00:46:21,190 It might never at any level have the optimal clustering, 772 00:46:21,190 --> 00:46:24,770 because, again, at each step, it's making a locally optimal 773 00:46:24,770 --> 00:46:28,435 decision, not guaranteed to find the best solution. 774 00:46:28,435 --> 00:46:33,040 775 00:46:33,040 --> 00:46:38,950 I should point out that a big issue in deciding to get these 776 00:46:38,950 --> 00:46:42,940 clusters or getting these clusters was 777 00:46:42,940 --> 00:46:46,920 my choice of features. 778 00:46:46,920 --> 00:46:50,580 And this is something we're going to come back to in 779 00:46:50,580 --> 00:46:54,700 spades, because I actually think it is the most important 780 00:46:54,700 --> 00:46:56,980 issue in machine learning-- 781 00:46:56,980 --> 00:47:01,310 is if we're going to say which points are similar to each 782 00:47:01,310 --> 00:47:06,840 other, we need to understand our feature space. 783 00:47:06,840 --> 00:47:10,500 So, for example, the feature I'm using here 784 00:47:10,500 --> 00:47:14,340 is distance by air. 785 00:47:14,340 --> 00:47:19,010 Suppose, instead, I added distance by air and distance 786 00:47:19,010 --> 00:47:22,900 by road and distance by train. 787 00:47:22,900 --> 00:47:25,900 Well, particularly given this sparsity of railroads in this 788 00:47:25,900 --> 00:47:29,180 country, we might get very different clustering, 789 00:47:29,180 --> 00:47:33,090 depending upon where the trains ran. 790 00:47:33,090 --> 00:47:36,590 And suppose I throw in a totally different feature like 791 00:47:36,590 --> 00:47:39,570 population. 792 00:47:39,570 --> 00:47:42,040 Well, I might get another different clustering, 793 00:47:42,040 --> 00:47:43,395 depending on how I use that. 794 00:47:43,395 --> 00:47:46,780 795 00:47:46,780 --> 00:47:54,910 What we typically need to do in these situations, dealing 796 00:47:54,910 --> 00:47:57,660 with multi-dimensional data-- and most data is 797 00:47:57,660 --> 00:47:59,230 multi-dimensional-- 798 00:47:59,230 --> 00:48:06,790 is we construct something called a feature vector that 799 00:48:06,790 --> 00:48:10,830 incorporates multiple features. 800 00:48:10,830 --> 00:48:14,870 So we might have for each city-- 801 00:48:14,870 --> 00:48:16,805 we'll just take something like the distance. 802 00:48:16,805 --> 00:48:22,600 803 00:48:22,600 --> 00:48:26,720 Or let's say, instead of distance, we'll compute the 804 00:48:26,720 --> 00:48:33,810 distance by having for each city its GPS coordinates, 805 00:48:33,810 --> 00:48:39,110 where it is on the globe, and its population. 806 00:48:39,110 --> 00:48:42,840 And let's say that's how we define a city. 807 00:48:42,840 --> 00:48:45,070 And that would be our feature vector. 808 00:48:45,070 --> 00:48:47,560 And then we would cluster it, say, using hierarchical 809 00:48:47,560 --> 00:48:51,140 clustering to determine which cities are most like which 810 00:48:51,140 --> 00:48:53,340 other cities. 811 00:48:53,340 --> 00:48:56,280 Well, it's a little bit complicated. 812 00:48:56,280 --> 00:49:01,760 I have to ask how do I compare feature vectors? 813 00:49:01,760 --> 00:49:04,990 What distance metric do I use there? 814 00:49:04,990 --> 00:49:09,600 Do I get confused that GPS coordinates and populations 815 00:49:09,600 --> 00:49:12,700 are essentially unrelated? 816 00:49:12,700 --> 00:49:16,240 And I wouldn't like to compare those to each other. 817 00:49:16,240 --> 00:49:19,240 Lots of issues there, and that's what we're going to 818 00:49:19,240 --> 00:49:22,500 talk about when we come back from Patriot's Day-- 819 00:49:22,500 --> 00:49:26,440 is how in the real world problems, we go from the large 820 00:49:26,440 --> 00:49:30,430 number of features associated with objects or things in the 821 00:49:30,430 --> 00:49:34,680 real world to feature vectors that allow us to automatically 822 00:49:34,680 --> 00:49:37,830 deduce which things are quote "most similar" 823 00:49:37,830 --> 00:49:39,080 to which other things. 824 00:49:39,080 --> 00:49:42,485