1 00:00:00,000 --> 00:00:00,080 2 00:00:00,080 --> 00:00:01,770 The following content is provided 3 00:00:01,770 --> 00:00:04,010 under a Creative Commons license. 4 00:00:04,010 --> 00:00:06,860 Your support will help MIT OpenCourseWare continue 5 00:00:06,860 --> 00:00:10,720 to offer high quality educational resources for free. 6 00:00:10,720 --> 00:00:13,340 To make a donation or view additional materials 7 00:00:13,340 --> 00:00:17,207 from hundreds of MIT courses, visit MIT OpenCourseWare 8 00:00:17,207 --> 00:00:17,832 at ocw.mit.edu. 9 00:00:17,832 --> 00:00:20,749 10 00:00:20,749 --> 00:00:23,040 VICTOR COSTAN: So I'm excited about today's recitation, 11 00:00:23,040 --> 00:00:25,560 because if I do this right and you guys get it, 12 00:00:25,560 --> 00:00:28,750 then I can mess up every other recitation after it. 13 00:00:28,750 --> 00:00:31,720 And you'll still get the gist of 6.006. 14 00:00:31,720 --> 00:00:34,700 So all I have to do is get this working. 15 00:00:34,700 --> 00:00:36,760 So most of the time in the real world 16 00:00:36,760 --> 00:00:39,580 you're probably not going to be coming up with new algorithms 17 00:00:39,580 --> 00:00:42,080 to do something, but rather you'll have some code 18 00:00:42,080 --> 00:00:43,960 and you want to make it faster. 19 00:00:43,960 --> 00:00:45,760 And the first step in making it faster 20 00:00:45,760 --> 00:00:48,510 is you realize, how does it do right now? 21 00:00:48,510 --> 00:00:51,210 How does it run, which lines are slow, which lines are fast, 22 00:00:51,210 --> 00:00:53,030 and where you can make improvements. 23 00:00:53,030 --> 00:00:56,190 So in lecture we talked about the Python Cost Model 24 00:00:56,190 --> 00:00:59,100 which is what you use to look at the code 25 00:00:59,100 --> 00:01:01,590 and figure out how much time it takes to run. 26 00:01:01,590 --> 00:01:04,069 And we talked about document distance, 27 00:01:04,069 --> 00:01:05,530 which is a problem that we'll use 28 00:01:05,530 --> 00:01:08,180 to practice our analysis skills. 29 00:01:08,180 --> 00:01:09,860 And this entire recitation is all 30 00:01:09,860 --> 00:01:12,610 about looking at versions of document distance 31 00:01:12,610 --> 00:01:14,810 and analyzing them. 32 00:01:14,810 --> 00:01:16,860 So that's what we'll do, look at Python code, 33 00:01:16,860 --> 00:01:19,010 look at Python code, look at Python code. 34 00:01:19,010 --> 00:01:21,320 So you better have handouts, because I can't project. 35 00:01:21,320 --> 00:01:25,637 OK, how many people remember the document distance problem? 36 00:01:25,637 --> 00:01:27,345 You guys said you went to lecture, right? 37 00:01:27,345 --> 00:01:30,540 38 00:01:30,540 --> 00:01:33,470 OK, so very, very fast, document distance. 39 00:01:33,470 --> 00:01:34,650 I have two documents. 40 00:01:34,650 --> 00:01:37,720 41 00:01:37,720 --> 00:01:41,780 The fox is in the hat. 42 00:01:41,780 --> 00:01:45,990 43 00:01:45,990 --> 00:01:49,930 And the fox is outside. 44 00:01:49,930 --> 00:01:54,250 45 00:01:54,250 --> 00:01:56,820 Document 1, document 2. 46 00:01:56,820 --> 00:01:58,580 What's the first thing I want to do? 47 00:01:58,580 --> 00:02:01,940 So there are three operations that Eric mentioned in lecture. 48 00:02:01,940 --> 00:02:07,410 Operation one, take each document, 49 00:02:07,410 --> 00:02:08,820 break it up into words. 50 00:02:08,820 --> 00:02:10,220 Right? 51 00:02:10,220 --> 00:02:12,610 This is a string. 52 00:02:12,610 --> 00:02:15,640 When I read it, then it becomes word one, word two, word three, 53 00:02:15,640 --> 00:02:18,060 word four, so on and so forth. 54 00:02:18,060 --> 00:02:20,930 Operation two, build document vectors 55 00:02:20,930 --> 00:02:22,840 out of the two documents. 56 00:02:22,840 --> 00:02:25,790 So the documents are D1 and D2. 57 00:02:25,790 --> 00:02:29,080 58 00:02:29,080 --> 00:02:30,950 A document vector is basically a list 59 00:02:30,950 --> 00:02:35,670 of the words in the documents with a count of how many times 60 00:02:35,670 --> 00:02:37,870 each word appears in the document. 61 00:02:37,870 --> 00:02:43,112 So let's build a document vector for document one. 62 00:02:43,112 --> 00:02:44,570 I'm not going to write it formally, 63 00:02:44,570 --> 00:02:47,120 so can anyone tell me what it should look like, 64 00:02:47,120 --> 00:02:49,110 and I'll sort of write it down as a list. 65 00:02:49,110 --> 00:02:52,780 66 00:02:52,780 --> 00:02:55,440 So for all the words here, I want to list the words 67 00:02:55,440 --> 00:02:57,640 and how many times they show up. 68 00:02:57,640 --> 00:03:01,376 Somebody, please. 69 00:03:01,376 --> 00:03:03,840 AUDIENCE: The is in there twice? 70 00:03:03,840 --> 00:03:04,970 VICTOR COSTAN: OK. 71 00:03:04,970 --> 00:03:07,735 The, twice. 72 00:03:07,735 --> 00:03:10,410 AUDIENCE: Fox, once. 73 00:03:10,410 --> 00:03:11,710 VICTOR COSTAN: One. 74 00:03:11,710 --> 00:03:13,340 AUDIENCE: Is, once. 75 00:03:13,340 --> 00:03:15,090 VICTOR COSTAN: Is, one. 76 00:03:15,090 --> 00:03:16,970 AUDIENCE: [INAUDIBLE] in once. 77 00:03:16,970 --> 00:03:18,200 VICTOR COSTAN: In, one. 78 00:03:18,200 --> 00:03:18,991 AUDIENCE: Hat once. 79 00:03:18,991 --> 00:03:21,242 80 00:03:21,242 --> 00:03:22,200 VICTOR COSTAN: Awesome. 81 00:03:22,200 --> 00:03:24,760 Thank you very much. 82 00:03:24,760 --> 00:03:25,910 Second one. 83 00:03:25,910 --> 00:03:27,500 Another volunteer. 84 00:03:27,500 --> 00:03:28,270 Yes, go for it. 85 00:03:28,270 --> 00:03:30,231 AUDIENCE: The, once. 86 00:03:30,231 --> 00:03:31,230 VICTOR COSTAN: The, one. 87 00:03:31,230 --> 00:03:32,740 AUDIENCE: Fox, once. 88 00:03:32,740 --> 00:03:33,857 VICTOR COSTAN: Fox, one 89 00:03:33,857 --> 00:03:35,020 AUDIENCE: Is, one. 90 00:03:35,020 --> 00:03:36,708 VICTOR COSTAN: Is, one. 91 00:03:36,708 --> 00:03:38,604 AUDIENCE: Outside, one. 92 00:03:38,604 --> 00:03:41,890 VICTOR COSTAN: Outside, one. 93 00:03:41,890 --> 00:03:44,850 OK, so this is a document vector. 94 00:03:44,850 --> 00:03:47,190 Notice two small details. 95 00:03:47,190 --> 00:03:49,880 Here, they is capitalized, here it's not, 96 00:03:49,880 --> 00:03:52,570 and yet I bundle them together. 97 00:03:52,570 --> 00:03:54,160 I know my grammar, so I put periods 98 00:03:54,160 --> 00:03:56,024 at the end of the sentences, and yet they 99 00:03:56,024 --> 00:03:57,190 don't show up anywhere here. 100 00:03:57,190 --> 00:03:58,800 So we got rid of the punctuation, 101 00:03:58,800 --> 00:04:00,185 and we made all words lowercase. 102 00:04:00,185 --> 00:04:02,476 103 00:04:02,476 --> 00:04:04,100 These are details, but they are details 104 00:04:04,100 --> 00:04:05,808 that you'll see in the code, so if you're 105 00:04:05,808 --> 00:04:07,850 wondering why, this is why. 106 00:04:07,850 --> 00:04:11,180 So step one, read the document, make it a list of words. 107 00:04:11,180 --> 00:04:13,250 Step two, compute the document vector. 108 00:04:13,250 --> 00:04:15,940 Step three, take the two document vectors, 109 00:04:15,940 --> 00:04:17,610 and compute the angle. 110 00:04:17,610 --> 00:04:20,560 What is the angle of two document vectors? 111 00:04:20,560 --> 00:04:21,980 Big ugly math formula. 112 00:04:21,980 --> 00:04:25,570 The only thing that's relevant is that it takes these vectors 113 00:04:25,570 --> 00:04:27,860 and computes an inner product. 114 00:04:27,860 --> 00:04:33,580 So if we look at the code for angle vector, or vector angle, 115 00:04:33,580 --> 00:04:37,400 you'll see that because numerator denominator lines two 116 00:04:37,400 --> 00:04:40,450 and three, it calls inner product three times 117 00:04:40,450 --> 00:04:42,562 and then it does some math with it. 118 00:04:42,562 --> 00:04:43,770 We don't care about the math. 119 00:04:43,770 --> 00:04:45,170 We assume the math is order one. 120 00:04:45,170 --> 00:04:48,514 We only care about inner product. 121 00:04:48,514 --> 00:04:49,680 How does inner product work? 122 00:04:49,680 --> 00:04:52,210 Can anyone help me compute the inner product for these guys? 123 00:04:52,210 --> 00:04:56,960 124 00:04:56,960 --> 00:04:57,920 Yes? 125 00:04:57,920 --> 00:04:58,920 AUDIENCE: It's like the dot product? 126 00:04:58,920 --> 00:04:59,672 VICTOR COSTAN: OK. 127 00:04:59,672 --> 00:05:03,445 AUDIENCE: So, if we take the vectors and you multiply them, 128 00:05:03,445 --> 00:05:05,870 like, you're adding to the components, right? 129 00:05:05,870 --> 00:05:08,300 Because they're so thick-- 130 00:05:08,300 --> 00:05:10,482 VICTOR COSTAN: OK, this is too complicated, then. 131 00:05:10,482 --> 00:05:11,940 I'm seriously depressed, so give me 132 00:05:11,940 --> 00:05:15,600 some clear instructions step by step. 133 00:05:15,600 --> 00:05:17,120 AUDIENCE: Like, I know you divide 134 00:05:17,120 --> 00:05:18,860 by the length of each of the vectors-- 135 00:05:18,860 --> 00:05:20,280 VICTOR COSTAN: Let's not worry about that. 136 00:05:20,280 --> 00:05:22,420 I have these vectors, and I want an inner product. 137 00:05:22,420 --> 00:05:25,454 I don't care about the angle, just the inner product. 138 00:05:25,454 --> 00:05:27,750 AUDIENCE: OK, well do 2 times 1 for the right. 139 00:05:27,750 --> 00:05:28,500 VICTOR COSTAN: OK. 140 00:05:28,500 --> 00:05:30,620 So I take the here, shows up twice. 141 00:05:30,620 --> 00:05:32,690 I take the here, shows up once. 142 00:05:32,690 --> 00:05:33,970 2 times 1, right? 143 00:05:33,970 --> 00:05:34,640 AUDIENCE: Mhm. 144 00:05:34,640 --> 00:05:35,850 VICTOR COSTAN: OK. 145 00:05:35,850 --> 00:05:36,758 And then? 146 00:05:36,758 --> 00:05:38,927 AUDIENCE: I would do the same for fox. 147 00:05:38,927 --> 00:05:41,510 VICTOR COSTAN: OK, fox shows up here once, shows up here once, 148 00:05:41,510 --> 00:05:43,739 so what I do? 149 00:05:43,739 --> 00:05:45,070 AUDIENCE: 1 times 1. 150 00:05:45,070 --> 00:05:47,380 VICTOR COSTAN: OK. 151 00:05:47,380 --> 00:05:49,240 AUDIENCE: And do the same for is. 152 00:05:49,240 --> 00:05:50,816 VICTOR COSTAN: OK. 153 00:05:50,816 --> 00:05:53,400 AUDIENCE: And in should be 0. 154 00:05:53,400 --> 00:05:54,150 VICTOR COSTAN: OK. 155 00:05:54,150 --> 00:05:55,730 AUDIENCE: [INAUDIBLE] in. 156 00:05:55,730 --> 00:05:56,480 VICTOR COSTAN: OK. 157 00:05:56,480 --> 00:05:58,820 AUDIENCE: And then outside would also be 0, 158 00:05:58,820 --> 00:06:00,750 and hat would also be 0. 159 00:06:00,750 --> 00:06:01,580 VICTOR COSTAN: OK. 160 00:06:01,580 --> 00:06:04,020 So it turns out you don't have to go through both lists. 161 00:06:04,020 --> 00:06:06,330 It's sufficient to go through one of the vectors 162 00:06:06,330 --> 00:06:08,336 and look up the words in the other vector. 163 00:06:08,336 --> 00:06:10,710 Because if the words don't show up in any of the vectors, 164 00:06:10,710 --> 00:06:12,670 their contribution is going to be 0. 165 00:06:12,670 --> 00:06:16,670 So my algorithm is go through each of the elements 166 00:06:16,670 --> 00:06:20,010 here, look up each of the words there, look up at the word 167 00:06:20,010 --> 00:06:20,780 here. 168 00:06:20,780 --> 00:06:23,500 And if there's a word here and here, 169 00:06:23,500 --> 00:06:26,130 take out the number of times it shows up in each document, 170 00:06:26,130 --> 00:06:31,470 multiply them, and then add everything up. 171 00:06:31,470 --> 00:06:33,440 So this is inner product. 172 00:06:33,440 --> 00:06:35,890 Everything else is good if you're writing a search engine 173 00:06:35,890 --> 00:06:38,020 or if you're using the scenario application, 174 00:06:38,020 --> 00:06:40,880 but we're not really concerned with it. 175 00:06:40,880 --> 00:06:42,730 OK, so now we have the three steps, 176 00:06:42,730 --> 00:06:45,630 read the document, break it up into words, 177 00:06:45,630 --> 00:06:47,990 compute document vectors, compute our inner product. 178 00:06:47,990 --> 00:06:49,870 So this is what we want to do. 179 00:06:49,870 --> 00:06:53,650 And document distance 1 does it in a painfully slow way, 180 00:06:53,650 --> 00:06:57,010 and we're probably not going to cover everything in recitation. 181 00:06:57,010 --> 00:06:59,460 But if you go all the way up to document distance 1, 182 00:06:59,460 --> 00:07:00,610 that's really, really fast. 183 00:07:00,610 --> 00:07:03,960 It's 1,000 times faster. 184 00:07:03,960 --> 00:07:06,892 So this is our job for the day. 185 00:07:06,892 --> 00:07:07,850 Let's look at the code. 186 00:07:07,850 --> 00:07:11,440 Did anyone look at the code beforehand? 187 00:07:11,440 --> 00:07:12,400 Nope. 188 00:07:12,400 --> 00:07:15,130 OK, so when I look at a big piece of code, 189 00:07:15,130 --> 00:07:19,820 I like to look at it from top down. 190 00:07:19,820 --> 00:07:22,080 So that means I start to the main function, 191 00:07:22,080 --> 00:07:24,370 I see who is it calling, I see what everything 192 00:07:24,370 --> 00:07:27,720 is trying to do, and then I go into the sub functions 193 00:07:27,720 --> 00:07:30,300 and recurves and basically do the same thing. 194 00:07:30,300 --> 00:07:32,580 So I build a tree of who's calling what, 195 00:07:32,580 --> 00:07:34,720 and that helps me figure out what's going on. 196 00:07:34,720 --> 00:07:37,224 197 00:07:37,224 --> 00:07:38,265 So let's start with main. 198 00:07:38,265 --> 00:07:44,230 199 00:07:44,230 --> 00:07:46,140 And let's look at main. 200 00:07:46,140 --> 00:07:49,150 Lines 1 through 6 look at the arguments. 201 00:07:49,150 --> 00:07:51,040 We don't really care. 202 00:07:51,040 --> 00:07:54,365 Line 7 and 8 call word frequencies for file. 203 00:07:54,365 --> 00:08:01,520 204 00:08:01,520 --> 00:08:04,770 I am abbreviating liberally. 205 00:08:04,770 --> 00:08:11,950 And then line 9 calls vector angle. 206 00:08:11,950 --> 00:08:17,210 So line 7 and 8 read the two documents, 207 00:08:17,210 --> 00:08:20,490 do steps one and two, and then 9 does step three. 208 00:08:20,490 --> 00:08:31,010 209 00:08:31,010 --> 00:08:31,700 OK. 210 00:08:31,700 --> 00:08:33,220 Word frequencies for files. 211 00:08:33,220 --> 00:08:35,679 So the point of this is to read a file 212 00:08:35,679 --> 00:08:41,100 and to produce a word document vector out of it. 213 00:08:41,100 --> 00:08:45,370 And it does it in three steps. 214 00:08:45,370 --> 00:08:48,840 Reads the file, line two. 215 00:08:48,840 --> 00:08:52,300 Breaks up the file into words, so operation one, 216 00:08:52,300 --> 00:08:56,170 this is line 3, and then line 4, it takes up the list of words 217 00:08:56,170 --> 00:08:58,790 and computes a document vector out of it. 218 00:08:58,790 --> 00:09:01,110 I don't care about reading files because I'll 219 00:09:01,110 --> 00:09:03,420 assume this is somehow done for me. 220 00:09:03,420 --> 00:09:05,290 We care about the algorithms. 221 00:09:05,290 --> 00:09:07,050 So as far as I'm concerned, this function 222 00:09:07,050 --> 00:09:08,700 is calling get words from line list. 223 00:09:08,700 --> 00:09:11,810 224 00:09:11,810 --> 00:09:16,695 Get words from line list, and count frequency. 225 00:09:16,695 --> 00:09:26,100 226 00:09:26,100 --> 00:09:29,130 And if we skip all the way to vector angle-- 227 00:09:29,130 --> 00:09:32,440 we already talked a little bit about how all it does 228 00:09:32,440 --> 00:09:34,580 is it calls inner product three times 229 00:09:34,580 --> 00:09:37,010 and then in does some fancy math of it. 230 00:09:37,010 --> 00:09:44,510 231 00:09:44,510 --> 00:09:46,590 So this is how the code looks like big picture. 232 00:09:46,590 --> 00:09:56,370 233 00:09:56,370 --> 00:09:58,530 OK, so to figure out the running time for main, 234 00:09:58,530 --> 00:10:00,780 we need to figure out the running time for these two 235 00:10:00,780 --> 00:10:03,314 functions and add them up, right? 236 00:10:03,314 --> 00:10:04,980 To figure out the running time for this, 237 00:10:04,980 --> 00:10:06,354 we need to figure out the running 238 00:10:06,354 --> 00:10:09,710 time for these functions and add them up, so on and so forth. 239 00:10:09,710 --> 00:10:12,960 So as you go through each of the document distance versions, 240 00:10:12,960 --> 00:10:17,214 you want keep a scorecard of the implementation that shows you 241 00:10:17,214 --> 00:10:18,630 what the running time is, and this 242 00:10:18,630 --> 00:10:20,772 helps you follow what was improved 243 00:10:20,772 --> 00:10:21,730 in each implementation. 244 00:10:21,730 --> 00:10:25,830 245 00:10:25,830 --> 00:10:29,430 So let's look at to get words from line lists. 246 00:10:29,430 --> 00:10:31,940 What does it seem like its doing? 247 00:10:31,940 --> 00:10:34,920 Without reading the get words from string, 248 00:10:34,920 --> 00:10:38,540 can anyone tell me what it seems like it's doing? 249 00:10:38,540 --> 00:10:40,515 If you just read lines 1 through 6. 250 00:10:40,515 --> 00:10:43,896 251 00:10:43,896 --> 00:10:46,950 AUDIENCE: [INAUDIBLE] through the list. 252 00:10:46,950 --> 00:10:47,700 VICTOR COSTAN: OK. 253 00:10:47,700 --> 00:10:49,940 So it's getting an input list. 254 00:10:49,940 --> 00:10:52,400 And if you look at word frequencies for files 255 00:10:52,400 --> 00:10:56,430 at line 2, it names a variable line list. 256 00:10:56,430 --> 00:11:00,360 So it seems like what's happening is, 257 00:11:00,360 --> 00:11:02,590 reads a file into a list of lines. 258 00:11:02,590 --> 00:11:07,140 And then that list of lines goes to get words from line lists. 259 00:11:07,140 --> 00:11:12,590 So this is L in get words from line lists. 260 00:11:12,590 --> 00:11:15,820 So it takes a list of lines which is the entire document, 261 00:11:15,820 --> 00:11:16,320 and then? 262 00:11:16,320 --> 00:11:20,900 263 00:11:20,900 --> 00:11:23,514 AUDIENCE: Basically it removes the new lines. 264 00:11:23,514 --> 00:11:30,178 It sticks it into one giant list rather than a list of lines, 265 00:11:30,178 --> 00:11:31,591 is that right? 266 00:11:31,591 --> 00:11:34,090 VICTOR COSTAN: Almost, so you seem to get words from string. 267 00:11:34,090 --> 00:11:35,798 Maybe we need to go through the function, 268 00:11:35,798 --> 00:11:40,180 but do see the get words from string function name? 269 00:11:40,180 --> 00:11:42,330 So I will assume that it does something 270 00:11:42,330 --> 00:11:44,510 with each of the words. 271 00:11:44,510 --> 00:11:49,090 And if the overall goal is to get a list of words, 272 00:11:49,090 --> 00:11:52,200 then I would assume that what that does is it takes a line 273 00:11:52,200 --> 00:11:54,200 and it breaks it up into words. 274 00:11:54,200 --> 00:11:55,950 Because this way, if you take up each line 275 00:11:55,950 --> 00:11:57,980 and break it up into words, then when 276 00:11:57,980 --> 00:11:59,970 we put all the words together we get the words 277 00:11:59,970 --> 00:12:01,053 that make up the document. 278 00:12:01,053 --> 00:12:03,430 279 00:12:03,430 --> 00:12:04,210 Do people follow? 280 00:12:04,210 --> 00:12:04,870 Any questions? 281 00:12:04,870 --> 00:12:07,369 I like that people are nodding, by the way, keep doing that. 282 00:12:07,369 --> 00:12:09,454 That helps me go at the right speed. 283 00:12:09,454 --> 00:12:11,870 If you're not nodding, I'll keep explaining the same thing 284 00:12:11,870 --> 00:12:12,710 over and over again. 285 00:12:12,710 --> 00:12:18,870 286 00:12:18,870 --> 00:12:20,280 OK, so get words from string. 287 00:12:20,280 --> 00:12:24,590 Get words from string takes up a single line, that's a string, 288 00:12:24,590 --> 00:12:27,060 and produces a list of words. 289 00:12:27,060 --> 00:12:29,920 And we saw in the example there that it 290 00:12:29,920 --> 00:12:33,230 has to take care of a few details such as making 291 00:12:33,230 --> 00:12:35,520 all the letters lowercase and ignoring 292 00:12:35,520 --> 00:12:38,870 punctuation and skipping spaces. 293 00:12:38,870 --> 00:12:42,210 So let's look at this code and figure out its running time. 294 00:12:42,210 --> 00:12:44,030 And the way we're going to do that is we're 295 00:12:44,030 --> 00:12:46,840 going to look at each line, and we're 296 00:12:46,840 --> 00:12:49,960 going to see what's the cost for that line 297 00:12:49,960 --> 00:12:51,732 and how many times does it run. 298 00:12:51,732 --> 00:12:53,190 And once we have those two numbers, 299 00:12:53,190 --> 00:12:56,700 we multiply them together and we see how much time 300 00:12:56,700 --> 00:13:01,270 does the program spend on that line in total. 301 00:13:01,270 --> 00:13:04,030 So I'm going to write down line numbers here. 302 00:13:04,030 --> 00:13:09,110 9, 10, 11, 12, 13, 14, 15. 303 00:13:09,110 --> 00:13:10,118 All the way to 23. 304 00:13:10,118 --> 00:13:18,180 305 00:13:18,180 --> 00:13:18,680 Too low. 306 00:13:18,680 --> 00:13:26,930 307 00:13:26,930 --> 00:13:29,960 20, 21, 22, 23. 308 00:13:29,960 --> 00:13:33,910 309 00:13:33,910 --> 00:13:35,670 OK, so let's start with something easy, 310 00:13:35,670 --> 00:13:40,650 lines 9 and 10 How many times are they run? 311 00:13:40,650 --> 00:13:41,730 AUDIENCE: Once. 312 00:13:41,730 --> 00:13:42,966 VICTOR COSTAN: OK. 313 00:13:42,966 --> 00:13:47,739 AUDIENCE: [INAUDIBLE] Once in this method? 314 00:13:47,739 --> 00:13:48,530 VICTOR COSTAN: Yep. 315 00:13:48,530 --> 00:13:51,330 So I'm only looking at this method. 316 00:13:51,330 --> 00:13:57,500 So assuming that the method gets one line, and the line has, 317 00:13:57,500 --> 00:14:07,380 I don't know, say, one line in characters, 318 00:14:07,380 --> 00:14:08,830 and we need another variable which 319 00:14:08,830 --> 00:14:11,140 we're going to figure out later. 320 00:14:11,140 --> 00:14:14,040 But for now, one line in characters. 321 00:14:14,040 --> 00:14:17,181 So how many times does line 9 run? 322 00:14:17,181 --> 00:14:19,950 AUDIENCE: [INAUDIBLE] 323 00:14:19,950 --> 00:14:21,450 VICTOR COSTAN: OK. 324 00:14:21,450 --> 00:14:23,660 Runs once. 325 00:14:23,660 --> 00:14:25,528 How about line 10? 326 00:14:25,528 --> 00:14:26,355 AUDIENCE: Once. 327 00:14:26,355 --> 00:14:26,980 AUDIENCE: Once. 328 00:14:26,980 --> 00:14:27,950 VICTOR COSTAN: OK. 329 00:14:27,950 --> 00:14:29,240 What do they do? 330 00:14:29,240 --> 00:14:32,360 Create new lists and assign them to variables. 331 00:14:32,360 --> 00:14:35,121 What's the cause for that? 332 00:14:35,121 --> 00:14:36,462 AUDIENCE: Constant [INAUDIBLE] 333 00:14:36,462 --> 00:14:38,420 VICTOR COSTAN: Constant, excellent. 334 00:14:38,420 --> 00:14:40,140 So I'll be skipping the order of so 335 00:14:40,140 --> 00:14:43,030 that I don't have to write it 23 times. 336 00:14:43,030 --> 00:14:45,290 So 1, 1. 337 00:14:45,290 --> 00:14:48,730 OK, line 11. 338 00:14:48,730 --> 00:14:51,600 It's iterates over all the characters in a line. 339 00:14:51,600 --> 00:14:54,750 So how many times is it going to run? 340 00:14:54,750 --> 00:14:56,250 AUDIENCE: Like, the line? 341 00:14:56,250 --> 00:14:57,526 VICTOR COSTAN: OK, which is? 342 00:14:57,526 --> 00:14:59,270 AUDIENCE: Line end characters. 343 00:14:59,270 --> 00:15:01,510 VICTOR COSTAN: Awesome. 344 00:15:01,510 --> 00:15:07,290 And just the fact of iterating takes constant time. 345 00:15:07,290 --> 00:15:09,220 I'm not sure we covered that. 346 00:15:09,220 --> 00:15:14,520 So for each character, test if it's an alphanumeric character. 347 00:15:14,520 --> 00:15:17,430 Does anyone know what alphanumeric means? 348 00:15:17,430 --> 00:15:19,050 AUDIENCE: It's a letter and a number. 349 00:15:19,050 --> 00:15:21,300 VICTOR COSTAN: OK, so fancy word for letter or number, 350 00:15:21,300 --> 00:15:23,970 A through Z, 0 through 9. 351 00:15:23,970 --> 00:15:25,920 So how much time does it take to test 352 00:15:25,920 --> 00:15:29,200 if a character is alphanumeric? 353 00:15:29,200 --> 00:15:30,602 Guesses? 354 00:15:30,602 --> 00:15:31,940 AUDIENCE: Constant. 355 00:15:31,940 --> 00:15:34,300 VICTOR COSTAN: OK, so constant time. 356 00:15:34,300 --> 00:15:37,160 You compare it to the range A, Z and 0, 9. 357 00:15:37,160 --> 00:15:39,109 How many times am I doing it? 358 00:15:39,109 --> 00:15:39,984 AUDIENCE: [INAUDIBLE] 359 00:15:39,984 --> 00:15:42,022 AUDIENCE: [INAUDIBLE] 360 00:15:42,022 --> 00:15:43,480 VICTOR COSTAN: Thank you guys, this 361 00:15:43,480 --> 00:15:45,650 is going much faster than the last recitation. 362 00:15:45,650 --> 00:15:48,800 You guys are active, I like it. 363 00:15:48,800 --> 00:15:50,030 So, now for line 13. 364 00:15:50,030 --> 00:15:52,550 That only gets executed when the character 365 00:15:52,550 --> 00:15:54,362 is an alphanumeric character. 366 00:15:54,362 --> 00:15:56,320 So we're going to have to make some assumptions 367 00:15:56,320 --> 00:15:57,530 about the document. 368 00:15:57,530 --> 00:16:00,400 And to make my life easier, we're 369 00:16:00,400 --> 00:16:02,380 going to make the following assumption. 370 00:16:02,380 --> 00:16:05,576 If this is a natural language like, say, English, 371 00:16:05,576 --> 00:16:07,450 words are going to be a constant size, right? 372 00:16:07,450 --> 00:16:10,680 How many 500-character words do you see in English? 373 00:16:10,680 --> 00:16:14,910 So let's say 5 to 10 characters per word. 374 00:16:14,910 --> 00:16:17,170 And since the difference is so small, 375 00:16:17,170 --> 00:16:19,960 I'm going to say all the words have the same size 376 00:16:19,960 --> 00:16:22,040 W. And if you want to be more formal, 377 00:16:22,040 --> 00:16:24,800 you can replace word length with average length, 378 00:16:24,800 --> 00:16:26,810 and the math works out. 379 00:16:26,810 --> 00:16:31,330 So each line has a number of words, 380 00:16:31,330 --> 00:16:34,290 and the words are separated by exactly one space, 381 00:16:34,290 --> 00:16:37,130 and the word has W characters. 382 00:16:37,130 --> 00:16:40,942 So how many words do I have, by the way? 383 00:16:40,942 --> 00:16:44,330 AUDIENCE: N divided by W. 384 00:16:44,330 --> 00:16:45,640 VICTOR COSTAN: OK, good. 385 00:16:45,640 --> 00:16:47,740 Someone's paying close attention. 386 00:16:47,740 --> 00:16:50,960 N divided by W plus 1. 387 00:16:50,960 --> 00:16:53,820 And the reason that is, is a line 388 00:16:53,820 --> 00:16:56,340 would look like this, word, space, word, space, word, 389 00:16:56,340 --> 00:16:56,840 space. 390 00:16:56,840 --> 00:16:59,850 So W, characters, one space, W, characters, one space, W, 391 00:16:59,850 --> 00:17:01,110 characters, one space. 392 00:17:01,110 --> 00:17:04,294 That's why you have W plus 1 there. 393 00:17:04,294 --> 00:17:05,960 When we look at asymptotics it turns out 394 00:17:05,960 --> 00:17:08,910 that it doesn't really matter because W's a constant, 395 00:17:08,910 --> 00:17:13,800 W plus 1 is a constant, so order and words. 396 00:17:13,800 --> 00:17:17,930 But for now, let's keep track of W's to seem a bit more formal. 397 00:17:17,930 --> 00:17:19,680 So line 13. 398 00:17:19,680 --> 00:17:21,180 How many times is it going to run? 399 00:17:21,180 --> 00:17:29,374 400 00:17:29,374 --> 00:17:31,629 AUDIENCE: W times 10 over W plus one. 401 00:17:31,629 --> 00:17:32,670 VICTOR COSTAN: Excellent. 402 00:17:32,670 --> 00:17:38,345 403 00:17:38,345 --> 00:17:39,220 Let me pull them out. 404 00:17:39,220 --> 00:17:43,880 405 00:17:43,880 --> 00:17:46,160 How much time does it take to run [INAUDIBLE]. 406 00:17:46,160 --> 00:17:49,872 407 00:17:49,872 --> 00:17:50,800 AUDIENCE: Constant? 408 00:17:50,800 --> 00:17:52,341 VICTOR COSTAN: Constant time, append, 409 00:17:52,341 --> 00:17:54,240 covered in lecture, constant time. 410 00:17:54,240 --> 00:17:57,680 So this is a bit tricky because if you have an array 411 00:17:57,680 --> 00:18:00,680 implementation that's naive, it's not constant time. 412 00:18:00,680 --> 00:18:03,320 But Python does some magic called table doubling, which 413 00:18:03,320 --> 00:18:05,190 we'll cover later in the course. 414 00:18:05,190 --> 00:18:11,470 And this is why you can say that append takes constant time. 415 00:18:11,470 --> 00:18:12,230 OK. 416 00:18:12,230 --> 00:18:16,560 Else, so if the character is not alphanumeric, 417 00:18:16,560 --> 00:18:20,050 than what's going on here? 418 00:18:20,050 --> 00:18:23,428 Can anyone see what's happening there? 419 00:18:23,428 --> 00:18:26,410 AUDIENCE: If its like, [INAUDIBLE]. 420 00:18:26,410 --> 00:18:29,278 VICTOR COSTAN: OK, so let's say if it's a space. 421 00:18:29,278 --> 00:18:30,153 AUDIENCE: [INAUDIBLE] 422 00:18:30,153 --> 00:18:33,450 423 00:18:33,450 --> 00:18:35,620 VICTOR COSTAN: Yeah, this the harder part. 424 00:18:35,620 --> 00:18:37,490 I think you need to run this on an example 425 00:18:37,490 --> 00:18:39,950 to figure out what's going on. 426 00:18:39,950 --> 00:18:42,310 I have to run it on an example in my head. 427 00:18:42,310 --> 00:18:46,610 So let's take this small example here, the fox is outside. 428 00:18:46,610 --> 00:18:48,610 And this is a single line, right? 429 00:18:48,610 --> 00:18:49,390 Nice and handy. 430 00:18:49,390 --> 00:18:52,180 So this can be the input for get words from string. 431 00:18:52,180 --> 00:18:54,090 And let's see what happens. 432 00:18:54,090 --> 00:19:01,600 So first I start with word list which is empty list, character, 433 00:19:01,600 --> 00:19:08,710 lists, empty list. 434 00:19:08,710 --> 00:19:11,330 Take the first character, it's alphanumeric, 435 00:19:11,330 --> 00:19:14,770 gets appended here, the second character, alphanumeric, 436 00:19:14,770 --> 00:19:17,140 appended here, third character, alphanumeric, 437 00:19:17,140 --> 00:19:19,010 gets appended here. 438 00:19:19,010 --> 00:19:20,980 Fourth character, not alphanumeric, 439 00:19:20,980 --> 00:19:25,910 so I get to run lines 15 through 18. 440 00:19:25,910 --> 00:19:26,920 OK, I did the easy part. 441 00:19:26,920 --> 00:19:28,700 Someone walk me through the hard part. 442 00:19:28,700 --> 00:19:33,600 What happens in lines 15 through 18? 443 00:19:33,600 --> 00:19:34,858 Yes. 444 00:19:34,858 --> 00:19:38,316 AUDIENCE: First, it takes that list and joins it 445 00:19:38,316 --> 00:19:40,790 into a string. [INAUDIBLE] 446 00:19:40,790 --> 00:19:43,550 VICTOR COSTAN: OK, so this is a list of characters. 447 00:19:43,550 --> 00:19:47,840 And join takes the list and makes a string out of it. 448 00:19:47,840 --> 00:19:50,840 So I'll have the string the. 449 00:19:50,840 --> 00:19:53,390 OK, excellent. 450 00:19:53,390 --> 00:19:55,570 AUDIENCE: And it converts it all to lower case. 451 00:19:55,570 --> 00:19:56,320 VICTOR COSTAN: OK. 452 00:19:56,320 --> 00:20:00,208 453 00:20:00,208 --> 00:20:03,620 AUDIENCE: End up [INAUDIBLE] that to the word list. 454 00:20:03,620 --> 00:20:05,620 VICTOR COSTAN: The world list is up here, right? 455 00:20:05,620 --> 00:20:10,413 So this is going to have the. 456 00:20:10,413 --> 00:20:13,836 AUDIENCE: And then it clears the character list, [INAUDIBLE]. 457 00:20:13,836 --> 00:20:18,855 458 00:20:18,855 --> 00:20:19,605 VICTOR COSTAN: OK. 459 00:20:19,605 --> 00:20:23,670 460 00:20:23,670 --> 00:20:31,680 So now as I go through the next word, I have F-O-X. 461 00:20:31,680 --> 00:20:34,000 Then this becomes the word, and it gets added here. 462 00:20:34,000 --> 00:20:40,680 463 00:20:40,680 --> 00:20:42,470 So on and so forth for everything. 464 00:20:42,470 --> 00:20:46,840 Do people see how this method works now? 465 00:20:46,840 --> 00:20:50,360 I'm not getting that many nods, so questions. 466 00:20:50,360 --> 00:20:52,440 If I don't get nods, I'll stop and you guys 467 00:20:52,440 --> 00:20:54,530 have to ask what you're confused about. 468 00:20:54,530 --> 00:20:57,230 AUDIENCE: I think it's a little tricky because instead 469 00:20:57,230 --> 00:21:00,052 of saying if it's not an alphanumeric character, 470 00:21:00,052 --> 00:21:02,738 it's just like well, if the length of the list 471 00:21:02,738 --> 00:21:04,792 is greater than 0, which threw me off initially, 472 00:21:04,792 --> 00:21:07,610 but then I realized it was just, like, omission. 473 00:21:07,610 --> 00:21:09,360 VICTOR COSTAN: OK, so why does it do this? 474 00:21:09,360 --> 00:21:12,316 What is the point of the length of the character list? 475 00:21:12,316 --> 00:21:15,600 AUDIENCE: So that there are two spaces. 476 00:21:15,600 --> 00:21:18,080 VICTOR COSTAN: Excellent. 477 00:21:18,080 --> 00:21:23,270 So here I was nice and I had one space, one space, one space. 478 00:21:23,270 --> 00:21:26,410 But if I'm sloppy when I'm typing and I have two spaces 479 00:21:26,410 --> 00:21:31,950 here, then suppose this is space, space-- kind a small, 480 00:21:31,950 --> 00:21:33,060 but pretend. 481 00:21:33,060 --> 00:21:34,550 Go with me here. 482 00:21:34,550 --> 00:21:37,020 So we got here. 483 00:21:37,020 --> 00:21:38,470 We got the fox is. 484 00:21:38,470 --> 00:21:41,720 485 00:21:41,720 --> 00:21:45,620 And then this list is empty because line 18 just 486 00:21:45,620 --> 00:21:48,180 made it empty. 487 00:21:48,180 --> 00:21:50,520 If I run the code the lines 15 through 18, 488 00:21:50,520 --> 00:21:53,930 it's going to add an empty word up here. 489 00:21:53,930 --> 00:21:57,280 And empty words aren't very useful. 490 00:21:57,280 --> 00:21:59,360 You'll see how many times the documents have 491 00:21:59,360 --> 00:22:01,661 too many spaces in them, so that doesn't really help. 492 00:22:01,661 --> 00:22:03,411 AUDIENCE: I mean, isn't that not an issue, 493 00:22:03,411 --> 00:22:07,470 because you call if C is L1 before you actually 494 00:22:07,470 --> 00:22:09,350 get to that. 495 00:22:09,350 --> 00:22:12,400 So you'd run through it again, but you would still 496 00:22:12,400 --> 00:22:14,950 just skip over that. 497 00:22:14,950 --> 00:22:18,030 That would fail, I mean it would not 498 00:22:18,030 --> 00:22:19,280 do anything for that equation. 499 00:22:19,280 --> 00:22:21,350 VICTOR COSTAN: So first space. 500 00:22:21,350 --> 00:22:22,750 C as L now fails. 501 00:22:22,750 --> 00:22:24,717 I run lines 15 through 18. 502 00:22:24,717 --> 00:22:25,300 AUDIENCE: Yep. 503 00:22:25,300 --> 00:22:26,175 VICTOR COSTAN: Right? 504 00:22:26,175 --> 00:22:27,170 I have is here. 505 00:22:27,170 --> 00:22:29,095 This becomes empty. 506 00:22:29,095 --> 00:22:29,970 AUDIENCE: Yep. 507 00:22:29,970 --> 00:22:33,557 AUDIENCE: Second space, C as L now fails again. 508 00:22:33,557 --> 00:22:34,140 AUDIENCE: Yep. 509 00:22:34,140 --> 00:22:36,470 VICTOR COSTAN: And if I wouldn't have the length check, 510 00:22:36,470 --> 00:22:40,080 it would run lines 15 through 18 again. 511 00:22:40,080 --> 00:22:40,961 AUDIENCE: Oh, OK. 512 00:22:40,961 --> 00:22:43,910 [INAUDIBLE] 513 00:22:43,910 --> 00:22:46,950 VICTOR COSTAN: OK, so this is what it's trying to prevent. 514 00:22:46,950 --> 00:22:49,284 So you can see that this code looks complicated, right? 515 00:22:49,284 --> 00:22:51,450 It's trying to do a lot of things, it's complicated, 516 00:22:51,450 --> 00:22:53,810 it's hard to analyze. 517 00:22:53,810 --> 00:22:55,100 Oh, well, let's go with it. 518 00:22:55,100 --> 00:22:59,590 Let's try to finish it up quickly. 519 00:22:59,590 --> 00:23:02,570 So now that we know what it does, let's 520 00:23:02,570 --> 00:23:04,960 try to figure out how many times each line runs 521 00:23:04,960 --> 00:23:06,650 and what's the cost? 522 00:23:06,650 --> 00:23:07,980 Yes. 523 00:23:07,980 --> 00:23:12,830 AUDIENCE: So I think the total cost is N times 1 524 00:23:12,830 --> 00:23:17,035 minus W over W plus 1. 525 00:23:17,035 --> 00:23:18,243 VICTOR COSTAN: Wait, so here? 526 00:23:18,243 --> 00:23:19,170 AUDIENCE: Yeah. 527 00:23:19,170 --> 00:23:25,031 VICTOR COSTAN: OK, so you're saying N times 1 minus. 528 00:23:25,031 --> 00:23:25,531 OK. 529 00:23:25,531 --> 00:23:28,810 530 00:23:28,810 --> 00:23:31,790 Why do you say that? 531 00:23:31,790 --> 00:23:33,183 I like it, but why? 532 00:23:33,183 --> 00:23:36,017 OK, it's because it's everything that is in the character, 533 00:23:36,017 --> 00:23:38,471 and the line above it was characters-- 534 00:23:38,471 --> 00:23:39,220 VICTOR COSTAN: OK. 535 00:23:39,220 --> 00:23:40,220 AUDIENCE: --all alphanumeric, [INAUDIBLE] 536 00:23:40,220 --> 00:23:41,970 VICTOR COSTAN: So basically spaces, right? 537 00:23:41,970 --> 00:23:44,160 If we have word, space, word, space, word, space, 538 00:23:44,160 --> 00:23:46,400 this happens for all the spaces. 539 00:23:46,400 --> 00:23:47,500 Cool. 540 00:23:47,500 --> 00:23:48,967 So this is good. 541 00:23:48,967 --> 00:23:50,425 I'm going to make it a bit simpler. 542 00:23:50,425 --> 00:23:54,120 543 00:23:54,120 --> 00:23:56,975 Same thing, it's just that it's slightly less intimidating. 544 00:23:56,975 --> 00:23:59,040 AUDIENCE: Oh, yeah. 545 00:23:59,040 --> 00:24:00,650 VICTOR COSTAN: Cool, thank you. 546 00:24:00,650 --> 00:24:03,000 Very brave, come up first. 547 00:24:03,000 --> 00:24:05,050 What's the running time for line 14? 548 00:24:05,050 --> 00:24:06,995 So, cost for running it once. 549 00:24:06,995 --> 00:24:10,320 550 00:24:10,320 --> 00:24:12,110 AUDIENCE: Constant. 551 00:24:12,110 --> 00:24:12,610 Excellent. 552 00:24:12,610 --> 00:24:14,850 VICTOR COSTAN: I like you guys. 553 00:24:14,850 --> 00:24:15,780 Nice. 554 00:24:15,780 --> 00:24:19,020 Line 15, how much time does it to take 555 00:24:19,020 --> 00:24:21,880 to take characters and put them into a list? 556 00:24:21,880 --> 00:24:24,110 AUDIENCE: N? 557 00:24:24,110 --> 00:24:24,860 VICTOR COSTAN: N-- 558 00:24:24,860 --> 00:24:25,120 AUDIENCE: [INAUDIBLE] 559 00:24:25,120 --> 00:24:27,340 VICTOR COSTAN: --where N is the size of the list, right? 560 00:24:27,340 --> 00:24:27,800 AUDIENCE: Yeah. 561 00:24:27,800 --> 00:24:28,550 VICTOR COSTAN: OK. 562 00:24:28,550 --> 00:24:30,814 So what's the size of the list now? 563 00:24:30,814 --> 00:24:33,154 AUDIENCE: [INAUDIBLE] 564 00:24:33,154 --> 00:24:34,090 AUDIENCE: [INAUDIBLE] 565 00:24:34,090 --> 00:24:36,430 VICTOR COSTAN: Yep. 566 00:24:36,430 --> 00:24:39,342 OK, so when you're using more than one letter, 567 00:24:39,342 --> 00:24:41,550 the problem is you have to pay attention to which one 568 00:24:41,550 --> 00:24:42,091 you're using. 569 00:24:42,091 --> 00:24:44,080 Because when we teach algorithms, 570 00:24:44,080 --> 00:24:46,910 we say oh, this is N, this is N squared, so on and so forth. 571 00:24:46,910 --> 00:24:49,010 You have to replace it to the right letter. 572 00:24:49,010 --> 00:24:51,415 And I get confused about this all the time, so-- 573 00:24:51,415 --> 00:24:51,810 AUDIENCE: [INAUDIBLE] 574 00:24:51,810 --> 00:24:52,630 VICTOR COSTAN: --a serious problem. 575 00:24:52,630 --> 00:24:53,614 AUDIENCE: --columns? 576 00:24:53,614 --> 00:24:56,080 What are the two columns? 577 00:24:56,080 --> 00:24:59,270 VICTOR COSTAN: So this is the cost of running a line once, 578 00:24:59,270 --> 00:25:01,296 and this is how many times it's run. 579 00:25:01,296 --> 00:25:02,130 AUDIENCE: Oh, OK. 580 00:25:02,130 --> 00:25:02,930 VICTOR COSTAN: Thanks for the question. 581 00:25:02,930 --> 00:25:04,260 I should have said that in the beginning. 582 00:25:04,260 --> 00:25:04,760 Thank you. 583 00:25:04,760 --> 00:25:07,480 584 00:25:07,480 --> 00:25:09,230 OK, let's make this a little bit faster 585 00:25:09,230 --> 00:25:12,490 and notice that lines 15 through 18 586 00:25:12,490 --> 00:25:14,789 all run the same number of times, right? 587 00:25:14,789 --> 00:25:16,580 They're in the if, and there's nothing else 588 00:25:16,580 --> 00:25:19,670 that's changes the control flow there. 589 00:25:19,670 --> 00:25:28,320 So lines 15 through 18 are O and divided by W plus 1. 590 00:25:28,320 --> 00:25:29,740 All right, line 16. 591 00:25:29,740 --> 00:25:30,930 Take a word. 592 00:25:30,930 --> 00:25:33,600 So take a string and make another string 593 00:25:33,600 --> 00:25:37,070 where each character is the lowercase version. 594 00:25:37,070 --> 00:25:38,360 AUDIENCE: [INAUDIBLE] 595 00:25:38,360 --> 00:25:39,610 VICTOR COSTAN: OK, cool. 596 00:25:39,610 --> 00:25:41,643 Why W, intuitively? 597 00:25:41,643 --> 00:25:44,884 AUDIENCE: Because [INAUDIBLE] has to check to make sure 598 00:25:44,884 --> 00:25:46,929 [INAUDIBLE] 599 00:25:46,929 --> 00:25:47,720 VICTOR COSTAN: Yep. 600 00:25:47,720 --> 00:25:48,750 AUDIENCE: [INAUDIBLE] 601 00:25:48,750 --> 00:25:50,999 VICTOR COSTAN: Yeah, so if you have a 10,000 character 602 00:25:50,999 --> 00:25:53,190 string you, have to go through 10,000 characters. 603 00:25:53,190 --> 00:25:55,210 Very good. 604 00:25:55,210 --> 00:25:58,032 Append 917. 605 00:25:58,032 --> 00:26:00,512 AUDIENCE: [INAUDIBLE] 606 00:26:00,512 --> 00:26:02,000 VICTOR COSTAN: Sweet. 607 00:26:02,000 --> 00:26:06,565 And line 18, we said the character list of length list. 608 00:26:06,565 --> 00:26:07,531 AUDIENCE: [INAUDIBLE] 609 00:26:07,531 --> 00:26:13,170 VICTOR COSTAN: [INAUDIBLE] OK, how many times 610 00:26:13,170 --> 00:26:18,505 do lines 19 through 23 run? 611 00:26:18,505 --> 00:26:19,400 AUDIENCE: Once. 612 00:26:19,400 --> 00:26:20,608 VICTOR COSTAN: At most, once. 613 00:26:20,608 --> 00:26:24,540 614 00:26:24,540 --> 00:26:26,727 AUDIENCE: [INAUDIBLE] 615 00:26:26,727 --> 00:26:29,310 VICTOR COSTAN: Can anyone figure out what's the point of them? 616 00:26:29,310 --> 00:26:33,099 617 00:26:33,099 --> 00:26:36,649 AUDIENCE: Catch any trailing [INAUDIBLE] 618 00:26:36,649 --> 00:26:37,482 VICTOR COSTAN: Good. 619 00:26:37,482 --> 00:26:40,404 If you ended on the last letter of a word, 620 00:26:40,404 --> 00:26:42,360 you want to make sure you catch that word. 621 00:26:42,360 --> 00:26:42,870 VICTOR COSTAN: All right. 622 00:26:42,870 --> 00:26:43,820 AUDIENCE: [INAUDIBLE] 623 00:26:43,820 --> 00:26:44,861 VICTOR COSTAN: Very good. 624 00:26:44,861 --> 00:26:46,030 So I find it here. 625 00:26:46,030 --> 00:26:48,980 Then after I'm done with the loop at line 19 626 00:26:48,980 --> 00:26:52,895 what the word list would have, the fox is. 627 00:26:52,895 --> 00:26:54,270 And then the character list would 628 00:26:54,270 --> 00:26:56,150 have the characters for outside. 629 00:26:56,150 --> 00:26:58,970 If I return the word list, woops, I just missed a word. 630 00:26:58,970 --> 00:27:07,430 So lines 20 through 22 are a copy of lines 15 through 17, 631 00:27:07,430 --> 00:27:11,110 and they take care of the last word. 632 00:27:11,110 --> 00:27:14,720 So line 19 is an if, and it takes the length of a list 633 00:27:14,720 --> 00:27:16,250 and compares it to the number. 634 00:27:16,250 --> 00:27:19,344 What's the cost of that? 635 00:27:19,344 --> 00:27:20,200 AUDIENCE: Constant. 636 00:27:20,200 --> 00:27:21,730 VICTOR COSTAN: OK, very good. 637 00:27:21,730 --> 00:27:24,980 Checking list length in Python is constant time. 638 00:27:24,980 --> 00:27:27,340 We did that in lecture. 639 00:27:27,340 --> 00:27:29,490 How about lines 20 through 22? 640 00:27:29,490 --> 00:27:32,262 641 00:27:32,262 --> 00:27:33,720 I just gave it away, guys, come on. 642 00:27:33,720 --> 00:27:34,530 Someone-- 643 00:27:34,530 --> 00:27:36,380 AUDIENCE: The same as 15 through 17. 644 00:27:36,380 --> 00:27:38,890 VICTOR COSTAN: OK, same as 15 through 17. 645 00:27:38,890 --> 00:27:41,460 W, W, 1. 646 00:27:41,460 --> 00:27:46,480 Line 23, return constant time. 647 00:27:46,480 --> 00:27:50,480 OK, so now we know how much it takes to run a line once, 648 00:27:50,480 --> 00:27:52,640 how many times each line runs. 649 00:27:52,640 --> 00:27:55,360 So we're going to do a dot product of these guys. 650 00:27:55,360 --> 00:27:57,920 See, dot products are useful. 651 00:27:57,920 --> 00:28:00,520 And if we do a dot product of these guys, 652 00:28:00,520 --> 00:28:03,180 we're going to get the total running time for the function. 653 00:28:03,180 --> 00:28:05,105 So let's compute the partial terms. 654 00:28:05,105 --> 00:28:06,350 AUDIENCE: [INAUDIBLE] 655 00:28:06,350 --> 00:28:07,310 VICTOR COSTAN: I'm not going to write them down. 656 00:28:07,310 --> 00:28:09,730 Let's just go through them and figure out what they are. 657 00:28:09,730 --> 00:28:13,494 So you guys say them. 658 00:28:13,494 --> 00:28:17,964 AUDIENCE: 1, 1, N, N, weird equation-- 659 00:28:17,964 --> 00:28:19,380 VICTOR COSTAN: OK, weird equation, 660 00:28:19,380 --> 00:28:21,177 what was the important part? 661 00:28:21,177 --> 00:28:22,010 [INTERPOSING VOICES] 662 00:28:22,010 --> 00:28:23,550 VICTOR COSTAN: Yeah, the important part. 663 00:28:23,550 --> 00:28:24,841 The important part is N, right? 664 00:28:24,841 --> 00:28:27,784 This is some constant times N, so N. 665 00:28:27,784 --> 00:28:35,670 AUDIENCE: N, N, N, N, N, N, 1, 1. 666 00:28:35,670 --> 00:28:37,406 VICTOR COSTAN: Pay attention. 667 00:28:37,406 --> 00:28:39,280 AUDIENCE: 1, N. 668 00:28:39,280 --> 00:28:41,100 VICTOR COSTAN: Pay attention. 669 00:28:41,100 --> 00:28:42,495 It's not N, it's not 1. 670 00:28:42,495 --> 00:28:43,370 AUDIENCE: [INAUDIBLE] 671 00:28:43,370 --> 00:28:45,130 VICTOR COSTAN: OK, actually is 1 I guess, 672 00:28:45,130 --> 00:28:46,740 if you think that W is a constant. 673 00:28:46,740 --> 00:28:47,364 Sorry. 674 00:28:47,364 --> 00:28:48,530 AUDIENCE: You're testing us. 675 00:28:48,530 --> 00:28:49,800 VICTOR COSTAN: OK. 676 00:28:49,800 --> 00:28:52,331 1, 1. 677 00:28:52,331 --> 00:28:54,580 VICTOR COSTAN: So I heard two numbers, N and 1, right? 678 00:28:54,580 --> 00:28:59,770 So this is 0 of N plus 1, which is order N, 679 00:28:59,770 --> 00:29:04,370 because as N goes to infinity, 1 becomes really tiny. 680 00:29:04,370 --> 00:29:07,660 OK, so this is how you analyze a function. 681 00:29:07,660 --> 00:29:10,700 Big functions are horribly painful to analyze because you 682 00:29:10,700 --> 00:29:14,760 have to look at each line and do this kind of reasoning. 683 00:29:14,760 --> 00:29:16,640 And it's not even a top level function here, 684 00:29:16,640 --> 00:29:19,340 so I don't even get to write anything here yet. 685 00:29:19,340 --> 00:29:22,490 So get words from string takes order and time 686 00:29:22,490 --> 00:29:24,980 where N is the length of a line. 687 00:29:24,980 --> 00:29:28,100 Let's look at get words from line list. 688 00:29:28,100 --> 00:29:29,289 AUDIENCE: I have a question. 689 00:29:29,289 --> 00:29:30,080 VICTOR COSTAN: Yes. 690 00:29:30,080 --> 00:29:33,545 AUDIENCE: So [INAUDIBLE] is W characters long? 691 00:29:33,545 --> 00:29:37,699 Like, does it matter if the [INAUDIBLE] 692 00:29:37,699 --> 00:29:38,990 VICTOR COSTAN: Does it matter-- 693 00:29:38,990 --> 00:29:41,470 AUDIENCE: [INAUDIBLE] make that assumption of that? 694 00:29:41,470 --> 00:29:45,760 VICTOR COSTAN: So that I can reason for lines 15 and 16. 695 00:29:45,760 --> 00:29:49,640 I can reason through them easily if I have a content length. 696 00:29:49,640 --> 00:29:52,410 It turns out that if you have an average length, 697 00:29:52,410 --> 00:29:54,580 the results are going to be the same. 698 00:29:54,580 --> 00:30:03,110 Like overall, if you look at the running time as a sum of what's 699 00:30:03,110 --> 00:30:05,730 the running time for converting all the words to lowercase 700 00:30:05,730 --> 00:30:07,490 and then appending them to the list. 701 00:30:07,490 --> 00:30:10,140 The sum of those is still going to be n N, 702 00:30:10,140 --> 00:30:12,230 but that takes a bit more time to reason through 703 00:30:12,230 --> 00:30:13,200 so I took a shortcut. 704 00:30:13,200 --> 00:30:17,202 705 00:30:17,202 --> 00:30:19,790 Are you a math major, by the way? 706 00:30:19,790 --> 00:30:21,790 You're very rigorous. 707 00:30:21,790 --> 00:30:22,450 OK. 708 00:30:22,450 --> 00:30:24,550 So this is good, it's always good to try 709 00:30:24,550 --> 00:30:26,150 to keep this in the back of your head 710 00:30:26,150 --> 00:30:31,260 to make sure you don't fall for a trap. 711 00:30:31,260 --> 00:30:33,790 So get words from string order N, 712 00:30:33,790 --> 00:30:36,150 and we're trying to figure out get words from line list. 713 00:30:36,150 --> 00:30:39,090 Any more questions before I do that? 714 00:30:39,090 --> 00:30:42,530 Or does anyone want to tell me I'm wrong? 715 00:30:42,530 --> 00:30:44,610 OK, good. 716 00:30:44,610 --> 00:30:47,320 So get words from line list. 717 00:30:47,320 --> 00:30:50,890 Lines 2 through 6. 718 00:30:50,890 --> 00:30:53,100 2 3, 4, 5, 6. 719 00:30:53,100 --> 00:30:55,690 720 00:30:55,690 --> 00:30:58,034 Line 2. 721 00:30:58,034 --> 00:30:59,860 AUDIENCE: 1. 722 00:30:59,860 --> 00:31:02,851 VICTOR COSTAN: OK, cost 1, how many times does it run? 723 00:31:02,851 --> 00:31:03,476 AUDIENCE: Once. 724 00:31:03,476 --> 00:31:05,290 VICTOR COSTAN: Cool. 725 00:31:05,290 --> 00:31:07,990 Line 3. 726 00:31:07,990 --> 00:31:09,170 We need a new number, right? 727 00:31:09,170 --> 00:31:12,000 We need the number of lines in a document. 728 00:31:12,000 --> 00:31:13,825 Let's say we have Z lines. 729 00:31:13,825 --> 00:31:19,010 730 00:31:19,010 --> 00:31:25,710 So line 3 runs Z times, and 4 and 5 are in a loop 731 00:31:25,710 --> 00:31:30,692 so they also run Z times What's the cost for line 4? 732 00:31:30,692 --> 00:31:33,524 733 00:31:33,524 --> 00:31:34,204 AUDIENCE: 1. 734 00:31:34,204 --> 00:31:35,245 VICTOR COSTAN: Excellent. 735 00:31:35,245 --> 00:31:38,870 736 00:31:38,870 --> 00:31:41,934 What's the cost for line 3? 737 00:31:41,934 --> 00:31:42,790 AUDIENCE: 1. 738 00:31:42,790 --> 00:31:44,950 VICTOR COSTAN: 1. 739 00:31:44,950 --> 00:31:46,590 And what is the cost for line 5? 740 00:31:46,590 --> 00:31:54,398 741 00:31:54,398 --> 00:31:55,880 AUDIENCE: Looks constant. 742 00:31:55,880 --> 00:31:58,125 VICTOR COSTAN: Looks constant, OK. 743 00:31:58,125 --> 00:31:59,000 AUDIENCE: [INAUDIBLE] 744 00:31:59,000 --> 00:32:03,030 VICTOR COSTAN: Does anyone else think it looks constant? 745 00:32:03,030 --> 00:32:04,618 Yeah. 746 00:32:04,618 --> 00:32:06,100 AUDIENCE: It's a trap. 747 00:32:06,100 --> 00:32:07,450 VICTOR COSTAN: It's a trap. 748 00:32:07,450 --> 00:32:08,948 It's a trap. 749 00:32:08,948 --> 00:32:10,310 [INTERPOSING VOICES] 750 00:32:10,310 --> 00:32:11,810 AUDIENCE: --length of the two lists. 751 00:32:11,810 --> 00:32:12,650 VICTOR COSTAN: OK. 752 00:32:12,650 --> 00:32:14,880 Good. 753 00:32:14,880 --> 00:32:17,080 You paid attention in lecture, right? 754 00:32:17,080 --> 00:32:17,990 AUDIENCE: I try. 755 00:32:17,990 --> 00:32:19,810 VICTOR COSTAN: Nice. 756 00:32:19,810 --> 00:32:25,830 OK, so we have plus as an operator, 757 00:32:25,830 --> 00:32:29,280 and suppose we work with two lists. 758 00:32:29,280 --> 00:32:34,410 The first list is 1, 2, 3, all the way through 1,000. 759 00:32:34,410 --> 00:32:39,380 And the second list is 1, 2, 3. 760 00:32:39,380 --> 00:32:42,010 So when you code plus to combine them, 761 00:32:42,010 --> 00:32:46,170 if you say something like C equals A plus B, 762 00:32:46,170 --> 00:32:49,160 you would expect that-- if this is A, by the way 763 00:32:49,160 --> 00:32:53,380 and this is B-- you would expect that after you call this A is 764 00:32:53,380 --> 00:32:56,120 still this, B is still this, and C 765 00:32:56,120 --> 00:32:58,740 is a list that contains everything. 766 00:32:58,740 --> 00:33:04,070 So because of that, what plus has to do is make a new list, 767 00:33:04,070 --> 00:33:07,350 append all the elements here, append all the elements here. 768 00:33:07,350 --> 00:33:10,630 So the cost of this if this list is 1,000 and this list is 3 769 00:33:10,630 --> 00:33:11,940 is 1,003. 770 00:33:11,940 --> 00:33:17,340 Or if you have two lists of length, L1 and L2 771 00:33:17,340 --> 00:33:22,580 the cost is order of L1 plus L2. 772 00:33:22,580 --> 00:33:24,920 Now there's another Python method called extend, 773 00:33:24,920 --> 00:33:28,432 which does what I think you would expect plus 774 00:33:28,432 --> 00:33:29,640 to do in terms of efficiency. 775 00:33:29,640 --> 00:33:33,020 776 00:33:33,020 --> 00:33:36,670 So what extend does is you call it a 1 or A on one list, 777 00:33:36,670 --> 00:33:38,610 give it the other list, and it's going 778 00:33:38,610 --> 00:33:40,260 to take each element in the second list 779 00:33:40,260 --> 00:33:43,020 and append it to the first list. 780 00:33:43,020 --> 00:33:47,050 So for each element here, it calls append on this list. 781 00:33:47,050 --> 00:33:48,746 So what's the running time for extend? 782 00:33:48,746 --> 00:33:49,621 AUDIENCE: [INAUDIBLE] 783 00:33:49,621 --> 00:33:52,920 784 00:33:52,920 --> 00:33:55,226 VICTOR COSTAN: OK, there are too many directions and-- 785 00:33:55,226 --> 00:33:56,200 AUDIENCE: Length of the second list. 786 00:33:56,200 --> 00:33:58,366 VICTOR COSTAN: Length of the second list, excellent. 787 00:33:58,366 --> 00:34:03,210 So two lists, L1, L2, order of L2. 788 00:34:03,210 --> 00:34:05,360 So it doesn't matter this is 1,000 elements 789 00:34:05,360 --> 00:34:08,130 are a million elements, appending three elements is 790 00:34:08,130 --> 00:34:11,739 going to take time proportional to three. 791 00:34:11,739 --> 00:34:14,860 OK now, let's see what's going on here. 792 00:34:14,860 --> 00:34:19,100 So we have Z lines and characters in a line. 793 00:34:19,100 --> 00:34:22,520 794 00:34:22,520 --> 00:34:24,730 I think I want a nicer constant. 795 00:34:24,730 --> 00:34:28,069 796 00:34:28,069 --> 00:34:29,360 No, let's go with this for now. 797 00:34:29,360 --> 00:34:32,240 798 00:34:32,240 --> 00:34:34,650 AUDIENCE: [INAUDIBLE] lines. 799 00:34:34,650 --> 00:34:38,020 VICTOR COSTAN: So this is the length of a word. 800 00:34:38,020 --> 00:34:40,020 Let's see, how many words will I have in a line? 801 00:34:40,020 --> 00:34:47,530 Let's say I have K words in a line, which is N divided by W. 802 00:34:47,530 --> 00:34:49,989 So I know that to get words from string 803 00:34:49,989 --> 00:34:55,219 returns a list of size K. So if that is the case, then 804 00:34:55,219 --> 00:34:59,820 the first time line 5 runs, word list is empty. 805 00:34:59,820 --> 00:35:01,580 And it's going to get K elements. 806 00:35:01,580 --> 00:35:05,310 The second time it runs, word list has K elements 807 00:35:05,310 --> 00:35:06,530 and gets K more. 808 00:35:06,530 --> 00:35:09,590 Third time, it has 2K elements, it gets K more. 809 00:35:09,590 --> 00:35:12,420 So the running time for this looks like this. 810 00:35:12,420 --> 00:35:19,150 K plus 2K plus 3K plus 4K all the way 811 00:35:19,150 --> 00:35:23,010 until when I'm at the last line, if I have Z lines. 812 00:35:23,010 --> 00:35:27,720 I had Z minus 1 times K elements in the list, 813 00:35:27,720 --> 00:35:30,000 because I have Z minus 1 lines and I put all the words 814 00:35:30,000 --> 00:35:35,080 in the list, and I'm adding K more words. 815 00:35:35,080 --> 00:35:43,760 So total, Z times K running time. 816 00:35:43,760 --> 00:35:46,010 So this is the total running time for this guy. 817 00:35:46,010 --> 00:35:50,510 And this is not constant, so it's complicated. 818 00:35:50,510 --> 00:35:52,910 What is the sum come down to, asymptotically? 819 00:35:52,910 --> 00:36:00,210 820 00:36:00,210 --> 00:36:04,990 AUDIENCE: Z plus 1K times Z over 2. 821 00:36:04,990 --> 00:36:05,740 VICTOR COSTAN: Ok. 822 00:36:05,740 --> 00:36:17,000 Z plus 1K, ZK over 2. 823 00:36:17,000 --> 00:36:19,770 Slow because I care about asymptotics, 824 00:36:19,770 --> 00:36:31,180 this is order of Z squared times K, right? 825 00:36:31,180 --> 00:36:33,820 So now any one more natural number to work with 826 00:36:33,820 --> 00:36:36,500 would be the number of words in a document. 827 00:36:36,500 --> 00:36:38,940 And the number of words in a document 828 00:36:38,940 --> 00:36:50,150 is W, which is Z times K. So Z is W divided by K. 829 00:36:50,150 --> 00:36:53,930 And if I substitute this, I get that this 830 00:36:53,930 --> 00:37:05,170 is equal to 0 of W squared over K. Now in a reasonable document 831 00:37:05,170 --> 00:37:08,840 that I see, there tends to be a limited number of words 832 00:37:08,840 --> 00:37:12,860 per line because the document has to fit on a page. 833 00:37:12,860 --> 00:37:15,580 So K's pretty much a constant. 834 00:37:15,580 --> 00:37:18,820 So this comes down to order of W squared. 835 00:37:18,820 --> 00:37:21,790 836 00:37:21,790 --> 00:37:27,830 So if I go down here and look at get word from line list, 837 00:37:27,830 --> 00:37:31,679 this is W squared, where W is how many words I 838 00:37:31,679 --> 00:37:32,470 have in a document. 839 00:37:32,470 --> 00:37:35,130 840 00:37:35,130 --> 00:37:38,830 How many of you guys are still with me? 841 00:37:38,830 --> 00:37:39,770 Half. 842 00:37:39,770 --> 00:37:41,400 OK. 843 00:37:41,400 --> 00:37:43,460 Does anyone else want to ask questions, 844 00:37:43,460 --> 00:37:46,360 so that you can get back on track? 845 00:37:46,360 --> 00:37:48,424 Yes, no? 846 00:37:48,424 --> 00:37:49,873 AUDIENCE: It makes sense so far. 847 00:37:49,873 --> 00:37:50,914 VICTOR COSTAN: Thank you. 848 00:37:50,914 --> 00:37:52,455 AUDIENCE: I think I didn't understand 849 00:37:52,455 --> 00:37:55,201 the part of [INAUDIBLE] 850 00:37:55,201 --> 00:37:55,950 VICTOR COSTAN: OK. 851 00:37:55,950 --> 00:37:58,060 Thank you. 852 00:37:58,060 --> 00:38:02,280 So let's see what's going on lines 2 through 5. 853 00:38:02,280 --> 00:38:09,360 So I have a word list, which at the beginning is empty. 854 00:38:09,360 --> 00:38:12,640 Then in line 4, words in line gets K words. 855 00:38:12,640 --> 00:38:15,300 856 00:38:15,300 --> 00:38:21,840 And those K words in line five are added to word list. 857 00:38:21,840 --> 00:38:25,420 So after that, word list has K words. 858 00:38:25,420 --> 00:38:26,880 Then I run through the loop again. 859 00:38:26,880 --> 00:38:29,880 Get the words from string gives me K new words. 860 00:38:29,880 --> 00:38:33,770 They get added to the list, which now has 2K words. 861 00:38:33,770 --> 00:38:35,470 Next time I get K more words, they 862 00:38:35,470 --> 00:38:39,890 get that added to the list, which has 3K. 863 00:38:39,890 --> 00:38:41,530 So on and so forth until the end. 864 00:38:41,530 --> 00:38:44,380 I have ugly numbers. 865 00:38:44,380 --> 00:38:50,820 Z minus 1 times K words and I add the last K words. 866 00:38:50,820 --> 00:38:53,570 867 00:38:53,570 --> 00:38:56,480 I'm getting confused here. 868 00:38:56,480 --> 00:38:59,690 And I get Z times K words. 869 00:38:59,690 --> 00:39:02,840 So the word list is eventually going to have Z times K words, 870 00:39:02,840 --> 00:39:04,710 and it gets them K at a time. 871 00:39:04,710 --> 00:39:08,450 The thing that does this addition is the plus operator. 872 00:39:08,450 --> 00:39:10,370 And the running time for the plus operator 873 00:39:10,370 --> 00:39:14,100 is the size of the two lists, so it's this plus this. 874 00:39:14,100 --> 00:39:17,440 So that's why the running time is first K, then 2K, then 3K, 875 00:39:17,440 --> 00:39:23,387 then-- make sense now? 876 00:39:23,387 --> 00:39:23,970 AUDIENCE: Yes. 877 00:39:23,970 --> 00:39:26,290 VICTOR COSTAN: OK. 878 00:39:26,290 --> 00:39:30,620 So this is a subtle bug because if you change plus to extend, 879 00:39:30,620 --> 00:39:33,050 you get [? bug ?] disk two, which runs a lot faster. 880 00:39:33,050 --> 00:39:37,265 881 00:39:37,265 --> 00:39:37,765 OK. 882 00:39:37,765 --> 00:39:42,270 883 00:39:42,270 --> 00:39:45,790 So for everything else, we want to be 884 00:39:45,790 --> 00:39:47,330 able to do this sort of analysis, 885 00:39:47,330 --> 00:39:49,027 but we want to do it faster. 886 00:39:49,027 --> 00:39:51,110 So you guys should look through [? bug list ?] one 887 00:39:51,110 --> 00:39:55,110 through eight and do the same analysis for all the functions. 888 00:39:55,110 --> 00:39:58,760 And we're going to post recitation notes where 889 00:39:58,760 --> 00:40:01,130 we tell you this is the function that changed, 890 00:40:01,130 --> 00:40:02,642 and this is the total running time. 891 00:40:02,642 --> 00:40:04,100 And you should go through the lines 892 00:40:04,100 --> 00:40:07,610 and convince yourself that this is the right running time. 893 00:40:07,610 --> 00:40:10,290 And you should do that until it becomes second nature, 894 00:40:10,290 --> 00:40:12,002 because when you're writing Python code, 895 00:40:12,002 --> 00:40:13,460 you want to have this in your head. 896 00:40:13,460 --> 00:40:14,880 You don't want to have to write it down, 897 00:40:14,880 --> 00:40:17,450 because if you have to write it down, you're going to be lazy 898 00:40:17,450 --> 00:40:19,158 and you're not going to do it, and you're 899 00:40:19,158 --> 00:40:20,850 going to use plus instead of extend, 900 00:40:20,850 --> 00:40:23,280 and your code is going to be horribly slow. 901 00:40:23,280 --> 00:40:25,114 So practice until this gets in your head, 902 00:40:25,114 --> 00:40:27,530 and then you'll be able to see the running time for things 903 00:40:27,530 --> 00:40:28,155 really quickly. 904 00:40:28,155 --> 00:40:31,070 905 00:40:31,070 --> 00:40:35,820 OK, do we have time for once more let me see. 906 00:40:35,820 --> 00:40:37,120 OK. 907 00:40:37,120 --> 00:40:39,310 Let's look at the running time for inner products, 908 00:40:39,310 --> 00:40:40,780 because this is nice and easy. 909 00:40:40,780 --> 00:40:44,700 910 00:40:44,700 --> 00:40:53,030 2, 3, 4, 5, 6, 7. 911 00:40:53,030 --> 00:40:57,900 2 is 1, 1, very nice and easy. 912 00:40:57,900 --> 00:41:05,200 3 looks at the first document list and iterates through it. 913 00:41:05,200 --> 00:41:09,430 Iteration is constant time, but if the first document vector 914 00:41:09,430 --> 00:41:15,100 has L1 elements, it's going to run L1 times. 915 00:41:15,100 --> 00:41:18,270 How about line 4, words 2 count 2 in L2. 916 00:41:18,270 --> 00:41:26,330 This is iteration again, so it's constant time to run it once, 917 00:41:26,330 --> 00:41:28,146 but how many times will it run? 918 00:41:28,146 --> 00:41:30,130 AUDIENCE: L2 times L1 times. 919 00:41:30,130 --> 00:41:33,420 VICTOR COSTAN: L2 times the 1, excellent. 920 00:41:33,420 --> 00:41:35,440 So these two loops are nested inside each other 921 00:41:35,440 --> 00:41:39,170 so that means that lines 4 through 6 922 00:41:39,170 --> 00:41:44,060 are going to run once every time line 3 iterates. 923 00:41:44,060 --> 00:41:45,590 So sorry, actually line 4 is going 924 00:41:45,590 --> 00:41:49,110 to run once every time line 3 iterates. 925 00:41:49,110 --> 00:41:53,130 And then everything inside the second 4 926 00:41:53,130 --> 00:41:56,980 is going to run L1 times L2 times. 927 00:41:56,980 --> 00:42:02,705 So lines 5 and 6 are also going to run L1, L2 times. 928 00:42:02,705 --> 00:42:06,430 L1, L2, L1, L2. 929 00:42:06,430 --> 00:42:11,716 How much time does it take to do that if check there? 930 00:42:11,716 --> 00:42:13,040 AUDIENCE: [INAUDIBLE] 931 00:42:13,040 --> 00:42:15,040 VICTOR COSTAN: Why does it take a constant time? 932 00:42:15,040 --> 00:42:19,304 933 00:42:19,304 --> 00:42:21,345 AUDIENCE: I was going to say, it wasn't constant, 934 00:42:21,345 --> 00:42:25,680 so you don't have to pair each character with no word. 935 00:42:25,680 --> 00:42:26,680 VICTOR COSTAN: OK, good. 936 00:42:26,680 --> 00:42:28,360 So we have two words, and equal, equal 937 00:42:28,360 --> 00:42:31,430 tells me are the words equal or not, right? 938 00:42:31,430 --> 00:42:35,450 So the way you do that, is you have words like the and fox. 939 00:42:35,450 --> 00:42:37,270 You go through each character, and you 940 00:42:37,270 --> 00:42:40,640 stop whenever you see different characters. 941 00:42:40,640 --> 00:42:46,440 But if you have something like, if you have a fake word 942 00:42:46,440 --> 00:42:50,704 F-O-I and fox, then go through the first character, 943 00:42:50,704 --> 00:42:53,370 they're equal, second character, they're equal, third character, 944 00:42:53,370 --> 00:42:54,620 they're different. 945 00:42:54,620 --> 00:42:57,390 So if you have length W words that 946 00:42:57,390 --> 00:42:59,220 are different only in the last character, 947 00:42:59,220 --> 00:43:02,660 this is going to be order W, right? 948 00:43:02,660 --> 00:43:04,210 So the real-- 949 00:43:04,210 --> 00:43:05,620 AUDIENCE: [INAUDIBLE] 950 00:43:05,620 --> 00:43:08,750 VICTOR COSTAN: --yep, equals, equals 4 strings not constant. 951 00:43:08,750 --> 00:43:13,470 It takes W time where W is the length of a word. 952 00:43:13,470 --> 00:43:15,662 Now here we said that the length of a word 953 00:43:15,662 --> 00:43:17,620 is constant because we're dealing with English. 954 00:43:17,620 --> 00:43:19,890 So you could tell me it is constant because of that. 955 00:43:19,890 --> 00:43:22,181 But I would like to hear the argument before I take it. 956 00:43:22,181 --> 00:43:24,630 957 00:43:24,630 --> 00:43:26,050 How about line 6? 958 00:43:26,050 --> 00:43:31,330 959 00:43:31,330 --> 00:43:32,770 AUDIENCE: Well, if the plus equals 960 00:43:32,770 --> 00:43:36,140 is going to be the same thing before when we were, 961 00:43:36,140 --> 00:43:39,270 every new time your plus equals, so it's 962 00:43:39,270 --> 00:43:41,940 going to be like how the word list before we were adding it, 963 00:43:41,940 --> 00:43:43,548 where we have to create that object, 964 00:43:43,548 --> 00:43:45,524 and then add it to the length. 965 00:43:45,524 --> 00:43:46,018 I mean, its going to be length of sum. 966 00:43:46,018 --> 00:43:46,518 Sorry. 967 00:43:46,518 --> 00:43:48,488 And then you add in the new one. 968 00:43:48,488 --> 00:43:50,572 So every time its going to be increasing, correct? 969 00:43:50,572 --> 00:43:51,488 VICTOR COSTAN: Almost. 970 00:43:51,488 --> 00:43:52,557 It's a trap again. 971 00:43:52,557 --> 00:43:53,390 [INTERPOSING VOICES] 972 00:43:53,390 --> 00:43:55,020 VICTOR COSTAN: Yep. 973 00:43:55,020 --> 00:43:56,770 Yeah, so this time they're not lists. 974 00:43:56,770 --> 00:44:00,460 So if you look at what's going on inside there, 975 00:44:00,460 --> 00:44:03,840 you have count one and count two are 976 00:44:03,840 --> 00:44:08,780 these numbers in the document vector, so they're numbers. 977 00:44:08,780 --> 00:44:11,124 And then some starts out at 0, and then it 978 00:44:11,124 --> 00:44:12,040 keeps getting numbers. 979 00:44:12,040 --> 00:44:14,050 So sum is going to be a number. 980 00:44:14,050 --> 00:44:16,240 And multiplying numbers is constant time, 981 00:44:16,240 --> 00:44:19,150 adding numbers is constant time, so plus for numbers 982 00:44:19,150 --> 00:44:20,587 is order 1 indeed. 983 00:44:20,587 --> 00:44:22,420 AUDIENCE: You're reassigning sum every time? 984 00:44:22,420 --> 00:44:24,003 VICTOR COSTAN: Which is also constant. 985 00:44:24,003 --> 00:44:24,545 AUDIENCE: OK. 986 00:44:24,545 --> 00:44:26,711 VICTOR COSTAN: Because you're copying a number over. 987 00:44:26,711 --> 00:44:28,660 So as long as you're copying one element over, 988 00:44:28,660 --> 00:44:29,535 that's constant time. 989 00:44:29,535 --> 00:44:32,370 If you're adding two elements together-- two elements, 990 00:44:32,370 --> 00:44:36,070 not two lists-- that's constant time. 991 00:44:36,070 --> 00:44:39,090 So this is constant. 992 00:44:39,090 --> 00:44:42,010 And the last line is returned. 993 00:44:42,010 --> 00:44:43,750 So what's the running time for this? 994 00:44:43,750 --> 00:44:46,630 995 00:44:46,630 --> 00:44:49,040 AUDIENCE: L2 times L1. 996 00:44:49,040 --> 00:44:50,250 VICTOR COSTAN: Excellent. 997 00:44:50,250 --> 00:44:52,880 So I assume this is a constant. 998 00:44:52,880 --> 00:44:55,860 So this lets me say this is 1, and then 999 00:44:55,860 --> 00:45:00,260 if we do the partial products we get 1L, 1L, 1, and L2. 1000 00:45:00,260 --> 00:45:01,510 L1, L2, L1, L2. 1001 00:45:01,510 --> 00:45:03,780 And if you add them up, you get L1 and L2. 1002 00:45:03,780 --> 00:45:06,380 1003 00:45:06,380 --> 00:45:11,290 So this is going to be L1, L2. 1004 00:45:11,290 --> 00:45:15,410 Vector angle calls inner product three times, right? 1005 00:45:15,410 --> 00:45:18,895 So what's it's running time? 1006 00:45:18,895 --> 00:45:19,877 AUDIENCE: L1, L2. 1007 00:45:19,877 --> 00:45:23,699 1008 00:45:23,699 --> 00:45:24,740 VICTOR COSTAN: Excellent. 1009 00:45:24,740 --> 00:45:27,390 1010 00:45:27,390 --> 00:45:29,090 Count frequency. 1011 00:45:29,090 --> 00:45:31,130 You're going to have to take my word for it 1012 00:45:31,130 --> 00:45:36,870 that this is order of W squared. 1013 00:45:36,870 --> 00:45:39,270 And if that's the case, what's the running 1014 00:45:39,270 --> 00:45:41,490 time for a word frequency for file? 1015 00:45:41,490 --> 00:45:44,983 1016 00:45:44,983 --> 00:45:45,981 AUDIENCE: W squared? 1017 00:45:45,981 --> 00:45:49,973 1018 00:45:49,973 --> 00:45:50,980 VICTOR COSTAN: Cool. 1019 00:45:50,980 --> 00:45:51,030 So. 1020 00:45:51,030 --> 00:45:52,770 What's the running time for main now? 1021 00:45:52,770 --> 00:45:55,835 1022 00:45:55,835 --> 00:45:56,335 Last trick. 1023 00:45:56,335 --> 00:45:57,000 AUDIENCE: [INAUDIBLE] 1024 00:45:57,000 --> 00:45:58,833 VICTOR COSTAN: Yep, If you just add them up, 1025 00:45:58,833 --> 00:46:00,932 except there is one last trick there. 1026 00:46:00,932 --> 00:46:04,299 AUDIENCE: If W is constant, [INAUDIBLE] 1027 00:46:04,299 --> 00:46:05,261 VICTOR COSTAN: No. 1028 00:46:05,261 --> 00:46:08,160 AUDIENCE: [INAUDIBLE] W's constant, right? 1029 00:46:08,160 --> 00:46:09,390 VICTOR COSTAN: No. 1030 00:46:09,390 --> 00:46:11,793 So W is the number of words in a document. 1031 00:46:11,793 --> 00:46:12,660 AUDIENCE: Oh. 1032 00:46:12,660 --> 00:46:14,440 VICTOR COSTAN: So it's huge. 1033 00:46:14,440 --> 00:46:16,190 If that's constant, then the whole problem 1034 00:46:16,190 --> 00:46:18,065 should run in order one time, and we're done. 1035 00:46:18,065 --> 00:46:19,790 We're going home. 1036 00:46:19,790 --> 00:46:23,940 AUDIENCE: W squared because it beats out L1 and L2. 1037 00:46:23,940 --> 00:46:25,460 VICTOR COSTAN: OK, so-- 1038 00:46:25,460 --> 00:46:26,110 AUDIENCE: L1-- 1039 00:46:26,110 --> 00:46:28,130 VICTOR COSTAN: --you're going faster than me. 1040 00:46:28,130 --> 00:46:31,190 You're going too fast, but you're right. 1041 00:46:31,190 --> 00:46:35,490 So word frequency for file is called twice. 1042 00:46:35,490 --> 00:46:38,220 The first document is going to have W1 words. 1043 00:46:38,220 --> 00:46:41,460 The second document is going to have W2 words. 1044 00:46:41,460 --> 00:46:44,470 So you can just copy W because this is called twice 1045 00:46:44,470 --> 00:46:46,940 for different files. 1046 00:46:46,940 --> 00:46:51,410 So this is order of W1 squared plus W2 1047 00:46:51,410 --> 00:46:54,290 squared, different documents. 1048 00:46:54,290 --> 00:46:59,870 1049 00:46:59,870 --> 00:47:03,760 And then I have plus L1, L2. 1050 00:47:03,760 --> 00:47:07,550 1051 00:47:07,550 --> 00:47:12,960 And you said that W1 and W2 dominate L1 and L2, right? 1052 00:47:12,960 --> 00:47:16,120 Because W's the total number of words in a document, 1053 00:47:16,120 --> 00:47:19,640 whereas L the is the number of unique words, 1054 00:47:19,640 --> 00:47:22,810 because it the length of the vector. 1055 00:47:22,810 --> 00:47:24,500 So that is true, but I'm not sure 1056 00:47:24,500 --> 00:47:28,000 how to reduce this here to make use of that. 1057 00:47:28,000 --> 00:47:31,962 However, I made use of what you said already when I wrote this. 1058 00:47:31,962 --> 00:47:35,740 1059 00:47:35,740 --> 00:47:37,300 You see why? 1060 00:47:37,300 --> 00:47:39,240 Can anyone else see why? 1061 00:47:39,240 --> 00:47:42,600 1062 00:47:42,600 --> 00:47:52,460 So let's look at the vector angle again, lines 2 and 3. 1063 00:47:52,460 --> 00:47:58,330 So line 2, it calls inner product with L1 and L2. 1064 00:47:58,330 --> 00:48:00,670 But if you look at line 3, it calls inner product 1065 00:48:00,670 --> 00:48:05,670 with L1, L1 and then L2, L2 So the total running time 1066 00:48:05,670 --> 00:48:10,880 for vector angle is actually L1, L2 plus L1 1067 00:48:10,880 --> 00:48:12,540 squared plus L2 squared. 1068 00:48:12,540 --> 00:48:17,880 1069 00:48:17,880 --> 00:48:20,550 So if the first document has 1,000 words 1070 00:48:20,550 --> 00:48:22,810 and the second document as one word, 1071 00:48:22,810 --> 00:48:25,680 computing the inner product between L1 and L1 1072 00:48:25,680 --> 00:48:27,830 is going to take a lot more time than computing 1073 00:48:27,830 --> 00:48:30,050 the inner product between L1 and L2. 1074 00:48:30,050 --> 00:48:32,910 So I can't leave out these terms. 1075 00:48:32,910 --> 00:48:34,440 They have to be here. 1076 00:48:34,440 --> 00:48:37,130 However, when I add them up here-- 1077 00:48:37,130 --> 00:48:41,270 if I would write W1 squared plus W2 squared plus L1 squared 1078 00:48:41,270 --> 00:48:44,650 plus L2 squared plus this-- in that case, 1079 00:48:44,650 --> 00:48:47,340 I can use the fact that W1 is bigger than L1, 1080 00:48:47,340 --> 00:48:50,735 and it cancels it out. 1081 00:48:50,735 --> 00:48:51,610 Does this make sense? 1082 00:48:51,610 --> 00:48:52,420 Did I lose people? 1083 00:48:52,420 --> 00:48:55,490 1084 00:48:55,490 --> 00:48:58,188 Ask questions, please. 1085 00:48:58,188 --> 00:49:02,751 1086 00:49:02,751 --> 00:49:04,584 AUDIENCE: But you can't get rid of L1 and L2 1087 00:49:04,584 --> 00:49:07,577 and not an [INAUDIBLE]. 1088 00:49:07,577 --> 00:49:08,660 VICTOR COSTAN: You can't-- 1089 00:49:08,660 --> 00:49:09,710 AUDIENCE: [INAUDIBLE] 1090 00:49:09,710 --> 00:49:11,500 VICTOR COSTAN: Oh, so I can't get rid of this term-- 1091 00:49:11,500 --> 00:49:12,541 AUDIENCE: --those, right? 1092 00:49:12,541 --> 00:49:17,335 So this should be the sum of this and this, right? 1093 00:49:17,335 --> 00:49:18,200 AUDIENCE: Right. 1094 00:49:18,200 --> 00:49:22,080 VICTOR COSTAN: So it should be W1 squared plus W2 squared 1095 00:49:22,080 --> 00:49:26,382 plus L1 squared plus L2 squared plus L1, L2. 1096 00:49:26,382 --> 00:49:28,110 AUDIENCE: Right. 1097 00:49:28,110 --> 00:49:30,680 L1 is strictly smaller than W1. 1098 00:49:30,680 --> 00:49:31,620 AUDIENCE: Yeah. 1099 00:49:31,620 --> 00:49:35,402 Goes away, L2 smaller than W2 goes away, and I get this. 1100 00:49:35,402 --> 00:49:36,326 Correct. 1101 00:49:36,326 --> 00:49:40,796 So L1L2 isn't smaller than W [INAUDIBLE] squared? 1102 00:49:40,796 --> 00:49:41,670 VICTOR COSTAN: Is it? 1103 00:49:41,670 --> 00:49:43,086 If you know more math than me, you 1104 00:49:43,086 --> 00:49:44,530 might be able to prove that it is, 1105 00:49:44,530 --> 00:49:47,422 but I don't, so I'm just leaving it in there. 1106 00:49:47,422 --> 00:49:47,963 AUDIENCE: Ok. 1107 00:49:47,963 --> 00:49:49,367 VICTOR COSTAN: Yeah. 1108 00:49:49,367 --> 00:49:51,200 I think there is some relation, but I really 1109 00:49:51,200 --> 00:49:53,940 don't remember what it this, so let's 1110 00:49:53,940 --> 00:49:55,070 leave it like that for now. 1111 00:49:55,070 --> 00:50:00,854 1112 00:50:00,854 --> 00:50:02,770 Yeah, I think it should be the case that these 1113 00:50:02,770 --> 00:50:06,250 are bigger than this, but I'm not sure. 1114 00:50:06,250 --> 00:50:07,463 OK, yes. 1115 00:50:07,463 --> 00:50:12,200 AUDIENCE: How do you get the line for vector angle? 1116 00:50:12,200 --> 00:50:15,020 VICTOR COSTAN: How do I get the running time for it? 1117 00:50:15,020 --> 00:50:19,390 So vector angle gets two vectors, right? 1118 00:50:19,390 --> 00:50:22,250 The vector for document one and the vector for document two. 1119 00:50:22,250 --> 00:50:24,190 The length of the first vector is L1. 1120 00:50:24,190 --> 00:50:26,590 The length of the second vector is L2. 1121 00:50:26,590 --> 00:50:29,260 Now, line, where is it? 1122 00:50:29,260 --> 00:50:32,550 1123 00:50:32,550 --> 00:50:38,050 Line 2, for numerator calls inner product with L1 and L2. 1124 00:50:38,050 --> 00:50:43,350 So we know that the running time is L1, L2 up here. 1125 00:50:43,350 --> 00:50:46,080 Now the next line, line 3 in vector angle, 1126 00:50:46,080 --> 00:50:49,990 calls inner product with L1 and L1. 1127 00:50:49,990 --> 00:50:53,700 So the running time is L1 times L1 which is L1 squared. 1128 00:50:53,700 --> 00:50:54,892 OK. 1129 00:50:54,892 --> 00:50:56,600 AUDIENCE: Can we say that because there's 1130 00:50:56,600 --> 00:51:02,814 a bounded number of words in the English language, L1's bounded? 1131 00:51:02,814 --> 00:51:04,287 And as the length of the document 1132 00:51:04,287 --> 00:51:08,215 gets really, really big, that [INAUDIBLE] constant? 1133 00:51:08,215 --> 00:51:11,180 1134 00:51:11,180 --> 00:51:15,300 VICTOR COSTAN: Yeah, you might be able to do that. 1135 00:51:15,300 --> 00:51:19,150 Yes, I think for the cases that we give you, that is true. 1136 00:51:19,150 --> 00:51:21,036 Yeah, I never thought of that, that's cool. 1137 00:51:21,036 --> 00:51:24,012 AUDIENCE: It doesn't work if it's not a language, right? 1138 00:51:24,012 --> 00:51:25,580 If you just have gibberish? 1139 00:51:25,580 --> 00:51:32,760 VICTOR COSTAN: Yes, also, to say that its constant is useful 1140 00:51:32,760 --> 00:51:35,050 when the number of words in English 1141 00:51:35,050 --> 00:51:37,660 is much smaller than your input size. 1142 00:51:37,660 --> 00:51:40,180 So if, say, English has 50,000 words 1143 00:51:40,180 --> 00:51:43,850 and your input is 3,000 words, then the input is much smaller. 1144 00:51:43,850 --> 00:51:45,910 But if you're input is a million words, which 1145 00:51:45,910 --> 00:51:48,330 I think is what we use, then yeah, 1146 00:51:48,330 --> 00:51:49,709 it comes down to constant. 1147 00:51:49,709 --> 00:51:51,000 So yeah, that's a good insight. 1148 00:51:51,000 --> 00:51:51,791 That's really nice. 1149 00:51:51,791 --> 00:51:54,572 1150 00:51:54,572 --> 00:51:55,536 Anything else? 1151 00:51:55,536 --> 00:52:02,780 1152 00:52:02,780 --> 00:52:06,410 OK, so you get to go through document distance 3 to 8. 1153 00:52:06,410 --> 00:52:08,690 We'll tell you what's changed, and we'll 1154 00:52:08,690 --> 00:52:11,020 give you a chance to help you analyze it. 1155 00:52:11,020 --> 00:52:13,850 But you have to analyze it, then update the scorecard 1156 00:52:13,850 --> 00:52:19,000 for each algorithm to see how things improve. 1157 00:52:19,000 --> 00:52:20,067 Thanks. 1158 00:52:20,067 --> 00:52:20,567