1 00:00:00,000 --> 00:00:00,060 2 00:00:00,060 --> 00:00:01,770 The following content is provided 3 00:00:01,770 --> 00:00:04,010 under a Creative Commons license. 4 00:00:04,010 --> 00:00:06,860 Your support will help MIT OpenCourseWare continue 5 00:00:06,860 --> 00:00:10,720 to offer high-quality educational resources for free. 6 00:00:10,720 --> 00:00:13,340 To make a donation or view additional materials 7 00:00:13,340 --> 00:00:17,226 from hundreds of MIT courses, visit MIT OpenCourseWare 8 00:00:17,226 --> 00:00:17,851 at ocw.mit.edu. 9 00:00:17,851 --> 00:00:22,030 10 00:00:22,030 --> 00:00:25,270 PROFESSOR: So did everyone turn in PSET1? 11 00:00:25,270 --> 00:00:25,770 Yes? 12 00:00:25,770 --> 00:00:27,220 Good. 13 00:00:27,220 --> 00:00:30,760 OK, so there is a PSET1 critique due in a few days. 14 00:00:30,760 --> 00:00:32,189 My advice? 15 00:00:32,189 --> 00:00:33,215 You already did PSET1. 16 00:00:33,215 --> 00:00:35,090 You remember what you wrote out on the proof. 17 00:00:35,090 --> 00:00:36,506 Look at the solution. 18 00:00:36,506 --> 00:00:38,630 Write one paragraph today, and you're done with it. 19 00:00:38,630 --> 00:00:40,800 Then you go focus on PSET2. 20 00:00:40,800 --> 00:00:42,334 If you leave it off until Tuesday, 21 00:00:42,334 --> 00:00:44,250 you're going to have to read your proof again, 22 00:00:44,250 --> 00:00:46,370 remember what you're thinking. 23 00:00:46,370 --> 00:00:47,577 It's a lot more work. 24 00:00:47,577 --> 00:00:49,160 Just do it now, get it out of the way, 25 00:00:49,160 --> 00:00:51,466 and put PSET1 behind you. 26 00:00:51,466 --> 00:00:53,340 AUDIENCE: Is the critique only for the proof? 27 00:00:53,340 --> 00:00:54,784 Or is this for all of them? 28 00:00:54,784 --> 00:00:55,450 PROFESSOR: Nope. 29 00:00:55,450 --> 00:00:56,399 Just the proof. 30 00:00:56,399 --> 00:00:58,440 So you have to compare your proof with our proof? 31 00:00:58,440 --> 00:01:01,128 32 00:01:01,128 --> 00:01:02,880 AUDIENCE: Is there an assignment for that? 33 00:01:02,880 --> 00:01:04,876 Or do we just know to do it? 34 00:01:04,876 --> 00:01:05,610 PROFESSOR: Uh. 35 00:01:05,610 --> 00:01:07,475 PSET1 be Stellar. 36 00:01:07,475 --> 00:01:09,990 37 00:01:09,990 --> 00:01:10,580 Oh, no, sorry. 38 00:01:10,580 --> 00:01:11,413 It's not in Stellar. 39 00:01:11,413 --> 00:01:13,580 It's on our new grading site that just went out. 40 00:01:13,580 --> 00:01:15,330 So you have to go to our new grading site, 41 00:01:15,330 --> 00:01:17,500 and you have to type in your critique there. 42 00:01:17,500 --> 00:01:18,790 And it's one paragraph. 43 00:01:18,790 --> 00:01:20,220 You should aim for one paragraph. 44 00:01:20,220 --> 00:01:21,760 If you're doing more than that, then you're 45 00:01:21,760 --> 00:01:22,676 doing something wrong. 46 00:01:22,676 --> 00:01:25,300 And it's LATEX Plus math mode. 47 00:01:25,300 --> 00:01:29,460 So you can use math mode, and that's about it. 48 00:01:29,460 --> 00:01:31,852 OK, any more questions about the critique? 49 00:01:31,852 --> 00:01:32,560 It's a new thing. 50 00:01:32,560 --> 00:01:35,160 We care about it because it will make our grading life easier. 51 00:01:35,160 --> 00:01:37,360 And because it'll force you to look at the solutions and see 52 00:01:37,360 --> 00:01:39,130 what you understood and what you didn't. 53 00:01:39,130 --> 00:01:40,060 So we care about it. 54 00:01:40,060 --> 00:01:41,960 Don't ignore it. 55 00:01:41,960 --> 00:01:43,638 Yes? 56 00:01:43,638 --> 00:01:45,304 AUDIENCE: Like, how much is it weighted? 57 00:01:45,304 --> 00:01:48,294 How much does it count toward the grade? 58 00:01:48,294 --> 00:01:49,960 PROFESSOR: If you don't have a critique, 59 00:01:49,960 --> 00:01:53,394 we will most likely give you a 0 for proof. 60 00:01:53,394 --> 00:01:54,310 AUDIENCE: [INAUDIBLE]. 61 00:01:54,310 --> 00:01:57,350 62 00:01:57,350 --> 00:02:00,070 PROFESSOR: If your proof is bad and your critique of the proof 63 00:02:00,070 --> 00:02:02,070 is good, then you might get something. 64 00:02:02,070 --> 00:02:04,370 If your proof is bad and you have no critique-- 65 00:02:04,370 --> 00:02:06,370 actually if your proof is whatever it is and you 66 00:02:06,370 --> 00:02:07,619 have no critique, you get a 0. 67 00:02:07,619 --> 00:02:10,032 AUDIENCE: Yeah. [CHUCKLE] 68 00:02:10,032 --> 00:02:11,740 PROFESSOR: Any more questions about that? 69 00:02:11,740 --> 00:02:14,870 70 00:02:14,870 --> 00:02:17,080 OK. 71 00:02:17,080 --> 00:02:20,740 Who needs help remembering what the document distance 72 00:02:20,740 --> 00:02:21,960 problem is? 73 00:02:21,960 --> 00:02:24,910 74 00:02:24,910 --> 00:02:25,410 OK. 75 00:02:25,410 --> 00:02:28,350 Everyone who went to lecture or to [INAUDIBLE] remembers. 76 00:02:28,350 --> 00:02:29,840 That's good. 77 00:02:29,840 --> 00:02:31,360 Who went to lecture last time? 78 00:02:31,360 --> 00:02:34,720 79 00:02:34,720 --> 00:02:35,290 Cool. 80 00:02:35,290 --> 00:02:36,010 That's good. 81 00:02:36,010 --> 00:02:38,560 So we did insertion sort and merge sort 82 00:02:38,560 --> 00:02:39,970 from a theoretical standpoint. 83 00:02:39,970 --> 00:02:42,851 Today we're going to look at the code for the insertion sort 84 00:02:42,851 --> 00:02:45,100 and, if we have time, look at the code for merge-sort, 85 00:02:45,100 --> 00:02:48,660 and use the same strategy as we did last time to analyze them, 86 00:02:48,660 --> 00:02:51,290 look at the running time, make sure the running time matches 87 00:02:51,290 --> 00:02:55,410 the theory, and see how pseudocode turns into Python. 88 00:02:55,410 --> 00:02:59,010 89 00:02:59,010 --> 00:03:01,690 So you all have your listings. 90 00:03:01,690 --> 00:03:05,790 Last time in document distance, we covered Main, 91 00:03:05,790 --> 00:03:07,570 and we covered most of the functions 92 00:03:07,570 --> 00:03:10,450 except for count frequency. 93 00:03:10,450 --> 00:03:15,200 Can anyone remind me what the call graph looked like? 94 00:03:15,200 --> 00:03:16,700 So the call graph is the tree that I 95 00:03:16,700 --> 00:03:18,990 had up on the left, and it started at Main. 96 00:03:18,990 --> 00:03:28,650 97 00:03:28,650 --> 00:03:29,150 Thank you. 98 00:03:29,150 --> 00:03:31,810 99 00:03:31,810 --> 00:03:40,610 So Main calls word frequencies for file, which in turn calls? 100 00:03:40,610 --> 00:03:42,470 AUDIENCE: Well, it's probably line list. 101 00:03:42,470 --> 00:03:43,090 PROFESSOR: OK. 102 00:03:43,090 --> 00:03:45,042 AUDIENCE: And count frequency. 103 00:03:45,042 --> 00:03:51,390 104 00:03:51,390 --> 00:03:54,150 PROFESSOR: So we pretend we don't see the read file called. 105 00:03:54,150 --> 00:03:57,360 We assume that the data is already in memory 106 00:03:57,360 --> 00:03:59,970 or that the call takes time that's 107 00:03:59,970 --> 00:04:03,730 proportional to the running-- to the length of the file. 108 00:04:03,730 --> 00:04:07,036 And we only look at get-word from line list and count 109 00:04:07,036 --> 00:04:07,535 frequency. 110 00:04:07,535 --> 00:04:13,340 111 00:04:13,340 --> 00:04:14,390 OK. 112 00:04:14,390 --> 00:04:17,072 Who else does Main call? 113 00:04:17,072 --> 00:04:18,030 AUDIENCE: Vector angle? 114 00:04:18,030 --> 00:04:25,760 115 00:04:25,760 --> 00:04:28,624 PROFESSOR: And the vector angle? 116 00:04:28,624 --> 00:04:29,624 AUDIENCE: Inner product. 117 00:04:29,624 --> 00:04:38,017 118 00:04:38,017 --> 00:04:38,600 PROFESSOR: OK. 119 00:04:38,600 --> 00:04:43,370 Let's put up the constants for-- for the document distance 120 00:04:43,370 --> 00:04:46,720 problem that we used last time. 121 00:04:46,720 --> 00:04:51,780 So we said that the document has W words. 122 00:04:51,780 --> 00:04:54,170 And then when you take that list of words 123 00:04:54,170 --> 00:04:57,330 and you turn it into a distance vector, 124 00:04:57,330 --> 00:04:59,850 you will get assigned to a document vector. 125 00:04:59,850 --> 00:05:04,180 You will get L elements, which basically means L unique words. 126 00:05:04,180 --> 00:05:09,730 So L is the document vector length. 127 00:05:09,730 --> 00:05:12,530 And we assume we're using a natural language like English, 128 00:05:12,530 --> 00:05:15,090 so all the words are bounded in size. 129 00:05:15,090 --> 00:05:17,470 Like 5 to 10 characters, for example. 130 00:05:17,470 --> 00:05:20,210 And to make our life easier, we say 131 00:05:20,210 --> 00:05:30,430 all the words have the same size W. So w is the word length. 132 00:05:30,430 --> 00:05:32,810 Using these numbers, can anyone remind me 133 00:05:32,810 --> 00:05:36,190 what we said the costs for these methods are? 134 00:05:36,190 --> 00:05:38,690 And we didn't analyze count frequency, 135 00:05:38,690 --> 00:05:40,740 so it's OK to not take my word for it 136 00:05:40,740 --> 00:05:42,800 and not tell me what I said last time. 137 00:05:42,800 --> 00:05:46,010 But I would like numbers for word frequencies from file, 138 00:05:46,010 --> 00:05:50,084 get word from line list, vector angle, and inner product. 139 00:05:50,084 --> 00:05:51,915 AUDIENCE: Was that mu squared? 140 00:05:51,915 --> 00:05:53,160 That was last time. 141 00:05:53,160 --> 00:05:55,380 PROFESSOR: OK. 142 00:05:55,380 --> 00:05:57,150 Which-- which one? 143 00:05:57,150 --> 00:06:01,410 Does anyone-- does anyone else want to try? 144 00:06:01,410 --> 00:06:03,780 Let's not do guessing. 145 00:06:03,780 --> 00:06:08,290 I'll pull them up if nobody remembers. 146 00:06:08,290 --> 00:06:10,910 I spent an entire hour on that, and you guys did too. 147 00:06:10,910 --> 00:06:11,574 It was painful. 148 00:06:11,574 --> 00:06:12,282 AUDIENCE: I know. 149 00:06:12,282 --> 00:06:14,924 But we had to, like, you know, add them up, and then-- 150 00:06:14,924 --> 00:06:15,590 PROFESSOR: Yeah. 151 00:06:15,590 --> 00:06:17,340 So we did a lot of work for those numbers. 152 00:06:17,340 --> 00:06:20,280 153 00:06:20,280 --> 00:06:24,350 So get word from line list was order of W squared. 154 00:06:24,350 --> 00:06:26,490 Does anyone remember why? 155 00:06:26,490 --> 00:06:29,220 What made it take so much time? 156 00:06:29,220 --> 00:06:34,020 157 00:06:34,020 --> 00:06:36,900 AUDIENCE: 'Cause you append it to the word list. 158 00:06:36,900 --> 00:06:39,467 For, like, you add it to the end to go through. 159 00:06:39,467 --> 00:06:40,050 PROFESSOR: OK. 160 00:06:40,050 --> 00:06:43,820 161 00:06:43,820 --> 00:06:44,530 So you add it. 162 00:06:44,530 --> 00:06:45,920 So what? 163 00:06:45,920 --> 00:06:47,922 AUDIENCE: So, like, every time you need it, 164 00:06:47,922 --> 00:06:52,260 you, like, do word list equals word list plus words in line. 165 00:06:52,260 --> 00:06:53,070 PROFESSOR: OK. 166 00:06:53,070 --> 00:06:55,300 Excellent. 167 00:06:55,300 --> 00:06:58,200 So get words from line list, line five. 168 00:06:58,200 --> 00:06:59,580 There's a plus there. 169 00:06:59,580 --> 00:07:03,630 And that plus sign messes up the performance. 170 00:07:03,630 --> 00:07:06,890 So that's why it's w squared. 171 00:07:06,890 --> 00:07:08,800 Count frequency, we didn't cover it, 172 00:07:08,800 --> 00:07:10,160 you had to take my word for it. 173 00:07:10,160 --> 00:07:12,400 So we will cover it now. 174 00:07:12,400 --> 00:07:15,145 And, inner product. 175 00:07:15,145 --> 00:07:26,060 176 00:07:26,060 --> 00:07:29,610 So suppose you have two vectors of line-- length L1 and L2. 177 00:07:29,610 --> 00:07:32,700 How much time does it take to compute the inner product? 178 00:07:32,700 --> 00:07:34,492 AUDIENCE: L1, L2 time. 179 00:07:34,492 --> 00:07:35,075 PROFESSOR: OK. 180 00:07:35,075 --> 00:07:38,880 181 00:07:38,880 --> 00:07:40,970 So how much time does vector angle take? 182 00:07:40,970 --> 00:07:48,335 183 00:07:48,335 --> 00:07:49,317 AUDIENCE: L1 L2 time. 184 00:07:49,317 --> 00:07:53,750 185 00:07:53,750 --> 00:07:56,509 PROFESSOR: L1 L2. 186 00:07:56,509 --> 00:07:57,800 So it makes three calls, right? 187 00:07:57,800 --> 00:07:59,270 You get the two things. 188 00:07:59,270 --> 00:08:02,450 First, it computes the inner product of the two lists. 189 00:08:02,450 --> 00:08:05,890 And then it has to compute the inner product of each vector-- 190 00:08:05,890 --> 00:08:08,359 of each document vector with itself. 191 00:08:08,359 --> 00:08:10,525 Because that's what's on the bottom of the fraction. 192 00:08:10,525 --> 00:08:13,330 193 00:08:13,330 --> 00:08:15,210 So what's the running time for that? 194 00:08:15,210 --> 00:08:18,710 AUDIENCE: Plus L1 squared, plus L2 squared. 195 00:08:18,710 --> 00:08:22,880 PROFESSOR: Plus L1 squared, plus L2 squared. 196 00:08:22,880 --> 00:08:25,380 So someone was really helpful last time and asked me, 197 00:08:25,380 --> 00:08:27,600 can you make this simpler with some math? 198 00:08:27,600 --> 00:08:30,080 And I said, I don't know so I don't think I will. 199 00:08:30,080 --> 00:08:32,820 But I looked at-- my, I looked at my high school math 200 00:08:32,820 --> 00:08:37,549 afterwards, and it turns out that these-- so L1 squared 201 00:08:37,549 --> 00:08:40,830 plus L2 squared are guaranteed to be greater than L1 L2 202 00:08:40,830 --> 00:08:43,009 as long as these numbers are positive. 203 00:08:43,009 --> 00:08:44,550 And we're working with document list, 204 00:08:44,550 --> 00:08:46,350 so they're always positive. 205 00:08:46,350 --> 00:08:49,155 So this will go away. 206 00:08:49,155 --> 00:08:54,320 207 00:08:54,320 --> 00:09:00,080 So let's assume count frequency is w squared or smaller. 208 00:09:00,080 --> 00:09:03,310 So the total running time for word frequencies from file 209 00:09:03,310 --> 00:09:06,580 is w squared. 210 00:09:06,580 --> 00:09:08,300 What's the running time for everything? 211 00:09:08,300 --> 00:09:13,114 212 00:09:13,114 --> 00:09:14,780 AUDIENCE: Would just be, like, w squared 213 00:09:14,780 --> 00:09:17,410 plus L1 squared plus L2 squared? 214 00:09:17,410 --> 00:09:18,160 PROFESSOR: Almost. 215 00:09:18,160 --> 00:09:20,330 AUDIENCE: Actually, it depends on which one's greater, right? 216 00:09:20,330 --> 00:09:21,330 PROFESSOR: Well, almost. 217 00:09:21,330 --> 00:09:27,580 So here W works, assuming you get one document with W words. 218 00:09:27,580 --> 00:09:30,230 But word frequencies from files is called twice, once 219 00:09:30,230 --> 00:09:31,870 for each document. 220 00:09:31,870 --> 00:09:36,060 First document has W1 words, second document has W2 words. 221 00:09:36,060 --> 00:09:38,974 So what's the running time? 222 00:09:38,974 --> 00:09:42,750 AUDIENCE: W1 squared plus W2 squared. 223 00:09:42,750 --> 00:09:48,000 PROFESSOR: W1 squared plus W2 squared. 224 00:09:48,000 --> 00:09:50,200 So I take this, and I add this. 225 00:09:50,200 --> 00:09:51,590 Right? 226 00:09:51,590 --> 00:09:55,380 Except when I add this, if I want to add L1 squared, 227 00:09:55,380 --> 00:09:58,290 I know L1 is the number of unique documents-- 228 00:09:58,290 --> 00:10:00,600 of unique words in the document. 229 00:10:00,600 --> 00:10:04,450 And W1 is the total number of words in the document. 230 00:10:04,450 --> 00:10:07,177 W1 guaranteed to be greater or equal than L1, 231 00:10:07,177 --> 00:10:08,260 so it's going to dominate. 232 00:10:08,260 --> 00:10:09,440 I don't need to add it. 233 00:10:09,440 --> 00:10:11,360 Same for L2. 234 00:10:11,360 --> 00:10:12,178 This is it. 235 00:10:12,178 --> 00:10:15,200 236 00:10:15,200 --> 00:10:15,700 OK. 237 00:10:15,700 --> 00:10:18,054 You guys don't seem to remember the numbers for these. 238 00:10:18,054 --> 00:10:20,220 So that means I didn't torture you enough last time. 239 00:10:20,220 --> 00:10:21,830 So let's do more. 240 00:10:21,830 --> 00:10:23,270 Let's look at count frequency. 241 00:10:23,270 --> 00:10:27,750 And let's compute the cost for that. 242 00:10:27,750 --> 00:10:36,830 243 00:10:36,830 --> 00:10:40,730 So what we did last time was, we went through each line of code. 244 00:10:40,730 --> 00:10:42,820 We thought, how much time does it 245 00:10:42,820 --> 00:10:44,550 take to execute the line once? 246 00:10:44,550 --> 00:10:46,710 And how many times does the line run? 247 00:10:46,710 --> 00:10:49,819 And then we compute the product of that, add everything up, 248 00:10:49,819 --> 00:10:51,360 and that's the cost for the function. 249 00:10:51,360 --> 00:10:55,940 250 00:10:55,940 --> 00:10:58,260 First off, before I put numbers here, 251 00:10:58,260 --> 00:10:59,540 what does the method to do? 252 00:10:59,540 --> 00:11:09,480 253 00:11:09,480 --> 00:11:12,040 AUDIENCE: It takes a list of words-- 254 00:11:12,040 --> 00:11:12,730 PROFESSOR: OK. 255 00:11:12,730 --> 00:11:16,150 AUDIENCE: For each item in that list, 256 00:11:16,150 --> 00:11:17,870 checks to see if it's-- you know, 257 00:11:17,870 --> 00:11:25,407 list of words that it's-- counted, right? 258 00:11:25,407 --> 00:11:25,990 PROFESSOR: OK. 259 00:11:25,990 --> 00:11:28,290 So you're telling me what the code does. 260 00:11:28,290 --> 00:11:29,010 AUDIENCE: Yeah. 261 00:11:29,010 --> 00:11:31,070 PROFESSOR: Try to look at Main or try 262 00:11:31,070 --> 00:11:33,420 to look at word frequencies for files. 263 00:11:33,420 --> 00:11:36,540 So look at it top-down, and tell me what the purpose of it is. 264 00:11:36,540 --> 00:11:38,310 What's the goal? 265 00:11:38,310 --> 00:11:45,108 AUDIENCE: Making a list of-- and each object 266 00:11:45,108 --> 00:11:47,400 is a list with a word and a number. 267 00:11:47,400 --> 00:11:48,220 PROFESSOR: OK. 268 00:11:48,220 --> 00:11:49,100 Excellent. 269 00:11:49,100 --> 00:11:51,790 So big picture, I have the first document. 270 00:11:51,790 --> 00:11:52,640 I read it in. 271 00:11:52,640 --> 00:11:54,106 I break it up into words. 272 00:11:54,106 --> 00:11:55,230 And I have a list of words. 273 00:11:55,230 --> 00:11:58,380 That's what word frequencies for file 274 00:11:58,380 --> 00:12:00,920 gives me-- sorry, that's what you get words from line list 275 00:12:00,920 --> 00:12:02,190 gives me. 276 00:12:02,190 --> 00:12:02,890 List of words. 277 00:12:02,890 --> 00:12:08,720 278 00:12:08,720 --> 00:12:16,000 The fox is in the hat. 279 00:12:16,000 --> 00:12:18,680 280 00:12:18,680 --> 00:12:21,600 And this gets passed to count frequency, 281 00:12:21,600 --> 00:12:24,700 and count frequency gives me, you said, 282 00:12:24,700 --> 00:12:27,430 an object, which is a list. 283 00:12:27,430 --> 00:12:30,180 Of lists, where each of them has the word 284 00:12:30,180 --> 00:12:31,590 and how many times it shows up. 285 00:12:31,590 --> 00:12:39,510 So I would have "the" shows up twice, "fox" shows up 286 00:12:39,510 --> 00:12:51,110 once, "is" shows up once, "in" shows up once, 287 00:12:51,110 --> 00:12:56,110 and-- I need a shorter example-- "hat" shows up once. 288 00:12:56,110 --> 00:12:58,155 So it takes this and turns it into that. 289 00:12:58,155 --> 00:13:01,840 290 00:13:01,840 --> 00:13:05,090 So on line 2, I have a list L that's initialized. 291 00:13:05,090 --> 00:13:06,770 And then, at the end, it's returned. 292 00:13:06,770 --> 00:13:09,840 So I'm going to guess that L is going to look like this. 293 00:13:09,840 --> 00:13:13,350 294 00:13:13,350 --> 00:13:17,050 Line 3, for new word in word list, iterates over the input. 295 00:13:17,050 --> 00:13:19,430 So iterates over this. 296 00:13:19,430 --> 00:13:23,340 And then, line 4 checks to see, for each new word, 297 00:13:23,340 --> 00:13:26,150 it looks at the list that I have under construction. 298 00:13:26,150 --> 00:13:30,950 So exam-- for example, if I ran through all the words 299 00:13:30,950 --> 00:13:33,360 and then I'm trying to put in hat right now, 300 00:13:33,360 --> 00:13:35,250 I wouldn't have it here. 301 00:13:35,250 --> 00:13:39,890 What line 4 does is, it looks at all the entries. 302 00:13:39,890 --> 00:13:42,710 And it says, if I can find the words-- 303 00:13:42,710 --> 00:13:45,180 so if I can find the word hat somewhere here-- 304 00:13:45,180 --> 00:13:47,090 then increment the number. 305 00:13:47,090 --> 00:13:49,642 If I can't, then make a new entry 306 00:13:49,642 --> 00:13:52,100 and say that the word shows up once, because it's the first 307 00:13:52,100 --> 00:13:52,600 I see it. 308 00:13:52,600 --> 00:13:56,340 309 00:13:56,340 --> 00:13:57,590 So this is what the code does. 310 00:13:57,590 --> 00:14:01,640 Now let's see how fast it does that. 311 00:14:01,640 --> 00:14:07,450 So line 2 initialize the output to an empty list. 312 00:14:07,450 --> 00:14:10,071 What's the cost for that? 313 00:14:10,071 --> 00:14:10,988 AUDIENCE: [INAUDIBLE]. 314 00:14:10,988 --> 00:14:11,862 PROFESSOR: Very good. 315 00:14:11,862 --> 00:14:12,580 How many times? 316 00:14:12,580 --> 00:14:18,370 317 00:14:18,370 --> 00:14:19,810 For new word in word lists. 318 00:14:19,810 --> 00:14:20,310 Cost? 319 00:14:20,310 --> 00:14:23,726 320 00:14:23,726 --> 00:14:26,680 AUDIENCE: [INAUDIBLE]. 321 00:14:26,680 --> 00:14:29,070 I know the cost is 1. 322 00:14:29,070 --> 00:14:32,330 PROFESSOR: OK, so we are-- here, it's a bit confusing, right? 323 00:14:32,330 --> 00:14:35,610 We're saying that, oh, there's does iteration over a list. 324 00:14:35,610 --> 00:14:37,920 And each step of the iteration is constant time, 325 00:14:37,920 --> 00:14:42,210 but the iteration happens L times. 326 00:14:42,210 --> 00:14:43,510 I'm sorry, not L times. 327 00:14:43,510 --> 00:14:45,400 The length of the list times. 328 00:14:45,400 --> 00:14:48,120 How many-- how many elements are in word lists? 329 00:14:48,120 --> 00:14:53,090 330 00:14:53,090 --> 00:14:57,730 I heard a very low W, so I will pretend I heard it. 331 00:14:57,730 --> 00:15:01,350 Or I hope I heard W. So word list, the words I 332 00:15:01,350 --> 00:15:06,690 got from the document, W. How about the if. 333 00:15:06,690 --> 00:15:13,570 So, it looks at the word that I have-- oh. 334 00:15:13,570 --> 00:15:16,680 This code is confusing because I forgot a line, right? 335 00:15:16,680 --> 00:15:20,580 Pretend that between line-- oh, no, sorry, I didn't. 336 00:15:20,580 --> 00:15:23,512 New word is-- new word is assigned in line 3. 337 00:15:23,512 --> 00:15:29,030 So new word in line 3 is compared to the first element 338 00:15:29,030 --> 00:15:31,280 of the entry that's assigned in line 4. 339 00:15:31,280 --> 00:15:36,960 So hat is compared with the, fox, so on, so forth. 340 00:15:36,960 --> 00:15:41,530 And if the comparison is true, it runs line 6 and 7. 341 00:15:41,530 --> 00:15:45,230 And if not, it keeps looping. 342 00:15:45,230 --> 00:15:50,300 So line 5, the if how, many times does it run? 343 00:15:50,300 --> 00:15:52,498 Just the comparison. 344 00:15:52,498 --> 00:15:53,810 AUDIENCE: W. 345 00:15:53,810 --> 00:15:55,107 PROFESSOR: W. 346 00:15:55,107 --> 00:15:56,148 AUDIENCE: Oh, no, no, no. 347 00:15:56,148 --> 00:15:57,072 No. 348 00:15:57,072 --> 00:15:58,920 I'm thinking of line 4. 349 00:15:58,920 --> 00:15:59,941 PROFESSOR: Oh. 350 00:15:59,941 --> 00:16:00,440 Yeah. 351 00:16:00,440 --> 00:16:02,260 I'm getting confused too. 352 00:16:02,260 --> 00:16:03,540 So let's start with line 4. 353 00:16:03,540 --> 00:16:04,520 Sorry. 354 00:16:04,520 --> 00:16:05,882 Shouldn't do line 5. 355 00:16:05,882 --> 00:16:08,059 AUDIENCE: --new word, then you're not-- like, 356 00:16:08,059 --> 00:16:09,850 you're not going to run through that again. 357 00:16:09,850 --> 00:16:10,516 PROFESSOR: Yeah. 358 00:16:10,516 --> 00:16:12,530 Let's worry about that right afterwards. 359 00:16:12,530 --> 00:16:13,760 Let's do line 4 first. 360 00:16:13,760 --> 00:16:15,790 Sorry, I jumped over line 4. 361 00:16:15,790 --> 00:16:19,520 So, line 4 definitely runs W times 362 00:16:19,520 --> 00:16:23,970 because it's inside the for loop from line 3 to line 9. 363 00:16:23,970 --> 00:16:29,510 So everything here will definitely run W times. 364 00:16:29,510 --> 00:16:31,910 But how many times does it run overall? 365 00:16:31,910 --> 00:16:36,377 So, line 4 iterates over all the entries here. 366 00:16:36,377 --> 00:16:37,710 How many times does that happen? 367 00:16:37,710 --> 00:16:41,161 368 00:16:41,161 --> 00:16:47,570 AUDIENCE: 1 plus W over-- times 10/2, 369 00:16:47,570 --> 00:16:54,588 because it's just worst case, L-- the length of L increases 370 00:16:54,588 --> 00:16:56,830 by 1 every time. 371 00:16:56,830 --> 00:16:58,950 [INAUDIBLE] 372 00:16:58,950 --> 00:17:02,350 PROFESSOR: OK, so I like that you started with worst case. 373 00:17:02,350 --> 00:17:04,130 Normally I would say exactly that. 374 00:17:04,130 --> 00:17:07,619 Worst case or W. But we had a different constant 375 00:17:07,619 --> 00:17:11,480 for the number of words that you have in the end. 376 00:17:11,480 --> 00:17:16,230 So let's say something a little bit better than W. Let's say, 377 00:17:16,230 --> 00:17:17,880 let's put the lower bound on it. 378 00:17:17,880 --> 00:17:20,036 So yeah, worst case, all the words are different. 379 00:17:20,036 --> 00:17:21,619 But what if they're not all different? 380 00:17:21,619 --> 00:17:25,632 And what if in the end I know I have L words? 381 00:17:25,632 --> 00:17:27,653 CLASS: [INAUDIBLE]. 382 00:17:27,653 --> 00:17:28,569 PROFESSOR: Worst case. 383 00:17:28,569 --> 00:17:33,001 384 00:17:33,001 --> 00:17:33,500 Almost. 385 00:17:33,500 --> 00:17:36,340 386 00:17:36,340 --> 00:17:39,940 So, I know I have a W from the outer loop. 387 00:17:39,940 --> 00:17:42,680 For each word in the outer loop, how many times 388 00:17:42,680 --> 00:17:46,800 does the inner loop execute? 389 00:17:46,800 --> 00:17:49,210 How many times do I have to go through something 390 00:17:49,210 --> 00:17:50,250 in the inner list? 391 00:17:50,250 --> 00:17:53,090 So I know here I have W words, suppose here I 392 00:17:53,090 --> 00:17:55,480 have L elements in the vector. 393 00:17:55,480 --> 00:17:58,195 For each one of these, how many times 394 00:17:58,195 --> 00:18:02,440 do I have to go through-- So how many elements do I 395 00:18:02,440 --> 00:18:03,946 have to go through here? 396 00:18:03,946 --> 00:18:05,764 AUDIENCE: Depends on where you are, though. 397 00:18:05,764 --> 00:18:07,930 For the first word, you only have to go through one. 398 00:18:07,930 --> 00:18:08,230 PROFESSOR: Yep. 399 00:18:08,230 --> 00:18:08,530 But-- 400 00:18:08,530 --> 00:18:09,590 AUDIENCE: For the second word, you have to through-- 401 00:18:09,590 --> 00:18:09,870 PROFESSOR: Yep. 402 00:18:09,870 --> 00:18:11,280 But I heard the worst case. 403 00:18:11,280 --> 00:18:12,780 And I like that, because it's easier 404 00:18:12,780 --> 00:18:14,071 to reason about the worst case. 405 00:18:14,071 --> 00:18:17,480 And most of the time it's sort of like the average case. 406 00:18:17,480 --> 00:18:21,200 AUDIENCE: So then, length of list-- L. 407 00:18:21,200 --> 00:18:25,530 PROFESSOR: The length of the list, and that's L. Worst case, 408 00:18:25,530 --> 00:18:27,976 the first words that I see will be L different words. 409 00:18:27,976 --> 00:18:29,350 And then all the words that I see 410 00:18:29,350 --> 00:18:32,840 will be the same as the words that I saw before. 411 00:18:32,840 --> 00:18:36,080 So worst case, the list will grow to L very fast, 412 00:18:36,080 --> 00:18:39,085 and then I'll keep seeing L L L. And I'll 413 00:18:39,085 --> 00:18:40,710 ignore what was there in the beginning, 414 00:18:40,710 --> 00:18:44,460 and I'll say L times. 415 00:18:44,460 --> 00:18:47,320 So I know the second list is bounded by L in length, 416 00:18:47,320 --> 00:18:49,470 the first list is bounded by W in length. 417 00:18:49,470 --> 00:18:53,960 So worst case this runs L times W times. 418 00:18:53,960 --> 00:18:56,394 And what's the cost of iterating? 419 00:18:56,394 --> 00:18:58,854 AUDIENCE: What is the difference between L and W? 420 00:18:58,854 --> 00:19:01,965 L is the document vector length, and W is the number of words. 421 00:19:01,965 --> 00:19:04,803 But isn't the number of elements in the document vector 422 00:19:04,803 --> 00:19:07,799 the number of words? 423 00:19:07,799 --> 00:19:09,090 PROFESSOR: How about this case? 424 00:19:09,090 --> 00:19:09,810 What's W? 425 00:19:09,810 --> 00:19:15,134 426 00:19:15,134 --> 00:19:16,110 AUDIENCE: 6, yeah. 427 00:19:16,110 --> 00:19:20,140 428 00:19:20,140 --> 00:19:23,790 PROFESSOR: So L is the number of unique words in a document. 429 00:19:23,790 --> 00:19:25,580 And I heard a really cool argument 430 00:19:25,580 --> 00:19:28,160 that I liked last time. 431 00:19:28,160 --> 00:19:29,120 Does anyone remember? 432 00:19:29,120 --> 00:19:29,840 About L? 433 00:19:29,840 --> 00:19:33,450 434 00:19:33,450 --> 00:19:37,160 If we're really dealing with a natural language like English, 435 00:19:37,160 --> 00:19:39,636 how many words do I have in English? 436 00:19:39,636 --> 00:19:41,135 AUDIENCE: Well, I think, at the max, 437 00:19:41,135 --> 00:19:43,997 there's actually around 250,000, but a lot of them 438 00:19:43,997 --> 00:19:45,157 are not used anymore. 439 00:19:45,157 --> 00:19:45,740 PROFESSOR: OK. 440 00:19:45,740 --> 00:19:47,720 So 250,000, right? 441 00:19:47,720 --> 00:19:48,820 Max. 442 00:19:48,820 --> 00:19:50,480 So that's a constant. 443 00:19:50,480 --> 00:19:52,930 If I have a document that contains 444 00:19:52,930 --> 00:19:56,060 all the writings of all the authors that were ever done, 445 00:19:56,060 --> 00:20:02,771 and say that's a billion words, L is still going to be 250,000. 446 00:20:02,771 --> 00:20:03,270 Right? 447 00:20:03,270 --> 00:20:04,882 So L can be very different from W. 448 00:20:04,882 --> 00:20:06,965 That's why we're keeping track of them separately. 449 00:20:06,965 --> 00:20:11,510 450 00:20:11,510 --> 00:20:14,910 One W L times W. What's the cost of iterating? 451 00:20:14,910 --> 00:20:19,190 452 00:20:19,190 --> 00:20:21,530 So we know how many times line 4 runs, 453 00:20:21,530 --> 00:20:26,930 but what's the cost of one step of iterating in the list? 454 00:20:26,930 --> 00:20:27,430 1. 455 00:20:27,430 --> 00:20:27,930 Very good. 456 00:20:27,930 --> 00:20:31,530 457 00:20:31,530 --> 00:20:32,036 Line 5. 458 00:20:32,036 --> 00:20:33,160 How many times does it run? 459 00:20:33,160 --> 00:20:37,420 460 00:20:37,420 --> 00:20:39,230 AUDIENCE: W times L? 461 00:20:39,230 --> 00:20:39,990 PROFESSOR: Yep. 462 00:20:39,990 --> 00:20:41,580 Same as line 4, right? 463 00:20:41,580 --> 00:20:43,800 The if is run all the time. 464 00:20:43,800 --> 00:20:47,860 And lines 5 and 6 only run sometimes, but-- sorry, 465 00:20:47,860 --> 00:20:53,750 line 6 and 7 only run sometimes, but line 5 runs all the time. 466 00:20:53,750 --> 00:20:54,740 What is the cost? 467 00:20:54,740 --> 00:21:01,642 468 00:21:01,642 --> 00:21:03,280 AUDIENCE: We can say it's constant. 469 00:21:03,280 --> 00:21:04,780 PROFESSOR: We can say it's constant. 470 00:21:04,780 --> 00:21:07,560 I like we can say it's constant, but why 471 00:21:07,560 --> 00:21:10,020 is it that we can say it's constant? 472 00:21:10,020 --> 00:21:12,480 Why don't I just say 1, if-- this is an empty list. 473 00:21:12,480 --> 00:21:13,280 This is a number. 474 00:21:13,280 --> 00:21:14,340 AUDIENCE: Depends on the word length. 475 00:21:14,340 --> 00:21:14,923 PROFESSOR: OK. 476 00:21:14,923 --> 00:21:17,065 It depends on the word lis-- length, very good. 477 00:21:17,065 --> 00:21:18,690 AUDIENCE: And we're assuming that words 478 00:21:18,690 --> 00:21:20,170 are all the same length. 479 00:21:20,170 --> 00:21:21,637 PROFESSOR: OK. 480 00:21:21,637 --> 00:21:22,220 AUDIENCE: And? 481 00:21:22,220 --> 00:21:26,190 So, 1 times L W. Little w. 482 00:21:26,190 --> 00:21:28,390 PROFESSOR: OK, very good. 483 00:21:28,390 --> 00:21:30,980 So we do assume that all the words are the same length. 484 00:21:30,980 --> 00:21:34,920 But unless I tell you that the length is really small, 485 00:21:34,920 --> 00:21:37,630 which I did, you can't say 1. 486 00:21:37,630 --> 00:21:41,000 So, when you said we can say it's constant, it's right. 487 00:21:41,000 --> 00:21:42,770 We can say, but we also have to say why, 488 00:21:42,770 --> 00:21:44,890 or at least think why, that's the case. 489 00:21:44,890 --> 00:21:48,760 So it's W-- we're going to use W here and when we copy it here, 490 00:21:48,760 --> 00:21:51,144 we're going to forget about it. 491 00:21:51,144 --> 00:21:52,852 AUDIENCE: Can you put a top bar on the W, 492 00:21:52,852 --> 00:21:56,830 just so I can tell that it's not the other W. 493 00:21:56,830 --> 00:22:07,950 PROFESSOR: OK But you have to be responsible for reminding 494 00:22:07,950 --> 00:22:08,770 me to put that. 495 00:22:08,770 --> 00:22:10,160 AUDIENCE: OK. 496 00:22:10,160 --> 00:22:12,240 PROFESSOR: OK. 497 00:22:12,240 --> 00:22:15,470 So string comparison, not constant. 498 00:22:15,470 --> 00:22:17,770 If I have two very long strings that 499 00:22:17,770 --> 00:22:19,550 only differ in the last character, 500 00:22:19,550 --> 00:22:21,550 I have to go through them character by character 501 00:22:21,550 --> 00:22:24,810 by character until I find the last character that's 502 00:22:24,810 --> 00:22:25,580 different. 503 00:22:25,580 --> 00:22:28,770 Because until I look at that, the strings might be equal. 504 00:22:28,770 --> 00:22:30,926 So comparing two long strings takes 505 00:22:30,926 --> 00:22:33,175 time that's proportional to the length of the strings. 506 00:22:33,175 --> 00:22:36,680 507 00:22:36,680 --> 00:22:40,130 OK, so line 5 costs w, tricky part, 508 00:22:40,130 --> 00:22:44,420 runs L times W part-- L times W times. 509 00:22:44,420 --> 00:22:46,350 How about line 6 and 7? 510 00:22:46,350 --> 00:22:52,860 511 00:22:52,860 --> 00:22:56,280 I didn't ask line 6, I asked line 6 and 7 together, 512 00:22:56,280 --> 00:22:57,690 because there is a trick there. 513 00:22:57,690 --> 00:23:05,690 514 00:23:05,690 --> 00:23:09,190 AUDIENCE: I think it's constant for line 6, right? 515 00:23:09,190 --> 00:23:10,518 PROFESSOR: Why? 516 00:23:10,518 --> 00:23:12,858 AUDIENCE: 'Cause it's a number. 517 00:23:12,858 --> 00:23:14,715 And one place in the entry. 518 00:23:14,715 --> 00:23:16,280 We've already grabbed the entry. 519 00:23:16,280 --> 00:23:20,420 PROFESSOR: So the cost of running it once is 1. 520 00:23:20,420 --> 00:23:21,700 Good. 521 00:23:21,700 --> 00:23:25,190 I can tell you that that's the same case for 7. 522 00:23:25,190 --> 00:23:26,560 How many times do they run? 523 00:23:26,560 --> 00:23:30,120 524 00:23:30,120 --> 00:23:32,490 This is the hard part. 525 00:23:32,490 --> 00:23:34,980 AUDIENCE: Wait, does break line break out of one loop? 526 00:23:34,980 --> 00:23:35,605 PROFESSOR: Yep. 527 00:23:35,605 --> 00:23:39,940 Break breaks out of the loop between lines 4 and 7. 528 00:23:39,940 --> 00:23:41,300 I have a question. 529 00:23:41,300 --> 00:23:46,350 AUDIENCE: Is L supposed to be not in line with if? 530 00:23:46,350 --> 00:23:46,850 Yes. 531 00:23:46,850 --> 00:23:49,580 PROFESSOR: It is supposed to be where it is. 532 00:23:49,580 --> 00:23:50,640 I will talk-- yes. 533 00:23:50,640 --> 00:23:53,880 So what happens is, that else is in line with a for. 534 00:23:53,880 --> 00:23:57,430 And if the for loop runs to completion, 535 00:23:57,430 --> 00:24:00,860 then it does get executed. 536 00:24:00,860 --> 00:24:04,097 If there's a break somewhere inside the for, 537 00:24:04,097 --> 00:24:05,305 then it doesn't get executed. 538 00:24:05,305 --> 00:24:08,140 539 00:24:08,140 --> 00:24:10,650 So the idea behind that is, usually 540 00:24:10,650 --> 00:24:12,510 use this for finding stuff. 541 00:24:12,510 --> 00:24:15,760 So you iterate over a list, and when you find something, 542 00:24:15,760 --> 00:24:17,516 break out of the loop. 543 00:24:17,516 --> 00:24:19,390 You did something, you break out of the loop. 544 00:24:19,390 --> 00:24:21,435 If you didn't find it, you can put an else 545 00:24:21,435 --> 00:24:23,167 and then say what code happens. 546 00:24:23,167 --> 00:24:25,000 And you don't have to write code on your own 547 00:24:25,000 --> 00:24:29,420 to check if you broke out of the loop. 548 00:24:29,420 --> 00:24:32,230 So if break executes, then it's going 549 00:24:32,230 --> 00:24:34,570 to take us out of the loop. 550 00:24:34,570 --> 00:24:35,930 It's going to ignore the else. 551 00:24:35,930 --> 00:24:38,282 And it's going to run the loop on line 3 again. 552 00:24:38,282 --> 00:24:39,865 So it's going to do another iteration. 553 00:24:39,865 --> 00:24:42,470 554 00:24:42,470 --> 00:24:44,280 So line 6 and 7, how many times? 555 00:24:44,280 --> 00:24:47,120 AUDIENCE: W minus L? 556 00:24:47,120 --> 00:24:48,450 PROFESSOR: W minus L. 557 00:24:48,450 --> 00:24:50,526 AUDIENCE: Like, if it's the difference 558 00:24:50,526 --> 00:24:54,962 in number of words and number of-- unique words. 559 00:24:54,962 --> 00:24:55,670 PROFESSOR: Smart. 560 00:24:55,670 --> 00:25:00,520 You gave the precise answer right the first time. 561 00:25:00,520 --> 00:25:02,230 W minus L. Very good. 562 00:25:02,230 --> 00:25:06,830 563 00:25:06,830 --> 00:25:13,960 And what happens once-- what happens once this runs? 564 00:25:13,960 --> 00:25:21,730 Why do I know-- why do I know that the if won't be-- Oh. 565 00:25:21,730 --> 00:25:23,660 Sorry, I'm getting myself confused. 566 00:25:23,660 --> 00:25:27,950 So it's going to run W minus L times total, right? 567 00:25:27,950 --> 00:25:31,570 Total times overall without this W thing here, 568 00:25:31,570 --> 00:25:33,547 so I should put an arrow and say-- 569 00:25:33,547 --> 00:25:38,320 570 00:25:38,320 --> 00:25:41,070 Now suppose I didn't notice this. 571 00:25:41,070 --> 00:25:43,510 Is there another way I can get the decent bounds? 572 00:25:43,510 --> 00:25:45,165 So this is the right, perfect answer. 573 00:25:45,165 --> 00:25:46,290 You have this, you're done. 574 00:25:46,290 --> 00:25:47,623 You don't need to think further. 575 00:25:47,623 --> 00:25:49,420 Suppose you don't have this. 576 00:25:49,420 --> 00:25:50,350 What else can you do? 577 00:25:50,350 --> 00:25:53,479 578 00:25:53,479 --> 00:25:55,070 AUDIENCE: But we don't have what? 579 00:25:55,070 --> 00:25:57,770 PROFESSOR: If I didn't notice that, hey, there 580 00:25:57,770 --> 00:26:03,330 are L words-- There are L new words, 581 00:26:03,330 --> 00:26:07,690 W words total, so W minus L words repeat themselves. 582 00:26:07,690 --> 00:26:11,230 So this is how many times the if is going to be true. 583 00:26:11,230 --> 00:26:13,330 If I didn't have that, then I could 584 00:26:13,330 --> 00:26:16,750 see that line 7 breaks out of the loop. 585 00:26:16,750 --> 00:26:19,314 So if that if runs once, then we're done. 586 00:26:19,314 --> 00:26:20,230 We're out of the loop. 587 00:26:20,230 --> 00:26:22,430 It's not going to run again. 588 00:26:22,430 --> 00:26:26,210 So a bound that's not as precise as the one you gave me 589 00:26:26,210 --> 00:26:31,265 is 1 times W, because it runs, at most, once per loop. 590 00:26:31,265 --> 00:26:40,030 591 00:26:40,030 --> 00:26:42,720 Does people see this? 592 00:26:42,720 --> 00:26:45,810 So this is an easy way to cop out of thinking. 593 00:26:45,810 --> 00:26:48,230 And I don't like to think more than necessary, because you 594 00:26:48,230 --> 00:26:50,100 have finite time on a test or in life, 595 00:26:50,100 --> 00:26:52,720 and you don't want to spend too much time on one thing. 596 00:26:52,720 --> 00:26:55,640 597 00:26:55,640 --> 00:26:56,670 We covered the loop. 598 00:26:56,670 --> 00:26:59,010 Let's look at else and append. 599 00:26:59,010 --> 00:27:00,510 I already got a helpful question, 600 00:27:00,510 --> 00:27:03,940 so I explained when the else would run. 601 00:27:03,940 --> 00:27:05,240 The running time for else. 602 00:27:05,240 --> 00:27:07,570 Else is a flow-- control flow statement. 603 00:27:07,570 --> 00:27:10,244 It's like break, so Python will keep 604 00:27:10,244 --> 00:27:11,910 track of whether a loop completed or not 605 00:27:11,910 --> 00:27:12,899 in constant time. 606 00:27:12,899 --> 00:27:13,690 I'll give you that. 607 00:27:13,690 --> 00:27:16,750 That's in the cost model. 608 00:27:16,750 --> 00:27:21,170 How many times does this else run, at most? 609 00:27:21,170 --> 00:27:22,840 OK, L. Good. 610 00:27:22,840 --> 00:27:25,546 611 00:27:25,546 --> 00:27:26,865 Perfect. 612 00:27:26,865 --> 00:27:27,573 How about line 9? 613 00:27:27,573 --> 00:27:36,012 614 00:27:36,012 --> 00:27:36,976 Loop stops here. 615 00:27:36,976 --> 00:27:39,890 616 00:27:39,890 --> 00:27:42,200 How about line 9? 617 00:27:42,200 --> 00:27:43,500 How many times does it run? 618 00:27:43,500 --> 00:27:44,886 That's easy. 619 00:27:44,886 --> 00:27:46,290 AUDIENCE: [INAUDIBLE]. 620 00:27:46,290 --> 00:27:47,800 PROFESSOR: Yep. 621 00:27:47,800 --> 00:27:48,820 And what does it do? 622 00:27:48,820 --> 00:27:49,590 It's an append. 623 00:27:49,590 --> 00:27:53,111 What's the cost for append? 624 00:27:53,111 --> 00:27:53,902 AUDIENCE: Constant. 625 00:27:53,902 --> 00:27:57,380 626 00:27:57,380 --> 00:27:59,680 PROFESSOR: OK Line 10 runs how many times? 627 00:27:59,680 --> 00:28:02,554 What's the cost? 628 00:28:02,554 --> 00:28:03,054 AUDIENCE: 1? 629 00:28:03,054 --> 00:28:07,307 630 00:28:07,307 --> 00:28:09,390 PROFESSOR: You guys didn't listen to me last time. 631 00:28:09,390 --> 00:28:11,020 So I was saying you have to look at the notes 632 00:28:11,020 --> 00:28:12,270 and you have to practice this. 633 00:28:12,270 --> 00:28:14,480 Because you have to have this model in your mind. 634 00:28:14,480 --> 00:28:16,160 So that when you're writing code, 635 00:28:16,160 --> 00:28:17,710 this has to happen automatically. 636 00:28:17,710 --> 00:28:20,306 You shouldn't have to think explicitly about it. 637 00:28:20,306 --> 00:28:22,180 Because if you do, you're not going to do it. 638 00:28:22,180 --> 00:28:23,679 AUDIENCE: For the else, shouldn't it 639 00:28:23,679 --> 00:28:28,612 be W tim-- I mean, it would be called W times not L times? 640 00:28:28,612 --> 00:28:30,744 Because you want to look at the outer loop and not 641 00:28:30,744 --> 00:28:32,060 the inner loop? 642 00:28:32,060 --> 00:28:35,942 So it can-- you call all-- once at the end of the total-- 643 00:28:35,942 --> 00:28:37,080 the inner for. 644 00:28:37,080 --> 00:28:37,580 Right? 645 00:28:37,580 --> 00:28:43,590 So, so it could be-- happen W times, maximum, not L times. 646 00:28:43,590 --> 00:28:47,960 'Cause the L is the for loop that the else coincides with. 647 00:28:47,960 --> 00:28:50,502 And the else would only happen once 648 00:28:50,502 --> 00:28:53,410 for every total iteration of that. 649 00:28:53,410 --> 00:28:55,400 PROFESSOR: OK so you're proposing W 650 00:28:55,400 --> 00:28:58,070 as the bound for the else, right? 651 00:28:58,070 --> 00:28:59,643 Here. 652 00:28:59,643 --> 00:29:00,939 AUDIENCE: Yeah. 653 00:29:00,939 --> 00:29:02,980 PROFESSOR: So I could say, hey, it runs, at most, 654 00:29:02,980 --> 00:29:04,140 once for outer loop. 655 00:29:04,140 --> 00:29:06,430 So it's, at most, W times. 656 00:29:06,430 --> 00:29:08,020 This is a nice, easy argument. 657 00:29:08,020 --> 00:29:09,680 We have a bound. 658 00:29:09,680 --> 00:29:10,670 L is a tighter bound. 659 00:29:10,670 --> 00:29:14,730 And when I got L, what happened here 660 00:29:14,730 --> 00:29:17,630 was the same kind of thinking that you did earlier 661 00:29:17,630 --> 00:29:18,960 to get this. 662 00:29:18,960 --> 00:29:21,580 So this bound is good, this bound is tighter. 663 00:29:21,580 --> 00:29:24,260 This bound is good, this bound is tighter. 664 00:29:24,260 --> 00:29:27,850 And the argument behind this one is that, hey, this 665 00:29:27,850 --> 00:29:30,570 else only happens for new words. 666 00:29:30,570 --> 00:29:33,300 If there's no new-- if the word that I looked at is old, 667 00:29:33,300 --> 00:29:36,900 then break is going to execute. 668 00:29:36,900 --> 00:29:39,040 And else is not going to execute. 669 00:29:39,040 --> 00:29:41,620 670 00:29:41,620 --> 00:29:45,330 So that's why I can say L. 671 00:29:45,330 --> 00:29:47,260 The beauty of asymptotics is that I 672 00:29:47,260 --> 00:29:49,290 can use either of the bounds and I'll still 673 00:29:49,290 --> 00:29:50,950 get the correct running time. 674 00:29:50,950 --> 00:29:53,510 So I'm not going to fuss over it too much. 675 00:29:53,510 --> 00:29:56,290 I like the tighter ones, because it means you guys are thinking. 676 00:29:56,290 --> 00:29:57,790 And you're looking at the algorithm, 677 00:29:57,790 --> 00:29:59,330 and you're understanding it. 678 00:29:59,330 --> 00:30:00,830 But if you don't have them, you'll 679 00:30:00,830 --> 00:30:02,329 still get the correct running times. 680 00:30:02,329 --> 00:30:04,640 So I think that's nice. 681 00:30:04,640 --> 00:30:07,540 PROFESSOR: OK, let's get the running time for everything. 682 00:30:07,540 --> 00:30:08,800 Can someone do it in one step? 683 00:30:08,800 --> 00:30:13,890 684 00:30:13,890 --> 00:30:15,830 Then let's do it step by step. 685 00:30:15,830 --> 00:30:17,710 So let's compute partial products. 686 00:30:17,710 --> 00:30:20,660 What are they? 687 00:30:20,660 --> 00:30:22,118 AUDIENCE: 1. 688 00:30:22,118 --> 00:30:25,520 W. LW. 689 00:30:25,520 --> 00:30:29,894 L, W, little w-- with the bar. 690 00:30:29,894 --> 00:30:31,840 AUDIENCE: (LAUGHING) With the bar. 691 00:30:31,840 --> 00:30:32,631 PROFESSOR: Awesome. 692 00:30:32,631 --> 00:30:39,258 AUDIENCE: W minus L. W. L, L, 1. 693 00:30:39,258 --> 00:30:42,130 694 00:30:42,130 --> 00:30:45,020 PROFESSOR: OK, so if I add them up, this is all asymptotic. 695 00:30:45,020 --> 00:30:47,530 So the biggest one will dominate. 696 00:30:47,530 --> 00:30:49,190 In general, I can just take a max 697 00:30:49,190 --> 00:30:52,100 instead of doing actual addition. 698 00:30:52,100 --> 00:30:54,296 So who dominates here? 699 00:30:54,296 --> 00:30:57,782 AUDIENCE: The fourth one down? 700 00:30:57,782 --> 00:30:59,780 PROFESSOR: Fourth-- OK. 701 00:30:59,780 --> 00:31:01,680 Yep. 702 00:31:01,680 --> 00:31:08,820 So line 5 is the biggest time consumer in this algorithm. 703 00:31:08,820 --> 00:31:11,430 And I know it's W times w-bar. 704 00:31:11,430 --> 00:31:12,940 So now I'm going to copy it here. 705 00:31:12,940 --> 00:31:16,575 What did I say I'll do when I'm copying it here? 706 00:31:16,575 --> 00:31:18,030 AUDIENCE: [INAUDIBLE]. 707 00:31:18,030 --> 00:31:18,720 PROFESSOR: Yep. 708 00:31:18,720 --> 00:31:22,640 So I'm assuming English five- to ten-character words. 709 00:31:22,640 --> 00:31:26,320 W L. And W L is smaller than w squared, 710 00:31:26,320 --> 00:31:28,780 so the assumption that I had before is correct. 711 00:31:28,780 --> 00:31:30,790 I don't have to change anything here. 712 00:31:30,790 --> 00:31:31,597 That is good. 713 00:31:31,597 --> 00:31:38,560 714 00:31:38,560 --> 00:31:42,190 So we noticed last time, and already forgot by now, 715 00:31:42,190 --> 00:31:46,230 that the biggest problem in this whole implementation 716 00:31:46,230 --> 00:31:50,800 was the plus in get words from line list. 717 00:31:50,800 --> 00:31:54,300 Suppose we forgot about it and we have this big pile of code, 718 00:31:54,300 --> 00:31:56,100 how do I go about making it faster? 719 00:31:56,100 --> 00:31:59,080 720 00:31:59,080 --> 00:32:01,720 Method one, go through every method. 721 00:32:01,720 --> 00:32:03,680 Do this. 722 00:32:03,680 --> 00:32:06,740 Compute the running times, and see which one's the slowest. 723 00:32:06,740 --> 00:32:10,190 Does this scale to 1,000 lines of code? 724 00:32:10,190 --> 00:32:11,730 Not so much. 725 00:32:11,730 --> 00:32:14,700 We're going to be giving you roughly 1,000 lines of code 726 00:32:14,700 --> 00:32:17,960 for PSET2, and we're going to ask it to make it faster. 727 00:32:17,960 --> 00:32:19,560 Do want to understand everything? 728 00:32:19,560 --> 00:32:22,350 No. 729 00:32:22,350 --> 00:32:24,590 Instead what you want to do is run the code 730 00:32:24,590 --> 00:32:25,824 through a profiler. 731 00:32:25,824 --> 00:32:26,740 So we have a computer. 732 00:32:26,740 --> 00:32:28,770 The computer can tell you which line 733 00:32:28,770 --> 00:32:30,877 takes the most time to run. 734 00:32:30,877 --> 00:32:32,710 So you don't have to do it on pen and paper. 735 00:32:32,710 --> 00:32:34,940 Whenever we can automate, do so. 736 00:32:34,940 --> 00:32:36,830 So we'll teach you how to run a profiler. 737 00:32:36,830 --> 00:32:38,110 It's in the notes. 738 00:32:38,110 --> 00:32:40,870 And if you look in the code outputs right 739 00:32:40,870 --> 00:32:45,080 after [INAUDIBLE], you'll see a profiler output. 740 00:32:45,080 --> 00:32:47,580 So what that tells you is for each function, 741 00:32:47,580 --> 00:32:51,980 how much time does it take-- that's the total time? 742 00:32:51,980 --> 00:32:53,670 And there's the cumulative time, which 743 00:32:53,670 --> 00:32:56,300 is how much does it take together with its children? 744 00:32:56,300 --> 00:32:59,680 745 00:32:59,680 --> 00:33:04,380 In this case for word frequencies from file, order 746 00:33:04,380 --> 00:33:06,360 of W-- this is how much time it takes 747 00:33:06,360 --> 00:33:07,860 including the functions it's called. 748 00:33:07,860 --> 00:33:10,540 So this is the cumulative time. 749 00:33:10,540 --> 00:33:13,940 Cumulative time is useful if I'm during runtime analysis. 750 00:33:13,940 --> 00:33:17,120 Is not so useful if I'm looking at where's 751 00:33:17,120 --> 00:33:18,745 the slowness in my program? 752 00:33:18,745 --> 00:33:20,370 Because if you look at cumulative time, 753 00:33:20,370 --> 00:33:24,510 you might see the slowness in one of the functions that 754 00:33:24,510 --> 00:33:27,930 get-- that word frequencies from file called. 755 00:33:27,930 --> 00:33:30,560 So the cumulative time for this is really big, 756 00:33:30,560 --> 00:33:35,140 but the total time-- the time that's spent inside it-- 757 00:33:35,140 --> 00:33:36,824 is not that bad. 758 00:33:36,824 --> 00:33:38,240 Instead if you look at total time, 759 00:33:38,240 --> 00:33:41,520 you'll see that the worst function is-- surprise, 760 00:33:41,520 --> 00:33:44,670 surprise-- get words from line list. 761 00:33:44,670 --> 00:33:48,150 So 5 lines to look at-- hey, there's a plus there. 762 00:33:48,150 --> 00:33:50,900 I remember from lecture that plus copies over lists, 763 00:33:50,900 --> 00:33:54,700 and it's kind of slow, so maybe I should use something else. 764 00:33:54,700 --> 00:33:57,340 Does d remember what else we should use? 765 00:33:57,340 --> 00:34:00,290 We talked about that last recitation. 766 00:34:00,290 --> 00:34:02,200 Extend. 767 00:34:02,200 --> 00:34:03,580 Document distance 2 . 768 00:34:03,580 --> 00:34:06,580 The only difference between it an document distance 1 769 00:34:06,580 --> 00:34:09,639 is get words from line list, line 5. 770 00:34:09,639 --> 00:34:12,580 That plus turned into an extend. 771 00:34:12,580 --> 00:34:16,330 One character turned to six, so about eight keystrokes, 772 00:34:16,330 --> 00:34:18,440 30% runtime improvement. 773 00:34:18,440 --> 00:34:20,500 Very good return on investment. 774 00:34:20,500 --> 00:34:24,530 Everything else is going to be harder. 775 00:34:24,530 --> 00:34:26,840 So let's look at that line, because that line dominated 776 00:34:26,840 --> 00:34:29,770 the running time of get words from line list. 777 00:34:29,770 --> 00:34:32,440 And think what's, then, your running time for it 778 00:34:32,440 --> 00:34:33,449 now that I have extend? 779 00:34:33,449 --> 00:35:05,440 780 00:35:05,440 --> 00:35:07,800 Does anyone remember what get words from line list does? 781 00:35:07,800 --> 00:35:16,130 782 00:35:16,130 --> 00:35:19,765 AUDIENCE: It gets the words out of the document. 783 00:35:19,765 --> 00:35:20,390 PROFESSOR: OK . 784 00:35:20,390 --> 00:35:22,110 It gets the word out of the document. 785 00:35:22,110 --> 00:35:25,720 So it reads a document that looks like a regular text file, 786 00:35:25,720 --> 00:35:30,240 and it gets this out of it. 787 00:35:30,240 --> 00:35:33,190 The way it does that is it goes through each line, 788 00:35:33,190 --> 00:35:36,640 reads the line as a string, breaks up the string 789 00:35:36,640 --> 00:35:39,310 into a list of words, and then combines 790 00:35:39,310 --> 00:35:42,240 all those lists of words together. 791 00:35:42,240 --> 00:35:44,420 Get words from string, line 4, returns 792 00:35:44,420 --> 00:35:47,114 the number of words in a line. 793 00:35:47,114 --> 00:35:49,030 Sorry, the list of words in the line, and then 794 00:35:49,030 --> 00:35:51,510 extend combines the lists together. 795 00:35:51,510 --> 00:35:56,140 Let's add some constants that we had last time. 796 00:35:56,140 --> 00:36:08,220 I think we had that K is the number of words per line. 797 00:36:08,220 --> 00:36:16,850 And Z is the number of total lines. 798 00:36:16,850 --> 00:36:29,492 So this K is actually-- W over Z. No, I don't like that. 799 00:36:29,492 --> 00:36:29,992 That's work. 800 00:36:29,992 --> 00:36:32,890 801 00:36:32,890 --> 00:36:43,730 So Z is K W over K. And we argued 802 00:36:43,730 --> 00:36:45,370 that we're not going to talk about K 803 00:36:45,370 --> 00:36:48,490 too much, because a document needs to fit on a screen 804 00:36:48,490 --> 00:36:50,820 or needs to fit on a piece of paper. 805 00:36:50,820 --> 00:36:52,740 So the line length has to be finite, right? 806 00:36:52,740 --> 00:36:56,069 Otherwise, if I have a document that has 10,000 character 807 00:36:56,069 --> 00:36:57,860 lines, I can't even write it on this board. 808 00:36:57,860 --> 00:36:59,940 Even though it's really long. 809 00:36:59,940 --> 00:37:03,300 So K, the number of words in a line, is going to be finite. 810 00:37:03,300 --> 00:37:06,270 But we'll need it for this analysis. 811 00:37:06,270 --> 00:37:10,650 So line 4 returns a list with the words on a line. 812 00:37:10,650 --> 00:37:12,225 How many elements in that list? 813 00:37:12,225 --> 00:37:19,010 814 00:37:19,010 --> 00:37:22,660 This returns a list with how many elements? 815 00:37:22,660 --> 00:37:28,408 816 00:37:28,408 --> 00:37:31,459 AUDIENCE: Could K [INAUDIBLE] words on line. 817 00:37:31,459 --> 00:37:33,250 PROFESSOR: So even if I ask easy questions, 818 00:37:33,250 --> 00:37:34,249 you guys have to answer. 819 00:37:34,249 --> 00:37:36,900 Because otherwise I'll stall until you do. 820 00:37:36,900 --> 00:37:39,240 So 4 gives me a list with K elements, 821 00:37:39,240 --> 00:37:43,060 and 5 appends that small list to the big list of words 822 00:37:43,060 --> 00:37:45,040 in the entire document. 823 00:37:45,040 --> 00:37:47,310 Before I used plus and that did something bad. 824 00:37:47,310 --> 00:37:48,460 Now I'm using extend. 825 00:37:48,460 --> 00:37:53,248 What's the cost of one extend called? 826 00:37:53,248 --> 00:37:54,706 AUDIENCE: Constant? 827 00:37:54,706 --> 00:37:55,912 Order of K? 828 00:37:55,912 --> 00:37:58,120 PROFESSOR: I want to know your Python implementation. 829 00:37:58,120 --> 00:38:02,288 830 00:38:02,288 --> 00:38:04,708 AUDIENCE: If Python did it like a linked list , 831 00:38:04,708 --> 00:38:08,477 like doubly linked lists, could be order constant? 832 00:38:08,477 --> 00:38:09,810 PROFESSOR: I like your question. 833 00:38:09,810 --> 00:38:13,720 So if Python lists were actually linked lists-- 834 00:38:13,720 --> 00:38:16,060 so if the name wasn't confusing-- 835 00:38:16,060 --> 00:38:18,320 then yes, merging two lists would be constant. 836 00:38:18,320 --> 00:38:21,520 But then accessing one element in the middle of a list 837 00:38:21,520 --> 00:38:24,160 would not be constant anymore. 838 00:38:24,160 --> 00:38:25,910 Say I want to access element number 839 00:38:25,910 --> 00:38:29,160 200 in a list of 10,000 elements. 840 00:38:29,160 --> 00:38:31,165 I have to go through 200 elements. 841 00:38:31,165 --> 00:38:32,540 We didn't do linked lists when we 842 00:38:32,540 --> 00:38:33,930 ran it, so don't worry about it. 843 00:38:33,930 --> 00:38:36,230 So they decided that it's less confusing to have 844 00:38:36,230 --> 00:38:39,710 lists actually be arrays. 845 00:38:39,710 --> 00:38:42,500 So Python lists, array in CLRS. 846 00:38:42,500 --> 00:38:45,175 847 00:38:45,175 --> 00:38:47,300 AUDIENCE: That can use their storage contiguously? 848 00:38:47,300 --> 00:38:48,200 PROFESSOR: Yep. 849 00:38:48,200 --> 00:38:50,930 AUDIENCE: So when you copy, why can't you copy a block? 850 00:38:50,930 --> 00:38:52,731 Why do you have to access each other? 851 00:38:52,731 --> 00:38:54,720 Does that make sense? 852 00:38:54,720 --> 00:38:56,930 PROFESSOR: So, you can copy a block, 853 00:38:56,930 --> 00:38:58,600 but in order to copy a block, you still 854 00:38:58,600 --> 00:39:00,270 have to move everything. 855 00:39:00,270 --> 00:39:03,770 So if your block is 10,000 elements, 856 00:39:03,770 --> 00:39:08,370 you still have to move 10,000 bytes times element size. 857 00:39:08,370 --> 00:39:11,490 And the CPU works on 4 bytes at a time or 8 bytes at a time. 858 00:39:11,490 --> 00:39:13,990 859 00:39:13,990 --> 00:39:14,490 OK. 860 00:39:14,490 --> 00:39:16,150 But this is the right kind of thing 861 00:39:16,150 --> 00:39:17,910 to be thinking about when you're doing the cost model. 862 00:39:17,910 --> 00:39:19,300 And this is what you want to have in your head 863 00:39:19,300 --> 00:39:20,424 when you're writing Python. 864 00:39:20,424 --> 00:39:21,610 So, good. 865 00:39:21,610 --> 00:39:23,492 I like your question. 866 00:39:23,492 --> 00:39:24,950 I wanted to say that at some point, 867 00:39:24,950 --> 00:39:26,366 but I didn't get the occasion yet. 868 00:39:26,366 --> 00:39:28,050 Lists in Python are not lists in CLRS. 869 00:39:28,050 --> 00:39:30,561 870 00:39:30,561 --> 00:39:31,060 OK. 871 00:39:31,060 --> 00:39:35,030 So with that long explanation, an extend 872 00:39:35,030 --> 00:39:37,590 is a list-- is a sequence of appends, right? 873 00:39:37,590 --> 00:39:39,020 If you have two lists and you want 874 00:39:39,020 --> 00:39:41,980 to extend the first list to the second list, 875 00:39:41,980 --> 00:39:44,640 extend basically goes through each element in the second list 876 00:39:44,640 --> 00:39:47,800 and calls append on the first list. 877 00:39:47,800 --> 00:39:50,380 The list is length K, the second list is length K, 878 00:39:50,380 --> 00:39:52,510 so K appends are going to happen. 879 00:39:52,510 --> 00:39:54,200 And append is constant time. 880 00:39:54,200 --> 00:39:55,910 Total cost, K. 881 00:39:55,910 --> 00:39:58,000 Now many times does line 5 run? 882 00:39:58,000 --> 00:40:01,227 883 00:40:01,227 --> 00:40:03,265 AUDIENCE: [INAUDIBLE]. 884 00:40:03,265 --> 00:40:04,140 PROFESSOR: Very good. 885 00:40:04,140 --> 00:40:07,140 886 00:40:07,140 --> 00:40:09,200 So what is the total running cost 887 00:40:09,200 --> 00:40:12,470 of the algorithm if this is the line that dominates? 888 00:40:12,470 --> 00:40:13,970 I don't want to do every other line, 889 00:40:13,970 --> 00:40:16,250 so I'll promise that this is the line that dominates. 890 00:40:16,250 --> 00:40:19,300 What is the total running time? 891 00:40:19,300 --> 00:40:20,377 AUDIENCE: [INAUDIBLE]. 892 00:40:20,377 --> 00:40:20,960 PROFESSOR: OK. 893 00:40:20,960 --> 00:40:22,230 Very good. 894 00:40:22,230 --> 00:40:23,390 K times L. And that is? 895 00:40:23,390 --> 00:40:27,530 896 00:40:27,530 --> 00:40:33,160 Oh, K. So I shouldn't have said K times L, sorry. 897 00:40:33,160 --> 00:40:35,430 L is not the number of lines. 898 00:40:35,430 --> 00:40:40,250 L is the number here, so Z is the number of lines. 899 00:40:40,250 --> 00:40:45,720 So it's K times Z. Which means I'm using bad letters, 900 00:40:45,720 --> 00:40:47,120 so please bear with me. 901 00:40:47,120 --> 00:40:48,830 We'll forget about them in a minute. 902 00:40:48,830 --> 00:40:50,830 So K times Z equals? 903 00:40:50,830 --> 00:40:53,470 AUDIENCE: W. 904 00:40:53,470 --> 00:40:57,660 PROFESSOR: W. So what do I write here? 905 00:40:57,660 --> 00:41:01,994 906 00:41:01,994 --> 00:41:02,910 AUDIENCE: [INAUDIBLE]. 907 00:41:02,910 --> 00:41:03,530 PROFESSOR: OK. 908 00:41:03,530 --> 00:41:04,760 Good. 909 00:41:04,760 --> 00:41:05,830 Very good. 910 00:41:05,830 --> 00:41:08,800 What do I write here? 911 00:41:08,800 --> 00:41:11,862 Word frequencies from file. 912 00:41:11,862 --> 00:41:13,137 AUDIENCE: W L. 913 00:41:13,137 --> 00:41:13,720 PROFESSOR: OK. 914 00:41:13,720 --> 00:41:16,460 915 00:41:16,460 --> 00:41:17,335 What do I write here? 916 00:41:17,335 --> 00:41:21,407 917 00:41:21,407 --> 00:41:21,990 AUDIENCE: W 1. 918 00:41:21,990 --> 00:41:35,150 919 00:41:35,150 --> 00:41:38,237 AUDIENCE: We have to-- put L 1 squared and L 2 squared back 920 00:41:38,237 --> 00:41:38,950 in. 921 00:41:38,950 --> 00:41:40,160 PROFESSOR: OK, do we? 922 00:41:40,160 --> 00:41:43,060 Well, the-- No. 923 00:41:43,060 --> 00:41:45,535 W, you want to always be bigger than L. I hope. 924 00:41:45,535 --> 00:41:46,160 PROFESSOR: Yep. 925 00:41:46,160 --> 00:41:48,680 So if I put it in, L 1 squared is 926 00:41:48,680 --> 00:41:53,570 L 1 L 1, which is smaller than W 1 L 1. 927 00:41:53,570 --> 00:41:55,560 But I have to think about it before doing that. 928 00:41:55,560 --> 00:41:58,260 I can't ignore this completely. 929 00:41:58,260 --> 00:42:02,240 930 00:42:02,240 --> 00:42:04,210 So this is document distance 2. 931 00:42:04,210 --> 00:42:07,207 The asymptotic complexity improved, the running time 932 00:42:07,207 --> 00:42:07,707 improved. 933 00:42:07,707 --> 00:42:14,530 934 00:42:14,530 --> 00:42:16,410 Next thing that happens to make this faster 935 00:42:16,410 --> 00:42:19,180 is-- I'm not giving you the profiler output, 936 00:42:19,180 --> 00:42:23,040 but you have to take my word for it that the longest methods are 937 00:42:23,040 --> 00:42:27,170 count frequency and inner product. 938 00:42:27,170 --> 00:42:29,770 So what I'm going to do first is I'm 939 00:42:29,770 --> 00:42:32,320 going to make inner product faster. 940 00:42:32,320 --> 00:42:34,400 But in order to do that, I have to make 941 00:42:34,400 --> 00:42:36,550 word frequencies for file slower. 942 00:42:36,550 --> 00:42:38,050 And that is because I happen to know 943 00:42:38,050 --> 00:42:44,030 an algorithm for inner product that is a lot faster, if only 944 00:42:44,030 --> 00:42:47,575 you can promise me that in this list, the words are ordered. 945 00:42:47,575 --> 00:42:48,450 The words are sorted. 946 00:42:48,450 --> 00:42:51,210 947 00:42:51,210 --> 00:42:57,000 So what happens in that list 1 is, the moment I see a word, 948 00:42:57,000 --> 00:42:59,510 if it's not in the list I add it at the end. 949 00:42:59,510 --> 00:43:01,820 So the words here show up pretty much in the order 950 00:43:01,820 --> 00:43:03,710 that they show up in the file. 951 00:43:03,710 --> 00:43:05,680 Well, if instead I could have something 952 00:43:05,680 --> 00:43:11,090 that looks like-- what is it? 953 00:43:11,090 --> 00:43:12,470 Fox. 954 00:43:12,470 --> 00:43:13,731 In. 955 00:43:13,731 --> 00:43:14,230 His. 956 00:43:14,230 --> 00:43:16,890 957 00:43:16,890 --> 00:43:19,850 Hat is somewhere here. 958 00:43:19,850 --> 00:43:22,650 So if these words would be in this order, 959 00:43:22,650 --> 00:43:25,210 together with the word that I'm missing, 960 00:43:25,210 --> 00:43:27,470 then I can combine two lists. 961 00:43:27,470 --> 00:43:31,620 I can do an inner product a lot faster. 962 00:43:31,620 --> 00:43:34,130 Let's see how that would work. 963 00:43:34,130 --> 00:43:37,360 And I'm already getting confused my words, so let's do a trick. 964 00:43:37,360 --> 00:43:39,810 Let's say that instead of words, we'll use numbers. 965 00:43:39,810 --> 00:43:41,570 So instead of saying "the," I'm going 966 00:43:41,570 --> 00:43:45,060 to say the is the 50th word in the dictionary. 967 00:43:45,060 --> 00:43:47,957 So I'm going to use number 50. 968 00:43:47,957 --> 00:43:49,290 Because I want to write numbers. 969 00:43:49,290 --> 00:43:50,940 The numbers are easier to deal with. 970 00:43:50,940 --> 00:43:53,460 So say the first document that I have-- 971 00:43:53,460 --> 00:43:59,840 the fox is in the hat-- as words number 3, 4, 6, 8, and 9. 972 00:43:59,840 --> 00:44:03,140 And they show up twice, once, once, once. 973 00:44:03,140 --> 00:44:06,360 974 00:44:06,360 --> 00:44:08,720 9, once. 975 00:44:08,720 --> 00:44:17,160 And say I'm trying to compute the inner product of this 976 00:44:17,160 --> 00:44:19,370 with a document that has word number 2 977 00:44:19,370 --> 00:44:23,290 showing up once, word number 3 showing up once, word number 978 00:44:23,290 --> 00:44:28,350 6 showing up once, word number 7 showing up once, 979 00:44:28,350 --> 00:44:30,174 word number 8 showing up once. 980 00:44:30,174 --> 00:44:33,737 981 00:44:33,737 --> 00:44:35,820 OK, the algorithm for inner product that we talked 982 00:44:35,820 --> 00:44:39,050 about last time was, go through each element in one 983 00:44:39,050 --> 00:44:42,710 of the vectors, find an element with the same word 984 00:44:42,710 --> 00:44:45,410 in the second vector, and if you can find it, 985 00:44:45,410 --> 00:44:47,370 then take the number of times the words show up 986 00:44:47,370 --> 00:44:48,800 and multiply them. 987 00:44:48,800 --> 00:44:50,560 So here I have a 3, 2. 988 00:44:50,560 --> 00:44:54,205 I would find this element here that has 3, 1. 989 00:44:54,205 --> 00:44:55,205 I have these everywhere. 990 00:44:55,205 --> 00:44:58,030 991 00:44:58,030 --> 00:45:00,370 And I take the 2 and the 1 and I multiply them. 992 00:45:00,370 --> 00:45:03,920 So the 3s have to be the same, then I think the 2 and the 1, 993 00:45:03,920 --> 00:45:05,500 and I multiply them. 994 00:45:05,500 --> 00:45:08,270 And for all the elements where that's case, 995 00:45:08,270 --> 00:45:10,430 I add up the results of the multiplication. 996 00:45:10,430 --> 00:45:13,420 997 00:45:13,420 --> 00:45:16,930 So step one, go through each element in a vector. 998 00:45:16,930 --> 00:45:20,095 That's not going to get faster if this other vector is 999 00:45:20,095 --> 00:45:21,260 sorted, right? 1000 00:45:21,260 --> 00:45:27,740 But the step of looking up the second element can be sped up. 1001 00:45:27,740 --> 00:45:31,430 The first and easiest way I can speed this up is, hey, 1002 00:45:31,430 --> 00:45:32,250 this is sorted. 1003 00:45:32,250 --> 00:45:37,340 If I'm looking up three, why not do a binary search? 1004 00:45:37,340 --> 00:45:39,640 What would be the cost if I do that? 1005 00:45:39,640 --> 00:45:43,670 So here I have L1 element, here I have L2 elements. 1006 00:45:43,670 --> 00:45:47,400 What's the cost of doing one binary search here? 1007 00:45:47,400 --> 00:45:48,940 AUDIENCE: Log of L2? 1008 00:45:48,940 --> 00:45:51,660 PROFESSOR: OK. 1009 00:45:51,660 --> 00:45:55,201 And how many times do I do a binary search? 1010 00:45:55,201 --> 00:45:57,682 AUDIENCE: [INAUDIBLE]. 1011 00:45:57,682 --> 00:45:59,640 PROFESSOR: So if I go through each element here 1012 00:45:59,640 --> 00:46:02,139 and I do a binary search, which is a nice and easy algorithm 1013 00:46:02,139 --> 00:46:03,800 that I can explain in 10 seconds, 1014 00:46:03,800 --> 00:46:06,520 I'm already faster than L1 L2. 1015 00:46:06,520 --> 00:46:08,680 So it's worth sorting. 1016 00:46:08,680 --> 00:46:11,200 Now, the algorithm that we use in class 1017 00:46:11,200 --> 00:46:15,790 takes time proportional to L1 plus L2. 1018 00:46:15,790 --> 00:46:17,590 So that's even trickier, and it's 1019 00:46:17,590 --> 00:46:19,300 going to take a bit more time to explain. 1020 00:46:19,300 --> 00:46:21,970 1021 00:46:21,970 --> 00:46:24,250 Does-- did anyone understand the algorithm for class 1022 00:46:24,250 --> 00:46:25,690 and wants to help me explain it? 1023 00:46:25,690 --> 00:46:30,580 1024 00:46:30,580 --> 00:46:32,790 Didn't think so. 1025 00:46:32,790 --> 00:46:34,140 OK. 1026 00:46:34,140 --> 00:46:39,230 So idea is that both of these vectors are sorted. 1027 00:46:39,230 --> 00:46:46,390 So if I have a 3 here and I found my 3 here, next time when 1028 00:46:46,390 --> 00:46:49,850 I get to 4, I know for sure that 4 is not 1029 00:46:49,850 --> 00:46:54,370 going to be anywhere here, because this vector is sorted. 1030 00:46:54,370 --> 00:46:57,550 Say I couldn't find 4, then I go to 6. 1031 00:46:57,550 --> 00:47:00,850 If 6 is here, when I have to look for 8, 1032 00:47:00,850 --> 00:47:04,260 I know for sure that 8 is not going to be anywhere up here. 1033 00:47:04,260 --> 00:47:07,210 1034 00:47:07,210 --> 00:47:11,290 So what I do is, I have a pointer here that remembers, 1035 00:47:11,290 --> 00:47:15,684 where's the last element that I have seen? 1036 00:47:15,684 --> 00:47:16,975 Does this make sense to people? 1037 00:47:16,975 --> 00:47:19,890 1038 00:47:19,890 --> 00:47:22,500 So when I start here and I look at 3, 1039 00:47:22,500 --> 00:47:26,300 I have a pointer here that says, I didn't see anything here. 1040 00:47:26,300 --> 00:47:28,170 I look at 2, it's not 3. 1041 00:47:28,170 --> 00:47:29,050 It's smaller. 1042 00:47:29,050 --> 00:47:29,940 I look at 3. 1043 00:47:29,940 --> 00:47:30,950 I found it, good. 1044 00:47:30,950 --> 00:47:32,720 I do my product. 1045 00:47:32,720 --> 00:47:35,610 Then I go look for the next element here, 4. 1046 00:47:35,610 --> 00:47:36,845 I look at 6. 1047 00:47:36,845 --> 00:47:38,840 6 is bigger than 4. 1048 00:47:38,840 --> 00:47:42,980 So I know for sure that nothing below it is going to be 4. 1049 00:47:42,980 --> 00:47:46,140 So I can stop right here and keep my pointer here. 1050 00:47:46,140 --> 00:47:47,840 Then I go to 6 here. 1051 00:47:47,840 --> 00:47:48,580 And I look here. 1052 00:47:48,580 --> 00:47:49,300 Where did I stop? 1053 00:47:49,300 --> 00:47:50,520 I stop here. 1054 00:47:50,520 --> 00:47:52,030 This element matches this. 1055 00:47:52,030 --> 00:47:53,750 I do my product. 1056 00:47:53,750 --> 00:47:56,150 Keep it in. 1057 00:47:56,150 --> 00:47:58,030 Now I go to 8 here. 1058 00:47:58,030 --> 00:47:59,310 I was at 6. 1059 00:47:59,310 --> 00:48:01,320 6 is smaller than 8. 1060 00:48:01,320 --> 00:48:03,236 7 is smaller than 8. 1061 00:48:03,236 --> 00:48:04,420 8 is equal to 8. 1062 00:48:04,420 --> 00:48:05,630 I found something. 1063 00:48:05,630 --> 00:48:06,850 Sweet. 1064 00:48:06,850 --> 00:48:08,500 I do a product. 1065 00:48:08,500 --> 00:48:09,730 And then I stop. 1066 00:48:09,730 --> 00:48:12,320 And I look at the next element. 1067 00:48:12,320 --> 00:48:15,830 I know for sure that nothing here is going to be 9, 1068 00:48:15,830 --> 00:48:17,610 so I can keep looking down. 1069 00:48:17,610 --> 00:48:20,160 I hit the end of my list. 1070 00:48:20,160 --> 00:48:21,310 OK. 1071 00:48:21,310 --> 00:48:25,069 Whatever I have down here-- it's going 9, 10, 11, 12-- 1072 00:48:25,069 --> 00:48:27,360 is not going to be in this list, because it was sorted. 1073 00:48:27,360 --> 00:48:30,475 So I can stop. 1074 00:48:30,475 --> 00:48:34,340 AUDIENCE: How do you keep going-- 1075 00:48:34,340 --> 00:48:37,340 What algorithm so you use to keep looking down on L2? 1076 00:48:37,340 --> 00:48:38,204 PROFESSOR: Plus 1. 1077 00:48:38,204 --> 00:48:39,120 AUDIENCE: So what if-- 1078 00:48:39,120 --> 00:48:40,328 PROFESSOR: I keep going down. 1079 00:48:40,328 --> 00:48:46,360 AUDIENCE: What if the left side was 3, 4, 6, 9243? 1080 00:48:46,360 --> 00:48:47,370 PROFESSOR: 9,000, what? 1081 00:48:47,370 --> 00:48:48,470 AUDIENCE: Just a big number. 1082 00:48:48,470 --> 00:48:49,136 AUDIENCE: Right. 1083 00:48:49,136 --> 00:48:52,714 Then you have to increment by 1 each time-- 1084 00:48:52,714 --> 00:48:54,130 PROFESSOR: So I'm not incrementing 1085 00:48:54,130 --> 00:48:56,080 the number I'm looking at. 1086 00:48:56,080 --> 00:48:57,710 Here, I looked at 6. 1087 00:48:57,710 --> 00:48:59,610 Then I'm looking for 8. 1088 00:48:59,610 --> 00:49:03,180 And after I found 8, I'm looking for 19,000. 1089 00:49:03,180 --> 00:49:06,800 So I'm going to go down, either until I find 90,000 1090 00:49:06,800 --> 00:49:07,710 or until I stop. 1091 00:49:07,710 --> 00:49:10,382 AUDIENCE: So are you going to go down one at a time, 1092 00:49:10,382 --> 00:49:11,007 PROFESSOR: Yep. 1093 00:49:11,007 --> 00:49:13,840 AUDIENCE: --why not do a binary search? 1094 00:49:13,840 --> 00:49:16,560 PROFESSOR: Because if I do a binary search, 1095 00:49:16,560 --> 00:49:20,550 the analysis that says it's fast is not going to work. 1096 00:49:20,550 --> 00:49:23,550 It turns out that this gives me the optimal running time. 1097 00:49:23,550 --> 00:49:28,830 If I do a binary search, suppose I have a list that's like this. 1098 00:49:28,830 --> 00:49:33,509 1, 2, 3, 4, 5, all the way down to 10,000. 1099 00:49:33,509 --> 00:49:36,750 Ugh, I can't write. 1100 00:49:36,750 --> 00:49:39,920 And I have another list that's like this- 1, 2, 3, 4, 1101 00:49:39,920 --> 00:49:42,800 all the way down to 10,000. 1102 00:49:42,800 --> 00:49:44,720 1, 1. 1103 00:49:44,720 --> 00:49:49,830 I do a binary search, takes log N. I look at 2. 1104 00:49:49,830 --> 00:49:53,230 I do a binary search, takes almost log N. I look at 3, 1105 00:49:53,230 --> 00:49:56,352 do a binary search, takes almost log N. So on, so forth. 1106 00:49:56,352 --> 00:49:59,550 1107 00:49:59,550 --> 00:50:00,705 So this is-- 1108 00:50:00,705 --> 00:50:03,214 AUDIENCE: But if your left list had been 1109 00:50:03,214 --> 00:50:04,380 PROFESSOR: Log N plus log N. 1110 00:50:04,380 --> 00:50:06,034 AUDIENCE: --10,000. 1111 00:50:06,034 --> 00:50:06,700 PROFESSOR: Yeah. 1112 00:50:06,700 --> 00:50:09,770 AUDIENCE: You'd have taken N time. 1113 00:50:09,770 --> 00:50:10,630 PROFESSOR: Yes. 1114 00:50:10,630 --> 00:50:13,030 Well, this algorithm takes N time, 1115 00:50:13,030 --> 00:50:15,800 even if I have to list like this. 1116 00:50:15,800 --> 00:50:19,490 It takes 10,000 plus 10,000 time. 1117 00:50:19,490 --> 00:50:21,700 Whereas this algorithm will take time that's actually 1118 00:50:21,700 --> 00:50:29,910 proportional to 10,000 log 10,000. 1119 00:50:29,910 --> 00:50:30,740 You believe me. 1120 00:50:30,740 --> 00:50:34,480 So, a way to look at this is to do bounds and say, 1121 00:50:34,480 --> 00:50:37,140 for the elements 1 through 5,000, 1122 00:50:37,140 --> 00:50:41,470 it's going to do a binary search for more than 5,000 elements. 1123 00:50:41,470 --> 00:50:44,170 So, the time-- the running time is definitely bigger 1124 00:50:44,170 --> 00:50:48,366 than N over 2 log N over 2. 1125 00:50:48,366 --> 00:50:51,840 1126 00:50:51,840 --> 00:50:52,630 Constant. 1127 00:50:52,630 --> 00:51:00,710 This becomes log N minus 1 and log N. 1128 00:51:00,710 --> 00:51:01,900 That is a good question. 1129 00:51:01,900 --> 00:51:04,340 I wondered about that the first time I saw merge-sort too. 1130 00:51:04,340 --> 00:51:05,810 And I was thinking, hey, I'm going 1131 00:51:05,810 --> 00:51:07,768 to do a binary search here because it's faster, 1132 00:51:07,768 --> 00:51:10,510 and I'm going to make a faster algorithm than anyone has ever 1133 00:51:10,510 --> 00:51:11,390 seen. 1134 00:51:11,390 --> 00:51:13,320 Well, if you do the analysis, not so much. 1135 00:51:13,320 --> 00:51:14,660 But you need to think about it, and you 1136 00:51:14,660 --> 00:51:16,660 need to know why that's true or that's not true. 1137 00:51:16,660 --> 00:51:17,670 So I like your question. 1138 00:51:17,670 --> 00:51:20,140 Thank you. 1139 00:51:20,140 --> 00:51:22,520 So now let's get down so the plain old merge-sort 1140 00:51:22,520 --> 00:51:26,860 that everyone-- sorry, merge that everyone knows. 1141 00:51:26,860 --> 00:51:30,610 So if we go through these one by one, 1142 00:51:30,610 --> 00:51:33,880 how many times am I going to be advancing this pointer? 1143 00:51:33,880 --> 00:51:35,498 In total? 1144 00:51:35,498 --> 00:51:36,920 AUDIENCE: L2? 1145 00:51:36,920 --> 00:51:38,160 PROFESSOR: L2 times. 1146 00:51:38,160 --> 00:51:42,540 So this pointer can only go down, right? 1147 00:51:42,540 --> 00:51:45,660 So worst case is going to go down L2 times. 1148 00:51:45,660 --> 00:51:48,550 And then I'm done with the list, return. 1149 00:51:48,550 --> 00:51:50,580 How many elements am I going to look through? 1150 00:51:50,580 --> 00:51:53,301 So how many times does this pointer going to advance? 1151 00:51:53,301 --> 00:51:53,842 AUDIENCE: L1? 1152 00:51:53,842 --> 00:51:56,700 1153 00:51:56,700 --> 00:51:59,080 PROFESSOR: This one. 1154 00:51:59,080 --> 00:52:00,550 AUDIENCE: But I thought, like-- 1155 00:52:00,550 --> 00:52:03,490 AUDIENCE: Then you get extra ones in between. 1156 00:52:03,490 --> 00:52:05,560 PROFESSOR: What if L2 is bigger than L1? 1157 00:52:05,560 --> 00:52:06,780 What if-- 1158 00:52:06,780 --> 00:52:07,660 AUDIENCE: Oh, right. 1159 00:52:07,660 --> 00:52:08,540 OK, never mind. 1160 00:52:08,540 --> 00:52:10,499 The reason I said L2 was because-- 1161 00:52:10,499 --> 00:52:13,040 PROFESSOR: You're thinking after I'm going through this list, 1162 00:52:13,040 --> 00:52:13,700 I'm out, right? 1163 00:52:13,700 --> 00:52:15,325 AUDIENCE: Right, that's what I thought. 1164 00:52:15,325 --> 00:52:17,640 PROFESSOR: So, your answer works if this list, say, has 1165 00:52:17,640 --> 00:52:19,620 10 elements and this has 10,000. 1166 00:52:19,620 --> 00:52:22,230 And I go through this one really quickly. 1167 00:52:22,230 --> 00:52:25,697 But if this list has 10 elements and this list has 10,000, 1168 00:52:25,697 --> 00:52:27,280 and they both start with 1 through 10, 1169 00:52:27,280 --> 00:52:31,660 I have to say a 1 because that's a better bound. 1170 00:52:31,660 --> 00:52:33,596 So I have to say this. 1171 00:52:33,596 --> 00:52:35,345 AUDIENCE: Could you say the opposite value 1172 00:52:35,345 --> 00:52:37,450 of the difference between the two of them? 1173 00:52:37,450 --> 00:52:39,486 Because if you're going to stop at 1 has 10 1174 00:52:39,486 --> 00:52:43,120 and the other one has 10,000, and let's say 1175 00:52:43,120 --> 00:52:46,579 that only the first ten are actually equal, 1176 00:52:46,579 --> 00:52:48,870 then you're going to go through that list, find all 10, 1177 00:52:48,870 --> 00:52:50,110 and you stop. 1178 00:52:50,110 --> 00:52:51,059 That would be 9,000-- 1179 00:52:51,059 --> 00:52:53,350 PROFESSOR: I could say about that if I'm looking at one 1180 00:52:53,350 --> 00:52:56,160 case, but the magic trick is-- let's-- we're looking 1181 00:52:56,160 --> 00:52:57,680 at the worst case. 1182 00:52:57,680 --> 00:52:59,844 So worst case, if I have 10 elements, 1183 00:52:59,844 --> 00:53:01,760 they'll be all the way down in the other list. 1184 00:53:01,760 --> 00:53:03,010 Or they won't be there at all. 1185 00:53:03,010 --> 00:53:04,970 And I have to go down through all the list. 1186 00:53:04,970 --> 00:53:07,670 AUDIENCE: OK. 1187 00:53:07,670 --> 00:53:09,640 PROFESSOR: So worst case, L1 plus L2. 1188 00:53:09,640 --> 00:53:15,000 1189 00:53:15,000 --> 00:53:18,760 Let me see if I have any time left. 1190 00:53:18,760 --> 00:53:19,720 Nope. 1191 00:53:19,720 --> 00:53:25,050 So, what I would like you to do is go through insertion sort. 1192 00:53:25,050 --> 00:53:27,700 Insertion sort matches the textbook. 1193 00:53:27,700 --> 00:53:30,540 Look at the definition in the textbook, look at the code, 1194 00:53:30,540 --> 00:53:32,009 convince yourself it's the same. 1195 00:53:32,009 --> 00:53:33,800 Look at the running time, convince yourself 1196 00:53:33,800 --> 00:53:35,280 it's N squared. 1197 00:53:35,280 --> 00:53:38,880 Then look at inner product and convince yourself 1198 00:53:38,880 --> 00:53:41,360 that this is what it does. 1199 00:53:41,360 --> 00:53:44,820 Go through this line by line, see where they match, 1200 00:53:44,820 --> 00:53:46,100 put the cost on. 1201 00:53:46,100 --> 00:53:49,040 Make sure that the cost is L1 plus L2. 1202 00:53:49,040 --> 00:53:52,810 And last, go through merge-sort and notice 1203 00:53:52,810 --> 00:53:56,660 that merge in [INAUDIBLE] 6 is exactly the same 1204 00:53:56,660 --> 00:53:57,880 as inner product. 1205 00:53:57,880 --> 00:54:00,020 So this pointer magic that I did here 1206 00:54:00,020 --> 00:54:02,440 is exactly what's happening inside merge. 1207 00:54:02,440 --> 00:54:03,570 And understand merge-sort. 1208 00:54:03,570 --> 00:54:05,630 Look at the textbook, look at the notes, 1209 00:54:05,630 --> 00:54:08,270 and see how they match. 1210 00:54:08,270 --> 00:54:08,870 OK. 1211 00:54:08,870 --> 00:54:10,720 Thanks, guys.