1
00:00:00,000 --> 00:00:00,060

2
00:00:00,060 --> 00:00:01,770
The following
content is provided

3
00:00:01,770 --> 00:00:04,010
under a Creative
Commons license.

4
00:00:04,010 --> 00:00:06,860
Your support will help MIT
OpenCourseWare continue

5
00:00:06,860 --> 00:00:10,720
to offer high-quality
educational resources for free.

6
00:00:10,720 --> 00:00:13,340
To make a donation or
view additional materials

7
00:00:13,340 --> 00:00:17,226
from hundreds of MIT courses,
visit MIT OpenCourseWare

8
00:00:17,226 --> 00:00:17,851
at ocw.mit.edu.

9
00:00:17,851 --> 00:00:22,030

10
00:00:22,030 --> 00:00:25,270
PROFESSOR: So did
everyone turn in PSET1?

11
00:00:25,270 --> 00:00:25,770
Yes?

12
00:00:25,770 --> 00:00:27,220
Good.

13
00:00:27,220 --> 00:00:30,760
OK, so there is a PSET1
critique due in a few days.

14
00:00:30,760 --> 00:00:32,189
My advice?

15
00:00:32,189 --> 00:00:33,215
You already did PSET1.

16
00:00:33,215 --> 00:00:35,090
You remember what you
wrote out on the proof.

17
00:00:35,090 --> 00:00:36,506
Look at the solution.

18
00:00:36,506 --> 00:00:38,630
Write one paragraph today,
and you're done with it.

19
00:00:38,630 --> 00:00:40,800
Then you go focus on PSET2.

20
00:00:40,800 --> 00:00:42,334
If you leave it
off until Tuesday,

21
00:00:42,334 --> 00:00:44,250
you're going to have to
read your proof again,

22
00:00:44,250 --> 00:00:46,370
remember what you're thinking.

23
00:00:46,370 --> 00:00:47,577
It's a lot more work.

24
00:00:47,577 --> 00:00:49,160
Just do it now, get
it out of the way,

25
00:00:49,160 --> 00:00:51,466
and put PSET1 behind you.

26
00:00:51,466 --> 00:00:53,340
AUDIENCE: Is the critique
only for the proof?

27
00:00:53,340 --> 00:00:54,784
Or is this for all of them?

28
00:00:54,784 --> 00:00:55,450
PROFESSOR: Nope.

29
00:00:55,450 --> 00:00:56,399
Just the proof.

30
00:00:56,399 --> 00:00:58,440
So you have to compare
your proof with our proof?

31
00:00:58,440 --> 00:01:01,128

32
00:01:01,128 --> 00:01:02,880
AUDIENCE: Is there an
assignment for that?

33
00:01:02,880 --> 00:01:04,876
Or do we just know to do it?

34
00:01:04,876 --> 00:01:05,610
PROFESSOR: Uh.

35
00:01:05,610 --> 00:01:07,475
PSET1 be Stellar.

36
00:01:07,475 --> 00:01:09,990

37
00:01:09,990 --> 00:01:10,580
Oh, no, sorry.

38
00:01:10,580 --> 00:01:11,413
It's not in Stellar.

39
00:01:11,413 --> 00:01:13,580
It's on our new grading
site that just went out.

40
00:01:13,580 --> 00:01:15,330
So you have to go to
our new grading site,

41
00:01:15,330 --> 00:01:17,500
and you have to type
in your critique there.

42
00:01:17,500 --> 00:01:18,790
And it's one paragraph.

43
00:01:18,790 --> 00:01:20,220
You should aim
for one paragraph.

44
00:01:20,220 --> 00:01:21,760
If you're doing more
than that, then you're

45
00:01:21,760 --> 00:01:22,676
doing something wrong.

46
00:01:22,676 --> 00:01:25,300
And it's LATEX Plus math mode.

47
00:01:25,300 --> 00:01:29,460
So you can use math mode,
and that's about it.

48
00:01:29,460 --> 00:01:31,852
OK, any more questions
about the critique?

49
00:01:31,852 --> 00:01:32,560
It's a new thing.

50
00:01:32,560 --> 00:01:35,160
We care about it because it will
make our grading life easier.

51
00:01:35,160 --> 00:01:37,360
And because it'll force you to
look at the solutions and see

52
00:01:37,360 --> 00:01:39,130
what you understood
and what you didn't.

53
00:01:39,130 --> 00:01:40,060
So we care about it.

54
00:01:40,060 --> 00:01:41,960
Don't ignore it.

55
00:01:41,960 --> 00:01:43,638
Yes?

56
00:01:43,638 --> 00:01:45,304
AUDIENCE: Like, how
much is it weighted?

57
00:01:45,304 --> 00:01:48,294
How much does it count
toward the grade?

58
00:01:48,294 --> 00:01:49,960
PROFESSOR: If you
don't have a critique,

59
00:01:49,960 --> 00:01:53,394
we will most likely
give you a 0 for proof.

60
00:01:53,394 --> 00:01:54,310
AUDIENCE: [INAUDIBLE].

61
00:01:54,310 --> 00:01:57,350

62
00:01:57,350 --> 00:02:00,070
PROFESSOR: If your proof is bad
and your critique of the proof

63
00:02:00,070 --> 00:02:02,070
is good, then you
might get something.

64
00:02:02,070 --> 00:02:04,370
If your proof is bad and
you have no critique--

65
00:02:04,370 --> 00:02:06,370
actually if your proof
is whatever it is and you

66
00:02:06,370 --> 00:02:07,619
have no critique, you get a 0.

67
00:02:07,619 --> 00:02:10,032
AUDIENCE: Yeah. [CHUCKLE]

68
00:02:10,032 --> 00:02:11,740
PROFESSOR: Any more
questions about that?

69
00:02:11,740 --> 00:02:14,870

70
00:02:14,870 --> 00:02:17,080
OK.

71
00:02:17,080 --> 00:02:20,740
Who needs help remembering
what the document distance

72
00:02:20,740 --> 00:02:21,960
problem is?

73
00:02:21,960 --> 00:02:24,910

74
00:02:24,910 --> 00:02:25,410
OK.

75
00:02:25,410 --> 00:02:28,350
Everyone who went to lecture
or to [INAUDIBLE] remembers.

76
00:02:28,350 --> 00:02:29,840
That's good.

77
00:02:29,840 --> 00:02:31,360
Who went to lecture last time?

78
00:02:31,360 --> 00:02:34,720

79
00:02:34,720 --> 00:02:35,290
Cool.

80
00:02:35,290 --> 00:02:36,010
That's good.

81
00:02:36,010 --> 00:02:38,560
So we did insertion
sort and merge sort

82
00:02:38,560 --> 00:02:39,970
from a theoretical standpoint.

83
00:02:39,970 --> 00:02:42,851
Today we're going to look at
the code for the insertion sort

84
00:02:42,851 --> 00:02:45,100
and, if we have time, look
at the code for merge-sort,

85
00:02:45,100 --> 00:02:48,660
and use the same strategy as we
did last time to analyze them,

86
00:02:48,660 --> 00:02:51,290
look at the running time, make
sure the running time matches

87
00:02:51,290 --> 00:02:55,410
the theory, and see how
pseudocode turns into Python.

88
00:02:55,410 --> 00:02:59,010

89
00:02:59,010 --> 00:03:01,690
So you all have your listings.

90
00:03:01,690 --> 00:03:05,790
Last time in document
distance, we covered Main,

91
00:03:05,790 --> 00:03:07,570
and we covered most
of the functions

92
00:03:07,570 --> 00:03:10,450
except for count frequency.

93
00:03:10,450 --> 00:03:15,200
Can anyone remind me what
the call graph looked like?

94
00:03:15,200 --> 00:03:16,700
So the call graph
is the tree that I

95
00:03:16,700 --> 00:03:18,990
had up on the left,
and it started at Main.

96
00:03:18,990 --> 00:03:28,650

97
00:03:28,650 --> 00:03:29,150
Thank you.

98
00:03:29,150 --> 00:03:31,810

99
00:03:31,810 --> 00:03:40,610
So Main calls word frequencies
for file, which in turn calls?

100
00:03:40,610 --> 00:03:42,470
AUDIENCE: Well, it's
probably line list.

101
00:03:42,470 --> 00:03:43,090
PROFESSOR: OK.

102
00:03:43,090 --> 00:03:45,042
AUDIENCE: And count frequency.

103
00:03:45,042 --> 00:03:51,390

104
00:03:51,390 --> 00:03:54,150
PROFESSOR: So we pretend we
don't see the read file called.

105
00:03:54,150 --> 00:03:57,360
We assume that the data
is already in memory

106
00:03:57,360 --> 00:03:59,970
or that the call
takes time that's

107
00:03:59,970 --> 00:04:03,730
proportional to the running--
to the length of the file.

108
00:04:03,730 --> 00:04:07,036
And we only look at get-word
from line list and count

109
00:04:07,036 --> 00:04:07,535
frequency.

110
00:04:07,535 --> 00:04:13,340

111
00:04:13,340 --> 00:04:14,390
OK.

112
00:04:14,390 --> 00:04:17,072
Who else does Main call?

113
00:04:17,072 --> 00:04:18,030
AUDIENCE: Vector angle?

114
00:04:18,030 --> 00:04:25,760

115
00:04:25,760 --> 00:04:28,624
PROFESSOR: And the vector angle?

116
00:04:28,624 --> 00:04:29,624
AUDIENCE: Inner product.

117
00:04:29,624 --> 00:04:38,017

118
00:04:38,017 --> 00:04:38,600
PROFESSOR: OK.

119
00:04:38,600 --> 00:04:43,370
Let's put up the constants
for-- for the document distance

120
00:04:43,370 --> 00:04:46,720
problem that we used last time.

121
00:04:46,720 --> 00:04:51,780
So we said that the
document has W words.

122
00:04:51,780 --> 00:04:54,170
And then when you take
that list of words

123
00:04:54,170 --> 00:04:57,330
and you turn it into
a distance vector,

124
00:04:57,330 --> 00:04:59,850
you will get assigned
to a document vector.

125
00:04:59,850 --> 00:05:04,180
You will get L elements, which
basically means L unique words.

126
00:05:04,180 --> 00:05:09,730
So L is the document
vector length.

127
00:05:09,730 --> 00:05:12,530
And we assume we're using a
natural language like English,

128
00:05:12,530 --> 00:05:15,090
so all the words
are bounded in size.

129
00:05:15,090 --> 00:05:17,470
Like 5 to 10
characters, for example.

130
00:05:17,470 --> 00:05:20,210
And to make our
life easier, we say

131
00:05:20,210 --> 00:05:30,430
all the words have the same
size W. So w is the word length.

132
00:05:30,430 --> 00:05:32,810
Using these numbers,
can anyone remind me

133
00:05:32,810 --> 00:05:36,190
what we said the costs
for these methods are?

134
00:05:36,190 --> 00:05:38,690
And we didn't analyze
count frequency,

135
00:05:38,690 --> 00:05:40,740
so it's OK to not
take my word for it

136
00:05:40,740 --> 00:05:42,800
and not tell me what
I said last time.

137
00:05:42,800 --> 00:05:46,010
But I would like numbers for
word frequencies from file,

138
00:05:46,010 --> 00:05:50,084
get word from line list, vector
angle, and inner product.

139
00:05:50,084 --> 00:05:51,915
AUDIENCE: Was that mu squared?

140
00:05:51,915 --> 00:05:53,160
That was last time.

141
00:05:53,160 --> 00:05:55,380
PROFESSOR: OK.

142
00:05:55,380 --> 00:05:57,150
Which-- which one?

143
00:05:57,150 --> 00:06:01,410
Does anyone-- does
anyone else want to try?

144
00:06:01,410 --> 00:06:03,780
Let's not do guessing.

145
00:06:03,780 --> 00:06:08,290
I'll pull them up
if nobody remembers.

146
00:06:08,290 --> 00:06:10,910
I spent an entire hour on
that, and you guys did too.

147
00:06:10,910 --> 00:06:11,574
It was painful.

148
00:06:11,574 --> 00:06:12,282
AUDIENCE: I know.

149
00:06:12,282 --> 00:06:14,924
But we had to, like, you
know, add them up, and then--

150
00:06:14,924 --> 00:06:15,590
PROFESSOR: Yeah.

151
00:06:15,590 --> 00:06:17,340
So we did a lot of
work for those numbers.

152
00:06:17,340 --> 00:06:20,280

153
00:06:20,280 --> 00:06:24,350
So get word from line list
was order of W squared.

154
00:06:24,350 --> 00:06:26,490
Does anyone remember why?

155
00:06:26,490 --> 00:06:29,220
What made it take so much time?

156
00:06:29,220 --> 00:06:34,020

157
00:06:34,020 --> 00:06:36,900
AUDIENCE: 'Cause you
append it to the word list.

158
00:06:36,900 --> 00:06:39,467
For, like, you add it to
the end to go through.

159
00:06:39,467 --> 00:06:40,050
PROFESSOR: OK.

160
00:06:40,050 --> 00:06:43,820

161
00:06:43,820 --> 00:06:44,530
So you add it.

162
00:06:44,530 --> 00:06:45,920
So what?

163
00:06:45,920 --> 00:06:47,922
AUDIENCE: So, like,
every time you need it,

164
00:06:47,922 --> 00:06:52,260
you, like, do word list equals
word list plus words in line.

165
00:06:52,260 --> 00:06:53,070
PROFESSOR: OK.

166
00:06:53,070 --> 00:06:55,300
Excellent.

167
00:06:55,300 --> 00:06:58,200
So get words from
line list, line five.

168
00:06:58,200 --> 00:06:59,580
There's a plus there.

169
00:06:59,580 --> 00:07:03,630
And that plus sign messes
up the performance.

170
00:07:03,630 --> 00:07:06,890
So that's why it's w squared.

171
00:07:06,890 --> 00:07:08,800
Count frequency,
we didn't cover it,

172
00:07:08,800 --> 00:07:10,160
you had to take my word for it.

173
00:07:10,160 --> 00:07:12,400
So we will cover it now.

174
00:07:12,400 --> 00:07:15,145
And, inner product.

175
00:07:15,145 --> 00:07:26,060

176
00:07:26,060 --> 00:07:29,610
So suppose you have two vectors
of line-- length L1 and L2.

177
00:07:29,610 --> 00:07:32,700
How much time does it take
to compute the inner product?

178
00:07:32,700 --> 00:07:34,492
AUDIENCE: L1, L2 time.

179
00:07:34,492 --> 00:07:35,075
PROFESSOR: OK.

180
00:07:35,075 --> 00:07:38,880

181
00:07:38,880 --> 00:07:40,970
So how much time does
vector angle take?

182
00:07:40,970 --> 00:07:48,335

183
00:07:48,335 --> 00:07:49,317
AUDIENCE: L1 L2 time.

184
00:07:49,317 --> 00:07:53,750

185
00:07:53,750 --> 00:07:56,509
PROFESSOR: L1 L2.

186
00:07:56,509 --> 00:07:57,800
So it makes three calls, right?

187
00:07:57,800 --> 00:07:59,270
You get the two things.

188
00:07:59,270 --> 00:08:02,450
First, it computes the inner
product of the two lists.

189
00:08:02,450 --> 00:08:05,890
And then it has to compute the
inner product of each vector--

190
00:08:05,890 --> 00:08:08,359
of each document
vector with itself.

191
00:08:08,359 --> 00:08:10,525
Because that's what's on
the bottom of the fraction.

192
00:08:10,525 --> 00:08:13,330

193
00:08:13,330 --> 00:08:15,210
So what's the running
time for that?

194
00:08:15,210 --> 00:08:18,710
AUDIENCE: Plus L1
squared, plus L2 squared.

195
00:08:18,710 --> 00:08:22,880
PROFESSOR: Plus L1
squared, plus L2 squared.

196
00:08:22,880 --> 00:08:25,380
So someone was really helpful
last time and asked me,

197
00:08:25,380 --> 00:08:27,600
can you make this
simpler with some math?

198
00:08:27,600 --> 00:08:30,080
And I said, I don't know
so I don't think I will.

199
00:08:30,080 --> 00:08:32,820
But I looked at-- my, I
looked at my high school math

200
00:08:32,820 --> 00:08:37,549
afterwards, and it turns out
that these-- so L1 squared

201
00:08:37,549 --> 00:08:40,830
plus L2 squared are guaranteed
to be greater than L1 L2

202
00:08:40,830 --> 00:08:43,009
as long as these
numbers are positive.

203
00:08:43,009 --> 00:08:44,550
And we're working
with document list,

204
00:08:44,550 --> 00:08:46,350
so they're always positive.

205
00:08:46,350 --> 00:08:49,155
So this will go away.

206
00:08:49,155 --> 00:08:54,320

207
00:08:54,320 --> 00:09:00,080
So let's assume count frequency
is w squared or smaller.

208
00:09:00,080 --> 00:09:03,310
So the total running time for
word frequencies from file

209
00:09:03,310 --> 00:09:06,580
is w squared.

210
00:09:06,580 --> 00:09:08,300
What's the running
time for everything?

211
00:09:08,300 --> 00:09:13,114

212
00:09:13,114 --> 00:09:14,780
AUDIENCE: Would just
be, like, w squared

213
00:09:14,780 --> 00:09:17,410
plus L1 squared plus L2 squared?

214
00:09:17,410 --> 00:09:18,160
PROFESSOR: Almost.

215
00:09:18,160 --> 00:09:20,330
AUDIENCE: Actually, it depends
on which one's greater, right?

216
00:09:20,330 --> 00:09:21,330
PROFESSOR: Well, almost.

217
00:09:21,330 --> 00:09:27,580
So here W works, assuming you
get one document with W words.

218
00:09:27,580 --> 00:09:30,230
But word frequencies from
files is called twice, once

219
00:09:30,230 --> 00:09:31,870
for each document.

220
00:09:31,870 --> 00:09:36,060
First document has W1 words,
second document has W2 words.

221
00:09:36,060 --> 00:09:38,974
So what's the running time?

222
00:09:38,974 --> 00:09:42,750
AUDIENCE: W1 squared
plus W2 squared.

223
00:09:42,750 --> 00:09:48,000
PROFESSOR: W1 squared
plus W2 squared.

224
00:09:48,000 --> 00:09:50,200
So I take this, and I add this.

225
00:09:50,200 --> 00:09:51,590
Right?

226
00:09:51,590 --> 00:09:55,380
Except when I add this, if
I want to add L1 squared,

227
00:09:55,380 --> 00:09:58,290
I know L1 is the number
of unique documents--

228
00:09:58,290 --> 00:10:00,600
of unique words in the document.

229
00:10:00,600 --> 00:10:04,450
And W1 is the total number
of words in the document.

230
00:10:04,450 --> 00:10:07,177
W1 guaranteed to be
greater or equal than L1,

231
00:10:07,177 --> 00:10:08,260
so it's going to dominate.

232
00:10:08,260 --> 00:10:09,440
I don't need to add it.

233
00:10:09,440 --> 00:10:11,360
Same for L2.

234
00:10:11,360 --> 00:10:12,178
This is it.

235
00:10:12,178 --> 00:10:15,200

236
00:10:15,200 --> 00:10:15,700
OK.

237
00:10:15,700 --> 00:10:18,054
You guys don't seem to
remember the numbers for these.

238
00:10:18,054 --> 00:10:20,220
So that means I didn't
torture you enough last time.

239
00:10:20,220 --> 00:10:21,830
So let's do more.

240
00:10:21,830 --> 00:10:23,270
Let's look at count frequency.

241
00:10:23,270 --> 00:10:27,750
And let's compute
the cost for that.

242
00:10:27,750 --> 00:10:36,830

243
00:10:36,830 --> 00:10:40,730
So what we did last time was, we
went through each line of code.

244
00:10:40,730 --> 00:10:42,820
We thought, how
much time does it

245
00:10:42,820 --> 00:10:44,550
take to execute the line once?

246
00:10:44,550 --> 00:10:46,710
And how many times
does the line run?

247
00:10:46,710 --> 00:10:49,819
And then we compute the product
of that, add everything up,

248
00:10:49,819 --> 00:10:51,360
and that's the cost
for the function.

249
00:10:51,360 --> 00:10:55,940

250
00:10:55,940 --> 00:10:58,260
First off, before
I put numbers here,

251
00:10:58,260 --> 00:10:59,540
what does the method to do?

252
00:10:59,540 --> 00:11:09,480

253
00:11:09,480 --> 00:11:12,040
AUDIENCE: It takes
a list of words--

254
00:11:12,040 --> 00:11:12,730
PROFESSOR: OK.

255
00:11:12,730 --> 00:11:16,150
AUDIENCE: For each
item in that list,

256
00:11:16,150 --> 00:11:17,870
checks to see if
it's-- you know,

257
00:11:17,870 --> 00:11:25,407
list of words that
it's-- counted, right?

258
00:11:25,407 --> 00:11:25,990
PROFESSOR: OK.

259
00:11:25,990 --> 00:11:28,290
So you're telling me
what the code does.

260
00:11:28,290 --> 00:11:29,010
AUDIENCE: Yeah.

261
00:11:29,010 --> 00:11:31,070
PROFESSOR: Try to
look at Main or try

262
00:11:31,070 --> 00:11:33,420
to look at word
frequencies for files.

263
00:11:33,420 --> 00:11:36,540
So look at it top-down, and tell
me what the purpose of it is.

264
00:11:36,540 --> 00:11:38,310
What's the goal?

265
00:11:38,310 --> 00:11:45,108
AUDIENCE: Making a list
of-- and each object

266
00:11:45,108 --> 00:11:47,400
is a list with a
word and a number.

267
00:11:47,400 --> 00:11:48,220
PROFESSOR: OK.

268
00:11:48,220 --> 00:11:49,100
Excellent.

269
00:11:49,100 --> 00:11:51,790
So big picture, I have
the first document.

270
00:11:51,790 --> 00:11:52,640
I read it in.

271
00:11:52,640 --> 00:11:54,106
I break it up into words.

272
00:11:54,106 --> 00:11:55,230
And I have a list of words.

273
00:11:55,230 --> 00:11:58,380
That's what word
frequencies for file

274
00:11:58,380 --> 00:12:00,920
gives me-- sorry, that's what
you get words from line list

275
00:12:00,920 --> 00:12:02,190
gives me.

276
00:12:02,190 --> 00:12:02,890
List of words.

277
00:12:02,890 --> 00:12:08,720

278
00:12:08,720 --> 00:12:16,000
The fox is in the hat.

279
00:12:16,000 --> 00:12:18,680

280
00:12:18,680 --> 00:12:21,600
And this gets passed
to count frequency,

281
00:12:21,600 --> 00:12:24,700
and count frequency
gives me, you said,

282
00:12:24,700 --> 00:12:27,430
an object, which is a list.

283
00:12:27,430 --> 00:12:30,180
Of lists, where each
of them has the word

284
00:12:30,180 --> 00:12:31,590
and how many times it shows up.

285
00:12:31,590 --> 00:12:39,510
So I would have "the" shows
up twice, "fox" shows up

286
00:12:39,510 --> 00:12:51,110
once, "is" shows up
once, "in" shows up once,

287
00:12:51,110 --> 00:12:56,110
and-- I need a shorter
example-- "hat" shows up once.

288
00:12:56,110 --> 00:12:58,155
So it takes this and
turns it into that.

289
00:12:58,155 --> 00:13:01,840

290
00:13:01,840 --> 00:13:05,090
So on line 2, I have a
list L that's initialized.

291
00:13:05,090 --> 00:13:06,770
And then, at the
end, it's returned.

292
00:13:06,770 --> 00:13:09,840
So I'm going to guess that L
is going to look like this.

293
00:13:09,840 --> 00:13:13,350

294
00:13:13,350 --> 00:13:17,050
Line 3, for new word in word
list, iterates over the input.

295
00:13:17,050 --> 00:13:19,430
So iterates over this.

296
00:13:19,430 --> 00:13:23,340
And then, line 4 checks
to see, for each new word,

297
00:13:23,340 --> 00:13:26,150
it looks at the list that
I have under construction.

298
00:13:26,150 --> 00:13:30,950
So exam-- for example, if
I ran through all the words

299
00:13:30,950 --> 00:13:33,360
and then I'm trying to
put in hat right now,

300
00:13:33,360 --> 00:13:35,250
I wouldn't have it here.

301
00:13:35,250 --> 00:13:39,890
What line 4 does is, it
looks at all the entries.

302
00:13:39,890 --> 00:13:42,710
And it says, if I
can find the words--

303
00:13:42,710 --> 00:13:45,180
so if I can find the
word hat somewhere here--

304
00:13:45,180 --> 00:13:47,090
then increment the number.

305
00:13:47,090 --> 00:13:49,642
If I can't, then
make a new entry

306
00:13:49,642 --> 00:13:52,100
and say that the word shows up
once, because it's the first

307
00:13:52,100 --> 00:13:52,600
I see it.

308
00:13:52,600 --> 00:13:56,340

309
00:13:56,340 --> 00:13:57,590
So this is what the code does.

310
00:13:57,590 --> 00:14:01,640
Now let's see how
fast it does that.

311
00:14:01,640 --> 00:14:07,450
So line 2 initialize the
output to an empty list.

312
00:14:07,450 --> 00:14:10,071
What's the cost for that?

313
00:14:10,071 --> 00:14:10,988
AUDIENCE: [INAUDIBLE].

314
00:14:10,988 --> 00:14:11,862
PROFESSOR: Very good.

315
00:14:11,862 --> 00:14:12,580
How many times?

316
00:14:12,580 --> 00:14:18,370

317
00:14:18,370 --> 00:14:19,810
For new word in word lists.

318
00:14:19,810 --> 00:14:20,310
Cost?

319
00:14:20,310 --> 00:14:23,726

320
00:14:23,726 --> 00:14:26,680
AUDIENCE: [INAUDIBLE].

321
00:14:26,680 --> 00:14:29,070
I know the cost is 1.

322
00:14:29,070 --> 00:14:32,330
PROFESSOR: OK, so we are-- here,
it's a bit confusing, right?

323
00:14:32,330 --> 00:14:35,610
We're saying that, oh, there's
does iteration over a list.

324
00:14:35,610 --> 00:14:37,920
And each step of the
iteration is constant time,

325
00:14:37,920 --> 00:14:42,210
but the iteration
happens L times.

326
00:14:42,210 --> 00:14:43,510
I'm sorry, not L times.

327
00:14:43,510 --> 00:14:45,400
The length of the list times.

328
00:14:45,400 --> 00:14:48,120
How many-- how many
elements are in word lists?

329
00:14:48,120 --> 00:14:53,090

330
00:14:53,090 --> 00:14:57,730
I heard a very low W, so
I will pretend I heard it.

331
00:14:57,730 --> 00:15:01,350
Or I hope I heard W. So
word list, the words I

332
00:15:01,350 --> 00:15:06,690
got from the document,
W. How about the if.

333
00:15:06,690 --> 00:15:13,570
So, it looks at the
word that I have-- oh.

334
00:15:13,570 --> 00:15:16,680
This code is confusing because
I forgot a line, right?

335
00:15:16,680 --> 00:15:20,580
Pretend that between line--
oh, no, sorry, I didn't.

336
00:15:20,580 --> 00:15:23,512
New word is-- new word
is assigned in line 3.

337
00:15:23,512 --> 00:15:29,030
So new word in line 3 is
compared to the first element

338
00:15:29,030 --> 00:15:31,280
of the entry that's
assigned in line 4.

339
00:15:31,280 --> 00:15:36,960
So hat is compared with
the, fox, so on, so forth.

340
00:15:36,960 --> 00:15:41,530
And if the comparison is
true, it runs line 6 and 7.

341
00:15:41,530 --> 00:15:45,230
And if not, it keeps looping.

342
00:15:45,230 --> 00:15:50,300
So line 5, the if how,
many times does it run?

343
00:15:50,300 --> 00:15:52,498
Just the comparison.

344
00:15:52,498 --> 00:15:53,810
AUDIENCE: W.

345
00:15:53,810 --> 00:15:55,107
PROFESSOR: W.

346
00:15:55,107 --> 00:15:56,148
AUDIENCE: Oh, no, no, no.

347
00:15:56,148 --> 00:15:57,072
No.

348
00:15:57,072 --> 00:15:58,920
I'm thinking of line 4.

349
00:15:58,920 --> 00:15:59,941
PROFESSOR: Oh.

350
00:15:59,941 --> 00:16:00,440
Yeah.

351
00:16:00,440 --> 00:16:02,260
I'm getting confused too.

352
00:16:02,260 --> 00:16:03,540
So let's start with line 4.

353
00:16:03,540 --> 00:16:04,520
Sorry.

354
00:16:04,520 --> 00:16:05,882
Shouldn't do line 5.

355
00:16:05,882 --> 00:16:08,059
AUDIENCE: --new word,
then you're not-- like,

356
00:16:08,059 --> 00:16:09,850
you're not going to
run through that again.

357
00:16:09,850 --> 00:16:10,516
PROFESSOR: Yeah.

358
00:16:10,516 --> 00:16:12,530
Let's worry about
that right afterwards.

359
00:16:12,530 --> 00:16:13,760
Let's do line 4 first.

360
00:16:13,760 --> 00:16:15,790
Sorry, I jumped over line 4.

361
00:16:15,790 --> 00:16:19,520
So, line 4 definitely
runs W times

362
00:16:19,520 --> 00:16:23,970
because it's inside the for
loop from line 3 to line 9.

363
00:16:23,970 --> 00:16:29,510
So everything here will
definitely run W times.

364
00:16:29,510 --> 00:16:31,910
But how many times
does it run overall?

365
00:16:31,910 --> 00:16:36,377
So, line 4 iterates over
all the entries here.

366
00:16:36,377 --> 00:16:37,710
How many times does that happen?

367
00:16:37,710 --> 00:16:41,161

368
00:16:41,161 --> 00:16:47,570
AUDIENCE: 1 plus W
over-- times 10/2,

369
00:16:47,570 --> 00:16:54,588
because it's just worst case,
L-- the length of L increases

370
00:16:54,588 --> 00:16:56,830
by 1 every time.

371
00:16:56,830 --> 00:16:58,950
[INAUDIBLE]

372
00:16:58,950 --> 00:17:02,350
PROFESSOR: OK, so I like that
you started with worst case.

373
00:17:02,350 --> 00:17:04,130
Normally I would
say exactly that.

374
00:17:04,130 --> 00:17:07,619
Worst case or W. But we
had a different constant

375
00:17:07,619 --> 00:17:11,480
for the number of words
that you have in the end.

376
00:17:11,480 --> 00:17:16,230
So let's say something a little
bit better than W. Let's say,

377
00:17:16,230 --> 00:17:17,880
let's put the lower bound on it.

378
00:17:17,880 --> 00:17:20,036
So yeah, worst case, all
the words are different.

379
00:17:20,036 --> 00:17:21,619
But what if they're
not all different?

380
00:17:21,619 --> 00:17:25,632
And what if in the end
I know I have L words?

381
00:17:25,632 --> 00:17:27,653
CLASS: [INAUDIBLE].

382
00:17:27,653 --> 00:17:28,569
PROFESSOR: Worst case.

383
00:17:28,569 --> 00:17:33,001

384
00:17:33,001 --> 00:17:33,500
Almost.

385
00:17:33,500 --> 00:17:36,340

386
00:17:36,340 --> 00:17:39,940
So, I know I have a W
from the outer loop.

387
00:17:39,940 --> 00:17:42,680
For each word in the
outer loop, how many times

388
00:17:42,680 --> 00:17:46,800
does the inner loop execute?

389
00:17:46,800 --> 00:17:49,210
How many times do I have
to go through something

390
00:17:49,210 --> 00:17:50,250
in the inner list?

391
00:17:50,250 --> 00:17:53,090
So I know here I have
W words, suppose here I

392
00:17:53,090 --> 00:17:55,480
have L elements in the vector.

393
00:17:55,480 --> 00:17:58,195
For each one of
these, how many times

394
00:17:58,195 --> 00:18:02,440
do I have to go through--
So how many elements do I

395
00:18:02,440 --> 00:18:03,946
have to go through here?

396
00:18:03,946 --> 00:18:05,764
AUDIENCE: Depends on
where you are, though.

397
00:18:05,764 --> 00:18:07,930
For the first word, you
only have to go through one.

398
00:18:07,930 --> 00:18:08,230
PROFESSOR: Yep.

399
00:18:08,230 --> 00:18:08,530
But--

400
00:18:08,530 --> 00:18:09,590
AUDIENCE: For the second
word, you have to through--

401
00:18:09,590 --> 00:18:09,870
PROFESSOR: Yep.

402
00:18:09,870 --> 00:18:11,280
But I heard the worst case.

403
00:18:11,280 --> 00:18:12,780
And I like that,
because it's easier

404
00:18:12,780 --> 00:18:14,071
to reason about the worst case.

405
00:18:14,071 --> 00:18:17,480
And most of the time it's
sort of like the average case.

406
00:18:17,480 --> 00:18:21,200
AUDIENCE: So then,
length of list-- L.

407
00:18:21,200 --> 00:18:25,530
PROFESSOR: The length of the
list, and that's L. Worst case,

408
00:18:25,530 --> 00:18:27,976
the first words that I see
will be L different words.

409
00:18:27,976 --> 00:18:29,350
And then all the
words that I see

410
00:18:29,350 --> 00:18:32,840
will be the same as the
words that I saw before.

411
00:18:32,840 --> 00:18:36,080
So worst case, the list
will grow to L very fast,

412
00:18:36,080 --> 00:18:39,085
and then I'll keep
seeing L L L. And I'll

413
00:18:39,085 --> 00:18:40,710
ignore what was there
in the beginning,

414
00:18:40,710 --> 00:18:44,460
and I'll say L times.

415
00:18:44,460 --> 00:18:47,320
So I know the second list
is bounded by L in length,

416
00:18:47,320 --> 00:18:49,470
the first list is
bounded by W in length.

417
00:18:49,470 --> 00:18:53,960
So worst case this
runs L times W times.

418
00:18:53,960 --> 00:18:56,394
And what's the
cost of iterating?

419
00:18:56,394 --> 00:18:58,854
AUDIENCE: What is the
difference between L and W?

420
00:18:58,854 --> 00:19:01,965
L is the document vector length,
and W is the number of words.

421
00:19:01,965 --> 00:19:04,803
But isn't the number of
elements in the document vector

422
00:19:04,803 --> 00:19:07,799
the number of words?

423
00:19:07,799 --> 00:19:09,090
PROFESSOR: How about this case?

424
00:19:09,090 --> 00:19:09,810
What's W?

425
00:19:09,810 --> 00:19:15,134

426
00:19:15,134 --> 00:19:16,110
AUDIENCE: 6, yeah.

427
00:19:16,110 --> 00:19:20,140

428
00:19:20,140 --> 00:19:23,790
PROFESSOR: So L is the number
of unique words in a document.

429
00:19:23,790 --> 00:19:25,580
And I heard a
really cool argument

430
00:19:25,580 --> 00:19:28,160
that I liked last time.

431
00:19:28,160 --> 00:19:29,120
Does anyone remember?

432
00:19:29,120 --> 00:19:29,840
About L?

433
00:19:29,840 --> 00:19:33,450

434
00:19:33,450 --> 00:19:37,160
If we're really dealing with a
natural language like English,

435
00:19:37,160 --> 00:19:39,636
how many words do
I have in English?

436
00:19:39,636 --> 00:19:41,135
AUDIENCE: Well, I
think, at the max,

437
00:19:41,135 --> 00:19:43,997
there's actually around
250,000, but a lot of them

438
00:19:43,997 --> 00:19:45,157
are not used anymore.

439
00:19:45,157 --> 00:19:45,740
PROFESSOR: OK.

440
00:19:45,740 --> 00:19:47,720
So 250,000, right?

441
00:19:47,720 --> 00:19:48,820
Max.

442
00:19:48,820 --> 00:19:50,480
So that's a constant.

443
00:19:50,480 --> 00:19:52,930
If I have a document
that contains

444
00:19:52,930 --> 00:19:56,060
all the writings of all the
authors that were ever done,

445
00:19:56,060 --> 00:20:02,771
and say that's a billion words,
L is still going to be 250,000.

446
00:20:02,771 --> 00:20:03,270
Right?

447
00:20:03,270 --> 00:20:04,882
So L can be very
different from W.

448
00:20:04,882 --> 00:20:06,965
That's why we're keeping
track of them separately.

449
00:20:06,965 --> 00:20:11,510

450
00:20:11,510 --> 00:20:14,910
One W L times W. What's
the cost of iterating?

451
00:20:14,910 --> 00:20:19,190

452
00:20:19,190 --> 00:20:21,530
So we know how many
times line 4 runs,

453
00:20:21,530 --> 00:20:26,930
but what's the cost of one
step of iterating in the list?

454
00:20:26,930 --> 00:20:27,430
1.

455
00:20:27,430 --> 00:20:27,930
Very good.

456
00:20:27,930 --> 00:20:31,530

457
00:20:31,530 --> 00:20:32,036
Line 5.

458
00:20:32,036 --> 00:20:33,160
How many times does it run?

459
00:20:33,160 --> 00:20:37,420

460
00:20:37,420 --> 00:20:39,230
AUDIENCE: W times L?

461
00:20:39,230 --> 00:20:39,990
PROFESSOR: Yep.

462
00:20:39,990 --> 00:20:41,580
Same as line 4, right?

463
00:20:41,580 --> 00:20:43,800
The if is run all the time.

464
00:20:43,800 --> 00:20:47,860
And lines 5 and 6 only run
sometimes, but-- sorry,

465
00:20:47,860 --> 00:20:53,750
line 6 and 7 only run sometimes,
but line 5 runs all the time.

466
00:20:53,750 --> 00:20:54,740
What is the cost?

467
00:20:54,740 --> 00:21:01,642

468
00:21:01,642 --> 00:21:03,280
AUDIENCE: We can
say it's constant.

469
00:21:03,280 --> 00:21:04,780
PROFESSOR: We can
say it's constant.

470
00:21:04,780 --> 00:21:07,560
I like we can say
it's constant, but why

471
00:21:07,560 --> 00:21:10,020
is it that we can
say it's constant?

472
00:21:10,020 --> 00:21:12,480
Why don't I just say 1,
if-- this is an empty list.

473
00:21:12,480 --> 00:21:13,280
This is a number.

474
00:21:13,280 --> 00:21:14,340
AUDIENCE: Depends
on the word length.

475
00:21:14,340 --> 00:21:14,923
PROFESSOR: OK.

476
00:21:14,923 --> 00:21:17,065
It depends on the word
lis-- length, very good.

477
00:21:17,065 --> 00:21:18,690
AUDIENCE: And we're
assuming that words

478
00:21:18,690 --> 00:21:20,170
are all the same length.

479
00:21:20,170 --> 00:21:21,637
PROFESSOR: OK.

480
00:21:21,637 --> 00:21:22,220
AUDIENCE: And?

481
00:21:22,220 --> 00:21:26,190
So, 1 times L W. Little w.

482
00:21:26,190 --> 00:21:28,390
PROFESSOR: OK, very good.

483
00:21:28,390 --> 00:21:30,980
So we do assume that all the
words are the same length.

484
00:21:30,980 --> 00:21:34,920
But unless I tell you that
the length is really small,

485
00:21:34,920 --> 00:21:37,630
which I did, you can't say 1.

486
00:21:37,630 --> 00:21:41,000
So, when you said we can say
it's constant, it's right.

487
00:21:41,000 --> 00:21:42,770
We can say, but we
also have to say why,

488
00:21:42,770 --> 00:21:44,890
or at least think
why, that's the case.

489
00:21:44,890 --> 00:21:48,760
So it's W-- we're going to use
W here and when we copy it here,

490
00:21:48,760 --> 00:21:51,144
we're going to forget about it.

491
00:21:51,144 --> 00:21:52,852
AUDIENCE: Can you put
a top bar on the W,

492
00:21:52,852 --> 00:21:56,830
just so I can tell that
it's not the other W.

493
00:21:56,830 --> 00:22:07,950
PROFESSOR: OK But you have to
be responsible for reminding

494
00:22:07,950 --> 00:22:08,770
me to put that.

495
00:22:08,770 --> 00:22:10,160
AUDIENCE: OK.

496
00:22:10,160 --> 00:22:12,240
PROFESSOR: OK.

497
00:22:12,240 --> 00:22:15,470
So string comparison,
not constant.

498
00:22:15,470 --> 00:22:17,770
If I have two very
long strings that

499
00:22:17,770 --> 00:22:19,550
only differ in the
last character,

500
00:22:19,550 --> 00:22:21,550
I have to go through them
character by character

501
00:22:21,550 --> 00:22:24,810
by character until I find
the last character that's

502
00:22:24,810 --> 00:22:25,580
different.

503
00:22:25,580 --> 00:22:28,770
Because until I look at that,
the strings might be equal.

504
00:22:28,770 --> 00:22:30,926
So comparing two
long strings takes

505
00:22:30,926 --> 00:22:33,175
time that's proportional to
the length of the strings.

506
00:22:33,175 --> 00:22:36,680

507
00:22:36,680 --> 00:22:40,130
OK, so line 5 costs
w, tricky part,

508
00:22:40,130 --> 00:22:44,420
runs L times W part--
L times W times.

509
00:22:44,420 --> 00:22:46,350
How about line 6 and 7?

510
00:22:46,350 --> 00:22:52,860

511
00:22:52,860 --> 00:22:56,280
I didn't ask line 6, I
asked line 6 and 7 together,

512
00:22:56,280 --> 00:22:57,690
because there is a trick there.

513
00:22:57,690 --> 00:23:05,690

514
00:23:05,690 --> 00:23:09,190
AUDIENCE: I think it's
constant for line 6, right?

515
00:23:09,190 --> 00:23:10,518
PROFESSOR: Why?

516
00:23:10,518 --> 00:23:12,858
AUDIENCE: 'Cause it's a number.

517
00:23:12,858 --> 00:23:14,715
And one place in the entry.

518
00:23:14,715 --> 00:23:16,280
We've already grabbed the entry.

519
00:23:16,280 --> 00:23:20,420
PROFESSOR: So the cost
of running it once is 1.

520
00:23:20,420 --> 00:23:21,700
Good.

521
00:23:21,700 --> 00:23:25,190
I can tell you that that's
the same case for 7.

522
00:23:25,190 --> 00:23:26,560
How many times do they run?

523
00:23:26,560 --> 00:23:30,120

524
00:23:30,120 --> 00:23:32,490
This is the hard part.

525
00:23:32,490 --> 00:23:34,980
AUDIENCE: Wait, does break
line break out of one loop?

526
00:23:34,980 --> 00:23:35,605
PROFESSOR: Yep.

527
00:23:35,605 --> 00:23:39,940
Break breaks out of the
loop between lines 4 and 7.

528
00:23:39,940 --> 00:23:41,300
I have a question.

529
00:23:41,300 --> 00:23:46,350
AUDIENCE: Is L supposed
to be not in line with if?

530
00:23:46,350 --> 00:23:46,850
Yes.

531
00:23:46,850 --> 00:23:49,580
PROFESSOR: It is supposed
to be where it is.

532
00:23:49,580 --> 00:23:50,640
I will talk-- yes.

533
00:23:50,640 --> 00:23:53,880
So what happens is, that
else is in line with a for.

534
00:23:53,880 --> 00:23:57,430
And if the for loop
runs to completion,

535
00:23:57,430 --> 00:24:00,860
then it does get executed.

536
00:24:00,860 --> 00:24:04,097
If there's a break
somewhere inside the for,

537
00:24:04,097 --> 00:24:05,305
then it doesn't get executed.

538
00:24:05,305 --> 00:24:08,140

539
00:24:08,140 --> 00:24:10,650
So the idea behind
that is, usually

540
00:24:10,650 --> 00:24:12,510
use this for finding stuff.

541
00:24:12,510 --> 00:24:15,760
So you iterate over a list,
and when you find something,

542
00:24:15,760 --> 00:24:17,516
break out of the loop.

543
00:24:17,516 --> 00:24:19,390
You did something, you
break out of the loop.

544
00:24:19,390 --> 00:24:21,435
If you didn't find it,
you can put an else

545
00:24:21,435 --> 00:24:23,167
and then say what code happens.

546
00:24:23,167 --> 00:24:25,000
And you don't have to
write code on your own

547
00:24:25,000 --> 00:24:29,420
to check if you broke
out of the loop.

548
00:24:29,420 --> 00:24:32,230
So if break executes,
then it's going

549
00:24:32,230 --> 00:24:34,570
to take us out of the loop.

550
00:24:34,570 --> 00:24:35,930
It's going to ignore the else.

551
00:24:35,930 --> 00:24:38,282
And it's going to run
the loop on line 3 again.

552
00:24:38,282 --> 00:24:39,865
So it's going to do
another iteration.

553
00:24:39,865 --> 00:24:42,470

554
00:24:42,470 --> 00:24:44,280
So line 6 and 7, how many times?

555
00:24:44,280 --> 00:24:47,120
AUDIENCE: W minus L?

556
00:24:47,120 --> 00:24:48,450
PROFESSOR: W minus L.

557
00:24:48,450 --> 00:24:50,526
AUDIENCE: Like, if
it's the difference

558
00:24:50,526 --> 00:24:54,962
in number of words and
number of-- unique words.

559
00:24:54,962 --> 00:24:55,670
PROFESSOR: Smart.

560
00:24:55,670 --> 00:25:00,520
You gave the precise answer
right the first time.

561
00:25:00,520 --> 00:25:02,230
W minus L. Very good.

562
00:25:02,230 --> 00:25:06,830

563
00:25:06,830 --> 00:25:13,960
And what happens once--
what happens once this runs?

564
00:25:13,960 --> 00:25:21,730
Why do I know-- why do I know
that the if won't be-- Oh.

565
00:25:21,730 --> 00:25:23,660
Sorry, I'm getting
myself confused.

566
00:25:23,660 --> 00:25:27,950
So it's going to run W
minus L times total, right?

567
00:25:27,950 --> 00:25:31,570
Total times overall
without this W thing here,

568
00:25:31,570 --> 00:25:33,547
so I should put
an arrow and say--

569
00:25:33,547 --> 00:25:38,320

570
00:25:38,320 --> 00:25:41,070
Now suppose I
didn't notice this.

571
00:25:41,070 --> 00:25:43,510
Is there another way I
can get the decent bounds?

572
00:25:43,510 --> 00:25:45,165
So this is the right,
perfect answer.

573
00:25:45,165 --> 00:25:46,290
You have this, you're done.

574
00:25:46,290 --> 00:25:47,623
You don't need to think further.

575
00:25:47,623 --> 00:25:49,420
Suppose you don't have this.

576
00:25:49,420 --> 00:25:50,350
What else can you do?

577
00:25:50,350 --> 00:25:53,479

578
00:25:53,479 --> 00:25:55,070
AUDIENCE: But we
don't have what?

579
00:25:55,070 --> 00:25:57,770
PROFESSOR: If I didn't
notice that, hey, there

580
00:25:57,770 --> 00:26:03,330
are L words-- There
are L new words,

581
00:26:03,330 --> 00:26:07,690
W words total, so W minus
L words repeat themselves.

582
00:26:07,690 --> 00:26:11,230
So this is how many times
the if is going to be true.

583
00:26:11,230 --> 00:26:13,330
If I didn't have
that, then I could

584
00:26:13,330 --> 00:26:16,750
see that line 7 breaks
out of the loop.

585
00:26:16,750 --> 00:26:19,314
So if that if runs
once, then we're done.

586
00:26:19,314 --> 00:26:20,230
We're out of the loop.

587
00:26:20,230 --> 00:26:22,430
It's not going to run again.

588
00:26:22,430 --> 00:26:26,210
So a bound that's not as
precise as the one you gave me

589
00:26:26,210 --> 00:26:31,265
is 1 times W, because it
runs, at most, once per loop.

590
00:26:31,265 --> 00:26:40,030

591
00:26:40,030 --> 00:26:42,720
Does people see this?

592
00:26:42,720 --> 00:26:45,810
So this is an easy way
to cop out of thinking.

593
00:26:45,810 --> 00:26:48,230
And I don't like to think more
than necessary, because you

594
00:26:48,230 --> 00:26:50,100
have finite time on
a test or in life,

595
00:26:50,100 --> 00:26:52,720
and you don't want to spend
too much time on one thing.

596
00:26:52,720 --> 00:26:55,640

597
00:26:55,640 --> 00:26:56,670
We covered the loop.

598
00:26:56,670 --> 00:26:59,010
Let's look at else and append.

599
00:26:59,010 --> 00:27:00,510
I already got a
helpful question,

600
00:27:00,510 --> 00:27:03,940
so I explained when
the else would run.

601
00:27:03,940 --> 00:27:05,240
The running time for else.

602
00:27:05,240 --> 00:27:07,570
Else is a flow--
control flow statement.

603
00:27:07,570 --> 00:27:10,244
It's like break,
so Python will keep

604
00:27:10,244 --> 00:27:11,910
track of whether a
loop completed or not

605
00:27:11,910 --> 00:27:12,899
in constant time.

606
00:27:12,899 --> 00:27:13,690
I'll give you that.

607
00:27:13,690 --> 00:27:16,750
That's in the cost model.

608
00:27:16,750 --> 00:27:21,170
How many times does
this else run, at most?

609
00:27:21,170 --> 00:27:22,840
OK, L. Good.

610
00:27:22,840 --> 00:27:25,546

611
00:27:25,546 --> 00:27:26,865
Perfect.

612
00:27:26,865 --> 00:27:27,573
How about line 9?

613
00:27:27,573 --> 00:27:36,012

614
00:27:36,012 --> 00:27:36,976
Loop stops here.

615
00:27:36,976 --> 00:27:39,890

616
00:27:39,890 --> 00:27:42,200
How about line 9?

617
00:27:42,200 --> 00:27:43,500
How many times does it run?

618
00:27:43,500 --> 00:27:44,886
That's easy.

619
00:27:44,886 --> 00:27:46,290
AUDIENCE: [INAUDIBLE].

620
00:27:46,290 --> 00:27:47,800
PROFESSOR: Yep.

621
00:27:47,800 --> 00:27:48,820
And what does it do?

622
00:27:48,820 --> 00:27:49,590
It's an append.

623
00:27:49,590 --> 00:27:53,111
What's the cost for append?

624
00:27:53,111 --> 00:27:53,902
AUDIENCE: Constant.

625
00:27:53,902 --> 00:27:57,380

626
00:27:57,380 --> 00:27:59,680
PROFESSOR: OK Line 10
runs how many times?

627
00:27:59,680 --> 00:28:02,554
What's the cost?

628
00:28:02,554 --> 00:28:03,054
AUDIENCE: 1?

629
00:28:03,054 --> 00:28:07,307

630
00:28:07,307 --> 00:28:09,390
PROFESSOR: You guys didn't
listen to me last time.

631
00:28:09,390 --> 00:28:11,020
So I was saying you have
to look at the notes

632
00:28:11,020 --> 00:28:12,270
and you have to practice this.

633
00:28:12,270 --> 00:28:14,480
Because you have to have
this model in your mind.

634
00:28:14,480 --> 00:28:16,160
So that when you're
writing code,

635
00:28:16,160 --> 00:28:17,710
this has to happen
automatically.

636
00:28:17,710 --> 00:28:20,306
You shouldn't have to
think explicitly about it.

637
00:28:20,306 --> 00:28:22,180
Because if you do, you're
not going to do it.

638
00:28:22,180 --> 00:28:23,679
AUDIENCE: For the
else, shouldn't it

639
00:28:23,679 --> 00:28:28,612
be W tim-- I mean, it would
be called W times not L times?

640
00:28:28,612 --> 00:28:30,744
Because you want to look
at the outer loop and not

641
00:28:30,744 --> 00:28:32,060
the inner loop?

642
00:28:32,060 --> 00:28:35,942
So it can-- you call all--
once at the end of the total--

643
00:28:35,942 --> 00:28:37,080
the inner for.

644
00:28:37,080 --> 00:28:37,580
Right?

645
00:28:37,580 --> 00:28:43,590
So, so it could be-- happen W
times, maximum, not L times.

646
00:28:43,590 --> 00:28:47,960
'Cause the L is the for loop
that the else coincides with.

647
00:28:47,960 --> 00:28:50,502
And the else would
only happen once

648
00:28:50,502 --> 00:28:53,410
for every total
iteration of that.

649
00:28:53,410 --> 00:28:55,400
PROFESSOR: OK so
you're proposing W

650
00:28:55,400 --> 00:28:58,070
as the bound for
the else, right?

651
00:28:58,070 --> 00:28:59,643
Here.

652
00:28:59,643 --> 00:29:00,939
AUDIENCE: Yeah.

653
00:29:00,939 --> 00:29:02,980
PROFESSOR: So I could say,
hey, it runs, at most,

654
00:29:02,980 --> 00:29:04,140
once for outer loop.

655
00:29:04,140 --> 00:29:06,430
So it's, at most, W times.

656
00:29:06,430 --> 00:29:08,020
This is a nice, easy argument.

657
00:29:08,020 --> 00:29:09,680
We have a bound.

658
00:29:09,680 --> 00:29:10,670
L is a tighter bound.

659
00:29:10,670 --> 00:29:14,730
And when I got L,
what happened here

660
00:29:14,730 --> 00:29:17,630
was the same kind of
thinking that you did earlier

661
00:29:17,630 --> 00:29:18,960
to get this.

662
00:29:18,960 --> 00:29:21,580
So this bound is good,
this bound is tighter.

663
00:29:21,580 --> 00:29:24,260
This bound is good,
this bound is tighter.

664
00:29:24,260 --> 00:29:27,850
And the argument behind
this one is that, hey, this

665
00:29:27,850 --> 00:29:30,570
else only happens for new words.

666
00:29:30,570 --> 00:29:33,300
If there's no new-- if the
word that I looked at is old,

667
00:29:33,300 --> 00:29:36,900
then break is going to execute.

668
00:29:36,900 --> 00:29:39,040
And else is not
going to execute.

669
00:29:39,040 --> 00:29:41,620

670
00:29:41,620 --> 00:29:45,330
So that's why I can say L.

671
00:29:45,330 --> 00:29:47,260
The beauty of
asymptotics is that I

672
00:29:47,260 --> 00:29:49,290
can use either of the
bounds and I'll still

673
00:29:49,290 --> 00:29:50,950
get the correct running time.

674
00:29:50,950 --> 00:29:53,510
So I'm not going to
fuss over it too much.

675
00:29:53,510 --> 00:29:56,290
I like the tighter ones, because
it means you guys are thinking.

676
00:29:56,290 --> 00:29:57,790
And you're looking
at the algorithm,

677
00:29:57,790 --> 00:29:59,330
and you're understanding it.

678
00:29:59,330 --> 00:30:00,830
But if you don't
have them, you'll

679
00:30:00,830 --> 00:30:02,329
still get the correct
running times.

680
00:30:02,329 --> 00:30:04,640
So I think that's nice.

681
00:30:04,640 --> 00:30:07,540
PROFESSOR: OK, let's get the
running time for everything.

682
00:30:07,540 --> 00:30:08,800
Can someone do it in one step?

683
00:30:08,800 --> 00:30:13,890

684
00:30:13,890 --> 00:30:15,830
Then let's do it step by step.

685
00:30:15,830 --> 00:30:17,710
So let's compute
partial products.

686
00:30:17,710 --> 00:30:20,660
What are they?

687
00:30:20,660 --> 00:30:22,118
AUDIENCE: 1.

688
00:30:22,118 --> 00:30:25,520
W. LW.

689
00:30:25,520 --> 00:30:29,894
L, W, little w-- with the bar.

690
00:30:29,894 --> 00:30:31,840
AUDIENCE: (LAUGHING)
With the bar.

691
00:30:31,840 --> 00:30:32,631
PROFESSOR: Awesome.

692
00:30:32,631 --> 00:30:39,258
AUDIENCE: W minus L. W. L, L, 1.

693
00:30:39,258 --> 00:30:42,130

694
00:30:42,130 --> 00:30:45,020
PROFESSOR: OK, so if I add them
up, this is all asymptotic.

695
00:30:45,020 --> 00:30:47,530
So the biggest
one will dominate.

696
00:30:47,530 --> 00:30:49,190
In general, I can
just take a max

697
00:30:49,190 --> 00:30:52,100
instead of doing
actual addition.

698
00:30:52,100 --> 00:30:54,296
So who dominates here?

699
00:30:54,296 --> 00:30:57,782
AUDIENCE: The fourth one down?

700
00:30:57,782 --> 00:30:59,780
PROFESSOR: Fourth-- OK.

701
00:30:59,780 --> 00:31:01,680
Yep.

702
00:31:01,680 --> 00:31:08,820
So line 5 is the biggest time
consumer in this algorithm.

703
00:31:08,820 --> 00:31:11,430
And I know it's W times w-bar.

704
00:31:11,430 --> 00:31:12,940
So now I'm going
to copy it here.

705
00:31:12,940 --> 00:31:16,575
What did I say I'll do
when I'm copying it here?

706
00:31:16,575 --> 00:31:18,030
AUDIENCE: [INAUDIBLE].

707
00:31:18,030 --> 00:31:18,720
PROFESSOR: Yep.

708
00:31:18,720 --> 00:31:22,640
So I'm assuming English
five- to ten-character words.

709
00:31:22,640 --> 00:31:26,320
W L. And W L is
smaller than w squared,

710
00:31:26,320 --> 00:31:28,780
so the assumption that
I had before is correct.

711
00:31:28,780 --> 00:31:30,790
I don't have to
change anything here.

712
00:31:30,790 --> 00:31:31,597
That is good.

713
00:31:31,597 --> 00:31:38,560

714
00:31:38,560 --> 00:31:42,190
So we noticed last time,
and already forgot by now,

715
00:31:42,190 --> 00:31:46,230
that the biggest problem in
this whole implementation

716
00:31:46,230 --> 00:31:50,800
was the plus in get
words from line list.

717
00:31:50,800 --> 00:31:54,300
Suppose we forgot about it and
we have this big pile of code,

718
00:31:54,300 --> 00:31:56,100
how do I go about
making it faster?

719
00:31:56,100 --> 00:31:59,080

720
00:31:59,080 --> 00:32:01,720
Method one, go
through every method.

721
00:32:01,720 --> 00:32:03,680
Do this.

722
00:32:03,680 --> 00:32:06,740
Compute the running times, and
see which one's the slowest.

723
00:32:06,740 --> 00:32:10,190
Does this scale to
1,000 lines of code?

724
00:32:10,190 --> 00:32:11,730
Not so much.

725
00:32:11,730 --> 00:32:14,700
We're going to be giving you
roughly 1,000 lines of code

726
00:32:14,700 --> 00:32:17,960
for PSET2, and we're going
to ask it to make it faster.

727
00:32:17,960 --> 00:32:19,560
Do want to understand
everything?

728
00:32:19,560 --> 00:32:22,350
No.

729
00:32:22,350 --> 00:32:24,590
Instead what you want
to do is run the code

730
00:32:24,590 --> 00:32:25,824
through a profiler.

731
00:32:25,824 --> 00:32:26,740
So we have a computer.

732
00:32:26,740 --> 00:32:28,770
The computer can
tell you which line

733
00:32:28,770 --> 00:32:30,877
takes the most time to run.

734
00:32:30,877 --> 00:32:32,710
So you don't have to
do it on pen and paper.

735
00:32:32,710 --> 00:32:34,940
Whenever we can automate, do so.

736
00:32:34,940 --> 00:32:36,830
So we'll teach you
how to run a profiler.

737
00:32:36,830 --> 00:32:38,110
It's in the notes.

738
00:32:38,110 --> 00:32:40,870
And if you look in
the code outputs right

739
00:32:40,870 --> 00:32:45,080
after [INAUDIBLE], you'll
see a profiler output.

740
00:32:45,080 --> 00:32:47,580
So what that tells you
is for each function,

741
00:32:47,580 --> 00:32:51,980
how much time does it take--
that's the total time?

742
00:32:51,980 --> 00:32:53,670
And there's the
cumulative time, which

743
00:32:53,670 --> 00:32:56,300
is how much does it take
together with its children?

744
00:32:56,300 --> 00:32:59,680

745
00:32:59,680 --> 00:33:04,380
In this case for word
frequencies from file, order

746
00:33:04,380 --> 00:33:06,360
of W-- this is how
much time it takes

747
00:33:06,360 --> 00:33:07,860
including the
functions it's called.

748
00:33:07,860 --> 00:33:10,540
So this is the cumulative time.

749
00:33:10,540 --> 00:33:13,940
Cumulative time is useful if
I'm during runtime analysis.

750
00:33:13,940 --> 00:33:17,120
Is not so useful if
I'm looking at where's

751
00:33:17,120 --> 00:33:18,745
the slowness in my program?

752
00:33:18,745 --> 00:33:20,370
Because if you look
at cumulative time,

753
00:33:20,370 --> 00:33:24,510
you might see the slowness
in one of the functions that

754
00:33:24,510 --> 00:33:27,930
get-- that word frequencies
from file called.

755
00:33:27,930 --> 00:33:30,560
So the cumulative time
for this is really big,

756
00:33:30,560 --> 00:33:35,140
but the total time-- the
time that's spent inside it--

757
00:33:35,140 --> 00:33:36,824
is not that bad.

758
00:33:36,824 --> 00:33:38,240
Instead if you
look at total time,

759
00:33:38,240 --> 00:33:41,520
you'll see that the worst
function is-- surprise,

760
00:33:41,520 --> 00:33:44,670
surprise-- get words
from line list.

761
00:33:44,670 --> 00:33:48,150
So 5 lines to look at--
hey, there's a plus there.

762
00:33:48,150 --> 00:33:50,900
I remember from lecture
that plus copies over lists,

763
00:33:50,900 --> 00:33:54,700
and it's kind of slow, so maybe
I should use something else.

764
00:33:54,700 --> 00:33:57,340
Does d remember what
else we should use?

765
00:33:57,340 --> 00:34:00,290
We talked about that
last recitation.

766
00:34:00,290 --> 00:34:02,200
Extend.

767
00:34:02,200 --> 00:34:03,580
Document distance 2 .

768
00:34:03,580 --> 00:34:06,580
The only difference between
it an document distance 1

769
00:34:06,580 --> 00:34:09,639
is get words from
line list, line 5.

770
00:34:09,639 --> 00:34:12,580
That plus turned into an extend.

771
00:34:12,580 --> 00:34:16,330
One character turned to six,
so about eight keystrokes,

772
00:34:16,330 --> 00:34:18,440
30% runtime improvement.

773
00:34:18,440 --> 00:34:20,500
Very good return on investment.

774
00:34:20,500 --> 00:34:24,530
Everything else is
going to be harder.

775
00:34:24,530 --> 00:34:26,840
So let's look at that line,
because that line dominated

776
00:34:26,840 --> 00:34:29,770
the running time of get
words from line list.

777
00:34:29,770 --> 00:34:32,440
And think what's, then,
your running time for it

778
00:34:32,440 --> 00:34:33,449
now that I have extend?

779
00:34:33,449 --> 00:35:05,440

780
00:35:05,440 --> 00:35:07,800
Does anyone remember what get
words from line list does?

781
00:35:07,800 --> 00:35:16,130

782
00:35:16,130 --> 00:35:19,765
AUDIENCE: It gets the
words out of the document.

783
00:35:19,765 --> 00:35:20,390
PROFESSOR: OK .

784
00:35:20,390 --> 00:35:22,110
It gets the word
out of the document.

785
00:35:22,110 --> 00:35:25,720
So it reads a document that
looks like a regular text file,

786
00:35:25,720 --> 00:35:30,240
and it gets this out of it.

787
00:35:30,240 --> 00:35:33,190
The way it does that is
it goes through each line,

788
00:35:33,190 --> 00:35:36,640
reads the line as a string,
breaks up the string

789
00:35:36,640 --> 00:35:39,310
into a list of words,
and then combines

790
00:35:39,310 --> 00:35:42,240
all those lists
of words together.

791
00:35:42,240 --> 00:35:44,420
Get words from string,
line 4, returns

792
00:35:44,420 --> 00:35:47,114
the number of words in a line.

793
00:35:47,114 --> 00:35:49,030
Sorry, the list of words
in the line, and then

794
00:35:49,030 --> 00:35:51,510
extend combines
the lists together.

795
00:35:51,510 --> 00:35:56,140
Let's add some constants
that we had last time.

796
00:35:56,140 --> 00:36:08,220
I think we had that K is the
number of words per line.

797
00:36:08,220 --> 00:36:16,850
And Z is the number
of total lines.

798
00:36:16,850 --> 00:36:29,492
So this K is actually-- W
over Z. No, I don't like that.

799
00:36:29,492 --> 00:36:29,992
That's work.

800
00:36:29,992 --> 00:36:32,890

801
00:36:32,890 --> 00:36:43,730
So Z is K W over
K. And we argued

802
00:36:43,730 --> 00:36:45,370
that we're not going
to talk about K

803
00:36:45,370 --> 00:36:48,490
too much, because a document
needs to fit on a screen

804
00:36:48,490 --> 00:36:50,820
or needs to fit on
a piece of paper.

805
00:36:50,820 --> 00:36:52,740
So the line length has
to be finite, right?

806
00:36:52,740 --> 00:36:56,069
Otherwise, if I have a document
that has 10,000 character

807
00:36:56,069 --> 00:36:57,860
lines, I can't even
write it on this board.

808
00:36:57,860 --> 00:36:59,940
Even though it's really long.

809
00:36:59,940 --> 00:37:03,300
So K, the number of words in
a line, is going to be finite.

810
00:37:03,300 --> 00:37:06,270
But we'll need it
for this analysis.

811
00:37:06,270 --> 00:37:10,650
So line 4 returns a list
with the words on a line.

812
00:37:10,650 --> 00:37:12,225
How many elements in that list?

813
00:37:12,225 --> 00:37:19,010

814
00:37:19,010 --> 00:37:22,660
This returns a list
with how many elements?

815
00:37:22,660 --> 00:37:28,408

816
00:37:28,408 --> 00:37:31,459
AUDIENCE: Could K
[INAUDIBLE] words on line.

817
00:37:31,459 --> 00:37:33,250
PROFESSOR: So even if
I ask easy questions,

818
00:37:33,250 --> 00:37:34,249
you guys have to answer.

819
00:37:34,249 --> 00:37:36,900
Because otherwise I'll
stall until you do.

820
00:37:36,900 --> 00:37:39,240
So 4 gives me a list
with K elements,

821
00:37:39,240 --> 00:37:43,060
and 5 appends that small
list to the big list of words

822
00:37:43,060 --> 00:37:45,040
in the entire document.

823
00:37:45,040 --> 00:37:47,310
Before I used plus and
that did something bad.

824
00:37:47,310 --> 00:37:48,460
Now I'm using extend.

825
00:37:48,460 --> 00:37:53,248
What's the cost of
one extend called?

826
00:37:53,248 --> 00:37:54,706
AUDIENCE: Constant?

827
00:37:54,706 --> 00:37:55,912
Order of K?

828
00:37:55,912 --> 00:37:58,120
PROFESSOR: I want to know
your Python implementation.

829
00:37:58,120 --> 00:38:02,288

830
00:38:02,288 --> 00:38:04,708
AUDIENCE: If Python did
it like a linked list ,

831
00:38:04,708 --> 00:38:08,477
like doubly linked lists,
could be order constant?

832
00:38:08,477 --> 00:38:09,810
PROFESSOR: I like your question.

833
00:38:09,810 --> 00:38:13,720
So if Python lists were
actually linked lists--

834
00:38:13,720 --> 00:38:16,060
so if the name
wasn't confusing--

835
00:38:16,060 --> 00:38:18,320
then yes, merging two
lists would be constant.

836
00:38:18,320 --> 00:38:21,520
But then accessing one element
in the middle of a list

837
00:38:21,520 --> 00:38:24,160
would not be constant anymore.

838
00:38:24,160 --> 00:38:25,910
Say I want to access
element number

839
00:38:25,910 --> 00:38:29,160
200 in a list of
10,000 elements.

840
00:38:29,160 --> 00:38:31,165
I have to go through
200 elements.

841
00:38:31,165 --> 00:38:32,540
We didn't do linked
lists when we

842
00:38:32,540 --> 00:38:33,930
ran it, so don't worry about it.

843
00:38:33,930 --> 00:38:36,230
So they decided that it's
less confusing to have

844
00:38:36,230 --> 00:38:39,710
lists actually be arrays.

845
00:38:39,710 --> 00:38:42,500
So Python lists, array in CLRS.

846
00:38:42,500 --> 00:38:45,175

847
00:38:45,175 --> 00:38:47,300
AUDIENCE: That can use
their storage contiguously?

848
00:38:47,300 --> 00:38:48,200
PROFESSOR: Yep.

849
00:38:48,200 --> 00:38:50,930
AUDIENCE: So when you copy,
why can't you copy a block?

850
00:38:50,930 --> 00:38:52,731
Why do you have to
access each other?

851
00:38:52,731 --> 00:38:54,720
Does that make sense?

852
00:38:54,720 --> 00:38:56,930
PROFESSOR: So, you
can copy a block,

853
00:38:56,930 --> 00:38:58,600
but in order to copy
a block, you still

854
00:38:58,600 --> 00:39:00,270
have to move everything.

855
00:39:00,270 --> 00:39:03,770
So if your block
is 10,000 elements,

856
00:39:03,770 --> 00:39:08,370
you still have to move 10,000
bytes times element size.

857
00:39:08,370 --> 00:39:11,490
And the CPU works on 4 bytes
at a time or 8 bytes at a time.

858
00:39:11,490 --> 00:39:13,990

859
00:39:13,990 --> 00:39:14,490
OK.

860
00:39:14,490 --> 00:39:16,150
But this is the
right kind of thing

861
00:39:16,150 --> 00:39:17,910
to be thinking about when
you're doing the cost model.

862
00:39:17,910 --> 00:39:19,300
And this is what you
want to have in your head

863
00:39:19,300 --> 00:39:20,424
when you're writing Python.

864
00:39:20,424 --> 00:39:21,610
So, good.

865
00:39:21,610 --> 00:39:23,492
I like your question.

866
00:39:23,492 --> 00:39:24,950
I wanted to say
that at some point,

867
00:39:24,950 --> 00:39:26,366
but I didn't get
the occasion yet.

868
00:39:26,366 --> 00:39:28,050
Lists in Python are
not lists in CLRS.

869
00:39:28,050 --> 00:39:30,561

870
00:39:30,561 --> 00:39:31,060
OK.

871
00:39:31,060 --> 00:39:35,030
So with that long
explanation, an extend

872
00:39:35,030 --> 00:39:37,590
is a list-- is a sequence
of appends, right?

873
00:39:37,590 --> 00:39:39,020
If you have two
lists and you want

874
00:39:39,020 --> 00:39:41,980
to extend the first
list to the second list,

875
00:39:41,980 --> 00:39:44,640
extend basically goes through
each element in the second list

876
00:39:44,640 --> 00:39:47,800
and calls append
on the first list.

877
00:39:47,800 --> 00:39:50,380
The list is length K, the
second list is length K,

878
00:39:50,380 --> 00:39:52,510
so K appends are
going to happen.

879
00:39:52,510 --> 00:39:54,200
And append is constant time.

880
00:39:54,200 --> 00:39:55,910
Total cost, K.

881
00:39:55,910 --> 00:39:58,000
Now many times does line 5 run?

882
00:39:58,000 --> 00:40:01,227

883
00:40:01,227 --> 00:40:03,265
AUDIENCE: [INAUDIBLE].

884
00:40:03,265 --> 00:40:04,140
PROFESSOR: Very good.

885
00:40:04,140 --> 00:40:07,140

886
00:40:07,140 --> 00:40:09,200
So what is the
total running cost

887
00:40:09,200 --> 00:40:12,470
of the algorithm if this
is the line that dominates?

888
00:40:12,470 --> 00:40:13,970
I don't want to do
every other line,

889
00:40:13,970 --> 00:40:16,250
so I'll promise that this
is the line that dominates.

890
00:40:16,250 --> 00:40:19,300
What is the total running time?

891
00:40:19,300 --> 00:40:20,377
AUDIENCE: [INAUDIBLE].

892
00:40:20,377 --> 00:40:20,960
PROFESSOR: OK.

893
00:40:20,960 --> 00:40:22,230
Very good.

894
00:40:22,230 --> 00:40:23,390
K times L. And that is?

895
00:40:23,390 --> 00:40:27,530

896
00:40:27,530 --> 00:40:33,160
Oh, K. So I shouldn't have
said K times L, sorry.

897
00:40:33,160 --> 00:40:35,430
L is not the number of lines.

898
00:40:35,430 --> 00:40:40,250
L is the number here, so
Z is the number of lines.

899
00:40:40,250 --> 00:40:45,720
So it's K times Z. Which
means I'm using bad letters,

900
00:40:45,720 --> 00:40:47,120
so please bear with me.

901
00:40:47,120 --> 00:40:48,830
We'll forget about
them in a minute.

902
00:40:48,830 --> 00:40:50,830
So K times Z equals?

903
00:40:50,830 --> 00:40:53,470
AUDIENCE: W.

904
00:40:53,470 --> 00:40:57,660
PROFESSOR: W. So
what do I write here?

905
00:40:57,660 --> 00:41:01,994

906
00:41:01,994 --> 00:41:02,910
AUDIENCE: [INAUDIBLE].

907
00:41:02,910 --> 00:41:03,530
PROFESSOR: OK.

908
00:41:03,530 --> 00:41:04,760
Good.

909
00:41:04,760 --> 00:41:05,830
Very good.

910
00:41:05,830 --> 00:41:08,800
What do I write here?

911
00:41:08,800 --> 00:41:11,862
Word frequencies from file.

912
00:41:11,862 --> 00:41:13,137
AUDIENCE: W L.

913
00:41:13,137 --> 00:41:13,720
PROFESSOR: OK.

914
00:41:13,720 --> 00:41:16,460

915
00:41:16,460 --> 00:41:17,335
What do I write here?

916
00:41:17,335 --> 00:41:21,407

917
00:41:21,407 --> 00:41:21,990
AUDIENCE: W 1.

918
00:41:21,990 --> 00:41:35,150

919
00:41:35,150 --> 00:41:38,237
AUDIENCE: We have to-- put L
1 squared and L 2 squared back

920
00:41:38,237 --> 00:41:38,950
in.

921
00:41:38,950 --> 00:41:40,160
PROFESSOR: OK, do we?

922
00:41:40,160 --> 00:41:43,060
Well, the-- No.

923
00:41:43,060 --> 00:41:45,535
W, you want to always be
bigger than L. I hope.

924
00:41:45,535 --> 00:41:46,160
PROFESSOR: Yep.

925
00:41:46,160 --> 00:41:48,680
So if I put it
in, L 1 squared is

926
00:41:48,680 --> 00:41:53,570
L 1 L 1, which is
smaller than W 1 L 1.

927
00:41:53,570 --> 00:41:55,560
But I have to think about
it before doing that.

928
00:41:55,560 --> 00:41:58,260
I can't ignore this completely.

929
00:41:58,260 --> 00:42:02,240

930
00:42:02,240 --> 00:42:04,210
So this is document distance 2.

931
00:42:04,210 --> 00:42:07,207
The asymptotic complexity
improved, the running time

932
00:42:07,207 --> 00:42:07,707
improved.

933
00:42:07,707 --> 00:42:14,530

934
00:42:14,530 --> 00:42:16,410
Next thing that happens
to make this faster

935
00:42:16,410 --> 00:42:19,180
is-- I'm not giving you
the profiler output,

936
00:42:19,180 --> 00:42:23,040
but you have to take my word for
it that the longest methods are

937
00:42:23,040 --> 00:42:27,170
count frequency
and inner product.

938
00:42:27,170 --> 00:42:29,770
So what I'm going
to do first is I'm

939
00:42:29,770 --> 00:42:32,320
going to make inner
product faster.

940
00:42:32,320 --> 00:42:34,400
But in order to do
that, I have to make

941
00:42:34,400 --> 00:42:36,550
word frequencies
for file slower.

942
00:42:36,550 --> 00:42:38,050
And that is because
I happen to know

943
00:42:38,050 --> 00:42:44,030
an algorithm for inner product
that is a lot faster, if only

944
00:42:44,030 --> 00:42:47,575
you can promise me that in this
list, the words are ordered.

945
00:42:47,575 --> 00:42:48,450
The words are sorted.

946
00:42:48,450 --> 00:42:51,210

947
00:42:51,210 --> 00:42:57,000
So what happens in that list
1 is, the moment I see a word,

948
00:42:57,000 --> 00:42:59,510
if it's not in the list
I add it at the end.

949
00:42:59,510 --> 00:43:01,820
So the words here show up
pretty much in the order

950
00:43:01,820 --> 00:43:03,710
that they show up in the file.

951
00:43:03,710 --> 00:43:05,680
Well, if instead I
could have something

952
00:43:05,680 --> 00:43:11,090
that looks like-- what is it?

953
00:43:11,090 --> 00:43:12,470
Fox.

954
00:43:12,470 --> 00:43:13,731
In.

955
00:43:13,731 --> 00:43:14,230
His.

956
00:43:14,230 --> 00:43:16,890

957
00:43:16,890 --> 00:43:19,850
Hat is somewhere here.

958
00:43:19,850 --> 00:43:22,650
So if these words
would be in this order,

959
00:43:22,650 --> 00:43:25,210
together with the
word that I'm missing,

960
00:43:25,210 --> 00:43:27,470
then I can combine two lists.

961
00:43:27,470 --> 00:43:31,620
I can do an inner
product a lot faster.

962
00:43:31,620 --> 00:43:34,130
Let's see how that would work.

963
00:43:34,130 --> 00:43:37,360
And I'm already getting confused
my words, so let's do a trick.

964
00:43:37,360 --> 00:43:39,810
Let's say that instead of
words, we'll use numbers.

965
00:43:39,810 --> 00:43:41,570
So instead of saying
"the," I'm going

966
00:43:41,570 --> 00:43:45,060
to say the is the 50th
word in the dictionary.

967
00:43:45,060 --> 00:43:47,957
So I'm going to use number 50.

968
00:43:47,957 --> 00:43:49,290
Because I want to write numbers.

969
00:43:49,290 --> 00:43:50,940
The numbers are
easier to deal with.

970
00:43:50,940 --> 00:43:53,460
So say the first
document that I have--

971
00:43:53,460 --> 00:43:59,840
the fox is in the hat-- as
words number 3, 4, 6, 8, and 9.

972
00:43:59,840 --> 00:44:03,140
And they show up twice,
once, once, once.

973
00:44:03,140 --> 00:44:06,360

974
00:44:06,360 --> 00:44:08,720
9, once.

975
00:44:08,720 --> 00:44:17,160
And say I'm trying to compute
the inner product of this

976
00:44:17,160 --> 00:44:19,370
with a document that
has word number 2

977
00:44:19,370 --> 00:44:23,290
showing up once, word number
3 showing up once, word number

978
00:44:23,290 --> 00:44:28,350
6 showing up once, word
number 7 showing up once,

979
00:44:28,350 --> 00:44:30,174
word number 8 showing up once.

980
00:44:30,174 --> 00:44:33,737

981
00:44:33,737 --> 00:44:35,820
OK, the algorithm for inner
product that we talked

982
00:44:35,820 --> 00:44:39,050
about last time was, go
through each element in one

983
00:44:39,050 --> 00:44:42,710
of the vectors, find an
element with the same word

984
00:44:42,710 --> 00:44:45,410
in the second vector,
and if you can find it,

985
00:44:45,410 --> 00:44:47,370
then take the number of
times the words show up

986
00:44:47,370 --> 00:44:48,800
and multiply them.

987
00:44:48,800 --> 00:44:50,560
So here I have a 3, 2.

988
00:44:50,560 --> 00:44:54,205
I would find this element
here that has 3, 1.

989
00:44:54,205 --> 00:44:55,205
I have these everywhere.

990
00:44:55,205 --> 00:44:58,030

991
00:44:58,030 --> 00:45:00,370
And I take the 2 and the
1 and I multiply them.

992
00:45:00,370 --> 00:45:03,920
So the 3s have to be the same,
then I think the 2 and the 1,

993
00:45:03,920 --> 00:45:05,500
and I multiply them.

994
00:45:05,500 --> 00:45:08,270
And for all the elements
where that's case,

995
00:45:08,270 --> 00:45:10,430
I add up the results
of the multiplication.

996
00:45:10,430 --> 00:45:13,420

997
00:45:13,420 --> 00:45:16,930
So step one, go through
each element in a vector.

998
00:45:16,930 --> 00:45:20,095
That's not going to get
faster if this other vector is

999
00:45:20,095 --> 00:45:21,260
sorted, right?

1000
00:45:21,260 --> 00:45:27,740
But the step of looking up the
second element can be sped up.

1001
00:45:27,740 --> 00:45:31,430
The first and easiest way I
can speed this up is, hey,

1002
00:45:31,430 --> 00:45:32,250
this is sorted.

1003
00:45:32,250 --> 00:45:37,340
If I'm looking up three,
why not do a binary search?

1004
00:45:37,340 --> 00:45:39,640
What would be the
cost if I do that?

1005
00:45:39,640 --> 00:45:43,670
So here I have L1 element,
here I have L2 elements.

1006
00:45:43,670 --> 00:45:47,400
What's the cost of doing
one binary search here?

1007
00:45:47,400 --> 00:45:48,940
AUDIENCE: Log of L2?

1008
00:45:48,940 --> 00:45:51,660
PROFESSOR: OK.

1009
00:45:51,660 --> 00:45:55,201
And how many times do
I do a binary search?

1010
00:45:55,201 --> 00:45:57,682
AUDIENCE: [INAUDIBLE].

1011
00:45:57,682 --> 00:45:59,640
PROFESSOR: So if I go
through each element here

1012
00:45:59,640 --> 00:46:02,139
and I do a binary search, which
is a nice and easy algorithm

1013
00:46:02,139 --> 00:46:03,800
that I can explain
in 10 seconds,

1014
00:46:03,800 --> 00:46:06,520
I'm already faster than L1 L2.

1015
00:46:06,520 --> 00:46:08,680
So it's worth sorting.

1016
00:46:08,680 --> 00:46:11,200
Now, the algorithm
that we use in class

1017
00:46:11,200 --> 00:46:15,790
takes time proportional
to L1 plus L2.

1018
00:46:15,790 --> 00:46:17,590
So that's even
trickier, and it's

1019
00:46:17,590 --> 00:46:19,300
going to take a bit
more time to explain.

1020
00:46:19,300 --> 00:46:21,970

1021
00:46:21,970 --> 00:46:24,250
Does-- did anyone understand
the algorithm for class

1022
00:46:24,250 --> 00:46:25,690
and wants to help me explain it?

1023
00:46:25,690 --> 00:46:30,580

1024
00:46:30,580 --> 00:46:32,790
Didn't think so.

1025
00:46:32,790 --> 00:46:34,140
OK.

1026
00:46:34,140 --> 00:46:39,230
So idea is that both of
these vectors are sorted.

1027
00:46:39,230 --> 00:46:46,390
So if I have a 3 here and I
found my 3 here, next time when

1028
00:46:46,390 --> 00:46:49,850
I get to 4, I know
for sure that 4 is not

1029
00:46:49,850 --> 00:46:54,370
going to be anywhere here,
because this vector is sorted.

1030
00:46:54,370 --> 00:46:57,550
Say I couldn't find
4, then I go to 6.

1031
00:46:57,550 --> 00:47:00,850
If 6 is here, when I
have to look for 8,

1032
00:47:00,850 --> 00:47:04,260
I know for sure that 8 is not
going to be anywhere up here.

1033
00:47:04,260 --> 00:47:07,210

1034
00:47:07,210 --> 00:47:11,290
So what I do is, I have a
pointer here that remembers,

1035
00:47:11,290 --> 00:47:15,684
where's the last element
that I have seen?

1036
00:47:15,684 --> 00:47:16,975
Does this make sense to people?

1037
00:47:16,975 --> 00:47:19,890

1038
00:47:19,890 --> 00:47:22,500
So when I start here
and I look at 3,

1039
00:47:22,500 --> 00:47:26,300
I have a pointer here that says,
I didn't see anything here.

1040
00:47:26,300 --> 00:47:28,170
I look at 2, it's not 3.

1041
00:47:28,170 --> 00:47:29,050
It's smaller.

1042
00:47:29,050 --> 00:47:29,940
I look at 3.

1043
00:47:29,940 --> 00:47:30,950
I found it, good.

1044
00:47:30,950 --> 00:47:32,720
I do my product.

1045
00:47:32,720 --> 00:47:35,610
Then I go look for the
next element here, 4.

1046
00:47:35,610 --> 00:47:36,845
I look at 6.

1047
00:47:36,845 --> 00:47:38,840
6 is bigger than 4.

1048
00:47:38,840 --> 00:47:42,980
So I know for sure that nothing
below it is going to be 4.

1049
00:47:42,980 --> 00:47:46,140
So I can stop right here
and keep my pointer here.

1050
00:47:46,140 --> 00:47:47,840
Then I go to 6 here.

1051
00:47:47,840 --> 00:47:48,580
And I look here.

1052
00:47:48,580 --> 00:47:49,300
Where did I stop?

1053
00:47:49,300 --> 00:47:50,520
I stop here.

1054
00:47:50,520 --> 00:47:52,030
This element matches this.

1055
00:47:52,030 --> 00:47:53,750
I do my product.

1056
00:47:53,750 --> 00:47:56,150
Keep it in.

1057
00:47:56,150 --> 00:47:58,030
Now I go to 8 here.

1058
00:47:58,030 --> 00:47:59,310
I was at 6.

1059
00:47:59,310 --> 00:48:01,320
6 is smaller than 8.

1060
00:48:01,320 --> 00:48:03,236
7 is smaller than 8.

1061
00:48:03,236 --> 00:48:04,420
8 is equal to 8.

1062
00:48:04,420 --> 00:48:05,630
I found something.

1063
00:48:05,630 --> 00:48:06,850
Sweet.

1064
00:48:06,850 --> 00:48:08,500
I do a product.

1065
00:48:08,500 --> 00:48:09,730
And then I stop.

1066
00:48:09,730 --> 00:48:12,320
And I look at the next element.

1067
00:48:12,320 --> 00:48:15,830
I know for sure that nothing
here is going to be 9,

1068
00:48:15,830 --> 00:48:17,610
so I can keep looking down.

1069
00:48:17,610 --> 00:48:20,160
I hit the end of my list.

1070
00:48:20,160 --> 00:48:21,310
OK.

1071
00:48:21,310 --> 00:48:25,069
Whatever I have down here--
it's going 9, 10, 11, 12--

1072
00:48:25,069 --> 00:48:27,360
is not going to be in this
list, because it was sorted.

1073
00:48:27,360 --> 00:48:30,475
So I can stop.

1074
00:48:30,475 --> 00:48:34,340
AUDIENCE: How do
you keep going--

1075
00:48:34,340 --> 00:48:37,340
What algorithm so you use
to keep looking down on L2?

1076
00:48:37,340 --> 00:48:38,204
PROFESSOR: Plus 1.

1077
00:48:38,204 --> 00:48:39,120
AUDIENCE: So what if--

1078
00:48:39,120 --> 00:48:40,328
PROFESSOR: I keep going down.

1079
00:48:40,328 --> 00:48:46,360
AUDIENCE: What if the left
side was 3, 4, 6, 9243?

1080
00:48:46,360 --> 00:48:47,370
PROFESSOR: 9,000, what?

1081
00:48:47,370 --> 00:48:48,470
AUDIENCE: Just a big number.

1082
00:48:48,470 --> 00:48:49,136
AUDIENCE: Right.

1083
00:48:49,136 --> 00:48:52,714
Then you have to
increment by 1 each time--

1084
00:48:52,714 --> 00:48:54,130
PROFESSOR: So I'm
not incrementing

1085
00:48:54,130 --> 00:48:56,080
the number I'm looking at.

1086
00:48:56,080 --> 00:48:57,710
Here, I looked at 6.

1087
00:48:57,710 --> 00:48:59,610
Then I'm looking for 8.

1088
00:48:59,610 --> 00:49:03,180
And after I found 8,
I'm looking for 19,000.

1089
00:49:03,180 --> 00:49:06,800
So I'm going to go down,
either until I find 90,000

1090
00:49:06,800 --> 00:49:07,710
or until I stop.

1091
00:49:07,710 --> 00:49:10,382
AUDIENCE: So are you going
to go down one at a time,

1092
00:49:10,382 --> 00:49:11,007
PROFESSOR: Yep.

1093
00:49:11,007 --> 00:49:13,840
AUDIENCE: --why not
do a binary search?

1094
00:49:13,840 --> 00:49:16,560
PROFESSOR: Because if
I do a binary search,

1095
00:49:16,560 --> 00:49:20,550
the analysis that says it's
fast is not going to work.

1096
00:49:20,550 --> 00:49:23,550
It turns out that this gives
me the optimal running time.

1097
00:49:23,550 --> 00:49:28,830
If I do a binary search, suppose
I have a list that's like this.

1098
00:49:28,830 --> 00:49:33,509
1, 2, 3, 4, 5, all the
way down to 10,000.

1099
00:49:33,509 --> 00:49:36,750
Ugh, I can't write.

1100
00:49:36,750 --> 00:49:39,920
And I have another list
that's like this- 1, 2, 3, 4,

1101
00:49:39,920 --> 00:49:42,800
all the way down to 10,000.

1102
00:49:42,800 --> 00:49:44,720
1, 1.

1103
00:49:44,720 --> 00:49:49,830
I do a binary search,
takes log N. I look at 2.

1104
00:49:49,830 --> 00:49:53,230
I do a binary search, takes
almost log N. I look at 3,

1105
00:49:53,230 --> 00:49:56,352
do a binary search, takes
almost log N. So on, so forth.

1106
00:49:56,352 --> 00:49:59,550

1107
00:49:59,550 --> 00:50:00,705
So this is--

1108
00:50:00,705 --> 00:50:03,214
AUDIENCE: But if your
left list had been

1109
00:50:03,214 --> 00:50:04,380
PROFESSOR: Log N plus log N.

1110
00:50:04,380 --> 00:50:06,034
AUDIENCE: --10,000.

1111
00:50:06,034 --> 00:50:06,700
PROFESSOR: Yeah.

1112
00:50:06,700 --> 00:50:09,770
AUDIENCE: You'd
have taken N time.

1113
00:50:09,770 --> 00:50:10,630
PROFESSOR: Yes.

1114
00:50:10,630 --> 00:50:13,030
Well, this algorithm
takes N time,

1115
00:50:13,030 --> 00:50:15,800
even if I have to
list like this.

1116
00:50:15,800 --> 00:50:19,490
It takes 10,000
plus 10,000 time.

1117
00:50:19,490 --> 00:50:21,700
Whereas this algorithm will
take time that's actually

1118
00:50:21,700 --> 00:50:29,910
proportional to
10,000 log 10,000.

1119
00:50:29,910 --> 00:50:30,740
You believe me.

1120
00:50:30,740 --> 00:50:34,480
So, a way to look at this
is to do bounds and say,

1121
00:50:34,480 --> 00:50:37,140
for the elements
1 through 5,000,

1122
00:50:37,140 --> 00:50:41,470
it's going to do a binary search
for more than 5,000 elements.

1123
00:50:41,470 --> 00:50:44,170
So, the time-- the running
time is definitely bigger

1124
00:50:44,170 --> 00:50:48,366
than N over 2 log N over 2.

1125
00:50:48,366 --> 00:50:51,840

1126
00:50:51,840 --> 00:50:52,630
Constant.

1127
00:50:52,630 --> 00:51:00,710
This becomes log N
minus 1 and log N.

1128
00:51:00,710 --> 00:51:01,900
That is a good question.

1129
00:51:01,900 --> 00:51:04,340
I wondered about that the first
time I saw merge-sort too.

1130
00:51:04,340 --> 00:51:05,810
And I was thinking,
hey, I'm going

1131
00:51:05,810 --> 00:51:07,768
to do a binary search
here because it's faster,

1132
00:51:07,768 --> 00:51:10,510
and I'm going to make a faster
algorithm than anyone has ever

1133
00:51:10,510 --> 00:51:11,390
seen.

1134
00:51:11,390 --> 00:51:13,320
Well, if you do the
analysis, not so much.

1135
00:51:13,320 --> 00:51:14,660
But you need to think
about it, and you

1136
00:51:14,660 --> 00:51:16,660
need to know why that's
true or that's not true.

1137
00:51:16,660 --> 00:51:17,670
So I like your question.

1138
00:51:17,670 --> 00:51:20,140
Thank you.

1139
00:51:20,140 --> 00:51:22,520
So now let's get down so
the plain old merge-sort

1140
00:51:22,520 --> 00:51:26,860
that everyone-- sorry,
merge that everyone knows.

1141
00:51:26,860 --> 00:51:30,610
So if we go through
these one by one,

1142
00:51:30,610 --> 00:51:33,880
how many times am I going to
be advancing this pointer?

1143
00:51:33,880 --> 00:51:35,498
In total?

1144
00:51:35,498 --> 00:51:36,920
AUDIENCE: L2?

1145
00:51:36,920 --> 00:51:38,160
PROFESSOR: L2 times.

1146
00:51:38,160 --> 00:51:42,540
So this pointer can
only go down, right?

1147
00:51:42,540 --> 00:51:45,660
So worst case is going
to go down L2 times.

1148
00:51:45,660 --> 00:51:48,550
And then I'm done
with the list, return.

1149
00:51:48,550 --> 00:51:50,580
How many elements am I
going to look through?

1150
00:51:50,580 --> 00:51:53,301
So how many times does this
pointer going to advance?

1151
00:51:53,301 --> 00:51:53,842
AUDIENCE: L1?

1152
00:51:53,842 --> 00:51:56,700

1153
00:51:56,700 --> 00:51:59,080
PROFESSOR: This one.

1154
00:51:59,080 --> 00:52:00,550
AUDIENCE: But I thought, like--

1155
00:52:00,550 --> 00:52:03,490
AUDIENCE: Then you get
extra ones in between.

1156
00:52:03,490 --> 00:52:05,560
PROFESSOR: What if
L2 is bigger than L1?

1157
00:52:05,560 --> 00:52:06,780
What if--

1158
00:52:06,780 --> 00:52:07,660
AUDIENCE: Oh, right.

1159
00:52:07,660 --> 00:52:08,540
OK, never mind.

1160
00:52:08,540 --> 00:52:10,499
The reason I said
L2 was because--

1161
00:52:10,499 --> 00:52:13,040
PROFESSOR: You're thinking after
I'm going through this list,

1162
00:52:13,040 --> 00:52:13,700
I'm out, right?

1163
00:52:13,700 --> 00:52:15,325
AUDIENCE: Right,
that's what I thought.

1164
00:52:15,325 --> 00:52:17,640
PROFESSOR: So, your answer
works if this list, say, has

1165
00:52:17,640 --> 00:52:19,620
10 elements and this has 10,000.

1166
00:52:19,620 --> 00:52:22,230
And I go through this
one really quickly.

1167
00:52:22,230 --> 00:52:25,697
But if this list has 10 elements
and this list has 10,000,

1168
00:52:25,697 --> 00:52:27,280
and they both start
with 1 through 10,

1169
00:52:27,280 --> 00:52:31,660
I have to say a 1 because
that's a better bound.

1170
00:52:31,660 --> 00:52:33,596
So I have to say this.

1171
00:52:33,596 --> 00:52:35,345
AUDIENCE: Could you
say the opposite value

1172
00:52:35,345 --> 00:52:37,450
of the difference
between the two of them?

1173
00:52:37,450 --> 00:52:39,486
Because if you're going
to stop at 1 has 10

1174
00:52:39,486 --> 00:52:43,120
and the other one has
10,000, and let's say

1175
00:52:43,120 --> 00:52:46,579
that only the first
ten are actually equal,

1176
00:52:46,579 --> 00:52:48,870
then you're going to go
through that list, find all 10,

1177
00:52:48,870 --> 00:52:50,110
and you stop.

1178
00:52:50,110 --> 00:52:51,059
That would be 9,000--

1179
00:52:51,059 --> 00:52:53,350
PROFESSOR: I could say about
that if I'm looking at one

1180
00:52:53,350 --> 00:52:56,160
case, but the magic trick
is-- let's-- we're looking

1181
00:52:56,160 --> 00:52:57,680
at the worst case.

1182
00:52:57,680 --> 00:52:59,844
So worst case, if
I have 10 elements,

1183
00:52:59,844 --> 00:53:01,760
they'll be all the way
down in the other list.

1184
00:53:01,760 --> 00:53:03,010
Or they won't be there at all.

1185
00:53:03,010 --> 00:53:04,970
And I have to go down
through all the list.

1186
00:53:04,970 --> 00:53:07,670
AUDIENCE: OK.

1187
00:53:07,670 --> 00:53:09,640
PROFESSOR: So worst
case, L1 plus L2.

1188
00:53:09,640 --> 00:53:15,000

1189
00:53:15,000 --> 00:53:18,760
Let me see if I
have any time left.

1190
00:53:18,760 --> 00:53:19,720
Nope.

1191
00:53:19,720 --> 00:53:25,050
So, what I would like you to do
is go through insertion sort.

1192
00:53:25,050 --> 00:53:27,700
Insertion sort
matches the textbook.

1193
00:53:27,700 --> 00:53:30,540
Look at the definition in the
textbook, look at the code,

1194
00:53:30,540 --> 00:53:32,009
convince yourself it's the same.

1195
00:53:32,009 --> 00:53:33,800
Look at the running
time, convince yourself

1196
00:53:33,800 --> 00:53:35,280
it's N squared.

1197
00:53:35,280 --> 00:53:38,880
Then look at inner product
and convince yourself

1198
00:53:38,880 --> 00:53:41,360
that this is what it does.

1199
00:53:41,360 --> 00:53:44,820
Go through this line by
line, see where they match,

1200
00:53:44,820 --> 00:53:46,100
put the cost on.

1201
00:53:46,100 --> 00:53:49,040
Make sure that the
cost is L1 plus L2.

1202
00:53:49,040 --> 00:53:52,810
And last, go through
merge-sort and notice

1203
00:53:52,810 --> 00:53:56,660
that merge in [INAUDIBLE]
6 is exactly the same

1204
00:53:56,660 --> 00:53:57,880
as inner product.

1205
00:53:57,880 --> 00:54:00,020
So this pointer
magic that I did here

1206
00:54:00,020 --> 00:54:02,440
is exactly what's
happening inside merge.

1207
00:54:02,440 --> 00:54:03,570
And understand merge-sort.

1208
00:54:03,570 --> 00:54:05,630
Look at the textbook,
look at the notes,

1209
00:54:05,630 --> 00:54:08,270
and see how they match.

1210
00:54:08,270 --> 00:54:08,870
OK.

1211
00:54:08,870 --> 00:54:10,720
Thanks, guys.