1
00:00:00,000 --> 00:00:00,040

2
00:00:00,040 --> 00:00:02,460
The following content is
provided under a Creative

3
00:00:02,460 --> 00:00:03,870
Commons license.

4
00:00:03,870 --> 00:00:06,910
Your support will help MIT
OpenCourseWare continue to

5
00:00:06,910 --> 00:00:10,560
offer high quality educational
resources for free.

6
00:00:10,560 --> 00:00:13,460
To make a donation or view
additional materials from

7
00:00:13,460 --> 00:00:17,390
hundreds of MIT courses, visit
MIT OpenCourseWare at

8
00:00:17,390 --> 00:00:18,640
ocw.mit.edu.

9
00:00:18,640 --> 00:00:21,490

10
00:00:21,490 --> 00:00:25,520
PROFESSOR: Last Tuesday, we
ended up the lecture talking

11
00:00:25,520 --> 00:00:27,840
about knapsack problems.

12
00:00:27,840 --> 00:00:30,920
We talked about the continuous
knapsack problem and the fact

13
00:00:30,920 --> 00:00:32,920
that you could solve
that optimally

14
00:00:32,920 --> 00:00:35,060
with a greedy algorithm.

15
00:00:35,060 --> 00:00:38,680
And we looked at the 0-1
knapsack problem and discussed

16
00:00:38,680 --> 00:00:42,190
the fact that while we could
write greedy algorithms that

17
00:00:42,190 --> 00:00:45,560
would solve the problem quickly,
we have to be careful

18
00:00:45,560 --> 00:00:49,930
what we mean by "solve," and
that while those algorithms

19
00:00:49,930 --> 00:00:54,010
would choose a set of items
that we could indeed carry

20
00:00:54,010 --> 00:00:57,390
away, there was no guarantee
that it would choose the

21
00:00:57,390 --> 00:01:01,120
optimal items, that is to say,
one that would meet the

22
00:01:01,120 --> 00:01:03,700
objective function of maximizing
the value.

23
00:01:03,700 --> 00:01:07,170

24
00:01:07,170 --> 00:01:11,960
We looked after that at a brute
force algorithm on the

25
00:01:11,960 --> 00:01:16,850
board only for finding an
optimal solution, a guaranteed

26
00:01:16,850 --> 00:01:21,800
optimal solution, but observed
the fact that on even a

27
00:01:21,800 --> 00:01:25,320
moderately sized set of
items, it might take a

28
00:01:25,320 --> 00:01:27,280
decade or so to run.

29
00:01:27,280 --> 00:01:29,910
Decided that wasn't very good.

30
00:01:29,910 --> 00:01:33,940
Nevertheless, I want to start
today looking at some code

31
00:01:33,940 --> 00:01:38,930
that implements a brute force
algorithm, not because I

32
00:01:38,930 --> 00:01:43,560
expect anyone to actually run
this on a real example, but

33
00:01:43,560 --> 00:01:46,100
because a bit later in the term,
we'll see how we could

34
00:01:46,100 --> 00:01:49,800
modify this to something that
would be practical.

35
00:01:49,800 --> 00:01:53,540
And there's some things to
learn by looking at it.

36
00:01:53,540 --> 00:01:57,510
So let's look at
some code here.

37
00:01:57,510 --> 00:02:01,090
I don't expect you to understand
in real time all

38
00:02:01,090 --> 00:02:02,940
the details of this code.

39
00:02:02,940 --> 00:02:06,280
It's more I want you to
understand the basic idea

40
00:02:06,280 --> 00:02:10,310
behind it and then the
result we get.

41
00:02:10,310 --> 00:02:13,520
So you'll remember that we
looked at the complexity by

42
00:02:13,520 --> 00:02:18,990
saying well, really, it's
like binary numbers.

43
00:02:18,990 --> 00:02:23,930
So the first helper subroutine
I'm going to use is something

44
00:02:23,930 --> 00:02:28,980
that generates binary numbers.

45
00:02:28,980 --> 00:02:34,330
So it takes an n, some natural
number, and the number of

46
00:02:34,330 --> 00:02:38,570
digits and returns a binary
string of that length,

47
00:02:38,570 --> 00:02:42,780
representing the decimal
number n.

48
00:02:42,780 --> 00:02:47,340
Why am I giving it this
number of digits?

49
00:02:47,340 --> 00:02:50,970
Because I need to zero pad it.

50
00:02:50,970 --> 00:02:53,970
If I want to have a vector that
represents whether or not

51
00:02:53,970 --> 00:03:00,090
I take items, if I take only one
item, say the first one, I

52
00:03:00,090 --> 00:03:03,370
don't want just a binary spring
with one digit in it.

53
00:03:03,370 --> 00:03:06,200
Because I need all those zeros
to indicate that I'm not

54
00:03:06,200 --> 00:03:08,700
taking the other items.

55
00:03:08,700 --> 00:03:12,320
And so the second argument tells
me, in effect, how many

56
00:03:12,320 --> 00:03:14,820
zeros I'm going to need.

57
00:03:14,820 --> 00:03:20,010
And there's nothing mysterious
about the way it does it.

58
00:03:20,010 --> 00:03:21,660
OK.

59
00:03:21,660 --> 00:03:25,320
The next helper function
generates the

60
00:03:25,320 --> 00:03:29,940
power set of the items.

61
00:03:29,940 --> 00:03:32,230
What is a power set?

62
00:03:32,230 --> 00:03:37,170
If you take a set, you can then
ask the question what are

63
00:03:37,170 --> 00:03:41,430
all the subsets of the set?

64
00:03:41,430 --> 00:03:43,230
What's the smallest
subset of a set?

65
00:03:43,230 --> 00:03:45,260
It's the empty set.

66
00:03:45,260 --> 00:03:46,910
No items.

67
00:03:46,910 --> 00:03:49,970
What's the largest
subset of a set?

68
00:03:49,970 --> 00:03:52,590
All of the items.

69
00:03:52,590 --> 00:03:56,930
And then we have everything in
between, the set that contains

70
00:03:56,930 --> 00:03:59,780
the first item, the set that
contains the second item, et

71
00:03:59,780 --> 00:04:02,620
cetera, the set that contains
the first and the second, the

72
00:04:02,620 --> 00:04:03,740
first and the third.

73
00:04:03,740 --> 00:04:06,320
There are a lot of them.

74
00:04:06,320 --> 00:04:08,510
And of course, how
many is a lot?

75
00:04:08,510 --> 00:04:11,870
Well, 2 to the n is a lot.

76
00:04:11,870 --> 00:04:15,470
But now we're going
to generate every

77
00:04:15,470 --> 00:04:19,079
possible subset of items.

78
00:04:19,079 --> 00:04:23,670
And we're going to do this
simply using the decimal to

79
00:04:23,670 --> 00:04:27,580
binary function to tell us
whether or not we keep each

80
00:04:27,580 --> 00:04:29,560
one, so we can enumerate them.

81
00:04:29,560 --> 00:04:31,590
We can generate them all.

82
00:04:31,590 --> 00:04:36,830
And now we have the set of all
possible items one might take,

83
00:04:36,830 --> 00:04:40,820
irrespective of whether they
obey the constraint of not

84
00:04:40,820 --> 00:04:42,495
weighing too much.

85
00:04:42,495 --> 00:04:47,090

86
00:04:47,090 --> 00:04:51,190
The next function is the
one that does the work.

87
00:04:51,190 --> 00:04:53,290
This is the interesting one.

88
00:04:53,290 --> 00:04:55,380
Choose best.

89
00:04:55,380 --> 00:05:04,790
It takes a power set, the
constraint, and two functions.

90
00:05:04,790 --> 00:05:06,960
One, getValue--

91
00:05:06,960 --> 00:05:08,770
it tells me the value
of an item.

92
00:05:08,770 --> 00:05:10,315
And the other getWeight--

93
00:05:10,315 --> 00:05:13,770
it tells me the weight
of an item.

94
00:05:13,770 --> 00:05:15,020
Then it just goes through.

95
00:05:15,020 --> 00:05:17,610

96
00:05:17,610 --> 00:05:21,580
And it enumerates all
possibilities

97
00:05:21,580 --> 00:05:24,100
and eventually chooses--

98
00:05:24,100 --> 00:05:27,980
I won't say the best
set, because it

99
00:05:27,980 --> 00:05:29,140
might not be unique.

100
00:05:29,140 --> 00:05:31,650
There might be more than
one optimal answer.

101
00:05:31,650 --> 00:05:35,020
But it finds at least
one optimal answer.

102
00:05:35,020 --> 00:05:38,470
And then it returns that.

103
00:05:38,470 --> 00:05:42,440
Again it's a very
straightforward implementation

104
00:05:42,440 --> 00:05:47,190
of the brute force algorithm
I sketched on the board.

105
00:05:47,190 --> 00:05:50,800
And then we can run it with
testBest, which is going to

106
00:05:50,800 --> 00:05:52,680
build the items using
the function we

107
00:05:52,680 --> 00:05:54,560
looked at last time.

108
00:05:54,560 --> 00:05:58,150
It's then going to get the
power set of the items.

109
00:05:58,150 --> 00:06:00,970
It's going to call chooseBest
and then print the result.

110
00:06:00,970 --> 00:06:03,540

111
00:06:03,540 --> 00:06:05,490
So let's see what happens
if we run it.

112
00:06:05,490 --> 00:06:18,550

113
00:06:18,550 --> 00:06:19,770
We get an error.

114
00:06:19,770 --> 00:06:20,295
Oh dear.

115
00:06:20,295 --> 00:06:23,890
I hadn't expected that.

116
00:06:23,890 --> 00:06:27,420
And it says test-- oh testBest
is not defined.

117
00:06:27,420 --> 00:06:27,960
All right.

118
00:06:27,960 --> 00:06:29,930
Let's try that again.

119
00:06:29,930 --> 00:06:31,730
Sure looks like it's
defined to me.

120
00:06:31,730 --> 00:06:38,000

121
00:06:38,000 --> 00:06:39,400
There it is.

122
00:06:39,400 --> 00:06:40,670
OK.

123
00:06:40,670 --> 00:06:44,170
And you may recall that this
is a better answer than

124
00:06:44,170 --> 00:06:45,940
anything that was generated
by the greedy

125
00:06:45,940 --> 00:06:47,290
algorithm on Tuesday.

126
00:06:47,290 --> 00:06:51,300
You may not recall it,
but believe me it is.

127
00:06:51,300 --> 00:06:54,460
It happened to have found
a better solution.

128
00:06:54,460 --> 00:06:57,320
And not surprisingly, that's
because I contrived the

129
00:06:57,320 --> 00:07:00,550
example to make sure
that would happen.

130
00:07:00,550 --> 00:07:03,720
Why does it work better in the
sense-- or why does it find a

131
00:07:03,720 --> 00:07:04,830
better answer?

132
00:07:04,830 --> 00:07:07,160
Why might it find
a better answer?

133
00:07:07,160 --> 00:07:12,680
Well, because the greedy
algorithm chose something that

134
00:07:12,680 --> 00:07:14,770
was locally optimal
at each step.

135
00:07:14,770 --> 00:07:21,420

136
00:07:21,420 --> 00:07:24,400
But there was no guarantee that
a sequence of locally

137
00:07:24,400 --> 00:07:26,750
optimal decisions would reach
a global optimum.

138
00:07:26,750 --> 00:07:32,220

139
00:07:32,220 --> 00:07:37,820
What this algorithm does is it
finds a global optimum by

140
00:07:37,820 --> 00:07:39,920
looking at all solutions.

141
00:07:39,920 --> 00:07:43,060
And that's something we'll see
again and again, as we go

142
00:07:43,060 --> 00:07:47,340
forward, that there's always a
temptation to do things one

143
00:07:47,340 --> 00:07:50,930
step at a time, finding
local optimum--

144
00:07:50,930 --> 00:07:52,050
optima--

145
00:07:52,050 --> 00:07:53,840
because it's fast.

146
00:07:53,840 --> 00:07:55,110
It's easy.

147
00:07:55,110 --> 00:08:00,490
But there's no guarantee
it will work well.

148
00:08:00,490 --> 00:08:02,830
Now the problem, of course,
with finding the global

149
00:08:02,830 --> 00:08:09,230
optimum is, as we discussed, it
is prohibitively expensive.

150
00:08:09,230 --> 00:08:12,430
Now you could ask is it
prohibitively expensive

151
00:08:12,430 --> 00:08:15,290
because I chose a stupid
algorithm,

152
00:08:15,290 --> 00:08:17,670
the brute force algorithm?

153
00:08:17,670 --> 00:08:20,420
Well, it is a stupid
algorithm.

154
00:08:20,420 --> 00:08:25,080
But in fact, this is a problem
that is what we would call

155
00:08:25,080 --> 00:08:26,330
inherently exponential.

156
00:08:26,330 --> 00:08:38,659

157
00:08:38,659 --> 00:08:41,440
We've looked at this
concept before.

158
00:08:41,440 --> 00:08:45,910
That in addition to talking
about the complexity of an

159
00:08:45,910 --> 00:08:49,860
algorithm, we can talk about the
complexity of a problem in

160
00:08:49,860 --> 00:08:54,650
which we ask the question how
fast can the absolute best

161
00:08:54,650 --> 00:08:58,170
solution, fastest solution
to this problem, be?

162
00:08:58,170 --> 00:09:04,230
And here you can construct a
mathematical proof that says

163
00:09:04,230 --> 00:09:08,320
the problem is inherently
exponential.

164
00:09:08,320 --> 00:09:11,880
No matter what we do, we're not
going to be able to find

165
00:09:11,880 --> 00:09:16,010
something that's guaranteed to
find the optimal, that is

166
00:09:16,010 --> 00:09:19,090
faster than exponential.

167
00:09:19,090 --> 00:09:23,428
Well, now let's be careful
of that statement.

168
00:09:23,428 --> 00:09:27,420
What that means is the worst
case is inherently

169
00:09:27,420 --> 00:09:29,760
exponential.

170
00:09:29,760 --> 00:09:34,070
As we will see in a couple of
weeks-- it'll take us a while

171
00:09:34,070 --> 00:09:35,420
to get there--

172
00:09:35,420 --> 00:09:39,730
there are actually algorithms
that people use to solve these

173
00:09:39,730 --> 00:09:43,760
inherently exponential problems
and solve them fast

174
00:09:43,760 --> 00:09:45,010
enough to be useful.

175
00:09:45,010 --> 00:09:47,230

176
00:09:47,230 --> 00:09:51,810
So, for example, when you go
to look at airline fares on

177
00:09:51,810 --> 00:09:56,800
Kayak to try and find the best
fare from A to B, it is an

178
00:09:56,800 --> 00:09:59,310
inherently exponential problem,
but you get an answer

179
00:09:59,310 --> 00:10:01,120
pretty quickly.

180
00:10:01,120 --> 00:10:04,740
And that's because there are
techniques you can use.

181
00:10:04,740 --> 00:10:06,820
Now, in fact, one of the reasons
you get it is they

182
00:10:06,820 --> 00:10:08,330
don't guarantee that
you actually

183
00:10:08,330 --> 00:10:10,680
get an optimal solution.

184
00:10:10,680 --> 00:10:13,140
But there are techniques that
guarantee to give you an

185
00:10:13,140 --> 00:10:18,000
optimal solution that almost all
the time will run quickly.

186
00:10:18,000 --> 00:10:23,110
And we'll look at one of those
a bit later in the term.

187
00:10:23,110 --> 00:10:30,620
Before we do that, however, I
want to leave for a while the

188
00:10:30,620 --> 00:10:35,350
whole question of complexity
behind and look at another

189
00:10:35,350 --> 00:10:38,300
class of optimization
problems.

190
00:10:38,300 --> 00:10:39,940
We'll look at several
different kinds of

191
00:10:39,940 --> 00:10:43,300
optimization problems as
the term goes forward.

192
00:10:43,300 --> 00:10:50,470
The kind I want to look at today
is probably what I would

193
00:10:50,470 --> 00:10:52,430
say is the most exciting
branch of

194
00:10:52,430 --> 00:10:53,730
computer science today.

195
00:10:53,730 --> 00:10:56,240
And of course I might
have a bias.

196
00:10:56,240 --> 00:10:57,490
And that's machine learning.

197
00:10:57,490 --> 00:11:05,040

198
00:11:05,040 --> 00:11:09,030
It's a word you'll
hear a lot about.

199
00:11:09,030 --> 00:11:14,090
And it's a technique that
many of you will apply.

200
00:11:14,090 --> 00:11:15,810
You might not write
your own codes.

201
00:11:15,810 --> 00:11:19,370
But I guarantee you were either
be the beneficiary or

202
00:11:19,370 --> 00:11:23,550
the victim of machine learning
almost every time you log on

203
00:11:23,550 --> 00:11:25,890
to the web these days.

204
00:11:25,890 --> 00:11:28,820
I should probably start
by defining what

205
00:11:28,820 --> 00:11:30,490
machine learning is.

206
00:11:30,490 --> 00:11:32,490
But that's hard to do.

207
00:11:32,490 --> 00:11:34,540
I really don't know
how to do it.

208
00:11:34,540 --> 00:11:38,000
Superficially, you could say
that machine learning deals

209
00:11:38,000 --> 00:11:42,920
with the question of how to
build programs that learn.

210
00:11:42,920 --> 00:11:46,090
However, I think in a very real
sense every program we

211
00:11:46,090 --> 00:11:48,170
write learns something.

212
00:11:48,170 --> 00:11:51,660
If I implement Newton's method,
it's learning what the

213
00:11:51,660 --> 00:11:54,730
roots of the polynomial is.

214
00:11:54,730 --> 00:11:58,750
Certainly when we looked
at curve fitting--

215
00:11:58,750 --> 00:12:00,710
fitting curves to data--

216
00:12:00,710 --> 00:12:03,320
we were learning a model
of the data.

217
00:12:03,320 --> 00:12:05,840
That's what that
regression is.

218
00:12:05,840 --> 00:12:08,570

219
00:12:08,570 --> 00:12:10,660
Wikipedia says--

220
00:12:10,660 --> 00:12:14,240
and of course, it must be true
if Wikipedia says it--

221
00:12:14,240 --> 00:12:17,470
that machine learning is a
scientific discipline that is

222
00:12:17,470 --> 00:12:20,960
concerned with the design and
development of algorithms that

223
00:12:20,960 --> 00:12:23,860
allow computers to evolve
behaviors based

224
00:12:23,860 --> 00:12:26,290
on empirical data.

225
00:12:26,290 --> 00:12:28,640
I'm not sure how helpful
this definition is.

226
00:12:28,640 --> 00:12:30,310
But it was the best
I could find.

227
00:12:30,310 --> 00:12:31,620
And it doesn't really matter.

228
00:12:31,620 --> 00:12:34,220

229
00:12:34,220 --> 00:12:38,910
But it sort of gets at the issue
that a major focus of

230
00:12:38,910 --> 00:12:42,940
machine learning research is
to automatically learn to

231
00:12:42,940 --> 00:12:48,640
recognize complex patterns and
make intelligent decisions

232
00:12:48,640 --> 00:12:51,880
based on data.

233
00:12:51,880 --> 00:12:55,210
This whole process is something

234
00:12:55,210 --> 00:12:56,670
called inductive inference.

235
00:12:56,670 --> 00:13:07,300

236
00:13:07,300 --> 00:13:10,410
The basic idea is
one observes--

237
00:13:10,410 --> 00:13:11,410
actually one doesn't.

238
00:13:11,410 --> 00:13:17,620
The program observes examples
that represent an incomplete

239
00:13:17,620 --> 00:13:22,750
information about some
statistical phenomena and then

240
00:13:22,750 --> 00:13:28,270
tries to generate a model, just
like with curve fitting,

241
00:13:28,270 --> 00:13:33,170
that summarizes some statistical
properties of that

242
00:13:33,170 --> 00:13:39,350
data and can be used to predict
the future, for

243
00:13:39,350 --> 00:13:44,250
example, give you information
about unseen data.

244
00:13:44,250 --> 00:13:51,420
There are roughly speaking two
distinctive approaches to

245
00:13:51,420 --> 00:14:01,320
machine learning called
supervised learning and

246
00:14:01,320 --> 00:14:02,570
unsupervised learning.

247
00:14:02,570 --> 00:14:17,630

248
00:14:17,630 --> 00:14:20,460
Let's first talk about
supervised learning.

249
00:14:20,460 --> 00:14:24,650
It's a little easier to
appreciate how it might work.

250
00:14:24,650 --> 00:14:37,450
In supervised learning, we
associate a label with each

251
00:14:37,450 --> 00:14:39,310
example in a training set.

252
00:14:39,310 --> 00:14:52,450

253
00:14:52,450 --> 00:14:55,770
So think of that as an answer
to a query about an example.

254
00:14:55,770 --> 00:14:58,580

255
00:14:58,580 --> 00:15:06,520
If the label is discrete,
we typically call it a

256
00:15:06,520 --> 00:15:08,010
classification problem.

257
00:15:08,010 --> 00:15:25,630

258
00:15:25,630 --> 00:15:31,470
So we would try and classify,
for example, a transaction on

259
00:15:31,470 --> 00:15:35,150
a credit card as belonging to
the owner of that credit card

260
00:15:35,150 --> 00:15:39,360
or not belonging to the owner,
as i.e., with some

261
00:15:39,360 --> 00:15:42,680
probability, a stolen
credit card.

262
00:15:42,680 --> 00:15:43,490
So it's discrete.

263
00:15:43,490 --> 00:15:44,460
It belongs to the owner.

264
00:15:44,460 --> 00:15:47,930
It doesn't belong
to the owner.

265
00:15:47,930 --> 00:15:54,620
If the labels are real valued,
we think of it as

266
00:15:54,620 --> 00:15:57,090
a regression problem.

267
00:15:57,090 --> 00:16:01,990
And so indeed, when we did the
curve fitting, we were doing

268
00:16:01,990 --> 00:16:04,000
machine learning.

269
00:16:04,000 --> 00:16:05,775
And we were handling a
regression problem.

270
00:16:05,775 --> 00:16:08,870

271
00:16:08,870 --> 00:16:13,610
Based on the examples from the
training set, the goal is to

272
00:16:13,610 --> 00:16:17,460
build a program that can predict
the answer for other

273
00:16:17,460 --> 00:16:23,260
cases before they were
explicitly observed.

274
00:16:23,260 --> 00:16:26,490
So we're trying to generalize
from the statistical

275
00:16:26,490 --> 00:16:30,790
properties of a training set to
be able to make predictions

276
00:16:30,790 --> 00:16:32,430
about things we haven't seen.

277
00:16:32,430 --> 00:16:36,960

278
00:16:36,960 --> 00:16:38,210
Let's look at an example.

279
00:16:38,210 --> 00:16:48,990

280
00:16:48,990 --> 00:16:57,220
So here, I've got red
and blue circles.

281
00:16:57,220 --> 00:17:05,240
And I'm trying to learn what
makes a circle red or what's

282
00:17:05,240 --> 00:17:09,930
the difference between red and
blue, other than the color?

283
00:17:09,930 --> 00:17:14,940
Think of my information as the
(x,y) values and the label as

284
00:17:14,940 --> 00:17:16,190
the color, red or blue.

285
00:17:16,190 --> 00:17:18,920

286
00:17:18,920 --> 00:17:20,079
So I've labeled each one.

287
00:17:20,079 --> 00:17:23,910
And now I'm trying to
learn something.

288
00:17:23,910 --> 00:17:25,364
Well, it's kind of tricky.

289
00:17:25,364 --> 00:17:29,040

290
00:17:29,040 --> 00:17:32,410
What are the questions I need to
answer to think about this?

291
00:17:32,410 --> 00:17:36,610
And then we'll look at
how we might do it.

292
00:17:36,610 --> 00:17:42,720
So, a first question
I need to ask is

293
00:17:42,720 --> 00:17:45,745
are the labels accurate?

294
00:17:45,745 --> 00:17:53,070

295
00:17:53,070 --> 00:17:56,980
And in fact, in a lot of real
world examples, in most real

296
00:17:56,980 --> 00:17:59,410
world examples, there's
no guarantee that

297
00:17:59,410 --> 00:18:02,000
the labels are accurate.

298
00:18:02,000 --> 00:18:04,640
So you have to assume that
well, maybe some of

299
00:18:04,640 --> 00:18:06,690
the labels are wrong.

300
00:18:06,690 --> 00:18:08,555
How do we deal with that?

301
00:18:08,555 --> 00:18:13,150

302
00:18:13,150 --> 00:18:18,810
Perhaps the most fundamental
question is the past

303
00:18:18,810 --> 00:18:20,300
representative of the future?

304
00:18:20,300 --> 00:18:34,240

305
00:18:34,240 --> 00:18:39,640
We've seen many examples where
people have learned things,

306
00:18:39,640 --> 00:18:45,110
for example, to predict
the price of housing.

307
00:18:45,110 --> 00:18:48,910
And it turns out you hit some
singularity which means the

308
00:18:48,910 --> 00:18:51,710
past is not a very good
predictor of the future.

309
00:18:51,710 --> 00:18:55,120
And even if all of your learning
is good, you get the

310
00:18:55,120 --> 00:18:56,790
wrong answer.

311
00:18:56,790 --> 00:19:01,160
So you sort of always have
to ask that question.

312
00:19:01,160 --> 00:19:05,780
Do you have enough data
to generalize?

313
00:19:05,780 --> 00:19:15,200

314
00:19:15,200 --> 00:19:18,430
And by this, I mean enough
training data.

315
00:19:18,430 --> 00:19:21,970
If your training set is very
small, you shouldn't have a

316
00:19:21,970 --> 00:19:23,560
lot of confidence in
what you learn.

317
00:19:23,560 --> 00:19:27,770

318
00:19:27,770 --> 00:19:31,420
A big issue here is feature
extraction.

319
00:19:31,420 --> 00:19:37,720

320
00:19:37,720 --> 00:19:41,820
As we'll see when we look at
real examples, the world is a

321
00:19:41,820 --> 00:19:44,350
pretty complex place.

322
00:19:44,350 --> 00:19:50,420
And we need to decide what
features we're going to use.

323
00:19:50,420 --> 00:19:53,650
If I were to ask 25% of you to
come up in the front of the

324
00:19:53,650 --> 00:19:59,560
room and then try and separate
you based upon some feature--

325
00:19:59,560 --> 00:20:01,490
if I were to say, all right, I'm
going to separate the good

326
00:20:01,490 --> 00:20:04,460
students from the bad students,
but the only

327
00:20:04,460 --> 00:20:08,390
features I have available are
the clothes you're wearing, it

328
00:20:08,390 --> 00:20:09,750
might not work so well.

329
00:20:09,750 --> 00:20:12,670

330
00:20:12,670 --> 00:20:17,140
And very importantly, how
tight should the fit be?

331
00:20:17,140 --> 00:20:27,490

332
00:20:27,490 --> 00:20:29,535
So now let's go back to
our example here.

333
00:20:29,535 --> 00:20:32,750

334
00:20:32,750 --> 00:20:38,660
We can look at two different
ways we might

335
00:20:38,660 --> 00:20:41,230
generalize from this data.

336
00:20:41,230 --> 00:20:45,410
And indeed, when we're looking
at classification problems in

337
00:20:45,410 --> 00:20:48,320
supervised learning, what
we're typically doing is

338
00:20:48,320 --> 00:20:53,405
trying to find some way of
dividing our training data.

339
00:20:53,405 --> 00:20:56,120

340
00:20:56,120 --> 00:20:59,210
In this case, I've given you a
two-dimensional projection.

341
00:20:59,210 --> 00:21:01,120
As we'll see, it's not always
two-dimensional.

342
00:21:01,120 --> 00:21:03,570
It's not usually
two-dimensional.

343
00:21:03,570 --> 00:21:07,850
So I might choose this rather
eccentric shape

344
00:21:07,850 --> 00:21:09,760
and say that's great.

345
00:21:09,760 --> 00:21:11,680
And why is that great?

346
00:21:11,680 --> 00:21:15,775
It's great because it minimizes
training error.

347
00:21:15,775 --> 00:21:20,290

348
00:21:20,290 --> 00:21:27,880
So if we look at it as an
optimization problem, we might

349
00:21:27,880 --> 00:21:34,080
say that our objective function
is how many points

350
00:21:34,080 --> 00:21:36,520
are correctly classified
in the training

351
00:21:36,520 --> 00:21:39,980
data as red or blue.

352
00:21:39,980 --> 00:21:45,680
And this triangular shape
has no training error.

353
00:21:45,680 --> 00:21:48,050
Every point is perfectly
classified in

354
00:21:48,050 --> 00:21:50,470
the training data.

355
00:21:50,470 --> 00:21:54,480
If I choose this linear
separator instead, I have some

356
00:21:54,480 --> 00:21:56,140
training error.

357
00:21:56,140 --> 00:22:01,750
This red point is misclassified
in the training.

358
00:22:01,750 --> 00:22:03,380
Does that mean that
the triangle is

359
00:22:03,380 --> 00:22:05,560
better than the line?

360
00:22:05,560 --> 00:22:07,340
Not necessarily, right?

361
00:22:07,340 --> 00:22:11,180
Because my goal is to predict
future points.

362
00:22:11,180 --> 00:22:16,690
And maybe that's mislabeled
or an experimental error.

363
00:22:16,690 --> 00:22:22,050
Maybe it's accurately labeled
but an outlier, very unusual.

364
00:22:22,050 --> 00:22:25,240
And this will not
generalize well.

365
00:22:25,240 --> 00:22:28,630
This is analogous to what we
talked about as overfitting

366
00:22:28,630 --> 00:22:30,720
when we looked at
curve fitting.

367
00:22:30,720 --> 00:22:34,510
And that's-- a very big problem
in machine learning is

368
00:22:34,510 --> 00:22:38,170
if you overfit to your training
data, it might not

369
00:22:38,170 --> 00:22:41,010
generalize well and might
give you bogus

370
00:22:41,010 --> 00:22:42,955
answers going forward.

371
00:22:42,955 --> 00:22:46,040

372
00:22:46,040 --> 00:22:46,300
OK.

373
00:22:46,300 --> 00:22:51,860
So that's a very quick look
at supervised learning.

374
00:22:51,860 --> 00:22:53,960
We'll come back to that.

375
00:22:53,960 --> 00:22:56,170
I now want to talk about
unsupervised learning.

376
00:22:56,170 --> 00:22:58,810

377
00:22:58,810 --> 00:23:04,940
The big difference here is we
have training data, but we

378
00:23:04,940 --> 00:23:06,190
don't have labels.

379
00:23:06,190 --> 00:23:08,690

380
00:23:08,690 --> 00:23:12,890
So I just give you a
bunch of points.

381
00:23:12,890 --> 00:23:16,100
It's as if we looked at this
picture, and I didn't tell you

382
00:23:16,100 --> 00:23:20,240
which were the red points and
which were the blue points.

383
00:23:20,240 --> 00:23:21,730
They were just all points.

384
00:23:21,730 --> 00:23:24,320

385
00:23:24,320 --> 00:23:28,500
So what can I learn?

386
00:23:28,500 --> 00:23:48,420
What typically you're learning
in unsupervised learning, is

387
00:23:48,420 --> 00:23:56,195
you're learning about
regularities of the data.

388
00:23:56,195 --> 00:24:04,580

389
00:24:04,580 --> 00:24:08,660
So if we looked at this and
think away the red and the

390
00:24:08,660 --> 00:24:14,880
blue, we might well say well,
at least, if I look at this,

391
00:24:14,880 --> 00:24:18,820
there is some structure
to this data.

392
00:24:18,820 --> 00:24:21,730
And maybe what I should do
is divide it this way.

393
00:24:21,730 --> 00:24:25,370
It gives me kind of a nice
clean separation.

394
00:24:25,370 --> 00:24:28,650
But maybe I should divide
it this way.

395
00:24:28,650 --> 00:24:31,300
Or maybe I should put
a circle around each

396
00:24:31,300 --> 00:24:34,480
of these four groupings.

397
00:24:34,480 --> 00:24:36,210
Complicated, what to do.

398
00:24:36,210 --> 00:24:40,740
But what we see is there is
clearly some structure here.

399
00:24:40,740 --> 00:24:44,860
And the idea of unsupervised
learning is to

400
00:24:44,860 --> 00:24:46,200
discover that structure.

401
00:24:46,200 --> 00:24:49,110

402
00:24:49,110 --> 00:24:53,270
Far and away, the dominant form
of unsupervised learning

403
00:24:53,270 --> 00:24:56,370
is clustering.

404
00:24:56,370 --> 00:25:00,110
And that's what I was just
talking about, is finding the

405
00:25:00,110 --> 00:25:04,450
cluster in this data.

406
00:25:04,450 --> 00:25:08,660
So we'll move forward here.

407
00:25:08,660 --> 00:25:13,060
There it is, with everything
the same color.

408
00:25:13,060 --> 00:25:17,100
But here I've labeled
the x- and y-axes

409
00:25:17,100 --> 00:25:18,790
as height and weight.

410
00:25:18,790 --> 00:25:22,640

411
00:25:22,640 --> 00:25:24,540
What does clustering mean?

412
00:25:24,540 --> 00:25:29,300
It's the process of organizing
the objects or the points into

413
00:25:29,300 --> 00:25:35,280
groups whose members are
similar in some way.

414
00:25:35,280 --> 00:25:38,790
A key issue is what do
we mean by similar?

415
00:25:38,790 --> 00:25:42,120
What's the metric
we want to use?

416
00:25:42,120 --> 00:25:43,860
And we can see that here.

417
00:25:43,860 --> 00:25:46,650
If I tell you that,
really, I want to

418
00:25:46,650 --> 00:25:50,080
cluster people by height--

419
00:25:50,080 --> 00:25:52,850
say, people are similar if
they're the same height--

420
00:25:52,850 --> 00:25:56,790
then it's pretty clear how I
should divide this, right,

421
00:25:56,790 --> 00:25:58,880
what my clusters should be.

422
00:25:58,880 --> 00:26:02,590
My clusters should probably be
this group of shorter people

423
00:26:02,590 --> 00:26:04,110
and this group of
taller people.

424
00:26:04,110 --> 00:26:06,780

425
00:26:06,780 --> 00:26:11,510
If I tell you I'm interested
in weight, then probably I

426
00:26:11,510 --> 00:26:16,390
want a cluster it with the
divisor here between the

427
00:26:16,390 --> 00:26:18,950
heavier people and the
lighter people.

428
00:26:18,950 --> 00:26:23,670
Or if I say well, I'm interested
in some combination

429
00:26:23,670 --> 00:26:27,090
of those two, then maybe I'll
get four clusters as I

430
00:26:27,090 --> 00:26:28,340
discussed before.

431
00:26:28,340 --> 00:26:35,840

432
00:26:35,840 --> 00:26:38,680
Clustering algorithms are
used all over the place.

433
00:26:38,680 --> 00:26:43,800
For example, in marketing,
they're used to find groups of

434
00:26:43,800 --> 00:26:47,910
customers with similar
behavior.

435
00:26:47,910 --> 00:26:52,380
Walmart is famous for using that
clustering to find that.

436
00:26:52,380 --> 00:26:56,670
They did a clustering to
determine when people bought

437
00:26:56,670 --> 00:26:58,010
the same thing.

438
00:26:58,010 --> 00:27:00,240
And then they would rearrange
their shelves to encourage

439
00:27:00,240 --> 00:27:02,230
people to buy things.

440
00:27:02,230 --> 00:27:05,360
And sort of the most famous
example they discovered was

441
00:27:05,360 --> 00:27:08,170
there was a strong correlation
between people between people

442
00:27:08,170 --> 00:27:11,790
who bought beer and people
who bought diapers.

443
00:27:11,790 --> 00:27:13,630
And so there was a period
where if you walked in a

444
00:27:13,630 --> 00:27:16,160
Walmart store, you would find
the beer and the diapers next

445
00:27:16,160 --> 00:27:17,550
to each other.

446
00:27:17,550 --> 00:27:19,720
And I leave it to you
to speculate on

447
00:27:19,720 --> 00:27:21,420
why that was true.

448
00:27:21,420 --> 00:27:23,215
It just happened to be
true in Walmart.

449
00:27:23,215 --> 00:27:26,170

450
00:27:26,170 --> 00:27:30,600
Amazon uses clustering to find
people who like similar books.

451
00:27:30,600 --> 00:27:33,820
So every time you buy a book on
Amazon, they're running a

452
00:27:33,820 --> 00:27:37,090
clustering algorithm to find
out who looks like you.

453
00:27:37,090 --> 00:27:39,530
Said, oh, this person
looks just like you.

454
00:27:39,530 --> 00:27:42,130
So if they buy a book, maybe
you'll get an email suggesting

455
00:27:42,130 --> 00:27:46,430
you buy that book or the next
time you log into Amazon.

456
00:27:46,430 --> 00:27:48,800
Or when you look at a book, they
tell you here are some

457
00:27:48,800 --> 00:27:50,400
similar books.

458
00:27:50,400 --> 00:27:52,670
And then they've done a
clustering to group books as

459
00:27:52,670 --> 00:27:56,380
similar based on
buying habits.

460
00:27:56,380 --> 00:28:01,550
Netflix uses that to recommend
movies, et cetera.

461
00:28:01,550 --> 00:28:07,420
Biologists spend a lot of time
these days doing clustering.

462
00:28:07,420 --> 00:28:09,630
They classify plants or animals

463
00:28:09,630 --> 00:28:10,780
based on their features.

464
00:28:10,780 --> 00:28:14,890
We'll shortly see an example
of that, as in right after

465
00:28:14,890 --> 00:28:16,760
Patriot's Day.

466
00:28:16,760 --> 00:28:19,640
But they also use it
a lot in genetics.

467
00:28:19,640 --> 00:28:24,140
So clustering is used to try and
find genes that look like

468
00:28:24,140 --> 00:28:27,080
or groups of genes.

469
00:28:27,080 --> 00:28:30,930
Insurance companies use that to
decide how much to charge

470
00:28:30,930 --> 00:28:33,340
you for your automobile
insurance.

471
00:28:33,340 --> 00:28:36,990
They cluster drivers based
upon-- and use that to predict

472
00:28:36,990 --> 00:28:40,420
who's going to have
an accident.

473
00:28:40,420 --> 00:28:45,830
Document classification on the
web is used all the time.

474
00:28:45,830 --> 00:28:47,200
It's used a lot in medicine.

475
00:28:47,200 --> 00:28:49,580
Just used all over the place.

476
00:28:49,580 --> 00:28:51,650
So what is it exactly?

477
00:28:51,650 --> 00:28:56,200
Well the nice thing is
we can define it very

478
00:28:56,200 --> 00:28:59,840
straightforwardly as an
optimization problem.

479
00:28:59,840 --> 00:29:03,670
And so we can ask what
properties does a good

480
00:29:03,670 --> 00:29:04,920
clustering have?

481
00:29:04,920 --> 00:29:07,610

482
00:29:07,610 --> 00:29:16,900
Well, it should have low
intra-cluster dissimilarity.

483
00:29:16,900 --> 00:29:26,100

484
00:29:26,100 --> 00:29:29,600
So in a good clustering, all
of the points in the same

485
00:29:29,600 --> 00:29:34,060
cluster should be similar, by
whatever metric you're using

486
00:29:34,060 --> 00:29:35,800
for similarity.

487
00:29:35,800 --> 00:29:38,910
As we'll see, there are a
lot of choices there.

488
00:29:38,910 --> 00:29:41,960
But that's not enough.

489
00:29:41,960 --> 00:29:55,300
We'd also like to have high
inter-cluster dissimilarity.

490
00:29:55,300 --> 00:29:57,790
So we'd like the points within
a cluster to be a lot like

491
00:29:57,790 --> 00:29:58,600
each other.

492
00:29:58,600 --> 00:30:01,730
But if points are in different
clusters, we'd like them to be

493
00:30:01,730 --> 00:30:04,900
quite different from
each other.

494
00:30:04,900 --> 00:30:07,894
That tells us that we
have a good cluster.

495
00:30:07,894 --> 00:30:09,570
All right, let's look at it.

496
00:30:09,570 --> 00:30:18,640

497
00:30:18,640 --> 00:30:22,160
How might we model
dissimilarity?

498
00:30:22,160 --> 00:30:26,760
Well, using a concept we've
already seen-- variance.

499
00:30:26,760 --> 00:30:37,940
So we can talk about the
variance of some cluster C as

500
00:30:37,940 --> 00:30:45,775
equal to the sum of all elements
x in C, of the mean

501
00:30:45,775 --> 00:30:53,210
of C minus x squared.

502
00:30:53,210 --> 00:30:56,310
Or maybe we can take the square
root of it, if we want.

503
00:30:56,310 --> 00:30:59,990
But it's exactly the idea we've
seen before, right?

504
00:30:59,990 --> 00:31:02,610
Then we say what's the average
value of the cluster?

505
00:31:02,610 --> 00:31:06,000
And then we look at how far is
each point from the average.

506
00:31:06,000 --> 00:31:07,070
We sum them.

507
00:31:07,070 --> 00:31:09,140
And that tells us how
much variance we

508
00:31:09,140 --> 00:31:10,770
have within the cluster.

509
00:31:10,770 --> 00:31:13,810

510
00:31:13,810 --> 00:31:15,060
Make sense?

511
00:31:15,060 --> 00:31:17,560

512
00:31:17,560 --> 00:31:19,680
So that's variance.

513
00:31:19,680 --> 00:31:22,660

514
00:31:22,660 --> 00:31:26,190
So we can use that to talk
about how similar or

515
00:31:26,190 --> 00:31:28,430
dissimilar the elements
in the cluster are.

516
00:31:28,430 --> 00:31:31,860

517
00:31:31,860 --> 00:31:36,490
We can use the same idea to
compare points in separate

518
00:31:36,490 --> 00:31:40,660
clusters and compute various
different ways-- and we'll

519
00:31:40,660 --> 00:31:42,300
look at different ways--

520
00:31:42,300 --> 00:31:46,510
to look at the distance
between clusters.

521
00:31:46,510 --> 00:31:50,890
So combining these two things,
we could get, say, a metric

522
00:31:50,890 --> 00:31:52,140
we'll call badness--

523
00:31:52,140 --> 00:31:55,000

524
00:31:55,000 --> 00:31:57,480
not a technical word.

525
00:31:57,480 --> 00:32:03,470
And now I'll ask the question
is the optimization problem

526
00:32:03,470 --> 00:32:05,690
that we're solving
in clustering

527
00:32:05,690 --> 00:32:07,540
finding a set of clusters--

528
00:32:07,540 --> 00:32:09,230
capital C--

529
00:32:09,230 --> 00:32:13,120
such that badness of that set
of clusters is minimized?

530
00:32:13,120 --> 00:32:16,080

531
00:32:16,080 --> 00:32:19,600
Is that a sufficient definition
of the problem

532
00:32:19,600 --> 00:32:22,710
we're trying to solve?

533
00:32:22,710 --> 00:32:25,760
Find a set of clusters
C, such that the

534
00:32:25,760 --> 00:32:29,130
badness of C is minimized.

535
00:32:29,130 --> 00:32:32,835
is that good enough?

536
00:32:32,835 --> 00:32:33,310
AUDIENCE: No.

537
00:32:33,310 --> 00:32:35,010
PROFESSOR: No, why not?

538
00:32:35,010 --> 00:32:37,007
AUDIENCE: Just imagine a case
where you view cluster-- if

539
00:32:37,007 --> 00:32:38,855
you make a single cluster,
every cluster has

540
00:32:38,855 --> 00:32:40,170
one element in it.

541
00:32:40,170 --> 00:32:41,700
And the variance is 0.

542
00:32:41,700 --> 00:32:42,680
PROFESSOR: Exactly.

543
00:32:42,680 --> 00:32:46,890
So that has a trivial solution,
which is probably

544
00:32:46,890 --> 00:32:50,330
not the one we want,
of putting each

545
00:32:50,330 --> 00:32:54,040
point in its own cluster.

546
00:32:54,040 --> 00:32:54,810
Badness--

547
00:32:54,810 --> 00:32:55,520
it won't be bad.

548
00:32:55,520 --> 00:32:58,790
It'll be a perfect clustering
in some sense.

549
00:32:58,790 --> 00:33:02,640
But it doesn't do us
any good really.

550
00:33:02,640 --> 00:33:05,570
So what do we do to fix that?

551
00:33:05,570 --> 00:33:07,270
What do we usually do
when we formulate an

552
00:33:07,270 --> 00:33:08,360
optimization problem?

553
00:33:08,360 --> 00:33:09,990
What's missing?

554
00:33:09,990 --> 00:33:11,620
I've given you the objective
function.

555
00:33:11,620 --> 00:33:13,620
What have I not giving you?

556
00:33:13,620 --> 00:33:14,100
AUDIENCE: Constraints.

557
00:33:14,100 --> 00:33:16,020
PROFESSOR: A constraint.

558
00:33:16,020 --> 00:33:22,620
So we need to add some
constraint here that will

559
00:33:22,620 --> 00:33:27,230
prevent us from finding
a trivial solution.

560
00:33:27,230 --> 00:33:31,700
So what kind of constraints
might we look at?

561
00:33:31,700 --> 00:33:33,940
There are different
ways of doing it.

562
00:33:33,940 --> 00:33:36,950

563
00:33:36,950 --> 00:33:38,380
A couple of ones
that is usual.

564
00:33:38,380 --> 00:33:43,520
Sometimes you might have as a
constraint, the maximum number

565
00:33:43,520 --> 00:33:44,770
of clusters.

566
00:33:44,770 --> 00:33:46,965

567
00:33:46,965 --> 00:33:52,530
Say, all right, cluster my data,
but I want at most K

568
00:33:52,530 --> 00:33:53,460
clusters --

569
00:33:53,460 --> 00:33:56,160
10 clusters.

570
00:33:56,160 --> 00:33:58,600
That would be my constraint,
like the weight for the

571
00:33:58,600 --> 00:34:01,680
knapsack problem.

572
00:34:01,680 --> 00:34:07,740
Or maybe I'll want to put
something on the maximum

573
00:34:07,740 --> 00:34:10,469
distance between clusters.

574
00:34:10,469 --> 00:34:13,110
So I don't want the distance
between any two clusters to be

575
00:34:13,110 --> 00:34:14,360
more than something.

576
00:34:14,360 --> 00:34:19,540

577
00:34:19,540 --> 00:34:23,420
In general, solving this
optimization problem is

578
00:34:23,420 --> 00:34:25,690
computationally prohibitive.

579
00:34:25,690 --> 00:34:30,190
So once again, in practice, what
people typically resort

580
00:34:30,190 --> 00:34:32,699
to is greedy algorithms.

581
00:34:32,699 --> 00:34:36,600
And I want to look at two kinds
of greedy algorithms,

582
00:34:36,600 --> 00:34:41,020
probably the two most common
approaches to clustering.

583
00:34:41,020 --> 00:34:42,270
One is called k-means.

584
00:34:42,270 --> 00:34:47,145

585
00:34:47,145 --> 00:34:52,385
In k-means clustering, you say
I want exactly k clusters.

586
00:34:52,385 --> 00:34:55,000

587
00:34:55,000 --> 00:34:57,950
And find the best
k clustering.

588
00:34:57,950 --> 00:35:00,070
We'll talk about how
it does that.

589
00:35:00,070 --> 00:35:03,170
And again, it's not guaranteed
to find the best.

590
00:35:03,170 --> 00:35:05,000
And the other is hierarchical
clustering.

591
00:35:05,000 --> 00:35:14,210

592
00:35:14,210 --> 00:35:15,790
We'll come back to
that shortly.

593
00:35:15,790 --> 00:35:22,750
Both are simple to understand
and widely used in practice.

594
00:35:22,750 --> 00:35:28,380
So let's first talk about
how we do this.

595
00:35:28,380 --> 00:35:30,555
Let's first look at hierarchical
clustering.

596
00:35:30,555 --> 00:35:38,200

597
00:35:38,200 --> 00:35:43,060
So we have a set of n items
to be clustered.

598
00:35:43,060 --> 00:35:54,670
And let's assume we have an n
by n distance matrix that

599
00:35:54,670 --> 00:35:59,790
tells me for each pair of items
how far they are from

600
00:35:59,790 --> 00:36:01,040
each other.

601
00:36:01,040 --> 00:36:03,140

602
00:36:03,140 --> 00:36:05,970
So we can look at an example.

603
00:36:05,970 --> 00:36:10,400
So here's an n by n distance
matrix for the airline

604
00:36:10,400 --> 00:36:14,230
distance between some cities
in the United States.

605
00:36:14,230 --> 00:36:17,480
The distance from Boston
to Boston is 0 miles.

606
00:36:17,480 --> 00:36:20,450
Distance from New York is 206.

607
00:36:20,450 --> 00:36:23,420
The distance from Chicago
to San Francisco

608
00:36:23,420 --> 00:36:26,480
is 2,142, et cetera.

609
00:36:26,480 --> 00:36:27,370
All right?

610
00:36:27,370 --> 00:36:30,020
So I have my n by n distance
matrix there.

611
00:36:30,020 --> 00:36:33,200

612
00:36:33,200 --> 00:36:37,360
Now let's go through how
hierarchical clustering would

613
00:36:37,360 --> 00:36:40,540
relate these things
to each other.

614
00:36:40,540 --> 00:36:51,490
So we start by assigning each
item to its own cluster.

615
00:36:51,490 --> 00:36:59,910

616
00:36:59,910 --> 00:37:03,710
So if we have n items, we
now have n clusters.

617
00:37:03,710 --> 00:37:08,250

618
00:37:08,250 --> 00:37:09,070
All right?

619
00:37:09,070 --> 00:37:13,430
That's the trivial solution
that you suggested before.

620
00:37:13,430 --> 00:37:34,360
The next step is to find the
most similar pair of clusters

621
00:37:34,360 --> 00:37:35,610
and merge them.

622
00:37:35,610 --> 00:37:42,310

623
00:37:42,310 --> 00:37:49,150
So if we look here and we just--
we start, we'll have

624
00:37:49,150 --> 00:37:52,050
six clusters, one
for each city.

625
00:37:52,050 --> 00:37:56,185
And we would merge the two most
similar, which I guess in

626
00:37:56,185 --> 00:37:59,450
this case is New York
and Boston.

627
00:37:59,450 --> 00:38:01,920
Hard to believe that those are
the most similar cities.

628
00:38:01,920 --> 00:38:05,800
But at least by this distance
metric they're the closest.

629
00:38:05,800 --> 00:38:07,450
So we would merge those two.

630
00:38:07,450 --> 00:38:14,530

631
00:38:14,530 --> 00:38:24,680
And then you just continue the
process in principle, until

632
00:38:24,680 --> 00:38:27,420
all items are in one cluster.

633
00:38:27,420 --> 00:38:30,810
So now you have a whole
hierarchy of clusters.

634
00:38:30,810 --> 00:38:33,920
And you can cut it off
where you want.

635
00:38:33,920 --> 00:38:36,020
If you want to have six
clusters, you could look at

636
00:38:36,020 --> 00:38:37,480
where the hierarchy
you have six.

637
00:38:37,480 --> 00:38:41,160
You can look where you have
two, where you have three.

638
00:38:41,160 --> 00:38:43,610
Of course, you don't have to go
all the way to finish it if

639
00:38:43,610 --> 00:38:45,670
you don't want to.

640
00:38:45,670 --> 00:38:48,450
This kind of hierarchical
clustering is called

641
00:38:48,450 --> 00:38:49,700
agglomerative.

642
00:38:49,700 --> 00:38:57,880

643
00:38:57,880 --> 00:38:58,420
Why?

644
00:38:58,420 --> 00:39:00,850
Well, because we're
combining things.

645
00:39:00,850 --> 00:39:02,100
We're agglomerating them.

646
00:39:02,100 --> 00:39:07,890

647
00:39:07,890 --> 00:39:09,310
So this is pretty

648
00:39:09,310 --> 00:39:12,025
straightforward, except for two.

649
00:39:12,025 --> 00:39:15,690

650
00:39:15,690 --> 00:39:22,460
The complication in step (2) is
we have to define what it

651
00:39:22,460 --> 00:39:26,010
means to find the two most
similar clusters.

652
00:39:26,010 --> 00:39:28,560

653
00:39:28,560 --> 00:39:32,520
Now it's pretty easy when the
clusters each contain one

654
00:39:32,520 --> 00:39:36,590
element, because, well, we
have our metric-- in this

655
00:39:36,590 --> 00:39:37,400
case, distance--

656
00:39:37,400 --> 00:39:41,180
and we can just do
that as I did.

657
00:39:41,180 --> 00:39:44,810
But it's not so obvious
what you do when they

658
00:39:44,810 --> 00:39:46,740
have multiple elements.

659
00:39:46,740 --> 00:39:49,940

660
00:39:49,940 --> 00:39:55,250
And in fact, different metrics
can be used to get different

661
00:39:55,250 --> 00:39:56,940
properties.

662
00:39:56,940 --> 00:39:58,540
So I want to talk about
some of the

663
00:39:58,540 --> 00:40:00,690
metrics we use for that.

664
00:40:00,690 --> 00:40:03,645
These are typically called
linkage criteria.

665
00:40:03,645 --> 00:40:13,990

666
00:40:13,990 --> 00:40:18,380
So one popular one is what's
called single linkage.

667
00:40:18,380 --> 00:40:24,170

668
00:40:24,170 --> 00:40:25,190
It's also called

669
00:40:25,190 --> 00:40:28,040
connectedness, or minimum method.

670
00:40:28,040 --> 00:40:31,160
In this, we consider the
distance between a pair of

671
00:40:31,160 --> 00:40:36,680
clusters to be equal to the
shortest distance from any

672
00:40:36,680 --> 00:40:37,975
member to any other member.

673
00:40:37,975 --> 00:41:00,220

674
00:41:00,220 --> 00:41:04,080
So we take the two points in
each cluster that are closest

675
00:41:04,080 --> 00:41:06,780
to each other and say
that's the distance

676
00:41:06,780 --> 00:41:08,030
between the two clusters.

677
00:41:08,030 --> 00:41:14,490

678
00:41:14,490 --> 00:41:17,335
People also use something called
complete linkage--

679
00:41:17,335 --> 00:41:21,376

680
00:41:21,376 --> 00:41:25,420
It's also called diameter
or maximum--

681
00:41:25,420 --> 00:41:30,010
where we consider the distance
between any two clusters to be

682
00:41:30,010 --> 00:41:32,710
the distance between the points
that are furthest from

683
00:41:32,710 --> 00:41:33,960
each other.

684
00:41:33,960 --> 00:41:40,060

685
00:41:40,060 --> 00:41:44,940
So in one case, essentially
single linkages was looking at

686
00:41:44,940 --> 00:41:47,150
the best case.

687
00:41:47,150 --> 00:41:49,390
Complete--

688
00:41:49,390 --> 00:41:50,870
in English, not French--

689
00:41:50,870 --> 00:41:53,450
is looking at the worst case.

690
00:41:53,450 --> 00:41:58,110

691
00:41:58,110 --> 00:42:01,870
And you won't be surprised to
hear that you could also look

692
00:42:01,870 --> 00:42:11,175
at the average case, where you
take all of the distances.

693
00:42:11,175 --> 00:42:13,950

694
00:42:13,950 --> 00:42:16,160
So you take all of the
pairwise things.

695
00:42:16,160 --> 00:42:17,020
You add them up.

696
00:42:17,020 --> 00:42:18,860
You take the average.

697
00:42:18,860 --> 00:42:20,460
You can also take the mean, the

698
00:42:20,460 --> 00:42:23,640
median, if you want instead.

699
00:42:23,640 --> 00:42:27,130
None of these is necessarily
best.

700
00:42:27,130 --> 00:42:29,790
But they do give you
different answers.

701
00:42:29,790 --> 00:42:32,605
And so I want to look at that
now with our example here.

702
00:42:32,605 --> 00:42:35,370

703
00:42:35,370 --> 00:42:38,500
So let's look at
it and run it.

704
00:42:38,500 --> 00:42:41,740
So the first step is independent
of what linkage

705
00:42:41,740 --> 00:42:42,850
we're using.

706
00:42:42,850 --> 00:42:46,580
We get these six clusters.

707
00:42:46,580 --> 00:42:51,790
All right, now let's look
at the second step.

708
00:42:51,790 --> 00:42:55,030
Well, also pretty simple
since we only have one

709
00:42:55,030 --> 00:42:57,400
element in each one.

710
00:42:57,400 --> 00:42:59,170
We're going to get
that clustering.

711
00:42:59,170 --> 00:43:02,750

712
00:43:02,750 --> 00:43:03,300
All right.

713
00:43:03,300 --> 00:43:08,960
Now, what about the next step?

714
00:43:08,960 --> 00:43:13,750
What do I get if I'm
using the minimal

715
00:43:13,750 --> 00:43:15,350
single linkage distance?

716
00:43:15,350 --> 00:43:16,895
What gets merged here?

717
00:43:16,895 --> 00:43:23,640

718
00:43:23,640 --> 00:43:24,980
Somebody?

719
00:43:24,980 --> 00:43:27,070
AUDIENCE: Boston, New
York and Chicago.

720
00:43:27,070 --> 00:43:28,550
PROFESSOR: Boston, New
York, and Chicago.

721
00:43:28,550 --> 00:43:31,910

722
00:43:31,910 --> 00:43:35,495
And it turns out we'll get the
same thing if we use other

723
00:43:35,495 --> 00:43:37,270
linkages in this case.

724
00:43:37,270 --> 00:43:38,745
Let's continue to
the next step.

725
00:43:38,745 --> 00:43:43,060

726
00:43:43,060 --> 00:43:45,935
Now we'll end up merging San
Francisco and Seattle.

727
00:43:45,935 --> 00:43:52,840

728
00:43:52,840 --> 00:43:55,840
Now we get a difference.

729
00:43:55,840 --> 00:43:57,540
What does the red represent
and what

730
00:43:57,540 --> 00:43:58,520
does the blue represent?

731
00:43:58,520 --> 00:44:00,980
Which linkage criteria?

732
00:44:00,980 --> 00:44:04,590
We're saying, we could either
merge Denver with Boston, New

733
00:44:04,590 --> 00:44:06,480
York, and Chicago.

734
00:44:06,480 --> 00:44:10,185
Or we could merge Denver with
San Francisco and Seattle.

735
00:44:10,185 --> 00:44:14,920

736
00:44:14,920 --> 00:44:16,790
Which is which?

737
00:44:16,790 --> 00:44:20,660
Which linkage criterion has put
Denver in which cluster?

738
00:44:20,660 --> 00:44:30,910

739
00:44:30,910 --> 00:44:33,640
Well, suppose we're using
single linkage.

740
00:44:33,640 --> 00:44:37,520
Where are we getting it from?

741
00:44:37,520 --> 00:44:39,460
AUDIENCE: Boston, New
York and Chicago?

742
00:44:39,460 --> 00:44:39,945
PROFESSOR: Yes.

743
00:44:39,945 --> 00:44:42,370
Because it's not so
far from Chicago.

744
00:44:42,370 --> 00:44:45,924
Even though it's pretty far
from Boston or New York.

745
00:44:45,924 --> 00:44:50,744
But if we use average linkage,
we see on average, it's closer

746
00:44:50,744 --> 00:44:55,180
to San Francisco and Seattle
than it is to the average of

747
00:44:55,180 --> 00:44:57,300
Boston, New York, or Chicago.

748
00:44:57,300 --> 00:45:00,920
So we get a different answer.

749
00:45:00,920 --> 00:45:03,860
And then finally, at
the last step,

750
00:45:03,860 --> 00:45:08,270
everything gets merged together.

751
00:45:08,270 --> 00:45:15,940
So you can see, in this case,
without having labels, we have

752
00:45:15,940 --> 00:45:22,200
used a feature to produce things
and, say, if we wanted

753
00:45:22,200 --> 00:45:25,840
to have three clusters, we
would maybe stop here.

754
00:45:25,840 --> 00:45:28,300
And we'd say, all right, these
things are one cluster.

755
00:45:28,300 --> 00:45:29,170
This is a cluster.

756
00:45:29,170 --> 00:45:31,340
And this is a cluster.

757
00:45:31,340 --> 00:45:33,640
And that's not a bad
geographical clustering,

758
00:45:33,640 --> 00:45:36,990
actually, for deciding
how to relate these

759
00:45:36,990 --> 00:45:38,240
things to each other.

760
00:45:38,240 --> 00:45:40,890

761
00:45:40,890 --> 00:45:43,160
This technique is used a lot.

762
00:45:43,160 --> 00:45:46,260
It does have some weaknesses.

763
00:45:46,260 --> 00:45:50,050
One weakness is it's very
time consuming.

764
00:45:50,050 --> 00:45:53,310
It doesn't scale well.

765
00:45:53,310 --> 00:45:58,370
The complexity is at least order
n-squared, where n is

766
00:45:58,370 --> 00:46:03,900
the number of points
to be clustered.

767
00:46:03,900 --> 00:46:06,060
And in fact, in many
implementations, it's worse

768
00:46:06,060 --> 00:46:08,230
than n-squared.

769
00:46:08,230 --> 00:46:13,010
And of course, it doesn't
necessarily find the optimal

770
00:46:13,010 --> 00:46:17,240
clustering, even giving
these criteria.

771
00:46:17,240 --> 00:46:21,190
It might never at any level have
the optimal clustering,

772
00:46:21,190 --> 00:46:24,770
because, again, at each step,
it's making a locally optimal

773
00:46:24,770 --> 00:46:28,435
decision, not guaranteed to
find the best solution.

774
00:46:28,435 --> 00:46:33,040

775
00:46:33,040 --> 00:46:38,950
I should point out that a big
issue in deciding to get these

776
00:46:38,950 --> 00:46:42,940
clusters or getting
these clusters was

777
00:46:42,940 --> 00:46:46,920
my choice of features.

778
00:46:46,920 --> 00:46:50,580
And this is something we're
going to come back to in

779
00:46:50,580 --> 00:46:54,700
spades, because I actually think
it is the most important

780
00:46:54,700 --> 00:46:56,980
issue in machine learning--

781
00:46:56,980 --> 00:47:01,310
is if we're going to say which
points are similar to each

782
00:47:01,310 --> 00:47:06,840
other, we need to understand
our feature space.

783
00:47:06,840 --> 00:47:10,500
So, for example, the feature
I'm using here

784
00:47:10,500 --> 00:47:14,340
is distance by air.

785
00:47:14,340 --> 00:47:19,010
Suppose, instead, I added
distance by air and distance

786
00:47:19,010 --> 00:47:22,900
by road and distance by train.

787
00:47:22,900 --> 00:47:25,900
Well, particularly given this
sparsity of railroads in this

788
00:47:25,900 --> 00:47:29,180
country, we might get very
different clustering,

789
00:47:29,180 --> 00:47:33,090
depending upon where
the trains ran.

790
00:47:33,090 --> 00:47:36,590
And suppose I throw in a totally
different feature like

791
00:47:36,590 --> 00:47:39,570
population.

792
00:47:39,570 --> 00:47:42,040
Well, I might get another
different clustering,

793
00:47:42,040 --> 00:47:43,395
depending on how I use that.

794
00:47:43,395 --> 00:47:46,780

795
00:47:46,780 --> 00:47:54,910
What we typically need to do in
these situations, dealing

796
00:47:54,910 --> 00:47:57,660
with multi-dimensional data--
and most data is

797
00:47:57,660 --> 00:47:59,230
multi-dimensional--

798
00:47:59,230 --> 00:48:06,790
is we construct something called
a feature vector that

799
00:48:06,790 --> 00:48:10,830
incorporates multiple
features.

800
00:48:10,830 --> 00:48:14,870
So we might have
for each city--

801
00:48:14,870 --> 00:48:16,805
we'll just take something
like the distance.

802
00:48:16,805 --> 00:48:22,600

803
00:48:22,600 --> 00:48:26,720
Or let's say, instead of
distance, we'll compute the

804
00:48:26,720 --> 00:48:33,810
distance by having for each
city its GPS coordinates,

805
00:48:33,810 --> 00:48:39,110
where it is on the globe,
and its population.

806
00:48:39,110 --> 00:48:42,840
And let's say that's how
we define a city.

807
00:48:42,840 --> 00:48:45,070
And that would be our
feature vector.

808
00:48:45,070 --> 00:48:47,560
And then we would cluster it,
say, using hierarchical

809
00:48:47,560 --> 00:48:51,140
clustering to determine which
cities are most like which

810
00:48:51,140 --> 00:48:53,340
other cities.

811
00:48:53,340 --> 00:48:56,280
Well, it's a little
bit complicated.

812
00:48:56,280 --> 00:49:01,760
I have to ask how do I compare
feature vectors?

813
00:49:01,760 --> 00:49:04,990
What distance metric
do I use there?

814
00:49:04,990 --> 00:49:09,600
Do I get confused that GPS
coordinates and populations

815
00:49:09,600 --> 00:49:12,700
are essentially unrelated?

816
00:49:12,700 --> 00:49:16,240
And I wouldn't like to compare
those to each other.

817
00:49:16,240 --> 00:49:19,240
Lots of issues there, and that's
what we're going to

818
00:49:19,240 --> 00:49:22,500
talk about when we come back
from Patriot's Day--

819
00:49:22,500 --> 00:49:26,440
is how in the real world
problems, we go from the large

820
00:49:26,440 --> 00:49:30,430
number of features associated
with objects or things in the

821
00:49:30,430 --> 00:49:34,680
real world to feature vectors
that allow us to automatically

822
00:49:34,680 --> 00:49:37,830
deduce which things are
quote "most similar"

823
00:49:37,830 --> 00:49:39,080
to which other things.

824
00:49:39,080 --> 00:49:42,485