1
00:00:01,550 --> 00:00:03,920
The following content is
provided under a Creative
2
00:00:03,920 --> 00:00:05,310
Commons license.
3
00:00:05,310 --> 00:00:07,520
Your support will help
MIT OpenCourseWare
4
00:00:07,520 --> 00:00:11,610
continue to offer high quality
educational resources for free.
5
00:00:11,610 --> 00:00:14,180
To make a donation or to
view additional materials
6
00:00:14,180 --> 00:00:17,795
from hundreds of MIT courses,
visit MIT OpenCourseWare
7
00:00:17,795 --> 00:00:19,026
at ocw.mit.edu.
8
00:00:22,800 --> 00:00:26,180
GILBERT STRANG: Well, OK,
I am happy to be back,
9
00:00:26,180 --> 00:00:29,910
and I am really happy
about the project
10
00:00:29,910 --> 00:00:32,229
proposals that are coming in.
11
00:00:32,229 --> 00:00:36,840
This is like, OK, this is really
a good part of the course.
12
00:00:36,840 --> 00:00:42,360
And so keep them coming, and
I'm happy to give whatever
13
00:00:42,360 --> 00:00:45,000
feedback I can on
those proposals,
14
00:00:45,000 --> 00:00:49,120
and do make a start there.
15
00:00:49,120 --> 00:00:53,100
They're really good, and
if some are completed
16
00:00:53,100 --> 00:00:55,230
before the end of
the semester and we
17
00:00:55,230 --> 00:01:00,860
can to offer you a
chance to report on them,
18
00:01:00,860 --> 00:01:04,019
that that's good too.
19
00:01:04,019 --> 00:01:07,930
So well done with
those proposals.
20
00:01:07,930 --> 00:01:12,830
So today, I'm
jumping to part six.
21
00:01:12,830 --> 00:01:15,870
So part six and part
seven are optimization
22
00:01:15,870 --> 00:01:17,910
which is the fundamental
algorithm that
23
00:01:17,910 --> 00:01:19,980
goes into deep learning.
24
00:01:19,980 --> 00:01:22,140
So we've got to start
with optimization.
25
00:01:22,140 --> 00:01:26,010
Everybody has to
get that picture,
26
00:01:26,010 --> 00:01:29,580
and then part seven
will be the structure
27
00:01:29,580 --> 00:01:33,660
of CNNs, Convolution
Neural Nets,
28
00:01:33,660 --> 00:01:37,560
and all kinds of applications.
29
00:01:37,560 --> 00:01:41,430
And so can we start
with optimization?
30
00:01:41,430 --> 00:01:44,370
So first, can I like
get the basic facts
31
00:01:44,370 --> 00:01:49,840
about three terms
of a Taylor series?
32
00:01:49,840 --> 00:01:51,030
So that's the typical.
33
00:01:51,030 --> 00:01:55,590
It's seldom that we would
go up to third derivatives
34
00:01:55,590 --> 00:01:56,970
in optimization.
35
00:01:56,970 --> 00:02:01,920
So that's the most useful
approximation to a function.
36
00:02:01,920 --> 00:02:03,720
Everybody recognizes it.
37
00:02:03,720 --> 00:02:08,729
Here, I'm thinking of
F as just one function,
38
00:02:08,729 --> 00:02:12,990
and x as just one
variable, but now
39
00:02:12,990 --> 00:02:18,180
I really want to go
to more variables.
40
00:02:18,180 --> 00:02:21,360
So what do I have
to change if F is
41
00:02:21,360 --> 00:02:26,790
a function of more variables?
42
00:02:26,790 --> 00:02:31,050
So now, I'm thinking of x as--
43
00:02:31,050 --> 00:02:32,730
well, now let me see.
44
00:02:37,670 --> 00:02:42,470
Yeah, I want n variables here.
45
00:02:42,470 --> 00:02:45,170
x is x1 up to xn.
46
00:02:48,830 --> 00:02:52,840
So just to get
the words straight
47
00:02:52,840 --> 00:02:56,800
so we can begin on
optimization, so what
48
00:02:56,800 --> 00:03:02,740
will be the similar step
so the function F at x--
49
00:03:02,740 --> 00:03:06,590
remember, x is n variables.
50
00:03:06,590 --> 00:03:07,330
OK?
51
00:03:07,330 --> 00:03:08,350
Now, what do I have?
52
00:03:08,350 --> 00:03:13,170
Delta x, so what's the
point about delta x now?
53
00:03:13,170 --> 00:03:17,500
It's a vector, delta
x1 to delta xn,
54
00:03:17,500 --> 00:03:20,190
and what about the
derivative of F?
55
00:03:20,190 --> 00:03:22,810
It's a vector too,
the derivative
56
00:03:22,810 --> 00:03:26,175
of F with respect to
x1, the derivative of F
57
00:03:26,175 --> 00:03:29,660
with respect to x2, and so on.
58
00:03:29,660 --> 00:03:32,420
What do I have to
change about that?
59
00:03:32,420 --> 00:03:36,310
I know those guys are vectors,
so it's their dot product.
60
00:03:36,310 --> 00:03:44,360
So it's delta x transpose
at vector times this dF/dx.
61
00:03:48,340 --> 00:03:52,620
So now I'm replacing this
by all the derivatives,
62
00:03:52,620 --> 00:03:58,040
and it's the gradient.
63
00:03:58,040 --> 00:04:03,820
So the gradient of F at
x is the derivatives--
64
00:04:07,257 --> 00:04:09,920
let's see.
65
00:04:09,920 --> 00:04:14,320
It's essential to get the
notation straight here.
66
00:04:14,320 --> 00:04:17,140
Yeah, so it'll be the
partial derivatives
67
00:04:17,140 --> 00:04:26,290
of the function F. So grad F
is the partial derivatives of F
68
00:04:26,290 --> 00:04:30,846
with respect to x1 down to
partial derivative with respect
69
00:04:30,846 --> 00:04:32,650
to xn.
70
00:04:32,650 --> 00:04:34,180
OK, good.
71
00:04:34,180 --> 00:04:37,490
That's the linear term, and
now what's the quadratic term?
72
00:04:37,490 --> 00:04:42,010
1/2, now delta x isn't
a scalar anymore.
73
00:04:42,010 --> 00:04:43,160
It's a vector.
74
00:04:43,160 --> 00:04:47,500
So I'm going to have
delta x transpose
75
00:04:47,500 --> 00:04:51,970
and a delta x, and
what goes in between
76
00:04:51,970 --> 00:04:55,240
is the second
derivatives, but I've
77
00:04:55,240 --> 00:04:57,460
got a function of n variables.
78
00:05:00,190 --> 00:05:04,240
So now, I have a matrix
of second derivatives,
79
00:05:04,240 --> 00:05:10,510
and I'll call it H. This is the
matrix of second derivatives,
80
00:05:10,510 --> 00:05:19,840
Hjk is the second derivative
of F with respect to xj and xk,
81
00:05:19,840 --> 00:05:23,360
and what's the
name for this guy?
82
00:05:23,360 --> 00:05:26,485
The Hessian, Hessian matrix.
83
00:05:30,350 --> 00:05:33,260
How the Hessians got into
this picture I don't know.
84
00:05:33,260 --> 00:05:34,960
The only Hessians
I know are the ones
85
00:05:34,960 --> 00:05:39,530
who fought in the
Revolutionary War for somebody.
86
00:05:39,530 --> 00:05:40,030
Who?
87
00:05:40,030 --> 00:05:42,220
Which side were they on?
88
00:05:42,220 --> 00:05:43,750
I think maybe the wrong side.
89
00:05:43,750 --> 00:05:47,470
The French were
on our side and--
90
00:05:47,470 --> 00:05:50,530
Anyway, Hessian
matrix, and what are
91
00:05:50,530 --> 00:05:52,345
the facts about that matrix?
92
00:05:52,345 --> 00:05:54,220
Well, the first fact is
that it's [INAUDIBLE]
93
00:05:54,220 --> 00:05:56,883
and the key fact
is it's symmetric.
94
00:06:00,200 --> 00:06:02,140
Yeah.
95
00:06:02,140 --> 00:06:07,420
OK, and again, it's
an approximation.
96
00:06:07,420 --> 00:06:12,880
And everybody recognizes
that if n is very large,
97
00:06:12,880 --> 00:06:16,360
and we have a function
of many variables.
98
00:06:16,360 --> 00:06:19,540
Then, we had n derivatives
to compute here,
99
00:06:19,540 --> 00:06:22,650
and about 1/2 n
squared derivatives.
100
00:06:22,650 --> 00:06:25,960
The 1/2 comes from the
symmetry, but the key point
101
00:06:25,960 --> 00:06:30,670
is the n squared derivatives
to compute there.
102
00:06:30,670 --> 00:06:36,240
So computing the
gradient is feasible
103
00:06:36,240 --> 00:06:40,050
if n is small or
moderately large.
104
00:06:40,050 --> 00:06:43,900
Actually, by using
automatic differentiation,
105
00:06:43,900 --> 00:06:46,050
the key idea of
back propagation,
106
00:06:46,050 --> 00:06:54,750
back prop, you can
speed up the computation
107
00:06:54,750 --> 00:06:56,820
of derivatives quite amazingly.
108
00:06:56,820 --> 00:07:02,400
But still for the size
of deep learning problems
109
00:07:02,400 --> 00:07:04,300
that's out of reach.
110
00:07:04,300 --> 00:07:05,340
OK.
111
00:07:05,340 --> 00:07:10,140
So that's the
picture, and then I
112
00:07:10,140 --> 00:07:16,170
will want to use this
to solve equations.
113
00:07:16,170 --> 00:07:25,140
There is a parallel
picture for a vector f.
114
00:07:25,140 --> 00:07:26,940
So now, this is a
vector function.
115
00:07:26,940 --> 00:07:39,210
This is f1 of x up to fn
of x, an x is x1 to xn.
116
00:07:42,030 --> 00:07:45,950
So I have n functions
of n variables,
117
00:07:45,950 --> 00:07:47,700
n functions of n variables.
118
00:07:47,700 --> 00:07:51,380
Well, that's exactly what
I have in the gradient.
119
00:07:54,200 --> 00:07:57,230
Think of these two as
parallel, the parallel
120
00:07:57,230 --> 00:08:02,230
being f corresponds
to the gradient of F,
121
00:08:02,230 --> 00:08:06,980
n functions of n variables.
122
00:08:10,800 --> 00:08:11,300
OK.
123
00:08:15,010 --> 00:08:18,730
Now maybe, what I'm after
here is to solve f equals 0.
124
00:08:18,730 --> 00:08:25,570
So I'm going to think about
the f at x plus delta x,
125
00:08:25,570 --> 00:08:28,890
so it starts with f of x.
126
00:08:28,890 --> 00:08:34,539
And then we have the
correction times the matrix
127
00:08:34,539 --> 00:08:42,309
of first derivatives, and
what's the name for that matrix
128
00:08:42,309 --> 00:08:45,850
of first derivatives?
129
00:08:45,850 --> 00:08:51,670
Well, if I'm just
given n functions--
130
00:08:54,280 --> 00:08:56,830
yeah, what am I after here?
131
00:08:56,830 --> 00:09:05,120
I'm looking for the Jacobian.
132
00:09:05,120 --> 00:09:10,580
So here we'll go the Jacobian,
J. This is the Jacobian named
133
00:09:10,580 --> 00:09:16,320
after Jacoby, Jacobian matrix.
134
00:09:16,320 --> 00:09:18,920
And what are its entries?
135
00:09:18,920 --> 00:09:23,060
J, the jk entry
is the derivative
136
00:09:23,060 --> 00:09:30,780
of the J function with
respect to the kth variable,
137
00:09:30,780 --> 00:09:34,520
and I'm stopping at
first order there.
138
00:09:34,520 --> 00:09:38,360
OK, so these are sort of
like facts of calculus,
139
00:09:38,360 --> 00:09:41,000
facts of 18.02 you could say.
140
00:09:41,000 --> 00:09:44,550
Multivariable calculus,
that's the point.
141
00:09:44,550 --> 00:09:47,870
Notice that we're doing just
like the first half of 18.02,
142
00:09:47,870 --> 00:09:52,880
just do differential calculus,
derivatives, Taylor series.
143
00:09:52,880 --> 00:09:54,750
We're not doing
multiple integrals.
144
00:09:54,750 --> 00:09:57,050
That's not part
of our world here.
145
00:09:57,050 --> 00:10:00,330
OK, so that's the background.
146
00:10:00,330 --> 00:10:05,840
Now, I want to look
at optimization.
147
00:10:05,840 --> 00:10:12,610
So over here, I
want to optimize--
148
00:10:12,610 --> 00:10:24,290
well, over here, let me
try to minimize F of x,
149
00:10:24,290 --> 00:10:27,340
and I'll be in the
vector case here.
150
00:10:27,340 --> 00:10:38,610
And over here, I want
to solve f equals 0,
151
00:10:38,610 --> 00:10:45,330
and of course, that
means f of 1 equals 0
152
00:10:45,330 --> 00:10:48,720
all the way along
to fn equals 0.
153
00:10:48,720 --> 00:10:54,080
Here, I have n equations,
and n unknowns.
154
00:10:59,720 --> 00:11:02,660
Let me start with
that one, and I'll
155
00:11:02,660 --> 00:11:05,450
start with Newton's
method, Newton's method
156
00:11:05,450 --> 00:11:09,890
to solve these n
equations and n unknowns.
157
00:11:09,890 --> 00:11:22,440
OK, so Newton, Newton's
method which is often
158
00:11:22,440 --> 00:11:24,210
not presented in 18.02.
159
00:11:24,210 --> 00:11:29,130
That's a crime, because
that's the big application
160
00:11:29,130 --> 00:11:33,180
of gradients in Jacobians.
161
00:11:33,180 --> 00:11:37,150
OK, so I'm trying to solve
n equations and n unknowns,
162
00:11:37,150 --> 00:11:42,930
and so I want f at x
plus delta x to be 0.
163
00:11:42,930 --> 00:11:43,500
Right?
164
00:11:43,500 --> 00:11:46,740
So I want f of x
plus delta x to be 0.
165
00:11:46,740 --> 00:11:49,890
So f at x plus delta x is--
166
00:11:49,890 --> 00:11:51,150
I'm putting in a 0.
167
00:11:51,150 --> 00:11:53,340
I'm just copying that equation--
168
00:11:53,340 --> 00:11:58,440
is f at where I am.
169
00:11:58,440 --> 00:12:02,370
Let me use K for
the case iteration.
170
00:12:02,370 --> 00:12:04,490
So I'm at a point xK.
171
00:12:04,490 --> 00:12:07,230
I want to get to
a point xK plus 1.
172
00:12:07,230 --> 00:12:14,890
And so I have 0 is f of
x plus J, at that point,
173
00:12:14,890 --> 00:12:21,700
times delta x which
is xK plus 1 minus xK.
174
00:12:21,700 --> 00:12:24,030
Good.
175
00:12:24,030 --> 00:12:25,020
That's Newton's method.
176
00:12:25,020 --> 00:12:29,010
Of course, 0 isn't quite true.
177
00:12:29,010 --> 00:12:35,250
Well, 0 will be true if I'm
constructing xK plus 1 here.
178
00:12:35,250 --> 00:12:37,950
I'm constructing xK plus 1.
179
00:12:37,950 --> 00:12:39,030
OK.
180
00:12:39,030 --> 00:12:43,410
So let me just rewrite that,
and we've got Newton's method.
181
00:12:43,410 --> 00:12:52,750
So we're looking for this
change, xK plus 1 minus xK.
182
00:12:52,750 --> 00:13:00,010
I'll put it on this side
as plus xK, so that's this.
183
00:13:00,010 --> 00:13:01,960
Now, I have to invert
that and put it
184
00:13:01,960 --> 00:13:04,010
on the other side
of the equation.
185
00:13:04,010 --> 00:13:06,280
So that will go with a minus.
186
00:13:06,280 --> 00:13:12,880
This guy will be
inverted and f at xK.
187
00:13:15,472 --> 00:13:17,170
So that's Newton's methods.
188
00:13:17,170 --> 00:13:19,980
It's natural.
189
00:13:19,980 --> 00:13:21,310
So let me just repeat that.
190
00:13:21,310 --> 00:13:26,140
You see where the xK plus
1 minus xK is sitting?
191
00:13:26,140 --> 00:13:27,670
Right?
192
00:13:27,670 --> 00:13:32,930
And I moved f of xK to the
other side with a minus sign,
193
00:13:32,930 --> 00:13:37,150
and then I multiplied through
by J inverse, so I got that.
194
00:13:37,150 --> 00:13:45,020
So that's Newton's method
for a system of equations,
195
00:13:45,020 --> 00:13:47,380
and over there, I'm going to
write down Newton's method
196
00:13:47,380 --> 00:13:49,030
for minimizing a function.
197
00:13:49,030 --> 00:13:53,890
This is such basic stuff
that we have to begin here.
198
00:13:53,890 --> 00:13:58,540
Let me even begin with an
extremely straightforward
199
00:13:58,540 --> 00:14:02,440
example of Newton's method here.
200
00:14:02,440 --> 00:14:05,950
Suppose my function--
suppose I've only
201
00:14:05,950 --> 00:14:10,900
got one function actually.
202
00:14:10,900 --> 00:14:14,450
Suppose I only had one function.
203
00:14:14,450 --> 00:14:21,240
So suppose my function
is x squared minus 9,
204
00:14:21,240 --> 00:14:26,020
and I want to solve
f of x equals 0.
205
00:14:26,020 --> 00:14:28,990
I want to find the
square root of 9.
206
00:14:28,990 --> 00:14:33,790
OK, so what is
Newton's method for it?
207
00:14:33,790 --> 00:14:37,360
My point is just to see how
Newton's method is written
208
00:14:37,360 --> 00:14:43,200
and then rewrite it a little bit
so that we see the convergence.
209
00:14:43,200 --> 00:14:48,100
OK, so of course,
the Jacobian is 2x.
210
00:14:48,100 --> 00:14:51,580
So Newton's method
says that xK plus 1--
211
00:14:51,580 --> 00:14:56,230
I'm just going to copy
that Newton's method--
212
00:14:56,230 --> 00:14:59,170
minus 1 over 2xK.
213
00:14:59,170 --> 00:15:01,060
Right?
214
00:15:01,060 --> 00:15:08,680
That's the derivative times f at
xK which is xK squared minus 9.
215
00:15:12,190 --> 00:15:13,280
OK.
216
00:15:13,280 --> 00:15:17,720
We followed the formula,
this determines xK plus 1,
217
00:15:17,720 --> 00:15:20,350
and let's simplify it.
218
00:15:20,350 --> 00:15:26,500
So here I have xK minus
that looks like 1/2 of xK,
219
00:15:26,500 --> 00:15:30,730
so I think I have
1/2xK, and then
220
00:15:30,730 --> 00:15:40,137
this times this is
9/2 of 1 over xK.
221
00:15:40,137 --> 00:15:40,720
Is that right?
222
00:15:43,450 --> 00:15:50,100
1/2 of xk from this stuff
and plus 9/2 of 1 over xK.
223
00:15:50,100 --> 00:15:50,600
OK.
224
00:15:53,940 --> 00:15:59,850
Can I just like check that
I know the answer is 3?
225
00:15:59,850 --> 00:16:04,260
Can I be sure that I
get the right answer, 3?
226
00:16:04,260 --> 00:16:10,170
That if xK was exactly 3, then
of course, I expect xK plus 1
227
00:16:10,170 --> 00:16:11,730
to stay at 3.
228
00:16:11,730 --> 00:16:13,740
So does that happen?
229
00:16:13,740 --> 00:16:24,330
So 1/2 of 3 and 9/2 of 1/3,
what's that, 1/2 of 3 and 9/2
230
00:16:24,330 --> 00:16:25,440
of 1/3?
231
00:16:25,440 --> 00:16:28,720
OK, that's 3/2 and 3/2.
232
00:16:28,720 --> 00:16:31,580
That's 6/2, and that's 3.
233
00:16:31,580 --> 00:16:32,080
OK.
234
00:16:35,690 --> 00:16:38,870
So we've checked that the method
is consistent which just means
235
00:16:38,870 --> 00:16:41,690
we kept the algebra straight.
236
00:16:41,690 --> 00:16:46,430
But then the really important
point about Newton's method
237
00:16:46,430 --> 00:16:50,700
is to discover how
fast it converges.
238
00:16:50,700 --> 00:16:55,350
So now let me do
xK plus 1 minus 3.
239
00:16:55,350 --> 00:17:00,650
So now, I'm looking at the
error which is, I hope,
240
00:17:00,650 --> 00:17:02,420
approaching 0.
241
00:17:02,420 --> 00:17:03,770
Is it approaching 0?
242
00:17:03,770 --> 00:17:05,540
How quickly is it approaching 0?
243
00:17:05,540 --> 00:17:09,470
These are the fundamental
questions of optimization.
244
00:17:09,470 --> 00:17:14,940
So I'm going to subtract
3 from both sides somehow.
245
00:17:14,940 --> 00:17:20,970
OK, from here, I guess,
I'm going to subtract 3.
246
00:17:20,970 --> 00:17:24,569
So I was just checking
that it was correct.
247
00:17:24,569 --> 00:17:25,950
OK.
248
00:17:25,950 --> 00:17:30,310
Now, so xK plus 1
minus 3, I'm going
249
00:17:30,310 --> 00:17:32,310
to subtract 3 from both sides.
250
00:17:32,310 --> 00:17:38,190
I'm going to subtract 3
there, and then I hope that--
251
00:17:38,190 --> 00:17:40,110
that box is what goes down here.
252
00:17:40,110 --> 00:17:41,070
Right?
253
00:17:41,070 --> 00:17:48,960
Subtracted 3 from both sides, so
I'm hoping now things go to 0.
254
00:17:48,960 --> 00:17:52,200
OK, so what do I have there?
255
00:17:52,200 --> 00:17:55,330
Let me factor out the 1 over xK.
256
00:17:58,860 --> 00:18:01,850
So what do I have then left?
257
00:18:01,850 --> 00:18:08,750
1 over xK, so there's a
9/2 from there, 1 over xK.
258
00:18:08,750 --> 00:18:12,710
So I really have
1/2 of xK squared,
259
00:18:12,710 --> 00:18:15,080
because I've divided by an xK.
260
00:18:15,080 --> 00:18:21,120
And this minus 3, I
better put minus 3xK,
261
00:18:21,120 --> 00:18:23,230
because I'm dividing by xK.
262
00:18:23,230 --> 00:18:25,570
I claim that that's--
263
00:18:25,570 --> 00:18:26,360
now I've got it.
264
00:18:30,740 --> 00:18:34,970
And let's see, let
me take out the 2--
265
00:18:34,970 --> 00:18:41,480
2, forget these 2s,
and make that a 6.
266
00:18:41,480 --> 00:18:47,210
So I have 1 over 2xK times
9 plus xK squared minus 6.
267
00:18:47,210 --> 00:18:48,410
Anything good about that?
268
00:18:50,980 --> 00:18:51,880
We hope so.
269
00:18:56,450 --> 00:19:00,450
We hope that that is
something attractive.
270
00:19:00,450 --> 00:19:05,730
So this is, again, the
error at set K plus 1,
271
00:19:05,730 --> 00:19:11,940
and it's 1 over 2xK times
this thing in brackets--
272
00:19:11,940 --> 00:19:15,000
9 plus xK squared minus 6xK.
273
00:19:15,000 --> 00:19:23,235
And we recognize that
as xK minus 3 squared.
274
00:19:25,980 --> 00:19:34,710
xK squared minus 6 of them plus
9, that's xK minus 3 squared.
275
00:19:34,710 --> 00:19:40,370
OK, that was the
goal, of course.
276
00:19:40,370 --> 00:19:44,390
That's the goal that shows why
Newton's method is fantastic.
277
00:19:44,390 --> 00:19:48,080
If you can execute it, if
you can start near enough,
278
00:19:48,080 --> 00:19:50,850
notice that--
279
00:19:50,850 --> 00:19:54,750
so how do I describe
this great equation?
280
00:19:54,750 --> 00:19:58,850
It says that the error is
squared at every step, squared
281
00:19:58,850 --> 00:20:00,770
at every step.
282
00:20:00,770 --> 00:20:08,132
So if I'm converging to a
limit, it will satisfy the--
283
00:20:08,132 --> 00:20:15,050
it'll be 3, or I guess
minus 3, is that possible?
284
00:20:15,050 --> 00:20:17,940
Yeah, minus 3 is
another solution here.
285
00:20:17,940 --> 00:20:20,810
So we've got two solutions.
286
00:20:20,810 --> 00:20:22,955
Newton's method
could converge to 3.
287
00:20:25,585 --> 00:20:29,190
Am I right, it could
converge to minus 3?
288
00:20:29,190 --> 00:20:34,920
So I'd have a similar equation
sort of centered at minus 3,
289
00:20:34,920 --> 00:20:37,920
or does it always
do one of those?
290
00:20:40,710 --> 00:20:41,700
It could blow up.
291
00:20:44,730 --> 00:20:48,430
So there are sort of
regions of attraction.
292
00:20:48,430 --> 00:20:51,690
They're all the starting
points that approach 3,
293
00:20:51,690 --> 00:20:53,190
and the whole point
of that equation
294
00:20:53,190 --> 00:20:57,870
is with quadratic
convergence the error
295
00:20:57,870 --> 00:20:59,730
being squared at every step.
296
00:20:59,730 --> 00:21:01,760
It zooms in on 3.
297
00:21:01,760 --> 00:21:04,060
Then, there is all
the starting points
298
00:21:04,060 --> 00:21:07,420
that would go to minus
3, and then there
299
00:21:07,420 --> 00:21:10,920
are the starting points
that would blow up.
300
00:21:10,920 --> 00:21:15,170
And those, maybe for
this very simple problem,
301
00:21:15,170 --> 00:21:17,760
the picture is not too
difficult to sort out
302
00:21:17,760 --> 00:21:18,675
those three regions.
303
00:21:21,600 --> 00:21:29,010
And this is allowing for a
vector, two equations or n
304
00:21:29,010 --> 00:21:31,950
equations, then
we're in n variables,
305
00:21:31,950 --> 00:21:35,310
and really you get
beautiful pictures.
306
00:21:35,310 --> 00:21:39,900
You get some of the type
of pictures that gave rise
307
00:21:39,900 --> 00:21:45,030
to these books on fractals,
picture books on fractals
308
00:21:45,030 --> 00:21:46,920
for these basins of attraction.
309
00:21:46,920 --> 00:21:52,410
Does the starting point lead
you to one of the solutions,
310
00:21:52,410 --> 00:21:54,780
or does it lead you to infinity?
311
00:21:54,780 --> 00:21:59,790
Here, that would be interesting
to just draw it for this,
312
00:21:59,790 --> 00:22:03,900
but the essential point is
the quadratic convergence,
313
00:22:03,900 --> 00:22:05,670
if it's close enough.
314
00:22:05,670 --> 00:22:09,210
You see that it has to be close.
315
00:22:09,210 --> 00:22:15,760
If x0 is pretty near 3, then
this is about 1/6 of that,
316
00:22:15,760 --> 00:22:21,430
and there would be a good region
of attraction in this case.
317
00:22:21,430 --> 00:22:22,140
OK.
318
00:22:22,140 --> 00:22:29,640
So that's Newton's
method for equations.
319
00:22:29,640 --> 00:22:31,860
And now I want to
do Newton's method.
320
00:22:31,860 --> 00:22:34,230
I just want to convert
all those words over
321
00:22:34,230 --> 00:22:38,450
to Newton's method
for optimization.
322
00:22:38,450 --> 00:22:45,810
So remember, these boards
were solving f equals 0.
323
00:22:45,810 --> 00:22:48,570
This board is
minimizing capital F,
324
00:22:48,570 --> 00:22:50,860
and what's the
connection between them?
325
00:22:50,860 --> 00:23:02,200
Well of course, this corresponds
to solving the gradient
326
00:23:02,200 --> 00:23:04,330
equals 0.
327
00:23:04,330 --> 00:23:07,420
At a minimum, if
I'm minimizing, I'm
328
00:23:07,420 --> 00:23:11,750
finding a point where all
the first derivatives are 0.
329
00:23:11,750 --> 00:23:16,060
So that will be the
match between these.
330
00:23:16,060 --> 00:23:22,510
This grad F in this picture is
the small f in that picture.
331
00:23:22,510 --> 00:23:23,010
OK.
332
00:23:25,740 --> 00:23:28,710
Now, I guess here I have--
333
00:23:28,710 --> 00:23:33,930
and this is sort of the
heart of our applications
334
00:23:33,930 --> 00:23:38,070
to deep learning-- we have
very complicated loss functions
335
00:23:38,070 --> 00:23:40,260
to minimize,
functions of thousands
336
00:23:40,260 --> 00:23:42,900
or hundreds of
thousands of variables.
337
00:23:42,900 --> 00:23:44,840
OK.
338
00:23:44,840 --> 00:23:47,840
So that means that we would
like to use Newton's method,
339
00:23:47,840 --> 00:23:49,970
but often we can't.
340
00:23:49,970 --> 00:23:54,380
So I need him to put
down here two methods--
341
00:23:54,380 --> 00:23:59,060
one that doesn't involve
those high second derivatives
342
00:23:59,060 --> 00:24:02,750
and Newton's that does.
343
00:24:02,750 --> 00:24:10,430
So first, I'll write down a
method that does not involve,
344
00:24:10,430 --> 00:24:16,390
so method one, and this
will be steepest descent.
345
00:24:25,040 --> 00:24:26,910
And what is that?
346
00:24:26,910 --> 00:24:35,430
That says that xK plus 1--
the new x is the old x minus--
347
00:24:35,430 --> 00:24:40,110
steepest descent means that I
move in the steepest direction
348
00:24:40,110 --> 00:24:45,240
which is the direction
of the gradient of F.
349
00:24:45,240 --> 00:24:47,520
I move some distance,
and I better
350
00:24:47,520 --> 00:24:51,360
have freedom to decide what
that distance should be.
351
00:24:51,360 --> 00:25:01,670
So this is a step size, s, or in
the language of deep learning,
352
00:25:01,670 --> 00:25:03,830
it's often called
the learning rate,
353
00:25:03,830 --> 00:25:06,450
so if you see learning rate.
354
00:25:11,910 --> 00:25:12,410
OK.
355
00:25:19,000 --> 00:25:23,440
So and it's natural
to choose sK.
356
00:25:23,440 --> 00:25:27,340
We're going along, do you see
what this right-hand side looks
357
00:25:27,340 --> 00:25:27,970
like?
358
00:25:27,970 --> 00:25:31,480
I'm at a point in n dimensions.
359
00:25:31,480 --> 00:25:34,640
We're in n dimensions here.
360
00:25:34,640 --> 00:25:37,770
We have functions
of n variables.
361
00:25:37,770 --> 00:25:39,020
There is a vector.
362
00:25:39,020 --> 00:25:43,780
There is a direction to
move down the steepest slope
363
00:25:43,780 --> 00:25:46,360
of the graph.
364
00:25:46,360 --> 00:25:51,910
And here is a distance to
move, and we will stop.
365
00:25:51,910 --> 00:25:59,350
We'll have to get off
this step, normally.
366
00:25:59,350 --> 00:26:02,650
If we stay on it,
it will swing back,
367
00:26:02,650 --> 00:26:06,910
it'll take us off to infinity.
368
00:26:06,910 --> 00:26:12,450
You would like to choose
sK so that you minimize
369
00:26:12,450 --> 00:26:20,080
capital F. You take the point on
this line, so this a line in R
370
00:26:20,080 --> 00:26:24,350
n, a direction in R n.
371
00:26:28,820 --> 00:26:33,790
And for all the points on
that line, in that direction,
372
00:26:33,790 --> 00:26:39,090
F has some value, and what
you expect is that initially,
373
00:26:39,090 --> 00:26:45,460
because you chose it sensibly,
the value of F will drop.
374
00:26:45,460 --> 00:26:49,830
But then at a certain point,
it will turn back on you
375
00:26:49,830 --> 00:26:51,130
and increase.
376
00:26:51,130 --> 00:26:54,430
So that would be the
natural stopping point.
377
00:26:54,430 --> 00:26:57,460
I would call that an
exact line search.
378
00:26:57,460 --> 00:27:08,930
So I exact line search would be,
exact line search is the best
379
00:27:08,930 --> 00:27:09,430
s.
380
00:27:11,950 --> 00:27:16,390
Of course, that would
take time to compute,
381
00:27:16,390 --> 00:27:19,360
and you probably,
in deep learning,
382
00:27:19,360 --> 00:27:25,690
that's time you can't afford,
so you fix the learning rate s.
383
00:27:25,690 --> 00:27:29,240
Maybe you choose 0.01
to be pretty safe.
384
00:27:29,240 --> 00:27:32,800
OK, so that's method
one, steepest descent.
385
00:27:32,800 --> 00:27:34,800
Now, method two will
be Newton's method.
386
00:27:43,380 --> 00:27:53,070
So now, we have xK plus 1
equal to xK minus something
387
00:27:53,070 --> 00:28:01,810
times delta F, and now I'm
going to do the right thing.
388
00:28:01,810 --> 00:28:05,760
I'm going to live right
here, and the right thing
389
00:28:05,760 --> 00:28:08,850
is the Hessian, the
second derivative.
390
00:28:08,850 --> 00:28:10,770
This was cheap.
391
00:28:10,770 --> 00:28:14,280
We just took the direction
and went along it.
392
00:28:14,280 --> 00:28:17,850
Now, we're getting really
the right direction
393
00:28:17,850 --> 00:28:23,850
by using the second derivative,
so that's H inverse.
394
00:28:23,850 --> 00:28:32,430
OK, and what I've
done is to set that 0.
395
00:28:37,900 --> 00:28:40,090
Do you see that's
Newton's method?
396
00:28:40,090 --> 00:28:44,110
It's totally
parallel to this guy.
397
00:28:44,110 --> 00:28:48,850
Actually, I'm really happy to
have these two on the board
398
00:28:48,850 --> 00:28:52,540
parallel to each other, because
you have to keep straight,
399
00:28:52,540 --> 00:28:56,800
are you solving equations, or
are you minimizing functions?
400
00:28:56,800 --> 00:29:00,430
And you're using different
letters in the two problems,
401
00:29:00,430 --> 00:29:02,850
but now you see how they match.
402
00:29:02,850 --> 00:29:08,680
The Jacobian of-- so
again the matches,
403
00:29:08,680 --> 00:29:13,600
think of f as the
gradient of F. That's
404
00:29:13,600 --> 00:29:16,510
the way you should think of it.
405
00:29:16,510 --> 00:29:24,940
So the Jacobian of the
gradient is the Hessian.
406
00:29:24,940 --> 00:29:27,880
The Jacobian of the
gradient is the Hessian,
407
00:29:27,880 --> 00:29:30,790
and that makes sense,
because the first derivative
408
00:29:30,790 --> 00:29:33,460
of the first derivative
is the second derivative.
409
00:29:33,460 --> 00:29:39,640
Only we're doing matrix y, so
the Jacobian of the gradient--
410
00:29:39,640 --> 00:29:41,620
we're doing a vector
matrix sentence
411
00:29:41,620 --> 00:29:46,090
instead of a scalar sentence--
the Jacobian of the gradient
412
00:29:46,090 --> 00:29:48,110
is a Hessian.
413
00:29:48,110 --> 00:29:50,600
Yeah, right.
414
00:29:50,600 --> 00:29:53,770
OK, so that's what I
wanted to start with,
415
00:29:53,770 --> 00:29:56,110
just to get those
basic facts down.
416
00:29:56,110 --> 00:30:03,280
And so the basic facts were
the three-term Taylor series.
417
00:30:03,280 --> 00:30:09,400
And then the basic algorithms
followed naturally from it
418
00:30:09,400 --> 00:30:13,420
by setting f F at
the new point to 0,
419
00:30:13,420 --> 00:30:16,010
if that's what you were
solving or by assuming
420
00:30:16,010 --> 00:30:17,020
you had the minimum.
421
00:30:17,020 --> 00:30:19,420
Right, good, good, good, good.
422
00:30:19,420 --> 00:30:21,100
OK.
423
00:30:21,100 --> 00:30:22,900
Now, what?
424
00:30:22,900 --> 00:30:26,380
Now, we have to think about
solving these problems,
425
00:30:26,380 --> 00:30:27,730
studying.
426
00:30:27,730 --> 00:30:29,170
Do they converge?
427
00:30:29,170 --> 00:30:30,970
What rate do they converge?
428
00:30:30,970 --> 00:30:35,770
Well, the rate of
convergence is like why
429
00:30:35,770 --> 00:30:39,410
I separated off this example.
430
00:30:39,410 --> 00:30:41,800
So the convergence rate
for Newton's method
431
00:30:41,800 --> 00:30:43,840
will be quadratic.
432
00:30:43,840 --> 00:30:47,980
The error gets squared,
and of course, that
433
00:30:47,980 --> 00:30:53,160
means super-fast convergence,
if you start and close enough.
434
00:30:53,160 --> 00:30:55,860
The rate of convergence
for a steepest descent
435
00:30:55,860 --> 00:30:57,600
is, of course, not.
436
00:30:57,600 --> 00:31:00,480
You're not squaring errors
here, because you're just
437
00:31:00,480 --> 00:31:03,630
taking some number
instead of the inverse
438
00:31:03,630 --> 00:31:12,610
of the correct matrix, so
you can't expect super speed.
439
00:31:12,610 --> 00:31:15,810
So a linear rate of
convergence would be right.
440
00:31:18,430 --> 00:31:21,240
You would like to
know that the error is
441
00:31:21,240 --> 00:31:26,850
multiplied at every step
by some constant below 1.
442
00:31:26,850 --> 00:31:29,400
That would be a
linear rate compared
443
00:31:29,400 --> 00:31:35,420
to being squared at every step.
444
00:31:35,420 --> 00:31:42,310
OK, and so this will
be our basic formula
445
00:31:42,310 --> 00:31:50,830
that we build on for really
large scale problems.
446
00:31:50,830 --> 00:31:53,590
And there are
methods, of course,
447
00:31:53,590 --> 00:31:57,790
people are going to come up
with methods that they're sort
448
00:31:57,790 --> 00:32:01,260
of a cheap Newton's method.
449
00:32:01,260 --> 00:32:05,020
Levenberg-Marquardt,
and it's in the notes
450
00:32:05,020 --> 00:32:09,850
at the end of this
section, at the end of 6.4
451
00:32:09,850 --> 00:32:11,105
that we'll get to.
452
00:32:11,105 --> 00:32:16,130
So Levenberg-Marquardt is a sort
of cheap man's Newton's method.
453
00:32:16,130 --> 00:32:22,180
It does not compute to
Hessian, but it says, OK,
454
00:32:22,180 --> 00:32:27,160
from the gradient, I can
see one term in the Hessian.
455
00:32:27,160 --> 00:32:34,570
So it grabs that term, but
it's not fully second order.
456
00:32:34,570 --> 00:32:36,270
OK.
457
00:32:36,270 --> 00:32:40,860
So now, we have to
think about problems,
458
00:32:40,860 --> 00:32:45,360
and I guess the message here
is, at our starting point,
459
00:32:45,360 --> 00:32:49,100
has to be convexity.
460
00:32:49,100 --> 00:32:55,090
Convexity is the key
word for these problems,
461
00:32:55,090 --> 00:32:57,740
for the function that
we want to minimize.
462
00:32:57,740 --> 00:33:01,800
If that's a convex
function, well first of all,
463
00:33:01,800 --> 00:33:07,570
the convex function is
likely to have one minimum.
464
00:33:07,570 --> 00:33:13,850
And the picture that's in
our mind of steepest descent,
465
00:33:13,850 --> 00:33:17,410
this picture of a bowl,
a bowl is the graph
466
00:33:17,410 --> 00:33:19,480
of the convex function.
467
00:33:19,480 --> 00:33:21,970
So I'm turning to convexity now.
468
00:33:21,970 --> 00:33:28,260
I'll leave that board there,
because that's pretty crucial,
469
00:33:28,260 --> 00:33:32,010
and speak about the
idea of convexity.
470
00:33:32,010 --> 00:33:39,960
Convex function, convex
set, so let's call
471
00:33:39,960 --> 00:33:45,540
the function f of x,
and a typical convex set
472
00:33:45,540 --> 00:33:50,590
would be I'll call it K. OK.
473
00:33:50,590 --> 00:33:54,760
So we just want to remember what
does that word can convex mean,
474
00:33:54,760 --> 00:33:57,910
and how do you know if
you have a convex function
475
00:33:57,910 --> 00:33:59,470
or a convex set?
476
00:33:59,470 --> 00:34:01,930
OK, let me start
with convex set.
477
00:34:01,930 --> 00:34:05,290
So because here is
my general problem,
478
00:34:05,290 --> 00:34:16,960
my convex minimization,
which you hope to have,
479
00:34:16,960 --> 00:34:19,480
and in many applications,
you do have.
480
00:34:19,480 --> 00:34:26,199
So you minimize
a convex function
481
00:34:26,199 --> 00:34:31,330
for points in a convex set.
482
00:34:31,330 --> 00:34:34,570
So that's like the
ideal situation.
483
00:34:34,570 --> 00:34:38,230
That's the ideal
situation, to get something
484
00:34:38,230 --> 00:34:41,590
on your side, something
powerful, convexity.
485
00:34:41,590 --> 00:34:44,409
The function is convex,
and you say, well,
486
00:34:44,409 --> 00:34:47,710
let me draw a convex
function, the graph.
487
00:34:47,710 --> 00:34:54,980
OK, so I'll draw a convex
function, say a bowl.
488
00:34:54,980 --> 00:35:00,640
So that's a graph of f of x,
and then here are the x's.
489
00:35:00,640 --> 00:35:08,800
Let me maybe put x1 and x2 in
the base and the graph of f
490
00:35:08,800 --> 00:35:11,680
of 1x x2 up here.
491
00:35:16,110 --> 00:35:16,610
OK.
492
00:35:16,610 --> 00:35:17,820
Actually, I'm over there.
493
00:35:17,820 --> 00:35:21,340
I should be calling this
function F, I think.
494
00:35:24,268 --> 00:35:25,732
Is that right?
495
00:35:29,160 --> 00:35:33,220
Yeah, a little f would be
the gradient of this guy.
496
00:35:33,220 --> 00:35:34,000
Yeah, I think so.
497
00:35:38,543 --> 00:35:39,043
OK.
498
00:35:43,200 --> 00:35:49,120
Now, I'm minimizing it over
certain x's, not all x's.
499
00:35:49,120 --> 00:35:55,750
I might be minimizing,
for example,
500
00:35:55,750 --> 00:36:05,230
K might be the set where
Ax equals B. K might be,
501
00:36:05,230 --> 00:36:09,460
in that case, a subspace
or a shifted subspace.
502
00:36:09,460 --> 00:36:14,350
I said subspace, but then 18.06
is reminding me in my mind
503
00:36:14,350 --> 00:36:16,710
that I only have a
subspace when B is 0.
504
00:36:19,330 --> 00:36:23,530
You know the word for a subspace
that's sort of moved over?
505
00:36:23,530 --> 00:36:27,110
Affine, so I'll just
put that word down here.
506
00:36:30,190 --> 00:36:35,200
Bunch of words to
learn for this topic,
507
00:36:35,200 --> 00:36:36,760
but they're worth learning.
508
00:36:36,760 --> 00:36:37,630
OK.
509
00:36:37,630 --> 00:36:42,700
So it's like a plane but not
necessarily through the origin.
510
00:36:42,700 --> 00:36:44,640
If B is 0, it doesn't
go through it.
511
00:36:44,640 --> 00:36:46,980
If B it's not 0, it doesn't
go through the origin.
512
00:36:46,980 --> 00:36:47,480
OK.
513
00:36:47,480 --> 00:36:49,940
Anyway, or I have
some other convex set.
514
00:36:49,940 --> 00:36:57,519
Let me just put this convex
set K in the base for you,
515
00:36:57,519 --> 00:36:59,470
and did I make it convex?
516
00:36:59,470 --> 00:37:04,060
I think pretty luckily I did.
517
00:37:04,060 --> 00:37:05,200
So now what's the?
518
00:37:07,840 --> 00:37:11,540
Well, the convex
sets the constraint,
519
00:37:11,540 --> 00:37:13,840
so this is the constraint set.
520
00:37:13,840 --> 00:37:23,290
Constraint is that x
must be in the set K. OK,
521
00:37:23,290 --> 00:37:28,120
and I drew it as a convex blob.
522
00:37:28,120 --> 00:37:32,770
Here was an example
where it's flat, not
523
00:37:32,770 --> 00:37:36,110
a blob but a flat plane.
524
00:37:36,110 --> 00:37:41,020
But let me come back to
what does convex mean.
525
00:37:41,020 --> 00:37:42,450
What's a convex set?
526
00:37:42,450 --> 00:37:47,430
Yeah, we have to do that,
should have done that before.
527
00:37:55,140 --> 00:38:00,040
In the notes, I had the
fun of figuring out,
528
00:38:00,040 --> 00:38:04,870
if I took a triangle,
is that a convex set?
529
00:38:04,870 --> 00:38:05,830
Let's just be sure.
530
00:38:09,640 --> 00:38:11,530
So what's a convex set?
531
00:38:11,530 --> 00:38:14,020
That is a convex
set, because if I
532
00:38:14,020 --> 00:38:19,120
take any two points in the set
and draw the line between them,
533
00:38:19,120 --> 00:38:21,710
it stays in the set.
534
00:38:21,710 --> 00:38:31,370
So that's convexity,
any edge, line, from x1
535
00:38:31,370 --> 00:38:37,970
to x2 stays in the set.
536
00:38:37,970 --> 00:38:39,920
OK, good.
537
00:38:39,920 --> 00:38:43,380
So here's my little
exercise to myself.
538
00:38:43,380 --> 00:38:46,720
What if I took the
union of two triangles?
539
00:38:46,720 --> 00:38:52,760
All I want to get you to do is
just visualize convex and not
540
00:38:52,760 --> 00:38:54,620
convex possibilities.
541
00:38:54,620 --> 00:39:02,120
Suppose I have one triangle,
even if it was obtuse,
542
00:39:02,120 --> 00:39:04,730
that's still convex, right?
543
00:39:04,730 --> 00:39:06,230
No problem.
544
00:39:06,230 --> 00:39:08,630
But now what if I put
those two triangles
545
00:39:08,630 --> 00:39:11,310
together, take their union?
546
00:39:11,310 --> 00:39:15,740
Well, if I take them sitting
with a big gap between,
547
00:39:15,740 --> 00:39:17,330
like I've lost.
548
00:39:17,330 --> 00:39:20,390
I mean, I never had
a chance that way,
549
00:39:20,390 --> 00:39:24,290
because if it was the
union of these two-- well,
550
00:39:24,290 --> 00:39:25,880
you know what I'm going to say.
551
00:39:25,880 --> 00:39:28,100
If I'm doing that point
and that point, of course,
552
00:39:28,100 --> 00:39:30,270
it goes outside and stupid.
553
00:39:30,270 --> 00:39:30,770
All right.
554
00:39:36,140 --> 00:39:41,150
What if what if that
triangle, that lower triangle,
555
00:39:41,150 --> 00:39:42,980
overlaps the upper triangle?
556
00:39:42,980 --> 00:39:44,240
Is that a convex set?
557
00:39:47,000 --> 00:39:49,130
Everybody's right saying no.
558
00:39:49,130 --> 00:39:53,240
Why how do I see that the
union of those two triangles
559
00:39:53,240 --> 00:39:56,200
is not a convex set?
560
00:39:56,200 --> 00:39:59,530
Guys, you tell me where
to pick two points,
561
00:39:59,530 --> 00:40:00,940
where the line goes out.
562
00:40:00,940 --> 00:40:03,700
Well, I take one from
that corner and one
563
00:40:03,700 --> 00:40:08,260
from that corner, and the line
between them went outside.
564
00:40:08,260 --> 00:40:17,305
So union is usually not convex.
565
00:40:21,990 --> 00:40:24,240
Well, if I think of
the union of two sets,
566
00:40:24,240 --> 00:40:29,560
my mind automatically goes
to the other corresponding
567
00:40:29,560 --> 00:40:36,540
possibility which is the
intersection of the two sets.
568
00:40:36,540 --> 00:40:39,384
So if I take the
intersection of two sets.
569
00:40:43,680 --> 00:40:46,890
Now, what's the deal with that?
570
00:40:46,890 --> 00:40:50,670
When I had two triangles,
two separated triangles,
571
00:40:50,670 --> 00:40:54,780
what can we say about the
intersection of those two
572
00:40:54,780 --> 00:40:55,440
triangles?
573
00:40:55,440 --> 00:40:56,790
AUDIENCE: [INAUDIBLE]
574
00:40:56,790 --> 00:40:58,510
GILBERT STRANG: It's empty.
575
00:40:58,510 --> 00:41:02,380
So should we regard the
empty set as a convex set?
576
00:41:02,380 --> 00:41:03,950
Yes.
577
00:41:03,950 --> 00:41:04,797
Isn't it?
578
00:41:04,797 --> 00:41:06,290
AUDIENCE: Yeah, it's vacuous.
579
00:41:06,290 --> 00:41:09,770
GILBERT STRANG: Vacuous, so
it hasn't got any problems.
580
00:41:09,770 --> 00:41:11,430
Right?
581
00:41:11,430 --> 00:41:18,770
OK, but now the intersection
is always convex.
582
00:41:18,770 --> 00:41:22,640
I'm assuming the two sets
that we start with are.
583
00:41:22,640 --> 00:41:26,640
Now, that's an important fact,
that the intersection of convex
584
00:41:26,640 --> 00:41:27,140
sets.
585
00:41:27,140 --> 00:41:30,305
Let's just draw a picture
that shows an example.
586
00:41:34,160 --> 00:41:35,840
So what's the intersection?
587
00:41:35,840 --> 00:41:39,560
Just this part and it's convex.
588
00:41:39,560 --> 00:41:42,480
OK, can you give
me a little proof
589
00:41:42,480 --> 00:41:48,670
that the intersection is convex?
590
00:41:48,670 --> 00:41:51,060
So I take two points
in the intersection--
591
00:41:51,060 --> 00:41:54,070
let me start the proof.
592
00:41:54,070 --> 00:41:57,550
To test if something's
convex, how do you test it?
593
00:41:57,550 --> 00:42:01,570
You take two points in the
set in the intersection,
594
00:42:01,570 --> 00:42:03,580
and you want to show that
the line between them
595
00:42:03,580 --> 00:42:06,850
is in the intersection.
596
00:42:06,850 --> 00:42:08,420
OK, why is that?
597
00:42:08,420 --> 00:42:14,350
So take two points, take
x1 in the intersection.
598
00:42:14,350 --> 00:42:17,530
We've got two sets
here, and that's
599
00:42:17,530 --> 00:42:19,870
the symbol for
intersection, and we've
600
00:42:19,870 --> 00:42:22,017
got another point
in the intersection.
601
00:42:24,640 --> 00:42:27,950
And now, we want to look
at the line between them,
602
00:42:27,950 --> 00:42:33,100
the line from x1 to 2x.
603
00:42:33,100 --> 00:42:35,236
What's the deal with that one?
604
00:42:38,070 --> 00:42:41,038
Is that fully in K1?
605
00:42:41,038 --> 00:42:42,830
AUDIENCE: Yes.
606
00:42:42,830 --> 00:42:45,510
GILBERT STRANG: Why
is it fully in K1?
607
00:42:45,510 --> 00:42:48,830
I took two points
in the intersection,
608
00:42:48,830 --> 00:42:52,850
I'm looking at the line
between them, and I'm asking,
609
00:42:52,850 --> 00:42:54,680
is it in the first set K1?
610
00:42:54,680 --> 00:42:57,950
And the answer is yes,
because those points
611
00:42:57,950 --> 00:43:02,010
were in K1, and K1's convex.
612
00:43:02,010 --> 00:43:05,520
And is that line
between them in K2?
613
00:43:05,520 --> 00:43:11,550
Yes, same reason, the
two endpoints were in K2,
614
00:43:11,550 --> 00:43:13,890
so the line between
them is in K2.
615
00:43:13,890 --> 00:43:17,490
So the intersection of
convex sets is always convex.
616
00:43:17,490 --> 00:43:22,990
The intersection of
convex sets is convex.
617
00:43:22,990 --> 00:43:23,490
Good.
618
00:43:26,420 --> 00:43:29,570
So you'll see in the
note these possibilities
619
00:43:29,570 --> 00:43:31,560
with two triangles.
620
00:43:31,560 --> 00:43:35,770
Sometimes, you can take the
union but not very often.
621
00:43:35,770 --> 00:43:36,590
OK.
622
00:43:36,590 --> 00:43:41,130
Now, what's the next
thing I have to do?
623
00:43:41,130 --> 00:43:45,230
Convex functions,
we got convex sets,
624
00:43:45,230 --> 00:43:48,590
what are convex
functions, and we're good.
625
00:43:48,590 --> 00:43:53,140
Because this is our
prototype of a problem,
626
00:43:53,140 --> 00:43:57,930
and I now want to know what
it means for that F to be--
627
00:43:57,930 --> 00:43:58,640
oh, I'm sorry.
628
00:43:58,640 --> 00:44:03,920
I now know what it means for
the set K to be convex set,
629
00:44:03,920 --> 00:44:07,525
but now I have to look at the
other often more important part
630
00:44:07,525 --> 00:44:08,150
of the problem.
631
00:44:08,150 --> 00:44:10,550
What's the function
I'm minimizing,
632
00:44:10,550 --> 00:44:15,110
and I'm looking for functions
with this kind of a picture.
633
00:44:15,110 --> 00:44:16,610
OK.
634
00:44:16,610 --> 00:44:22,580
The coolest way is to connect
the definition of a convex
635
00:44:22,580 --> 00:44:28,340
function to the definition
of a convex set.
636
00:44:28,340 --> 00:44:31,820
This is really the nicest way.
637
00:44:31,820 --> 00:44:32,780
It's a little quick.
638
00:44:32,780 --> 00:44:34,610
It just swishes by you.
639
00:44:34,610 --> 00:44:38,240
But tell me, do you see a
convex set in that picture?
640
00:44:38,240 --> 00:44:40,930
[INAUDIBLE]
641
00:44:40,930 --> 00:44:43,550
You see a convex
set in that picture.
642
00:44:43,550 --> 00:44:46,300
That's the picture of a
graph of a convex function.
643
00:44:46,300 --> 00:44:48,850
It's a picture of a bowl.
644
00:44:48,850 --> 00:44:53,400
Are the points on that
surface, is that a convex set?
645
00:44:53,400 --> 00:44:55,650
No, certainly not.
646
00:44:55,650 --> 00:45:03,310
No, but where is a convex set to
be found here, in that picture?
647
00:45:03,310 --> 00:45:04,290
Yes.
648
00:45:04,290 --> 00:45:07,230
AUDIENCE: The set of y, if y
is greater than [INAUDIBLE]
649
00:45:07,230 --> 00:45:11,210
GILBERT STRANG: Yes, the
points on and above the bowl,
650
00:45:11,210 --> 00:45:16,470
inside the bowl, we
could say, these points.
651
00:45:16,470 --> 00:45:21,600
So convex function,
yes, a function's
652
00:45:21,600 --> 00:45:37,320
convex when the points on and
above the graph are convex set.
653
00:45:48,650 --> 00:45:51,110
You could say,
OK, mathematicians
654
00:45:51,110 --> 00:45:52,100
are just being lazy.
655
00:45:52,100 --> 00:45:55,190
Having got one definition
straight for a convex set,
656
00:45:55,190 --> 00:45:59,045
now they're just using that
to give an easy definition
657
00:45:59,045 --> 00:46:00,930
of a convex function.
658
00:46:00,930 --> 00:46:04,190
Actually, it's quite
useful for functions
659
00:46:04,190 --> 00:46:09,080
that could maybe equal infinity,
sort of generalized functions.
660
00:46:09,080 --> 00:46:15,260
But it's not the quickest way to
tell if the function is convex.
661
00:46:15,260 --> 00:46:19,070
It's not our usual test
for convex functions.
662
00:46:19,070 --> 00:46:21,520
So now I want to
give such a test.
663
00:46:21,520 --> 00:46:22,020
OK.
664
00:46:24,740 --> 00:46:36,030
So now, the definition of convex
function, of a smooth convex,
665
00:46:36,030 --> 00:46:36,530
yeah.
666
00:46:39,640 --> 00:46:45,190
This fact, I shouldn't
rush off away from it,
667
00:46:45,190 --> 00:46:51,560
from the definition
of a convex function
668
00:46:51,560 --> 00:46:55,020
as having a convex
set above its graph.
669
00:46:55,020 --> 00:46:59,450
The really official French name
for the set above the graph
670
00:46:59,450 --> 00:47:03,780
is the epigraph, but I won't
even write that word down.
671
00:47:03,780 --> 00:47:05,690
OK.
672
00:47:05,690 --> 00:47:08,300
Why do I come back
to that for a minute?
673
00:47:08,300 --> 00:47:16,720
Because I would like to think
about two functions, F1 and F2.
674
00:47:16,720 --> 00:47:18,640
Out of two functions,
I can always
675
00:47:18,640 --> 00:47:23,140
create the minimum
or the maximum.
676
00:47:23,140 --> 00:47:28,795
So suppose I have to convex
functions, convex function
677
00:47:28,795 --> 00:47:30,535
F1 and F2.
678
00:47:34,440 --> 00:47:35,050
OK.
679
00:47:35,050 --> 00:47:38,840
Then, I could choose a minimum.
680
00:47:38,840 --> 00:47:41,450
I could choose my new function.
681
00:47:41,450 --> 00:47:44,090
Shall I call it
little m for minimum?
682
00:47:44,090 --> 00:47:51,340
m of x is the
minimum of F1 and F2.
683
00:47:51,340 --> 00:47:54,380
And I could choose
a maximum function
684
00:47:54,380 --> 00:47:59,330
which would be the
maximum of F1 of x and F2
685
00:47:59,330 --> 00:48:03,870
of x at the same point x.
686
00:48:03,870 --> 00:48:07,280
It's just a natural to think,
OK, I have two functions.
687
00:48:07,280 --> 00:48:11,212
I've got a bowl and
I've got another bowl,
688
00:48:11,212 --> 00:48:12,545
and suppose they're both convex.
689
00:48:18,030 --> 00:48:20,472
So I'm just stretching
you to think here.
690
00:48:20,472 --> 00:48:25,060
If I've got the graphs
of two convex functions,
691
00:48:25,060 --> 00:48:29,560
and I would like to consider the
minimum of those two functions
692
00:48:29,560 --> 00:48:32,150
and also the maximum
of those two functions.
693
00:48:32,150 --> 00:48:34,730
I believe life is good.
694
00:48:34,730 --> 00:48:38,830
One of these will be
convex, and the other won't.
695
00:48:38,830 --> 00:48:41,980
And can you identify
which one is convex
696
00:48:41,980 --> 00:48:43,300
and which one is not convex?
697
00:48:46,310 --> 00:48:49,100
What about the minimum?
698
00:48:49,100 --> 00:48:50,690
Is that a convex function?
699
00:48:50,690 --> 00:48:52,470
So just look at the graph.
700
00:48:52,470 --> 00:48:54,110
What does the minimum look like?
701
00:48:54,110 --> 00:48:58,250
The minimum is this guy
until they meet somehow
702
00:48:58,250 --> 00:49:01,190
on some surface
and then this guy.
703
00:49:01,190 --> 00:49:02,240
Is that convex?
704
00:49:02,240 --> 00:49:05,900
We have like one minute
to answer that question.
705
00:49:05,900 --> 00:49:07,160
Absolutely no.
706
00:49:07,160 --> 00:49:10,370
It's got this bad kink in it.
707
00:49:10,370 --> 00:49:14,220
What about the maximum
of the two functions?
708
00:49:14,220 --> 00:49:16,490
So the maximum is the
one that is above,
709
00:49:16,490 --> 00:49:22,150
all the points or things
that are above or on.
710
00:49:26,010 --> 00:49:27,770
There is the maximum function.
711
00:49:27,770 --> 00:49:29,990
That was the minimum function.
712
00:49:29,990 --> 00:49:31,460
It had a kink.
713
00:49:31,460 --> 00:49:33,380
The maximum function
is like that,
714
00:49:33,380 --> 00:49:41,530
and it is convex, so
maximum yes, minimum no.
715
00:49:44,310 --> 00:49:50,740
OK, and we could have a
maximum of 1,500 functions.
716
00:49:50,740 --> 00:49:53,800
If the 1,500 functions
are all convex,
717
00:49:53,800 --> 00:49:55,600
the maximum will
be, because it's
718
00:49:55,600 --> 00:50:00,520
the part way above
everybody's graph,
719
00:50:00,520 --> 00:50:03,580
and that would be the
graph of the maximum.
720
00:50:03,580 --> 00:50:05,410
OK, good.
721
00:50:05,410 --> 00:50:11,260
And now finally, let me
just say, how do you know
722
00:50:11,260 --> 00:50:13,480
whether a function is convex?
723
00:50:13,480 --> 00:50:16,570
How to test, how of test.
724
00:50:20,330 --> 00:50:24,890
OK, so let me take just a
function of one variable.
725
00:50:24,890 --> 00:50:28,130
What's the test you learned
in calculus, freshman
726
00:50:28,130 --> 00:50:36,360
calculus actually, just show
that this is a convex function?
727
00:50:36,360 --> 00:50:38,542
What's the test for that?
728
00:50:38,542 --> 00:50:40,067
AUDIENCE: Use second derivative.
729
00:50:40,067 --> 00:50:41,900
GILBERT STRANG: Second
derivative should be?
730
00:50:41,900 --> 00:50:42,692
AUDIENCE: Positive.
731
00:50:42,692 --> 00:50:45,200
GILBERT STRANG:
Positive or possibly 0,
732
00:50:45,200 --> 00:50:53,360
so second derivative greater
or equals 0 everywhere.
733
00:50:53,360 --> 00:50:54,360
That's convex.
734
00:50:57,480 --> 00:51:02,920
OK, final question,
suppose F is a vector.
735
00:51:02,920 --> 00:51:09,370
So this is a vector, and so I
have n functions of n variable.
736
00:51:09,370 --> 00:51:10,020
No, I don't.
737
00:51:10,020 --> 00:51:14,030
I have one, sorry,
I've got one function,
738
00:51:14,030 --> 00:51:15,490
but I'm in n variables.
739
00:51:19,510 --> 00:51:21,680
So this was just one.
740
00:51:21,680 --> 00:51:23,360
What's the test for convexity?
741
00:51:27,350 --> 00:51:33,080
So it would be passed, for
example, by x1 squared plus x2
742
00:51:33,080 --> 00:51:35,150
squared.
743
00:51:35,150 --> 00:51:37,683
Would it be passed by--
744
00:51:37,683 --> 00:51:39,350
so here would be the
question-- would it
745
00:51:39,350 --> 00:51:45,830
be passed by x transpose
some symmetric matrix S?
746
00:51:45,830 --> 00:51:51,000
That would be a quadratic,
a pure quadratic.
747
00:51:51,000 --> 00:51:53,760
Would it be convex?
748
00:51:53,760 --> 00:51:54,780
What would be the test?
749
00:51:57,500 --> 00:52:00,350
I'm looking for an n
dimensional equivalent
750
00:52:00,350 --> 00:52:04,460
of positive second derivative.
751
00:52:04,460 --> 00:52:07,370
The n dimensional equivalent
of positive second derivative
752
00:52:07,370 --> 00:52:12,560
is convexity, and we have to
recognize what's the test.
753
00:52:12,560 --> 00:52:14,990
So I could apply it
to this function,
754
00:52:14,990 --> 00:52:20,090
or I could apply it to any
function of n variables.
755
00:52:24,170 --> 00:52:25,280
It should be OK.
756
00:52:27,960 --> 00:52:29,220
What's the test here?
757
00:52:29,220 --> 00:52:33,340
Here, I have a matrix
instead of a number.
758
00:52:33,340 --> 00:52:38,090
So what's the
requirement going to be?
759
00:52:38,090 --> 00:52:39,527
Times out, yeah?
760
00:52:39,527 --> 00:52:43,380
[INAUDIBLE] Positive
definite or semidefinite,
761
00:52:43,380 --> 00:52:45,440
or semidefinite just as here.
762
00:52:45,440 --> 00:52:45,940
Yeah.
763
00:52:45,940 --> 00:52:51,680
So the test is positive,
semidefinite, Hessian.
764
00:52:54,520 --> 00:52:56,830
And here, the Hessian
is actually that S,
765
00:52:56,830 --> 00:53:00,700
because the second
derivatives will produce--
766
00:53:00,700 --> 00:53:02,220
I'll put a 1/2 in there--
767
00:53:02,220 --> 00:53:06,940
the second derivatives will
produce S equal the Hessian H.
768
00:53:06,940 --> 00:53:08,590
So here, the S--
769
00:53:08,590 --> 00:53:11,800
so positive
semidefinite, Hessian
770
00:53:11,800 --> 00:53:17,920
in general, second derivative
matrix for a quadratic.
771
00:53:17,920 --> 00:53:19,150
OK.
772
00:53:19,150 --> 00:53:23,950
So its convex
problems that we're
773
00:53:23,950 --> 00:53:28,150
going to get farther with.
774
00:53:28,150 --> 00:53:31,040
We run into no saddle points.
775
00:53:31,040 --> 00:53:34,220
We run into no local minimum.
776
00:53:34,220 --> 00:53:36,950
Once we found the minimum,
it's the global minimum.
777
00:53:36,950 --> 00:53:38,520
These are the good problems.
778
00:53:38,520 --> 00:53:41,180
OK, again, happy
to see you today,
779
00:53:41,180 --> 00:53:43,822
and I look forward to Wednesday.