Yann LeCun at Duke's Responsible AI Symposium
Loading YouTube video...
Video ID:ddUPj3N3A-Y
Use the "Analytical Tools" panel on the right to create personalized analytical resources from this video content.
Transcript
[Music]
All right. Um, so I'm going to uh talk a
bit about where I think AI is going over
the next few years. And this is going to
um uh cover um some of the research um
I've I've been personally involved in
and working on um over the the last uh
several years.
And and first um what I want to say is
that regardless of what your interest in
AI
is, at some point we're going to need
human level AI
systems because they're going to assist
us in our daily lives, maybe with us at
all times. And the best uh way to
interact with a system is if it has an
intelligence that's kind of similar to
human intelligence because we're
familiar with interacting with other
humans. Um and so and that was you know
depicted in the movie her from u uh the
Spike Jones movie from from 2013.
Um eventually we'll probably at least if
the vision of uh of meta is fulfilled uh
we're going to be interacting with those
agents through smart devices like smart
glasses and things of that type
augmented reality or mixed reality
glasses. Um but for this we we need
technology that doesn't exist yet. We
need systems that basically
um understand how the world works, can
remember, can reason, plan.
And we're nowhere near this at the
moment. So on the hardware side, uh
we're making progress towards sort of,
you know, having devices that allow us
to have uh assistance at all times. I
mean, currently they're in our
smartphone, but it's somewhat
inconvenient. If they could display
information in smart glasses and we
could just talk to them or interact with
them with electromiogram bracelet that
would allow us to, for example, point
and click and type with our hands in our
pocket. That would be uh more more more
convenient. Um so that is going to occur
also what's going to occur over the next
decade or so is you know practical uh
hardware platform for things like
humanoid robots and domestic robots and
things like this. But we don't have the
technology that will allow those
assistants and domestic robots to be
sufficiently smart yet. So that needs to
make significant progress over the next
five five years
roughly. And the problem with this is
that machine learning
sucks. I mean, in terms of sample
efficiency, in terms of being able to
acquire new skills quickly, machine
learning is nowhere near the kind of
capabilities that we observe, not just
in humans, but in most
animals. Um, so animals can learn new
tasks extremely quickly, understand how
the world works. Certainly, they can
reason, they can plan. A lot of animals
can can do that. Um they have common
sense what we would qualify as common
sense. Um but their behavior is not
driven by the statistics of their
training data. Is is driven by you know
hardwired objectives that were hardwired
into into us by
evolution. Um okay so we we have some
progress to to make. Now let's talk
about what you know current AI
technology that everybody talks about
does. auto reggressive large language
models. We call them LLMs, right? Or
chatbots, but they're really auto
reggressive LLMs. So they they they're
really trained to predict the next word
that follows a text. Many of you already
know this. Um you take a text and you
train a system to just predict the next
word from that follows that text. Now
because of the architecture of the kind
of network that is trained to do this uh
you can actually train it in parallel
and and in effect uh the way you train
those systems is as an autoenccoder. You
train the system to reproduce its input
on its output but because it cannot see
a particular
input to produce the corresponding
output. It has to it can only look at
the ones to the left. It's called a
causal architecture. um in effect you're
training it to produce the word that
follows a text. You're just training it
to do this for every word in the text in
parallel which is very efficient. Okay,
so it scales very well. You can scale
those systems to hundreds of billions of
parameters and uh there is some sort of
emergent property that that comes out.
Those systems seem to really have some
level of understanding of the underlying
uh reality but it's pretty shallow.
Now once you've trained a system of this
type, you can have it produce text auto
reggressively, which means um it can
predict the next word that follows a
particular prompt that you you give it
and then you shift that word into into
its input. Shift everything by by one
and now it can predict the second next
word, shift that in, third word, etc.
Right? So that's autographive
prediction. It's not a new concept. It's
been around since the
1950s if not the 1940s actually. Um but
there is a major limitation with this
which is that
um there is no way for the system to
really kind of reflect about something
right it's going to spend the exact same
amount of computation for every word or
every token that it produces. Okay, so
if you ask a system like this a question
like does like a yes no question
uh does 2 plus 2 equal four okay it's
going to spend a particular quantity of
computation to answer this now asked you
know does p equal
np and it will spend the exact same
amount of computation and then produce
an answer yes or
no which will most likely be
wrong or at least unjustified if it's
Right. Um, so that's not right. I mean,
we tend to spend a lot more time on
complex questions than simple
questions. There is another problem
which is that under some assumptions,
right? If you um if you think that in
the space of all possible sequences of
tokens or words, okay, think of it as a
tree, there's a subtree of acceptable
answers for the question, okay?
um at every token produced by the system
there is some probability that it will
take you out of the sub tree of
acceptable
answers. Now if you make an incredibly
strong assumption which is that this
probability is constant and independent
for all the token in the in the in the
answer which is of course false. Um but
let's say a first approximation. Now
what that means is that the probability
of staying within this sub tree
decreases exponentially as time goes
by
and it's you know almost inevitable and
so you know the the more output the
longer an output those system produce
the more it it's going to become either
wrong or
irrelevant and there is some uh I mean
this is kind of you know very informal a
very informal argument but there are
some some papers that kind of study
So, we're missing something really big
and um even bigger than just the the
small issues that I I mentioned. Um you
know, never mind humans. Cats and dogs
can do amazing stuff that we cannot come
even close to reproducing with our AI
systems. Any house cat can plan really
highly complex actions. Just observe a
cat, you know, standing at the bottom of
a bunch of furniture thinking about how
he's going to jump on the top of the
furniture. They clearly
plan.
Um, in humans, you know, any 10-year-old
can clear up the dinner table and fill
up the dishwasher without being trained
to do it. The first time you ask the
10-year-old to do it, he or she'll be
able to do it. That's called zero
shot. No need for training. being able
to solve new tasks without actually
training ourselves to do it. Any
17-year-old can learn to drive a car in
about 20 hours of practice. Now, we
don't have domestic
robots and we don't have we don't even
have cell loving cars despite the fact
that we have millions of hours of
training data of a car being driven by
an expert and we could train a system to
just, you know, emulate the human
driver. But when we do this, we don't
get a system that is is nearly as
reliable as a human and it cannot
actually handle very unusual uh
situations. So we have systems that can
pass the bar exam. They can solve
complex math problems. They can prove
theorems. They can beat us at chess and
go and poker and diplomacy and whatever
else. Uh but we don't have level five
serving cards. We don't have domestic
robots. And that's just another example
of the the Moravec paradox, right?
Moravec was a roboticist. And he said
like how come all the things that we
think of as sophisticated intellectual
task like playing chess or something uh
or planning a a path through you know
from one city to another. We can solve
this with computers but we can get
computers to solve those problem better
than humans.
But then what we take for granted just
dealing with the real world acting in
the real world we don't know what how to
do this with the
machines that points to an interesting
remark which is um a typical LLM today
like lama is trained on on the order of
30 trillion tokens three 10^ the 13
tokens a token is kind of like a subword
unit but it's you know it's worth three
quarters of a word roughly
Um each token is about three
bytes for lama. So the complete data
volume is 0.9 10^ the 14 is basically
10^ the 14 bytes. Okay, that's
essentially the entirety of all the
publicly available text on the internet.
It would take any of us on the order of
400,000 years to read
um you know reading 12 hours a day which
obviously we would not survive for very
long. Um now
consider a human child a four-year-old
has been awake a total of about 16,000
hours. That's what psychologist tell
us. Which by the way is not a big amount
of video. That's 30 minutes worth of
YouTube
uploads. Um, we have 2 million optical
nerve fibers, 1 million per optic nerve
going to the to the brain, the visual
cortex. Each nerve fiber carries about
one bite per second, you know, give or
take. So do the arithmetics, the result
is about 10^ the 14 bytes. So if in four
year four years um a child has seen as
much data raw data as the biggest
trained on all the publicly available
text on the
internet 100,000 times less u less time
than it would take to read the material.
So what that tells you is that we're not
going to get to human level
AI by training from text. It's just not
just not happening. Okay. Despite what
Dario Modi is saying that we're gonna
have, you know, super intelligent PhD
level uh assistant by next year or the
year after that, it's not going to
happen. It's absolutely not going to
happen unless you redefine what it means
to be intelligent. If intelligent if you
reduce or uh constrain the definition of
intelligence to being able to solve math
problems and certain math problems and
uh you know other type of of things of
this type then yes maybe
um but that's not full
intelligence. So there's one immediate
consequence of this which is that we're
we're not going to have super
intelligent system you know anytime soon
and we're not going to die next year.
Okay.
Um so how do babies u learn how the
world work? They they basically learn
this mostly by observation in the first
three months of life because they can't
really interact with the world you know
beyond their limbs. Uh so they certainly
build an active model of their own body
okay because they can move their limbs
but
um the model of the external world it's
mostly through perception passive
perception I mean they can move their
eyes obviously but not really grab
objects or anything that comes later
um but you know things like gravity and
inertia basically what we call intuitive
physics is not acquired by infants until
the age of about 9 months. So before 9
months uh you show the little the
scenario here at the bottom to u a
six-month old uh child where an object
is on a platform and you push it off the
platform it appears to float in the air.
Sixmonth old baby will barely pay
attention. A 10-month old will will go
like the little girl here because in the
meantime baby has learned that objects
that are not supported are supposed to
fall. Um and whenever our model of the
world or mental model of the world is
violated, it makes us pay attention
because of of course it could be
dangerous. Okay, so that gives us a list
of what we want advanced machine
intelligence system to have. We want
systems that learn those world models
from sensory
input. Why from sensory input? Because
it's much higher bandwidth uh than text
and you don't need humans to produce it.
You can have as much data as you want.
So task like learning intuitive physics
from video that would be a very good
thing to get machines to do and I'll
show some examples of that in a a few
minutes. Uh we want systems that have
persistent memory. We want systems that
can plan uh complex
actions and so as to fulfill an
objective and for that you need a mental
model of the world because you need to
be able to predict what the consequences
of your actions are going to be.
And we need systems that are that that
can reason obviously but that's really
quite the same thing as manipulating a
mental model. And we need systems that
are controllable and safe which is not
the case for LLM. uh some of my
colleagues don't like me to do to say
this but LLMs are intrinsically unsafe
in a way because they will just
um you
know produce answers that kind of
satisfy the statistics of the training
data but it's not a kind of a direct way
of controlling what they uh what they
say. They're still useful still a lot of
interesting applications you can build
with them and we should absolutely work
on them in a big way but they're not
going to take us to human level AI.
Okay. So first of all
um what type of inference should an AI
system be able to do? So as I explained
earlier an LLM produces an answer by
just propagating through a bunch of
layers of a neural net and then
producing a token. And that's not
computationally sufficient for
intelligent behavior or reasoning or
planning. What you want is uh producing
an output through an inference process.
So basically imagine you have an
observation and you have a system and
you propose a potential output to the
system and the system has you know gives
you a scalar output that measures to
what extent this output is compatible
with the input or to what extent is
incompatible with the input I should
say. Okay. So then a system of this type
can perform inference by optimization
basically by searching through an for an
output that minimizes a particular
objective function. Okay, the square
that you see on the right here marked
objective that's
a a function with scalar output not
represented that basically tells you to
what extent the output is incompatible
with the input. Okay, if the output is
zero, it's compatible. If it's larger,
it's not or less. This idea, I mean,
this type of inference is very
classical. Um, it's the type of
inference that first of all, classical
AI has been using for a long time.
Reasoning in classical AI and optimal
control and
robotics is performed by optimizing
um a function with respect to the
output. You search for an output that is
that minimizes some function, right?
That's what you do when you do a
shortest pass for example right you plan
a trajectory from one city to another uh
when you solve a a SAT problem when you
um you know basically most computational
tests can be reduced to an optimization
problem and of course it's u it's
exactly what you know probability
graphical models and biz and nets and
things like this were we're doing when
you try to figure out the optimal value
for a laten variable whose value you
don't know you're basically minimizing
negative log likelihood or something of
that type. Find the value of the
variable that minimizes negative log
likelihood or free energy if you are in
the log domain. So this is very
classical
um and this is the type of process that
psychologists would would qualify as
system two. Okay. So in psychology there
is system one and system two. System one
is the task that you can accomplish
without thinking about them. they be you
become so used to them that they become
automatic and subconscious. You don't
need to if you're an experienced driver.
You don't need to think about driving.
You just drive.
Um but um system two is the one where
you recruit all the power of your of
your mind and your mental model of the
world to kind of plan an action
potentially imagining all kinds of you
know catastrophe scenarios and avoiding
them things like that. So compare a
person driving a car for the first time
and an experienced
driver. The first per the first uh
person uses system two. The second one
uses system
one. Okay. So how are we going to uh uh
kind of formalize this a little bit and
and this is not really uh complicated
theory here but the way you would build
such a a function that measures the
degree of incompatibility between an
input and an output um can be captured
by this notion of energy based models.
Energy based model is kind of a weaker
form of probabilistic modeling if you
want. Okay. So you have a function
f. It's called an energy or free energy.
And it depends on two variables. In this
little diagram here, there are scalar
variables. X is one that you observe and
y is one that you're supposed to
infer. Um and the dependency between x
and y. So the the the relation between
that produces y from x is not a function
because you may have multiple values of
y that are compatible with a single
value of x. Okay, so you have to
represent this by an implicit function
and that's this energy function. Okay,
so this energy function f ofxy would be
zero when y is compatible with x and we
take a larger positive value when y is
not compatible with x. And if you build
a thing properly, this kind of energy
landscape would be kind of smooth. So
that if I give you a value of x, you can
easily find a value of y that minimizes
this energy through uh optimization
perhaps using gradient information or
something like
this. Okay. So uh to train a system like
this uh you would need to learn the
train the parameters of this energy
function which could be some big neural
net in such a way that it gives you low
energy on examples of x and y that are
compatible and high energy for
everything else. And that's the hard
part. I'll come back to this later.
Okay. Okay, so assuming you have such a
energy function, how would you structure
it inside and then how would you use it
for uh reasoning and planning? Okay, so
this could be an example of a diagram
block diagram of the internal structure
of this uh energy function. You have an
observation on the left that goes
through a perception module, some big
neural net that produces a
representation of the state of the world
as you perceive it. You might want to
combine this with the content of a
memory which represents everything else
that you know about the world but you're
not currently perceiving. All right. So
that gives you an idea of the state of
the world. You feed this to your world
model which would be kind of a
centerpiece of that architecture. And
that uh world model will also take a
proposed a proposal for an action
sequence to um
accomplish given the state of the world
and the action sequence. The war model
predicts the next state of the world
after the action sequence has been
accomplished. Okay, the predicted
u uh state representation of the state.
You can feed this to a few objective
functions. One objective
u is a task objective measures to what
extent you're accomplishing a task which
could be specified through another input
to that objective function. Okay. And
then perhaps a set of other objectives
that would be
guardrails. So guardrails that
guarantees that the sequence of actions
that will be taken is not dangerous or
you know respect some boundaries of some
kind. Okay. So the way the systems uh
the system operates is that given a an
input a perception the system searches
through optimization. It searches
through an action sequence that
minimizes the objectives and the
guardrails. You can think of the
guardrails as constraints that need to
be satisfied as opposed to costs or you
can view them as cost penalty functions
if you want. Okay, so that's an example
of inference by
optimization. Now in in classical uh
control theory um a world model is
something that you know given the state
of the world at time t and an action you
might take at time t gives you the state
of the world at time t plus delta t.
Right? Let's say your world model is
some sort of differential equation that
governs the dynamics of a robot arm or a
rocket or something like that. Um so you
might want to run your world model
multiple times. Okay, unfolded in time
as represented here. So here you have a
sequence of two actions. You feed it to
your world model but the world model
kind of you know take the second step
take as input the output the predicted
output from the previous time step. And
the guardrail cost can be applied to the
entire trajectory not just to the the
the final state. Same for the task cost.
Actually this type of uh architecture is
very classical in optimal control is
called uh or or this kind of operation
is called model predictive control.
Right? You plan a sequence of actions.
Uh so that according to your world model
a particular objective would be
fulfilled and you do this by
optimization.
Now in reality the world is not
completely deterministic. So it could be
that you may have to do this in the
presence of
uncertainty and those extra variables
here uh the latent variables kind of you
could think of as uh variables that
represent everything you don't know
about the uh environment that sort of
makes it nondeterministic if you want.
Okay. So draw those variables to some
uh from some distribution or optimize
them in some way uh to uh be able to do
the planning that makes everything much
more complicated but we're going to
ignore this for the time being.
Ultimately though what you would want to
do is not planning at a single level
because
um it could be extremely it could be
impossible because you just don't have
the information.
Um, turns out I have to be in Paris
tomorrow. It's true. It's a true story.
Um, and to go to Paris and be in Paris
tomorrow uh,
morning. I can I cannot plan my entire
trip from here to Paris in terms of
millisecond by millisecond muscle
control, which is really kind of the
lowest level actions that a humans uh,
can take. Instead, I have to plan this
at a very high abstract level where I
don't need to have all the information
to do this planning. So, I don't need to
know anything about much to know that to
fly to Paris, I need to to go to Paris.
I need to go to the airport first and
catch a plane.
Okay. And then going to the airport now
becomes my my sub goal. I need to figure
out how to go to the airport. Well, I
need to uh well, the example I have is
from my office at NYU. So, if I'm in New
York, I need to
just go on the street. Hail a taxi. You
can do this in New
York. Um, now I'm sitting in my office
at NYU. How do I go on the street? I
need to walk to the elevator, press the
button, walk out the building. How do I
go to the elevator? I need to stand up
from my chair, pick up my bag, open the
door, walk to the elevator, avoiding
potential obstacles on the way, uh, etc.
And there is some point in the hierarchy
where I can just act. I don't I don't
need to know more than what I currently
perceive. I don't need to plan because
I'm used to standing up for my chair.
Um, so I can just accomplish the task,
right? So, but almost everything we do,
we use this kind of hierarchical
planning. And it's true of animals as
well. Here's the thing though. We have
no idea how to do this with machines. We
have no idea how to train a a machine so
that it has a world model that can do
this type of hierarchical
planning. Robots do hierarchical
planning, but the the levels are kind of
hardwired by hand if you
want.
Um how do we train a system to do this
from examples? So that led me to kind of
constructing all of those components
that you see here can be put into some
sort of architecture that people have
called cognitive architecture where the
centerpiece is a world model. Um this is
actually kind of represented roughly
where it could be in the human brain. By
the way, a world model is basically all
the your prefrontal cortex. Perception
is in the back. Uh motor control is kind
of in the middle. Uh the memory is kind
of inside is the hippocampus. That's
your short-term memory. Um, and then in
the basil ganglia at the bottom of the
brain, you have a bunch of those
objective functions, at least if you
squint. So that leads to this um
architecture called objectived driven
AI. And that's what I just described.
Okay. And I wrote kind of a vision paper
about this about three years ago. I put
online on open review. It's not on
archive. It's only open uh so that
people can
comment. Um and it's called a path
towards autonomous machine intelligence
where you know I explain a lot of what I
just went
through. Okay. So how far have we have
we gone in that
direction and you know can we train uh a
system to learn how the world works from
video the same way we train a system to
produce language by just training it to
predict the next word in a text. Can we
do this with video? Right? show a video
to the system and then train it to to
predict what's going to happen next in
the video. If the system is able to do a
good job at this
planning, that means it has understood
the underlying nature of uh of the
world. It probably knows that the world
is threedimensional, that you know some
objects can move spontaneously, animate
object and most other objects obey
simple rules of physics and blah blah
blah.
Um,
so can we use the same techniques that
we use so successfully for text to train
a video
system? And the answer is absolutely
not. Despite what you might read, okay,
so there's a lot of people in the in the
field today who strongly believe that
the way you get a system to understand
the real world is you train it to
predict at a pixel level what happens in
a video.
And I used to believe this until about
five years ago. I completely changed my
mind about this and I've become actually
kind of philosophically opposed to this
very idea. Okay. Um it just doesn't
work. And we've tried to do this. I've
worked on this for the better part of
the last 20 years trying to predict
what's going to happen in the video
based on this idea that you know you
need self-supervised learning. You don't
want to train a system to accomplish a
particular task. you just want to train
you to understand the world and then
learning you know learning a particular
task will will be very simple and fast.
Um which is why I never believed in
reinforcement learning for example. Um
so okay so why does it work for text so
well and why does it not work for
video and the answer is very simple.
It's just that
text is simple. Language is simple. We
think of of text of of language as kind
of the epitome of human
intelligence. But in fact, no, language
is simple. It's like chess. Chess is
simple. It's like solving integrals,
computing integrals. It's actually
simple. It's hard for humans, but
algorithmically, it's not that hard.
Okay. So, why is it simple uh for
machines to train a machine to predict
the next word? Well, you can never
predict the actual word that follows the
text, but what you can do is produce a
probability distribution over all
possible words in your dictionary,
right? And you can do this because there
is only a finite number of words or
tokens in that in that uh in that case.
So, you can handle uncertainty pretty
easily. Now, for video though, we don't
have a way of representing a normalized
distribution over all possible video
frames.
Not only don't we have it, it's actually
an intractable mathematical problem. We
don't even know how to represent proper
distributions in such high dimensional
continuous spaces. We can represent
energy functions that would be the
unnormalized logarithm of a distribution
but even those are not very good and we
certainly don't know how to normalize
them. So the whole idea of probabilistic
modeling basically has to be thrown out
the
window if we want to predict what goes
on in a video. And the main issue is
that if we train a system to just
predict at the pixel level what goes on
in the video, we get blurry predictions.
We get the kind of prediction that you
see this is old work from 2016 almost 10
years old now. And this is the kind of
prediction you get when you train, you
know, with pretty big neural net to
predict what's going to happen in short
videos. Um the second column at the
bottom here is also what you get when
you try to predict the trajectory of
cars on a highway.
blurry. Okay, so my solution to this is
something I call joint embedding
predictive
architecture and this is what it looks
like. Now what's the difference of what
I showed you before? The difference is
that why goes through an encoder now.
Okay, so you take a video, call it Y,
you corrupt it or you transform it or
you encode an action that maybe took
place
between the second video and the first
call this X
um run it through an encoder as well and
then try to predict not the entire video
but the representation of that
video. Okay, so instead of making
prediction at the pixel level, make
predictions in representation space.
Now if the a variable here is an action
that is you know being taken in the
world then this is a world model that
basically given the state of the world
representation of state of the world at
time t sx and an action predicts the
representation of the world that will
result from taking this
action. So that's the difference between
those two architectures. On the left you
have generative architectures. They
predict
why. On the right you have the joint
embedding predictive architectures JPA
they predict an abstract representation
of Y where all the details in Y that are
essentially unpredictable are
eliminated. Okay. So if I take a video
of this room I I start from the left and
I slowly move towards the right and I
stop right here and I ask the system to
predict you know tell me you know give
me the rest of the video. The system
will predict that the camera is going to
continue panning. It's probably going to
predict that this is a room. It might
even predict that the there's blinds on
this side because, you know, it can
figure out the lighting pattern on
everyone's face. There's no way it can
predict what everyone of you looks like.
Absolutely no way. The information is
just not there in the initial
segment. It can't predict, you know,
which chairs are occupied or not. It
can't predict the detailed texture of of
the carpet.
So if you try to train a system to
predict at the pixel level, it's going
to spend all of its resources predicting
things that it can't
predict. As a result, it's going to
predict, you know, a blurry mess, which
basically is the average of all the
possible things that could plausibly
happen. So if you do this in
representation space, you simplify the
problem a lot.
Um but then there's one problem that's
complicated which is how you train those
architectures. Okay. So um there's
several types of those joint emitting
architectures. The one on the left pure
joint emitting architecture. Um if the
the two encoders are identical and share
the same weight that's called a Siamese
neural network is a little concept from
one of my papers in the early 90s. The
JEPA with a predictor is the one in the
middle and then the action condition
JEPA is the one on the right where the Z
variable here is either a latent
variable or an action um that is being
taken and that could be seen as kind of
a causal model of what happens in the
world um given an action or
transformation that may occur. So that's
the kind of architecture we're going to
need to train and
um how are we going to train it?
So that's where those energy based model
story kind of becomes somewhat um
somewhat useful because as I said before
the way you you would train a energy
based model of this type is you would
make sure that the energy produced by
the system which is the value of those
objective functions okay um is low for
example training samples of X and Y and
then high everywhere else. It's easy
enough to make the energy low for
training samples. to just show an
example of X and Y that that from your
training set and tweak the parameters of
the system so that the scalar energy
goes down super
easy. The problem is how do you make
sure the energy is higher everywhere
else and there there's two types of
methods is more but there is two
major categories of methods. So if you
don't explicitly make sure that the
energy is higher outside the manifold of
data, you run the risk of having a
system that's collapsed. Basically gives
you a zero energy for every pair of X
and Y. And that's not a good model. You
need a contrast between the good the
good pairs of XY and the bad pairs of
XY. Okay. So two classes of method
contrastive methods. The contractive
method consists in generating
uh other pairs XY that are
not under manifold of
data and then you push their energy up.
There's an issue with contrasting
method. I I used to be a fan of them. I
kind of invented them with this Siamese
network stuff,
but it doesn't work very well in high
dimension. If you have a highdimensional
representation space, then you're going
to have to generate uh a number of
contrastive samples here that basically
in the limit grows exponentially with
the dimension and so it doesn't scare
very well. Experimentally it doesn't
work very well either.
There's an alternative which are called
regularized methods and those are
methods where either by construction or
through a regularizer function the
volume of stuff that can take low energy
the volume of space that can take low
energy is limited somehow or
minimized sounds a little mysterious but
um I'll give you an example of how this
can be done so um so again contrastive
method versus regularized methods
regularized method try to minimize the
volume of stuff that can take low energy
Um so the way we're going to test this
experimentally is that we're going to
train one of those joint embedding or
JPA architecture on unlabelled data.
Okay, either with video or images or
whatever. Um and then we take the
encoder and use the representation
learned by the encoder as input to u a
supervised classifier or predictor or
something uh that we're going to train
in supervised mode to solve a particular
task like object recognition or
segmentation or whatever it is or action
recognition video for example. Okay,
that's a standard way you train self-s
supervised learning.
Um so the contrasting methods I I said I
didn't want to use it but just for
completeness um there's an old paper of
mine from 1993 where we kind of propose
this uh thing it's it's come to be known
also as metric learning you kind of try
to train a neural net to project inputs
into a space where uclean distance kind
of makes sense if you want semantic
sense there's been a two other
uh papers from my lab in the mid200s and
then a paper from Google um in 2020
called SIMC clear which showed that this
type of method can give decent
performance for so self-s supervised
training of uh image recognition object
recognition
system but again the representation that
are produced by those method tend to be
low dimensional not more than 200 or
so okay there's another set of methods
which I didn't mention they they're sort
of regularized methods but they they
they're not completely well understood
based on distillation
um and they're very popular at the
moment. So basically they they consist
in again having two encoders but the
weights of the encoder on the right is
an exponential moving average of the
weights of the encoder on the left. So
think of it as the encoder on the left
is trained and its weight can change
pretty
quickly. Um and it's trained with the
output coming out of the encoder on the
right. The encoder on the right does not
you don't back propagate gradient to it.
There is the red arrow the red cross
here says the gradients are not back
propagated. So you just set the the
weights of the encoder on the right with
the exponential moving average of the
weight on the
left. And somehow this works and if you
do it right it prevents a
collapse. a collapse in which the
encoder just
ignore ignores the input and just
produces constant output which of course
would minimize the prediction error.
Okay, somehow it prevents collapse.
There's some theoretical papers on this
but it's a little mysterious I should
say but there's a bunch of papers that
use this idea and they work really well.
So be away well from from deep mind sim
from my colleagues at at meta
u from 2020 and dinov2 I ja vja I'm
going to talk a little bit about that um
but with this so there's a a particular
um image extraction self-supervised
image uh feature extraction system
called this dinov2 it was produced by my
colleagues at fair paris and it works
really well people use the dinov2
features for all kinds of applications
in computer vision. Just use it as a
generic feature extractors extractor
feed it to a supervised classifier and
with just a few samples you can train
your classifier to accomplish just about
any vision
task. Um so one of our colleagues for
example took satellite images and had a
small amount of label data where the the
height of the canopy was labeled for
some areas. And so she trained the head
uh using this small amount of data then
applied to the entire earth and then was
able to evaluate how much carbon was
captured in vegetation in total uh in
the entire earth which is a quantity
that um is interesting to know for
climate u change
prediction. Okay. But literally hundreds
of people use this these features. Okay.
Okay, so here's a a version of this
where we can use it to build a world
model and it's called Dino WM Dino world
model. Uh this is a recent paper on
archive by uh a student co-advised by
Leral Pinto who is a roboticist at NYU
and myself name is
Ghu and the basic idea is um you take
the representation of an image produced
by
Dinov2 and then the representation of an
image of the environment after you've
taken a particular action. Okay. and you
train a
predictor. You keep the encoder fixed,
but you train a predictor to predict the
representation of the
world after the action has been
accomplished. Right? So once you have
that predictor which is action
conditioned, you can use it for
planning.
you
um show the system in initial state,
extract the features with DOV2 and then
run your world model multiple time steps
maybe 10 something like that and then uh
measure the distance in representation
space to a target state that you want to
attain that you want for example a robot
to accomplish a certain number of task
actions so that you reach a particular
goal. Um, and so you feed the target
state to the encoder again, and then you
measure the distance in representation
space. And now by optimization, you
figure out a sequence of actions that
minimizes this task
objective. And this works really well.
So these are images of
reconstructions produced by a separately
trained decoder. We don't use a decoder
for training. There's no decoder. It's
very important whenever you train
something to an image encoder through
reconstruction, it doesn't work very
well. Doesn't work as well as if you
train using those joint embedding uh
type criteria. Okay. So, what you see at
the top is um this is a a problem where
you have like a little blue dot that you
can move around and it can push a a
shape a T-shaped object.
So at the top you have a sequence of
action being taken and the result of
applying the sequence of actions on the
actual environment which is a simulated
one. Okay. At the bottom this is the
prediction that dino the dino world
model would produce. So we take the
representation resulting from the
prediction of the sequence of actions
and then for each time step we just run
it through the decoder. So can we we can
produce an image that corresponds to
this. But again this decoder is trained
separately afterwards. Um and you can
see that it's pretty similar to what
goes on in the first row. Right? So the
last row is almost identical to the to
the first one. Everything in between are
sort of other methods that people have
proposed to kind of solve this problem
um using various techniques. None of
them does nearly as good of a
job. Now the dynamics here is super
simple but there are you know other
situations where the dynamics is really
more complicated. So here what the robot
does is that it goes down on a on a
plate on a platter and it moves by delta
x delta y and that pushes a bunch of
blue chips uh which kind of push onto
each other. So it's fairly complex kind
of dynamical system which you know you
can simulate but you can't really sort
of reduce equations that would allow you
to plan. Um and again at the top is the
ground truth what actually occurs um
after executing a certain number of
actions and at the bottom is a decoding
version of the internal state of the
system uh after having accomplished the
same imagined uh actions and the result
is pretty similar and again the other
methods are not not nearly as good. I'm
going to bore you with uh results. It
just works better. Okay, let me show you
some videos. It's more fun. Okay, so you
can apply this to all kinds of
situation. Train world models or
predictors for v situation for pushing a
tear around for navigating through a
maze or for doing those those tasks um
of you know moving a little string or a
bunch of chips. Uh let me show you the
last video. I think it's the more fun
more fun one. So you set that system
with arbitrary
goals with an initial condition and then
what you see at the bottom is the robot
using actions that are planned to try to
reproduce the configuration the target
configuration that it's observing and
it's limited to a relatively small
number of actions here. uh and some of
this is done open loop some of it is
done using what's called receding
horizon uh planning and you know we can
apply this to all kinds of situation
where that require planning and works
pretty well but it's still a lot a lot
more work to do
there okay um those architectures uh
um that's a bit of a cheat that I just
told you because the encoder is
pre-trained it's it's pre-trained using
self-s supervised learning using uh
distillation method denov v2 uses
distillation method to train the visual
encoder but the the predictor on the
world model is trained separately right
it assumes the encoder is already there
in image ja and video japa everything is
trained at once also using distillation
method but we train also the predictor
uh simultaneously so there's two papers
to um one very recent on video japa um
and I'm going to go fast on this but um
this uh technique of using distillation
method to train one of those JA
architecture works really well for
learning visual features to recognize
objects or um or segment images or or
doing other tasks. Um it's much faster
than uh alternative methods or self-s
supervised learning and it gives better
results. Um uh Dino V2 gives better
result because it's trained on much
bigger data sets and we haven't yet
compared exactly but what we compare it
with here is another project that took
place at fair called MAE masked
autoenccoder so that method is basically
an autoenccoder a denoing autoenccoder
you take an image you mask certain parts
of it and then you train a gigantic
neural net to reconstruct the full image
from the partially masked uh one and
this doesn't work as well it's much more
expensive and in fact um this project
was cancelled
abandoned. So the main lesson there is
if you're going to train a system to
produce representations of images or
video, do not train it to re to
reconstruct. It's not going to work.
Which is why I don't think any of the
efforts using models like Sora or video
prediction systems to produce wild
models are ever going to work. It's a
complete waste of time.
Um okay so video jpa is just a version
of i ja for video where the input to the
system is a sequence of 16 frames
partially masked and then you train the
system to reproduce uh to basically
predict the representation of the full
video from the representation of the
partially masked one. And this works
really well. It produces features that
are very useful if you want to classify
for example an action that takes place
in a video. Again I'm not going to bore
you with details. uh and it can
basically elucinate what goes on in the
video. So this is a partially matched
video and it what you see on the right
is what the sort
of an output produced by a separately
trained decoder. Okay, using diffusion
model but it's important that the
decoder is not used for training the
internal model. Right? If you do this
actually doesn't work as
well. Now here's a surprising result
that this is a paper we just posted on
archive a few weeks ago. Um, if you take
this video Japa model and you show it
videos where something really weird
occurs, something that's not physically
possible, the system can tell you that
it's not physically possible. So what
you do is you you you take those 16
frame input and you slide it over a
longer video and you measure the
prediction error of the system. you
measure to what extent uh you know it
can predict the representation of full
video from a partial one or future
frames from current ones even though
it's not been trained to do that. And if
you have something really unusual
appearing in a video like an object
spontaneously
disappears, the prediction error goes
through the roof tells you like
something very strange is happening
here. And in fact uh if you
construct a data set where you have
pairs of video where one is a
perfectly normal video where physics is
not violated and another one where some
aspect of physics is violated an object
disappears or changes shape or changes
color or something like that. Um with
practically
100% accuracy the system can tell you
which one is less plausible than the
other one.
Um
so very interesting and the you know
performance is good or bad depending on
the type of
uh sort of common sense that you're
testing uh the system
for. Okay. But here is the thing that we
are really sort of moving towards. This
is um techniques for training those JA
architectures with regularized methods
that are based on information
maximization.
Um and there is a bunch of them that
have been proposed over the last five
years or so. MCR squared from uh Yima's
group at Berkeley Bar Twins from my
group at META.
WSE vragg which is also from my group at
at META VR and MMCR for my colleagues at
NYU in computational
neuroscience. Um so how does it work?
Those things basically say okay I want
to
prevent the system from collapsing from
just minimizing the prediction error by
ignoring the input and then producing a
constant representation at the output of
the
encoder. And one way to do this is to
have some measure of
information that comes out of the
encoder and try to maximize it.
So, we have information theorists in the
audience, and if he had some hair left,
he would be pulling it pulling
them. My apologies. Um, we're old
friends and colleagues.
Um, here is the problem. We do not have
any good ways of maximizing information
content because that would require
having a lower bound on information
content. And we don't have any lower
bound on information content. We only
have upper bounds. Okay. So what we're
going to do is have some measure of
information content that we know is only
an upper bound, but we're still going to
push up on it and we're going to cross
our fingers that the actual information
content
follows.
Okay? And that's as justified as I can
make this thing. Okay? Um so other
authors here for those other papers have
other justification of you know trying
to produce efficient coding trying to
make sure that the representation
vectors coming out of the encoder are
filling up the space of representation
are uniform around a sphere or something
like that. But basically they're all
sort of surrogates for computing
information content of some kind under
some assumption.
And because of the assumptions, they're
all upper
bounds. The assumptions basically ignore
dependencies between
variables. Uh okay, so
um how can we do this? Okay, so
basically we can do this by making sure
that the variables coming out of the
encoders, okay, it's a vector coming out
of the encoder. we can try to make sure
that the variables the components of
those vectors are somewhat independent
or at least
uncorrelated. Okay, so if we have a
matrix coming out of this encoder, a
matrix, you know, we we we feed a batch
of samples to this encoder and we get a
matrix out of it where each row is a
sample and each column is a is a
variable of uh of the the vector, a
component of the vector.
Um, and we have two types of method.
Sample contrasting method. So they're
the contrasting methods I was talking
about earlier. What they're trying to do
is make sure that the rows coming out of
that matrix the rows of that matrix are
all different from each other possibly
orthogonal to each other if you
can maximally different from each other.
This is what this seem clear method and
various other uh semiisnet methods are
doing. The alternative I'm proposing
here is making the columns of that
matrix
uh independent or at least orthogonal to
each other mutually orthogonal pair
wise which means which is another way of
saying I want the the variables to be
uncorrelated. Okay. So there's various
specific ways of doing this and those
papers basically have different ways of
of doing this. uh the the the variance
coariance regularization method I've um
I've mentioned uh does this by making
sure that the the variables have
standard deviation one or at least
one and then uh making sure that the
offdagonal terms of the coariance matrix
which is this matrix multiply this
matrix transpose multiplied by
itself that the off diagonal terms of
this coarance matrix are as close to
zero as possible that guarantees that
the variables are
uncorrelated and then simultaneously
minimize the prediction error. Okay, so
what you get in the end is a system that
finds a trade-off between extracting as
much information as possible from the
input but only extracting the
information that is actually predictable
by the predictor and basically
eliminating all the stuff in the
observation in the X and Y that is not
predictable. all the stuff from Y that
is not predictable from X
essentially. So if you train a system
like this completely self-supervised
with unlabelled data um and you take the
representation feed it to a supervised
classifier measure the performance it
works well I'm not going to bore you
with numbers um but what we can do is
actually train a world model with this
method and so this is a very recent
paper just came out last week on archive
u by two of my students Vlad Sabal and
uh Kevin
Jong together with uh some of my
colleague Cho Colero
um who was a postoc at at fair who's a
professor at Brown and Tim Brner who is
a a posttock at NYU who is on the job
market. So
um so basically the idea here is uh let
me jump directly here. So it's a world
model again um but the encoder and the
predictor are trained simultaneously and
the collapse is prevented using this
variance coariance regularization or
information maximization I just talked
about as well as kind of you know
minimizing the prediction error is
trained on sequences of uh video frames
um and then once you have this world
model trained end to end you can use it
for planning um the system can actually
plan pretty well uh on simple cases so
there's still a lot of work to to kind
of scale those up, make them work in
more complex situations. This is another
project very recent by Embar who is a
postto at META um which basically uses
world model. They're trained in a a
different fashion to plan in um in the
real world. So you you train it from
video videos taken from a robot where
the robot sits at a particular position.
it moves by a known quantity and then
you have another video frame and you can
train the system to predict what the
world is going to look like if you
accomplish a particular motion. Okay?
And then you can use it for planning
trajectories and it works which is
pretty cool. Um so those systems can um
predict what the world is going to look
like if you follow a trajectory and
therefore actually kind of plan a
sequence of action so that the world it
looks like a a particular target. And
um lots of uh fun videos there. Okay, so
let me conclude because I'm badly out of
time. I have a number of
recommendations. Abandon generative
models in favor of those joint embedding
architectures. Okay, everybody's talking
about generative AI. Everybody's working
on generative AI. Everybody's assuming
that AI is just generative AI and I'm
telling you forget about generative AI.
emittic model. Okay, so this is the main
frame theoretical framework in which all
of machine learning is based but it
really doesn't work in the context of
sort of making nondeterministic
prediction in high dimensional
continuous space. Um so I'm just telling
you use those energy based methods you
know you don't need to normalize
anything and it's like you know
unnormalized logarithms of probabilities
if you want but but you have a much more
bigger space at your disposal. abandon
contrasting methods again something
that's very popular at the moment in
favor of those regularized method and of
course abandon reinforcement learning
I've been saying this for 10
years so if you are interested in human
level AI do not work on
LMS okay work on LMS if you want an
engineering job next year
okay but if you want a research
job on a topic that is not going to fall
out of favor three or five years from
now. Don't work on
L&M. I mean L&M would be useful just not
the centerpiece of AI systems. Okay. So
we have a lot of problems to solve. I
mean the research program that goes
around this you know built around this
is probably a
10ear program for a lot of people.
So if you're looking for a good topic
for a PhD in AI,
uh these are a lot of problems to solve
within this particular framework. Um so
training large scale world models on all
kinds of inputs. Um figuring out good
planning algorithms, optimization
algorithms for planning. Turns out
gradient based methods tend to get stuck
in local minima. So we might have to use
more sophisticated optimization like
ADMM or some level of gradient free
optimization which I would like to stay
away from. uh dealing with planning in a
context of uncertainty with latent
variables hierarchical planning
completely unsolved completely open
um associative memory I didn't talk
about um and then sort of you know
slightly more theoretical issues
mathematical foundations for energy
based learning and inference
uh learning cost modules because the
cost modules you know here we only use
kind of very simple situations so the
cost module can be built by hand it's
just ukle and distance in representation
space. In most cases, you probably have
to learn the cost function.
Um, planning with an accurate world
model,
um, adjusting the world model as you go,
uh, etc. Um, what this model tells you
is that you have three ways to be
stupid. The first one
is your world model can be wrong. So the
effect of your action may not be the
ones that you think they would be.
Second one is your cost function might
be inappropriate. They might lead to
outcomes that are not the ones that you
expect or you may not have any
guardrails which will lead you to do
really bad things to bad to good people
without realizing. And then the third
one is not being able to actually find
even if your world model and your cost
function are good, not being able to
find a sequence of actions that will
actually kind of fulfill your objective.
Okay?
And some AI systems and certainly some
people in some people all three are
bad. They don't have the right world
model. They don't have the right cost
function because they they have no
morals and whatever action they decide
to take is completely
ineffective. That's a good description
of some people in
government that shall remain
unnamed. Um, okay. So, in the future,
we're going to have virtual assistant
with us at all times helping us in our
daily
lives and those systems eventually will
constitute a kind of repository of all
human knowledge. Right? will not go
to a search engine
uh or even a library unless we really
like going to libraries which I do. Um
but we'll just ask a question to AI
assistant. We may even pose a problem to
our AI assistant and might be able to
solve it.
Now all of our digital diet is going to
be mediated by those AI assistants and
it would be extremely dangerous for
everything about humanity if those AI
assistant came from a handful of
companies on the west coast of the US or
China for linguistic cultural diversity
for different value systems, political
leanings, whatever it is.
um we cannot possibly get all of our
information diet from just a few uh
systems of this type. We need a high
diversity in AI assistance for the same
reason that we need a high diversity in
the in the media and the
press. And so the only way to get there
because those systems are so expensive
to train. The only way to get there is
if the people who have the means to
train foundation models release them in
open source so that a lot of other
people can fine-tune them for whatever
language, culture, value system or
centers of interest they have and I
don't see any alternative to this. So I
think probably in terms of ethics this
is after all a symposium about
ethics. Uh the most important aspect of
AI ethics today is not bias. is not is
AI going to kill us all. It's not any of
this. It's are we going to have the
tools to build very highly diverse AI
systems so we don't get all of our
information for just a handful
uh of those and that means uh open
source uh foundation models. So if you
have any level of influence to anyone
um particularly in government make sure
that governments don't
uh make laws that would make open source
risky or um complicated or illegal and
there are proposals in that direction.
um that would be a very very uh bad
outcome I think for the future. But if
we manage to preserve uh diversity then
perhaps humanity will go through a new
renaissance because you know if we have
access to all the world's knowledge in
even more efficient fashion than we
currently do um and we have systems that
assist us in all the decisions that we
make every day will amplify human
intelligence. It's like everyone would
be walking around with a staff of super
smart people working for us. Okay, we
should not be scared by the fact that
they would be smarter than us because we
set the objectives for them. Which is
why I think this idea of objective
driven AI is super
important. Um so we be like you know
every politician right who don't know
anything but they have a staff of people
experts in various topics that adise
them. We'll all be pointy hat
managers or virtual people. Thank you
very much.
[Applause]
[Music]
Analytical Tools
Create personalized analytical resources on-demand.