Yann LeCun at Duke's Responsible AI Symposium

Feedback

Society-Centered AI at Duke University

1:06:36

2025-09-13

Loading YouTube video...

Video ID:ddUPj3N3A-Y

Use the "Analytical Tools" panel on the right to create personalized analytical resources from this video content.

Transcript

1642 segmentsCurrent: 0:00

[Music]

All right. Um, so I'm going to uh talk a

bit about where I think AI is going over

the next few years. And this is going to

um uh cover um some of the research um

I've I've been personally involved in

and working on um over the the last uh

several years.

And and first um what I want to say is

that regardless of what your interest in

is, at some point we're going to need

human level AI

systems because they're going to assist

us in our daily lives, maybe with us at

all times. And the best uh way to

interact with a system is if it has an

intelligence that's kind of similar to

human intelligence because we're

familiar with interacting with other

humans. Um and so and that was you know

depicted in the movie her from u uh the

Spike Jones movie from from 2013.

Um eventually we'll probably at least if

the vision of uh of meta is fulfilled uh

we're going to be interacting with those

agents through smart devices like smart

glasses and things of that type

augmented reality or mixed reality

glasses. Um but for this we we need

technology that doesn't exist yet. We

need systems that basically

um understand how the world works, can

remember, can reason, plan.

And we're nowhere near this at the

moment. So on the hardware side, uh

we're making progress towards sort of,

you know, having devices that allow us

to have uh assistance at all times. I

mean, currently they're in our

smartphone, but it's somewhat

inconvenient. If they could display

information in smart glasses and we

could just talk to them or interact with

them with electromiogram bracelet that

would allow us to, for example, point

and click and type with our hands in our

pocket. That would be uh more more more

convenient. Um so that is going to occur

also what's going to occur over the next

decade or so is you know practical uh

hardware platform for things like

humanoid robots and domestic robots and

things like this. But we don't have the

technology that will allow those

assistants and domestic robots to be

sufficiently smart yet. So that needs to

make significant progress over the next

five five years

roughly. And the problem with this is

that machine learning

sucks. I mean, in terms of sample

efficiency, in terms of being able to

acquire new skills quickly, machine

learning is nowhere near the kind of

capabilities that we observe, not just

in humans, but in most

animals. Um, so animals can learn new

tasks extremely quickly, understand how

the world works. Certainly, they can

reason, they can plan. A lot of animals

can can do that. Um they have common

sense what we would qualify as common

sense. Um but their behavior is not

driven by the statistics of their

training data. Is is driven by you know

hardwired objectives that were hardwired

into into us by

evolution. Um okay so we we have some

progress to to make. Now let's talk

about what you know current AI

technology that everybody talks about

does. auto reggressive large language

models. We call them LLMs, right? Or

chatbots, but they're really auto

reggressive LLMs. So they they they're

really trained to predict the next word

that follows a text. Many of you already

know this. Um you take a text and you

train a system to just predict the next

word from that follows that text. Now

because of the architecture of the kind

of network that is trained to do this uh

you can actually train it in parallel

and and in effect uh the way you train

those systems is as an autoenccoder. You

train the system to reproduce its input

on its output but because it cannot see

a particular

input to produce the corresponding

output. It has to it can only look at

the ones to the left. It's called a

causal architecture. um in effect you're

training it to produce the word that

follows a text. You're just training it

to do this for every word in the text in

parallel which is very efficient. Okay,

so it scales very well. You can scale

those systems to hundreds of billions of

parameters and uh there is some sort of

emergent property that that comes out.

Those systems seem to really have some

level of understanding of the underlying

uh reality but it's pretty shallow.

Now once you've trained a system of this

type, you can have it produce text auto

reggressively, which means um it can

predict the next word that follows a

particular prompt that you you give it

and then you shift that word into into

its input. Shift everything by by one

and now it can predict the second next

word, shift that in, third word, etc.

Right? So that's autographive

prediction. It's not a new concept. It's

been around since the

1950s if not the 1940s actually. Um but

there is a major limitation with this

which is that

um there is no way for the system to

really kind of reflect about something

right it's going to spend the exact same

amount of computation for every word or

every token that it produces. Okay, so

if you ask a system like this a question

like does like a yes no question

uh does 2 plus 2 equal four okay it's

going to spend a particular quantity of

computation to answer this now asked you

know does p equal

np and it will spend the exact same

amount of computation and then produce

an answer yes or

no which will most likely be

wrong or at least unjustified if it's

Right. Um, so that's not right. I mean,

we tend to spend a lot more time on

complex questions than simple

questions. There is another problem

which is that under some assumptions,

right? If you um if you think that in

the space of all possible sequences of

tokens or words, okay, think of it as a

tree, there's a subtree of acceptable

answers for the question, okay?

um at every token produced by the system

there is some probability that it will

take you out of the sub tree of

acceptable

answers. Now if you make an incredibly

strong assumption which is that this

probability is constant and independent

for all the token in the in the in the

answer which is of course false. Um but

let's say a first approximation. Now

what that means is that the probability

of staying within this sub tree

decreases exponentially as time goes

and it's you know almost inevitable and

so you know the the more output the

longer an output those system produce

the more it it's going to become either

wrong or

irrelevant and there is some uh I mean

this is kind of you know very informal a

very informal argument but there are

some some papers that kind of study

So, we're missing something really big

and um even bigger than just the the

small issues that I I mentioned. Um you

know, never mind humans. Cats and dogs

can do amazing stuff that we cannot come

even close to reproducing with our AI

systems. Any house cat can plan really

highly complex actions. Just observe a

cat, you know, standing at the bottom of

a bunch of furniture thinking about how

he's going to jump on the top of the

furniture. They clearly

plan.

Um, in humans, you know, any 10-year-old

can clear up the dinner table and fill

up the dishwasher without being trained

to do it. The first time you ask the

10-year-old to do it, he or she'll be

able to do it. That's called zero

shot. No need for training. being able

to solve new tasks without actually

training ourselves to do it. Any

17-year-old can learn to drive a car in

about 20 hours of practice. Now, we

don't have domestic

robots and we don't have we don't even

have cell loving cars despite the fact

that we have millions of hours of

training data of a car being driven by

an expert and we could train a system to

just, you know, emulate the human

driver. But when we do this, we don't

get a system that is is nearly as

reliable as a human and it cannot

actually handle very unusual uh

situations. So we have systems that can

pass the bar exam. They can solve

complex math problems. They can prove

theorems. They can beat us at chess and

go and poker and diplomacy and whatever

else. Uh but we don't have level five

serving cards. We don't have domestic

robots. And that's just another example

of the the Moravec paradox, right?

Moravec was a roboticist. And he said

like how come all the things that we

think of as sophisticated intellectual

task like playing chess or something uh

or planning a a path through you know

from one city to another. We can solve

this with computers but we can get

computers to solve those problem better

than humans.

But then what we take for granted just

dealing with the real world acting in

the real world we don't know what how to

do this with the

machines that points to an interesting

remark which is um a typical LLM today

like lama is trained on on the order of

30 trillion tokens three 10^ the 13

tokens a token is kind of like a subword

unit but it's you know it's worth three

quarters of a word roughly

Um each token is about three

bytes for lama. So the complete data

volume is 0.9 10^ the 14 is basically

10^ the 14 bytes. Okay, that's

essentially the entirety of all the

publicly available text on the internet.

It would take any of us on the order of

400,000 years to read

um you know reading 12 hours a day which

obviously we would not survive for very

long. Um now

consider a human child a four-year-old

has been awake a total of about 16,000

hours. That's what psychologist tell

us. Which by the way is not a big amount

of video. That's 30 minutes worth of

YouTube

uploads. Um, we have 2 million optical

nerve fibers, 1 million per optic nerve

going to the to the brain, the visual

cortex. Each nerve fiber carries about

one bite per second, you know, give or

take. So do the arithmetics, the result

is about 10^ the 14 bytes. So if in four

year four years um a child has seen as

much data raw data as the biggest

trained on all the publicly available

text on the

internet 100,000 times less u less time

than it would take to read the material.

So what that tells you is that we're not

going to get to human level

AI by training from text. It's just not

just not happening. Okay. Despite what

Dario Modi is saying that we're gonna

have, you know, super intelligent PhD

level uh assistant by next year or the

year after that, it's not going to

happen. It's absolutely not going to

happen unless you redefine what it means

to be intelligent. If intelligent if you

reduce or uh constrain the definition of

intelligence to being able to solve math

problems and certain math problems and

uh you know other type of of things of

this type then yes maybe

um but that's not full

intelligence. So there's one immediate

consequence of this which is that we're

we're not going to have super

intelligent system you know anytime soon

and we're not going to die next year.

Okay.

Um so how do babies u learn how the

world work? They they basically learn

this mostly by observation in the first

three months of life because they can't

really interact with the world you know

beyond their limbs. Uh so they certainly

build an active model of their own body

okay because they can move their limbs

but

um the model of the external world it's

mostly through perception passive

perception I mean they can move their

eyes obviously but not really grab

objects or anything that comes later

um but you know things like gravity and

inertia basically what we call intuitive

physics is not acquired by infants until

the age of about 9 months. So before 9

months uh you show the little the

scenario here at the bottom to u a

six-month old uh child where an object

is on a platform and you push it off the

platform it appears to float in the air.

Sixmonth old baby will barely pay

attention. A 10-month old will will go

like the little girl here because in the

meantime baby has learned that objects

that are not supported are supposed to

fall. Um and whenever our model of the

world or mental model of the world is

violated, it makes us pay attention

because of of course it could be

dangerous. Okay, so that gives us a list

of what we want advanced machine

intelligence system to have. We want

systems that learn those world models

from sensory

input. Why from sensory input? Because

it's much higher bandwidth uh than text

and you don't need humans to produce it.

You can have as much data as you want.

So task like learning intuitive physics

from video that would be a very good

thing to get machines to do and I'll

show some examples of that in a a few

minutes. Uh we want systems that have

persistent memory. We want systems that

can plan uh complex

actions and so as to fulfill an

objective and for that you need a mental

model of the world because you need to

be able to predict what the consequences

of your actions are going to be.

And we need systems that are that that

can reason obviously but that's really

quite the same thing as manipulating a

mental model. And we need systems that

are controllable and safe which is not

the case for LLM. uh some of my

colleagues don't like me to do to say

this but LLMs are intrinsically unsafe

in a way because they will just

um you

know produce answers that kind of

satisfy the statistics of the training

data but it's not a kind of a direct way

of controlling what they uh what they

say. They're still useful still a lot of

interesting applications you can build

with them and we should absolutely work

on them in a big way but they're not

going to take us to human level AI.

Okay. So first of all

um what type of inference should an AI

system be able to do? So as I explained

earlier an LLM produces an answer by

just propagating through a bunch of

layers of a neural net and then

producing a token. And that's not

computationally sufficient for

intelligent behavior or reasoning or

planning. What you want is uh producing

an output through an inference process.

So basically imagine you have an

observation and you have a system and

you propose a potential output to the

system and the system has you know gives

you a scalar output that measures to

what extent this output is compatible

with the input or to what extent is

incompatible with the input I should

say. Okay. So then a system of this type

can perform inference by optimization

basically by searching through an for an

output that minimizes a particular

objective function. Okay, the square

that you see on the right here marked

objective that's

a a function with scalar output not

represented that basically tells you to

what extent the output is incompatible

with the input. Okay, if the output is

zero, it's compatible. If it's larger,

it's not or less. This idea, I mean,

this type of inference is very

classical. Um, it's the type of

inference that first of all, classical

AI has been using for a long time.

Reasoning in classical AI and optimal

control and

robotics is performed by optimizing

um a function with respect to the

output. You search for an output that is

that minimizes some function, right?

That's what you do when you do a

shortest pass for example right you plan

a trajectory from one city to another uh

when you solve a a SAT problem when you

um you know basically most computational

tests can be reduced to an optimization

problem and of course it's u it's

exactly what you know probability

graphical models and biz and nets and

things like this were we're doing when

you try to figure out the optimal value

for a laten variable whose value you

don't know you're basically minimizing

negative log likelihood or something of

that type. Find the value of the

variable that minimizes negative log

likelihood or free energy if you are in

the log domain. So this is very

classical

um and this is the type of process that

psychologists would would qualify as

system two. Okay. So in psychology there

is system one and system two. System one

is the task that you can accomplish

without thinking about them. they be you

become so used to them that they become

automatic and subconscious. You don't

need to if you're an experienced driver.

You don't need to think about driving.

You just drive.

Um but um system two is the one where

you recruit all the power of your of

your mind and your mental model of the

world to kind of plan an action

potentially imagining all kinds of you

know catastrophe scenarios and avoiding

them things like that. So compare a

person driving a car for the first time

and an experienced

driver. The first per the first uh

person uses system two. The second one

uses system

one. Okay. So how are we going to uh uh

kind of formalize this a little bit and

and this is not really uh complicated

theory here but the way you would build

such a a function that measures the

degree of incompatibility between an

input and an output um can be captured

by this notion of energy based models.

Energy based model is kind of a weaker

form of probabilistic modeling if you

want. Okay. So you have a function

f. It's called an energy or free energy.

And it depends on two variables. In this

little diagram here, there are scalar

variables. X is one that you observe and

y is one that you're supposed to

infer. Um and the dependency between x

and y. So the the the relation between

that produces y from x is not a function

because you may have multiple values of

y that are compatible with a single

value of x. Okay, so you have to

represent this by an implicit function

and that's this energy function. Okay,

so this energy function f ofxy would be

zero when y is compatible with x and we

take a larger positive value when y is

not compatible with x. And if you build

a thing properly, this kind of energy

landscape would be kind of smooth. So

that if I give you a value of x, you can

easily find a value of y that minimizes

this energy through uh optimization

perhaps using gradient information or

something like

this. Okay. So uh to train a system like

this uh you would need to learn the

train the parameters of this energy

function which could be some big neural

net in such a way that it gives you low

energy on examples of x and y that are

compatible and high energy for

everything else. And that's the hard

part. I'll come back to this later.

Okay. Okay, so assuming you have such a

energy function, how would you structure

it inside and then how would you use it

for uh reasoning and planning? Okay, so

this could be an example of a diagram

block diagram of the internal structure

of this uh energy function. You have an

observation on the left that goes

through a perception module, some big

neural net that produces a

representation of the state of the world

as you perceive it. You might want to

combine this with the content of a

memory which represents everything else

that you know about the world but you're

not currently perceiving. All right. So

that gives you an idea of the state of

the world. You feed this to your world

model which would be kind of a

centerpiece of that architecture. And

that uh world model will also take a

proposed a proposal for an action

sequence to um

accomplish given the state of the world

and the action sequence. The war model

predicts the next state of the world

after the action sequence has been

accomplished. Okay, the predicted

u uh state representation of the state.

You can feed this to a few objective

functions. One objective

u is a task objective measures to what

extent you're accomplishing a task which

could be specified through another input

to that objective function. Okay. And

then perhaps a set of other objectives

that would be

guardrails. So guardrails that

guarantees that the sequence of actions

that will be taken is not dangerous or

you know respect some boundaries of some

kind. Okay. So the way the systems uh

the system operates is that given a an

input a perception the system searches

through optimization. It searches

through an action sequence that

minimizes the objectives and the

guardrails. You can think of the

guardrails as constraints that need to

be satisfied as opposed to costs or you

can view them as cost penalty functions

if you want. Okay, so that's an example

of inference by

optimization. Now in in classical uh

control theory um a world model is

something that you know given the state

of the world at time t and an action you

might take at time t gives you the state

of the world at time t plus delta t.

Right? Let's say your world model is

some sort of differential equation that

governs the dynamics of a robot arm or a

rocket or something like that. Um so you

might want to run your world model

multiple times. Okay, unfolded in time

as represented here. So here you have a

sequence of two actions. You feed it to

your world model but the world model

kind of you know take the second step

take as input the output the predicted

output from the previous time step. And

the guardrail cost can be applied to the

entire trajectory not just to the the

the final state. Same for the task cost.

Actually this type of uh architecture is

very classical in optimal control is

called uh or or this kind of operation

is called model predictive control.

Right? You plan a sequence of actions.

Uh so that according to your world model

a particular objective would be

fulfilled and you do this by

optimization.

Now in reality the world is not

completely deterministic. So it could be

that you may have to do this in the

presence of

uncertainty and those extra variables

here uh the latent variables kind of you

could think of as uh variables that

represent everything you don't know

about the uh environment that sort of

makes it nondeterministic if you want.

Okay. So draw those variables to some

uh from some distribution or optimize

them in some way uh to uh be able to do

the planning that makes everything much

more complicated but we're going to

ignore this for the time being.

Ultimately though what you would want to

do is not planning at a single level

because

um it could be extremely it could be

impossible because you just don't have

the information.

Um, turns out I have to be in Paris

tomorrow. It's true. It's a true story.

Um, and to go to Paris and be in Paris

tomorrow uh,

morning. I can I cannot plan my entire

trip from here to Paris in terms of

millisecond by millisecond muscle

control, which is really kind of the

lowest level actions that a humans uh,

can take. Instead, I have to plan this

at a very high abstract level where I

don't need to have all the information

to do this planning. So, I don't need to

know anything about much to know that to

fly to Paris, I need to to go to Paris.

I need to go to the airport first and

catch a plane.

Okay. And then going to the airport now

becomes my my sub goal. I need to figure

out how to go to the airport. Well, I

need to uh well, the example I have is

from my office at NYU. So, if I'm in New

York, I need to

just go on the street. Hail a taxi. You

can do this in New

York. Um, now I'm sitting in my office

at NYU. How do I go on the street? I

need to walk to the elevator, press the

button, walk out the building. How do I

go to the elevator? I need to stand up

from my chair, pick up my bag, open the

door, walk to the elevator, avoiding

potential obstacles on the way, uh, etc.

And there is some point in the hierarchy

where I can just act. I don't I don't

need to know more than what I currently

perceive. I don't need to plan because

I'm used to standing up for my chair.

Um, so I can just accomplish the task,

right? So, but almost everything we do,

we use this kind of hierarchical

planning. And it's true of animals as

well. Here's the thing though. We have

no idea how to do this with machines. We

have no idea how to train a a machine so

that it has a world model that can do

this type of hierarchical

planning. Robots do hierarchical

planning, but the the levels are kind of

hardwired by hand if you

want.

Um how do we train a system to do this

from examples? So that led me to kind of

constructing all of those components

that you see here can be put into some

sort of architecture that people have

called cognitive architecture where the

centerpiece is a world model. Um this is

actually kind of represented roughly

where it could be in the human brain. By

the way, a world model is basically all

the your prefrontal cortex. Perception

is in the back. Uh motor control is kind

of in the middle. Uh the memory is kind

of inside is the hippocampus. That's

your short-term memory. Um, and then in

the basil ganglia at the bottom of the

brain, you have a bunch of those

objective functions, at least if you

squint. So that leads to this um

architecture called objectived driven

AI. And that's what I just described.

Okay. And I wrote kind of a vision paper

about this about three years ago. I put

online on open review. It's not on

archive. It's only open uh so that

people can

comment. Um and it's called a path

towards autonomous machine intelligence

where you know I explain a lot of what I

just went

through. Okay. So how far have we have

we gone in that

direction and you know can we train uh a

system to learn how the world works from

video the same way we train a system to

produce language by just training it to

predict the next word in a text. Can we

do this with video? Right? show a video

to the system and then train it to to

predict what's going to happen next in

the video. If the system is able to do a

good job at this

planning, that means it has understood

the underlying nature of uh of the

world. It probably knows that the world

is threedimensional, that you know some

objects can move spontaneously, animate

object and most other objects obey

simple rules of physics and blah blah

blah.

Um,

so can we use the same techniques that

we use so successfully for text to train

a video

system? And the answer is absolutely

not. Despite what you might read, okay,

so there's a lot of people in the in the

field today who strongly believe that

the way you get a system to understand

the real world is you train it to

predict at a pixel level what happens in

a video.

And I used to believe this until about

five years ago. I completely changed my

mind about this and I've become actually

kind of philosophically opposed to this

very idea. Okay. Um it just doesn't

work. And we've tried to do this. I've

worked on this for the better part of

the last 20 years trying to predict

what's going to happen in the video

based on this idea that you know you

need self-supervised learning. You don't

want to train a system to accomplish a

particular task. you just want to train

you to understand the world and then

learning you know learning a particular

task will will be very simple and fast.

Um which is why I never believed in

reinforcement learning for example. Um

so okay so why does it work for text so

well and why does it not work for

video and the answer is very simple.

It's just that

text is simple. Language is simple. We

think of of text of of language as kind

of the epitome of human

intelligence. But in fact, no, language

is simple. It's like chess. Chess is

simple. It's like solving integrals,

computing integrals. It's actually

simple. It's hard for humans, but

algorithmically, it's not that hard.

Okay. So, why is it simple uh for

machines to train a machine to predict

the next word? Well, you can never

predict the actual word that follows the

text, but what you can do is produce a

probability distribution over all

possible words in your dictionary,

right? And you can do this because there

is only a finite number of words or

tokens in that in that uh in that case.

So, you can handle uncertainty pretty

easily. Now, for video though, we don't

have a way of representing a normalized

distribution over all possible video

frames.

Not only don't we have it, it's actually

an intractable mathematical problem. We

don't even know how to represent proper

distributions in such high dimensional

continuous spaces. We can represent

energy functions that would be the

unnormalized logarithm of a distribution

but even those are not very good and we

certainly don't know how to normalize

them. So the whole idea of probabilistic

modeling basically has to be thrown out

the

window if we want to predict what goes

on in a video. And the main issue is

that if we train a system to just

predict at the pixel level what goes on

in the video, we get blurry predictions.

We get the kind of prediction that you

see this is old work from 2016 almost 10

years old now. And this is the kind of

prediction you get when you train, you

know, with pretty big neural net to

predict what's going to happen in short

videos. Um the second column at the

bottom here is also what you get when

you try to predict the trajectory of

cars on a highway.

blurry. Okay, so my solution to this is

something I call joint embedding

predictive

architecture and this is what it looks

like. Now what's the difference of what

I showed you before? The difference is

that why goes through an encoder now.

Okay, so you take a video, call it Y,

you corrupt it or you transform it or

you encode an action that maybe took

place

between the second video and the first

call this X

um run it through an encoder as well and

then try to predict not the entire video

but the representation of that

video. Okay, so instead of making

prediction at the pixel level, make

predictions in representation space.

Now if the a variable here is an action

that is you know being taken in the

world then this is a world model that

basically given the state of the world

representation of state of the world at

time t sx and an action predicts the

representation of the world that will

result from taking this

action. So that's the difference between

those two architectures. On the left you

have generative architectures. They

predict

why. On the right you have the joint

embedding predictive architectures JPA

they predict an abstract representation

of Y where all the details in Y that are

essentially unpredictable are

eliminated. Okay. So if I take a video

of this room I I start from the left and

I slowly move towards the right and I

stop right here and I ask the system to

predict you know tell me you know give

me the rest of the video. The system

will predict that the camera is going to

continue panning. It's probably going to

predict that this is a room. It might

even predict that the there's blinds on

this side because, you know, it can

figure out the lighting pattern on

everyone's face. There's no way it can

predict what everyone of you looks like.

Absolutely no way. The information is

just not there in the initial

segment. It can't predict, you know,

which chairs are occupied or not. It

can't predict the detailed texture of of

the carpet.

So if you try to train a system to

predict at the pixel level, it's going

to spend all of its resources predicting

things that it can't

predict. As a result, it's going to

predict, you know, a blurry mess, which

basically is the average of all the

possible things that could plausibly

happen. So if you do this in

representation space, you simplify the

problem a lot.

Um but then there's one problem that's

complicated which is how you train those

architectures. Okay. So um there's

several types of those joint emitting

architectures. The one on the left pure

joint emitting architecture. Um if the

the two encoders are identical and share

the same weight that's called a Siamese

neural network is a little concept from

one of my papers in the early 90s. The

JEPA with a predictor is the one in the

middle and then the action condition

JEPA is the one on the right where the Z

variable here is either a latent

variable or an action um that is being

taken and that could be seen as kind of

a causal model of what happens in the

world um given an action or

transformation that may occur. So that's

the kind of architecture we're going to

need to train and

um how are we going to train it?

So that's where those energy based model

story kind of becomes somewhat um

somewhat useful because as I said before

the way you you would train a energy

based model of this type is you would

make sure that the energy produced by

the system which is the value of those

objective functions okay um is low for

example training samples of X and Y and

then high everywhere else. It's easy

enough to make the energy low for

training samples. to just show an

example of X and Y that that from your

training set and tweak the parameters of

the system so that the scalar energy

goes down super

easy. The problem is how do you make

sure the energy is higher everywhere

else and there there's two types of

methods is more but there is two

major categories of methods. So if you

don't explicitly make sure that the

energy is higher outside the manifold of

data, you run the risk of having a

system that's collapsed. Basically gives

you a zero energy for every pair of X

and Y. And that's not a good model. You

need a contrast between the good the

good pairs of XY and the bad pairs of

XY. Okay. So two classes of method

contrastive methods. The contractive

method consists in generating

uh other pairs XY that are

not under manifold of

data and then you push their energy up.

There's an issue with contrasting

method. I I used to be a fan of them. I

kind of invented them with this Siamese

network stuff,

but it doesn't work very well in high

dimension. If you have a highdimensional

representation space, then you're going

to have to generate uh a number of

contrastive samples here that basically

in the limit grows exponentially with

the dimension and so it doesn't scare

very well. Experimentally it doesn't

work very well either.

There's an alternative which are called

regularized methods and those are

methods where either by construction or

through a regularizer function the

volume of stuff that can take low energy

the volume of space that can take low

energy is limited somehow or

minimized sounds a little mysterious but

um I'll give you an example of how this

can be done so um so again contrastive

method versus regularized methods

regularized method try to minimize the

volume of stuff that can take low energy

Um so the way we're going to test this

experimentally is that we're going to

train one of those joint embedding or

JPA architecture on unlabelled data.

Okay, either with video or images or

whatever. Um and then we take the

encoder and use the representation

learned by the encoder as input to u a

supervised classifier or predictor or

something uh that we're going to train

in supervised mode to solve a particular

task like object recognition or

segmentation or whatever it is or action

recognition video for example. Okay,

that's a standard way you train self-s

supervised learning.

Um so the contrasting methods I I said I

didn't want to use it but just for

completeness um there's an old paper of

mine from 1993 where we kind of propose

this uh thing it's it's come to be known

also as metric learning you kind of try

to train a neural net to project inputs

into a space where uclean distance kind

of makes sense if you want semantic

sense there's been a two other

uh papers from my lab in the mid200s and

then a paper from Google um in 2020

called SIMC clear which showed that this

type of method can give decent

performance for so self-s supervised

training of uh image recognition object

recognition

system but again the representation that

are produced by those method tend to be

low dimensional not more than 200 or

so okay there's another set of methods

which I didn't mention they they're sort

of regularized methods but they they

they're not completely well understood

based on distillation

um and they're very popular at the

moment. So basically they they consist

in again having two encoders but the

weights of the encoder on the right is

an exponential moving average of the

weights of the encoder on the left. So

think of it as the encoder on the left

is trained and its weight can change

pretty

quickly. Um and it's trained with the

output coming out of the encoder on the

right. The encoder on the right does not

you don't back propagate gradient to it.

There is the red arrow the red cross

here says the gradients are not back

propagated. So you just set the the

weights of the encoder on the right with

the exponential moving average of the

weight on the

left. And somehow this works and if you

do it right it prevents a

collapse. a collapse in which the

encoder just

ignore ignores the input and just

produces constant output which of course

would minimize the prediction error.

Okay, somehow it prevents collapse.

There's some theoretical papers on this

but it's a little mysterious I should

say but there's a bunch of papers that

use this idea and they work really well.

So be away well from from deep mind sim

from my colleagues at at meta

u from 2020 and dinov2 I ja vja I'm

going to talk a little bit about that um

but with this so there's a a particular

um image extraction self-supervised

image uh feature extraction system

called this dinov2 it was produced by my

colleagues at fair paris and it works

really well people use the dinov2

features for all kinds of applications

in computer vision. Just use it as a

generic feature extractors extractor

feed it to a supervised classifier and

with just a few samples you can train

your classifier to accomplish just about

any vision

task. Um so one of our colleagues for

example took satellite images and had a

small amount of label data where the the

height of the canopy was labeled for

some areas. And so she trained the head

uh using this small amount of data then

applied to the entire earth and then was

able to evaluate how much carbon was

captured in vegetation in total uh in

the entire earth which is a quantity

that um is interesting to know for

climate u change

prediction. Okay. But literally hundreds

of people use this these features. Okay.

Okay, so here's a a version of this

where we can use it to build a world

model and it's called Dino WM Dino world

model. Uh this is a recent paper on

archive by uh a student co-advised by

Leral Pinto who is a roboticist at NYU

and myself name is

Ghu and the basic idea is um you take

the representation of an image produced

Dinov2 and then the representation of an

image of the environment after you've

taken a particular action. Okay. and you

train a

predictor. You keep the encoder fixed,

but you train a predictor to predict the

representation of the

world after the action has been

accomplished. Right? So once you have

that predictor which is action

conditioned, you can use it for

planning.

you

um show the system in initial state,

extract the features with DOV2 and then

run your world model multiple time steps

maybe 10 something like that and then uh

measure the distance in representation

space to a target state that you want to

attain that you want for example a robot

to accomplish a certain number of task

actions so that you reach a particular

goal. Um, and so you feed the target

state to the encoder again, and then you

measure the distance in representation

space. And now by optimization, you

figure out a sequence of actions that

minimizes this task

objective. And this works really well.

So these are images of

reconstructions produced by a separately

trained decoder. We don't use a decoder

for training. There's no decoder. It's

very important whenever you train

something to an image encoder through

reconstruction, it doesn't work very

well. Doesn't work as well as if you

train using those joint embedding uh

type criteria. Okay. So, what you see at

the top is um this is a a problem where

you have like a little blue dot that you

can move around and it can push a a

shape a T-shaped object.

So at the top you have a sequence of

action being taken and the result of

applying the sequence of actions on the

actual environment which is a simulated

one. Okay. At the bottom this is the

prediction that dino the dino world

model would produce. So we take the

representation resulting from the

prediction of the sequence of actions

and then for each time step we just run

it through the decoder. So can we we can

produce an image that corresponds to

this. But again this decoder is trained

separately afterwards. Um and you can

see that it's pretty similar to what

goes on in the first row. Right? So the

last row is almost identical to the to

the first one. Everything in between are

sort of other methods that people have

proposed to kind of solve this problem

um using various techniques. None of

them does nearly as good of a

job. Now the dynamics here is super

simple but there are you know other

situations where the dynamics is really

more complicated. So here what the robot

does is that it goes down on a on a

plate on a platter and it moves by delta

x delta y and that pushes a bunch of

blue chips uh which kind of push onto

each other. So it's fairly complex kind

of dynamical system which you know you

can simulate but you can't really sort

of reduce equations that would allow you

to plan. Um and again at the top is the

ground truth what actually occurs um

after executing a certain number of

actions and at the bottom is a decoding

version of the internal state of the

system uh after having accomplished the

same imagined uh actions and the result

is pretty similar and again the other

methods are not not nearly as good. I'm

going to bore you with uh results. It

just works better. Okay, let me show you

some videos. It's more fun. Okay, so you

can apply this to all kinds of

situation. Train world models or

predictors for v situation for pushing a

tear around for navigating through a

maze or for doing those those tasks um

of you know moving a little string or a

bunch of chips. Uh let me show you the

last video. I think it's the more fun

more fun one. So you set that system

with arbitrary

goals with an initial condition and then

what you see at the bottom is the robot

using actions that are planned to try to

reproduce the configuration the target

configuration that it's observing and

it's limited to a relatively small

number of actions here. uh and some of

this is done open loop some of it is

done using what's called receding

horizon uh planning and you know we can

apply this to all kinds of situation

where that require planning and works

pretty well but it's still a lot a lot

more work to do

there okay um those architectures uh

um that's a bit of a cheat that I just

told you because the encoder is

pre-trained it's it's pre-trained using

self-s supervised learning using uh

distillation method denov v2 uses

distillation method to train the visual

encoder but the the predictor on the

world model is trained separately right

it assumes the encoder is already there

in image ja and video japa everything is

trained at once also using distillation

method but we train also the predictor

uh simultaneously so there's two papers

to um one very recent on video japa um

and I'm going to go fast on this but um

this uh technique of using distillation

method to train one of those JA

architecture works really well for

learning visual features to recognize

objects or um or segment images or or

doing other tasks. Um it's much faster

than uh alternative methods or self-s

supervised learning and it gives better

results. Um uh Dino V2 gives better

result because it's trained on much

bigger data sets and we haven't yet

compared exactly but what we compare it

with here is another project that took

place at fair called MAE masked

autoenccoder so that method is basically

an autoenccoder a denoing autoenccoder

you take an image you mask certain parts

of it and then you train a gigantic

neural net to reconstruct the full image

from the partially masked uh one and

this doesn't work as well it's much more

expensive and in fact um this project

was cancelled

abandoned. So the main lesson there is

if you're going to train a system to

produce representations of images or

video, do not train it to re to

reconstruct. It's not going to work.

Which is why I don't think any of the

efforts using models like Sora or video

prediction systems to produce wild

models are ever going to work. It's a

complete waste of time.

Um okay so video jpa is just a version

of i ja for video where the input to the

system is a sequence of 16 frames

partially masked and then you train the

system to reproduce uh to basically

predict the representation of the full

video from the representation of the

partially masked one. And this works

really well. It produces features that

are very useful if you want to classify

for example an action that takes place

in a video. Again I'm not going to bore

you with details. uh and it can

basically elucinate what goes on in the

video. So this is a partially matched

video and it what you see on the right

is what the sort

of an output produced by a separately

trained decoder. Okay, using diffusion

model but it's important that the

decoder is not used for training the

internal model. Right? If you do this

actually doesn't work as

well. Now here's a surprising result

that this is a paper we just posted on

archive a few weeks ago. Um, if you take

this video Japa model and you show it

videos where something really weird

occurs, something that's not physically

possible, the system can tell you that

it's not physically possible. So what

you do is you you you take those 16

frame input and you slide it over a

longer video and you measure the

prediction error of the system. you

measure to what extent uh you know it

can predict the representation of full

video from a partial one or future

frames from current ones even though

it's not been trained to do that. And if

you have something really unusual

appearing in a video like an object

spontaneously

disappears, the prediction error goes

through the roof tells you like

something very strange is happening

here. And in fact uh if you

construct a data set where you have

pairs of video where one is a

perfectly normal video where physics is

not violated and another one where some

aspect of physics is violated an object

disappears or changes shape or changes

color or something like that. Um with

practically

100% accuracy the system can tell you

which one is less plausible than the

other one.

so very interesting and the you know

performance is good or bad depending on

the type of

uh sort of common sense that you're

testing uh the system

for. Okay. But here is the thing that we

are really sort of moving towards. This

is um techniques for training those JA

architectures with regularized methods

that are based on information

maximization.

Um and there is a bunch of them that

have been proposed over the last five

years or so. MCR squared from uh Yima's

group at Berkeley Bar Twins from my

group at META.

WSE vragg which is also from my group at

at META VR and MMCR for my colleagues at

NYU in computational

neuroscience. Um so how does it work?

Those things basically say okay I want

prevent the system from collapsing from

just minimizing the prediction error by

ignoring the input and then producing a

constant representation at the output of

the

encoder. And one way to do this is to

have some measure of

information that comes out of the

encoder and try to maximize it.

So, we have information theorists in the

audience, and if he had some hair left,

he would be pulling it pulling

them. My apologies. Um, we're old

friends and colleagues.

Um, here is the problem. We do not have

any good ways of maximizing information

content because that would require

having a lower bound on information

content. And we don't have any lower

bound on information content. We only

have upper bounds. Okay. So what we're

going to do is have some measure of

information content that we know is only

an upper bound, but we're still going to

push up on it and we're going to cross

our fingers that the actual information

content

follows.

Okay? And that's as justified as I can

make this thing. Okay? Um so other

authors here for those other papers have

other justification of you know trying

to produce efficient coding trying to

make sure that the representation

vectors coming out of the encoder are

filling up the space of representation

are uniform around a sphere or something

like that. But basically they're all

sort of surrogates for computing

information content of some kind under

some assumption.

And because of the assumptions, they're

all upper

bounds. The assumptions basically ignore

dependencies between

variables. Uh okay, so

um how can we do this? Okay, so

basically we can do this by making sure

that the variables coming out of the

encoders, okay, it's a vector coming out

of the encoder. we can try to make sure

that the variables the components of

those vectors are somewhat independent

or at least

uncorrelated. Okay, so if we have a

matrix coming out of this encoder, a

matrix, you know, we we we feed a batch

of samples to this encoder and we get a

matrix out of it where each row is a

sample and each column is a is a

variable of uh of the the vector, a

component of the vector.

Um, and we have two types of method.

Sample contrasting method. So they're

the contrasting methods I was talking

about earlier. What they're trying to do

is make sure that the rows coming out of

that matrix the rows of that matrix are

all different from each other possibly

orthogonal to each other if you

can maximally different from each other.

This is what this seem clear method and

various other uh semiisnet methods are

doing. The alternative I'm proposing

here is making the columns of that

matrix

uh independent or at least orthogonal to

each other mutually orthogonal pair

wise which means which is another way of

saying I want the the variables to be

uncorrelated. Okay. So there's various

specific ways of doing this and those

papers basically have different ways of

of doing this. uh the the the variance

coariance regularization method I've um

I've mentioned uh does this by making

sure that the the variables have

standard deviation one or at least

one and then uh making sure that the

offdagonal terms of the coariance matrix

which is this matrix multiply this

matrix transpose multiplied by

itself that the off diagonal terms of

this coarance matrix are as close to

zero as possible that guarantees that

the variables are

uncorrelated and then simultaneously

minimize the prediction error. Okay, so

what you get in the end is a system that

finds a trade-off between extracting as

much information as possible from the

input but only extracting the

information that is actually predictable

by the predictor and basically

eliminating all the stuff in the

observation in the X and Y that is not

predictable. all the stuff from Y that

is not predictable from X

essentially. So if you train a system

like this completely self-supervised

with unlabelled data um and you take the

representation feed it to a supervised

classifier measure the performance it

works well I'm not going to bore you

with numbers um but what we can do is

actually train a world model with this

method and so this is a very recent

paper just came out last week on archive

u by two of my students Vlad Sabal and

uh Kevin

Jong together with uh some of my

colleague Cho Colero

um who was a postoc at at fair who's a

professor at Brown and Tim Brner who is

a a posttock at NYU who is on the job

market. So

um so basically the idea here is uh let

me jump directly here. So it's a world

model again um but the encoder and the

predictor are trained simultaneously and

the collapse is prevented using this

variance coariance regularization or

information maximization I just talked

about as well as kind of you know

minimizing the prediction error is

trained on sequences of uh video frames

um and then once you have this world

model trained end to end you can use it

for planning um the system can actually

plan pretty well uh on simple cases so

there's still a lot of work to to kind

of scale those up, make them work in

more complex situations. This is another

project very recent by Embar who is a

postto at META um which basically uses

world model. They're trained in a a

different fashion to plan in um in the

real world. So you you train it from

video videos taken from a robot where

the robot sits at a particular position.

it moves by a known quantity and then

you have another video frame and you can

train the system to predict what the

world is going to look like if you

accomplish a particular motion. Okay?

And then you can use it for planning

trajectories and it works which is

pretty cool. Um so those systems can um

predict what the world is going to look

like if you follow a trajectory and

therefore actually kind of plan a

sequence of action so that the world it

looks like a a particular target. And

um lots of uh fun videos there. Okay, so

let me conclude because I'm badly out of

time. I have a number of

recommendations. Abandon generative

models in favor of those joint embedding

architectures. Okay, everybody's talking

about generative AI. Everybody's working

on generative AI. Everybody's assuming

that AI is just generative AI and I'm

telling you forget about generative AI.

emittic model. Okay, so this is the main

frame theoretical framework in which all

of machine learning is based but it

really doesn't work in the context of

sort of making nondeterministic

prediction in high dimensional

continuous space. Um so I'm just telling

you use those energy based methods you

know you don't need to normalize

anything and it's like you know

unnormalized logarithms of probabilities

if you want but but you have a much more

bigger space at your disposal. abandon

contrasting methods again something

that's very popular at the moment in

favor of those regularized method and of

course abandon reinforcement learning

I've been saying this for 10

years so if you are interested in human

level AI do not work on

LMS okay work on LMS if you want an

engineering job next year

okay but if you want a research

job on a topic that is not going to fall

out of favor three or five years from

now. Don't work on

L&M. I mean L&M would be useful just not

the centerpiece of AI systems. Okay. So

we have a lot of problems to solve. I

mean the research program that goes

around this you know built around this

is probably a

10ear program for a lot of people.

So if you're looking for a good topic

for a PhD in AI,

uh these are a lot of problems to solve

within this particular framework. Um so

training large scale world models on all

kinds of inputs. Um figuring out good

planning algorithms, optimization

algorithms for planning. Turns out

gradient based methods tend to get stuck

in local minima. So we might have to use

more sophisticated optimization like

ADMM or some level of gradient free

optimization which I would like to stay

away from. uh dealing with planning in a

context of uncertainty with latent

variables hierarchical planning

completely unsolved completely open

um associative memory I didn't talk

about um and then sort of you know

slightly more theoretical issues

mathematical foundations for energy

based learning and inference

uh learning cost modules because the

cost modules you know here we only use

kind of very simple situations so the

cost module can be built by hand it's

just ukle and distance in representation

space. In most cases, you probably have

to learn the cost function.

Um, planning with an accurate world

model,

um, adjusting the world model as you go,

uh, etc. Um, what this model tells you

is that you have three ways to be

stupid. The first one

is your world model can be wrong. So the

effect of your action may not be the

ones that you think they would be.

Second one is your cost function might

be inappropriate. They might lead to

outcomes that are not the ones that you

expect or you may not have any

guardrails which will lead you to do

really bad things to bad to good people

without realizing. And then the third

one is not being able to actually find

even if your world model and your cost

function are good, not being able to

find a sequence of actions that will

actually kind of fulfill your objective.

Okay?

And some AI systems and certainly some

people in some people all three are

bad. They don't have the right world

model. They don't have the right cost

function because they they have no

morals and whatever action they decide

to take is completely

ineffective. That's a good description

of some people in

government that shall remain

unnamed. Um, okay. So, in the future,

we're going to have virtual assistant

with us at all times helping us in our

daily

lives and those systems eventually will

constitute a kind of repository of all

human knowledge. Right? will not go

to a search engine

uh or even a library unless we really

like going to libraries which I do. Um

but we'll just ask a question to AI

assistant. We may even pose a problem to

our AI assistant and might be able to

solve it.

Now all of our digital diet is going to

be mediated by those AI assistants and

it would be extremely dangerous for

everything about humanity if those AI

assistant came from a handful of

companies on the west coast of the US or

China for linguistic cultural diversity

for different value systems, political

leanings, whatever it is.

um we cannot possibly get all of our

information diet from just a few uh

systems of this type. We need a high

diversity in AI assistance for the same

reason that we need a high diversity in

the in the media and the

press. And so the only way to get there

because those systems are so expensive

to train. The only way to get there is

if the people who have the means to

train foundation models release them in

open source so that a lot of other

people can fine-tune them for whatever

language, culture, value system or

centers of interest they have and I

don't see any alternative to this. So I

think probably in terms of ethics this

is after all a symposium about

ethics. Uh the most important aspect of

AI ethics today is not bias. is not is

AI going to kill us all. It's not any of

this. It's are we going to have the

tools to build very highly diverse AI

systems so we don't get all of our

information for just a handful

uh of those and that means uh open

source uh foundation models. So if you

have any level of influence to anyone

um particularly in government make sure

that governments don't

uh make laws that would make open source

risky or um complicated or illegal and

there are proposals in that direction.

um that would be a very very uh bad

outcome I think for the future. But if

we manage to preserve uh diversity then

perhaps humanity will go through a new

renaissance because you know if we have

access to all the world's knowledge in

even more efficient fashion than we

currently do um and we have systems that

assist us in all the decisions that we

make every day will amplify human

intelligence. It's like everyone would

be walking around with a staff of super

smart people working for us. Okay, we

should not be scared by the fact that

they would be smarter than us because we

set the objectives for them. Which is

why I think this idea of objective

driven AI is super

important. Um so we be like you know

every politician right who don't know

anything but they have a staff of people

experts in various topics that adise

them. We'll all be pointy hat

managers or virtual people. Thank you

very much.

[Applause]

[Music]

Analytical Tools

Create personalized analytical resources on-demand.

Analysis

Visualization

Learning

Assessment

Chatbot is available after you save this video to your library.