The Original Sin of Computing...that no one can fix
Loading YouTube video...
Video ID:Fu3laL5VYdM
Use the "Analytical Tools" panel on the right to create personalized analytical resources from this video content.
Transcript
The Turing Award is widely considered to
be the Nobel Prize of Computer Science.
You may be surprised to know that this
is not a real turning award, but it does
look a lot like one. It is also in the
shape of a bowl, but it also comes with
a million-doll prize as well. When Ken
Thompson, one of the original authors of
the Unix operating system, accepted his
touring award in 1984, his acceptance
speech was a little bit different. It
didn't talk about just like some
generalized computing theory. He also
didn't talk about the future of computer
science. Instead, he released a very
dark and terrifying thought experiment
onto the world. What if every single
compiler was already irreversibly
compromised?
An original sin passed to every
generation of compiler with no cure.
He called it reflections on trusting
trust. It was the first example of a
meta backdoor, a virus so perfect that
even complete code auditing couldn't
catch it. But let's think how an
original compiler sin could permeate
through multiple subsequent generations.
Now, I have a story for you as an
analogy. About a year ago, I made a
video on virtual memory. It was a really
good video. I encourage you to go watch
it. In that video, I accidentally
mispronounced the word contiguous as
contiguous. And so inevitably when the
comments made me realize my mistake, I
ended up calling my mom because she
watches all my videos. Hi mom. And I
asked her, "Did you notice my mistake?"
And she said, "What mistake?" And I
said, "I mispronounced the word
contiguous as contigious." And she just
paused for a second and she said, "Is
that wrong?" Now, I learned that word
from my mom and she learned that word
from her parents. So, it has to make you
wonder how many generations far back do
you have to go to find the original
mistake? And that's exactly how the
compiler original sin theory works.
Basically, it's like self-replicating
code where code reproduces code that
also reproduces that same issue. Now,
you might think this is kind of
complicated, but let me make it much
worse for you. If I look over here, this
is actually the family tree for the
fourth programming language. And this is
kind of terrifying. And you will also be
terrified to know that this is actually
a very good representation of a lot of
the other family trees for different
programming languages and compilers
themselves. This is just absolutely
terrible to take a look at. We have all
of these different derivations going
from the year 1975 to 2010. I don't even
want to know what happened after 2010.
To understand how an original compiler
sin would work, you first have to
understand how compilers are compiled.
And remember, compilers are just the
programs that take in your source code
and produce that compiled binary. A lot
of different programming languages are
something called self-hosted. And to
understand what that means, let's take a
step back for a second. I want you to
think of a word, any word you want. I'm
going to pick the word literal, and I'm
going to come up with a definition for
that word. My definition for a literal
might be something like taking words at
face value or taking something at their
most basic definition. But pause for a
second and how did I describe that
definition of the English word literal?
I used the English language to describe
the definition of an English word. And
you probably did the same if you were
following along with that or or if you
didn't let me know what you did cuz I
need to like study you or something. But
this kind of creates a paradox where you
have to know the language in order to
understand definitions of that language.
And this is exactly how it works in the
C programming language. Programming
languages being self-hosted means that
their compilers are written in the
language that they compile for. Aka the
C compiler is actually written in the C
programming language. Now when I first
started learning this, this kind of gave
me an existential crisis. So allow me to
also share my existential crisis with
you. So if the C compiler is written in
the C language itself, how was that
compiler compiled? And the answer is
another compiler. But how was that
compiler compiled? Prior to the C
compiler being fully self-hosted, it was
actually derived from the B compiler.
And I know naming conventions were
absolutely genius back then. So if we go
back generations of compilers compiling
and deriving other compilers
compiler baguette compiler sounds
biblical goes along with this well we
actually get the original self-hosted
assemblers assembling assembly and I
know what you're about to say but how
are those assemblers assembled? I hope
you're just as horrified as me to think
that for the very earliest computers
humans were the original assemblers. So
what programmers had to do is they had
to go handwrite their assembly programs
and then at assemble time they had to go
over to the instructions manual and
handwrite those ones and zeros for the
corresponding op codes of their program.
So if you haven't thanked your compiler
today, make sure to do so. We can
demonstrate the idea of self-hosted and
self-reroducing compilers with a
programming exercise. Ken Thompson in
his acceptance speech said that in
college before video games they used to
amuse themselves with different
programming exercises. I don't think I
was doing that in college. I think I was
playing too much World of Warcraft. But
the idea behind the exercise is to
produce a program that when compiled and
executed will output an exact copy of
the original source code. And the name
for these is actually quines after
Willard Vin Orman Quin the philosopher.
So, I'd like to try my hand at producing
some self-producing code. And if you'd
like to follow along, I've made all of
the code for this available in my GitHub
repository, which I will leave a link to
in the description of this video. So,
let's go over to our very first simple
quin program, and this is going to be
written in C. Now, remember, the goal
behind this program is to produce
exactly this source code as soon as it
executes the program. And this is going
to include all of these different
comments as well as the actual code
itself. So if you progressively just
kept running and rerunning the output of
this program, you would get kind of this
recursive program that's always
reproducing itself. So that's the whole
complicated goal behind this. And I do a
little bit of tricks with this with the
format specifiers. For example, if I
want to print a new line in this code, I
can't just print a new line in the code.
I have to use these format specifiers.
So 10 corresponds to the asky value on
the ASKI table for the new line
character. And then if I want to print
quotation marks inside of here, I can
use the asky value 34 in order to
produce those quotation marks inside of
my string code that's going to be part
of my output. Now that sounds
complicated, but let's go and let's
execute this program and make sure it
produces all of the correct code. So
I've already compiled my program and if
I do /coin
here you can see this exactly matches
our original code and if I want to what
I can do if I want to verify that the
output is exactly the same I'm going to
do this and I'm going to write the
output of this to a temporary file I'll
just call it temp. So now that contains
the output of the program. And so if I
want to diff this, I can do diff. I'm
going to do the original quinine. C,
which is the source code. And then I'm
going to diff that with that temp
output. And it should be exactly the
same. Fingers crossed. Yes, it is
exactly the same. So we have
successfully made some self-producing
code. Now Thompson did say that the best
vehicle for producing this
self-producing code was forran. So I
decided to also write another version of
this same code in the forran programming
language. Now if we go over here, this
looks very similar to the C program
except so much uglier. But we have a
kind of copy of the source code for this
stored inside of this string here as
well. So remember, we kind of have to
self-reroduce the program inside of this
string that's going to allow us to
actually print this string one more
time. And kind of a side note here, if
you're ever coding in forran, note that
a dow loop, you cannot increment the
flag variable inside of it. So I had
ended up having to change this to a dowh
loop just to be able to increment my i
variable. That's very interesting. And I
believe this is the first time I've ever
had the forran language as a language
listed inside of my GitHub. So that's
kind of a good achievement. To verify my
forran quin is working properly, I'm
going to compile this. So we'll do G for
I'm going to do my fort quinine.f90
output fort coin and we can run this
fort coin and here you can see we do
indeed have a another reproduction of
this particular code. Now let's move on
to the more interesting portion of this
exercise. What if we slipped in some
additional trojanized code inside of the
self-producing program so that the
second iteration would either have that
trojanized code or have the code that
was able to produce that code. So if I
go over to Trojancoin.c, you can see the
structure for this particular program is
primarily the same as the previous ones,
but I have this additional function
inside of here. And so what this
additional function is going to do is
it's going to check for the string right
found inside of the string that's about
to be printed of the additional code
that second iteration and it's going to
replace the word right with the word
wrong. So that means even inside of all
these comments and stuff this is going
to produce a trojanized version of the
output. And you can see we also have to
include this check function inside of
the original string that's going to be
our format specifier to specify the
output of this. And if I go over and I
want to execute this program slash
judging coin, you can see if we go up
here somewhere inside of this messy
code, we have actually replaced the word
right which was inside of this original
comment with the word wrong which is
making us look just a little bit crazy
over here. If this idea for Trojanized
code was instead part of a malicious
compiler, it would basically have two
different options. In the first option,
if it recognized that it was compiling a
regular program, it would slip in that
malicious payload into the compiled
binary. In the second case, if it
realized that it was compiling code for
a secondary compiler, aka it's
recognizing its own source code, it
would instead slip in the code that's
used to generate the malicious payload.
Now, you may be thinking that that
Trojanized code was supremely obvious,
and it kind of was, but I did that on
purpose to demonstrate the point of how
to slip in the original sin behavior
into the coin itself. But there's just a
multitude of offiscation techniques that
you can add to code to make it more
difficult to identify and detect inside
of code. So, let's take a look at
obfiscated Trojancoin. C to demonstrate
how this is done. Now this is just
honestly the bare minimum of obiscation
I have added to this particular program.
All I've simply done is taken the string
that's used to generate the code for the
program and translated that into the
decimal version of the ASKI values. So
if you take a look at the ASI table, all
of those characters have corresponding
decimal values. And so this is actually
the original source code just encoded in
its decimal value form. And you can see
I still have my check function inside of
here that's going to replace the word
right with the word wrong. But instead
I'm also using the asky representations
in decimal form for this as well. And
this function is also encoded inside of
our giant number array here. I would say
this is a lot more difficult to detect.
So let's take a look at our offiscated
code. I'm going to compile with GCC.
Here's my offiscated Trojan coin. Ignore
the warnings. Not normally, but in this
particular case, you should. And we're
going to execute this code.
And you can see even though we had this
confusing character array in decimal
form in the original code, we actually
have that original source code in the
unobfiscated version. And this is really
only the bare minimum of obfiscation
that I'm adding here. As a professional
reverse engineer, there is so much more
that you can do with obfiscation. But
back to our original sim compiler
scenario, you can use this to obiscate
the source code of the Trojanized
compiler. But think about this, this
might not even really be necessary
because if you are using that compiler,
you're compiling another compiler. And
so other people are probably only using
the compiled version of that. Meaning
that if they really wanted to find that
Trojanized code inside of it, they're
probably working with a binary. So,
they're going to have to actually
reverse engineer that binary using tools
like Gedra or IDA to even be able to
detect that malicious code. I don't know
about you, but I'm not really going
around reverse engineering my compilers
on a daily basis probably. Not only
that, but if you're looking at the
programs that are produced by this
compiler, it would be very difficult to
be able to detect this Trojanized
behavior. you've effectively removed the
Trojan version from the source code of
that program. If you audit the source
code, it's going to appear clean because
the infection process only happens after
the binary has been compiled. So, you
would also have to reverse engineer that
target binary. Now, you may be thinking
that if you did this and infected the C
compiler, you'd really only be infecting
C programs. But that's actually not
true. If you think about the complicated
family tree of programming languages and
compilers that I demonstrated earlier, a
lot of programming languages will build
off of other compilers to kind of avoid
reinventing the wheel and focus on their
new language that they're creating. A
really good example of this is the LLVM
project. Now, the LLVM project is a
compiler infrastructure used by so many
different languages. Basically,
everything is LLVM. Everything is
computer.
So it's used directly to compile C, C++
or Objective C with its provided clang
compiler or it's also used indirectly by
a lot of different programming languages
like uh Swift's compiler, Rust's
compiler or even forrans compiler. To
demonstrate just how pervasive this
project actually is and how many
compilers this applies to, think about
developing an iOS application. Now, you
normally would not consider an iOS
application needing to involve the C
programming language, but clang is used
to compile Objective C, which used to be
the language used for iOS apps. So, if
we go over here, I have this make file.
Instead of just doing this inside of
Xcode, which normally does all of that
on the back end, and you have no idea
that LLVM is involved at all, I can do
this manually. And I downloaded my own
copy of the LLVM project, which is big
ordeal. We'll get into that in a second.
And I'm going to custom compile this
directly. So I'm reading in all of the
Objective C binaries and I'm going to be
producing a target MacO binary for the
iOS application. This is also very
interesting because you can use this and
add extra offiscation flags and
customizations to LLVM to add automated
offiscation to the binary. But that's
kind of a topic for another video. So,
let me go ahead real quick and
demonstrate this. So, I'm just going to
do make and it's reading in all of my
Objective C binaries here. It's adding
them in together. It's compiling them
and it's going to produce the target
binary. We're going to let this load cuz
this is a really big application.
And here we go. We have completed our
test application which is comprised of
originally Objective Code. Now, to show
how large the LLVM project is, and to
demonstrate that you're not going to be
regularly auditing both your compiler
and any inherited compilers that they
rely on, let's take a look at the LLVM
project. Now, this is available open
source on GitHub, which is kind of cool.
And this has this has so much stuff to
it. And it did say it was an
infrastructure. So that's just what the
repository looks like. But let me show
you something funny. So I'm going to
clone this repository.
I'm going to do get clone. That's my
LLVM project, which I did tell you I
downloaded and customized earlier. We
are cloning this project.
It's going to be a while. Everyone,
how are you doing today?
Hopefully good.
What did the What did the brother cell
say to the sister cell when she stepped
on his toe?
Mitosis.
Okay, this is now cloning. You can see
we're at 1% 2%. This is going to be very
slow. So, you can see this repository is
absolutely enormous. So, there is no way
that you're going to be auditing this. I
am going to kill this because I don't
need that. I've talked a lot about how
to implement a kind of proof of concept
for this terrifying compiler thought
experiment. But let's talk about some of
the mitigations now and there are
mitigations although they're pretty
complex. David Wheeler as part of his
PhD thesis created a technique for
mitigating this known as diverse double
compiling. So let's say if we wanted to
try to prove that GCC hadn't been
tampered with or trojanized. I have the
steps right here for implementing DDC.
One, pick a second independent compiler
that you trust. Big job there. Two, use
that second compiler to build a fresh
GCC from source. We're going to call
that source G1. And then three, use G1
to build GCC again from source. We're
going to call that result G2. And now
the fourth step, compare G1 to G2. If
they are bit forbit identical, then you
know that successfully this compiler has
not been tampered with. So you can use
this and trust the programs created from
it. Now this can get pretty complicated.
So I will share a link to that paper in
the description of this video. Now note
to think about if you're trying to hand
audit modern compilers these days like
GCC or clang, they just have such a
massive code base that it's going to be
super challenging. So, another technique
that you could try to use instead is
kind of handwriting your own teeny
compiler and then using that as a
bootstrap to generate larger and larger
compilers until you have a final
compiler with a pure lineage that has
been written entirely by yourself that
you know you can trust. Now, you might
think this all kind of sounds
far-fetched, but this attack scenario
has actually happened in real life. Back
in 2015, there was this family of Mac
malware called Xcode Ghost. And what it
was was a Trojanized version elite Xcode
compiler targeting iOS developers. So
these developers would open up this
Trojanized version of their IDE, write
their iOS applications, and publish them
to the App Store. But what the infected
compiler would do is slip in additional
malicious code from the infected IDE
into the final applications. So that
means hundreds of applications made it
onto the app store looking like they
were coming from legitimate developers
with those developers being completely
unaware that they were actually
publishing Trojanized code. Now the
scariest part about all of this is that
Thompson gave a highly specific outline
of how he would implement this kind of
attack in his acceptance speech was a
little too specific. In fact, if you
scrape old Usenet archives, you can
actually see the Oracle employees and
the Bell Labs employees going back and
forth theorizing about whether he
actually did it. But the craziest part
is if you look at the old archived
messages, they're kind of corrupted. But
Ken himself actually admits to slipping
in the Trojanized login code into the
Unix support group at Bell Labs for fun.
They snuck it in by advertising it as
some nonbackwards compatible feature so
that the group would have to use their
specialized compiler with the original
sin code inside of it. But the original
Trojan code inside was kind of smart. If
you used the - S option to output the
assembly of the program, it would
actually not generate the Trojanized
version of it. So it' kind of generate a
clean version so that you wouldn't be
able to detect it if you were trying to
look at the assembly. And Ken claims
that at some point someone ran it with
the - S option and so the original SIN
was removed from that compiler in the
chain of compilers at Bell Labs. He also
claimed that the compiler was never
released outside of Bell Labs, but I'm
not so sure about that. Bell Labs had
their hands on just about everything in
the 70s and 80s, so there's definitely a
nonzero chance that that compiler
actually escaped. Hopefully that doesn't
keep you up at night. I particularly
like one quote from Ken Thompson saying,
"You can't trust code that you didn't
totally write yourself, especially not
code from companies that employ people
like me."
Like, Ken, if this is the story that you
are sharing, what did you possibly do
that you're not sharing with us? It's
not really a feasible philosophy to
follow though in this day and age. If
you take a step back for a moment and
think about all the proprietary software
or applications that you've used even
today, you're definitely not going to be
able to go through and audit all of
those applications, let alone write all
of that code yourself. But personally, I
think this is one of the spookiest
thought experiments that I've read in a
while. So, I'd be curious if anybody
else has read any other spookier thought
experiments lately. Let me know. Anyway,
thank you so much for watching everyone.
As always, I hope you enjoyed this
video. Lori wired out.
Analytical Tools
Create personalized analytical resources on-demand.