The Original Sin of Computing...that no one can fix

Feedback

What if we can't trust *ANY* software...even if you have the source code? A perfect, self-replicating "sin" passed down for generations of compilers. Invisibly compromised in every program. It sounds like a spy novel; only it actually happened. Ken Thompson, creator of UNIX, theorized how Code Auditing won't save you. This is how it works...and why it's already in the wild. --- Timestamps: 00:00 Original Sin 02:06 Compiler Family Tree 05:44 Self Reproducing Code 10:42 Obfuscating the Quine 13:24 C touches EVERYTHING 17:43 A Mitigation (sorta) --- Check out my X account for cool Computer Science stuff! https://x.com/lauriewired --- Reflections on Trusting Trust, Ken Thompson: https://dl.acm.org/doi/10.1145/358198.358210 David Wheeler's Diverse Double Compiling Mitigation: https://dwheeler.com/trusting-trust/

LaurieWired

22:19

2025-10-08

Loading YouTube video...

Video ID:Fu3laL5VYdM

Use the "Analytical Tools" panel on the right to create personalized analytical resources from this video content.

Transcript

574 segmentsCurrent: 0:00

The Turing Award is widely considered to

be the Nobel Prize of Computer Science.

You may be surprised to know that this

is not a real turning award, but it does

look a lot like one. It is also in the

shape of a bowl, but it also comes with

a million-doll prize as well. When Ken

Thompson, one of the original authors of

the Unix operating system, accepted his

touring award in 1984, his acceptance

speech was a little bit different. It

didn't talk about just like some

generalized computing theory. He also

didn't talk about the future of computer

science. Instead, he released a very

dark and terrifying thought experiment

onto the world. What if every single

compiler was already irreversibly

compromised?

An original sin passed to every

generation of compiler with no cure.

He called it reflections on trusting

trust. It was the first example of a

meta backdoor, a virus so perfect that

even complete code auditing couldn't

catch it. But let's think how an

original compiler sin could permeate

through multiple subsequent generations.

Now, I have a story for you as an

analogy. About a year ago, I made a

video on virtual memory. It was a really

good video. I encourage you to go watch

it. In that video, I accidentally

mispronounced the word contiguous as

contiguous. And so inevitably when the

comments made me realize my mistake, I

ended up calling my mom because she

watches all my videos. Hi mom. And I

asked her, "Did you notice my mistake?"

And she said, "What mistake?" And I

said, "I mispronounced the word

contiguous as contigious." And she just

paused for a second and she said, "Is

that wrong?" Now, I learned that word

from my mom and she learned that word

from her parents. So, it has to make you

wonder how many generations far back do

you have to go to find the original

mistake? And that's exactly how the

compiler original sin theory works.

Basically, it's like self-replicating

code where code reproduces code that

also reproduces that same issue. Now,

you might think this is kind of

complicated, but let me make it much

worse for you. If I look over here, this

is actually the family tree for the

fourth programming language. And this is

kind of terrifying. And you will also be

terrified to know that this is actually

a very good representation of a lot of

the other family trees for different

programming languages and compilers

themselves. This is just absolutely

terrible to take a look at. We have all

of these different derivations going

from the year 1975 to 2010. I don't even

want to know what happened after 2010.

To understand how an original compiler

sin would work, you first have to

understand how compilers are compiled.

And remember, compilers are just the

programs that take in your source code

and produce that compiled binary. A lot

of different programming languages are

something called self-hosted. And to

understand what that means, let's take a

step back for a second. I want you to

think of a word, any word you want. I'm

going to pick the word literal, and I'm

going to come up with a definition for

that word. My definition for a literal

might be something like taking words at

face value or taking something at their

most basic definition. But pause for a

second and how did I describe that

definition of the English word literal?

I used the English language to describe

the definition of an English word. And

you probably did the same if you were

following along with that or or if you

didn't let me know what you did cuz I

need to like study you or something. But

this kind of creates a paradox where you

have to know the language in order to

understand definitions of that language.

And this is exactly how it works in the

C programming language. Programming

languages being self-hosted means that

their compilers are written in the

language that they compile for. Aka the

C compiler is actually written in the C

programming language. Now when I first

started learning this, this kind of gave

me an existential crisis. So allow me to

also share my existential crisis with

you. So if the C compiler is written in

the C language itself, how was that

compiler compiled? And the answer is

another compiler. But how was that

compiler compiled? Prior to the C

compiler being fully self-hosted, it was

actually derived from the B compiler.

And I know naming conventions were

absolutely genius back then. So if we go

back generations of compilers compiling

and deriving other compilers

compiler baguette compiler sounds

biblical goes along with this well we

actually get the original self-hosted

assemblers assembling assembly and I

know what you're about to say but how

are those assemblers assembled? I hope

you're just as horrified as me to think

that for the very earliest computers

humans were the original assemblers. So

what programmers had to do is they had

to go handwrite their assembly programs

and then at assemble time they had to go

over to the instructions manual and

handwrite those ones and zeros for the

corresponding op codes of their program.

So if you haven't thanked your compiler

today, make sure to do so. We can

demonstrate the idea of self-hosted and

self-reroducing compilers with a

programming exercise. Ken Thompson in

his acceptance speech said that in

college before video games they used to

amuse themselves with different

programming exercises. I don't think I

was doing that in college. I think I was

playing too much World of Warcraft. But

the idea behind the exercise is to

produce a program that when compiled and

executed will output an exact copy of

the original source code. And the name

for these is actually quines after

Willard Vin Orman Quin the philosopher.

So, I'd like to try my hand at producing

some self-producing code. And if you'd

like to follow along, I've made all of

the code for this available in my GitHub

repository, which I will leave a link to

in the description of this video. So,

let's go over to our very first simple

quin program, and this is going to be

written in C. Now, remember, the goal

behind this program is to produce

exactly this source code as soon as it

executes the program. And this is going

to include all of these different

comments as well as the actual code

itself. So if you progressively just

kept running and rerunning the output of

this program, you would get kind of this

recursive program that's always

reproducing itself. So that's the whole

complicated goal behind this. And I do a

little bit of tricks with this with the

format specifiers. For example, if I

want to print a new line in this code, I

can't just print a new line in the code.

I have to use these format specifiers.

So 10 corresponds to the asky value on

the ASKI table for the new line

character. And then if I want to print

quotation marks inside of here, I can

use the asky value 34 in order to

produce those quotation marks inside of

my string code that's going to be part

of my output. Now that sounds

complicated, but let's go and let's

execute this program and make sure it

produces all of the correct code. So

I've already compiled my program and if

I do /coin

here you can see this exactly matches

our original code and if I want to what

I can do if I want to verify that the

output is exactly the same I'm going to

do this and I'm going to write the

output of this to a temporary file I'll

just call it temp. So now that contains

the output of the program. And so if I

want to diff this, I can do diff. I'm

going to do the original quinine. C,

which is the source code. And then I'm

going to diff that with that temp

output. And it should be exactly the

same. Fingers crossed. Yes, it is

exactly the same. So we have

successfully made some self-producing

code. Now Thompson did say that the best

vehicle for producing this

self-producing code was forran. So I

decided to also write another version of

this same code in the forran programming

language. Now if we go over here, this

looks very similar to the C program

except so much uglier. But we have a

kind of copy of the source code for this

stored inside of this string here as

well. So remember, we kind of have to

self-reroduce the program inside of this

string that's going to allow us to

actually print this string one more

time. And kind of a side note here, if

you're ever coding in forran, note that

a dow loop, you cannot increment the

flag variable inside of it. So I had

ended up having to change this to a dowh

loop just to be able to increment my i

variable. That's very interesting. And I

believe this is the first time I've ever

had the forran language as a language

listed inside of my GitHub. So that's

kind of a good achievement. To verify my

forran quin is working properly, I'm

going to compile this. So we'll do G for

I'm going to do my fort quinine.f90

output fort coin and we can run this

fort coin and here you can see we do

indeed have a another reproduction of

this particular code. Now let's move on

to the more interesting portion of this

exercise. What if we slipped in some

additional trojanized code inside of the

self-producing program so that the

second iteration would either have that

trojanized code or have the code that

was able to produce that code. So if I

go over to Trojancoin.c, you can see the

structure for this particular program is

primarily the same as the previous ones,

but I have this additional function

inside of here. And so what this

additional function is going to do is

it's going to check for the string right

found inside of the string that's about

to be printed of the additional code

that second iteration and it's going to

replace the word right with the word

wrong. So that means even inside of all

these comments and stuff this is going

to produce a trojanized version of the

output. And you can see we also have to

include this check function inside of

the original string that's going to be

our format specifier to specify the

output of this. And if I go over and I

want to execute this program slash

judging coin, you can see if we go up

here somewhere inside of this messy

code, we have actually replaced the word

right which was inside of this original

comment with the word wrong which is

making us look just a little bit crazy

over here. If this idea for Trojanized

code was instead part of a malicious

compiler, it would basically have two

different options. In the first option,

if it recognized that it was compiling a

regular program, it would slip in that

malicious payload into the compiled

binary. In the second case, if it

realized that it was compiling code for

a secondary compiler, aka it's

recognizing its own source code, it

would instead slip in the code that's

used to generate the malicious payload.

Now, you may be thinking that that

Trojanized code was supremely obvious,

and it kind of was, but I did that on

purpose to demonstrate the point of how

to slip in the original sin behavior

into the coin itself. But there's just a

multitude of offiscation techniques that

you can add to code to make it more

difficult to identify and detect inside

of code. So, let's take a look at

obfiscated Trojancoin. C to demonstrate

how this is done. Now this is just

honestly the bare minimum of obiscation

I have added to this particular program.

All I've simply done is taken the string

that's used to generate the code for the

program and translated that into the

decimal version of the ASKI values. So

if you take a look at the ASI table, all

of those characters have corresponding

decimal values. And so this is actually

the original source code just encoded in

its decimal value form. And you can see

I still have my check function inside of

here that's going to replace the word

right with the word wrong. But instead

I'm also using the asky representations

in decimal form for this as well. And

this function is also encoded inside of

our giant number array here. I would say

this is a lot more difficult to detect.

So let's take a look at our offiscated

code. I'm going to compile with GCC.

Here's my offiscated Trojan coin. Ignore

the warnings. Not normally, but in this

particular case, you should. And we're

going to execute this code.

And you can see even though we had this

confusing character array in decimal

form in the original code, we actually

have that original source code in the

unobfiscated version. And this is really

only the bare minimum of obfiscation

that I'm adding here. As a professional

reverse engineer, there is so much more

that you can do with obfiscation. But

back to our original sim compiler

scenario, you can use this to obiscate

the source code of the Trojanized

compiler. But think about this, this

might not even really be necessary

because if you are using that compiler,

you're compiling another compiler. And

so other people are probably only using

the compiled version of that. Meaning

that if they really wanted to find that

Trojanized code inside of it, they're

probably working with a binary. So,

they're going to have to actually

reverse engineer that binary using tools

like Gedra or IDA to even be able to

detect that malicious code. I don't know

about you, but I'm not really going

around reverse engineering my compilers

on a daily basis probably. Not only

that, but if you're looking at the

programs that are produced by this

compiler, it would be very difficult to

be able to detect this Trojanized

behavior. you've effectively removed the

Trojan version from the source code of

that program. If you audit the source

code, it's going to appear clean because

the infection process only happens after

the binary has been compiled. So, you

would also have to reverse engineer that

target binary. Now, you may be thinking

that if you did this and infected the C

compiler, you'd really only be infecting

C programs. But that's actually not

true. If you think about the complicated

family tree of programming languages and

compilers that I demonstrated earlier, a

lot of programming languages will build

off of other compilers to kind of avoid

reinventing the wheel and focus on their

new language that they're creating. A

really good example of this is the LLVM

project. Now, the LLVM project is a

compiler infrastructure used by so many

different languages. Basically,

everything is LLVM. Everything is

computer.

So it's used directly to compile C, C++

or Objective C with its provided clang

compiler or it's also used indirectly by

a lot of different programming languages

like uh Swift's compiler, Rust's

compiler or even forrans compiler. To

demonstrate just how pervasive this

project actually is and how many

compilers this applies to, think about

developing an iOS application. Now, you

normally would not consider an iOS

application needing to involve the C

programming language, but clang is used

to compile Objective C, which used to be

the language used for iOS apps. So, if

we go over here, I have this make file.

Instead of just doing this inside of

Xcode, which normally does all of that

on the back end, and you have no idea

that LLVM is involved at all, I can do

this manually. And I downloaded my own

copy of the LLVM project, which is big

ordeal. We'll get into that in a second.

And I'm going to custom compile this

directly. So I'm reading in all of the

Objective C binaries and I'm going to be

producing a target MacO binary for the

iOS application. This is also very

interesting because you can use this and

add extra offiscation flags and

customizations to LLVM to add automated

offiscation to the binary. But that's

kind of a topic for another video. So,

let me go ahead real quick and

demonstrate this. So, I'm just going to

do make and it's reading in all of my

Objective C binaries here. It's adding

them in together. It's compiling them

and it's going to produce the target

binary. We're going to let this load cuz

this is a really big application.

And here we go. We have completed our

test application which is comprised of

originally Objective Code. Now, to show

how large the LLVM project is, and to

demonstrate that you're not going to be

regularly auditing both your compiler

and any inherited compilers that they

rely on, let's take a look at the LLVM

project. Now, this is available open

source on GitHub, which is kind of cool.

And this has this has so much stuff to

it. And it did say it was an

infrastructure. So that's just what the

repository looks like. But let me show

you something funny. So I'm going to

clone this repository.

I'm going to do get clone. That's my

LLVM project, which I did tell you I

downloaded and customized earlier. We

are cloning this project.

It's going to be a while. Everyone,

how are you doing today?

Hopefully good.

What did the What did the brother cell

say to the sister cell when she stepped

on his toe?

Mitosis.

Okay, this is now cloning. You can see

we're at 1% 2%. This is going to be very

slow. So, you can see this repository is

absolutely enormous. So, there is no way

that you're going to be auditing this. I

am going to kill this because I don't

need that. I've talked a lot about how

to implement a kind of proof of concept

for this terrifying compiler thought

experiment. But let's talk about some of

the mitigations now and there are

mitigations although they're pretty

complex. David Wheeler as part of his

PhD thesis created a technique for

mitigating this known as diverse double

compiling. So let's say if we wanted to

try to prove that GCC hadn't been

tampered with or trojanized. I have the

steps right here for implementing DDC.

One, pick a second independent compiler

that you trust. Big job there. Two, use

that second compiler to build a fresh

GCC from source. We're going to call

that source G1. And then three, use G1

to build GCC again from source. We're

going to call that result G2. And now

the fourth step, compare G1 to G2. If

they are bit forbit identical, then you

know that successfully this compiler has

not been tampered with. So you can use

this and trust the programs created from

it. Now this can get pretty complicated.

So I will share a link to that paper in

the description of this video. Now note

to think about if you're trying to hand

audit modern compilers these days like

GCC or clang, they just have such a

massive code base that it's going to be

super challenging. So, another technique

that you could try to use instead is

kind of handwriting your own teeny

compiler and then using that as a

bootstrap to generate larger and larger

compilers until you have a final

compiler with a pure lineage that has

been written entirely by yourself that

you know you can trust. Now, you might

think this all kind of sounds

far-fetched, but this attack scenario

has actually happened in real life. Back

in 2015, there was this family of Mac

malware called Xcode Ghost. And what it

was was a Trojanized version elite Xcode

compiler targeting iOS developers. So

these developers would open up this

Trojanized version of their IDE, write

their iOS applications, and publish them

to the App Store. But what the infected

compiler would do is slip in additional

malicious code from the infected IDE

into the final applications. So that

means hundreds of applications made it

onto the app store looking like they

were coming from legitimate developers

with those developers being completely

unaware that they were actually

publishing Trojanized code. Now the

scariest part about all of this is that

Thompson gave a highly specific outline

of how he would implement this kind of

attack in his acceptance speech was a

little too specific. In fact, if you

scrape old Usenet archives, you can

actually see the Oracle employees and

the Bell Labs employees going back and

forth theorizing about whether he

actually did it. But the craziest part

is if you look at the old archived

messages, they're kind of corrupted. But

Ken himself actually admits to slipping

in the Trojanized login code into the

Unix support group at Bell Labs for fun.

They snuck it in by advertising it as

some nonbackwards compatible feature so

that the group would have to use their

specialized compiler with the original

sin code inside of it. But the original

Trojan code inside was kind of smart. If

you used the - S option to output the

assembly of the program, it would

actually not generate the Trojanized

version of it. So it' kind of generate a

clean version so that you wouldn't be

able to detect it if you were trying to

look at the assembly. And Ken claims

that at some point someone ran it with

the - S option and so the original SIN

was removed from that compiler in the

chain of compilers at Bell Labs. He also

claimed that the compiler was never

released outside of Bell Labs, but I'm

not so sure about that. Bell Labs had

their hands on just about everything in

the 70s and 80s, so there's definitely a

nonzero chance that that compiler

actually escaped. Hopefully that doesn't

keep you up at night. I particularly

like one quote from Ken Thompson saying,

"You can't trust code that you didn't

totally write yourself, especially not

code from companies that employ people

like me."

Like, Ken, if this is the story that you

are sharing, what did you possibly do

that you're not sharing with us? It's

not really a feasible philosophy to

follow though in this day and age. If

you take a step back for a moment and

think about all the proprietary software

or applications that you've used even

today, you're definitely not going to be

able to go through and audit all of

those applications, let alone write all

of that code yourself. But personally, I

think this is one of the spookiest

thought experiments that I've read in a

while. So, I'd be curious if anybody

else has read any other spookier thought

experiments lately. Let me know. Anyway,

thank you so much for watching everyone.

As always, I hope you enjoyed this

video. Lori wired out.

Analytical Tools

Create personalized analytical resources on-demand.

Analysis

Visualization

Learning

Assessment

Chatbot is available after you save this video to your library.