What Nobody Tells You About Being a Quant

Feedback

0:00 Intro 1:10 The Interview Process 6:06 The Multifactor Trading Model 17:54 Security Matching for Alternative Data 23:15 Productionizing Trading Factors 27:33 Cloud – AWS, Databricks & Spark 32:09 Databases – Parquet & KDB+ 36:48 Recap & Thanks Book: https://cms.dm.uba.ar/Members/maurette/ACF2022/Richard%20Grinold%2C%20Ronald%20Kahn-Active%20Portfolio%20Management_%20A%20Quantitative%20Approach%20for%20Producing%20Superior%20Returns%20and%20Controlling%20Risk-McGraw-Hill%20%281999%29.pdf #citadel #quant #quantquestions #howtobecomeaquant #whatisaquant #quantinterviewquestions #quantfirms #quanttrading #quantfinancecourse #whatisquanttrading #whatisquantfinance #quantengineer #quantsystems #howtobecomeaquanttrader #quantfinance #quantdata #quantdeveloper #codingjesus

The Quant Insider

38:03

2026-06-17

Loading YouTube video...

Video ID:tzTftCzmr7k

Use the "Analytical Tools" panel on the right to create personalized analytical resources from this video content.

Transcript

1082 segmentsCurrent: 0:00

Hi everyone, I've received a lot of

requests to do a full walk-through of my

experience as a quantitative developer.

And so today, I'm going to share

everything I learned in my four years as

a quant, and also touch on things I've

worked on to give a detailed look into

some practical systems that a quant

developer could be expected to

implement. The most exciting piece that

I will cover is reviewing over the

systems design of a trading system you

can expect at most big quant firms,

which can be great for implementing in

your personal projects, and in turn a

great talking point in interviews.

I've sectioned this video off into

chapters, so you can easily skip ahead

to the parts you're most interested in.

First, I'll start with the experience I

had before I was accepted as a

quantitative developer at my firm. I

graduated from the engineering

department from the University of

Waterloo, and I had completed six

internships, all of them in big tech and

startups. Across those roles, I worked

as a software developer and as a machine

learning engineer. Despite the extensive

internship experience, I didn't have any

finance or quant finance background at

all. After I graduated, I applied to

hundreds of quant roles of varying

types, and I eventually landed mine.

The interview process consisted of six

rounds, and I'll cover each one. The

first two rounds were technical coding

questions, but they weren't your typical

leak code questions. They felt

custom-made for the role, and the two

questions I got were of varying

difficulty.

The first round asked me to build a

portfolio management system,

something that could track orders,

execute orders, and maintain my

positions across different stocks.

I found this one fairly easy because I

had already built trading systems in the

past, and was familiar with the features

the interviewer was asking for, like

stop losses, limit orders, and so on.

The key thing I took away from that

round was the importance of

communicating my thought process

thoroughly and concisely.

If there's one piece of technical

interview advice you take from this

video, it's this.

Half of what they're looking for is

whether you can work your way through

the problem using good technical

practices,

and the other half is whether you can

communicate your thought process

clearly.

This is a hugely underrated skill in

interviews,

and it's one of the main things I, along

with my colleagues, look for when we

conduct interviews ourselves. If a

candidate can't communicate their

thinking behind a simple coding

question, we lose confidence that

they'll be able to explain the far more

complex projects they'd take on if they

were hired as a quant developer,

researcher, or trader.

As a quant, part of your job is being

able to explain technical concepts to a

wide range of people across the firm in

a very intuitive and clear way.

When I interview candidates for our

firm, I tend to value excellent

communication over mediocre coding

technical skills, especially in this

climate in the tech industry. The second

round was about developing an algorithm

around different sets of combinations

that would appear in a 52-card deck, and

then modified the suits and ranks in the

deck. I can't remember the exact

question. I tried searching for it

online afterward, and couldn't find

anything that matched. But from what I

recall, it was really testing my ability

to apply statistics to programming.

This was by far the hardest question I

encountered in the whole process.

And I'll be honest, I didn't get it

fully correct.

But I was able to communicate my thought

process as best I could, and got

probably 70% of the way there.

That actually reinforces the point I

made earlier. Your communication will

take you a lot farther in the interview

process than you might realize. I truly

thought that round is what would have

kicked me out right away.

To my surprise, he continued me to the

next stage to meet the partners at the

firms, who are now my managers. So, my

biggest piece of advice is this.

Whenever you practice technical

questions, talk through the problem out

loud to yourself first. Write it down in

pseudo code, and then actually implement

it. That way, you build the habit of

explaining yourself concisely. The third

round was a system design and technical

round with two partners at the firm. It

was a little intimidating, almost a good

cop, bad cop situation. They went

through my resume in full detail and

dove deep into the projects I'd worked

on. Their goal was to see how well I

could communicate a technical concept,

and also to confirm that I had genuinely

done the things on my resume.

And trust me, it comes across very

clearly if that isn't true.

I talked about one of the projects from

my machine learning engineering role,

and a side project I'd built in trading

systems.

This kind of round will almost certainly

happen in every interview process.

So, I highly recommend being extremely

comfortable being questioned about your

own work and able to talk about a

project for a solid hour. No matter how

hard they pushed on the work I'd done,

it was important that I came across as

confident and stood my ground with the

technical depth to explain my projects

in detail.

They asked me to show the system design

for one of my projects, so I shared my

screen and drew out the systems I'd

built, and they questioned each piece.

There were a lot of questions around

data engineering, how I'd organized the

database, and the specific ML algorithms

I'd used, which I believe were some kind

of gradient boosted tree algorithm,

if I remember correctly.

The fourth, fifth, and sixth interviews

were quite similar. Each one was with a

different partner at the firm. It seemed

like they were rotating me around all

the partners of each sub team within the

quant department to figure out where I'd

fit in the company.

I won't dwell on these too much because

they didn't feel as difficult. They were

mostly a mix of systems design and

behavioral questions, talking about the

projects I'd worked on, database

questions, and understanding my

experience in the quantitative finance

space.

They traded global equity, so they did

ask me some questions related to that,

but they were fairly high level, and

they clarified that I wouldn't be

penalized for not knowing too much of

it. Though they made it clear it would

be expected of me to grasp it if I

joined.

The first big concept I want to walk

through is the multi-factor trading

model, because honestly everything else

in this video hangs off it. It's the

standard architecture across a lot of

the top firms, and once you understand

it, the rest of my job will make a lot

more sense. It's not a new idea, it goes

back a few decades, and there's one book

I'd point any new hire to before

anything else, Active Portfolio

Management by Richard Grinold and Ronald

Kahn.

I'll link it in the description. What I

want to do here is lay out the whole

machine end to end, the way I'd sketch

it on a whiteboard for someone on their

first week.

Then, for the rest of the video, we'll

zoom into the individual pieces I

actually worked on and expand them.

One thing to set expectations on first,

the multi-factor system at my firm, like

at basically every firm, is an enormous

code base. Nobody who joins today

understands the entire thing end to end,

and I'd argue nobody understands all of

it at all. So think of this section as

the map I wish someone had drawn for me

on day one.

Let me start with the single idea the

whole business rests on.

We get paid to outperform a benchmark. A

client could just buy the index for

almost nothing, so our entire reason to

exist is the value we add on top of that

index. That value add has a name, alpha.

The cleanest way to think about it, if

your benchmark is up 10% and you're up

12, that extra 2% that came from you,

not from the market rising, is your

alpha.

Everything we build is in service of

producing more of it.

And here's the mindset shift that trips

up most people coming from a pure

engineering background, like I did. The

market price already has a consensus

view baked into it. Thousands of smart

people have already priced in what they

collectively think a stock is worth. So,

our job isn't to figure out what a stock

is worth in some absolute sense. It's to

find the specific spots where we

disagree with the consensus, and to be

right about that disagreement more often

than the other people trying to do the

exact same thing. Active management is

forecasting the market's errors. So, how

do you forecast in a disciplined,

repeatable way? You start by realizing

that a stock's move on any given day

isn't one thing.

It's a stack of things added together.

Picture an oil producer that's up 14%

over some period.

A chunk of that is just the whole equity

market going up. Another chunk is its

country.

Another chunk is its industry.

Oil moved, so oil stocks moved with it.

And only what's left after you strip all

of that out is truly specific to that

company.

That leftover piece is the part you

actually have a view on.

The way we separate those pieces in

practice is regression. You regress the

stock's return against the things that

drive it, and whatever they explain is

factor return, while the unexplained

residual is the stock's own

idiosyncratic return. And a quick gut

check that surprises people, even for an

oil stock that tracks oil tightly, that

company-specific residual is usually a

big part of its total movement. Stocks

are a lot more individual than they

look. That idea, decomposing returns

into common drivers plus a specific

piece, is the seed of the whole model.

So, let me walk the pipeline it grows

into. It all starts with data and

signals. Everything downstream is only

as good as the data feeding it. So, this

is where it begins. Both traditional

financial data and the alternative data

I'll get into later.

Out of that data, you build signals.

Individual, measurable views on a stock.

Classic ones are value, momentum, size,

quality.

The more exotic ones come out of

alternative data sets.

One thing worth flagging early.

Before any of this, you also have to

define your universe.

The set of names you're even willing to

trade.

Your instinct is to trade everything for

maximum breadth, but you quickly learn

some names just aren't worth it.

Illiquid stuff you can't get in and out

of without moving the price against

yourself.

So, you draw a sensible boundary first.

A raw signal though, isn't something you

can trade directly. And turning it into

something tradable is the next block,

the alpha.

An alpha is a clean, refined forecast of

return.

The intuition for the refinement is

three things multiplied together. How

volatile the stock is.

How much genuine predictive skill your

signal actually has.

And how strongly the signal's firing for

that specific name, right now.

Each refined signal becomes one factor.

And multi-factor just means you're

running many of these at once. Each one

ideally capturing a different,

independent edge.

We'll come back later to why

independence matters so much, because

it's the whole game.

Running right alongside the alphas, not

after them, in parallel, is the risk

model. And this is the block newcomers

consistently underrate. The alpha tells

you what you hope to make. The risk

model tells you what it could cost you

in volatility. It uses that same

decomposition from before. A stock's

return is its exposure to a set of

common factors, industries, plus risk

indices like size, value, volatility,

momentum. Plus its own specific piece.

Now, why go to all this trouble instead

of just measuring how every stock moves

with every other stock directly?

This is the single best why I can give

you. So, let me actually work it

through.

To understand a portfolio's risk, you

need the covariance between every pair

of stocks, how each one co-moves with

each other one.

Take a universe of 1,400 names.

The number of pairwise relationships

you'd have to estimate is on the order

of 980,000.

That's hopeless. You can't estimate a

million numbers reliably from finite

history. You'd mostly be fitting noise.

So, here's the trick that makes the

whole thing possible. Instead of

stock-to-stock, you say every stock is

just a bundle of exposures to maybe 65

common factors. Now, you only need the

covariances among the factors, which is

around 2,000 numbers instead of nearly a

million.

You've collapsed an impossible problem

into a tractable one.

That is why the risk model is built on

factors rather than raw stocks.

It's the only way the math is even

computable.

While I'm on risk, two pieces of

intuition I lean on constantly. First,

risk does not add up the way you'd

expect. The risk of a portfolio is less

than the weighted average of its parts

because the stocks don't all move

together. That gap is exactly the

benefit of diversification. Spread

across enough uncorrelated names, and

the specific risk averages away toward

nothing. But, and this is the catch, it

only averages away the specific part.

Stocks all tend to rise and fall with

the market together, and that shared

market risk never diversifies away, no

matter how many names you hold. That's

the line between specific risk, which

you can dilute, and systematic risk,

which you're always carrying.

It's the core insight behind CAPM. The

market won't pay you a premium for risk

you could have diversified away for

free.

Arbitrage pricing theory took that one

step further

and said, "There isn't just one

systematic driver,

but many."

Which is precisely the multi-factor view

we run on.

The second piece of intuition is about

time.

Risk doesn't add across time, but

variance does.

As long as today's return isn't

correlated with yesterday's.

Which for most assets, it basically

isn't.

So variance piles up linearly with time.

And since risk is the square root of

variance, risk grows with the square

root of time.

That's why when you turn a monthly

volatility number into an annual one,

you multiply by the square root of 12,

not by 12. Tiny detail, but it's

everywhere.

And getting it wrong quietly corrupts

every risk number you produce.

Once you've got both halves, the alpha

is saying what you expect to make,

and the risk model saying what it'll

cost, they feed into portfolio

construction. This is an optimizer. And

conceptually, it's doing a balancing

act. Maximize expected return. Subtract

a penalty for the risk you're taking.

Subtract the cost of trading, all while

respecting your constraints. Position

limits, staying sector neutral,

turnover caps, how much leverage you're

allowed.

What comes out the other side

is the target portfolio.

The actual number of each name you want

to hold. The optimizer just hands you a

target, though.

Actually getting there is the next

block, implementation and trading.

The whole philosophy of this stage fits

in one line.

Subtract as little value as possible.

Every trade leaks a bit of your alpha

back out, and this is death by a

thousand cuts.

So it's worth knowing what those costs

are.

There's commission,

the per share fee to the broker.

There's the bid-ask spread. Buy at the

ask, sell at the bid. And that gap is

the cost of a round trip. There's market

impact, the big sneaky one. Buying one

share is cheap, but buying a 100,000

shares pushes the price against you as

you go.

I always describe market impact as the

finance version of the Heisenberg

principle. You can't observe the market

without disturbing it. And finally,

there's opportunity cost.

The trade you waited on for a better

price that just ran away from you. The

way you measure all of this after the

fact is implementation shortfall. You

run a hypothetical paper portfolio with

zero trading costs and compare it to

your real one.

And the gap is your total cost of

implementation. The last block closes

the loop, performance analysis. After

the fact, you take what actually

happened and decompose it. How much came

from the factor bets you intended to

make? How much from constraints?

How much was just noise?

The real goal is to separate skill from

luck and to find where the skill

actually lives, so you double down on

what's working. This block feeds

straight back to research because

factors decay. An edge that printed

money five years ago gets crowded out as

everyone else discovers it. You're never

done. You're always refurbishing old

factors and hunting new ones. Two

numbers tie the entire machine together

and I want you to internalize both. The

first is the information ratio. Your

active return divided by your active

risk. It's the report card. How much

value are you adding per unit of risk

you choose to take? The second genuinely

changed how I think. The fundamental law

of active management. Your information

ratio is roughly your skill multiplied

by the square root of your breadth.

Where breadth is the number of

independent bets you make.

So, there are exactly two ways to get

better.

Be more skillful per bet or make more

independent bets. Now, that word

independent is doing enormous work. If

you're long five stocks and short five,

but all the longs are retail and all the

shorts are energy, you don't have 10

bets. You have two. A bet on retail and

a bet against energy. Real breadth means

genuinely distinct decisions across both

the names you cover and how often you

independently revisit them. And once

that clicks, the entire point of a

multi-factor model snaps into focus.

It's a breadth machine.

It's how you make thousands of small,

independent, slightly better than even

bets across the whole market every

single day.

And the square root tells you that

stacking up breadth is how a modest per

bet skill compounds into a serious edge.

That's the skeleton. Everything else I

show you from here hangs off one of

these blocks.

So, let's zoom into the very first block

of that diagram, data and signals,

because that's where I spent a real

chunk of my time and it's where the

whole machine either stands or falls.

I said everything downstream is only as

good as the data feeding it and I want

to show you what good data actually

takes, starting with a fundamental piece

of data engineering in this space,

security matching.

Before I get to matching, it's worth

naming the unglamorous work that lives

in this block because it's easy to

assume the data shows up clean and it

never does. A big part of the job is

data scrubbing, cross-checking outliers

against other sources, filling gaps,

fixing formats, and then reconciling

vendors who all describe the same

company differently. One vendor

identifies a company by CUSIP, another

by CUSIP, another by ISIN. One revises

its history when figures get restated,

another doesn't.

Sorting all of that out so the data is

consistent and trustworthy is the price

of admission before anything downstream

can run.

And the sharpest version of that problem

is security matching.

Most quant firms with the budget for it

will buy alternative data, a newer type

of data that's entered the quant finance

space over the past several years.

Examples of alternative data include

scraped social media data, supply chain

information, news articles, credit card

transactions, and broker reports, among

many others.

The reason alternative data has exploded

in popularity over the past few years is

that it's become so hard to extract

valuable information from traditional

data.

There's been a lot more competition over

the past couple of decades, and quant

firms are constantly looking for unique

pieces of information they can turn into

trading factors.

So, whenever a firm buys a data set, the

first step to integrating it into the

system is a process called security

matching,

something every data engineer at a quant

firm has to do.

It's a pretty tedious process, but it's

a necessity for everything downstream.

Security matching is the process of

mapping the entities in a data set to

the firm's internal identifiers for its

trading model.

For example, mapping a URL from a

website like apple.com to the official

identifiers that represent Apple.

These official identifiers are

standardized across the finance

industry.

A few to mention are ISIN, CUSIP, and

CUSIP,

and Bloomberg IDs are also commonly

accepted.

The reason you need to map these

entities is so you can algorithmically

trade based on the data set you bought.

The key concept in security matching is

point in time. Point in time is a hard

requirement because it prevents you from

corrupting the data with look-ahead

bias. An easy way to understand it, the

identifier for Apple is only valid for

the specific time periods during which

the company was publicly listed under

it, and those official identifiers can

change for several reasons. One example

is mergers and acquisitions. So, you

always have to ask what was true as of

that date, not what is true today. This

whole process tends to get pretty

straightforward once you've matched a

couple of data sets. A couple of things

to keep in mind. The data sets you're

mapping are often several terabytes in

size and they require quite a bit of

sophistication in how you process them.

That ranges from distributed computing,

where applicable, to processing the data

with fast, efficient transformations

using tools like Polars, Spark, or

NumPy.

Another concept that's very relevant to

security matching is data loaders.

When a vendor sells data to a quant firm

and then updates their database, the

firm needs to make sure it pulls in that

updated data. And the way you do that is

with a data loader, which tends to be

unique to each vendor.

Some vendors update their data daily,

some weekly, some monthly.

It really depends on the type of data

they sell.

Typically, what you'd expect is for the

data loader to pull in the updated

information from the vendor's side,

perform the necessary transformations

and aggregations to preprocess it, and

then save it down into the firm's

internal database.

From that point on, any trading factor

built off that data runs again on the

fresh data that just came in. So, data

loaders are a production-quality

feature. If the data loader doesn't

work, the trading factor gets halted.

So, it's critical that both the security

matching and the data loaders work

flawlessly and that you build in some

kind of auditing to catch any

discrepancies in the data coming

upstream from the vendor. And that opens

up a whole new concept, data auditing. I

couldn't depend on the vendor to provide

correct data all the time,

so I had to build auditing systems for

the different vendors.

Most of the time, this was just

statistical metrics reported on certain

columns of the data set,

or even calculating the coverage of the

securities mentioned in the newly

refreshed file.

Any difference that crossed a certain

threshold would flag me, or whoever else

was responsible on the data side, to

look into the issue. This is a very

common standard practice in the quant

space because if the data is wrong, your

trading factor is trading on incorrect

data. So, it's vital to pinpoint the

issue as early and as far upstream as

possible.

Now, let's move one block to the right

on our diagram to where those signals

become alphas, the actual trading

factors. On the whiteboard, that's a

single tidy box, but in real life, it's

where research meets production. And

that handoff is most of what I did

day-to-day. So, I want to expand on that

block and talk about my experience

productionizing trading factors.

I worked alongside plenty of senior

researchers who were responsible for the

research side of the trading factors,

and I worked very closely with them to

take all of their research code and

methodology and convert it into

production-ready code that could run in

our live trading model.

I worked on about six different research

projects over my 3 years as a quant dev

so far. I obviously can't talk about the

specifics, but I'll walk through the

high-level process and what I went

through. The projects I worked on

covered very different topics. Some of

them leveraged alternative data, others

used traditional financial data, and

only a very small subset actually used

neural network techniques, which I found

pretty surprising. A lot of research in

general can be done with regression

models or gradient boosted trees. If

there's interest in understanding

regression models or gradient boosted

trees, let me know in the comments.

That'll be my signal to do a future

video explaining them intuitively.

Most of the research code was written in

R, and I'd take that and convert it into

either Python, Spark, or KDB,

which is a pretty old language that a

lot of hedge funds use. It's extremely

optimized for speed, and it's an

excellent database. That said, I used a

lot of distributed computing when

building these trading factors, along

with a lot of vectorization using NumPy.

Since everything is time series based,

we needed to enable distributed

computation for practically every

trading factor we built because the

first thing you do is run the code over

historical data. And that can range

anywhere from 5 to 15 years.

With distributed computing, a full

historical run might take around half a

day. It obviously depends on the trading

factor and the data being used, but in

most cases a full historical run takes

about half a day to a full day. Once

everything is set, you're pretty much

ready to start running the trading

factor on the new data that comes in

through the data loader. And because

it's all distributed, it runs a lot

faster than it otherwise would. You also

have to keep in mind that these trading

factors have a specific time window in

which they can be updated. In

production, multi-factor models tend to

get updated every day. And if you're

trading global equity, for example, you

only have a really specific window. Once

the New York market closes, you have a

few hours to run the model on the latest

data so you can start trading during

Japan's market hours.

So, being able to distribute your code

and really understanding this concept is

very important. When I wrote the

production code, I was given a Word

document that contained all of the

research methodology for that factor.

That gave me an easy way to understand

the thought process behind each step.

And it also made collaboration much

easier across the several members who

might be involved in a project. It's

essentially a central note that everyone

can refer to when trying to understand a

trading factor.

It became very clear during my time at

the firm that documentation is extremely

important. Not only does it help

researchers and developers understand

the code better, but in the future it's

common for a trading factor to need a

second research project on top of it,

either to fix issues or to amplify its

performance. This happens because

factors always decay in performance, and

it's up to the portfolio managers to

decide whether a particular factor has

some low-hanging fruit worth another

research project to improve it.

These research projects tend to involve

several people from the team, usually a

researcher, a developer, and a tester,

plus a few senior partners who provide

oversight, give advice, and get constant

reporting. These small pods are a

crucial part of the firm because they

require excellent collaboration from

each member and a real sense of team

spirit.

One thing I noticed is that the

portfolio managers paid close attention

to particular pods. If they noticed a

pod had a great work ethic and really

good synchronous workflows, they tend to

keep that pod together for future

projects because it meant better project

throughput.

Here's a fair question to ask looking at

the board so far. Those historical runs

over 15 years of data across thousands

of names,

where does that actually run and how

does it finish before the next trading

window?

That's the infrastructure layer sitting

underneath the data and alpha blocks we

just expanded. So, let me draw it in. By

the time I joined the firm, everything

was done on premises. They had huge

servers in the office where everything

ran. It soon became clear to management

that they had to migrate to cloud

services. The team was growing, the data

was getting larger, and the factors were

getting more complicated and demanding

more compute.

So, one of my biggest task at the firm

was migrating a lot of our

infrastructure to the cloud and that

forced me to get very comfortable with

AWS and Databricks.

AWS has a lot of services and it can be

confusing to navigate. But, the main

ones I leveraged were EC2, ECR, and S3.

On the Databricks side, I pretty much

used every feature they had at the time.

They've definitely expanded a lot since.

When I migrated everything, I took our

on-premises code, moved it into

notebooks, and incorporated those into

workflows, which essentially trigger the

code to run on a specific schedule. But,

the most notable piece of this section

is Apache Spark and Delta Lake, which

were the two features that really

transformed a lot of our processes. Let

me explain Apache Spark in some detail

because it's the engine that made all of

this possible at scale. Spark is a

distributed in-memory data processing

engine. The whole point of it is to take

a computation that would never fit on or

finish on a single machine, and spread

it across a cluster of machines that

work on it in parallel.

The way it's structured, you have a

driver, which is the brain. It holds

your program and builds the plan.

And

you have executors, which are the

workers spread across the cluster that

actually crunch the data.

A cluster manager hands out the

machines. Your big data set gets split

into chunks called partitions, and each

executor works on its own partitions at

the same time.

That's where the speed comes from. If

you have 100 partitions and enough

executors, you're doing 100 things at

once.

A couple of things make Spark special.

First, it's lazy. When you write

transformations, filter this, join that,

group by the other, Spark doesn't

actually run anything yet. It just

records what you asked for and builds a

graph of the steps.

It only executes when you hit an action,

like writing the result or counting

rows.

That laziness is what lets Spark's query

optimizer, Catalyst, look at the entire

plan and rewrite it to be as efficient

as possible, pushing your filters down

to the data source so it reads less,

reordering joins, and so on, before a

single byte is processed.

Second, it's in-memory. The old

MapReduce model wrote intermediate

results to disk between every step.

Spark keeps data in memory across steps,

which is why it's often an order of

magnitude faster for the kind of

multi-step pipelines we ran. And it's

fault tolerant. Because Spark remembers

the lineage, the recipe of

transformations that built any piece of

data, it can just recompute a lost chunk

if a machine dies instead of failing the

whole job. The one thing you have to

respect with Spark is the shuffle. Some

operations, joins and group buys, need

data that lives on different machines to

be moved around so related rows end up

together.

That movement across the network is the

expensive part. And most of optimizing

Spark comes down to minimizing and

controlling shuffles. So, how would this

actually look if I applied Spark to a

large data set? Picture one of the

alternative data sets I mentioned,

several terabytes sitting in S3 as

partitioned files. The flow looks like

this. Spark reads those partitions from

S3 spread across the executors. Because

it's lazy, it pushes my date and column

filters all the way down so it only

pulls what I actually need. Then it runs

the transformations in parallel on each

partition, cleaning, aggregating,

computing my features. When I need to

attach the firm's internal security

identifiers, the security matching step

from earlier, that's a join.

And because the mapping table is small,

Spark broadcasts it out to every

executor so the join happens locally

with no shuffle. Finally, I partition

the output by date and write it back

down into a Delta Lake table ready for

the trading factor to consume. A

historical run that would take days on

one machine comes down to hours.

That Spark job had to read from

somewhere and write to somewhere. So,

the natural next layer to draw

underneath everything is storage. And

the choice of where data lives isn't an

afterthought. It's wired into why the

whole system performs the way it does.

So, I'm also going to explain two types

of databases that are very commonly used

in the quant space and that almost

always come up in interviews. I'd say

about 80% of the interviews I

encountered for quant developer roles

asked me about the specifics of

databases, and discussing parquet data

lakes and KDB has always been a talking

point. So, I'll go through how quants

typically use parquet data lakes and

why. Let's start with parquet and Delta

Lake. Parquet is a columnar file format.

A normal database row stores all of a

record's fields together. Parquet flips

that and stores each column together

instead. That sounds like a small

detail, but it's huge for the kind of

work we do because our queries usually

touch a few columns across millions of

rows, not whole rows. Storing by column

means you only read the columns you ask

for. You get fantastic compression

because similar values sit next to each

other, and you can skip entire chunks of

a file that couldn't possibly match your

filter. A Delta Lake is what you get

when you put a transaction layer on top

of a pile of parquet files. On its own,

a folder of parquet files has no concept

of a consistent all or nothing change.

Delta adds a transaction log that gives

you ACID guarantees: atomicity,

consistency, isolation, and durability.

In practice, that means a write either

fully happens or doesn't happen at all.

Readers never see a half-written table,

and concurrent jobs don't corrupt each

other. It also enforces a schema, so bad

data doesn't silently slip in.

This combination is almost perfect for

quant data, and especially for time

series. You partition the data by date,

so when a factor needs the last 10 years

of a few fields, it scans exactly those

date partitions and exactly those

columns, nothing else. Daily updates are

just appends of a new date partition,

which is fast and cheap, and batch

computation loves this because the whole

historical run is just one big parallel

scan. But, the feature I leaned on the

most was versioning. Because Delta keeps

a transaction log of every change, you

get time travel. You can query the table

exactly as it looked on any past date.

That ties directly back to the

point-in-time requirement I talked about

in security matching.

When I run a factor over history, I need

the data as it was known then,

not as it's been restated since.

Versioning gives me that essentially for

free.

And it makes runs reproducible.

If a vendor restates a chunk of history,

I can cleanly overwrite just those

partitions instead of rebuilding the

whole table.

And because everything sits on cheap

object storage like S3

and reads in parallel,

the read and write throughput scales

with the size of your cluster rather

than choking on a single machine. The

other database is Kdb and its query

language Q. Kdb is a completely

different animal from Parquet and Delta.

It's an in-memory columnar time series

database and it is absurdly fast. It was

built from the ground up for exactly the

kind of data finance produces, enormous

streams of timestamped ticks and quotes.

The language Q is terse and vectorized.

You write tiny expressions that operate

on entire columns at once the way NumPy

does and it runs extremely close to the

metal. The thing Kdb does better than

almost anything else is time-based

joins, in particular the as-of join

where for every trade you want the quote

that was in effect at that exact moment.

That operation is everywhere in finance

and painfully slow in a normal database.

In Kdb, it's a first-class

lightning-fast primitive. That's why a

lot of hedge funds still build their

core on it despite it being an old niche

technology.

At my firm, Kdb was the backbone of the

live trading model and we ran it in

combination with HTCondor,

which is a job scheduler that farms work

out across a grid of machines.

So, Kdb held and served the time series

data at speed and HTCondor distributed

the actual model computation across the

grid on top of it.

The way I'd sum up the two, Parquet and

Delta Lake, are your cheap, massive,

versioned warehouse for research and

batch. They scale out and they remember

everything.

KDB is your high-performance engine for

time series and live trading.

It's all about raw speed on timestamped

data.

Most firms use both because they're

solving two different problems.

And that's pretty much everything I

wanted to cover for now.

If you take one thing away, let it be

the picture we just built. We started

with a single skeleton of the

multi-factor machine. And every piece I

worked on was really just one of those

blocks cracked open and expanded. The

data block became security matching, the

alpha block became the research to

production pipeline, and underneath all

of it sat the infrastructure and the

databases that make it run on time.

That's genuinely how it feels on the

inside. One big connected system where

every part exists for a reason.

There's a lot more I could dive into,

but this should be enough for now,

depending on how much of a reaction this

video gets.

If there's any positive feedback,

that'll be the encouragement I need to

keep making more videos that are helpful

for you.

Let me know your thoughts in the

comments. I'll genuinely decide whether

to keep making these videos based on

your reactions. So, if this was helpful,

please let me know.

Tell me what you like the most and what

you'd like to see next.

If you're interested in this kind of

content, go ahead and like and subscribe

so you can stay updated. I'll try to

post a video every week going forward

and we'll see how it goes. Thank you so

much and I'll do my best to answer any

questions you have. Thanks. Bye.

Analytical Tools

Create personalized analytical resources on-demand.

Analysis

Visualization

Learning

Assessment

Chatbot is available after you save this video to your library.