What Nobody Tells You About Being a Quant
Loading YouTube video...
Video ID:tzTftCzmr7k
Use the "Analytical Tools" panel on the right to create personalized analytical resources from this video content.
Transcript
Hi everyone, I've received a lot of
requests to do a full walk-through of my
experience as a quantitative developer.
And so today, I'm going to share
everything I learned in my four years as
a quant, and also touch on things I've
worked on to give a detailed look into
some practical systems that a quant
developer could be expected to
implement. The most exciting piece that
I will cover is reviewing over the
systems design of a trading system you
can expect at most big quant firms,
which can be great for implementing in
your personal projects, and in turn a
great talking point in interviews.
I've sectioned this video off into
chapters, so you can easily skip ahead
to the parts you're most interested in.
First, I'll start with the experience I
had before I was accepted as a
quantitative developer at my firm. I
graduated from the engineering
department from the University of
Waterloo, and I had completed six
internships, all of them in big tech and
startups. Across those roles, I worked
as a software developer and as a machine
learning engineer. Despite the extensive
internship experience, I didn't have any
finance or quant finance background at
all. After I graduated, I applied to
hundreds of quant roles of varying
types, and I eventually landed mine.
The interview process consisted of six
rounds, and I'll cover each one. The
first two rounds were technical coding
questions, but they weren't your typical
leak code questions. They felt
custom-made for the role, and the two
questions I got were of varying
difficulty.
The first round asked me to build a
portfolio management system,
something that could track orders,
execute orders, and maintain my
positions across different stocks.
I found this one fairly easy because I
had already built trading systems in the
past, and was familiar with the features
the interviewer was asking for, like
stop losses, limit orders, and so on.
The key thing I took away from that
round was the importance of
communicating my thought process
thoroughly and concisely.
If there's one piece of technical
interview advice you take from this
video, it's this.
Half of what they're looking for is
whether you can work your way through
the problem using good technical
practices,
and the other half is whether you can
communicate your thought process
clearly.
This is a hugely underrated skill in
interviews,
and it's one of the main things I, along
with my colleagues, look for when we
conduct interviews ourselves. If a
candidate can't communicate their
thinking behind a simple coding
question, we lose confidence that
they'll be able to explain the far more
complex projects they'd take on if they
were hired as a quant developer,
researcher, or trader.
As a quant, part of your job is being
able to explain technical concepts to a
wide range of people across the firm in
a very intuitive and clear way.
When I interview candidates for our
firm, I tend to value excellent
communication over mediocre coding
technical skills, especially in this
climate in the tech industry. The second
round was about developing an algorithm
around different sets of combinations
that would appear in a 52-card deck, and
then modified the suits and ranks in the
deck. I can't remember the exact
question. I tried searching for it
online afterward, and couldn't find
anything that matched. But from what I
recall, it was really testing my ability
to apply statistics to programming.
This was by far the hardest question I
encountered in the whole process.
And I'll be honest, I didn't get it
fully correct.
But I was able to communicate my thought
process as best I could, and got
probably 70% of the way there.
That actually reinforces the point I
made earlier. Your communication will
take you a lot farther in the interview
process than you might realize. I truly
thought that round is what would have
kicked me out right away.
To my surprise, he continued me to the
next stage to meet the partners at the
firms, who are now my managers. So, my
biggest piece of advice is this.
Whenever you practice technical
questions, talk through the problem out
loud to yourself first. Write it down in
pseudo code, and then actually implement
it. That way, you build the habit of
explaining yourself concisely. The third
round was a system design and technical
round with two partners at the firm. It
was a little intimidating, almost a good
cop, bad cop situation. They went
through my resume in full detail and
dove deep into the projects I'd worked
on. Their goal was to see how well I
could communicate a technical concept,
and also to confirm that I had genuinely
done the things on my resume.
And trust me, it comes across very
clearly if that isn't true.
I talked about one of the projects from
my machine learning engineering role,
and a side project I'd built in trading
systems.
This kind of round will almost certainly
happen in every interview process.
So, I highly recommend being extremely
comfortable being questioned about your
own work and able to talk about a
project for a solid hour. No matter how
hard they pushed on the work I'd done,
it was important that I came across as
confident and stood my ground with the
technical depth to explain my projects
in detail.
They asked me to show the system design
for one of my projects, so I shared my
screen and drew out the systems I'd
built, and they questioned each piece.
There were a lot of questions around
data engineering, how I'd organized the
database, and the specific ML algorithms
I'd used, which I believe were some kind
of gradient boosted tree algorithm,
if I remember correctly.
The fourth, fifth, and sixth interviews
were quite similar. Each one was with a
different partner at the firm. It seemed
like they were rotating me around all
the partners of each sub team within the
quant department to figure out where I'd
fit in the company.
I won't dwell on these too much because
they didn't feel as difficult. They were
mostly a mix of systems design and
behavioral questions, talking about the
projects I'd worked on, database
questions, and understanding my
experience in the quantitative finance
space.
They traded global equity, so they did
ask me some questions related to that,
but they were fairly high level, and
they clarified that I wouldn't be
penalized for not knowing too much of
it. Though they made it clear it would
be expected of me to grasp it if I
joined.
The first big concept I want to walk
through is the multi-factor trading
model, because honestly everything else
in this video hangs off it. It's the
standard architecture across a lot of
the top firms, and once you understand
it, the rest of my job will make a lot
more sense. It's not a new idea, it goes
back a few decades, and there's one book
I'd point any new hire to before
anything else, Active Portfolio
Management by Richard Grinold and Ronald
Kahn.
I'll link it in the description. What I
want to do here is lay out the whole
machine end to end, the way I'd sketch
it on a whiteboard for someone on their
first week.
Then, for the rest of the video, we'll
zoom into the individual pieces I
actually worked on and expand them.
One thing to set expectations on first,
the multi-factor system at my firm, like
at basically every firm, is an enormous
code base. Nobody who joins today
understands the entire thing end to end,
and I'd argue nobody understands all of
it at all. So think of this section as
the map I wish someone had drawn for me
on day one.
Let me start with the single idea the
whole business rests on.
We get paid to outperform a benchmark. A
client could just buy the index for
almost nothing, so our entire reason to
exist is the value we add on top of that
index. That value add has a name, alpha.
The cleanest way to think about it, if
your benchmark is up 10% and you're up
12, that extra 2% that came from you,
not from the market rising, is your
alpha.
Everything we build is in service of
producing more of it.
And here's the mindset shift that trips
up most people coming from a pure
engineering background, like I did. The
market price already has a consensus
view baked into it. Thousands of smart
people have already priced in what they
collectively think a stock is worth. So,
our job isn't to figure out what a stock
is worth in some absolute sense. It's to
find the specific spots where we
disagree with the consensus, and to be
right about that disagreement more often
than the other people trying to do the
exact same thing. Active management is
forecasting the market's errors. So, how
do you forecast in a disciplined,
repeatable way? You start by realizing
that a stock's move on any given day
isn't one thing.
It's a stack of things added together.
Picture an oil producer that's up 14%
over some period.
A chunk of that is just the whole equity
market going up. Another chunk is its
country.
Another chunk is its industry.
Oil moved, so oil stocks moved with it.
And only what's left after you strip all
of that out is truly specific to that
company.
That leftover piece is the part you
actually have a view on.
The way we separate those pieces in
practice is regression. You regress the
stock's return against the things that
drive it, and whatever they explain is
factor return, while the unexplained
residual is the stock's own
idiosyncratic return. And a quick gut
check that surprises people, even for an
oil stock that tracks oil tightly, that
company-specific residual is usually a
big part of its total movement. Stocks
are a lot more individual than they
look. That idea, decomposing returns
into common drivers plus a specific
piece, is the seed of the whole model.
So, let me walk the pipeline it grows
into. It all starts with data and
signals. Everything downstream is only
as good as the data feeding it. So, this
is where it begins. Both traditional
financial data and the alternative data
I'll get into later.
Out of that data, you build signals.
Individual, measurable views on a stock.
Classic ones are value, momentum, size,
quality.
The more exotic ones come out of
alternative data sets.
One thing worth flagging early.
Before any of this, you also have to
define your universe.
The set of names you're even willing to
trade.
Your instinct is to trade everything for
maximum breadth, but you quickly learn
some names just aren't worth it.
Illiquid stuff you can't get in and out
of without moving the price against
yourself.
So, you draw a sensible boundary first.
A raw signal though, isn't something you
can trade directly. And turning it into
something tradable is the next block,
the alpha.
An alpha is a clean, refined forecast of
return.
The intuition for the refinement is
three things multiplied together. How
volatile the stock is.
How much genuine predictive skill your
signal actually has.
And how strongly the signal's firing for
that specific name, right now.
Each refined signal becomes one factor.
And multi-factor just means you're
running many of these at once. Each one
ideally capturing a different,
independent edge.
We'll come back later to why
independence matters so much, because
it's the whole game.
Running right alongside the alphas, not
after them, in parallel, is the risk
model. And this is the block newcomers
consistently underrate. The alpha tells
you what you hope to make. The risk
model tells you what it could cost you
in volatility. It uses that same
decomposition from before. A stock's
return is its exposure to a set of
common factors, industries, plus risk
indices like size, value, volatility,
momentum. Plus its own specific piece.
Now, why go to all this trouble instead
of just measuring how every stock moves
with every other stock directly?
This is the single best why I can give
you. So, let me actually work it
through.
To understand a portfolio's risk, you
need the covariance between every pair
of stocks, how each one co-moves with
each other one.
Take a universe of 1,400 names.
The number of pairwise relationships
you'd have to estimate is on the order
of 980,000.
That's hopeless. You can't estimate a
million numbers reliably from finite
history. You'd mostly be fitting noise.
So, here's the trick that makes the
whole thing possible. Instead of
stock-to-stock, you say every stock is
just a bundle of exposures to maybe 65
common factors. Now, you only need the
covariances among the factors, which is
around 2,000 numbers instead of nearly a
million.
You've collapsed an impossible problem
into a tractable one.
That is why the risk model is built on
factors rather than raw stocks.
It's the only way the math is even
computable.
While I'm on risk, two pieces of
intuition I lean on constantly. First,
risk does not add up the way you'd
expect. The risk of a portfolio is less
than the weighted average of its parts
because the stocks don't all move
together. That gap is exactly the
benefit of diversification. Spread
across enough uncorrelated names, and
the specific risk averages away toward
nothing. But, and this is the catch, it
only averages away the specific part.
Stocks all tend to rise and fall with
the market together, and that shared
market risk never diversifies away, no
matter how many names you hold. That's
the line between specific risk, which
you can dilute, and systematic risk,
which you're always carrying.
It's the core insight behind CAPM. The
market won't pay you a premium for risk
you could have diversified away for
free.
Arbitrage pricing theory took that one
step further
and said, "There isn't just one
systematic driver,
but many."
Which is precisely the multi-factor view
we run on.
The second piece of intuition is about
time.
Risk doesn't add across time, but
variance does.
As long as today's return isn't
correlated with yesterday's.
Which for most assets, it basically
isn't.
So variance piles up linearly with time.
And since risk is the square root of
variance, risk grows with the square
root of time.
That's why when you turn a monthly
volatility number into an annual one,
you multiply by the square root of 12,
not by 12. Tiny detail, but it's
everywhere.
And getting it wrong quietly corrupts
every risk number you produce.
Once you've got both halves, the alpha
is saying what you expect to make,
and the risk model saying what it'll
cost, they feed into portfolio
construction. This is an optimizer. And
conceptually, it's doing a balancing
act. Maximize expected return. Subtract
a penalty for the risk you're taking.
Subtract the cost of trading, all while
respecting your constraints. Position
limits, staying sector neutral,
turnover caps, how much leverage you're
allowed.
What comes out the other side
is the target portfolio.
The actual number of each name you want
to hold. The optimizer just hands you a
target, though.
Actually getting there is the next
block, implementation and trading.
The whole philosophy of this stage fits
in one line.
Subtract as little value as possible.
Every trade leaks a bit of your alpha
back out, and this is death by a
thousand cuts.
So it's worth knowing what those costs
are.
There's commission,
the per share fee to the broker.
There's the bid-ask spread. Buy at the
ask, sell at the bid. And that gap is
the cost of a round trip. There's market
impact, the big sneaky one. Buying one
share is cheap, but buying a 100,000
shares pushes the price against you as
you go.
I always describe market impact as the
finance version of the Heisenberg
principle. You can't observe the market
without disturbing it. And finally,
there's opportunity cost.
The trade you waited on for a better
price that just ran away from you. The
way you measure all of this after the
fact is implementation shortfall. You
run a hypothetical paper portfolio with
zero trading costs and compare it to
your real one.
And the gap is your total cost of
implementation. The last block closes
the loop, performance analysis. After
the fact, you take what actually
happened and decompose it. How much came
from the factor bets you intended to
make? How much from constraints?
How much was just noise?
The real goal is to separate skill from
luck and to find where the skill
actually lives, so you double down on
what's working. This block feeds
straight back to research because
factors decay. An edge that printed
money five years ago gets crowded out as
everyone else discovers it. You're never
done. You're always refurbishing old
factors and hunting new ones. Two
numbers tie the entire machine together
and I want you to internalize both. The
first is the information ratio. Your
active return divided by your active
risk. It's the report card. How much
value are you adding per unit of risk
you choose to take? The second genuinely
changed how I think. The fundamental law
of active management. Your information
ratio is roughly your skill multiplied
by the square root of your breadth.
Where breadth is the number of
independent bets you make.
So, there are exactly two ways to get
better.
Be more skillful per bet or make more
independent bets. Now, that word
independent is doing enormous work. If
you're long five stocks and short five,
but all the longs are retail and all the
shorts are energy, you don't have 10
bets. You have two. A bet on retail and
a bet against energy. Real breadth means
genuinely distinct decisions across both
the names you cover and how often you
independently revisit them. And once
that clicks, the entire point of a
multi-factor model snaps into focus.
It's a breadth machine.
It's how you make thousands of small,
independent, slightly better than even
bets across the whole market every
single day.
And the square root tells you that
stacking up breadth is how a modest per
bet skill compounds into a serious edge.
That's the skeleton. Everything else I
show you from here hangs off one of
these blocks.
So, let's zoom into the very first block
of that diagram, data and signals,
because that's where I spent a real
chunk of my time and it's where the
whole machine either stands or falls.
I said everything downstream is only as
good as the data feeding it and I want
to show you what good data actually
takes, starting with a fundamental piece
of data engineering in this space,
security matching.
Before I get to matching, it's worth
naming the unglamorous work that lives
in this block because it's easy to
assume the data shows up clean and it
never does. A big part of the job is
data scrubbing, cross-checking outliers
against other sources, filling gaps,
fixing formats, and then reconciling
vendors who all describe the same
company differently. One vendor
identifies a company by CUSIP, another
by CUSIP, another by ISIN. One revises
its history when figures get restated,
another doesn't.
Sorting all of that out so the data is
consistent and trustworthy is the price
of admission before anything downstream
can run.
And the sharpest version of that problem
is security matching.
Most quant firms with the budget for it
will buy alternative data, a newer type
of data that's entered the quant finance
space over the past several years.
Examples of alternative data include
scraped social media data, supply chain
information, news articles, credit card
transactions, and broker reports, among
many others.
The reason alternative data has exploded
in popularity over the past few years is
that it's become so hard to extract
valuable information from traditional
data.
There's been a lot more competition over
the past couple of decades, and quant
firms are constantly looking for unique
pieces of information they can turn into
trading factors.
So, whenever a firm buys a data set, the
first step to integrating it into the
system is a process called security
matching,
something every data engineer at a quant
firm has to do.
It's a pretty tedious process, but it's
a necessity for everything downstream.
Security matching is the process of
mapping the entities in a data set to
the firm's internal identifiers for its
trading model.
For example, mapping a URL from a
website like apple.com to the official
identifiers that represent Apple.
These official identifiers are
standardized across the finance
industry.
A few to mention are ISIN, CUSIP, and
CUSIP,
and Bloomberg IDs are also commonly
accepted.
The reason you need to map these
entities is so you can algorithmically
trade based on the data set you bought.
The key concept in security matching is
point in time. Point in time is a hard
requirement because it prevents you from
corrupting the data with look-ahead
bias. An easy way to understand it, the
identifier for Apple is only valid for
the specific time periods during which
the company was publicly listed under
it, and those official identifiers can
change for several reasons. One example
is mergers and acquisitions. So, you
always have to ask what was true as of
that date, not what is true today. This
whole process tends to get pretty
straightforward once you've matched a
couple of data sets. A couple of things
to keep in mind. The data sets you're
mapping are often several terabytes in
size and they require quite a bit of
sophistication in how you process them.
That ranges from distributed computing,
where applicable, to processing the data
with fast, efficient transformations
using tools like Polars, Spark, or
NumPy.
Another concept that's very relevant to
security matching is data loaders.
When a vendor sells data to a quant firm
and then updates their database, the
firm needs to make sure it pulls in that
updated data. And the way you do that is
with a data loader, which tends to be
unique to each vendor.
Some vendors update their data daily,
some weekly, some monthly.
It really depends on the type of data
they sell.
Typically, what you'd expect is for the
data loader to pull in the updated
information from the vendor's side,
perform the necessary transformations
and aggregations to preprocess it, and
then save it down into the firm's
internal database.
From that point on, any trading factor
built off that data runs again on the
fresh data that just came in. So, data
loaders are a production-quality
feature. If the data loader doesn't
work, the trading factor gets halted.
So, it's critical that both the security
matching and the data loaders work
flawlessly and that you build in some
kind of auditing to catch any
discrepancies in the data coming
upstream from the vendor. And that opens
up a whole new concept, data auditing. I
couldn't depend on the vendor to provide
correct data all the time,
so I had to build auditing systems for
the different vendors.
Most of the time, this was just
statistical metrics reported on certain
columns of the data set,
or even calculating the coverage of the
securities mentioned in the newly
refreshed file.
Any difference that crossed a certain
threshold would flag me, or whoever else
was responsible on the data side, to
look into the issue. This is a very
common standard practice in the quant
space because if the data is wrong, your
trading factor is trading on incorrect
data. So, it's vital to pinpoint the
issue as early and as far upstream as
possible.
Now, let's move one block to the right
on our diagram to where those signals
become alphas, the actual trading
factors. On the whiteboard, that's a
single tidy box, but in real life, it's
where research meets production. And
that handoff is most of what I did
day-to-day. So, I want to expand on that
block and talk about my experience
productionizing trading factors.
I worked alongside plenty of senior
researchers who were responsible for the
research side of the trading factors,
and I worked very closely with them to
take all of their research code and
methodology and convert it into
production-ready code that could run in
our live trading model.
I worked on about six different research
projects over my 3 years as a quant dev
so far. I obviously can't talk about the
specifics, but I'll walk through the
high-level process and what I went
through. The projects I worked on
covered very different topics. Some of
them leveraged alternative data, others
used traditional financial data, and
only a very small subset actually used
neural network techniques, which I found
pretty surprising. A lot of research in
general can be done with regression
models or gradient boosted trees. If
there's interest in understanding
regression models or gradient boosted
trees, let me know in the comments.
That'll be my signal to do a future
video explaining them intuitively.
Most of the research code was written in
R, and I'd take that and convert it into
either Python, Spark, or KDB,
which is a pretty old language that a
lot of hedge funds use. It's extremely
optimized for speed, and it's an
excellent database. That said, I used a
lot of distributed computing when
building these trading factors, along
with a lot of vectorization using NumPy.
Since everything is time series based,
we needed to enable distributed
computation for practically every
trading factor we built because the
first thing you do is run the code over
historical data. And that can range
anywhere from 5 to 15 years.
With distributed computing, a full
historical run might take around half a
day. It obviously depends on the trading
factor and the data being used, but in
most cases a full historical run takes
about half a day to a full day. Once
everything is set, you're pretty much
ready to start running the trading
factor on the new data that comes in
through the data loader. And because
it's all distributed, it runs a lot
faster than it otherwise would. You also
have to keep in mind that these trading
factors have a specific time window in
which they can be updated. In
production, multi-factor models tend to
get updated every day. And if you're
trading global equity, for example, you
only have a really specific window. Once
the New York market closes, you have a
few hours to run the model on the latest
data so you can start trading during
Japan's market hours.
So, being able to distribute your code
and really understanding this concept is
very important. When I wrote the
production code, I was given a Word
document that contained all of the
research methodology for that factor.
That gave me an easy way to understand
the thought process behind each step.
And it also made collaboration much
easier across the several members who
might be involved in a project. It's
essentially a central note that everyone
can refer to when trying to understand a
trading factor.
It became very clear during my time at
the firm that documentation is extremely
important. Not only does it help
researchers and developers understand
the code better, but in the future it's
common for a trading factor to need a
second research project on top of it,
either to fix issues or to amplify its
performance. This happens because
factors always decay in performance, and
it's up to the portfolio managers to
decide whether a particular factor has
some low-hanging fruit worth another
research project to improve it.
These research projects tend to involve
several people from the team, usually a
researcher, a developer, and a tester,
plus a few senior partners who provide
oversight, give advice, and get constant
reporting. These small pods are a
crucial part of the firm because they
require excellent collaboration from
each member and a real sense of team
spirit.
One thing I noticed is that the
portfolio managers paid close attention
to particular pods. If they noticed a
pod had a great work ethic and really
good synchronous workflows, they tend to
keep that pod together for future
projects because it meant better project
throughput.
Here's a fair question to ask looking at
the board so far. Those historical runs
over 15 years of data across thousands
of names,
where does that actually run and how
does it finish before the next trading
window?
That's the infrastructure layer sitting
underneath the data and alpha blocks we
just expanded. So, let me draw it in. By
the time I joined the firm, everything
was done on premises. They had huge
servers in the office where everything
ran. It soon became clear to management
that they had to migrate to cloud
services. The team was growing, the data
was getting larger, and the factors were
getting more complicated and demanding
more compute.
So, one of my biggest task at the firm
was migrating a lot of our
infrastructure to the cloud and that
forced me to get very comfortable with
AWS and Databricks.
AWS has a lot of services and it can be
confusing to navigate. But, the main
ones I leveraged were EC2, ECR, and S3.
On the Databricks side, I pretty much
used every feature they had at the time.
They've definitely expanded a lot since.
When I migrated everything, I took our
on-premises code, moved it into
notebooks, and incorporated those into
workflows, which essentially trigger the
code to run on a specific schedule. But,
the most notable piece of this section
is Apache Spark and Delta Lake, which
were the two features that really
transformed a lot of our processes. Let
me explain Apache Spark in some detail
because it's the engine that made all of
this possible at scale. Spark is a
distributed in-memory data processing
engine. The whole point of it is to take
a computation that would never fit on or
finish on a single machine, and spread
it across a cluster of machines that
work on it in parallel.
The way it's structured, you have a
driver, which is the brain. It holds
your program and builds the plan.
And
you have executors, which are the
workers spread across the cluster that
actually crunch the data.
A cluster manager hands out the
machines. Your big data set gets split
into chunks called partitions, and each
executor works on its own partitions at
the same time.
That's where the speed comes from. If
you have 100 partitions and enough
executors, you're doing 100 things at
once.
A couple of things make Spark special.
First, it's lazy. When you write
transformations, filter this, join that,
group by the other, Spark doesn't
actually run anything yet. It just
records what you asked for and builds a
graph of the steps.
It only executes when you hit an action,
like writing the result or counting
rows.
That laziness is what lets Spark's query
optimizer, Catalyst, look at the entire
plan and rewrite it to be as efficient
as possible, pushing your filters down
to the data source so it reads less,
reordering joins, and so on, before a
single byte is processed.
Second, it's in-memory. The old
MapReduce model wrote intermediate
results to disk between every step.
Spark keeps data in memory across steps,
which is why it's often an order of
magnitude faster for the kind of
multi-step pipelines we ran. And it's
fault tolerant. Because Spark remembers
the lineage, the recipe of
transformations that built any piece of
data, it can just recompute a lost chunk
if a machine dies instead of failing the
whole job. The one thing you have to
respect with Spark is the shuffle. Some
operations, joins and group buys, need
data that lives on different machines to
be moved around so related rows end up
together.
That movement across the network is the
expensive part. And most of optimizing
Spark comes down to minimizing and
controlling shuffles. So, how would this
actually look if I applied Spark to a
large data set? Picture one of the
alternative data sets I mentioned,
several terabytes sitting in S3 as
partitioned files. The flow looks like
this. Spark reads those partitions from
S3 spread across the executors. Because
it's lazy, it pushes my date and column
filters all the way down so it only
pulls what I actually need. Then it runs
the transformations in parallel on each
partition, cleaning, aggregating,
computing my features. When I need to
attach the firm's internal security
identifiers, the security matching step
from earlier, that's a join.
And because the mapping table is small,
Spark broadcasts it out to every
executor so the join happens locally
with no shuffle. Finally, I partition
the output by date and write it back
down into a Delta Lake table ready for
the trading factor to consume. A
historical run that would take days on
one machine comes down to hours.
That Spark job had to read from
somewhere and write to somewhere. So,
the natural next layer to draw
underneath everything is storage. And
the choice of where data lives isn't an
afterthought. It's wired into why the
whole system performs the way it does.
So, I'm also going to explain two types
of databases that are very commonly used
in the quant space and that almost
always come up in interviews. I'd say
about 80% of the interviews I
encountered for quant developer roles
asked me about the specifics of
databases, and discussing parquet data
lakes and KDB has always been a talking
point. So, I'll go through how quants
typically use parquet data lakes and
why. Let's start with parquet and Delta
Lake. Parquet is a columnar file format.
A normal database row stores all of a
record's fields together. Parquet flips
that and stores each column together
instead. That sounds like a small
detail, but it's huge for the kind of
work we do because our queries usually
touch a few columns across millions of
rows, not whole rows. Storing by column
means you only read the columns you ask
for. You get fantastic compression
because similar values sit next to each
other, and you can skip entire chunks of
a file that couldn't possibly match your
filter. A Delta Lake is what you get
when you put a transaction layer on top
of a pile of parquet files. On its own,
a folder of parquet files has no concept
of a consistent all or nothing change.
Delta adds a transaction log that gives
you ACID guarantees: atomicity,
consistency, isolation, and durability.
In practice, that means a write either
fully happens or doesn't happen at all.
Readers never see a half-written table,
and concurrent jobs don't corrupt each
other. It also enforces a schema, so bad
data doesn't silently slip in.
This combination is almost perfect for
quant data, and especially for time
series. You partition the data by date,
so when a factor needs the last 10 years
of a few fields, it scans exactly those
date partitions and exactly those
columns, nothing else. Daily updates are
just appends of a new date partition,
which is fast and cheap, and batch
computation loves this because the whole
historical run is just one big parallel
scan. But, the feature I leaned on the
most was versioning. Because Delta keeps
a transaction log of every change, you
get time travel. You can query the table
exactly as it looked on any past date.
That ties directly back to the
point-in-time requirement I talked about
in security matching.
When I run a factor over history, I need
the data as it was known then,
not as it's been restated since.
Versioning gives me that essentially for
free.
And it makes runs reproducible.
If a vendor restates a chunk of history,
I can cleanly overwrite just those
partitions instead of rebuilding the
whole table.
And because everything sits on cheap
object storage like S3
and reads in parallel,
the read and write throughput scales
with the size of your cluster rather
than choking on a single machine. The
other database is Kdb and its query
language Q. Kdb is a completely
different animal from Parquet and Delta.
It's an in-memory columnar time series
database and it is absurdly fast. It was
built from the ground up for exactly the
kind of data finance produces, enormous
streams of timestamped ticks and quotes.
The language Q is terse and vectorized.
You write tiny expressions that operate
on entire columns at once the way NumPy
does and it runs extremely close to the
metal. The thing Kdb does better than
almost anything else is time-based
joins, in particular the as-of join
where for every trade you want the quote
that was in effect at that exact moment.
That operation is everywhere in finance
and painfully slow in a normal database.
In Kdb, it's a first-class
lightning-fast primitive. That's why a
lot of hedge funds still build their
core on it despite it being an old niche
technology.
At my firm, Kdb was the backbone of the
live trading model and we ran it in
combination with HTCondor,
which is a job scheduler that farms work
out across a grid of machines.
So, Kdb held and served the time series
data at speed and HTCondor distributed
the actual model computation across the
grid on top of it.
The way I'd sum up the two, Parquet and
Delta Lake, are your cheap, massive,
versioned warehouse for research and
batch. They scale out and they remember
everything.
KDB is your high-performance engine for
time series and live trading.
It's all about raw speed on timestamped
data.
Most firms use both because they're
solving two different problems.
And that's pretty much everything I
wanted to cover for now.
If you take one thing away, let it be
the picture we just built. We started
with a single skeleton of the
multi-factor machine. And every piece I
worked on was really just one of those
blocks cracked open and expanded. The
data block became security matching, the
alpha block became the research to
production pipeline, and underneath all
of it sat the infrastructure and the
databases that make it run on time.
That's genuinely how it feels on the
inside. One big connected system where
every part exists for a reason.
There's a lot more I could dive into,
but this should be enough for now,
depending on how much of a reaction this
video gets.
If there's any positive feedback,
that'll be the encouragement I need to
keep making more videos that are helpful
for you.
Let me know your thoughts in the
comments. I'll genuinely decide whether
to keep making these videos based on
your reactions. So, if this was helpful,
please let me know.
Tell me what you like the most and what
you'd like to see next.
If you're interested in this kind of
content, go ahead and like and subscribe
so you can stay updated. I'll try to
post a video every week going forward
and we'll see how it goes. Thank you so
much and I'll do my best to answer any
questions you have. Thanks. Bye.
Analytical Tools
Create personalized analytical resources on-demand.