The Biggest Challenge for Data-Driven Investors: Mirroring the Past Into the Future

DDVC #57: Where venture capital and data intersect. Every week.

Oct 19, 2023

👋 Hi, I’m Andre and welcome to my weekly newsletter, Data-driven VC. Every Thursday I cover hands-on insights into data-driven innovation in venture capital and connect the dots between the latest research, reviews of novel tools and datasets, deep dives into various VC tech stacks, interviews with experts, and the implications for all stakeholders. Follow along to understand how data-driven approaches change the game, why it matters, and what it means for you.

Current subscribers: 14,135, +185 since last week

Brought to you by VESTBERRY - The Portfolio Intelligence Software for Data-driven VCs

Watch our 7-minute product demo, showcasing the platform's powerful features and intuitive interface. Gain valuable insights and make data-driven decisions with unparalleled ease.

Watch video

Piece 2 of 2

Welcome back to another episode around biases in the world of venture capital. While last week’s episode centered around cognitive biases and how they impact human decision-making across the VC investment process, today’s episode is all about AI and data biases and how they are prone to mirror the past into the future.

Generated by DALL·E — Generated with DALLE-3

How Can Data-driven Approaches Help Investors Overcome Cognitive Biases?

Let’s look back at my very first episode “Why VC is Broken and Where to Start Fixing It” more than a year ago:

… the VC investment/decision-making process is manual, inefficient, non-inclusive, subjective and biased which leads not only to a huge waste of resources but more importantly to sub-optimal outcomes and missed opportunities.

Double-clicking on the “subjective and biased” part, we find that cognitive biases are the key driver leading investors to pattern matching and oftentimes suboptimal outcomes. Pattern matching on the basis of limited sample size is dangerous and unfortunately, no investor had the opportunity to experience all of the success cases in the world firsthand.

Data-driven approaches may balance these shortcomings if done right. By assembling a comprehensive time-series dataset about all companies out there, we can analyze how different features of successful vs unsuccessful companies have evolved over time, just as described in this paper.

Following the extraction of “success patterns” (check out this related piece on “Patterns of Successful Startups”), investors cannot only translate them into an algorithmic selection of new investment opportunities but leverage these findings to challenge their cognitive biases.

Firsthand experiences with a limited sample of successful vs unsuccessful companies shape subjective cognitive biases. Creating awareness for these biases and balancing them with more objective feature patterns identified through a significantly more comprehensive data sample merges the best of both worlds: Subjective/human + objective/data.

What Is Data Bias?

While humans are prone to cognitive biases, data-driven approaches and machine-learning models are prone to data bias. Data bias occurs when an information set is inaccurate and fails to represent the entire population.

For example, when looking at successful vs unsuccessful startups, one might over-index a specific industry or geography in the training data. As a result, extracted “success patterns” might only partially apply to the full universe of opportunities out there, limiting your ability to spot all success candidates.

Data bias is a significant concern as it can lead to biased responses and skewed outcomes, resulting in inequality and ineffectiveness in the screening/investment selection process.

How to Mitigate Data Bias?

As for cognitive biases, the first step is to create awareness of data biases too. Only thereafter, we can leverage techniques like stratified sampling, oversampling and undersampling, or moderator variables. The latter is something that has proven to be extremely valuable in the context of startup screening, so let’s dive into this topic in a bit more detail.

Understanding Moderator Variables: In the context of research and statistics, a moderator variable affects the strength or direction of the relationship between an independent variable (=cause) and a dependent variable (=effect). In simpler terms, it can change how one factor affects another.
Addressing Data Bias with Moderators:
- Clarifying Relationships: By examining moderator variables, we can better understand under which conditions certain relationships hold or don't hold. For instance, if we're studying the relationship between startup success (dependent variable) and investment received (independent variable), a moderator like "region of operation" might reveal that the relationship is stronger in urban areas compared to rural areas.
- Identifying Hidden Biases: Sometimes, biases aren't evident until you introduce a moderator. For example, a dataset might show that a tech bootcamp improves job placement rates for all participants. But when the moderator "gender" is introduced, it could reveal a significant discrepancy in placement rates between men and women, indicating a potential bias.
Limitations:
- Doesn't Eliminate Bias: Introducing moderator variables can help reveal and understand biases, but it doesn't inherently eliminate them. This requires additional initiatives like the sampling techniques mentioned above.
- Requires Thoughtful Selection: Not all variables serve effectively as moderators. Researchers must have a theoretical or empirical reason to believe that a certain variable can act as a moderator.

The Biggest Challenge of Data-driven Investing: Mirroring the Past Into the Future

Beyond the data biases mentioned above, one of the biggest concerns with purely data-driven investment selection is the question of how to deal with changing success patterns. Let’s imagine the following (a brief summary of my paper here):

You procure startup data from established providers like Crunchbase, Pitchbook, and Dealroom —> big problem is that they tend to update and over-write features; historic values get deleted which makes it difficult to reconstruct the full history of a company
You scrape additional data from LinkedIn, ProductHunt, GitHub, and diverse public registers —> important to repeat in consistent intervalls like every week or month to keep track of feature development over time
You merge all datasets together and remove duplicates to receive a single source of truth with comprehensive coverage of companies and maximum level of detail on time-series features
You encode your features (think for example One-Hot Encoding) and take a snapshot of all independent variables as of t1 (think 1st Jan 2015)
You classify the sample into success (for example IPO and M&A above 500m) and failure (all other cases) at a later point in time to represent required outcomes from an early-stage investor’s view; this is your dependent variable as of t2 (think 31st Dec 2020)
You train a classification model to identify patterns across the independent variables as of t1 (1st Jan 2015) that predict the success of the dependent variable as of t2 (31st Dec 2020); these are your success patterns

Shifting t1 and t2 equally across time reveals that success patterns change, even when keeping the majority of features constant. This can be explained by the fact that business models and industries evolve, requiring new approaches to become successful. Said differently, what got us here won’t get us there.

In VC land, this means that sourcing and screening investment opportunities with algorithms that got trained on historical data bears the risk that novel, so far unseen business models or innovations might fall through the cracks.

Assume you’ve trained a model with input data / independent variables t1 = 2018 and output data / dependent variable t2 = 2022. Would the model know what a successful core fusion company looks like? Of course not.

Why? Because there hasn’t been a successful core fusion company as of today. It will take at least another few years, potentially decades to know what success for these kind of companies looks like. Before this happens, we cannot purely rely on data-driven approaches to identify novel innovations but need human intuition.

Share this article with others who might benefit.

Augmented VC as the solution to all problems

While data-driven methods offer objectivity and counteract the cognitive biases inherent in human decision-making, humans possess the unique ability to rectify the limitations of these data-centric strategies. Not only can they prevent the mere replication of historical patterns, but they can also discern novel and previously unrecognized patterns.

In a nutshell, data-driven approaches are exceptional in spotting established success patterns but struggle to identify novel innovations whereas (some) humans are exceptional in intuitively identifying so far unseen opportunities yet quickly become prone to flawed cognitive biases that resulted from limited sample sizes. Therefore, combining the power and objectivity of computers with the intuition of humans seems like the only way to improve efficiency, effectiveness, and inclusiveness 🤖🤝🤓

Stay driven,
Andre

Thank you for reading. If you liked it, share it with your friends, colleagues, and everyone interested in data-driven innovation. Subscribe below and follow me on LinkedIn or Twitter to never miss data-driven VC updates again.

What do you think about my weekly Newsletter? Love it | It's great | Good | Okay-ish | Stop it

If you have any suggestions, want me to feature an article, research, your tech stack or list a job, hit me up! I would love to include it in my next edition😎