

Discover more from Data-driven VC
Data-driven VC #2: How to not miss an investment opportunity anymore
Where venture capital and data intersect. Every week.
šĀ Hi, Iām Andre and welcome to my weekly newsletter, Data-driven VC. Every Thursday I cover hands-on insights into data-driven innovation in venture capital and connect the dots between the latest research, reviews of novel tools and datasets, deep dives into various VC tech stacks, interviews with experts and the implications for all stakeholders. Follow along to understand how data-driven approaches change the game, why it matters, and what it means for you.
Current subscribers:Ā 1,134, +298 since last week
Letās start with the most essential takeaways from my previous episode: VC is broken and the sourcing & screening stages are the most critical parts of the value chain as 2/3 of VC value is created here. Said differently, VC is a finding and picking the winners game. Having explored the shortcomings in the respective stages, major areas for improvement in the sourcing stage are more comprehensive coverage of potential investment opportunities (this is what I call the āidentificationā) as well as more complete, unbiased and accurate data on the respective companies and the people behind them (this is what I call the āenrichmentā). We will come back to these terms later and zoom out to start with the question of all questions:
How to identify every company as early as possible?
The question for comprehensive coverage top of funnel is one of the most difficult ones to answer and top of mind for every investor, no matter the stage. Certainly, it becomes easier the later you invest as the number of potential investment opportunities across development stages (pre-Seed to pre-IPO) naturally decreases. The challenge itself, however, is structurally the same: See every company as early as possible. In order to get the answer, Iād like to take a birdās view on the universe of different approaches and cluster them into human-centric and data-driven.
Human-centric sourcing approaches
Foot on the ground: Wasnāt even sure whether to mention this point as itās that obvious, but for comprehensiveness and non-VCs among you I do. In the past, VCs were oftentimes used to founders flying in and meeting the VCs at their respective offices. Clearly, this trend flipped upside down. Nowadays, VCs fly around to meet the founders wherever they are. Even more efficient, many investors have opened satellite offices on the ground like in Paris at Station F and in Munich next to CDTM or got a desk in hotspot WeWorks like in Berlin or London.
Mentoring: In line with the āfoot on the groundā, VCs started to offer free mentoring in promising incubator or accelerator programs like Y-Combinator, Entrepreneurs First or Techstars. This way, they get interactions even earlier than the prominent demo days.
Accelerators: Some multi-stage VC firms started to move upstream and as part of this trend even launched their own incubators or accelerators. Though different business, itās still closely related, oftentimes managed by independent teams and certainly a great, proprietary sourcing opportunity for the respective firms. Examples include Sequoia Arc or a16z START, and given the recency of these kinds of programs, the questions of signaling and reputational risks remain open.
Student ambassadors: The list of uni drop-outs who founded category-defining companies is sheer endless: Facebook, Microsoft, Apple, Twitter, Square.. A range of investors have identified this trend and figured that itās not enough to give a guest lecture and visit campus once a semester. To be top of mind of these youngsters, itās oftentimes not enough to be on-site but actually, you need to be one of them. We all know that level of trust is different with peers and friends than with foreigners you might have met once or twice. To hear their crazy ideas, participate in these early brainstorming sessions and see the inception of a great company first-hand, a range of funds have extended their wider team and started to recruit active students into their campus programs. They oftentimes serve as part-time (junior) analysts or get a finders fee in case the fund ends up investing. Some examples include Picus Capital, Alix Ventures or Hook VC.
Angel networks: Following the same logic as the student ambassador programs, many VCs have started to recruit operators who bring close proximity to ecosystems and proprietary deal-flow into so-called āangel programsā. The main hypothesis is āgreat founders know great foundersā and funds want to incentivize the well-connected ones among these operators to share info on promising opportunities with them. But only with them. Setups differ from more loose constructs of a simple finders fee (cash or shareholding in the respective investee) over angels who invest money on behalf of the fund but legally on their own, and where the fund then has an option to acquire the resulting angel portfolio (or sometimes only specific positions) at a fixed price until a specific point in time, to very close collaborations where angels invest legally and monetary on behalf of the fund. Sounds win-win but as always, the devil is in the details and it seems like no fund has cracked the winning setup yet. While the upside is clear, there is - similar to the accelerators - a significant downside in terms of signaling risk (for example āif angel A invested on behalf of fund Y and fund Y does not invest in the next round of the company, then there must be something wrongā; similar to funds not doing their pro-rata) and potential reputational damage for the fund. You can find a great summary of the different programs and requirements by Superscout here.
Research affiliations: Different from the US, European universities still lack proper setups to transfer their world-class research into practice. Tech transfer offices are as scarce as standardized terms for IP transfer and, as a result, significant potential is left on the table. To solve this problem, Earlybird has launched UNI-X, a pan-European first-check fund connected to a network of leading professors (multi-university affiliation #ResearchersPleaseReachOut) who are incentivized to share the most promising research and have access to a platform to complement teams. Examples of related but still different single-university affiliations include UVC which evolved out of the TU Munich ecosystem or Cambridge Innovation Capital which evolved around the University of Cambridge.
Fund-of-funds: More relevant for late-stage than for early-stage funds is the fund-of-fund strategy where the later-stage fund becomes a limited partner (LP) investor in a range of early-stage funds. Subsequently, they get access to proprietary information of the respective portfolio companies and can hereby establish a multiplier pipeline while at the same time ensuring early access/introductions to the prospects. Examples include US-based New Enterprise Associates (NEA) who is an LP in Europe-based Speedinvest, essentially as their extended foot on the ground on the continent, or Molten who is an LP in a range of European early-stage funds such as Seedcamp, IQ Capital or Earlybird. Similar but still different, many general partners (GPs) of multi-stage or growth-stage VC firms have invested privately as LP into different early-stage funds. Clearly, their motivation is not only financial returns but proprietary insights into the respective portfolios and trends, to eventually benefit with their own vehicle.
To wrap up the human-centric approaches, all of the above share the intention to grow deeper roots in the relevant ecosystems, to pick up signals about promising founders/companies earlier and to get an unfair advantage, including proprietary access, that cannot/hardly be replicated by other funds. While different approaches are associated with different pros, cons and costs, they all share that they are hard to scale and far from providing perfect coverage.
The below figure illustrates the initial identification source in the old world, where it was only inbound and outbound, and today where human-centric sourcing innovations were added to the outbound component. Important to note that itās only about the initial identification and does not represent multiple identifications, i.e. a founder reached out the VC firm inbound but was identified through an operator in their angel program as well. In this example, the potential investment opportunity would only be presented as an inbound. The figure also shows the potential that is yet unobserved, i.e. the grey area.

Although human-centric approaches allow VCs to further open up their funnel, itās far from being comprehensive. Moreover, these āhotspotā focused approaches create deep roots in various ecosystems, however, fail to solve one of the major shortcomings of the sourcing stage, the lack of inclusion, or as Leslie Cornfeld put it āTalent is distributed equally, but opportunity is notā. In my perspective, itās our job as VCs to change this and distribute opportunities equal to talent. This means identifying opportunities independent of geography, social relationships and without warm introductions. Iām convinced that data-driven approaches are the only way to get closer to comprehensive coverage while at the same time making VC more inclusive and distributing opportunities equal to talent.
Data-driven sourcing approaches
Identification: Iād like to follow the above/initially introduced structure of identification and enrichment below. Identification includes all online sources where we can potentially find a new company. So itās really only about finding a company, independent of the data quality. Being an early-stage investor, Iād like to find them as early as possible, actually at the point in time where a soon-to-be founder gets an idea and the itch to start her own company. I created a list of the most important identification source categories and some examples below.
Social networks: LinkedIn, Twitter, Instagram, Tiktok, Youtube
Content platforms: Medium, Substack
Forums: Reddit, HackerNews
Product/code Platforms: ProductHunt, Github, App stores
Public registers: Handelsregister, Companies House, INPI
Startup media: TechCrunch, TechEU, Deutsche Startups
Commercial databases: Crunchbase, Pitchbook, CBInsights
Experience shows that every company has some kind of digital footprint, even if itās just the required entry into the public register or a founder changing her LinkedIn status to āstarting something newā, we will find you.. :)
.. though more likely at many places than at one, which leads to the problem of entity matching. Or said differently, as soon as we find one and the same company in different identification sources, we need to ensure to merge the different entities together and remove duplicates, to eventually end up with one single source of truth. I will cover this problem in a future, more ML-focused episode. So for now letās assume weāve completed the entity matching and have no duplicates. The next step is to enrich all entries/companies with further information.
Enrichment: Similar to the identification, the first stage of the enrichment is to find every single data point on a specific company, independent of the quality. Obviously, we collect all data we can find from the identification source in one go, so above-listed sources oftentimes serve as both an identifcation but also an enrichment source. Additionally, we add specific enrichment sources which I categorized below with some examples.
People data: LinkedIn, PeopledataLabs, LifeDataTechnologies ā> Headcount, job postings, split across departments
Website traffic: SimilarWeb, Semrush, SpyFu ā> Visitors, time spent, global tank
Patents: QuantIP, European Patent Offices ā> Number of patents, patent holder
Clinical trial data: FDA ā> Clinical trial stage
Grant databases: Horizon2020 ā> Grant size, duration
Academic publications: Google Scholar, SSRN ā> Citations, contributors
Payment/credit card data: FableData, SnapBizz, ShipsDNA ā> Volume, distribution
Product traction/reviews: ProductHunt, Github, App stores ā> Upvotes, stars, forks, issues, likes, reviews
Commercial databases: Crunchbase, Pitchbook, CBInsights ā> Funding, financial KPIs
Company reviews: Kununu, Glassdoor ā> Employee feedback
Once all data is collected, the next step is to verify it. One avenue is to create a dictionary and manually rank the credibility of the respective sources, e.g. primary sources (like company website) are more credible than secondary sources (like SimilarWeb). Hereof, you can create a ground truth dataset for every company and weigh the credibility of different data points. And finally, youāll be at the point where the fun stuff begins: Feature engineering, NLP, sentiment analysis, scoring modelsā¦
But letās cut it here, keep the fun stuff for next time and wrap up the data-driven sourcing part. Although the universe of potential identification and enrichment sources seems intuitively infinite, the above structure shows that it is actually quite manageable. In my experience, it is probably an 80-20-Pareto/Power-Law distribution (similar to everything else in venture..) where a minority of the sources delivers the majority of the coverage (#LinkedIn).
That being said, Iād like to end this sourcing-focused episode with the conclusion that the VC business never materially changed, but innovation forced us to add new layers. Today, the human-centric approaches which help VCs to get better coverage and proprietary access, and tomorrow the data-driven approaches that will eventually lead to comprehensive coverage and, hopefully, an equal distribution of opportunity.
Stay driven,
Andre
---
Thank you for reading. If you liked it, share it with your friends, colleagues and everyone interested in data-driven innovation. Subscribe below and follow me onĀ LinkedInĀ orĀ TwitterĀ to never miss data-driven VC updates again.Ā
What do you think about my weekly Newsletter?Ā Love itĀ |Ā It's greatĀ | Ā GoodĀ |Ā Okay-ishĀ |Ā Stop it
If you have any suggestions, want me to feature an article, research, your tech stack or list a job, hit me up! I would love to include it in my next editionš
Data-driven VC #2: How to not miss an investment opportunity anymore
Phenomenal! Would love to get in depth ideas about the data engineering stack, especially the Signal to Noise Ratio(for data driven sourcing) and how to optimize it.
Stoked about what's to come!!!
Hi, I would add "company-data" to the enrichment point of Data-driven sourcing approaches - for example, which technologies are companies using based on their self-description in job postings (e.g., as used in https://techmap.io)