Discover more from Data-driven VC
Data-driven VC #6: "Sh*t in, sh*t out" and why feature engineering is the ultimate differentiator for VCs
Where venture capital and data intersect. Every week.
👋 Hi, I’m Andre and welcome to my weekly newsletter, Data-driven VC. Every Thursday I cover hands-on insights into data-driven innovation in venture capital and connect the dots between the latest research, reviews of novel tools and datasets, deep dives into various VC tech stacks, interviews with experts and the implications for all stakeholders. Follow along to understand how data-driven approaches change the game, why it matters, and what it means for you.
Current subscribers: 1,906, +260 since last week
Disclaimer: This and the next episode will be a bit nerdier 🤓
Another week, another episode :) The last five episodes focused on why VC is broken and why to start fixing it in the sourcing and screening part (episode#1), how we can (or rather need to) complement human-centric approaches with data-driven ones (episode#2), why a hybrid setup is the best answer to “make versus buy”, how we can leverage commercial startup databases (including a benchmarking across the most prominent providers in episode#3), how we can complement the foundation with web crawlers/scrapers (episode#4) and, lastly, how we use entity matching to create a single source of truth (episode#5).
Assuming the above has been diligently implemented, we have achieved comprehensive coverage with respect to the number of startups top of the funnel (identification) and also with respect to the data for every individual startup itself (enrichment). Moreover, everything has been merged into a single source of truth, without duplication whatsoever. The problem, however, is that enrichment data is messy. Very messy. So let’s clean up!
Data Cleaning and Feature Engineering
First off, we never change the original feature values but store them as they are in a data lake. Only thereafter, we establish data pipelines that clean, transform and process the features. With these data pipelines, we follow two major goals:
Prepare data to be consumed by a frontend so that it can be presented to and interacted with by a user (for manual analysis and startup exploration)
Prepare data to train and run algorithms, including NLP, classification and scoring models (to eventually cut through the noise)
To achieve both, we need to apply two different techniques, data cleaning and feature engineering. “What’s the difference between data cleaning and feature engineering” you rightfully ask. Data cleaning refers to the process of dealing with incomplete, irrelevant, corrupt or missing records in our dataset, whereas feature engineering is the process of applying domain knowledge to transform existing or create new features for ML model training. It’s a fine line but on the highest level data cleaning is a process of subtraction whereas feature engineering is a process of addition. Fun fact, Data Scientists spend about 2/3 of their time subtracting and adding, aka data cleaning and feature engineering.
Data cleaning and feature engineering operations depend on the feature types, so let’s look into the three major ones in our dataset:
String values are text data. We need to make sure that the text is consistent. For example, capitalization might cause problems when processing because it can change the meaning of a word or sentence, like “Bill” as a name versus “bill” as an invoice. In line with capitalization, we should also run simple spell checkers like PySpellchecker. Following the “every startup once, no more and no less” (=single source of truth) approach in episode#5, I also prefer a single “language of truth” for all string features. As we crawl/scrape startup data from different sources across different geographies, it makes sense to translate all text data into a uniform language, English in our case.
Other string-related issues include inconsistencies in formatting. For example, if you have a column of US dollar amounts, you might want to convert any other currency type into US dollars so as to preserve a consistent standard currency. The same applies to abbreviations like “K” for thousands or “M” for millions, and for any other form of measurement such as grams, ounces, etc. You get it.
Numerical values are the most common data type that we need to convert when cleaning our data. Often numbers are included as strings, however, in order to be processed, they need to appear as numeric values. If they appear as text, they are classed as a string and neither can we present it in the frontend as we want (e.g. the user wants to sort numerical features ascending or descending) nor can the algorithms perform mathematical operations on them.
The same applies to dates that are stored as text. These should all be changed to numerical values. For example, if you have an entry that reads “October 20th 2022”, you’ll need to change that to read “10/20/2022”.
Categorical data are variables that contain label values rather than numeric values. The number of possible labels is often limited to a fixed set, where each value represents a different category. One example would be the funding stage of a company, e.g. “Seed”, “Series A”, “Series B” (..), “Series G” (that’s the latest I’ve seen at least..) With respect to goal 2. above, we oftentimes need to transform categorical features (=feature engineering) via integer encoding (think 1=Seed, 2=Series A, 3=Series B, etc.) or one-hot-encoding (think 001=Seed, 010=Series A, 100=Series B, etc.) to make it “usable” by different ML models.
Missing values and sparse features
Some terminology to start with. “Sparse features” are not a result of “missing values”. When there is missing data, it means that the values of a specific feature are unknown. On the other hand, if the data is sparse, all the data points are known, but most of them have zero value. Sparse features are common in machine learning (ML) use cases, especially in the form of one-hot encoding as described above. Let’s align this logic with the two goals above.
For 1) the presentation and interaction of our data in a UI, only missing data matters (not sparse data). We first calculate the availability of values in a specific feature in percent across the full sample to then decide whether we should present or exclude the respective feature. Obviously, the higher the availability percentage, the more we should consider including it. Still, it’s a case-by-case decision that also factors in the importance of the respective feature. No big thing, no big downside.
For 2) the training of an algorithm, however, both missing values but also sparse features matter. With respect to the missing values, academics would typically omit the respective observations (=startups) with missing values in likewise, list or pairwise deletion approaches. As we collect data from uncountable different sources, literally every startup has missing values across different features. Following the deletion approach, we would obviously need to delete every observation. Not feasible.
Another approach would be to fill in the missing values via methods like regression imputation (=existing variables are used to make a prediction on the missing value), synthetic data generation (check out Earlybird portfolio company Mostly AI!), “last entry carried forward” or interpolation (=the last both approaches for longitudinal/time-series features).
A less scalable but more accurate human-in-the-loop approach would be to get in touch with someone at the respective companies and ask them to add the missing data via a survey or interface. Crunchbase is a great example of this approach as they incentivize founders and investors to claim and edit their data. A more reasonable approach would be to only reach out to a list of watchlist founders and ask to submit their data to also unbias the dataset (bias in the sense that publicly available data might be positively skewed whereas negative data is rather kept private).
With respect to the sparse features, we could remove the respective features from the model, run a principal component analysis (PCA; can be used to project the features into the directions of the principal components and select from the most important components) or do feature hashing (sparse features can be binned into the desired number of output features using a hash function).
Creative feature engineering approaches #holygrail
Time to get creative! Once we solved the missing values and sparse feature issues (=the messy stuff), we can start creating more explorative features. The first step is to leverage domain knowledge and add interaction features by combining two or more of the original features as products, sums or differences. A general tip is to look at each pair of features and ask yourself, “Could I combine this information in any way that might be even more useful?”
Another approach is auto-generating industry or technology tags based on company descriptions. Similar to the entity-matching approaches in episode#5, the easiest way to start is with a keyword-based classification approach. Just put all keywords into a dictionary and link them to your individual tags. On a more advanced side, we explored this use case with Earlybird portfolio companies Aleph Alpha and thingsTHINKING by providing our internal industry and technology tags together with a set of manually labeled companies to adapt their generic large language/semantic models to our specific requirements.
Lastly, to share some of the more explorative work we’ve been doing, we combine the features that include founders’ public social media profiles with a model that resulted from the paper “Computer-based personality judgments are more accurate than those made by humans” to predict big5 founder character trades. Or another approach that I’ve explored some years ago was to transform Twitter features into a knowledge graph, create a list of tier1 investors and create a binary feature to indicate if more than one of the listed investors follows the same company. A signal to look at this seemingly interesting company too, right?
The universe of such ideas for explorative feature engineering is sheer endless and I perceive this part as the most challenging but also the most differentiating part of the screening process. The explorative feature engineering/generation is where you can really make a difference as a data-driven VC in the long run and where you should really keep your secret sauce. This is the holy grail of data-driven VC. Less so in the data collection or the model training, this is all pretty straightforward.
In the next episode, I will dive deeper into goal 2) above and explore the world of screening approaches, from deterministic to ML-based ones.
Thank you for reading. If you liked it, share it with your friends, colleagues and everyone interested in data-driven innovation. Subscribe below and follow me on LinkedIn or Twitter to never miss data-driven VC updates again.
If you have any suggestions, want me to feature an article, research, your tech stack or list a job, hit me up! I would love to include it in my next edition😎