

Discover more from Data-driven VC
First steps as a data-driven VC without coding skills: AI-powered Google Sheet to track LinkedIn profiles
DDVC #32: Where venture capital and data intersect. Every week.
👋 Hi, I’m Andre and welcome to my weekly newsletter, Data-driven VC. Every Thursday I cover hands-on insights into data-driven innovation in venture capital and connect the dots between the latest research, reviews of novel tools and datasets, deep dives into various VC tech stacks, interviews with experts and the implications for all stakeholders. Follow along to understand how data-driven approaches change the game, why it matters, and what it means for you.
Current subscribers: 7,747+, +162 since last week
Brought to you by VESTBERRY - the future of portfolio management.
Harness real-time data to leapfrog in the investment game and uncover hidden opportunities. Make data-driven decisions with VESTBERRY's intuitive platform.
Incentives to become more data-driven are obvious, yet many firms are stuck in a buy versus build trade-off and end up doing nothing.
“What are the first steps on the journey from productivity VC to data-driven VC?” Today, I’m incredibly excited to have Vlastimil Vodička, CEO and Founder of Leadspicker, share his step-by-step guide to start leveraging AI as a VC - without coding skills - in his guest post below.
If you're a fan of Andre‘s newsletter, chances are you're already intrigued by using LinkedIn for startup sourcing and taking advantage of GPT's powerful classification and categorization capabilities, even if you lack coding skills.
As part of a little no-code experiment, we've put together a comprehensive guide on how to connect OpenAI's GPT to your Google Spreadsheet. This guide will show you how to evaluate companies scraped from LinkedIn and determine if they're a startup or not, and how to categorize them into predefined categories directly in your spreadsheet using GPT.
By reading this blog post, you will learn:
How to Add OpenAI's GPT to Your Google Spreadsheet
Challenges We Faced When Scraping Data from LinkedIn
How to Classify and Categorize Startups with GPT via Your Google Spreadsheet
Outcomes of Our Little No-Code Experiment
Extracting Data from Linkedin Sales Navigator
For this experiment, we extracted data from LinkedIn Sales Navigator, focusing on new founders in the Central and Eastern (CEE) region within the last two years. To achieve this, we utilized advanced search filters in Sales Navigator, targeting specific job titles. We recommend using a boolean query such as "founder" OR "co-founder" OR "CEO" OR "CTO" instead of the filter options that LinkedIn offers, as it can provide more accurate and comprehensive results.
We chose to focus on the CEE region, but you can select any region according to your specific interests.
To extract data from Sales Navigator, we recommend using no-code tools such as PhantomBuster, Duck-soup or Apify.
Challenges we faced:
Splitting your search region into smaller data samples of a maximum of 2,000 contacts per batch can help ensure that you're able to extract all relevant data. LinkedIn doesn't display the exact number of people who match your search criteria in their database, so this approach can be useful in making sure that you retrieve all the necessary data.
False positive: LinkedIn can also display completely irrelevant profiles that don't match your search criteria, and it's unclear why this happens. This can make it difficult to accurately categorize and analyze extracted data. As a result, it's important to carefully clean and filter the data before using it for further analysis. While it can be time-consuming, this step is crucial to ensure the accuracy and reliability of your data.
Data cleaning: it's important to deduplicate and clean the data by removing obviously irrelevant profiles. This step is necessary to ensure that the final data set is accurate.
Data enrichment may be necessary depending on the tool you used for extracting data. Make sure that you have scraped the information from LinkedIn company profiles, such as the company description, headquarters location, and the number of employees. PhantomBuster and Duck-soup can do the trick
Keep in mind the Linkedin profile visit limits to avoid your LinkedIn account from getting blocked, as already described in his article.
Outcome: We were able to export a total of 29,763 profiles. After some basic deduplication and data cleaning, we ended up with 21,311 unique firms in the dataset.
Now, let's find out if they really are new startups and in which industries they can be classified to.
How to add GPT-3.5 to Google Spreadsheet
Go to Google Sheet, where you want to add GPT-3 -> go to Extensions -> Apps Script
Copy and paste the attached code into your Google Apps Script Project
/**
* Generates text using OpenAI's GPT-3 model
* @param {string} prompt The prompt to feed to the GPT-3 model
* @param {string} cell The cell to append to the prompt
* @param {number} [maxWords=10] The maximum number of words to generate
* @return {string} The generated text
* @customfunction
*/
function runOpenAI(prompt, cell, maxWords) {
const API_KEY = "YourAPIkey";
maxTokens = 100
if (maxWords){maxTokens = maxWords * 0.75}
model = "gpt-3.5-turbo"
prompt = prompt+cell+":"
temperature= 0
// Set up the request body with the given parameters
const requestBody = {
"model": model,
"messages": [
{"role": "system", "content": "You are a helpful assistant that answers questions."},
{"role": "user", "content": prompt},
],
"temperature": temperature,
"max_tokens": maxTokens
};
console.log(requestBody)
// Set up the request options with the required headers
const requestOptions = {
"method": "POST",
"headers": {
"Content-Type": "application/json",
"Authorization": "Bearer "+API_KEY
},
"payload": JSON.stringify(requestBody)
};
// Send the request to the GPT-3 API endpoint for completions
const response = UrlFetchApp.fetch("https://api.openai.com/v1/chat/completions", requestOptions);
console.log(response.getContentText())
// Get the response body as a JSON object
const responseBody = JSON.parse(response.getContentText());
//let answer= responseBody.choices[0]["text"].text
let answer= responseBody.choices[0]["message"]["content"]
// Return the generated text from the response
return answer
}
Go to https://openai.com/api/ and click on "Sign up" to complete your registration (if you already have an account, simply click on "Log in").
Navigate to the "API Keys" tab, or use this direct link: https://platform.openai.com/account/api-keys.
If you already have API key, just copy it. If you have not create a new key by clicking on “+ Create new secret key”
Paste your key inside your Google Apps Script project
It should look like the screenshot below:
Save Google App Script project
GREAT! Now after returning to your Google Sheet, you can directly access AI by using the formula =runOpenAI() and inserting your prompt.
Classifying Startups: How to Determine if a Company is a Startup or Not in Google Sheets
Now you have an AI-powered Google Sheet in place! So we can now begin classifying the companies that we extracted from LinkedIn using GPT to determine whether or not they are startups.
We utilized the GPT 3.5 model turbo, which is the same one that powers the GPT 3 chat. To start classifying companies, create a new column in your Google Sheet and launch the AI directly in the right cell to determine if the company is a startup based on its description. For this experiment, we used this prompt:
=runOpenAI("Based on the provided description, decide whether this company might be a startup or not. Software companies with a product are more likely to be startups. Respond with either yes startup or not startup. Description:",B3,100)
Categorizing Startups into Selected Industries with GPT in Google Sheets
In our experiment, we not only wanted to determine whether or not a company is a startup, but also to categorize them into specific industries. To achieve this, we selected 10 categories and used GPT to decide whether each startup belongs to one or up to three of them.
You can do it similarly as in the step above, just changing the prompt. We used the following prompt:
=runOpenAI("Decide based on the provided company description in which industry this company operates. Answers can be any of these exactly: 'fintech', 'IoT & Robotics', 'Biotech & LifeScience', 'B2B SaaS', 'AI & Deeptech' 'Blockchain', 'Health & Life Science'. 'Energy & Cleantech'. 'Agriculture'. Each company can belong to one or maximum 3 categories",B2)
Optimizing your prompt is important to ensure that you get accurate and relevant results. It may be necessary to rewrite your prompt, adjust your approach, or test it on multiple companies until you are satisfied with the output.
Challenges we faced when categorizing and classifying companies in Gsheet:
We had to adjust the API parameters, such as the max tokens and temperature, to achieve satisfactory results.
The OpenAI API may be unstable when working with a large amount of data.
Google Sheets is not always the best tool for managing such datasets.
When refreshing Google Sheets, the OpenAI API may relaunch, leading to additional charges.
Additionally, those starting out may quickly reach the OpenAI credit limit and need to request an increase.
To prevent this, it is recommended to copy and paste the results as values to avoid formula relaunches.
Discovering New Startups in CEE: The Outcome of Our Process
After running the classification process, we discovered that since 2021, there have been 8,368 companies in the CEE region that were classified by GPT as possible startups based on the prompt. This information has given us valuable insights into the startup ecosystem in the region and could be used to guide further research or investment opportunities. Here are some additional insights we gained from the process:
Insights:
The high conversion rate of 8,368 out of 21,311 is likely due to the filters we used on LinkedIn, as individuals with job titles such as founder or CEO in small companies that have recently started are likely to appear as "startup-ish."
Instances, where we lacked enough description to decide (680), occurred when there was no description about the company where the person was a founder.
2,320 companies had descriptions on LinkedIn, but their websites were non-working, usually due to being parked at some domain provider. In our experience, this often indicates that the startup is no longer operative rather than just starting out
Below, we present the outcomes of the categorization process based on the analysis conducted by GPT-3.5:
Findings:
One company may fall under up to three categories since industries often overlap.
GPT-3.5 understands some industries like fintech very well, but for others not, you may need to add a definition of the industry directly into the prompt.
OpenAI doesn't currently provide an API that includes GPT-4, which may be more effective at understanding industries.
Providing more description is not always better, as irrelevant blog posts or other SEO-oriented content can negatively impact results.
The best approach is to use the LinkedIn company description and the experience section descriptions of founders.
When information is missing, you can use metadata from the company website or a summarized website but more sophisticated tools than Google Sheets are necessary for these methods.
Exploring CEE Startup Founders in the Diaspora: Which Locations Are Assigned to Them on LinkedIn?
During our experiment, we discovered that many founders with companies based in the CEE region are discoverable on LinkedIn as being located in the USA or other countries outside the region.
Conversely, we also found that many founders originally from Ukraine or Russia have companies that LinkedIn lists as being based in CEE. These findings demonstrate the fluidity of startup founders' locations and the importance of understanding their actual geographic locations when analyzing the CEE startup ecosystem.
Conclusion
In conclusion, leveraging the power of LLMs such as OpenAI's GPT and Linkedin for startup sourcing and categorization can be a game-changer for VCs and researchers alike.
This guide serves as a starting point for those who may not have coding expertise but are eager to explore GPT's capabilities by integrating it into a Google Spreadsheet. Importantly, the primary goal of this article was to provide a detailed process description, empowering readers to replicate and experiment with the methodology themselves, rather than solely focusing on the outcomes.
I would like to extend a big thank you to my colleague Tomas Blatak, who played an instrumental role in conducting this experiment.
That’s it for today. Hopefully, this comprehensive guide serves as further inspiration for your firm to become more data-driven. Let’s leverage our growing community to learn from each other and eventually make our industry more efficient, effective and inclusive.
Stay driven,
Andre
PS: Check out the “Riding Unicorns” podcast by my friends James and Hector to explore the ins and outs of tech unicorns with some of the leading VCs and operators
Thank you for reading. If you liked it, share it with your friends, colleagues and everyone interested in data-driven innovation. Subscribe below and follow me on LinkedIn or Twitter to never miss data-driven VC updates again.
What do you think about my weekly Newsletter? Love it | It's great | Good | Okay-ish | Stop it
If you have any suggestions, want me to feature an article, research, your tech stack or list a job, hit me up! I would love to include it in my next edition😎
First steps as a data-driven VC without coding skills: AI-powered Google Sheet to track LinkedIn profiles
This is great and inspiring! We're seeing some of the approaches being adopted in China. Although raw data sources are different, e.g, China's LinkedIn data is not as rich, Chinese VC utilized many publicly available data such as e-commerce data, company registration library, to track up and coming startups. They also organize these startups in to knowledge graph maps.