LinkedIn is the world’s largest professional network with over 850 million members. As a professional or a recruiter, LinkedIn provides a great platform to connect with other professionals, search for jobs or talent and build your professional brand. LinkedIn also provides useful insights through data on skills, industries, companies and more. This data on LinkedIn can be extracted for analysis using Python.
In this comprehensive guide, we will learn how to extract data from LinkedIn using Python. We will cover the following topics:
Requirements
To extract data from LinkedIn, you need to have the following requirements fulfilled:
- LinkedIn Account: You need a LinkedIn account to access LinkedIn data. Make sure your account is logged in before scraping data.
- LinkedIn API Keys: LinkedIn provides various APIs to extract data. To use these APIs, you will need API keys which can be obtained after registering as a LinkedIn developer.
- Python: Python programming knowledge is required to write the code to extract and process LinkedIn data. We’ll be using libraries like Selenium, Beautiful Soup, Pandas etc.
- Code Editor: You need a code editor like Jupyter Notebook or Spyder to write and execute the Python code.
Methods to Extract LinkedIn Data
There are primarily three methods to extract data from LinkedIn:
- Web Scraping: This involves directly scraping LinkedIn web pages to extract data. We’ll use Selenium and Beautiful Soup for web scraping.
- LinkedIn APIs: LinkedIn provides various APIs for recruiters, ads, groups etc. We can use these APIs to fetch data.
- LinkedIn Data Exports: Members can export their own LinkedIn data as well as company data if they are admins. This data can then be processed using Python.
Let’s look at each of these methods in more detail:
1. Web Scraping LinkedIn Data
Web scraping involves accessing web page source code and extracting information from HTML tags and classes. The main steps are:
- Use Selenium to access LinkedIn pages by automating a browser.
- Parse the page source using Beautiful Soup library.
- Target specific HTML tags, ids, classes to extract text and data.
- Store the scraped data in structured formats like DataFrames.
Here is a sample code to extract names from LinkedIn search pages:
from selenium import webdriver from bs4 import BeautifulSoup import pandas as pd driver = webdriver.Chrome() driver.get('https://www.linkedin.com/search/results/people/?keywords=data%20scientist') soup = BeautifulSoup(driver.page_source, 'html.parser') names = [] for tag in soup.find_all('span', {'class': 'entity-result__title-text'}): names.append(tag.text.strip()) df = pd.DataFrame({'names': names}) print(df)
This will print out a Pandas DataFrame with names extracted from the page source. The main advantages of web scraping are:
- Can extract any public information without needing API access.
- Provides granular control over data extraction process.
However, web scraping has certain limitations:
- Prone to breaking if LinkedIn changes page layouts.
- Difficult to scale across many pages and profiles.
- Blocking or CAPTCHAs if scraping at scale.
2. Using LinkedIn APIs
LinkedIn provides a range of REST APIs for recruiters, advertising, company information and more. These require API keys but provide structured data access. Some key APIs include:
- Jobs API: Provides search and posting APIs for jobs.
- Companies API: Details on companies, statistics and industries.
- Interests API: Get member interests and suggestions.
- Ads API: Programmatic access for managing LinkedIn ads.
Here is sample code to fetch company statistics using the Companies API:
import requests api_key = 'xxxxxxxxxxxx' url = 'https://api.linkedin.com/v2/companies?q=Anthropic&projection=(id,name,staffCountRange)' headers = {'Authorization': 'Bearer ' + api_key} response = requests.get(url, headers=headers) print(response.json())
This prints out company details in JSON format. The key advantages of APIs are:
- Structured data in consistent formats like JSON.
- Higher rate limits for data access.
- Official support and documentation for usage.
The limitations include:
- Require API keys and authorization.
- Restricted to specific endpoints provided.
- Cannot extract all data that web scraping might allow.
3. Exporting LinkedIn Data
LinkedIn allows users to export their own profile data as well as company page data if they are administrators. The steps are:
- Go to account settings and privacy.
- Click on “Get a copy of your data”.
- Choose desired data settings and export type.
- LinkedIn will email export file when ready.
The export is provided as a Zip file containing JSON files. It can contain profile info, connections, interests, company pages, ads data and more. Here is sample code to process an exported JSON file:
import json with open('linkedin-data.json') as f: data = json.load(f) print(data['positions'])
This prints out the positions details from the exported profile data. The main advantages of exported data are:
- Full access to your personal LinkedIn data.
- Get company and LinkedIn page data if admin.
- Data already structured in JSON format.
The limitations are:
- Only your own data or company data if admin.
- Need to manually export and process data.
- Data export limited to few specific datasets.
Selenium Web Scraping Hands-On
Now that we have understood the basics, let’s dive deeper into web scraping using Selenium in Python. We will learn how to:
- Setup Selenium and ChromeDriver
- Login to LinkedIn using Selenium
- Navigate profile pages
- Scroll pages to load dynamic content
- Extract text, links and data
- Store extracted data in Pandas
Installing Selenium
We need to install Selenium first. We also need ChromeDriver which allows Selenium to control Chrome browser.
pip install selenium # Download ChromeDriver and add to PATH https://sites.google.com/a/chromium.org/chromedriver/downloads
Login to LinkedIn using Selenium
Here is how to log into LinkedIn using Selenium:
from selenium import webdriver driver = webdriver.Chrome() driver.get('https://www.linkedin.com') username = driver.find_element_by_id('session_key') username.send_keys('my_username') password = driver.find_element_by_id('session_password') password.send_keys('my_password') submit = driver.find_element_by_class_name('sign-in-form__submit-button') submit.click()
This will fill the username, password and click submit to login. We can also use CSS Selectors along with ids and classes to identify elements.
Navigate to Profiles
Once logged in, we can navigate to any profile using:
driver.get('https://www.linkedin.com/in/billgates')
Similarly, we can navigate to other pages like search results, company pages, groups etc. Selenium will automatically handle page navigation and cookies.
Scroll to Load Dynamic Content
Many times LinkedIn uses dynamic loading when you scroll down a profile. To load such content, we need to simulate scrolling using Selenium:
SCROLL_PAUSE_TIME = 3 # Get scroll height last_height = driver.execute_script("return document.body.scrollHeight") while True: # Scroll down to bottom driver.execute_script("window.scrollTo(0, document.body.scrollHeight);") # Wait to load page time.sleep(SCROLL_PAUSE_TIME) # Calculate new scroll height and compare with last scroll height new_height = driver.execute_script("return document.body.scrollHeight") if new_height == last_height: break last_height = new_height
This will keep scrolling till new content stops loading upon reaching bottom. We can now scrape the entire page with dynamically loaded content.
Extract Text, Links and Data
To extract information from the pages, we will use Beautiful Soup after Selenium loads the page. For example:
from bs4 import BeautifulSoup soup = BeautifulSoup(driver.page_source, 'html.parser') # Extract text from elements name = soup.find('h1').getText() # Extract links links = soup.find_all('a') links = [a['href'] for a in links] # Extract element data like IDs ids = [div['id'] for div in soup.find_all('div', {'class': 'profile-detail'})]
We can target specific classes, ids, attributes or text to extract any data from the Selenium loaded page using Beautiful Soup.
Store Extracted Data in Pandas
For storing and analyzing extracted data, we can use Pandas DataFrames:
import pandas as pd data = [] profiles = ['https://www.linkedin.com/in/billgates', 'http://www.linkedin.com/in/stevejobs'] for profile in profiles: driver.get(profile) name = driver.find_element_by_class_name('name').text company = driver.find_element_by_class_name('company').text data.append([name, company]) df = pd.DataFrame(data, columns = ['Name', 'Company']) print(df)
This will scrape multiple profiles and store structured data in a DataFrame for analysis.
LinkedIn API Hands-On
Now let’s dive deeper into using official LinkedIn APIs. We will cover:
- Obtaining API keys
- Authentication and API requests
- Profile and Company APIs
- Search API
- Jobs API
- Ads API
Obtaining LinkedIn API Keys
To get started with LinkedIn APIs, we need to obtain API keys:
- Go to https://www.linkedin.com/developers/
- Sign in with your LinkedIn account
- Register for a new app to get API keys
- The App will have Client ID and Client Secret
- We will use these keys to authenticate in our code
Authentication and API Requests
To use the APIs, we first need to generate an access token using the API keys:
from linkedin_api import Linkedin CLIENT_ID = 'xxxxxx' CLIENT_SECRET = 'xxxxxxx' API_URL = 'https://api.linkedin.com/v2/' linkedin = Linkedin(client_id=CLIENT_ID, client_secret=CLIENT_SECRET) # Authenticate and get access token token = linkedin.get_access_token() # Set access token for all requests linkedin.set_access_token(token)
We can now use the linkedin client for API requests. For example, to get profile data:
response = linkedin.get_profile(fields=['id', 'firstName', 'lastName']) print(response.json())
Profile and Company APIs
Here are some examples of the Profile and Company APIs:
# Get profile data my_profile = linkedin.get_profile(fields=['education', 'positions']) # Get profile by id profile = linkedin.get_profile('abcdefg1234', fields=['skills']) # Search for people people = linkedin.search_people(keywords='python developer', limit=10) # Company data company = linkedin.get_company('anthropic', fields=['staffCount']) # Search companies companies = linkedin.search_companies(keywords='artificial intelligence')
Jobs API
The Jobs API can be used to search jobs and post new listings:
# Search jobs with keyword jobs = linkedin.search_jobs(keywords='data scientist') # Post a new job listing job_data = {'title': 'Python Developer'} linkedin.post_job(job_data)
Search API
We can search across multiple datasets like profiles, companies etc:
# Search people and companies results = linkedin.search(keywords='data', types=['people', 'companies'], limit=50) # Search by name john = linkedin.search(name='John Doe', types=['people'])
Ads API
The Ads API can create and manage LinkedIn ads programmatically:
# Get list of current ad accounts accounts = linkedin.get_ad_accounts() # Create a new Sponsored Content ad content = {'description': 'Check out our products!'} campaign = {'dailyBudget': 100, 'campaignGroup': 'Website Visitors'} ad = linkedin.create_sponsored_content(accounts[0], content, campaign)
This covers the key steps to use LinkedIn APIs in Python. The APIs provide powerful structured access to LinkedIn data for applications.
Conclusion
In this comprehensive guide, we learned different methods to extract LinkedIn data using Python:
- Web scraping using Selenium and Beautiful Soup
- LinkedIn API access using API keys
- Exporting and processing LinkedIn data
Here are some key takeaways:
- Web scraping provides fine-grained control but can break more easily.
- APIs have structured data access but require keys and limits.
- Exports give access to user data but limited to few datasets.
- Selenium is useful for automation and dynamic page loads.
- Beautiful Soup helps extract data from page source.
- Pandas can store extracted data as DataFrames.
LinkedIn data can provide valuable insights for recruiters, marketers, analysts and developers. Using Python, we can build powerful applications and analytics on top of LinkedIn’s professional data.