👋 Enjoying the content? Subscribe to our Medium publication for more articles like this. 👇
Keep Learning

How I Scraped More Than 100.000 Posts on Linkedin

BY Atakan Demircioğlu
Table of Contents

I need to analyze some LinkedIn posts and I decided to scrape viral posts from Linkedin. This is the story of LinkedIn scraping like a pro.

How I Scraped More Than 100.000 Posts on Linkedin image 1

As I said in the description, I needed to find some public LinkedIn post data and when I researched it, I couldn’t find it. I decided to scrape the public posts and the story started.

How to Scrape LinkedIn?

The first question is this. There are different ways to do this. In general, I just want to make these actions like a real user so this provides me with protection from LinkedIn ban policies, restrictions and etc. For this reason, I decided to use a headless browser for doing this operation.

On the other hand, when I research, some guys are just copying the API requests and extracting data from that. But I can easily say, this is a bad way for LinkedIn scraping because you will get banned very fastly.

After that, I need to decide on the tool, framework, and also the programming language. In my previous experience, I used GO Colly and I got really good results when scraping Amazon. But this time I want to make these operations by writing Javascript and I decided to use Puppeteer. There are some alternatives to Puppe but I know how to use Puppeteer and this affected my decision.

Started Scraping LinkedIn with Puppeteer

After my decisions, fastly I installed Chromium and Puppeteer on my computer. (I don’t write the technical code parts in this article).

My main goal was to find viral posts and also find the viral posts inside hashtags. So after that, I can analyze the posts easily.

I created a list of users that are popular on LinkedIn. You can just research it via Google and there are a lot of lists about this.

When I try to scrape users and also user’s posts, the first challenge started :)

LinkedIn Requires Login to Scrape User Posts

The solution was easy. I need to log in and then set the cookies to my Puppeteer script. I did it.

Also, I write a login script then I can log in via script automated.

Then I started to get post content, likes, shares and etc.

After 2 hours I understand I am making a mistake :( The posts have “see all” and my crawled data mostly have less content.

Then I improved my script and recursively started to click the necessary buttons to get the true content.

After that, I TRUNCATE my table and started scraping again. In this part, I need to find a solution for these types of mistakes and also updates because every time I can’t scrape the data from scratch.

So I found a solution (I don’t want to share this part because I am still using :D) I found a URL about posts and I directly go with a unique identifier and scrape the posts and get the updates easily.

The second and biggest challenge was my account has suspended.

Linkedin Banned the Scraping Account

When I start I know one day my scaping accounts will be banned. This is the nature of web scraping, but here we are to find solutions for that.

My first solution is that most people do the same, opening new accounts with a script and continuing scraping. But in this part, LinkedIn has good limitations. In the registration process, they are sending email confirmation. Also, this is not a big problem, I can open the email box and just copy the confirmation numbers and etc. But LinkedIn also wants to solve the captcha. This is a problem for me :)

So I know they are some services to solve captchas but I don’t want to pay money for them and also it will require development again for me.

So in this part, I need to protect my scraping accounts from getting banned.

Random clicks and random waiting processes applied

How I Scraped More Than 100.000 Posts on Linkedin image 2

The main problem, I really don’t know how LinkedIn decides to restrict accounts. In this part, the only solution is to try fail try :)

So firstly I put some random delays before clicking anything on the page. After that, I started to wait randomly when I changed a page. It is generally random (5–40 secs).

This solved my problem a little bit, but after a while, I still get banned :D

When I tried to solve these problems, another different problem occurred.

The puppeteer script randomly clicks and redirects to another page

Accidentally opens a new window, redirects, and clicks to the message page and this breaks my scraping logic.

So I researched and found a solution in Puppeteer. If you guys encounter the same problem, just research the RequestInterception.

In this article, I am done and I will share other solutions in the upcoming articles. This is a long story and web scraping is not easy :)