If you are coming over from the article How To Scrape The Web With NodeJs because you need to emulate a human using a browser, you are in the right place!
It’s quite common for many websites to have anti-bot measures in place. These measures can be as simple as ip rate limiting to scanning the request to see if it’s been generated by a program or a real web browser. And while it’s true you can mimic browser requests through pretty much any programming language on the market today, sometimes it’s best to just use what works, a real web browser.
When automating a browser, the libraries used will be language dependent, in the case of NodeJs, the most popular browser automation libraries are:
First, we’ll cover puppeteer, we will go over how to setup your project with Pupeeteer, all the way to loading your first page in chrome with code, and every step in between.
Puppeteer
First things first, you’re going to want open a command prompt, navigate to the directory that you want your source code to be in, and type in or copy:
npm init -y
This will initialize a new node project in that directory.
The next command you’ll want to enter is:
npm install puppeteer
This will download puppeteer and a chromium browser and make both of those accessible within your project.
Next you’ll create a file, name it whatever you want, ours will be called index.js to align with convention. We’ll post the full file content here and then break it down line by line below.
(async () => {
const puppeteer = require('puppeteer');
const browser = await puppeteer.launch({
headless : false,
args : [
`--user-agent=\"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36\"`
]
});
const page = await browser.newPage();
var pageResponse = await page.goto('https://httpbin.org/user-agent', {
timeout : 35000,
waitUntil: 'networkidle0',
});
if(pageResponse._status !== 200){
console.error(`there was an error scraping your url`);
await page.close();
await browser.close();
return;
}
let content = await page.content();
await page.close();
await browser.close();
//process data
})();
First thing we’ll need to add to our file is an async wrapper because puppeteer’s methods are asynchronous.
(async () => {
})();
Next we’ll import the Puppeteer library.
const puppeteer = require('puppeteer');
Now it’s time to initialize the browser. There are a few key things to note, the first is we are setting the headless flag to false. This flag will determine if the chrome browser actually renders to your screen, when headless is set to true, the browser will NOT render the graphical interface, but it will still act as though it is. Right now for debugging and visualization purposes we are having the browser render itself. If you were trying to run this code on an os that doesn’t render a desktop like Ubuntu Server, you must set headless to true.
The next important part is the –user-agent line in the args variable. We are manually setting the value that will be sent as the User-Agent header with every request the browser makes. We are doing this because the default chromium User-Agent is Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/97.0.4691.0 Safari/537.36 which is easily detected by anti-bot measures. So we set it to a legitimate user agent and that’s one more thing that makes this request look like it came from a real person on a real computer. If you want to use your browser agent, you can load up this url(https://httpbin.org/user-agent) in your browser and it will return exactly what you’re looking for.
const browser = await puppeteer.launch({
headless : false,
args : [
`--user-agent=\"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36\"`
]
});
Now we have to tell the browser to open a new page and give us access to it (we don’t get access to the default opened page unfortunately). We do this with this line:
const page = await browser.newPage();
Finally, time to actually make the browser go somewhere using page.goto we can navtigate to our desired url, we can also pass in several other parameters like a timeout period (in ms) before the browser throws an error and says the page is unreachable. We can also pass in an identifier that puppeteer uses to determine the page is done loading, in this case we use networkidle0 which means there’s at least 500 ms of time since the last network traffic and the browser will return as ‘done’. You can read more about the other options for waitUntil here: puppeteer github documentation
var pageResponse = await page.goto('https://httpbin.org/user-agent', {
timeout : 35000,
waitUntil: 'networkidle0',
});
This command above makes the browser navigate to the page and returns an object containing the returned Http status code which indicates success or failure. Before accessing the content, you want to look at this status to avoid unintended errors.
if(pageResponse._status !== 200){
console.error(`there was an error scraping your url`);
await page.close();
await browser.close();
return;
}
After your verify success, that’s it. Don’t forget to close the page and browser up. All that’s left to do is get the content, and process your data!
let content = await page.content();
await page.close();
await browser.close();
Selenium
Now it’s time to look at Selenium. The setup to use this library is a little more cumbersome. Unfortunately it’s not as simple as installing an npm package. Although there are a few npm packages to install, and we’ll start there.
First is selenium itself
npm install selenium-webdriver
Next is the chrome driver packages
npm install chromedriver
Now for the task of setting up your system to actually be able to use the chrome driver. This can be a somewhat involved task, so in this article we’ll highlight the important parts, but if you need a more in depth tutorial on how to get setup, we have an article for that here.
- Download the version of the driver compatible with the version of chrome you have installed.
- Extract the downloaded file to an easy to find location on your hard drive.
- Add the folder with the extracted binary to your system path.
Now that your system is setup if you’ve done it correctly, you’re ready to start the coding. Like before, we’ll post the full first and then break down the lines.
(async () => {
const chrome = require('selenium-webdriver/chrome');
const {Builder} = require('selenium-webdriver');
let driver = new Builder()
.forBrowser('chrome')
.setChromeOptions(new chrome.Options().headless()
.addArguments('user-agent=Mozilla/5.0 (Windows Phone 10.0; Android 4.2.1; Microsoft; Lumia 640 XL LTE) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.135 Mobile Safari/537.36 Edge/12.10166')
)
.build();
await driver.get('https://httpbin.org/user-agent');
console.log(await driver.getPageSource());
await driver.quit();
})();
First thing we’ll need to add to our file is an async wrapper because puppeteer’s methods are asynchronous.
(async () => {
})();
Same as last time we surround it in an anonymous async function.
(async () => {
})();
Next we have our two library imports that allow this whole setup to work.
const chrome = require('selenium-webdriver/chrome');
const {Builder} = require('selenium-webdriver');
Initializing the browser is done via a completely different syntax. All of the arguments are passed in via the setChromeOptions function taking in a chrome.Options object. With these arguments we are able to accomplish the same things we were able to with Puppeteer. These arguments will set the browser to operate in headless mode and they will set the user agent to something that looks like it’s coming from a real user on a real browser. The default value for selenium’s user-agent in headless mode is Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/101.0.4951.54 Safari/537.36, which would be easily detected.
let driver = new Builder()
.forBrowser('chrome')
.setChromeOptions(new chrome.Options().headless()
.addArguments('user-agent=Mozilla/5.0 (Windows Phone 10.0; Android 4.2.1; Microsoft; Lumia 640 XL LTE) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.135 Mobile Safari/537.36 Edge/12.10166')
)
.build();
The standard practice when using Puppeteer is to open a new tab as it’s the easiest way to get access to the browser context for manipulation. In Selenium, there isn’t a need to do that. Once you’ve initialized the driver you’re immediately given that context.
So to navigate to your target site, you call driver.get like so.
await driver.get('https://httpbin.org/user-agent');
Once the get method returns, you’ll have access to the content via getPageSource. Which in our scripts case will simply log that source to the console output.
console.log(await driver.getPageSource());
Then to clean everything up once you’re done parsing the page source, don’t forget to close down chrome.
await driver.quit();
That’s it, that should be enough information to get you up and running with either Puppeteer or Selenium. Even though this tutorial only covered enough to get you setup and didn’t go into depth, we think it’s worth offering our 2 cents on which library is better. In our opinion, Puppeteer offers a more robust library for controlling the browser and is far easier to get setup, thus is the superior choice here. But obviously there are other factors that can influence what choice in library you make in the end.