Ads Top

Web Scraping For Amazon Prices

Idea:
I wanted to build a basic web scraper where you can get the price for a given product on Amazon. Amazon has a product advertising API which allows you to do this programmatically, but after watching a few videos on this subject, I wanted to try and do it this way as a simple node app.

What is a web scraper
From wikipedia
Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites.[1] Web scraping software may access the World Wide Web directly using the Hypertext Transfer Protocol, or through a web browser. While web scraping can be done manually by a software user, the term typically refers to automated processes implemented using a bot or web crawler. It is a form of copying, in which specific data is gathered and copied from the web, typically into a central local database or spreadsheet, for later retrieval or analysis.
full wiki reference on web scraping

Steps to solve this problem
1: Setup project
2: Manually inspect the page to see where the the price is. If in a class or div, note that. For this case, it is in a div called #priceblock_ourprice
3: Get the HTML of the page (using axios within getHTML() function)
4: Once we have the HTML, we can get the price from the page via cheerio in the getAmazonPrice() function

Node packages used
cherriojs - Essentially jQuery for node. Allows you to easily pick elements from a page
axios - Promise based HTTP client for the browser and node.js
esm - ECMAScript module loader so we can use import

Setup
   mkdir simpleWebScraper
   cd simpleWebScraper
   npm init -f (-f accepts the defaults)

Install packages
   npm i cheerio axios esm
   npm i nodemon —save-dev

After you run npm ini, and install the packages it will create a package.json file for you. Once created, you can go into the scripts object and add a command to run the app. See line 8 of the package.json file below

package.json
{
"name": "amazon-web-scaper",
"version": "1.0.0",
"description": "",
"main": "index.js",
"scripts": {
"test": "echo \"Error: no test specified\" && exit 1",
"dev": "nodemon -r esm index.js"
},
"keywords": [],
"author": "",
"license": "ISC",
"dependencies": {
"axios": "^0.18.0",
"cheerio": "^1.0.0-rc.3",
"esm": "^3.2.22"
},
"devDependencies": {
"nodemon": "^1.18.11"
}
}

With the project skeleton setup, we can now add the following files (index.js and scrape.js).

index.js
import { getHTML, getAmazonPrice } from './scrape';
const productURL = `https://www.amazon.ca/Vitamix-Explorian-Professional-Grade-Low-Profile-Refurbished/dp/B07CXVSMZ4/ref=sr_1_5?keywords=vitamix&qid=1555870204&s=gateway&sr=8-5&th=1`;
async function scrapePage() {
const html = await getHTML(productURL);
const amazonPrice = await getAmazonPrice(html);
console.log(`The price is ${amazonPrice}`);
}
scrapePage();
scrape.js
import axios from 'axios';
import cherrio from 'cheerio';
async function getHTML(productURL) {
const { data: html } = await axios.get(productURL, {
headers: {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.90 Safari/537.36'
}
})
.catch(function (error) {
console.log(error);
})
return html;
}
async function getAmazonPrice(html) {
const $ = cherrio.load(html)
const span = $('#priceblock_ourprice')
return span.html();
}
export { getHTML, getAmazonPrice };

Finally, to run this app, simply go to your terminal, and enter:
npm run dev

Because I am using nodemon, anytime you make a change and save the application, it will run the app again.

Note that in scrape.js, lines 6 - 8, I had to pass headers. Without doing this, I was getting a 503 error returned. Please see notes 2, 3, and 4 below.

Leave a Comment

No comments:

Powered by Blogger.