Web Scraping For Amazon Prices

Idea:
I wanted to build a basic web scraper where you can get the price for a given product on Amazon. Amazon has a product advertising API which allows you to do this programmatically, but after watching a few videos on this subject, I wanted to try and do it this way as a simple node app.

What is a web scraper
From wikipedia

Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites.[1] Web scraping software may access the World Wide Web directly using the Hypertext Transfer Protocol, or through a web browser. While web scraping can be done manually by a software user, the term typically refers to automated processes implemented using a bot or web crawler. It is a form of copying, in which specific data is gathered and copied from the web, typically into a central local database or spreadsheet, for later retrieval or analysis.

full wiki reference on web scraping

Steps to solve this problem
1: Setup project
2: Manually inspect the page to see where the the price is. If in a class or div, note that. For this case, it is in a div called #priceblock_ourprice
3: Get the HTML of the page (using axios within getHTML() function)
4: Once we have the HTML, we can get the price from the page via cheerio in the getAmazonPrice() function

Node packages used
cherriojs - Essentially jQuery for node. Allows you to easily pick elements from a page
axios - Promise based HTTP client for the browser and node.js
esm - ECMAScript module loader so we can use import

Setup
   mkdir simpleWebScraper
   cd simpleWebScraper
   npm init -f (-f accepts the defaults)

Install packages
   npm i cheerio axios esm
   npm i nodemon —save-dev

After you run npm ini, and install the packages it will create a package.json file for you. Once created, you can go into the scripts object and add a command to run the app. See line 8 of the package.json file below

package.json

	{
	"name": "amazon-web-scaper",
	"version": "1.0.0",
	"description": "",
	"main": "index.js",
	"scripts": {
	"test": "echo \"Error: no test specified\" && exit 1",
	"dev": "nodemon -r esm index.js"
	},
	"keywords": [],
	"author": "",
	"license": "ISC",
	"dependencies": {
	"axios": "^0.18.0",
	"cheerio": "^1.0.0-rc.3",
	"esm": "^3.2.22"
	},
	"devDependencies": {
	"nodemon": "^1.18.11"
	}
	}

view raw web-scraper_package.json hosted with ❤ by GitHub

With the project skeleton setup, we can now add the following files (index.js and scrape.js).

index.js

	import { getHTML, getAmazonPrice } from './scrape';

	const productURL = `https://www.amazon.ca/Vitamix-Explorian-Professional-Grade-Low-Profile-Refurbished/dp/B07CXVSMZ4/ref=sr_1_5?keywords=vitamix&qid=1555870204&s=gateway&sr=8-5&th=1`;

	async function scrapePage() {
	const html = await getHTML(productURL);
	const amazonPrice = await getAmazonPrice(html);
	console.log(`The price is ${amazonPrice}`);
	}

	scrapePage();

view raw web-scraper-index.js hosted with ❤ by GitHub

scrape.js

	import axios from 'axios';
	import cherrio from 'cheerio';

	async function getHTML(productURL) {
	const { data: html } = await axios.get(productURL, {
	headers: {
	'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.90 Safari/537.36'
	}
	})
	.catch(function (error) {
	console.log(error);
	})
	return html;
	}

	async function getAmazonPrice(html) {
	const $ = cherrio.load(html)

	const span = $('#priceblock_ourprice')
	return span.html();
	}

	export { getHTML, getAmazonPrice };

view raw web-scraper_scrape.js hosted with ❤ by GitHub

Finally, to run this app, simply go to your terminal, and enter:
npm run dev

Because I am using nodemon, anytime you make a change and save the application, it will run the app again.

Note that in scrape.js, lines 6 - 8, I had to pass headers. Without doing this, I was getting a 503 error returned. Please see notes 2, 3, and 4 below.

Notes
1: Based on scraping tutorial
https://www.youtube.com/watch?v=rWc0xqroY4U&t=1757s
2: Error research (a python thread)
https://www.reddit.com/r/learnpython/comments/4eaz7v/error_503_when_trying_to_get_info_off_amazon/
3: Axios send headers
https://stackoverflow.com/questions/45578844/how-to-set-header-and-options-in-axios
4: Found the headers to send here
https://www.scrapehero.com/tutorial-how-to-scrape-amazon-product-details-using-python/
5: Wikipedia definition of web scraping
web scraping

Author: James Buczkowski | For more information
Profile: http://jamesbuczkowski.com
Follow me : @_theDevNotebook | Subscribe to RSS feed

_theDevNotebook

Ads Top

Web Scraping For Amazon Prices

Leave a Comment

No comments:

Blog Archive

Popular Posts

Recent Posts

Categories

Popular Posts This Month

About Me

_theDevNotebook

Ads Top

Web Scraping For Amazon Prices

You Might Also Like

Leave a Comment

No comments:

Blog Archive

Popular Posts

Recent Posts

Categories

Popular Posts This Month

About Me