In this quickly evolving business world, staying competitive requires tracking price fluctuations regularly to ensure your strategies are updated. Know that manually gathering pricing information consumes a lot of effort and time, which cannot be afforded when you plan to grow your business.
Competitor pricing data scraping is gathering real-time pricing information from the target platforms like websites, ecommerce stores, or marketplaces ethically. This process aims to help businesses monitor and analyze the prices of similar services and products their competitors offer.
It is easier to make informed decisions about pricing to maintain the right balance and boost profit margins. Also, with the industry's dynamic pricing model, you have a better chance of beating the competition by delivering affordable services to the target customer.
The business dynamics continuously shift towards a better strategy with real-time data analysis. Here are some simple steps to scrape and compare prices of the same products on different platforms:
Some libraries that we require for the scraping:
This is used for sending HTTP requests to the webpages and gathering data as HTML.
This library helps in parsing HTML and collecting data using CSS and XPath selectors.
It will ensure that your scrapers run asynchronously, which boosts the speed of web scraping.
It helps in monitoring and logging the competitor price tracker.
As asyncio is pre-installed in Python, it is essential to install the rest of the libraries using this command:
pip install httpx parsel loguru
We will be scraping data from three competitors, BestBuy, Walmart, and Amazon, to compare PlayStation 5 prices. The keyword for each product search will be “PS5 Digital Edition.” Let us start scraping the data from Walmart.
import urllib.parse import asyncio import json from httpx import AsyncClient, Response from parsel import Selector from typing import Dict, List from loguru import logger as log # create an HTTP client with headers that look like a real web browser client = AsyncClient( headers={ "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Safari/537.36 Edg/113.0.1774.35", "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8", "Accept-Encoding": "gzip, deflate, br", "Accept-Language": "en-US,en;q=0.9,lt;q=0.8,et;q=0.7,de;q=0.6", }, follow_redirects=True, http2=True ) async def scrape_walmart(search_query: str) -> List[Dict]: """scrape Walmart search pages""" def parse_walmart(response: Response) -> List[Dict]: """parse Walmart search pages""" selector = Selector(response.text) data = [] product_box = selector.xpath("//div[@data-testid='item-stack']/div[1]") link = product_box.xpath(".//a[@link-identifier]/@link-identifier").get() title = product_box.xpath(".//a[@link-identifier]/span/text()").get() price = product_box.xpath(".//div[@data-automation-id='product-price']/span/text()").get() price = float(price[price.find("$")+1: -1]) if price else None rate = product_box.xpath(".//span[@data-testid='product-ratings']/@data-value").get() review_count = product_box.xpath(".//span[@data-testid='product-reviews']/@data-value").get() data.append({ "link": "https://www.walmart.com/ip/" + link, "title": title, "price": price, "rate": float(rate) if rate else None, "review_count": int(review_count) if review_count else None }) return data search_url = "https://www.walmart.com/search?q=" + urllib.parse.quote_plus(search_query) + "&sort=best_seller" response = await client.get(search_url) if response.status_code == 403: raise Exception("Walmart requests are blocked") data = parse_walmart(response) log.success(f"scraped {len(data)} products from Walmart") return data
async def run(): data = await scrape_walmart( search_query="PS5 digital edition" ) # print the data in JSON format print(json.dumps(data, indent=2)) if __name__=="__main__": asyncio.run(run())
In the above code, we defined two functions:
import urllib.parse import asyncio import json from httpx import AsyncClient, Response from parsel import Selector from typing import Dict, List from loguru import logger as log # create HTTP client with headers that look like a real web browser client = AsyncClient( headers={ "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Safari/537.36 Edg/113.0.1774.35", "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8", "Accept-Encoding": "gzip, deflate, br", "Accept-Language": "en-US,en;q=0.9,lt;q=0.8,et;q=0.7,de;q=0.6", }, follow_redirects=True, http2=True ) async def scrape_amazon(search_query: str) -> List[Dict]: """scrape Amazon search pages""" def parse_amazon(response: Response) -> List[Dict]: """parse Amazon search pages""" selector = Selector(response.text) data = [] product_box = selector.xpath("//div[contains(@class, 'search-results')]/div[@data-component-type='s-search-result']") product_id = product_box.xpath(".//div[@data-cy='title-recipe']/h2/a[contains(@class, 'a-link-normal')]/@href").get().split("/dp/")[-1].split("/")[0] title = product_box.xpath(".//div[@data-cy='title-recipe']/h2/a/span/text()").get() price = product_box.xpath(".//span[@class='a-price']/span/text()").get() price = float(price.replace("$", "")) if price else None rate = product_box.xpath(".//span[contains(@aria-label, 'stars')]/@aria-label").re_first(r"(\d+\.*\d*) out") review_count = product_box.xpath(".//div[contains(@data-csa-c-content-id, 'ratings-count')]/span/@aria-label").get() data.append({ "link": f"https://www.amazon.com/dp/{product_id}", "title": title, "price": price, "rate": float(rate) if rate else None, "review_count": int(review_count.replace(',','')) if review_count else None, }) return data search_url = "https://www.amazon.com/s?k=" + urllib.parse.quote_plus(search_query) response = await client.get(search_url) if response.status_code == 403 or 503: raise Exception("Amazon requests are blocked") data = parse_amazon(response) log.success(f"scraped {len(data)} products from Amazon") return data
async def run(): amazon_data = await scrape_amazon( search_query="PS5 digital edition" ) # print the data in JSON format print(json.dumps(amazon_data, indent=2, ensure_ascii=False)) if __name__=="__main__": asyncio.run(run())
import urllib.parse import asyncio import json from httpx import AsyncClient, Response from parsel import Selector from typing import Dict, List from loguru import logger as log # create an HTTP client with headers that look like a real web browser client = AsyncClient( headers={ "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Safari/537.36 Edg/113.0.1774.35", "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8", "Accept-Encoding": "gzip, deflate, br", "Accept-Language": "en-US,en;q=0.9,lt;q=0.8,et;q=0.7,de;q=0.6", }, follow_redirects=True, http2=True ) async def scrape_bestbuy(search_query: str) -> List[Dict]: """scrape BestBuy search pages""" def parse_bestbuy(response: Response) -> List[Dict]: """parse BestBuy search pages""" selector = Selector(response.text) data = [] product_box = selector.xpath("//ol[contains(@class, 'sku-item-list')]/li[@class='sku-item']") product_id = product_box.xpath(".//h4[@class='sku-title']/a/@href").get().split("?skuId=")[-1] title = product_box.xpath(".//h4[@class='sku-title']/a/text()").get() price = product_box.xpath(".//div[contains(@class, 'priceView')]/span/text()").get() price = float(price.replace("$", "")) if price else None rate = product_box.xpath(".//div[contains(@class, 'ratings-reviews')]/p/text()").get() review_count = product_box.xpath(".//span[@class='c-reviews ']/text()").get() data.append({ "link": f"https://www.bestbuy.com/site/{product_id}.p", "title": title, "price": price, "rate": float(rate.split()[1]) if rate else None, "review_count": int(review_count[1:-1].replace(",", "")) if review_count else None }) return data search_url = "https://www.bestbuy.com/site/searchpage.jsp?st=" + urllib.parse.quote_plus(search_query) response = await client.get(search_url) if response.status_code == 403: raise Exception("BestBuy requests are blocked") data = parse_bestbuy(response) log.success(f"scraped {len(data)} products from BestBuy") return data
async def run(): bestbuy_data = await scrape_bestbuy( search_query="PS5 digital edition" ) # print the data in JSON format print(json.dumps(bestbuy_data, indent=2, ensure_ascii=False)) if __name__=="__main__": asyncio.run(run())
In this step, we will combine all the scraping logic to use it for tracking competitors' pricing:
async def track_competitor_prices( search_query: str ): """scrape products from different competitors""" data = {} data["walmart"] = await scrape_walmart( search_query=search_query ) data["amazon"] = await scrape_amazon( search_query=search_query ) data["bestbuy"] = await scrape_bestbuy( search_query=search_query ) product_count = sum(len(products) for products in data.values()) log.success(f"successfully scraped {product_count} products") # save the results into a JSON file with open("data.json", "w", encoding="utf-8") as file: json.dump(data, file, indent=2, ensure_ascii=False) async def run(): await track_competitor_prices( search_query="PS5 digital edition" ) if __name__=="__main__": asyncio.run(run())
All the results will be organized in a single JSON file:
{ "walmart": [ { "link": "https://www.walmart.com/ip/5113183757", "title": "Sony PlayStation 5 (PS5) Digital Console Slim", "price": 449.0, "rate": 4.6, "review_count": 369 } ], "amazon": [ { "link": "https://www.amazon.com/dp/B0CL5KNB9M", "title": "PlayStation®5 Digital Edition (slim)", "price": 449.0, "rate": 4.7, "review_count": 2521 } ], "bestbuy": [ { "link": "https://www.bestbuy.com/site/6566040.p", "title": "Sony - PlayStation 5 Slim Console Digital Edition - White", "price": 449.99, "rate": 4.8, "review_count": 769 } ] }
The web scraping product data process will help manually analyze the insights to understand competitors' performance. A simple monitoring function to analyze the information:
def generate_insights(data): """analyze the data for insight values""" def calculate_average(lst): # Calculate the averages non_none_values = [value for value in lst if value is not None] return round(sum(non_none_values) / len(non_none_values), 2) if non_none_values else None # Extract all products across competitors all_products = [product for products in data.values() for product in products] # Calculate overall averages overall_average_price = calculate_average([product["price"] for product in all_products]) overall_average_rate = calculate_average([product["rate"] for product in all_products]) overall_average_review_count = calculate_average([product["review_count"] for product in all_products]) # Find the lowest priced, highest reviewed, highest priced, and highest rated products across all competitors lowest_priced_product = min(all_products, key=lambda x: x["price"]) highest_reviewed_product = max(all_products, key=lambda x: x.get("review_count", 0) if x.get("review_count") is not None else 0) highest_priced_product = max(all_products, key=lambda x: x["price"]) highest_rated_product = max(all_products, key=lambda x: x["rate"]) # Extract website names for each product website_names = {retailer: products[0]["link"].split(".")[1] for retailer, products in data.items()} insights = { "Overall Average Price": overall_average_price, "Overall Average Rate": overall_average_rate, "Overall Average Review Count": overall_average_review_count, "Lowest Priced Product": { "Product": lowest_priced_product, "Competitor": website_names.get(lowest_priced_product["link"].split(".")[1]) }, "Highest Priced Product": { "Product": highest_priced_product, "Competitor": website_names.get(highest_priced_product["link"].split(".")[1]) }, "Highest Rated Product": { "Product": highest_rated_product, "Competitor": website_names.get(highest_rated_product["link"].split(".")[1]) }, "Highest Reviewed Product": { "Product": highest_reviewed_product, "Competitor": website_names.get(highest_reviewed_product["link"].split(".")[1]) } } # Save the insights to a JSON file with open("insights.json", "w") as json_file: json.dump(insights, json_file, indent=2, ensure_ascii=False)
We have introduced generate_insights function, which will calculate various metrics like:
The insight data below helps represent statistics and numbers to make it easier for analysis. Now, you can successfully compare product prices from various competitors:
{ "Overall Average Price": 449.33, "Overall Average Rate": 4.7, "Overall Average Review Count": 1219.67, "Lowest Priced Product": { "Product": { "link": "https://www.walmart.com/ip/5113183757", "title": "Sony PlayStation 5 (PS5) Digital Console Slim", "price": 449.0, "rate": 4.6, "review_count": 369 }, "Competitor": "walmart" }, "Highest Priced Product": { "Product": { "link": "https://www.bestbuy.com/site/6566040.p", "title": "Sony - PlayStation 5 Slim Console Digital Edition - White", "price": 449.99, "rate": 4.8, "review_count": 769 }, "Competitor": "bestbuy" }, "Highest Rated Product": { "Product": { "link": "https://www.bestbuy.com/site/6566040.p", "title": "Sony - PlayStation 5 Slim Console Digital Edition - White", "price": 449.99, "rate": 4.8, "review_count": 769 }, "Competitor": "bestbuy" }, "Highest Reviewed Product": { "Product": { "link": "https://www.amazon.com/dp/B0CL5KNB9M", "title": "PlayStation 5 Digital Edition (slim)", "price": 449.0, "rate": 4.7, "review_count": 2521 }, "Competitor": "amazon" } }
Competing in this dynamic market comes with hurdles and requires advanced solutions to handle them efficiently. Here are some common challenges you might face while extracting and analyzing competitor pricing data:
Pricing on ecommerce websites will change frequently based on the stock level, demand changes, and competitive pricing. It makes gathering information in real-time or often even in a few hours technically challenging.
Product pages on online stores have various information beyond just the pricing, including product descriptions, reviews, related products, ratings, and more. This requires an advanced scraper to identify and extract the specific data per your requirements.
Many ecommerce sellers use dynamic pricing, where costs will change based on browsing history, location, active time zone, and market fluctuations. It is difficult to account for these changes and make an updated pricing model for a profitable business.
The price changes depend on the geographic location due to various tax rates, regional strategies, and shipping costs. This makes it essential to have scrapers that simulate being in different locations using VPNs or proxy servers.
Ecommerce platforms often sell the same products in different variations like size, color, packaging, or seller with their prices. Significantly, your competitor's data scraping tool captures the accurate information to offer the correct prices.
Competitor price data scraping has become essential for businesses looking to effortlessly beat the competition and gain better returns. Here are some reasons you must invest in scraping pricing data from your competitors:
Know competitor prices to set the ideal pricing model to attract the target audience’s attention. If a competitor constantly offers deals, your business can seize the opportunity to provide attractive discounts.
Gathering pricing information for a particular period will help determine the common patterns and seasonal changes. Using advanced professional scraping tools, this data analysis will help business owners anticipate seasonal changes and adjust pricing accordingly.
Monitoring the stock levels of the competitors will help you set the right price in your inventory. Maintaining the right balance of popular and less demanding products becomes effortless while delivering the best customer service.
We have shared essential insights into scraping and analyzing competitor pricing data with the help of professionals. At Scraping Intelligence, you can access advanced technologies and the latest strategies to gather the newest information from your competitors.
Know that web scraping is a powerful solution for performing competitive analysis while extracting valuable and updated information from the target websites. We respect data privacy and terms of service to ensure our ethical responsibilities while extracting information.