![]() |
Web scraping, data mining any website |
In this article, I will show you how you can data mine any website and export the data to a spreadsheet. I will do so by showing you an example of an actual project I took on. I found the project on the freelancing website UpWork. The posting was as follows:
Record Store Day Website Scrape
We’re looking for a CSV file / Excel Spreadsheet of all participating stores on this website:
https://recordstoreday.com/Stores
Data Required:
Company Name
URL
Phone
Email
Address
Result
The result of my scraping solution is below.
How I did it
The solution basically does the following:
- Create a corpus of all the URLs that contain the company data
- Scrape all the URLs in the corpus and store the data in a file.
Creating a corpus
Before we can start extracting data from webpages, we need to spider parts of the website to figure out what pages we need to scrape. This spidering will give us a set of URLs that refer to as a corpus.
Scraping
The website I needed to scrape rendered the information I was after in the browser using javascript. Therefore a simple pull and request of the HTML were not enough. This is where the Dryscrape library came in handy as it renders the request as if it were in a browser. After that, we can render the rendered body using BeautifulSoup.
Python3 and Libraries
I build the solution using the Python programming language and a few libraries such as:
- DryScrape
Documentation of this library can be found at https://dryscrape.readthedocs.io/
According to this documentation, DryScrape is a lightweight web scraping library for Python. It uses a headless Webkit instance to evaluate Javascript on the visited pages. This enables painless scraping of plain web pages as well as Javascript-heavy โWeb 2.0โ applications like Facebook. - BeautifulSoup
Documentation of this library can be found at https://www.crummy.com/software/BeautifulSoup/bs4/doc/
According to its documentation Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.
In-depth video explaining the code behind the solution
I will not share the code for this solution, because I do not want just anyone to run this code and scrape the website in question. However, I did I post a two-part video series explaining the code behind the solution in detail so that fellow coders can see, understand and possibly learn how to data mine a website in this way.
Part 1
Part 2
Need help?
If you need help mining for data, then contact me.