Home ยป Blog ยป How to web scrape and data mine any website to excel

How to web scrape and data mine any website to excel

Web scraping, data mining any website
Web scraping, data mining any website

In this article, I will show you how you can data mine any website and export the data to a spreadsheet. I will do so by showing you an example of an actual project I took on. I found the project on the freelancing website UpWork. The posting was as follows:

Posting on UpWork
Posting on UpWork

Record Store Day Website Scrape

We’re looking for a CSV file / Excel Spreadsheet of all participating stores on this website:
https://recordstoreday.com/Stores

Data Required:
Company Name
URL
Phone
Email
Address

Result

The result of my scraping solution is below.

Result of web scraping
Result of web scraping

How I did it

The solution basically does the following:

 

  1. Create a corpus of all the URLs that contain the company data
  2. Scrape all the URLs in the corpus and store the data in a file.

Creating a corpus

Before we can start extracting data from webpages, we need to spider parts of the website to figure out what pages we need to scrape. This spidering will give us a set of URLs that refer to as a corpus.

Scraping

The website I needed to scrape rendered the information I was after in the browser using javascript. Therefore a simple pull and request of the HTML were not enough. This is where the Dryscrape library came in handy as it renders the request as if it were in a browser. After that, we can render the rendered body using BeautifulSoup.

Python3 and Libraries

I build the solution using the Python programming language and a few libraries such as:
  • DryScrape
    Documentation of this library can be found at https://dryscrape.readthedocs.io/
    According to this documentation, DryScrape is a lightweight web scraping library for Python. It uses a headless Webkit instance to evaluate Javascript on the visited pages. This enables painless scraping of plain web pages as well as Javascript-heavy โ€œWeb 2.0โ€ applications like Facebook.
  • BeautifulSoup
    Documentation of this library can be found at https://www.crummy.com/software/BeautifulSoup/bs4/doc/
    According to its documentation Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.

In-depth video explaining the code behind the solution

I will not share the code for this solution, because I do not want just anyone to run this code and scrape the website in question. However, I did I post a two-part video series explaining the code behind the solution in detail so that fellow coders can see, understand and possibly learn how to data mine a website in this way.

Part 1

Part 2

Need help?

If you need help mining for data, then contact me.