![]() |
Web scraping, data mining any website |
In this article, I will show you how you can data mine any website and export the data to a spreadsheet. I will do so by showing you an example of an actual project I took on. I found the project on the freelancing website UpWork. The posting was as follows:
Record Store Day Website Scrape
We’re looking for a CSV file / Excel Spreadsheet of all participating stores on this website:
https://recordstoreday.com/Stores
Data Required:
Company Name
URL
Phone
Email
Address
Result
The result of my scraping solution is below.
How I did it
- Create a corpus of all the URLs that contain the company data
- Scrape all the URLs in the corpus and store the data in a file.
Creating a corpus
Scraping
Python3 and Libraries
- DryScrape
Documentation of this library can be found at https://dryscrape.readthedocs.io/
According to this documentation, DryScrape is a lightweight web scraping library for Python. It uses a headless Webkit instance to evaluate Javascript on the visited pages. This enables painless scraping of plain web pages as well as Javascript-heavy “Web 2.0” applications like Facebook. - BeautifulSoup
Documentation of this library can be found at https://www.crummy.com/software/BeautifulSoup/bs4/doc/
According to its documentation Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.