Python Web Scraping Tutorial using BeautifulSoup & Scrappy
- Introduction of Web scraping
- Important tools & python library required for Tutorial
- Parsing a page with BeautifulSoup
- A complete example using BeautifulSoup
- Example using Scrapy
“Meaning Full Data” that we all want for our Data Science Projects! There are situations where data is not available in “CSV” or in your “Datamart”. The only way to access those set of data from HTML pages available on the INTERNET. In all these situations we usually go for a technique called web scraping to get the data from the html web pages into a format we require for our analysis.
A lot of valuable information is freely available online, but not necessarily in a nice structured format. Web scraping is the process of collecting data that is more or less structured from the Internet.
- Crawling from page to page, following links (think Google crawler)
- Extracting information from resulting downloads, such as Web pages, PDF documents, Word documents, etc..
What are the other ways to extract information from the web:
- WEB API – Most preferred way to get the required information from your data providers! almost every technology driven organisation providing data APIs to access their data in a more structured manner.
- RSS Feeds– RSS stands for Really Simple Syndication or Rich Site Summary however it is often referred to as the feed or news feed.RSS allows publishers to automatically syndicate their content, This is basically a structured XML document that includes full or summarized text along with other metadata such as published date, author name, etc.
Even when an API does exist, request volume and rate limits, the types of data, or the format of data that it provides might be insufficient for your purposes. This is where web scraping steps in. With few exceptions, if you can view it in your browser, you can access it via a Python script. If you can access it in a script, you can store it in a database. And if you can store it in a database, you can do the virtually any-thing with that data.
Web scraping is the practice of using a computer program to grab information from web page(mostly HTML) and gather the data that you need in a format most useful to you while at the same time preserving the structure of the data.
A few scraping rules
We are all set to start web-scraping. But first, a couple of rules you should keep while running any scraping utility.
- You should check a website’s terms and conditions before you scrape them. It’s their data and they likely have some rules to govern it. This also differs based on the country local rules and type of organisation. Better check terms & condition!
- Be nice – Your Script/utility will send web requests much quicker than a human can. Make sure you space out your requests a bit so that you don’t hammer the website’s server. Nowadays! Webmasters are smart enough to catch you and block any traffic you are sending!
- Scrapers break – websites change their layout all the time. If that happens, you need to rewrite your code. So be Prepared! More Often if a Site is dynamic in nature.
Let’s Get Started!
Important tools & python library required for Tutorial :
There are a lot of tools and libraries around the internet for web-scraping using python, I personally prefer BeautifulSoup, Scrapy. It will be good to learn XPath, CSSSelect and regex. below is the list of tools/library useful for web scraping.
- Scrapy is an aptly named library for creating spider bots to systematically crawl the web and extract structured data like prices, contact info, and URLs. Originally designed for web scraping, Scrapy can also extract data from APIs.
- BeautifulSoup. I know it’s slow but this xml and html parsing library is very useful for beginners.
- Beautiful Soup (BS4) is a parsing library that can use different parsers. A parser is simply a program that can extract data from HTML and XML documents.
- Beautiful Soup’s default parser comes from Python’s standard library. It’s flexible and forgiving, but a little slow. The good news is that you can swap out its parser with a faster one if you need the speed.
- One advantage of BS4 is its ability to automatically detect encodings. This allows it to gracefully handle HTML documents with special characters.
- In addition, BS4 can help you navigate a parsed document and find what you need. This makes it quick and painless to build common applications. For example, if you wanted to find all the links on the web page we pulled down earlier, it’s only a few lines.
- Requests. The most famous HTTP library written by kenneth reitz. It’s a must-have for every python developer.
- Lxml is a high-performance, production-quality HTML and XML parsing library. Among all the Python web scraping libraries, we’ve enjoyed using lxml the most. It’s straightforward, fast, and feature-rich.
- Selenium is a tool that automates browsers, also known as a web-driver. With it, you can actually open a Google Chrome window, visit a site, and click on links. Pretty cool, right? It also comes with Python bindings for controlling it right from your application. This makes it a breeze to integrate with your chosen parsing library. Resources
Other Important Concepts you might need to practise to get your work done is Regular expressions with python
Parsing a page with BeautifulSoup
Did you find this article helpful? Please share your opinions/thoughts in the comments section below. Please feel free to reach me in case you have a suggestion!