Python Web Scraping Tutorial using BeautifulSoup & Scrapy
Beginner’s guide to Web Scraping in Python using BeautifulSoup & scrapy
Python Web Scraping Tutorial using BeautifulSoup & Scrappy
Introduction of Web scraping
Important tools & python library required for Tutorial
Parsing a page with BeautifulSoup
A complete example using BeautifulSoup
Example using Scrapy
“Meaning Full Data” that we all want for our Data Science Projects! There are situations where data is not available in “CSV” or in your “Datamart”. The only way to access those set of data from HTML pages available on the INTERNET. In all these situations we usually go for a technique called web scraping to get the data from the html web pages into a format we require for our analysis.
A lot of valuable information is freely available online, but not necessarily in a nice structured format. Web scraping is the process of collecting data that is more or less structured from the Internet.
- Crawling from page to page, following links (think Google crawler)
- Extracting information from resulting downloads, such as Web pages, PDF documents, Word documents, etc..
What are the other ways to extract information from the web:
- WEB API – Most preferred way to get the required information from your data providers! almost every technology driven organisation providing data APIs to access their data in a more structured manner.
- RSS Feeds– RSS stands for Really Simple Syndication or Rich Site Summary however it is often referred to as the feed or news feed.RSS allows publishers to automatically syndicate their content, This is basically a structured XML document that includes full or summarized text along with other metadata such as published date, author name, etc.
Even when an API does exist, request volume and rate limits, the types of data, or the format of data that it provides might be insufficient for your purposes. This is where web scraping steps in. With few exceptions, if you can view it in your browser, you can access it via a Python script. If you can access it in a script, you can store it in a database. And if you can store it in a database, you can do the virtually any-thing with that data.
Web scraping is the practice of using a computer program to grab information from web page(mostly HTML) and gather the data that you need in a format most useful to you while at the same time preserving the structure of the data.
You can find lots of examples (open source) Python scraping code on– as well as a place to keep and run scrapers so you aren’t responsible for them forever. There are tutorials too, that you can fully run in your web browser. This was set up in 2009 by a bunch of clever people previously involved in the very useful http://theyworkforyou.com/. ScraperWiki hosts programmes to scrape the web – and it also hosts the nice, tidy data these scrapers produce.
Right now, there are around 10,000 scrapers hosted on the website. You might be lucky, and find that someone has already written a program to scrape the website you’re interested in. If you’re not lucky, you can pay someone else to write that scraper – as long as you’re happy for the data and the scraper to be in the public domain.
A few scraping rules
We are all set to start web-scraping. But first, a couple of rules you should keep while running any scraping utility.
- You should check a website’s terms and conditions before you scrape them. It’s their data and they likely have some rules to govern it. This also differs based on the country local rules and type of organisation. Better check terms & condition!
- Be nice – Your Script/utility will send web requests much quicker than a human can. Make sure you space out your requests a bit so that you don’t hammer the website’s server. Nowadays! Webmasters are smart enough to catch you and block any traffic you are sending!
- Scrapers break – websites change their layout all the time. If that happens, you need to rewrite your code. So be Prepared! More Often if a Site is dynamic in nature.
Let’s Get Started!
Important tools & python library required for Tutorial :
There are a lot of tools and libraries around the internet for web-scraping using python, I personally prefer BeautifulSoup, Scrapy. It will be good to learn XPath, CSSSelect and regex. below is the list of tools/library useful for web scraping.
- Scrapy is an aptly named library for creating spider bots to systematically crawl the web and extract structured data like prices, contact info, and URLs. Originally designed for web scraping, Scrapy can also extract data from APIs.
- BeautifulSoup. I know it’s slow but this xml and html parsing library is very useful for beginners.
Beautiful Soup (BS4) is a parsing library that can use different parsers. A parser is simply a program that can extract data from HTML and XML documents.
Beautiful Soup’s default parser comes from Python’s standard library. It’s flexible and forgiving, but a little slow. The good news is that you can swap out its parser with a faster one if you need the speed.
One advantage of BS4 is its ability to automatically detect encodings. This allows it to gracefully handle HTML documents with special characters.
In addition, BS4 can help you navigate a parsed document and find what you need. This makes it quick and painless to build common applications. For example, if you wanted to find all the links on the web page we pulled down earlier, it’s only a few lines:
- Requests. The most famous HTTP library written by kenneth reitz. It’s a must-have for every python developer.
- Lxml is a high-performance, production-quality HTML and XML parsing library. Among all the Python web scraping libraries, we’ve enjoyed using lxml the most. It’s straightforward, fast, and feature-rich.
- Selenium is a tool that automates browsers, also known as a web-driver. With it, you can actually open a Google Chrome window, visit a site, and click on links. Pretty cool, right? It also comes with Python bindings for controlling it right from your application. This makes it a breeze to integrate with your chosen parsing library. Resources
Other Important Concepts you might need to practise to get your work done
- Regular expressions with python
Example 1:- Technical analysis of Tweets of Honourable Prime Minister of India “Modi Jee”
This is a sample code to illustrate how you can merge and analyze twitter data, textual (sentiment) analysis. This code loads all PM Tweets since the inauguration(you can set any date), analyses the sentiment of the content and looks at the volume.
One of the many packages that lets you access the Twitter API using Python. You can install from the terminal:
This is a nice package with an easy interface for textual analysis. It uses the much more complex NLTK under the hood… You can install from the terminal:
pip install textblob
In order to use the Twitter API, you need to register yourself and your application (this code!) on Twitter’s developer platform. I have placed my keys in a JSON file called ‘twitter_keys.json’.
- Step1:- We begin by downloading the latest tweets (API limits to 200 at a time), and then loop back until we have the inauguration date.
- Next, we’ll use TextBlob to compute each tweet’s sentiment measure, which goes from -1 (negative) to +1 (positive). This is an easy way to get the sentiment, but for serious research, one would need to assess if the underlying algorithm is appropriate for the task.
import pandas as pd import numpy as np import matplotlib import matplotlib.pyplot as plt from datetime import datetime, date, time, timedelta # For this one, cartoon-style plots seem appropriate. import json from textblob import TextBlob import twitter with open('twitter_keys.json', 'r') as f: keys = json.JSONDecoder().decode(f.read()) api = twitter.Api(consumer_key=keys['consumer_key'], consumer_secret=keys['consumer_secret'], access_token_key=keys['access_token_key'], access_token_secret=keys['access_token_secret']) # Inauguration day start_date = datetime(2017, 1, 20) statuses =  # Download tweets until we reach before the inauguration date. done = False max_id=None while not done: tmp = api.GetUserTimeline(screen_name='narendramodi', count=200, max_id=max_id) statuses = statuses + tmp a = tmp[-1] max_id = a.id-1 done = pd.to_datetime(a.created_at) < start_date stat_list = [[pd.to_datetime(x.created_at), x.text] for x in statuses] tweets = pd.DataFrame(stat_list) tweets.columns = ['timestamp', 'text'] tweets = tweets[tweets.timestamp >= start_date].copy() print tweets.head() print(str(len(tweets)) + ' tweets since inauguration.') plt.interactive(False) # Compute the tweets' sentiment def getSentiment(text): blob = TextBlob(text) return blob.sentiment.polarity tweets['sentiment'] = tweets['text'].apply(getSentiment) print('TOP 10 NEGATIVE TWEETSn') for row in tweets.sort_values('sentiment').head(10).iterrows(): print (row['text'] + 'n') print('TOP 10 POSITIVE TWEETSn') for row in tweets.sort_values('sentiment', ascending=False).head(10).iterrows(): print (row['text'] + 'n') tweets = tweets.set_index('timestamp').tz_localize('UTC') tweets = tweets.tz_convert('US/Eastern') count_days2 = tweets.groupby(tweets.index.weekday_name)['sentiment'].count() count_days2.plot(kind='bar') plt.show(block=True) plt.interactive(False) count_hours = tweets.groupby(tweets.index.hour)['sentiment'].count() fig, axes = plt.subplots(1, 1, figsize=(8, 5), dpi=80) count_hours.plot(kind='bar', ax=axes) axes.set_title('Distribution of tweets over the day') axes.set_ylabel('Number of tweets') axes.set_xlabel('Hour') plt.show(block=True) plt.interactive(False)
Did you find this article helpful? Please share your opinions/thoughts in the comments section below. Please feel free to reach me in case you have a suggestion!
In part two of the article, we will discuss:-
Parsing a page with BeautifulSoup
A complete example using BeautifulSoup
Example using Scrapy
Stay tune for next part !