Python Web Scraping Tutorial using BeautifulSoup & Scrapy

Beginner’s guide to Web Scraping in Python using BeautifulSoup & scrapy

0 788

 

Python Web Scraping Tutorial using BeautifulSoup  & Scrapy

  • Introduction of Webscraping

  • Important tools & python library required for Tutorial

  • Parsing a page with BeautifulSoup

  • Complete example using BeautifulSoup

  • Example using Scrapy

 

Introduction :

“Meaning Full Data” that we all want for our Data Science Projects !  There are situations where data  is not avaliabe in “CSV” or in your “Datamart”.  The only way to access those set of data from HTML pages available on INTERNET.  In all these situation we usually go for technique called web scraping to get the data from the  html web pages into a format we require for our analysis.

A lot of valuable information is freely available online, but not necessarily in a nice structured format. Web scraping is the process of collecting data that is more or less structured from the Internet.

It includes:

  • Crawling from page to page, following links (think Google crawler)
  • Extracting information from resulting downloads, such as Web pages, PDF documents, Word documents, etc..

What are the other ways to extract information from web:

  • WEB API – Most prefred way to get required information from your data providers ! almost every technology driven organisation providing data APIs to access their data in a more structured manner.
  • RSS Feeds– RSS stands for Really Simple Syndication or Rich Site Summary however it is often referred to as the feed or news feed.RSS allows publishers to automatically syndicate their content, This  is basically an structured XML document that includes full or summarized text along with other metadata such as published date, author name, etc.

Even when an API does exist, request volume and rate limits, the types of data, or the format of data that it provides might be insufficient for your purposes. This is where web scraping steps in. With few exceptions, if you can view it in your browser, you can access it via a Python script. If you can access it in a script, you can store it in a database. And if you can store it in a database, you can do virtually any‐ thing with that data. 

Web scraping is the practice of using a computer program to  grab  information from web page(mostly HTML) and gather the data that you need in a format most useful to you while at the same time preserving the structure of the data.

You can find lots of example (open source) Python scraping code on http://www.scraperwiki.com/ – as well as a place to keep and run scrapers so you aren’t responsible for them forever. There are tutorials too, that you can fully run in your web browser.  This was set up in 2009 by a bunch of clever people previously involved in the very useful http://theyworkforyou.com/. ScraperWiki hosts programmes to scrape the web – and it also hosts the nice, tidy data these scrapers produce.

Right now, there are around 10,000 scrapers hosted on the website. You might be lucky, and find that someone has already written a program to scrape the website you’re interested in. If you’re not lucky, you can pay someone else to write that scraper – as long as you’re happy for the data and the scraper to be in the public domain.

 

A few scraping rules

We are all set to start web-scraping. But first, a couple of rules you should keep while running any scraping utility.

  • You should check a website’s terms and conditions before you scrape them. It’s their data and they likely have some rules to govern it. This also differs based on the country local rules and type of organisation. Better check terms & condition !
  • Be nice – Your Script/utility will send web requests much quicker than a human can. Make sure you space out your requests a bit so that you don’t hammer the website’s server. Now days ! Webmasters are smart enough to catch you and block any traffic you are sending !
  • Scrapers break – websites change their layout all the time. If that happens, you need to rewrite your code. So be Prepared ! More Often if Site is dynamic in nature .

Lets Get Started !

 

Important tools & python library required for Tutorial :

There are lot of tools and libraries around the internet for web-scraping using python, I personally prefer BeautifulSoup, Scrapy. It will be good to learn XPath, CSSSelect and regex. below is the list of tools/library useful  for webscraping.

Python library

  • Scrapy is an aptly named library for creating spider bots to systematically crawl the web and extract structured data like prices, contact info, and URLs. Originally designed for web scraping, Scrapy can also extract data from APIs.
  • BeautifulSoup. I know it’s slow but this xml and html parsing library is very useful for beginners.

Beautiful Soup (BS4) is a parsing library that can use different parsers. A parser is simply a program that can extract data from HTML and XML documents.

Beautiful Soup’s default parser comes from Python’s standard library. It’s flexible and forgiving, but a little slow. The good news is that you can swap out its parser with a faster one if you need the speed.

One advantage of BS4 is its ability to automatically detect encodings. This allows it to gracefully handle HTML documents with special characters.

In addition, BS4 can help you navigate a parsed document and find what you need. This makes it quick and painless to build common applications. For example, if you wanted to find all the links in the web page we pulled down earlier, it’s only a few lines:

  • Requests. The most famous http library written by kenneth reitz. It’s a must have for every python developer.
  • Lxml is a high-performance, production-quality HTML and XML parsing library. Among all the Python web scraping libraries, we’ve enjoyed using lxml the most. It’s straightforward, fast, and feature-rich.
  • Selenium is a tool that automates browsers, also known as a web-driver. With it, you can actually open a Google Chrome window, visit a site, and click on links. Pretty cool, right? It also comes with Python bindings for controlling it right from your application. This makes it a breeze to integrate with your chosen parsing library.Resources

Other Important Concepts you might need to practise to get your work done

  • Regular expressions with python

Example 1:- Technical analysis of Tweets of Honourable Prime Minister of India  “Modi Jee”

This is a sample code to illustrate how you can merge and analyze twitter data, textual (sentiment) analysis.This code loads all  PM Tweets since the inauguration(you can set any date)  , analyses the sentiment of the content and looks at the volume.

python-twitter

One of the many package that lets you access the Twitter API using Python. You can install from the terminal:

TextBlob

This is a nice package with an easy interface for textual analysis. It uses the much more complex NLTK under the hood.. You can install from the terminal:

pip install textblob

In order to use the Twitter API, you need to register yourself and your application (this code!) on Twitter’s developper platform. I have placed my keys in a JSON file called ‘twitter_keys.json’.

  • Step1:-  We begin by downloading the latest tweets (API limits to 200 at a time), and then loop back until we have the inauguration date.
  • Next we’ll use TextBlob to compute each tweet’s sentiment measure, which goes from -1 (negative) to +1 (positive). This is an easy way to get sentiment, but for serious research one would need to assess if the underlying algorithm is appropriate for the task.

 

 import pandas as pd
 import numpy as np
 import matplotlib
 import matplotlib.pyplot as plt
 from datetime import datetime, date, time, timedelta
 # For this one, cartoon-style plots seem appropriate.

import json

from textblob import TextBlob
 import twitter

with open('twitter_keys.json', 'r') as f:
 keys = json.JSONDecoder().decode(f.read())

api = twitter.Api(consumer_key=keys['consumer_key'],
 consumer_secret=keys['consumer_secret'],
 access_token_key=keys['access_token_key'],
 access_token_secret=keys['access_token_secret'])

# Inauguration day
 start_date = datetime(2017, 1, 20)

statuses = []

# Download tweets until we reach before the inauguration date.
 done = False
 max_id=None
 while not done:
 tmp = api.GetUserTimeline(screen_name='narendramodi',
 count=200, max_id=max_id)
 statuses = statuses + tmp
 a = tmp[-1]
 max_id = a.id-1
 done = pd.to_datetime(a.created_at) < start_date

stat_list = [[pd.to_datetime(x.created_at), x.text] for x in statuses]
 tweets = pd.DataFrame(stat_list)
 tweets.columns = ['timestamp', 'text']

tweets = tweets[tweets.timestamp >= start_date].copy()
 print tweets.head()

print(str(len(tweets)) + ' tweets since inauguration.')
 plt.interactive(False)

# Compute the tweets' sentiment
 def getSentiment(text):
 blob = TextBlob(text)
 return blob.sentiment.polarity

tweets['sentiment'] = tweets['text'].apply(getSentiment)

print('TOP 10 NEGATIVE TWEETS\n')
 for row in tweets.sort_values('sentiment').head(10).iterrows():
 print (row[1]['text'] + '\n')

print('TOP 10 POSITIVE TWEETS\n')
 for row in tweets.sort_values('sentiment', ascending=False).head(10).iterrows():
 print (row[1]['text'] + '\n')

tweets = tweets.set_index('timestamp').tz_localize('UTC')
 tweets = tweets.tz_convert('US/Eastern')

count_days2 = tweets.groupby(tweets.index.weekday_name)['sentiment'].count()
 count_days2.plot(kind='bar')
 plt.show(block=True)
 plt.interactive(False)

count_hours = tweets.groupby(tweets.index.hour)['sentiment'].count()
 fig, axes = plt.subplots(1, 1, figsize=(8, 5), dpi=80)
 count_hours.plot(kind='bar', ax=axes)
 axes.set_title('Distribution of tweets over the day')
 axes.set_ylabel('Number of tweets')
 axes.set_xlabel('Hour')
 plt.show(block=True)
 plt.interactive(False)

 

Did you find this article helpful? Please share your opinions / thoughts in the comments section below. Plese feel free to reach me in case you have suggestion !

In the part two of article we will discuss about :-

  • Parsing a page with BeautifulSoup
  • Complete example using BeautifulSoup
  • Example using Scrapy

Stay tune for next part !

Source Python Bootcamp links
Comments
Loading...

This website uses cookies to improve your experience. We'll assume you're ok with this, but you can opt-out if you wish. Accept Read More