How to extract data from websites

Image by James Osborne from Pixabay

The web contains lots of data. The ability to extract the information you need from it is, with no doubt, a useful one, even necessary. Of course, there are still lots of datasets already available for you to download, on places like Kaggle, but in many cases, you won’t find the exact data that you need for your particular problem. However, chances are you’ll find what you need somewhere on the web and you’ll need to extract it from there.

Web scraping is the process of doing this, of extracting data from web pages. In this article, we’ll see how to do web scraping in python. For this task, there are several libraries that you can use. Among these, here we will use Beautiful Soup 4. This library takes care of extracting data from an HTML document, not downloading it. For downloading web pages, we need to use another library: requests.

So, we’ll need 2 packages:

  • requests — for downloading the HTML code from a given URL
  • beautiful soup — for extracting data from that HTML string

Installing the libraries

Now, let’s start by installing the required packages. Open a terminal window and type:

python -m pip install requests beautifulsoup4

…or, if you’re using a conda environment:

conda install requests beautifulsoup4

Now, try to run the following:

import requests
from bs4 import BeautifulSoup

If you don’t get any error, then the packages are installed successfully.

Using requests & beautiful soup to extract data

From the requests package we will use the get() function to download a web page from a given URL:

requests.get(url, params=None, **kwargs)

Where the parameters are:

  • url — url of the desired web page
  • params — a optional dictionary, list of tuples or bytes to send in the query string
  • **kwargs — optional arguments that request takes

This function returns an object of type requests.Response. Among this object’s attributes and methods, we are most interested in the .content attribute which consists of the HTML string of the target web page.

Example:

html_string = requests.get("http://www.example.com").content

After we got the HTML of the target web page, we have to use the BeautifulSoup() constructor to parse it, and get an BeautifulSoup object that we can use to navigate the document tree and extract the data that we need.

soup = BeautifulSoup(markup_string, parser)

Where:

  • markup_string — the string of our web page
  • parser — a string consisting of the name of the parser to be used; here we will use python’s default parser: “html.parser”

Note that we named the first parameter as “markup_string” instead of “html_string” because BeautifulSoup can be used with other markup languages as well, not just HTML, but we need to specify an appropriate parser; e.g. we can parse XML by passing “xml” as parser.

BeautifulSoup object has several methods and attributes that we can use to navigate within the parsed document and extract data from it.
The most used method is .find_all():

soup.find_all(name, attrs, recursive, string, limit, **kwargs)
  • name — name of the tag; e.g. “a”, “div”, “img”
  • attrs — a dictionary with the tag’s attributes; e.g. {“class”: “nav”, “href”: “#menuitem”}
  • recursive — boolean; if false only direct children are considered, if true (default) all children are examined in the search
  • string — used to search for strings in the element’s content
  • limit — limit the search to only this number of found elements

Example:

soup.find_all("a", attrs={"class": "nav", "data-foo": "value"})

The line above returns a list with all “a” elements that also have the specified attributes.

HTML attributes that can not be confused with this method’s parameters or python’s keywords (like “class”) can be used directly as function parameters without the need to put them inside attrs dictionary. The HTML class attribute can also be used like this but instead of class=”…” write class_=”…”.

Example:

soup.find_all("a", class_<strong>=</strong>"nav")

Because this method is the most used one, it has a shortcut: calling the BeautifulSoup object directly has the same effect as calling the .find_all() method.

Example:

soup("a", class_<strong>=</strong>"nav")

The .find() method is like .find_all(), but it stops the search after it founds the first element; element which will be returned. It is roughly equivalent to .find_all(..., limit=1), but instead of returning a list, it returns a single element.

The .contents attribute of a BeautifulSoup object is a list with all its children elements. If the current element does not contain nested HTML elements, then .contents[0] will be just the text inside it. So after we got the element that contains the data we need using the .find_all() or .find() methods, all we need to do to get the data inside it is to access .contents[0].

Example:

soup = BeautifulSoup('''
    <div>
        <span class="rating">5</span>
        <span class="views">100</span>
    </div>
''', "html.parser")
views = soup.find("span", class_="views").contents[0]

What if we need a piece of data that is not inside the element, but as the value of an attribute? We can access an element’s attribute value as follows:

soup['attr_name']

Example:

soup = BeautifulSoup('''
    <div>
        <img src="./img1.png">
    </div>
''', "html.parser")
img_source = soup.find("img")['src']

Web scraping example: get top 10 Linux distros

Now, let’s see a simple web scraping example using the concepts above. We will extract a list of the top 10 most popular Linux distros from the DistroWatch website. DistroWatch (https://distrowatch.com/) is a website featuring news about Linux distros and open source software that runs on Linux. This website has on the right side a ranking with the most popular Linux distros. From this ranking, we will extract the first 10.

Firstly, we will download the web page and construct a BeautifulSoup object from it:

import requests
from bs4 import BeautifulSoup

soup = BeautifulSoup(
    requests.get("https://distrowatch.com/").content,
    "html.parser")

Then, we need to find out how to identify the data we want inside the HTML code. For that, we will use chrome’s developer tools. Right-click somewhere in the web page and then click on “Inspect”, or press “Ctrl+Shift+I” in order to open chrome’s developer tools. It should look like this:

Then, if you click on the little arrow in the top-left corner of the developer tools, and then click on some element on the web page, you should see in the dev tools window the piece of HTML associated with that element. After that you can use the information that you saw in the dev tools window to tell beautiful soup where to find that element.

In our example, we can see that that ranking is structured as a HTML table and each distro name is inside a td element with class “phr2”. Then inside that td element is a link containing the text we want to extract (the distro’s name). That’s what we will do in the next few lines of code:

top_ten_distros = []
distro_tds = soup("td", class_="phr2", limit=10)
for td in distro_tds:
    top_ten_distros.append(td.find("a").contents[0])

And this is what we got:


To learn more about Web Scraping here is a great book that covers many aspects of scraping data from the modern web:

I hope you found this information useful and thanks for reading!

Let’s keep in touch! Feel free to follow me on social media: Medium, LinkedInTwitterFacebook to get my latest posts.

This article is also posted on Medium here. You can have a look!


Dorian

Passionate about Data Science, AI, Programming & Math

0 0 vote
Article Rating
Subscribe
Notify of
1 Comment
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

[…] Web scraping with Python & Beautiful Soup […]

1
0
Would love your thoughts, please comment.x
()
x