How to interact with websites and extract data

Background image by Markus Winkler from Pixabay

Automating the interaction with a Web Browser — that’s what Selenium WebDriver lets you do. And this thing can have many use cases; among them: web application testing and web scraping — this last thing is what data scientists care most about.

Selenium is a more powerful tool for web scraping. Other tools, like Beautiful Soup, are meant only for parsing a static HTML file and extracting data from it. If the web page you want to extract data from needs some interaction from your behalf to make its data available, like logging in or scrolling down, then such a parsing tool is useless.

But Selenium is not just a parser. When you request a web page, it actually opens a web browser of your choice and goes to that address. Then you can do from code what you usually can do with a web browser: click stuff, type into text fields, scroll, extract data, etc.

Selenium WebDriver is available for more programming languages: Ruby, Java, Python, C#, JavaScript, + others that are not officially supported. But in this article, we will work only with the Python version.

Now, let’s go ahead and see how we can do these things with Selenium, and we’ll start with its installation.

Installation

Like most Python packages, the installation is quite straightforward:

pip install selenium

…or:

conda install selenium

But besides installing the python package we also need to download a browser driver. A browser driver is just an executable provided by the browser’s developers that Selenium uses. Browser drivers are available for the following web browsers: ChromeFirefoxEdgeOpera, and Safari.

Before using a browser in Selenium, we first need to download the browser driver required, put it into a folder (e.g. C:\Users\username\browser_drivers), and add this folder to the PATH environment variable.

Opening a browser and navigating around

We first need to import Selenium, then, to open a specific Web Browser, one just needs to construct the corresponding object:

from selenium import webdriver

chrome_driver = webdriver.Chrome()
firefox_driver = webdriver.Firefox()
edge_driver = webdriver.Edge()
opera_driver = webdriver.Opera()
safari_driver = webdriver.Safari()

For example, when we run driver = webdriver.Chrome() a Chrome window is immediately opened.

If we want to just scrape some data and don’t have a browser window opened, we can open a browser in headless mode:

from selenium import webdriver
from selenium.webdriver.chrome.options import Options

chrome_options = Options()
chrome_options.add_argument('--headless')
driver = webdriver.Chrome(options=chrome_options)

The above code will open Chrome without the user interface, so we’ll not see anything on the screen.

After we opened a browser, we can navigate to a web address as simple as calling the .get(“https://www.example.com”) method on a driver object.

driver = webdriver.Chrome()
driver.get('https://www.google.com')

And we can navigate back and forth through our history using .back() and .forward() methods.

For example:

Interacting with a web page

To interact with the web page, we first need to locate our desired element, then we can send commands to it.

Selenium has a range of methods we can use for identifying elements on a page.

To enumerate some of them, they are:

element = driver.find_element_by_id(‘…’)
element = driver.find_element_by_name(‘…’)
element = driver.find_element_by_tag_name(‘…’)
element = driver.find_element_by_class_name(‘…’)
element = driver.find_element_by_css_selector(‘…’)

…and:

elements = driver.find_elements_by_id(‘…’)
elements = driver.find_elements_by_name(‘…’)
elements = driver.find_elements_by_tag_name(‘…’)
elements = driver.find_elements_by_class_name(‘…’)
elements = driver.find_elements_by_css_selector(‘…’)

The first group of methods returns only the first element found that matches the criteria. If there is no such element, then a NoSuchElementException is raised.

The second group of methods returns a list of all the matched elements or an empty list if there are no such elements and no exception is raised.

These methods can be called on either a driver object (what’s returned by webdriver.BrowserName()) or another element.

Among these methods, the one that seems the most convenient to me (and other people who know CSS) is .find_element(s)_by_css_selector. That’s because in a CSS selector we can pack all the information needed to identify an element on a page that we could do with all the other methods at once. That is, in a CSS selector we can use tag name, id, class name, attributes, and many other things. Here is a cheat sheet on CSS selectors, in case you need it.

After we identified an element on the page, we can send a couple of signals to it. To enumerate a few, we can click on it with element.click(), we can press keyboard keys on it with element.send_keys(…), or we can clear a text input with element.clear().

.send_keys(…) can be used either to send more text data at once (element.send_keys(‘Selenium is cool.’)) or to send individual keys like arrow up/down (element.send_keys(Keys.ARROW_DOWN)).

Here is an example of scrolling with arrow key:

And below is an example of searching something on google:

Now you probably wonder what is with driver.switch_to.frame(iframe) and driver.switch_to.default_content() from the code above. When you search for an element in a page with Selenium, it ignores what’s inside iframes. So, you need to use driver.switch_to.frame(iframe) to be able to search elements inside that iframe. Then, after you found what you were looking for, driver.switch_to.default_content() needs to be called to be able to find elements outside your previously selected iframe.

That loop above was made to appear like someone is typing into the text field, but we could just enter all the text at once with search_input.send_keys(‘python find in list’).

By the way, .send_keys() method also allows deleting text by using keys like backspace: search_input.send_keys(Keys.BACKSPACE).

Extracting data from a web page

After we located an element on-page, there are mainly 3 ways we can extract data from it:

  • Through the element.text attribute which returns the text that’s inside the HTML element
  • By using element.get_attribute(attr_name) we can get value of an attribute, e.g. href, class, name. If there is no attribute with the given name, it returns None.
  • By taking a screenshot of the element with element.screenshot(filename). filename should be the full path to your image file, including the .png extension. The method returns False if there is an IOError. If everything is OK, it returns True.

As an example, below we will extract the page titles and URLs of the google search we saw above and put them into a pandas data frame.


That’s it for this article. I hope it gave you a good idea of what Selenium WebDriver is capable of.

I hope you found this information useful and thanks for reading!

This article is also posted on Medium here. Feel free to have a look!


Dorian

Passionate about Data Science, AI, Programming & Math

0 0 votes
Article Rating
Subscribe
Notify of
2 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

[…] Selenium WebDriver: Browse the Web with Code […]

[…] Selenium WebDriver: Browse the Web with CodeHow to apply Minimax to 2048How to represent the game state of 2048How to control the game board of 2048Categories: UncategorizedTags: AlgorithmsArtificial IntelligenceGame TheoryMinimax Algorithm […]

2
0
Would love your thoughts, please comment.x
()
x