How to interact with websites and extract data
Automating the interaction with a Web Browser — that’s what Selenium WebDriver lets you do. And this thing can have many use cases; among them: web application testing and web scraping — this last thing is what data scientists care most about.
Selenium is a more powerful tool for web scraping. Other tools, like Beautiful Soup, are meant only for parsing a static HTML file and extracting data from it. If the web page you want to extract data from needs some interaction from your behalf to make its data available, like logging in or scrolling down, then such a parsing tool is useless.
But Selenium is not just a parser. When you request a web page, it actually opens a web browser of your choice and goes to that address. Then you can do from code what you usually can do with a web browser: click stuff, type into text fields, scroll, extract data, etc.
Now, let’s go ahead and see how we can do these things with Selenium, and we’ll start with its installation.
Like most Python packages, the installation is quite straightforward:
pip install selenium
conda install selenium
But besides installing the python package we also need to download a browser driver. A browser driver is just an executable provided by the browser’s developers that Selenium uses. Browser drivers are available for the following web browsers: Chrome, Firefox, Edge, Opera, and Safari.
Before using a browser in Selenium, we first need to download the browser driver required, put it into a folder (e.g.
C:\Users\username\browser_drivers), and add this folder to the PATH environment variable.
Opening a browser and navigating around
We first need to import Selenium, then, to open a specific Web Browser, one just needs to construct the corresponding object:
from selenium import webdriver chrome_driver = webdriver.Chrome() firefox_driver = webdriver.Firefox() edge_driver = webdriver.Edge() opera_driver = webdriver.Opera() safari_driver = webdriver.Safari()
For example, when we run
driver = webdriver.Chrome() a Chrome window is immediately opened.
If we want to just scrape some data and don’t have a browser window opened, we can open a browser in headless mode:
from selenium import webdriver from selenium.webdriver.chrome.options import Options chrome_options = Options() chrome_options.add_argument('--headless') driver = webdriver.Chrome(options=chrome_options)
The above code will open Chrome without the user interface, so we’ll not see anything on the screen.
After we opened a browser, we can navigate to a web address as simple as calling the
.get(“https://www.example.com”) method on a driver object.
driver = webdriver.Chrome() driver.get('https://www.google.com')
And we can navigate back and forth through our history using
Interacting with a web page
To interact with the web page, we first need to locate our desired element, then we can send commands to it.
Selenium has a range of methods we can use for identifying elements on a page.
To enumerate some of them, they are:
element = driver.find_element_by_id(‘…’) element = driver.find_element_by_name(‘…’) element = driver.find_element_by_tag_name(‘…’) element = driver.find_element_by_class_name(‘…’) element = driver.find_element_by_css_selector(‘…’)
elements = driver.find_elements_by_id(‘…’) elements = driver.find_elements_by_name(‘…’) elements = driver.find_elements_by_tag_name(‘…’) elements = driver.find_elements_by_class_name(‘…’) elements = driver.find_elements_by_css_selector(‘…’)
The first group of methods returns only the first element found that matches the criteria. If there is no such element, then a
NoSuchElementException is raised.
The second group of methods returns a list of all the matched elements or an empty list if there are no such elements and no exception is raised.
These methods can be called on either a driver object (what’s returned by
webdriver.BrowserName()) or another element.
Among these methods, the one that seems the most convenient to me (and other people who know CSS) is
.find_element(s)_by_css_selector. That’s because in a CSS selector we can pack all the information needed to identify an element on a page that we could do with all the other methods at once. That is, in a CSS selector we can use tag name, id, class name, attributes, and many other things. Here is a cheat sheet on CSS selectors, in case you need it.
After we identified an element on the page, we can send a couple of signals to it. To enumerate a few, we can click on it with
element.click(), we can press keyboard keys on it with
element.send_keys(…), or we can clear a text input with
.send_keys(…) can be used either to send more text data at once (
element.send_keys(‘Selenium is cool.’)) or to send individual keys like arrow up/down (
Here is an example of scrolling with arrow key:
And below is an example of searching something on google:
Now you probably wonder what is with
driver.switch_to.default_content() from the code above. When you search for an element in a page with Selenium, it ignores what’s inside iframes. So, you need to use
driver.switch_to.frame(iframe) to be able to search elements inside that iframe. Then, after you found what you were looking for,
driver.switch_to.default_content() needs to be called to be able to find elements outside your previously selected iframe.
That loop above was made to appear like someone is typing into the text field, but we could just enter all the text at once with
search_input.send_keys(‘python find in list’).
By the way,
.send_keys() method also allows deleting text by using keys like backspace:
Extracting data from a web page
After we located an element on-page, there are mainly 3 ways we can extract data from it:
- Through the
element.textattribute which returns the text that’s inside the HTML element
- By using
element.get_attribute(attr_name)we can get value of an attribute, e.g. href, class, name. If there is no attribute with the given name, it returns
- By taking a screenshot of the element with
element.screenshot(filename). filename should be the full path to your image file, including the .png extension. The method returns False if there is an
IOError. If everything is OK, it returns True.
As an example, below we will extract the page titles and URLs of the google search we saw above and put them into a pandas data frame.
That’s it for this article. I hope it gave you a good idea of what Selenium WebDriver is capable of.
To learn more about Web Scraping here is a great book that covers many aspects of scraping data from the modern web:
I hope you found this information useful and thanks for reading!
This article is also posted on Medium here. You can have a look!