Web Scraping Tutorial in Python – Part 2

28th March 2019


Welcome to the 2nd installation in the web scraping series. Last week, we began to dive into how we can use Python to grab html pages, and parse information from them.

However, last week we only covered cases where we needed elements from static html. When a javascript layer or interactivity is added, web scraping can become more difficult.

To overcome this, we’ll use a useful tool called Selenium. This tool is primarily used for automated website testing, but is also great for web scraping as it can mimic human interaction such as clicking buttons or entering information into forms. As a simple example, I’ll demo how to use Selenium to search google for us.

Installation

We can install the selenium package easily using pip. Additionally, you will need install a webdriver to use – I use Chrome.

pip install selenium

Using Selenium

Selenium is very intuitive and visual. Running the code below should open an additional Chrome application on your computer, and head to the web address provided.

from selenium import webdriver

browser = webdriver.Chrome()
browser.get('http://www.google.com')

Chrome will probably let you know it’s being controlled by automated test software, but a new page should open to google.com. And that’s it! That’s how you can use selenium to open any web page you like.

Button Clicking and Form Entry

Of course, we don’t intend to only open web pages but to perform tasks with them. Specifically, here we want to fill in a form on google and press a button to search. Just like we did in the last blog post, we have to “inspect” a webpage to see what elements we need to target. This is easy to do in Chrome.

Right-clicking in the Chrome Browser and clicking “inspect” will provide html info

We can see the actual <input> tag where we type has id="q" which is enough info to find it using selenium. Then all we have to do is hit RETURN like we normally would on a keyboard to send our search. Selenium has built-in functionality for reproducing these keystrokes.

from selenium import webdriver
from selenium.webdriver.common.keys import Keys

def search_google(query):
    browser = webdriver.Chrome()
    browser.get('http://www.google.com')

    # find the search bar
    search = browser.find_element_by_name('q')
    # fill in the form with our query
    search.send_keys(query)
    # press return/enter
    search.send_keys(Keys.RETURN)

Now we have a function that searches google for whatever we want to send it. Try search_google("Jamie AI") to watch the page automate your search, or any other query you like.

Parsing the Results

Once we have the page, parsing the result is fairly easily. You can use the XPath elements that Selenium likes to work with, but from here I prefer to pass the html to BeautifulSoup. Let’s add a line in our function to extract the page_source element which contains the html, and return that to the user.

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time

def search_google(query):
    browser = webdriver.Chrome()
    browser.get('http://www.google.com')

    # find the search bar
    search = browser.find_element_by_name('q')
    # fill in the form with our query
    search.send_keys(query)
    # press return/enter
    search.send_keys(Keys.RETURN)
    
    # sleep for 5 seconds to ensure the browser loaded
    time.sleep(5)
    return browser.page_source

Now we just have to run the function and parse our results!

from bs4 import BeautifulSoup as bs

html = search_google("cats")
# create a bs object
soup = bs(html)
# find the link and text of the first result
soup.find("div", {"class":"r"})

And there you go. By adding an iterator to go through each search result or even pages, you can automate the crawling of google results for your enjoyment.

Read also: Web Scraping Tutorial in Python – Part 1

Tagged with:

Kyle Gallatin

A data scientist with an MS in molecular & biology. He currently aids to deploy AI and technology solutions within Pfizer’s Artificial Intelligence Center of Excellence using Python and other computer science frameworks

To give you the best possible experience, this site uses cookies. Continuing to use jamieai.com means you agree on our use of cookies.
Accept