Web Scraping Tutorial in Python – Part 3

19th April 2019


So far we’ve used Python’s requests library to crawl static pages, and the Selenium framework to make more complex requests. However, neither of these tools is really suitable for generalizability or scale. What can we do in the case of sites that don’t have easy rule based methods for crawling?

In this post, we’ll scrape an entire domain using one of the most popular (and powerful) Python web scraping frameworks: scrapy.

Installation

Installation is easy, simply use pip to install the package.

pip install scrapy

Starting a Scrapy Project

Unlike the previous packages we’ve worked with, scrapy is more extensive and allows you to create an entire project. Once you have it installed on your system, open and terminal and cd into a directory where you want to store the code for your scraping project. Then use the command line to start your scrapy project.

scrapy startproject <yourprojectname>

You may choose whatever you like to replace <yourprojectname> depending on what you want to do with the spider. I chose domain_scraper, since we’ll be using this scraper to crawl entire domains.

When you start a scrapy project, it will spawn a directory with the following contents:

tutorial/
scrapy.cfg # deploy configuration file

tutorial/ # project's Python module, you'll import your code from here
__init__.py

items.py # project items definition file

middlewares.py # project middlewares file

pipelines.py # project pipelines file

settings.py # project settings file

spiders/ # directory for spiders
__init__.py

Each of the files in our project’s Python module serves a different function. items.py is where we name which fields we want to extract. pipelines.py defines where we send the data we crawl from the website. settings.py and middlewares.py can be used to fine tune how the spider will crawl the web. The spiders/ comes empty, but is where we’ll put our own spiders.py file.

For a very basic tutorial on using scrapy, you can refer to the official documentation they provide: https://docs.scrapy.org/en/latest/intro/tutorial.html. This example scrapes the quotes.toscrape.com website we crawled in our first demo.

items.py

Our items are the objects we want to scrape from each page we’re getting. To keep it simple, let’s stick with just the title, html, and url of each of the pages. We use scrapy’s inherent classes to define our items.

from scrapy import Item, Field

class DomainScraperItem(Item):
    url = Field()
    title = Field()
    body = Field()
    pass

spiders.py

The scrapy spider would be analogous to the brain of our web scraper. This is where all of the crawling logic is placed. Scrapy fortunately comes with some convenient classes like CrawlSpider which can be used to traverse domains pretty effectively. Create a spiders.py file in this directory and fill it with the following content:

from scrapy.http import Request
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.item import Item
from domain_scraper.items import DomainScraperItem

class domainSpider(CrawlSpider):

    name = "domain_spider"
    allowed_domains = ["quotes.toscrape.com"]
    start_urls = ["http://quotes.toscrape.com"]
    rules = (
        Rule(LinkExtractor(), callback = 'parse_response', follow = True),
    )

    def parse_response(self, response):
        item = DomainScraperItem()
        item['title'] = response.xpath('/html/head/title/text()').extract()[0]
        item['body'] = response.body
        item['url'] = response.url
        yield item;

A few key things here:

  1. We named out spider domain_spider which is the name we’ll call to run it. Our spider uses LinkExtractor() which looks for web links in the pages we crawl to follow.
  2. The allowed_domains specifies which domains your spider is allowed to go to when it finds links. The start_urls is a list of starting points for the crawler.
  3. The rules information defines how we’ll crawl the pages. LinkExtractor() will extract links from each page, and call our parse_response function.
  4. Our parse_response function takes the predefined items from our item class, extracts them using simple attributes from each page and returns them (where they will be passed to our pipeline.

pipelines.py

from scrapy.exporters import CsvItemExporter

class DomainScraperPipeline(object):
        def __init__(self):
                self.filename = 'pages.csv'

        def open_spider(self, spider):
                self.csvfile = open(self.filename, 'wb')
                self.exporter = CsvItemExporter(self.csvfile)
                self.exporter.start_exporting()

        def close_spider(self, spider):
                self.exporter.finish_exporting()
                self.csvfile.close()

        def process_item(self, item, spider):
                self.exporter.export_item(item)
                return item

Our pipelines file describes what happens to our data as it’s crawled. Here, we use the scrapy CsvItemExporter class to write the data we scrape to a csv file. The name of the file is specified at the beginning of the script.

settings.py

Open up the settings.py file to see what it contains. There will be a bunch of commented out and preset options for your crawler. It would take awhile to get into all of these, but the only key thing you have to do here is uncomment the pipelines portion so that it activates the pipeline we coded in our pipelines file.

ITEM_PIPELINES = {
   'domain_scraper.pipelines.DomainScraperPipeline': 300,
}

Then that’s it! Just go to the top of the project to run your spider

scrapy crawl domain_spider

Scrapy should start up, rip through the domain provided and spit out a pages.csv file with all the data we wanted!

Next Steps

Try playing around with different allowed domains and start urls. Additionally, try editing the response function in the scrapy spider to grab more complex fields. Happy scraping!

Read also: Web Scraping Tutorial in Python – Part 2| Web Scraping Tutorial in Python – Part 1

Tagged with:

Kyle Gallatin

A data scientist with an MS in molecular & biology. He currently aids to deploy AI and technology solutions within Pfizer’s Artificial Intelligence Center of Excellence using Python and other computer science frameworks

To give you the best possible experience, this site uses cookies. Continuing to use jamieai.com means you agree on our use of cookies.
Accept