Scrapy Tutorial #5: How To Create Simple Scrapy Spider

Introduction:

This is the #5 post of my Scrapy Tutorial Series, in this Scrapy tutorial, I will talk about how to create a Scrapy project and a Scrapy spider, in addition, I will show you how to use some basic scrapy commands.

You can get the source code of this project at the end of this tutorial.

Scrapy Commands

First, we can take a short view about the Scrapy commands and have an impression, and later we can learn more about them. Type scrapy in the terminal, below is the output.

$ scrapy
Scrapy 1.4.0 - no active project

Usage:
  scrapy <command> [options] [args]

Available commands:
  bench         Run quick benchmark test
  fetch         Fetch a URL using the Scrapy downloader
  genspider     Generate new spider using pre-defined templates
  runspider     Run a self-contained spider (without creating a project)
  settings      Get settings values
  shell         Interactive scraping console
  startproject  Create new project
  version       Print Scrapy version
  view          Open URL in browser, as seen by Scrapy

  [ more ]      More commands available when run from project directory

Use "scrapy <command> -h" to see more info about a command

As you can see, here is a short list of scrapy commands, if you want to check the detail about any scrapy commands, just use scrapy <command> -h. In this tutorial, we whould use startproject and genspider to help us create project and spider file, and then I would introduce you how to use shell and fetch commands to test your code.

Create Simple Scrapy Project

Now we start to create a new scrapy project from scratch.

$ scrapy startproject scrapy_spider

Now a project named scrapy_spider has been created, we can follow the output to use genspider to generate one scrapy spider for us

You can start your first spider with:
    cd scrapy_spider
    scrapy genspider example example.com

Now you have a scrapy project which contains a spider named example. Let's take a look at the project directory.

├── scrapy.cfg                # deploy configuration file
└── scrapy_spider             # project's Python module, you'll import your code from here
    ├── __init__.py
    ├── items.py              # project items definition file
    ├── middlewares.py        # project middlewares file
    ├── pipelines.py          # project pipeline file
    ├── settings.py           # project settings file
    └── spiders               # a directory where spiders are located
        ├── __init__.py
        └── example.py        # spider we just created

Our first Scrapy spider

Now we start to create a Scrapy spider to scrap something for us. Our target website is Quotes to Scrape and we plan to crawl all the quotes of the page for us, but first, we need to create a new Scrapy spider.

# Usage: scrapy genspider [options] <name> <domain> 

$ scrapy genspider quotes_spider quotes.toscrape.com

We just use genspider to create a spider, the parameters you passed in would be seen in the generated python file, let's see what is in the scrapy_spider/spiders/quotes_spider.py

# -*- coding: utf-8 -*-
import scrapy

class QuotesSpiderSpider(scrapy.Spider):
    name = 'quotes_spider'
    allowed_domains = ['quotes.toscrape.com']
    start_urls = ['http://quotes.toscrape.com/']

    def parse(self, response):
        pass

As you can see, the spider file has contains some parameters.

name: identifies the Spider. It must be unique within a project
start_urls: The list of the feed URLs, the spider would start by crawling the feed URLs.
allowed_domains: This setting is useful for broad crawls, if the domain of the URL is not in this setting, then the URL would be ignored.
parse: a method that will be called to handle the response downloaded for each of the requests made.

Now we start to write some code in parse to crawl something for us.

# -*- coding: utf-8 -*-
import scrapy

class QuotesSpiderSpider(scrapy.Spider):
    name = 'quotes_spider'
    allowed_domains = ['quotes.toscrape.com']
    start_urls = ['http://quotes.toscrape.com/']

    def parse(self, response):
        quotes = response.xpath("//div[@class='quote']//span[@class='text']/text()").extract()
        yield {'quotes': quotes}

We use XPath expression to extract the data which will be talked about more in detail. After the spider is done, we can now run the spider to crawl the data.

$ scrapy crawl quotes_spider

# Then you can see the valid data in the terminal output
[scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/>

Conclusion

In this scrapy tutorial, we successfully create a Scrapy project and a Scrapy spider using some Scrapy commands, and we have a spider which can crawl data for us now.

To help user focus on the key part, I only paste part of the source code instead of the whole file in this tutorial, If you want source code which can run in your local env directly, just

git clone [email protected]:michael-yin/scrapy-spider-example.git

cd scrapy-spider-example

git checkout ffc4cc3

# If you can star my project, I would appreciate that