class: center, middle .center[![Python](http://m1.paperblog.com/i/201/2016454/guia-python-conceptos-programacion-atributos--L-DTucOw.png)] # Web Scraping in Python: Scrapy March 14, 2018
Instructor: [S. M. Masoud Sadrnezhaad](https://twitter.com/smmsadrnezh) --- Introduction ========== - Web scraping, often called web crawling or web spidering, or “programatically going over a collection of web pages and extracting data,” is a powerful tool for working with data on the web. - With a web scraper, you can mine data about a set of products, get a **large corpus of text** or quantitative data to play around with, get data from a site **without an official API**, or just satisfy your own personal curiosity. - You’ll learn about the fundamentals of the scraping and spidering process as you explore a playful data set. - We'll use [BrickSet](https://brickset.com/), a community-run site that contains information about LEGO sets. By the end of this tutorial, you’ll have a fully functional Python web scraper that walks through a series of pages on Brickset and extracts data about LEGO sets from each page, displaying the data to your screen. - The scraper will be easily expandable so you can tinker around with it and use it as a foundation for your own projects scraping data from the web. --- Creating a Basic Scraper ========== - Scraping is a two step process: - You systematically find and download web pages. - You take those web pages and extract information from them. - Both of those steps can be implemented in a number of ways in many languages. - You can build a scraper **from scratch** using modules or libraries provided by your programming language, but then you have to deal with some potential headaches as your scraper grows **more complex**. - For example, you'll need to **handle concurrency** so you can crawl more than one page at a time. - You'll probably want to figure out how to transform your scraped **data into different formats like CSV, XML, or JSON**. - And you'll sometimes have to deal with sites that **require specific settings** and access patterns. --- Creating a Basic Scraper (Contd) ========== - You'll have better luck if you build your scraper on top of an **existing library** that handles those issues for you. - We're going to use Python and Scrapy to build our scraper. - Scrapy is one of the **most popular and powerful** Python scraping libraries; it takes a "batteries included" approach to scraping, meaning that it handles a lot of the common functionality that all scrapers need so developers **don't have to reinvent the wheel** each time. - It makes scraping a quick and fun process! - Scrapy, like most Python packages, **is on PyPI** (also known as pip). - PyPI, the Python Package Index, is a community-owned repository of all published Python software. - You can install Scrapy with the following command: ```bash pip install scrapy ``` --- Creating a Basic Scraper (Contd) ========== - With Scrapy installed, let's create a new folder for our project. - You can do this in the terminal by running: ```bash mkdir brickset-scraper ``` - Now, navigate into the new directory you just created: ```bash cd brickset-scraper ``` - Then create a new Python file for our scraper called scraper.py. - We'll place all of our code in this file for this tutorial. - You can create this file in the terminal with the touch command, like this: ```bash touch scraper.py ``` - Or you can create the file using your text editor or graphical file manager. --- Creating a Basic Scraper (Contd) ========== - We'll start by making a very basic scraper that uses Scrapy as its foundation. - To do that, we'll create a Python class that subclasses scrapy.Spider, a basic spider class provided by Scrapy. - This class will have two required attributes: - `name` — just a name for the spider. - `start_urls` — a list of URLs that you start to crawl from. We'll start with one URL. - Open the scrapy.py file in your text editor and add this code to create the basic spider: ```python import scrapy class BrickSetSpider(scrapy.Spider): name = "brickset_spider" start_urls = ['http://brickset.com/sets/year-2016'] ``` --- Creating a Basic Scraper (Contd) ========== - Let's break this down line by line: - First, we import scrapy so that we can use the classes that the package provides. - Next, we take the Spider class provided by Scrapy and make a subclass out of it called BrickSetSpider. - The Spider subclass has methods and behaviors that define how to follow URLs and extract data from the pages it finds, but it doesn't know where to look or what data to look for. - By subclassing it, we can give it that information. - Then we give the spider the name brickset_spider. - Finally, we give our scraper a single URL to start from: http://brickset.com/sets/year-2016. - If you open that URL in your browser, it will take you to a search results page, showing the first of many pages containing LEGO sets. --- Creating a Basic Scraper (Contd) ========== - You typically run Python files by running a command like `python path/to/file.py`. - However, Scrapy comes with its own command line interface to streamline the process of starting a scraper. - Start your scraper with the following command: ```bash scrapy runspider scraper.py ``` - That's a lot of output, so let's break it down. - The scraper **initialized and loaded** additional components and extensions it needed to handle reading data from URLs. - It used the URL we provided in the **`start_urls` list** and **grabbed the HTML**, just like your web browser would do. - It **passed that HTML to the parse method**, which doesn't do anything by default. Since we never **wrote our own parse method**, the spider just finishes without doing any work. --- Extracting Data from a Page ========== - Now let's pull some data from the page. - We've created a very basic program that pulls down a page, but it doesn't do any scraping or spidering yet. - Let’s give it some data to extract. - If you look at the page we want to scrape, you'll see it has the following structure: - There's a header that’s present on every page. - There's some top-level search data, including the number of matches, what we’re searching for, and the breadcrumbs for the site. - Then there are the sets themselves, displayed in what looks like a table or ordered list. - Each set has a similar format. - When writing a scraper, it's a good idea to look at the source of the HTML file and familiarize yourself with the structure. --- Extracting Data from a Page (Contd) ========== ```python class BrickSetSpider(scrapy.Spider): name = "brickset_spider" start_urls = ['http://brickset.com/sets/year-2016'] def parse(self, response): SET_SELECTOR = '.set' for brickset in response.css(SET_SELECTOR): NAME_SELECTOR = 'h1 a ::text' yield { 'name': brickset.css(NAME_SELECTOR).extract_first() } ``` - This code grabs all the sets on the page and loops over them to extract the data. - We append `::text` to our selector for the name. That’s a CSS pseudo-selector that fetches the text inside of the a tag rather than the tag itself. - We call `extract_first()` on the object returned by `brickset.css(NAME_SELECTOR)` because we just want the first element that matches the selector. This gives us a string, rather than a list of elements. --- Extracting Data from a Page (Contd) ========== - Save the file and run the scraper again: ```bash scrapy runspider scraper.py ``` - This time you'll see the names of the sets appear in the output. - Keep expanding on this by adding new selectors and using BeautifulSoup. --- Crawling Multiple Pages ========== - We’ve successfully extracted data from that initial page, but we’re not progressing past it to see the rest of the results. - The whole point of a spider is to detect and traverse links to other pages and grab data from those pages too. - You’ll notice that the top and bottom of each page has a little right carat (>) that links to the next page of results. ```html
...
›
»
``` - As you can see, there's a li tag with the class of next, and inside that tag, there's an a tag with a link to the next page. --- Crawling Multiple Pages (Contd) ========== - All we have to do is tell the scraper to follow that link if it exists. - Modify your code as follows: ```python class BrickSetSpider(scrapy.Spider): name = "brickset_spider" start_urls = ['http://brickset.com/sets/year-2016'] def parse(self, response): SET_SELECTOR = '.set' for brickset in response.css(SET_SELECTOR): NAME_SELECTOR = 'h1 a ::text' yield { 'name': brickset.css(NAME_SELECTOR).extract_first() } NEXT_PAGE_SELECTOR = '.next a ::attr(href)' next_page = response.css(NEXT_PAGE_SELECTOR).extract_first() if next_page: yield scrapy.Request( response.urljoin(next_page), callback=self.parse ) ``` --- Crawling Multiple Pages (Contd) ========== - First, we define a selector for the "next page" link, extract the first match, and check if it exists. - The `scrapy.Request` is a value that we return saying “Hey, crawl this page”, and `callback=self.parse` says “once you’ve gotten the HTML from this page, pass it back to this method so we can parse it, extract the data, and find the next page." - This means that once we go to the next page, we’ll look for a link to the next page there, and on that page we’ll look for a link to the next page, and so on, until we don't find a link for the next page. - This is the key piece of web scraping: finding and following links. --- Crawling Multiple Pages (Contd) ========== - In this example, it’s very linear; one page has a link to the next page until we’ve hit the last page, But you could follow links to tags, or other search results, or any other URL you'd like. - Now, if you save your code and run the spider again you’ll see that it doesn't just stop once it iterates through the first page of sets. - It keeps on going through all 779 matches on 23 pages! In the grand scheme of things it’s not a huge chunk of data, but now you know the process by which you automatically find new pages to scrape. --- # References * https://www.digitalocean.com/community/tutorials/how-to-crawl-a-web-page-with-scrapy-and-python-3 --- class: center, middle .center[![Python](http://m1.paperblog.com/i/201/2016454/guia-python-conceptos-programacion-atributos--L-DTucOw.png)] # Thank you. Any questions?