Scrapy Short Tutorial

Извлечение данных

Запускаем Scrapy Shell

scrapy shell 'http://quotes.toscrape.com/page/1/'

Получим что-то вроде:

[ ... Scrapy log here ... ]
2016-09-19 12:09:27 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/1/> (referer: None)
[s] Available Scrapy objects:
[s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s]   crawler    <scrapy.crawler.Crawler object at 0x7fa91d888c90>
[s]   item       {}
[s]   request    <GET http://quotes.toscrape.com/page/1/>
[s]   response   <200 http://quotes.toscrape.com/page/1/>
[s]   settings   <scrapy.settings.Settings object at 0x7fa91d888c10>
[s]   spider     <DefaultSpider 'default' at 0x7fa91c8af990>
[s] Useful shortcuts:
[s]   shelp()           Shell help (print this help)
[s]   fetch(req_or_url) Fetch request (or URL) and update local objects
[s]   view(response)    View response in a browser

Находим элементы при помощи CSS

>>> response.css('title')
[<Selector xpath='descendant-or-self::title' data='<title>Quotes to Scrape</title>'>]

>>> response.css('title::text').getall()
['Quotes to Scrape']

>>> response.css('title').getall()
['<title>Quotes to Scrape</title>']

Получаем первый элемент:

>>> response.css('title::text').get()
'Quotes to Scrape'

или так

>>> response.css('title::text')[0].get()
'Quotes to Scrape'

Функции: getall() - получает список, get() - получает один элемент, re() - RegEx

>>> response.css('title::text').re(r'Quotes.*')
['Quotes to Scrape']
 
>>> response.css('title::text').re(r'Q\w+')
['Quotes']
 
>>> response.css('title::text').re(r'(\w+) to (\w+)')
['Quotes', 'Scrape']

XPath

>>> response.xpath('//title')
[<Selector xpath='//title' data='<title>Quotes to Scrape</title>'>]
 
>>> response.xpath('//title/text()').get()
'Quotes to Scrape'

Простой способ создать Markdown Blog на NextJS