scrapyでスクレイピングしてみた
説明
パッとみた感じソースからタグをとってきて下記のような記述から次のリンクへ。
<li class="next">
<a href="/tag/humor/page/2/">Next <span aria-hidden="true">→</span></a>
</li>
感想
ああ、こんなにも簡単に。。。メソッドの使い方と、スクレイピング対象のサイトの構造さえ理解してしまえば、如何様にもデータを集めることができそうに思った。
実装
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = [
'https://quotes.toscrape.com/tag/humor/',
]
def parse(self, response):
for quote in response.css('div.quote'):
yield {
'text': quote.css('span.text::text').extract_first(),
'author': quote.xpath('span/small/text()').extract_first(),
}
next_page = response.css('li.next a::attr("href")').extract_first()
if next_page is not None:
yield response.follow(next_page, self.parse)
結果
[
{"text": "\u201cThe person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.\u201d", "author": "Jane Austen"},
{"text": "\u201cA day without sunshine is like, you know, night.\u201d", "author": "Steve Martin"},
{"text": "\u201cAnyone who thinks sitting in church can make you a Christian must also think that sitting in a garage can make you a car.\u201d", "author": "Garrison Keillor"},
{"text": "\u201cBeauty is in the eye of the beholder and it may be necessary from time to time to give a stupid or misinformed beholder a black eye.\u201d", "author": "Jim Henson"},
{"text": "\u201cAll you need is love. But a little chocolate now and then doesn't hurt.\u201d", "author": "Charles M. Schulz"},
{"text": "\u201cRemember, we're madly in love, so it's all right to kiss me anytime you feel like it.\u201d", "author": "Suzanne Collins"},
{"text": "\u201cSome people never go crazy. What truly horrible lives they must lead.\u201d", "author": "Charles Bukowski"},
{"text": "\u201cThe trouble with having an open mind, of course, is that people will insist on coming along and trying to put things in it.\u201d", "author": "Terry Pratchett"},
{"text": "\u201cThink left and think right and think low and think high. Oh, the thinks you can think up if only you try!\u201d", "author": "Dr. Seuss"},
{"text": "\u201cThe reason I talk to myself is because I\u2019m the only one whose answers I accept.\u201d", "author": "George Carlin"},
{"text": "\u201cI am free of all prejudice. I hate everyone equally. \u201d", "author": "W.C. Fields"},
{"text": "\u201cA lady's imagination is very rapid; it jumps from admiration to love, from love to matrimony in a moment.\u201d", "author": "Jane Austen"}
]
対象のサイトのソース
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>Quotes to Scrape</title>
<link rel="stylesheet" href="/static/bootstrap.min.css">
<link rel="stylesheet" href="/static/main.css">
</head>
<body>
<div class="container">
<div class="row header-box">
<div class="col-md-8">
<h1>
<a href="/" style="text-decoration: none">Quotes to Scrape</a>
</h1>
</div>
<div class="col-md-4">
<p>
<a href="/login">Login</a>
</p>
</div>
</div>
<h3>Viewing tag: <a href="/tag/humor/page/1/">humor</a></h3>
<div class="row">
<div class="col-md-8">
<div class="quote" itemscope itemtype="https://schema.org/CreativeWork">
<span class="text" itemprop="text">“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”</span>
<span>by <small class="author" itemprop="author">Jane Austen</small>
<a href="/author/Jane-Austen">(about)</a>
</span>
<div class="tags">
Tags:
<meta class="keywords" itemprop="keywords" content="aliteracy,books,classic,humor" / >
<a class="tag" href="/tag/aliteracy/page/1/">aliteracy</a>
<a class="tag" href="/tag/books/page/1/">books</a>
<a class="tag" href="/tag/classic/page/1/">classic</a>
<a class="tag" href="/tag/humor/page/1/">humor</a>
</div>
</div>
<div class="quote" itemscope itemtype="https://schema.org/CreativeWork">
<span class="text" itemprop="text">“A day without sunshine is like, you know, night.”</span>
<span>by <small class="author" itemprop="author">Steve Martin</small>
<a href="/author/Steve-Martin">(about)</a>
</span>
<div class="tags">
Tags:
<meta class="keywords" itemprop="keywords" content="humor,obvious,simile" / >
<a class="tag" href="/tag/humor/page/1/">humor</a>
<a class="tag" href="/tag/obvious/page/1/">obvious</a>
<a class="tag" href="/tag/simile/page/1/">simile</a>
</div>
</div>
<div class="quote" itemscope itemtype="https://schema.org/CreativeWork">
<span class="text" itemprop="text">“Anyone who thinks sitting in church can make you a Christian must also think that sitting in a garage can make you a car.”</span>
<span>by <small class="author" itemprop="author">Garrison Keillor</small>
<a href="/author/Garrison-Keillor">(about)</a>
</span>
<div class="tags">
Tags:
<meta class="keywords" itemprop="keywords" content="humor,religion" / >
<a class="tag" href="/tag/humor/page/1/">humor</a>
<a class="tag" href="/tag/religion/page/1/">religion</a>
</div>
</div>
<div class="quote" itemscope itemtype="https://schema.org/CreativeWork">
<span class="text" itemprop="text">“Beauty is in the eye of the beholder and it may be necessary from time to time to give a stupid or misinformed beholder a black eye.”</span>
<span>by <small class="author" itemprop="author">Jim Henson</small>
<a href="/author/Jim-Henson">(about)</a>
</span>
<div class="tags">
Tags:
<meta class="keywords" itemprop="keywords" content="humor" / >
<a class="tag" href="/tag/humor/page/1/">humor</a>
</div>
</div>
<div class="quote" itemscope itemtype="https://schema.org/CreativeWork">
<span class="text" itemprop="text">“All you need is love. But a little chocolate now and then doesn't hurt.”</span>
<span>by <small class="author" itemprop="author">Charles M. Schulz</small>
<a href="/author/Charles-M-Schulz">(about)</a>
</span>
<div class="tags">
Tags:
<meta class="keywords" itemprop="keywords" content="chocolate,food,humor" / >
<a class="tag" href="/tag/chocolate/page/1/">chocolate</a>
<a class="tag" href="/tag/food/page/1/">food</a>
<a class="tag" href="/tag/humor/page/1/">humor</a>
</div>
</div>
<div class="quote" itemscope itemtype="https://schema.org/CreativeWork">
<span class="text" itemprop="text">“Remember, we're madly in love, so it's all right to kiss me anytime you feel like it.”</span>
<span>by <small class="author" itemprop="author">Suzanne Collins</small>
<a href="/author/Suzanne-Collins">(about)</a>
</span>
<div class="tags">
Tags:
<meta class="keywords" itemprop="keywords" content="humor" / >
<a class="tag" href="/tag/humor/page/1/">humor</a>
</div>
</div>
<div class="quote" itemscope itemtype="https://schema.org/CreativeWork">
<span class="text" itemprop="text">“Some people never go crazy. What truly horrible lives they must lead.”</span>
<span>by <small class="author" itemprop="author">Charles Bukowski</small>
<a href="/author/Charles-Bukowski">(about)</a>
</span>
<div class="tags">
Tags:
<meta class="keywords" itemprop="keywords" content="humor" / >
<a class="tag" href="/tag/humor/page/1/">humor</a>
</div>
</div>
<div class="quote" itemscope itemtype="https://schema.org/CreativeWork">
<span class="text" itemprop="text">“The trouble with having an open mind, of course, is that people will insist on coming along and trying to put things in it.”</span>
<span>by <small class="author" itemprop="author">Terry Pratchett</small>
<a href="/author/Terry-Pratchett">(about)</a>
</span>
<div class="tags">
Tags:
<meta class="keywords" itemprop="keywords" content="humor,open-mind,thinking" / >
<a class="tag" href="/tag/humor/page/1/">humor</a>
<a class="tag" href="/tag/open-mind/page/1/">open-mind</a>
<a class="tag" href="/tag/thinking/page/1/">thinking</a>
</div>
</div>
<div class="quote" itemscope itemtype="https://schema.org/CreativeWork">
<span class="text" itemprop="text">“Think left and think right and think low and think high. Oh, the thinks you can think up if only you try!”</span>
<span>by <small class="author" itemprop="author">Dr. Seuss</small>
<a href="/author/Dr-Seuss">(about)</a>
</span>
<div class="tags">
Tags:
<meta class="keywords" itemprop="keywords" content="humor,philosophy" / >
<a class="tag" href="/tag/humor/page/1/">humor</a>
<a class="tag" href="/tag/philosophy/page/1/">philosophy</a>
</div>
</div>
<div class="quote" itemscope itemtype="https://schema.org/CreativeWork">
<span class="text" itemprop="text">“The reason I talk to myself is because I’m the only one whose answers I accept.”</span>
<span>by <small class="author" itemprop="author">George Carlin</small>
<a href="/author/George-Carlin">(about)</a>
</span>
<div class="tags">
Tags:
<meta class="keywords" itemprop="keywords" content="humor,insanity,lies,lying,self-indulgence,truth" / >
<a class="tag" href="/tag/humor/page/1/">humor</a>
<a class="tag" href="/tag/insanity/page/1/">insanity</a>
<a class="tag" href="/tag/lies/page/1/">lies</a>
<a class="tag" href="/tag/lying/page/1/">lying</a>
<a class="tag" href="/tag/self-indulgence/page/1/">self-indulgence</a>
<a class="tag" href="/tag/truth/page/1/">truth</a>
</div>
</div>
<nav>
<ul class="pager">
<li class="next">
<a href="/tag/humor/page/2/">Next <span aria-hidden="true">→</span></a>
</li>
</ul>
</nav>
</div>
<div class="col-md-4 tags-box">
<h2>Top Ten tags</h2>
<span class="tag-item">
<a class="tag" style="font-size: 28px" href="/tag/love/">love</a>
</span>
<span class="tag-item">
<a class="tag" style="font-size: 26px" href="/tag/inspirational/">inspirational</a>
</span>
<span class="tag-item">
<a class="tag" style="font-size: 26px" href="/tag/life/">life</a>
</span>
<span class="tag-item">
<a class="tag" style="font-size: 24px" href="/tag/humor/">humor</a>
</span>
<span class="tag-item">
<a class="tag" style="font-size: 22px" href="/tag/books/">books</a>
</span>
<span class="tag-item">
<a class="tag" style="font-size: 14px" href="/tag/reading/">reading</a>
</span>
<span class="tag-item">
<a class="tag" style="font-size: 10px" href="/tag/friendship/">friendship</a>
</span>
<span class="tag-item">
<a class="tag" style="font-size: 8px" href="/tag/friends/">friends</a>
</span>
<span class="tag-item">
<a class="tag" style="font-size: 8px" href="/tag/truth/">truth</a>
</span>
<span class="tag-item">
<a class="tag" style="font-size: 6px" href="/tag/simile/">simile</a>
</span>
</div>
</div>
</div>
<footer class="footer">
<div class="container">
<p class="text-muted">
Quotes by: <a href="https://www.goodreads.com/quotes">GoodReads.com</a>
</p>
<p class="copyright">
Made with <span class='sh-red'>❤</span> by <a href="https://scrapinghub.com">Scrapinghub</a>
</p>
</div>
</footer>
</body>
</html>