scrapyでスクレイピングしてみた

説明

パッとみた感じソースからタグをとってきて下記のような記述から次のリンクへ。

<li class="next">
    <a href="/tag/humor/page/2/">Next <span aria-hidden="true">&rarr;</span></a>
</li>

感想

ああ、こんなにも簡単に。。。メソッドの使い方と、スクレイピング対象のサイトの構造さえ理解してしまえば、如何様にもデータを集めることができそうに思った。

実装

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'https://quotes.toscrape.com/tag/humor/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').extract_first(),
                'author': quote.xpath('span/small/text()').extract_first(),
            }

        next_page = response.css('li.next a::attr("href")').extract_first()
        if next_page is not None:
            yield response.follow(next_page, self.parse)

結果

[
{"text": "\u201cThe person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.\u201d", "author": "Jane Austen"},
{"text": "\u201cA day without sunshine is like, you know, night.\u201d", "author": "Steve Martin"},
{"text": "\u201cAnyone who thinks sitting in church can make you a Christian must also think that sitting in a garage can make you a car.\u201d", "author": "Garrison Keillor"},
{"text": "\u201cBeauty is in the eye of the beholder and it may be necessary from time to time to give a stupid or misinformed beholder a black eye.\u201d", "author": "Jim Henson"},
{"text": "\u201cAll you need is love. But a little chocolate now and then doesn't hurt.\u201d", "author": "Charles M. Schulz"},
{"text": "\u201cRemember, we're madly in love, so it's all right to kiss me anytime you feel like it.\u201d", "author": "Suzanne Collins"},
{"text": "\u201cSome people never go crazy. What truly horrible lives they must lead.\u201d", "author": "Charles Bukowski"},
{"text": "\u201cThe trouble with having an open mind, of course, is that people will insist on coming along and trying to put things in it.\u201d", "author": "Terry Pratchett"},
{"text": "\u201cThink left and think right and think low and think high. Oh, the thinks you can think up if only you try!\u201d", "author": "Dr. Seuss"},
{"text": "\u201cThe reason I talk to myself is because I\u2019m the only one whose answers I accept.\u201d", "author": "George Carlin"},
{"text": "\u201cI am free of all prejudice. I hate everyone equally. \u201d", "author": "W.C. Fields"},
{"text": "\u201cA lady's imagination is very rapid; it jumps from admiration to love, from love to matrimony in a moment.\u201d", "author": "Jane Austen"}
]

対象のサイトのソース

<!DOCTYPE html>

<html lang="en">

<head>

<meta charset="UTF-8">

<title>Quotes to Scrape</title>

<link rel="stylesheet" href="/static/bootstrap.min.css">

<link rel="stylesheet" href="/static/main.css">

</head>

<body>

<div class="container">

<div class="row header-box">

<div class="col-md-8">

<h1>

<a href="/" style="text-decoration: none">Quotes to Scrape</a>

</h1>

</div>

<div class="col-md-4">

<p>



<a href="/login">Login</a>



</p>

</div>

</div>





<h3>Viewing tag: <a href="/tag/humor/page/1/">humor</a></h3>



<div class="row">

<div class="col-md-8">



<div class="quote" itemscope itemtype="https://schema.org/CreativeWork">

<span class="text" itemprop="text">“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”</span>

<span>by <small class="author" itemprop="author">Jane Austen</small>

<a href="/author/Jane-Austen">(about)</a>

</span>

<div class="tags">

Tags:

<meta class="keywords" itemprop="keywords" content="aliteracy,books,classic,humor" / > 



<a class="tag" href="/tag/aliteracy/page/1/">aliteracy</a>



<a class="tag" href="/tag/books/page/1/">books</a>



<a class="tag" href="/tag/classic/page/1/">classic</a>



<a class="tag" href="/tag/humor/page/1/">humor</a>



</div>

</div>



<div class="quote" itemscope itemtype="https://schema.org/CreativeWork">

<span class="text" itemprop="text">“A day without sunshine is like, you know, night.”</span>

<span>by <small class="author" itemprop="author">Steve Martin</small>

<a href="/author/Steve-Martin">(about)</a>

</span>

<div class="tags">

Tags:

<meta class="keywords" itemprop="keywords" content="humor,obvious,simile" / > 



<a class="tag" href="/tag/humor/page/1/">humor</a>



<a class="tag" href="/tag/obvious/page/1/">obvious</a>



<a class="tag" href="/tag/simile/page/1/">simile</a>



</div>

</div>



<div class="quote" itemscope itemtype="https://schema.org/CreativeWork">

<span class="text" itemprop="text">“Anyone who thinks sitting in church can make you a Christian must also think that sitting in a garage can make you a car.”</span>

<span>by <small class="author" itemprop="author">Garrison Keillor</small>

<a href="/author/Garrison-Keillor">(about)</a>

</span>

<div class="tags">

Tags:

<meta class="keywords" itemprop="keywords" content="humor,religion" / > 



<a class="tag" href="/tag/humor/page/1/">humor</a>



<a class="tag" href="/tag/religion/page/1/">religion</a>



</div>

</div>



<div class="quote" itemscope itemtype="https://schema.org/CreativeWork">

<span class="text" itemprop="text">“Beauty is in the eye of the beholder and it may be necessary from time to time to give a stupid or misinformed beholder a black eye.”</span>

<span>by <small class="author" itemprop="author">Jim Henson</small>

<a href="/author/Jim-Henson">(about)</a>

</span>

<div class="tags">

Tags:

<meta class="keywords" itemprop="keywords" content="humor" / > 



<a class="tag" href="/tag/humor/page/1/">humor</a>



</div>

</div>



<div class="quote" itemscope itemtype="https://schema.org/CreativeWork">

<span class="text" itemprop="text">“All you need is love. But a little chocolate now and then doesn&#39;t hurt.”</span>

<span>by <small class="author" itemprop="author">Charles M. Schulz</small>

<a href="/author/Charles-M-Schulz">(about)</a>

</span>

<div class="tags">

Tags:

<meta class="keywords" itemprop="keywords" content="chocolate,food,humor" / > 



<a class="tag" href="/tag/chocolate/page/1/">chocolate</a>



<a class="tag" href="/tag/food/page/1/">food</a>



<a class="tag" href="/tag/humor/page/1/">humor</a>



</div>

</div>



<div class="quote" itemscope itemtype="https://schema.org/CreativeWork">

<span class="text" itemprop="text">“Remember, we&#39;re madly in love, so it&#39;s all right to kiss me anytime you feel like it.”</span>

<span>by <small class="author" itemprop="author">Suzanne Collins</small>

<a href="/author/Suzanne-Collins">(about)</a>

</span>

<div class="tags">

Tags:

<meta class="keywords" itemprop="keywords" content="humor" / > 



<a class="tag" href="/tag/humor/page/1/">humor</a>



</div>

</div>



<div class="quote" itemscope itemtype="https://schema.org/CreativeWork">

<span class="text" itemprop="text">“Some people never go crazy. What truly horrible lives they must lead.”</span>

<span>by <small class="author" itemprop="author">Charles Bukowski</small>

<a href="/author/Charles-Bukowski">(about)</a>

</span>

<div class="tags">

Tags:

<meta class="keywords" itemprop="keywords" content="humor" / > 



<a class="tag" href="/tag/humor/page/1/">humor</a>



</div>

</div>



<div class="quote" itemscope itemtype="https://schema.org/CreativeWork">

<span class="text" itemprop="text">“The trouble with having an open mind, of course, is that people will insist on coming along and trying to put things in it.”</span>

<span>by <small class="author" itemprop="author">Terry Pratchett</small>

<a href="/author/Terry-Pratchett">(about)</a>

</span>

<div class="tags">

Tags:

<meta class="keywords" itemprop="keywords" content="humor,open-mind,thinking" / > 



<a class="tag" href="/tag/humor/page/1/">humor</a>



<a class="tag" href="/tag/open-mind/page/1/">open-mind</a>



<a class="tag" href="/tag/thinking/page/1/">thinking</a>



</div>

</div>



<div class="quote" itemscope itemtype="https://schema.org/CreativeWork">

<span class="text" itemprop="text">“Think left and think right and think low and think high. Oh, the thinks you can think up if only you try!”</span>

<span>by <small class="author" itemprop="author">Dr. Seuss</small>

<a href="/author/Dr-Seuss">(about)</a>

</span>

<div class="tags">

Tags:

<meta class="keywords" itemprop="keywords" content="humor,philosophy" / > 



<a class="tag" href="/tag/humor/page/1/">humor</a>



<a class="tag" href="/tag/philosophy/page/1/">philosophy</a>



</div>

</div>



<div class="quote" itemscope itemtype="https://schema.org/CreativeWork">

<span class="text" itemprop="text">“The reason I talk to myself is because I’m the only one whose answers I accept.”</span>

<span>by <small class="author" itemprop="author">George Carlin</small>

<a href="/author/George-Carlin">(about)</a>

</span>

<div class="tags">

Tags:

<meta class="keywords" itemprop="keywords" content="humor,insanity,lies,lying,self-indulgence,truth" / > 



<a class="tag" href="/tag/humor/page/1/">humor</a>



<a class="tag" href="/tag/insanity/page/1/">insanity</a>



<a class="tag" href="/tag/lies/page/1/">lies</a>



<a class="tag" href="/tag/lying/page/1/">lying</a>



<a class="tag" href="/tag/self-indulgence/page/1/">self-indulgence</a>



<a class="tag" href="/tag/truth/page/1/">truth</a>



</div>

</div>



<nav>

<ul class="pager">





<li class="next">

<a href="/tag/humor/page/2/">Next <span aria-hidden="true">&rarr;</span></a>

</li>



</ul>

</nav>

</div>

<div class="col-md-4 tags-box">



<h2>Top Ten tags</h2>



<span class="tag-item">

<a class="tag" style="font-size: 28px" href="/tag/love/">love</a>

</span>



<span class="tag-item">

<a class="tag" style="font-size: 26px" href="/tag/inspirational/">inspirational</a>

</span>



<span class="tag-item">

<a class="tag" style="font-size: 26px" href="/tag/life/">life</a>

</span>



<span class="tag-item">

<a class="tag" style="font-size: 24px" href="/tag/humor/">humor</a>

</span>



<span class="tag-item">

<a class="tag" style="font-size: 22px" href="/tag/books/">books</a>

</span>



<span class="tag-item">

<a class="tag" style="font-size: 14px" href="/tag/reading/">reading</a>

</span>



<span class="tag-item">

<a class="tag" style="font-size: 10px" href="/tag/friendship/">friendship</a>

</span>



<span class="tag-item">

<a class="tag" style="font-size: 8px" href="/tag/friends/">friends</a>

</span>



<span class="tag-item">

<a class="tag" style="font-size: 8px" href="/tag/truth/">truth</a>

</span>



<span class="tag-item">

<a class="tag" style="font-size: 6px" href="/tag/simile/">simile</a>

</span>





</div>

</div>



</div>

<footer class="footer">

<div class="container">

<p class="text-muted">

Quotes by: <a href="https://www.goodreads.com/quotes">GoodReads.com</a>

</p>

<p class="copyright">

Made with <span class='sh-red'>❤</span> by <a href="https://scrapinghub.com">Scrapinghub</a>

</p>

</div>

</footer>

</body>

</html>

参考

Docs » Scrapy at a glance

藤沢瞭介(Ryosuke Hujisawa)
  • りょすけと申します。18歳からプログラミングをはじめ、今はフロントエンドでReactを書いたり、AIの勉強を頑張っています。off.tokyoでは、ハイテクやガジェット、それからプログラミングに関する情報まで、エンジニアに役立つ情報を日々発信しています!

ホーム