ホーム

scrapyでスクレイピングしてみた

説明

パッとみた感じソースからタグをとってきて下記のような記述から次のリンクへ。

<li class="next">
    <a href="/tag/humor/page/2/">Next <span aria-hidden="true">&rarr;</span></a>
</li>

感想

ああ、こんなにも簡単に。。。メソッドの使い方と、スクレイピング対象のサイトの構造さえ理解してしまえば、如何様にもデータを集めることができそうに思った。

実装

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/tag/humor/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').extract_first(),
                'author': quote.xpath('span/small/text()').extract_first(),
            }

        next_page = response.css('li.next a::attr("href")').extract_first()
        if next_page is not None:
            yield response.follow(next_page, self.parse)

結果

[
{"text": "\u201cThe person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.\u201d", "author": "Jane Austen"},
{"text": "\u201cA day without sunshine is like, you know, night.\u201d", "author": "Steve Martin"},
{"text": "\u201cAnyone who thinks sitting in church can make you a Christian must also think that sitting in a garage can make you a car.\u201d", "author": "Garrison Keillor"},
{"text": "\u201cBeauty is in the eye of the beholder and it may be necessary from time to time to give a stupid or misinformed beholder a black eye.\u201d", "author": "Jim Henson"},
{"text": "\u201cAll you need is love. But a little chocolate now and then doesn't hurt.\u201d", "author": "Charles M. Schulz"},
{"text": "\u201cRemember, we're madly in love, so it's all right to kiss me anytime you feel like it.\u201d", "author": "Suzanne Collins"},
{"text": "\u201cSome people never go crazy. What truly horrible lives they must lead.\u201d", "author": "Charles Bukowski"},
{"text": "\u201cThe trouble with having an open mind, of course, is that people will insist on coming along and trying to put things in it.\u201d", "author": "Terry Pratchett"},
{"text": "\u201cThink left and think right and think low and think high. Oh, the thinks you can think up if only you try!\u201d", "author": "Dr. Seuss"},
{"text": "\u201cThe reason I talk to myself is because I\u2019m the only one whose answers I accept.\u201d", "author": "George Carlin"},
{"text": "\u201cI am free of all prejudice. I hate everyone equally. \u201d", "author": "W.C. Fields"},
{"text": "\u201cA lady's imagination is very rapid; it jumps from admiration to love, from love to matrimony in a moment.\u201d", "author": "Jane Austen"}
]

対象のサイトのソース

<!DOCTYPE html>

<html lang="en">

<head>

<meta charset="UTF-8">

<title>Quotes to Scrape</title>

<link rel="stylesheet" href="/static/bootstrap.min.css">

<link rel="stylesheet" href="/static/main.css">

</head>

<body>

<div class="container">

<div class="row header-box">

<div class="col-md-8">

<h1>

<a href="/" style="text-decoration: none">Quotes to Scrape</a>

</h1>

</div>

<div class="col-md-4">

<p>



<a href="/login">Login</a>



</p>

</div>

</div>





<h3>Viewing tag: <a href="/tag/humor/page/1/">humor</a></h3>



<div class="row">

<div class="col-md-8">



<div class="quote" itemscope itemtype="http://schema.org/CreativeWork">

<span class="text" itemprop="text">“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”</span>

<span>by <small class="author" itemprop="author">Jane Austen</small>

<a href="/author/Jane-Austen">(about)</a>

</span>

<div class="tags">

Tags:

<meta class="keywords" itemprop="keywords" content="aliteracy,books,classic,humor" / > 



<a class="tag" href="/tag/aliteracy/page/1/">aliteracy</a>



<a class="tag" href="/tag/books/page/1/">books</a>



<a class="tag" href="/tag/classic/page/1/">classic</a>



<a class="tag" href="/tag/humor/page/1/">humor</a>



</div>

</div>



<div class="quote" itemscope itemtype="http://schema.org/CreativeWork">

<span class="text" itemprop="text">“A day without sunshine is like, you know, night.”</span>

<span>by <small class="author" itemprop="author">Steve Martin</small>

<a href="/author/Steve-Martin">(about)</a>

</span>

<div class="tags">

Tags:

<meta class="keywords" itemprop="keywords" content="humor,obvious,simile" / > 



<a class="tag" href="/tag/humor/page/1/">humor</a>



<a class="tag" href="/tag/obvious/page/1/">obvious</a>



<a class="tag" href="/tag/simile/page/1/">simile</a>



</div>

</div>



<div class="quote" itemscope itemtype="http://schema.org/CreativeWork">

<span class="text" itemprop="text">“Anyone who thinks sitting in church can make you a Christian must also think that sitting in a garage can make you a car.”</span>

<span>by <small class="author" itemprop="author">Garrison Keillor</small>

<a href="/author/Garrison-Keillor">(about)</a>

</span>

<div class="tags">

Tags:

<meta class="keywords" itemprop="keywords" content="humor,religion" / > 



<a class="tag" href="/tag/humor/page/1/">humor</a>



<a class="tag" href="/tag/religion/page/1/">religion</a>



</div>

</div>



<div class="quote" itemscope itemtype="http://schema.org/CreativeWork">

<span class="text" itemprop="text">“Beauty is in the eye of the beholder and it may be necessary from time to time to give a stupid or misinformed beholder a black eye.”</span>

<span>by <small class="author" itemprop="author">Jim Henson</small>

<a href="/author/Jim-Henson">(about)</a>

</span>

<div class="tags">

Tags:

<meta class="keywords" itemprop="keywords" content="humor" / > 



<a class="tag" href="/tag/humor/page/1/">humor</a>



</div>

</div>



<div class="quote" itemscope itemtype="http://schema.org/CreativeWork">

<span class="text" itemprop="text">“All you need is love. But a little chocolate now and then doesn&#39;t hurt.”</span>

<span>by <small class="author" itemprop="author">Charles M. Schulz</small>

<a href="/author/Charles-M-Schulz">(about)</a>

</span>

<div class="tags">

Tags:

<meta class="keywords" itemprop="keywords" content="chocolate,food,humor" / > 



<a class="tag" href="/tag/chocolate/page/1/">chocolate</a>



<a class="tag" href="/tag/food/page/1/">food</a>



<a class="tag" href="/tag/humor/page/1/">humor</a>



</div>

</div>



<div class="quote" itemscope itemtype="http://schema.org/CreativeWork">

<span class="text" itemprop="text">“Remember, we&#39;re madly in love, so it&#39;s all right to kiss me anytime you feel like it.”</span>

<span>by <small class="author" itemprop="author">Suzanne Collins</small>

<a href="/author/Suzanne-Collins">(about)</a>

</span>

<div class="tags">

Tags:

<meta class="keywords" itemprop="keywords" content="humor" / > 



<a class="tag" href="/tag/humor/page/1/">humor</a>



</div>

</div>



<div class="quote" itemscope itemtype="http://schema.org/CreativeWork">

<span class="text" itemprop="text">“Some people never go crazy. What truly horrible lives they must lead.”</span>

<span>by <small class="author" itemprop="author">Charles Bukowski</small>

<a href="/author/Charles-Bukowski">(about)</a>

</span>

<div class="tags">

Tags:

<meta class="keywords" itemprop="keywords" content="humor" / > 



<a class="tag" href="/tag/humor/page/1/">humor</a>



</div>

</div>



<div class="quote" itemscope itemtype="http://schema.org/CreativeWork">

<span class="text" itemprop="text">“The trouble with having an open mind, of course, is that people will insist on coming along and trying to put things in it.”</span>

<span>by <small class="author" itemprop="author">Terry Pratchett</small>

<a href="/author/Terry-Pratchett">(about)</a>

</span>

<div class="tags">

Tags:

<meta class="keywords" itemprop="keywords" content="humor,open-mind,thinking" / > 



<a class="tag" href="/tag/humor/page/1/">humor</a>



<a class="tag" href="/tag/open-mind/page/1/">open-mind</a>



<a class="tag" href="/tag/thinking/page/1/">thinking</a>



</div>

</div>



<div class="quote" itemscope itemtype="http://schema.org/CreativeWork">

<span class="text" itemprop="text">“Think left and think right and think low and think high. Oh, the thinks you can think up if only you try!”</span>

<span>by <small class="author" itemprop="author">Dr. Seuss</small>

<a href="/author/Dr-Seuss">(about)</a>

</span>

<div class="tags">

Tags:

<meta class="keywords" itemprop="keywords" content="humor,philosophy" / > 



<a class="tag" href="/tag/humor/page/1/">humor</a>



<a class="tag" href="/tag/philosophy/page/1/">philosophy</a>



</div>

</div>



<div class="quote" itemscope itemtype="http://schema.org/CreativeWork">

<span class="text" itemprop="text">“The reason I talk to myself is because I’m the only one whose answers I accept.”</span>

<span>by <small class="author" itemprop="author">George Carlin</small>

<a href="/author/George-Carlin">(about)</a>

</span>

<div class="tags">

Tags:

<meta class="keywords" itemprop="keywords" content="humor,insanity,lies,lying,self-indulgence,truth" / > 



<a class="tag" href="/tag/humor/page/1/">humor</a>



<a class="tag" href="/tag/insanity/page/1/">insanity</a>



<a class="tag" href="/tag/lies/page/1/">lies</a>



<a class="tag" href="/tag/lying/page/1/">lying</a>



<a class="tag" href="/tag/self-indulgence/page/1/">self-indulgence</a>



<a class="tag" href="/tag/truth/page/1/">truth</a>



</div>

</div>



<nav>

<ul class="pager">





<li class="next">

<a href="/tag/humor/page/2/">Next <span aria-hidden="true">&rarr;</span></a>

</li>



</ul>

</nav>

</div>

<div class="col-md-4 tags-box">



<h2>Top Ten tags</h2>



<span class="tag-item">

<a class="tag" style="font-size: 28px" href="/tag/love/">love</a>

</span>



<span class="tag-item">

<a class="tag" style="font-size: 26px" href="/tag/inspirational/">inspirational</a>

</span>



<span class="tag-item">

<a class="tag" style="font-size: 26px" href="/tag/life/">life</a>

</span>



<span class="tag-item">

<a class="tag" style="font-size: 24px" href="/tag/humor/">humor</a>

</span>



<span class="tag-item">

<a class="tag" style="font-size: 22px" href="/tag/books/">books</a>

</span>



<span class="tag-item">

<a class="tag" style="font-size: 14px" href="/tag/reading/">reading</a>

</span>



<span class="tag-item">

<a class="tag" style="font-size: 10px" href="/tag/friendship/">friendship</a>

</span>



<span class="tag-item">

<a class="tag" style="font-size: 8px" href="/tag/friends/">friends</a>

</span>



<span class="tag-item">

<a class="tag" style="font-size: 8px" href="/tag/truth/">truth</a>

</span>



<span class="tag-item">

<a class="tag" style="font-size: 6px" href="/tag/simile/">simile</a>

</span>





</div>

</div>



</div>

<footer class="footer">

<div class="container">

<p class="text-muted">

Quotes by: <a href="https://www.goodreads.com/quotes">GoodReads.com</a>

</p>

<p class="copyright">

Made with <span class='sh-red'>❤</span> by <a href="https://scrapinghub.com">Scrapinghub</a>

</p>

</div>

</footer>

</body>

</html>

参考

Docs » Scrapy at a glance

Pocket
LinkedIn にシェア

エンジニアにおすすめできる本

Card image cap
リーダブルコード

より良いコードを書くためのシンプルで実践的なテクニック

Card image cap
Webを支える技術

HTTP,URI,HTML,そしてREST

Card image cap
誰でもPythonで作れる

儲かるAIとソフトウェアの作り方

Card image cap
プログラマが知るべき97のこと

現場で使える実践哲学のマスターピース

Card image cap
情熱プログラマー

時代を超えて。ソフトウェア開発者の幸せな生き方

Card image cap
アジャイルサムライ

プログラミング達人開発者への道

Card image cap
Rubyを作った男 まつもとゆきひろ

コードの世界 スーパー・プログラマになる14の思考法

ご提供 sponsor
 

Meee!(ミー)は、ビジネスからプライベート利用まで、個人のスキルを気軽に売り買いできるスキルマーケットです。カテゴリや居住地から、検索することが可能です。

 

ランゲージエクスチェンジは、ネイティブスピーカーと気軽にマッチングできる言語交換プラットフォームです。あなたの地元に住む外国人を探したり、留学や海外移住の前に、現地のネイティブスピーカーと繋がることもできます!

宣伝
 

りょすけトークchは、仕事や私生活をより豊にするYouTubeチャンネルです。文献(本、映画、論文)から役に立つ情報をまとめ、生涯にわたり役に立つ哲学をお届けしています。是非、チャンネル登録してみてね

-ホーム

Copyright© offブログ! , 2021 All Rights Reserved Powered by AFFINGER5.