從0到1的網頁爬蟲 - 03

Pyladies Taiwan

Speaker : Mars

2016/12

Roadmap

  • Request & Response 簡介
  • GET/POST
  • Cookie介紹
  • 爬蟲實例解析與實戰技巧

Request & Response

Request Method

GET

http://blog.marsw.tw/search?q=安藤忠雄
response = requests.get(url)

  • 如同明信片,簡單直覺
  • 缺點就是資料會被郵差看光
  • 受限於明信片的大小,內容跟地址字數就會受到限制。

POST

http://www.thsrc.com.tw/tw/TimeTable/SearchResult
response = requests.post(url,data=post_data)

  • 像是用平信寄出,內容是被信封保護起來
  • 郵差只看得到地址,看不到信件內容,安全性因此提高不少
  • 信封的容量比明信片高很多

In [1]:
import requests 
response = requests.get("http://blog.marsw.tw/")
html = response.text
content_type = response.headers['content-type']
print (content_type)
text/html; charset=UTF-8
In [2]:
import requests
post_data = {
    "StartStation": "977abb69-413a-4ccf-a109-0272c24fd490", 
    "EndStation": "f2519629-5973-4d08-913b-479cce78a356",
    "SearchDate": "2016/12/25",
    "SearchTime": "17:00",
    "SearchWay":"DepartureInMandarin",
    "RestTime":"",
    "EarlyOrLater":""
}
response = requests.post("http://www.thsrc.com.tw/tw/TimeTable/SearchResult",data=post_data)
html = response.text
In [3]:
from lxml import etree
page = etree.HTML(html)
for i in page.xpath("//table[@class='touch_table']//tr"):
    print (" ".join(i.xpath("./td//text()")))
0845 17:11 19:25
0667 17:21 19:20
0149 17:31 19:05
0669 17:46 19:45
1245 17:51 19:30
0849 18:11 20:25
0673 18:21 20:20
0153 18:31 20:05
0675 18:46 20:45
0249 18:51 20:30

Coding Time

  • 查詢高鐵 2016/12/31,新竹->台北,抵達時間為21:00的班次

Cookie介紹

  • requests.get(url,cookies=我的cookie資訊)
  • requests.post(url,data=post_data,cookies=我的cookie資訊)
  • 資料記錄在Client上
  • 紀錄的內容:登入狀態(使用者名稱)、瀏覽紀錄......etc.

應用

  • 記住我的帳號
  • 購物車在付帳前要紀錄使用者購買哪些商品
  • 推薦你可能喜歡的商品

Session

  • 資料記在Server上
  • 存放較敏感的資料

Cache

  • 加速瀏覽速度
  • 圖檔、JavaScript、Xml......etc

十萬個為什麼

  • 有時候網路斷掉,Youtube還能繼續播放一段時間
  • 第二次開網頁比較快
  • 網站更新版本需要再特別refresh
In [4]:
import requests
resq = requests.Session() # 雖然名稱同樣叫Session,但是這是 requests 的物件,是專門用來保持網路間連線的!
post_data = {"from": "/bbs/Gossiping/index.html", "yes": "yes"}
resp = resq.post("https://www.ptt.cc/ask/over18", data=post_data)
resp = resq.get("https://www.ptt.cc/bbs/Gossiping/M.1481674520.A.251.html")
html = resp.text
print (html)
<!DOCTYPE html>
<html>
	<head>
		<meta charset="utf-8">
		

<meta name="viewport" content="width=device-width, initial-scale=1">

<title>[問卦] 為什麼服部平次沒被稱為大阪死神? - 看板 Gossiping - 批踢踢實業坊</title>
<meta name="robots" content="all">
<meta name="keywords" content="Ptt BBS 批踢踢">
<meta name="description" content="
關西名偵探服部平次  跟柯南一樣都會碰

到一堆死人的刑事案件 那他怎沒被稱為

">
<meta property="og:site_name" content="Ptt 批踢踢實業坊">
<meta property="og:title" content="[問卦] 為什麼服部平次沒被稱為大阪死神?">
<meta property="og:description" content="
關西名偵探服部平次  跟柯南一樣都會碰

到一堆死人的刑事案件 那他怎沒被稱為

">
<link rel="canonical" href="https://www.ptt.cc/bbs/Gossiping/M.1481674520.A.251.html">

<link rel="stylesheet" type="text/css" href="//images.ptt.cc/v2.20/bbs-common.css">
<link rel="stylesheet" type="text/css" href="//images.ptt.cc/v2.20/bbs-base.css" media="screen">
<link rel="stylesheet" type="text/css" href="//images.ptt.cc/v2.20/bbs-custom.css">
<link rel="stylesheet" type="text/css" href="//images.ptt.cc/v2.20/pushstream.css" media="screen">
<link rel="stylesheet" type="text/css" href="//images.ptt.cc/v2.20/bbs-print.css" media="print">


<script src="//ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
<script src="//images.ptt.cc/v2.20/bbs.js"></script>


		

<script type="text/javascript">

  var _gaq = _gaq || [];
  _gaq.push(['_setAccount', 'UA-32365737-1']);
  _gaq.push(['_setDomainName', 'ptt.cc']);
  _gaq.push(['_trackPageview']);

  (function() {
    var ga = document.createElement('script'); ga.type = 'text/javascript'; ga.async = true;
    ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www') + '.google-analytics.com/ga.js';
    var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(ga, s);
  })();

</script>


	</head>
    <body>
		
<div id="fb-root"></div>
<script>(function(d, s, id) {
var js, fjs = d.getElementsByTagName(s)[0];
if (d.getElementById(id)) return;
js = d.createElement(s); js.id = id;
js.src = "//connect.facebook.net/en_US/all.js#xfbml=1";
fjs.parentNode.insertBefore(js, fjs);
}(document, 'script', 'facebook-jssdk'));</script>

<div id="topbar-container">
	<div id="topbar" class="bbs-content">
		<a id="logo" href="/">批踢踢實業坊</a>
		<span>&rsaquo;</span>
		<a class="board" href="/bbs/Gossiping/index.html"><span class="board-label">看板 </span>Gossiping</a>
		<a class="right small" href="/about.html">關於我們</a>
		<a class="right small" href="/contact.html">聯絡資訊</a>
	</div>
</div>
<div id="navigation-container">
	<div id="navigation" class="bbs-content">
		<a class="board" href="/bbs/Gossiping/index.html">返回看板</a>
		<div class="bar"></div>
		<div class="share">
			<span>分享</span>
			<div class="fb-like" data-send="false" data-layout="button_count" data-width="90" data-show-faces="false" data-href="http://www.ptt.cc/bbs/Gossiping/M.1481674520.A.251.html"></div>

			<div class="g-plusone" data-size="medium"></div>
<script type="text/javascript">
window.___gcfg = {lang: 'zh-TW'};
(function() {
var po = document.createElement('script'); po.type = 'text/javascript'; po.async = true;
po.src = 'https://apis.google.com/js/plusone.js';
var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(po, s);
})();
</script>

		</div>
	</div>
</div>
<div id="main-container">
    <div id="main-content" class="bbs-screen bbs-content"><div class="article-metaline"><span class="article-meta-tag">作者</span><span class="article-meta-value">saintlin (saintlin)</span></div><div class="article-metaline-right"><span class="article-meta-tag">看板</span><span class="article-meta-value">Gossiping</span></div><div class="article-metaline"><span class="article-meta-tag">標題</span><span class="article-meta-value">[問卦] 為什麼服部平次沒被稱為大阪死神?</span></div><div class="article-metaline"><span class="article-meta-tag">時間</span><span class="article-meta-value">Wed Dec 14 08:15:17 2016</span></div>
關西名偵探服部平次  跟柯南一樣都會碰

到一堆死人的刑事案件 那他怎沒被稱為

大阪死神? 還是只是沒演出來就稱不上

死神這名號?

--
<span class="f2">※ 發信站: 批踢踢實業坊(ptt.cc), 來自: 42.72.125.27
</span><span class="f2">※ 文章網址: <a href="https://www.ptt.cc/bbs/Gossiping/M.1481674520.A.251.html" target="_blank" rel="nofollow">https://www.ptt.cc/bbs/Gossiping/M.1481674520.A.251.html</a>
</span><div class="push"><span class="f1 hl push-tag">→ </span><span class="f3 hl push-userid">momo1244</span><span class="f3 push-content">: 因為死的人幾乎不在大阪</span><span class="push-ipdatetime"> 12/14 08:15
</span></div><div class="push"><span class="f1 hl push-tag">→ </span><span class="f3 hl push-userid">aterui</span><span class="f3 push-content">: 原本都是警察委託他辦理事件,跟柯南在一起時才會碰巧遇到</span><span class="push-ipdatetime"> 12/14 08:16
</span></div><div class="push"><span class="f1 hl push-tag">→ </span><span class="f3 hl push-userid">winiS</span><span class="f3 push-content">: 擊墜數不到</span><span class="push-ipdatetime"> 12/14 08:18
</span></div><div class="push"><span class="hl push-tag">推 </span><span class="f3 hl push-userid">akway</span><span class="f3 push-content">: 因為服部都是被委託辦案 只有跟柯南在一起才會碰到兇殺案</span><span class="push-ipdatetime"> 12/14 08:22
</span></div><div class="push"><span class="f1 hl push-tag">→ </span><span class="f3 hl push-userid">tetsu2008</span><span class="f3 push-content">: 都是柯南來大阪害的</span><span class="push-ipdatetime"> 12/14 08:41
</span></div><div class="push"><span class="hl push-tag">推 </span><span class="f3 hl push-userid">feedfish</span><span class="f3 push-content">: 都是柯南的關係吧</span><span class="push-ipdatetime"> 12/14 08:51
</span></div></div>
    
    <div id="article-polling" data-pollurl="/poll/Gossiping/M.1481674520.A.251.html?cacheKey=2052-1115141154&offset=983&offset-sig=ea3331af4a298b2280f1b0fbda4d161f776609ba" data-longpollurl="/v1/longpoll?id=b6ce4f0d6fb4823a614c2cd38739bd1ebcfae5c6" data-offset="983"></div>
    

    
<div class="bbs-screen bbs-footer-message">本網站已依台灣網站內容分級規定處理。此區域為限制級,未滿十八歲者不得瀏覽。</div>

</div>

    </body>
</html>

獲取現在的 cookies

In [5]:
print (resq.cookies.get_dict())
{'over18': '1', '__cfduid': 'd60ac267650fba286d8fcc23f28cd97f31481700563'}
In [6]:
import requests
resp = requests.get("https://www.ptt.cc/bbs/Gossiping/M.1481674520.A.251.html",cookies=resq.cookies.get_dict())
html = resp.text
print (html)
<!DOCTYPE html>
<html>
	<head>
		<meta charset="utf-8">
		

<meta name="viewport" content="width=device-width, initial-scale=1">

<title>[問卦] 為什麼服部平次沒被稱為大阪死神? - 看板 Gossiping - 批踢踢實業坊</title>
<meta name="robots" content="all">
<meta name="keywords" content="Ptt BBS 批踢踢">
<meta name="description" content="
關西名偵探服部平次  跟柯南一樣都會碰

到一堆死人的刑事案件 那他怎沒被稱為

">
<meta property="og:site_name" content="Ptt 批踢踢實業坊">
<meta property="og:title" content="[問卦] 為什麼服部平次沒被稱為大阪死神?">
<meta property="og:description" content="
關西名偵探服部平次  跟柯南一樣都會碰

到一堆死人的刑事案件 那他怎沒被稱為

">
<link rel="canonical" href="https://www.ptt.cc/bbs/Gossiping/M.1481674520.A.251.html">

<link rel="stylesheet" type="text/css" href="//images.ptt.cc/v2.20/bbs-common.css">
<link rel="stylesheet" type="text/css" href="//images.ptt.cc/v2.20/bbs-base.css" media="screen">
<link rel="stylesheet" type="text/css" href="//images.ptt.cc/v2.20/bbs-custom.css">
<link rel="stylesheet" type="text/css" href="//images.ptt.cc/v2.20/pushstream.css" media="screen">
<link rel="stylesheet" type="text/css" href="//images.ptt.cc/v2.20/bbs-print.css" media="print">


<script src="//ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
<script src="//images.ptt.cc/v2.20/bbs.js"></script>


		

<script type="text/javascript">

  var _gaq = _gaq || [];
  _gaq.push(['_setAccount', 'UA-32365737-1']);
  _gaq.push(['_setDomainName', 'ptt.cc']);
  _gaq.push(['_trackPageview']);

  (function() {
    var ga = document.createElement('script'); ga.type = 'text/javascript'; ga.async = true;
    ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www') + '.google-analytics.com/ga.js';
    var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(ga, s);
  })();

</script>


	</head>
    <body>
		
<div id="fb-root"></div>
<script>(function(d, s, id) {
var js, fjs = d.getElementsByTagName(s)[0];
if (d.getElementById(id)) return;
js = d.createElement(s); js.id = id;
js.src = "//connect.facebook.net/en_US/all.js#xfbml=1";
fjs.parentNode.insertBefore(js, fjs);
}(document, 'script', 'facebook-jssdk'));</script>

<div id="topbar-container">
	<div id="topbar" class="bbs-content">
		<a id="logo" href="/">批踢踢實業坊</a>
		<span>&rsaquo;</span>
		<a class="board" href="/bbs/Gossiping/index.html"><span class="board-label">看板 </span>Gossiping</a>
		<a class="right small" href="/about.html">關於我們</a>
		<a class="right small" href="/contact.html">聯絡資訊</a>
	</div>
</div>
<div id="navigation-container">
	<div id="navigation" class="bbs-content">
		<a class="board" href="/bbs/Gossiping/index.html">返回看板</a>
		<div class="bar"></div>
		<div class="share">
			<span>分享</span>
			<div class="fb-like" data-send="false" data-layout="button_count" data-width="90" data-show-faces="false" data-href="http://www.ptt.cc/bbs/Gossiping/M.1481674520.A.251.html"></div>

			<div class="g-plusone" data-size="medium"></div>
<script type="text/javascript">
window.___gcfg = {lang: 'zh-TW'};
(function() {
var po = document.createElement('script'); po.type = 'text/javascript'; po.async = true;
po.src = 'https://apis.google.com/js/plusone.js';
var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(po, s);
})();
</script>

		</div>
	</div>
</div>
<div id="main-container">
    <div id="main-content" class="bbs-screen bbs-content"><div class="article-metaline"><span class="article-meta-tag">作者</span><span class="article-meta-value">saintlin (saintlin)</span></div><div class="article-metaline-right"><span class="article-meta-tag">看板</span><span class="article-meta-value">Gossiping</span></div><div class="article-metaline"><span class="article-meta-tag">標題</span><span class="article-meta-value">[問卦] 為什麼服部平次沒被稱為大阪死神?</span></div><div class="article-metaline"><span class="article-meta-tag">時間</span><span class="article-meta-value">Wed Dec 14 08:15:17 2016</span></div>
關西名偵探服部平次  跟柯南一樣都會碰

到一堆死人的刑事案件 那他怎沒被稱為

大阪死神? 還是只是沒演出來就稱不上

死神這名號?

--
<span class="f2">※ 發信站: 批踢踢實業坊(ptt.cc), 來自: 42.72.125.27
</span><span class="f2">※ 文章網址: <a href="https://www.ptt.cc/bbs/Gossiping/M.1481674520.A.251.html" target="_blank" rel="nofollow">https://www.ptt.cc/bbs/Gossiping/M.1481674520.A.251.html</a>
</span><div class="push"><span class="f1 hl push-tag">→ </span><span class="f3 hl push-userid">momo1244</span><span class="f3 push-content">: 因為死的人幾乎不在大阪</span><span class="push-ipdatetime"> 12/14 08:15
</span></div><div class="push"><span class="f1 hl push-tag">→ </span><span class="f3 hl push-userid">aterui</span><span class="f3 push-content">: 原本都是警察委託他辦理事件,跟柯南在一起時才會碰巧遇到</span><span class="push-ipdatetime"> 12/14 08:16
</span></div><div class="push"><span class="f1 hl push-tag">→ </span><span class="f3 hl push-userid">winiS</span><span class="f3 push-content">: 擊墜數不到</span><span class="push-ipdatetime"> 12/14 08:18
</span></div><div class="push"><span class="hl push-tag">推 </span><span class="f3 hl push-userid">akway</span><span class="f3 push-content">: 因為服部都是被委託辦案 只有跟柯南在一起才會碰到兇殺案</span><span class="push-ipdatetime"> 12/14 08:22
</span></div><div class="push"><span class="f1 hl push-tag">→ </span><span class="f3 hl push-userid">tetsu2008</span><span class="f3 push-content">: 都是柯南來大阪害的</span><span class="push-ipdatetime"> 12/14 08:41
</span></div><div class="push"><span class="hl push-tag">推 </span><span class="f3 hl push-userid">feedfish</span><span class="f3 push-content">: 都是柯南的關係吧</span><span class="push-ipdatetime"> 12/14 08:51
</span></div></div>
    
    <div id="article-polling" data-pollurl="/poll/Gossiping/M.1481674520.A.251.html?cacheKey=2052-1115141154&offset=983&offset-sig=ea3331af4a298b2280f1b0fbda4d161f776609ba" data-longpollurl="/v1/longpoll?id=b6ce4f0d6fb4823a614c2cd38739bd1ebcfae5c6" data-offset="983"></div>
    

    
<div class="bbs-screen bbs-footer-message">本網站已依台灣網站內容分級規定處理。此區域為限制級,未滿十八歲者不得瀏覽。</div>

</div>

    </body>
</html>

字典 dict

  • 鍵值key:很像是串列的索引,
  • 設定值:字典名稱[key]=value
  • 拿取值:字典名稱[key]
  • key:value
    {
      "the":10 ,
      "a"  :9  ,
      "of" :6  ,
      "in" :6
    }

!注意

  • 字典是「非有序」的資料型別
  • 字典是存key,然後以key去找資料value
In [7]:
my_dict = {}
my_dict["the"] = 10
my_dict["a"] = 9
my_dict["of"] = 6
my_dict["in"] = 6
my_dict2 = {"the":10,"a":9,"of":6,"in":6}
print (my_dict)
print (my_dict2)
print (my_dict["the"])
{'a': 9, 'in': 6, 'of': 6, 'the': 10}
{'a': 9, 'in': 6, 'of': 6, 'the': 10}
10

!注意

  • 直接使用字典不存在的key去找資料會有錯誤!
In [8]:
my_dict = {"the":10,"a":9,"of":6,"in":6}
print ("with" in my_dict)
print (my_dict["with"])
False
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-8-bfa4e840c783> in <module>()
      1 my_dict = {"the":10,"a":9,"of":6,"in":6}
      2 print ("with" in my_dict)
----> 3 print (my_dict["with"])

KeyError: 'with'

字典的儲存

  • pickle:Python中的資料型態<->檔案
  • json::Python中的資料型態<->字串
In [9]:
import pickle
pickle.dump( resq.cookies.get_dict(), open( "cookie.p", "wb" ) )
my_cookie = pickle.load( open( "cookie.p", "rb" ) )
print (type(my_cookie),my_cookie)
<class 'dict'> {'over18': '1', '__cfduid': 'd60ac267650fba286d8fcc23f28cd97f31481700563'}
In [10]:
import json
file_string = json.dumps(resq.cookies.get_dict())
print (type(file_string),file_string)
fileout = open("cookie.txt","w")
fileout.write(file_string)
fileout.close()
filein = open("cookie.txt")
my_cookie2 = json.loads(filein.read())
print (type(my_cookie2),my_cookie2)
<class 'str'> {"over18": "1", "__cfduid": "d60ac267650fba286d8fcc23f28cd97f31481700563"}
<class 'dict'> {'over18': '1', '__cfduid': 'd60ac267650fba286d8fcc23f28cd97f31481700563'}
In [11]:
import requests
resp = requests.get("https://www.ptt.cc/bbs/Gossiping/M.1481674520.A.251.html",cookies=my_cookie)
html = resp.text
print (html)
<!DOCTYPE html>
<html>
	<head>
		<meta charset="utf-8">
		

<meta name="viewport" content="width=device-width, initial-scale=1">

<title>[問卦] 為什麼服部平次沒被稱為大阪死神? - 看板 Gossiping - 批踢踢實業坊</title>
<meta name="robots" content="all">
<meta name="keywords" content="Ptt BBS 批踢踢">
<meta name="description" content="
關西名偵探服部平次  跟柯南一樣都會碰

到一堆死人的刑事案件 那他怎沒被稱為

">
<meta property="og:site_name" content="Ptt 批踢踢實業坊">
<meta property="og:title" content="[問卦] 為什麼服部平次沒被稱為大阪死神?">
<meta property="og:description" content="
關西名偵探服部平次  跟柯南一樣都會碰

到一堆死人的刑事案件 那他怎沒被稱為

">
<link rel="canonical" href="https://www.ptt.cc/bbs/Gossiping/M.1481674520.A.251.html">

<link rel="stylesheet" type="text/css" href="//images.ptt.cc/v2.20/bbs-common.css">
<link rel="stylesheet" type="text/css" href="//images.ptt.cc/v2.20/bbs-base.css" media="screen">
<link rel="stylesheet" type="text/css" href="//images.ptt.cc/v2.20/bbs-custom.css">
<link rel="stylesheet" type="text/css" href="//images.ptt.cc/v2.20/pushstream.css" media="screen">
<link rel="stylesheet" type="text/css" href="//images.ptt.cc/v2.20/bbs-print.css" media="print">


<script src="//ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
<script src="//images.ptt.cc/v2.20/bbs.js"></script>


		

<script type="text/javascript">

  var _gaq = _gaq || [];
  _gaq.push(['_setAccount', 'UA-32365737-1']);
  _gaq.push(['_setDomainName', 'ptt.cc']);
  _gaq.push(['_trackPageview']);

  (function() {
    var ga = document.createElement('script'); ga.type = 'text/javascript'; ga.async = true;
    ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www') + '.google-analytics.com/ga.js';
    var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(ga, s);
  })();

</script>


	</head>
    <body>
		
<div id="fb-root"></div>
<script>(function(d, s, id) {
var js, fjs = d.getElementsByTagName(s)[0];
if (d.getElementById(id)) return;
js = d.createElement(s); js.id = id;
js.src = "//connect.facebook.net/en_US/all.js#xfbml=1";
fjs.parentNode.insertBefore(js, fjs);
}(document, 'script', 'facebook-jssdk'));</script>

<div id="topbar-container">
	<div id="topbar" class="bbs-content">
		<a id="logo" href="/">批踢踢實業坊</a>
		<span>&rsaquo;</span>
		<a class="board" href="/bbs/Gossiping/index.html"><span class="board-label">看板 </span>Gossiping</a>
		<a class="right small" href="/about.html">關於我們</a>
		<a class="right small" href="/contact.html">聯絡資訊</a>
	</div>
</div>
<div id="navigation-container">
	<div id="navigation" class="bbs-content">
		<a class="board" href="/bbs/Gossiping/index.html">返回看板</a>
		<div class="bar"></div>
		<div class="share">
			<span>分享</span>
			<div class="fb-like" data-send="false" data-layout="button_count" data-width="90" data-show-faces="false" data-href="http://www.ptt.cc/bbs/Gossiping/M.1481674520.A.251.html"></div>

			<div class="g-plusone" data-size="medium"></div>
<script type="text/javascript">
window.___gcfg = {lang: 'zh-TW'};
(function() {
var po = document.createElement('script'); po.type = 'text/javascript'; po.async = true;
po.src = 'https://apis.google.com/js/plusone.js';
var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(po, s);
})();
</script>

		</div>
	</div>
</div>
<div id="main-container">
    <div id="main-content" class="bbs-screen bbs-content"><div class="article-metaline"><span class="article-meta-tag">作者</span><span class="article-meta-value">saintlin (saintlin)</span></div><div class="article-metaline-right"><span class="article-meta-tag">看板</span><span class="article-meta-value">Gossiping</span></div><div class="article-metaline"><span class="article-meta-tag">標題</span><span class="article-meta-value">[問卦] 為什麼服部平次沒被稱為大阪死神?</span></div><div class="article-metaline"><span class="article-meta-tag">時間</span><span class="article-meta-value">Wed Dec 14 08:15:17 2016</span></div>
關西名偵探服部平次  跟柯南一樣都會碰

到一堆死人的刑事案件 那他怎沒被稱為

大阪死神? 還是只是沒演出來就稱不上

死神這名號?

--
<span class="f2">※ 發信站: 批踢踢實業坊(ptt.cc), 來自: 42.72.125.27
</span><span class="f2">※ 文章網址: <a href="https://www.ptt.cc/bbs/Gossiping/M.1481674520.A.251.html" target="_blank" rel="nofollow">https://www.ptt.cc/bbs/Gossiping/M.1481674520.A.251.html</a>
</span><div class="push"><span class="f1 hl push-tag">→ </span><span class="f3 hl push-userid">momo1244</span><span class="f3 push-content">: 因為死的人幾乎不在大阪</span><span class="push-ipdatetime"> 12/14 08:15
</span></div><div class="push"><span class="f1 hl push-tag">→ </span><span class="f3 hl push-userid">aterui</span><span class="f3 push-content">: 原本都是警察委託他辦理事件,跟柯南在一起時才會碰巧遇到</span><span class="push-ipdatetime"> 12/14 08:16
</span></div><div class="push"><span class="f1 hl push-tag">→ </span><span class="f3 hl push-userid">winiS</span><span class="f3 push-content">: 擊墜數不到</span><span class="push-ipdatetime"> 12/14 08:18
</span></div><div class="push"><span class="hl push-tag">推 </span><span class="f3 hl push-userid">akway</span><span class="f3 push-content">: 因為服部都是被委託辦案 只有跟柯南在一起才會碰到兇殺案</span><span class="push-ipdatetime"> 12/14 08:22
</span></div><div class="push"><span class="f1 hl push-tag">→ </span><span class="f3 hl push-userid">tetsu2008</span><span class="f3 push-content">: 都是柯南來大阪害的</span><span class="push-ipdatetime"> 12/14 08:41
</span></div><div class="push"><span class="hl push-tag">推 </span><span class="f3 hl push-userid">feedfish</span><span class="f3 push-content">: 都是柯南的關係吧</span><span class="push-ipdatetime"> 12/14 08:51
</span></div></div>
    
    <div id="article-polling" data-pollurl="/poll/Gossiping/M.1481674520.A.251.html?cacheKey=2052-1115141154&offset=983&offset-sig=ea3331af4a298b2280f1b0fbda4d161f776609ba" data-longpollurl="/v1/longpoll?id=b6ce4f0d6fb4823a614c2cd38739bd1ebcfae5c6" data-offset="983"></div>
    

    
<div class="bbs-screen bbs-footer-message">本網站已依台灣網站內容分級規定處理。此區域為限制級,未滿十八歲者不得瀏覽。</div>

</div>

    </body>
</html>

Coding Time

  • 試試看抓取任一篇Gossiping版文章吧!

爬蟲實例解析與實戰技巧

善加利用 Chrome 的 「開發人員工具」觀察

  • Network 觀察各個request => requests
  • Elements 輔助觀察經Goolge排版過的原始碼架構 => lxml
    • 但要看原始碼才是最準的!(右鍵>檢視原始碼)
    • 可利用Copy Xpath幫忙解析

觀察細節!try & error!

換換看page的數字?

共同之處?

  • //*[@id="Blog1"]/div[1]/div[1]/div/div[1]/article/header/h5/a
  • //*[@id="Blog1"]/div[1]/div[1]/div/div[2]/article/header/h5/a
    • 可以只用//article/header/h5/a,甚至更簡短的//h5/a就可以達到同樣效果!

11月爬蟲活動回顧

In [12]:
import requests
for i in range(1,3+1): # 抓1~3頁
    url = "http://ecshweb.pchome.com.tw/search/v3.3/all/results?q=mac&page={}&sort=rnk/dc".format(str(i))
    resp = requests.get(url)
    html = resp.text
    

練習-g0v斧頭幫大挑戰

  • 練習爬蟲語法
  • 熟悉python語法
  • 第四關:
    header = {"User-Agent":"???",
              "key":"value",...
            }
    requests.get(url,headers=header)