
print ("Hello PyLadies"),按下介面上的 或是用快捷鍵Ctrl+Enter、Shift+Enter編譯執行


GET

from lxml import etree
import requests
url = "http://blog.marsw.tw"
response = requests.get(url)
html = response.text
fileout = open("test.html","w")
fileout.write(html)
fileout.close()
page = etree.HTML(html)
first_article_tags = page.xpath("//footer/a/text()")[0]
print (first_article_tags)
for article_title in page.xpath("//h5/a/text()"):
print (article_title)
旅遊-日本 2014。09~關西9天滾出趣體驗 2014。05 日本 ~ Day9 旅程的最後~滾出趣精神、令人屏息的司馬遼太郎紀念館 2014。05 日本 ~ 滾出趣(調查11) 道頓堀的街頭運動 2014。05 日本 ~ Day8 鞍馬天氣晴、貴船川床流水涼麵、京都風情 2014。05 日本 ~ 高瀬川、鴨川、祇園之美
from lxml import etree
只從 lxml 工具箱,裝備 etree 這個工具
import requests
裝備 requests 工具箱
url = "http://blog.marsw.tw"
我們把 "http://blog.marsw.tw" 這個<字串 string>物件的儲存空間
以名稱 url 指向這著儲存空間。
response = requests.get(url)
使用 requests 工具箱的 get 工具,
這個工具能幫我們抓下網址為 url 的網頁(傳遞資料的方法為 GET )上的資料,
以我們命名的response名稱指向這些資料的儲存空間。
html = response.text
將 response 屬於 text 的資料,也就是網頁原始碼(response是用get抓下來的網頁資料)
以我們命名的html名稱指向。
工具箱、工具 在 Python裡面的專有名詞分別是
「模組 module」和 「函式 function / 類別 class」,
比較複雜一點所以今天不會特別說明。
我們先來看看 物件命名 以及 <字串 string>
url = "http://blog.marsw.tw"
response = requests.get(url)
html = response.text
url、response、html 都是自行定義的名稱import, for, in, str...)xyz = "http://blog.marsw.tw" vs.url = "http://blog.marsw.tw""或是單引號',將字串包住字串.format()my_string = "PyLadies Taiwan"
my_string2 = ""
my_string2 = my_string2 + "Py" + "Ladies"
print (my_string)
print (my_string2)
PyLadies Taiwan PyLadies
# 產生各股票網址(股票代號是在網址中,不是在尾端)
url = "http://www.wantgoo.com/stock/{}?searchType=stocks".format(2330)
print (url)
url = "http://www.wantgoo.com/stock/{}?searchType=stocks".format(2371)
print (url)
http://www.wantgoo.com/stock/2330?searchType=stocks http://www.wantgoo.com/stock/2371?searchType=stocks
# 產生facebook粉絲頁資訊(想讓資訊好看,不想用+來處理)
url = "https://www.facebook.com/{}/{}/".format("pyladies.tw","about")
print (url)
url = "https://www.facebook.com/{}/{}/".format("pyladies.tw","photos")
print (url)
https://www.facebook.com/pyladies.tw/about/ https://www.facebook.com/pyladies.tw/photos/
不使用格式化字串,直接用字串相加就會變成:
url = "https://www.facebook.com/"+"pyladies.tw"+"/"+"about"+"/"
而不用命名物件 url 紀錄,直接用print印出,就會像以下看起來很雜亂的程式碼:
print ("https://www.facebook.com/"+"pyladies.tw"+"/"+"about"+"/")
fileout = open("test.html","w")
open是讓我們開啟一個檔案的工具,而這個檔案名稱我們叫做test.html,
而這個檔案是用來「寫入資料」,因此要加上w,不加的話會是用來「讀取檔案」。
我們把這個工具以 fileout名稱指向。
fileout.write(html)
然後使用 write 功能將 名稱 html 指向的資料寫入(會寫在 test.html 中)
fileout.close()
最後,怕同時間有其他人/程式也一起使用這個檔案,
會造成檔案的內容被影響,所以我們在程式中以close功能關閉這個檔案,
至少現階段這個程式不會再影響檔案了。
article = "Bubble tea represents the 'QQ' food texture that Taiwanese love. The phrase refers to something that is especially chewy, like the tapioca balls that form the 'bubbles' in bubble tea. It's said this unusual drink was invented out of boredom. Chun Shui Tang and Hanlin Tea Room both claim to have invented bubble tea by combining sweetened tapioca pudding (a popular Taiwanese dessert) with tea. Regardless of which shop did it first, today the city is filled with bubble tea joints. Variations on the theme include taro-flavored tea, jasmine tea and coffee, served cold or hot."
fileout = open("my_article.txt","w")
fileout.write(article)
fileout.close()
為什麼要用fileout這個命名物件來紀錄open的檔案,
因為我們是用w來寫檔案,遇到一模一樣名字的檔案,會直接把原有的資料洗掉,
所以如果寫成:
open('test.html','w').write(html)
open('test.html','w').close()
第二次用 open 以 w 開啟 test.html檔案,
就把第一次 write 的內容洗掉了,
因此我們使用命名物件,重複利用已經開啟的檔案,讓檔案開啟一次就好。
現在大家知道了
GET 的網頁輸入程式碼,按下介面上的 或是用快捷鍵Ctrl+Enter、Shift+Enter編譯執行
from lxml import etree
import requests
url = "http://blog.marsw.tw"
response = requests.get(url)
html = response.text
fileout = open("test.html","w")
fileout.write(html)
fileout.close()
from lxml import etree
import requests
url = "http://blog.marsw.tw"
response = requests.get(url)
response.encoding = 'utf-8'
html = response.text
fileout = open(r"C:\Users\Desktop\test.html","w",encoding="utf8")
fileout.write(html)
fileout.close()
test.html的檔案¶

url 來抓下不同的網頁看看吧!¶from lxml import etree
import requests
url = "http://blog.marsw.tw"
response = requests.get(url)
html = response.text
fileout = open("test.html","w")
fileout.write(html)
fileout.close()
page = etree.HTML(html)
把名稱為html的物件中儲存的資料(網頁抓下來的原始碼),
以 工具etree的HTML功能,轉換成「XPath的節點(node)型態」,
貼上名稱page。
範例程式第一行宣告的 lxml 工具箱的 etree 工具,終於要用到啦!
from lxml import etree
<a href="">連結文字</a> <img src=""/><h1>~<h6><br><html>
<head>
<title>Title</title>
</head>
<body>
<h1>Subtitle</h1>
<a href="http://tw.pyladies.com/">PyLadies Website</a>
<p>
This is a paragraph <br>
<a href="http://www.meetup.com/PyLadiesTW/">PyLadies Meetup</a> <br>
<a href="https://www.facebook.com/pyladies.tw">PyLadies FB</a> <br>
</p>
<img src="http://tw.pyladies.com/img/logo2.png" width="99px"/>
</body>
</html>
/:「下一層節點」,或是該標籤的「屬性 @」或「文字 text()」//:小孩、小孩的小孩、小孩的小孩的小孩......etc.. 以現在的節點node搜索,常用在同時呈現同一階層(輩份)的資料<html>
<head>
<title>Title</title>
</head>
<body>
<h1>Subtitle</h1>
<a href="http://tw.pyladies.com/">PyLadies Website</a>
<p>
This is a paragraph <br>
<a href="http://www.meetup.com/PyLadiesTW/">PyLadies Meetup</a> <br>
<a href="https://www.facebook.com/pyladies.tw">PyLadies FB</a> <br>
</p>
<img src="http://tw.pyladies.com/img/logo2.png" width="99px"/>
</body>
</html>
# 這邊先不使用request去抓,直接將原始碼貼上 html 標間,方便大家理解
# 遇到長篇文章有換行的存在,可用「三個」雙引號或單引號
from lxml import etree
html = """
<html>
<head>
<title>Title</title>
</head>
<body>
<h1>Subtitle</h1>
<a href="http://tw.pyladies.com/">PyLadies Website</a>
<p>
This is a paragraph <br>
<a href="http://www.meetup.com/PyLadiesTW/">PyLadies Meetup</a> <br>
<a href="https://www.facebook.com/pyladies.tw">PyLadies FB</a> <br>
</p>
<img src="http://tw.pyladies.com/img/logo2.png" width="99px"/>
</body>
</html>
"""
page = etree.HTML(html)
link_text_list = page.xpath("//a/text()")
link_text_p_list = page.xpath("//p/a/text()")
link_list = page.xpath("//a/@href")
print (link_text_list)
print (link_text_p_list)
print (link_list)
['PyLadies Website', 'PyLadies Meetup', 'PyLadies FB'] ['PyLadies Meetup', 'PyLadies FB'] ['http://tw.pyladies.com/', 'http://www.meetup.com/PyLadiesTW/', 'https://www.facebook.com/pyladies.tw']
網頁中有很多個連結(a標籤),這種可以存很多資料的物件型別,叫做「串列(list)」
my_list = ["a",2016,5566,"PyLadies"]
my_list2= []
my_list2+=[2016]
my_list2+=["abc"]
print (my_list)
print (my_list2)
['a', 2016, 5566, 'PyLadies'] [2016, 'abc']
len(串列):可取得串列長度my_list = ["a",2016,5566,"PyLadies",2016,2016.0]
print ("The 1th element of my_list = ",my_list[0])
print ("The 4th element of my_list = ",my_list[3])
print ("The last element of my_list = ",my_list[-1])
print ("The second-last element of my_list = ",my_list[-2])
print ("Length of my_list = ",len(my_list))
The 1th element of my_list = a The 4th element of my_list = PyLadies The last element of my_list = 2016.0 The second-last element of my_list = 2016 Length of my_list = 6
first_article_tags = page.xpath("//footer/a/text()")[0]
前面的程式碼,我們已經讓page指向「XPath的節點(node)型態」的物件上,
這邊的程式碼是以page指向的資料用xpath找尋
整個文件中 // , 所有 footer 標籤,且小孩是 a 標籤的文字屬性 text() ,
而我們只取「第1個」結果,以名稱 first_article_tags 指向他。
print (first_article_tags)
將 first_article_tags指向的資料,印到螢幕上。
from lxml import etree
html = """
<html>
<head>
<title>Title</title>
</head>
<body>
<h1>Subtitle</h1>
<a href="http://tw.pyladies.com/">PyLadies Website</a>
<p>
This is a paragraph <br>
<a href="http://www.meetup.com/PyLadiesTW/">PyLadies Meetup</a> <br>
<a href="https://www.facebook.com/pyladies.tw">PyLadies FB</a> <br>
</p>
<img src="http://tw.pyladies.com/img/logo2.png" width="99px"/>
</body>
</html>
"""
page = etree.HTML(html)
link_text_list = page.xpath("//a/text()")
print (link_text_list)
print ("網頁中共有",len(link_text_list),"個連結")
['PyLadies Website', 'PyLadies Meetup', 'PyLadies FB'] 網頁中共有 3 個連結
from lxml import etree
html = """
<html>
<head>
<title>Title</title>
</head>
<body>
<h1>Subtitle</h1>
<a href="http://tw.pyladies.com/">PyLadies Website</a>
<p>
This is a paragraph <br>
<a href="http://www.meetup.com/PyLadiesTW/">PyLadies Meetup</a> <br>
<a href="https://www.facebook.com/pyladies.tw">PyLadies FB</a> <br>
</p>
<img src="http://tw.pyladies.com/img/logo2.png" width="99px"/>
</body>
</html>
"""
page = etree.HTML(html)
link_list = page.xpath("______")
print (____)
my_list = ["a",2016,5566,"PyLadies"]
for element in my_list:
print (element)
a 2016 5566 PyLadies
for article_title in page.xpath("//h5/a/text()"):
print (article_title)
取出 整個文件中 // , 所有 h5 標籤,且小孩是 a 標籤的文字屬性 text() 的結果(串列),
把它們一個個取出,以名稱article_title指向它,並把它印出來。
PyLadies Website
PyLadies Meetup
PyLadies FB
from lxml import etree
html = """
<html>
<head>
<title>Title</title>
</head>
<body>
<h1>Subtitle</h1>
<a href="http://tw.pyladies.com/">PyLadies Website</a>
<p>
This is a paragraph <br>
<a href="http://www.meetup.com/PyLadiesTW/">PyLadies Meetup</a> <br>
<a href="https://www.facebook.com/pyladies.tw">PyLadies FB</a> <br>
</p>
<img src="http://tw.pyladies.com/img/logo2.png" width="99px"/>
</body>
</html>
"""
page = _____
___ text ___ page.xpath("_____"):
print(text) # 記得要縮排


//*[@id="Blog1"]/div[1]/div[1]/div/div[1]/article/header/h5/a

//h5/a

from lxml import etree
html = """
<html>
<head>
<title>Title</title>
</head>
<body>
<h1>Subtitle</h1>
<a href="http://tw.pyladies.com/">PyLadies Website</a>
<p>
This is a paragraph <br>
<a href="http://www.meetup.com/PyLadiesTW/">PyLadies Meetup</a> <br>
<a href="https://www.facebook.com/pyladies.tw">PyLadies FB</a> <br>
</p>
<img src="http://tw.pyladies.com/img/logo2.png" width="99px"/>
</body>
</html>
"""
page = etree.HTML(html)
for link_node in page.xpath("//a"):
text = link_node.xpath("./text()")[0]
link = link_node.xpath("./@href")[0]
print (text,link)
PyLadies Website http://tw.pyladies.com/ PyLadies Meetup http://www.meetup.com/PyLadiesTW/ PyLadies FB https://www.facebook.com/pyladies.tw
from lxml import etree
import requests
url = "http://blog.marsw.tw"
response = requests.get(url)
html = response.text
page = etree.HTML(html)
for img_src in page.xpath("//img/@src"):
# 抓取圖片
img_response = requests.get(img_src)
img = img_response.content
filename=img_src.split("/")[-1]
filepath="tmp/"+filename
fileout = open(filepath,"wb")
fileout.write(img)
tmp,記得要新建資料夾,不然會找不到路徑content(bytes型式的內容)bytes型式的內容,因此用的是 wb 字串串列 = 原字串.split(子字串):將「原字串」以「子字串」切割為「字串串列」 my_string = "PyLadies Taiwan"
print (my_string.split(" "))
print (my_string.split(" ")[0])
print (my_string.split(" ")[-1])
['PyLadies', 'Taiwan'] PyLadies Taiwan
img_src = "http://1.bp.blogspot.com/-lP9M5nJ-kb0/U3nAOYRLCAI/AAAAAAAAUaw/1SbrOZBwz3g/s1600/2014-05-01%2B22.26.27.jpg"
print (img_src.split("/"))
print (img_src.split("/")[-1])
['http:', '', '1.bp.blogspot.com', '-lP9M5nJ-kb0', 'U3nAOYRLCAI', 'AAAAAAAAUaw', '1SbrOZBwz3g', 's1600', '2014-05-01%2B22.26.27.jpg'] 2014-05-01%2B22.26.27.jpg
from lxml import etree
import requests
# Crawling
url = "http://blog.marsw.tw"
response = requests.get(url)
html = response.text
# Save to File
fileout = open("test.html","w")
fileout.write(html)
fileout.close()
# Parsing
page = etree.HTML(html)
for article_title_node in page.xpath("//h5/a"):
text = article_title_node.xpath("./text()")[0]
link = article_title_node.xpath("./@href")[0]
print (text,link)
2014。09~關西9天滾出趣體驗 http://blog.marsw.tw/2014/09/2014099.html 2014。05 日本 ~ Day9 旅程的最後~滾出趣精神、令人屏息的司馬遼太郎紀念館 http://blog.marsw.tw/2014/09/201405-day9.html 2014。05 日本 ~ 滾出趣(調查11) 道頓堀的街頭運動 http://blog.marsw.tw/2014/09/201405-11.html 2014。05 日本 ~ Day8 鞍馬天氣晴、貴船川床流水涼麵、京都風情 http://blog.marsw.tw/2014/09/201405-day8.html 2014。05 日本 ~ 高瀬川、鴨川、祇園之美 http://blog.marsw.tw/2014/09/201405.html
//*[@id="Blog1"]/div[1]/div[1]/div/div[1]/article/header/h5/a//*[@id="Blog1"]/div[1]/div[1]/div/div[2]/article/header/h5/a//article/header/h5/a,甚至更簡短的//h5/a就可以達到同樣效果!pchome的網址無法直接拿到商品資訊