從0到1的網頁爬蟲 - 05

Pyladies Taiwan

Speaker : Mars

2017/02

Roadmap

  • 什麼時候需要 Selenium+PhantomJS
  • 基本操作:開啟網頁、截圖、釋出空間、取得網頁內容
  • 網頁互動:鍵盤操作、按鈕操作
  • 進階:渲染(render)完畢提早結束等待時間
  • 附錄:安裝方式

什麼時候需要 Selenium+PhantomJS

開發經驗談

  • 消費者購買當下的商品快照
  • 搜尋引擎cached for SEO
  • 登入、連續頁面的驗證是用js處理
  • 發送request需求的參數較為複雜
  • 驗證碼
  • 自動化測試

=> 需要一套能讓瀏覽器操作自動化的工具集

selenium 的瀏覽器核心

  • Firefox
  • Chrome
  • Ie
  • Opera
  • PhantomJS

基本操作

- 以pchome商品列表為例

開啟網頁、截圖

  • webdriver.PhantomJS(路徑) 預設是環境變數,可自行指定路徑
In [1]:
from selenium import webdriver
driver = webdriver.PhantomJS() 
driver.get("http://ecshweb.pchome.com.tw/search/v3.3/?q=mac")
driver.save_screenshot("pchome.jpg")   
Out[1]:
True

釋出空間

Unix上一次全殺的指令:
kill $(ps aux | grep 'phantomjs' | awk '{print $2}')

In [2]:
import signal
driver.service.process.send_signal(signal.SIGTERM) # 從系統中kill掉
driver.quit()                                      # 從程式中釋放

取得網頁內容

  • PhantomJS 預設是以 utf8 去做 decode
  • requests 預設是以 Content-Type
  • 網頁內容若有中文字型,在系統上需要做額外的環境安裝(詳見附錄)
In [3]:
from selenium import webdriver
import signal
driver = webdriver.PhantomJS() 
driver.get("http://ecshweb.pchome.com.tw/search/v3.3/?q=mac")
html_selenium = driver.page_source
fileout = open('pchome_selenium.html','w') 
fileout.write(html_selenium)
fileout.close()

driver.service.process.send_signal(signal.SIGTERM)
driver.quit()    
In [4]:
import requests 
resp = requests.get("http://ecshweb.pchome.com.tw/search/v3.3/?q=mac")
html_requests = resp.content.decode('utf8')
fileout = open('pchome_requests.html','w')
fileout.write(html_requests)
fileout.close()

網頁互動

  • 選擇元素 find_element:有多個結果會選第一個
    • find_element(By.ID,"")
    • find_element(By.NAME,"")
    • find_element(By.XPATH,"")
    • find_element(By.LINK_TEXT,"")
    • find_element(By.PARTIAL_LINK_TEXT,"")
    • find_element(By.TAG_NAME,"")
    • find_element(By.CLASS_NAME,"")
    • find_element(By.CSS_SELECTOR,"")

鍵盤操作

  • 特殊鍵
    • ARROW_UP, ARROW_DOWN, ARROW_LEFT, ARROW_DOWN
    • BACKSPACE, DELETE
    • ENTER, SPACE, TAB
    • COMMAND, CONTROL, SHIFT, ALT
    • F1~F12
In [5]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
import time
import signal

driver = webdriver.PhantomJS() 
driver.get("http://shopping.pchome.com.tw/")
time.sleep(5)

text_box = driver.find_element(By.XPATH, "//*[@id='keyword']")
text_box.send_keys("macbook")     # 填入
text_box.clear()                  # 清空
text_box.send_keys("mac")         # 填入
driver.save_screenshot("pchome_textbox.jpg") 

text_box.send_keys(Keys.ENTER)
time.sleep(5)
driver.save_screenshot("pchome_search.jpg") 

driver.service.process.send_signal(signal.SIGTERM)
driver.quit() 

Coding Time

試試看把爬蟲的投影片截圖下載下來吧!

In [6]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from lxml import etree
import time
import signal

driver = webdriver.PhantomJS() 
url = "http://tw.pyladies.com/~marsw/crawler04.slides.html"
driver.get(url)
time.sleep(10)
driver.save_screenshot("slide01.jpg") 
fileout = open('slide.html','w')
fileout.write(driver.page_source)
fileout.close()

page = etree.HTML(driver.page_source)
for button_enable in page.xpath("//button/@class"):
    print (button_enable)

anywhere = driver.find_element(By.XPATH, "//h2")
anywhere.send_keys(Keys.ARROW_DOWN)

time.sleep(5)
driver.save_screenshot("slide02.jpg") 

driver.service.process.send_signal(signal.SIGTERM)
driver.quit() 
navigate-left
navigate-right enabled
navigate-up
navigate-down enabled

按鈕操作

  • 一般按鈕
  • 單選選項
  • 多選選項
  • 選單列表
In [7]:
from selenium import webdriver
from selenium.webdriver.common.by import By
import time
import signal

driver = webdriver.PhantomJS() 
driver.get("http://24h.pchome.com.tw/prod/DGAX07-A900799PN")
time.sleep(10)

### 選單列表
driver.find_element(By.XPATH, "//select[@class='Qty']/option[text()='2']").click()

### 一般按鈕 
sbtn = driver.find_element(By.XPATH, "//button[@accesskey='I']")
sbtn.click()
time.sleep(10)
driver.save_screenshot("pchome_add2cart.jpg") 


driver.service.process.send_signal(signal.SIGTERM)
driver.quit() 

進階:渲染(render)完畢提早結束等待時間

In [8]:
# 不等待
from selenium import webdriver
from selenium.webdriver.common.by import By
import signal
driver = webdriver.PhantomJS() 
driver.get("http://24h.pchome.com.tw/prod/DGAX07-A900799PN")

driver.find_element(By.XPATH, "//select[@class='Qty']/option[text()='2']").click()

driver.service.process.send_signal(signal.SIGTERM)
driver.quit() 
---------------------------------------------------------------------------
NoSuchElementException                    Traceback (most recent call last)
<ipython-input-8-675a839bd9d5> in <module>()
      5 driver.get("http://24h.pchome.com.tw/prod/DGAX07-A900799PN")
      6 
----> 7 driver.find_element(By.XPATH, "//select[@class='Qty']/option[text()='2']").click()
      8 
      9 driver.service.process.send_signal(signal.SIGTERM)

/usr/local/lib/python3.4/dist-packages/selenium/webdriver/remote/webdriver.py in find_element(self, by, value)
    750         return self.execute(Command.FIND_ELEMENT, {
    751             'using': by,
--> 752             'value': value})['value']
    753 
    754     def find_elements(self, by=By.ID, value=None):

/usr/local/lib/python3.4/dist-packages/selenium/webdriver/remote/webdriver.py in execute(self, driver_command, params)
    234         response = self.command_executor.execute(driver_command, params)
    235         if response:
--> 236             self.error_handler.check_response(response)
    237             response['value'] = self._unwrap_value(
    238                 response.get('value', None))

/usr/local/lib/python3.4/dist-packages/selenium/webdriver/remote/errorhandler.py in check_response(self, response)
    190         elif exception_class == UnexpectedAlertPresentException and 'alert' in value:
    191             raise exception_class(message, screen, stacktrace, value['alert'].get('text'))
--> 192         raise exception_class(message, screen, stacktrace)
    193 
    194     def _value_or_default(self, obj, key, default):

NoSuchElementException: Message: {"errorMessage":"Unable to find element with xpath '//select[@class='Qty']/option[text()='2']'","request":{"headers":{"Accept":"application/json","Accept-Encoding":"identity","Connection":"close","Content-Length":"125","Content-Type":"application/json;charset=UTF-8","Host":"127.0.0.1:34925","User-Agent":"Python-urllib/3.4"},"httpVersion":"1.1","method":"POST","post":"{\"value\": \"//select[@class='Qty']/option[text()='2']\", \"using\": \"xpath\", \"sessionId\": \"e7a97100-f330-11e6-99d9-c35c87d1b9f4\"}","url":"/element","urlParsed":{"anchor":"","query":"","file":"element","directory":"/","path":"/element","relative":"/element","port":"","host":"","password":"","user":"","userInfo":"","authority":"","protocol":"","source":"/element","queryKey":{},"chunks":["element"]},"urlOriginal":"/session/e7a97100-f330-11e6-99d9-c35c87d1b9f4/element"}}
Screenshot: available via screen
In [9]:
# 強迫等待10秒
from selenium import webdriver
from selenium.webdriver.common.by import By
import signal
import time
driver = webdriver.PhantomJS() 
driver.get("http://24h.pchome.com.tw/prod/DGAX07-A900799PN")
tstart = time.time()
time.sleep(10)  
tstop = time.time()
print (tstop-tstart)
driver.find_element(By.XPATH, "//select[@class='Qty']/option[text()='2']").click()

driver.service.process.send_signal(signal.SIGTERM)
driver.quit() 
10.015915870666504
In [10]:
# 最多等待10秒
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import signal
import time
driver = webdriver.PhantomJS() 
driver.get("http://24h.pchome.com.tw/prod/DGAX07-A900799PN")
tstart = time.time()
try:
    element = WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.XPATH, "//select[@class='Qty']/option[text()='2']"))
    )
except:
    print ("TimeOut")
    
finally:
    tstop = time.time()
    print (tstop-tstart)
    driver.find_element(By.XPATH, "//select[@class='Qty']/option[text()='2']").click()
    driver.service.process.send_signal(signal.SIGTERM)
    driver.quit()
1.8321282863616943

附錄:安裝方式

  • [PhantomJS - 利用npm安裝]:該有的環境跟變數都會幫你設好
  • [PhantomJS - 直接使用binary package]:輕量版本
  • [Chrome]
  • [Selenium]
  • [中文字型]:Windows、Mac 不需要

PhantomJS - 利用npm安裝

[Mac]

  • Install Node.JS
  • sudo npm -g install phantomjs

[CentOS]

sudo yum install nodejs
sudo yum install npm
sudo yum install bzip2 
sudo npm -g install phantomjs
sudo yum install fontconfig

[Ubuntu]

sudo apt-get install -y nodejs
sudo apt-get install npm
sudo apt-get install phantomjs

PhantomJS - 直接使用binary package

依系統而定,有時需要額外安裝fontconfig或libfontconfig。

[Windows/Mac/Linux/FreeBSD]

  • 官網 下載各系統版本,與解壓縮
  • 在程式碼webdriver.PhantomJS(路徑)指定路徑,或是依照各系統設定環境變數

    Unix 解壓縮指令
    sudo ln -s /xxxx/bin/phantomjs /usr/bin/phantomjs

[pi] github source

sudo apt-get install libfontconfig1 libfreetype6 libpng12-0
curl -o /tmp/phantomjs_2.1.1_armhf.deb -sSL https://github.com/fg2it/phantomjs-on-raspberry/releases/download/v2.1.1-wheezy-jessie-armv6/phantomjs_2.1.1_armhf.deb
sudo dpkg -i /tmp/phantomjs_2.1.1_armhf.deb
sudo ln -s /usr/local/bin/phantomjs /usr/bin/phantomjs

Chrome

Selenium

[easy_install]

sudo easy_install selenium

[pip]

sudo pip install selenium
Unix系統上同時有Python2.x與Python3.x的話,要安裝給Python3.x需使用:sudo pip3 install selenium

中文字型

[CentOS]

yum install bitmap-fonts bitmap-fonts-cjk

[Ubuntu]

sudo apt-get install xfonts-wqy
sudo apt-get install ttf-wqy-zenhei