從0到1的網頁爬蟲 - 06¶

Pyladies Taiwan¶

Speaker : Mars¶

2017/03¶

Roadmap¶

驗證碼的種類與處理方式
pytesseract
2captcha
裁切圖片與實戰解析
附錄：安裝方式

驗證碼的種類與處理方式¶

驗證碼的種類¶

標準字體¶

標準字體+雜訊¶

旋轉字體¶

很有事的驗證碼¶

驗證碼的處理方式¶

pytesseract：標準字體
影像處理(降噪、灰階、加強對比...)：標準字體+雜訊、旋轉字體
以OpenCV處理驗證碼
機器學習：有訓練資料可循(其實tesseract背後也是機器學習的概念)
工人智慧：最正確的解碼方式

以下介紹的，都是針對「輸入文字型」的驗證碼的實作！

pytesseract ¶

基於Tesseract-OCR的 Python 模組：目前由Google維護
支援多國語言：是以ISO 639-2為語系代碼
OpenSource

優點：快速，可以做英文辨識(實測中文有問題)
缺點：不是很準確

In [1]:

# 可以處理文字辨識
from PIL import Image
import pytesseract
print(pytesseract.image_to_string(Image.open('slide_image/ocr-eng.png'),lang="eng"))

Pyladies is a group of women developers who love the Python
programming language.

We are an intemat‘lonal mentorship group 9 with a focus on helping
more women become active participants and leaders in the Python
open-source community.

We host monthly meetups with different topics such as beginners
meetups, project of python presentation and tutorial Open to all who
identify as women. Feel free to join us!

In [2]:

from PIL import Image
import pytesseract

print(pytesseract.image_to_string(Image.open('captcha/1-284176.png')))
print(pytesseract.image_to_string(Image.open('captcha/1-882579.png')))

284176
882579

In [3]:

# 有雜訊的驗證碼，直接以pytesseract判別，效果很差
from PIL import Image
import pytesseract
print(pytesseract.image_to_string(Image.open('captcha/2-J45Z.png')))
print(pytesseract.image_to_string(Image.open('captcha/2-R59X.png')))
print(pytesseract.image_to_string(Image.open('captcha/2-4089.png')))
print(pytesseract.image_to_string(Image.open('captcha/2-9198.png')))
print(pytesseract.image_to_string(Image.open('captcha/2-0479.png')))
print(pytesseract.image_to_string(Image.open('captcha/2-1430.png')))
print(pytesseract.image_to_string(Image.open('captcha/2-49586.png')))
print(pytesseract.image_to_string(Image.open('captcha/2-88860.png')))

pytesseract+影像處理(threshold)¶

In [4]:

def threshold(filein, fileout, limit=100):
    img = Image.open(filein)
    m = 1.5
    img = img.convert('RGBA')
    pixdata = img.load()
    
    for y in range(img.size[1]):
        for x in range(img.size[0]):
            # 讓RGB三個顏色其一小於 threshold 值，就轉成 黑色
            if pixdata[x, y][0] < limit or pixdata[x, y][1] < limit or pixdata[x, y][2] < limit:
                pixdata[x, y] = (0, 0, 0, 255)
            else:
                pixdata[x, y] = (255, 255, 255, 255)
    img.save(fileout)

In [5]:

from PIL import Image
import pytesseract

threshold('captcha/2-R59X.png','captcha/2-R59X_threshold.png')
print(pytesseract.image_to_string(Image.open('captcha/2-R59X.png')))
print(pytesseract.image_to_string(Image.open('captcha/2-R59X_threshold.png')))

R5 9X

2captcha ¶

支援多種程式語言：Python,Php,C++,C#,JS......etc.
可以當工人解碼代替直接購買獲得額度
優點：簡單好用準確率高
缺點：速度慢

其他類似免付費服務¶

同行分享所使用服務¶

通常多為需要購買才能使用其服務，優點是速度快

Anti-Captcha：Hellowings
打碼兔dama2：大數學堂
Text Captcha Decoder
解碼服務分析：Antigate,BeatCaptchas,BypassCaptcha,CaptchaBot,CaptchaBypass,CaptchaGateway,DeCaptcher,ImageToText
- 準確率：BypassCaptcha 最低(86-89%)，其餘都差不多(93-97%)
- 回傳速度：ImageToText,Antigate最快(9.4s,9.6s)、CaptchaGateway最慢(21.3s)，中位數是14s
- 越高的價格是否有對應越快回應速度：Antigate,CaptchaBot的有較好的性價比，ImageToText的收費不合理

=> 會選擇 2captcha 是剛好搜尋到，也可以不需信用卡帳戶或是付費才能使用

2captcha 註冊與使用¶

先到官網 https://2captcha.com/ 註冊一個帳號
選擇 I'M A DEVELOPER(選錯了，右上角也可以改變角色)
Start Work：帳戶需要有額度才能使用API
- 通過測試流程：避免真正工作時，常輸入錯誤造成帳號被鎖
- 同意條款
- 開始解碼：解碼到覺得帳戶的額度夠用為止
2Captcha API：
- API資訊頁面
- Your CAPTCHAs：查看使用API上傳的記錄(價錢、解碼時間)

Start Work、通過測試流程¶

在訓練過程有不少提示，能在往後真正的解碼工作中，減低失誤。
中間不小心跳出或是網頁無反應，可以直接重整頁面
- 不要按到Start Over，會全部重來

同意條款¶

開始解碼¶

解碼過程¶

解越快的報酬越高
想停止就利用Stop，結束後重整頁面會看到左方的Reputation有金額

API資訊頁面¶

記錄你的captcha KEY
可調整解碼速度：越慢收費越低

Your CAPTCHAs：查看使用API上傳的記錄¶

可知道每張上傳的驗證碼的花費價錢與時間

=> 自己當工人解碼6張，大約有0.006，API處理一張簡單的驗證碼花費0.00095
=> 解一張可以讓API使用10次

In [6]:

from captcha2 import CaptchaUpload
import time
tstart = time.time()

_API_KEY = "YOUR_KEY"
captcha = CaptchaUpload(_API_KEY)
print (captcha.solve('captcha/1-284176.png'))

tstop = time.time()
print (tstop-tstart)

284176
15.581701040267944

In [7]:

from PIL import Image
import pytesseract
import time
tstart = time.time()
print(pytesseract.image_to_string(Image.open('captcha/1-284176.png')))
tstop = time.time()
print (tstop-tstart)

284176
0.1440131664276123

裁切圖片與實戰解析¶

裁切圖片¶

In [8]:

from PIL import Image
img = Image.open('slide_image/govtw_cmpyinfo.png')
img2 = img.crop((240, 110, 350, 140)) # left top right bottom
img2.save('temp.png')
img2

Out[8]:

Coding Time¶

試試看，取出「經濟部商業司─公司及分公司基本資料查詢」網頁的驗證碼 http://gcis.nat.gov.tw/pub/cmpy/cmpyInfoListAction.do

以selenium控制webdriver
- 設定瀏覽器畫面大小
- 開啟網頁
- 截圖
- 裁切截圖取出驗證碼

In [9]:

from selenium import webdriver
import signal
# 設定瀏覽器、畫面大小、開啟網頁、截圖
driver = webdriver.PhantomJS() 
driver.set_window_size(1024, 768)
driver.get("http://gcis.nat.gov.tw/pub/cmpy/cmpyInfoListAction.do")
driver.save_screenshot('cmpyInfo_raw.jpg')
# 裁切圖片
# Your Code
# 別忘記要關閉process
driver.service.process.send_signal(signal.SIGTERM)
driver.quit()

實戰解析¶

觀察要點：對著驗證碼按下右鍵「在新分頁中開啟圖片」，重新整理頁面

驗證碼如果不一樣：以selenium模擬瀏覽器最為保險，需要裁切圖片
驗證碼如果都一樣(通常網址後面會帶參數)：可直接用request.get()來抓取圖片

經濟部商業司─公司及分公司基本資料查詢：驗證碼不一樣¶

In [10]:

from selenium import webdriver
from selenium.webdriver.common.by import By
from PIL import Image
import pytesseract
import signal
# 設定瀏覽器、畫面大小、開啟網頁、截圖
driver = webdriver.PhantomJS() 
driver.set_window_size(1024, 768)
driver.get("http://gcis.nat.gov.tw/pub/cmpy/cmpyInfoListAction.do")
driver.save_screenshot('cmpyInfo_raw.jpg')
# 裁切圖片
img = Image.open('cmpyInfo_raw.jpg')
img2 = img.crop((260, 145, 370, 180)) # left top right bottom
img2.save('captcha.jpg')
# 解析驗證碼文字
captcha = pytesseract.image_to_string(Image.open('captcha.jpg'))
print (captcha)
# 填入資料
text_box = driver.find_element(By.XPATH, "//input[@name='queryStr']")
text_box.send_keys("玩咖旅行社")
text_box = driver.find_element(By.XPATH, "//input[@name='imageCode']")
text_box.send_keys(captcha)
driver.save_screenshot('cmpyInfo_ready.jpg')
# 送出
button = driver.find_element(By.XPATH, "//input[@name='submitData']")
button.click()
driver.save_screenshot('cmpyInfo_submit.jpg')
# 別忘記要關閉process
driver.service.process.send_signal(signal.SIGTERM)
driver.quit()

中華郵政全球資訊網>中文地址英譯：驗證碼一樣¶

In [11]:

import requests
from lxml import etree
from PIL import Image
import pytesseract

# 開啟網頁
resp = requests.get("http://www.post.gov.tw/post/internet/SearchZone/index.jsp?ID=130112")
# 獲取圖片網址
html = resp.text
page = etree.HTML(html)
image_src = page.xpath("//img[@id='imgCaptcha3']/@src")[0]
vKey = image_src.split("&vKey=")[-1]
# 下載圖片
resp_captcha = requests.get("http://www.post.gov.tw/post/internet/"+image_src.replace("../",""))
img = resp_captcha.content
fileout = open("post_gov_captcha.jpg","wb")
fileout.write(img)
fileout.close()
# 解析驗證碼文字
captcha = pytesseract.image_to_string(Image.open('post_gov_captcha.jpg'))
print (captcha)

In [12]:

# 以requests的get/post來送出資料
post_data = {
    "do_s_1":"1",
    "vKey": vKey,
    "city":"臺北市",
    "change_city":"2",
    "cityarea":"中山區",
    "street":"中山北路２段",
    "lane":"",
    "alley":"",
    "num":"31",
    "num_hyphen":"",
    "fl":"9",
    "hyphen":"",
    "suite":"",
    "list":"true",
    "checkImange":captcha,
    "submit":"查詢"
}
resp = requests.post("http://www.post.gov.tw/post/internet/Postal/index.jsp?ID=207",data=post_data)
html = resp.text
page = etree.HTML(html)
eng_address = "".join(page.xpath("//table[contains(@class,'TableStyle_02')][1]//tr[2]//text()")).strip()
print (eng_address)

9F., No.31, Sec. 2, Zhongshan N. Rd., Zhongshan Dist., Taipei City 104, Taiwan (R.O.C.)

Coding Time¶

試試看，以取貨編號 E42981808304 在「E-Tracking 交易系統」查詢
https://eservice.7-11.com.tw/E-Tracking/search.aspx

觀察要點：對著驗證碼按下右鍵「在新分頁中開啟圖片」，
- 重新整理頁面，驗證碼一樣嗎？
- 用同樣的網址，以新的無痕視窗開啟呢？
可用方法：selenium、requests+Session
- crawler03回顧：Session

In [13]:

import requests
from lxml import etree
from PIL import Image
import pytesseract

resq = requests.Session() 
resp = resq.get("https://eservice.7-11.com.tw/E-Tracking/search.aspx")
html = resp.text
page = etree.HTML(html)
image_src = page.xpath("//img[@id='ImgVCode']/@src")[0]
__VIEWSTATE = page.xpath("//input[@name='__VIEWSTATE']/@value")[0]
__VIEWSTATEGENERATOR = page.xpath("//input[@name='__VIEWSTATEGENERATOR']/@value")[0]

# 下載圖片
resp_captcha = resq.get("https://eservice.7-11.com.tw/E-Tracking/"+image_src)
img = resp_captcha.content
fileout = open("7-11_captcha.jpg","wb")
fileout.write(img)
fileout.close()

# 解析驗證碼文字
captcha = pytesseract.image_to_string(Image.open('7-11_captcha.jpg'))
print (captcha)

# 以requests的get/post來送出資料
post_data = {
    "__EVENTTARGET":"",
    "__EVENTARGUMENT":"",
    "__VIEWSTATE":__VIEWSTATE,
    "__VIEWSTATEGENERATOR":__VIEWSTATEGENERATOR,
    "txtProductNum":"E42981808304",
    "tbChkCode":captcha,
    "btUserSearch":"查 詢"
}
resp = resq.post("https://eservice.7-11.com.tw/E-Tracking/search.aspx",data=post_data)
html = resp.content.decode("utf8")
print (html)

附錄：安裝方式¶

Unix系統上同時有Python2.x與Python3.x的話，要安裝給Python3.x需使用：pip3

[PIL]
[tesseract]
[pytesseract]
[2captcha]

PIL¶

安裝的是Pillow，因為原始的PIL已無維護，
要注意Pillow與PIL無法共存，若已安裝PIL需將其解除安裝。

pip install Pillow
pip3 install Pillow

若在Linux系統(Ubuntu,Debian)遇到問題，先行安裝

sudo apt-get install libjpeg8 libjpeg62-dev libfreetype6 libfreetype6-dev

tesseract¶

[Linux]¶

sudo apt-get install tesseract-ocr

[其餘]¶

https://github.com/tesseract-ocr/tesseract/wiki/Downloads

pytesseract¶

pip install pytesseract
pip3 install pytesseract

2captcha¶

安裝captcha2，因為官方範例程式captcha2upload只支援Python 2.x。

pip install captcha2  
pip3 install captcha2