Python网络爬虫权威指南(第2版)试买阅读体验
虽然这本书是2019年4月出的新书。
但是实际内容基本上和第一版是一模一样的,所以通识教育还可以,但是里面的代码全都不能用了。
能不能用点心啊
第177页的代码从逻辑上就不对啊,import的pytesseract就没用,而是通过subprocess调用,这应该是第一版的思路,不过我也搞不清这是作者还是译者的锅,把代码改成如下更合理
import time
from urllib.request import urlretrieve
from PIL import Image
import pytesseract
from selenium import webdriver
from PIL import Image
# Create new Selenium driver
driver = webdriver.Chrome(executable_path='drivers/chromedriver/chromedriver')
driver.get(
'https://www.amazon.com/Death-Ivan-Ilyich-Nikolayevich-Tolstoy/dp/1427027277')
time.sleep(2)
# Click on the book preview button
driver.find_element_by_id('imgBlkFront').click()
imageList = []
# Wait for the page to load
time.sleep(5)
while 'pointer' in driver.find_element_by_id('sitbReaderRightPageTurner').get_attribute('style'):
# While the right arrow is available for clicking, turn through pages
driver.find_element_by_id('sitbReaderRightPageTurner').click()
time.sleep(2)
# Get any new pages that have loaded (multiple pages can load at once,
# but duplicates will not be added to a set)
pages = driver.find_elements_by_xpath(
'//div[@class='pageImage']/div/img')
if not len(pages):
print('No pages found')
for page in pages:
image = page.get_attribute('src')
print('Found image: {}'.format(image))
if image not in imageList:
urlretrieve(image, 'page.jpg')
imageList.append(image)
print(pytesseract.image_to_string(Image.open('page.jpg')))
driver.quit()
这本书别买,本书不仅是第一版而且只有web爬虫。
Copyright © 2010-2022 All Rights Reserved