我正在尝试创建一个可以抓取某些电子商务网站的应用程序。我为此目的使用 Selenium 并尝试在运行 centos 的 ec2 实体上部署我的应用程序。在部署之前,我在本地开发了我的代码并且它可以作业,但是它在远程机器上给了我错误。
我正在使用的代码
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
ser = Service(ChromeDriverManager().install())
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument("--headless")
selenium_driver = webdriver.Chrome(service=ser, options=chrome_options)
url = 'https://www.everlane.com/products/womens-cloud-cable-knit-vest-oatmeal?collection=womens-newest-arrivals'
selenium_driver.get(url)
title = selenium_driver.find_element(By.XPATH, '//*[@id="content"]/div/div[3]/div[2]/div/div/div/div[2]/div/div[1]/hgroup/h1/span')
print(title.text)
当我尝试在远程机器上运行此代码时,出现以下堆栈跟踪错误
Traceback (most recent call last):
File "/home/ec2-user/.local/lib/python3.7/site-packages/flask/app.py", line 2091, in __call__
return self.wsgi_app(environ, start_response)
File "/home/ec2-user/.local/lib/python3.7/site-packages/flask/app.py", line 2076, in wsgi_app
response = self.handle_exception(e)
File "/home/ec2-user/.local/lib/python3.7/site-packages/flask/app.py", line 2073, in wsgi_app
response = self.full_dispatch_request()
File "/home/ec2-user/.local/lib/python3.7/site-packages/flask/app.py", line 1518, in full_dispatch_request
rv = self.handle_user_exception(e)
File "/home/ec2-user/.local/lib/python3.7/site-packages/flask/app.py", line 1516, in full_dispatch_request
rv = self.dispatch_request()
File "/home/ec2-user/.local/lib/python3.7/site-packages/flask/app.py", line 1502, in dispatch_request
return self.ensure_sync(self.view_functions[rule.endpoint])(**req.view_args)
File "/home/ec2-user/price_tracker/flask_api.py", line 22, in home
title, price, isSizeAvailable, shop = prices.checkInfoByShop(url, size)
File "/home/ec2-user/price_tracker/check_prices.py", line 132, in checkInfoByShop
secondaryPriceXPath=secondaryPriceXPath)
File "/home/ec2-user/price_tracker/check_prices.py", line 61, in checkSelenium
title = self.selenium_driver.find_element(By.XPATH, titleXPath)
File "/home/ec2-user/.local/lib/python3.7/site-packages/selenium/webdriver/remote/webdriver.py", line 1246, in find_element
'value': value})['value']
File "/home/ec2-user/.local/lib/python3.7/site-packages/selenium/webdriver/remote/webdriver.py", line 424, in execute
self.error_handler.check_response(response)
File "/home/ec2-user/.local/lib/python3.7/site-packages/selenium/webdriver/remote/errorhandler.py", line 247, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.NoSuchElementException: Message: no such element: Unable to locate element: {"method":"xpath","selector":"//*[@id="content"]/div/div[3]/div[2]/div/div/div/div[2]/div/div[1]/hgroup/h1/span"}
(Session info: headless chrome=96.0.4664.110)
Stacktrace:
#0 0x559979e8dee3 <unknown>
#1 0x55997995b608 <unknown>
#2 0x559979991aa1 <unknown>
#3 0x559979991c61 <unknown>
#4 0x5599799c4714 <unknown>
#5 0x5599799af29d <unknown>
#6 0x5599799c23bc <unknown>
#7 0x5599799af163 <unknown>
#8 0x559979984bfc <unknown>
#9 0x559979985c05 <unknown>
#10 0x559979ebfbaa <unknown>
#11 0x559979ed5651 <unknown>
#12 0x559979ec0b05 <unknown>
#13 0x559979ed6a68 <unknown>
#14 0x559979eb505f <unknown>
#15 0x559979ef1818 <unknown>
#16 0x559979ef1998 <unknown>
#17 0x559979f0ceed <unknown>
#18 0x7ff5dd53b40b <unknown>
出于除错目的,我尝试使用
body = selenium_driver.find_element(By.XPATH, '/html/body')
print(body.text)
回传
"We're sorry, something has gone wrong. Please try again.\nIf you continue to have trouble, please contact us at support@everlane.com.\nChecking your browser before accessing www.everlane.com.\nThis process is automatic. Your browser will redirect to your requested content shortly.\nPlease allow up to 5 seconds…\nDebugging Information\nIP Address\n<ip-address>\nRay ID\n6c57184d797805a0"
我知道我的请求可能因某种原因被阻止,但有没有办法绕过这个?
我尝试添加等待陈述句以希望登陆重定向,但到目前为止没有任何效果。
uj5u.com热心网友回复:
该讯息看起来页面内容已更改。所以你的代码按预期作业。我会让 Selenium 等待一个元素可见(在这里阅读更多)。如果您不想这样做,您也可以等待页面重定向。如何做到这一点在此处的另一个 SO 问题中得到了解答。
uj5u.com热心网友回复:
因为讯息
Checking your browser before accessing www.everlane.com.\nThis process is automatic. Your browser will redirect to your requested content shortly.
似乎该站点启用了云票价保护。
请参阅参考:https : //thegeekpage.com/how-to-fix-checking-your-browser-before-accessing-message/
我建议尝试 selenium-stealth
https://pypi.org/project/selenium-stealth/
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium_stealth import stealth
ser = Service(ChromeDriverManager().install())
options = webdriver.ChromeOptions()
options.add_argument("--headless")
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
driver = webdriver.Chrome(service=ser, options=options)
stealth(driver,
languages=["en-US", "en"],
vendor="Google Inc.",
platform="Win32",
webgl_vendor="Intel Inc.",
renderer="Intel Iris OpenGL Engine",
fix_hairline=True,
)
url = 'https://www.everlane.com/products/womens-cloud-cable-knit-vest-oatmeal?collection=womens-newest-arrivals'
driver.get(url)
title = selenium_driver.find_element(By.XPATH, '//*[@id="content"]/div/div[3]/div[2]/div/div/div/div[2]/div/div[1]/hgroup/h1/span')
print(title.text)
此外,其中一些存盘库可能会有所帮助:
- https://github.com/ultrafunkamsterdam/undetected-chromedriver
- https://github.com/VeNoMouS/cloudscraper
- https://github.com/unixfox/pupflare
或者看看这个话题:
https://github.com/topics/cloudflare-bypass
uj5u.com热心网友回复:
我建议使用 webdriver 等待页面加载。
wait=WebDriverWait(driver,selenium_driver)
elem=wait.until(EC.visibility_of_element_located((By.XPATH,"//*[@id="content"]/div/div[3]/div[2]/div/div/div/div[2]/div/div[1]/hgroup/h1/span")))
print(elem.text)
进口:
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
0 评论