我正在抓取发布新法律 (Gazzetta Ufficiale) 的意大利网站的页面,以保存包含法律文本的最后一页。
我有一个回圈,它构建了一个要下载的页面串列,并附上了一个完整作业的 cose 示例,该示例显示了我正在运行的问题(示例没有回圈,我只是在执行两次“获取”。
处理不显示“Visualizza”(显示)按钮但直接显示所需全文的稀有页面的最佳方法是什么?
希望代码是非常自我解释和评论的。在此先感谢您,2022 年超级快乐!
import time
from selenium import webdriver
from selenium.webdriver.support.ui import Select
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome("/Users/bob/Documents/work/scraper/scrape_gu/chromedriver")
# showing the "normal" behaviour
driver.get(
"https://www.gazzettaufficiale.it/atto/vediMenuHTML?atto.dataPubblicazioneGazzetta=2021-01-02&atto.codiceRedazionale=20A07300&tipoSerie=serie_generale&tipoVigenza=originario"
)
# this page has a "Visualizza" button, find it and click it.
bottoni = WebDriverWait(driver, 10).until(
EC.visibility_of_all_elements_located(
(By.XPATH, '//*[@id="corpo_export"]/div/input[1]')
)
)
time.sleep(5) # just to see the "normal" result with the "Visualizza" button
bottoni[0].click() # now click it and this shows the desired final webpage
time.sleep(5) # just to see the "normal" desired result
# but unfortunately some pages directly get to the end result WITHOUT the "Visualizza" button.
# as an example see the following get
# showing the "normal" behaviour
driver.get(
"https://www.gazzettaufficiale.it/atto/vediMenuHTML?atto.dataPubblicazioneGazzetta=2021-01-02&atto.codiceRedazionale=20A07249&tipoSerie=serie_generale&tipoVigenza=originario"
) # get a law page
time.sleep(
5
) # as you can see we are now on the final desired full page WITHOUT the Visualizza button
# hence the following code, identical to that above will fail and timeout
bottoni = WebDriverWait(driver, 10).until(
EC.visibility_of_all_elements_located(
(By.XPATH, '//*[@id="corpo_export"]/div/input[1]')
)
)
time.sleep(5) # just to see the result
bottoni[0].click() # and this shows the desired final webpage
# and the program abends with the following message
# File "/Users/bob/Documents/work/scraper/scrape_gu/temp.py", line 33, in <module>
# bottoni = WebDriverWait(driver, 10).until(
# File "/Users/bob/opt/miniconda3/envs/scraping/lib/python3.8/site-packages/selenium/webdriver/support/wait.py", line 80, in until
# raise TimeoutException(message, screen, stacktrace)
# selenium.common.exceptions.TimeoutException: Message:
uj5u.com热心网友回复:
用 atry
和except
块捕捉例外- 如果没有按钮直接提取文本 -处理例外
...
urls = [
'https://www.gazzettaufficiale.it/atto/vediMenuHTML?atto.dataPubblicazioneGazzetta=2021-01-02&atto.codiceRedazionale=20A07300&tipoSerie=serie_generale&tipoVigenza=originario',
'https://www.gazzettaufficiale.it/atto/vediMenuHTML?atto.dataPubblicazioneGazzetta=2021-01-02&atto.codiceRedazionale=20A07249&tipoSerie=serie_generale&tipoVigenza=originario'
]
data = []
for url in urls:
driver.get(url)
try:
bottoni = WebDriverWait(driver,1).until(
EC.element_to_be_clickable(
(By.XPATH, '//input[@value="Visualizza"]')
)
)
bottoni.click()
except TimeoutException:
print('no bottoni -')
finally:
data.append(driver.find_element(By.XPATH, '//body').text)
driver.close()
print(data)
...
uj5u.com热心网友回复:
首先,将 selenium 用于此任务是过大的。
您可以使用requests或aiohttp加上beautifulsoup来做同样的事情,除了这样会更快更容易编码。
现在回到你的问题,有几个解决方案。
最简单的是:
- 捕获超时例外:如果未找到按钮,则直接决议法律。
- 在单击按钮
!driver.findElements(By.id("corpo_export")).isEmpty()
或决议网页之前检查按钮是否存在 : 。
但话又说回来,你会更容易摆脱硒并使用beautifulsoup。
0 评论