I'm new to Python.
I'm trying to scrape my site for learning Python, but I'm stuck with the following events:
I'm very sorry for the question, but I would appreciate it if you could tell me.
Thank you.
If the item acquired and set by xpath is not posted on the publisher site, the process stops with an error.
If a pop-up suddenly appears on the publisher's site, the subsequent process will not work and an error will stop.
Questions:
How to return the text of the acquisition error to the list and proceed to the subsequent processing if the item set by xpath is not posted on the publisher site
If a pop-up suddenly appears on the publisher site, close the pop-up and proceed to the following process with the following process. Pop-ups do not appear every time. It is not displayed on the details page, but on the list page.
* I know how to close the displayed popup.
You can close it with the following code.
elem_close_btn = browser.find_element_by_id ('popover-link-x')
elem_close_btn.click ()
Code description:
- Browser: Chrome
- Library used: selenium (not using beautifulsoup)
- Crawling will start from the list page in the setting paging URL.
- Go to the details page and get the scraping part (5 places) specified by xpath.
- Paging on the list page and crawling the details page
- Repeat list pages 1-10 ...
* Code part
index_page = 0
for pages in range (10,100):
from selenium import webdriver
from time import sleep
import pandas as pd
browser = webdriver.Chrome ('chromedriver.exe')
# ================================================= ================================================== ==========================
# Setting
# ================================================= ================================================== ==========================
## Origin URL
#url = 'https://jp.indeed.com/%E6%B1%82%E4%BA%BA?q=%C2%A56%2C000%2C000%E3%80%80%E3%83%87% E3% 83% BC% E3% 82% BF&l =% E6% 9D% B1% E4% BA% AC% E9% 83% BD '
## Paging URL
page = 'https://jp.indeed.com/jobs?q=%C2%A56%2C000%2C000%E3%80%80%E3%83%87%E3%83%BC%E3%82%BF&l= % E6% 9D% B1% E4% BA% AC% E9% 83% BD&start = {} '
## scraping points
### Title
results_01 = []
r01_xpath = '/ html/body/div [1]/div [2]/div [3]/div/div/div [1]/div [1]/div [1]/h3'
### company name
results_02 = []
r02_xpath = '/ html/body/div [1]/div [2]/div [3]/div/div/div [1]/div [1]/div [1]/div [1]/div/div/div [1] '
### Work location
results_03 = []
r03_xpath = '/ html/body/div [1]/div [2]/div [3]/div/div/div [1]/div [1]/div [1]/div [1]/div/div/div [3] '
### annual income
results_04 = []
r04_xpath = '/ html/body/div [1]/div [2]/div [3]/div/div/div [1]/div [1]/div [1]/div [2]/span [1 ] '
### Details
results_05 = []
r05_xpath = '// * [@ id = "jobDescriptionText"]'
# ================================================= ================================================== ==========================# Crawl&scraping
# ================================================= ================================================== ==========================
index_page = 0
for pages in range (10,100):
browser.get (page.format (index_page))
## Detail page URL class is specified
elem_detail_btn = browser.find_elements_by_class_name ('title')
index_detail = 0
for elem_detail_btns in range (len (elem_detail_btn)):
elem_detail_btn [index_detail] .click ()
## Go to details page
browser.switch_to.window (browser.window_handles [-1])
## scraping
r01 = browser.find_element_by_xpath (r01_xpath) .text
results_01.append (r01)
r02 = browser.find_element_by_xpath (r02_xpath) .text
results_02.append (r02)
r03 = browser.find_element_by_xpath (r03_xpath) .text
results_03.append (r03)
r04 = browser.find_element_by_xpath (r04_xpath) .text
results_04.append (r04)
r05 = browser.find_element_by_xpath (r05_xpath) .text
results_05.append (r05)
## Go to list page
browser.switch_to.window (browser.window_handles [0])
index_detail + = 1
index_page + = 10
-
Answer # 1
-
Answer # 2
If the item set for acquisition by xpath is not posted on the publisher site, return the text of the acquisition error to the list and proceed to the subsequent processing
browser.find_element_by_ ~~
is troublesome because an exception occurs when there is no element.
browser.find_elements_by_ ~~
returns[]
when there are no elements, soif
can be used to determine that there were no elements TheIf a pop-up suddenly appears on the publisher site,
Check for pop-ups as described above, and close if there are any.
Related articles
- python - error when scraping with selenium and firefox
- python scraping error
- python scraping error
- when scraping with python, it says none
- python - web scraping what to do when a webdriverexception occurs on starbucks hp
- python - categorical_crossentoropy error does not resolve
- python - i want to display the scraping result in the browser
- readcsv error in python
- python - i want to display an image with pysimplegui, but an error occurs
- [python] graphviz output format error
- python 3x - best estimator: i get an error with no syntax, so please tell me what to do
- python - i get an error when connecting to a voice channel with discordpy
- python - error in image binarization using cv2adaptivethreshold function
- an error occurs during python scraping (retrieving property information from suumo)
- python - twitterapi i want to resolve errors when acquiring follower information
- [python] i don't know how to solve the error
- python - an error has occurred in yolo v3
- python - idle cannot resolve the "rootgeometry" error
- python - tuple error does not resolve
- python max () arg is an empty sequence and an error occurs and it cannot be processed well
Sorry, we have put the code below.
The specific solution to the problem is as follows.
The added processing is 2 points.
-First define all the items to be acquired with 'Non'.
After that, if the value can be acquired normally, it is overwritten with the correct value and stored in the list. If it cannot be acquired, 'Non' is stored in the list.
-Added exception handling (try, except) code.
When trying to acquire and use the acquisition item after try, if an exception occurred that could not be acquired due to an error,
The error reason is displayed after except and the error is passed.
Regardless of whether an exception occurs after finally, the value is stored in the list.
Added exception handling (try, except, finally) code.
If a pop-up exception occurs when you click the details page URL, click the close button and then click the details page URL again.
Regardless of whether or not an exception occurred, the process of activating the detail page window was finally performed.
Here are all the codes.
I hope it will be helpful for those who are craving in the same place.