Home>

I'm new to Python.
I'm trying to scrape my site for learning Python, but I'm stuck with the following events:
I'm very sorry for the question, but I would appreciate it if you could tell me.
Thank you.

Event:
  • If the item acquired and set by xpath is not posted on the publisher site, the process stops with an error.

  • If a pop-up suddenly appears on the publisher's site, the subsequent process will not work and an error will stop.


Questions:
  • How to return the text of the acquisition error to the list and proceed to the subsequent processing if the item set by xpath is not posted on the publisher site

  • If a pop-up suddenly appears on the publisher site, close the pop-up and proceed to the following process with the following process. Pop-ups do not appear every time. It is not displayed on the details page, but on the list page.

* I know how to close the displayed popup.
You can close it with the following code.

elem_close_btn = browser.find_element_by_id ('popover-link-x')
        elem_close_btn.click ()

Code description:
  • Browser: Chrome
  • Library used: selenium (not using beautifulsoup)
  • Crawling will start from the list page in the setting paging URL.
  • Go to the details page and get the scraping part (5 places) specified by xpath.
  • Paging on the list page and crawling the details page
  • Repeat list pages 1-10 ...
    * Code part
    index_page = 0
    for pages in range (10,100):

from selenium import webdriver
from time import sleep
import pandas as pd
browser = webdriver.Chrome ('chromedriver.exe')
# ================================================= ================================================== ==========================
# Setting
# ================================================= ================================================== ==========================
## Origin URL
#url = 'https://jp.indeed.com/%E6%B1%82%E4%BA%BA?q=%C2%A56%2C000%2C000%E3%80%80%E3%83%87% E3% 83% BC% E3% 82% BF&l =% E6% 9D% B1% E4% BA% AC% E9% 83% BD '
## Paging URL
page = 'https://jp.indeed.com/jobs?q=%C2%A56%2C000%2C000%E3%80%80%E3%83%87%E3%83%BC%E3%82%BF&l= % E6% 9D% B1% E4% BA% AC% E9% 83% BD&start = {} '
## scraping points
### Title
results_01 = []
r01_xpath = '/ html/body/div [1]/div [2]/div [3]/div/div/div [1]/div [1]/div [1]/h3'
### company name
results_02 = []
r02_xpath = '/ html/body/div [1]/div [2]/div [3]/div/div/div [1]/div [1]/div [1]/div [1]/div/div/div [1] '
### Work location
results_03 = []
r03_xpath = '/ html/body/div [1]/div [2]/div [3]/div/div/div [1]/div [1]/div [1]/div [1]/div/div/div [3] '
### annual income
results_04 = []
r04_xpath = '/ html/body/div [1]/div [2]/div [3]/div/div/div [1]/div [1]/div [1]/div [2]/span [1 ] '
### Details
results_05 = []
r05_xpath = '// * [@ id = "jobDescriptionText"]'

# ================================================= ================================================== ==========================# Crawl&scraping
# ================================================= ================================================== ==========================
index_page = 0
for pages in range (10,100):
    browser.get (page.format (index_page))

    ## Detail page URL class is specified
    elem_detail_btn = browser.find_elements_by_class_name ('title')
    index_detail = 0
    for elem_detail_btns in range (len (elem_detail_btn)):
        elem_detail_btn [index_detail] .click ()
        ## Go to details page
        browser.switch_to.window (browser.window_handles [-1])
        ## scraping
        r01 = browser.find_element_by_xpath (r01_xpath) .text
        results_01.append (r01)
        r02 = browser.find_element_by_xpath (r02_xpath) .text
        results_02.append (r02)
        r03 = browser.find_element_by_xpath (r03_xpath) .text
        results_03.append (r03)
        r04 = browser.find_element_by_xpath (r04_xpath) .text
        results_04.append (r04)
        r05 = browser.find_element_by_xpath (r05_xpath) .text
        results_05.append (r05)

        ## Go to list page
        browser.switch_to.window (browser.window_handles [0])

        index_detail + = 1
    index_page + = 10
  • Answer # 1

    Sorry, we have put the code below.
    The specific solution to the problem is as follows.

      

    If the item set for acquisition by xpath is not posted on the publisher site, return the text of the acquisition error to the list and proceed to the subsequent processing

    The added processing is 2 points.

    -First define all the items to be acquired with 'Non'.
    After that, if the value can be acquired normally, it is overwritten with the correct value and stored in the list. If it cannot be acquired, 'Non' is stored in the list.

    -Added exception handling (try, except) code.
    When trying to acquire and use the acquisition item after try, if an exception occurred that could not be acquired due to an error,
    The error reason is displayed after except and the error is passed.
    Regardless of whether an exception occurs after finally, the value is stored in the list.

    try:
                    r02 = 'Non'
                    r02 = browser.find_element_by_xpath (r02_xpath) .text
                except Exception as e:
                    print (e)
                finally:
                    results_02.append (r02)
      

    If a pop-up suddenly appears on the publisher site,

    Added exception handling (try, except, finally) code.
    If a pop-up exception occurs when you click the details page URL, click the close button and then click the details page URL again.
    Regardless of whether or not an exception occurred, the process of activating the detail page window was finally performed.

    try:
                elem_detail_btn [index_detail] .click ()
            except:
                elem_close_btn = browser.find_element_by_id ('popover-link-x')
                elem_close_btn.click ()
                elem_detail_btn [index_detail] .click ()
            finally:
                ## Go to details page
                browser.switch_to.window (browser.window_handles [-1])

    Here are all the codes.
    I hope it will be helpful for those who are craving in the same place.

    from selenium import webdriver
    import pandas as pd
    browser = webdriver.Chrome ('chromedriver.exe')
    # ================================================= ================================================== ==========================
    # Setting
    # ================================================= ================================================== ==========================
    ## Origin URL
    #url = 'https://jp.indeed.com/%E6%B1%82%E4%BA%BA?q=%C2%A56%2C000%2C000%E3%80%80%E3%83%87% E3% 83% BC% E3% 82% BF&l =% E6% 9D% B1% E4% BA% AC% E9% 83% BD '
    ## Paging URL
    page = 'https://jp.indeed.com/jobs?q=%C2%A56%2C000%2C000%E3%80%80%E3%83%87%E3%83%BC%E3%82%BF&l= % E6% 9D% B1% E4% BA% AC% E9% 83% BD&start = {} '
    ## scraping points
    ### URL
    results_01 = []
    column_name_01 = 'URL'
    r01_xpath = '/ html/body/table [2]/tbody/tr/td/table/tbody/tr/td [2]/div/div [1]/a'
    ### Title
    results_02 = []
    column_name_02 = 'title'
    r02_xpath = '/ html/body/div [1]/div [2]/div [3]/div/div/div [1]/div [1]/div [1]/h3'
    ### company name
    results_03 = []
    column_name_03 = 'name'r03_xpath = '/ html/body/div [1]/div [2]/div [3]/div/div/div [1]/div [1]/div [1]/div [1]/div/div/div [1] '
    ### Work location
    results_04 = []
    column_name_04 = 'place'
    r04_xpath = '/ html/body/div [1]/div [2]/div [3]/div/div/div [1]/div [1]/div [1]/div [1]/div/div/div [3] '
    ### annual income
    results_05 = []
    column_name_05 = 'income'
    r05_xpath = '/ html/body/div [1]/div [2]/div [3]/div/div/div [1]/div [1]/div [1]/div [2]/span [1 ] '
    ### Employment status
    results_06 = []
    column_name_06 = 'status'
    r06_xpath = '/ html/body/div [1]/div [2]/div [3]/div/div/div [1]/div [1]/div [1]/div [2]/span [2 ] '
    ### Details
    results_07 = []
    column_name_07 = 'detail'
    r07_xpath = '// * [@ id = "jobDescriptionText"]'
    # ================================================= ================================================== ==========================
    # Crawl&scraping
    # ================================================= ================================================== ==========================
    index_page = 0
    for pages in range (2):
        browser.get (page.format (index_page))
        ## Detail page URL class is specified
        elem_detail_btn = browser.find_elements_by_class_name ('title')
        index_detail = 0
        for elem_detail_btns in range (len (elem_detail_btn)):
            try:
                elem_detail_btn [index_detail] .click ()
            except:
                elem_close_btn = browser.find_element_by_id ('popover-link-x')
                elem_close_btn.click ()
                elem_detail_btn [index_detail] .click ()
            finally:
                ## Go to details page
                browser.switch_to.window (browser.window_handles [-1])
                ## Wait 5 seconds
                browser.implicitly_wait (15)
                ## Scraping _ Get each item
                r01 = browser.current_url
                results_01.append (r01)
                try:
                    r02 = 'Non'
                    r02 = browser.find_element_by_xpath (r02_xpath) .text
                except Exception as e:
                    print (e)
                finally:
                    results_02.append (r02)
                try:
                    r03 = 'Non'
                    r03 = browser.find_element_by_xpath (r03_xpath) .text
                except Exception as e:
                    print (e)
                finally:
                    results_03.append (r03)
                try:r04 = 'Non'
                    r04 = browser.find_element_by_xpath (r04_xpath) .text
                except Exception as e:
                    print (e)
                finally:
                    results_04.append (r04)
                try:
                    r05 = 'Non'
                    r05 = browser.find_element_by_xpath (r05_xpath) .text
                except Exception as e:
                    print (e)
                finally:
                    results_05.append (r05)
                try:
                    r06 = 'Non'
                    r06 = browser.find_element_by_xpath (r06_xpath) .text
                except Exception as e:
                    print (e)
                finally:
                    results_06.append (r06)
                try:
                    r07 = 'Non'
                    r07 = browser.find_element_by_xpath (r07_xpath) .text
                except Exception as e:
                    print (e)
                finally:
                    results_07.append (r07)
                ## Close instance window only
                browser.close ()
                ## Go to list page
                browser.switch_to.window (browser.window_handles [0])
            index_detail + = 1
        index_page + = 10
    ## Close browser
    browser.quit ()
    # ================================================= ================================================== ==========================
    # Data formatting&output
    # ================================================= ================================================== ==========================
    ## Define DateFrame
    df = pd.DataFrame ()
    df [column_name_01] = results_01
    df [column_name_02] = results_02
    df [column_name_03] = results_03
    df [column_name_04] = results_04
    df [column_name_05] = results_05
    df [column_name_06] = results_06
    df [column_name_07] = results_07
    df
    #Output to CSV
    df.to_csv ('results_indeed.csv', index = True)

  • Answer # 2

      

    If the item set for acquisition by xpath is not posted on the publisher site, return the text of the acquisition error to the list and proceed to the subsequent processing

    browser.find_element_by_ ~~is troublesome because an exception occurs when there is no element.
    browser.find_elements_by_ ~~returns[]when there are no elements, soifcan be used to determine that there were no elements The

      

    If a pop-up suddenly appears on the publisher site,

    Check for pop-ups as described above, and close if there are any.