Home>
Status

I'm studying web scraping.
I'm trying to get a list of Starbucks coffee stores (name + address).
(Https://store.starbucks.co.jp/?keyword=)

As shown below, scraping is not prohibited in robots.txt.
User-Agent: *
problem

Starbucks HP uses Javascript, and all stores are not displayed unless you click the "More" button.
So I used selenium and used the .click () function to try to display all items.

However, as detailed below, I get a WebDriverException.
I would appreciate it if anyone could give me some advice if you know how to deal with it.
Thank you very much.

code
import requests
import time
import csv
import pandas as pd
from selenium import webdriver
from selenium.common.exceptions import ElementNotInteractableException
from bs4 import BeautifulSoup
Launch #Chrome headless and get Starbucks HP
options = webdriver.ChromeOptions ()
options.add_argument ('--no-sandbox')
options.add_argument ('--headless')
options.add_argument ('--disable-dev-shm-usage')
driver = webdriver.Chrome (options = options)
time.sleep (1)
html = driver.page_source
soup = BeautifulSoup (html,'lxml')
#Click the "more" button until the end
while True:
    try: try:
        more_btn = driver.find_element_by_xpath ('// * [@ id = "moreList"]')
        more_btn.click ()
        time.sleep (1)
    except ElementNotInteractableException:
        break

#Get the store list
detailContainers = soup.find_all ('div', class_ = "detailContainer")
storeNames = []
storeAddresses = []
for detailContainer in detailContainers:
    storeNames + = [detailContainer.find (class_ ='storeName'). get_text ()]
    storeAddresses + = [detailContainer.find (class_ ='storeAddress'). get_text ()]
storeList = pd.DataFrame (
        {
            'storeName': storeNames,
            'storeAddress': storeAddresses,
        }
    )
print (storeList)
Error log
[vagrant @ localhost scraping] $python scraping_starbucks.py
Traceback (most recent call last):
  File "scraping_starbucks.py", line 26, in<module>
    more_btn.click ()
  File "/home/vagrant/.local/lib/python3.6/site-packages/selenium/webdriver/remote/webelementpy", line 80, in click
    self._execute (Command.CLICK_ELEMENT)
  File "/home/vagrant/.local/lib/python3.6/site-packages/selenium/webdriver/remote/webelementpy", line 633, in _execute
    return self._parent.execute (command, params)
  File "/home/vagrant/.local/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.y", line 321, in execute
    self.error_handler.check_response (response)
  File "/home/vagrant/.local/lib/python3.6/site-packages/selenium/webdriver/remote/errorhandlr.py", line 242, in check_response
    raise exception_class (message, screen, stacktrace)
selenium.common.exceptions.WebDriverException: Message: unknown error: session deleted becaus of page crash
from unknown error: cannot determine loading status
from tab crashed
  (Session info: headless chrome = 85.0.4183.121)
  • Answer # 1

    Probably because I clicked before [More] was displayed.
    Please refer to the following for the details of the error.
    stackoverflow --Scraping with selenium --click () gives an error

    Here, the place where the headless mode was once canceled and the code was executed for visual recognition
    It works normally and the page can be displayed until the end.
    I think it depends on the execution environment and the internet environment, so please adjust it yourself.

    Also, in the code in the title, the source has been acquired before executing the while statement.
    detailContainers,storeNames,storeAddressesas well asstoreListTo
    Only the first 100 items displayed by default were stored.