Home>

I want to get the publisher name from the book name and put it together in a Google spreadsheet. I'm working at Colab.
https://www.tutorialfor.com/go.php?id=304423I tried to achieve the same thing using the API of the National Diet Library, but since it is difficult, I would like to aim for a solution by scraping.

https://qiita.com/Azunyan1111/items/b161b998790b1db2ff7aIt is quoted largely with reference to.

https://honto.jp/netstore/search.html?gnrcd=1&k=%E3%82%A8%E3%83%BC%E3%82%B9%E8%96%AC%E7%90%86%E5 % AD% A6&extSiteId = junkudo&cid = eu_hb_jtoh_0411&srchf = 1
-------------------------------------------------- -------------------------
AttributeError Traceback (most recent call last)
<ipython-input-35-9e745ffb8004>in<module>()
     35
     36 # Display the text at the specified location using CSS selectors
--->37 print (soup.select_one ("# displayOrder1>div>div.stInfo>div.stContents>ul>li: nth-child (4)>a"). Text)
     38 publisher = soup.select_one ("# displayOrder1>div>div.stInfo>div.stContents>ul>li: nth-child (4)>a"). Text
     39 Reflected in #sheet
AttributeError:'NoneType' object has no attribute'text'
Corresponding source code
import numpy as np
from pandas import DataFrame
import xml.etree.ElementTree as ET
import requests
from collections import defaultdict
from google.colab import files
import urllib.request
from bs4 import BeautifulSoup
ss_url = "https://docs.google.com/spreadsheets/d/ooooooo"
workbook = gc.open_by_url (ss_url)
worksheet = workbook.get_worksheet (1)
cell_list = worksheet.range ("A3: A5")
for cell in cell_list:
  #Search conditions
  title = cell.value
  #Convert search terms
  search_word = urllib.parse.quote (title)
  # URL to access
  url = "https://honto.jp/netstore/search.html?gnrcd=1&k="+ search_word + "&extSiteId = junkudo&cid = eu_hb_jtoh_0411&srchf = 1"
  print (url)
  #Access the URL In the return value, the instance containing the access result and HTML etc. will be returned.
  instance = urllib.request.urlopen (url)
  Extract HTML from #instance and parse it for beautiful Soup
  soup = BeautifulSoup (instance, "html.parser")
  Display text at the specified location using # CSS selector
  print (soup.select_one ("# displayOrder1>div>div.stInfo>div.stContents>ul>li: nth-child (4)>a"). text)
  publisher = soup.select_one ("# displayOrder1>div>div.stInfo>div.stContents>ul>li: nth-child (4)>a"). Text
  Reflected in #sheet
  worksheet.update_cell (cell.row, cell.col +1, publisher)
What I tried

print (soup.select_one ("# displayOrder1>div>div.stInfo>div.stContents>ul>li: nth-child (4)>a"). text)
To
print (soup.select_one ("# displayOrder1>div>div.stInfo>div.stContents>ul>li: nth-child (4)>a"). String)
I did, but it doesn't change.

https://qiita.com/booleanoid/items/211820516eb7a2191b32
I checked, but it may be another problem.

I'm also looking for a way to use Selenium, but I'm not sure if it will lead to a solution.

  • Answer # 1

    soup.select_one (~~~)The result ofNoneabout it.
    That is, the specified node does not exist.

    print (urllib.request.urlopen (url) .read ())Then, let's review the HTML carefully.

    If you write the code based on what you see with the developer tools of the browser,
    -A node dynamically added by JavaScript
    -A node in the frame
    Here are some examples of common questions in the past.