Home>

I am scraping information on ramen shops with 3.5 stars or more in Tokyo using Tabelog.
I want to create a data frame with "store name, address, evaluation score" as one line and export csv.

A data frame with "store name, address, evaluation score" as one line is not generated.
Also, information with stars 3.67 or higher can be output, but operation will stop from less than 3.67.

Corresponding source code

import requests
from bs4 import BeautifulSoup
import re
import pandas as pd
import time
import csv

class Tabelog:
"" "
Tabelog scraping class
If i operate with test_mode = True, you can get only the data of 3 stores on the first page.
"" "
definit(self, base_url, test_mode = False, p_ward ='in Tokyo', begin_page = 1, end_page = 30):

Variable declaration

self.store_id =''''
self.store_id_num = 0
self.store_name =''
self.score = 0
self.address_name = ""
self.columns = ['store_id','store_name','score','address_name']
self.df = pd.DataFrame (columns = self.columns)
self.__regexcomp = re.compile (r'\ n | \ s') # \ n is a line break, \ s is a blank

page_num = begin_page # Store list page number

if test_mode:
list_url = base_url + str (page_num) +'/? Srt = D&SrtT = rt&sort_mode = 1'

Processing required when sorting by tabelog score ranking

self.scrape_list (list_url, mode = test_mode)
else: else:
while True:
list_url = base_url + str (page_num) +'/? Srt = D&SrtT = rt&sort_mode = 1'

Processing required when sorting by tabelog score ranking

if self.scrape_list (list_url, mode = test_mode)! = True:
break

Get page number data up to IN parameter

if page_num>= end_page:
break
page_num + = 1
return

def scrape_list (self, list_url, mode):
"" "
Pershing on the store list page
"" "
r = requests.get (list_url)
if r.status_code! = requests.codes.ok:
return False

soup = BeautifulSoup (r.content,'html.parser')
soup_a_list = soup.find_all ('a', class_ ='list-rst__rst-name-target') #Store name list

if len (soup_a_list) == 0:
return False

if mode:
for soup_a in soup_a_list [: 2]:
item_url = soup_a.get ('href') # Get the individual page URL of the store
self.store_id_num + = 1
self.scrape_item (item_url, mode)
else: else:
for soup_a in soup_a_list:
item_url = soup_a.get ('href') # Get the individual page URL of the store
self.store_id_num + = 1
self.scrape_item (item_url, mode)

return True

def scrape_item (self, item_url, mode):
"" "
Pershing of individual store information page
"" "
start = time.time ()

r = requests.get (item_url)
if r.status_code! = requests.codes.ok:
print (f'error: not found {item_url}')
return

soup = BeautifulSoup (r.content,'html.parser')

Acquired store name Taketora Shinjuku store

store_name_tag = soup.find ('h2', class_ ='display-name')
store_name = store_name_tag.span.string
print ('{} → store name: {}'. format (self.store_id_num, store_name.strip ()), end ='')
self.store_name = store_name.strip ()

Excludes stores other than ramen shops and tsukemen shops

store_head = soup.find ('div', class_ ='rdheader-subinfo') # Get header frame data for store information
store_head_list = store_head.find_all ('dl')
store_head_list = store_head_list [1] .find_all ('span')

print ('target:', store_head_list [0] .text)

if store_head_list [0] .text not in {'ramen','tsukemen'}:
print ('Not processed because it is not a ramen or tsukemen shop')
self.store_id_num-= 1
return

Address acquisition

Tokyo Shinjuku ward Kabukicho 1-9-5 Sankei 61 Building 2F

try: try:
address_name = soup.find ("p", class _ = "rstinfo-table__address"). text

print ("address: {}" .format (address_name), end = "")
self.address_name = address_name

except AttributeError:
href =''

Evaluation score acquisition val rdheader-ratingscore-val "rel =" v: rating "> 3.58

rating_score_tag = soup.find ('b', class_ ='c-rating__val')
rating_score = rating_score_tag.span.string
print ('Rating score: {} points'.format (rating_score), end ='')
self.score = rating_score

Excludes stores that do not have an evaluation score

if rating_score =='-':
print ('Not processed because there is no evaluation')
self.store_id_num-= 1
return

Excludes stores with a rating of less than 3.5

if float (rating_score)<3.5:
print ('Not subject to processing because the tabelog rating is less than 3.5')
self.store_id_num-= 1
return

Data frame generation

self.make_df ()
return

def make_df (self):
self.store_id = str (self.store_id_num) .zfill (8) # 0 padding
se = pd.Series ([self.store_id, self.store_name, self.address_name, self.score], self.columns) # create rows
self.df = self.df.append (se, self.columns) #Add rows to dataframe
return

Run

tokyo_ramen_address = Tabelog (base_url = "https://tabelog.com/tokyo/rstLst/ramen/", test_mode = False)

tokyo_ramen_address.df.to_csv ("Users/~ ~ ~ ~/Desktop/tokyo_ramen_address.csv")

What I tried

https://qiita.com/toshiyuki_tsutsui/items/f143946944a428ed105b
Based on the code of the blog of the above URL, I arranged the code to scrape the address information.
I was able to retrieve the address information, but I couldn't output it as a data frame, and I couldn't retrieve up to 3.5 stars.
(When I tried the code of the above URL as it is, it fits in the data frame and it worked.)

Supplementary information (FW/tool version, etc.)

I'm a beginner just starting to study Python and scraping. I am working on mac OS and Jupyter Notebook.
We apologize for the inconvenience, but we would appreciate it if you could teach us. Thanks for your cooperation.

  • Answer # 1

    According to the terms of use and urllib.robotparser.RobotFileParser (), it seems that scraping is prohibited on some pages of the tabelog, but it seems that the code operation of the subject is not prohibited.

    I also checked the code on the subject, but there were no particular problems and it worked normally.
    I was able to output both the data frame and CSV correctly.
    (Some indentation is off,__init__ButinitEtc.
    There were some strange parts in the details, but it can be operated by correcting only those points. )
    The major corrections are as follows.

    --tokyo_ramen_address.df.to_csv ("tokyo_ramen_address.csv")
    + tokyo_ramen_address.df.to_csv ("tokyo_ramen_address.csv", encoding ='utf_8_sig')
    import requests
    from bs4 import BeautifulSoup
    import re
    import pandas as pd
    import time
    import csv
    class Tabelog:
        def __init__ (self, base_url, test_mode = False, p_ward ='in Tokyo', begin_page = 1, end_page = 50):
            #Variable declaration
            self.store_id =''''
            self.store_id_num = 0
            self.store_name =''
            self.score = 0
            self.address_name = ""
            self.columns = ['store_id','store_name','address_name','score']
            self.df = pd.DataFrame (columns = self.columns)
            self.__regexcomp = re.compile (r'\ n | \ s') # \ n is a line break, \ s is a blank
            page_num = begin_page # Store list page number
            if test_mode:
                list_url = base_url + str (page_num) +'/? Srt = D&SrtT = rt&sort_mode = 1'# Processing required when sorting by tabelog score ranking
                self.scrape_list (list_url, mode = test_mode)
            else: else:
                while True:
                    list_url = base_url + str (page_num) +'/? Srt = D&SrtT = rt&sort_mode = 1'# Processing required when sorting by tabelog score ranking
                    if self.scrape_list (list_url, mode = test_mode)! = True:
                        break
                    Get page number data up to #IN parameter
                    if page_num>= end_page:
                        break
                    page_num + = 1
            Confirmation of #df
            print (self.df)
        def scrape_list (self, list_url, mode):
            r = requests.get (list_url)
            if r.status_code! = requests.codes.ok:return False
            soup = BeautifulSoup (r.content,'html.parser')
            soup_a_list = soup.find_all ('a', class_ ='list-rst__rst-name-target') #List of store names
            if len (soup_a_list) == 0:
                return False
            if mode:
                for soup_a in soup_a_list [: 2]:
                    item_url = soup_a.get ('href') # Get the individual page URL of the store
                    self.store_id_num + = 1
                    self.scrape_item (item_url, mode)
            else: else:
                for soup_a in soup_a_list:
                    item_url = soup_a.get ('href') # Get the individual page URL of the store
                    self.store_id_num + = 1
                    self.scrape_item (item_url, mode)
            return True
        def scrape_item (self, item_url, mode):
            start = time.time ()
            r = requests.get (item_url)
            if r.status_code! = requests.codes.ok:
                #print (f'error: not found {item_url}')
                return
            soup = BeautifulSoup (r.content,'html.parser')
            store_name_tag = soup.find ('h2', class_ ='display-name')
            store_name = store_name_tag.span.string
            #print ('{} → store name: {}'. format (self.store_id_num, store_name.strip ()), end ='')
            self.store_name = store_name.strip ()
            #Excludes stores other than ramen shops and tsukemen shops
            store_head = soup.find ('div', class_ ='rdheader-subinfo') # Get header frame data for store information
            store_head_list = store_head.find_all ('dl')
            store_head_list = store_head_list [1] .find_all ('span')
            #print ('target:', store_head_list [0] .text)
            if store_head_list [0] .text not in {'ramen','tsukemen'}:
                #print ('Not processed because it is not a ramen or tsukemen shop')
                self.store_id_num-= 1
                return
            try: try:
                address_name = soup.find ("p", class _ = "rstinfo-table__address"). text
                #print ("address: {}" .format (address_name), end = "")
                self.address_name = address_name
            except AttributeError:
                href =''
            rating_score_tag = soup.find ('b', class_ ='c-rating__val')
            rating_score = rating_score_tag.span.string#print ('evaluation score: {} score'.format (rating_score), end ='')
            self.score = rating_score
            #Excludes stores that do not have an evaluation score
            if rating_score =='-':
                #print ('Not processed because there is no evaluation')
                self.store_id_num-= 1
                return
            #Excludes stores with a rating of less than 3.5
            if float (rating_score)<3.5:
                #print ('Not subject to processing because the tabelog rating is less than 3.5')
                self.store_id_num-= 1
                return
            #Generate data frame
            self.make_df ()
            return
        def make_df (self):
            self.store_id = str (self.store_id_num) .zfill (8) # 0 padding
            se = pd.Series ([self.store_id, self.store_name, self.address_name, self.score],
     self.columns) # create rows
            self.df = self.df.append (se, self.columns) #Add rows to dataframe
            print (self.address_name)
            print (self.score)
            print (self.store_name)
            print (self.store_id)
            print ('df appended!')
            print ('=' * 50)
            time.sleep (0.4)
            return
    if __name__ =='__main__':
        tokyo_ramen_address = Tabelog (base_url = "https://tabelog.com/tokyo/rstLst/ramen/", test_mode = False)
        tokyo_ramen_address.df.to_csv ("tokyo_ramen_address.csv", encoding ='utf_8_sig')
    Postscript

    It seems that the creator is different, so I think it can't be helped to tell the questioner.
    Confirmation print is better than doing it in the scrape_item function
    It is better to check [whether it can be extracted] and [whether it is the correct value] in the make_df function.
    You can check at the same time and it will be efficient. (Sorry for the rough explanation)

    Alsoif store_head_list [0] .text not in {'ramen','tsukemen'}:Regarding the point
    When the first item of the item "Genre" in the header frame of the store information is not [Ramen]
    Even if it is actually a ramen shop, it has been excluded.
    I think that the accuracy will be further improved if the processing in this case can be handled a little more flexibly.
    Example: Genre: Ramen ▼ Dandan noodles ▼ → Processing target
    Genre: Dandan noodles ▼ Ramen ▼ → Excluded