Home>
Status

I'm writing a program that uses pandas in python, scrapes on multiple sites, and finally spits out CSV.
First, declare df outside the function as follows in main.py. This df has been created successfully.

columns = ["Name", "Price", "Url"]
df = pd.DataFrame (columns = columns)

Next, in the first function, EC site search and scraping are performed, and the value is added to df as follows.
name is the product name, price is the price, URL is the URL of the product page.

name price Url
hoge hoga hogu
huga hoga huge

In the next function, I want to extract the url from this df, jump to that URL, get the ID of the product, and add the ID column and each value to this df.

name price Url ID
hoge hoga hogu abc
huga hoga huge def

Since I want to continue scraping with another site using ID, I would like to handle df globally, but even the two processes listed above are not working.

Traceback (most recent call last):
  File "main.py", line 14, in<module>
    gethogeSearch.search_hoge (search_word, get_pages)
  File "/hoge/gethogeSearch.py", line 102, in search_hoge
    print (df)
NameError: name 'df' is not defined
(Posted on November 27, 2019) Tried

Specify in main.py instead of as a global variable and pass arguments to each module

import pandas as pd
import gethogeSearch
#Global variable/final product csv moto shared by each function
columns = ["Name", "Price", "Url"]
df = pd.DataFrame (columns = columns)
#Tap from terminal when running
search_word = input ('word to search:')
number = input ('Number of pages:')
get_pages = int (number)
df = gethogeSearch.search_hoge (df, search_word, get_pages)
print (df)
# coding UTF-8
import requests
from bs4 import BeautifulSoup
import pandas as pd
import csv
from time import sleep
import random
import lxml.html
def search_hoge (df, search_word, get_pages):

        #Create scraping URL (= url) from search word (omitted)
        page = 1
        try:
                # Repeat for "get_pages" pages
                while page<get_pages + 1:
                        #Indicates how many pages are being acquired
                        print (page, "Retrieving page .....")
                        headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.3;Win64;x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36'}
                        #Get HTML from created URLresponse = requests.get (url, headers = headers)
                        sleepTime = random.randint (2,8)

                        #BeautifulSoup initialization
                        soup = BeautifulSoup (response.content, "lxml")
                        #Entire search result list
                        items = soup.select (". hoge")
                        #List for each element
                        for item in items:
                                name = item.find ("span", {"class": "huga"})
                                price = item.find ("span", {"class": "hoge"})
                                # Add to df if not empty
                                if name! = None and price! = None:
                                        nameTitle = name.string
                                        priceText = price.string
                                        item_url = item.a.get ("href")
                                        se = pd.Series ([nameTitle, priceText, item_url],

 columns)
                                        print (se)
                                        df = df.append (se, columns)
                        #Get the "next page" URL at the bottom of the page
                        NextUrl = soup.find ('li', {"class": "hg"})
                        Url = NextUrl.a.get ("href")
                        url = huga + Url
                        # Go to the next page, so increase the value of the variable page by 1
                        page + = 1
                        sleepTime = random.randint (3,15)
                        sleep (sleepTime)
        except:
                #When it finishes before reaching the number of pages that it is going to get, it outputs that there was no more than that page
                nextpage = str (page + 1)
                print (nextpage + "There was no page after")
        finally:
                #Determine the csv file name to save
                filename = search_words + ".csv"
                #Create list to csv
                df.to_csv (filename, encoding = 'utf-8-sig')
                print ("Target URL:" + url)
                #Output that finished
                print (filename + "created")
        return df #empty dataframe
I want to tell me

Why can't I pass arguments?

  • Answer # 1

    If it is a global variable, it can be shared between functions, but if it is not possible, something is wrong.

    I can't say anything because there's no code presented anymore

  • Answer # 2

    You can't use global variables across modules (.py files). Although it is a global variable, its scope is limited and it is limited to a single file.

    I think it's safe to simply pass in arguments and return values. When passing

    gethogeSearch.search_hoge (df, search_word, get_pages)
    It ’s like

    . ThegethogeSearch.search_hogeside is rewritten to accept the corresponding argument. If you want to rewrite the DataFrame, return it with areturnstatement at the end, and reassign it to the variabledfon the receiving side.

  • Answer # 3

    Hayataka, I apologize for the misunderstanding.
    We will only operate on sites that do not violate the terms.

    This link
    https://qiita.com/567000/items/d8a29bb7404f68d90dd4
    "Add value (row) to DataFarme whose type (column) is fixed"
    After adding the processing in the function with reference to, it worked well.