I'm writing a program that uses pandas in python, scrapes on multiple sites, and finally spits out CSV.
First, declare df outside the function as follows in main.py. This df has been created successfully.
columns = ["Name", "Price", "Url"]
df = pd.DataFrame (columns = columns)
Next, in the first function, EC site search and scraping are performed, and the value is added to df as follows.
name is the product name, price is the price, URL is the URL of the product page.
name | price | Url |
---|---|---|
hoge | hoga | hogu |
huga | hoga | huge |
In the next function, I want to extract the url from this df, jump to that URL, get the ID of the product, and add the ID column and each value to this df.
name | price | Url | ID |
---|---|---|---|
hoge | hoga | hogu | abc |
huga | hoga | huge | def |
Since I want to continue scraping with another site using ID, I would like to handle df globally, but even the two processes listed above are not working.
Traceback (most recent call last):
File "main.py", line 14, in<module>
gethogeSearch.search_hoge (search_word, get_pages)
File "/hoge/gethogeSearch.py", line 102, in search_hoge
print (df)
NameError: name 'df' is not defined
(Posted on November 27, 2019) Tried
Specify in main.py instead of as a global variable and pass arguments to each module
import pandas as pd
import gethogeSearch
#Global variable/final product csv moto shared by each function
columns = ["Name", "Price", "Url"]
df = pd.DataFrame (columns = columns)
#Tap from terminal when running
search_word = input ('word to search:')
number = input ('Number of pages:')
get_pages = int (number)
df = gethogeSearch.search_hoge (df, search_word, get_pages)
print (df)
# coding UTF-8
import requests
from bs4 import BeautifulSoup
import pandas as pd
import csv
from time import sleep
import random
import lxml.html
def search_hoge (df, search_word, get_pages):
#Create scraping URL (= url) from search word (omitted)
page = 1
try:
# Repeat for "get_pages" pages
while page<get_pages + 1:
#Indicates how many pages are being acquired
print (page, "Retrieving page .....")
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.3;Win64;x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36'}
#Get HTML from created URLresponse = requests.get (url, headers = headers)
sleepTime = random.randint (2,8)
#BeautifulSoup initialization
soup = BeautifulSoup (response.content, "lxml")
#Entire search result list
items = soup.select (". hoge")
#List for each element
for item in items:
name = item.find ("span", {"class": "huga"})
price = item.find ("span", {"class": "hoge"})
# Add to df if not empty
if name! = None and price! = None:
nameTitle = name.string
priceText = price.string
item_url = item.a.get ("href")
se = pd.Series ([nameTitle, priceText, item_url],
columns)
print (se)
df = df.append (se, columns)
#Get the "next page" URL at the bottom of the page
NextUrl = soup.find ('li', {"class": "hg"})
Url = NextUrl.a.get ("href")
url = huga + Url
# Go to the next page, so increase the value of the variable page by 1
page + = 1
sleepTime = random.randint (3,15)
sleep (sleepTime)
except:
#When it finishes before reaching the number of pages that it is going to get, it outputs that there was no more than that page
nextpage = str (page + 1)
print (nextpage + "There was no page after")
finally:
#Determine the csv file name to save
filename = search_words + ".csv"
#Create list to csv
df.to_csv (filename, encoding = 'utf-8-sig')
print ("Target URL:" + url)
#Output that finished
print (filename + "created")
return df #empty dataframe
I want to tell me
Why can't I pass arguments?
-
Answer # 1
-
Answer # 2
You can't use global variables across modules (.py files). Although it is a global variable, its scope is limited and it is limited to a single file.
I think it's safe to simply pass in arguments and return values. When passing
It ’s likegethogeSearch.search_hoge (df, search_word, get_pages)
. The
gethogeSearch.search_hoge
side is rewritten to accept the corresponding argument. If you want to rewrite the DataFrame, return it with areturn
statement at the end, and reassign it to the variabledf
on the receiving side. -
Answer # 3
Hayataka, I apologize for the misunderstanding.
We will only operate on sites that do not violate the terms.This link
https://qiita.com/567000/items/d8a29bb7404f68d90dd4
"Add value (row) to DataFarme whose type (column) is fixed"
After adding the processing in the function with reference to, it worked well.
Related articles
- python - difference between pandas dataframeplotbar and dataframeplot (kind = bar)
- python - i can't understand the specifications of pandas
- python - merge after pandas pivot
- python - count by element with pivot_table in pandas
- i was given a python 3 assignment as a cram school assignment, but i don't understand "functions and comprehensions, while
- python - about data analysis in pandas
- python - combine input with built-in functions
- python - [pandas] how to search for unexpected data
- python - what is the difference between the two jupyter kernels generated by conda?
- python - shuffle a few lines of pandas for weekdays and holidays
- eliminating pandas install and import in python pyenv export ldflags
- python pandas pivot_table is not reflected
- python - pandas attributeerror: module'pandas' has no attribute'read_tabel' error
- python 3x - about functions used in python for statement
- python - i want to modify the value of a specific column or row of pandas using the apply function and the lambda function
- i don't understand the meaning of x used in a python pandas lambda expression
- i don't know the order of processing higher-order functions in python
- python - aggregation processing using pandas
- python - how to load multiple time formats with pandas
- python 3x - the reading of the expression in the excel file in pandas becomes nan
- python : I want to subtract each column of dataframe by the value of the first column
- Read a specific column from CSV with Python and send an email [Closed]
- Help with kNN visualization, in python
- python : Sorting by multiple columns in a dataframe
- python : Identifying parts of speech and removing unnecessary parts
- python : Get unique values in dataframe rows
- Extract specific columns with Python and email
- python : Add DataFrame to Excel file without overwriting it
- python : Pandas DataFrame -split rows into n windows, shift windows by m elements, find average
- python : pandas grouping and slicing strings
If it is a global variable, it can be shared between functions, but if it is not possible, something is wrong.
I can't say anything because there's no code presented anymore