Home>

When multiple URLs were specified in scrapy, data could be acquired in order.
However, the data is overwritten when saved in the DB, and only the last acquired data is saved for the number of URLs.

Imaged ideal value (when only url is scraped)

[{'url': 'https://www.python.org/'},
 {'url': 'https://www.python.org/downloads/'},
 {'url': 'https://docs.python.org/3'}
]


Actual value stored in DB

[
{'url': 'https://docs.python.org/3/'},
{'url': 'https://docs.python.org/3/'},
{'url': 'https://docs.python.org/3/'}
]


I think that it is processing with pipelines.py process_item (), but I can't handle it well.
Since these multiple URLs are search result urls, I want to keep the order or to rank them so that they can be retrieved in order.

windows10
python3.6.5
scrapyd 1.2
scrapy 1.5

# blogspider.py
import scrapy
from scrapy import Request
from apps.main_app.models import SiteData
from collections import OrderedDict
"" "
$scrapy crawl blogspider
Commands to move spiders in files
"" "

class BlogSpider (scrapy.Spider):
    name = 'blogspider' # Name of the Spider. It doesn't work without this
    def __init __ (self, * args, ** kwargs):
        self.domain = kwargs.get ('domain')
        self.start_urls = ['https://www.python.org/',
                          'https://www.python.org/downloads/',
                          'https://docs.python.org/',
                          ]
        self.item = OrderedDict ()
        super (BlogSpider, self) ._ init __ (* args, ** kwargs)

    def start_requests (self):
        for url in self.start_urls:
            yield Request (url = url, callback = self.parse)
    def parse (self, response):
        item = self.item
        item ['url'] = response.url
        yield item
# pipelines.py
from apps.main_app.models import SiteData
import json

class ScrapyAppPipeline (object):
    def __init __ (self, unique_id, * args, ** kwargs):
        self.count_rank = 0
        self.unique_id = unique_id
        self.items = []
        self.dict = {}
    @classmethod
    def from_crawler (cls, crawler):
        # Create a pipeline instance after this class method is called.
        # Since there is a class in the argument, you can access the class variable
        return cls (
            unique_id = crawler.settings.get ('unique_id') # comes through django views
        )
    def close_spider (self, spider):
        # save items that are called when spider is closed in django model
        site_data = SiteData.objects.get (pk = self.unique_id)
        site_data.site_data = json.dumps (self.items)
        site_data.save ()
    def process_item (self, item, spider,):
        "" "Summary of items from the spider here into items and close_spider to save
            From item to self.items. If it is left as it is, it will be overwritten by the number of urls "" "
        self.items.append (item)
        print (self.items)
        return item


Result of print statement in process_item

[OrderedDict ([('url', 'https://www.python.org/')])]
[OrderedDict ([('url', 'https://www.python.org/downloads/')]), OrderedDict ([('url', 'https://www.python.org/downloads/') ])]
[OrderedDict ([('url', 'https://docs.python.org/3')]), OrderedDict ([('url', 'https://docs.python.org/3')])) , OrderedDict ([('url', 'https://docs.python.org/3')])]


The data has been acquired, but overwritten + added for the number of URLs.
I would like to add without overwriting, what should I do? Thank you.

  • Answer # 1

    >>If you add multiple dicts to the list and you write as follows, all the added dict type variables will have the same value.

    self.items.append (item.copy ())
    And copy () was added and it was able to add without being overwritten. Thank you very much.

    Appending a dict type variable to a Python list causes the variable to behave like a pointer