Home>

As the title says.

When I run it with the following code, it outputs None.

import urllib.request
from bs4 import BeautifulSoup
class Scraper:
    def __init __ (self, site):
        self.site = site
    def scrape (self):
        r = urllib.request.urlopen (self.site)
        html = r.read ()
        parser = "html.parser"
        sp = BeautifulSoup (html, parser)
        for tag in sp.find_all ("a"):
            url = tag.get ("html")
            if url is None:
                print (url)
                continue
            if "html" in url:
                print ("\ n" + url)

news = "http://news.google.com/"
Scraper (news) .scrape ()

I usually watch it, not google news

http://jin115.com/

When I tried it with this url, I was able to scrape it here.

Besides

url = tag.get ("html") → url = tag.get ("articles")

I tried it, but it was still None.

Please give me a professor.

  • Answer # 1

    I made a mistake in messing around
    tag.get ("href") was correct instead of tag.get ("html").

    But still
    if "html" in url:

    It didn't work. I wrote it according to the book "Self-study Programmer". .. ..

    import urllib.request
    from bs4 import BeautifulSoup
    class Scraper:
        def __init __ (self, site):
            self.site = site
        def scrape (self):
            response = urllib.request.urlopen (self.site)
            html = response.read ()
            soup = BeautifulSoup (html, "html.parser")
            for tag in soup.find_all ("a"):
                url = tag.get ("href")
                if url and "article" in url:
                    print ("\ n" + "https://news.google.com/"+url)
    Scraper ('https://news.google.com/'). scrape ()

    I don't know if this code can do the same thing as the aim of the book, but I was able to output a url that can be accessed properly.

  • Answer # 2

    This is probably some reference book code, but stackoverflow often asks the same question about this code.
    The specifications on the http://news.google.com/ side have changed since some time.
    It seems that the code has already become a mechanism that does not work properly with the code as it is.

    Conclusion

    The code itself is not problematic
    Due to the specification change on the target page side, the code cannot be used as it is.


    Reference: stackoverflow --Web scraping code does not work