Home>
I got the text of<h2> ;! But I don't know how to get the link

I am trying to get the text of<h2>and its link in csv file by web scraping using Python.
I was able to get the text of<h2> ;, but I don't know how to get the contents of the link (href), such as # id2.

  • Scraping target: BeautifulSoup4 documentation
Desired result

Get text and links for<h2>as csv file

0, Soap can't be eaten¶, # id2
1, About this document¶, # id3
... (omitted)
Contents of<h2>
&lt;h2&gt;Can't eat soap&lt;a href = "# id2" title = "Permalink to this headline"&gt;¶&lt;/a&gt;&lt;/h2&gt;
&lt;h2&gt;About this document&lt;a href = "# id3" title = "Permalink to this headline"&gt;¶&lt;/a&gt;&lt;/h2


There are examples like below, and there is also a href in h2, so there should be a way to get the href. . . I am thinking while thinking. I'm currently using one for statement, so I'm guessing it will get a link in it, but it doesn't work (T_T)

for link in soup.find_all ('a'):
    print (link.get ('href'))
Current code
#-*-coding: utf-8-*-
from bs4 import BeautifulSoup
import requests
import csv
"" "
I want to save&lt;h2&gt;text and link in csv file
"" "
r = requests.get ("http://kondou.com/BS4/index.html")
soup_content = BeautifulSoup (r.content, "html.parser")
alltxt = soup_content.get_text ()
with open ('h2textlink.csv', 'w +', newline = '', encoding = 'utf-8') as f:
    n = 0
    for subheading in soup_content.find_all ('h2'):
        sh = subheading.get_text ()
        writer = csv.writer (f, lineterminator = '\ n')
        writer.writerow ([n, sh])
        n + = 1
pass
Current output result
0, can't eat soap¶
1, About this document¶
2, When I want help¶
3, problems after installation¶
4, Installing the parser¶
... (omitted)
H2 text and link acquisition complete!

Thanks to you! (≧ ∀ ≦)

#-*-coding: utf-8-*-
from bs4 import BeautifulSoup
import requests
import csv
"" "
Save&lt;h2&gt;text and link in csv file
"" "
r = requests.get ("http://kondou.com/BS4/index.html")
soup_content = BeautifulSoup (r.content, "html.parser")
alltxt = soup_content.get_text ()
with open ('h2textlink.csv', 'w +', newline = '', encoding = 'utf-8') as f:
    writer = csv.writer (f, lineterminator = '\ n')
    std_link = 'http://kondou.com/BS4/'
    for n, subheading in enumerate (soup_content.find_all ('h2')):
        sh = subheading.get_text ()
        h2link = subheading.a ['href']
        writer.writerow ([n, sh, std_link + h2link])
pass
Code execution results
0, Soap can't be eaten¶, http://kondou.com/BS4/#id2
1, About this document¶, http: //kondou.com/BS4/#id3
2, If I want help¶, http://kondou.com/BS4/#id5
3, Problems after installation¶, http://kondou.com/BS4/#id9
... (omitted)
  • Answer # 1

    It's a good idea to practice scraping on the document itself.

      

    I want to get the link in h3.

    href is the attribute of the a tag in h3, so it can be obtained as follows.

    subheading.find ('a') ['href']

    Or you can write:

    subheading.a ['href']

    It is enough to create the writer object only once outside the loop.
    For the variable n, using enumerate can be a little smarter.

    In response to comments
      

    (・ _ ・) I read this linked document, but I did not understand well what I searched for on Google with "for sentence enumerate". . . I'm sorry orz

    Use it like this.

    for n, subheading in enumerate (soup_content.find_all ('h2')):
        ...

    This will eliminaten = 0andn + = 1.