Home>
Assumptions

I would like to achieve the following 1-2 by Web scraping using R, but this time I will post in the form narrowed down to 1 below. Note that the scraping target page can only be accessed through a dynamic authentication page, so data is acquired via RSelenium.

  1. Parse json type text data described in a specific html page
  2. I want to store this json type data in a data frame
Problems i am experiencing

You can get to the corresponding page using RSelenium and get the page source, but you cannot parse the json displayed in the page with fromJSON well, and you do not know how to approach The

What is displayed on the page

The following contents are displayed on the corresponding page.

[{"id": 1, "name": "family", "middleCategories": [{"id": 1, "name": "mother", "scenarios": [{"id ": 105," name ":" son "}, {" id ": 106," name ":" daughter "}]} .. (omitted) ..]}]}]
Your page source

The following is the large structure of the target page confirmed with the chrome developer tool.

<html>
    <script></script>
    <head></head>
    <body></body>
</html>

Although the value has been changed, the contents of the above body tag look as follows.

<body><pre style = \ "word-wrap: break-word;white-space: pre-wrap;\">[{\ "id \": 1, \ "name \ ": \" family \ ", \" middleCategories \ ": [{\" id \ ": 1, \" name \ ": \" mother \ ", \" scenarios \ ": [{\" id \ " : 105, \ "name \": \ "son \"}, {\ "id \": 106, \ "name \": \ "daughter \"}]} .. (omitted) ..</pre&gt ;</body>
Applicable source code

I tried the following code.

# Get content (Login and transition operations are omitted until here)
res<-remDr $getPageSource () [[1]]
#pageCheck if content can be acquired
res
It looks like #json type content, so it parses as json
fromJSON (res)
res output result

As shown below, it seems that the page content can be acquired without any problems.

[1] "<html xmlns = \" http: //www.w3.org/1999/xhtml \ "><head></head><body><pre style = \ "word-wrap: break-word;white-space: pre-wrap;\">[{\ "id \": 1, \ "name \": \ "family \", \ "middleCategories \ ": [{\" id \ ": 1, \" name \ ": \" mother \ ", \" scenarios \ ": [{\" id \ ": 105, \" name \ ": \" son \ "}, {\" id \ ": 106, \" name \ ": \" daughter \ "}]} .. (omitted) ..</pre></body></html>"
Error message-fromJSON (res) result

When the above code is executed, the following error message appears.

Error: lexical error: invalid char in json text.
                                       <html xmlns = "http://www.w3.org/
                     (right here) ------ ^
Supplemental information (FW/tool version etc.)

R (3.5.0)
Rstudio (1.1.442)
library (RSelenium)
library (RJSONIO)
library (jsonlite)
library (rvest)

  • Answer # 1

    If you save the page locally, it will be saved as json data and read as json.

  • Answer # 2

    To my eyes, this doesn't seem to be correct JSON data, and before that, it looks like an html header in a string that should be "<body>tag contents" It seems that the problem before R suffered, such as a character string, has occurred.