I want to scrape this university syllabus
Hitotsubashi University mercas< br /> Although you can not see without logging in, the general public will be able to see if you log in without entering anything
You can then search for the syllabus by going to the syllabus page
What I did
I tried to login using ruby's library mechanize, but it doesn't work
I tried to display the page information after logging in with the code
require 'rubygems' require 'nokogiri' require 'mechanize' agent = Mechanize.new agent.get ('https://mercas.hit-u.ac.jp/Campus/Web/UniversityPortal/UserAttestation/WFU06010.aspx') do | page | login_result = page.form_with (: action =>'WFU06010.aspx') do | login | login.field_with (: name =>'txbID'). value = "" login.field_with (: name =>'txbPassword'). value = "" end.submit end agent.get ('https://mercas.hit-u.ac.jp/Campus/Web/UniversityPortal/WFC00010.aspx') do | page | p page end
It becomes only the menu bar part
I want to extract the text of the syllabus search results for the time being
Answer # 1
Scraping should be done as long as it doesn't bother the other site.
The site you are trying to scrape this time has a dynamic page
It should be impossible with mechanize because it is generated.
Selenium and Capybara can handle dynamically changing pages
I chose a familiar Watir.
If you run the following code, Firefox will be launched and the syllabus
I think the list of course subjects will change page by page.
require 'watir' require 'nokogiri' # What you need to move # watir http://watir.com/ # nokogiri https://github.com/sparklemotion/nokogiri # Firefox https://www.mozilla.org/en/firefox/ # geckodriver https://github.com/mozilla/geckodriver/releases # timeout setting client = Selenium :: WebDriver :: Remote :: Http :: Default.new client.open_timeout = 480 client.read_timeout = 480 Watir.default_timeout = 600 browser = Watir :: Browser.new: firefox browser.goto ('https://mercas.hit-u.ac.jp/Campus/Web/UniversityPortal/UserAttestation/WFU06010.aspx') browser.div (xpath: '/html/body/form/div/div/div/div').wait_until(&:present?) # Click Login browser.element (xpath: '//*[@id="btnLogin"]').click # Click on syllabus seminar browser.element (xpath: '//*[@id="UC00050_S_0"]').click # Click on syllabus search browser.element (xpath: '//*[@id="UC00060_S_02"]').click # Click search browser.element (xpath: '//*[@id="searchButton"]').click # Wait for display of list of course subjects browser.span (xpath: '//*[@id="lblGamenTitle"]').wait_until(&:present?) doc = Nokogiri :: HTML.parse (browser.execute_script ('return document.documentElement.innerHTML')) last_page = doc.xpath ('// * [@ id = "lblAllPage"]'). text.to_i current_page = doc.xpath ('// * [@ id = "lblCurrentPage"]'). text.to_i while current_page<last_page sleep (3) browser.element (text: 'Next page>'). click current_page + = 1 end browser.close
Answer # 2
Although it appears in the results, it seems that there is
<iframe>in the page.
If you don't get that, you can't scrape.
- ruby - i tried to automatically log in to yahoo with mechanize, but the reason why it is recognized as a bot
- scraped values in ruby are stored in hash and made into an array
- ruby - i scraped using mechanize, but i get a 403 error
- ruby - i don't know how to put data scraped by nokogiri into sqlite3 database
- ruby - 'rbenv' is not recognized as an internal or external command, operable program or batch file
- html - i would like to know how to list scraped data in a table tag with each statement
- xml - i want to delete all the layers below the acquisition element with nokogiri