I want to scrape this university syllabus
Hitotsubashi University mercas< br /> Although you can not see without logging in, the general public will be able to see if you log in without entering anything
You can then search for the syllabus by going to the syllabus page

What I did
I tried to login using ruby's library mechanize, but it doesn't work

require 'rubygems'
require 'nokogiri'
require 'mechanize'
agent = Mechanize.new
agent.get ('https://mercas.hit-u.ac.jp/Campus/Web/UniversityPortal/UserAttestation/WFU06010.aspx') do | page |
  login_result = page.form_with (: action =&gt;'WFU06010.aspx') do | login |
    login.field_with (: name =&gt;'txbID'). value = ""
    login.field_with (: name =&gt;'txbPassword'). value = ""
agent.get ('https://mercas.hit-u.ac.jp/Campus/Web/UniversityPortal/WFC00010.aspx') do | page |
  p page
I tried to display the page information after logging in with the code
#&lt;Mechanize :: Page
  #&lt;URI :: HTTPS https://mercas.hit-u.ac.jp/Campus/Web/UniversityPortal/WFC00010.aspx&gt;}
 {title "Hitotsubashi University Academic Information System/Student Portal Mercas (MERCURY CAMPUS SYSTEM)"}
  #&lt;Mechanize :: Page :: Link "User Manual (for teachers)" "/manual-teacher-N.pdf"&gt;
  #&lt;Mechanize :: Page :: Link "User Manual (for students)" "/manual-student.pdf"&gt;
  #&lt;Mechanize :: Page :: Link
   "javascript: __ doPostBack ('UC00050 $linkLogOut', '')"&gt;
  #&lt;Mechanize :: Page :: Link
   "Close window"
   "javascript: __ doPostBack ('UC00050 $linkClose', '')"&gt;}
  #&lt;Mechanize :: Form
   {name "WFC01010"}
   {method "POST"}
   {action "WFC01010.aspx"}
    [hidden: 0x3fcf3e123e80 type: hidden name: __VIEWSTATE value:/wEPDwUKLTExNjY1ODMyMA8WCB4STWVzc2FnZUluZm9ybWF0aW9uFgEy/QIAAQAAAP //// 8BAAAAAA ~ omitted ~ 6m0yE2uY =
    [hidden: 0x3fcf3e123a70 type: hidden name: __VIEWSTATEGENERATOR value: 0B535023]
    [hidden: 0x3fcf3e12369c type: hidden name: __EVENTVALIDATION value:/wEWAwKSoqUuAuegiskIAvrV7/IH6dgLY2bsu9wzjA4YEZUT7CSPO1Y =]}

It becomes only the menu bar part

I want to extract the text of the syllabus search results for the time being

  • Answer # 1

    Scraping should be done as long as it doesn't bother the other site.

    Main topic
    The site you are trying to scrape this time has a dynamic page
    It should be impossible with mechanize because it is generated.

    Selenium and Capybara can handle dynamically changing pages
    I chose a familiar Watir.

    If you run the following code, Firefox will be launched and the syllabus
    I think the list of course subjects will change page by page.

    require 'watir'
    require 'nokogiri'
    # What you need to move
    # watir http://watir.com/
    # nokogiri https://github.com/sparklemotion/nokogiri
    # Firefox https://www.mozilla.org/en/firefox/
    # geckodriver https://github.com/mozilla/geckodriver/releases
    # timeout setting
    client = Selenium :: WebDriver :: Remote :: Http :: Default.new
    client.open_timeout = 480
    client.read_timeout = 480
    Watir.default_timeout = 600
    browser = Watir :: Browser.new: firefox
    browser.goto ('https://mercas.hit-u.ac.jp/Campus/Web/UniversityPortal/UserAttestation/WFU06010.aspx')
    browser.div (xpath: '/html/body/form/div/div/div/div[2]').wait_until(&:present?)
    # Click Login
    browser.element (xpath: '//*[@id="btnLogin"]').click
    # Click on syllabus seminar
    browser.element (xpath: '//*[@id="UC00050_S_0"]').click
    # Click on syllabus search
    browser.element (xpath: '//*[@id="UC00060_S_02"]').click
    # Click search
    browser.element (xpath: '//*[@id="searchButton"]').click
    # Wait for display of list of course subjects
    browser.span (xpath: '//*[@id="lblGamenTitle"]').wait_until(&:present?)
    doc = Nokogiri :: HTML.parse (browser.execute_script ('return document.documentElement.innerHTML'))
    last_page = doc.xpath ('// * [@ id = "lblAllPage"]'). text.to_i
    current_page = doc.xpath ('// * [@ id = "lblCurrentPage"]'). text.to_i
    while current_page<last_page
      sleep (3)
      browser.element (text: 'Next page>'). click
      current_page + = 1

  • Answer # 2

    Although it appears in the results, it seems that there is<iframe>in the page.
    If you don't get that, you can't scrape.

Related articles