Home>

Wrote last weekNode + experss crawler small entry. Come and learn today,Write a crawler version 2.0.

This time we are no longer climbing the blog garden,How Play something new,Climbing movie heaven.Because every weekend I download a movie in Movie Heaven to see it.

talk is cheap, show me the code!

Crawl page analysis

Our goal:

1. Grab the movie heaven homepage,Get 169 links to the latest movie on the left

2. Grab Thunder download links for 169 new movies.And concurrent asynchronous fetching.

The specific analysis is as follows:

1. We don't need to grab everything Thunder.Just download the latest released movie,For example, the left column below.There are 170 in total, excluding the first one (because there are 200 movies in the first one), there are a total of 169 movies.

2. In addition to grabbing the homepage,After we grab the points,Thunder download link for each movie

Environment setup

1. Things needed:node environment, express, and cherrio are all introduced in the previous article,So no more introduction here:Click to view

2. New things to install:

superagent:

Role:Similar to request, we can use it to get get/post and other requests, and can set related request header information,Compared to using built-in modules,Much simpler.

usage:

var superagent=require ("superagent");
superagent
.get ("/some-url")
.end (function (err, res) {
  //do something
});

superagent-charset:

Role:Solve coding problems,Because the code for movie heaven is gb2312, the crawled Chinese will be garbled.

usage:

var superagent=require ("superagent");
var charset=require ("superagent-charset");
charset (superagent);
superagent
.get ("/some-url")
.charset ("gb2312") //Set the encoding here
.end (function (err, res) {
  //do something
});

async:

Role:async is a process control toolkit,Provides direct and powerful asynchronous functionality,Called here as a process concurrency.

Usage:What you need here is:async.maplimit (arr, limit, iterator, callback)

maplimit can initiate multiple asynchronous operations at the same time,Then wait for the callback to return together, and return to one before initiating the next.

arr is an array,Limit the number of concurrency, each item in the arr in turn to the iterator for execution, the execution results are passed to the final callback

eventproxy:

Role:eventproxy acts as a counter,It helps you manage whether the asynchronous operation is completed,When done,It will automatically call the handler function you provided,Pass the captured data as parameters.

For example, I first crawled the link to the sidebar of the movie heaven homepage.Then you can crawl the content of the link.The specific role can be clicked here

usage:

var ep=new eventproxy ();
ep.after ("got_file", files.length, function (list) {
 //will be executed after the asynchronous execution of all files
 //The contents of all files are stored in the list array
});
for (var i=0;i<files.length;i ++) {
 fs.readfile (files [i], "utf-8", function (err, content) {
  //trigger result event
  ep.emit ("got_file", content);
 });
}
//Note that the two names got_file must correspond

Start crawling

The main program is here in app.js, so if you look at it, you can mainly look at app.js.

1. First define some global variables,The imported library is imported

var cheerio=require ("cheerio");//can operate the interface like jquer
var charset=require ("superagent-charset");//Solve the garbled problem:
var superagent=require ("superagent");//Initiate a request
charset (superagent);
var async=require ("async");//asynchronous fetch
var express=require ("express");
var eventproxy=require ("eventproxy");//Process control
var ep=eventproxy ();
var app=express ();
var baseurl="http://www.dytt8.net";//Xunlei homepage link
var newmovielinkarr=[];//URL for storing new movies
var errlength=[];//Count the number of links with errors
var highscoremoviearr=[] //High-rated movie

2. Start crawling the homepage of Thunder Homepage:

//First grab the Thunder homepage
(function (page) {
  superagent
  .get (page)
  .charset ("gb2312")
  .end (function (err, sres) {
    //general error handling
    if (err) {
     console.log ("Fetching" + page + "An error occurred during this message")
      return next (err);
    }
    var $= cheerio.load (sres.text);
    //170 movie links,Pay attention to deduplication
    getallmovielink ($);
    highscoremovie ($);
    /*
    * Flow control statements
    * After crawling the links on the left side of the homepage,We started to crawl the details page
    * /
    ep.emit ("get_topic_html", "get" + page + "successful");
  });
}) (baseurl);

Here we first grab the stuff on the homepage,Pass the content of the page fetched from the home page to the two functions getallmovielink and highscoremovie for processing.

getallmovielink got 169 movies in the left column except the first movie.

highscoremovie is the first link in the left column,The movies in it all have high ratings.

In the above code,We got a counter,When it's done,We can execute the process corresponding to the name "get_topic_html",This can ensure that after the crawl of the first page is performed,Then perform the crawl of the secondary pages.

ep.emit ("get_topic_html", "get" + page + "successful");

The highscoremovie method is as follows,Actually our role here is not great,I just counted the information on the homepage of high-rated movies.Lazy continue to crawl

//Rate more than 200 videos with a score of 8 or more! This is just statistical data.
No more crawling
function highscoremovie ($) {
  var url="http://www.dytt8.net" + $(".co_content2 ul a"). eq (0) .attr ("href");
  console.log (url);
  superagent
  .get (url)
  .charset ("gb2312")
  .end (function (err, sres) {
    //general error handling
    if (err) {
      console.log ("Fetching" + url + "This message went wrong")
    }
    var $= cheerio.load (sres.text);
    var elemp=$("#zoom p");
    var elema=$("#zoom a");
    for (var k=1;k<elemp.length;k ++) {
      var hurl=elemp.eq (k) .find ("a"). text ();
      if (highscoremoviearr.indexof (hurl) ==-1) {
        highscoremoviearr.push (hurl);
      };
    }
  });
}

3. Separate the information in the left column,

As shown below, in the homepage, the links to the details pages are here $(". Co_content2 ul a").

So we traverse the details page links here on the left side,Stored in an array of newmovielinkarr.

The getallmovielink method is as follows:

//Get all links in the left column of the homepage
function getallmovielink ($) {
  var linkelem=$(". co_content2 ul a");
  for (var i=1;i<170;i ++) {
    var url="http://www.dytt8.net" + linkelem.eq (i) .attr ("href");
    //pay attention to deduplication
    if (newmovielinkarr.indexof (url) ==-1) {
      newmovielinkarr.push (url);
    };
  }
}

4, crawl the obtained movie details page,Extract useful information,Like download links for movies,This is what we care about.

//Command ep repeatedly listens for the emit event (get_topic_html), and executes when get_topic_html is crawled
ep.after ("get_topic_html", 1, function (eps) {
  var concurrencycount=0;
  var num=-4;//Because it is 5 concurrent, it needs to be reduced by 4
  //use the callback function to return the result,Then take the entire result array from the result.
  var fetchurl=function (myurl, callback) {
    var fetchstart=new date (). gettime ();
    concurrencycount ++;
    num +=1
    console.log ("The number of concurrency is now", concurrencycount, ", is crawling", myurl);
    superagent
    .get (myurl)
    .charset ("gb2312") //Solve the encoding problem
    .end (function (err, ssres) {
      if (err) {
        callback (err, myurl + "error happened!");
        errlength.push (myurl);
        return next (err);
      }
      var time=new date (). gettime ()-fetchstart;
      console.log ("crawl" + myurl + "success", ", time consuming" + time + "milliseconds");
      concurrencycount--;
      var $= cheerio.load (ssres.text);
      //Process the obtained results
      getdownloadlink ($, function (obj) {
        res.write ("<br />");
        res.write (num + ", movie name->" + obj.moviename);
        res.write ("<br />");
        res.write ("Thunderbolt download link->" + obj.downlink);
        res.write ("<br />");
        res.write ("Detail link-><a href =" + myurl + "target =" _ blank ">" + myurl + "<a />");
        res.write ("<br />");
        res.write ("<br />");
      });
      var result={
         movielink:myurl
      };
      callback (null, result);
    });
  };
  //Control the maximum number of concurrency to be 5, and take out the entire result array returned by the callback in the result.
  //maplimit (arr, limit, iterator, [callback])
  async.maplimit (newmovielinkarr, 5, function (myurl, callback) {
    fetchurl (myurl, callback);
  }, function (err, result) {
    //callback after the crawler finishes,Can do some statistical results
    console.log ("End packet capture,Collected a total of->"+ newmovielinkarr.length +" data ");
    console.log ("Error->" + errlength.length + "data");
    console.log ("High-rated movies:==》" + highscoremoviearr.length);
    return false;
  });
});

The first is async.maplimit which makes a concurrency for all detail pages,The number of concurrency is 5, and then crawl the details page,The process of climbing the detail page is actually the same as the process of climbing the home page.So I wo n’t introduce too much here.Then print the useful information on the page.

5. The figure after executing the command is as follows:

Browser interface:

In this way, a slightly upgraded version of our crawler is complete.Maybe the article is not very clear,I have uploaded the code to github and can run the code again.This is easier to understand.If you have time later,Maybe another upgrade version of the crawler,For example, the information that is crawled is stored in mongodb, and then displayed on another page.The crawler program adds a timer,Crawl regularly.

Note:If Chinese characters are garbled in the browser,You can set Google's encoding to UTF-8 to solve it;

Code address:https://github.com/xianyulaodi/myspider2

  • Previous Socket multiplayer chat program C language version (2)
  • Next Laravel uses memcached cache to optimize article additions, deletions, and changes