Home>

Given that I want to complete the asp.net mvc 3 imitate the blog park enterprise system to use test data,I'm too tired to type in myself,So I grabbed some list data from the blog garden.Please also ask dudu not to blame.

When crawling the blog garden data, regular expressions are used.So friends who are not familiar with regular expressions can refer to related materials.It ’s actually very easy to master,It will take some time in specific examples.

Now I will describe the process of capturing the data of the blog garden.If a friend has a better opinion,Welcome to come forward.

To fetch data using regular expressions,First we need to create a regular expression to match,I recommend using the regulator, this regular expression tool,We can first use this tool to stitch together the regular expressions we want to use,Then use it in your program.

I found that the blog homepage list can be accessed directly by ...This way we can get the data directly through the url,Instead of simulating a data click event to virtually click the button on the next page to get data,More convenient.Because my purpose is to grab some data,So keep it simple.

1. The first step is to write the corresponding sql helper class. I believe this will be mastered by many programmers.It is nothing more than the operation of adding, deleting, changing and checking.After creating the sqlhelper class, we can start the logical processing of fetching data.

2. Create a blogregexcontroller

public class blogregexcontroller:controller
   {
     public void executeregex ()
     {
       string strbaseurl="http://www.cnblogs.com/p";//Define the base address of the list data that the blog park can access
       for (int i =;i<=;i ++) //Because the blog park homepage list is limited to pages,So we execute this loop
       {
         string strurl=strbaseurl + i.tostring ();
         blogrege blogregex=new blogrege ();//Define the specific regex class Grab the blog park address
         string result=blogregex.sendurl (strurl);
         blogregex.analysishtml (result);
         response.write ("Get success");
       }
     }
     //
     //get:/blogregex /
     public actionresult index ()
     {
       executeregex ();
       return view ();
     }
   }

The executeregex () method in the controller is responsible for capturing the data of the blog garden list.

3. The first is the blogrege class defined in it, which is responsible for grabbing the blog garden list data and inserting it into the database

public class blogrege
   {//Responsible for inserting data into the database using the sqlhelper class
     public void insert (string title, string content, string linkurl, int categoryid =)
     {
       sqlhelper helper=new sqlhelper ();
       helper.insert (title, content, categoryid, linkurl);
     }
     ///<summary>
     ///Get specific web page content by url address Initiate a request to get html content
     ///</summary>
     ///<param name="strurl"></param>
     ///<returns></returns>
     public string sendurl (string strurl)
     {
       try
       {
         webrequest webrequest=webrequest.create (strurl);
         webresponse webresponse=webrequest.getresponse ();
         streamreader reader=new streamreader (webresponse.getresponsestream ());
         string result=reader.readtoend ();
         return result;
       }
       catch (exception ex)
       {
         throw ex;
       }
     }
     ///<summary>
     ///Analyze the html and parse out the specific data inside
     ///</summary>
     ///<param name="htmlcontent"></param>
     public void analysishtml (string htmlcontent)
     {//This is the regular expression that I obtained by splicing in the regulator regular expression tool. One more thing, please note that the problem of escape characters
       string strpattern="<div \\ s * class=\" post_item \ ">\\ s *. * \\ s *. * \\ s *. * \\ s *. * \\ s *. * \\ s *. * \\ s *. * \\ s *<div \\ s * class=\ "post_item_body \">\\ s *<h><a \\ s * class=\ "titlelnk \" \\ s * href=\ "(?<href>. *) \" \\ s * target=\ "_ blank \">(?<title>. * )</a>. * \\ s *&p;s \\ class * \=post_item_summary \ ">\\ s * (?<content>. *) \\ s *</p>";
       regex regex=new regex (strpattern, regexoptions.ignorecase | regexoptions.multiline | regexoptions.cultureinvariant);
       if (regex.ismatch (htmlcontent))
       {
         matchcollection matchcollection=regex.matches (htmlcontent);
         foreach (match match in matchcollection)
         {
           string title=match.groups []. value;//Get the title of the list data
           string content=match.groups []. value;//Get the content
           string linkurl=match.groups []. value;//Get the link to
          insert (title, content, linkurl);//Perform the operation of inserting into the database
         }
       }
     }
   }

4. Through the above code, we can easily get the data we used for testing from the blog garden.Convenient,And true,Much faster than we entered manually.

Regular expressions shouldn't really be considered a language,Can only be regarded as a syntax,Because any language including c#, javascript and other languages ​​have good support for regular expressions,It's just that they use a slightly different syntax,In fact, as long as we can correctly stitch the regular expression,Then we can easily grab the content of any website.In the previous paragraph, I tried to capture the data of Taobao.A total of several million have been captured,I think there are still many that have not been captured,Have to admire Taobao,The amount of data is too large.

Back to the c#language we use, in fact, there is very good support for regular expressions.regex is the class used to operate on regular expressions,All operations on regular expressions are in this class.

If you are not too familiar with regular expressions,There is a 30-minute getting started tutorial on regular expressions online,You can refer to it,Very well written.Plus using a regular expression tool,I believe that you can grab any content i want.

When stitching regular expressions,May take a long time,After all, analyze the html structure and grab the content from it.I hope everyone can calm down,Because as long as the regular expressions are spliced ​​correctly,Then you can crawl the right content.

In order to avoid everyone saying just say no,Then I will show the content of the blog garden homepage that I crawled,Because the blog park homepage data will be updated,So you can see that these data are all in order in the blog garden.

There are 20 articles per page in the blog garden, and there are 200 pages in total, so there are 4,000 in total. The data is fetched correctly.

I said before,Just programmers who know how to code are not necessarily qualified programmers,Programmers should reduce their workload as much as possible,Because we are all highly intelligent people.So we should actively learn various frameworks or methods that are helpful to our work.Such as ioc, entity framework or nhibernate framework to reduce the burden of our development and maintenance code,After all, we hear the reflection of changes in demand,Generally angry.And then scolded,The last is to modify.Some frameworks can help us,Give us a good mood to maintain the code,Why not do it.

I say one last word,Because I want to develop a simple website (mvc3) that imitates the blog garden, I will use various technical preparations,I wrote it in advance to sort out the content to be used,Accelerate for future development.