Home>

Foreword

A few days ago, the pv statistics were added to the articles on the website. Previously, only uv statistics were available. I didn't add pv statistics before because I think every user visits the article once.I need to do a database write operation is really detrimental to performance,After all, from a user's visit to the5fire blog,Just get the corresponding article from the database (usually from the cache) and return it to the browser.The write operation is meaningless.The previous uv is also only one write operation for each user within 24 hours.

But then again,For a small site like the5fire blog,Even if I write a dozen or so databases every time, it doesn't matter.After all, the amount is small and pathetic.But we do n’t have to have a heart to resist billion-level traffic.

For students who do n’t understand,You can go out and research,Look at other people's websites.Yes, those are the hundreds of millions of visitors,Billions, billions of websites,See how they handle user writes,For example, leave a message.

Meaning of pv

After saying the reason,Let's talk about business.All websites will have statistics like pv and uv.Even the length of stay,Various types of page conversion rates and so on.My job at Sohu,To put it plainly, it is to make a website.The business indicators of interest are things related to traffic.As a webmaster for so many years,It will also make adjustments by referring to some indicators in Baidu statistics.

But this time only pv, pv for an article.

Aside from abnormal visits,An article on the internet,The more people visit him,That means the more valuable this article is.After all, we just click on the valuable things.This traffic is uv (user view/visit). So what is pv,A well written article,Especially technical articles,May visit multiple times,For example, I like to collect good articles.Look back when you have time.Each review (refresh page) is considered a pv. Can read articles that human readers read many times,The value will be higher.So the pv/uv ratio of an article is also an indicator of the value of the article.Especially in the age of the title party.(Well, here's another twist,The title party is not a product of the media age,In the era of blogging,It's just that it has become more concentrated since the media age)

It ’s okay to just say value,Did n’t the ancients say that?The value can be changed for a few meters.(I'm Hu Yan's)

Take all current news sites/media platforms,pv is equal to ¥.The greater the flow,Means more income,Whether it ’s revenue from advertising,Still release traffic to other channels.Sometimes I also consider,The goal of everything is really to better understand the user,Push the user what he wants to see?Maybe it is,But one problem that ca n’t be avoided is thatBuilding a business model,Let advertisers and investors pay for the user's stay.Let users stay more on the platform,Spend more time.(Purely personal,Distinguish clearly, think carefully)

So it seems that pv has become attractive.

Way of statistics

For websites,There are so many statistical methods of pv and uv learned by the5fire

Like the5fire's early approach:Every time a user visits an article,Articles pv + 1, uv + 1. Silly and rough approach. the5fire blog is now doing,Write a distributed task service,It is then called in the business code. Buried points on the page,Tags, or reference js to send data to the statistics server. Collect nginx access-log (if using nginx), of course, the format needs to be customized,At least you must add user_id, and then do offline statistics and summary.

The first two are heavy-duty implementations.Need to insert code in specific pages.The latter two are similar,Essentially all collect nginx logs, but the collection stages are different,The third is after the page is fully opened,nginx will receive the log.And the fourth is to simply visit the page,And upstream returns a status code of 200 even if successful,Even if the end user doesn't see the page.

In short, each has advantages and disadvantages,Can be cross-referenced.

How blogging works

As said above,It is also mainly to use Celery, a distributed task queue.It is relatively simple to use in django.

To use celery in django, you need the celery runtime to be able to use the various modules of this django project.So first specify the settings module. I use django version 1.11. Add celery.py to the same directory as wsgi.py, the code is as follows:

#coding:utf-8
from __future__ import absolute_import, unicode_literals
import os
from celery import celery
profile=os.environ.get ("django_selfblog_profile", "develop") #I split settings.py into:develop.py, product.py
os.environ.setdefault ("django_settings_module", "django_selfblog.settings.%s"%profile)
app=celery ("selfblog", broker="redis://127.0.0.1:6666/2")
app.config_from_object ("django.conf:settings", namespace="celery")
#load task modules from all registered django app configs.
app.autodiscover_tasks ()

Here we use the official not recommended redis as a broker instead of rabbitmq. The main cache is redis. In order not to introduce more systems that need maintenance.

After defining the startup file,You need to define specific tasks and write specific tasks in app/tasks.py:

#coding:utf-8
from __future__ import unicode_literals
from django.db.models import f
from .models import post
from django_selfblog.celery import app
@ app.task
def increase_pv (post_id):
 return post.objects.filter (id=post_id) .update (pv=f ("pv") + 1)
@ app.task
def increase_uv (post_id):
 return post.objects.filter (id=post_id) .update (uv=f ("uv") + 1)

Add a call in the corresponding position of views.py on the visit article page:

from .tasks import increase_pv, increase_uv
#.... omitted context
increase_pv.delay (self.post.id)
increase_uv.delay (self.post.id)

In this way, the logic of calculating pv and uv every time the user visits is put into the distributed task manager to execute,Will not affect this visit.

If i want to see the execution status of the task,For example via:

r=increase_pv.delay (self.post.id)
print r.ready ()

Want to see if the task is done like this,Then you need to introduce django-celery-results, the use steps are as follows:

pip install django-celery-results Put django_celery_results into installed_apps Configurecelery_result_backend="django-db" or "django-cache" If django-db is configured, it means that the results need to be stored in the database.Then executepython manage.py migrate django_celery_results to build a table

After these configurations are completed,The rest is deployed,The ffire re_deploy:master code, which is done through fabric every time the code is updated and redeployed by the5fire blog, is deployed to the server.After adding Celery, you only need to increase the configuration of supervisord. Now, after all, Celery's code is also in the blog code.

Supervised increase configuration:

[program:celery]
command=celery -a selfblog worker -p gevent --loglevel=info --concurrency=5
directory =/home/the5fire/selfblog /
process_name =%(program_name) s _%(process_num) d
umask=022
startsecs=0
stopwaitsecs=0
redirect_stderr=true
stdout_logfile =/tmp/log/celery _%(process_num) 02d.log
numprocs=1
numprocs_start=1
environment=django_selfblog_profile=product

So every time you redeploy,The celery process will also restart.

django tips

In the django project, the most performance loss is orm, if you are not familiar, it is easy to be pitted.

Take increasing pv, for example, each time a user visits an article,pv field +1, in code terms:

#Never write such stupid code
post=post.objects.get (pk=post_id)
post.pv=post.pv + 1
post.save ()

This is the easiest way,But in most cases,Users access an article,This article is usually in the cache,After all, you don't need to go to the database every time to get it.What should we do?The intuitive way is to first get the post, then +1, save, as above.But there will be competition issues.

Let ’s say 100 people visit an article at the same time,I started multiple threads/processes to process the request,It may happen that all processes execute post=post.objects.get (pk=post_id) at the same time. Assuming the pv of this article in the database is 100, then post.pv is 100. After all users execute post.save (), the result is 101, which is one hundred concurrent visits.It may happen that pv is only incremented by one.

To solve this problem,Two ways.

First, lock, as far as I know django does not provide,You need to do it yourself.But no one will do it. Second, use mysql to perform auto-increment,That's what I used above.

For method two,How to achieve it in django.Actually translated into sql is

update `blog_post` set` pv`=(`blog_post`.`pv` + 1) where` blog_post`.`id` =<post_id> ;;

The django code is:post.objects.filter (id=post_id) .update (pv=f ("pv") + 1) For f expressions, please refer to the official documentation:

to sum up

  • Previous Learn more about the use of UIWindow in iOS development
  • Next Example explanation of Validation application of form validation plugin