What is a reptile? Talk about crawler and bypass website anti-crawling mechanism
What is crawler? Simply and unilaterally, crawler is a tool that the computer automatically interacts with the server to obtain data. The most basic crawler is to get the source code data of a web page, if more in-depth, there will be POST interaction with the web page, and obtain the data returned after the server receives the POST request. In a word, the crawler is used to automatically obtain the source data, as for more data processing and so on is the follow-up work, this article mainly wants to talk about this part of the crawler to obtain data. Please pay attention to the website's Robot.txt file, do not let the crawler illegal, do not let the crawler cause harm to the website.
Inappropriate examples of anti-creep and anti-creep concepts
For many reasons (such as server resources, protecting data, etc.), many websites are limited to crawler effects.
Think about it, how do we get the source code of a web page when people are playing the role of crawlers? The most common is, of course, right-clicking the source code.
Website blocked right click, how to do?
Take out our most useful crawler F12(welcome to discuss)
Press F12 at the same time to open it (funny)
The source code is out!!
In the case of people as reptiles, shielding the right button is the anti-crawling strategy, and F12 is the way of anti-crawling.
Talk about the formal anti-crawl strategy
In fact, there must be no return data in the process of writing crawlers, this time may be the server limits the UA head (user-agent), which is a very basic anti-crawl, as long as the request is sent when the UA head can be added. Isn't that simple?
In fact, it is also a simple and rude way to add all the Request Headers that are not needed...
Have you found that the verification code of the website is also an anti-crawling strategy? In order to make the users of the website can be real people, the captCHA really does make a great contribution. Along with the verification code comes the verification code recognition.
Speaking of this, I do not know whether the verification code recognition or picture recognition appeared first?
Simple captCHA is now very simple to identify, there are a lot of tutorials online, including slightly more advanced denoising, binary, segmentation, reorganization and other concepts. But now the website man-machine recognition has become more and more terrible, such as this:
Briefly describe the concept of noise binary
Will a verification code
Turn into
It is binary, that is to change the image itself into only two tones, the example is very simple, through the python PIL library
Image.convert("1")
You can do that, but if the picture gets more complicated, you have to think about it a little bit, like
If you just do it in a simple way, it becomes
Think about some of these captcha should be how to identify? In this case, denoising is useful, according to the characteristics of the verification code itself, you can calculate the base color of the verification code and the RGB values outside the font, etc., and turn these values into a color, leaving the font. The example code is as follows, just change the color
for x in range(0,image.size[0]):
for y in range(0,image.size[1]):
# print arr2[x][y]
if arr[x][y].tolist()== Base color:
arr[x][y]=0
elif arr[x][y].tolist()[0] in range(200,256) and arr[x][y].tolist()[1] in range(200,256) and arr[x][y].tolist()[2] in Range (200256) :
arr[x][y]=0
elif arr[x][y].tolist()==[0,0,0]:
arr[x][y]=0
else:
arr[x][y]=255
arr is obtained from numpy, according to the picture RGB worth out of the matrix, readers can try to improve the code, personally experiment.
With careful manipulation, the picture can become
The recognition rate is still very high.
In the development of the verification code, there are still clear numeric letters, simple addition and subtraction, multiplication and division, wheels on the Internet can be used, some difficult numeric letters Chinese characters, you can also build your own wheels (such as above), but more things are enough to write an artificial intelligence... (There is a job to identify the CAPTCHA...)
Add a tip: some websites have verification codes on the PC side, and the mobile phone side does not...
Next topic!
Anti-crawling strategy is more common in a strategy to block IP, usually in a short period of time too much access will be blocked, this is very simple, limit the frequency of access or add IP proxy pool is OK, of course, distributed can also be...
IP proxy pool -> Left Google right to baidu, there are many proxy sites, although not much can be used for free, but after all.
There is another kind of anti-crawling strategy that can also be counted as asynchronous data, with the gradual deepening of crawlers (obviously the update of the website!). Asynchronous loading is bound to encounter the problem, the solution is still F12. Take the unnamed NetEase cloud music website as an example, right open the source code, try to search for comments
What about the data? ! This is the characteristic of asynchronous loading after the rise of JS and Ajax. But open F12, switch to the NetWork TAB, refresh the page a bit, look carefully, no secrets.
Oh, and if you're listening to a song, you can download it if you click on it...
Only for the popularization of the website structure, please consciously resist piracy, protect copyright, and protect the interests of the original.
What if this site limits you to death? We have one last trick, a powerful combination: selenium + PhantomJs
This combination is very strong, can perfectly simulate browser behavior, the specific use of their own 100 degrees, does not recommend this method, very bulky, here only as a popular science.
Sum up
This article mainly discusses some common anti-reptile strategies (mainly the ones I've come across (shrug)). Mainly includes HTTP request header, verification code recognition, IP proxy pool, asynchronous loading several aspects, introduces some simple methods (too difficult will not!) , mainly in Python. I hope it will lead you to a new way.
Related article
-
What is a reptile? Talk about crawler and bypass website anti-crawling mechanism
Simply and unilaterally, crawler is a tool that automatically interacts with the server to obtain data. This article mainly wants to talk about this part of crawler to obtain data. Crawlers Please pay attention to the site's Robot.txt file, do not let crawlers break the law, nor2017-12-18 -
Discussion on the waiting experience design skills of App loading page
Designers may not be able to reduce the waiting time for App loading pages, but can make the waiting time interesting ~, this article is mainly for everyone to light the App loading page waiting experience design skills, interested friends to understand it2017-12-15 -
Form design tips sharing with good interaction and high conversion rate
Form whether it is in the web Settings, or in the APP, application, software interface are widely run, so a good form design is very important, this article mainly for you to share a few good interaction and high conversion rate form design skills, there are2017-11-20 -
iPhone X adaptation essentials: 10 minutes to quickly master iPhone X UI adaptation skills
At present, the first batch of friends who reserve iPhone X mobile phones have got the real machine, there is no application operator for iPhone X, I believe that they are working overtime to adapt it, this article brings you the key points of adapting iPhone X, let you ten minutes fast2017-11-08 -
Although the layout of the web page is ever-changing, but the layout usually follows several common rules, F-layout design can enhance the readability of the web page, this statement is tracable, this article introduces how to use F-layout, interested2017-11-08
-
What a dry product! 10 efficient and useful form design tips to share
In the design of web pages, forms are the most common and one of the most important components of the interface. For designers, front ends and developers, the design of forms should be as much as possible to make them easier to use.2017-10-26 -
How to design the login page better? 15 Psychological strategies for landing page design
There are many things that subconsciously affect our daily decisions, many of which are subtle psychological factors at work, everyone knows that user experience design is related to psychology, and even directly affects the conversion rate of products, so what about the landing page2017-10-10 -
Not only the content of the website is very important, the comment module is also very important, then, how to design the website comment module? In this regard, this article will introduce 10 key points for you to tell you how to design the product review module, interested friends to understand2017-09-25
-
Any details of the construction of the website need to consider whether it is conducive to SEO optimization, in the process of website navigation design, how should we operate is in line with search engine optimization? In this regard, this article will give you a simple answer2017-09-21
-
How to design the footer to make the website more competitive
In the process of website design, many people may ignore the footer, but in fact, the overall analysis data of our usual use shows that the footer is still a very important role for a website2017-09-04
Latest comments