What is a reptile? Talk about crawler and bypass website anti-crawling mechanism

Posted: 2017-12-18 17:19:36 Author: Anonymous I would like to comment
Simply and unilaterally, crawler is a tool that automatically interacts with the server to obtain data. This article mainly wants to talk about this part of crawler to obtain data. Please pay attention to the website's Robot.txt file, do not let the crawler illegal, do not let the crawler cause harm to the website

What is crawler? Simply and unilaterally, crawler is a tool that the computer automatically interacts with the server to obtain data. The most basic crawler is to get the source code data of a web page, if more in-depth, there will be POST interaction with the web page, and obtain the data returned after the server receives the POST request. In a word, the crawler is used to automatically obtain the source data, as for more data processing and so on is the follow-up work, this article mainly wants to talk about this part of the crawler to obtain data. Please pay attention to the website's Robot.txt file, do not let the crawler illegal, do not let the crawler cause harm to the website.

Inappropriate examples of anti-creep and anti-creep concepts

For many reasons (such as server resources, protecting data, etc.), many websites are limited to crawler effects.

Think about it, how do we get the source code of a web page when people are playing the role of crawlers? The most common is, of course, right-clicking the source code.

Website blocked right click, how to do?

网站屏蔽了右键,怎么办?

Take out our most useful crawler F12(welcome to discuss)

Press F12 at the same time to open it (funny)

2.png

The source code is out!!

In the case of people as reptiles, shielding the right button is the anti-crawling strategy, and F12 is the way of anti-crawling.

Talk about the formal anti-crawl strategy

In fact, there must be no return data in the process of writing crawlers, this time may be the server limits the UA head (user-agent), which is a very basic anti-crawl, as long as the request is sent when the UA head can be added. Isn't that simple?

In fact, it is also a simple and rude way to add all the Request Headers that are not needed...

Have you found that the verification code of the website is also an anti-crawling strategy? In order to make the users of the website can be real people, the captCHA really does make a great contribution. Along with the verification code comes the verification code recognition.

Speaking of this, I do not know whether the verification code recognition or picture recognition appeared first?

Simple captCHA is now very simple to identify, there are a lot of tutorials online, including slightly more advanced denoising, binary, segmentation, reorganization and other concepts. But now the website man-machine recognition has become more and more terrible, such as this:

6.jpg

Briefly describe the concept of noise binary

Will a verification code

3.png

Turn into

5.png

It is binary, that is to change the image itself into only two tones, the example is very simple, through the python PIL library

Image.convert("1")

You can do that, but if the picture gets more complicated, you have to think about it a little bit, like

If you just do it in a simple way, it becomes

Think about some of these captcha should be how to identify? In this case, denoising is useful, according to the characteristics of the verification code itself, you can calculate the base color of the verification code and the RGB values outside the font, etc., and turn these values into a color, leaving the font. The example code is as follows, just change the color

for x in range(0,image.size[0]):

for y in range(0,image.size[1]):

# print arr2[x][y]

if arr[x][y].tolist()== Base color:

arr[x][y]=0

elif arr[x][y].tolist()[0] in range(200,256) and arr[x][y].tolist()[1] in range(200,256) and arr[x][y].tolist()[2] in Range (200256) :

arr[x][y]=0

elif arr[x][y].tolist()==[0,0,0]:

arr[x][y]=0

else:

arr[x][y]=255

arr is obtained from numpy, according to the picture RGB worth out of the matrix, readers can try to improve the code, personally experiment.

With careful manipulation, the picture can become

The recognition rate is still very high.

In the development of the verification code, there are still clear numeric letters, simple addition and subtraction, multiplication and division, wheels on the Internet can be used, some difficult numeric letters Chinese characters, you can also build your own wheels (such as above), but more things are enough to write an artificial intelligence... (There is a job to identify the CAPTCHA...)

Add a tip: some websites have verification codes on the PC side, and the mobile phone side does not...

Next topic!

Anti-crawling strategy is more common in a strategy to block IP, usually in a short period of time too much access will be blocked, this is very simple, limit the frequency of access or add IP proxy pool is OK, of course, distributed can also be...

IP proxy pool -> Left Google right to baidu, there are many proxy sites, although not much can be used for free, but after all.

There is another kind of anti-crawling strategy that can also be counted as asynchronous data, with the gradual deepening of crawlers (obviously the update of the website!). Asynchronous loading is bound to encounter the problem, the solution is still F12. Take the unnamed NetEase cloud music website as an example, right open the source code, try to search for comments

8.png

What about the data? ! This is the characteristic of asynchronous loading after the rise of JS and Ajax. But open F12, switch to the NetWork TAB, refresh the page a bit, look carefully, no secrets.

7.png

Oh, and if you're listening to a song, you can download it if you click on it...

4.png

Only for the popularization of the website structure, please consciously resist piracy, protect copyright, and protect the interests of the original.

What if this site limits you to death? We have one last trick, a powerful combination: selenium + PhantomJs

This combination is very strong, can perfectly simulate browser behavior, the specific use of their own 100 degrees, does not recommend this method, very bulky, here only as a popular science.

Sum up

This article mainly discusses some common anti-reptile strategies (mainly the ones I've come across (shrug)). Mainly includes HTTP request header, verification code recognition, IP proxy pool, asynchronous loading several aspects, introduces some simple methods (too difficult will not!) , mainly in Python. I hope it will lead you to a new way.

  • Tag: crawler website anti-crawling and anti-crawling mechanism

Related article

Latest comments