最全反爬虫技术

King

一、通过User-Agent来控制访问：

无论是浏览器还是爬虫程序，在向服务器发起网络请求的时候，都会发过去一个头文件：headers，比如知乎的requests headers:

Accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,/;q=0.8
Accept-Encoding:gzip, deflate, sdch, br
Accept-Language:zh-CN,zh;q=0.8,en;q=0.6,zh-TW;q=0.4,da;q=0.2,la;q=0.2
Cache-Control:max-age=0
Connection:keep-alive
Cookie: **********
Host:http://zhuanlan.zhihu.com
Referer:Ehco - 知乎
Upgrade-Insecure-Requests:1
User-Agent:Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36
Query String Parameters
view source
view URL encoded

知识兔

这里面的大多数的字段都是浏览器向服务器”表明身份“用的
对于爬虫程序来说，最需要注意的字段就是：User-Agent
很多网站都会建立 user-agent白名单，只有属于正常范围的user-agent才能够正常访问。

比如知乎：

import requests
import bs4
import random

def get_html(url):
    try:
        r = requests.get(url, timeout=30)
        r.raise_for_status
        r.encoding = r.apparent_encoding
        return r.text
    except:
        return "Someting Wrong！"
        
print(get_html('https://zhuanlan.zhihu.com'))

# OUT：
'''
<html><body><h1>500 Server Error</h1>
An internal server error occured.
</body></html>

知识兔

'''

可以看到，这里的请求被拒绝了，并且返回了一个500的错误码：
这里就是因为requests库本身的headers是这样的：

{'Date': 'Tue, 09 May 2017 12:13:00 GMT', 'Content-Type': 'text/html', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'Set-Cookie': 'aliyungf_tc=AQAAAPDDXQnf6AEAHaBXcP1tHo5z1uta; Path=/; HttpOnly, acw_tc=AQAAAAM89GeptQMAHaBXcJiyTK3l8c5g; Path=/; HttpOnly', 'Cache-Control': 'no-cache'}

这里面并没有user-agent字段，自然不被知乎的服务器所接受了。

解决方法：

可以自己设置一下user-agent，或者更好的是，可以从一系列的user-agent里随机挑出一个符合标准的使用，代码如下：

def get_agent():    '''    模拟header的user-agent字段，    返回一个随机的user-agent字典类型的键值对    '''    agents = ['Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0;',              'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv,2.0.1) Gecko/20100101 Firefox/4.0.1',              'Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; en) Presto/2.8.131 Version/11.11',              'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11',              'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)']    fakeheader = {}    fakeheader['User-agent'] = agents[random.randint(0, len(agents))]    return fakeheader        # 注意看新的请求函数：        def get_html(url):    try:        r = requests.get(url, timeout=30,headers=get_agent())        r.raise_for_status        r.encoding = r.apparent_encoding        return r.status_code        except:        return "Someting Wrong！"           '''    OUT:    200    '''

def get_proxy(): ''' 简答模拟代理池返回一个字典类型的键值对， ''' proxy = ["http://116.211.143.11:80", "http://183.1.86.235:8118", "http://183.32.88.244:808", "http://121.40.42.35:9999", "http://222.94.148.210:808"] fakepxs = {} fakepxs['http'] = proxy[random.randint(0, len(proxy))] return fakepxs

robots.txt（统一小写）是一种存放于网站根目录下的ASCII编码的文本文件，它通常告诉网络搜索引擎的漫游器（又称网络蜘蛛），此网站中的哪些内容是不应被搜索引擎的漫游器获取的，哪些是可以被漫游器获取的。因为一些系统中的URL是大小写敏感的，所以robots.txt的文件名应统一为小写。robots.txt应放置于网站的根目录下。如果想单独定义搜索引擎的漫游器访问子目录时的行为，那么可以将自定的设置合并到根目录下的robots.txt，或者使用robots元数据（Metadata，又称元数据）。robots.txt协议并不是一个规范，而只是约定俗成的，所以并不能保证网站的隐私。注意robots.txt是用字符串比较来确定是否获取URL，所以目录末尾有与没有斜杠“/”表示的是不同的URL。robots.txt允许使用类似"Disallow: *.gif"这样的通配符[1][2]。

User-agent: * Disallow: /?* Disallow: /pop/*.html Disallow: /pinpai/*.html?* User-agent: EtaoSpider Disallow: / User-agent: HuihuiSpider Disallow: / User-agent: GwdangSpider Disallow: / User-agent: WochachaSpider Disallow: /

最全反爬虫技术

一、通过User-Agent来控制访问：

三、通过JS脚本来防止爬虫：

四、通过robots.txt来限制爬虫：

热门标签

最新文章