爬虫遭遇http error 521

最近做了个爬虫需要用到代理ip，然后想去快代理上爬点代理ip下来用。

结果使用urllib2访问http://www.kuaidaili.com/proxylist/1时候发现总是返回521错误。

使用HTTPError的read方法可以打印出HTTP的返回内容。

try:

resp = urllib2.urlopen(url)

contents = resp.read()

except urllib2.HTTPError, error:

print e

contents = error.read()

print contents

try: resp = urllib2.urlopen(url) contents = resp.read() except urllib2.HTTPError, error: print e contents = error.read() print contents

try:
    resp = urllib2.urlopen(url)
    contents = resp.read()
except urllib2.HTTPError, error:
    print e
    contents = error.read()
    print contents

结果发现是一串js代码，我怀疑它是在初次访问的时候，先返回一串js代码，错误代码521，然后在js脚本中设置什么参数让浏览器重新访问一次，就可以正常取得http 200的响应了。通过浏览器调试工具查看。发现果然如此：

这样就可以有效防止一般的http爬虫了。看来得上selenium才可以了。

爬虫遭遇http error 521

2 thoughts on “爬虫遭遇http error 521”

Leave a Comment Cancel Reply