python - How to connect to https site with Scrapy via Polipo over TOR? -
not entirely sure problem here.
running python 2.7.3, , scrapy 0.16.5
i've created simple scrapy spider test connecting local polipo proxy can send requests out via tor. basic code of spider follows:
from scrapy.spider import basespider class torspider(basespider): name = "tor" allowed_domains = ["check.torproject.org"] start_urls = [ "https://check.torproject.org" ] def parse(self, response): print response.body
for proxy middleware, i've defined:
class proxymiddleware(object): def process_request(self, request, spider): request.meta['proxy'] = settings.get('http_proxy')
my http_proxy in settings file defined http_proxy = 'http://localhost:8123'
.
now, if change start url http://check.torproject.org, works fine, no problems.
if attempt run against https://check.torproject.org, 400 bad request error every time (i've tried different https:// sites, , of them have same problem):
2013-07-23 21:36:18+0100 [scrapy] info: scrapy 0.16.5 started (bot: arachnid) 2013-07-23 21:36:18+0100 [scrapy] debug: enabled extensions: logstats, telnetconsole, closespider, webservice, corestats, spiderstate 2013-07-23 21:36:18+0100 [scrapy] debug: enabled downloader middlewares: httpauthmiddleware, downloadtimeoutmiddleware, randomuseragentmiddleware, proxymiddleware, retrymiddleware, defaultheadersmiddleware, redirectmiddleware, cookiesmiddleware, httpcompressionmiddleware, chunkedtransfermiddleware, downloaderstats 2013-07-23 21:36:18+0100 [scrapy] debug: enabled spider middlewares: httperrormiddleware, offsitemiddleware, referermiddleware, urllengthmiddleware, depthmiddleware 2013-07-23 21:36:18+0100 [scrapy] debug: enabled item pipelines: 2013-07-23 21:36:18+0100 [tor] info: spider opened 2013-07-23 21:36:18+0100 [tor] info: crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2013-07-23 21:36:18+0100 [scrapy] debug: telnet console listening on 0.0.0.0:6023 2013-07-23 21:36:18+0100 [scrapy] debug: web service listening on 0.0.0.0:6080 2013-07-23 21:36:18+0100 [tor] debug: retrying <get https://check.torproject.org> (failed 1 times): 400 bad request 2013-07-23 21:36:18+0100 [tor] debug: retrying <get https://check.torproject.org> (failed 2 times): 400 bad request 2013-07-23 21:36:18+0100 [tor] debug: gave retrying <get https://check.torproject.org> (failed 3 times): 400 bad request 2013-07-23 21:36:18+0100 [tor] debug: crawled (400) <get https://check.torproject.org> (referer: none) 2013-07-23 21:36:18+0100 [tor] info: closing spider (finished)
and double check isn't wrong tor/polipo set up, i'm able run following curl command in terminal, , connect fine: curl --proxy localhost:8123 https://check.torproject.org/
any suggestions what's wrong here?
not sure if these may you:
Comments
Post a Comment