python - How to connect to https site with Scrapy via Polipo over TOR? -


not entirely sure problem here.

running python 2.7.3, , scrapy 0.16.5

i've created simple scrapy spider test connecting local polipo proxy can send requests out via tor. basic code of spider follows:

from scrapy.spider import basespider  class torspider(basespider):     name = "tor"     allowed_domains = ["check.torproject.org"]     start_urls = [         "https://check.torproject.org"     ]      def parse(self, response):         print response.body 

for proxy middleware, i've defined:

class proxymiddleware(object):     def process_request(self, request, spider):         request.meta['proxy'] = settings.get('http_proxy') 

my http_proxy in settings file defined http_proxy = 'http://localhost:8123'.

now, if change start url http://check.torproject.org, works fine, no problems.

if attempt run against https://check.torproject.org, 400 bad request error every time (i've tried different https:// sites, , of them have same problem):

2013-07-23 21:36:18+0100 [scrapy] info: scrapy 0.16.5 started (bot: arachnid) 2013-07-23 21:36:18+0100 [scrapy] debug: enabled extensions: logstats, telnetconsole, closespider, webservice, corestats, spiderstate 2013-07-23 21:36:18+0100 [scrapy] debug: enabled downloader middlewares: httpauthmiddleware, downloadtimeoutmiddleware, randomuseragentmiddleware, proxymiddleware, retrymiddleware, defaultheadersmiddleware, redirectmiddleware, cookiesmiddleware, httpcompressionmiddleware, chunkedtransfermiddleware, downloaderstats 2013-07-23 21:36:18+0100 [scrapy] debug: enabled spider middlewares: httperrormiddleware, offsitemiddleware, referermiddleware, urllengthmiddleware, depthmiddleware 2013-07-23 21:36:18+0100 [scrapy] debug: enabled item pipelines:  2013-07-23 21:36:18+0100 [tor] info: spider opened 2013-07-23 21:36:18+0100 [tor] info: crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2013-07-23 21:36:18+0100 [scrapy] debug: telnet console listening on 0.0.0.0:6023 2013-07-23 21:36:18+0100 [scrapy] debug: web service listening on 0.0.0.0:6080 2013-07-23 21:36:18+0100 [tor] debug: retrying <get https://check.torproject.org> (failed 1 times): 400 bad request 2013-07-23 21:36:18+0100 [tor] debug: retrying <get https://check.torproject.org> (failed 2 times): 400 bad request 2013-07-23 21:36:18+0100 [tor] debug: gave retrying <get https://check.torproject.org> (failed 3 times): 400 bad request 2013-07-23 21:36:18+0100 [tor] debug: crawled (400) <get https://check.torproject.org> (referer: none) 2013-07-23 21:36:18+0100 [tor] info: closing spider (finished) 

and double check isn't wrong tor/polipo set up, i'm able run following curl command in terminal, , connect fine: curl --proxy localhost:8123 https://check.torproject.org/

any suggestions what's wrong here?

not sure if these may you:


Comments

Popular posts from this blog

javascript - DIV "hiding" when changing dropdown value -

Does Firefox offer AppleScript support to get URL of windows? -

android - How to install packaged app on Firefox for mobile? -