python - Parsing uncommon symbol using BeautifulSoup -

- April 15, 2013

i have link <a href=abc.asp?xyz=foobar&baz=lookatme´_beautiful.jpg> , there unusual symbol ´ , not present in standard english keyboard. mirror reflection of symbol ctrl+k produces in editor . after ran code found on stackoverflow:

soup = beautifulsoup.beautifulsoup("<a href=abc.asp?xyz=foobar&baz=lookatme´_beautiful.jpg>"); in soup.findall('a'):                                                                            print a['href']

the output abc.asp?xyz=foobar&baz=lookatme want have abc.asp?xyz=foobar&baz=lookatme´_beautiful.jpg . website i'm scraping in .br domain . of writings in portugese , though links in english , uncommon symbol may not valid english language symbol. thoughts or suggestions ?

edit: looked @ representation python string produced me , <a href=abc.asp?xyz=foobar&baz=lookatme\xb4_beautiful.jpg>

one way around produce custom regex , , snippet stackoverflow:

import re urls = re.findall(r'href=[\'"]?([^\'" >]+)', s)

if impossible modify beautifulsoup regex , how can modify above regex incorporate \xb4 symbol. ( s here string in question )

upgrade latest version of beautifulsoup , install html5lib, lenient parser:

import requests bs4 import beautifulsoup  html = requests.get('http://www.atlasdermatologico.com.br/listar.asp?acao=indice').text soup = beautifulsoup(html, 'html5lib')  in soup.find_all('a'):     href = a.get('href')      if '\\' in repr(href):         print(repr(href))

it correctly prints out links \xb4 in url.

Search This Blog

IO

python - Parsing uncommon symbol using BeautifulSoup -

Comments

Post a Comment

Popular posts from this blog

javascript - DIV "hiding" when changing dropdown value -

html - Accumulated Depreciation of Assets on php -

c# - WPF DataGrids for hierarchical information -