python - Parsing uncommon symbol using BeautifulSoup -
i have link <a href=abc.asp?xyz=foobar&baz=lookatme´_beautiful.jpg>
, there unusual symbol ´
, not present in standard english keyboard. mirror reflection of symbol ctrl+k
produces in editor . after ran code found on stackoverflow:
soup = beautifulsoup.beautifulsoup("<a href=abc.asp?xyz=foobar&baz=lookatme´_beautiful.jpg>"); in soup.findall('a'): print a['href']
the output abc.asp?xyz=foobar&baz=lookatme
want have abc.asp?xyz=foobar&baz=lookatme´_beautiful.jpg
. website i'm scraping in .br
domain . of writings in portugese , though links in english , uncommon symbol may not valid english language symbol. thoughts or suggestions ?
edit: looked @ representation python string produced me , <a href=abc.asp?xyz=foobar&baz=lookatme\xb4_beautiful.jpg>
one way around produce custom regex , , snippet stackoverflow:
import re urls = re.findall(r'href=[\'"]?([^\'" >]+)', s)
if impossible modify beautifulsoup regex , how can modify above regex incorporate \xb4
symbol. ( s here string in question )
upgrade latest version of beautifulsoup , install html5lib
, lenient parser:
import requests bs4 import beautifulsoup html = requests.get('http://www.atlasdermatologico.com.br/listar.asp?acao=indice').text soup = beautifulsoup(html, 'html5lib') in soup.find_all('a'): href = a.get('href') if '\\' in repr(href): print(repr(href))
it correctly prints out links \xb4
in url.
Comments
Post a Comment