python - Parsing uncommon symbol using BeautifulSoup -
i have link <a href=abc.asp?xyz=foobar&baz=lookatme´_beautiful.jpg> , there unusual symbol ´ , not present in standard english keyboard. mirror reflection of symbol ctrl+k produces in editor . after ran code found on stackoverflow:
soup = beautifulsoup.beautifulsoup("<a href=abc.asp?xyz=foobar&baz=lookatme´_beautiful.jpg>"); in soup.findall('a'): print a['href'] the output abc.asp?xyz=foobar&baz=lookatme want have abc.asp?xyz=foobar&baz=lookatme´_beautiful.jpg . website i'm scraping in .br domain . of writings in portugese , though links in english , uncommon symbol may not valid english language symbol. thoughts or suggestions ?
edit: looked @ representation python string produced me , <a href=abc.asp?xyz=foobar&baz=lookatme\xb4_beautiful.jpg>
one way around produce custom regex , , snippet stackoverflow:
import re urls = re.findall(r'href=[\'"]?([^\'" >]+)', s) if impossible modify beautifulsoup regex , how can modify above regex incorporate \xb4 symbol. ( s here string in question )
upgrade latest version of beautifulsoup , install html5lib, lenient parser:
import requests bs4 import beautifulsoup html = requests.get('http://www.atlasdermatologico.com.br/listar.asp?acao=indice').text soup = beautifulsoup(html, 'html5lib') in soup.find_all('a'): href = a.get('href') if '\\' in repr(href): print(repr(href)) it correctly prints out links \xb4 in url.
Comments
Post a Comment