ruby - Using binary data (strings in utf-8) from external file -
i have problem using strings in utf-8 format, e.g. "\u0161\u010d\u0159\u017e\u00fd". when such string defined variable in program works fine. when use such string reading external file wrong output (i don't want/expect). i'm missing necessary encoding stuff...
my code:
file = "c:\\...\\vlmlist_unicode.txt" #\u306b\u3064\u3044\u3066 data = file.open(file, 'rb') { |io| io.read.split(/\t/) } puts data data_var = "\u306b\u3064\u3044\u3066" puts data_var
output:
\u306b\u3064\u3044\u3066 # don't want について # want
i'm trying read file in binary form specifying 'rb' there other problem... run code in netbeans 7.3.1 build in jruby 1.7.3 (i tried ruby 2.0.0 without effect.)
since i'm new in ruby world ideas welcomed...
if file contains literal escaped string:
\u306b\u3064\u3044\u3066
then need unescape after reading. ruby string literals, why second case worked you. taken answer "is best way unescape unicode escape sequences in ruby?", can use this:
file = "c:\\...\\vlmlist_unicode.txt" #\u306b\u3064\u3044\u3066 data = file.open(file, 'rb') { |io| contents = io.read.gsub(/\\u([\da-fa-f]{4})/) { |m| [$1].pack("h*").unpack("n*").pack("u*") } contents.split(/\t/) }
alternatively, if make more readable, extract substitution new method, , add string
class:
class string def unescape_unicode self.gsub(/\\u([\da-fa-f]{4})/) { |m| [$1].pack("h*").unpack("n*").pack("u*") } end end
then can call:
file = "c:\\...\\vlmlist_unicode.txt" #\u306b\u3064\u3044\u3066 data = file.open(file, 'rb') { |io| io.read.unescape_unicode.split(/\t/) }
Comments
Post a Comment