c# - How to Convert Unicode Text File With URLs to ANSI using URL Encoding -
i have large text files containing urls. encoded in ucs-2 little endian. contain kinds of links contain: arabian, chinese, japanese, korean, russian , languages can think of in url.
my goal create script url encode automatically of these links , save them in ansi encoded file.
example:
these of original links:
http://ejje.weblio.jp/content/あきれて物が言えない https://ru.wikipedia.org/wiki/Дактиль http://zh.wikipedia.org/zh/垃圾食品 http://abunawaf.com/سيارات-الملوك-وورثتهم-صور http://ko.wiktionary.org/wiki/가능해지다
these need become:
http://ejje.weblio.jp/content/%e3%81%82%e3%81%8d%e3%82%8c%e3%81%a6%e7%89%a9%e3%81%8c%e8%a8%80%e3%81%88%e3%81%aa%e3%81%84 https://ru.wikipedia.org/wiki/%d0%94%d0%b0%d0%ba%d1%82%d0%b8%d0%bb%d1%8c http://zh.wikipedia.org/zh/%e5%9e%83%e5%9c%be%e9%a3%9f%e5%93%81 http://abunawaf.com/%d8%b3%d9%8a%d8%a7%d8%b1%d8%a7%d8%aa-%d8%a7%d9%84%d9%85%d9%84%d9%88%d9%83-%d9%88%d9%88%d8%b1%d8%ab%d8%aa%d9%87%d9%85-%d8%b5%d9%88%d8%b1 http://ko.wiktionary.org/wiki/%ea%b0%80%eb%8a%a5%ed%95%b4%ec%a7%80%eb%8b%a4
i've used c# that. i've tried using httputility.urlpathencode method this:
static void main(string[] args) { string path = @"c:\temp\test.txt"; string enpath = @"c:\temp\entest.txt"; string[] lines = file.readalllines(path); (int = 0; < 72; i++) { console.write(httputility.urlpathencode(lines[i]) + environment.newline); system.io.file.appendalltext(enpath, httputility.urlpathencode(lines[i]) + environment.newline, encoding.ascii); } console.readline(); }
it seems converting them except 1 small bug: if url contains question mark, doesn't convert after it. big handicap me have lot of links contain question marks.
example:
http://www.alkousy.com/showthread.php?4113-ÇáÚáã-ÈÇááøóå-åæ-ßäÒ-ÇáÃäÈíÇÁ-ææÑËÊåã-ãä-ÇáãÄãäíä
is being converted as:
http://www.alkousy.com/showthread.php?4113-?????-???????-??-???-????????-???????-??-????????
this totally unacceptable me, , i'm looking solution. i've tried uri.escapedatastring well, guy converts including // , :
is there quick solution without custom coding anything?
use uri
class instead:
var url = "http://www.alkousy.com/あきれて物が言.php?4113-ÇáÚáã-ÈÇááøóå-åæ-ßäÒ-ÇáÃäÈíÇ"; var uri = new uri(url, urikind.absolute); console.writeline(uri.getcomponents(uricomponents.absoluteuri, uriformat.uriescaped));
which output:
http://www.alkousy.com/%e3%81%82%e3%81%8d%e3%82%8c%e3%81%a6%e7%89%a9%e3%81%8c%e8 %a8%80.php?4113-%c3%87%c3%a1%c3%9a%c3%a1%c3%a3-%c3%88%c3%87%c3%a1%c3%a1%c3%b8%c3 %b3%c3%a5-%c3%a5%c3%a6-%c3%9f%c3%a4%c3%92-%c3%87%c3%a1%c3%83%c3%a4%c3%88%c3%ad%c 3%87
the uri class understands uri actual uri, knows not encode protocol. can adjust code this:
static void main(string[] args) { string path = @"c:\temp\test.txt"; string enpath = @"c:\temp\entest.txt"; string[] lines = file.readalllines(path); (int = 0; < 72; i++) { var uri = new uri(lines[i], urikind.absolute); var escaped = uri.getcomponents(uricomponents.absoluteuri, uriformat.uriescaped); console.writeline(escaped); system.io.file.appendalltext(enpath, escaped + environment.newline, encoding.ascii); } console.readline(); }
based on comments, can implement foreach
loop:
foreach (var line in lines) { uri uri; if (uri.trycreate(line, urikind.absolute, out uri)) { var escaped = uri.getcomponents(uricomponents.absoluteuri, uriformat.uriescaped); console.writeline(escaped); system.io.file.appendalltext(enpath, escaped + environment.newline, encoding.ascii); } }
Comments
Post a Comment