Is there a simple direct way to transliterate or re-encode these into
proper Cyrillic characters using BBEdit? I've tried all the charsets
in the 'Reopen using encoding' submenu, but to no avail.
I've done this (usually imperfectly) in the past using online
converters and hacky freeware, but I'd really like to accomplish this
task within BBEdit.
Any insights welcome.
001. čĺđîěîíŕő Čëŕđčîí (Ŕëôĺĺâ) - Ńëîâî â Âĺëčęóţ Ď˙ňíčöó
002. Őîđ ěóćńęîăî Äŕíčëîâŕ ě-đ˙ - Ęí˙çč ëţäńňčč ńîáđŕřŕń˙ íŕ Ăîńďîäŕ
003. Őîđ ěóćńęîăî Äŕíčëîâŕ ě-đ˙ - Ňĺ÷ĺ ăëŕăîë˙ Čóäŕ áĺççŕęîííűě ęíčćíčęîě
004. Őîđ ěóćńęîăî Äŕíčëîâŕ ě-đ˙ - Ëŕçŕđĺâŕ đŕäč âîńňŕíč˙
005. Őîđ ěóćńęîăî Äŕíčëîâŕ ě-đ˙ - Äíĺńü Čóäŕ îńňŕâë˙ĺň Ó÷čňĺë˙
--
Lloyd Dunn
http://nula.cc/
http://blog.nula.cc/
>Below are a few examples of garbled Cyrilic from a web page (this
>happens to be a CD track list).
>
>Is there a simple direct way to transliterate or re-encode these into
>proper Cyrillic characters using BBEdit? I've tried all the charsets
>in the 'Reopen using encoding' submenu, but to no avail.
The way to do it is surely to get the pager to display properly in
the browser by changing the encoding there, not in BBEdit. In
FireFox: View->Character Encoding.
What encoding is declared in the <head> of the page when you view the
source? Does the sewrver specify a character set when you do
curl --head "http://.../"
in Terminal?
The problem may be a simple error in the meta declaration or some
sort of double encoding may have taken placew which needs to be
unraveled. If the worst comes to the worst you can use Perl/Encode
in BBEdit to sort it out, but you need to have a good idea what has
gone wrong first. What is the URL?
JD
Below are a few examples of garbled Cyrilic from a web page (this
happens to be a CD track list).
Is there a simple direct way to transliterate or re-encode these into
proper Cyrillic characters using BBEdit? I've tried all the charsets
in the 'Reopen using encoding' submenu, but to no avail.
I've done this (usually imperfectly) in the past using online
converters and hacky freeware, but I'd really like to accomplish this
task within BBEdit.
Any insights welcome.
001. �橢���펑 �뎩���� (������) - ����� � �����t� ������
002. � ��矴�a�� Ď����� �-��� - ���� �t����� ��ᩢ�ގ�� � A�����
003. � ��矴�a�� Ď����� �-��� - ���� a��a��� ��� ���玴���� ��������
004. � ��矴�a�� Ď����� �-��� - ˎ玩��� ����� ������
005. � ��矴�a�� Ď����� �-��� - ���� ��� ������� ������
--
Lloyd Dunn
http://nula.cc/
http://blog.nula.cc/
--
You received this message because you are subscribed to the
"BBEdit Talk" discussion group on Google Groups.
To post to this group, send email to bbe...@googlegroups.com
To unsubscribe from this group, send email to
bbedit+un...@googlegroups.com
For more options, visit this group at
<http://groups.google.com/group/bbedit?hl=en>
If you have a feature request or would like to report a problem,
please email "sup...@barebones.com" rather than posting to the group.
Follow @bbedit on Twitter: <http://www.twitter.com/bbedit>
So maybe that is a starting point for me, but I'm guessing this isn't
really a BBEdit topic anymore, so I'll just proceed from here unless
there are any further suggestions.
Thanks.
--
Lloyd Dunn
http://nula.cc/
http://blog.nula.cc/
On Mar 10, 9:16 pm, "Robert A. Rosenberg" <rar...@banet.net> wrote:
> At 10:00 AM +0100 on 03/10/2011, Lloyd Dunn wrote about transliterate
> into cyrillic:
>
> >Below are a few examples of garbled Cyrilic from a web page (this
> >happens to be a CD track list).
>
> >Is there a simple direct way to transliterate or re-encode these into
> >proper Cyrillic characters using BBEdit? I've tried all the charsets
> >in the 'Reopen using encoding' submenu, but to no avail.
>
> >I've done this (usually imperfectly) in the past using online
> >converters and hacky freeware, but I'd really like to accomplish this
> >task within BBEdit.
>
> >Any insights welcome.
>
> >001. �橢���펑 �뎩���� (������) - ����� � �����t� ������
> >002. � ��矴�a�� Ď����� �-��� - ���� �t����� ��ᩢ�ގ�� � A�����
> >003. � ��矴�a�� Ď����� �-��� - ���� a��a��� ��� ���玴���� ��������
> >004. � ��矴�a�� Ď����� �-��� - ˎ玩��� ����� ������
> >005. � ��矴�a�� Ď����� �-��� - ���� ��� ������� ������
>
> >--
> >Lloyd Dunn
> >http://nula.cc/
> >http://blog.nula.cc/
>
> >--
> >You received this message because you are subscribed to the
> >"BBEdit Talk" discussion group on Google Groups.
> >To post to this group, send email to bbe...@googlegroups.com
> >To unsubscribe from this group, send email to
> >bbedit+un...@googlegroups.com
> >For more options, visit this group at
> ><http://groups.google.com/group/bbedit?hl=en>
> >If you have a feature request or would like to report a problem,
> >please email "supp...@barebones.com" rather than posting to the group.
> >Follow @bbedit on Twitter: <http://www.twitter.com/bbedit>
>
> This looks like the page is declared as ISO-8859-1 in the meta tag
> instead of utf-8. User the source/page view option to check. Try
> telling your browser to display it as Character Set UTF-8. What is
> the URL? I can look at it for you if you want. When you see the
> characters in groups of 3 (and they are all accented) that is a tip
> off for utf-8. If you look up the ISO-8859-1 codepoint of the
> characters in (for example) � you can see how it converts to the
Those entities are not Cyrillic though.
I suspect that someone wrote the page in Cyrillic 8859-5 and then uploaded it to a host that only serves 8859-1 and it all got munged to hell.
Can you try some other encodings in firefox (Windows Latin 2 maybe, or maybe even KOI8-R) and see if something renders the page correctly. If it does, then copy and paste the page contents.
What does the server claim is the encoding (this is different than what the HTML page claims is the encoding)?
--
Experience is something you don't get until just after you need it.
>On 11-Mar-2011, at 03:54, eleven wrote:
> >
>> When I view source, the
>> garbled Cyrillic is really all encoded entities like this (mixed with
>> Latin accented characters): ńę�ă in a page that is
>> rendering as charset=iso-8859-1.
>
>Those entities are not Cyrillic though.
>
>I suspect that someone wrote the page in Cyrillic 8859-5 and then
>uploaded it to a host that only serves 8859-1 and it all got munged
>to hell.
>
>Can you try some other encodings in firefox (Windows Latin 2 maybe,
>or maybe even KOI8-R) and see if something renders the page
>correctly. If it does, then copy and paste the page contents.
FF has an ISO-8859-5 Charset Setting so that can be tried to see if
it comes out. The problem is that if it WERE ISO-8859-1 all the codes
would still be in the x00-xFF range. Even if it were originally
ISO-8859-5, something mangled it since just serving it as ISO-8859-1
would just use those glyphs for the Cyrillic Glyphs.
>Thanks for the responses. The URL is behind a login screen so there is
>no way for me to share it directly. I am pretty sure that the problem
>is with the page encoding, however, as you've both suggested.
>FireFox's View>Character Encoding just gives other kinds of garbled
>text. But I have figured out why, I think. When I view source, the
>garbled Cyrillic is really all encoded entities like this (mixed with
>Latin accented characters):ńę�ă in a page that is
>rendering as charset=iso-8859-1.
Cyrillic is from Ѐ to ԯ as Unicode. This corresponds to
Ѐ to Ô&xAF; when UTF-8 encoded.
Your ńę converts to ń ę which in Unicode is
LATIN SMALL LETTER N WITH ACUTE and LATIN SMALL LETTER E WITH OGONEK
so something major is wrong with the creation of the page.
The fact that the charset on the page is ISO-8859-1 is an immediate
tip-off of the problem.
As to sharing the page, there is NO NEED to supply the actual URL. If
you go to the page and do a SAVE AS HTML it will save a copy to your
machine. That file can then be posted to an accessible web site or
sent as an attachment to a private message for analysis.
>I would still like to know if it is even possible to do
>transliteration chores like this within BBEdit thru a text factory
>or similiar, as well as any wisdom as to how to go about it...
Your problem here is that is is not simple transliteration.
You can convert the decimal html entities easily into characters
using a UNIX filter something like this:
#!/usr/bin/perl
use strict;
no warnings;
while(<>){
s~&#(\d\d\d);~chr($1)~eg;
print;
}
but that will get you nowhere unless you know, or can guess, what
transformations the original Cyrillic was subjected to in the process
of producing the garbage. At some point it is likely that an attempt
was made to convert something to utf-8 and the raw bytes of the
supposed utf-8 were then converted to decimal html entities where
they were outside the range of iso-8859-1 -- you will see that
characters within range have not been so encoded. The original
Cyrillic could have been in any one of four distinct encodings. This
makes the task even more difficult. The problem arises quite
frequently in badly managed European sites but here the chances are
that the original text was windows-1252 and it's easier to follow the
process back.
JD
OK, in THAT case the original encoding was almost certainly Windows 1252, so with enough work and probably a perl/php script you should be able to get something useful back if you want to put some time into it.
Basically, you have to reverse the chain of encoding transformations until you get data that looks right.
>At some point it is likely that an attempt was made to convert
>something to utf-8 and the raw bytes of the supposed utf-8 were then
>converted to decimal html entities where they were outside the range
>of iso-8859-1
Anything that is in UTF-8 has each byte between x00-7F if it is
US-ASCII or xC0 or above followed by one or more characters in the
x80-BF range. The number of characters in the UTF-8 string is based
on the number of 1 bits at the start of the first character before
you get to a 0 bit (thus 110xxxxx is 2 bytes [1 following character],
1110xxxx is 3 bytes [2 following characters], etc.) All following
characters are of the form 10xxxxxx (so if you find one, you look
left until you find one that is of the form 11xxxxxx which is a start
character). Details are at http://en.wikipedia.org/wiki/Utf8.
As to the mangling issue, the codes do not match something converted
into UTF-8. For real Unicode Cyrillic (like the good sample in the
1000 range here is the breakdown:
Cyrillic is from Ѐ to ԯ as Unicode. This corresponds to
Ѐ to Ô&xAF; when UTF-8 encoded.
The numbers are off for real UTF-8, even if the two bytes are merged into one.