transliterate into cyrillic

153 views
Skip to first unread message

Lloyd Dunn

unread,
Mar 10, 2011, 4:00:37 AM3/10/11
to bbe...@googlegroups.com
Below are a few examples of garbled Cyrilic from a web page (this
happens to be a CD track list).

Is there a simple direct way to transliterate or re-encode these into
proper Cyrillic characters using BBEdit? I've tried all the charsets
in the 'Reopen using encoding' submenu, but to no avail.

I've done this (usually imperfectly) in the past using online
converters and hacky freeware, but I'd really like to accomplish this
task within BBEdit.

Any insights welcome.

001. čĺđîěîíŕő Čëŕđčîí (Ŕëôĺĺâ) - Ńëîâî â Âĺëčęóţ Ď˙ňíčöó
002. Őîđ ěóćńęîăî Äŕíčëîâŕ ě-đ˙ - Ęí˙çč ëţäńňčč ńîáđŕřŕń˙ íŕ Ăîńďîäŕ
003. Őîđ ěóćńęîăî Äŕíčëîâŕ ě-đ˙ - Ňĺ÷ĺ ăëŕăîë˙ Čóäŕ áĺççŕęîííűě ęíčćíčęîě
004. Őîđ ěóćńęîăî Äŕíčëîâŕ ě-đ˙ - Ëŕçŕđĺâŕ đŕäč âîńňŕíč˙
005. Őîđ ěóćńęîăî Äŕíčëîâŕ ě-đ˙ - Äíĺńü Čóäŕ îńňŕâë˙ĺň Ó÷čňĺë˙

--
Lloyd Dunn
http://nula.cc/
http://blog.nula.cc/

John Delacour

unread,
Mar 10, 2011, 11:26:53 AM3/10/11
to bbe...@googlegroups.com
At 10:00 +0100 10/03/2011, you wrote:

>Below are a few examples of garbled Cyrilic from a web page (this
>happens to be a CD track list).
>
>Is there a simple direct way to transliterate or re-encode these into
>proper Cyrillic characters using BBEdit? I've tried all the charsets
>in the 'Reopen using encoding' submenu, but to no avail.

The way to do it is surely to get the pager to display properly in
the browser by changing the encoding there, not in BBEdit. In
FireFox: View->Character Encoding.

What encoding is declared in the <head> of the page when you view the
source? Does the sewrver specify a character set when you do

curl --head "http://.../"

in Terminal?

The problem may be a simple error in the meta declaration or some
sort of double encoding may have taken placew which needs to be
unraveled. If the worst comes to the worst you can use Perl/Encode
in BBEdit to sort it out, but you need to have a good idea what has
gone wrong first. What is the URL?

JD

Robert A. Rosenberg

unread,
Mar 10, 2011, 3:16:12 PM3/10/11
to bbe...@googlegroups.com, Lloyd Dunn
At 10:00 AM +0100 on 03/10/2011, Lloyd Dunn wrote about transliterate into cyrillic:

Below are a few examples of garbled Cyrilic from a web page (this
happens to be a CD track list).

Is there a simple direct way to transliterate or re-encode these into
proper Cyrillic characters using BBEdit? I've tried all the charsets
in the 'Reopen using encoding' submenu, but to no avail.

I've done this (usually imperfectly) in the past using online
converters and hacky freeware, but I'd really like to accomplish this
task within BBEdit.

Any insights welcome.

001. �橢���펑 �뎩���� (������) - ����� � �����t� ������
002. � ��矴�a�� Ď����� �-��� - ���� �t����� ��ᩢ�ގ�� � A�����
003. � ��矴�a�� Ď����� �-��� - ���� a��a��� ��� ���玴���� ��������
004. � ��矴�a�� Ď����� �-��� - ˎ玩��� ����� ������
005. � ��矴�a�� Ď����� �-��� - ���� ��� ������� ������
--
You received this message because you are subscribed to the
"BBEdit Talk" discussion group on Google Groups.
To post to this group, send email to bbe...@googlegroups.com
To unsubscribe from this group, send email to
bbedit+un...@googlegroups.com
For more options, visit this group at
<http://groups.google.com/group/bbedit?hl=en>
If you have a feature request or would like to report a problem,
please email "sup...@barebones.com" rather than posting to the group.
Follow @bbedit on Twitter: <http://www.twitter.com/bbedit>


This looks like the page is declared as ISO-8859-1 in the meta tag instead of utf-8. User the source/page view option to check. Try telling your browser to display it as Character Set UTF-8. What is the URL? I can look at it for you if you want. When you see the characters in groups of 3 (and they are all accented) that is a tip off for utf-8. If you look up the ISO-8859-1 codepoint of the characters in (for example) � you can see how it converts to the UTF-8 coding and see if it is the Cyrillic Unicode range. The only problem with this is that Cyrillic is 2 byte not 3 byte UTF-8 encoding.

eleven

unread,
Mar 11, 2011, 5:54:32 AM3/11/11
to BBEdit Talk
Thanks for the responses. The URL is behind a login screen so there is
no way for me to share it directly. I am pretty sure that the problem
is with the page encoding, however, as you've both suggested.
FireFox's View>Character Encoding just gives other kinds of garbled
text. But I have figured out why, I think. When I view source, the
garbled Cyrillic is really all encoded entities like this (mixed with
Latin accented characters): &#324;&#281;î&#259; in a page that is
rendering as charset=iso-8859-1.

So maybe that is a starting point for me, but I'm guessing this isn't
really a BBEdit topic anymore, so I'll just proceed from here unless
there are any further suggestions.

Thanks.

On Mar 10, 9:16 pm, "Robert A. Rosenberg" <rar...@banet.net> wrote:
> At 10:00 AM +0100 on 03/10/2011, Lloyd Dunn wrote about transliterate
> into cyrillic:
>
> >Below are a few examples of garbled Cyrilic from a web page (this
> >happens to be a CD track list).
>
> >Is there a simple direct way to transliterate or re-encode these into
> >proper Cyrillic characters using BBEdit? I've tried all the charsets
> >in the 'Reopen using encoding' submenu, but to no avail.
>
> >I've done this (usually imperfectly) in the past using online
> >converters and hacky freeware, but I'd really like to accomplish this
> >task within BBEdit.
>
> >Any insights welcome.
>

> >001. �橢���펑 �뎩���� (������) - ����� � �����t� ������
> >002. � ��矴�a�� Ď����� �-��� - ���� �t����� ��ᩢ�ގ�� � A�����
> >003. � ��矴�a�� Ď����� �-��� - ���� a��a��� ��� ���玴���� ��������
> >004. � ��矴�a�� Ď����� �-��� - ˎ玩��� ����� ������
> >005. � ��矴�a�� Ď����� �-��� - ���� ��� ������� ������


>
> >--
> >Lloyd Dunn
> >http://nula.cc/
> >http://blog.nula.cc/
>
> >--
> >You received this message because you are subscribed to the
> >"BBEdit Talk" discussion group on Google Groups.
> >To post to this group, send email to bbe...@googlegroups.com
> >To unsubscribe from this group, send email to
> >bbedit+un...@googlegroups.com
> >For more options, visit this group at
> ><http://groups.google.com/group/bbedit?hl=en>
> >If you have a feature request or would like to report a problem,

> >please email "supp...@barebones.com" rather than posting to the group.


> >Follow @bbedit on Twitter: <http://www.twitter.com/bbedit>
>
> This looks like the page is declared as ISO-8859-1 in the meta tag
> instead of utf-8. User the source/page view option to check. Try
> telling your browser to display it as Character Set UTF-8. What is
> the URL? I can look at it for you if you want. When you see the
> characters in groups of 3 (and they are all accented) that is a tip
> off for utf-8. If you look up the ISO-8859-1 codepoint of the

> characters in (for example) � you can see how it converts to the

LuKreme

unread,
Mar 11, 2011, 9:19:16 AM3/11/11
to bbe...@googlegroups.com
On 11-Mar-2011, at 03:54, eleven wrote:
>
> When I view source, the
> garbled Cyrillic is really all encoded entities like this (mixed with
> Latin accented characters): &#324;&#281;î&#259; in a page that is
> rendering as charset=iso-8859-1.

Those entities are not Cyrillic though.

I suspect that someone wrote the page in Cyrillic 8859-5 and then uploaded it to a host that only serves 8859-1 and it all got munged to hell.

Can you try some other encodings in firefox (Windows Latin 2 maybe, or maybe even KOI8-R) and see if something renders the page correctly. If it does, then copy and paste the page contents.

What does the server claim is the encoding (this is different than what the HTML page claims is the encoding)?

--
Experience is something you don't get until just after you need it.

Robert A. Rosenberg

unread,
Mar 11, 2011, 4:36:45 PM3/11/11
to bbe...@googlegroups.com, LuKreme
At 07:19 AM -0700 on 03/11/2011, LuKreme wrote about Re:
transliterate into cyrillic:

>On 11-Mar-2011, at 03:54, eleven wrote:
> >
>> When I view source, the
>> garbled Cyrillic is really all encoded entities like this (mixed with

>> Latin accented characters): &#324;&#281;�&#259; in a page that is


>> rendering as charset=iso-8859-1.
>
>Those entities are not Cyrillic though.
>
>I suspect that someone wrote the page in Cyrillic 8859-5 and then
>uploaded it to a host that only serves 8859-1 and it all got munged
>to hell.
>
>Can you try some other encodings in firefox (Windows Latin 2 maybe,
>or maybe even KOI8-R) and see if something renders the page
>correctly. If it does, then copy and paste the page contents.

FF has an ISO-8859-5 Charset Setting so that can be tried to see if
it comes out. The problem is that if it WERE ISO-8859-1 all the codes
would still be in the x00-xFF range. Even if it were originally
ISO-8859-5, something mangled it since just serving it as ISO-8859-1
would just use those glyphs for the Cyrillic Glyphs.

Robert A. Rosenberg

unread,
Mar 11, 2011, 4:29:03 PM3/11/11
to bbe...@googlegroups.com, eleven
At 02:54 AM -0800 on 03/11/2011, eleven wrote about Re: transliterate
into cyrillic:

>Thanks for the responses. The URL is behind a login screen so there is


>no way for me to share it directly. I am pretty sure that the problem
>is with the page encoding, however, as you've both suggested.
>FireFox's View>Character Encoding just gives other kinds of garbled
>text. But I have figured out why, I think. When I view source, the
>garbled Cyrillic is really all encoded entities like this (mixed with

>Latin accented characters):&#324;&#281;�&#259; in a page that is
>rendering as charset=iso-8859-1.


Cyrillic is from &#x400; to &#x52F; as Unicode. This corresponds to
&#xD0;&#x80; to &#xD4;&xAF; when UTF-8 encoded.

Your &#324;&#281; converts to &#x144; &#x119; which in Unicode is
LATIN SMALL LETTER N WITH ACUTE and LATIN SMALL LETTER E WITH OGONEK
so something major is wrong with the creation of the page.

The fact that the charset on the page is ISO-8859-1 is an immediate
tip-off of the problem.

As to sharing the page, there is NO NEED to supply the actual URL. If
you go to the page and do a SAVE AS HTML it will save a copy to your
machine. That file can then be posted to an accessible web site or
sent as an attachment to a private message for analysis.

eleven

unread,
Mar 12, 2011, 7:51:21 AM3/12/11
to BBEdit Talk
Thanks, everyone, for the additional insight.

@LuKreme: I tried every Firefox encoding possibility that had to do
with
Cyrillic and even Eastern European. Still garbled, sometimes in
different ways.

@Robert A. Rosenberg: For those interested, I've posted a copy of the
page here: http://dl.dropbox.com/u/2321985/index.html. It's just the
bare html, edited to remove the parts not germane to the discussion;
also no stylesheets, but the header is intact.

In the end, I gather that the garbled text was pulled from mp3 id3
tags,
which is no doubt the source of the problem. As far as the html and
server is concerned, perhaps they are doing the 'right thing' with
what they are given.

I would still like to know if it is even possible to do
transliteration
chores like this within BBEdit thru a text factory or similiar, as
well
as any wisdom as to how to go about it. I found an AppleScript that
actually fixes these broken id3 tags from within iTunes, so I am going
to take it apart an see if I can get that to work somehow in BBEdit.

John Delacour

unread,
Mar 12, 2011, 11:05:01 AM3/12/11
to bbe...@googlegroups.com
At 04:51 -0800 12/03/2011, eleven wrote:

>I would still like to know if it is even possible to do
>transliteration chores like this within BBEdit thru a text factory

>or similiar, as well as any wisdom as to how to go about it...

Your problem here is that is is not simple transliteration.

You can convert the decimal html entities easily into characters
using a UNIX filter something like this:

#!/usr/bin/perl
use strict;
no warnings;
while(<>){
s~&#(\d\d\d);~chr($1)~eg;
print;
}

but that will get you nowhere unless you know, or can guess, what
transformations the original Cyrillic was subjected to in the process
of producing the garbage. At some point it is likely that an attempt
was made to convert something to utf-8 and the raw bytes of the
supposed utf-8 were then converted to decimal html entities where
they were outside the range of iso-8859-1 -- you will see that
characters within range have not been so encoded. The original
Cyrillic could have been in any one of four distinct encodings. This
makes the task even more difficult. The problem arises quite
frequently in badly managed European sites but here the chances are
that the original text was windows-1252 and it's easier to follow the
process back.

JD


LuKreme

unread,
Mar 12, 2011, 3:48:44 PM3/12/11
to bbe...@googlegroups.com
On Mar 12, 2011, at 5:51 AM, eleven wrote:
> In the end, I gather that the garbled text was pulled from mp3 id3 tags, which is no doubt the source of the problem. As far as the html and server is concerned, perhaps they are doing the 'right thing' with what they are given.

OK, in THAT case the original encoding was almost certainly Windows 1252, so with enough work and probably a perl/php script you should be able to get something useful back if you want to put some time into it.

Basically, you have to reverse the chain of encoding transformations until you get data that looks right.


Cai Alfredson

unread,
Mar 12, 2011, 6:21:02 PM3/12/11
to bbe...@googlegroups.com
Of course you can write a text factory to do this chore. It will take you a while to get it right, but once you've got it you can use it over and over again.
Just make a string of "replace-all" operations. I used to do this all the time when I worked with cyrillic encodings several years ago, but I haven't needed it for a while so I don't have a lot of tools any longer (all my old stuff was before OSX).
A tip that might be useful is that you can edit a text factory as xml, so you can create the factory itself using search-and-replace.
/Cai

Robert A. Rosenberg

unread,
Mar 14, 2011, 1:21:26 AM3/14/11
to bbe...@googlegroups.com
At 04:05 PM +0000 on 03/12/2011, John Delacour wrote about Re:
transliterate into cyrillic:

>At some point it is likely that an attempt was made to convert

>something to utf-8 and the raw bytes of the supposed utf-8 were then
>converted to decimal html entities where they were outside the range
>of iso-8859-1

Anything that is in UTF-8 has each byte between x00-7F if it is
US-ASCII or xC0 or above followed by one or more characters in the
x80-BF range. The number of characters in the UTF-8 string is based
on the number of 1 bits at the start of the first character before
you get to a 0 bit (thus 110xxxxx is 2 bytes [1 following character],
1110xxxx is 3 bytes [2 following characters], etc.) All following
characters are of the form 10xxxxxx (so if you find one, you look
left until you find one that is of the form 11xxxxxx which is a start
character). Details are at http://en.wikipedia.org/wiki/Utf8.

As to the mangling issue, the codes do not match something converted
into UTF-8. For real Unicode Cyrillic (like the good sample in the
1000 range here is the breakdown:

Cyrillic is from &#x400; to &#x52F; as Unicode. This corresponds to
&#xD0;&#x80; to &#xD4;&xAF; when UTF-8 encoded.

The numbers are off for real UTF-8, even if the two bytes are merged into one.

Reply all
Reply to author
Forward
0 new messages