Character maps and code pages(Split)

For discussions about programming, and for programming questions and advice


Moderator: Forum moderators

Post Reply
User avatar
misko_2083
Posts: 196
Joined: Wed Dec 09, 2020 11:59 pm
Has thanked: 10 times
Been thanked: 20 times

Character maps and code pages(Split)

Post by misko_2083 »

Can you please split the thread @rockedge so we don't polute it with this?

MochiMoppel wrote: Sun Jul 18, 2021 2:20 am
misko_2083 wrote: Fri Jul 16, 2021 6:35 pm

Our language is completely vocalized with 30 sounds and 30 Cyrilic characters.
However we also have a Latin script with 27 chars and 3 diagraphs.

Serbian?

Sometimes people use English keyboards that don't have some letters (or lack a different keyborad input, or too lazy to make a switch).
Instead they type the closest match and the software transliterates to a completely different letter in Cyrilic.

I assume that the "software" is not iconv and that there is no codepage around that can convert Basic Latin (ASCII) to Cyrillic equivalents, e.g. changing an ASCII L to a Cyrillic Л. On the other side a more or less sophisticated search/replace script should be able to do it, starting with the diagraphs and leaving only ambiguous characters like Z (could be Ж or З ) for manual correction. Don't know if this describes your task, it's just my imagination. A real life example would help.

Serbian language.
lj, nj, and dž are diagraphs, though sometimes people write dj https://en.wikipedia.org/wiki/Novak_Djokovic
instead of đ which may cause confusion with words written in ijekavian dialect.

The characters missing in english alphabet are š đ č ć ž
People sometimes type
s for s and š, с and ш
z for z and ž, з and ж
c for c, ć, č ц, ћ, ч

example:
Mozemo li da idemo na rucak?
When transliterated:
Моземо ли да идемо на руцак?
How it should be:
Možemo li da idemo na ručak?
Можемо ли да идемо на ручак?

Perhaps it's easier to fix the latin text and transliterate.
That Fred's (@fredx181) copy-code-paste-from-clipboard script with yad UI would be ideal for this.

koze, kože
Cyrillic - Latin - English / explanation
з, ж - z, ž
козе - koze - goats
коже - kože - skins, leathers

kuce, kuće, kuče
Cyrillic - Latin - English / explanation
ц, ч, ћ - c, č, ć
куче - kuče - dog
куце - kuce - doggies / diminutive of the word kuče is 'куца - kuca'; in plural 'куце - kuce'
куће - kuće - house / plural of the word 'кућа - kuća'

Something like this would change the characters

Code: Select all

#!/bin/bash

declare -A LO_J=(["ј"]="j")

declare -A UP_J=(["Ј"]="J")

declare -A LO_SR_CYR_TO_LAT_DICT=(
    ["а"]="a"
    ["б"]="b"
    ["в"]="v"
    ["г"]="g"
    ["д"]="d"
    ["ђ"]="đ"
    ["е"]="e"
    ["ж"]="ž"
    ["з"]="z"
    ["и"]="i"
    ["к"]="k"
    ["л"]="l"
    ["љ"]="lj"
    ["м"]="m"
    ["н"]="n"
    ["њ"]="nj"
    ["о"]="o"
    ["п"]="p"
    ["р"]="r"
    ["с"]="s"
    ["т"]="t"
    ["ћ"]="ć"
    ["у"]="u"
    ["ф"]="f"
    ["х"]="h"
    ["ц"]="c"
    ["ч"]="č"
    ["џ"]="dž"
    ["ш"]="š"
)

declare -A UP_SR_CYR_TO_LAT_DICT=(
    ["А"]="A"
    ["Б"]="B"
    ["В"]="V"
    ["Г"]="G"
    ["Д"]="D"
    ["Ђ"]="Đ"
    ["Е"]="E"
    ["Ж"]="Ž"
    ["З"]="Z"
    ["И"]="I"
    ["К"]="K"
    ["Л"]="L"
    ["Љ"]="Lj"
    ["М"]="M"
    ["Н"]="N"
    ["Њ"]="Nj"
    ["О"]="O"
    ["П"]="P"
    ["Р"]="R"
    ["С"]="S"
    ["Т"]="T"
    ["Ћ"]="Ć"
    ["У"]="U"
    ["Ф"]="F"
    ["Х"]="H"
    ["Ц"]="C"
    ["Ч"]="Č"
    ["Џ"]="Dž"
    ["Ш"]="Š"
)

    string="абвгдђежзијклљмнњопрстћуфхцчџш"

    echo "String:     $string"

    for letter in "${!LO_SR_CYR_TO_LAT_DICT[@]}"; do
        string="${string//$letter/${LO_SR_CYR_TO_LAT_DICT[$letter]}}"
    done

    # j must be replaced last because of the diagraphs lj,nj
    string="${string//${!LO_J[@]}/${LO_J[@]}}"
    
    echo "Romanaised: $string"

    for letter in "${!LO_SR_CYR_TO_LAT_DICT[@]}"; do
        string="${string//${LO_SR_CYR_TO_LAT_DICT[$letter]}/$letter}"
    done

    # j must be replaced last because of the diagraphs lj,nj
    string="${string//${LO_J[@]}/${!LO_J[@]}}"

    echo "Cyrillic:   ${string}"

Character per character is easy but there are strings I would not want to transliterate like web and email addresses.
https://forum.puppylinux.com transliterates to хттпс://форум.пуппyлинуx.цом
Hm, just thinking, ih I could only think of an easy way to select the words in a text which should be reverted back to latin or overall skip conversion to latin.
Mouse click would be ideal, maybe a combo box or a popup list dialog?

Do you want to exit the Circus? The Harsh Truth
https://www.youtube.com/watch?v=ZJwQicZHp_c

User avatar
misko_2083
Posts: 196
Joined: Wed Dec 09, 2020 11:59 pm
Has thanked: 10 times
Been thanked: 20 times

Re: Character maps and code pages

Post by misko_2083 »

Grey wrote: Sun Jul 18, 2021 4:45 pm
MochiMoppel wrote: Sun Jul 18, 2021 2:20 am
misko_2083 wrote: Fri Jul 16, 2021 6:35 pm

Our language is completely vocalized with 30 sounds and 30 Cyrilic characters.
However we also have a Latin script with 27 chars and 3 diagraphs.

Serbian?

Sometimes people use English keyboards that don't have some letters (or lack a different keyborad input, or too lazy to make a switch).
Instead they type the closest match and the software transliterates to a completely different letter in Cyrilic.

I assume that the "software" is not iconv and that there is no codepage around that can convert Basic Latin (ASCII) to Cyrillic equivalents, e.g. changing an ASCII L to a Cyrillic Л. On the other side a more or less sophisticated search/replace script should be able to do it, starting with the diagraphs and leaving only ambiguous characters like Z (could be Ж or З ) for manual correction. Don't know if this describes your task, it's just my imagination. A real life example would help.

Yes, he meant Serbian - I remember from his screenshots the words "сликовни" and "уклони". Serbia (or Montenegro) is doing well.
You have not yet seen how encodings are used in the countries of the former USSR and English keyboards on which stickers with letters are glued - a mixture of Cyrillic and Latin + local flavor. But the people have adapted :)

Good memory Grey. I can't even remember what I ate for breakfast. :)

Talk about adapting. On this keyboard q w y x = љ њ ж џ.
I don't understand spoken Russian, it's like comparing Dutch with English.
Written I can decode to some extent.

Do you want to exit the Circus? The Harsh Truth
https://www.youtube.com/watch?v=ZJwQicZHp_c

User avatar
Grey
Posts: 2003
Joined: Wed Jul 22, 2020 12:33 am
Location: Russia
Has thanked: 75 times
Been thanked: 365 times

Re: Character maps and code pages

Post by Grey »

Hello. Previously, the Russian keyboard was qwerty = яверты. Currently qwerty = йцукен. The Serbian keyboard uses the first option, that is, the Latin letter is replaced by the corresponding Cyrillic one, and if there is no match, then its unique one.

And how, for example, is such a problem solved in Ubuntu? Or not resolved? Really, no one has yet made a utility according to the principle of a "dictionary", so that when selecting a text, it can be transformed in both directions?

Fossapup OS, Ryzen 5 3600 CPU, 64 GB RAM, GeForce GTX 1050 Ti 4 GB, Sound Blaster Audigy Rx with amplifier + Yamaha speakers for loud sound, USB Sound Blaster X-Fi Surround 5.1 Pro V3 + headphones for quiet sound.

williams2
Posts: 1026
Joined: Sat Jul 25, 2020 5:45 pm
Been thanked: 291 times

Re: Character maps and code pages(Split)

Post by williams2 »

Would this be useful? http://mashke.org/Conv/

Uses perl scripts that can be downloaded and executed in a terminal or in a shell script.
has a small builtin dictionary of common words.

User avatar
misko_2083
Posts: 196
Joined: Wed Dec 09, 2020 11:59 pm
Has thanked: 10 times
Been thanked: 20 times

Re: Character maps and code pages(Split)

Post by misko_2083 »

Grey wrote: Tue Jul 20, 2021 3:00 pm

Hello. Previously, the Russian keyboard was qwerty = яверты. Currently qwerty = йцукен. The Serbian keyboard uses the first option, that is, the Latin letter is replaced by the corresponding Cyrillic one, and if there is no match, then its unique one.

And how, for example, is such a problem solved in Ubuntu? Or not resolved? Really, no one has yet made a utility according to the principle of a "dictionary", so that when selecting a text, it can be transformed in both directions?

There are Serbian Cyrilic keyboards and Yu keyboards. Yu keyboards are more common.
I found out there is a libre office macro that can transliterate selected text in both directions https://extensions.libreoffice.org/en/e ... ootranslit
It adds a small toolbar to the right.
But only for Serbian.

Image

Do you want to exit the Circus? The Harsh Truth
https://www.youtube.com/watch?v=ZJwQicZHp_c

User avatar
Grey
Posts: 2003
Joined: Wed Jul 22, 2020 12:33 am
Location: Russia
Has thanked: 75 times
Been thanked: 365 times

Re: Character maps and code pages(Split)

Post by Grey »

Five interesting projects.
0. Cirlat.
1. Cirilica converter in Python.
2. Serbian Translit for VSCode.
3. CyrLatConverter (JavaScript). Git and Home page.
4. Terminator :) No - Ћирилизатор - browser extension. Git and Home page.
At least some of this can be compiled/run in Fossapup, I think :)
Don't forget to rename the topic. Something about Serbia or Cyrillic ;)

Fossapup OS, Ryzen 5 3600 CPU, 64 GB RAM, GeForce GTX 1050 Ti 4 GB, Sound Blaster Audigy Rx with amplifier + Yamaha speakers for loud sound, USB Sound Blaster X-Fi Surround 5.1 Pro V3 + headphones for quiet sound.

User avatar
MochiMoppel
Posts: 1139
Joined: Mon Jun 15, 2020 6:25 am
Location: Japan
Has thanked: 18 times
Been thanked: 372 times

Re: Character maps and code pages(Split)

Post by MochiMoppel »

misko_2083 wrote: Tue Jul 20, 2021 12:05 pm

Something like this would change the characters

Code: Select all

#!/bin/bash

declare -A LO_J=(["ј"]="j")

declare -A UP_J=(["Ј"]="J")

declare -A LO_SR_CYR_TO_LAT_DICT=(
    ["а"]="a"
    ["б"]="b"
    ["в"]="v"
    ["г"]="g"
    ["д"]="d"
    ["ђ"]="đ"
    ["е"]="e"
    ["ж"]="ž"
    ["з"]="z"
    ["и"]="i"
    ["к"]="k"
    ["л"]="l"
    ["љ"]="lj"
    ["м"]="m"
    ["н"]="n"
    ["њ"]="nj"
    ["о"]="o"
    ["п"]="p"
    ["р"]="r"
    ["с"]="s"
    ["т"]="t"
    ["ћ"]="ć"
    ["у"]="u"
    ["ф"]="f"
    ["х"]="h"
    ["ц"]="c"
    ["ч"]="č"
    ["џ"]="dž"
    ["ш"]="š"
)

declare -A UP_SR_CYR_TO_LAT_DICT=(
    ["А"]="A"
    ["Б"]="B"
    ["В"]="V"
    ["Г"]="G"
    ["Д"]="D"
    ["Ђ"]="Đ"
    ["Е"]="E"
    ["Ж"]="Ž"
    ["З"]="Z"
    ["И"]="I"
    ["К"]="K"
    ["Л"]="L"
    ["Љ"]="Lj"
    ["М"]="M"
    ["Н"]="N"
    ["Њ"]="Nj"
    ["О"]="O"
    ["П"]="P"
    ["Р"]="R"
    ["С"]="S"
    ["Т"]="T"
    ["Ћ"]="Ć"
    ["У"]="U"
    ["Ф"]="F"
    ["Х"]="H"
    ["Ц"]="C"
    ["Ч"]="Č"
    ["Џ"]="Dž"
    ["Ш"]="Š"
)

    string="абвгдђежзијклљмнњопрстћуфхцчџш"

    echo "String:     $string"

    for letter in "${!LO_SR_CYR_TO_LAT_DICT[@]}"; do
        string="${string//$letter/${LO_SR_CYR_TO_LAT_DICT[$letter]}}"
    done

    # j must be replaced last because of the diagraphs lj,nj
    string="${string//${!LO_J[@]}/${LO_J[@]}}"
    
    echo "Romanaised: $string"

    for letter in "${!LO_SR_CYR_TO_LAT_DICT[@]}"; do
        string="${string//${LO_SR_CYR_TO_LAT_DICT[$letter]}/$letter}"
    done

    # j must be replaced last because of the diagraphs lj,nj
    string="${string//${LO_J[@]}/${!LO_J[@]}}"

    echo "Cyrillic:   ${string}"


Something like this would too (hopefully). On my machine using sed is 4x faster than a loop:

Code: Select all

#!/bin/bash
cyrstring="хттпс://форум.пуппyлинуx.цом"                #cyrillic
latstring="Možemo li da idemo na ručak? kuče kuce kuće" #romanized Serbian 
ascstring="Mozemo li da idemo na rucak? kuce kuce kuce" #ASCIIfied Serbian

CYR=абвгдђежзијклмнопрстћуфхцчшАБВГДЂЕЖЗИКЛМНОПРСТЋУФХЦЧШ #Unicode chars of block "Cyrillic"
LAT=abvgdđežzijklmnoprstćufhcčšABVGDĐEŽZIKLMNOPRSTĆUFHCČŠ #ASCII and Unicode chars of block "Latin Extended-A"
CAS=абвгеијклмнопртуфхАБВГДЕИКЛМНОПРТУФХ #cyrillic chars with unambiguous ASCII equivalents
ASC=abvgeijklmnoprtufhABVGDEIKLMNOPRTUFH #ASCII

romanize () {
	echo "$@" | sed "
	s/љ/lj/g
	s/њ/nj/g
	s/џ/dž/g
	s/Љ/Lj/g
	s/Њ/Nj/g
	s/Џ/Dž/g
	y/$CYR/$LAT/
	"
}

asci2cyr () {
	echo "$@" | sed "
	s/lj/љ/g
	s/nj/њ/g
	s/dz/џ/g
	s/Lj/Љ/g
	s/Nj/Њ/g
	s/Dz/Џ/g
	s/d/<дђ>/g
	s/z/<жз>/g
	s/s/<сш>/g
	s/c/<ћцч>/g
	s/D/<ДЂ>/g
	s/Z/<ЖЗ>/g
	s/S/<СШ>/g
	s/C/<ЋЦЧ>/g
	y/$ASC/$CAS/
	"
}

cyrilize () {
	echo "$@" | sed "
	s/lj/љ/g
	s/nj/њ/g
	s/dž/џ/g
	s/Lj/Љ/g
	s/Nj/Њ/g
	s/Dž/Џ/g
	y/$LAT/$CYR/
	"
}

echo     "### Cyrillic transcribed to Latin"
echo     "$cyrstring"
romanize "$cyrstring"
echo -e  "\n### Latin transcribed to Cyrillic"
echo     "$latstring"
cyrilize "$latstring"
echo -e  "\n###Pure ASCII transcribed to Cyrillic (ambiguous chars <marked> )"
echo     "$ascstring"
asci2cyr "$ascstring"

Hm, just thinking, ih I could only think of an easy way to select the words in a text which should be reverted back to latin or overall skip conversion to latin.
Mouse click would be ideal, maybe a combo box or a popup list dialog?

The easiest way may be to use xclip to send the converted string to the clipboard, then Ctrl+V to overwrite the source string.

User avatar
misko_2083
Posts: 196
Joined: Wed Dec 09, 2020 11:59 pm
Has thanked: 10 times
Been thanked: 20 times

Re: Character maps and code pages(Split)

Post by misko_2083 »

Grey wrote: Tue Jul 20, 2021 7:17 pm

Five interesting projects.
0. Cirlat.
1. Cirilica converter in Python.
2. Serbian Translit for VSCode.
3. CyrLatConverter (JavaScript). Git and Home page.
4. Terminator :) No - Ћирилизатор - browser extension. Git and Home page.
At least some of this can be compiled/run in Fossapup, I think :)
Don't forget to rename the topic. Something about Serbia or Cyrillic ;)

Thanks. So many options to choose.
Youtube kind of tries to do that automatically and it often switches English Latin to English Cyrillic.

MochiMoppel wrote: Wed Jul 21, 2021 2:30 am

The easiest way may be to use xclip to send the converted string to the clipboard, then Ctrl+V to overwrite the source string.
[/quote]
Thanks MochiMoppel sed is faster indeed.

Do you want to exit the Circus? The Harsh Truth
https://www.youtube.com/watch?v=ZJwQicZHp_c

User avatar
Grey
Posts: 2003
Joined: Wed Jul 22, 2020 12:33 am
Location: Russia
Has thanked: 75 times
Been thanked: 365 times

Re: Character maps and code pages(Split)

Post by Grey »

misko_2083 wrote: Sat Jul 31, 2021 10:28 am

Youtube kind of tries to do that automatically and it often switches English Latin to English Cyrillic.

Yes, Гунфигхтерс Моон, it sounds original :)
Гојко Митић is the best chief of the Redskins and he is not afraid of any cowboys and гунфигхтерс :)
ImageImage

Fossapup OS, Ryzen 5 3600 CPU, 64 GB RAM, GeForce GTX 1050 Ti 4 GB, Sound Blaster Audigy Rx with amplifier + Yamaha speakers for loud sound, USB Sound Blaster X-Fi Surround 5.1 Pro V3 + headphones for quiet sound.

User avatar
misko_2083
Posts: 196
Joined: Wed Dec 09, 2020 11:59 pm
Has thanked: 10 times
Been thanked: 20 times

Re: Character maps and code pages(Split) (Solved)

Post by misko_2083 »

Grey wrote: Sat Jul 31, 2021 2:09 pm
misko_2083 wrote: Sat Jul 31, 2021 10:28 am

Youtube kind of tries to do that automatically and it often switches English Latin to English Cyrillic.

Yes, Гунфигхтерс Моон, it sounds original :)
Гојко Митић is the best chief of the Redskins and he is not afraid of any cowboys and гунфигхтерс :)

That's another subject for the off-topic category. :D

Do you want to exit the Circus? The Harsh Truth
https://www.youtube.com/watch?v=ZJwQicZHp_c

Post Reply

Return to “Programming”