Character maps and code pages(Split)

misko_2083 · Post by **misko_2083** » Tue Jul 20, 2021 12:05 pm

Can you please split the thread @rockedge so we don't polute it with this?

MochiMoppel wrote: ↑Sun Jul 18, 2021 2:20 am
misko_2083 wrote: ↑Fri Jul 16, 2021 6:35 pm
Our language is completely vocalized with 30 sounds and 30 Cyrilic characters.
However we also have a Latin script with 27 chars and 3 diagraphs.
Serbian?

Sometimes people use English keyboards that don't have some letters (or lack a different keyborad input, or too lazy to make a switch).
Instead they type the closest match and the software transliterates to a completely different letter in Cyrilic.
I assume that the "software" is not iconv and that there is no codepage around that can convert Basic Latin (ASCII) to Cyrillic equivalents, e.g. changing an ASCII L to a Cyrillic Л. On the other side a more or less sophisticated search/replace script should be able to do it, starting with the diagraphs and leaving only ambiguous characters like Z (could be Ж or З ) for manual correction. Don't know if this describes your task, it's just my imagination. A real life example would help.

Serbian language.
lj, nj, and dž are diagraphs, though sometimes people write dj https://en.wikipedia.org/wiki/Novak_Djokovic
instead of đ which may cause confusion with words written in ijekavian dialect.

The characters missing in english alphabet are š đ č ć ž
People sometimes type
s for s and š, с and ш
z for z and ž, з and ж
c for c, ć, č ц, ћ, ч

example:
Mozemo li da idemo na rucak?
When transliterated:
Моземо ли да идемо на руцак?
How it should be:
Možemo li da idemo na ručak?
Можемо ли да идемо на ручак?

Perhaps it's easier to fix the latin text and transliterate.
That Fred's (@fredx181) copy-code-paste-from-clipboard script with yad UI would be ideal for this.

koze, kože
Cyrillic - Latin - English / explanation
з, ж - z, ž
козе - koze - goats
коже - kože - skins, leathers

kuce, kuće, kuče
Cyrillic - Latin - English / explanation
ц, ч, ћ - c, č, ć
куче - kuče - dog
куце - kuce - doggies / diminutive of the word kuče is 'куца - kuca'; in plural 'куце - kuce'
куће - kuće - house / plural of the word 'кућа - kuća'

Something like this would change the characters

Code: Select all

#!/bin/bash

declare -A LO_J=(["ј"]="j")

declare -A UP_J=(["Ј"]="J")

declare -A LO_SR_CYR_TO_LAT_DICT=(
    ["а"]="a"
    ["б"]="b"
    ["в"]="v"
    ["г"]="g"
    ["д"]="d"
    ["ђ"]="đ"
    ["е"]="e"
    ["ж"]="ž"
    ["з"]="z"
    ["и"]="i"
    ["к"]="k"
    ["л"]="l"
    ["љ"]="lj"
    ["м"]="m"
    ["н"]="n"
    ["њ"]="nj"
    ["о"]="o"
    ["п"]="p"
    ["р"]="r"
    ["с"]="s"
    ["т"]="t"
    ["ћ"]="ć"
    ["у"]="u"
    ["ф"]="f"
    ["х"]="h"
    ["ц"]="c"
    ["ч"]="č"
    ["џ"]="dž"
    ["ш"]="š"
)

declare -A UP_SR_CYR_TO_LAT_DICT=(
    ["А"]="A"
    ["Б"]="B"
    ["В"]="V"
    ["Г"]="G"
    ["Д"]="D"
    ["Ђ"]="Đ"
    ["Е"]="E"
    ["Ж"]="Ž"
    ["З"]="Z"
    ["И"]="I"
    ["К"]="K"
    ["Л"]="L"
    ["Љ"]="Lj"
    ["М"]="M"
    ["Н"]="N"
    ["Њ"]="Nj"
    ["О"]="O"
    ["П"]="P"
    ["Р"]="R"
    ["С"]="S"
    ["Т"]="T"
    ["Ћ"]="Ć"
    ["У"]="U"
    ["Ф"]="F"
    ["Х"]="H"
    ["Ц"]="C"
    ["Ч"]="Č"
    ["Џ"]="Dž"
    ["Ш"]="Š"
)

    string="абвгдђежзијклљмнњопрстћуфхцчџш"

    echo "String:     $string"

    for letter in "${!LO_SR_CYR_TO_LAT_DICT[@]}"; do
        string="${string//$letter/${LO_SR_CYR_TO_LAT_DICT[$letter]}}"
    done

    # j must be replaced last because of the diagraphs lj,nj
    string="${string//${!LO_J[@]}/${LO_J[@]}}"
    
    echo "Romanaised: $string"

    for letter in "${!LO_SR_CYR_TO_LAT_DICT[@]}"; do
        string="${string//${LO_SR_CYR_TO_LAT_DICT[$letter]}/$letter}"
    done

    # j must be replaced last because of the diagraphs lj,nj
    string="${string//${LO_J[@]}/${!LO_J[@]}}"

    echo "Cyrillic:   ${string}"

Character per character is easy but there are strings I would not want to transliterate like web and email addresses.
https://forum.puppylinux.com transliterates to хттпс://форум.пуппyлинуx.цом
Hm, just thinking, ih I could only think of an easy way to select the words in a text which should be reverted back to latin or overall skip conversion to latin.
Mouse click would be ideal, maybe a combo box or a popup list dialog?

misko_2083 · Post by **misko_2083** » Tue Jul 20, 2021 12:23 pm

Grey wrote: ↑Sun Jul 18, 2021 4:45 pm
MochiMoppel wrote: ↑Sun Jul 18, 2021 2:20 am
misko_2083 wrote: ↑Fri Jul 16, 2021 6:35 pm
Our language is completely vocalized with 30 sounds and 30 Cyrilic characters.
However we also have a Latin script with 27 chars and 3 diagraphs.
Serbian?

Sometimes people use English keyboards that don't have some letters (or lack a different keyborad input, or too lazy to make a switch).
Instead they type the closest match and the software transliterates to a completely different letter in Cyrilic.
I assume that the "software" is not iconv and that there is no codepage around that can convert Basic Latin (ASCII) to Cyrillic equivalents, e.g. changing an ASCII L to a Cyrillic Л. On the other side a more or less sophisticated search/replace script should be able to do it, starting with the diagraphs and leaving only ambiguous characters like Z (could be Ж or З ) for manual correction. Don't know if this describes your task, it's just my imagination. A real life example would help.

Yes, he meant Serbian - I remember from his screenshots the words "сликовни" and "уклони". Serbia (or Montenegro) is doing well.
You have not yet seen how encodings are used in the countries of the former USSR and English keyboards on which stickers with letters are glued - a mixture of Cyrillic and Latin + local flavor. But the people have adapted

Good memory Grey. I can't even remember what I ate for breakfast.

Talk about adapting. On this keyboard q w y x = љ њ ж џ.
I don't understand spoken Russian, it's like comparing Dutch with English.
Written I can decode to some extent.

Grey · Post by **Grey** » Tue Jul 20, 2021 3:00 pm

Hello. Previously, the Russian keyboard was qwerty = яверты. Currently qwerty = йцукен. The Serbian keyboard uses the first option, that is, the Latin letter is replaced by the corresponding Cyrillic one, and if there is no match, then its unique one.

And how, for example, is such a problem solved in Ubuntu? Or not resolved? Really, no one has yet made a utility according to the principle of a "dictionary", so that when selecting a text, it can be transformed in both directions?

williams2 · Post by **williams2** » Tue Jul 20, 2021 5:53 pm

Would this be useful? http://mashke.org/Conv/

Uses perl scripts that can be downloaded and executed in a terminal or in a shell script.
has a small builtin dictionary of common words.

misko_2083 · Post by **misko_2083** » Tue Jul 20, 2021 6:36 pm

Grey wrote: ↑Tue Jul 20, 2021 3:00 pm
Hello. Previously, the Russian keyboard was qwerty = яверты. Currently qwerty = йцукен. The Serbian keyboard uses the first option, that is, the Latin letter is replaced by the corresponding Cyrillic one, and if there is no match, then its unique one.

And how, for example, is such a problem solved in Ubuntu? Or not resolved? Really, no one has yet made a utility according to the principle of a "dictionary", so that when selecting a text, it can be transformed in both directions?

There are Serbian Cyrilic keyboards and Yu keyboards. Yu keyboards are more common.
I found out there is a libre office macro that can transliterate selected text in both directions https://extensions.libreoffice.org/en/e ... ootranslit
It adds a small toolbar to the right.
But only for Serbian.

Grey · Post by **Grey** » Tue Jul 20, 2021 7:17 pm

Five interesting projects.
0. Cirlat.
1. Cirilica converter in Python.
2. Serbian Translit for VSCode.
3. CyrLatConverter (JavaScript). Git and Home page.
4. Terminator No - Ћирилизатор - browser extension. Git and Home page.
At least some of this can be compiled/run in Fossapup, I think
Don't forget to rename the topic. Something about Serbia or Cyrillic

MochiMoppel · Post by **MochiMoppel** » Wed Jul 21, 2021 2:30 am

misko_2083 wrote: ↑Tue Jul 20, 2021 12:05 pm

Something like this would change the characters

Code: Select all

#!/bin/bash

declare -A LO_J=(["ј"]="j")

declare -A UP_J=(["Ј"]="J")

declare -A LO_SR_CYR_TO_LAT_DICT=(
    ["а"]="a"
    ["б"]="b"
    ["в"]="v"
    ["г"]="g"
    ["д"]="d"
    ["ђ"]="đ"
    ["е"]="e"
    ["ж"]="ž"
    ["з"]="z"
    ["и"]="i"
    ["к"]="k"
    ["л"]="l"
    ["љ"]="lj"
    ["м"]="m"
    ["н"]="n"
    ["њ"]="nj"
    ["о"]="o"
    ["п"]="p"
    ["р"]="r"
    ["с"]="s"
    ["т"]="t"
    ["ћ"]="ć"
    ["у"]="u"
    ["ф"]="f"
    ["х"]="h"
    ["ц"]="c"
    ["ч"]="č"
    ["џ"]="dž"
    ["ш"]="š"
)

declare -A UP_SR_CYR_TO_LAT_DICT=(
    ["А"]="A"
    ["Б"]="B"
    ["В"]="V"
    ["Г"]="G"
    ["Д"]="D"
    ["Ђ"]="Đ"
    ["Е"]="E"
    ["Ж"]="Ž"
    ["З"]="Z"
    ["И"]="I"
    ["К"]="K"
    ["Л"]="L"
    ["Љ"]="Lj"
    ["М"]="M"
    ["Н"]="N"
    ["Њ"]="Nj"
    ["О"]="O"
    ["П"]="P"
    ["Р"]="R"
    ["С"]="S"
    ["Т"]="T"
    ["Ћ"]="Ć"
    ["У"]="U"
    ["Ф"]="F"
    ["Х"]="H"
    ["Ц"]="C"
    ["Ч"]="Č"
    ["Џ"]="Dž"
    ["Ш"]="Š"
)

    string="абвгдђежзијклљмнњопрстћуфхцчџш"

    echo "String:     $string"

    for letter in "${!LO_SR_CYR_TO_LAT_DICT[@]}"; do
        string="${string//$letter/${LO_SR_CYR_TO_LAT_DICT[$letter]}}"
    done

    # j must be replaced last because of the diagraphs lj,nj
    string="${string//${!LO_J[@]}/${LO_J[@]}}"
    
    echo "Romanaised: $string"

    for letter in "${!LO_SR_CYR_TO_LAT_DICT[@]}"; do
        string="${string//${LO_SR_CYR_TO_LAT_DICT[$letter]}/$letter}"
    done

    # j must be replaced last because of the diagraphs lj,nj
    string="${string//${LO_J[@]}/${!LO_J[@]}}"

    echo "Cyrillic:   ${string}"

Something like this would too (hopefully). On my machine using sed is 4x faster than a loop:

Code: Select all

#!/bin/bash
cyrstring="хттпс://форум.пуппyлинуx.цом"                #cyrillic
latstring="Možemo li da idemo na ručak? kuče kuce kuće" #romanized Serbian 
ascstring="Mozemo li da idemo na rucak? kuce kuce kuce" #ASCIIfied Serbian

CYR=абвгдђежзијклмнопрстћуфхцчшАБВГДЂЕЖЗИКЛМНОПРСТЋУФХЦЧШ #Unicode chars of block "Cyrillic"
LAT=abvgdđežzijklmnoprstćufhcčšABVGDĐEŽZIKLMNOPRSTĆUFHCČŠ #ASCII and Unicode chars of block "Latin Extended-A"
CAS=абвгеијклмнопртуфхАБВГДЕИКЛМНОПРТУФХ #cyrillic chars with unambiguous ASCII equivalents
ASC=abvgeijklmnoprtufhABVGDEIKLMNOPRTUFH #ASCII

romanize () {
	echo "$@" | sed "
	s/љ/lj/g
	s/њ/nj/g
	s/џ/dž/g
	s/Љ/Lj/g
	s/Њ/Nj/g
	s/Џ/Dž/g
	y/$CYR/$LAT/
	"
}

asci2cyr () {
	echo "$@" | sed "
	s/lj/љ/g
	s/nj/њ/g
	s/dz/џ/g
	s/Lj/Љ/g
	s/Nj/Њ/g
	s/Dz/Џ/g
	s/d/<дђ>/g
	s/z/<жз>/g
	s/s/<сш>/g
	s/c/<ћцч>/g
	s/D/<ДЂ>/g
	s/Z/<ЖЗ>/g
	s/S/<СШ>/g
	s/C/<ЋЦЧ>/g
	y/$ASC/$CAS/
	"
}

cyrilize () {
	echo "$@" | sed "
	s/lj/љ/g
	s/nj/њ/g
	s/dž/џ/g
	s/Lj/Љ/g
	s/Nj/Њ/g
	s/Dž/Џ/g
	y/$LAT/$CYR/
	"
}

echo     "### Cyrillic transcribed to Latin"
echo     "$cyrstring"
romanize "$cyrstring"
echo -e  "\n### Latin transcribed to Cyrillic"
echo     "$latstring"
cyrilize "$latstring"
echo -e  "\n###Pure ASCII transcribed to Cyrillic (ambiguous chars <marked> )"
echo     "$ascstring"
asci2cyr "$ascstring"

Hm, just thinking, ih I could only think of an easy way to select the words in a text which should be reverted back to latin or overall skip conversion to latin.
Mouse click would be ideal, maybe a combo box or a popup list dialog?

The easiest way may be to use xclip to send the converted string to the clipboard, then Ctrl+V to overwrite the source string.

misko_2083 · Post by **misko_2083** » Sat Jul 31, 2021 10:28 am

Grey wrote: ↑Tue Jul 20, 2021 7:17 pm
Five interesting projects.
0. Cirlat.
1. Cirilica converter in Python.
2. Serbian Translit for VSCode.
3. CyrLatConverter (JavaScript). Git and Home page.
4. Terminator No - Ћирилизатор - browser extension. Git and Home page.
At least some of this can be compiled/run in Fossapup, I think
Don't forget to rename the topic. Something about Serbia or Cyrillic

Thanks. So many options to choose.
Youtube kind of tries to do that automatically and it often switches English Latin to English Cyrillic.

MochiMoppel wrote: ↑Wed Jul 21, 2021 2:30 am

The easiest way may be to use xclip to send the converted string to the clipboard, then Ctrl+V to overwrite the source string.
[/quote]
Thanks MochiMoppel sed is faster indeed.

Grey · Post by **Grey** » Sat Jul 31, 2021 2:09 pm

misko_2083 wrote: ↑Sat Jul 31, 2021 10:28 am
Youtube kind of tries to do that automatically and it often switches English Latin to English Cyrillic.

Yes, Гунфигхтерс Моон, it sounds original
Гојко Митић is the best chief of the Redskins and he is not afraid of any cowboys and гунфигхтерс

misko_2083 · Post by **misko_2083** » Sun Aug 15, 2021 2:08 am

Grey wrote: ↑Sat Jul 31, 2021 2:09 pm
misko_2083 wrote: ↑Sat Jul 31, 2021 10:28 am
Youtube kind of tries to do that automatically and it often switches English Latin to English Cyrillic.

Yes, Гунфигхтерс Моон, it sounds original
Гојко Митић is the best chief of the Redskins and he is not afraid of any cowboys and гунфигхтерс

That's another subject for the off-topic category.

Puppy Linux Discussion Forum

Character maps and code pages(Split)

Character maps and code pages(Split)

Re: Character maps and code pages

Re: Character maps and code pages

Re: Character maps and code pages(Split)

Re: Character maps and code pages(Split)

Re: Character maps and code pages(Split)

Re: Character maps and code pages(Split)

Re: Character maps and code pages(Split)

Re: Character maps and code pages(Split)

Re: Character maps and code pages(Split) (Solved)