Year in Review 2021

2021 was a difficult year for everyone surviving into second year of global pandemic; however for Tamil computing community had much progress; here is my take on it.

	Event	Comments	Date
1	Rust language support	Tokenizer for REST rust_v0.1	Jan 17th ’21
2	open-tamil v1.0	Release v1.0 : bug-fix pypi	Apr 18th ’21
3	tamilinayavaani v0.14	Release v0.14 : pypi	Dec 5th ’21
4	Book Translation of ‘Practical Algorithms and Data Structures’	pending – typeset + copy-edit; 220 page book	Nov ’21
5	Relaunch Min Madurai Tamil app	Google Play Store : link	Sep 8 ’21
6	Tutorial for TIC 20th – Keras AI	Beginning AI applications: link	Dec 4th ’21

Ezhil Language Foundation related activities in 2021

This year has been tough but we keep our head above the water for another challenging year 2022. I’m also happy to share I’ve volunteered to serve in the steering committee at INFITT organization to share some of open-source view points from my experience and some AI/ML strategies for developing our ecosystem.

Some of the major events by INFITT in 2021 are successful organization of Hackathon for college students at KCT in Kovai; 20th TIC organized virtually with good turnout and contributions from industry and academics.

Hope you are vaccinated, stay healthy, and in positive frame of mind to have a successful year and share some of your contributions to Tamil community.

Sincerely

-Muthu

சொல்திருத்தி – தெறிந்தவை 5

கட்டுரைத் தொடரில் இந்த பதிவில் மேலோட்டமான சொல்திருத்தியின் பிழைதிருத்தம் அல்கோரிதம் எப்படி கட்டமைக்கப்பட்டிருக்கு என்றும் பார்க்கலாம்.

படம்1: மெக்சிகோவில் புனித குவடலூப்பே கன்னியின் படம் மிக பிரசித்தி பெற்றதாக அவர்கள் நம்புகின்றனர். எனக்கு பூண்டி மாதா, வேளங்கன்னி மாதா நினைவு. இடம்: பெர்க்கிலி, கலிபொனியா #மக்சிக்கோ #சுவர்ஓவியம் #ourladyofguadalupe

1 பிழைதிருத்தி அல்கோரிதம்

உள்ளீடு : உரையின் சொற்கள் ஒவ்வொன்றாக. இடம்-பொருள் விளங்குவதற்கு [context] நாம் சொல் இடம் பெரும் வரியை சூழலுக்கு உள்ளீடாக கொடுக்கலாம்.

வெளியீடு: தவரான சொற்களின் பட்டியல், மற்றும் இவ்வாறு பிழையான் சொற்களின் வாயில் என்ன வேற்று சொல்லை மற்றாக இணைக்கலாம் என்ற பட்டியல்.

இப்படிப்பட்ட ஒரு அல்கோரிதத்தை செயல்ப்படுத்த நமக்கு ஒரு சொல்பட்டியல் தேவை; இதை நாம் அகராதி என்று வழக்கு மாரி சொல்வோம். அதாவது நமக்கு சொல் மற்றும் அதன் சரியான எழுத்து வடிவம் மற்றுமே தேவை – சொல்லின் பொருள் முதலில் தேவை இல்லை. ஆகையால் இந்த சொல் பட்டியல் மட்டுமே அகராதி என்று நம்மால் கருதப்படும்.

முதல் படியாக உரையில் உள்ள சொற்கள் நேரடியே பட்டியலில் காணப்பட்டால் இதனை நாம் சரியான சொல் என்றும் அவற்றை நீக்கி விடலாம். எ.கா. “அவன் வாத்து முட்டை விருப்பம் கொண்டவளை மட்டுமே சமைக்க தேர்ந்தெடுப்பதாக சீனாவில் அறிவித்திருந்தான்” என்ற 10 சொல் வாக்கியத்தில் ‘அவன்’, ‘வத்து’, ‘முட்டை’, ‘விருப்பம்’, என்ற சொற்கள் சரியாக சொல் பட்டியலில் இருக்கும். தற்போது – 6 சொற்கள் மீதம் உள்ளன.

அடுத்தபடியாக பெயர்சொற்கள் அவற்றின் பட்டியல் கொண்டால் இதனையும் நாம் நீக்கிவிடலாம். மேல் உள்ள செயற்கையான உதாரனத்தில் ‘சீனா’ என்ற பெயர் சொல் நேரடியாக இந்த பட்டியலில் காணப்படும். தற்போது – 5 சொற்கள் மீதம் உள்ளன.

அடுத்தபடியாக வினைச்சொற்கள், மற்றும் இலக்கண வகைபடுத்தப்பட்ட இடைச்சொற்கள், ஆகுபெயர்கள், ஆகியவற்றை சரியாக பகுத்தாய்ந்து விதிகளுடன் உணர்ந்தால் சில அடிச்சொற்கள் கொண்ட பட்டியலின் வழியே மட்டும் அவற்றின் ஆக்கல் தன்மையின் வாயிலாக பல சொற்களை நாம் பகுத்தரியும் வகையில் அனுகலாம். தமிழில், இலத்தின் போல, வினைஎச்சங்கள், வினைச்சொற்கள் அவை வாக்கியத்தில் இடம் பெரும் இடங்கள் கண்டு மருவி வருகிண்ரன. எ.கா. ‘அவன் ஒரு சட்டை வாங்க சென்றான்’, ‘அவள் ஒரு சட்டை வாங்க சொல்வாள்’ என்ற இரு வாக்கியங்களில் ‘செல்’ என்ற சொல் மருவி ஆணுக்கு ‘சென்றான்’ என்றும் பெண்ணுக்கு ‘செல்வாள்’ என்றும் வருகிரது. இது சற்று சிக்கலான ஒரு அல்கொரிதத்தின் கீற்றாகவே அமைகிரது; இதனை அதிகம் மொழியியலாகவும் சற்று கம்மியாக கணினியியலாகவும் கருதலாம்.

தமிழில் உள்ள இலக்கண விதிகளை பேரா. ராஜம் அவர்கள் letsgrammar.org என்ற தளத்தில் வினைச்சொற்கள் எப்படி மருவும் என்ற விதிகளை மென்பொருளில் நிருவி அழகாக விளக்கியுள்ளார். இவற்றை ஆங்கிலத்தில் ‘word declension rules’ என்று சொல்வார்கள்.

எண்கள், வடமொழி சொற்கள், நிருத்த சொற்கள், பன்மை சொற்கள், ஆங்கில சொற்கள் ஆகியவற்றையும் நாம் கண்டறிந்து உரையினை இவற்றிலிருந்து நீக்கம் அல்லது பிழை திருத்தம் செய்யலாம். தட்டுப்பிழைகள், ஒருங்குரி பிழைகள் போன்றவற்றையும் இந்நிலையில் நாம் நீக்கிவிடலாம்.

2 பிழை வகைகள்

மேல் சொன்னபடி சொல்திருத்திகள் அவைகளின் நான்கு படினிலைகளில் ஏதேனும் ஒரு சொல்லை [உரையில் உள்ள] அந்தந்த வகுப்பில் உள்ளதாகவும் கண்டு, அதே சொல் தவராக உருவெடுத்திருந்தால் அது தவரான சொல் என்றும், அதனை நாம் சரிசெய்து – மாற்றங்களை பரிந்துரைக்கலாம். இதையே ‘wrong word error’ என்று சொல்லாம்.

கடைசியில், இவ்வாரு நான்கு படிகளில் நீக்கம் செய்யப்படாத சொற்களை நாம் அகராதியில் இல்லாத சொற்கள் என்று மட்டுமே கருதலாம். அதாவது இவற்றை ‘non-word error’ என்று கண்டறிந்து சொல்லாம். இவற்றில் நாம் மாற்று சொற்களை தரமுடியாது.

concordance தரவுகள் இருப்பின் ‘அன்பே சிவம் என்பர் சைவ சித்தாந்திகள்‘, மற்றும் ‘அன்பே சவம் என்பர் சைவ சித்தாந்திகள்‘ என்ற இரு வாக்கியங்களுக்கும் மாற்றுகள் மேல் கண்ட சொல்திருத்தியினை மேம்படுத்தி செயல்படுத்த செய்யலாம்.

3. வழங்கல்

இந்த நிலைகள் முழுதும் ஒரு மேலோட்டமான ஒவ்வொரு சொல்திருத்தியின் கட்டமைப்பிலும் இருப்பதாக நாம் உணரலாம்.

சொல்திருத்தி என்பது உரையினை உள்வாங்கிக்கொண்டு சரியான சொற்களை முழுதும் கண்டுகொள்ளாது. தவரான சொற்களை மட்டுமே மையமாக கொண்டு இயங்குகிரது. என்னடா வாழ்க்கையிது, கால்ஃபு போல் சொல்திருத்திகள், எல்லாமே சரியான ஆட்டத்தினால் நிற்னயிக்கப்படுவதில்லை – பிழையான சொல், பிழையான ஆட்டம் அதே வெற்றியை நிற்னயிக்கிரது. இதன் பணி:

தவரான சொற்களை சுட்டிக்காட்ட வேண்டும்
தவரான சொற்களுக்கு மாற்றங்களை காட்ட வேண்டும்
தவரான் சொல்லுக்கு பயனர் மாற்று தரவிருந்தால் அதனை சொல் பட்டியலில் சேர்த்துக்கொள்ள வேண்டும்; அதனை உரையிலும் மாற்றவேண்டும்.

கடைசியில் அனைத்து உள்ளீடுகளையும் ஒருங்கிணைத்து சரியான உரையை சொல்திருத்தி வழங்கும்.

ஆழக்கற்றல் – Deep Learning – மின் புத்தகம்

Michael Nielsen, a well known computer scientist and Quantum Computing expert [author of famous: ‘Introduction to Quantum Computation and Quantum Information,’ with Isaac Chuang, has written a nice book in expository detail about Deep Learning.

Front Cover — Book: “Quantum Computation and Quantum Information” from authors Michael Nielsen, Isaac Chuang. (C) 2000 Cambridge University Press. Google Books URL here

Nielsen’s new book, Neural networks and deep learning here, has taken a more modern approach to (web) publishing in releasing the whole book in Creative Commons Non-Commercial Share Alike [NC-SA] license.

இந்த புத்தகத்தில் கணினி தரவுகளைக்கொண்டு எப்படி [ஒரு படிபடியான் நிரல் இல்லமல், தரவின் கற்றல் வழியே மற்றும்] நிரைய செயல்பாடுகளை சாதிக்கமுடியும் என்றும், இதன் அடிப்படையான செயற்கை நரம்புகளின் பினைப்புகள் மற்றும் அதன் கோட்பாடுகளையும் உடைத்து வைக்கிரார் திரு. நீல்சன். அல்வா மாதிரி ருசித்து பருகுங்கள்.

செயற்கை நரம்புகளின் பினைப்புகள் [‘Artificial Neural Networks’] மூலம் எப்படி கையெழுத்து வழி எண்களை உணரலாம் ? வழிகள் கூறுகிறார் திரு. நீல்சன் **Neural networks and deep learning** என்ற புத்தகத்தில்

பொறியாளர்கள் கவணம்! இதனை தமிழாக்கம் செய்யலாம் – முனைவீர்களா?

Open-Tamil v0.8

Last release of Open-Tamil was v0.71 from March 2018. Since then a lot of work has gone into making software with additional features, bug fixes, web interface for Tamilpesu.us

You can find copy of commit lines for the release at following link: Release notes – Open-Tamil 2018 v0.8

Today, I’m posting the combined efforts of Open-Tamil developers as an update/packaged release v0.8 for Open-Tamil here.

Please try the software in your development environment as:

$ pip install –upgrade open-tamil

and report any problems via email to ezhillang@gmail.com

இந்த நேரத்தில், ஓபன் தமிழ் கணினி தொகுப்பு சிறப்பாகவும் மேம்படுத்தப்பட்டும், பிழைகள் திருத்தப்பட்டும் வருவதற்கு ஒத்துழைப்பும், பங்களிப்பும் அளித்துவரும் நண்பர்கள், பொறியாளர்களுக்கும் மிக்க நன்றிகளை தெறிவித்துக்கொள்கிறேன்.

தொடர்ந்து அடுத்த ஆண்டும் செயல்படுவோமாக. நன்றி. வாழ்க வளமுடன்!

ஆமவடை

ஏற்கணவே பதிவு செய்த இடத்தில் இருந்து தொடருவோம்:

Corollary 2 of Theorem 3: ஒரே சொல்லில் எழுத்து இரடிக்கப்பட்டால் அந்த சொல் டோரசில் ஒரு சுழலுடன் [loop] கொண்டபடி அமையும்.

Lemma 2: படுக்கவசமாகவும், நிமிர்ந்துவசமாகவும் அமைகப்பட்ட சொர்கள் மொழியில் இல்லாதவை.

Corollary 3 or Theorem 3: டோரசில் படுக்கவசமாகவும், நிமிர்ந்துவசமாகவும் பாதைகள்/எழுத்துக்கள் இல்லாதவை.

Theorem 4: ஒரு அகராதியில் உள்ள சொர்கள் அனைத்தையும் டோரசில் பிரதிபலித்தால் அந்த குறுக்கிடும் இடங்களின் [intersecting points] ஒன்று அல்லது மெர்பட்ட சொற்களை] எண்ணிக்கை அளவை மிக குறைவாக்கும் வண்ணம் அமைக்க முடியாது. அதாவது ஒரு அகராதியின் சொற்கள் அனைத்து எவ்வித அமைப்பில் உள்ள டோரசானாலும் சரி அதன் குறுக்கிடும் இடங்களின் எண்ணிக்கை மாராது. இது ஒரு மாறிலி [invariant].

Corollary 1 of Theorem 4: மேர்கண்ட டோரசில் [அதன் ஒரு பிரதிபலிப்பில் – ‘அ,ஆ,இ,ஈ, … ,ஒ,ஓ,ஔ‘ என்றும் ‘கசடதபர – யரலவழள – ….’ என்றும் வரிசையிலோ, அல்லது வேறு பரிமாணங்களில் அடுக்கியிருந்தால்] ஒவ்வொரு அகராதிக்கும் ஒரு சிரப்பான குறுக்கிடும் இடங்களின் எண்ணிக்கை கிடைக்கும். இந்த எண் அகராதியின் கையொப்பம் [signature] என்றும் சொல்லாம்.

Theorem 5: டோரசில் உள்ள ஓவ்வொரு அகராதி சொல்லும் ஒரு பாதை என்று கொள்ளலாம். சொல்லின் தொடக்க எழுத்து பாதையின் தொடக்கத்தையும், சொல்லின் கடைசி எழுத்து பாதையின் முடிவையும் குறிக்கும்; பாதை திசைகொண்ட பாதையாக இருக்கும் – ஒரு அம்பு தொடக்கத்தில் இருந்து முடிவின் திசையில் வழி காட்டும். ஆகையால் அகராதியில் இல்லாத பாதைகள் பிழையாக எழுதப்பட்ட அகராதி சொற்களுக்கு சமம், அல்லது அகராதியில் இல்லாத புதிய சொற்களுக்கு சமம்.

வாதம் [ஆதாரத்தின் தொடக்கமாக கருத்ப்படலாம்]: டோரசில்ஒவ்வொரு சொல்லும் [அதன் பாதையும்] அகராதியில் உள்ள சொற்களாகவே இருக்கவேண்டும். Coding-theory / error correction codes theory படி இவ்வகை சரியான எழுத்துக்கள் உள்ள பாதைகள், சரியான சொற்களாகவும், தவான சொற்கள் [இல்லாத சொற்கள்] பிழையானவை என்வும் அமையும். இவ்வாரான சொற்கள் சரியானவையையின் சொற்பிழை எனவும் கருதப்பாடும்.

Corollary 1 of Theorem 5: மேர்கண்ட டோரசில் முழு அகராதி பிரதிபலிக்கப்பட்டதால், இதனைக்க்கொண்டு ஒரு சொற்பிழை திருத்தி செய்யலாம். பிழையான் சொல்லின் திருத்தம், அதன் நெருங்கிய தொலைவில் உள்ள சரியான் சொல் என்பதை நடைமுரைவிதியாகக்கொண்டு இதனை அமல்படுத்தலாம்.

Theorem 6: Tries எனப்படும் சொல்மரங்களைக்கொண்ட தரவமைப்பை டோரசில் குறியிட்டால், அது தொடர்பாதையாக ஒரே தொடக்கமும், பல பாதைமுடிவுகளையும் கொண்டதாக அமையும். இவற்றில் சில பாதைகள் சேரும் வகையில் முடிவுபெரும் வகையிலும் அமையலாம்.

**படம் 2**: Trie மரம் என்ற தரவமைப்பு. இதில் ‘to’, ‘tea’, ‘ted’, ‘ten’, ‘A’, ‘in’, மற்றும் ‘inn’ ஆகிய சொற்கள் இடம் பெற்றுள்ளன.

உதாரணத்திற்கு, படம் 2-இல் முடியும் நிலை நுனிகள் ‘n’ என்பவை டோரசில் வரும்பொழுது சேரும் வகையில் முடிவுபெரும் வகையில் அமையும்.

-முத்து.

Latha vs Bamini – 1

Tamil billboard; credits - masanori_jpn via Flickr.

Well, this blog post is not about any famous cat-fight: ‘sabaash – sariyaane potTi!’

80535-epqqqdhsmh-1517237957 — Frame grab from the song ‘Kannum Kannum’ from movie ‘Vanjikottai Vaaliban’ with danseuse rivals Padmini and Vaijayanthimala.

but about the more mundane issue of resolution of Tamil letters that maybe affecting visual acuity and usage in practical things like billboards. Yes, we know Latha (is Tamil font from Microsoft) and Bamini famous storied font created in 1980s. Bamini font is also used in the Chennai Metro, Colombo railway station among other places; creator of Bamini was recently felicitated with 2017 Tamil Computing award for the pioneering efforts in the dawn of digital era.

Back to resolution; ‘kannu theriyithaa ?’ is the usual expression but it really asks the question are you able to see the object/thing/place/person – and not literally ‘do you have vision?’

In optical science, it is well known that free-space – distance – acts as a filter introducing blurs into the image. This is the mechanism behind why we don’t see details of far-away billboards and they grow in detail as one may approach them.

So if you are advertising in large billboards, obviously you want to be visible to audiences as far as laws of physics [Rayleigh resolution limit] will allow.

We can gather from simple considerations the following:

Larger the letters farther away they maybe visible
Longer wavelengths of light [Red (longer) – Violet (shorter)] farther they are visible without being scattered

So you can ask, if all the billboards are painted in large bold Tamil letters ‘adikkira maathiri’, will be visible for very long distances ? Yes. And they will also be boring.

This knowledge does not help us to choose between two fonts, since we can draw/write/pain letters in their character in any size and color – just the shape remains fixed.

Now to properly analyze the two fonts for best visual acuity, we may consider the following criteria:

All letters compared need to be same in both fonts
Viewer is considered to observe the projected font/printed text/billboard from progressively farther and farther distances.
Essentially font corresponding to the billboard which is visible from farthest distance is the winner/better font in this criteria

Before we start drawing conclusions you also want the test subjects to have 20/20 vision or wear corrective prescription eye glasses for the same level of vision.

Now, regardless of the color and size of the fonts we can use the criteria to compare the acuity of the fonts.

But wait, can we do this by computer modeling without paint, labor and 20/20 vision subjects? You betcha! This will be subject of next blog post.

Until then…. Vaazhga Valamudan.

-Muthu

Project Madurai Corpus – உளி வீரன்

Project Madurai

Project Madurai corpus contains a treasure trove of Tamil data across many generations and inflections of Tamil language. Using this data I post-processed the files in project உளி வீரன்.

Data

We are able to look at data from Project Madurai e-Texts. Currently 4,036,616 total words – 40 lakh plus words – in ‘plain_text’ folder which contains unigram data and bigram data at word level. One may use open-tamil library to: – discover the unigram word-frequency of this corpus – discover the bi-gram word-frequency of this corpus (since successive words occur in successive lines).

Morse Code for Madurai Corpus

Using the techniques laid out in earlier blog post on Morse Code we are able to regenerate the Morse Code for Tamil using additional data.

Average code word length = 6.65456 bits. Morse code for Tamil using Madurai corpus is displayed below [most frequently occurring symbols to least] – i.e. in descending order.

க -> ..---
ன் -> .--.-
ம் -> .-..-
த -> ----.
த் -> ---.-
க் -> --..-
வ -> -.---
ர் -> -.-.-
ல் -> ....-.
து -> ...--.
ரு -> ...-.-
ப -> ..--.-
ந் -> ..-...
தி -> ..-.--
ப் -> .-----
கு -> .---..
ய -> .--..-
ம -> .-.--.
ட -> --...-
ற் -> --.---
அ -> --.-..
வி -> --.-.-
ர -> -...--
டு -> -...-.
ன -> -..---
ங் -> -..-..
ண் -> -.-...
ட் -> .....--
கி -> .....-.
ள் -> ....---
ல -> ...---.
டி -> ...-...
ற -> ..--..-
யி -> .----.-
று -> .---.--
மு -> .--....
தா -> .--...-
இ -> .-....-
மா -> .-.-..-
பு -> .-.-.-.
ய் -> -------
கா -> ------.
ரி -> -----.-
யா -> ---....
வா -> ---..--
றி -> --.....
சி -> -.....-
லை -> -..--..
ச் -> -..--.-
ச -> -..-.--
யு -> -..-.-.
பி -> -.--...
பா -> -.--.--
உ -> -.-..--
எ -> -.-..-.
னை -> ......--
டை -> ....--..
ள -> ...-----
கொ -> ...-..--
செ -> ..--....
ளி -> ..-..---
ந -> ..-..--.
ண -> ..-.-...
லி -> ..-.-.--
லா -> ..-.-.-.
னி -> .----...
நி -> .---.-..
போ -> .-......
னா -> .-...--.
வே -> .-...-.-
வு -> .-.-----
கை -> .-.----.
னு -> .-.---..
தை -> .-.---.-
மை -> .-.-...-
மி -> .-.-.---
ரை -> .-.-.--.
ளை -> -----...
ழி -> ---...--
ஆ -> ---...-.
லு -> ---..-..
ழு -> --....--
பெ -> --....-.
றை -> --.--...
பொ -> --.--.--
நா -> --.--.-.
ஞ் -> -......-
ரா -> -....---
தே -> -....-..
ணி -> -....-.-
ழ -> -.--..--
சு -> -.--..-.
றா -> -.--.-.-
ழ் -> ........-
வெ -> .......-.
மே -> ......-..
டா -> ......-.-
ளு -> ...----.-
வை -> ...-..-.-
தெ -> ..--...--
யை -> ..-..-..-
கூ -> .----..--
ஒ -> .----..-.
யே -> .---.-.--
தோ -> .---.-.-.
சா -> .-.....-.
தொ -> .-...---.
மெ -> .-...-..-
நீ -> .-.-....-
கோ -> -----..--
கே -> --.--..--
சை -> -........
பே -> -.......-
சொ -> -....--.-
லே -> -.--.-...
யெ -> -.--.-..-
ளா -> .........-
னே -> .......---
ஏ -> ....--.---
வீ -> ....--.--.
பூ -> ....--.-.-
சே -> ...----...
யோ -> ...-..-..-
ழை -> ..--...-.-
நெ -> ..-..-....
தீ -> ..-..-.--.
ணை -> ..-..-.-.-
வ் -> ..-.-..--.
மூ -> .-...-----
றே -> .-...-...-
மொ -> .-.-.....-
கெ -> -----..-.-
ணு -> ---..-.--.
ஓ -> ---..-.-..
சூ -> --.--..-..
தூ -> -....--...
ரே -> .......--..
னெ -> .......--.-
மோ -> ....--.-..-
பை -> ...----..--
சீ -> ...----..-.
மீ -> ...-..-...-
ணா -> ..--...-...
டே -> ..-..-...-.
ஊ -> ..-..-.----
னோ -> ..-..-.---.
ளே -> ..-..-.-..-
வோ -> ..-.-..----
சோ -> ..-.-..---.
நே -> ..-.-..-...
ரெ -> ..-.-..-..-
லோ -> ..-.-..-.--
ஸ் -> ..-.-..-.-.
லெ -> .-.....---.
நோ -> .-.....--..
யொ -> .-...----..
ரோ -> .-...-.....
ஈ -> .-...-....-
றோ -> .-.-.......
நு -> .-.-......-
றெ -> ---..-.----
நூ -> ---..-.---.
கீ -> -....--..-.
ஞா -> ............
ஐ -> ..........--
ஷ -> ..........-.
ழா -> ...-..-.....
டெ -> ..--...-..-.
வொ -> ..-..-...---
ளெ -> ..-..-...--.
ஜ -> ..-..-.-....
றொ -> .-.....-----
ளோ -> .-.....--.--
னொ -> .-.....--.-.
டோ -> .-...----.--
யூ -> -----..-....
ஷ் -> -----..-...-
பீ -> ---..-.-.---
றீ -> ---..-.-.--.
லொ -> ---..-.-.-.-
ரொ -> --.--..-.---
ரீ -> --.--..-.-..
ரூ -> ...........--
ஞ -> ....--.-.....
னீ -> ....--.-...--
டொ -> ...-..-....--
ணீ -> ...-..-....-.
யீ -> ..--...-..---
டீ -> ..--...-..--.
வூ -> .-.....----.-
ணெ -> .-...----.-.-
ஸ -> -----..-..--.
ஜா -> -----..-..-.-
லீ -> --.--..-.--..
ணே -> --.--..-.-.--
னூ -> --.--..-.-.-.
லூ -> -....--..----
நொ -> -....--..--..
ஃ -> -....--..--.-
ளொ -> ...........-.-
ங -> ....--.-...-..
றூ -> ..-..-.-...-..
ணோ -> ..-..-.-...-.-
ஜ் -> .-.....----...
டூ -> .-...----.-...
ஹ -> -----..-..----
ஷி -> -----..-..-..-
நை -> ---..-.-.-...-
ஹா -> ---..-.-.-..--
ளீ -> --.--..-.--.-.
ளூ -> -....--..---.-
ழீ -> ...........-..-
ஜி -> ....--.-....--.
ஸி -> ....--.-....-..
ழே -> ....--.-...-.-.
ஞை -> ..-..-.-...----
கௌ -> ..-..-.-...--.-
மௌ -> .-.....----..--
ணொ -> .-...----.-..--
சௌ -> .-...----.-..-.
ஸா -> -----..-..---.-
ஷா -> -----..-..-...-
ஜெ -> ---..-.-.-.....
வௌ -> ---..-.-.-....-
ஷை -> ---..-.-.-..-..
ஜோ -> --.--..-.--.---
ஜீ -> --.--..-.--.--.
ழெ -> -....--..---...
ஷே -> -....--..---..-
ணூ -> ....--.-....----
ஜை -> ....--.-....---.
ஹி -> ....--.-....-.--
பௌ -> ....--.-...-.---
ஔ -> ..-..-.-...---..
ஞெ -> ..-..-.-...--...
ழூ -> .-.....----..-.-
ழோ -> -----..-..---..-
ழொ -> -----..-..-....-
ஸு -> ---..-.-.-..-.--
ஹோ -> ...........-.....
ஜு -> ...........-...--
ஷு -> ...........-...-.
ஞீ -> ....--.-....-.-..
ஹ் -> ..-..-.-...---.-.
தௌ -> ..-..-.-...---.--
ஸை -> ..-..-.-...--..--
ஜே -> ..-..-.-...--..-.
ஸீ -> -----..-..---...-
ஞி -> -----..-..-.....-
ஸூ -> ...........-....--
ஜொ -> ....--.-....-.-.--
ஹு -> ....--.-...-.--...
ஹை -> ....--.-...-.--..-
ஹீ -> ....--.-...-.--.--
ஸெ -> .-.....----..-....
ஜூ -> .-.....----..-...-
ரௌ -> .-.....----..-..-.
ஹே -> -----..-..---.....
ஸே -> -----..-..-.......
யௌ -> ---..-.-.-..-.-...
ஷூ -> ---..-.-.-..-.-..-
ஹூ -> ---..-.-.-..-.-.--
ஹெ -> ...........-....-..
ஞூ -> ...........-....-.-
ஸோ -> ....--.-...-.--.-..
ஞே -> .-.....----..-..---
ஷீ -> -----..-..---....--
ஷோ -> -----..-..-......--
ஷெ -> -----..-..---....-.
ஹொ -> ---..-.-.-..-.-.-..
ஞோ -> ---..-.-.-..-.-.-.-
ஸௌ -> ....--.-....-.-.-...
டௌ -> ....--.-....-.-.-.--
லௌ -> ....--.-....-.-.-.-.
ஞு -> ....--.-...-.--.-.--
நௌ -> .-.....----..-..--.-
ஙு -> -----..-..-......-..
ஹௌ -> ....--.-....-.-.-..-.
ஸொ -> .-.....----..-..--...
னௌ -> ....--.-...-.--.-.-.-
ஙொ -> ....--.-...-.--.-.-..
ஞௌ -> .-.....----..-..--..-
ஞொ -> ....--.-....-.-.-..---
ஙா -> -----..-..-......-.---
ஙே -> -----..-..-......-.-..
ளௌ -> ....--.-....-.-.-..--..
ஷொ -> -----..-..-......-.-.--
ழௌ -> -----..-..-......-.--..
ஙூ -> -----..-..-......-.-.-.
ஷௌ -> -----..-..-......-.--.--
றௌ -> -----..-..-......-.--.-.
ஙோ -> ....--.-....-.-.-..--.--
ஙி -> ....--.-....-.-.-..--.-.

We are able to say this Morse code book is a better representation of Tamil since it finds 290 letters incident of the 323 letters in Grantha + Tamil letter set generated from the Madurai corpus of 4 million words.

Tamil in Morse-code

Can we compose a Tamil Morse-code ? Yes, we can.

International Morse Code – Source: Wikipedia

Start with a frequency count of Tamil letters from various sources
Build a probability distribution from the frequency counts
Build a Huffman code using the above distribution
Each letter of Tamil alphabet gets a Morse code : 0 = ‘.’, 1 – ‘-‘.
புள்ளி, கோடு.

Tamil Morse Code Table generated from Open-Tamil library. See here for full code and methodology. Full table follows.

Can you decode what this Morse code means in Tamil ? Hint: 2 words (4,5) letters long

`...-. --.--.. .---..--.--- .-..-. ...-. ---.-. -----.--.- .--....- ..-..-`

Please note table was updated to show letters in most-frequent to least-frequent alphabets and their code-words used. Updated after publishing on Aug 16th, 2018.

Source coding theory

Information theory provides us with tools to calculate the information content of symbols in a language, i.e. alphabets in our case. Average codeword length was 6.45652 bits, which is rounded to 7bits.
According to 230+ symbols of encoded in binary without attention to letter frequency we would be using ceil[ log2[230] ] ~ 8bits per symbol, so the usage of Morse code provides a related data compression of 12.5%!

Previously, I had written about Morse code for Tamil in this blog here, and relationship with Unigram, Bigram and Trigram models and word-structure in Tamil language.

ம் -> --..
த -> -...
க -> ...-.
ல் -> ..---
த் -> ----.
க் -> -.---
ன் -> -.--.
ர -> .....-
ப -> ....--
வ -> ..--.-
தி -> ..-..-
ச -> ..-.-.
கு -> .----.
ம -> .---.-
ப் -> .--..-
ட் -> .--.-.
டு -> .-...-
ர் -> .-..-.
ய -> .-.-.-
அ -> ---..-
ட -> ---.--
ரு -> ---.-.
பு -> -..---
கா -> -..--.
து -> -..-.-
ல -> -.-..-
வி -> .......
டி -> ....-..
ண் -> ....-.-
சி -> ...---.
ன -> ..--...
ரி -> ..-....
ங் -> ..-...-
ந் -> ..-.---
ற் -> .-----.
இ -> .--...-
று -> .-..---
ச் -> .-....-
சு -> .-..--.
பா -> .-.----
கி -> .-.--..
பி -> .-.--.-
வா -> .-.-...
மு -> -----..
ள் -> ---....
லை -> --.--..
உ -> --.--.-
டை -> --.-..-
தா -> --.-.--
ண -> -..-...
கை -> -..-..-
ஆ -> -.-...-
மா -> -.-.---
ய் -> -.-.-.-
ள -> ......-.
சா -> ...--..-
ற -> ...--.--
லி -> ..--..--
வு -> .---...-
கொ -> .---..-.
ந -> .--.....
நி -> .--....-
ஞ் -> .--.----
ரா -> .--.---.
ணி -> .--.--..
ளி -> .--.--.-
யா -> .-......
நா -> .-.-..--
றி -> .-.-..-.
கோ -> -------.
செ -> ------..
ழி -> ------.-
னி -> -----.-.
ழு -> --.-----
மி -> --.----.
யி -> --.-....
பொ -> --.-.-..
ரை -> --.-.-.-
வெ -> -.-.....
எ -> -.-.--..
மை -> -.-.--.-
றை -> -.-.-..-
பூ -> ......--.
ழ -> ...-----.
னை -> ...----..
லா -> ...--.-..
சை -> ..--..-.-
வை -> ..-.--...
போ -> ..-.--..-
கூ -> ..-.--.-.
வே -> .--------
டா -> .-------.
தை -> .------..
பெ -> .---....-
ளை -> .---..---
தே -> .-.---...
ஒ -> .-.---.--
ழ் -> -----.---
லு -> ---...---
நீ -> ---...-..
சீ -> ---...-.-
தீ -> --.---...
மூ -> --.---..-
தொ -> --.---.--
ணை -> --.---.-.
ஏ -> --.-...-.
நெ -> -.-....-.
ளு -> -.-.-....
னா -> ......----
சூ -> ......---.
மே -> ...-------
தோ -> ...------.
தெ -> ...----.-.
சொ -> ...--.....
சே -> ...--....-
தூ -> ...--...--
யு -> ...--...-.
பே -> ...--.-.--
வீ -> ..--..-..-
ஊ -> .------.--
னு -> .---......
யோ -> .---.....-
சோ -> .---..--..
கே -> .-.....---
ழை -> .-.....--.
ணு -> .-.---..--
ஓ -> .-.---.-..
கெ -> ----------
கீ -> --------..
றா -> --------.-
பை -> -----.--..
ணா -> -----.--.-
ரோ -> ---...--.-
மொ -> -.-....--.
மெ -> -.-.-...--
லோ -> ...----.---
பீ -> ...----.--.
ளா -> ...--.-.-.-
ஈ -> ..--..-....
ஞா -> ..--..-...-
மீ -> ..-.--.----
வ் -> ..-.--.--..
மோ -> ..-.--.--.-
நு -> .---..--.-.
ஐ -> .-.....-..-
ரே -> .-.....-.-.
நோ -> .-.---..-.-
நே -> .-.---.-.--
நூ -> ---------..
யெ -> --.-...----
லே -> --.-...--..
ரீ -> -.-....----
நொ -> -.-....---.
யை -> -.-.-...-..
ழா -> ...--.-.-...
ரூ -> ...--.-.-..-
னோ -> .------.-.--
ஞ -> .---..--.---
யூ -> .---..--.--.
வோ -> .-.....-....
யே -> .-.....-.---
லெ -> .-.---..-...
ரெ -> .-.---.-.-.-
ணீ -> ---...--....
டோ -> ---...--..--
டெ -> ---...--...-
கௌ -> ---...--..-.
ணெ -> --.-...---..
சௌ -> --.-...---.-
றெ -> ..-.--.---...
லூ -> ..-.--.---..-
றோ -> .------.-....
னே -> ..-.--.---.--
னீ -> .------.-..-.
நை -> .------.-..--
டூ -> .------.-.-..
னெ -> .-.....-.--..
டே -> .-.....-.--.-
ஞெ -> .-.---..-..--
ளெ -> .-.---.-.-...
டீ -> ---------.---
யொ -> ---------.--.
பௌ -> ---------.-..
ஃ -> --.-...--.---
ஔ -> --.-...--.-..
ஞை -> -.-.-...-.---
யீ -> -.-.-...-.--.
றொ -> -.-.-...-.-.-
வொ -> .------.-...--
வூ -> ..-.--.---.-..
னூ -> .------.-.-.--
ளோ -> .-.....-...---
ணோ -> .------.-.-.-.
றே -> .-.....-...--.
மௌ -> .-.....-...-..
தௌ -> .-.---..-..-..
ளே -> .-.---.-.-..-.
லொ -> .-.---.-.-..--
றூ -> ---------.-.--
ரொ -> --.-...--.--..
டொ -> --.-...--.-.-.
ங -> -.-.-...-.-...
ணே -> ..-.--.---.-.--
ளீ -> .------.-...-..
ழூ -> .-.....-...-.-.
ளொ -> .-.---..-..-.-.
ரௌ -> .-.---..-..-.--
யௌ -> ---------.-.-..
னொ -> ---------.-.-.-
ழோ -> --.-...--.-.--.
ளூ -> --.-...--.-.---
ஞி -> -.-.-...-.-..--
ணொ -> .-.....-...-.---
ணூ -> .------.-...-.--
ழீ -> .-.....-...-.--.
ஸ் -> --.-...--.--.--.
வௌ -> -.-.-...-.-..-..
ஞீ -> --.-...--.--.---
ஷ் -> ..-.--.---.-.-...
ஷி -> ..-.--.---.-.-..-
ழெ -> ..-.--.---.-.-.-.
றீ -> .------.-...-.-.-
நௌ -> ..-.--.---.-.-.--
ஞே -> .------.-...-.-..
லௌ -> --.-...--.--.-..-
ஞொ -> -.-.-...-.-..-.--
ஙு -> --.-...--.--.-...
ஷ -> --.-...--.--.-.---
ழொ -> --.-...--.--.-.--.
ழே -> -.-.-...-.-..-.-.
டௌ -> --.-...--.--.-.-.-
ஞூ -> --.-...--.--.-.-..

Caveats and Closing Comments

Of course 15 of 247 letters are perhaps not received any codeword in this codebook. Further with inclusion of Grantha letters, 323 letters exist in Tamil some of which we don’t have code words.

Further, a large text corpus like Project Madurai’s [PM] unigram frequency distribution maybe useful to develop a widely representative Morse code table. Once you have this PM unigram data, you know how to get this Tamil Morse codebook regenerated!

Language Transformations

Question of Translation

How can you convert a text like “Me Amor!” to “என் உயிரே!” [from Spanish to தமிழ்] ? Lets assume we have Spanish to English and Tamil to English translators [bidirectional with English] then we can convert Spanish to English then to Tamil. Likewise one can translate between any two languages from a clique of languages [so far as the clique is defined such that each language can be translated to at least one other language in clique].

Development – Theory

Language can exist as text (print/message/document) or speech (audio, conversations) etc. Ideas are represented in any language. Ideas originate from one language and move to another, or sometimes originate iñ many lañguages simultaneously. Ideas cañ cross from oñe language to añother via text or speech.

In mathematical terms if we write L as set of lañguages = { L₁, L₂, .. L_n} and then if we define each language as a tuple L_i = (T_i,S_i) then we may further define mathematical function operating on text and converting it to speech as :

TTS_i : T_i -> S_i

we may define a function speech recognition as,

ASR_i : S_i -> T_i

we may also define a translation function as,

TX_ij : L_i -> L_j

Essentially what we can do is by representing the language as a node in a graph with two text and speech parts to it, we may connect these nodes to each other via the edges – functions – like ASR and TTS, and to nodes of other languages via translators function edge.

In a graph with only two languages [English, Tamil] with all edges representing functions like TTS, ASR within same language and functions like Translator between two languages (one for each direction) we see a graph like the following:

Screen Shot 2018-08-03 at 11.51.08 PM — Fig. 1: Language transformation graph. Nodes represent languages and their components. Edges represent functions like TTS, ASR [for same language] and Translators [directional between languages]. Clearly we may see this is a directed graph with ability to go from a specific language to another language in text or speech or both forms, provided a path exists from source to target language. Using such a graph with no orphan nodes, we may have universal translation powers from language A to language B [so far as bidirectional connectivity is present with at least one neighbor].

Problems to Ponder

So the curious reader now having a background of representing the translation problem as a graph problem of reaching node B from node A, can use rich set of path finding algorithms and shortest distance algorithms may attempt to answer some of these questions:

What is the graph criteria for a language to have no translations ?
What is the graph criteria for a language to not be able to have virtual assistant ? [Siri, Cortana, Alexa etc.]
Conversely, to 2, what is minimum criteria [necessary but not sufficient] to have a virtual assistant [that can speak and listen] ?
Given two paths to translating from language A -> F, which are of two different lengths which one would you choose and why? Assume all jumps have a uniform information loss. What if information loss at each edge is non-uniform, how can you optimized such a problem ?
How would you introduce a new language into this graph so that it maybe translated to all other languages [unidirectionally] ?
How would you introduce a new language into this graph so that it can be bi-directionally translated ?
How can you represent the transliteration function in this graph ?

Answers will be posted soon! Feel free to leave your comments in section below.

-Muthu

India A.I. report – highlights

ஏற்கணவே எழுதிணபடி இந்திய செயற்கை நுண்ணறிவு அறிக்கை வெளியிட்ட குழுவின் தலைவர், IIT-சென்னையைச் சேர்ந்த பேரா. காமகோடி. இந்த அறிக்கையில், முக்கியமான விஷயங்ககள் கீழே படம் வடிவங்களில் பாற்க;