Classifying Tamil words – part 2

Recap

Continuing from previous post (see part-1) I am sharing my results on classifying a Tamil alphabet sequence as a valid Tamil-like word or English-like word using a binary classifier.

Pre-requisities

You need to get scikit-learn API installed by following directions on website here.

pip install -U scikit-learn

This will also get dependencies like Numpy and other Python libraries supporting the SciKit learn.

Next ensure your installation is okay by typing,

python -m sklearn

which should run without any output if all your settings are okay.

Training the AI Classifier

To train the classifier based on multi-layer perceptron (in other words – an AI neural network)

we need to represent our input as a CSV file, with each sampled encoded as a feature of rows.
- for this case the data are in the form of CSV files representing features of Jaffna, Azhagi, Combinational transliterated output of input words
- See: files ‘english_dictionary_words.azhagi’ and ‘tamilvu_dictionary_words.txt’ at repo open-tamil/examples/classifier
each word (represented as features) will also be given training label usually as integer, forming a column data on CSV file (across all samples); typical features encoded for the data file are defined in class Field under file ‘classifier/preprocess.py’;
- Typically the information for each word like number of letters, juries, medics, ayutha letters, vallinams, mellinams, idayinams, first, last and vowels are stored in feature record within CSV.
- We can generate various feature records of the data files by running the code of preprocessor.py
next we may train the neural network using the Scikit learn API,
- this is key code in ‘classifier/modelprocess2.py’
- first we load the CSV feature vectors into Python as Numpy array for both class-0 (English words) and class-‘1’ (Tamil)
- next we setup scaling of data sets for both classes
- we pick test set, and training set which are key information to getting a good model network and generalized fit
- We import various tools out of scikit learn like input scaler ‘StandardScalar’, ‘train_test_split’ etc for keeping up with good training conventions
- Since we are doing classification both test and training inputs need to be scaled but not the label data
Next step we setup a 3-layer neural network with ‘lbfgs’ activation function. We can fit this data with X_train data and corresponding Y_train labels
- nn = MLPClassifier(hidden_layer_sizes=(8,8,7),solver=‘lbfgs‘)
  
  nn.fit(X_train,Y_train)
  
  Y_pred = nn.pred( X_test )
  
  print(” accuracy => “,accuracy_score(Y_pred.ravel(),Y_test)
The fitted neural network is capable of generating a score (goodness of fit), and immediately serialized into disk for future references; we also output diagnostic informations like,
- confusion matrix
- classification report
Next we use the training neural network to show the results of a few known inputs.

Fig. 2: 89% accuracy trained classifier with correct identification of word “Hello”; while both are acceptable in native script form it is a English origin word!

Key points for this prediction with ANN are to keep the input transformed as a feature vector before applying it to the classifier input
Once the training is complete we see results like in item [6].

Finally we can automatically tell (via a neural network) if computer is a Tamil or English origin word; there is some sensitivity in this decision due to the 10% error. I have a screenshot of the predictions for various words (feature vectors are written as output as well)

Screen Shot 2017-12-20 at 2.28.35 AM.png — Fig. 3: Neural Network prediction of Tamil words and English (transliterated into Tamil) words

Finally we would like to conclude saying various types of Artificial Neural Network topologies and hidden-layer sizes were used but we chose to stick with simplest. At this time this trained neural network seems like a quite satisfying, and even ready to use for practical purposes.

Conclusion

Scikit-learn provides powerful framework to train and build classification neural networks.

This work has shown easy classification with 10% false-alarm rate (or ~90% classification rate) of various Tamil/English vocabularies and out of training vocabulary sets. The source codes are provided at open-tamil site including the various CSV data etc.

Goodluck, to exploring Neural Networks. Getting beyond 90% in this task seemed hard, and long way to go.

Classifying Tamil words – part 1

Problem

One of problems faced when building a Tamil spell checker, albeit somewhat marginal, can be phrased as follows:

Given a series of Tamil alphabets, how do you decide if the letters are true Tamil word (even out of dictionary) or if it is a transliterated English word ?

e.g. Between the words, ‘உகந்த’ vs ‘கம்புயுடர்’ can you decide which is true Tamil word and which is transliterated ?

Tools

This is somewhat simple with help of a neural network; given sufficient “features” and “training data” we can train some of these neural networks easily. With current interest in this area, tools are available to make this task quite easy – any of Pandas, Keras, PyTorch and Tensorflow may suffice.

Generally, the only thing you need to know about Artificial Intelligence (AI) is that machines can be trained to do tasks based on two distinctive learning processes:

Regression,
Classification

Read more at the Wikipedia – the current “problem” is a classification task.

Features

Naturally for task of classifying a word, we may take features as following:

Word length
Are all characters unique ?
Number of repeated characters ?
Vowels count, Consonant count
1. In Tamil this information is stored as (Kuril, Nedil, Ayudham) and (Vallinam, Mellinam and Idayinam)
Is word palindrome ?
We can add bigram data as features as next step

Basically this task can be achieved with new code checked into Open-Tamil 0.7 (dev version) called ‘tamil.utf8.classify_letter‘

Data sets

To make data sets we can use Tamil VU dictionary as a list of valid Tamil words (label 1); next we can use a transliterated list of words from English into Tamil as list of invalid Tamil words (label 0).

Using a 1, 0 labeled data, we may use part of this combined data for training the neural network with gradient descent algorithm or any other method for building a supervised learning model.

Building Transliterated Data

Using the Python code below and the data file from open-tamil repository you can build the code and run it,

def jaffna_transliterate(eng_string):
  tamil_tx = algorithm.Iterative.transliterate(jaffna.Transliteration.table,eng_string)
  return tamil_tx

def azhagi_transliterate(eng_string):
  tamil_tx = algorithm.Iterative.transliterate(azhagi.Transliteration.table,eng_string)
  return tamil_tx

def combinational_transliterate(eng_string):
  tamil_tx = algorithm.Iterative.transliterate(combinational.Transliteration.table,eng_string)
  return tamil_tx

# 3 forms of Tamil transliteration for English word
jfile = codecs.open('english_dictionary_words.jaffna','w','utf-8')
cfile = codecs.open('english_dictionary_words.combinational','w','utf-8')
afile = codecs.open('english_dictionary_words.azhagi','w','utf-8')
with codecs.open('english_dictionary_words.txt','r') as engf:
for idx,w in enumerate(engf.readlines()):
  w = w.strip()
  if len(w) < 1:
    continue
  print(idx)
  jfile.write(u"%s\n"%jaffna_transliterate(w))
  cfile.write(u"%s\n"%combinational_transliterate(w))
  afile.write(u"%s\n"%azhagi_transliterate(w))
  jfile.close()
  cfile.close()
  afile.close()

to get the following data files (left pane shows ‘Jaffna’ transliteration standard, while the right pane shows the source English word list); full gist on GitHub at this link

In the next blog post I will share the details of training the neural network and building this classifier. Stay tuned!

Language as Memory

Do you have a digital device ? (yes you do!)
Do you have Tamil localization in it ? (you most likely do)
Did you setup a Tamil input / keyboard ?
Do you standup for your language ?

Language is *your* memory; if you ignore it, you will forget it. Language itself won’t be erased.