Recap
Continuing from previous post (see part-1) I am sharing my results on classifying a Tamil alphabet sequence as a valid Tamil-like word or English-like word using a binary classifier.
Pre-requisities
You need to get scikit-learn API installed by following directions on website here.
pip install -U scikit-learn
This will also get dependencies like Numpy and other Python libraries supporting the SciKit learn.
Next ensure your installation is okay by typing,
python -m sklearn
which should run without any output if all your settings are okay.
Training the AI Classifier
To train the classifier based on multi-layer perceptron (in other words – an AI neural network)
- we need to represent our input as a CSV file, with each sampled encoded as a feature of rows.
- for this case the data are in the form of CSV files representing features of Jaffna, Azhagi, Combinational transliterated output of input words
- See: files ‘english_dictionary_words.azhagi’ and ‘tamilvu_dictionary_words.txt’ at repo open-tamil/examples/classifier
- each word (represented as features) will also be given training label usually as integer, forming a column data on CSV file (across all samples); typical features encoded for the data file are defined in class Field under file ‘classifier/preprocess.py’;
- Typically the information for each word like number of letters, juries, medics, ayutha letters, vallinams, mellinams, idayinams, first, last and vowels are stored in feature record within CSV.
- We can generate various feature records of the data files by running the code of preprocessor.py
- next we may train the neural network using the Scikit learn API,
- this is key code in ‘classifier/modelprocess2.py’
- first we load the CSV feature vectors into Python as Numpy array for both class-0 (English words) and class-‘1’ (Tamil)
- next we setup scaling of data sets for both classes
- we pick test set, and training set which are key information to getting a good model network and generalized fit
- We import various tools out of scikit learn like input scaler ‘StandardScalar’, ‘train_test_split’ etc for keeping up with good training conventions
- Since we are doing classification both test and training inputs need to be scaled but not the label data
- Next step we setup a 3-layer neural network with ‘lbfgs’ activation function. We can fit this data with X_train data and corresponding Y_train labels
-
nn = MLPClassifier(hidden_layer_sizes=(8,8,7),solver=‘lbfgs‘) nn.fit(X_train,Y_train) Y_pred = nn.pred( X_test )
print(” accuracy => “,accuracy_score(Y_pred.ravel(),Y_test)
-
- The fitted neural network is capable of generating a score (goodness of fit), and immediately serialized into disk for future references; we also output diagnostic informations like,
- confusion matrix
- classification report
- Next we use the training neural network to show the results of a few known inputs.
- Key points for this prediction with ANN are to keep the input transformed as a feature vector before applying it to the classifier input
- Once the training is complete we see results like in item [6].
Finally we can automatically tell (via a neural network) if computer is a Tamil or English origin word; there is some sensitivity in this decision due to the 10% error. I have a screenshot of the predictions for various words (feature vectors are written as output as well)
Finally we would like to conclude saying various types of Artificial Neural Network topologies and hidden-layer sizes were used but we chose to stick with simplest. At this time this trained neural network seems like a quite satisfying, and even ready to use for practical purposes.
Conclusion
Scikit-learn provides powerful framework to train and build classification neural networks.
This work has shown easy classification with 10% false-alarm rate (or ~90% classification rate) of various Tamil/English vocabularies and out of training vocabulary sets. The source codes are provided at open-tamil site including the various CSV data etc.
Goodluck, to exploring Neural Networks. Getting beyond 90% in this task seemed hard, and long way to go.