Motivation
In 2022 we are reaching a point where more Tamil datasets are available than Tamil tools – arunthamizh அருந்தமிழ். However the accessibility of fully-trained models and capability of providing pre-trained models are much harder and still require domain expertise in hardware and software. Personally I have published some small Jupyter notebooks (see here), and some simple articles, but they still remain inadequate to scale the breadth of Tamil computing needs in AI world among:
- NLP – Text Classification, Recommendation, Spell Checking, Correction tasks
- TTS – speech synthesis tasks
- ASR – speech recognition
While sufficient data exist for 1, the private corpora for speech tasks (அருந்தமிழ் பட்டியல்), the public corpora of a 300hr voice dataset recently published from Mozilla Common Voice (University of Toronto, Scarborough, Canada leading Tamil effort here) have enabled data completion to a large degree for tasks 2 and 3.
Ultimately the tooling provides capability to quickly compose AI services based on open-source tools and existing compute environment to host services and devices in Tamil space.
Proposal
My proposal is the following:
- Develop a open-source toolbox for pre-training and task training specialization
- Identify good components to base effort
- Contribute engineering effort, testing, and validation
- R&D – DataScience, Infra, AI framework
- Engineering Validation – DataScience, Tamil language expertise
- Engineering – packaging, documentation, distribution
- Project management
- Library to be liberally licensed MIT/BSD
- Open-Source license for developed models
- Find hardware resources for AI model pre-training etc.
- Managed by a steering committee / nominated BDFL
- Scope – decade time frame
- TBD – மேலும் பல.
Summary
Let’s build a pytorch-lightning like API for Tamil tasks across NLP, TTS, ASR via AI.
Leave your thoughts by email ezhillang -at- gmail -dot- com, or in comments section.