August 2024 NLP Research

August 22, 2024

I recently finished an AI Programming with Python Nanodegree through Udacity, and that gave me a much better understanding of the basics of Machine Learning. I wanted to try and persue further learning in that space, and with Japanese being one of my hobbies, I thought something along those lines might be nice.

One of the slowdowns I have when reading stuff in Japanese is unknown kanji. Reading a kanji word for its meaning but not having the reading for it (i.e. how you would spell the word in kana) makes it harder for me to fully understand and make good connections. It makes it harder for me to internalize the words and make future recall a natural thing.

I've used tools to auto-add readings as furigana before, but the've always been lacking. One very common issue is just failing to give the most common reading for some very common words. For example, giving the reading of 二人 as ににん instead of ふたり. Some of that likely has to do with the segmentation and part-of-speech tagging tools that they use, and likely at least partially the fault of their dictionaries.

As far as I can tell, the current state of the art is to use those dictionary-based tools. However, I wanted to have something better than those tools, something that would be more likely to give me the proper readings. So I decided to look into the possiblity of using some kind of machine learning model to give me the proper readings when I feed it Japanese text.

After consulting with my friend Matt, I started looking into some current NLP libraries and techniques such as Word2vec, bag-of-words, etc. Looking at them and what they are used for and good at, I felt the weren't quite what I was looking for.

A thought occurred to me that what I was doing was similar to taking English and converting it to IPA, and so I searched around for that and found a post by a student at the University of Colorado at Boulder where he did just that using a translation neural network, with English as the source and IPA as the target.

He used OpenNMT, so I looked into that and found OpenNMT-Py. However, OpenNMT-Py was just deprecated a couple of months ago in favor of a new library called Eole.

Now, I just needed some data I could used for training up a model. I needed data that had both kanji words in context and their readings. Thankfully I found this repo called awesome-japanese-nlp-resources which had a link to a corpus of Aozora Bunko works with full furigana with a public domain license.

I'm going to be working on one or more scripts to convert the corpus data to the format Eole is expecting and then see how well this approach works.

Stephen's Meandering Missives

August 2024 NLP Research