Trying to train a Japanese reading model
So I mentioned in my post in August about NLP research that I was looking into training an ML model to give me the readings of kanji in hiragana to aid in reading Japanese text, with the first step being converting a public domain corpus of Aozora Bunko works with full furigana I found online from the National Diet Library of Japan.
The first step for that was just understanding the file format.
- The entire file is a set of tab-separated values with three columns total
- Each record represents a single sentence in the original text
- Each record starts with "č”ēŖå·: " followed by the record number as the first column, and the other two columns are empty
- The following row is the original source text of the line in the first column, a blank second column, and is tagged with "[å
„åę]" in the third column
- This line may be omitted if the original line was already fully hiragana (no kanji or katakana)
- In that case, the next line (the reading line) would be parsed as the original line as well
- The next row is the reading line, with a blank first column, the reading text in the second column, and tagged with "[å
„å čŖćæ]" in the third column
- Basically same text, but with any kanji or katakana converted to hiragana instead
- This row also has extra space characters inserted between certain parts of the sentence
- The rest of the rows up until the next record marker are a breakdown of the elements of the sentence, one row per element
- The first column is the original text
- The second column is the corresponding text in the target sentence
- The third column is a tag I'm calling POS (even though it's not exactly part-of-speech)
Once I had the file format understood and parsed, then I could go about converting the data to something I could then feed into Eole. From my understanding, Eole expects the training data to be in the format of one sentence per line, with spaces between each word, and between words and internal punctuation (e.g. commas). Since I'm not translating between different languages, but instead between different representations in the same language, I had to decide how I wanted to define a "word" in this context.
In the post by Zach Ryan, a student at the University of Colorado at Boulder I mentioned in my last post, where he was "translating" between English and IPA, he tried the approach of just submitting each individual character as a "word" to the model. Since Japanese doesn't normally use spaces between words anyway, and I was also "translating" between a word and its reading, I thought this might be a good approach to try here, as well.
Looking at the data I had, the easiest approach would be to just use the full sentences as the source and target. I did have to strip out existing spaces from them first, especially since the target sentences had added spaces in it (likely to help people visibly parse them). I then took each character in the resulting strings and added a space after it to make each character a separate "word" for translation.
I then split the files into test, train, and validation data sets. My original split was 30/30/30, but my accuracy was pretty poor with that, but I'll get back to that in a second. First I wanted to discuss an issue I ran into trying to install Eole.
You see, apparently Eole has a dependency on the pyonmttok
package, telling me that it couldn't find that package. Trying to install it manually failed as well (tried both using conda
and pip
). Googling turned up this StackOverflow question about this exact issue. Turns out that package is not available for Windows, and I'm on Windows 10 and was trying to run it natively.
Given that, I switched over to my WSL2 install and tried to install Eole there. It failed there complaining about lack of cmake
, so I had to install that using apt first. Once I had cmake
in place, though, it finally installed successfully.
Now that I actually had it installed, I roughly followed the instructions in the quickstart guide for Eole for training a model from scratch.
First I needed to build the vocab files. I had put my config in a file named aozora-jpjp.yaml
, and started out with the example number of samples of 10,000, thus giving me the following command line:
eole build_vocab -config aozora-jpjp.yaml -n_sample 10000
I then attempted to train the model:
eole train -config aozora-jpjp.yaml
But I ran into another issue. Eole was breaking when trying to parse the vocab files, complaining about indexing past the end of a list. You see, the vocab files were also tab-delimited files with two columns. The first column is the "word", the second is a numeric value (I believe it's the number of times that "word" was found in the source data). The issue I was running into was that the one of the vocab words it was pulling from the data was the empty string.
The reason why that was an issue was that the line with the issue had the following call in it:
line.split(None, 1)[1]
Where line held the text of the current line pulled from the vocab file.
When .split()
is called on a string, it has special behavior if the first argument is excluded or set to None
.
If sep is not specified or is None, a different splitting algorithm is applied: runs of consecutive whitespace are regarded as a single separator, and the result will contain no empty strings at the start or end if the string has leading or trailing whitespace. Consequently, splitting an empty string or a string consisting of just whitespace with a None separator returns [].
This is what was causing the issue, because since the line was starting with a whitespace character (tab), it was ignored and the split line only had one entry, thus causing the request for the second entry to fail. Perhaps the empty record should have been created in the first place, or maybe that line just needs to be changed. I'll probably end up filing an issue with Eole later, but for now I just changed the delimiter from None
to "\t"
, the tab character.
Doing that changed the behavior so that it split the line into two entries, an empty string and then the number. This fixed the issue for now, and I was able to continue.
Now I was able to do my first attempt at training the model. I started with a laughably small number of training steps, 1000, just to test things at first. With predictably silly results, of course. The majority of the output had lines that started with (ellipsis added for brevity):
ćććććććććććććććććććććććććććć...
When telling a friend about it, I joked that I wasn't aware I had Radwimps song lyrics in the training data. (A reference to their song Zenzenzense featured in Your Name.)
I upped the training steps to 30,000 and tried training again, but the validation accuracy hit a low plateau pretty early on and visual inspection of prediction against the test data set was... pretty much nonsense.
So, after a quick googling, I re-split my data into 80/10/10 and tried training again. Training faired much better this time, but validation accuracy still capped out at around 60%. This was still not accurate enough to make sense out of the prediction output against the test data, but it was much better than the last attempt.
So, I tried re-splitting the data again, this time 90/5/5. However, at step 21700/30000, my GPU ran out of memory (all 16 GB worth) and sat there thrashing for a while before I finally killed the process. Validation accuracy still only hit 60% at max, so I'm going to take a step back and see what else I can adjust to get the accuracy up where I need it to be.
In the meantime I thought I'd publish this post what I've done so far. I'll follow up with another post on this topic when I have more to share.