Yomikata: Heteronym disambiguation for Japaneseprojects japanese machine learning
I’ve been learning Japanese since 2018, when I was still a grad student at the University of Chicago. Thanks to computers and the internet, learning Japanese is much easier than it used to be. In the old days even looking up new words required whipping out both a kanji dictionary and a separate word dictionary. Now you can just draw kanji directly into your phone.
But there are still many aspects of the Japanese language that computers have trouble with. One that used to drive me nuts stems from the abundance of heteronyms in Japanese: words that are written the same way but pronounced differently. For example, the character 角 can represent the word tsuno (horn), kado (corner), kaku (angle), and more. In Japanese these kinds of heteronyms are called doukei iongo (同形異音語).
Japanese speakers can easily determine the appropriate reading of the word from context, but a simple dictionary lookup by a computer will get the reading wrong most of the time. For language learners and people who rely on screen readers, as well as all sorts of linguistic and computer speech applications, it would be helpful if a computer could understand the context and determine if the word should be read one way or another, just like a human.
That’s why I made Yomikata, an open-source python library that uses a machine learning model to disambiguate Japanese heteronyms. Check out the demo page to get a sense of what Yomikata can do, and check out the GitHub repository to learn how to use it in your own projects. And continue reading this article to learn how Yomikata works.
Disambiguating words is a long-studied problem in the field of Natural Language Processing in both English and Japanese. Until the early 2010s the most popular solution was to construct rules from grammatical or statistical patterns: if 角 appears in a sentence near the word for ‘animal’, it should probably be read ‘horn’ tsuno . If instead it appears alongside ‘street,’ it probably refers ‘corner’ / kado. This approach works well for some words, but there is a limit to the complexity of patterns it can model.
Machine learning approaches to disambiguation have grown in popularity because they can exploit large datasets to learn complex language patterns. In 2018, Gorman et al. reported how they improved Google’s text-to-speech services by training a simple logistic classifier to disambiguate heteronyms in English. This model used the two words to the left and two words to the right of each ambiguous word to disambiguate it.
Ideally we would like to look at an entire sentence of context and not just a few neighboring words. The BERT model offers a way to do this. BERT is a language model based on the transformer architecture that uses context to embed words.
An embedding is a mapping from a word to a vector. For example the word ‘dog’ might be embedded as the vector [0.3, 0.7]. Embeddings are the starting point for machine learning models because they map words into a vector space on which you can do operations. There are many ways to embed words, but most algorithms try to put words with similar meanings close together in the vector space. For example a good algorithm would probably assign ‘dog’ and ‘puppy’ to nearby vectors.
To limit the number of words that a model needs to learn how to embed, it is often useful to embed pieces of words like stems and suffixes instead of entire words. These word pieces are called ‘tokens.’
Most embeddings are not immediately useful for disambiguating heteronyms: simple embeddings map tokens to vectors without looking at the context, so 角 will always map to the same vector no matter if it represents ‘horn’, ‘corner’, or ‘angle’.
BERT on the other hand provides contextual embeddings. It takes an initial embedding for each token in a sentence then processes the sentence with a 110 million parameter transformer model. The output is an embedding for each token that contains some information about the tokens around it. Training BERT takes substantial computing resources. The Tohoku University group provides a BERT model that was trained on the entire Japanese Wikipedia.
With contextual embeddings, it becomes possible to disambiguate heteronyms using the entire sentence as context. After going through the BERT transformer 角 / tsuno, 角 / kado, and 角 / kaku will all map to different regions of the vector space because they appear in different types of sentences.
In 2021, Kobayashi et al. used an SVM to classify which BERT outputs correspond to which heteronym readings for a group of ~70 Japanese heteronyms. Later that year Nicolis et al. at Amazon used a logistic classifier to do the same for BERT embeddings of English heteronyms.
These works kept the BERT model fixed and just trained classification layers on the output. Sato et al. 2021 went a step further and fine-tuned the BERT model weights themselves in order to aid in the classification, in other words training BERT to clearly separate the embeddings for the different readings of each heteronym. This is the approach I use for Yomikata.
Training a model to distinguish heteronym readings requires a large corpus of Japanese text with the different readings for each heteronym labeled. Luckily Sato et al. prepared two such corpora: the Aozora Bunko corpus and the National Diet Library titles corpus.
Aozora Bunko is a library of public domain or permissively licensed works. Sato et al. combined those texts with phonetic information from the Japan Braille Library to produce a high-quality phonetically annotated corpus of Japanese literature.
The NDL titles corpus is a list of titles of books in the National Diet Library. As part of the registration process all book titles are phonetically annotated, yielding a large but functionally limited corpus of annotated text.
I supplement these two large corpora with sentences from the Balanced Corpus of Contemporary Written Japanese, which is the largest and most widely used Japanese language text corpus. Only the small ‘core’ subset of the BCCWJ has annotations that are of good enough quality to act as training data. I also add sentences from the Kyoto University Web Document Leads Corpus, an annotated corpus of the first three sentences of 5,000 web documents.
After combining all these corpora, I search for sentences that contain heteronyms. I then use the sudachi dictionary to filter out sentences that contain heteronyms only as part of larger compound words, because these larger compounds usually disambiguate the heteronyms. I also remove sentences that contain only unexpected readings for heteronyms in order to cut down on the size of the training dataset. Since this cut uses knowledge of the true label of the heteronyms in preparing the the training data for the model, it introduces a bias that future versions of Yomikata should try to eliminate.
This procedure results in a dataset of 628,755 Japanese sentences and book titles that I split into a training set (70%), a validation set (15%), and a test set (15%).
I used as a starting point the Tohoku group’s BERT-base-japanese-v2 model. This model tokenizes sentences using MeCaB with the unidic dictionary. unidic favors using small subword tokens, so many heteronyms are not present in the model’s initial vocabulary and I had to add them in. This is the main modeling improvement I have made over Sato et al.: the initial release of Yomikata covers 130 heteronyms while Sato et al. report results for 93.
On top of the BERT transformer I bolt a linear layer to classify the BERT embeddings into heteronym readings or an
<OTHER> token that I use to indicate that the token belongs to a larger compound word. Training the model takes about an hour one of my institute’s Tesla V100 GPUs. I used MLflow to manage and supervise the training process.
Overall accuracy is 94%, but performance varies significantly from heteronym to heteronym. For 角, for example, the reading kaku is correctly identified in 834 out of its 848 appearances in the test data (98% accuracy) and performance for kado is also good (139/157, 89%). Performance is more mediocre for tsuno (37/52, 71%), and the uncommon reading sumi is correctly identified only once out of its 15 appearances (6% accuracy).
Large class imbalances are sometimes a problem, with for example the model completely failing to identify the ikisatsu reading of 経緯 because it is swamped by the keii reading in the data. In other unbalanced cases the model performs well, for example the two readings henka ‘change’ and henge ‘embodiment’ of 変化 are very well distinguished despite 12302 appearances of the former and only 78 appearances of the latter in the test data.
My primary goal with Yomikata was to make a package that is easy to use, easy to share, easy to improve on, and easy to integrate into other projects. Despite much existing research on heteronym disambiguation in Japanese, Yomikata is to my knowledge the first public Japanese disambiguation library.
Installing Yomikata requires running just two commands: one to install the code and one to fetch the model binaries. I used streamlit to make a demo page so people can see what Yomikata can do without having to install anything. The model is small enough that it can run on computers without a GPU.
Yomikata also comes integrated with several dictionaries, so that users can use Yomikata to get readings for all words in a sentence by combining the disambiguation powers of the machine learning model with the breadth of a full dictionary.
Data is the limiting factor in Yomikata’s performance. Many heteronym readings appear only a few dozen times in the training data, and the training corpora are not wholly representative of modern written or spoken Japanese. One way to improve the training data would be to apply Yomikata to a large but unannotated corpus like Wikipedia to find sentences that likely contain the rare readings. These can then be confirmed by hand and fed back into Yomikata’s training data.
Yomikata also only supports the limited number of heteronyms identified in Sato et al., which are predominantly nouns and adverbs. There are many more heteronyms in Japanese which could and should be integrated into the model.
Finally, ambiguous words are just one of many difficulties that computers face when dealing with Japanese text. Japanese has no spaces between words, and so sentence parsers often make word-boundary mistakes. Using deep-learning models to parse and annotate Japanese sentences is promising, but even large language models struggle to match the direct and accurate information-retrieval capability that dictionaries can offer. Efforts to integrate large language models with external tools, like the approach taken in the recent Toolformer paper, are promising avenues to combine the two.