Contents:#
- What brought this on?
- Defining a curiously related word pair
- Common Ancestors
- Similar Meanings
- The big ideas
What brought this on?#
Did you know that the words “Galaxy” and “Lactose” are related? They both derive from the Proto-Indo-European word “glakt” which means “Milk”. I didn’t. And, when a friend told me this, I was intrigued. As a computer nerd, this brought up another question: can I automate finding such words?
An informal definition of curiously related words#
Before automating something I need to define what I want to do. Which is, for a given English word, I want to find all the English words that are curiously connected. What does this mean?
Let’s consider a pair of words we’ll consider the pair as curiously connected iff:
- They share a common ancestor
- Their meanings are similar
Going back to our example I’d intuit the words “Galaxy” and “Lactose” to be curiously connected as the pair meet both conditions.
Common ancestors#
Etymology studies the origin of a word and the historical development of its meaning. That sounds like exactly what we want. After a quick Google I found EtymDB, a database that contains many words, from different languages and their etymological relationships. From EtymDB, for a given word, we can find all of that words contemporary related words. For example, consider the pair “potion” and “poison”. In EtymDB we can see that they share the same root word:
Turns out the common ancestor is the Latin word “pōtiō”. The words have then evolved into their respective words over time. Data in this format (things and relationships between them) is often called a Graph. While it is possible to analyse Graph data in Python, it required reconstructing the graph from EtymDB everytime. This took a long time (~30 mins) which isn’t great. I’m planning to use Neo4j a databased designed to work efficiently with graphs.
Similar meanings#
We’ve established that “potion” and “poison” share a common ancestor. They are connected. But are they curiously connected? The second part of the definition is that their meanings need to be similar. This is easy to intuit, but hard to quantify. For that we need to computationally find the meaning of a word, and define some kind of distance metric between words. Now, if the distance between the two words is high enough, we consider them to be curiously connected.
How can I tell a computer what a word means? Ideally we’d want words that intuitively have similar meanings (to us) to be “close” to each other. This is called the semantic similarity where the distance between words tends to be based on their meaning. One way to analyse semantic similarity is to embed the words in a vector space. The big idea is that words with similar meanings influence the surrounding words in the similar ways.
Implementing this isn’t too hard (maybe in another blog post!) but it is expensive, so I won’t be training the model myself. Luckily these models (known collectively as Word2vec) are readily available. We can use these through the gensim Python library.
The big idea of our algorithm#
What are we actually going to be doing? Let’s outline it:
- Get a word from the user
- Validate that it is in EtymDB
- Validate that it is in English
- Find the root word for the user’s word
- Find all leaf words for the root word
- Make sure each leaf word is in English
- Make sure each leaf word is not the user’s word
- For each leaf word calculate the similarity to the user’s word
- Select the word with the highest similarity
Conclusion#
I’ve talked about an idea I’ve recently come across, and how (I think) I can automate it. The problem has been defined reasonably accurately. We’ve created the concept of a curiously related word and defined it covering each property in depth. We’ve also highlighted the technology we’ll need to implement the solution.