Curiously Related Words on Kai Striega

Curiously Related Words on Kai Striegahttp://kaistriega.com/blog/curiously-related-words/Recent content in Curiously Related Words on Kai StriegaHugo -- gohugo.ioen-usSun, 08 Dec 2024 08:25:52 +1100Curiously Related Words Constructing Our Queryhttp://kaistriega.com/blog/curiously-related-words/curiously-related-words-constructing-our-query/Sun, 08 Dec 2024 08:25:52 +1100http://kaistriega.com/blog/curiously-related-words/curiously-related-words-constructing-our-query/Querying data# Neo4j uses a query language called Cypher1. Cypher was inspired by ASCII art and lets us represent our ideas very intuitively. Nodes are represented as being in parentheses while relationships are shown as arrows between nodes. If you have some spare time I’d suggest you play around with Cypher before continuing to familiarize yourself. What did we want?# If we go way back to the original post we said we wanted two things:Curiously Related Words in Neo4jhttp://kaistriega.com/blog/curiously-related-words/curiously-related-words-in-neo4j/Sun, 08 Dec 2024 08:08:07 +1100http://kaistriega.com/blog/curiously-related-words/curiously-related-words-in-neo4j/What we have# In the previous post I showed how to parse EtymDB and convert it into a format usable by the admin-import tool. We should now have five csv files: vertex/full.csv vertex/small.csv vertex/with_embedding.csv vertex/with_meaning.csv relationships.csv Getting Neo4j# Neo4j in the cloud# Neo4j provides a cloud service with a free tier. Unfortunately, the free tier is capped at 200k nodes and 400k relationships. We’ve 1.8M nodes and 640k relationships. Unfortunately the free tier is not going to cut it.Curiously Related Words Preprocessing Our Datahttp://kaistriega.com/blog/curiously-related-words/curiously-related-words-preprocessing-our-data/Sat, 07 Dec 2024 15:32:41 +1100http://kaistriega.com/blog/curiously-related-words/curiously-related-words-preprocessing-our-data/Previously I’ve made up the concept of a curiously connected word and a high level plan for finding such words. This post outlines the interesting parts of how I parse EtymDB. For those who are interested in all the code, it is available on my GitHub. The data we have, and why that’s not enough# As outlined previously we have two sources of data: EtymDB a database of words and their etymological relationships gensim a library of Word2vec models that model the semantic relationship between words Our goal is to combine these two datasets into something nerdy.What is a Curiously Related Word?http://kaistriega.com/blog/curiously-related-words/what-is-a-curiously-related-word/Sat, 07 Dec 2024 12:53:46 +1100http://kaistriega.com/blog/curiously-related-words/what-is-a-curiously-related-word/Contents:# What brought this on? Defining a curiously related word pair Common Ancestors Similar Meanings The big ideas What brought this on?# Did you know that the words “Galaxy” and “Lactose” are related? They both derive from the Proto-Indo-European word “glakt” which means “Milk”. I didn’t. And, when a friend told me this, I was intrigued. As a computer nerd, this brought up another question: can I automate finding such words?