Previously I’ve made up the concept of a curiously connected word and a high level plan for finding such words. This post outlines the interesting parts of how I parse EtymDB. For those who are interested in all the code, it is available on my GitHub.
The data we have, and why that’s not enough#
As outlined previously we have two sources of data:
- EtymDB a database of words and their etymological relationships
- gensim a library of Word2vec models that model the semantic relationship between words
Our goal is to combine these two datasets into something nerdy. But just having data isn’t enough. We need to be able to retrieve, analyse and work with our datasets efficiently. Furthermore, our data isn’t really tabular. What we have are words, with some associated properties, and relationships between them. When you hear “relationships” you may be tempted to think of a relational database. While these are often the right choice, our data is a graph which can be difficult to work with in relational databases 1. Neo4j provides a database that is specialised on graphs. This seems like the right tool for the job. Plus I want to try a new tool2.
The Plan#
EtymDB provides it’s data as several csv
files. This isn’t ideal, and we want to get it into Neo4j. The
admin-import tool allows us to
import the data we want, however it needs to be in their own, very particular, format. We will need to write a
conversion script and combine it with the gensim embeddings. EtymDB stores the vertices and edges separately these files
are:
- etymdb_values.csv for the vertices (or nodes, if you’re not a mathematician)
- etymdb_links_info.csv for the relationships between words
Parsing Vertices#
Each row in etymdb_values.csv represents a word, followed by properties for each word. These are:
- The EtymDB ID
- The language code
- The field3
- The lexeme
- The meaning of the word, in English
One or more of these fields may be missing. We’ll need to think of a strategy to manage these. We’ll also want to:
- convert the language code to a recognisable language
- add another property, the gensim embedding
- encode the data into the format required by the admin-import tool
Parsing Edges#
Parsing edges looks much simpler. Each row in etymdb_links_info.csv contains three items:
- The relationship type
- The parent’s EtymDB ID
- The child’s EtymDB ID
Implementation#
Note: I’ll be focusing on parsing the vertices here. This is the more interesting problem, and I don’t see the benefit in doubling up on the classes - they are nearly identical.
Modelling the data in Python#
Representing our Vertex as a dataclass#
To represent our data I’m going to construct a dataclass that represents the fields we want.
@dataclass(frozen=True)
class Vertex:
etymdb_id: int
language: str
field: int
lexeme: str
meaning: Optional[str] = None
embedding: Optional[np.ndarray] = None
Representing Word2Vec#
… and a protocol that represents the different
classes of Word2Vec models we may want to use. This protocol defines a single method get_vector
that, for a given
word, returns the embedding of that word (as a NumPy array).
class Word2VecModel(Protocol):
def get_vector(self, word: str) -> np.ndarray:
...
And some utilities#
Lastly we’ll need a couple of utility functions.
_get_language
#
I borrowed the language code to language conversion from EtymDB’s code_to_lang.py file. Unfortunately this file doesn’t contain all the codes that are in EtymDB. When looking up the language, we’re going to try and find the language, unless the code isn’t present. In that case, let’s call it “Unknown”.
def _get_language(language_code: str, code_to_lang: dict[str, str]) -> str:
try:
return code_to_lang[language_code]
except KeyError:
return "Unknown"
_get_embedding
#
gensim, does not include all words in EtymDB. That’s ok. We don’t require embeddings for all words. But we don’t
want the script to error when a word is missing, so I’ve wrapped it in a try..except
clause.
def _get_embedding(word: str, model: Word2VecModel) -> Optional[np.ndarray]:
try:
return model.get_vector(word)
except KeyError:
return None
Tying all the datastructures together#
I wrote a function parse_vertices
that takes an opened file, reads each line and tries to parse it. Rather than
return a big list, I’ve created a generator, this way
we don’t define a massive list in memory.
def parse_vertices(fp: TextIO, code_to_lang: dict[str, str], model: Word2VecModel) -> Iterable[Vertex]:
for row in fp:
entries = [e for e in row.rstrip("\n").split("\t") if e != ""]
match len(entries):
case 5:
idx, language_code, field, lexeme, meaning = entries
case 4:
idx, language_code, field, lexeme = entries
meaning = None
case _:
continue
language = _get_language(language_code, code_to_lang)
embedding = _get_embedding(lexeme, model)
yield Vertex(
etymdb_id=int(idx),
language=language,
field=int(field),
lexeme=lexeme,
meaning=meaning,
embedding=embedding,
)
Making our data Neo4j compatible#
Defining our data model#
The admin-import tool allows us to import CSV data to Neo4j. As we’ve already parsed EtymDB, we need to turn this data into a format that the admin-import tool can parse. The csv header format tells us what this format is. If you didn’t here’s the tldr:
- Files containing node data can have an
ID
field, aLABEL
field, and properties. - Properties are strings by default, but can be any of the Property data types
- Arrays are defined by appending
[]
to the type and delimited by a special value,--array-delimiter
Example#
The vertex class should therefore have a header of etymdb_id:ID,lexeme,field:int,language,embedding:double[],meaning,:LABEL
where:
etymdb_id
is the vertex’s IDlexeme
,language
andmeaning
are stringsfield
is an integermeaning
is an array of doubles.
I haven’t touched on the :LABEL
yet. This is the “type” of the vertex.
Implementation in Python#
I chose to extend the Vertex
class to include a to_neo4j_dict
method. I’m doing something funny with the
meaning
here. It turns out the admin-import tool doesn’t understand what a NumPy arrays is. We need to format
the embedding as a string of values seperated by --array-delimiter
(in this case a \t
).
def to_neo4j_dict(self) -> dict:
fields = {
"etymdb_id:ID": self.etymdb_id,
"language": self.language,
"field:int": self.field,
"lexeme": self.lexeme,
}
if self.meaning is not None:
fields["meaning"] = self.meaning
if self.embedding is not None:
fields["embedding:double[]"] = "\t".join(str(x) for x in self.embedding)
return fields
Actually parsing EtymDB#
Now that we have our data model sorted, we can start actually parsing EtymDB. There is one complication, our data contains words with a different number of properties (some don’t have embeddings or meanings or both) the admin-import tool requires each row in the CSV file to have the same data, but can parse multiple CSVs with different number of rows.
To do this, I’m going to define a bunch of sets that contain the exact headers we’re looking for and a suffix, which is the file we’re going to store them in:
CSV_HEADER_FILENAMES = {
frozenset(["etymdb_id:ID", "language", "field:int", "lexeme", ":LABEL"]): "small",
frozenset(["etymdb_id:ID", "language", "field:int", "lexeme", "meaning", ":LABEL"]): "with_meaning",
frozenset(["etymdb_id:ID", "language", "field:int", "lexeme", "embedding:double[]", ":LABEL"]): "with_embedding",
frozenset(["etymdb_id:ID", "language", "field:int", "lexeme", "meaning", "embedding:double[]", ":LABEL"]): "full",
}
From here, let’s create a dictionary of each type of vertex entry:
csv_vertex_groups = defaultdict(list)
with open(etymdb_vertices_path) as fp:
for vertex in parse_vertices(fp, WIKI_CODE_TO_LANG, MODEL):
row_values = vertex.to_neo4j_dict() | {":LABEL": "Lexeme"}
row_key = frozenset(row_values.keys())
csv_vertex_groups[row_key].append(row_values)
This stores all the vertices by its header. Having all the rows we can go through each value in the dictionary and write the values to a CSV:
for header, file_suffix in CSV_HEADER_FILENAMES.items():
filename = Path(outdir) / "neo4j_import_data"/ "vertex" / f"{file_suffix}.csv"
with open(filename, "w+") as fp:
csv_writer = csv.DictWriter(fp, header)
csv_writer.writeheader()
for row_values in csv_vertex_groups[header]:
csv_writer.writerow(row_values)
Conclusion#
Hooray! This blog has shown you how to parse the data in EtymDB and store it in a format that can be parsed by Neo4j’s admin-import tool. Next up we’ll look a bit more into Neo4j and how to actually load our data.