Skip to content

Latest commit

 

History

History
103 lines (76 loc) · 3.32 KB

sqlite.md

File metadata and controls

103 lines (76 loc) · 3.32 KB

Dictionary SQLite database format

This document describes the SQLite export format. This is a database file containing a collection of dictionary entries and supporting indexes and tables.

The supporting tables should make it easy to locate articles by searching for words in various forms. The lemma, form and searchtext tables can be deleted without loosing information, as this information can also be obtained from the entry.xjtei structure.

All text strings in the database use the UTF-8 encoding.

Database schema

ER-diagram

entry

This represents a single directory entry. This is an article describing a single "word".

Field Type Comment
id int pk Every entry has a numeric key
lang enum('nb', 'nn') Bokmål or Nynorsk ISO 639-1 code
pos_id fk What kind of word is this (verb, noun,...)
tei xml null The dictionary entry in TEI format
xjtei json The dictionary entry in XJ-TEI format

lemma

The base form of the word described by a dictionary entry. Ref wikipedia. A single article can have multiple lemmas, and the same lemma.orth value can be used by other entries as well.

Field Type Comment
id int pk Each lemma has its own key
orth text The spelling of the word
entry_id fk The corresponding entry

pos

POS stands for 'Part of Speech' and is the grammatical class that the word belongs to; like verb, noun, adjective, etc.

Field Type Comment
id enum('v', 'n',...) The class of word (v=verb, n=noun,...)
name text 'Verb', 'Substantiv', 'Adjektiv',...
lang enum('nb', 'nn') The language of name

gram

This expresses the grammatical forms that words of the referenced pos takes. For instance nouns in Norwegian has the following 4 forms:

  • "Entall; Ubestemt form"
  • "Entall; Bestemt form"
  • "Flertall; Ubestemt form"
  • "Flertall; Bestemt form"
Field Type Comment
id id Just something unique
name text String like "Entall; Ubestemt form"
order int The natural order for the given pos and lang
pos_id fk The pos this applies to
lang enum('nb', 'nn') The language of name

If form.name contains ";" it denotes an opportunity to join columns together names with the same prefix. For instance the 4 forms above can be presented like this: table of forms

form

This encodes the how a specific lemma of a word is to be spelled in its various grammatical forms. There can be multiple systems that applies for a single word which is expressed by the paradim key. A separate row will be filled in for all variations of gram given the word's pos.

Field Type Comment
lemma_id fk Combined key
gram_id fk Combined key
paradigm int Combined key
orth text The spelling of the form

searchtext

This is contains the concatenation of the plain text found in a dictionary entry. It can be used to implement full text search for dictionary entries that mention a specific word in its description.

Field Type Comment
entry_id pk fk The dictionary entry text is extracted from
text text lemma + forms + etym + defs + cits