A Properly Normalized Relational Database Specifically Designed To Store, Retrieve, Manipulate and Associate the Content and Structures of Written Languages
By
March, 2009
All rights reserved
Table of Contents
2 NLDB and Debunking the Myths about Relational Databases and Lexical Data
2.1 Myth: Relational Databases Cannot Capture Lexical Information
2.2 Myth: The View of Lexical Data Should Be Tied To the Structure of Data
2.3 Myth: The More Complex the Data, the More Tables Are Required
2.4 Myth: Relational Databases Cannot Represent Hierarchical Data
3 The Database Normalization Imperative
3.1 The Main Benefits of a Properly Normalized Relational Database
4 The Problem with Machine-Readable Dictionaries and Lexical Text-Files
5 NLDB Data Categories and Database Design
6.1 NLDB Main Features and Abilities
7.1 Appendix A: NLDB Physical Data Model
7.2 Appendix B: NLDB Data Dictionary
7.3 Appendix C: NLDB SQL Creation Script
Natural Language Database (NLDB) is a properly normalized relational database that is specifically designed to store, retrieve, manipulate, and more importantly, associate the content, rules and structures of one or more natural languages in a granular and generalized manner allowing its data to be queried, using standard Structured Query Language (SQL), for any conceivable purpose. Natural language content, structures and rules are represented in and reflected by the content and structure of NLDB tables and their relationships with one-another. I can demonstrate that NLDB can be the fundamental data source and foundation upon which viable, practical Natural Language Processing and Computational Linguististics applications can be built.
NLDB is a unified, application independent, general-purpose, generalized, data driven natural language knowledge base. NLDB data can be imported and integrated from any machine-readable dictionaries, thesauri, lexical data files, annotated corpora or any grammatically correct text via grammar parsing routines. NLDB can store and retrieve any natural language structures and data including, but not limited to, grammatical, morphological, orthographical, syntactic, semantic, lexical and phonological. NLDB is currently capable of storing the contents of the annotated SUSANNE Corpus, WordNet 3.0 and most of the content from the Longman Dictionary of Contemporary English including phonograms and syllabification.
One of NLDB’s most powerful and fundamental abilities is to store any given language structure independently from, yet associated with, it’s words. For example take the following three sentences: 1) “The cow jumped over the moon”, 2) “A cat crawled under the car”, 3) “The boy hopped across the street”. These three sentences share one distinct sentence structure that need only be stored one time and in one place. The words from each of the three different sentences are stored separately yet can be associated with this one shared distinct sentence structure. Distinct structures can be hard mapped to any equivalent structures in any other language including its own. Distinct question structures can be mapped to distinct and equivalent answer structures. With the right processing, this ability to map structures can be used for accurate machine translation and natural language processing. The following quotes seem to express the need for and usefulness of a system such as NLDB.
“…what the research community ultimately needs is very large databases of language, analyzed in very great detail.”[1]
"Stroustrup (1997) states that 'constructing a useful…model of an application area is one of the highest forms of analysis. Thus, providing a tool, language, framework, etc., that makes the result of such work available to thousands is a way for programmers and designers to escape the trap of becoming craftsmen of one-of-a-kind artifacts.” [2]
“To enable computers to process human language, we need databases (corpora) of language samples annotated to show their structural features, as a source of information and statistics to guide the development of language-processing algorithms. This in turn requires some set of categories to be explicitly defined, so that researchers exchanging language data can be confident that they are using the annotations in the same way. Computational linguistics needs something like the Linnaean taxonomy created for botany in the 18th century, which for the first time enabled naturalists everywhere to exchange information about plants secure in the knowledge that when they used the same names they were talking about the same things.”[3]
Understanding the critical importance of properly normalizing databases is necessary to understand and appreciate the significance of NLDB’s potential contribution to and impact on the creation of viable new natural language applications and accurate machine translation. Critical differences separate and distinguish the NLDB normalized relational database from currently available machine-readable dictionaries, thesauri, lexical data files and annotated corpora.
NLDB documentation consists of a data model and data dictionary. NLDB is composed of twelve core tables. The data model is an Entity Relationship Diagram (ERD) identifying all of NLDB database tables, their structure, and their relationships with one-another. The data dictionary documents each table’s statement of purpose, its structure, what its columns represent, practical examples of its purpose, and example data.
Some of NLDB’s general categories of data include [WordTemplate], [Structure], [Word] and [Attribute] each of which has a very specific meaning.
A [WordTemplate] is simply a structure that identifies the abstracted representation of a word. It represents the place in other structures where an actual word would be. A word-template is defined by a grouping of any number of attributes that uniquely define the type of abstracted word that is allowable in a specific point in a structure’s sequence.
A [Structure] is defined by a [StructureType], a [Language] and an ordered sequence of one or more [WordTemplates]. It is a unique, distinct grammatical structure that is divorced from, yet associated with, actual words. It is stored only one time and in one place. It can represent a word or an abstract, grammatically well-formed phrase, sentence or paragraph; chapter, book, conversation or any other defined structure.
[Word] is defined by a distinct spelling, a word sense, phoneme, syllabification, a language and any number of other different attributes. Any distinct spelling of any word can be stored one time and in one place. It doesn’t matter if a distinct spelling has fifty different senses or meanings in each of fifty different languages because associated attributes or attribute hierarchies can identify anything about that distinct spelling in each language.
[Attribute] values can be hierarchically organized in any conceivable way and in any number of hierarchical levels. Each distinct attribute need only be stored one time in one place. Attributes can define: Structures, Words, WordTemplates, and any other attributes. In a hierarchical grouping, parent attributes, child attributes, relationship types, sequential ordering of attributes in a group, and nesting level can all be identified.
There exists a pervasive misunderstanding about the ability of relational models to represent lexical and language related information. This misunderstanding stems from a general lack of knowledge about relational modeling and from the belief that a perceived failure to accurately represent lexical data relationally is due to relational database technology itself when in fact this is due to a failure to apply this technology to accurately represent lexical and language structures relationally. I hope to eliminate these misunderstandings by explaining the NLDB design, my design decisions and how NLDB works.
I have designed and built numerous large enterprise scale commercial databases having upwards of 120 tables. These are for systems that track inventory, sales etc… NLDB has 12 tables. Any data can be accurately modeled relationally regardless of whether that data is related to business, science or linguistics. NLDB is a relational model that captures the structural properties of lexical data by allowing the nesting of attributes.
“Lexical data, as is obvious in any dictionary entry, is much more complex than the kind of data (suppliers and parts, employees' records, etc.) that has provided the impetus for most database research. Therefore, classical data models (e.g., relational) do not apply very well to lexical data, although several attempts have been made.” [4]
“Relational data models, including normalized models which allow the nesting of attributes, cannot capture the structural properties of lexical information.” [5]
This quote suggests that the relational representation of lexical data is problematic because normalized data exists in different tables ‘thus fragmenting the view of data’. This is a common misunderstanding that stems from confusing the way in which data is presented to the user with the physical structure of the data in the database.
Many misconceive that data should be physically stored in a database in a way that is easiest for a user view. This is called an un-normalized structure. Structuring data this way renders it problematic to store retrieve and manipulate. Data should be stored first and fore mostly in a way that most easily facilitates it’s storage, retrieval and manipulation. Once data is properly structured it can be presented to the user in any conceivable way.
“Fragmentation of data: The most obvious problem arises from the fact that the number of values for each attribute in dictionary entries varies enormously. For example, entries may include several different pronunciations, parts of speech, orthographic variants, definitions, etc., while some other fields, such as examples, synonyms, cross-references, domain information, geographical information, etc., may be completely absent. To avoid massive duplication of data, the information must be split across several tables, thus fragmenting the view of the data.” [6]
This is a gratuitous and unsupportable claim. The apparent complexity of the data has nothing to do with the number of tables in an appropriately normalized relational database. The number of tables in a database is determined by how data is modeled and normalized regardless of its apparent complexity.
“To avoid massive duplication of data, the information must be split across several tables, thus fragmenting the view of the data. The more complex the data, the more tables are required, and the more fragmented the view.” [7]
This is an unsupportable claim. There are a number of commonly used ways to capture hierarchically structured data in relational databases including recursive relationships and self-identifying entities. All data can be relationally modeled either hierarchically or not.
“…the relational model cannot capture the obvious hierarchy in most dictionary entries.” [8]
“Michael Stonebraker recently argued that the traditional database concept of ‘one size fits all’ is no longer applicable in the database market. Nowhere is this more true than with scientific data. Scientific data is different from business data, for which current [relational] database technology has been developed. Much of scientific data is tree structured because it models an inherently hierarchical process or object.” [9]
Few people, even in the field of Computer Science, neither understand nor appreciate the critical importance of data modeling and properly normalizing databases. Understanding this critical importance is necessary to understand and appreciate the significance of NLDB’s potential contribution to and impact on the creation of viable new natural language applications, accurate machine translation and many other practical software applications. A vital dependency exists between proper database design and the functionality required to support practical language applications.
One main benefit of a normalized relational database comes from the ability to store a distinct piece of data in one specific, distinct and unique place once and only once. This simplifies and speeds data storage and retrieval, eliminates duplicate and redundant data, eases the data maintenance burden, reduces or eliminates inconsistent data and allows for the efficient use of storage space.
Another main benefit comes from the ability to most easily retrieve data in virtually any conceivable way via SQL. SQL is a standardized language for manipulating data in relational databases through a Relational Database Management System (RDBMS). SQL has been adopted by the American National Standards Institute (ANSI) and the International Standards Organization (ISO) as the standard data access language.
There are many different machine-readable dictionaries, thesauri, lexical data files and annotated corpora currently available. They contain varying kinds of data including grammatical, morphological, syntactic, semantic and lexical. They are generally composed of some data text files and some custom programming to provide data access. I have reverse engineered several of these to exposing their underlying data structures and inner workings. The ways that their data is structured and accessed limits their utility to fairly simple searches. Table 1 identifies some of the more popular of these data sources.
Acronym |
Description |
British National Corpora World Edition on CD-ROM |
|
The Linguistic Data Consortiums Lexical Database |
|
Oxford’s English Dictionary |
|
Princeton University’s Lexical Database for the English language |
|
Lexical Data Files, University of Sheffield |
|
Diachronic Corpus of Present-Day Spoken English |
|
The International Corpus of English |
|
The Linguistic Data Consortium’s English Dictionary |
|
The SUSANNE Corpus and Analytic Scheme |
Table 1
Critical and significant differences separate and distinguish NLDB from these text-file systems. Most notable of these differences are: 1) the content of the data, 2) the way the data is structured and 3) the methods of data access. A common misconception is that all electronic representations of rows and columns of data are of equal utility. This could not be farther from the truth. Not all databases are created equal. A poorly designed database severely limits the ability to extract and represent meaningful data. A properly designed database allows data to be efficiently extracted and represented in any conceivable way.
Some refer to these text-file systems as databases or relational. Although some of these systems can be considered databases, in a rudimentary sense of the word, none of them, which I have worked with, are normalized in any way nor are reliant on a RDBMS. They generally violate first and all subsequent normal forms. Consequently, the meaningful separation of disparate categories of data and their relationships to one another are lost or obscured. This necessitates that each disparate text-file system come with an associated computer program that is custom built specifically to operate only on its respective text-files. These custom built programs limited to s pre-defined set of operations on their respective text-files.
Properly normalized databases are designed so that the rules that relate their various tables to one another are contained within the structure of the database itself. The text-file systems, I have seen, obscure many or all of these rules by embedding them in the programming code of their associated computer program. This obfuscation combined with their un-normalized file structures effectively renders SQL useless or problematic as a means of extracting and representing meaningful data. This also limits the data’s utility to the set of functions defined by their associated programs.
NLDB is designed very specifically to manipulate, store and retrieve any natural language content, structures and rules. In this section I explain the main categories of data that NLDB stores and the relationships that these categories have with one another. The content, structures and rules of natural languages are both represented in and reflected by the content and structure of NLDB database tables and their relationships with one-another. NLDB currently stores all WordNet 3.0 synsets in four different tables, none of which have more than six columns. This WordNet data can not only be represented as in WordNet, but can now be queried via SQL in any conceivable way.
A good database design should be data-driven. Being data-driven means that database tables and relationships should not change with the introduction of new kinds of data. The population of NLDB with new kinds of data has no bearing on the physical structure of the database. It is unnecessary to add or change NLDB tables as a result of adding any new data or new categories of data.
A benefit of this generalized, flexible approach to database design is that it allows NLDB to be populated with many disparate classes of attributes including: Part of Speech, Language, Word Sense, Semantic Relations, Phonetic Pronunciation, etc… As with any database, the quality of its results is determined by the quality of its data. ‘The Garbage In Garbage Out’ (GIGO) principle applies to NLDB as it does anything else.
Table 2 illustrates some data categories that NLDB can store along with a brief description of the category’s purpose and a small sample of the category’s data where appropriate. These categories are not physical database tables. There is no one-to-one correspondence of these categories and NLDB physical tables. These are only conceptual representations of some data categories.
Category |
Purpose |
Example |
Word |
Identifies a unique and distinct spelling of a word in a language along with its specific sense and a main part of speech. A word can also be identified by a virtually infinite number of different attributes and classes of attributes. |
|
Language |
Identifies the unique and distinct set of languages. |
‘French’, ‘German’, ‘English’… |
Spelling |
Identifies the unique and distinct set of symbols or spellings of words in one or more languages |
'Aborigine', ' аудитория', 'Чайковский', '关系数据库'... |
StructureType |
Identifies the unique and distinct set of structure type categories and structure types that that a language structure can have. This category can also represent infinitely nest-able structure type categories and structure types. |
‘Paragraph', 'Declarative Sentence', ‘Word’, ‘Phrase’, ‘Paragraph’, ‘Article’, ‘Book’, ‘Conversation’… |
StructureRelation |
Identifies the unique and distinct set of relationships that language structures can have with one another. Any structure can have any number of any relationship types with any other structure. |
'Equivalent', 'Question', 'Answer', ‘Member of’... |
Phoneme |
Identifies the unique and distinct set of sequenced Phonograms that combine to form the pronunciation(s) of a word represented by phonetic symbology. |
(t-mt, -mä-), (kn-found, kn-)… |
Template |
Identifies the unique and distinct set of "skeletal" language structures. A [Template] is an abstract and well-formed grammatical sentence, phrase or other structure that is divorced from, yet associated with, any actual words and contains all the salient attributes of the words and the structure. This can be used to identify actual words that are allowable in any particular sequence of that [Template]. |
|
WordRelation |
Identifies the unique and distinct set of the types of relationships that one word can have with another word. |
'Homonym', 'Hypernym', 'Hyperonym', 'Hyponym', 'Synonym', 'Antonym'... |
WordSense |
Identifies the unique and distinct set of senses that a word can have in any language. |
‘(a theatrical performance of a drama) "the play lasted two hours" ‘… |
Table 2
Any number of NLDB attributes can specify: a structure, a structures relationship with another structure or word, a word, a word’s relationship with another word or structure, another attribute or an attribute’s relationship with another attribute, word or structure. These attribute entities represent unique and distinct attributes and attribute hierarchies. The hierarchical nature if these entities allows for any number of new and disparate attributes, attribute categories or any organization of new hierarchies of attribute categories.
Many disparate kinds of attributes can be stored in the [Attribute] table. One alternative to storing many disparate kinds of attributes in one [Attribute] table would be to segment the database structure into disparate, discrete, predefined kinds of attribute tables such as [WordRelation], [Spelling], [PartOfSpeech], [Tense], [WordSense], [Language], [StructureRelation]. The problem with this approach is that the database structure is no longer data driven because now if a user needs to add a new category of attributes that does not belong in any of these pre-defined tables, then the user is forced to alter the database structure by adding a new table to the database and change the structure of other tables in order to accommodate one new table. Also, if the user needs to create more than one new attribute category and organize them into a specific hierarchy of attribute categories then the user is forced to alter the database structure even more by adding a new table for each category and changing the structures of other tables to represent the hierarchy. This is an important aspect of relational design that explains why many attempts at relational lexical knowledge representation fail.
It is critically important to be able to add any number of new attributes, any number of new attribute categories and any number and organization of new hierarchies of attribute categories without having to change the structure of the database. For these reasons it is critically important for one [Attribute] table to be able to represent any number of new attributes, any number of new attribute categories and any number and organization of new hierarchies of attribute categories. It is for these reasons that the [Attribute] tables are used to store many disparate, unique, and distinct categories of data including, but not limited to: [PartOfSpeech], [Language], [Spelling], [Phonogram], [StructureRelation], [StructureType], [WordSense], [SemanticRelation], and [PhoneticPronunciation].
This Entity Relation Diagram (ERD) illustrates the deceptively simple structure of the [Attribute] entities.
Table 3 illustrates how data in the attribute tables can be used to visually hierarchically represent any attributes.
Attribute |
|||
Language |
|||
|
French |
||
|
German… |
||
Spelling |
|||
|
Aborigine |
||
|
аудитория … |
||
StructureType |
|||
|
Paragraph |
||
|
Declarative Sentence… |
||
WordRelation |
|||
|
Homonym |
||
|
Meronym… |
||
WordSense |
|||
|
Play: a theatrical performance of… |
||
|
Play: a preset plan of action... |
||
Part of Speech |
|||
|
Noun |
||
|
|
Proper |
|
|
|
Common |
|
|
|
|
Countable |
|
Uncountable |
||
|
|
Concrete |
|
|
|
Abstract |
Table 3
NLDB’s [Structure] entities constitute a language taxonomy specifically designed to store the surface and underlying structures of Natural Languages. After designing NLDB’s [Structure] tables, I came across the SUSANNE Analytic Scheme created by Dr. Geoffrey Sampson of the University of Sussex. The SUSANNE Analytic Scheme is a comprehensive language-engineering-oriented taxonomy and annotation scheme for the (logical and surface) grammar of English.[10] The annotated SUSANNE Corpus is a 130,000-word cross-section of written American English annotated in accordance with the scheme. The name ‘SUSANNE’ stands for ‘Surface and underlying structural analysis of natural English’.[11]
“SUSANNE scheme is so far as I am aware the first serious attempt anywhere to produce a comprehensive, fully explicit annotation scheme for English grammatical structure. It has won praise internationally” [12]
I discovered that NLDB required no modification to import and integrate the structures and contents of both the SUSANNE Analytic Scheme and the annotated SUSANNE Corpus. Although NLDB can accommodate this data, it was not originally designed to do so. NLDB’s [Structure] tables are generalized so that they can store SUSANNE Scheme tags as attributes as well as any other attributes, attribute categories and hierarchies of attribute categories which need to be added in the course of dealing with additional phenomena of natural languages.
As you can see, the three sentences below are of an identical sentence structure. Table 4 table graphically illustrates the abstract and well-formed grammatical sentence structure, or [Template], divorced from any actual words. This ‘skeletal’ structure can identify all salient attributes of the words and the structures. This structure can be used to identify any actual words that are allowable in any particular sequence of the structure template. By placing different allowable words in a well-formed [Template] a virtually infinite number of sentences could be automatically created from this one single [Template].
Actual Sentences |
|||||||
1 |
2 |
3 |
4 |
5 |
6 |
||
The |
cow |
jumped |
over |
the |
moon |
||
A |
cat |
crawled |
under |
the |
car |
||
The |
boy |
hopped |
across |
the |
street |
||
|
|||||||
1 |
2 |
3 |
4 |
5 |
6 |
||
Illocution: Assertion |
|||||||
Voice: Active |
|||||||
Structure Type: Sentence |
|||||||
Noun Phrase |
Verb Phrase |
||||||
Determiner: Article |
Noun |
Verb |
Prepositional Phrase |
||||
Preposition: Self: Motion: Area (See Preposition Project) |
Noun Phrase |
||||||
Noun: Concrete: Object: Life Form |
Verb: Action: Motion |
Determiner |
Noun |
||||
Noun: Common: Countable |
Verb: Number: Singular |
Noun: Concrete: Object: Non-Life Form |
|||||
Noun: Singular |
Verb: Regular |
Noun: Common: Countable |
|||||
Verb: Transitive |
Noun: Singular |
||||||
Verb: Indicative: Past: Simple: Active |
|||||||
Predicate |
|||||||
Subject |
|
Object |
|||||
|
Simple Subject |
Simple Predicate |
|
||||
SUSANNE Scheme tags [O[S[Ns:s.Ns:s][Vd.Vd][P:p.[Ns.Ns]P:p] |
|||||||
[O[S[Ns:s. |
.Ns:s] |
[Vd.Vd] |
[P:p. |
[Ns. |
.Ns]P:p] |
||
AT |
NN1c |
VVDv |
II |
AT |
NN1c |
||
Table 4
NLDB is a properly normalized relational database that is specifically designed to store, retrieve, manipulate, and more importantly, associate the content, rules and structures of one or more natural languages in a granular and generalized manner allowing its data to be queried, using SQL, for any conceivable purpose. Natural language content, structures and rules are represented in and reflected by the content and structure of NLDB tables and their relationships with one-another. I can demonstrate that NLDB can be the fundamental data source and foundation upon which viable, practical Natural Language Processing and Computational Linguististics applications can be built
· Capture the structural properties of lexical data in a relational database
· Define and store any distinct natural language structures divorced from, yet associated with, actual words
· Map language structures in one language to equivalent structures in any other language
· Map question structures to answer structures
· Define different types of relationships between structures
· Define any attributes and organize them in zero or more hierarchical levels
· Associate any attribute or attribute hierarchy with any structure or any association between structures.
· Query using standard SQL.
· Accurate Machine Translation: The ability to hard-mapping of structures with word-senses to equivalent structures in other languages.
· Natural Language Understanding: The ability to hard-mapping of question structures (divorced from their words, yet associated with them) with answer structures.
· Querying of any lexical information in any conceivable way.
· Unambiguous Synonym Replacement: Re-write a poem by replacing all of its words, which have synonyms, with their synonymous word while maintaining the same rhyme, cantor and grammatical structure of the original poem.
· Re-write a book by replacing all of its words, that have synonyms, with their synonymous word while maintaining the meaning and grammatical structure of the original text.
· Text Generation from Pre-Defined Language Structures: Compose a new grammatically or colloquially correct novel that is a composite of identified salient factors that are mostly or completely consistent across all (Harlequin romance novels) such as sentence structures, main events, word usage, similar yet different nouns such as place names or people’s names.
· Accurate Grammar Checking and Correction in multiple languages.
The NLDB data Dictionary has been omitted in the interest of keeping this paper under the twelve page limit. It is available upon request by email at: CompLing@Comcast.Net or at this [link]
The NLDB SQL Creation Script has been omitted in the interest of keeping this paper under the twelve page limit. It is available upon request by email at: CompLing@Comcast.Net or at this [link]
[2] [February 11, 2002] "Combining UML, XML and Relational Database Technologies. The Best of All Worlds For Robust Linguistic Databases." By Larry S. Hayashi and John Hatton (SIL International). Pages 115-124 in Proceedings of the IRCS Workshop on Linguistic Databases (11-13 December 2001, University of Pennsylvania, Philadelphia, USA.
[3] Sampson, Geoffrey (1995) “English for the Computer”
[4] Zheng, Yifeng (2006) "Research Statement"
[5] Ide, Nancy; Le Maitre, Jacques; Véronis, Jean. (1999) “Outline Of A Model For Lexical Databases”.
[6] Ide, Nancy; Le Maitre, Jacques; Véronis, Jean. (1999) “Outline Of A Model For Lexical Databases”.
[7] Ide, Nancy; Le Maitre, Jacques; Véronis, Jean. (1999) “Outline Of A Model For Lexical Databases”