AI / ChatGPT
AI Consultancy

On building a LLM for the English language

Posted on October 22, 2023 in Top Level Category by Stefaan Meeuws

See also: (shared ChatGPT chat): https://chat.openai.com/share/34a8787e-2dca-421a-9e51-e67bb998629b


Here’s a Python software which I created using AI that reads a text file with English prose in it, and processes every sentence in the file.

The code was generated in about the same time it took me to write this here fb post. It prints out every sentence in the text file, followed by the words in the sentence followed by (verb) if it is a verb, (noun) if it’s a noun. If it’s a (verb) the line continutes with the verb (tense) and the “root = ({verb root})” of the verb.

I intend to use this to build a LLM (Large Language Model) to be able to translate any language into any other language. Assuming I can create a LLM for English and a LLM for, say, Dutch or Chinese, then translation becomes a problem of “rotating” both source and target LLM so they “fit” or “click” in place.

In this way I expect to be able to detect that “King” relates to “man” as “woman” relates to “Queen”, and “son” relates to “Prince” and daughter relates to “Princess” (given a context, yet to be defined.)

It will probably take me many more weeks or even months before I turn up with a real translation result, but bear in mind this code (to build a LLM for English) is just the beginning, and just the work of 15 minutes on my development PC.

So wish me luck!

This is some of the output from a random text file containing English prose:

Sample output (from a random English prose text file):

Yet it may be said on their behalf that they were not really as far in

advance of the majority of their contemporaries as they imagined they

were and it is to their credit that when their eyes were opened they

were opened thoroughly and not closed again.

Yet

it

may

be (verb)(present) root = (be)

said (verb)(past) root = (say)

on

their

behalf (noun)

that

they

were (verb)(past) root = (be)

not

really

as

far

in

advance (noun)

of

the

majority (noun)

of

their

contemporaries (noun)

as

they

imagined (verb)(past) root = (imagine)

they

were (verb)(past) root = (be)

and

it

is (verb)(present) root = (be)

to

their

credit (noun)

that

when

their

eyes (noun)

were (verb)(past) root = (be)

opened (verb)(past participle) root = (open)

they

were (verb)(past) root = (be)

opened (verb)(past participle) root = (open)

thoroughly

and

not

closed (verb)(past) root = (close)

again

.

And this is the code so far:

import re

import nltk

nltk.download(‘punkt’)

nltk.download(‘averaged_perceptron_tagger’)

nltk.download(‘wordnet’)

from nltk import sent_tokenize, word_tokenize, pos_tag

from nltk.stem import WordNetLemmatizer

# Define a function to clean text

def clean_text(text):

# Remove special characters except periods

cleaned_text = re.sub(r'[^a-zA-Z0-9\s.]’, ”, text)

return cleaned_text

# Define a function to get the verb tense

def get_verb_tense(tag):

if ‘VBD’ in tag:

return ‘(past)’

elif ‘VBG’ in tag:

return ‘(present participle)’

elif ‘VBN’ in tag:

return ‘(past participle)’

elif ‘VB’ in tag:

return ‘(present)’

else:

return ”

# Read the text file

file_path = ‘your_text_file.txt’ # Replace with the actual file path

with open(file_path, ‘r’, encoding=’utf-8′) as file:

text = file.read()

# Clean the text

cleaned_text = clean_text(text)

# Split the cleaned text into sentences

sentences = sent_tokenize(cleaned_text)

# Initialize the WordNet lemmatizer

lemmatizer = WordNetLemmatizer()

# Process and analyze each sentence

for sentence in sentences:

print(sentence)

words = word_tokenize(sentence)

word_tags = pos_tag(words)

for word, tag in word_tags:

if tag.startswith(‘V’):

tense = get_verb_tense(tag)

root = lemmatizer.lemmatize(word, pos=’v’) # Get the verb’s root form

print(f”{word} (verb){tense} root = ({root})”)

elif tag.startswith(‘N’):

print(word, “(noun)”)

else:

print(word)

Comments on 'On building a LLM for the English language' (0)

Leave a Reply

%d