On building a LLM for the English language
See also: (shared ChatGPT chat): https://chat.openai.com/share/34a8787e-2dca-421a-9e51-e67bb998629b
Here’s a Python software which I created using AI that reads a text file with English prose in it, and processes every sentence in the file.
The code was generated in about the same time it took me to write this here fb post. It prints out every sentence in the text file, followed by the words in the sentence followed by (verb) if it is a verb, (noun) if it’s a noun. If it’s a (verb) the line continutes with the verb (tense) and the “root = ({verb root})” of the verb.
I intend to use this to build a LLM (Large Language Model) to be able to translate any language into any other language. Assuming I can create a LLM for English and a LLM for, say, Dutch or Chinese, then translation becomes a problem of “rotating” both source and target LLM so they “fit” or “click” in place.
In this way I expect to be able to detect that “King” relates to “man” as “woman” relates to “Queen”, and “son” relates to “Prince” and daughter relates to “Princess” (given a context, yet to be defined.)
It will probably take me many more weeks or even months before I turn up with a real translation result, but bear in mind this code (to build a LLM for English) is just the beginning, and just the work of 15 minutes on my development PC.
So wish me luck!
This is some of the output from a random text file containing English prose:
Sample output (from a random English prose text file):
Yet it may be said on their behalf that they were not really as far in
advance of the majority of their contemporaries as they imagined they
were and it is to their credit that when their eyes were opened they
were opened thoroughly and not closed again.
Yet
it
may
be (verb)(present) root = (be)
said (verb)(past) root = (say)
on
their
behalf (noun)
that
they
were (verb)(past) root = (be)
not
really
as
far
in
advance (noun)
of
the
majority (noun)
of
their
contemporaries (noun)
as
they
imagined (verb)(past) root = (imagine)
they
were (verb)(past) root = (be)
and
it
is (verb)(present) root = (be)
to
their
credit (noun)
that
when
their
eyes (noun)
were (verb)(past) root = (be)
opened (verb)(past participle) root = (open)
they
were (verb)(past) root = (be)
opened (verb)(past participle) root = (open)
thoroughly
and
not
closed (verb)(past) root = (close)
again
.
And this is the code so far:
import re
import nltk
nltk.download(‘punkt’)
nltk.download(‘averaged_perceptron_tagger’)
nltk.download(‘wordnet’)
from nltk import sent_tokenize, word_tokenize, pos_tag
from nltk.stem import WordNetLemmatizer
# Define a function to clean text
def clean_text(text):
# Remove special characters except periods
cleaned_text = re.sub(r'[^a-zA-Z0-9\s.]’, ”, text)
return cleaned_text
# Define a function to get the verb tense
def get_verb_tense(tag):
if ‘VBD’ in tag:
return ‘(past)’
elif ‘VBG’ in tag:
return ‘(present participle)’
elif ‘VBN’ in tag:
return ‘(past participle)’
elif ‘VB’ in tag:
return ‘(present)’
else:
return ”
# Read the text file
file_path = ‘your_text_file.txt’ # Replace with the actual file path
with open(file_path, ‘r’, encoding=’utf-8′) as file:
text = file.read()
# Clean the text
cleaned_text = clean_text(text)
# Split the cleaned text into sentences
sentences = sent_tokenize(cleaned_text)
# Initialize the WordNet lemmatizer
lemmatizer = WordNetLemmatizer()
# Process and analyze each sentence
for sentence in sentences:
print(sentence)
words = word_tokenize(sentence)
word_tags = pos_tag(words)
for word, tag in word_tags:
if tag.startswith(‘V’):
tense = get_verb_tense(tag)
root = lemmatizer.lemmatize(word, pos=’v’) # Get the verb’s root form
print(f”{word} (verb){tense} root = ({root})”)
elif tag.startswith(‘N’):
print(word, “(noun)”)
else:
print(word)
Comments on 'On building a LLM for the English language' (0)
Comments Feed