OOV Archives - Nerd Corner

NLP Application: Tensorflow.js vs Tensorflow Python – Part 2

Nerds — Wed, 31 May 2023 21:47:58 +0000

The first part demonstrated how to read in and prepare datasets. In addition, the tokenisation of a dataset was discussed in detail. The points were illustrated using an example in Tensorflow (Python) and Tensorflow.js (Tfjs). In both the Python example and the JavaScript example, the model is only able to recognise words that have occurred at least once in the data set. This model has issues with new words, because we didn’t take an OOV token into account for our NLP tensorflow application.

You might also be interested in: NLP application part 1 (reading data, preparing data and tokenisation)

OOV Token

With a translator, it is only a matter of time before an unknown word is entered. This can be a personal name, a spelling mistake or something similar. It is therefore advisable to train the model with regard to unknown words. An OOV token is needed for this. OOV stands for “out of vocabulary”. During training, the model learns to generate this token or to handle it accordingly. In this case, an unknown word can be replaced by the token “” before it is passed to the model. The model will then treat it like any other token and generate a response based on its learned behaviour.

Now you might ask how do you include this OOV token in the training data? I do this by automatically searching my dataset at the beginning for words that occur only 1 time. Then I replace these rare words with “” so that my model can also learn to react to unknown words.

Padding

In many machine learning models, including neural networks, the inputs are expected to have a fixed size or shape. This requirement arises from the structure and operation of the underlying computational graph. Inputs of the same length simplify the data processing pipeline and enable efficient batch processing.

Why the inputs should have equal length:

Matrix operations: Neural networks typically process inputs in batches, and batch processing is most efficient when the input data has a uniform shape. The data is organised into matrices, with each row representing an input instance. To perform matrix operations efficiently, all input instances must have the same shape.
Sharing of parameters: In many neural network architectures, model parameters (weights) are shared across different parts of the input sequence. For example, in recurrent neural networks (RNNs), the same weights are used to process each time step. To enable sharing of parameters, all input sequences must have the same length.
Memory allocation: Neural networks often allocate memory based on the maximum length of the input sequences. If the sequences have different lengths, dynamic memory allocation is required, which can be more complex and less efficient.

While it is possible to process variable length inputs using techniques such as padding and masking, this increases the complexity of the model and may require additional processing steps. For simplicity and efficiency, it is therefore common to pad or truncate sequences to a fixed length before feeding them into a neural network model.

from keras.utils import pad_sequences

# pad sequences
encoder_seq = pad_sequences(encoder, maxlen=max_encoder_sequence_len, padding="post")
decoder_inp = pad_sequences([arr[:-1] for arr in decoder], maxlen=max_decoder_sequence_len, padding="post")
decoder_output = pad_sequences([arr[1:] for arr in decoder], maxlen=max_decoder_sequence_len, padding="post")
print(encoder_seq)
print([idx_2_txt_encoder[i] for i in encoder_seq[0]])
print([idx_2_txt_decoder[i] for i in decoder_inp[0]])
print([idx_2_txt_decoder[i] for i in decoder_output[0]])

I have added the 4 print commands for better illustration. Initially, I thought that the longest record in the data set would set the length for input data and output data. So the padding length for input and output would be the same. But that is not the case! Input data and output data are normalised to different lengths!

In the example here, I have used a tiny data set where the longest English sentence consists of 3 words and the longest French sentence consists of 10 words. Accordingly, each training set is padded with “” or 0 until the input reaches 3 words and the output 10 words.

The decoder output with [arr[1:] for arr in decoder] removes the “start” token and the decoder input with [arr[:-1] for arr in decoder] removes the “end” token.

In sequence-to-sequence models, the decoder is trained to generate the output sequence based on the input sequence and the previously generated tokens. During training, the input sequence of the decoder contains the “start” token, which serves as the initialisation token for the decoder. However, when training the decoder, it is supposed to predict the next token based on the previously generated tokens, except for the “start” token. Therefore, when preparing the decoder output sequence, the “start” token is removed from each sequence. This is done to correctly match the decoder input and output sequences. The decoder input sequence contains the “Start” token and excludes the “End” token, while the decoder output sequence contains the “End” token and excludes the “Start” token. In this way, we ensure that the decoder learns to generate the correct output sequence based on the input.

During inference (model application after training) or when using the trained model to generate translations, we can start with the “start” token and iteratively generate tokens until we encounter the “end” token or reach a maximum sequence length.

When padding for Tensorflow.js, we adopt Python’s approach 1:1. Unfortunately, we again have more work and more code lines, as there is no padSequences function in Tfjs. I have therefore written my own padSequences function:

function padSequences(sequences) {
  const paddedSequences = [];
  const maxlen = findMaxLength(sequences);

  for (const sequence of sequences) {
    if (sequence.length >= maxlen) {
      paddedSequences.push(sequence.slice(0, maxlen));
    } else {
      const paddingLength = maxlen - sequence.length;
      const paddingArray = new Array(paddingLength).fill(0);
      const paddedSequence = sequence.concat(paddingArray);
      paddedSequences.push(paddedSequence);
    }
  }

  return paddedSequences;
}

We can then use this function to determine our encoder, decoder input and decoder output:

function pad(data) {
  const encoderSeq = padSequences(data.en);
  const decoderInp = padSequences(data.de.map((arr) => arr.slice(0, -1))); // Has startToken
  const decoderOutput = padSequences(data.de.map((arr) => arr.slice(1))); // Has endToken
  console.log(decoderInp);
}

In my case, the “1” is the “startToken”, so the decoder input looks like this, for example:

Create a model

# Design LSTM NN (Encoder & Decoder)
# encoder model
encoder_input = Input(shape=(None,), name="encoder_input_layer")
encoder_embedding = Embedding(num_encoder_tokens, 300, input_length=max_encoder_sequence_len, name="encoder_embedding_layer")(encoder_input)
encoder_lstm = LSTM(256, activation="tanh", return_sequences=True, return_state=True, name="encoder_lstm_1_layer")(encoder_embedding)
encoder_lstm2 = LSTM(256, activation="tanh", return_state=True, name="encoder_lstm_2_layer")(encoder_lstm)
_, state_h, state_c = encoder_lstm2
encoder_states = [state_h, state_c]

# decoder model
decoder_input = Input(shape=(None,), name="decoder_input_layer")
decoder_embedding = Embedding(num_decoder_tokens, 300, input_length=max_decoder_sequence_len, name="decoder_embedding_layer")(decoder_input)
decoder_lstm = LSTM(256, activation="tanh", return_state=True, return_sequences=True, name="decoder_lstm_layer")
decoder_outputs, _, _ = decoder_lstm(decoder_embedding, initial_state=encoder_states)
decoder_dense = Dense(num_decoder_tokens+1, activation="softmax", name="decoder_final_layer")
outputs = decoder_dense(decoder_outputs)

model = Model([encoder_input, decoder_input], outputs)

The code example shows the design of a neural network with Long Short-Term Memory (LSTM) for sequence-to-sequence learning (Seq2Seq), which is typically used for tasks such as machine translation. The code defines two main parts: Encoder Model and Decoder Model.

Encoder Model:

- The encoder input layer (encoder_input) represents the input sequence of the encoder model.
- The input sequence is embedded using an embedding layer (encoder_embedding) that converts each token into a dense vector representation.
- The embedded sequence is then passed through the first LSTM layer (encoder_lstm_1_layer) to capture sequential information. The LSTM layer returns the output sequence and the final hidden state.
- The output sequence of the first LSTM layer is further processed by the second LSTM layer (encoder_lstm_2_layer). The second LSTM layer only returns the final hidden state, which is the summarised information of the input sequence.
- The final hidden state of the second LSTM layer is split into the final hidden state (state_h) and the final cell state (state_c), which are used as initial states for the decoder model.
- The states of the encoder model are defined as encoder_states and are passed on to the decoder model.

Decoder Model:

The decoder input layer (decoder_input) represents the input sequence of the decoder model, which consists of the target sequence shifted by one position.
Similar to the encoder, the input sequence is embedded using an embedding layer (decoder_embedding).
The embedded sequence is then passed through an LSTM layer (decoder_lstm_layer), where the initial states are set to the final states of the encoder model. This allows the decoder to take into account the relevant information from the encoder.
The LSTM layer provides the output sequence and the end states.
The output sequence from the LSTM layer is passed through a dense layer (decoder_final_layer) with a softmax activation function that predicts the probability distribution over the output tokens.

The Model class is used to create the overall model by specifying the input layers ([encoder_input, decoder_input]) and the output layer (outputs). This model architecture follows the basic structure of an encoder-decoder model using LSTMs, where the encoder processes the input sequence and generates the context vector (final hidden state), which is then used by the decoder to generate the output sequence.

The same model can also be implemented in JS:

function createModell(
  numEncoderTokens,
  numDecoderTokens,
  maxEncoderSequenceLen,
  maxDecoderSequenceLen
) {
  // Encoder model
  const encoderInput = tf.input({ shape: [null], name: "encoderInputLayer" });
  const encoderEmbedding = tf.layers
    .embedding({
      inputDim: numEncoderTokens,
      outputDim: 300,
      inputLength: maxEncoderSequenceLen,
      name: "encoderEmbeddingLayer",
    })
    .apply(encoderInput);
  const encoderLstm = tf.layers
    .lstm({
      units: 256,
      activation: "tanh",
      returnSequences: true,
      returnState: true,
      name: "encoderLstm1Layer",
    })
    .apply(encoderEmbedding);
  const [_, state_h, state_c] = tf.layers
    .lstm({
      units: 256,
      activation: "tanh",
      returnState: true,
      name: "encoderLstm2Layer",
    })
    .apply(encoderLstm);
  const encoderStates = [state_h, state_c];

  // Decoder model
  const decoderInput = tf.input({ shape: [null], name: "decoderInputLayer" });
  const decoderEmbedding = tf.layers
    .embedding({
      inputDim: numDecoderTokens,
      outputDim: 300,
      inputLength: maxDecoderSequenceLen,
      name: "decoderEmbeddingLayer",
    })
    .apply(decoderInput);
  const decoderLstm = tf.layers.lstm({
    units: 256,
    activation: "tanh",
    returnState: true,
    returnSequences: true,
    name: "decoderLstmLayer",
  });
  const [decoderOutputs, ,] = decoderLstm.apply(decoderEmbedding, {
    initialState: encoderStates,
  });
  const decoderDense = tf.layers.dense({
    units: numDecoderTokens + 1,
    activation: "softmax",
    name: "decoderFinalLayer",
  });
  const outputs = decoderDense.apply(decoderOutputs);

  const model = tf.model({ inputs: [encoderInput, decoderInput], outputs });
  return model;
}

Train and save a model

# train model
loss = tf.losses.SparseCategoricalCrossentropy()
model.compile(optimizer='rmsprop', loss=loss, metrics=['accuracy'])
callback = tf.keras.callbacks.EarlyStopping(monitor='loss', patience=3)
history = model.fit(
   [encoder_seq, decoder_inp],
   decoder_output,
   epochs=80,  # 80
   batch_size=450,  # 450
   # callbacks=[callback]
)

The model.fit() function is used to train the model. The training data consists of the encoder input sequences (encoder_seq), the decoder input sequences (decoder_inp) and the decoder output sequences (decoder_output). The training is carried out for a certain number of epochs and a batch size of 450. The training progress can be monitored with the EarlyStopping callback, which stops the training if the loss has not improved after a certain number of epochs. The training progress is stored in the variable history.

The model in Tensorflow can handle both tensors and numpy arrays as inputs. If you pass numpy arrays as inputs to the fit function in TensorFlow, it will automatically convert them to tensors internally before training is performed. In the code, the encoder_seq, decoder_inp and decoder_output arrays are automatically converted to tensors when passed to the fit function. This allows TensorFlow to perform the necessary calculations during the training process.

Similarly, the fit function in TensorFlow.js can handle both tensors and arrays. So you can directly pass your 2D array (encoderSeq) as the first input and TensorFlow.js will internally convert them into tensors for training. Although you pass arrays instead of tensors, TensorFlow and TensorFlow.js are able to handle the conversion internally and perform the training accordingly.

# save model
model.save("./model-experimental/Translate_Eng_FR.h5")
model.save_weights("./model-experimental/model_NMT")

It is common to store the weights of a trained model separately from the model architecture. Storing the weights and the architecture separately allows more flexibility when loading and using the model. For example, one can load only the weights if the model architecture has been defined elsewhere or if the weights are to be used in another model with a similar architecture.

Finally, also the code in JavaScript:

async function trainModel(data) {
  const encoderSeq = padSequences(data.en);
  const decoderInp = padSequences(data.de.map((arr) => arr.slice(0, -1))); // Has startToken
  const decoderOutput = padSequences(data.de.map((arr) => arr.slice(1))); // Has endToken

  data.model.compile({
    optimizer: "rmsprop",
    loss: "sparseCategoricalCrossentropy",
    metrics: ["accuracy"],
  });
  const history = await data.model.fit(
    [encoderSeq, decoderInp],
    decoderOutput,
    {
      epochs: 80,
      batch_size: 450,
    }
  );
}

This is where my frustration with Tensorflow.js comes in. Although each step is 1:1 the same as the step in Python, training the model in Tensorflow.js doesn’t work…. I always get an error message:

C:\Users\[...]\node_modules\@tensorflow\tfjs-layers\dist\tf-layers.node.js:23386
            if (array.shape.length !== shapes[i].length) {
                            ^

TypeError: Cannot read properties of undefined (reading 'length')
    at standardizeInputData

General loss function and optimiser

Loss functions and optimisers are key components in training a machine learning model. A loss function, also known as an objective or cost function, measures the performance of a model by quantifying the dissimilarity between predicted outputs and actual objectives. The goal of training a model is to minimise this loss function, which essentially means improving the model’s ability to make accurate predictions. The choice of loss function depends on the problem at hand. For example, in classification tasks, categorical cross entropy, binary cross entropy and softmax cross entropy are common loss functions, while in regression tasks, mean square error (MSE) and mean absolute error (MAE) are often used.

An optimiser, on the other hand, is responsible for updating the model parameters (weights and biases) during training to minimise the loss function. He determines how to adjust the parameters of the model based on the calculated gradients of the loss function with respect to these parameters. Optimisers use various algorithms and techniques to efficiently search for the optimal values of the parameters. Common optimisers include Stochastic Gradient Descent (SGD), Adam, RMSprop and Adagrad. Each optimiser has its own features and hyper-parameters that can affect the training process and the convergence speed of the model.

The choice of loss function and optimiser depends on the specific task, the model architecture and the characteristics of the data set. It is important to select appropriate loss functions and optimisers to ensure effective model training and convergence to optimal performance.

Frequently used loss functions and optimisers

Loss functions:

Categorical cross entropy: This loss function is often used in sequence-to-sequence models for multi-class classification problems where each target word is treated as a separate class.
Sparse categorical cross entropy: Similar to categorical cross entropy, but suitable when the target sequences are represented as sparse integer sequences (e.g. using word indices).

Optimiser:

Adam: Adam is a popular optimiser that combines the advantages of the Adaptive Gradient Algorithm (AdaGrad) and Root Mean Square Propagation (RMSprop). It adjusts the learning rate for each parameter based on previous gradients, which contributes to faster convergence and better handling of sparse gradients.
RMSprop: RMSprop is an optimiser that maintains a moving average of squared gradients for each parameter. It adjusts the learning rate based on the size of the gradient, allowing for faster convergence and better performance on non-stationary targets.
Adagrad: Adagrad adjusts the learning rate individually for each parameter based on historical gradient accumulation. It performs larger updates for infrequent parameters and smaller updates for frequent parameters.

Files for download

NLP Tensorflow.js code (model has an error!)

The post NLP Application: Tensorflow.js vs Tensorflow Python – Part 2 appeared first on Nerd Corner.

NLP Application: Tensorflow.js vs Tensorflow Python – Part 1

Nerds — Wed, 31 May 2023 19:17:55 +0000

I am currently working on a project in which I want to program a German to Bavarian translator using machine learning. This is called Natural Language Processing (NLP). A Google Library called Tensorflow is often used for the implementation. There is Tensorflow.js as well as Tensorflow (Python). Since I develop professionally with Angular and therefore I am familiar with TypeScript and JavaScript, I initially decided to use the NLP application in Tensorflow.js. I was naive enough to assume that the only difference between the two libraries would be the programming language used. This is definitely not the case! For my NLP project, some basic functions are missing in Tensorflow.js (such as a tokenizer). In this post I explained the general differences between Tensorflow.js and Tensorflow (Python).

I spent many evenings trying to get my project to work with Tensorflow.js and failed in the end. Switching to Python brought the breakthrough I was hoping for! I would recommend everyone to use Python for NLP applications! Nevertheless, in this article I want to explain the differences between Tensorflow.js and Tensorflow in relation to my project using code examples. In between, I will also incorporate my newly accumulated knowledge into the respective sections as best I can.

You might also be interested in: NLP application part 2 (OOV token, padding, creating the model and training the model).

Reading in data

First of all, you need a data set with which the model will be trained later. Here I can recommend https://www.kaggle.com/. There you find a large number of data sets for free use and even some code examples. You can either read in the data set via a link or download it and then read it in locally from the file system. A good data set should contain over 100,000 examples. Preferably also whole paragraphs. For example, this is what an English/French data set looks like as a CSV:

First, the simple variant using Python:

import pandas as pd

# read in dataSet for training
df = pd.read_csv("./dataset/eng_-french.csv")
df.columns = ["english", "french"]
print(df.head())
print(df.info())

We use the pandas library and read in the CSV with it. With the head() we can test if it worked and display the first 5 rows. With info() we get more information like number of columns and number of rows:

For comparison in Tensorflow.js (Tfjs) there is also a possibility to read in CSV:

const tf = require("@tensorflow/tfjs");

async function readInData() {
  await tf.ready();
  const languageDataSet = tf.data.csv("file://" + "./ger_en_trans.csv");

  // Extract language pairs
  const dataset = languageDataSet.map((record) => ({
    en: record.en,
    de: record.de,
  }));

  const pairs = await dataset.toArray();

  console.log(pairs);
}

readInData();

I tried at first to read in the same data set as in the Python version:

Afterwards I wanted to shorten the headings in the original CSV, but this strangely gave me an error message when reading in. Even when I restored the CSV to its original state, the error remained:

In the end, I decided to use a different data set:

This one was also much more readable when it was read in:

And here is the final result after the mapping:

Although Tfjs offers an extra function to read in the CSV, I still had more trouble than in the Python version. I have also not found a quick way to read in a data set in txt format. However, txt files are widespread!

Prepare data

I have often seen that a cleaning function was written for data preparation and that the output set also received a start and end token. I then wondered whether the input set, i.e. the encoder, also needs a start and end token. In the context of sequence-to-sequence models, however, the encoder does not need explicit start and end tokens. Its purpose is to process the input sequence as it is and produce a representation of the input.

The decoder, on the other hand, which generates the output sequence, usually benefits from the use of start and end tokens. These tokens help to mark the beginning and end of the generated sequence. The use of start and end tokens is therefore specific to the decoder. During training, the input sequence of the decoder includes a start token at the beginning and excludes an end token at the end. The output sequence of the decoder contains the end token and excludes the start token. In this way, the model learns to generate the correct output sequence based on the input.

When creating translations with the trained model, you start with the start token and generate one token after another until you hit the end token or reach a maximum sequence length. Adding start and end tokens to the decoder set improves the performance of the NLP translator model. It helps to establish clear sequence boundaries and supports the generation process by indicating where the translation starts and ends.

In summary:

Encoder: No need for start and end tokens. Processes the input sequence as it is.
Decoder: Start and end tokens are helpful for generating the output sequence.

We start again with the easy part, namely Python. We want to clean up the data we read in. This means converting everything to lower case and removing characters that are not part of the alphabet or punctuation marks. For this we need the regex library (re).

import re

def clean(text):
    text = text.lower()  # lower case
    # remove any characters not a-z and ?!,'
    # please note that french has additional characters...I just simplified that
    text = re.sub(u"[^a-z!?',]", " ", text)
    return text


# apply cleaningFunctions to dataframe
data["english"] = data["english"].apply(lambda txt: clean(txt))
data["french"] = data["french"].apply(lambda txt: clean(txt))

# add   token to decoder sentence (french)
data["french"] = data["french"].apply(lambda txt: f" {txt} ")

print(data.sample(10))

I have simplified here. Since this is a French data set, one should actually write an extra cleaning function that also takes French letters like “ê” into account. The sample() function only serves to illustrate the data:

In Tfjs the process is absolutely identical. I have created a cleanData() function and modified the previous code:

function cleanData(text) {
  //if necessary also remove any characters not a-z and ?!,'
  return text.toLowerCase();
}

const dataset = languageDataSet.map((record) => ({
   en: cleanData(record.en),
   de: "startToken " + cleanData(record.de) + " endToken",
 }));

The result is therefore also identical to the Python approach:

If the words “start” and “end” are part of regular sentences and are not used as special tokens to mark the beginning and end of sequences, then they should definitely not be replaced by corresponding indices during tokenisation. When tokenising, it is important to choose special tokens that are unlikely to occur in the actual input data. This ensures that the model can distinguish them from normal words and learns to produce the appropriate output sequences.

If the words ” start” and “end” are regular words in the input sentences, consider using different special tokens to mark the start and end of sequences. A common choice is ” ” and “”. Using special tokens that are unlikely to be part of the regular vocabulary can ensure that they can be correctly identified and processed by the model during training and generation.

For example, the tokenised sequences would look like this:

Decoder Input: [“”, “hello”, “world”]
Decoder Output: [“hello”, “world”, “”]

Therefore AVOID the following:

Decoder Input: [“start”, “hello”, “world”]
Decoder output: [“hello”, “world”, “end”]

Tokenisation

# tokenization
import tensorflow as tf
from tensorflow import keras
from keras.preprocessing.text import Tokenizer
import numpy as np

# english tokenizer
english_tokenize = Tokenizer(filters='#$%&()*+,-./:;<=>@[\\]^_`{|}~\t\n')
english_tokenize.fit_on_texts(data["english"])
num_encoder_tokens = len(english_tokenize.word_index)+1
# print(num_encoder_tokens)
encoder = english_tokenize.texts_to_sequences(data["english"])
# print(encoder[:5])
max_encoder_sequence_len = np.max([len(enc) for enc in encoder])
# print(max_encoder_sequence_len)

# french tokenizer
french_tokenize = Tokenizer(filters="#$%&()*+,-./:;<=>@[\\]^_`{|}~\t\n")
french_tokenize.fit_on_texts(data["french"])
num_decoder_tokens = len(french_tokenize.word_index)+1
# print(num_decoder_tokens)
decoder = french_tokenize.texts_to_sequences(data["french"])
# print(decoder[:5])
max_decoder_sequence_len = np.max([len(dec) for dec in decoder])
# print(max_decoder_sequence_len)

This code performs tokenisation and sequence preprocessing with the Tokenizer class in TensorFlow.

english_tokenize = Tokenizer(filters=’#$%&()*+,-./:;<=>@[\]^_`{|}~\t\n’) Initialises a tokenizer object for English sentences. The filters parameter specifies characters to be filtered out during tokenisation. We have already filtered the data in the cleaning process, so it is not really necessary to filter again here.
english_tokenize.fit_on_texts(data[“english”]) Updates the internal vocabulary of the tokenizer based on the English sentences in the variable data. Each word in the vocabulary is assigned a unique index.
num_encoder_tokens = len(english_tokenize.word_index) + 1 Determines the number of unique tokens (words) in the English vocabulary. The word_index attribute of the tokeniser returns a dictionary that maps words to their respective indices.
encoder = english_tokenize.texts_to_sequences(data[“english”]) Converts the English sentences in the variable data into sequences of token indices using the tokenizer. Each sentence is replaced by a sequence of integers representing the corresponding words.
max_encoder_sequence_len = np.max([len(enc) for enc in encoder]) Calculates the maximum length (number of tokens) among all encoded sequences. It uses the max function of NumPy to find the maximum value in a list comprehension.

These steps help to prepare the sentences for further processing in an NLP model. This is necessary for both languages!

The sentences have now been tokenised, then converted into sequences of token indices and the maximum sequence length determined. An example sentence from the dataset now looks like this: [[148], [252], [59], [14], [111]]. Here, 148 could stand for “I”, 252 for “am”, 59 for “very”, 14 for “hungry” and 111 for “now”.

idx_2_txt_decoder = {k: i for i, k in french_tokenize.word_index.items()}
# print(idx_2_txt_decoder)
idx_2_txt_encoder = {k: i for i, k in english_tokenize.word_index.items()}
# print(idx_2_txt_encoder)

idx_2_txt_decoder[0] = ""
idx_2_txt_encoder[0] = ""

The code snippet idx_2_txt_encoder = {k: i for i, k in english_tokenize.word_index.items()} creates a dictionary directory idx_2_txt_encoder that maps token indices to the corresponding words in the English vocabulary: {k: i for i, k in english_tokenize.word_index.items()} is a dictionary that iterates over the key-value pairs in english_tokenize.word_index. At each iteration, the key (k) represents a word in the vocabulary, and the value (i) represents the corresponding index. Understanding creates a new dictionary whose keys are the indices (i) and the values are the words (k).

The resulting idx_2_txt_encoder – dictionary allows you to look up the word corresponding to a particular index in the English vocabulary. english_tokenize.word_index, by the way, would swap the displays exactly. Here the key would be the word and the value the index. The second line, idx_2_txt_encoder[0] = “”, adds a special entry to the dictionary. Here, the word “” is assigned to index “0” to specify a padding token that is used when padding sequences.

Afterwards, one should save the dictionary directory, because later when the model has been trained and is used, the translations of the model will also be a series of indices that are transformed back into readable sentences with the help of the dictionary.

# Saving the dicitionaries
pickle.dump(idx_2_txt_encoder, open("./saves/idx_2_word_input.txt", "wb"))
pickle.dump(idx_2_txt_decoder, open("./saves/idx_2_word_target.txt", "wb"))

The same process as in Python can also be constructed for the NLP application in Tensorflow.js. Of course, you need a little more lines of code and the overall workload is higher. The first hurdle here is the tokeniser. Unfortunately, unlike Tensorflow (Python), Tfjs does not have its own tokenizer. After extensive research, I luckily found the natural.WordTokenizer. I would like to point out here that a Node.js project is definitely required. Tfjs can be integrated via a