Quora Question Pairs: Detecting Text Similarity using Siamese networks.

Quora Similar Questions: Detecting Text Similarity using Siamese networks.

Aadit Kapoor
Towards Data Science

--

Ever wondered how to calculate text similarity using Deep Learning? We aim to develop a model to detect text similarity between texts. We will be using the Quora Question Pairs Dataset.

https://unsplash.com/photos/askpr0s66Rg

Requirements

  • Python 3.8
  • Scikit-Learn
  • TensorFlow
  • Genism
  • NLTK

Dataset

Let us first start by exploring the dataset. Our dataset consists of:

  • id: The ID of the training set of a pair
  • qid1, qid2: Unique ID of the question
  • question1: Text for Question One
  • question2: Text for Question Two
  • is_duplicate: 1 if question1 and question2 have the same meaning or else 0
Preview of our dataset

Data Preprocessing

Like any Machine Learning project, we will start by preprocessing the data. Let us first load the data and combined the question1 and question2 to form the vocabulary.

def load_data(df):
question1 = df['"question1"'].astype(str).values
question2 = df['"question2"'].astype(str).values
# combined: to get the tokens
df['combined'] = df['"question1"'] + df['"question2"']
labels = df['"is_duplicate"'].values
return question1, question2, labels
question1, question2, labels = load_data(df)
question1 = list(question1)
question2 = list(question2)
combined = question1 + question2df.head()
Our modified data frame

We will also clean the text a bit.

# Remove Non ASCII characters from the dataset.def cleanAscii(text):       return ''.join(i for i in text if ord(i) < 128)

Word Embeddings

Every feed-forward neural network that takes words from a vocabulary as input and embeds them as vectors into a lower dimensional space, which it then fine-tunes through back-propagation, necessarily yields word embeddings as the weights of the first layer, which is usually referred to as Embedding Layer (Ruder, 2016)

Image from [3]

Word embedding learns the syntactical and semantic aspects of the text (Almeida et al, 2019). As our problem is related to the semantic meaning of the text, we will use a word embedding as our first layer in our Siamese Network.

For this, we will use the popular GloVe (Global Vectors for Word Representation) embedding model. We will obtain the pre-trained model (https://nlp.stanford.edu/projects/glove/) and load it as our first layer as the embedding layer.

Due to the nearst neighbours approach (or cosine similarity) of Glove, it is able to capture the semantic similary the word. In our model, we will use an embedding matrix developed using Glove weights and take word vectors for each of our sentence.

First we build a Tokenizer out of all our vocabulary.

max_words = 10000
tok = Tokenizer(num_words=max_words, oov_token="<OOV>")
tok.fit_on_texts(combined)
# Padding sequences to a max embedding length of 100 dim and max len of the sequence to 300sequences = tok.texts_to_sequences(combined)sequences = pad_sequences(sequences, maxlen=300, padding='post')

Now assuming, we have downloaded the Glove pre-trained vectors from here, we initialize our embedding layer with the embedding matrix.

max_words = 10000
word_index = len(tok.word_index) + 1
glove_dir = ''
embeddings_index = {}
f = open(os.path.join(glove_dir, 'glove.6B.100d.txt'))
for line in f:values = line.split()word = values[0]coefs = np.asarray(values[1:], dtype='float32')embeddings_index[word] = coefsf.close()print('Found %s word vectors.' % len(embeddings_index))print(word_index)# matrixembedding_dim = 100embedding_matrix = np.zeros((max_words, embedding_dim))for word, i in tok.word_index.items():if i < max_words:embedding_vector = embeddings_index.get(word)if embedding_vector is not None:embedding_matrix[i] = embedding_vector
Screenshot of the output

Model

Image from [4]

Now we have created our embedding matrix, we will nor start building our model.

Model summary
lstm_layer = tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(lstm_units, dropout=0.2, recurrent_dropout=0.2))# loading our matrix
emb = tf.keras.layers.Embedding(max_words, embedding_dim, input_length=300, weights=[embedding_matrix],trainable=False)
input1 = tf.keras.Input(shape=(300,))
e1 = emb(input1)
x1 = lstm_layer(e1)
input2 = tf.keras.Input(shape=(300,))
e2 = emb(input2)
x2 = lstm_layer(e2)
mhd = lambda x: tf.keras.backend.abs(x[0] - x[1])
merged = tf.keras.layers.Lambda(function=mhd, output_shape=lambda x: x[0],
name='L1_distance')([x1, x2])
preds = tf.keras.layers.Dense(1, activation='sigmoid')(merged)
model = tf.keras.Model(inputs=[input1, input2], outputs=preds)
model.compile(loss='mse', optimizer='adam')

We use an LSTM layer to encode our 100 dim word embedding. Then we calculate the Manhattan Distance (Also called L1 Distance), followed by a sigmoid activation to squash our output between 0 and 1.(1 refers to maximum similarity and 0 refers to minimum similarity). We use the MSE as our loss function and an Adam optimizer.

Our Model structure

Training

We split our train.csv to train, test, and validation set to test out our model.

def create_data():
features, labels = df_train.drop(columns=['id', 'qid1', 'qid2', 'is_duplicate']).values, df_train['is_duplicate'].values
x_train, x_test, y_train, y_test = train_test_split(features, labels, test_size=0.2, random_state=42)
x_train, x_val, y_train, y_val = train_test_split(x_train, y_train, test_size=0.25, random_state=42)

return x_train, x_test, y_train, y_test, x_val, y_val
question1 and question2

To train our model, we simply call the fit function followed by the inputs.

history = model.fit([x_train[:,0], x_train[:,1]], y_train, epochs=100, validation_data=([x_val[:,0], x_val[:,1]], y_val))

References

--

--