Quora Question Pairs: Detecting Text Similarity using Siamese networks.

Quora Similar Questions: Detecting Text Similarity using Siamese networks.

Published in

Towards Data Science

4 min readAug 17, 2020

Ever wondered how to calculate text similarity using Deep Learning? We aim to develop a model to detect text similarity between texts. We will be using the Quora Question Pairs Dataset.

Requirements

Python 3.8
Scikit-Learn
TensorFlow
Genism
NLTK

Dataset

Let us first start by exploring the dataset. Our dataset consists of:

id: The ID of the training set of a pair
qid1, qid2: Unique ID of the question
question1: Text for Question One
question2: Text for Question Two
is_duplicate: 1 if question1 and question2 have the same meaning or else 0

Data Preprocessing

Like any Machine Learning project, we will start by preprocessing the data. Let us first load the data and combined the question1 and question2 to form the vocabulary.

def load_data(df):
    question1 = df['"question1"'].astype(str).values
    question2 = df['"question2"'].astype(str).values
    # combined: to get the tokens
    df['combined'] = df['"question1"'] + df['"question2"']
    labels = df['"is_duplicate"'].values
    return question1, question2, labelsquestion1, question2, labels = load_data(df)
question1 = list(question1)
question2 = list(question2)
combined = question1 + question2df.head()

We will also clean the text a bit.

# Remove Non ASCII characters from the dataset.def cleanAscii(text):       return ''.join(i for i in text if ord(i) < 128)

Word Embeddings

Every feed-forward neural network that takes words from a vocabulary as input and embeds them as vectors into a lower dimensional space, which it then fine-tunes through back-propagation, necessarily yields word embeddings as the weights of the first layer, which is usually referred to as Embedding Layer (Ruder, 2016)

Word embedding learns the syntactical and semantic aspects of the text (Almeida et al, 2019). As our problem is related to the semantic meaning of the text, we will use a word embedding as our first layer in our Siamese Network.

For this, we will use the popular GloVe (Global Vectors for Word Representation) embedding model. We will obtain the pre-trained model (https://nlp.stanford.edu/projects/glove/) and load it as our first layer as the embedding layer.

Due to the nearst neighbours approach (or cosine similarity) of Glove, it is able to capture the semantic similary the word. In our model, we will use an embedding matrix developed using Glove weights and take word vectors for each of our sentence.

First we build a Tokenizer out of all our vocabulary.

max_words = 10000
tok = Tokenizer(num_words=max_words, oov_token="<OOV>")
tok.fit_on_texts(combined)# Padding sequences to a max embedding length of 100 dim and max len of the sequence to 300sequences = tok.texts_to_sequences(combined)sequences = pad_sequences(sequences, maxlen=300, padding='post')

Now assuming, we have downloaded the Glove pre-trained vectors from here, we initialize our embedding layer with the embedding matrix.

max_words = 10000
word_index = len(tok.word_index) + 1
glove_dir = ''
embeddings_index = {}
f = open(os.path.join(glove_dir, 'glove.6B.100d.txt'))for line in f:values = line.split()word = values[0]coefs = np.asarray(values[1:], dtype='float32')embeddings_index[word] = coefsf.close()print('Found %s word vectors.' % len(embeddings_index))print(word_index)# matrixembedding_dim = 100embedding_matrix = np.zeros((max_words, embedding_dim))for word, i in tok.word_index.items():if i < max_words:embedding_vector = embeddings_index.get(word)if embedding_vector is not None:embedding_matrix[i] = embedding_vector

Screenshot of the output

Model

Now we have created our embedding matrix, we will nor start building our model.

lstm_layer = tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(lstm_units, dropout=0.2, recurrent_dropout=0.2))# loading our matrix
emb = tf.keras.layers.Embedding(max_words, embedding_dim, input_length=300, weights=[embedding_matrix],trainable=False)input1 = tf.keras.Input(shape=(300,))
e1 = emb(input1)
x1 = lstm_layer(e1)input2 = tf.keras.Input(shape=(300,))
e2 = emb(input2)
x2 = lstm_layer(e2)mhd = lambda x: tf.keras.backend.abs(x[0] - x[1])
merged = tf.keras.layers.Lambda(function=mhd, output_shape=lambda x: x[0],
name='L1_distance')([x1, x2])
preds = tf.keras.layers.Dense(1, activation='sigmoid')(merged)
model = tf.keras.Model(inputs=[input1, input2], outputs=preds)
model.compile(loss='mse', optimizer='adam')

We use an LSTM layer to encode our 100 dim word embedding. Then we calculate the Manhattan Distance (Also called L1 Distance), followed by a sigmoid activation to squash our output between 0 and 1.(1 refers to maximum similarity and 0 refers to minimum similarity). We use the MSE as our loss function and an Adam optimizer.

Training

We split our train.csv to train, test, and validation set to test out our model.

def create_data():
    features, labels = df_train.drop(columns=['id', 'qid1', 'qid2', 'is_duplicate']).values, df_train['is_duplicate'].values
    x_train, x_test, y_train, y_test = train_test_split(features, labels, test_size=0.2, random_state=42)
    x_train, x_val, y_train, y_val = train_test_split(x_train, y_train, test_size=0.25, random_state=42)
    
    return x_train, x_test, y_train, y_test, x_val, y_val

To train our model, we simply call the fit function followed by the inputs.

history = model.fit([x_train[:,0], x_train[:,1]], y_train, epochs=100, validation_data=([x_val[:,0], x_val[:,1]], y_val))