Kaggle: Quora Insincere Questions Classification

Posted on 2018-12-06 | In Competitions |

现在开始用用英文写一部分报告。

Introduction

Quora Insincere Questions Classification

Target: A binary classifier to identify insincere questions
Evaluation: F1 Score
Environment: Kernels only

Insincere Question:
Questions founded upon false premises, or that intend to make a statement rather than look for helpful answers. for examples:

Why is it that European food is so terrible? (false premise)
Why are Quora employees friendly and helpful? (make a statement)

Environment:

Kernels only, not allow external data
CPU Kernel <= 6 hours run-time
GPU Kernel <= 2 hours run-time
No internet access enabled
No multiple data sources enabled
No custom packages
Submission file must be named “submission.csv”
Pretrained embeddings:
- GoogleNews-vectors-negative300 (word2vec)
- glove.840B.300d
- paragram_300_sl999
- wiki-news-300d-1M (fasttext)

EDA

train.head(10) Alt text

train["target"].value_counts() Alt text 6% of the training data are insincere questions

train[train.target == 1][0:10] Alt text Examples of insincere questions

train['question_text'].apply(lambda x: len(x.split())).plot(kind='hist') Alt text Most questions are 70 words long or shorter

Word Frequency plot of sincere & insincere questions (from: Simple Exploration Notebook - QIQC):

Word Alt text

Bigram Alt text

Trigram Alt text

Embeddings

How to use different pretrained embeddings is one of the most important part of this competition.

Some great kernels about embeddings:

A look at different embeddings.! - A great example of using pretrained embeddings:
How to: Preprocessing when using embeddings

Reference

Unfinished.

rocuku

I know nothing except the fact of my ignorance

GitHub E-Mail

1. Introduction
2. EDA
3. Embeddings
4. Reference