Kaggle: Quora Insincere Questions Classification

现在开始用用英文写一部分报告。

Introduction

Quora Insincere Questions Classification

  • Target: A binary classifier to identify insincere questions
  • Evaluation: F1 Score
  • Environment: Kernels only

Insincere Question:
Questions founded upon false premises, or that intend to make a statement rather than look for helpful answers. for examples:

  • Why is it that European food is so terrible? (false premise)
  • Why are Quora employees friendly and helpful? (make a statement)

Environment:

  • Kernels only, not allow external data
  • CPU Kernel <= 6 hours run-time
  • GPU Kernel <= 2 hours run-time
  • No internet access enabled
  • No multiple data sources enabled
  • No custom packages
  • Submission file must be named “submission.csv”
  • Pretrained embeddings:
    • GoogleNews-vectors-negative300 (word2vec)
    • glove.840B.300d
    • paragram_300_sl999
    • wiki-news-300d-1M (fasttext)

EDA

train.head(10) Alt text

train["target"].value_counts() Alt text 6% of the training data are insincere questions

train[train.target == 1][0:10] Alt text Examples of insincere questions

train['question_text'].apply(lambda x: len(x.split())).plot(kind='hist') Alt text Most questions are 70 words long or shorter

Word Frequency plot of sincere & insincere questions (from: Simple Exploration Notebook - QIQC):

Word Alt text

Bigram Alt text

Trigram Alt text

Embeddings

How to use different pretrained embeddings is one of the most important part of this competition.

Some great kernels about embeddings:

Reference

  1. What is an insincere question?
  2. Simple Exploration Notebook - QIQC
  3. A look at different embeddings.!
  4. How to: Preprocessing when using embeddings

Unfinished.