Table of Contents¶

1 Data Challenge : Mangaki - September 2017

1.1 Reading data

1.2 First prediction model

1.3 Better predicted models

1.3.1 Using watched.csv

1.3.1.1 Maping string-rating to probability of seeing the movie

1.3.2 Using titles.csv

1.4 Evaluation from the data challenge platform

Data Challenge : Mangaki - September 2017¶

See here for more information.

Author: Lilian Besson.

License: MIT License.

Reading data¶

We have a few CSV files, let start by reading them.

from tqdm import tqdm
import numpy as np
import pandas as pd

!ls -larth *.csv

-rw-r--r-- 1 lilian lilian  350K juin  27 15:10 titles.csv
-rw-r--r-- 1 lilian lilian  3,2M juin  27 15:25 watched.csv
-rw-r--r-- 1 lilian lilian 1010K juin  27 15:34 test.csv
-rw-r--r-- 1 lilian lilian  124K juin  28 17:55 train.csv
-rw-r--r-- 1 lilian lilian  2,4M juin  28 17:57 submission.csv

!cp -vf submission.csv submission.csv.old

'submission.csv' -> 'submission.csv.old'

train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")
titles = pd.read_csv("titles.csv")
watched = pd.read_csv("watched.csv")

np.unique(titles.category)

array(['album', 'anime', 'manga'], dtype=object)

Just to check they have correctly been read:

train[:5]
len(train)
min(train['user_id']), max(train['user_id'])
min(train['work_id']), max(train['work_id'])

11112

(1, 1982)

(2, 9884)

test[:5]
len(test)
min(test['user_id']), max(test['user_id'])
min(test['work_id']), max(test['work_id'])

100015

(0, 1982)

(2, 9884)

watched[:5]
len(watched)
min(watched['user_id']), max(watched['user_id'])
min(watched['work_id']), max(watched['work_id'])

198970

(0, 1982)

(0, 9896)

First prediction model¶

For each movie, compute the empirical average rating of users who saw it, using data from the train data.
And simply use this to predict for the other users in test data.

submission = test.copy()

total_average_rating = train.rating.mean()

submission[:5]
len(submission)

100015

works_id = np.unique(np.append(test.work_id.unique(), train.work_id.unique()))

mean_ratings = pd.DataFrame(data={'mean_rating': 0}, index=works_id)
mean_ratings[:5]
len(mean_ratings)

2706

computed_means = pd.DataFrame(data={'mean_rating': train.groupby('work_id').mean()['rating']}, index=works_id)
computed_means[:5]
len(computed_means)

2706

mean_ratings.update(computed_means)

mean_ratings[:10]
len(mean_ratings)

2706

submission = submission.join(mean_ratings, on='work_id')
submission.rename_axis({'mean_rating': 'prob_willsee'}, axis="columns", inplace=True)

# in case of mean on empty values
submission.fillna(value=total_average_rating, inplace=True)

submission[:10]

Let save it to submission_naive1.csv:

submission.to_csv("submission_naive1.csv", index=False)

!ls -larth submission_naive1.csv

-rw-rw-r-- 1 lilian lilian 2,0M sept. 24 14:40 submission_naive1.csv

Better predicted models¶

Using `watched.csv`¶

The bonus data set watched can give a lot of information. There is 200000 entries in it and only 100000 in test.csv.

len(test), len(watched)

(100015, 198970)

ratings = np.unique(watched.rating).tolist()
ratings

['dislike', 'like', 'love', 'neutral']

watched[:5]

Maping string-rating to probability of seeing the movie¶

By using the train data (user, work) that are also in watched, we can learn to map string rating, i.e., 'dislike', 'neutral', 'like', 'love', to probability of having see the movie.

watched.rename_axis({'rating': 'strrating'}, axis="columns", inplace=True)

watched[:5]

train[:5]

Is there pairs (user, work) for which both train data and watched data are available (i.e., both see/notsee and liked/disliked) ?

train.merge(watched, on=['user_id', 'work_id'])

And what about test data?

test.merge(watched, on=['user_id', 'work_id'])

test.merge(watched, on=['work_id'])

No! So we can forget about the user_id, and we will learn how to map liked/disliked to see/notsee for each movie.

all_train = watched.merge(train, on='work_id')
all_train[:5]

del all_train['user_id_x']
del all_train['user_id_y']

We can delete the user_id axes.

all_train[:5]

We can first get the average rating of each work:

all_train.groupby('work_id').rating.mean()[:10]

work_id
2     0.230769
4     0.500000
5     0.333333
9     0.200000
23    0.000000
24    0.000000
27    0.333333
28    0.333333
33    0.250000
48    0.400000
Name: rating, dtype: float64

This table now contains, for each work, a list of mapping from strrating to rating. It can be combined into a concise mapping, like in this form:

mapping_strrating_probwillsee = {
    'dislike': 0,
    'neutral': 0.50,
    'like': 0.75,
    'love': 1,
}

Manually, for instance for one movie:

all_train[(all_train.work_id == 8025) & (all_train.strrating == 'dislike')]

all_train[all_train.work_id == 8025].rating.mean()

0.31578947368421051

len(all_train[(all_train.work_id == 8025) & (all_train.strrating == 'dislike')].rating)
all_train[(all_train.work_id == 8025) & (all_train.strrating == 'dislike')].rating.mean()

4294

0.31578947368421051

len(all_train[(all_train.work_id == 8025) & (all_train.strrating == 'neutral')].rating)
all_train[(all_train.work_id == 8025) & (all_train.strrating == 'neutral')].rating.mean()

4598

0.31578947368421051

len(all_train[(all_train.work_id == 8025) & (all_train.strrating == 'like')].rating)
all_train[(all_train.work_id == 8025) & (all_train.strrating == 'like')].rating.mean()

8151

0.31578947368421051

len(all_train[(all_train.work_id == 8025) & (all_train.strrating == 'love')].rating)
all_train[(all_train.work_id == 8025) & (all_train.strrating == 'love')].rating.mean()

817

0.31578947368421051

That's weird!

Using `titles.csv`¶

I don't think I want to use the titles, but clustering the works by categories could help, maybe.

categories = np.unique(titles.category).tolist()
categories

['album', 'anime', 'manga']

for cat in categories:
    print("There is {:>5} work(s) in category '{}'.".format(sum(titles.category == cat), cat))

There is  2808 work(s) in category 'manga'.
There is     1 work(s) in category 'album'.
There is  7088 work(s) in category 'anime'.

One category is alone, let rewrite it to 'anime'.

categories = {
    'anime': 0,
    'album': 0,
    'manga': 1,
}

TODO !

Evaluation from the data challenge platform¶