- See here for more information.
- Author: Lilian Besson.
- License: MIT License.
We have a few CSV files, let start by reading them.
from tqdm import tqdm
import numpy as np
import pandas as pd
!ls -larth *.csv
!cp -vf submission.csv submission.csv.old
train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")
titles = pd.read_csv("titles.csv")
watched = pd.read_csv("watched.csv")
np.unique(titles.category)
Just to check they have correctly been read:
train[:5]
len(train)
min(train['user_id']), max(train['user_id'])
min(train['work_id']), max(train['work_id'])
test[:5]
len(test)
min(test['user_id']), max(test['user_id'])
min(test['work_id']), max(test['work_id'])
watched[:5]
len(watched)
min(watched['user_id']), max(watched['user_id'])
min(watched['work_id']), max(watched['work_id'])
rating
of users who saw it, using data from the train data.submission = test.copy()
total_average_rating = train.rating.mean()
submission[:5]
len(submission)
works_id = np.unique(np.append(test.work_id.unique(), train.work_id.unique()))
mean_ratings = pd.DataFrame(data={'mean_rating': 0}, index=works_id)
mean_ratings[:5]
len(mean_ratings)
computed_means = pd.DataFrame(data={'mean_rating': train.groupby('work_id').mean()['rating']}, index=works_id)
computed_means[:5]
len(computed_means)
mean_ratings.update(computed_means)
mean_ratings[:10]
len(mean_ratings)
submission = submission.join(mean_ratings, on='work_id')
submission.rename_axis({'mean_rating': 'prob_willsee'}, axis="columns", inplace=True)
# in case of mean on empty values
submission.fillna(value=total_average_rating, inplace=True)
submission[:10]
Let save it to submission_naive1.csv
:
submission.to_csv("submission_naive1.csv", index=False)
!ls -larth submission_naive1.csv
watched.csv
¶The bonus data set watched
can give a lot of information. There is 200000 entries in it and only 100000 in test.csv
.
len(test), len(watched)
ratings = np.unique(watched.rating).tolist()
ratings
watched[:5]
By using the train data (user, work)
that are also in watched
, we can learn to map string rating, i.e., 'dislike', 'neutral', 'like', 'love'
, to probability of having see the movie.
watched.rename_axis({'rating': 'strrating'}, axis="columns", inplace=True)
watched[:5]
train[:5]
Is there pairs (user, work)
for which both train data and watched data are available (i.e., both see/notsee and liked/disliked) ?
train.merge(watched, on=['user_id', 'work_id'])
And what about test data?
test.merge(watched, on=['user_id', 'work_id'])
test.merge(watched, on=['work_id'])
No! So we can forget about the user_id
, and we will learn how to map liked/disliked to see/notsee for each movie.
all_train = watched.merge(train, on='work_id')
all_train[:5]
del all_train['user_id_x']
del all_train['user_id_y']
We can delete the user_id
axes.
all_train[:5]
We can first get the average rating of each work:
all_train.groupby('work_id').rating.mean()[:10]
This table now contains, for each work, a list of mapping from strrating
to rating
.
It can be combined into a concise mapping, like in this form:
mapping_strrating_probwillsee = {
'dislike': 0,
'neutral': 0.50,
'like': 0.75,
'love': 1,
}
Manually, for instance for one movie:
all_train[(all_train.work_id == 8025) & (all_train.strrating == 'dislike')]
all_train[all_train.work_id == 8025].rating.mean()
len(all_train[(all_train.work_id == 8025) & (all_train.strrating == 'dislike')].rating)
all_train[(all_train.work_id == 8025) & (all_train.strrating == 'dislike')].rating.mean()
len(all_train[(all_train.work_id == 8025) & (all_train.strrating == 'neutral')].rating)
all_train[(all_train.work_id == 8025) & (all_train.strrating == 'neutral')].rating.mean()
len(all_train[(all_train.work_id == 8025) & (all_train.strrating == 'like')].rating)
all_train[(all_train.work_id == 8025) & (all_train.strrating == 'like')].rating.mean()
len(all_train[(all_train.work_id == 8025) & (all_train.strrating == 'love')].rating)
all_train[(all_train.work_id == 8025) & (all_train.strrating == 'love')].rating.mean()
That's weird!
titles.csv
¶I don't think I want to use the titles, but clustering the works by categories could help, maybe.
categories = np.unique(titles.category).tolist()
categories
for cat in categories:
print("There is {:>5} work(s) in category '{}'.".format(sum(titles.category == cat), cat))
One category is alone, let rewrite it to 'anime'
.
categories = {
'anime': 0,
'album': 0,
'manga': 1,
}
TODO !
TODO !