- See here for more information.
- Author: Lilian Besson.
- License: MIT License.
We have a few CSV files, let start by reading them.
from tqdm import tqdm
import numpy as np
import pandas as pd
!ls -larth *.csv
!cp -vf submission.csv submission.csv.old
train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")
titles = pd.read_csv("titles.csv")
watched = pd.read_csv("watched.csv")
np.unique(titles.category)
Just to check they have correctly been read:
train[:5]
len(train)
min(train['user_id']), max(train['user_id'])
min(train['work_id']), max(train['work_id'])
test[:5]
len(test)
min(test['user_id']), max(test['user_id'])
min(test['work_id']), max(test['work_id'])
watched[:5]
len(watched)
min(watched['user_id']), max(watched['user_id'])
min(watched['work_id']), max(watched['work_id'])
rating of users who saw it, using data from the train data.submission = test.copy()
total_average_rating = train.rating.mean()
submission[:5]
len(submission)
works_id = np.unique(np.append(test.work_id.unique(), train.work_id.unique()))
mean_ratings = pd.DataFrame(data={'mean_rating': 0}, index=works_id)
mean_ratings[:5]
len(mean_ratings)
computed_means = pd.DataFrame(data={'mean_rating': train.groupby('work_id').mean()['rating']}, index=works_id)
computed_means[:5]
len(computed_means)
mean_ratings.update(computed_means)
mean_ratings[:10]
len(mean_ratings)
submission = submission.join(mean_ratings, on='work_id')
submission.rename_axis({'mean_rating': 'prob_willsee'}, axis="columns", inplace=True)
# in case of mean on empty values
submission.fillna(value=total_average_rating, inplace=True)
submission[:10]
Let save it to submission_naive1.csv:
submission.to_csv("submission_naive1.csv", index=False)
!ls -larth submission_naive1.csv
watched.csv¶The bonus data set watched can give a lot of information. There is 200000 entries in it and only 100000 in test.csv.
len(test), len(watched)
ratings = np.unique(watched.rating).tolist()
ratings
watched[:5]
By using the train data (user, work) that are also in watched, we can learn to map string rating, i.e., 'dislike', 'neutral', 'like', 'love', to probability of having see the movie.
watched.rename_axis({'rating': 'strrating'}, axis="columns", inplace=True)
watched[:5]
train[:5]
Is there pairs (user, work) for which both train data and watched data are available (i.e., both see/notsee and liked/disliked) ?
train.merge(watched, on=['user_id', 'work_id'])
And what about test data?
test.merge(watched, on=['user_id', 'work_id'])
test.merge(watched, on=['work_id'])
No! So we can forget about the user_id, and we will learn how to map liked/disliked to see/notsee for each movie.
all_train = watched.merge(train, on='work_id')
all_train[:5]
del all_train['user_id_x']
del all_train['user_id_y']
We can delete the user_id axes.
all_train[:5]
We can first get the average rating of each work:
all_train.groupby('work_id').rating.mean()[:10]
This table now contains, for each work, a list of mapping from strrating to rating.
It can be combined into a concise mapping, like in this form:
mapping_strrating_probwillsee = {
'dislike': 0,
'neutral': 0.50,
'like': 0.75,
'love': 1,
}
Manually, for instance for one movie:
all_train[(all_train.work_id == 8025) & (all_train.strrating == 'dislike')]
all_train[all_train.work_id == 8025].rating.mean()
len(all_train[(all_train.work_id == 8025) & (all_train.strrating == 'dislike')].rating)
all_train[(all_train.work_id == 8025) & (all_train.strrating == 'dislike')].rating.mean()
len(all_train[(all_train.work_id == 8025) & (all_train.strrating == 'neutral')].rating)
all_train[(all_train.work_id == 8025) & (all_train.strrating == 'neutral')].rating.mean()
len(all_train[(all_train.work_id == 8025) & (all_train.strrating == 'like')].rating)
all_train[(all_train.work_id == 8025) & (all_train.strrating == 'like')].rating.mean()
len(all_train[(all_train.work_id == 8025) & (all_train.strrating == 'love')].rating)
all_train[(all_train.work_id == 8025) & (all_train.strrating == 'love')].rating.mean()
That's weird!
titles.csv¶I don't think I want to use the titles, but clustering the works by categories could help, maybe.
categories = np.unique(titles.category).tolist()
categories
for cat in categories:
print("There is {:>5} work(s) in category '{}'.".format(sum(titles.category == cat), cat))
One category is alone, let rewrite it to 'anime'.
categories = {
'anime': 0,
'album': 0,
'manga': 1,
}
TODO !
TODO !