This small notebook is a demonstration of what you can do with:
import seaborn as sns
sns.set(context="notebook", style="whitegrid", palette="hls", font="sans-serif", font_scale=1.6)
import matplotlib as mpl
mpl.rcParams['figure.figsize'] = (8*16/9, 8)
This is very specific to where your data is stored.
from pathlib import Path
root = Path('..')
dataroot = root / 'data' / 'raw'
!ls -larth $dataroot
smsdata = dataroot / 'sms'
!ls -larth $smsdata
xml_filenames = !ls $smsdata
a_xml_filename = smsdata / xml_filenames[0]
!ls $a_xml_filename
!wc $a_xml_filename
Let's see the latest text message I received:
!head -n3 $a_xml_filename
str(a_xml_filename)
As suggested in this StackOverflow answer, I will use minidom
.
from xml.dom import minidom
xmldoc = minidom.parse(str(a_xml_filename))
sms_list = xmldoc.getElementsByTagName('sms')
len(sms_list)
There is about $17000$ text messages in this file.
def sms_to_dict(sms):
return dict(sms.attributes.items())
all_dicts = [sms_to_dict(sms) for sms in sms_list]
The following keys are stored for each messages:
attributes = list(all_dicts[0].keys())
attributes
This data structure is basic, and does not cost too much memory.
from sys import getsizeof
getsizeof(all_dicts)
latest = all_dicts[0]
first = all_dicts[-1]
print(f"This dataset spans between '{first['time']}' and '{latest['time']}'")
from datetime import date, timedelta
from dateutil.parser import parse
span = date.fromtimestamp(int(latest['date'])/1000) - date.fromtimestamp(int(first['date'])/1000)
span_days = span.days
print(f"This dataset spans on {span_days} days")
This was not intentional!
sum( not hasattr(sms, a) for a in attributes for sms in all_dicts)
for a in attributes:
dont_have_it = sum( a not in sms for sms in all_dicts)
print(f"For attribute {a:>14}, {dont_have_it:>5} messages do not have it.")
import numpy as np
import pandas as pd
Behold the magic!
df = pd.DataFrame(all_dicts)
Let's clean up, first by using only the initials, like "LB" instead of Lilian Besson.
Why? I want to preserve the privacy of my contacts!
def only_initials(s):
if len(s) == 0: return s
parts = s.split(" ")
if len(parts) <= 1:
return s[:2]
else:
return parts[0][:1] + parts[1][:1]
df.name = df.name.apply(only_initials)
Then by cleaning up phone numbers. I want to remove the spaces, the '+' in the country code, and hide the last 3 digits to preserve the privacy of my contacts.
def cleanup_phonenumber(address):
address = address.replace(' ', '')
address = address.replace('+33', '0')
return address[:-3] + 'XXX'
Tests:
cleanup_phonenumber('0612121212')
cleanup_phonenumber('06 12 12 12 12')
cleanup_phonenumber('06 121 212 12')
cleanup_phonenumber('+33612121212')
df.address = df.address.apply(cleanup_phonenumber)
Let's check:
df[:2]
Some information:
df.info()
type
attribute? It is 1
for the received sms and 2
for the sent sms.set(list(df.type))
received = df[df.type == '1']
sent = df[df.type == '2']
received[:1]
sent[:1]
len(received)
len(sent)
round(100 * ((len(sent) / len(received)) - 1))
service_center
information? Not so sure, but we don't care. Actually, it can be used to know the countries from where I sent all these:country_codes = set([s[:3] for s in set(list(df.service_center)) if s and s.startswith('+')])
country_codes
assert set(list(df.read)) == {'1'}, "Some message was not read before the backup!"
min([len(b) for b in df.body])
sum([True for b in df.body if len(b) == 0])
I want to answer these statistical questions:
First, let's check this:
contacts = list(set(list(df.address)))
sent_contacts = list(set(list(sent.address)))
received_contacts = list(set(list(received.address)))
print(f"I sent messages to {len(sent_contacts)} different people, and received from {len(received_contacts)} different people in the last {span_days} days.")
Now, get the list of 10 people I exchanged the most text messages with:
df.groupby(['address']).body.count().sort_values(ascending=False)[:10]
exchanges = df.groupby(['address']).body.count()
Most of the people I interact is with only one message ($10\%$) or very few message ($43\%$ for less than 10 messages and $56\%$ with less than 20 messages).
N = len(exchanges)
N
round(100 * sum(exchanges == 1) / N)
round(100 * sum(exchanges <= 10) / N)
round(100 * sum(exchanges <= 20) / N)
exchanges.plot.hist()
But most of my messages ($61\%$) are with my top-10 contacts, with who I exchange in average $1000$ messages in 2 years. There is a huge gap between the #1 contact ($1664$ messages) and the #10 contact (three times less!).
top10 = df.groupby(['address']).body.count().sort_values(ascending=False)[:10]
round(100 * top10.sum() / len(df))
top10.describe()
$76\%$ of my messages are with only $20$ people, and $90\%$ with $50$ people out of $153$.
top20 = df.groupby(['address']).body.count().sort_values(ascending=False)[:20]
round(100 * top20.sum() / len(df))
top20.mean()
top50 = df.groupby(['address']).body.count().sort_values(ascending=False)[:50]
round(100 * top50.sum() / len(df))
top50.mean()
Not so useful, I would like to have their name:
top20 = df.groupby(['address', 'name']).body.count().sort_values(ascending=False)[:20]
top20
Note that some names are missing, so it was wise to keep the phone numbers also.
df.groupby(['name']).body.count().sort_values(ascending=False)[:10]
I can take some interesting conclusions:
Interesting to notice that two of these top-20 contacts are "recent friends" that I met in 2017.
import matplotlib.pyplot as plt
lengths = pd.Series([len(b) for b in df.body])
sent_lengths = pd.Series([len(b) for b in sent.body])
received_lengths = pd.Series([len(b) for b in received.body])
lengths.describe()
lengths.plot.hist(alpha=0.8)
That's interesting if we devide sizes by $160$ and count how many text messages are just 1 message or more. Average message length is $70\%$ of one message.
(lengths / 160).round().describe()
And if we restrict to the reasonably-sized SMS, we see that most of them fit in less than 160 caracters.
(lengths[lengths <= 4*160] / 160).plot.hist(alpha=0.8)
Most messages fit in one message, and most of them fit in $30\%$ of 160 caracters:
(lengths[lengths <= 160] / 160).plot.hist(alpha=0.8)
Do I send longer messages than the ones I receive?
sent_lengths.describe()
received_lengths.describe()
To make it more visual, here is the histogram plots of non-MMS text messages I received and sent in the last two years:
(sent_lengths[sent_lengths <= 4*160] / 160).plot.hist(alpha=0.6, label="sent")
(received_lengths[received_lengths <= 4*160] / 160).plot.hist(alpha=0.6, label="received")
plt.legend()
plt.xlabel("Size of SMS")
#plt.savefig("images/size_of_sms_sent_vs_received.png")
To answer this, I will need to cluster the data on a daily basis.
It shouldn't be too hard to do, by adding a "day" column, based on the "time" column.
df[:5]
from datetime import date
def day_of_date(d):
timestamp = int(d)/1000
this_d = date.fromtimestamp(timestamp)
return (this_d.day, this_d.month, this_d.year)
df.date[:1]
day_of_date('1536725248997')
Now, we can create a "day" attribute pretty easily:
df['day'] = df.date.apply(day_of_date)
df[:1]
And then we can group the data by "day".
df.groupby(['day']).body.count().sort_values(ascending=False)[:10]
df.groupby(['day', 'address', 'name']).body.count().sort_values(ascending=False)[:10]
In average, for each of my contact, and every day, I exchange about $4.4$ messages. Maximum is a day in July where I exchanged a lot with LS (see above)
df.groupby(['day', 'address', 'name']).body.count().describe()
In average, I exchange about $27$ message everyday. What about the difference between sent and received message?
df.groupby(['day']).body.count().describe()
received = df[df.type == '1']
sent = df[df.type == '2']
I send in average $15$ messages and receive $12$ everyday.
Extreme values for both follow the same distribution, that's very logical: a day when I send a lot can only be a day when I receive a lot!
sent.groupby(['day']).body.count().describe()
received.groupby(['day']).body.count().describe()
To answer this, I need to group by the data by weekday or by month.
df.date[:1]
weekdays = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]
def weekday_of_date(d):
timestamp = int(d)/1000
this_d = date.fromtimestamp(timestamp)
return weekdays[this_d.weekday()]
def weekend_of_date(d):
timestamp = int(d)/1000
this_d = date.fromtimestamp(timestamp)
if this_d.weekday() >= 5:
return "weekend"
else:
return "week"
months = ["Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"]
def month_of_date(d):
timestamp = int(d)/1000
this_d = date.fromtimestamp(timestamp)
return months[this_d.month - 1]
weekday_of_date('1536725248997')
weekend_of_date('1536725248997')
month_of_date('1536725248997')
I also want to group the text messages by hours or minutes.
df.time[:1]
def hour_of_time(t):
return t[-8:-6]
def minute_of_time(t):
return t[-5:-3]
t = '23 janv. 2018 17:34:08'
hour_of_time(t)
minute_of_time(t)
Now, we can create attributes for "weekday", "month" and "weekend" pretty easily:
df['weekday'] = df.date.apply(weekday_of_date)
df['weekend'] = df.date.apply(weekend_of_date)
df['month'] = df.date.apply(month_of_date)
df['hour'] = df.time.apply(hour_of_time)
df['minute'] = df.time.apply(minute_of_time)
Let's just check:
df[:1]
df[-1:]
Now, I can see that I don't text a lot in the beginning of the work week, and I'm mostly active on weekeds and Friday. That's a very natural result!
df.groupby(['weekday']).body.count().sort_values(ascending=False)
weekends = df.groupby(['weekend']).body.count().sort_values(ascending=False)
weekends["week"] /= 5
weekends["weekend"] /= 2
weekends
weekends.plot.bar(rot=0)
plt.title("Proportionally, I send more messages\nin 2 days of weekends than 5 days of the week")
#plt.savefig("images/week_vs_weekend.png")
Let's visualize a little bit more:
plt.axis('equal')
r = df.groupby(['weekday']).body.count().sort_values()
r.plot.pie(radius=1.1)
plt.title("Repartition of text message by weekday")
#plt.savefig("images/messages_by_weekday_1.png")
r = df.groupby(['weekday']).body.count().sort_values()
r.plot.bar(rot=45)
plt.title("Repartition of text messages by weekday\nInterpretation: I socialize more during the weekend!!")
#plt.savefig("images/messages_by_weekday_2.png")
And by month? There is clearly more in September, December and January:
df.groupby(['month']).body.count().sort_values(ascending=False)
monthnum_of_month = {m: i for i, m in enumerate(months)}
def lambda_monthnum_of_month(m):
return monthnum_of_month[m]
monthnums = [lambda_monthnum_of_month(m) for m in months]
monthnums
df['monthnum'] = df.month.apply(lambda_monthnum_of_month)
key = df['month'].map(monthnum_of_month)
r.get_values()
r = df.iloc[key.argsort()].groupby(['monthnum', 'month']).body.count()
plt.bar(monthnums, r.get_values(), tick_label=months)
plt.title("Repartition of text messages by month\nInterpretation: I socialize more for Christmas and in September")
#plt.savefig("images/messages_by_month.png")
df.groupby(['hour']).body.count().sort_values(ascending=False)
r = df.groupby(['hour']).body.count()
r.plot.bar(rot=45)
hours_to_see = [0, 2, 6, 8, 10, 12, 14, 16, 18, 20, 22]
plt.xticks(hours_to_see, hours_to_see)
plt.title("Repartition of text messages by hours in a day\nInterpretation: I mainly text in the evenings aroung 19 P.M.")
#plt.savefig("images/messages_by_hour.png")
I don't see other interesting questions I can ask to this database, so I will stop here.
That's it for today! See you, folks!
See here for other notebooks I wrote.