Introduction

This document details the series of actions carried out with the goal of improving the quality of the dataset from the Twitter user 'WeRateDogs'. This peculiar dataset holds interesting information of dogs featured on the internet. In order to make this analysis and others easier, this dataset was augmented, assessed, cleaned and briefly analyzed. The file contains observations that could be useful to get insights on which dogs seem to attract more attention, what pictures or actions are the most popular over twitter and so on.

In [1371]:
#Required packages and tools are imported
import pandas as pd
import numpy as np
import requests as rd
import time
import tweepy as tw
import json
import sys
from matplotlib import pyplot as plt
%matplotlib inline
In [1372]:
#Twitter API credentials
#consumer_key = 'YOUR CONSUMER KEY'
#consumer_secret = 'YOUR CONSUMER SECRET'
#access_token = 'YOUR ACCESS TOKEN'
#access_secret = 'YOUR ACCESS SECRET'

#auth = tw.OAuthHandler(consumer_key, consumer_secret)
#auth.set_access_token(access_token, access_token_secret)
#api = tw.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify = True, parser=tw.parsers.JSONParser())

Data Wrangling

This section holds the steps carried out to assemble/obtain, inspect and properly clean the datasets. All steps will be documented, and more information can be obtained in the attached file called ‘wrangle_report.pdf’.

Gathering

The gathering process of all documents required to carry out the analysis are documented here. The first file with the original 2355 tweets were handed as a .csv file and uploaded to this notebook server. The second file was downloaded from a URL using the library requests. The third file was obtained using the Twitter API. The tasks required to obtain all these files are appropriately described in sections bellow:

In [1373]:
# Read CSV (comma-separated) file into DataFrame
twitter_archive = pd.read_csv('twitter-archive-enhanced.csv')

The dataset is first inspected to make sure it was imported successfully

In [1374]:
# Preliminary Visual inspection of the dataset 
twitter_archive.head()
Out[1374]:
tweet_id in_reply_to_status_id in_reply_to_user_id timestamp source text retweeted_status_id retweeted_status_user_id retweeted_status_timestamp expanded_urls rating_numerator rating_denominator name doggo floofer pupper puppo
0 892420643555336193 NaN NaN 2017-08-01 16:23:56 +0000 <a href="http://twitter.com/download/iphone" r... This is Phineas. He's a mystical boy. Only eve... NaN NaN NaN https://twitter.com/dog_rates/status/892420643... 13 10 Phineas None None None None
1 892177421306343426 NaN NaN 2017-08-01 00:17:27 +0000 <a href="http://twitter.com/download/iphone" r... This is Tilly. She's just checking pup on you.... NaN NaN NaN https://twitter.com/dog_rates/status/892177421... 13 10 Tilly None None None None
2 891815181378084864 NaN NaN 2017-07-31 00:18:03 +0000 <a href="http://twitter.com/download/iphone" r... This is Archie. He is a rare Norwegian Pouncin... NaN NaN NaN https://twitter.com/dog_rates/status/891815181... 12 10 Archie None None None None
3 891689557279858688 NaN NaN 2017-07-30 15:58:51 +0000 <a href="http://twitter.com/download/iphone" r... This is Darla. She commenced a snooze mid meal... NaN NaN NaN https://twitter.com/dog_rates/status/891689557... 13 10 Darla None None None None
4 891327558926688256 NaN NaN 2017-07-29 16:00:24 +0000 <a href="http://twitter.com/download/iphone" r... This is Franklin. He would like you to stop ca... NaN NaN NaN https://twitter.com/dog_rates/status/891327558... 12 10 Franklin None None None None

As this importation seemed to have been successful, the second dataset will be downloaded from the URL provided using the library requests :

In [1375]:
# Create a variable holding the URL with the files and same for the response
URL1 = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'
response = rd.get(URL1)

# Create the file where the content of the url will be saved
with open ('image_predictions.tsv', mode = 'wb') as file:
    file.write(response.content)
In [1376]:
# Read TSV (Tab-separated) file into DataFrame
image_predictions = pd.read_csv('image_predictions.tsv', sep='\t')
In [1377]:
# Preliminary visual inspection of the dataset 
image_predictions.sample(2)
Out[1377]:
tweet_id jpg_url img_num p1 p1_conf p1_dog p2 p2_conf p2_dog p3 p3_conf p3_dog
569 678389028614488064 https://pbs.twimg.com/media/CWofOHUWUAACGVa.jpg 1 miniature_pinscher 0.516284 True kelpie 0.227402 True vizsla 0.103246 True
295 671357843010908160 https://pbs.twimg.com/media/CVEkZaPXIAEw5vr.jpg 1 Italian_greyhound 0.831757 True toy_terrier 0.043306 True Chihuahua 0.036773 True

The third dataset from Twitter is gathered using their API and the library Tweepy. The tweets to download are those whose tweet ID are in the original file. In order to download exactly those in that file, the 'tweet_id' column is to be stored in a list and all those tweets matching that ID are going to be saved in the file called ‘tweet_json.txt’.

In [1378]:
# Create the column with tweet ID's
list_ids = list(twitter_archive.tweet_id)

# Verify the length of this list
len(list_ids) 
Out[1378]:
2356
In [1379]:
"""
#Set up a timer to see how long the operation lasts
start = time.time()
deleted_tweets = 0

#Set up a for loop that verifies the tweet id's 
# Use try and except for cases where the tweet has been deleted
with open('tweet_json.txt', mode= 'w') as f:
    for ids in list_ids:
        try:
            tweet = api.get_status(ids, tweet_mode='extended')
            json.dump(tweet, f)
        except:
            deleted_tweets += 1
            
#record end time             
end = time.time()
#print total duration
print(end - start)

# print total of tweets deleted
print('A total of ' + str(deleted_tweets) + ' tweets were deleted')

"""
Out[1379]:
"\n#Set up a timer to see how long the operation lasts\nstart = time.time()\ndeleted_tweets = 0\n\n#Set up a for loop that verifies the tweet id's \n# Use try and except for cases where the tweet has been deleted\nwith open('tweet_json.txt', mode= 'w') as f:\n    for ids in list_ids:\n        try:\n            tweet = api.get_status(ids, tweet_mode='extended')\n            json.dump(tweet, f)\n        except:\n            deleted_tweets += 1\n            \n#record end time             \nend = time.time()\n#print total duration\nprint(end - start)\n\n# print total of tweets deleted\nprint('A total of ' + str(deleted_tweets) + ' tweets were deleted')\n\n"

Now the file can be read into a DataFrame

In [1380]:
# Read the flat .txt file into DataFrame using read_json
# Full_tweets = pd.read_json('tweet_json.txt', orient= 'records', encoding="utf8", lines=True)

This code did not work because my file contained an invalid json and it throwed the following error:

ValueError :Unexpected character found when decoding array value (2)

The JSON file was checked in a JSON validator and it showed and EOF (end of file) error. A workaround was found on StackExchange. There a function was created aimed at parsing and formatting the json line by line into a list and finally creating a DataFrame. This approach is used bellow:

In [1381]:
# First the ParseMultipleJson function is created to make sure the
# JSONs objects of the file are parsed correctly

def parseMultipleJSON(lines):
    skip = prev = 0
    data = []
    lines = ''.join(lines)
    for idx, line in enumerate(lines):
        if line == "{":
            skip += 1
        elif line == "}":
            skip -= 1
            if skip == 0:
                json_string = ''.join(lines[prev:idx+1])
                data.append(json.loads(json_string))
                prev = idx+1
    return data

with open('tweet_json.txt','r') as json_file:
    lines = json_file.readlines()
    lines = [line.strip("\n") for line in lines]
    data = parseMultipleJSON(lines)

detailed_tweets = pd.DataFrame(data)
In [1382]:
# Preliminary visual inspection of the dataset 
detailed_tweets.sample(2)
Out[1382]:
created_at id id_str full_text truncated display_text_range entities extended_entities source in_reply_to_status_id ... favorited retweeted possibly_sensitive possibly_sensitive_appealable lang retweeted_status quoted_status_id quoted_status_id_str quoted_status_permalink quoted_status
212 Thu Apr 06 00:13:11 +0000 2017 849776966551130114 849776966551130114 Seriously guys? Again? We only rate dogs. Plea... False [0, 135] {'hashtags': [], 'symbols': [], 'user_mentions... {'media': [{'id': 849776958493818880, 'id_str'... <a href="http://twitter.com/download/iphone" r... NaN ... False False False False en NaN NaN NaN NaN NaN
57 Fri Jun 30 23:47:07 +0000 2017 880935762899988482 880935762899988482 This is Louis. He's crossing. It's a big deal.... False [0, 72] {'hashtags': [], 'symbols': [], 'user_mentions... {'media': [{'id': 880935758152028161, 'id_str'... <a href="http://twitter.com/download/iphone" r... NaN ... False False False False en NaN NaN NaN NaN NaN

2 rows × 32 columns

Assessing

This part of the analysis will focus on identifying those aspects of the datasets to improve. These improvements will make the posterior analysis possible. A visual inspection will be carried out using the software Numbers, followed by a programmatic one. Issues found will be categorized into quality and tidiness sections.

  1. twitter_archive table

    First a visual inspection was carried out with the software 'Numbers'

image.png

This inspection gave a general idea of what the variables in this file are and which values they could and could not take. Many observations were empty (missing values).

Programmatic inspection:

In [1383]:
# Display main characteristics of this table with info
twitter_archive.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   tweet_id                    2356 non-null   int64  
 1   in_reply_to_status_id       78 non-null     float64
 2   in_reply_to_user_id         78 non-null     float64
 3   timestamp                   2356 non-null   object 
 4   source                      2356 non-null   object 
 5   text                        2356 non-null   object 
 6   retweeted_status_id         181 non-null    float64
 7   retweeted_status_user_id    181 non-null    float64
 8   retweeted_status_timestamp  181 non-null    object 
 9   expanded_urls               2297 non-null   object 
 10  rating_numerator            2356 non-null   int64  
 11  rating_denominator          2356 non-null   int64  
 12  name                        2356 non-null   object 
 13  doggo                       2356 non-null   object 
 14  floofer                     2356 non-null   object 
 15  pupper                      2356 non-null   object 
 16  puppo                       2356 non-null   object 
dtypes: float64(4), int64(3), object(10)
memory usage: 313.0+ KB
In [1384]:
# Verify the data types of the timestamp 
type(twitter_archive.timestamp[9])
Out[1384]:
str
In [1385]:
# Extract a sample of null values from retweeted_status_id
twitter_archive[twitter_archive['retweeted_status_id'].isnull()].sample(14)
Out[1385]:
tweet_id in_reply_to_status_id in_reply_to_user_id timestamp source text retweeted_status_id retweeted_status_user_id retweeted_status_timestamp expanded_urls rating_numerator rating_denominator name doggo floofer pupper puppo
1769 678380236862578688 NaN NaN 2015-12-20 01:03:46 +0000 <a href="http://twitter.com/download/iphone" r... This is Crumpet. He underestimated the snow. Q... NaN NaN NaN https://twitter.com/dog_rates/status/678380236... 10 10 Crumpet None None None None
1401 699434518667751424 NaN NaN 2016-02-16 03:25:58 +0000 <a href="http://twitter.com/download/iphone" r... I know this is a tad late but here's a wonderf... NaN NaN NaN https://twitter.com/dog_rates/status/699434518... 12 10 None None None pupper None
245 845812042753855489 NaN NaN 2017-03-26 01:38:00 +0000 <a href="http://twitter.com/download/iphone" r... We usually don't rate polar bears but this one... NaN NaN NaN https://twitter.com/dog_rates/status/845812042... 13 10 None None None None None
1929 674042553264685056 NaN NaN 2015-12-08 01:47:22 +0000 <a href="http://twitter.com/download/iphone" r... Yea I can't handle the cuteness anymore. Curls... NaN NaN NaN https://twitter.com/dog_rates/status/674042553... 12 10 None None None None None
1428 697616773278015490 NaN NaN 2016-02-11 03:02:54 +0000 <a href="http://twitter.com/download/iphone" r... This pupper doubles as a hallway rug. Very rar... NaN NaN NaN https://twitter.com/dog_rates/status/697616773... 11 10 None None None pupper None
1957 673583129559498752 NaN NaN 2015-12-06 19:21:47 +0000 <a href="http://twitter.com/download/iphone" r... This is Sandy. She loves her spot by the tree.... NaN NaN NaN https://twitter.com/dog_rates/status/673583129... 11 10 Sandy None None None None
1404 699370870310113280 NaN NaN 2016-02-15 23:13:03 +0000 <a href="http://twitter.com/download/iphone" r... Say hello to Calbert. He doesn't have enough l... NaN NaN NaN https://twitter.com/dog_rates/status/699370870... 11 10 Calbert None None None None
553 804026241225523202 NaN NaN 2016-11-30 18:16:08 +0000 <a href="http://twitter.com/download/iphone" r... This is Bo. He's going to make me cry. 13/10 p... NaN NaN NaN https://twitter.com/dog_rates/status/804026241... 13 10 Bo None None None None
1690 681339448655802368 NaN NaN 2015-12-28 05:02:37 +0000 <a href="http://twitter.com/download/iphone" r... This is Asher. He's not wearing a seatbelt or ... NaN NaN NaN https://twitter.com/dog_rates/status/681339448... 9 10 Asher None None None None
706 785533386513321988 NaN NaN 2016-10-10 17:32:08 +0000 <a href="http://twitter.com/download/iphone" r... This is Dallas. Her tongue is ridiculous. 11/1... NaN NaN NaN https://twitter.com/dog_rates/status/785533386... 11 10 Dallas None None None None
1468 694206574471057408 NaN NaN 2016-02-01 17:11:59 +0000 <a href="http://twitter.com/download/iphone" r... "Martha come take a look at this. I'm so fed u... NaN NaN NaN https://twitter.com/dog_rates/status/694206574... 10 10 None None None None None
2141 669993076832759809 NaN NaN 2015-11-26 21:36:12 +0000 <a href="http://twitter.com/download/iphone" r... This is Zoey. Her dreams of becoming a hippo b... NaN NaN NaN https://twitter.com/dog_rates/status/669993076... 9 10 Zoey None None None None
252 844973813909606400 NaN NaN 2017-03-23 18:07:10 +0000 <a href="http://twitter.com/download/iphone" r... This is Brady. He's a recovering alcoholic. De... NaN NaN NaN https://twitter.com/dog_rates/status/844973813... 12 10 Brady None None None None
1901 674646392044941312 NaN NaN 2015-12-09 17:46:48 +0000 <a href="http://twitter.com/download/iphone" r... Two gorgeous dogs here. Little waddling dog is... NaN NaN NaN https://twitter.com/dog_rates/status/674646392... 5 10 None None None None None
In [1386]:
# Look at the unique values
twitter_archive.nunique()
Out[1386]:
tweet_id                      2356
in_reply_to_status_id           77
in_reply_to_user_id             31
timestamp                     2356
source                           4
text                          2356
retweeted_status_id            181
retweeted_status_user_id        25
retweeted_status_timestamp     181
expanded_urls                 2218
rating_numerator                40
rating_denominator              18
name                           957
doggo                            2
floofer                          2
pupper                           2
puppo                            2
dtype: int64
In [1387]:
# Verify the most repeated value within the stages of dog 'doggo'
twitter_archive.doggo.mode()
Out[1387]:
0    None
dtype: object
In [1388]:
# Verify if all types of dog stages have 'None' as the most frequent value
twitter_archive.doggo.mode()[0] == twitter_archive.floofer.mode()[0] == twitter_archive.pupper.mode()[0] == twitter_archive.pupper.mode()
Out[1388]:
0    True
dtype: bool
In [1389]:
# Obtain number of entries that do not have a dog stage assigned
twitter_archive_subset = twitter_archive[(twitter_archive['doggo'] == 'None') & (twitter_archive['floofer'] == 'None') & (twitter_archive['pupper'] == 'None') & (twitter_archive['puppo'] == 'None')]
len(twitter_archive_subset)
Out[1389]:
1976
In [1390]:
# Verify if some rows have more than one dog stage assigned; first comparing all stages to doggos
duplicates = 0
stages =['floofer','pupper','puppo']
for col in stages:
        duplicates += len(twitter_archive.loc[(twitter_archive['doggo'] == 'doggo') & (twitter_archive[col] != 'None')])
print(duplicates)      
14
In [1391]:
# Display descriptive statistics of the variables with describe
twitter_archive.describe()
Out[1391]:
tweet_id in_reply_to_status_id in_reply_to_user_id retweeted_status_id retweeted_status_user_id rating_numerator rating_denominator
count 2.356000e+03 7.800000e+01 7.800000e+01 1.810000e+02 1.810000e+02 2356.000000 2356.000000
mean 7.427716e+17 7.455079e+17 2.014171e+16 7.720400e+17 1.241698e+16 13.126486 10.455433
std 6.856705e+16 7.582492e+16 1.252797e+17 6.236928e+16 9.599254e+16 45.876648 6.745237
min 6.660209e+17 6.658147e+17 1.185634e+07 6.661041e+17 7.832140e+05 0.000000 0.000000
25% 6.783989e+17 6.757419e+17 3.086374e+08 7.186315e+17 4.196984e+09 10.000000 10.000000
50% 7.196279e+17 7.038708e+17 4.196984e+09 7.804657e+17 4.196984e+09 11.000000 10.000000
75% 7.993373e+17 8.257804e+17 4.196984e+09 8.203146e+17 4.196984e+09 12.000000 10.000000
max 8.924206e+17 8.862664e+17 8.405479e+17 8.874740e+17 7.874618e+17 1776.000000 170.000000
In [1392]:
# Verify maximum values for numerator and denominator
twitter_archive[twitter_archive.rating_numerator == 1776.000000]
Out[1392]:
tweet_id in_reply_to_status_id in_reply_to_user_id timestamp source text retweeted_status_id retweeted_status_user_id retweeted_status_timestamp expanded_urls rating_numerator rating_denominator name doggo floofer pupper puppo
979 749981277374128128 NaN NaN 2016-07-04 15:00:45 +0000 <a href="https://about.twitter.com/products/tw... This is Atticus. He's quite simply America af.... NaN NaN NaN https://twitter.com/dog_rates/status/749981277... 1776 10 Atticus None None None None
In [1393]:
# Verify maximum values for numerator and denominator
twitter_archive[twitter_archive.rating_denominator == 170.000000]
Out[1393]:
tweet_id in_reply_to_status_id in_reply_to_user_id timestamp source text retweeted_status_id retweeted_status_user_id retweeted_status_timestamp expanded_urls rating_numerator rating_denominator name doggo floofer pupper puppo
1120 731156023742988288 NaN NaN 2016-05-13 16:15:54 +0000 <a href="http://twitter.com/download/iphone" r... Say hello to this unbelievably well behaved sq... NaN NaN NaN https://twitter.com/dog_rates/status/731156023... 204 170 this None None None None
In [1394]:
# Verify low values for the denominator
twitter_archive[twitter_archive.rating_denominator < 10]
Out[1394]:
tweet_id in_reply_to_status_id in_reply_to_user_id timestamp source text retweeted_status_id retweeted_status_user_id retweeted_status_timestamp expanded_urls rating_numerator rating_denominator name doggo floofer pupper puppo
313 835246439529840640 8.352460e+17 26259576.0 2017-02-24 21:54:03 +0000 <a href="http://twitter.com/download/iphone" r... @jonnysun @Lin_Manuel ok jomny I know you're e... NaN NaN NaN NaN 960 0 None None None None None
516 810984652412424192 NaN NaN 2016-12-19 23:06:23 +0000 <a href="http://twitter.com/download/iphone" r... Meet Sam. She smiles 24/7 &amp; secretly aspir... NaN NaN NaN https://www.gofundme.com/sams-smile,https://tw... 24 7 Sam None None None None
2335 666287406224695296 NaN NaN 2015-11-16 16:11:11 +0000 <a href="http://twitter.com/download/iphone" r... This is an Albanian 3 1/2 legged Episcopalian... NaN NaN NaN https://twitter.com/dog_rates/status/666287406... 1 2 an None None None None
In [1395]:
# Verify low values for the numerator
len(twitter_archive[twitter_archive.rating_numerator < 10])
Out[1395]:
440
In [1396]:
# Search for duplicated tweets ID's with 'duplicated' method
twitter_archive[twitter_archive.tweet_id.duplicated()]
Out[1396]:
tweet_id in_reply_to_status_id in_reply_to_user_id timestamp source text retweeted_status_id retweeted_status_user_id retweeted_status_timestamp expanded_urls rating_numerator rating_denominator name doggo floofer pupper puppo
In [1397]:
# Search for duplicated texts with 'duplicated' method
twitter_archive[twitter_archive.text.duplicated()]
Out[1397]:
tweet_id in_reply_to_status_id in_reply_to_user_id timestamp source text retweeted_status_id retweeted_status_user_id retweeted_status_timestamp expanded_urls rating_numerator rating_denominator name doggo floofer pupper puppo
In [1398]:
# Inspect how many tweets are a response to other tweets
len(twitter_archive.in_reply_to_status_id.value_counts())
Out[1398]:
77
In [1399]:
# Inspect how many tweets are a response to other users
len(twitter_archive.in_reply_to_user_id.value_counts())
Out[1399]:
31
In [1400]:
# Random sample of the dataset
twitter_archive.sample(3)
Out[1400]:
tweet_id in_reply_to_status_id in_reply_to_user_id timestamp source text retweeted_status_id retweeted_status_user_id retweeted_status_timestamp expanded_urls rating_numerator rating_denominator name doggo floofer pupper puppo
579 800513324630806528 NaN NaN 2016-11-21 01:37:04 +0000 <a href="http://twitter.com/download/iphone" r... This is Chef. Chef loves everyone and wants ev... NaN NaN NaN https://twitter.com/dog_rates/status/800513324... 11 10 Chef None None None None
1749 679047485189439488 NaN NaN 2015-12-21 21:15:11 +0000 <a href="http://twitter.com/download/iphone" r... This dog doesn't know how to stairs. Quite tra... NaN NaN NaN https://twitter.com/dog_rates/status/679047485... 9 10 None None None None None
358 830097400375152640 NaN NaN 2017-02-10 16:53:37 +0000 <a href="http://twitter.com/download/iphone" r... Meet Samson. He's absolute fluffy perfection. ... NaN NaN NaN https://www.gofundme.com/sick-baby-samson,http... 13 10 Samson None None None None
  1. image_predictions table

    First a visual inspection was carried out with the software 'Numbers'.

image.png

As with the previous table, since these values are so diverse and are not ruled by many constrains the visual search did not deliver especially useful insights.

In [1401]:
# Display main characteristics of this table with info
image_predictions.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   tweet_id  2075 non-null   int64  
 1   jpg_url   2075 non-null   object 
 2   img_num   2075 non-null   int64  
 3   p1        2075 non-null   object 
 4   p1_conf   2075 non-null   float64
 5   p1_dog    2075 non-null   bool   
 6   p2        2075 non-null   object 
 7   p2_conf   2075 non-null   float64
 8   p2_dog    2075 non-null   bool   
 9   p3        2075 non-null   object 
 10  p3_conf   2075 non-null   float64
 11  p3_dog    2075 non-null   bool   
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 152.1+ KB
In [1402]:
# Verify the data types of the predictions
type(image_predictions.p1[9])
Out[1402]:
str
In [1403]:
# Look at unique values
image_predictions.nunique()
Out[1403]:
tweet_id    2075
jpg_url     2009
img_num        4
p1           378
p1_conf     2006
p1_dog         2
p2           405
p2_conf     2004
p2_dog         2
p3           408
p3_conf     2006
p3_dog         2
dtype: int64
In [1404]:
# Check most predicted breeds by algorithm attempts
image_predictions.p1.mode()
Out[1404]:
0    golden_retriever
dtype: object
In [1405]:
# Check most predicted breeds by algorithm attempts
image_predictions.p2.mode()
Out[1405]:
0    Labrador_retriever
dtype: object
In [1406]:
# Check most predicted breeds by algorithm attempts
image_predictions.p3.mode()
Out[1406]:
0    Labrador_retriever
dtype: object
In [1407]:
# Display descriptive statistics of the variables with describe
image_predictions.describe()
Out[1407]:
tweet_id img_num p1_conf p2_conf p3_conf
count 2.075000e+03 2075.000000 2075.000000 2.075000e+03 2.075000e+03
mean 7.384514e+17 1.203855 0.594548 1.345886e-01 6.032417e-02
std 6.785203e+16 0.561875 0.271174 1.006657e-01 5.090593e-02
min 6.660209e+17 1.000000 0.044333 1.011300e-08 1.740170e-10
25% 6.764835e+17 1.000000 0.364412 5.388625e-02 1.622240e-02
50% 7.119988e+17 1.000000 0.588230 1.181810e-01 4.944380e-02
75% 7.932034e+17 1.000000 0.843855 1.955655e-01 9.180755e-02
max 8.924206e+17 4.000000 1.000000 4.880140e-01 2.734190e-01
In [1408]:
# Inspect value with confidence equal to 1
image_predictions[image_predictions.p1_conf == 1]
Out[1408]:
tweet_id jpg_url img_num p1 p1_conf p1_dog p2 p2_conf p2_dog p3 p3_conf p3_dog
106 667866724293877760 https://pbs.twimg.com/media/CUS9PlUWwAANeAD.jpg 1 jigsaw_puzzle 1.0 False prayer_rug 1.011300e-08 False doormat 1.740170e-10 False
In [1409]:
# Check how many entries have incorrect predictions for all attempts
image_predictions_subset = image_predictions[(image_predictions['p1_dog'] == False) & (image_predictions['p2_dog'] == False) & (image_predictions['p3_dog'] == False) ]
len(image_predictions_subset)
Out[1409]:
324
In [1410]:
# Search for duplicated tweets with 'duplicated' method
image_predictions[image_predictions.tweet_id.duplicated()]
Out[1410]:
tweet_id jpg_url img_num p1 p1_conf p1_dog p2 p2_conf p2_dog p3 p3_conf p3_dog
In [1411]:
# Show a random sample of the dataset
image_predictions.sample(4)
Out[1411]:
tweet_id jpg_url img_num p1 p1_conf p1_dog p2 p2_conf p2_dog p3 p3_conf p3_dog
947 704761120771465216 https://pbs.twimg.com/media/CcfQgHVWoAAxauy.jpg 1 Siamese_cat 0.202294 False Chihuahua 0.100418 True basenji 0.072097 True
1075 717047459982213120 https://pbs.twimg.com/media/CfN23ArXEAEkZkz.jpg 1 golden_retriever 0.983548 True Labrador_retriever 0.012185 True cocker_spaniel 0.002412 True
1780 828650029636317184 https://pbs.twimg.com/media/C3_0yhCWEAETXj2.jpg 1 golden_retriever 0.649209 True Chesapeake_Bay_retriever 0.198560 True vizsla 0.056200 True
342 672256522047614977 https://pbs.twimg.com/media/CVRVvRMWEAIBKOP.jpg 1 ostrich 0.999004 False Arabian_camel 0.000512 False llama 0.000147 False
  1. detailed_tweets table

    The visual inspection was carried out using Pandas.

In [1412]:
detailed_tweets
Out[1412]:
created_at id id_str full_text truncated display_text_range entities extended_entities source in_reply_to_status_id ... favorited retweeted possibly_sensitive possibly_sensitive_appealable lang retweeted_status quoted_status_id quoted_status_id_str quoted_status_permalink quoted_status
0 Tue Aug 01 16:23:56 +0000 2017 892420643555336193 892420643555336193 This is Phineas. He's a mystical boy. Only eve... False [0, 85] {'hashtags': [], 'symbols': [], 'user_mentions... {'media': [{'id': 892420639486877696, 'id_str'... <a href="http://twitter.com/download/iphone" r... NaN ... False False False False en NaN NaN NaN NaN NaN
1 Tue Aug 01 00:17:27 +0000 2017 892177421306343426 892177421306343426 This is Tilly. She's just checking pup on you.... False [0, 138] {'hashtags': [], 'symbols': [], 'user_mentions... {'media': [{'id': 892177413194625024, 'id_str'... <a href="http://twitter.com/download/iphone" r... NaN ... False False False False en NaN NaN NaN NaN NaN
2 Mon Jul 31 00:18:03 +0000 2017 891815181378084864 891815181378084864 This is Archie. He is a rare Norwegian Pouncin... False [0, 121] {'hashtags': [], 'symbols': [], 'user_mentions... {'media': [{'id': 891815175371796480, 'id_str'... <a href="http://twitter.com/download/iphone" r... NaN ... False False False False en NaN NaN NaN NaN NaN
3 Sun Jul 30 15:58:51 +0000 2017 891689557279858688 891689557279858688 This is Darla. She commenced a snooze mid meal... False [0, 79] {'hashtags': [], 'symbols': [], 'user_mentions... {'media': [{'id': 891689552724799489, 'id_str'... <a href="http://twitter.com/download/iphone" r... NaN ... False False False False en NaN NaN NaN NaN NaN
4 Sat Jul 29 16:00:24 +0000 2017 891327558926688256 891327558926688256 This is Franklin. He would like you to stop ca... False [0, 138] {'hashtags': [{'text': 'BarkWeek', 'indices': ... {'media': [{'id': 891327551943041024, 'id_str'... <a href="http://twitter.com/download/iphone" r... NaN ... False False False False en NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2326 Mon Nov 16 00:24:50 +0000 2015 666049248165822465 666049248165822465 Here we have a 1949 1st generation vulpix. Enj... False [0, 120] {'hashtags': [], 'symbols': [], 'user_mentions... {'media': [{'id': 666049244999131136, 'id_str'... <a href="http://twitter.com/download/iphone" r... NaN ... False False False False en NaN NaN NaN NaN NaN
2327 Mon Nov 16 00:04:52 +0000 2015 666044226329800704 666044226329800704 This is a purebred Piers Morgan. Loves to Netf... False [0, 137] {'hashtags': [], 'symbols': [], 'user_mentions... {'media': [{'id': 666044217047650304, 'id_str'... <a href="http://twitter.com/download/iphone" r... NaN ... False False False False en NaN NaN NaN NaN NaN
2328 Sun Nov 15 23:21:54 +0000 2015 666033412701032449 666033412701032449 Here is a very happy pup. Big fan of well-main... False [0, 130] {'hashtags': [], 'symbols': [], 'user_mentions... {'media': [{'id': 666033409081393153, 'id_str'... <a href="http://twitter.com/download/iphone" r... NaN ... False False False False en NaN NaN NaN NaN NaN
2329 Sun Nov 15 23:05:30 +0000 2015 666029285002620928 666029285002620928 This is a western brown Mitsubishi terrier. Up... False [0, 139] {'hashtags': [], 'symbols': [], 'user_mentions... {'media': [{'id': 666029276303482880, 'id_str'... <a href="http://twitter.com/download/iphone" r... NaN ... False False False False en NaN NaN NaN NaN NaN
2330 Sun Nov 15 22:32:08 +0000 2015 666020888022790149 666020888022790149 Here we have a Japanese Irish Setter. Lost eye... False [0, 131] {'hashtags': [], 'symbols': [], 'user_mentions... {'media': [{'id': 666020881337073664, 'id_str'... <a href="http://twitter.com/download/iphone" r... NaN ... False False False False en NaN NaN NaN NaN NaN

2331 rows × 32 columns

This table has a lot of information but many of the entries are empty (missing values).

In [1413]:
# Display main characteristics of this table with info
detailed_tweets.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2331 entries, 0 to 2330
Data columns (total 32 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   created_at                     2331 non-null   object 
 1   id                             2331 non-null   int64  
 2   id_str                         2331 non-null   object 
 3   full_text                      2331 non-null   object 
 4   truncated                      2331 non-null   bool   
 5   display_text_range             2331 non-null   object 
 6   entities                       2331 non-null   object 
 7   extended_entities              2059 non-null   object 
 8   source                         2331 non-null   object 
 9   in_reply_to_status_id          77 non-null     float64
 10  in_reply_to_status_id_str      77 non-null     object 
 11  in_reply_to_user_id            77 non-null     float64
 12  in_reply_to_user_id_str        77 non-null     object 
 13  in_reply_to_screen_name        77 non-null     object 
 14  user                           2331 non-null   object 
 15  geo                            0 non-null      object 
 16  coordinates                    0 non-null      object 
 17  place                          1 non-null      object 
 18  contributors                   0 non-null      object 
 19  is_quote_status                2331 non-null   bool   
 20  retweet_count                  2331 non-null   int64  
 21  favorite_count                 2331 non-null   int64  
 22  favorited                      2331 non-null   bool   
 23  retweeted                      2331 non-null   bool   
 24  possibly_sensitive             2196 non-null   object 
 25  possibly_sensitive_appealable  2196 non-null   object 
 26  lang                           2331 non-null   object 
 27  retweeted_status               163 non-null    object 
 28  quoted_status_id               26 non-null     float64
 29  quoted_status_id_str           26 non-null     object 
 30  quoted_status_permalink        26 non-null     object 
 31  quoted_status                  24 non-null     object 
dtypes: bool(4), float64(3), int64(3), object(22)
memory usage: 519.1+ KB
In [1414]:
# Verify the data types of the column 'created at' 
type(detailed_tweets.created_at[9])
Out[1414]:
str
In [1415]:
# verify the data types of the column 'possibly_sensitive'
type(detailed_tweets.possibly_sensitive[9])
Out[1415]:
bool
In [1416]:
# Display descriptive statistics 
detailed_tweets.describe()
Out[1416]:
id in_reply_to_status_id in_reply_to_user_id retweet_count favorite_count quoted_status_id
count 2.331000e+03 7.700000e+01 7.700000e+01 2331.000000 2331.000000 2.600000e+01
mean 7.419079e+17 7.440692e+17 2.040329e+16 2656.050622 7464.755899 8.113972e+17
std 6.823170e+16 7.524295e+16 1.260797e+17 4492.226874 11590.528926 6.295843e+16
min 6.660209e+17 6.658147e+17 1.185634e+07 1.000000 0.000000 6.721083e+17
25% 6.782670e+17 6.757073e+17 3.589728e+08 539.000000 1299.000000 7.761338e+17
50% 7.182469e+17 7.032559e+17 4.196984e+09 1243.000000 3244.000000 8.281173e+17
75% 7.986692e+17 8.233264e+17 4.196984e+09 3083.500000 9142.500000 8.637581e+17
max 8.924206e+17 8.862664e+17 8.405479e+17 76336.000000 154049.000000 8.860534e+17
In [1417]:
# Search for duplicated tweets with 'duplicated' 
detailed_tweets[detailed_tweets.id.duplicated()]
Out[1417]:
created_at id id_str full_text truncated display_text_range entities extended_entities source in_reply_to_status_id ... favorited retweeted possibly_sensitive possibly_sensitive_appealable lang retweeted_status quoted_status_id quoted_status_id_str quoted_status_permalink quoted_status

0 rows × 32 columns

In [1418]:
# Show a random sample of the dataset
detailed_tweets.sample(4)
Out[1418]:
created_at id id_str full_text truncated display_text_range entities extended_entities source in_reply_to_status_id ... favorited retweeted possibly_sensitive possibly_sensitive_appealable lang retweeted_status quoted_status_id quoted_status_id_str quoted_status_permalink quoted_status
1219 Mon Mar 21 17:30:03 +0000 2016 711968124745228288 711968124745228288 Meet Winston. He's trapped in a cup of coffee.... False [0, 106] {'hashtags': [], 'symbols': [], 'user_mentions... {'media': [{'id': 711968119305211904, 'id_str'... <a href="http://twitter.com/download/iphone" r... NaN ... False False False False en NaN NaN NaN NaN NaN
1816 Sun Dec 13 03:21:34 +0000 2015 675878199931371520 675878199931371520 Ok, I'll admit this is a pretty adorable bunny... False [0, 134] {'hashtags': [], 'symbols': [], 'user_mentions... {'media': [{'id': 675878195854483457, 'id_str'... <a href="http://twitter.com/download/iphone" r... NaN ... False False False False en NaN NaN NaN NaN NaN
2244 Fri Nov 20 01:06:48 +0000 2015 667509364010450944 667509364010450944 This a Norwegian Pewterschmidt named Tickles. ... False [0, 116] {'hashtags': [], 'symbols': [], 'user_mentions... {'media': [{'id': 667509363477577728, 'id_str'... <a href="http://twitter.com" rel="nofollow">Tw... NaN ... False False False False en NaN NaN NaN NaN NaN
2156 Tue Nov 24 03:29:51 +0000 2015 668994913074286592 668994913074286592 Two gorgeous pups here. Both have cute fake ho... False [0, 140] {'hashtags': [], 'symbols': [], 'user_mentions... {'media': [{'id': 668994907227430913, 'id_str'... <a href="http://twitter.com/download/iphone" r... NaN ... False False False False en NaN NaN NaN NaN NaN

4 rows × 32 columns

The last assessment is a verification of duplicated columns among the three tables

In [1419]:
# Join all columns in a list and verify duplicated ones
all_columns = pd.Series(list(twitter_archive) + list(image_predictions) + list(detailed_tweets))
all_columns[all_columns.duplicated()]
Out[1419]:
17                 tweet_id
37                   source
38    in_reply_to_status_id
40      in_reply_to_user_id
dtype: object

Observations:

The main observations regarding data quality and tidiness are summarized below. Quality issues are divided by table and tidiness issues are presented together for all tables.

Quality
tweet_archive table
  • Data type of timestamp is str.
  • Missing values from 6 out of 17 columns.
  • Maximum rating numerators and denominators could be outliers moreover some denominator values are lower than 10.
  • Predictions of dog "stages" are mostly empty.
  • Some predictions of dog "stages" are duplicated.
  • Some columns are not required for the analysis.
image_predictions table
  • Column names could be more descriptive.
detailed_tweets table
  • Column names could be more descriptive.
  • Missing values from 17 out of 32 columns.
  • Some columns are totally empty, and imputing is not feasible.
  • Data type of timeStamp is str.
  • Some columns are not required for the analysis
Tidiness
  • The dog stage present in the twitter_archive table should be one column instead of four.
  • The twitter_archive table and detailed_tweet share duplicated information. By dropping these duplicates they could constitute more coherent observational units. For the sake of this analysis the columns favorite count and retweet count will be taken as response variables and will be analyzed based on: Tweet information (from the detailed_tweet table) and their relationship with the dogs stage data from the twitter_archive table. With this in mind:
    • duplicated columns will be dropped from the twitter_archive table
    • The 'response' columns from detailed_tweet will be merged to the twitter_archive table.

Cleaning

The cleaning efforts carried out so as to fix the issues discovered in the previous section, are described herein.

In [1420]:
# Copy the datasets before modifying them separately in case cleaning operations are not correct
twitter_archive_clean = twitter_archive.copy()
In [1421]:
image_predictions_clean = image_predictions.copy()
In [1422]:
detailed_tweets_clean = detailed_tweets.copy()

1. Missing Data

1.1 twitter_archive: Missing values (Not necessary ones)
Define

Drop the columns with missing values that are not necessary for the analysis using drop() method.

Code
In [1423]:
# Drop columns that will not be included in the analysis using drop() method
# Copy the datasets before modifying them
twitter_archive_clean.drop(twitter_archive_clean.columns[[1,2,6,7,8,9]], axis = 1, inplace = True) 
Test
In [1424]:
# Verify that columns have been drop with info()
twitter_archive_clean.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 11 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   tweet_id            2356 non-null   int64 
 1   timestamp           2356 non-null   object
 2   source              2356 non-null   object
 3   text                2356 non-null   object
 4   rating_numerator    2356 non-null   int64 
 5   rating_denominator  2356 non-null   int64 
 6   name                2356 non-null   object
 7   doggo               2356 non-null   object
 8   floofer             2356 non-null   object
 9   pupper              2356 non-null   object
 10  puppo               2356 non-null   object
dtypes: int64(3), object(8)
memory usage: 202.6+ KB
1.2 twitter_archive: Missing values (imputable ones)
Define

Re- extract the stages of dogs matching their definitions and popular synonyms to the captions included in each tweet's text. This will be done in the tidiness section.

1.3 detailed_tweets: Missing values
Define

Drop the columns with missing values that are not necessary for the analysis, the ones that have little or no values (and imputation is not possible) using drop() method.

Code
In [1425]:
# Verify the index of the columns
detailed_tweets_clean.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2331 entries, 0 to 2330
Data columns (total 32 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   created_at                     2331 non-null   object 
 1   id                             2331 non-null   int64  
 2   id_str                         2331 non-null   object 
 3   full_text                      2331 non-null   object 
 4   truncated                      2331 non-null   bool   
 5   display_text_range             2331 non-null   object 
 6   entities                       2331 non-null   object 
 7   extended_entities              2059 non-null   object 
 8   source                         2331 non-null   object 
 9   in_reply_to_status_id          77 non-null     float64
 10  in_reply_to_status_id_str      77 non-null     object 
 11  in_reply_to_user_id            77 non-null     float64
 12  in_reply_to_user_id_str        77 non-null     object 
 13  in_reply_to_screen_name        77 non-null     object 
 14  user                           2331 non-null   object 
 15  geo                            0 non-null      object 
 16  coordinates                    0 non-null      object 
 17  place                          1 non-null      object 
 18  contributors                   0 non-null      object 
 19  is_quote_status                2331 non-null   bool   
 20  retweet_count                  2331 non-null   int64  
 21  favorite_count                 2331 non-null   int64  
 22  favorited                      2331 non-null   bool   
 23  retweeted                      2331 non-null   bool   
 24  possibly_sensitive             2196 non-null   object 
 25  possibly_sensitive_appealable  2196 non-null   object 
 26  lang                           2331 non-null   object 
 27  retweeted_status               163 non-null    object 
 28  quoted_status_id               26 non-null     float64
 29  quoted_status_id_str           26 non-null     object 
 30  quoted_status_permalink        26 non-null     object 
 31  quoted_status                  24 non-null     object 
dtypes: bool(4), float64(3), int64(3), object(22)
memory usage: 519.1+ KB
In [1426]:
# Drop columns previously targeted using drop() method
detailed_tweets_clean.drop(detailed_tweets_clean.columns[[2,7,10,11,12,13,15,16,17,18,22,23,25,28,29,30,31]], axis = 1, inplace = True) 
Test
In [1427]:
# Verify that columns have been drop with info()
detailed_tweets_clean.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2331 entries, 0 to 2330
Data columns (total 15 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   created_at             2331 non-null   object 
 1   id                     2331 non-null   int64  
 2   full_text              2331 non-null   object 
 3   truncated              2331 non-null   bool   
 4   display_text_range     2331 non-null   object 
 5   entities               2331 non-null   object 
 6   source                 2331 non-null   object 
 7   in_reply_to_status_id  77 non-null     float64
 8   user                   2331 non-null   object 
 9   is_quote_status        2331 non-null   bool   
 10  retweet_count          2331 non-null   int64  
 11  favorite_count         2331 non-null   int64  
 12  possibly_sensitive     2196 non-null   object 
 13  lang                   2331 non-null   object 
 14  retweeted_status       163 non-null    object 
dtypes: bool(2), float64(1), int64(3), object(9)
memory usage: 241.4+ KB

2. Tidiness

2.1 The type of dog in the twitter_archive table should be one column instead of four.
Define

The process to improve the tidiness of the data frame, drop duplicates and fix the missing/redundant values (by imputing new ones) will be the following:

  1. Melt columns describing dogs’ stages;
  2. Drop redundant values produced by ‘Melt’;
  3. Extract those rows that already had 'stage' values and drop the 14 entries with double value;
  4. Apply a customized function to obtain new stages from the text column;
  5. Concatenate both data frames, one with stages obtained from the function developed and the ones that had values already.
code
In [1428]:
# Melt redundant columns (doggo, pupper...)
twitter_archive_clean = pd.melt(twitter_archive_clean, id_vars=['tweet_id','timestamp', 'source', 'text', 'rating_numerator', 'rating_denominator', 'name'], value_vars=['doggo', 'floofer', 'pupper', 'puppo'], var_name='stage') 
In [1429]:
# Drop the duplicates values except for those rows with stage information and their duplicates
twitter_archive_clean = twitter_archive_clean.drop_duplicates(subset = ['tweet_id','value'])
In [1430]:
# Save those columns with info and store them away and drop the 14 identified duplicated
stages_temp = twitter_archive_clean[twitter_archive_clean['stage'] == twitter_archive_clean['value']]
# Duplicates in this column are to be dropped randomly because both have the same chances to be right
stages_temp = stages_temp.drop_duplicates(subset=['tweet_id'], keep='last')
In [1431]:
# Drop all rows with tweet_id duplicates 
twitter_archive_clean = twitter_archive_clean.drop_duplicates(subset=['tweet_id'], keep=False)
In [1432]:
# Create a function that compares common adjectives to those of the dogs definition
def DogClassifificator(word):
    #Create a list with popular and descriptive dog's adjectives
    adjectives = ['pupper', 'puppy','pup','little', 'young', 'old boy', 'floof', 'old boy', 'boy', 'mature', 'dogo']
    #Create specific lists for different stages (Also could be indexing of adjectives)
    puppy = ['pupper', 'puppy', 'pup', 'little']
    pupper = ['young', 'old boy']
    floofer = ['floof', 'fur']
    doggo = ['old boy', 'boy', 'mature', 'dogo']
    #Loop the tweet’s texts in search of matches with the aforementioned adjectives
    for adjective in adjectives:
        if (adjective in word and adjective in puppy):
            value =  'puppo'    
            break
        elif (adjective in word and adjective in pupper):
            value =  'pupper'
            break
        elif (adjective in word and adjective in floofer):
            value = 'floofer'
            break
        elif (adjective in word and adjective in doggo):
            value =  'doggo' 
            break
        else:
            value = 'None'
    return value

# Apply the function and create a variable called Stage_dog
twitter_archive_clean["value"]= twitter_archive_clean["text"].apply(DogClassifificator)
In [1433]:
# Concatenate classified and stored dataframes to have back the 2356 total
twitter_archive_clean = pd.concat([twitter_archive_clean, stages_temp])
In [1434]:
# Drop stage column
twitter_archive_clean = twitter_archive_clean.drop(['stage'], axis=1)
In [1435]:
# Rename value column
twitter_archive_clean.rename(columns={'value':'dog_stage'}, inplace=True)
Test
In [1436]:
# Inspect the dataset 
twitter_archive_clean.head()
Out[1436]:
tweet_id timestamp source text rating_numerator rating_denominator name dog_stage
0 892420643555336193 2017-08-01 16:23:56 +0000 <a href="http://twitter.com/download/iphone" r... This is Phineas. He's a mystical boy. Only eve... 13 10 Phineas doggo
1 892177421306343426 2017-08-01 00:17:27 +0000 <a href="http://twitter.com/download/iphone" r... This is Tilly. She's just checking pup on you.... 13 10 Tilly puppo
2 891815181378084864 2017-07-31 00:18:03 +0000 <a href="http://twitter.com/download/iphone" r... This is Archie. He is a rare Norwegian Pouncin... 12 10 Archie None
3 891689557279858688 2017-07-30 15:58:51 +0000 <a href="http://twitter.com/download/iphone" r... This is Darla. She commenced a snooze mid meal... 13 10 Darla None
4 891327558926688256 2017-07-29 16:00:24 +0000 <a href="http://twitter.com/download/iphone" r... This is Franklin. He would like you to stop ca... 12 10 Franklin None
In [1437]:
# Inspect the dataset 
twitter_archive_clean.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2356 entries, 0 to 8151
Data columns (total 8 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   tweet_id            2356 non-null   int64 
 1   timestamp           2356 non-null   object
 2   source              2356 non-null   object
 3   text                2356 non-null   object
 4   rating_numerator    2356 non-null   int64 
 5   rating_denominator  2356 non-null   int64 
 6   name                2356 non-null   object
 7   dog_stage           2356 non-null   object
dtypes: int64(3), object(5)
memory usage: 165.7+ KB
2.1 Consistency between twitter_archive table and detailed_tweet as separate observational units
2.2.1 The twitter_archive table and detailed_tweet have duplicated entries
Define

Drop duplicated/redundant columns between twitter_archive and detailed_tweet table.

Code
In [1438]:
# Drop redundant columns
twitter_archive_clean.drop(twitter_archive_clean.columns[[1,2,3]], axis = 1, inplace = True) 
Test
In [1439]:
# Verify that columns have been dropped with info()
twitter_archive_clean.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2356 entries, 0 to 8151
Data columns (total 5 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   tweet_id            2356 non-null   int64 
 1   rating_numerator    2356 non-null   int64 
 2   rating_denominator  2356 non-null   int64 
 3   name                2356 non-null   object
 4   dog_stage           2356 non-null   object
dtypes: int64(3), object(2)
memory usage: 110.4+ KB
In [1440]:
# Verify columns and shape with info()
detailed_tweets_clean.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2331 entries, 0 to 2330
Data columns (total 15 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   created_at             2331 non-null   object 
 1   id                     2331 non-null   int64  
 2   full_text              2331 non-null   object 
 3   truncated              2331 non-null   bool   
 4   display_text_range     2331 non-null   object 
 5   entities               2331 non-null   object 
 6   source                 2331 non-null   object 
 7   in_reply_to_status_id  77 non-null     float64
 8   user                   2331 non-null   object 
 9   is_quote_status        2331 non-null   bool   
 10  retweet_count          2331 non-null   int64  
 11  favorite_count         2331 non-null   int64  
 12  possibly_sensitive     2196 non-null   object 
 13  lang                   2331 non-null   object 
 14  retweeted_status       163 non-null    object 
dtypes: bool(2), float64(1), int64(3), object(9)
memory usage: 241.4+ KB
2.2.2 The 'response' columns favorited and retweeted will be merged from detailed_tweet to twitter_archive
Define

Merge the favorite_count and retweet_count columns to the twitter_archive table, joining on tweet_id.

Code
In [1441]:
# Rename id column to match that of twitter_archive 
detailed_tweets_clean.rename(columns={'id':'tweet_id'}, inplace=True)
In [1442]:
# Compare identifiers of both columns
twitter_archive_clean['tweet_id'].isin(detailed_tweets_clean['tweet_id']).value_counts()
Out[1442]:
True     2331
False      25
Name: tweet_id, dtype: int64
In [1443]:
# Merge response variables to table 1 as another table (to carry out the exploratory analysis)
twitter_archive_clean2 = pd.merge(twitter_archive_clean, detailed_tweets_clean, on=['tweet_id'], how='inner')
twitter_archive_clean2.head()
Out[1443]:
tweet_id rating_numerator rating_denominator name dog_stage created_at full_text truncated display_text_range entities source in_reply_to_status_id user is_quote_status retweet_count favorite_count possibly_sensitive lang retweeted_status
0 892420643555336193 13 10 Phineas doggo Tue Aug 01 16:23:56 +0000 2017 This is Phineas. He's a mystical boy. Only eve... False [0, 85] {'hashtags': [], 'symbols': [], 'user_mentions... <a href="http://twitter.com/download/iphone" r... NaN {'id': 4196983835, 'id_str': '4196983835', 'na... False 7554 35731 False en NaN
1 892177421306343426 13 10 Tilly puppo Tue Aug 01 00:17:27 +0000 2017 This is Tilly. She's just checking pup on you.... False [0, 138] {'hashtags': [], 'symbols': [], 'user_mentions... <a href="http://twitter.com/download/iphone" r... NaN {'id': 4196983835, 'id_str': '4196983835', 'na... False 5599 30872 False en NaN
2 891815181378084864 12 10 Archie None Mon Jul 31 00:18:03 +0000 2017 This is Archie. He is a rare Norwegian Pouncin... False [0, 121] {'hashtags': [], 'symbols': [], 'user_mentions... <a href="http://twitter.com/download/iphone" r... NaN {'id': 4196983835, 'id_str': '4196983835', 'na... False 3709 23224 False en NaN
3 891689557279858688 13 10 Darla None Sun Jul 30 15:58:51 +0000 2017 This is Darla. She commenced a snooze mid meal... False [0, 79] {'hashtags': [], 'symbols': [], 'user_mentions... <a href="http://twitter.com/download/iphone" r... NaN {'id': 4196983835, 'id_str': '4196983835', 'na... False 7739 39020 False en NaN
4 891327558926688256 12 10 Franklin None Sat Jul 29 16:00:24 +0000 2017 This is Franklin. He would like you to stop ca... False [0, 138] {'hashtags': [{'text': 'BarkWeek', 'indices': ... <a href="http://twitter.com/download/iphone" r... NaN {'id': 4196983835, 'id_str': '4196983835', 'na... False 8335 37280 False en NaN
Test
In [1444]:
#Verify columns and shape with info()
twitter_archive_clean2.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2331 entries, 0 to 2330
Data columns (total 19 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   tweet_id               2331 non-null   int64  
 1   rating_numerator       2331 non-null   int64  
 2   rating_denominator     2331 non-null   int64  
 3   name                   2331 non-null   object 
 4   dog_stage              2331 non-null   object 
 5   created_at             2331 non-null   object 
 6   full_text              2331 non-null   object 
 7   truncated              2331 non-null   bool   
 8   display_text_range     2331 non-null   object 
 9   entities               2331 non-null   object 
 10  source                 2331 non-null   object 
 11  in_reply_to_status_id  77 non-null     float64
 12  user                   2331 non-null   object 
 13  is_quote_status        2331 non-null   bool   
 14  retweet_count          2331 non-null   int64  
 15  favorite_count         2331 non-null   int64  
 16  possibly_sensitive     2196 non-null   object 
 17  lang                   2331 non-null   object 
 18  retweeted_status       163 non-null    object 
dtypes: bool(2), float64(1), int64(5), object(11)
memory usage: 332.3+ KB

3. Quality

Some of the quality issues have been tackled during the tidiness and missing data operations.

twitter_archive table:

3.1 Invalid data among numerators and denominator ratings
Define

Inspect and handle numerators and denominators which could be outliers.

Code
In [1445]:
#inspect the outliers and drop the ones that could be errors:
twitter_archive_clean2.rating_numerator.value_counts()

#Inspect denominator values
twitter_archive_clean2.rating_denominator.value_counts()
Out[1445]:
10     2309
50        3
80        2
11        2
20        2
2         1
16        1
40        1
70        1
15        1
90        1
110       1
120       1
130       1
150       1
170       1
7         1
0         1
Name: rating_denominator, dtype: int64
In [1446]:
#observations with values lower than zero will be dropped
twitter_archive_clean2.drop(twitter_archive_clean2.loc[twitter_archive_clean2['rating_denominator'] < 10].index, inplace=True)
twitter_archive_clean2.drop(twitter_archive_clean2.loc[twitter_archive_clean2['rating_numerator'] < 10].index, inplace=True)
Test
In [1447]:
#Inspect with value_counts()
twitter_archive_clean2.rating_denominator.value_counts()
twitter_archive_clean2.rating_numerator.value_counts()
Out[1447]:
12      552
11      464
10      458
13      339
14       51
420       2
15        2
75        2
50        1
17        1
20        1
26        1
27        1
44        1
45        1
1776      1
666       1
80        1
84        1
88        1
99        1
121       1
143       1
144       1
165       1
182       1
204       1
60        1
Name: rating_numerator, dtype: int64
image_predictions table
3.2 Column names are not descriptive enough
Define

Change column names to descriptive ones

Code
In [1448]:
#Change names of the columns
image_predictions_clean.rename(columns={'p1':'prediction1', 'p2':'prediction2', 'p3':'prediction3' }, inplace=True)
In [1449]:
image_predictions_clean.rename(columns={'jpg_url':'image_url' }, inplace=True)
In [1450]:
image_predictions_clean.rename(columns={'p1_conf':'prediction1_confidence', 'p2_conf':'prediction2_confidence', 'p3_conf':'prediction3_confidence' }, inplace=True)
In [1451]:
image_predictions_clean.rename(columns={'p1_dog':'p1_breed_dog', 'p2_dog':'p2_breed_dog', 'p3_dog':'p3_breed_dog' }, inplace=True)
Test
In [1452]:
#Inspect with info()
image_predictions_clean.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   tweet_id                2075 non-null   int64  
 1   image_url               2075 non-null   object 
 2   img_num                 2075 non-null   int64  
 3   prediction1             2075 non-null   object 
 4   prediction1_confidence  2075 non-null   float64
 5   p1_breed_dog            2075 non-null   bool   
 6   prediction2             2075 non-null   object 
 7   prediction2_confidence  2075 non-null   float64
 8   p2_breed_dog            2075 non-null   bool   
 9   prediction3             2075 non-null   object 
 10  prediction3_confidence  2075 non-null   float64
 11  p3_breed_dog            2075 non-null   bool   
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 152.1+ KB

detailed_tweet table

3.3 Data type of timestamp is str.
Define

Change the datatype from str to datetime.

Code
In [1453]:
#Change the datatype with to_datetime()
detailed_tweets_clean['created_at'] = pd.to_datetime(detailed_tweets_clean['created_at'])
Test
In [1454]:
#Inspect the datetype of the created at column
type(detailed_tweets_clean.created_at[9])
Out[1454]:
pandas._libs.tslibs.timestamps.Timestamp
3.4 Some columns are not required for the analysis
Define

Drop columns that are not required.

Code
In [1455]:
#Drop columns that are not required
detailed_tweets_clean.drop(detailed_tweets_clean.columns[[2,7]], axis = 1, inplace = True) 
Test
In [1456]:
#Inspect with info()
detailed_tweets_clean.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2331 entries, 0 to 2330
Data columns (total 13 columns):
 #   Column              Non-Null Count  Dtype              
---  ------              --------------  -----              
 0   created_at          2331 non-null   datetime64[ns, UTC]
 1   tweet_id            2331 non-null   int64              
 2   truncated           2331 non-null   bool               
 3   display_text_range  2331 non-null   object             
 4   entities            2331 non-null   object             
 5   source              2331 non-null   object             
 6   user                2331 non-null   object             
 7   is_quote_status     2331 non-null   bool               
 8   retweet_count       2331 non-null   int64              
 9   favorite_count      2331 non-null   int64              
 10  possibly_sensitive  2196 non-null   object             
 11  lang                2331 non-null   object             
 12  retweeted_status    163 non-null    object             
dtypes: bool(2), datetime64[ns, UTC](1), int64(3), object(7)
memory usage: 205.0+ KB
3.5 Some column names are not descriptive enough
Define

Change column names to descriptive ones.

Code
In [1457]:
#Change names of the columns
detailed_tweets_clean.rename(columns={'display_text_range':'tweet_length', 'lang':'language'  }, inplace=True)
Test
In [1458]:
#Inspect with info()
detailed_tweets_clean.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2331 entries, 0 to 2330
Data columns (total 13 columns):
 #   Column              Non-Null Count  Dtype              
---  ------              --------------  -----              
 0   created_at          2331 non-null   datetime64[ns, UTC]
 1   tweet_id            2331 non-null   int64              
 2   truncated           2331 non-null   bool               
 3   tweet_length        2331 non-null   object             
 4   entities            2331 non-null   object             
 5   source              2331 non-null   object             
 6   user                2331 non-null   object             
 7   is_quote_status     2331 non-null   bool               
 8   retweet_count       2331 non-null   int64              
 9   favorite_count      2331 non-null   int64              
 10  possibly_sensitive  2196 non-null   object             
 11  language            2331 non-null   object             
 12  retweeted_status    163 non-null    object             
dtypes: bool(2), datetime64[ns, UTC](1), int64(3), object(7)
memory usage: 205.0+ KB

Storing

These clean columns will be saved to .csv files.

In [1459]:
#Save files with to_csv() function
twitter_archive_clean.to_csv('twitter_archive_master.csv', index=False)
twitter_archive_clean2.to_csv('twitter_archive_master2.csv', index=False)
image_predictions_clean.to_csv('image_predictions_clean.csv', index=False)
detailed_tweets_clean.to_csv('detailed_tweets_clean.csv', index=False)

Exploratory Data Analysis

Now that the data has been cleaned and tidied, some useful descriptive statistics and visual aids can come into effect to better understand the variables and how they interact. Some questions that could arise are:

What stage of dogs is favorited or retweeted the most?
In [1460]:
#group by stage_dog
twitter_archive_clean2.groupby(['dog_stage']).mean()
Out[1460]:
tweet_id rating_numerator rating_denominator truncated in_reply_to_status_id is_quote_status retweet_count favorite_count
dog_stage
None 7.506073e+17 14.111969 10.386873 0.0 7.422830e+17 0.017761 2807.005405 7885.288031
doggo 8.037876e+17 17.892857 10.000000 0.0 8.188688e+17 0.035714 6953.044643 17166.258929
floofer 7.541778e+17 14.321429 12.142857 0.0 NaN 0.000000 2642.428571 8856.464286
pupper 7.338976e+17 11.466981 10.000000 0.0 7.136706e+17 0.009434 2737.938679 6868.891509
puppo 7.663997e+17 13.905350 12.098765 0.0 7.614270e+17 0.008230 2908.308642 9999.786008

This table shows the statistics of favorite and retweeted posts by dog stages. It seems as if the dogs classified as puppo and doggo have the higher favorite _count mean

In [1461]:
# Dog_stages are ploted with favorite_count.mean
twitter_archive_clean2.groupby(['dog_stage']).favorite_count.mean().plot(kind='bar');
plt.title('Avg. favorite count of tweets by dog "stage"');
plt.ylabel('Favorite count', fontsize = 15);
plt.xlabel("Dog's stage", fontsize = 15);
In [1462]:
# Dog_stages are ploted with favorite_count.mean
twitter_archive_clean2.groupby(['dog_stage']).retweet_count.mean().plot(kind='bar');
plt.title('Avg. retweet count of tweets by dog "stage"');
plt.ylabel('Retweet count', fontsize=15);
plt.xlabel("Dog's stage", fontsize=15);
What breeds are more common to be found?
In [1463]:
# Counts of dog breeds for algorithm 1
image_predictions_clean.prediction1.value_counts()
Out[1463]:
golden_retriever      150
Labrador_retriever    100
Pembroke               89
Chihuahua              83
pug                    57
                     ... 
hotdog                  1
water_bottle            1
mud_turtle              1
otter                   1
sliding_door            1
Name: prediction1, Length: 378, dtype: int64
In [1464]:
# Counts of dog breeds for algorithm 2
image_predictions_clean.prediction2.value_counts()
Out[1464]:
Labrador_retriever     104
golden_retriever        92
Cardigan                73
Chihuahua               44
Pomeranian              42
                      ... 
washbasin                1
medicine_chest           1
cockroach                1
African_hunting_dog      1
shopping_cart            1
Name: prediction2, Length: 405, dtype: int64
In [1465]:
# Counts of dog breeds for algorithm 3
image_predictions_clean.prediction3.value_counts()
Out[1465]:
Labrador_retriever    79
Chihuahua             58
golden_retriever      48
Eskimo_dog            38
kelpie                35
                      ..
rock_crab              1
affenpinscher          1
snail                  1
acorn                  1
viaduct                1
Name: prediction3, Length: 408, dtype: int64

From these values it seems as if the retriever dogs are the ones being predicted the by all three algorithms. It is important to see which algorithm delivers results with higher confidence. An extra step could be checking is specific breeds have more favorited or retweets than others.

In [1466]:
# Descriptive statistics of the predictions table
image_predictions_clean.describe()
Out[1466]:
tweet_id img_num prediction1_confidence prediction2_confidence prediction3_confidence
count 2.075000e+03 2075.000000 2075.000000 2.075000e+03 2.075000e+03
mean 7.384514e+17 1.203855 0.594548 1.345886e-01 6.032417e-02
std 6.785203e+16 0.561875 0.271174 1.006657e-01 5.090593e-02
min 6.660209e+17 1.000000 0.044333 1.011300e-08 1.740170e-10
25% 6.764835e+17 1.000000 0.364412 5.388625e-02 1.622240e-02
50% 7.119988e+17 1.000000 0.588230 1.181810e-01 4.944380e-02
75% 7.932034e+17 1.000000 0.843855 1.955655e-01 9.180755e-02
max 8.924206e+17 4.000000 1.000000 4.880140e-01 2.734190e-01

Perhaps as expected, predictions 1 are the ones with higher confidence rates.

Are longer tweets favored?
In [1467]:
# In order to use the tweet_length column it had to be divided into two columns
detailed_tweets_clean[['tweet_start','tweet_end']] = pd.DataFrame(detailed_tweets_clean.tweet_length.tolist(), index= detailed_tweets_clean.index)
In [1468]:
# Inspect the most frequent tweet length
detailed_tweets_clean.tweet_end.mode()
Out[1468]:
0    140
dtype: int64
In [1469]:
# Inspect tweets length
detailed_tweets_clean.tweet_end.value_counts()
Out[1469]:
140    252
139    145
138     79
137     77
116     59
      ... 
29       1
31       1
43       1
145      1
11       1
Name: tweet_end, Length: 125, dtype: int64
In [1470]:
# Creation of masks to categorize tweets lengths with 1/3rd of the maximum allowed as base
Long = detailed_tweets_clean.tweet_end > 50
Short= detailed_tweets_clean.tweet_end <  50
In [1471]:
# Histogram of tweets lengths
detailed_tweets_clean.tweet_end[Long].hist(label = 'Long tweet');
detailed_tweets_clean.tweet_end[Short].hist(label = 'Short tweet');
plt.title("Histogram of tweets length")
plt.xlabel("Tweet's length (Char)")
plt.legend();
In [1472]:
# What is the mean of favorite counts for long tweets
detailed_tweets_clean.favorite_count[Long].mean()
Out[1472]:
7504.939180255619
In [1473]:
# What is the mean of favorite counts for short tweets
detailed_tweets_clean.favorite_count[Short].mean()
Out[1473]:
5229.1551724137935

As seen the average of favorite counts for longer tweets is higher. This makes sense as longer tweets tend to provide more info on the dog.

In [1474]:
# What is the mean of retweets for long tweets
detailed_tweets_clean.retweet_count[Long].mean()
Out[1474]:
2681.32921992067
In [1475]:
# What is the mean of retweets for short tweets
detailed_tweets_clean.retweet_count[Short].mean()
Out[1475]:
1487.2241379310344

Similarly, the number of retweets for longer tweets is higher. One aspect to take into account is that, longer tweets are more than shorter ones. Hence further inspections of this relationship have to be done in order to stablish any relationship.

Conclusions

This project shows the stages of wrangling and exploratory analysis of three datasets holding interesting information of dogs over Twitter from the WeRateDogs account. The datasets were obtained from diverse sources and their main flaws were inspected and some of them were documented and fixed. The resulting dataset helped to gain some insights on questions related to the favorite and retweet count. Based on the results delivered by the dataset it seems as if puppies are the ones with more retweets and likes. From the predicted images dataset the retriever breed was the one predicted the most, by far in comparison the rest. Lastly, from the table holding detailed tweet information evidenced that tweets using the 140-character limit were the most common ones as well as the ones more favorited and retweeted. These insights were obtained from superficial analysis so in order to establish relationship between these variables further analysis has to be carried out.