Project: WeRateDogs data improvement¶

Table of Contents¶

Introduction
Data Wrangling

Gathering
Assessing
Cleaning
Storing

Exploratory Data Analysis
Conclusions
References

Introduction¶

This document details the series of actions carried out with the goal of improving the quality of the dataset from the Twitter user 'WeRateDogs'. This peculiar dataset holds interesting information of dogs featured on the internet. In order to make this analysis and others easier, this dataset was augmented, assessed, cleaned and briefly analyzed. The file contains observations that could be useful to get insights on which dogs seem to attract more attention, what pictures or actions are the most popular over twitter and so on.

#Required packages and tools are imported
import pandas as pd
import numpy as np
import requests as rd
import time
import tweepy as tw
import json
import sys
from matplotlib import pyplot as plt
%matplotlib inline

#Twitter API credentials
#consumer_key = 'YOUR CONSUMER KEY'
#consumer_secret = 'YOUR CONSUMER SECRET'
#access_token = 'YOUR ACCESS TOKEN'
#access_secret = 'YOUR ACCESS SECRET'

#auth = tw.OAuthHandler(consumer_key, consumer_secret)
#auth.set_access_token(access_token, access_token_secret)
#api = tw.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify = True, parser=tw.parsers.JSONParser())

Data Wrangling¶

This section holds the steps carried out to assemble/obtain, inspect and properly clean the datasets. All steps will be documented, and more information can be obtained in the attached file called ‘wrangle_report.pdf’.

¶

Gathering¶

The gathering process of all documents required to carry out the analysis are documented here. The first file with the original 2355 tweets were handed as a .csv file and uploaded to this notebook server. The second file was downloaded from a URL using the library requests. The third file was obtained using the Twitter API. The tasks required to obtain all these files are appropriately described in sections bellow:

# Read CSV (comma-separated) file into DataFrame
twitter_archive = pd.read_csv('twitter-archive-enhanced.csv')

The dataset is first inspected to make sure it was imported successfully

# Preliminary Visual inspection of the dataset 
twitter_archive.head()

As this importation seemed to have been successful, the second dataset will be downloaded from the URL provided using the library requests :

# Create a variable holding the URL with the files and same for the response
URL1 = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'
response = rd.get(URL1)

# Create the file where the content of the url will be saved
with open ('image_predictions.tsv', mode = 'wb') as file:
    file.write(response.content)

# Read TSV (Tab-separated) file into DataFrame
image_predictions = pd.read_csv('image_predictions.tsv', sep='\t')

# Preliminary visual inspection of the dataset 
image_predictions.sample(2)

The third dataset from Twitter is gathered using their API and the library Tweepy. The tweets to download are those whose tweet ID are in the original file. In order to download exactly those in that file, the 'tweet_id' column is to be stored in a list and all those tweets matching that ID are going to be saved in the file called ‘tweet_json.txt’.

# Create the column with tweet ID's
list_ids = list(twitter_archive.tweet_id)

# Verify the length of this list
len(list_ids)

2356

"""
#Set up a timer to see how long the operation lasts
start = time.time()
deleted_tweets = 0

#Set up a for loop that verifies the tweet id's 
# Use try and except for cases where the tweet has been deleted
with open('tweet_json.txt', mode= 'w') as f:
    for ids in list_ids:
        try:
            tweet = api.get_status(ids, tweet_mode='extended')
            json.dump(tweet, f)
        except:
            deleted_tweets += 1
            
#record end time             
end = time.time()
#print total duration
print(end - start)

# print total of tweets deleted
print('A total of ' + str(deleted_tweets) + ' tweets were deleted')

"""

"\n#Set up a timer to see how long the operation lasts\nstart = time.time()\ndeleted_tweets = 0\n\n#Set up a for loop that verifies the tweet id's \n# Use try and except for cases where the tweet has been deleted\nwith open('tweet_json.txt', mode= 'w') as f:\n    for ids in list_ids:\n        try:\n            tweet = api.get_status(ids, tweet_mode='extended')\n            json.dump(tweet, f)\n        except:\n            deleted_tweets += 1\n            \n#record end time             \nend = time.time()\n#print total duration\nprint(end - start)\n\n# print total of tweets deleted\nprint('A total of ' + str(deleted_tweets) + ' tweets were deleted')\n\n"

Now the file can be read into a DataFrame

# Read the flat .txt file into DataFrame using read_json
# Full_tweets = pd.read_json('tweet_json.txt', orient= 'records', encoding="utf8", lines=True)

This code did not work because my file contained an invalid json and it throwed the following error:

ValueError :Unexpected character found when decoding array value (2)

The JSON file was checked in a JSON validator and it showed and EOF (end of file) error. A workaround was found on StackExchange. There a function was created aimed at parsing and formatting the json line by line into a list and finally creating a DataFrame. This approach is used bellow:

# First the ParseMultipleJson function is created to make sure the
# JSONs objects of the file are parsed correctly

def parseMultipleJSON(lines):
    skip = prev = 0
    data = []
    lines = ''.join(lines)
    for idx, line in enumerate(lines):
        if line == "{":
            skip += 1
        elif line == "}":
            skip -= 1
            if skip == 0:
                json_string = ''.join(lines[prev:idx+1])
                data.append(json.loads(json_string))
                prev = idx+1
    return data

with open('tweet_json.txt','r') as json_file:
    lines = json_file.readlines()
    lines = [line.strip("\n") for line in lines]
    data = parseMultipleJSON(lines)

detailed_tweets = pd.DataFrame(data)

# Preliminary visual inspection of the dataset 
detailed_tweets.sample(2)

Assessing¶

This part of the analysis will focus on identifying those aspects of the datasets to improve. These improvements will make the posterior analysis possible. A visual inspection will be carried out using the software Numbers, followed by a programmatic one. Issues found will be categorized into quality and tidiness sections.

twitter_archive table
First a visual inspection was carried out with the software 'Numbers'

This inspection gave a general idea of what the variables in this file are and which values they could and could not take. Many observations were empty (missing values).

Programmatic inspection:

# Display main characteristics of this table with info
twitter_archive.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   tweet_id                    2356 non-null   int64  
 1   in_reply_to_status_id       78 non-null     float64
 2   in_reply_to_user_id         78 non-null     float64
 3   timestamp                   2356 non-null   object 
 4   source                      2356 non-null   object 
 5   text                        2356 non-null   object 
 6   retweeted_status_id         181 non-null    float64
 7   retweeted_status_user_id    181 non-null    float64
 8   retweeted_status_timestamp  181 non-null    object 
 9   expanded_urls               2297 non-null   object 
 10  rating_numerator            2356 non-null   int64  
 11  rating_denominator          2356 non-null   int64  
 12  name                        2356 non-null   object 
 13  doggo                       2356 non-null   object 
 14  floofer                     2356 non-null   object 
 15  pupper                      2356 non-null   object 
 16  puppo                       2356 non-null   object 
dtypes: float64(4), int64(3), object(10)
memory usage: 313.0+ KB

# Verify the data types of the timestamp 
type(twitter_archive.timestamp[9])

str

# Extract a sample of null values from retweeted_status_id
twitter_archive[twitter_archive['retweeted_status_id'].isnull()].sample(14)

# Look at the unique values
twitter_archive.nunique()

tweet_id                      2356
in_reply_to_status_id           77
in_reply_to_user_id             31
timestamp                     2356
source                           4
text                          2356
retweeted_status_id            181
retweeted_status_user_id        25
retweeted_status_timestamp     181
expanded_urls                 2218
rating_numerator                40
rating_denominator              18
name                           957
doggo                            2
floofer                          2
pupper                           2
puppo                            2
dtype: int64

# Verify the most repeated value within the stages of dog 'doggo'
twitter_archive.doggo.mode()

0    None
dtype: object

# Verify if all types of dog stages have 'None' as the most frequent value
twitter_archive.doggo.mode()[0] == twitter_archive.floofer.mode()[0] == twitter_archive.pupper.mode()[0] == twitter_archive.pupper.mode()

0    True
dtype: bool

# Obtain number of entries that do not have a dog stage assigned
twitter_archive_subset = twitter_archive[(twitter_archive['doggo'] == 'None') & (twitter_archive['floofer'] == 'None') & (twitter_archive['pupper'] == 'None') & (twitter_archive['puppo'] == 'None')]
len(twitter_archive_subset)

1976

# Verify if some rows have more than one dog stage assigned; first comparing all stages to doggos
duplicates = 0
stages =['floofer','pupper','puppo']
for col in stages:
        duplicates += len(twitter_archive.loc[(twitter_archive['doggo'] == 'doggo') & (twitter_archive[col] != 'None')])
print(duplicates)

14

# Display descriptive statistics of the variables with describe
twitter_archive.describe()

# Verify maximum values for numerator and denominator
twitter_archive[twitter_archive.rating_numerator == 1776.000000]

# Verify maximum values for numerator and denominator
twitter_archive[twitter_archive.rating_denominator == 170.000000]

# Verify low values for the denominator
twitter_archive[twitter_archive.rating_denominator < 10]

# Verify low values for the numerator
len(twitter_archive[twitter_archive.rating_numerator < 10])

440

# Search for duplicated tweets ID's with 'duplicated' method
twitter_archive[twitter_archive.tweet_id.duplicated()]

# Search for duplicated texts with 'duplicated' method
twitter_archive[twitter_archive.text.duplicated()]

# Inspect how many tweets are a response to other tweets
len(twitter_archive.in_reply_to_status_id.value_counts())

77

# Inspect how many tweets are a response to other users
len(twitter_archive.in_reply_to_user_id.value_counts())

31

# Random sample of the dataset
twitter_archive.sample(3)

image_predictions table
First a visual inspection was carried out with the software 'Numbers'.

As with the previous table, since these values are so diverse and are not ruled by many constrains the visual search did not deliver especially useful insights.

# Display main characteristics of this table with info
image_predictions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   tweet_id  2075 non-null   int64  
 1   jpg_url   2075 non-null   object 
 2   img_num   2075 non-null   int64  
 3   p1        2075 non-null   object 
 4   p1_conf   2075 non-null   float64
 5   p1_dog    2075 non-null   bool   
 6   p2        2075 non-null   object 
 7   p2_conf   2075 non-null   float64
 8   p2_dog    2075 non-null   bool   
 9   p3        2075 non-null   object 
 10  p3_conf   2075 non-null   float64
 11  p3_dog    2075 non-null   bool   
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 152.1+ KB

# Verify the data types of the predictions
type(image_predictions.p1[9])

str

# Look at unique values
image_predictions.nunique()

tweet_id    2075
jpg_url     2009
img_num        4
p1           378
p1_conf     2006
p1_dog         2
p2           405
p2_conf     2004
p2_dog         2
p3           408
p3_conf     2006
p3_dog         2
dtype: int64

# Check most predicted breeds by algorithm attempts
image_predictions.p1.mode()

0    golden_retriever
dtype: object

# Check most predicted breeds by algorithm attempts
image_predictions.p2.mode()

0    Labrador_retriever
dtype: object

# Check most predicted breeds by algorithm attempts
image_predictions.p3.mode()

0    Labrador_retriever
dtype: object

# Display descriptive statistics of the variables with describe
image_predictions.describe()

# Inspect value with confidence equal to 1
image_predictions[image_predictions.p1_conf == 1]

# Check how many entries have incorrect predictions for all attempts
image_predictions_subset = image_predictions[(image_predictions['p1_dog'] == False) & (image_predictions['p2_dog'] == False) & (image_predictions['p3_dog'] == False) ]
len(image_predictions_subset)

324

# Search for duplicated tweets with 'duplicated' method
image_predictions[image_predictions.tweet_id.duplicated()]

# Show a random sample of the dataset
image_predictions.sample(4)

detailed_tweets table
The visual inspection was carried out using Pandas.

detailed_tweets

This table has a lot of information but many of the entries are empty (missing values).

# Display main characteristics of this table with info
detailed_tweets.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2331 entries, 0 to 2330
Data columns (total 32 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   created_at                     2331 non-null   object 
 1   id                             2331 non-null   int64  
 2   id_str                         2331 non-null   object 
 3   full_text                      2331 non-null   object 
 4   truncated                      2331 non-null   bool   
 5   display_text_range             2331 non-null   object 
 6   entities                       2331 non-null   object 
 7   extended_entities              2059 non-null   object 
 8   source                         2331 non-null   object 
 9   in_reply_to_status_id          77 non-null     float64
 10  in_reply_to_status_id_str      77 non-null     object 
 11  in_reply_to_user_id            77 non-null     float64
 12  in_reply_to_user_id_str        77 non-null     object 
 13  in_reply_to_screen_name        77 non-null     object 
 14  user                           2331 non-null   object 
 15  geo                            0 non-null      object 
 16  coordinates                    0 non-null      object 
 17  place                          1 non-null      object 
 18  contributors                   0 non-null      object 
 19  is_quote_status                2331 non-null   bool   
 20  retweet_count                  2331 non-null   int64  
 21  favorite_count                 2331 non-null   int64  
 22  favorited                      2331 non-null   bool   
 23  retweeted                      2331 non-null   bool   
 24  possibly_sensitive             2196 non-null   object 
 25  possibly_sensitive_appealable  2196 non-null   object 
 26  lang                           2331 non-null   object 
 27  retweeted_status               163 non-null    object 
 28  quoted_status_id               26 non-null     float64
 29  quoted_status_id_str           26 non-null     object 
 30  quoted_status_permalink        26 non-null     object 
 31  quoted_status                  24 non-null     object 
dtypes: bool(4), float64(3), int64(3), object(22)
memory usage: 519.1+ KB

# Verify the data types of the column 'created at' 
type(detailed_tweets.created_at[9])

str

# verify the data types of the column 'possibly_sensitive'
type(detailed_tweets.possibly_sensitive[9])

bool

# Display descriptive statistics 
detailed_tweets.describe()

# Search for duplicated tweets with 'duplicated' 
detailed_tweets[detailed_tweets.id.duplicated()]

# Show a random sample of the dataset
detailed_tweets.sample(4)

The last assessment is a verification of duplicated columns among the three tables

# Join all columns in a list and verify duplicated ones
all_columns = pd.Series(list(twitter_archive) + list(image_predictions) + list(detailed_tweets))
all_columns[all_columns.duplicated()]

17                 tweet_id
37                   source
38    in_reply_to_status_id
40      in_reply_to_user_id
dtype: object

Observations:¶

The main observations regarding data quality and tidiness are summarized below. Quality issues are divided by table and tidiness issues are presented together for all tables.

Quality¶

`tweet_archive` table¶

Data type of timestamp is str.
Missing values from 6 out of 17 columns.
Maximum rating numerators and denominators could be outliers moreover some denominator values are lower than 10.
Predictions of dog "stages" are mostly empty.
Some predictions of dog "stages" are duplicated.
Some columns are not required for the analysis.

`image_predictions` table¶

Column names could be more descriptive.

`detailed_tweets` table¶

Column names could be more descriptive.
Missing values from 17 out of 32 columns.
Some columns are totally empty, and imputing is not feasible.
Data type of timeStamp is str.
Some columns are not required for the analysis

Tidiness¶

The dog stage present in the twitter_archive table should be one column instead of four.
The twitter_archive table and detailed_tweet share duplicated information. By dropping these duplicates they could constitute more coherent observational units. For the sake of this analysis the columns favorite count and retweet count will be taken as response variables and will be analyzed based on: Tweet information (from the detailed_tweet table) and their relationship with the dogs stage data from the twitter_archive table. With this in mind:
- duplicated columns will be dropped from the twitter_archive table
- The 'response' columns from detailed_tweet will be merged to the twitter_archive table.

Cleaning¶

The cleaning efforts carried out so as to fix the issues discovered in the previous section, are described herein.

# Copy the datasets before modifying them separately in case cleaning operations are not correct
twitter_archive_clean = twitter_archive.copy()

image_predictions_clean = image_predictions.copy()

detailed_tweets_clean = detailed_tweets.copy()

1. Missing Data¶

1.1 `twitter_archive`: Missing values (Not necessary ones)¶

Define¶

Drop the columns with missing values that are not necessary for the analysis using drop() method.

Code¶

# Drop columns that will not be included in the analysis using drop() method
# Copy the datasets before modifying them
twitter_archive_clean.drop(twitter_archive_clean.columns[[1,2,6,7,8,9]], axis = 1, inplace = True)

Test¶

# Verify that columns have been drop with info()
twitter_archive_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 11 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   tweet_id            2356 non-null   int64 
 1   timestamp           2356 non-null   object
 2   source              2356 non-null   object
 3   text                2356 non-null   object
 4   rating_numerator    2356 non-null   int64 
 5   rating_denominator  2356 non-null   int64 
 6   name                2356 non-null   object
 7   doggo               2356 non-null   object
 8   floofer             2356 non-null   object
 9   pupper              2356 non-null   object
 10  puppo               2356 non-null   object
dtypes: int64(3), object(8)
memory usage: 202.6+ KB

1.2 `twitter_archive`: Missing values (imputable ones)¶

Define¶

Re- extract the stages of dogs matching their definitions and popular synonyms to the captions included in each tweet's text. This will be done in the tidiness section.

1.3 `detailed_tweets`: Missing values¶

Define¶

Drop the columns with missing values that are not necessary for the analysis, the ones that have little or no values (and imputation is not possible) using drop() method.

Code¶

# Verify the index of the columns
detailed_tweets_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2331 entries, 0 to 2330
Data columns (total 32 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   created_at                     2331 non-null   object 
 1   id                             2331 non-null   int64  
 2   id_str                         2331 non-null   object 
 3   full_text                      2331 non-null   object 
 4   truncated                      2331 non-null   bool   
 5   display_text_range             2331 non-null   object 
 6   entities                       2331 non-null   object 
 7   extended_entities              2059 non-null   object 
 8   source                         2331 non-null   object 
 9   in_reply_to_status_id          77 non-null     float64
 10  in_reply_to_status_id_str      77 non-null     object 
 11  in_reply_to_user_id            77 non-null     float64
 12  in_reply_to_user_id_str        77 non-null     object 
 13  in_reply_to_screen_name        77 non-null     object 
 14  user                           2331 non-null   object 
 15  geo                            0 non-null      object 
 16  coordinates                    0 non-null      object 
 17  place                          1 non-null      object 
 18  contributors                   0 non-null      object 
 19  is_quote_status                2331 non-null   bool   
 20  retweet_count                  2331 non-null   int64  
 21  favorite_count                 2331 non-null   int64  
 22  favorited                      2331 non-null   bool   
 23  retweeted                      2331 non-null   bool   
 24  possibly_sensitive             2196 non-null   object 
 25  possibly_sensitive_appealable  2196 non-null   object 
 26  lang                           2331 non-null   object 
 27  retweeted_status               163 non-null    object 
 28  quoted_status_id               26 non-null     float64
 29  quoted_status_id_str           26 non-null     object 
 30  quoted_status_permalink        26 non-null     object 
 31  quoted_status                  24 non-null     object 
dtypes: bool(4), float64(3), int64(3), object(22)
memory usage: 519.1+ KB

# Drop columns previously targeted using drop() method
detailed_tweets_clean.drop(detailed_tweets_clean.columns[[2,7,10,11,12,13,15,16,17,18,22,23,25,28,29,30,31]], axis = 1, inplace = True)

Test¶

# Verify that columns have been drop with info()
detailed_tweets_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2331 entries, 0 to 2330
Data columns (total 15 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   created_at             2331 non-null   object 
 1   id                     2331 non-null   int64  
 2   full_text              2331 non-null   object 
 3   truncated              2331 non-null   bool   
 4   display_text_range     2331 non-null   object 
 5   entities               2331 non-null   object 
 6   source                 2331 non-null   object 
 7   in_reply_to_status_id  77 non-null     float64
 8   user                   2331 non-null   object 
 9   is_quote_status        2331 non-null   bool   
 10  retweet_count          2331 non-null   int64  
 11  favorite_count         2331 non-null   int64  
 12  possibly_sensitive     2196 non-null   object 
 13  lang                   2331 non-null   object 
 14  retweeted_status       163 non-null    object 
dtypes: bool(2), float64(1), int64(3), object(9)
memory usage: 241.4+ KB

2. Tidiness¶

2.1 The type of dog in the `twitter_archive` table should be one column instead of four.¶

Define¶

The process to improve the tidiness of the data frame, drop duplicates and fix the missing/redundant values (by imputing new ones) will be the following:

Melt columns describing dogs’ stages;
Drop redundant values produced by ‘Melt’;
Extract those rows that already had 'stage' values and drop the 14 entries with double value;
Apply a customized function to obtain new stages from the text column;
Concatenate both data frames, one with stages obtained from the function developed and the ones that had values already.

code¶

# Melt redundant columns (doggo, pupper...)
twitter_archive_clean = pd.melt(twitter_archive_clean, id_vars=['tweet_id','timestamp', 'source', 'text', 'rating_numerator', 'rating_denominator', 'name'], value_vars=['doggo', 'floofer', 'pupper', 'puppo'], var_name='stage')

# Drop the duplicates values except for those rows with stage information and their duplicates
twitter_archive_clean = twitter_archive_clean.drop_duplicates(subset = ['tweet_id','value'])

# Save those columns with info and store them away and drop the 14 identified duplicated
stages_temp = twitter_archive_clean[twitter_archive_clean['stage'] == twitter_archive_clean['value']]
# Duplicates in this column are to be dropped randomly because both have the same chances to be right
stages_temp = stages_temp.drop_duplicates(subset=['tweet_id'], keep='last')

# Drop all rows with tweet_id duplicates 
twitter_archive_clean = twitter_archive_clean.drop_duplicates(subset=['tweet_id'], keep=False)

# Create a function that compares common adjectives to those of the dogs definition
def DogClassifificator(word):
    #Create a list with popular and descriptive dog's adjectives
    adjectives = ['pupper', 'puppy','pup','little', 'young', 'old boy', 'floof', 'old boy', 'boy', 'mature', 'dogo']
    #Create specific lists for different stages (Also could be indexing of adjectives)
    puppy = ['pupper', 'puppy', 'pup', 'little']
    pupper = ['young', 'old boy']
    floofer = ['floof', 'fur']
    doggo = ['old boy', 'boy', 'mature', 'dogo']
    #Loop the tweet’s texts in search of matches with the aforementioned adjectives
    for adjective in adjectives:
        if (adjective in word and adjective in puppy):
            value =  'puppo'    
            break
        elif (adjective in word and adjective in pupper):
            value =  'pupper'
            break
        elif (adjective in word and adjective in floofer):
            value = 'floofer'
            break
        elif (adjective in word and adjective in doggo):
            value =  'doggo' 
            break
        else:
            value = 'None'
    return value

# Apply the function and create a variable called Stage_dog
twitter_archive_clean["value"]= twitter_archive_clean["text"].apply(DogClassifificator)

# Concatenate classified and stored dataframes to have back the 2356 total
twitter_archive_clean = pd.concat([twitter_archive_clean, stages_temp])

# Drop stage column
twitter_archive_clean = twitter_archive_clean.drop(['stage'], axis=1)

# Rename value column
twitter_archive_clean.rename(columns={'value':'dog_stage'}, inplace=True)

Test¶

# Inspect the dataset 
twitter_archive_clean.head()

# Inspect the dataset 
twitter_archive_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2356 entries, 0 to 8151
Data columns (total 8 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   tweet_id            2356 non-null   int64 
 1   timestamp           2356 non-null   object
 2   source              2356 non-null   object
 3   text                2356 non-null   object
 4   rating_numerator    2356 non-null   int64 
 5   rating_denominator  2356 non-null   int64 
 6   name                2356 non-null   object
 7   dog_stage           2356 non-null   object
dtypes: int64(3), object(5)
memory usage: 165.7+ KB

2.1 Consistency between `twitter_archive` table and `detailed_tweet` as separate observational units¶

2.2.1 The `twitter_archive` table and `detailed_tweet` have duplicated entries¶

Define¶

Drop duplicated/redundant columns between twitter_archive and detailed_tweet table.

Code¶

# Drop redundant columns
twitter_archive_clean.drop(twitter_archive_clean.columns[[1,2,3]], axis = 1, inplace = True)

Test¶

# Verify that columns have been dropped with info()
twitter_archive_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2356 entries, 0 to 8151
Data columns (total 5 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   tweet_id            2356 non-null   int64 
 1   rating_numerator    2356 non-null   int64 
 2   rating_denominator  2356 non-null   int64 
 3   name                2356 non-null   object
 4   dog_stage           2356 non-null   object
dtypes: int64(3), object(2)
memory usage: 110.4+ KB

# Verify columns and shape with info()
detailed_tweets_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2331 entries, 0 to 2330
Data columns (total 15 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   created_at             2331 non-null   object 
 1   id                     2331 non-null   int64  
 2   full_text              2331 non-null   object 
 3   truncated              2331 non-null   bool   
 4   display_text_range     2331 non-null   object 
 5   entities               2331 non-null   object 
 6   source                 2331 non-null   object 
 7   in_reply_to_status_id  77 non-null     float64
 8   user                   2331 non-null   object 
 9   is_quote_status        2331 non-null   bool   
 10  retweet_count          2331 non-null   int64  
 11  favorite_count         2331 non-null   int64  
 12  possibly_sensitive     2196 non-null   object 
 13  lang                   2331 non-null   object 
 14  retweeted_status       163 non-null    object 
dtypes: bool(2), float64(1), int64(3), object(9)
memory usage: 241.4+ KB

2.2.2 The 'response' columns favorited and retweeted will be merged from `detailed_tweet` to `twitter_archive`¶

Define¶

Merge the favorite_count and retweet_count columns to the twitter_archive table, joining on tweet_id.

Code¶

# Rename id column to match that of twitter_archive 
detailed_tweets_clean.rename(columns={'id':'tweet_id'}, inplace=True)

# Compare identifiers of both columns
twitter_archive_clean['tweet_id'].isin(detailed_tweets_clean['tweet_id']).value_counts()

True     2331
False      25
Name: tweet_id, dtype: int64

# Merge response variables to table 1 as another table (to carry out the exploratory analysis)
twitter_archive_clean2 = pd.merge(twitter_archive_clean, detailed_tweets_clean, on=['tweet_id'], how='inner')
twitter_archive_clean2.head()

Test¶

#Verify columns and shape with info()
twitter_archive_clean2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2331 entries, 0 to 2330
Data columns (total 19 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   tweet_id               2331 non-null   int64  
 1   rating_numerator       2331 non-null   int64  
 2   rating_denominator     2331 non-null   int64  
 3   name                   2331 non-null   object 
 4   dog_stage              2331 non-null   object 
 5   created_at             2331 non-null   object 
 6   full_text              2331 non-null   object 
 7   truncated              2331 non-null   bool   
 8   display_text_range     2331 non-null   object 
 9   entities               2331 non-null   object 
 10  source                 2331 non-null   object 
 11  in_reply_to_status_id  77 non-null     float64
 12  user                   2331 non-null   object 
 13  is_quote_status        2331 non-null   bool   
 14  retweet_count          2331 non-null   int64  
 15  favorite_count         2331 non-null   int64  
 16  possibly_sensitive     2196 non-null   object 
 17  lang                   2331 non-null   object 
 18  retweeted_status       163 non-null    object 
dtypes: bool(2), float64(1), int64(5), object(11)
memory usage: 332.3+ KB

3. Quality¶

Some of the quality issues have been tackled during the tidiness and missing data operations.

twitter_archive table:

3.1 Invalid data among numerators and denominator ratings¶

Define¶

Inspect and handle numerators and denominators which could be outliers.

Code¶

#inspect the outliers and drop the ones that could be errors:
twitter_archive_clean2.rating_numerator.value_counts()

#Inspect denominator values
twitter_archive_clean2.rating_denominator.value_counts()

10     2309
50        3
80        2
11        2
20        2
2         1
16        1
40        1
70        1
15        1
90        1
110       1
120       1
130       1
150       1
170       1
7         1
0         1
Name: rating_denominator, dtype: int64

#observations with values lower than zero will be dropped
twitter_archive_clean2.drop(twitter_archive_clean2.loc[twitter_archive_clean2['rating_denominator'] < 10].index, inplace=True)
twitter_archive_clean2.drop(twitter_archive_clean2.loc[twitter_archive_clean2['rating_numerator'] < 10].index, inplace=True)

Test¶

#Inspect with value_counts()
twitter_archive_clean2.rating_denominator.value_counts()
twitter_archive_clean2.rating_numerator.value_counts()

12      552
11      464
10      458
13      339
14       51
420       2
15        2
75        2
50        1
17        1
20        1
26        1
27        1
44        1
45        1
1776      1
666       1
80        1
84        1
88        1
99        1
121       1
143       1
144       1
165       1
182       1
204       1
60        1
Name: rating_numerator, dtype: int64

`image_predictions` table¶

3.2 Column names are not descriptive enough¶

Define¶

Change column names to descriptive ones

Code¶

#Change names of the columns
image_predictions_clean.rename(columns={'p1':'prediction1', 'p2':'prediction2', 'p3':'prediction3' }, inplace=True)

image_predictions_clean.rename(columns={'jpg_url':'image_url' }, inplace=True)

image_predictions_clean.rename(columns={'p1_conf':'prediction1_confidence', 'p2_conf':'prediction2_confidence', 'p3_conf':'prediction3_confidence' }, inplace=True)

image_predictions_clean.rename(columns={'p1_dog':'p1_breed_dog', 'p2_dog':'p2_breed_dog', 'p3_dog':'p3_breed_dog' }, inplace=True)

Test¶

#Inspect with info()
image_predictions_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   tweet_id                2075 non-null   int64  
 1   image_url               2075 non-null   object 
 2   img_num                 2075 non-null   int64  
 3   prediction1             2075 non-null   object 
 4   prediction1_confidence  2075 non-null   float64
 5   p1_breed_dog            2075 non-null   bool   
 6   prediction2             2075 non-null   object 
 7   prediction2_confidence  2075 non-null   float64
 8   p2_breed_dog            2075 non-null   bool   
 9   prediction3             2075 non-null   object 
 10  prediction3_confidence  2075 non-null   float64
 11  p3_breed_dog            2075 non-null   bool   
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 152.1+ KB

detailed_tweet table

3.3 Data type of timestamp is str.¶

Define¶

Change the datatype from str to datetime.

Code¶

#Change the datatype with to_datetime()
detailed_tweets_clean['created_at'] = pd.to_datetime(detailed_tweets_clean['created_at'])

Test¶

#Inspect the datetype of the created at column
type(detailed_tweets_clean.created_at[9])

pandas._libs.tslibs.timestamps.Timestamp

3.4 Some columns are not required for the analysis¶

Define¶

Drop columns that are not required.

Code¶

#Drop columns that are not required
detailed_tweets_clean.drop(detailed_tweets_clean.columns[[2,7]], axis = 1, inplace = True)

Test¶

#Inspect with info()
detailed_tweets_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2331 entries, 0 to 2330
Data columns (total 13 columns):
 #   Column              Non-Null Count  Dtype              
---  ------              --------------  -----              
 0   created_at          2331 non-null   datetime64[ns, UTC]
 1   tweet_id            2331 non-null   int64              
 2   truncated           2331 non-null   bool               
 3   display_text_range  2331 non-null   object             
 4   entities            2331 non-null   object             
 5   source              2331 non-null   object             
 6   user                2331 non-null   object             
 7   is_quote_status     2331 non-null   bool               
 8   retweet_count       2331 non-null   int64              
 9   favorite_count      2331 non-null   int64              
 10  possibly_sensitive  2196 non-null   object             
 11  lang                2331 non-null   object             
 12  retweeted_status    163 non-null    object             
dtypes: bool(2), datetime64[ns, UTC](1), int64(3), object(7)
memory usage: 205.0+ KB

3.5 Some column names are not descriptive enough¶

Define¶

Change column names to descriptive ones.

Code¶

#Change names of the columns
detailed_tweets_clean.rename(columns={'display_text_range':'tweet_length', 'lang':'language'  }, inplace=True)

Test¶

#Inspect with info()
detailed_tweets_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2331 entries, 0 to 2330
Data columns (total 13 columns):
 #   Column              Non-Null Count  Dtype              
---  ------              --------------  -----              
 0   created_at          2331 non-null   datetime64[ns, UTC]
 1   tweet_id            2331 non-null   int64              
 2   truncated           2331 non-null   bool               
 3   tweet_length        2331 non-null   object             
 4   entities            2331 non-null   object             
 5   source              2331 non-null   object             
 6   user                2331 non-null   object             
 7   is_quote_status     2331 non-null   bool               
 8   retweet_count       2331 non-null   int64              
 9   favorite_count      2331 non-null   int64              
 10  possibly_sensitive  2196 non-null   object             
 11  language            2331 non-null   object             
 12  retweeted_status    163 non-null    object             
dtypes: bool(2), datetime64[ns, UTC](1), int64(3), object(7)
memory usage: 205.0+ KB

Storing¶

These clean columns will be saved to .csv files.

#Save files with to_csv() function
twitter_archive_clean.to_csv('twitter_archive_master.csv', index=False)
twitter_archive_clean2.to_csv('twitter_archive_master2.csv', index=False)
image_predictions_clean.to_csv('image_predictions_clean.csv', index=False)
detailed_tweets_clean.to_csv('detailed_tweets_clean.csv', index=False)

Exploratory Data Analysis¶

Now that the data has been cleaned and tidied, some useful descriptive statistics and visual aids can come into effect to better understand the variables and how they interact. Some questions that could arise are:

What stage of dogs is favorited or retweeted the most?¶

#group by stage_dog
twitter_archive_clean2.groupby(['dog_stage']).mean()

This table shows the statistics of favorite and retweeted posts by dog stages. It seems as if the dogs classified as puppo and doggo have the higher favorite _count mean

# Dog_stages are ploted with favorite_count.mean
twitter_archive_clean2.groupby(['dog_stage']).favorite_count.mean().plot(kind='bar');
plt.title('Avg. favorite count of tweets by dog "stage"');
plt.ylabel('Favorite count', fontsize = 15);
plt.xlabel("Dog's stage", fontsize = 15);

# Dog_stages are ploted with favorite_count.mean
twitter_archive_clean2.groupby(['dog_stage']).retweet_count.mean().plot(kind='bar');
plt.title('Avg. retweet count of tweets by dog "stage"');
plt.ylabel('Retweet count', fontsize=15);
plt.xlabel("Dog's stage", fontsize=15);

What breeds are more common to be found?¶

# Counts of dog breeds for algorithm 1
image_predictions_clean.prediction1.value_counts()

golden_retriever      150
Labrador_retriever    100
Pembroke               89
Chihuahua              83
pug                    57
                     ... 
hotdog                  1
water_bottle            1
mud_turtle              1
otter                   1
sliding_door            1
Name: prediction1, Length: 378, dtype: int64

# Counts of dog breeds for algorithm 2
image_predictions_clean.prediction2.value_counts()

Labrador_retriever     104
golden_retriever        92
Cardigan                73
Chihuahua               44
Pomeranian              42
                      ... 
washbasin                1
medicine_chest           1
cockroach                1
African_hunting_dog      1
shopping_cart            1
Name: prediction2, Length: 405, dtype: int64

# Counts of dog breeds for algorithm 3
image_predictions_clean.prediction3.value_counts()

Labrador_retriever    79
Chihuahua             58
golden_retriever      48
Eskimo_dog            38
kelpie                35
                      ..
rock_crab              1
affenpinscher          1
snail                  1
acorn                  1
viaduct                1
Name: prediction3, Length: 408, dtype: int64

From these values it seems as if the retriever dogs are the ones being predicted the by all three algorithms. It is important to see which algorithm delivers results with higher confidence. An extra step could be checking is specific breeds have more favorited or retweets than others.

# Descriptive statistics of the predictions table
image_predictions_clean.describe()

Perhaps as expected, predictions 1 are the ones with higher confidence rates.

Are longer tweets favored?¶

# In order to use the tweet_length column it had to be divided into two columns
detailed_tweets_clean[['tweet_start','tweet_end']] = pd.DataFrame(detailed_tweets_clean.tweet_length.tolist(), index= detailed_tweets_clean.index)

# Inspect the most frequent tweet length
detailed_tweets_clean.tweet_end.mode()

0    140
dtype: int64

# Inspect tweets length
detailed_tweets_clean.tweet_end.value_counts()

140    252
139    145
138     79
137     77
116     59
      ... 
29       1
31       1
43       1
145      1
11       1
Name: tweet_end, Length: 125, dtype: int64

# Creation of masks to categorize tweets lengths with 1/3rd of the maximum allowed as base
Long = detailed_tweets_clean.tweet_end > 50
Short= detailed_tweets_clean.tweet_end <  50

# Histogram of tweets lengths
detailed_tweets_clean.tweet_end[Long].hist(label = 'Long tweet');
detailed_tweets_clean.tweet_end[Short].hist(label = 'Short tweet');
plt.title("Histogram of tweets length")
plt.xlabel("Tweet's length (Char)")
plt.legend();

# What is the mean of favorite counts for long tweets
detailed_tweets_clean.favorite_count[Long].mean()

7504.939180255619

# What is the mean of favorite counts for short tweets
detailed_tweets_clean.favorite_count[Short].mean()

5229.1551724137935

As seen the average of favorite counts for longer tweets is higher. This makes sense as longer tweets tend to provide more info on the dog.

# What is the mean of retweets for long tweets
detailed_tweets_clean.retweet_count[Long].mean()

2681.32921992067

# What is the mean of retweets for short tweets
detailed_tweets_clean.retweet_count[Short].mean()

1487.2241379310344

Similarly, the number of retweets for longer tweets is higher. One aspect to take into account is that, longer tweets are more than shorter ones. Hence further inspections of this relationship have to be done in order to stablish any relationship.

Conclusions¶

This project shows the stages of wrangling and exploratory analysis of three datasets holding interesting information of dogs over Twitter from the WeRateDogs account. The datasets were obtained from diverse sources and their main flaws were inspected and some of them were documented and fixed. The resulting dataset helped to gain some insights on questions related to the favorite and retweet count. Based on the results delivered by the dataset it seems as if puppies are the ones with more retweets and likes. From the predicted images dataset the retriever breed was the one predicted the most, by far in comparison the rest. Lastly, from the table holding detailed tweet information evidenced that tweets using the 140-character limit were the most common ones as well as the ones more favorited and retweeted. These insights were obtained from superficial analysis so in order to establish relationship between these variables further analysis has to be carried out.

References¶

1.- https://stackoverflow.com/questions/59094921/parsing-data-from-json-tweepy-into-a-pandas-dataframe/59095043#59095043

2.- https://www.shanelynn.ie/select-pandas-dataframe-rows-and-columns-using-iloc-loc-and-ix/

3.- https://www.datacamp.com/community/tutorials/python-if-elif-else?utm_source=adwords_ppc&utm_campaignid=898687156&utm_adgroupid=48947256715&utm_device=c&utm_keyword=&utm_matchtype=b&utm_network=g&utm_adpostion=&utm_creative=332602034352&utm_targetid=aud-299261629574:dsa-429603003980&utm_loc_interest_ms=&utm_loc_physical_ms=1005493&gclid=CjwKCAjwkdL6BRAREiwA-kiczFfdjsnCPxYgG-KkoTdL6yTZlgPJWhNVg2t4Rl9Liba9DNa0iKocfBoCiREQAvD_BwE

4.-https://stackoverflow.com/questions/49161120/pandas-python-set-value-of-one-column-based-on-value-in-another-column

	tweet_id	in_reply_to_status_id	in_reply_to_user_id	timestamp	source	text	retweeted_status_id	retweeted_status_user_id	retweeted_status_timestamp	expanded_urls	rating_numerator	rating_denominator	name	doggo	floofer	pupper	puppo
0	892420643555336193	NaN	NaN	2017-08-01 16:23:56 +0000	<a href="http://twitter.com/download/iphone" r...	This is Phineas. He's a mystical boy. Only eve...	NaN	NaN	NaN	https://twitter.com/dog_rates/status/892420643...	13	10	Phineas	None	None	None	None
1	892177421306343426	NaN	NaN	2017-08-01 00:17:27 +0000	<a href="http://twitter.com/download/iphone" r...	This is Tilly. She's just checking pup on you....	NaN	NaN	NaN	https://twitter.com/dog_rates/status/892177421...	13	10	Tilly	None	None	None	None
2	891815181378084864	NaN	NaN	2017-07-31 00:18:03 +0000	<a href="http://twitter.com/download/iphone" r...	This is Archie. He is a rare Norwegian Pouncin...	NaN	NaN	NaN	https://twitter.com/dog_rates/status/891815181...	12	10	Archie	None	None	None	None
3	891689557279858688	NaN	NaN	2017-07-30 15:58:51 +0000	<a href="http://twitter.com/download/iphone" r...	This is Darla. She commenced a snooze mid meal...	NaN	NaN	NaN	https://twitter.com/dog_rates/status/891689557...	13	10	Darla	None	None	None	None
4	891327558926688256	NaN	NaN	2017-07-29 16:00:24 +0000	<a href="http://twitter.com/download/iphone" r...	This is Franklin. He would like you to stop ca...	NaN	NaN	NaN	https://twitter.com/dog_rates/status/891327558...	12	10	Franklin	None	None	None	None

	tweet_id	in_reply_to_status_id	in_reply_to_user_id	timestamp	source	text	retweeted_status_id	retweeted_status_user_id	retweeted_status_timestamp	expanded_urls	rating_numerator	rating_denominator	name	doggo	floofer	pupper	puppo
1769	678380236862578688	NaN	NaN	2015-12-20 01:03:46 +0000	<a href="http://twitter.com/download/iphone" r...	This is Crumpet. He underestimated the snow. Q...	NaN	NaN	NaN	https://twitter.com/dog_rates/status/678380236...	10	10	Crumpet	None	None	None	None
1401	699434518667751424	NaN	NaN	2016-02-16 03:25:58 +0000	<a href="http://twitter.com/download/iphone" r...	I know this is a tad late but here's a wonderf...	NaN	NaN	NaN	https://twitter.com/dog_rates/status/699434518...	12	10	None	None	None	pupper	None
245	845812042753855489	NaN	NaN	2017-03-26 01:38:00 +0000	<a href="http://twitter.com/download/iphone" r...	We usually don't rate polar bears but this one...	NaN	NaN	NaN	https://twitter.com/dog_rates/status/845812042...	13	10	None	None	None	None	None
1929	674042553264685056	NaN	NaN	2015-12-08 01:47:22 +0000	<a href="http://twitter.com/download/iphone" r...	Yea I can't handle the cuteness anymore. Curls...	NaN	NaN	NaN	https://twitter.com/dog_rates/status/674042553...	12	10	None	None	None	None	None
1428	697616773278015490	NaN	NaN	2016-02-11 03:02:54 +0000	<a href="http://twitter.com/download/iphone" r...	This pupper doubles as a hallway rug. Very rar...	NaN	NaN	NaN	https://twitter.com/dog_rates/status/697616773...	11	10	None	None	None	pupper	None
1957	673583129559498752	NaN	NaN	2015-12-06 19:21:47 +0000	<a href="http://twitter.com/download/iphone" r...	This is Sandy. She loves her spot by the tree....	NaN	NaN	NaN	https://twitter.com/dog_rates/status/673583129...	11	10	Sandy	None	None	None	None
1404	699370870310113280	NaN	NaN	2016-02-15 23:13:03 +0000	<a href="http://twitter.com/download/iphone" r...	Say hello to Calbert. He doesn't have enough l...	NaN	NaN	NaN	https://twitter.com/dog_rates/status/699370870...	11	10	Calbert	None	None	None	None
553	804026241225523202	NaN	NaN	2016-11-30 18:16:08 +0000	<a href="http://twitter.com/download/iphone" r...	This is Bo. He's going to make me cry. 13/10 p...	NaN	NaN	NaN	https://twitter.com/dog_rates/status/804026241...	13	10	Bo	None	None	None	None
1690	681339448655802368	NaN	NaN	2015-12-28 05:02:37 +0000	<a href="http://twitter.com/download/iphone" r...	This is Asher. He's not wearing a seatbelt or ...	NaN	NaN	NaN	https://twitter.com/dog_rates/status/681339448...	9	10	Asher	None	None	None	None
706	785533386513321988	NaN	NaN	2016-10-10 17:32:08 +0000	<a href="http://twitter.com/download/iphone" r...	This is Dallas. Her tongue is ridiculous. 11/1...	NaN	NaN	NaN	https://twitter.com/dog_rates/status/785533386...	11	10	Dallas	None	None	None	None
1468	694206574471057408	NaN	NaN	2016-02-01 17:11:59 +0000	<a href="http://twitter.com/download/iphone" r...	"Martha come take a look at this. I'm so fed u...	NaN	NaN	NaN	https://twitter.com/dog_rates/status/694206574...	10	10	None	None	None	None	None
2141	669993076832759809	NaN	NaN	2015-11-26 21:36:12 +0000	<a href="http://twitter.com/download/iphone" r...	This is Zoey. Her dreams of becoming a hippo b...	NaN	NaN	NaN	https://twitter.com/dog_rates/status/669993076...	9	10	Zoey	None	None	None	None
252	844973813909606400	NaN	NaN	2017-03-23 18:07:10 +0000	<a href="http://twitter.com/download/iphone" r...	This is Brady. He's a recovering alcoholic. De...	NaN	NaN	NaN	https://twitter.com/dog_rates/status/844973813...	12	10	Brady	None	None	None	None
1901	674646392044941312	NaN	NaN	2015-12-09 17:46:48 +0000	<a href="http://twitter.com/download/iphone" r...	Two gorgeous dogs here. Little waddling dog is...	NaN	NaN	NaN	https://twitter.com/dog_rates/status/674646392...	5	10	None	None	None	None	None

	tweet_id	in_reply_to_status_id	in_reply_to_user_id	retweeted_status_id	retweeted_status_user_id	rating_numerator	rating_denominator
count	2.356000e+03	7.800000e+01	7.800000e+01	1.810000e+02	1.810000e+02	2356.000000	2356.000000
mean	7.427716e+17	7.455079e+17	2.014171e+16	7.720400e+17	1.241698e+16	13.126486	10.455433
std	6.856705e+16	7.582492e+16	1.252797e+17	6.236928e+16	9.599254e+16	45.876648	6.745237
min	6.660209e+17	6.658147e+17	1.185634e+07	6.661041e+17	7.832140e+05	0.000000	0.000000
25%	6.783989e+17	6.757419e+17	3.086374e+08	7.186315e+17	4.196984e+09	10.000000	10.000000
50%	7.196279e+17	7.038708e+17	4.196984e+09	7.804657e+17	4.196984e+09	11.000000	10.000000
75%	7.993373e+17	8.257804e+17	4.196984e+09	8.203146e+17	4.196984e+09	12.000000	10.000000
max	8.924206e+17	8.862664e+17	8.405479e+17	8.874740e+17	7.874618e+17	1776.000000	170.000000

	tweet_id	img_num	p1_conf	p2_conf	p3_conf
count	2.075000e+03	2075.000000	2075.000000	2.075000e+03	2.075000e+03
mean	7.384514e+17	1.203855	0.594548	1.345886e-01	6.032417e-02
std	6.785203e+16	0.561875	0.271174	1.006657e-01	5.090593e-02
min	6.660209e+17	1.000000	0.044333	1.011300e-08	1.740170e-10
25%	6.764835e+17	1.000000	0.364412	5.388625e-02	1.622240e-02
50%	7.119988e+17	1.000000	0.588230	1.181810e-01	4.944380e-02
75%	7.932034e+17	1.000000	0.843855	1.955655e-01	9.180755e-02
max	8.924206e+17	4.000000	1.000000	4.880140e-01	2.734190e-01

	tweet_id	jpg_url	img_num	p1	p1_conf	p1_dog	p2	p2_conf	p2_dog	p3	p3_conf	p3_dog
947	704761120771465216	https://pbs.twimg.com/media/CcfQgHVWoAAxauy.jpg	1	Siamese_cat	0.202294	False	Chihuahua	0.100418	True	basenji	0.072097	True
1075	717047459982213120	https://pbs.twimg.com/media/CfN23ArXEAEkZkz.jpg	1	golden_retriever	0.983548	True	Labrador_retriever	0.012185	True	cocker_spaniel	0.002412	True
1780	828650029636317184	https://pbs.twimg.com/media/C3_0yhCWEAETXj2.jpg	1	golden_retriever	0.649209	True	Chesapeake_Bay_retriever	0.198560	True	vizsla	0.056200	True
342	672256522047614977	https://pbs.twimg.com/media/CVRVvRMWEAIBKOP.jpg	1	ostrich	0.999004	False	Arabian_camel	0.000512	False	llama	0.000147	False

	tweet_id	jpg_url	img_num	p1	p1_conf	p1_dog	p2	p2_conf	p2_dog	p3	p3_conf	p3_dog
569	678389028614488064	https://pbs.twimg.com/media/CWofOHUWUAACGVa.jpg	1	miniature_pinscher	0.516284	True	kelpie	0.227402	True	vizsla	0.103246	True
295	671357843010908160	https://pbs.twimg.com/media/CVEkZaPXIAEw5vr.jpg	1	Italian_greyhound	0.831757	True	toy_terrier	0.043306	True	Chihuahua	0.036773	True

	created_at	id	id_str	full_text	truncated	display_text_range	entities	extended_entities	source	in_reply_to_status_id	...	favorited	retweeted	possibly_sensitive	possibly_sensitive_appealable	lang	retweeted_status	quoted_status_id	quoted_status_id_str	quoted_status_permalink	quoted_status
212	Thu Apr 06 00:13:11 +0000 2017	849776966551130114	849776966551130114	Seriously guys? Again? We only rate dogs. Plea...	False	[0, 135]	{'hashtags': [], 'symbols': [], 'user_mentions...	{'media': [{'id': 849776958493818880, 'id_str'...	<a href="http://twitter.com/download/iphone" r...	NaN	...	False	False	False	False	en	NaN	NaN	NaN	NaN	NaN
57	Fri Jun 30 23:47:07 +0000 2017	880935762899988482	880935762899988482	This is Louis. He's crossing. It's a big deal....	False	[0, 72]	{'hashtags': [], 'symbols': [], 'user_mentions...	{'media': [{'id': 880935758152028161, 'id_str'...	<a href="http://twitter.com/download/iphone" r...	NaN	...	False	False	False	False	en	NaN	NaN	NaN	NaN	NaN

	tweet_id	in_reply_to_status_id	in_reply_to_user_id	timestamp	source	text	retweeted_status_id	retweeted_status_user_id	retweeted_status_timestamp	expanded_urls	rating_numerator	rating_denominator	name	doggo	floofer	pupper	puppo
313	835246439529840640	8.352460e+17	26259576.0	2017-02-24 21:54:03 +0000	<a href="http://twitter.com/download/iphone" r...	@jonnysun @Lin_Manuel ok jomny I know you're e...	NaN	NaN	NaN	NaN	960	0	None	None	None	None	None
516	810984652412424192	NaN	NaN	2016-12-19 23:06:23 +0000	<a href="http://twitter.com/download/iphone" r...	Meet Sam. She smiles 24/7 & secretly aspir...	NaN	NaN	NaN	https://www.gofundme.com/sams-smile,https://tw...	24	7	Sam	None	None	None	None
2335	666287406224695296	NaN	NaN	2015-11-16 16:11:11 +0000	<a href="http://twitter.com/download/iphone" r...	This is an Albanian 3 1/2 legged Episcopalian...	NaN	NaN	NaN	https://twitter.com/dog_rates/status/666287406...	1	2	an	None	None	None	None

	tweet_id	in_reply_to_status_id	in_reply_to_user_id	timestamp	source	text	retweeted_status_id	retweeted_status_user_id	retweeted_status_timestamp	expanded_urls	rating_numerator	rating_denominator	name	doggo	floofer	pupper	puppo
579	800513324630806528	NaN	NaN	2016-11-21 01:37:04 +0000	<a href="http://twitter.com/download/iphone" r...	This is Chef. Chef loves everyone and wants ev...	NaN	NaN	NaN	https://twitter.com/dog_rates/status/800513324...	11	10	Chef	None	None	None	None
1749	679047485189439488	NaN	NaN	2015-12-21 21:15:11 +0000	<a href="http://twitter.com/download/iphone" r...	This dog doesn't know how to stairs. Quite tra...	NaN	NaN	NaN	https://twitter.com/dog_rates/status/679047485...	9	10	None	None	None	None	None
358	830097400375152640	NaN	NaN	2017-02-10 16:53:37 +0000	<a href="http://twitter.com/download/iphone" r...	Meet Samson. He's absolute fluffy perfection. ...	NaN	NaN	NaN	https://www.gofundme.com/sick-baby-samson,http...	13	10	Samson	None	None	None	None

	created_at	id	id_str	full_text	truncated	display_text_range	entities	extended_entities	source	in_reply_to_status_id	...	favorited	retweeted	possibly_sensitive	possibly_sensitive_appealable	lang	retweeted_status	quoted_status_id	quoted_status_id_str	quoted_status_permalink	quoted_status
0	Tue Aug 01 16:23:56 +0000 2017	892420643555336193	892420643555336193	This is Phineas. He's a mystical boy. Only eve...	False	[0, 85]	{'hashtags': [], 'symbols': [], 'user_mentions...	{'media': [{'id': 892420639486877696, 'id_str'...	<a href="http://twitter.com/download/iphone" r...	NaN	...	False	False	False	False	en	NaN	NaN	NaN	NaN	NaN
1	Tue Aug 01 00:17:27 +0000 2017	892177421306343426	892177421306343426	This is Tilly. She's just checking pup on you....	False	[0, 138]	{'hashtags': [], 'symbols': [], 'user_mentions...	{'media': [{'id': 892177413194625024, 'id_str'...	<a href="http://twitter.com/download/iphone" r...	NaN	...	False	False	False	False	en	NaN	NaN	NaN	NaN	NaN
2	Mon Jul 31 00:18:03 +0000 2017	891815181378084864	891815181378084864	This is Archie. He is a rare Norwegian Pouncin...	False	[0, 121]	{'hashtags': [], 'symbols': [], 'user_mentions...	{'media': [{'id': 891815175371796480, 'id_str'...	<a href="http://twitter.com/download/iphone" r...	NaN	...	False	False	False	False	en	NaN	NaN	NaN	NaN	NaN
3	Sun Jul 30 15:58:51 +0000 2017	891689557279858688	891689557279858688	This is Darla. She commenced a snooze mid meal...	False	[0, 79]	{'hashtags': [], 'symbols': [], 'user_mentions...	{'media': [{'id': 891689552724799489, 'id_str'...	<a href="http://twitter.com/download/iphone" r...	NaN	...	False	False	False	False	en	NaN	NaN	NaN	NaN	NaN
4	Sat Jul 29 16:00:24 +0000 2017	891327558926688256	891327558926688256	This is Franklin. He would like you to stop ca...	False	[0, 138]	{'hashtags': [{'text': 'BarkWeek', 'indices': ...	{'media': [{'id': 891327551943041024, 'id_str'...	<a href="http://twitter.com/download/iphone" r...	NaN	...	False	False	False	False	en	NaN	NaN	NaN	NaN	NaN
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
2326	Mon Nov 16 00:24:50 +0000 2015	666049248165822465	666049248165822465	Here we have a 1949 1st generation vulpix. Enj...	False	[0, 120]	{'hashtags': [], 'symbols': [], 'user_mentions...	{'media': [{'id': 666049244999131136, 'id_str'...	<a href="http://twitter.com/download/iphone" r...	NaN	...	False	False	False	False	en	NaN	NaN	NaN	NaN	NaN
2327	Mon Nov 16 00:04:52 +0000 2015	666044226329800704	666044226329800704	This is a purebred Piers Morgan. Loves to Netf...	False	[0, 137]	{'hashtags': [], 'symbols': [], 'user_mentions...	{'media': [{'id': 666044217047650304, 'id_str'...	<a href="http://twitter.com/download/iphone" r...	NaN	...	False	False	False	False	en	NaN	NaN	NaN	NaN	NaN
2328	Sun Nov 15 23:21:54 +0000 2015	666033412701032449	666033412701032449	Here is a very happy pup. Big fan of well-main...	False	[0, 130]	{'hashtags': [], 'symbols': [], 'user_mentions...	{'media': [{'id': 666033409081393153, 'id_str'...	<a href="http://twitter.com/download/iphone" r...	NaN	...	False	False	False	False	en	NaN	NaN	NaN	NaN	NaN
2329	Sun Nov 15 23:05:30 +0000 2015	666029285002620928	666029285002620928	This is a western brown Mitsubishi terrier. Up...	False	[0, 139]	{'hashtags': [], 'symbols': [], 'user_mentions...	{'media': [{'id': 666029276303482880, 'id_str'...	<a href="http://twitter.com/download/iphone" r...	NaN	...	False	False	False	False	en	NaN	NaN	NaN	NaN	NaN
2330	Sun Nov 15 22:32:08 +0000 2015	666020888022790149	666020888022790149	Here we have a Japanese Irish Setter. Lost eye...	False	[0, 131]	{'hashtags': [], 'symbols': [], 'user_mentions...	{'media': [{'id': 666020881337073664, 'id_str'...	<a href="http://twitter.com/download/iphone" r...	NaN	...	False	False	False	False	en	NaN	NaN	NaN	NaN	NaN

	id	in_reply_to_status_id	in_reply_to_user_id	retweet_count	favorite_count	quoted_status_id
count	2.331000e+03	7.700000e+01	7.700000e+01	2331.000000	2331.000000	2.600000e+01
mean	7.419079e+17	7.440692e+17	2.040329e+16	2656.050622	7464.755899	8.113972e+17
std	6.823170e+16	7.524295e+16	1.260797e+17	4492.226874	11590.528926	6.295843e+16
min	6.660209e+17	6.658147e+17	1.185634e+07	1.000000	0.000000	6.721083e+17
25%	6.782670e+17	6.757073e+17	3.589728e+08	539.000000	1299.000000	7.761338e+17
50%	7.182469e+17	7.032559e+17	4.196984e+09	1243.000000	3244.000000	8.281173e+17
75%	7.986692e+17	8.233264e+17	4.196984e+09	3083.500000	9142.500000	8.637581e+17
max	8.924206e+17	8.862664e+17	8.405479e+17	76336.000000	154049.000000	8.860534e+17

	created_at	id	id_str	full_text	truncated	display_text_range	entities	extended_entities	source	in_reply_to_status_id	...	favorited	retweeted	possibly_sensitive	possibly_sensitive_appealable	lang	retweeted_status	quoted_status_id	quoted_status_id_str	quoted_status_permalink	quoted_status
1219	Mon Mar 21 17:30:03 +0000 2016	711968124745228288	711968124745228288	Meet Winston. He's trapped in a cup of coffee....	False	[0, 106]	{'hashtags': [], 'symbols': [], 'user_mentions...	{'media': [{'id': 711968119305211904, 'id_str'...	<a href="http://twitter.com/download/iphone" r...	NaN	...	False	False	False	False	en	NaN	NaN	NaN	NaN	NaN
1816	Sun Dec 13 03:21:34 +0000 2015	675878199931371520	675878199931371520	Ok, I'll admit this is a pretty adorable bunny...	False	[0, 134]	{'hashtags': [], 'symbols': [], 'user_mentions...	{'media': [{'id': 675878195854483457, 'id_str'...	<a href="http://twitter.com/download/iphone" r...	NaN	...	False	False	False	False	en	NaN	NaN	NaN	NaN	NaN
2244	Fri Nov 20 01:06:48 +0000 2015	667509364010450944	667509364010450944	This a Norwegian Pewterschmidt named Tickles. ...	False	[0, 116]	{'hashtags': [], 'symbols': [], 'user_mentions...	{'media': [{'id': 667509363477577728, 'id_str'...	<a href="http://twitter.com" rel="nofollow">Tw...	NaN	...	False	False	False	False	en	NaN	NaN	NaN	NaN	NaN
2156	Tue Nov 24 03:29:51 +0000 2015	668994913074286592	668994913074286592	Two gorgeous pups here. Both have cute fake ho...	False	[0, 140]	{'hashtags': [], 'symbols': [], 'user_mentions...	{'media': [{'id': 668994907227430913, 'id_str'...	<a href="http://twitter.com/download/iphone" r...	NaN	...	False	False	False	False	en	NaN	NaN	NaN	NaN	NaN

	tweet_id	rating_numerator	rating_denominator	truncated	in_reply_to_status_id	is_quote_status	retweet_count	favorite_count
dog_stage
None	7.506073e+17	14.111969	10.386873	0.0	7.422830e+17	0.017761	2807.005405	7885.288031
doggo	8.037876e+17	17.892857	10.000000	0.0	8.188688e+17	0.035714	6953.044643	17166.258929
floofer	7.541778e+17	14.321429	12.142857	0.0	NaN	0.000000	2642.428571	8856.464286
pupper	7.338976e+17	11.466981	10.000000	0.0	7.136706e+17	0.009434	2737.938679	6868.891509
puppo	7.663997e+17	13.905350	12.098765	0.0	7.614270e+17	0.008230	2908.308642	9999.786008

Project: WeRateDogs data improvement¶

Table of Contents¶

Introduction¶

Data Wrangling¶

¶

Gathering¶

Assessing¶

Observations:¶

Quality¶

tweet_archive table¶

image_predictions table¶

detailed_tweets table¶

Tidiness¶

Cleaning¶

1. Missing Data¶

1.1 twitter_archive: Missing values (Not necessary ones)¶

Define¶

Code¶

Test¶

1.2 twitter_archive: Missing values (imputable ones)¶

Define¶

1.3 detailed_tweets: Missing values¶

Define¶

Code¶

Test¶

2. Tidiness¶

2.1 The type of dog in the twitter_archive table should be one column instead of four.¶

Define¶

code¶

Test¶

2.1 Consistency between twitter_archive table and detailed_tweet as separate observational units¶

2.2.1 The twitter_archive table and detailed_tweet have duplicated entries¶

Define¶

Code¶

Test¶

2.2.2 The 'response' columns favorited and retweeted will be merged from detailed_tweet to twitter_archive¶

Define¶

Code¶

Test¶

3. Quality¶

3.1 Invalid data among numerators and denominator ratings¶

Define¶

Code¶

Test¶

image_predictions table¶

3.2 Column names are not descriptive enough¶

Define¶

Code¶

Test¶

3.3 Data type of timestamp is str.¶

Define¶

Code¶

Test¶

3.4 Some columns are not required for the analysis¶

Define¶

Code¶

Test¶

3.5 Some column names are not descriptive enough¶

Define¶

Code¶

Test¶

Storing¶

Exploratory Data Analysis¶

What stage of dogs is favorited or retweeted the most?¶

What breeds are more common to be found?¶

Are longer tweets favored?¶

Conclusions¶

References¶

`tweet_archive` table¶

`image_predictions` table¶

`detailed_tweets` table¶

1.1 `twitter_archive`: Missing values (Not necessary ones)¶

1.2 `twitter_archive`: Missing values (imputable ones)¶

1.3 `detailed_tweets`: Missing values¶

2.1 The type of dog in the `twitter_archive` table should be one column instead of four.¶

2.1 Consistency between `twitter_archive` table and `detailed_tweet` as separate observational units¶

2.2.1 The `twitter_archive` table and `detailed_tweet` have duplicated entries¶

2.2.2 The 'response' columns favorited and retweeted will be merged from `detailed_tweet` to `twitter_archive`¶

`image_predictions` table¶