This document details the series of actions carried out with the goal of improving the quality of the dataset from the Twitter user 'WeRateDogs'. This peculiar dataset holds interesting information of dogs featured on the internet. In order to make this analysis and others easier, this dataset was augmented, assessed, cleaned and briefly analyzed. The file contains observations that could be useful to get insights on which dogs seem to attract more attention, what pictures or actions are the most popular over twitter and so on.
#Required packages and tools are imported
import pandas as pd
import numpy as np
import requests as rd
import time
import tweepy as tw
import json
import sys
from matplotlib import pyplot as plt
%matplotlib inline
#Twitter API credentials
#consumer_key = 'YOUR CONSUMER KEY'
#consumer_secret = 'YOUR CONSUMER SECRET'
#access_token = 'YOUR ACCESS TOKEN'
#access_secret = 'YOUR ACCESS SECRET'
#auth = tw.OAuthHandler(consumer_key, consumer_secret)
#auth.set_access_token(access_token, access_token_secret)
#api = tw.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify = True, parser=tw.parsers.JSONParser())
This section holds the steps carried out to assemble/obtain, inspect and properly clean the datasets. All steps will be documented, and more information can be obtained in the attached file called ‘wrangle_report.pdf’.
The gathering process of all documents required to carry out the analysis are documented here. The first file with the original 2355 tweets were handed as a .csv file and uploaded to this notebook server. The second file was downloaded from a URL using the library requests. The third file was obtained using the Twitter API. The tasks required to obtain all these files are appropriately described in sections bellow:
# Read CSV (comma-separated) file into DataFrame
twitter_archive = pd.read_csv('twitter-archive-enhanced.csv')
The dataset is first inspected to make sure it was imported successfully
# Preliminary Visual inspection of the dataset
twitter_archive.head()
As this importation seemed to have been successful, the second dataset will be downloaded from the URL provided using the library requests :
# Create a variable holding the URL with the files and same for the response
URL1 = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'
response = rd.get(URL1)
# Create the file where the content of the url will be saved
with open ('image_predictions.tsv', mode = 'wb') as file:
file.write(response.content)
# Read TSV (Tab-separated) file into DataFrame
image_predictions = pd.read_csv('image_predictions.tsv', sep='\t')
# Preliminary visual inspection of the dataset
image_predictions.sample(2)
The third dataset from Twitter is gathered using their API and the library Tweepy. The tweets to download are those whose tweet ID are in the original file. In order to download exactly those in that file, the 'tweet_id' column is to be stored in a list and all those tweets matching that ID are going to be saved in the file called ‘tweet_json.txt’.
# Create the column with tweet ID's
list_ids = list(twitter_archive.tweet_id)
# Verify the length of this list
len(list_ids)
"""
#Set up a timer to see how long the operation lasts
start = time.time()
deleted_tweets = 0
#Set up a for loop that verifies the tweet id's
# Use try and except for cases where the tweet has been deleted
with open('tweet_json.txt', mode= 'w') as f:
for ids in list_ids:
try:
tweet = api.get_status(ids, tweet_mode='extended')
json.dump(tweet, f)
except:
deleted_tweets += 1
#record end time
end = time.time()
#print total duration
print(end - start)
# print total of tweets deleted
print('A total of ' + str(deleted_tweets) + ' tweets were deleted')
"""
Now the file can be read into a DataFrame
# Read the flat .txt file into DataFrame using read_json
# Full_tweets = pd.read_json('tweet_json.txt', orient= 'records', encoding="utf8", lines=True)
This code did not work because my file contained an invalid json and it throwed the following error:
ValueError :Unexpected character found when decoding array value (2)
The JSON file was checked in a JSON validator and it showed and EOF (end of file) error. A workaround was found on StackExchange. There a function was created aimed at parsing and formatting the json line by line into a list and finally creating a DataFrame. This approach is used bellow:
# First the ParseMultipleJson function is created to make sure the
# JSONs objects of the file are parsed correctly
def parseMultipleJSON(lines):
skip = prev = 0
data = []
lines = ''.join(lines)
for idx, line in enumerate(lines):
if line == "{":
skip += 1
elif line == "}":
skip -= 1
if skip == 0:
json_string = ''.join(lines[prev:idx+1])
data.append(json.loads(json_string))
prev = idx+1
return data
with open('tweet_json.txt','r') as json_file:
lines = json_file.readlines()
lines = [line.strip("\n") for line in lines]
data = parseMultipleJSON(lines)
detailed_tweets = pd.DataFrame(data)
# Preliminary visual inspection of the dataset
detailed_tweets.sample(2)
This part of the analysis will focus on identifying those aspects of the datasets to improve. These improvements will make the posterior analysis possible. A visual inspection will be carried out using the software Numbers, followed by a programmatic one. Issues found will be categorized into quality and tidiness sections.
twitter_archive
tableFirst a visual inspection was carried out with the software 'Numbers'
This inspection gave a general idea of what the variables in this file are and which values they could and could not take. Many observations were empty (missing values).
Programmatic inspection:
# Display main characteristics of this table with info
twitter_archive.info()
# Verify the data types of the timestamp
type(twitter_archive.timestamp[9])
# Extract a sample of null values from retweeted_status_id
twitter_archive[twitter_archive['retweeted_status_id'].isnull()].sample(14)
# Look at the unique values
twitter_archive.nunique()
# Verify the most repeated value within the stages of dog 'doggo'
twitter_archive.doggo.mode()
# Verify if all types of dog stages have 'None' as the most frequent value
twitter_archive.doggo.mode()[0] == twitter_archive.floofer.mode()[0] == twitter_archive.pupper.mode()[0] == twitter_archive.pupper.mode()
# Obtain number of entries that do not have a dog stage assigned
twitter_archive_subset = twitter_archive[(twitter_archive['doggo'] == 'None') & (twitter_archive['floofer'] == 'None') & (twitter_archive['pupper'] == 'None') & (twitter_archive['puppo'] == 'None')]
len(twitter_archive_subset)
# Verify if some rows have more than one dog stage assigned; first comparing all stages to doggos
duplicates = 0
stages =['floofer','pupper','puppo']
for col in stages:
duplicates += len(twitter_archive.loc[(twitter_archive['doggo'] == 'doggo') & (twitter_archive[col] != 'None')])
print(duplicates)
# Display descriptive statistics of the variables with describe
twitter_archive.describe()
# Verify maximum values for numerator and denominator
twitter_archive[twitter_archive.rating_numerator == 1776.000000]
# Verify maximum values for numerator and denominator
twitter_archive[twitter_archive.rating_denominator == 170.000000]
# Verify low values for the denominator
twitter_archive[twitter_archive.rating_denominator < 10]
# Verify low values for the numerator
len(twitter_archive[twitter_archive.rating_numerator < 10])
# Search for duplicated tweets ID's with 'duplicated' method
twitter_archive[twitter_archive.tweet_id.duplicated()]
# Search for duplicated texts with 'duplicated' method
twitter_archive[twitter_archive.text.duplicated()]
# Inspect how many tweets are a response to other tweets
len(twitter_archive.in_reply_to_status_id.value_counts())
# Inspect how many tweets are a response to other users
len(twitter_archive.in_reply_to_user_id.value_counts())
# Random sample of the dataset
twitter_archive.sample(3)
image_predictions
tableFirst a visual inspection was carried out with the software 'Numbers'.
As with the previous table, since these values are so diverse and are not ruled by many constrains the visual search did not deliver especially useful insights.
# Display main characteristics of this table with info
image_predictions.info()
# Verify the data types of the predictions
type(image_predictions.p1[9])
# Look at unique values
image_predictions.nunique()
# Check most predicted breeds by algorithm attempts
image_predictions.p1.mode()
# Check most predicted breeds by algorithm attempts
image_predictions.p2.mode()
# Check most predicted breeds by algorithm attempts
image_predictions.p3.mode()
# Display descriptive statistics of the variables with describe
image_predictions.describe()
# Inspect value with confidence equal to 1
image_predictions[image_predictions.p1_conf == 1]
# Check how many entries have incorrect predictions for all attempts
image_predictions_subset = image_predictions[(image_predictions['p1_dog'] == False) & (image_predictions['p2_dog'] == False) & (image_predictions['p3_dog'] == False) ]
len(image_predictions_subset)
# Search for duplicated tweets with 'duplicated' method
image_predictions[image_predictions.tweet_id.duplicated()]
# Show a random sample of the dataset
image_predictions.sample(4)
detailed_tweets
tableThe visual inspection was carried out using Pandas.
detailed_tweets
This table has a lot of information but many of the entries are empty (missing values).
# Display main characteristics of this table with info
detailed_tweets.info()
# Verify the data types of the column 'created at'
type(detailed_tweets.created_at[9])
# verify the data types of the column 'possibly_sensitive'
type(detailed_tweets.possibly_sensitive[9])
# Display descriptive statistics
detailed_tweets.describe()
# Search for duplicated tweets with 'duplicated'
detailed_tweets[detailed_tweets.id.duplicated()]
# Show a random sample of the dataset
detailed_tweets.sample(4)
The last assessment is a verification of duplicated columns among the three tables
# Join all columns in a list and verify duplicated ones
all_columns = pd.Series(list(twitter_archive) + list(image_predictions) + list(detailed_tweets))
all_columns[all_columns.duplicated()]
The main observations regarding data quality and tidiness are summarized below. Quality issues are divided by table and tidiness issues are presented together for all tables.
tweet_archive
table¶image_predictions
table¶detailed_tweets
table¶twitter_archive
table should be one column instead of four.twitter_archive
table and detailed_tweet
share duplicated information. By dropping these duplicates they could constitute more coherent observational units. For the sake of this analysis the columns favorite count and retweet count will be taken as response variables and will be analyzed based on: Tweet information (from the detailed_tweet
table) and their relationship with the dogs stage data from the twitter_archive
table. With this in mind:twitter_archive
table detailed_tweet
will be merged to the twitter_archive
table.The cleaning efforts carried out so as to fix the issues discovered in the previous section, are described herein.
# Copy the datasets before modifying them separately in case cleaning operations are not correct
twitter_archive_clean = twitter_archive.copy()
image_predictions_clean = image_predictions.copy()
detailed_tweets_clean = detailed_tweets.copy()
twitter_archive
: Missing values (Not necessary ones)¶Drop the columns with missing values that are not necessary for the analysis using drop() method.
# Drop columns that will not be included in the analysis using drop() method
# Copy the datasets before modifying them
twitter_archive_clean.drop(twitter_archive_clean.columns[[1,2,6,7,8,9]], axis = 1, inplace = True)
# Verify that columns have been drop with info()
twitter_archive_clean.info()
twitter_archive
: Missing values (imputable ones)¶Re- extract the stages of dogs matching their definitions and popular synonyms to the captions included in each tweet's text. This will be done in the tidiness section.
detailed_tweets
: Missing values¶Drop the columns with missing values that are not necessary for the analysis, the ones that have little or no values (and imputation is not possible) using drop() method.
# Verify the index of the columns
detailed_tweets_clean.info()
# Drop columns previously targeted using drop() method
detailed_tweets_clean.drop(detailed_tweets_clean.columns[[2,7,10,11,12,13,15,16,17,18,22,23,25,28,29,30,31]], axis = 1, inplace = True)
# Verify that columns have been drop with info()
detailed_tweets_clean.info()
twitter_archive
table should be one column instead of four.¶The process to improve the tidiness of the data frame, drop duplicates and fix the missing/redundant values (by imputing new ones) will be the following:
# Melt redundant columns (doggo, pupper...)
twitter_archive_clean = pd.melt(twitter_archive_clean, id_vars=['tweet_id','timestamp', 'source', 'text', 'rating_numerator', 'rating_denominator', 'name'], value_vars=['doggo', 'floofer', 'pupper', 'puppo'], var_name='stage')
# Drop the duplicates values except for those rows with stage information and their duplicates
twitter_archive_clean = twitter_archive_clean.drop_duplicates(subset = ['tweet_id','value'])
# Save those columns with info and store them away and drop the 14 identified duplicated
stages_temp = twitter_archive_clean[twitter_archive_clean['stage'] == twitter_archive_clean['value']]
# Duplicates in this column are to be dropped randomly because both have the same chances to be right
stages_temp = stages_temp.drop_duplicates(subset=['tweet_id'], keep='last')
# Drop all rows with tweet_id duplicates
twitter_archive_clean = twitter_archive_clean.drop_duplicates(subset=['tweet_id'], keep=False)
# Create a function that compares common adjectives to those of the dogs definition
def DogClassifificator(word):
#Create a list with popular and descriptive dog's adjectives
adjectives = ['pupper', 'puppy','pup','little', 'young', 'old boy', 'floof', 'old boy', 'boy', 'mature', 'dogo']
#Create specific lists for different stages (Also could be indexing of adjectives)
puppy = ['pupper', 'puppy', 'pup', 'little']
pupper = ['young', 'old boy']
floofer = ['floof', 'fur']
doggo = ['old boy', 'boy', 'mature', 'dogo']
#Loop the tweet’s texts in search of matches with the aforementioned adjectives
for adjective in adjectives:
if (adjective in word and adjective in puppy):
value = 'puppo'
break
elif (adjective in word and adjective in pupper):
value = 'pupper'
break
elif (adjective in word and adjective in floofer):
value = 'floofer'
break
elif (adjective in word and adjective in doggo):
value = 'doggo'
break
else:
value = 'None'
return value
# Apply the function and create a variable called Stage_dog
twitter_archive_clean["value"]= twitter_archive_clean["text"].apply(DogClassifificator)
# Concatenate classified and stored dataframes to have back the 2356 total
twitter_archive_clean = pd.concat([twitter_archive_clean, stages_temp])
# Drop stage column
twitter_archive_clean = twitter_archive_clean.drop(['stage'], axis=1)
# Rename value column
twitter_archive_clean.rename(columns={'value':'dog_stage'}, inplace=True)
# Inspect the dataset
twitter_archive_clean.head()
# Inspect the dataset
twitter_archive_clean.info()
Drop duplicated/redundant columns between twitter_archive
and detailed_tweet
table.
# Drop redundant columns
twitter_archive_clean.drop(twitter_archive_clean.columns[[1,2,3]], axis = 1, inplace = True)
# Verify that columns have been dropped with info()
twitter_archive_clean.info()
# Verify columns and shape with info()
detailed_tweets_clean.info()
detailed_tweet
to twitter_archive
¶Merge the favorite_count and retweet_count columns to the twitter_archive
table, joining on tweet_id.
# Rename id column to match that of twitter_archive
detailed_tweets_clean.rename(columns={'id':'tweet_id'}, inplace=True)
# Compare identifiers of both columns
twitter_archive_clean['tweet_id'].isin(detailed_tweets_clean['tweet_id']).value_counts()
# Merge response variables to table 1 as another table (to carry out the exploratory analysis)
twitter_archive_clean2 = pd.merge(twitter_archive_clean, detailed_tweets_clean, on=['tweet_id'], how='inner')
twitter_archive_clean2.head()
#Verify columns and shape with info()
twitter_archive_clean2.info()
Some of the quality issues have been tackled during the tidiness and missing data operations.
twitter_archive
table:
Inspect and handle numerators and denominators which could be outliers.
#inspect the outliers and drop the ones that could be errors:
twitter_archive_clean2.rating_numerator.value_counts()
#Inspect denominator values
twitter_archive_clean2.rating_denominator.value_counts()
#observations with values lower than zero will be dropped
twitter_archive_clean2.drop(twitter_archive_clean2.loc[twitter_archive_clean2['rating_denominator'] < 10].index, inplace=True)
twitter_archive_clean2.drop(twitter_archive_clean2.loc[twitter_archive_clean2['rating_numerator'] < 10].index, inplace=True)
#Inspect with value_counts()
twitter_archive_clean2.rating_denominator.value_counts()
twitter_archive_clean2.rating_numerator.value_counts()
image_predictions
table¶Change column names to descriptive ones
#Change names of the columns
image_predictions_clean.rename(columns={'p1':'prediction1', 'p2':'prediction2', 'p3':'prediction3' }, inplace=True)
image_predictions_clean.rename(columns={'jpg_url':'image_url' }, inplace=True)
image_predictions_clean.rename(columns={'p1_conf':'prediction1_confidence', 'p2_conf':'prediction2_confidence', 'p3_conf':'prediction3_confidence' }, inplace=True)
image_predictions_clean.rename(columns={'p1_dog':'p1_breed_dog', 'p2_dog':'p2_breed_dog', 'p3_dog':'p3_breed_dog' }, inplace=True)
#Inspect with info()
image_predictions_clean.info()
detailed_tweet
table
Change the datatype from str to datetime.
#Change the datatype with to_datetime()
detailed_tweets_clean['created_at'] = pd.to_datetime(detailed_tweets_clean['created_at'])
#Inspect the datetype of the created at column
type(detailed_tweets_clean.created_at[9])
Drop columns that are not required.
#Drop columns that are not required
detailed_tweets_clean.drop(detailed_tweets_clean.columns[[2,7]], axis = 1, inplace = True)
#Inspect with info()
detailed_tweets_clean.info()
Change column names to descriptive ones.
#Change names of the columns
detailed_tweets_clean.rename(columns={'display_text_range':'tweet_length', 'lang':'language' }, inplace=True)
#Inspect with info()
detailed_tweets_clean.info()
These clean columns will be saved to .csv files.
#Save files with to_csv() function
twitter_archive_clean.to_csv('twitter_archive_master.csv', index=False)
twitter_archive_clean2.to_csv('twitter_archive_master2.csv', index=False)
image_predictions_clean.to_csv('image_predictions_clean.csv', index=False)
detailed_tweets_clean.to_csv('detailed_tweets_clean.csv', index=False)
Now that the data has been cleaned and tidied, some useful descriptive statistics and visual aids can come into effect to better understand the variables and how they interact. Some questions that could arise are:
#group by stage_dog
twitter_archive_clean2.groupby(['dog_stage']).mean()
This table shows the statistics of favorite and retweeted posts by dog stages. It seems as if the dogs classified as puppo and doggo have the higher favorite _count mean
# Dog_stages are ploted with favorite_count.mean
twitter_archive_clean2.groupby(['dog_stage']).favorite_count.mean().plot(kind='bar');
plt.title('Avg. favorite count of tweets by dog "stage"');
plt.ylabel('Favorite count', fontsize = 15);
plt.xlabel("Dog's stage", fontsize = 15);
# Dog_stages are ploted with favorite_count.mean
twitter_archive_clean2.groupby(['dog_stage']).retweet_count.mean().plot(kind='bar');
plt.title('Avg. retweet count of tweets by dog "stage"');
plt.ylabel('Retweet count', fontsize=15);
plt.xlabel("Dog's stage", fontsize=15);
# Counts of dog breeds for algorithm 1
image_predictions_clean.prediction1.value_counts()
# Counts of dog breeds for algorithm 2
image_predictions_clean.prediction2.value_counts()
# Counts of dog breeds for algorithm 3
image_predictions_clean.prediction3.value_counts()
From these values it seems as if the retriever dogs are the ones being predicted the by all three algorithms. It is important to see which algorithm delivers results with higher confidence. An extra step could be checking is specific breeds have more favorited or retweets than others.
# Descriptive statistics of the predictions table
image_predictions_clean.describe()
Perhaps as expected, predictions 1 are the ones with higher confidence rates.
# In order to use the tweet_length column it had to be divided into two columns
detailed_tweets_clean[['tweet_start','tweet_end']] = pd.DataFrame(detailed_tweets_clean.tweet_length.tolist(), index= detailed_tweets_clean.index)
# Inspect the most frequent tweet length
detailed_tweets_clean.tweet_end.mode()
# Inspect tweets length
detailed_tweets_clean.tweet_end.value_counts()
# Creation of masks to categorize tweets lengths with 1/3rd of the maximum allowed as base
Long = detailed_tweets_clean.tweet_end > 50
Short= detailed_tweets_clean.tweet_end < 50
# Histogram of tweets lengths
detailed_tweets_clean.tweet_end[Long].hist(label = 'Long tweet');
detailed_tweets_clean.tweet_end[Short].hist(label = 'Short tweet');
plt.title("Histogram of tweets length")
plt.xlabel("Tweet's length (Char)")
plt.legend();
# What is the mean of favorite counts for long tweets
detailed_tweets_clean.favorite_count[Long].mean()
# What is the mean of favorite counts for short tweets
detailed_tweets_clean.favorite_count[Short].mean()
As seen the average of favorite counts for longer tweets is higher. This makes sense as longer tweets tend to provide more info on the dog.
# What is the mean of retweets for long tweets
detailed_tweets_clean.retweet_count[Long].mean()
# What is the mean of retweets for short tweets
detailed_tweets_clean.retweet_count[Short].mean()
Similarly, the number of retweets for longer tweets is higher. One aspect to take into account is that, longer tweets are more than shorter ones. Hence further inspections of this relationship have to be done in order to stablish any relationship.
This project shows the stages of wrangling and exploratory analysis of three datasets holding interesting information of dogs over Twitter from the WeRateDogs account. The datasets were obtained from diverse sources and their main flaws were inspected and some of them were documented and fixed. The resulting dataset helped to gain some insights on questions related to the favorite and retweet count. Based on the results delivered by the dataset it seems as if puppies are the ones with more retweets and likes. From the predicted images dataset the retriever breed was the one predicted the most, by far in comparison the rest. Lastly, from the table holding detailed tweet information evidenced that tweets using the 140-character limit were the most common ones as well as the ones more favorited and retweeted. These insights were obtained from superficial analysis so in order to establish relationship between these variables further analysis has to be carried out.