The dataset chosen is the TMDb movie data. The original collection holds detailed information of approximately 10,867 movies within 21 columns including information related to their casting, producer, budget, runtime, and so on. The properties contained in the dataset is quite diverse. Some of these variables are names (actors and producers names), others are ordinal data (ratings), sentences (tagline) and even paragraphs (overview). The data covers several decades, and it had the revenues of older years adjusted so as to be comparable to current ones. The file contains observations that could be used to get insights on what factors could contribute to the success of a film at a box office, or to get better ratings by its viewers. In that regard, the dependent variables will be revenue and user's rating.
The questions that will direct this introductory investigation are:
These two will be inspected so as to answer a third question:
#The packages required to carry out the analysis are pandas, numpy and matplotlib
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
from itertools import cycle, islice
%matplotlib inline
The data set was downloaded as a csv file. In order to carry out the exploratory analysis some adjustments had to be done and are described in the following sections.
As stated in the previous section this dataset holds information of movies. The first step is to load the data set and from there examine the variables that could be useful to answer the research questions and their corresponding values.
The data was downloaded as a csv file. The method pandas.read_csv was used so as to open it. The first attempt to open the file was unsucesfull due to a decoding error :</p>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd1 in position 8: invalid continuation byte
After investigating the error online one solution proposed was encoding the string to latin-1'. Official panda’s documentation introduced the encoding as an optional parameter. Further reading highlighted that specifying the 'latin-1' encoding would just ignore those bites that are not being decoded and read the file no matter what. As the real encoding of the file was not known, the encoding was set to 'latin-1'. This means that some of the values contained in the rows could be unknown characters.
# Read and inspect the dataset
movieData = pd.read_csv("tmdb-movies.csv", encoding = 'latin-1')
movieData.head()
movieData.head(0)
The first inspection of the data highlighted that the first row was just a statement specifying where the data came from. This line was not necessary, so we had to specify to the method read_csv that the first row had to be ignored. This was accomplished with the parameter skiprow. Pandas documentation stated that this parameter has to be a 0-indexed integer. As the row to be deleted was the zeroth one, the number specified was 1.
# Read and inspect the dataset after the adjustments
movieData = pd.read_csv("tmdb-movies.csv", skiprows = 1, encoding='latin-1')
movieData.head(2)
The columns seem to have very self-explanatory names that could be helpful to answer the analysis questions. The first column is the exception with a very long name that makes the inspection process a little difficult. In order to facilitate the preliminary assessment this column will be renamed to ID_2. Perhaps in later parts of the analysis one could rename it to a more pertinent label or just drop it, if it has no use. First, would be useful to have a look at all the columns labels:
# All the columns labels will be inspected
movieData.columns
So far, the names of the other columns will remain the same, they are all lower case and have no spaces. As the name of the first column is too long to type it and copying it could lead to mistakes, its value will be asigned to a variable and then renamed.
#We asign the first column name to a variable to ensure we have the right value
first = movieData.columns[0]
#We use the rename method and inspect the first two rows of the dataset
movieData.rename(columns = {first: "ID_2"}, inplace = True)
movieData.head(2)
Now the file seems legible. So far, things that can be improved have been spotted, but first a quick inspection of the columns and their content was carried out to better understand the dataset. As the first values have been inspected, it is time to see the end of the dataset.
# Inspection of the last values of the dataset
movieData.tail(2)
Here one can see that all fields are missing in the last row, that seems as an empty observation. Further inspections are aimed at determining the shape of the dataset, its data types and general information:
# Obtain the amount of columns and rows of the dataset
movieData.shape
# General information on the dataset
movieData.info()
The dataset has many null values that have to be handled.
#the data type of its variables
movieData.dtypes
Some of these datatypes have to be further inspected.
# Look at the unique values
movieData.nunique()
The unique values display interesting facts. For instance, for such a large dataset the runtime values have a very narrow range, same with budget.
It is pertinent to carry out an inspection for duplicated entries:
# With the function sum, we obtain a total of the duplicated values
sum(movieData.duplicated())
# As the duplicated entry is just one, it will be dropped:
movieData.drop_duplicates(inplace = True)
It is pertinent to see the statistics of the data set and analyze its variables and its measurement units:
# Inspection of the descriptive statistics of the numerical variables with 'describe'
movieData.describe()
The columns holding numerical variables are ID2, popularity, budget, revenue, runtime, vote_count, vote_average, release_year and budjet_adj. Most of them have pretty self-explanatory names. ID_2 seems to serve as an unique identifier for each movie; popularity looks as a measure of its popularity but it is not clear from where it is obtained; budget represent the funds used to produce the movie and budget_adj is a value that normalizes the funds minimizing the impacts of inflation over time; vote count contains the total votes for each movie; average is the average punctuation the users have given to the movies and release year is the year in which the movie premiered.
Some general observations so far are:
- Some variables have as minimum value zero (which is not possible due to the value they represent) these are budget, revenue, runtime and budget_adj. No movie could have a runtime of zero or a budget of 0.00$
- The first and second percentile of budget and budget_adj are also zero which could be due to an error. This applies also to revenue;
- The runtime has to be stored in minutes. Seems like the longest movie lasted 15 hours; while the majority lasted less than 2 hours
- Most of the movies listed have a vote average of 6; the popularity values span from 0 - 32 but most of the movies have a popularity of less than 1. The 32 value could be due to an error;
- Movies span from 1960 to 2015.
And finally, the histogram of the whole dataset is to be shown.</p>
# A visual inspection of the variables with 'hist'
movieData.hist(figsize = (11,8));
Many of these histograms are skewed to the right, this could be due to some data missing and filled with zeroes or null values. The Year column shows that the movies included are in majority from the 2000's. The vote_average variable could have a normal distribution, the only one of the whole dataset.
The inspections carried out in the previous section helped to better understand the dataset and revealed what can be done to make it a workable one. Some work already done was cleaning up the column labels and dropping the duplicated values. First, there are some columns that have no apparent use to the analysis and will be eliminated. These are the first two columns, that contain identifiers apparently, the homepage and the tagline ones. Additionally, there are other columns from which would be very difficult to obtain advantageous information. For instance, the cast is a key part of a movie's success, but in order to make it a more functional column one would have to rank the actors and give them some sort of weight. In spite of that, the cast will be kept in order to check its impact on the results and then verify if indeed the actors are A-list ones. The same applies to the overview and the keywords, but these two will be dropped. Secondly, the data shows some missing values that have to be handled and inspect if there null rows. Subsequently, the data types of the remaining columns have to be inspected and changed if needed.
First, the columns with little use will be dropped, these are: 'ID_2', 'imdb_id', 'homepage', 'tagline', 'keywords' and 'overview'.
movieData.drop(['ID_2', 'imdb_id', 'homepage', 'tagline', 'keywords', 'overview'], axis = 1, inplace = True)
The following lines will handle the missing values. These will be handled differently depending on which values are missing.
# Glance at where the values are missing from the columns kept.
movieData.info()
So far, the columns missing values are cast, director, genres, and production companies. These are all categorical data so assigning any measure of tendency to the null values would require some extra work. First the rows with all missing values will be dropped:
# Rows with all columns set to null are dropped with the drop function and the parameter 'how' equal to all
movieData.dropna(how = 'all', inplace = True)
movieData.info()
The column 'production_companies' has many missing rows, if all the rows with missing values are dropped the sample would lose at least a thousand observations. It would be important to inspect further this column before making any decision.
# Inspection of the counts of production companies values:
movieData['production_companies'].value_counts(0)
Seems like the most relevant production companies have the most occurrences in the dataset, hence it could be useful to include them in the analysis. The rows with null values will be set to the value "unknown".
# The column production companies will modified by giving the 'unknown' value to all null rows
movieData['production_companies'] = movieData['production_companies'].fillna('Unknown')
movieData.info()
On the other hand, the values missing cast, director and genres are not so many so these rows will be dropped.
#Rows with missing values will be dropped, decreasing the number of observations by few dozens
movieData.dropna(inplace = True)
movieData.info()
The datatypes of the columns that are not numericals are to be examined. Original title, cast, director, production companies and genres should be strings.
#verification of the columns that should be strings:
str == type(movieData['original_title'][5]) == type(movieData['cast'][5])
str == type(movieData['director'][5]) == type(movieData['genres'][5])
str == type(movieData['production_companies'][5])
So far, all the values kept as string are in the proper data type. Exemptions are release_date (that should be saved as a date), release_year should be an integer and revenue_adj should be a float. These will be changed.
#Verification of the form of the release_date values:
(movieData['release_date'][5])
#The dates will be transformed to datetype with the to_datetime method:
movieData['release_date'] = pd.to_datetime(movieData['release_date'])
This operation turned those movies from the late 1960’s into the years 2060’s. This problem of mixing centuries happens when datasets store the year in a two-digit value while containing values of different centuries. As the data set cannot hold values from years after 2020, all those rows with release dates after this year will be taken back to the previous century. This approach to the problem was found on Stack Overflow.
#Movies whose year is greater than the current one will be taken back to the past century.
movieData.loc[movieData['release_date'].dt.year >= 2020, 'release_date'] -= pd.DateOffset(years=100)
#inspection of the results
movieData['release_date'].describe()
# The relese year will be changed to int using the astype method:
movieData.release_year = movieData.release_year.astype(int)
movieData.release_year.dtypes
#Inspection of how revenue_ad looks like and further change to float
movieData['revenue_adj'][9]
#They seemed like numbers so the method astype to float could be useful
#movieData.revenue_adj = movieData.revenue_adj.astype(float)
#This line of code did not work because there was a row with a value = '0</pre></body></html>Xtext/csv';
#The row containing that value was found with 'loc' and dropped with 'drop'
movieData.loc[movieData['revenue_adj']=='0</pre></body></html>Xtext/csv']
movieData.drop(movieData.loc[movieData['revenue_adj']=='0</pre></body></html>Xtext/csv'].index, inplace=True)
#with the wrong value gone the code was ran again
movieData.revenue_adj = movieData.revenue_adj.astype(float)
Finally the numerical variables that had some minimun values as zero will be inspected:
# New dataset with all the rows whose budget value equals to zero
zero_budget = movieData.loc[movieData['budget']==0,:]
zero_budget.budget.describe()
#This shows that half of the dataset lacks budget values
# New dataset with all the rows whose budget value equals to zero
zero_rev = movieData.loc[movieData['revenue']==0,:]
zero_rev.revenue.describe()
#This shows that half of the dataset lacks revenue values
The lack of revenue and budget values could introduce bias to the analysis. Options are to fill these values with the mean or drop the rows. Fillng with the mean will turn the original distribution of the data into a non-realistic one. Due to the fact that these columns are important for the analysis dropping them is not an option. On the other hand, dropping the rows with zero values would cut the observations by half, but at least these will be observations with realistic data. With these considerations in mind, the dataset with real values of revenue and budget will be saved into a new dataset and both datasets will be investigated.
#New datasets with budget and revenue values filled
movieData2 = movieData.drop(movieData[(movieData.budget <= 0)].index)
movieData2 = movieData.drop(movieData[(movieData.budget_adj <= 0)].index)
movieData2 = movieData.drop(movieData[(movieData.revenue <= 0) & (movieData.revenue_adj <= 0)].index)
movieData2.info()
Lastly the popularity variable will be inspected.
movieData.popularity.describe()
It would seem as if the popularity is a value from 0 to 1. With several outliers. Official documentation states that the metric is related to the number of votes and not the vote average, this column will be dropped, but no specifications of their min and max values is stated so is difficult to stablish a threshold. This column will be dropped.
#The column popularity is dropped using the function drop
movieData.drop(['popularity'], axis = 1, inplace = True)
#statistics of the numerical values
movieData.describe()
movieData2.describe()
These values seem a lot better now, and we can use them to get some answers to the proposed questions.
For the purposes of the examination, a well-rated movie will be defined as that with a vote of 7 or higher, regular movies are that between 6 and 7 and a bad movie one that has less than 5 average rating. The descriptive table highlights that each of these movies have a n average of 200 votes. Using this definitions, some masks will be created.
Firstly, the variable 'budget' will be inspected in this regard.
#Masks dividing the movies into good and regular
#New variables with Boolean values based on a condition from the original dataframe
Good = movieData.vote_average > 7
Regular = ((movieData.vote_average < 7) | (movieData.vote_average > 6))
bad = movieData.vote_average < 6
#How these ratings are related to the budget
#The average budget of movies rated as good
movieData.budget[Good].mean()
#The average budget of movies rated as regular
movieData.budget[Regular].mean()
#The average budget of movies rated as bad
movieData.budget[bad].mean()
It seems like the movies rated as good have a higher budget than the ones ranked as not.
Now it is time to see their relationship with runtime:
#The average runtime of movies rated as good
movieData.runtime[Good].mean()
#The average runtime of movies rated as regular
movieData.runtime[Regular].mean()
#The average runtime of movies rated as bad
movieData.runtime[bad].mean()
It seems as if the movies ranked as 'Good' last longer.
These findings will be inspected visually:
movieData.budget[Good].hist(color = 'green', label = 'Good Movie', bins = 13);
movieData.budget[Regular].hist(color = 'blue', alpha = 0.2, label = 'Regular Movie', bins = 13);
movieData.budget[bad].hist(color = 'orange', alpha = 0.2, label = 'Bad Movie', bins = 13);
plt.title("Histogram of Movie expenses")
plt.xlabel("Budget (1*10 ^8 $)")
plt.ylabel("Movies")
plt.legend();
This chart shows as if the majority of the movies have low budget but this is related to big ammount of movies with budgets equal to zero. Still both good and less good movies have higher ammount of not so expensive movies.
# Visual inspection of the datasets with hist function
movieData.runtime[Good].hist(color = 'green', label = 'Good Movie', bins = 15);
movieData.runtime[Regular].hist(color = 'blue', alpha = 0.2, label = 'Regular Movie', bins = 15);
movieData.runtime[bad].hist(color = 'orange', alpha = 0.5, label = 'Bad Movie', bins = 15);
plt.title("Histogram of Movie duration")
plt.xlabel("Runtime (minutes)")
plt.ylabel("Movies")
plt.legend();
This chart shows as if the majority of the movies independent of their quality last between 100 and 150 minutes.
In order to avoid the effect of the null values, the same examinations will be performed with the dataset without zero values.
#Masks dividing the movies into good and regular
Good = movieData2.vote_average > 7
Regular = ((movieData2.vote_average < 7) | (movieData2.vote_average > 6))
bad = movieData2.vote_average < 6
#the average budget of movied labeled as good:
movieData2.budget[Good].mean()
#the average budget of movied labeled as regular:
movieData2.budget[Regular].mean()
#the average budget of movied labeled as bad:
movieData2.budget[bad].mean()
# Visual inspection of the dataset without budgets equal to zero:
movieData2.budget[Good].hist(color = 'green', label = 'Good Movie');
movieData2.budget[Regular].hist(color = 'blue', alpha = 0.2, label = 'Regular Movie');
movieData2.budget[bad].hist(color = 'orange', alpha = 0.4, label = 'Regular Movie');
plt.title("Histogram of Movie expenses")
plt.xlabel("Budget (1*10 ^8 $)")
plt.ylabel("Movies")
plt.legend();
The trend seems pretty similar for both filtered and unfiltered datasets the good and regular movies have similar budgets. One thing is clear, that the majority of movies do not have excessive budgets.
As an extra, other potential explanatory variables to explore are cast and director:
#The cast unique values ands statistics are obtained
movieData.cast.nunique()
movieData.director.describe()
The cast column in its present form would be difficult to use. When it comes to directors, they are half the movies, it is pertinent to inspect if these frequent directors are related to the movies with higher votes. First the most frequent directors will be obtained, pandas could count the with the value_counts method but in order to store them as a list, an approach fund on stackOverflow was utilized:
#first the 15 most frequent directors will be obtained
# n is the number of directors to retrieve
n = 15
# A new variable is created with the directors with more values assigned in the dataset
freqDir = movieData.director.value_counts()[:n].index.tolist()
#following, the movies created by these directors will be listed and their statistics analyzed
moviesTopDir = movieData[movieData['director'].isin(freqDir)]
moviesTopDir.vote_average.describe()
The table shows that the movies directed by the 15th most common directors are in average 'Regular' movies due to our previous definition. The one with higher rank is 8.1 but the 75% of them fall below 6.9; the standard deviation is very low hence most of the movies are around the mean.
The same approach will be used for the production companies and genres. First obtain the most common ones and see their average punctuation.
# Unique values for production companies
movieData.production_companies.nunique()
# 15th common production companies
n = 15
freqPro = movieData.production_companies.value_counts()[:n].index.tolist()
print(freqPro)
#Dataset only with movies produced by these companies
moviesTopComp = movieData[movieData['production_companies'].isin(freqPro)]
moviesTopComp.vote_average.describe()
The table shows that the movies produced by the 15th most common companies are in average 'Bad' movies. The one with higher rank is 9.2 (the highest rank of the entire dataset) but the 75% of them fall below 6.7; the standard deviation is quite high hence, the values of this subset are very spread.
Last thing to be inspected are going to be the genres:
# Unique values for genres
movieData.genres.nunique()
# 15th common genres
# a new variable with the number of
n = 15
freqGen = movieData.genres.value_counts()[:n].index.tolist()
print(freqGen)
moviesTopGen = movieData[movieData['genres'].isin(freqGen)]
moviesTopGen.vote_average.describe()
The table shows that belong to the 15th most common genres are in average 'Bad' movies. The one with higher rank is 9.2 (the highest rank of the entire dataset) but the 75% of them fall below 6.7; the standard deviation is low meaning that these movies are around the mean.
movieData.vote_average.mode()
The mode of the vote_average is 6.1 (most common rating among movies).
A last inspection of the cast variable:
movieData.cast.describe()
It seems as if Louis C.K. is the most common actor with 6 titles. This is mainly due to the fact that actors are stashed in the cast and separated by a bar.
The rate of success in the revenue for this analysis will depend on the budget. A successful revenue will be defined as one that is at least 3 times the budget.
#Creation of masks to inspect the movies that follow the criteria explained above
Succesful = movieData.revenue_adj > (3 * movieData.budget_adj)
Bad = movieData.revenue_adj < movieData.budget_adj
#Total of movies that are considered as uccesful
sum(Succesful)
The 'bad' movies are more than the succesful ones.
#mean budget of movies considered as succesful and bad
movieData.budget[Succesful].mean()
movieData.budget[Bad].mean()
Good movies spent more in average than 'bad' ones.
#relationship between sucess in the box office and runtime
movieData.runtime[Succesful].mean()
movieData.runtime[Bad].mean()
This results show that the average succesful movie last longer than the ones that do not perform that well in the box office. This will be inspected visually.
#duration of movies vs revenue criteria
movieData.runtime[Bad].hist(label = 'Bad Performing movie');
movieData.runtime[Succesful].hist(label = 'Sucessful Movie');
plt.title("Histogram of movie duration")
plt.xlabel("Runtime (minutes)")
plt.ylabel("Movies")
plt.legend();
The mean and the histogram show that movies with good performance in the box office last, in average, longer.
The categorical variables of this dataset have too many values in order to draw specific relationships between each category and the criteria described above as success or failure. Because of that, the most common values of genres, companies and directors are inspected visually.
#The most common genres are inspected
#First a list of colors is created to assign each genre a distinct color
#my_colors = list(islice(cycle(['b', 'g', 'r', 'c', 'm', 'y', 'k', 'pink', 'orange', 'gray', 'olive', 'cyan', 'brown', 'purple', 'blue']), None, len(moviesTopGen)))
#With groupby the revenues are grouped by genre and plot
moviesTopGen.groupby('genres').revenue_adj.mean().plot(kind='bar');
plt.title('Avg. revenues of most common genres');
plt.ylabel('Revenues', fontsize=15);
plt.xlabel('Genres', fontsize=15);
This graph shows that among the most common genres, Animation|Family is the one with the highest average revenues.
#The most common genres are inspected
#First a list of colors is created to assign each genre a distinct color
#my_colors = list(islice(cycle(['b', 'g', 'r', 'c', 'm', 'y', 'k', 'pink', 'orange', 'gray']), None, len(moviesTopDir)))
#With groupby the revenues are grouped by genre and plot
moviesTopDir.groupby('director').revenue_adj.mean().plot(kind='bar');
plt.title('Avg. revenues of most common directors');
plt.ylabel('Revenues', fontsize=12);
plt.xlabel('Directors', fontsize=12);
This graph shows that among the most common directors, Steven Spielberg is the one whose movies have the highest average revenues.
#The most common genres are inspected
#First a list of colors is created to assign each genre a distinct color
my_colors = list(islice(cycle(['b', 'g', 'r', 'c', 'm', 'y', 'k', 'pink', 'orange', 'gray', 'olive', 'cyan', 'brown', 'purple', 'blue']), None, len(moviesTopComp)))
#With groupby the revenues are grouped by genre and plot
moviesTopComp.groupby('production_companies').revenue_adj.mean().plot(kind='bar');
plt.title('Avg. revenues of most common prod. companies');
plt.ylabel('Revenues', fontsize=15);
plt.xlabel('Production Companies', fontsize=15);
This graph shows that among the most common production companies, Paramount and Columbia are the ones with the highst average revenues. Lastly, the release year of the movies is to be inspected to see if recent movies are more succesful (perhaps because they are spending more):
#relationship between sucess in the box office and release year
#The mode will be utilized because the average year does not make a lot of sense
movieData.release_year[Succesful].mode()
movieData.release_year[Bad].mode()
The most repeated year among the succesful movies was 2014 and 2013 for the bad ones.
A short inspection of the relatioship between the revenues and vote averages is carried out:
#Evaluation of the vote average the movies more succesful in the box office get:
movieData[Succesful].vote_average.describe()
This shows that movies with good revenues are in average, regular movies.
The movie with the highest rating will be inspected:
#Obtention of the highest rate:
movieData.describe()
movieTop = movieData[movieData['vote_average']==9.2]
movieTop
The revenue and budget values of this movie is zero, hence no further information could be gathered by it.
#Obtention of the highest rate for the secondary data frame:
movieData2.describe()
movieTop2 = movieData2[movieData2['vote_average']==8.4]
movieTop2
The analysis in general was challenging due to the variety of the categorical variables. Variables such as cast and director have a strong impact on the outcome of a movie, but in order to be used they had to be prepared. The answer to the first question ‘How the runtime and budget influence the ratings a movie receives?’ could be answered by saying that movies that last too long are not among the higher rated movies. In addition, the movies with higher ratings have higher budget in average than the ones with worse ratings. That could be due to the presence of outliers. This needed to be inspected further and see if certain runtimes are more common among certain genres. When it comes to the second question ‘How the runtime and budget influence the box office performance of a movie? The inspection showed that movies considered to be successful lasted longer in average and spent more in average. Also the movies considered as successful were less than half of the dataset. In general, a high number of movies lacking values for budget and revenues impacted highly the results obtained. The big presence of null values influences the mean, and many assumptions were made based on average values. Moreover, the inspection of the top frequent values among the categorical values also have the problem of reducing the samples dramatically. The assumptions drawn from this preliminary examination are insightful but have flaws.