Project: Factors influencing movie revenue and ratings¶

Table of Contents¶

Introduction
Data Wrangling
Exploratory Data Analysis
Conclusions
References

Introduction¶

The dataset chosen is the TMDb movie data. The original collection holds detailed information of approximately 10,867 movies within 21 columns including information related to their casting, producer, budget, runtime, and so on. The properties contained in the dataset is quite diverse. Some of these variables are names (actors and producers names), others are ordinal data (ratings), sentences (tagline) and even paragraphs (overview). The data covers several decades, and it had the revenues of older years adjusted so as to be comparable to current ones. The file contains observations that could be used to get insights on what factors could contribute to the success of a film at a box office, or to get better ratings by its viewers. In that regard, the dependent variables will be revenue and user's rating.

The questions that will direct this introductory investigation are:

How the runtime and budget influence the ratings a movie receives?

How the runtime and budget influence the box office performance of a movie??

These two will be inspected so as to answer a third question:

Are the box office performance and the ratings related?

#The packages required to carry out the analysis are pandas, numpy and matplotlib

import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
from itertools import cycle, islice

%matplotlib inline

Data Wrangling¶

The data set was downloaded as a csv file. In order to carry out the exploratory analysis some adjustments had to be done and are described in the following sections.

General Properties¶

As stated in the previous section this dataset holds information of movies. The first step is to load the data set and from there examine the variables that could be useful to answer the research questions and their corresponding values.

The data was downloaded as a csv file. The method pandas.read_csv was used so as to open it. The first attempt to open the file was unsucesfull due to a decoding error :</p>

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd1 in position 8: invalid continuation byte

After investigating the error online one solution proposed was encoding the string to latin-1'. Official panda’s documentation introduced the encoding as an optional parameter. Further reading highlighted that specifying the 'latin-1' encoding would just ignore those bites that are not being decoded and read the file no matter what. As the real encoding of the file was not known, the encoding was set to 'latin-1'. This means that some of the values contained in the rows could be unknown characters.

# Read and inspect the dataset

movieData = pd.read_csv("tmdb-movies.csv", encoding = 'latin-1')
movieData.head()
movieData.head(0)

The first inspection of the data highlighted that the first row was just a statement specifying where the data came from. This line was not necessary, so we had to specify to the method read_csv that the first row had to be ignored. This was accomplished with the parameter skiprow. Pandas documentation stated that this parameter has to be a 0-indexed integer. As the row to be deleted was the zeroth one, the number specified was 1.

# Read and inspect the dataset after the adjustments
movieData = pd.read_csv("tmdb-movies.csv", skiprows = 1, encoding='latin-1')
movieData.head(2)

The columns seem to have very self-explanatory names that could be helpful to answer the analysis questions. The first column is the exception with a very long name that makes the inspection process a little difficult. In order to facilitate the preliminary assessment this column will be renamed to ID_2. Perhaps in later parts of the analysis one could rename it to a more pertinent label or just drop it, if it has no use. First, would be useful to have a look at all the columns labels:

# All the columns labels will be inspected 
movieData.columns

Index(['

_WebResourceTextEncodingName^WebResourceURL_WebResourceFrameName_WebResourceData_WebResourceMIMETypeUUTF-8_^https://d17h27t6h515a5.cloudfront.net/topher/2017/October/59dd1c4c_tmdb-movies/tmdb-movies.csvPO',
       'imdb_id', 'popularity', 'budget', 'revenue', 'original_title', 'cast',
       'homepage', 'director', 'tagline', 'keywords', 'overview', 'runtime',
       'genres', 'production_companies', 'release_date', 'vote_count',
       'vote_average', 'release_year', 'budget_adj', 'revenue_adj'],
      dtype='object')

So far, the names of the other columns will remain the same, they are all lower case and have no spaces. As the name of the first column is too long to type it and copying it could lead to mistakes, its value will be asigned to a variable and then renamed.

#We asign the first column name to a variable to ensure we have the right value
first = movieData.columns[0]

#We use the rename method and inspect the first two rows of the dataset
movieData.rename(columns = {first: "ID_2"}, inplace = True)
movieData.head(2)

Now the file seems legible. So far, things that can be improved have been spotted, but first a quick inspection of the columns and their content was carried out to better understand the dataset. As the first values have been inspected, it is time to see the end of the dataset.

# Inspection of the last values of the dataset
movieData.tail(2)

Here one can see that all fields are missing in the last row, that seems as an empty observation. Further inspections are aimed at determining the shape of the dataset, its data types and general information:

# Obtain the amount of columns and rows of the dataset
movieData.shape

(10867, 21)

# General information on the dataset
movieData.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10867 entries, 0 to 10866
Data columns (total 21 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   ID_2                  10866 non-null  float64
 1   imdb_id               10856 non-null  object 
 2   popularity            10866 non-null  float64
 3   budget                10866 non-null  float64
 4   revenue               10866 non-null  float64
 5   original_title        10866 non-null  object 
 6   cast                  10790 non-null  object 
 7   homepage              2936 non-null   object 
 8   director              10822 non-null  object 
 9   tagline               8042 non-null   object 
 10  keywords              9373 non-null   object 
 11  overview              10862 non-null  object 
 12  runtime               10866 non-null  float64
 13  genres                10843 non-null  object 
 14  production_companies  9836 non-null   object 
 15  release_date          10866 non-null  object 
 16  vote_count            10866 non-null  float64
 17  vote_average          10866 non-null  float64
 18  release_year          10866 non-null  float64
 19  budget_adj            10866 non-null  float64
 20  revenue_adj           10866 non-null  object 
dtypes: float64(9), object(12)
memory usage: 1.7+ MB

The dataset has many null values that have to be handled.

#the data type of its variables
movieData.dtypes

ID_2                    float64
imdb_id                  object
popularity              float64
budget                  float64
revenue                 float64
original_title           object
cast                     object
homepage                 object
director                 object
tagline                  object
keywords                 object
overview                 object
runtime                 float64
genres                   object
production_companies     object
release_date             object
vote_count              float64
vote_average            float64
release_year            float64
budget_adj              float64
revenue_adj              object
dtype: object

Some of these datatypes have to be further inspected.

# Look at the unique values
movieData.nunique()

ID_2                    10865
imdb_id                 10855
popularity              10814
budget                    557
revenue                  4702
original_title          10571
cast                    10719
homepage                 2896
director                 5067
tagline                  7997
keywords                 8804
overview                10847
runtime                   247
genres                   2039
production_companies     7445
release_date             5909
vote_count               1289
vote_average               72
release_year               56
budget_adj               2614
revenue_adj              4841
dtype: int64

The unique values display interesting facts. For instance, for such a large dataset the runtime values have a very narrow range, same with budget.

It is pertinent to carry out an inspection for duplicated entries:

# With the function sum, we obtain a total of the duplicated values
sum(movieData.duplicated())

1

# As the duplicated entry is just one, it will be dropped:
movieData.drop_duplicates(inplace = True)

It is pertinent to see the statistics of the data set and analyze its variables and its measurement units:

# Inspection of the descriptive statistics of the numerical variables with 'describe'
movieData.describe()

The columns holding numerical variables are ID2, popularity, budget, revenue, runtime, vote_count, vote_average, release_year and budjet_adj. Most of them have pretty self-explanatory names. ID_2 seems to serve as an unique identifier for each movie; popularity looks as a measure of its popularity but it is not clear from where it is obtained; budget represent the funds used to produce the movie and budget_adj is a value that normalizes the funds minimizing the impacts of inflation over time; vote count contains the total votes for each movie; average is the average punctuation the users have given to the movies and release year is the year in which the movie premiered.

Some general observations so far are:

Some variables have as minimum value zero (which is not possible due to the value they represent) these are budget, revenue, runtime and budget_adj. No movie could have a runtime of zero or a budget of 0.00$

The first and second percentile of budget and budget_adj are also zero which could be due to an error. This applies also to revenue;

The runtime has to be stored in minutes. Seems like the longest movie lasted 15 hours; while the majority lasted less than 2 hours

Most of the movies listed have a vote average of 6; the popularity values span from 0 - 32 but most of the movies have a popularity of less than 1. The 32 value could be due to an error;

Movies span from 1960 to 2015.

And finally, the histogram of the whole dataset is to be shown.</p>

# A visual inspection of the variables with 'hist'
movieData.hist(figsize = (11,8));

Many of these histograms are skewed to the right, this could be due to some data missing and filled with zeroes or null values. The Year column shows that the movies included are in majority from the 2000's. The vote_average variable could have a normal distribution, the only one of the whole dataset.

Data Cleaning¶

The inspections carried out in the previous section helped to better understand the dataset and revealed what can be done to make it a workable one. Some work already done was cleaning up the column labels and dropping the duplicated values. First, there are some columns that have no apparent use to the analysis and will be eliminated. These are the first two columns, that contain identifiers apparently, the homepage and the tagline ones. Additionally, there are other columns from which would be very difficult to obtain advantageous information. For instance, the cast is a key part of a movie's success, but in order to make it a more functional column one would have to rank the actors and give them some sort of weight. In spite of that, the cast will be kept in order to check its impact on the results and then verify if indeed the actors are A-list ones. The same applies to the overview and the keywords, but these two will be dropped. Secondly, the data shows some missing values that have to be handled and inspect if there null rows. Subsequently, the data types of the remaining columns have to be inspected and changed if needed.

First, the columns with little use will be dropped, these are: 'ID_2', 'imdb_id', 'homepage', 'tagline', 'keywords' and 'overview'.

movieData.drop(['ID_2', 'imdb_id', 'homepage', 'tagline', 'keywords', 'overview'], axis = 1, inplace = True)

The following lines will handle the missing values. These will be handled differently depending on which values are missing.

# Glance at where the values are missing from the columns kept.
movieData.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10866 entries, 0 to 10866
Data columns (total 15 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   popularity            10865 non-null  float64
 1   budget                10865 non-null  float64
 2   revenue               10865 non-null  float64
 3   original_title        10865 non-null  object 
 4   cast                  10789 non-null  object 
 5   director              10821 non-null  object 
 6   runtime               10865 non-null  float64
 7   genres                10842 non-null  object 
 8   production_companies  9835 non-null   object 
 9   release_date          10865 non-null  object 
 10  vote_count            10865 non-null  float64
 11  vote_average          10865 non-null  float64
 12  release_year          10865 non-null  float64
 13  budget_adj            10865 non-null  float64
 14  revenue_adj           10865 non-null  object 
dtypes: float64(8), object(7)
memory usage: 1.3+ MB

So far, the columns missing values are cast, director, genres, and production companies. These are all categorical data so assigning any measure of tendency to the null values would require some extra work. First the rows with all missing values will be dropped:

# Rows with all columns set to null are dropped with the drop function and the parameter 'how' equal to all
movieData.dropna(how = 'all', inplace = True)
movieData.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10865 entries, 0 to 10865
Data columns (total 15 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   popularity            10865 non-null  float64
 1   budget                10865 non-null  float64
 2   revenue               10865 non-null  float64
 3   original_title        10865 non-null  object 
 4   cast                  10789 non-null  object 
 5   director              10821 non-null  object 
 6   runtime               10865 non-null  float64
 7   genres                10842 non-null  object 
 8   production_companies  9835 non-null   object 
 9   release_date          10865 non-null  object 
 10  vote_count            10865 non-null  float64
 11  vote_average          10865 non-null  float64
 12  release_year          10865 non-null  float64
 13  budget_adj            10865 non-null  float64
 14  revenue_adj           10865 non-null  object 
dtypes: float64(8), object(7)
memory usage: 1.3+ MB

The column 'production_companies' has many missing rows, if all the rows with missing values are dropped the sample would lose at least a thousand observations. It would be important to inspect further this column before making any decision.

# Inspection of the counts of production companies values:
movieData['production_companies'].value_counts(0)

Paramount Pictures                                                                                                     156
Universal Pictures                                                                                                     133
Warner Bros.                                                                                                            84
Walt Disney Pictures                                                                                                    76
Columbia Pictures                                                                                                       72
                                                                                                                      ... 
Lions Gate Films|2929 Productions|Out of the Blue... Entertainment|Starbucks Entertainment|Cinema Gypsy Productions      1
Tiger Aspect Productions|PolyGram Filmed Entertainment|Working Title Films                                               1
Columbia Pictures|Gary Sanchez Productions                                                                               1
Beatrice Film                                                                                                            1
Rampart Films|Raven West Films Ltd.|Resonance Film &amp; Video                                                           1
Name: production_companies, Length: 7445, dtype: int64

Seems like the most relevant production companies have the most occurrences in the dataset, hence it could be useful to include them in the analysis. The rows with null values will be set to the value "unknown".

# The column production companies will modified by giving the 'unknown' value to all null rows
movieData['production_companies'] = movieData['production_companies'].fillna('Unknown')
movieData.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10865 entries, 0 to 10865
Data columns (total 15 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   popularity            10865 non-null  float64
 1   budget                10865 non-null  float64
 2   revenue               10865 non-null  float64
 3   original_title        10865 non-null  object 
 4   cast                  10789 non-null  object 
 5   director              10821 non-null  object 
 6   runtime               10865 non-null  float64
 7   genres                10842 non-null  object 
 8   production_companies  10865 non-null  object 
 9   release_date          10865 non-null  object 
 10  vote_count            10865 non-null  float64
 11  vote_average          10865 non-null  float64
 12  release_year          10865 non-null  float64
 13  budget_adj            10865 non-null  float64
 14  revenue_adj           10865 non-null  object 
dtypes: float64(8), object(7)
memory usage: 1.3+ MB

On the other hand, the values missing cast, director and genres are not so many so these rows will be dropped.

#Rows with missing values will be dropped, decreasing the number of observations by few dozens
movieData.dropna(inplace = True)
movieData.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10731 entries, 0 to 10865
Data columns (total 15 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   popularity            10731 non-null  float64
 1   budget                10731 non-null  float64
 2   revenue               10731 non-null  float64
 3   original_title        10731 non-null  object 
 4   cast                  10731 non-null  object 
 5   director              10731 non-null  object 
 6   runtime               10731 non-null  float64
 7   genres                10731 non-null  object 
 8   production_companies  10731 non-null  object 
 9   release_date          10731 non-null  object 
 10  vote_count            10731 non-null  float64
 11  vote_average          10731 non-null  float64
 12  release_year          10731 non-null  float64
 13  budget_adj            10731 non-null  float64
 14  revenue_adj           10731 non-null  object 
dtypes: float64(8), object(7)
memory usage: 1.3+ MB

The datatypes of the columns that are not numericals are to be examined. Original title, cast, director, production companies and genres should be strings.

#verification of the columns that should be strings:
str == type(movieData['original_title'][5]) == type(movieData['cast'][5])
str == type(movieData['director'][5]) == type(movieData['genres'][5])
str == type(movieData['production_companies'][5])

True

So far, all the values kept as string are in the proper data type. Exemptions are release_date (that should be saved as a date), release_year should be an integer and revenue_adj should be a float. These will be changed.

#Verification of the form of the release_date values:
(movieData['release_date'][5])

'12/25/15'

#The dates will be transformed to datetype with the to_datetime method:
movieData['release_date'] = pd.to_datetime(movieData['release_date'])

This operation turned those movies from the late 1960’s into the years 2060’s. This problem of mixing centuries happens when datasets store the year in a two-digit value while containing values of different centuries. As the data set cannot hold values from years after 2020, all those rows with release dates after this year will be taken back to the previous century. This approach to the problem was found on Stack Overflow.

#Movies whose year is greater than the current one will be taken back to the past century.
movieData.loc[movieData['release_date'].dt.year >= 2020, 'release_date'] -= pd.DateOffset(years=100)

#inspection of the results
movieData['release_date'].describe()

count                   10731
unique                   5872
top       2009-01-01 00:00:00
freq                       27
first     1960-01-01 00:00:00
last      2015-12-31 00:00:00
Name: release_date, dtype: object

# The relese year will be changed to int using the astype method:
movieData.release_year = movieData.release_year.astype(int)
movieData.release_year.dtypes

dtype('int64')

#Inspection of how revenue_ad looks like and further change to float
movieData['revenue_adj'][9]

#They seemed like numbers so the method astype to float could be useful
#movieData.revenue_adj = movieData.revenue_adj.astype(float)
#This line of code did not work because there was a row with a value = '0</pre></body></html>Xtext/csv';

#The row containing that value was found with 'loc' and dropped with 'drop'
movieData.loc[movieData['revenue_adj']=='0</pre></body></html>Xtext/csv']
movieData.drop(movieData.loc[movieData['revenue_adj']=='0</pre></body></html>Xtext/csv'].index, inplace=True)

#with the wrong value gone the code was ran again
movieData.revenue_adj = movieData.revenue_adj.astype(float)

Finally the numerical variables that had some minimun values as zero will be inspected:

# New dataset with all the rows whose budget value equals to zero
zero_budget = movieData.loc[movieData['budget']==0,:]
zero_budget.budget.describe()

#This shows that half of the dataset lacks budget values

# New dataset with all the rows whose budget value equals to zero
zero_rev = movieData.loc[movieData['revenue']==0,:]
zero_rev.revenue.describe()

#This shows that half of the dataset lacks revenue values

count    5887.0
mean        0.0
std         0.0
min         0.0
25%         0.0
50%         0.0
75%         0.0
max         0.0
Name: revenue, dtype: float64

The lack of revenue and budget values could introduce bias to the analysis. Options are to fill these values with the mean or drop the rows. Fillng with the mean will turn the original distribution of the data into a non-realistic one. Due to the fact that these columns are important for the analysis dropping them is not an option. On the other hand, dropping the rows with zero values would cut the observations by half, but at least these will be observations with realistic data. With these considerations in mind, the dataset with real values of revenue and budget will be saved into a new dataset and both datasets will be investigated.

#New datasets with budget and revenue values filled
movieData2 = movieData.drop(movieData[(movieData.budget <= 0)].index)
movieData2 = movieData.drop(movieData[(movieData.budget_adj <= 0)].index)
movieData2 = movieData.drop(movieData[(movieData.revenue <= 0) & (movieData.revenue_adj <= 0)].index)
movieData2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4843 entries, 0 to 10848
Data columns (total 15 columns):
 #   Column                Non-Null Count  Dtype         
---  ------                --------------  -----         
 0   popularity            4843 non-null   float64       
 1   budget                4843 non-null   float64       
 2   revenue               4843 non-null   float64       
 3   original_title        4843 non-null   object        
 4   cast                  4843 non-null   object        
 5   director              4843 non-null   object        
 6   runtime               4843 non-null   float64       
 7   genres                4843 non-null   object        
 8   production_companies  4843 non-null   object        
 9   release_date          4843 non-null   datetime64[ns]
 10  vote_count            4843 non-null   float64       
 11  vote_average          4843 non-null   float64       
 12  release_year          4843 non-null   int64         
 13  budget_adj            4843 non-null   float64       
 14  revenue_adj           4843 non-null   float64       
dtypes: datetime64[ns](1), float64(8), int64(1), object(5)
memory usage: 605.4+ KB

Lastly the popularity variable will be inspected.

movieData.popularity.describe()

count    10730.000000
mean         0.652672
std          1.004833
min          0.000188
25%          0.210766
50%          0.387136
75%          0.721158
max         32.985763
Name: popularity, dtype: float64

It would seem as if the popularity is a value from 0 to 1. With several outliers. Official documentation states that the metric is related to the number of votes and not the vote average, this column will be dropped, but no specifications of their min and max values is stated so is difficult to stablish a threshold. This column will be dropped.

#The column popularity is dropped using the function drop
movieData.drop(['popularity'], axis = 1, inplace = True)

Exploratory Data Analysis¶

Now that the data has been cleaned, the descriptive statistics and visual aids can be used to better understand the variables and how they interact.

How the runtime and budget influence the ratings a movie receives?¶

#statistics of the numerical values
movieData.describe()
movieData2.describe()

These values seem a lot better now, and we can use them to get some answers to the proposed questions.

For the purposes of the examination, a well-rated movie will be defined as that with a vote of 7 or higher, regular movies are that between 6 and 7 and a bad movie one that has less than 5 average rating. The descriptive table highlights that each of these movies have a n average of 200 votes. Using this definitions, some masks will be created.

Firstly, the variable 'budget' will be inspected in this regard.

#Masks dividing the movies into good and regular
#New variables with Boolean values based on a condition from the original dataframe
Good = movieData.vote_average > 7
Regular = ((movieData.vote_average < 7) | (movieData.vote_average > 6))
bad = movieData.vote_average < 6

#How these ratings are related to the budget
#The average budget of movies rated as good
movieData.budget[Good].mean()

17037564.310821805

#The average budget of movies rated as regular
movieData.budget[Regular].mean()

14805024.023485554

#The average budget of movies rated as bad
movieData.budget[bad].mean()

12610486.134366926

It seems like the movies rated as good have a higher budget than the ones ranked as not.

Now it is time to see their relationship with runtime:

#The average runtime of movies rated as good
movieData.runtime[Good].mean()

110.00813669650123

#The average runtime of movies rated as regular
movieData.runtime[Regular].mean()

102.47148182665424

#The average runtime of movies rated as bad
movieData.runtime[bad].mean()

97.69310276287021

It seems as if the movies ranked as 'Good' last longer.

These findings will be inspected visually:

movieData.budget[Good].hist(color = 'green', label = 'Good Movie', bins = 13);
movieData.budget[Regular].hist(color = 'blue', alpha = 0.2, label = 'Regular Movie', bins = 13);
movieData.budget[bad].hist(color = 'orange', alpha = 0.2, label = 'Bad Movie', bins = 13);
plt.title("Histogram of Movie expenses")
plt.xlabel("Budget (1*10 ^8 $)")
plt.ylabel("Movies")
plt.legend();

This chart shows as if the majority of the movies have low budget but this is related to big ammount of movies with budgets equal to zero. Still both good and less good movies have higher ammount of not so expensive movies.

# Visual inspection of the datasets with hist function 
movieData.runtime[Good].hist(color = 'green', label = 'Good Movie', bins = 15);
movieData.runtime[Regular].hist(color = 'blue', alpha = 0.2, label = 'Regular Movie', bins = 15);
movieData.runtime[bad].hist(color = 'orange', alpha = 0.5, label = 'Bad Movie', bins = 15);
plt.title("Histogram of Movie duration")
plt.xlabel("Runtime (minutes)")
plt.ylabel("Movies")
plt.legend();

This chart shows as if the majority of the movies independent of their quality last between 100 and 150 minutes.

In order to avoid the effect of the null values, the same examinations will be performed with the dataset without zero values.

#Masks dividing the movies into good and regular
Good = movieData2.vote_average > 7
Regular = ((movieData2.vote_average < 7) | (movieData2.vote_average > 6))
bad = movieData2.vote_average < 6

#the average budget of movied labeled as good:
movieData2.budget[Good].mean()

32126722.119426753

#the average budget of movied labeled as regular:
movieData2.budget[Regular].mean()

29603729.519719183

#the average budget of movied labeled as bad:
movieData2.budget[bad].mean()

28621739.973247726

# Visual inspection of the dataset without budgets equal to zero:
movieData2.budget[Good].hist(color = 'green', label = 'Good Movie');
movieData2.budget[Regular].hist(color = 'blue', alpha = 0.2, label = 'Regular Movie');
movieData2.budget[bad].hist(color = 'orange', alpha = 0.4, label = 'Regular Movie');
plt.title("Histogram of Movie expenses")
plt.xlabel("Budget (1*10 ^8 $)")
plt.ylabel("Movies")
plt.legend();

The trend seems pretty similar for both filtered and unfiltered datasets the good and regular movies have similar budgets. One thing is clear, that the majority of movies do not have excessive budgets.

As an extra, other potential explanatory variables to explore are cast and director:

#The cast unique values ands statistics are obtained
movieData.cast.nunique()

movieData.director.describe()

count           10730
unique           5017
top       Woody Allen
freq               45
Name: director, dtype: object

The cast column in its present form would be difficult to use. When it comes to directors, they are half the movies, it is pertinent to inspect if these frequent directors are related to the movies with higher votes. First the most frequent directors will be obtained, pandas could count the with the value_counts method but in order to store them as a list, an approach fund on stackOverflow was utilized:

#first the 15 most frequent directors will be obtained
# n is the number of directors to retrieve
n = 15
# A new variable is created with the directors with more values assigned in the dataset
freqDir = movieData.director.value_counts()[:n].index.tolist()

#following, the movies created by these directors will be listed and their statistics analyzed
moviesTopDir = movieData[movieData['director'].isin(freqDir)]
moviesTopDir.vote_average.describe()

count    354.00000
mean       6.40791
std        0.75312
min        4.10000
25%        5.90000
50%        6.50000
75%        6.90000
max        8.10000
Name: vote_average, dtype: float64

The table shows that the movies directed by the 15th most common directors are in average 'Regular' movies due to our previous definition. The one with higher rank is 8.1 but the 75% of them fall below 6.9; the standard deviation is very low hence most of the movies are around the mean.

The same approach will be used for the production companies and genres. First obtain the most common ones and see their average punctuation.

# Unique values for production companies
movieData.production_companies.nunique()

# 15th  common production companies
n = 15
freqPro = movieData.production_companies.value_counts()[:n].index.tolist()
print(freqPro)

#Dataset only with movies produced by these companies
moviesTopComp = movieData[movieData['production_companies'].isin(freqPro)]
moviesTopComp.vote_average.describe()

['Unknown', 'Paramount Pictures', 'Universal Pictures', 'Warner Bros.', 'Walt Disney Pictures', 'Columbia Pictures', 'Metro-Goldwyn-Mayer (MGM)', 'New Line Cinema', 'Touchstone Pictures', '20th Century Fox', 'Twentieth Century Fox Film Corporation', 'TriStar Pictures', 'Orion Pictures', 'Miramax Films', 'DreamWorks Animation']

count    1912.000000
mean        5.997071
std         0.947111
min         2.100000
25%         5.400000
50%         6.000000
75%         6.700000
max         9.200000
Name: vote_average, dtype: float64

The table shows that the movies produced by the 15th most common companies are in average 'Bad' movies. The one with higher rank is 9.2 (the highest rank of the entire dataset) but the 75% of them fall below 6.7; the standard deviation is quite high hence, the values of this subset are very spread.

Last thing to be inspected are going to be the genres:

# Unique values for genres
movieData.genres.nunique()

# 15th  common genres
# a new variable with the number of 
n = 15
freqGen = movieData.genres.value_counts()[:n].index.tolist()
print(freqGen)
moviesTopGen = movieData[movieData['genres'].isin(freqGen)]
moviesTopGen.vote_average.describe()

['Drama', 'Comedy', 'Drama|Romance', 'Comedy|Drama', 'Documentary', 'Comedy|Romance', 'Horror|Thriller', 'Horror', 'Comedy|Drama|Romance', 'Drama|Thriller', 'Comedy|Family', 'Action|Thriller', 'Thriller', 'Drama|Comedy', 'Animation|Family']

count    3870.000000
mean        5.981214
std         0.928975
min         2.300000
25%         5.400000
50%         6.000000
75%         6.600000
max         9.200000
Name: vote_average, dtype: float64

The table shows that belong to the 15th most common genres are in average 'Bad' movies. The one with higher rank is 9.2 (the highest rank of the entire dataset) but the 75% of them fall below 6.7; the standard deviation is low meaning that these movies are around the mean.

movieData.vote_average.mode()

0    6.1
dtype: float64

The mode of the vote_average is 6.1 (most common rating among movies).

A last inspection of the cast variable:

movieData.cast.describe()

count          10730
unique         10664
top       Louis C.K.
freq               6
Name: cast, dtype: object

It seems as if Louis C.K. is the most common actor with 6 titles. This is mainly due to the fact that actors are stashed in the cast and separated by a bar.

How the runtime and budget influence the box office performance of a movie?¶

The rate of success in the revenue for this analysis will depend on the budget. A successful revenue will be defined as one that is at least 3 times the budget.

#Creation of masks to inspect the movies that follow the criteria explained above
Succesful = movieData.revenue_adj > (3 * movieData.budget_adj)
Bad = movieData.revenue_adj <  movieData.budget_adj

#Total of movies that are considered as uccesful
sum(Succesful)

2398

The 'bad' movies are more than the succesful ones.

#mean budget of movies considered as succesful and bad
movieData.budget[Succesful].mean()

20100607.09591326

movieData.budget[Bad].mean()

19797233.88074168

Good movies spent more in average than 'bad' ones.

#relationship between sucess in the box office and runtime
movieData.runtime[Succesful].mean()

107.34904086738949

movieData.runtime[Bad].mean()

103.67130214917826

This results show that the average succesful movie last longer than the ones that do not perform that well in the box office. This will be inspected visually.

#duration of movies vs revenue criteria
movieData.runtime[Bad].hist(label = 'Bad Performing movie');
movieData.runtime[Succesful].hist(label = 'Sucessful Movie');
plt.title("Histogram of movie duration")
plt.xlabel("Runtime (minutes)")
plt.ylabel("Movies")
plt.legend();

The mean and the histogram show that movies with good performance in the box office last, in average, longer.

The categorical variables of this dataset have too many values in order to draw specific relationships between each category and the criteria described above as success or failure. Because of that, the most common values of genres, companies and directors are inspected visually.

#The most common genres are inspected
#First a list of colors is created to assign each genre a distinct color
#my_colors = list(islice(cycle(['b', 'g', 'r', 'c', 'm', 'y', 'k', 'pink', 'orange', 'gray', 'olive', 'cyan', 'brown', 'purple', 'blue']), None, len(moviesTopGen)))

#With groupby the revenues are grouped by genre and plot
moviesTopGen.groupby('genres').revenue_adj.mean().plot(kind='bar');
plt.title('Avg. revenues of most common genres');
plt.ylabel('Revenues', fontsize=15);
plt.xlabel('Genres', fontsize=15);

This graph shows that among the most common genres, Animation|Family is the one with the highest average revenues.

#The most common genres are inspected
#First a list of colors is created to assign each genre a distinct color
#my_colors = list(islice(cycle(['b', 'g', 'r', 'c', 'm', 'y', 'k', 'pink', 'orange', 'gray']), None, len(moviesTopDir)))

#With groupby the revenues are grouped by genre and plot
moviesTopDir.groupby('director').revenue_adj.mean().plot(kind='bar');
plt.title('Avg. revenues of most common directors');
plt.ylabel('Revenues', fontsize=12);
plt.xlabel('Directors', fontsize=12);

This graph shows that among the most common directors, Steven Spielberg is the one whose movies have the highest average revenues.

#The most common genres are inspected
#First a list of colors is created to assign each genre a distinct color
my_colors = list(islice(cycle(['b', 'g', 'r', 'c', 'm', 'y', 'k', 'pink', 'orange', 'gray', 'olive', 'cyan', 'brown', 'purple', 'blue']), None, len(moviesTopComp)))

#With groupby the revenues are grouped by genre and plot
moviesTopComp.groupby('production_companies').revenue_adj.mean().plot(kind='bar');
plt.title('Avg. revenues of most common prod. companies');
plt.ylabel('Revenues', fontsize=15);
plt.xlabel('Production Companies', fontsize=15);

This graph shows that among the most common production companies, Paramount and Columbia are the ones with the highst average revenues. Lastly, the release year of the movies is to be inspected to see if recent movies are more succesful (perhaps because they are spending more):

#relationship between sucess in the box office and release year 
#The mode will be utilized because the average year does not make a lot of sense
movieData.release_year[Succesful].mode()

0    2014
dtype: int64

movieData.release_year[Bad].mode()

0    2013
dtype: int64

The most repeated year among the succesful movies was 2014 and 2013 for the bad ones.

A short inspection of the relatioship between the revenues and vote averages is carried out:

#Evaluation of the vote average the movies more succesful in the box office get:
movieData[Succesful].vote_average.describe()

count    2398.000000
mean        6.310842
std         0.796917
min         2.100000
25%         5.800000
50%         6.300000
75%         6.900000
max         8.400000
Name: vote_average, dtype: float64

This shows that movies with good revenues are in average, regular movies.

The movie with the highest rating will be inspected:

#Obtention of the highest rate:
movieData.describe()

movieTop = movieData[movieData['vote_average']==9.2]
movieTop

The revenue and budget values of this movie is zero, hence no further information could be gathered by it.

#Obtention of the highest rate for the secondary data frame:
movieData2.describe()
movieTop2 = movieData2[movieData2['vote_average']==8.4]
movieTop2

Conclusions¶

The analysis in general was challenging due to the variety of the categorical variables. Variables such as cast and director have a strong impact on the outcome of a movie, but in order to be used they had to be prepared. The answer to the first question ‘How the runtime and budget influence the ratings a movie receives?’ could be answered by saying that movies that last too long are not among the higher rated movies. In addition, the movies with higher ratings have higher budget in average than the ones with worse ratings. That could be due to the presence of outliers. This needed to be inspected further and see if certain runtimes are more common among certain genres. When it comes to the second question ‘How the runtime and budget influence the box office performance of a movie? The inspection showed that movies considered to be successful lasted longer in average and spent more in average. Also the movies considered as successful were less than half of the dataset. In general, a high number of movies lacking values for budget and revenues impacted highly the results obtained. The big presence of null values influences the mean, and many assumptions were made based on average values. Moreover, the inspection of the top frequent values among the categorical values also have the problem of reducing the samples dramatically. The assumptions drawn from this preliminary examination are insightful but have flaws.

	_WebResourceTextEncodingName^WebResourceURL_WebResourceFrameName_WebResourceData_WebResourceMIMETypeUUTF-8_^https://d17h27t6h515a5.cloudfront.net/topher/2017/October/59dd1c4c_tmdb-movies/tmdb-movies.csvPO	imdb_id	popularity	budget	revenue	original_title	cast	homepage	director	tagline	...	overview	runtime	genres	production_companies	release_date	vote_count	vote_average	release_year	budget_adj	revenue_adj
0	135397.0	tt0369610	32.985763	150000000.0	1.513529e+09	Jurassic World	Chris Pratt\|Bryce Dallas Howard\|Irrfan Khan\|Vi...	http://www.jurassicworld.com/	Colin Trevorrow	The park is open.	...	Twenty-two years after the events of Jurassic ...	124.0	Action\|Adventure\|Science Fiction\|Thriller	Universal Studios\|Amblin Entertainment\|Legenda...	6/9/15	5562.0	6.5	2015.0	1.379999e+08	1392445892.5238
1	76341.0	tt1392190	28.419936	150000000.0	3.784364e+08	Mad Max: Fury Road	Tom Hardy\|Charlize Theron\|Hugh Keays-Byrne\|Nic...	http://www.madmaxmovie.com/	George Miller	What a Lovely Day.	...	An apocalyptic story set in the furthest reach...	120.0	Action\|Adventure\|Science Fiction\|Thriller	Village Roadshow Pictures\|Kennedy Miller Produ...	5/13/15	6185.0	7.1	2015.0	1.379999e+08	348161292.489031

	ID_2	imdb_id	popularity	budget	revenue	original_title	cast	homepage	director	tagline	...	overview	runtime	genres	production_companies	release_date	vote_count	vote_average	release_year	budget_adj	revenue_adj
0	135397.0	tt0369610	32.985763	150000000.0	1.513529e+09	Jurassic World	Chris Pratt\|Bryce Dallas Howard\|Irrfan Khan\|Vi...	http://www.jurassicworld.com/	Colin Trevorrow	The park is open.	...	Twenty-two years after the events of Jurassic ...	124.0	Action\|Adventure\|Science Fiction\|Thriller	Universal Studios\|Amblin Entertainment\|Legenda...	6/9/15	5562.0	6.5	2015.0	1.379999e+08	1392445892.5238
1	76341.0	tt1392190	28.419936	150000000.0	3.784364e+08	Mad Max: Fury Road	Tom Hardy\|Charlize Theron\|Hugh Keays-Byrne\|Nic...	http://www.madmaxmovie.com/	George Miller	What a Lovely Day.	...	An apocalyptic story set in the furthest reach...	120.0	Action\|Adventure\|Science Fiction\|Thriller	Village Roadshow Pictures\|Kennedy Miller Produ...	5/13/15	6185.0	7.1	2015.0	1.379999e+08	348161292.489031

	ID_2	imdb_id	popularity	budget	revenue	original_title	cast	homepage	director	tagline	...	overview	runtime	genres	production_companies	release_date	vote_count	vote_average	release_year	budget_adj	revenue_adj
10865	22293.0	tt0060666	0.035919	19000.0	0.0	Manos: The Hands of Fate	Harold P. Warren\|Tom Neyman\|John Reynolds\|Dian...	NaN	Harold P. Warren	It's Shocking! It's Beyond Your Imagination!	...	A family gets lost on the road and stumbles up...	74.0	Horror	Norm-Iris	11/15/66	15.0	1.5	1966.0	127642.279154	0</pre></body></html>Xtext/csv
10866	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

	ID_2	popularity	budget	revenue	runtime	vote_count	vote_average	release_year	budget_adj
count	10865.000000	10865.000000	1.086500e+04	1.086500e+04	10865.000000	10865.000000	10865.000000	10865.000000	1.086500e+04
mean	66066.374413	0.646446	1.462429e+07	3.982690e+07	102.071790	217.399632	5.975012	2001.321859	1.754989e+07
std	92134.091971	1.000231	3.091428e+07	1.170083e+08	31.382701	575.644627	0.935138	12.813260	3.430753e+07
min	5.000000	0.000065	0.000000e+00	0.000000e+00	0.000000	10.000000	1.500000	1960.000000	0.000000e+00
25%	10596.000000	0.207575	0.000000e+00	0.000000e+00	90.000000	17.000000	5.400000	1995.000000	0.000000e+00
50%	20662.000000	0.383831	0.000000e+00	0.000000e+00	99.000000	38.000000	6.000000	2006.000000	0.000000e+00
75%	75612.000000	0.713857	1.500000e+07	2.400000e+07	111.000000	146.000000	6.600000	2011.000000	2.085325e+07
max	417859.000000	32.985763	4.250000e+08	2.781506e+09	900.000000	9767.000000	9.200000	2015.000000	4.250000e+08

	popularity	budget	revenue	runtime	vote_count	vote_average	release_year	budget_adj	revenue_adj
count	4843.000000	4.843000e+03	4.843000e+03	4843.000000	4843.000000	4843.000000	4843.000000	4.843000e+03	4.843000e+03
mean	1.046315	2.960373e+07	8.933981e+07	107.969647	436.794342	6.149060	2000.914516	3.520258e+07	1.152341e+08
std	1.357222	4.053107e+07	1.621546e+08	21.107426	806.854055	0.798274	11.571917	4.377308e+07	1.989424e+08
min	0.001117	0.000000e+00	2.000000e+00	15.000000	10.000000	2.100000	1960.000000	0.000000e+00	2.370705e+00
25%	0.388227	1.700000e+06	7.779664e+06	95.000000	47.000000	5.600000	1994.000000	2.334746e+06	1.048057e+07
50%	0.680803	1.500000e+07	3.191160e+07	104.000000	147.000000	6.200000	2004.000000	2.032801e+07	4.402879e+07
75%	1.210533	4.000000e+07	1.000000e+08	117.000000	435.500000	6.700000	2010.000000	4.973958e+07	1.317599e+08
max	32.985763	4.250000e+08	2.781506e+09	705.000000	9767.000000	8.400000	2015.000000	4.250000e+08	2.827124e+09

	popularity	budget	revenue	original_title	cast	director	runtime	genres	production_companies	release_date	vote_count	vote_average	release_year	budget_adj	revenue_adj
4178	7.192039	25000000.0	28341469.0	The Shawshank Redemption	Tim Robbins\|Morgan Freeman\|Bob Gunton\|William ...	Frank Darabont	142.0	Drama\|Crime	Castle Rock Entertainment	1994-09-10	5754.0	8.4	1994	3.677779e+07	4.169346e+07
7948	0.283191	1200000.0	4978922.0	Stop Making Sense	David Byrne\|Tina Weymouth\|Chris Frantz\|Jerry H...	Jonathan Demme	88.0	Documentary\|Music	Talking Heads\|Arnold Stiefel Company	1984-11-16	23.0	8.4	1984	2.518850e+06	1.045097e+07