Project: Factors influencing movie revenue and ratings

Table of Contents

Introduction

The dataset chosen is the TMDb movie data. The original collection holds detailed information of approximately 10,867 movies within 21 columns including information related to their casting, producer, budget, runtime, and so on. The properties contained in the dataset is quite diverse. Some of these variables are names (actors and producers names), others are ordinal data (ratings), sentences (tagline) and even paragraphs (overview). The data covers several decades, and it had the revenues of older years adjusted so as to be comparable to current ones. The file contains observations that could be used to get insights on what factors could contribute to the success of a film at a box office, or to get better ratings by its viewers. In that regard, the dependent variables will be revenue and user's rating.

The questions that will direct this introductory investigation are:

  • How the runtime and budget influence the ratings a movie receives?
  • How the runtime and budget influence the box office performance of a movie??
  • These two will be inspected so as to answer a third question:

  • Are the box office performance and the ratings related?
  • In [39]:
    #The packages required to carry out the analysis are pandas, numpy and matplotlib
    
    import pandas as pd
    import numpy as np
    from matplotlib import pyplot as plt
    from itertools import cycle, islice
    
    %matplotlib inline
    

    Data Wrangling

    The data set was downloaded as a csv file. In order to carry out the exploratory analysis some adjustments had to be done and are described in the following sections.

    General Properties

    As stated in the previous section this dataset holds information of movies. The first step is to load the data set and from there examine the variables that could be useful to answer the research questions and their corresponding values.

    The data was downloaded as a csv file. The method pandas.read_csv was used so as to open it. The first attempt to open the file was unsucesfull due to a decoding error :</p>

    UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd1 in position 8: invalid continuation byte

    After investigating the error online one solution proposed was encoding the string to latin-1'. Official panda’s documentation introduced the encoding as an optional parameter. Further reading highlighted that specifying the 'latin-1' encoding would just ignore those bites that are not being decoded and read the file no matter what. As the real encoding of the file was not known, the encoding was set to 'latin-1'. This means that some of the values contained in the rows could be unknown characters.

    In [40]:
    # Read and inspect the dataset
    
    movieData = pd.read_csv("tmdb-movies.csv", encoding = 'latin-1')
    movieData.head()
    movieData.head(0)
    
    Out[40]:
    bplist00Ñ_WebMainResourceÕ\t

    The first inspection of the data highlighted that the first row was just a statement specifying where the data came from. This line was not necessary, so we had to specify to the method read_csv that the first row had to be ignored. This was accomplished with the parameter skiprow. Pandas documentation stated that this parameter has to be a 0-indexed integer. As the row to be deleted was the zeroth one, the number specified was 1.

    In [41]:
    # Read and inspect the dataset after the adjustments
    movieData = pd.read_csv("tmdb-movies.csv", skiprows = 1, encoding='latin-1')
    movieData.head(2)
    
    Out[41]:
    _WebResourceTextEncodingName^WebResourceURL_WebResourceFrameName_WebResourceData_WebResourceMIMETypeUUTF-8_^https://d17h27t6h515a5.cloudfront.net/topher/2017/October/59dd1c4c_tmdb-movies/tmdb-movies.csvPO imdb_id popularity budget revenue original_title cast homepage director tagline ... overview runtime genres production_companies release_date vote_count vote_average release_year budget_adj revenue_adj
    0 135397.0 tt0369610 32.985763 150000000.0 1.513529e+09 Jurassic World Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi... http://www.jurassicworld.com/ Colin Trevorrow The park is open. ... Twenty-two years after the events of Jurassic ... 124.0 Action|Adventure|Science Fiction|Thriller Universal Studios|Amblin Entertainment|Legenda... 6/9/15 5562.0 6.5 2015.0 1.379999e+08 1392445892.5238
    1 76341.0 tt1392190 28.419936 150000000.0 3.784364e+08 Mad Max: Fury Road Tom Hardy|Charlize Theron|Hugh Keays-Byrne|Nic... http://www.madmaxmovie.com/ George Miller What a Lovely Day. ... An apocalyptic story set in the furthest reach... 120.0 Action|Adventure|Science Fiction|Thriller Village Roadshow Pictures|Kennedy Miller Produ... 5/13/15 6185.0 7.1 2015.0 1.379999e+08 348161292.489031

    2 rows × 21 columns

    The columns seem to have very self-explanatory names that could be helpful to answer the analysis questions. The first column is the exception with a very long name that makes the inspection process a little difficult. In order to facilitate the preliminary assessment this column will be renamed to ID_2. Perhaps in later parts of the analysis one could rename it to a more pertinent label or just drop it, if it has no use. First, would be useful to have a look at all the columns labels:

    In [42]:
    # All the columns labels will be inspected 
    movieData.columns
    
    Out[42]:
    Index(['
    
    _WebResourceTextEncodingName^WebResourceURL_WebResourceFrameName_WebResourceData_WebResourceMIMETypeUUTF-8_^https://d17h27t6h515a5.cloudfront.net/topher/2017/October/59dd1c4c_tmdb-movies/tmdb-movies.csvPO',
           'imdb_id', 'popularity', 'budget', 'revenue', 'original_title', 'cast',
           'homepage', 'director', 'tagline', 'keywords', 'overview', 'runtime',
           'genres', 'production_companies', 'release_date', 'vote_count',
           'vote_average', 'release_year', 'budget_adj', 'revenue_adj'],
          dtype='object')

    So far, the names of the other columns will remain the same, they are all lower case and have no spaces. As the name of the first column is too long to type it and copying it could lead to mistakes, its value will be asigned to a variable and then renamed.

    In [43]:
    #We asign the first column name to a variable to ensure we have the right value
    first = movieData.columns[0]
    
    #We use the rename method and inspect the first two rows of the dataset
    movieData.rename(columns = {first: "ID_2"}, inplace = True)
    movieData.head(2)
    
    Out[43]:
    ID_2 imdb_id popularity budget revenue original_title cast homepage director tagline ... overview runtime genres production_companies release_date vote_count vote_average release_year budget_adj revenue_adj
    0 135397.0 tt0369610 32.985763 150000000.0 1.513529e+09 Jurassic World Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi... http://www.jurassicworld.com/ Colin Trevorrow The park is open. ... Twenty-two years after the events of Jurassic ... 124.0 Action|Adventure|Science Fiction|Thriller Universal Studios|Amblin Entertainment|Legenda... 6/9/15 5562.0 6.5 2015.0 1.379999e+08 1392445892.5238
    1 76341.0 tt1392190 28.419936 150000000.0 3.784364e+08 Mad Max: Fury Road Tom Hardy|Charlize Theron|Hugh Keays-Byrne|Nic... http://www.madmaxmovie.com/ George Miller What a Lovely Day. ... An apocalyptic story set in the furthest reach... 120.0 Action|Adventure|Science Fiction|Thriller Village Roadshow Pictures|Kennedy Miller Produ... 5/13/15 6185.0 7.1 2015.0 1.379999e+08 348161292.489031

    2 rows × 21 columns

    Now the file seems legible. So far, things that can be improved have been spotted, but first a quick inspection of the columns and their content was carried out to better understand the dataset. As the first values have been inspected, it is time to see the end of the dataset.

    In [44]:
    # Inspection of the last values of the dataset
    movieData.tail(2)
    
    Out[44]:
    ID_2 imdb_id popularity budget revenue original_title cast homepage director tagline ... overview runtime genres production_companies release_date vote_count vote_average release_year budget_adj revenue_adj
    10865 22293.0 tt0060666 0.035919 19000.0 0.0 Manos: The Hands of Fate Harold P. Warren|Tom Neyman|John Reynolds|Dian... NaN Harold P. Warren It's Shocking! It's Beyond Your Imagination! ... A family gets lost on the road and stumbles up... 74.0 Horror Norm-Iris 11/15/66 15.0 1.5 1966.0 127642.279154 0</pre></body></html>Xtext/csv
    10866 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

    2 rows × 21 columns

    Here one can see that all fields are missing in the last row, that seems as an empty observation. Further inspections are aimed at determining the shape of the dataset, its data types and general information:

    In [45]:
    # Obtain the amount of columns and rows of the dataset
    movieData.shape
    
    Out[45]:
    (10867, 21)
    In [46]:
    # General information on the dataset
    movieData.info()
    
    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 10867 entries, 0 to 10866
    Data columns (total 21 columns):
     #   Column                Non-Null Count  Dtype  
    ---  ------                --------------  -----  
     0   ID_2                  10866 non-null  float64
     1   imdb_id               10856 non-null  object 
     2   popularity            10866 non-null  float64
     3   budget                10866 non-null  float64
     4   revenue               10866 non-null  float64
     5   original_title        10866 non-null  object 
     6   cast                  10790 non-null  object 
     7   homepage              2936 non-null   object 
     8   director              10822 non-null  object 
     9   tagline               8042 non-null   object 
     10  keywords              9373 non-null   object 
     11  overview              10862 non-null  object 
     12  runtime               10866 non-null  float64
     13  genres                10843 non-null  object 
     14  production_companies  9836 non-null   object 
     15  release_date          10866 non-null  object 
     16  vote_count            10866 non-null  float64
     17  vote_average          10866 non-null  float64
     18  release_year          10866 non-null  float64
     19  budget_adj            10866 non-null  float64
     20  revenue_adj           10866 non-null  object 
    dtypes: float64(9), object(12)
    memory usage: 1.7+ MB
    

    The dataset has many null values that have to be handled.

    In [47]:
    #the data type of its variables
    movieData.dtypes
    
    Out[47]:
    ID_2                    float64
    imdb_id                  object
    popularity              float64
    budget                  float64
    revenue                 float64
    original_title           object
    cast                     object
    homepage                 object
    director                 object
    tagline                  object
    keywords                 object
    overview                 object
    runtime                 float64
    genres                   object
    production_companies     object
    release_date             object
    vote_count              float64
    vote_average            float64
    release_year            float64
    budget_adj              float64
    revenue_adj              object
    dtype: object

    Some of these datatypes have to be further inspected.

    In [48]:
    # Look at the unique values
    movieData.nunique()
    
    Out[48]:
    ID_2                    10865
    imdb_id                 10855
    popularity              10814
    budget                    557
    revenue                  4702
    original_title          10571
    cast                    10719
    homepage                 2896
    director                 5067
    tagline                  7997
    keywords                 8804
    overview                10847
    runtime                   247
    genres                   2039
    production_companies     7445
    release_date             5909
    vote_count               1289
    vote_average               72
    release_year               56
    budget_adj               2614
    revenue_adj              4841
    dtype: int64

    The unique values display interesting facts. For instance, for such a large dataset the runtime values have a very narrow range, same with budget.

    It is pertinent to carry out an inspection for duplicated entries:

    In [49]:
    # With the function sum, we obtain a total of the duplicated values
    sum(movieData.duplicated())
    
    Out[49]:
    1
    In [50]:
    # As the duplicated entry is just one, it will be dropped:
    movieData.drop_duplicates(inplace = True)
    

    It is pertinent to see the statistics of the data set and analyze its variables and its measurement units:

    In [51]:
    # Inspection of the descriptive statistics of the numerical variables with 'describe'
    movieData.describe()
    
    Out[51]:
    ID_2 popularity budget revenue runtime vote_count vote_average release_year budget_adj
    count 10865.000000 10865.000000 1.086500e+04 1.086500e+04 10865.000000 10865.000000 10865.000000 10865.000000 1.086500e+04
    mean 66066.374413 0.646446 1.462429e+07 3.982690e+07 102.071790 217.399632 5.975012 2001.321859 1.754989e+07
    std 92134.091971 1.000231 3.091428e+07 1.170083e+08 31.382701 575.644627 0.935138 12.813260 3.430753e+07
    min 5.000000 0.000065 0.000000e+00 0.000000e+00 0.000000 10.000000 1.500000 1960.000000 0.000000e+00
    25% 10596.000000 0.207575 0.000000e+00 0.000000e+00 90.000000 17.000000 5.400000 1995.000000 0.000000e+00
    50% 20662.000000 0.383831 0.000000e+00 0.000000e+00 99.000000 38.000000 6.000000 2006.000000 0.000000e+00
    75% 75612.000000 0.713857 1.500000e+07 2.400000e+07 111.000000 146.000000 6.600000 2011.000000 2.085325e+07
    max 417859.000000 32.985763 4.250000e+08 2.781506e+09 900.000000 9767.000000 9.200000 2015.000000 4.250000e+08

    The columns holding numerical variables are ID2, popularity, budget, revenue, runtime, vote_count, vote_average, release_year and budjet_adj. Most of them have pretty self-explanatory names. ID_2 seems to serve as an unique identifier for each movie; popularity looks as a measure of its popularity but it is not clear from where it is obtained; budget represent the funds used to produce the movie and budget_adj is a value that normalizes the funds minimizing the impacts of inflation over time; vote count contains the total votes for each movie; average is the average punctuation the users have given to the movies and release year is the year in which the movie premiered.

    Some general observations so far are:

    • Some variables have as minimum value zero (which is not possible due to the value they represent) these are budget, revenue, runtime and budget_adj. No movie could have a runtime of zero or a budget of 0.00$
    • The first and second percentile of budget and budget_adj are also zero which could be due to an error. This applies also to revenue;
    • The runtime has to be stored in minutes. Seems like the longest movie lasted 15 hours; while the majority lasted less than 2 hours
    • Most of the movies listed have a vote average of 6; the popularity values span from 0 - 32 but most of the movies have a popularity of less than 1. The 32 value could be due to an error;
    • Movies span from 1960 to 2015.

    And finally, the histogram of the whole dataset is to be shown.</p>

    In [52]:
    # A visual inspection of the variables with 'hist'
    movieData.hist(figsize = (11,8));
    

    Many of these histograms are skewed to the right, this could be due to some data missing and filled with zeroes or null values. The Year column shows that the movies included are in majority from the 2000's. The vote_average variable could have a normal distribution, the only one of the whole dataset.

    Data Cleaning

    The inspections carried out in the previous section helped to better understand the dataset and revealed what can be done to make it a workable one. Some work already done was cleaning up the column labels and dropping the duplicated values. First, there are some columns that have no apparent use to the analysis and will be eliminated. These are the first two columns, that contain identifiers apparently, the homepage and the tagline ones. Additionally, there are other columns from which would be very difficult to obtain advantageous information. For instance, the cast is a key part of a movie's success, but in order to make it a more functional column one would have to rank the actors and give them some sort of weight. In spite of that, the cast will be kept in order to check its impact on the results and then verify if indeed the actors are A-list ones. The same applies to the overview and the keywords, but these two will be dropped. Secondly, the data shows some missing values that have to be handled and inspect if there null rows. Subsequently, the data types of the remaining columns have to be inspected and changed if needed.

    First, the columns with little use will be dropped, these are: 'ID_2', 'imdb_id', 'homepage', 'tagline', 'keywords' and 'overview'.

    In [53]:
    movieData.drop(['ID_2', 'imdb_id', 'homepage', 'tagline', 'keywords', 'overview'], axis = 1, inplace = True)
    

    The following lines will handle the missing values. These will be handled differently depending on which values are missing.

    In [54]:
    # Glance at where the values are missing from the columns kept.
    movieData.info()
    
    <class 'pandas.core.frame.DataFrame'>
    Int64Index: 10866 entries, 0 to 10866
    Data columns (total 15 columns):
     #   Column                Non-Null Count  Dtype  
    ---  ------                --------------  -----  
     0   popularity            10865 non-null  float64
     1   budget                10865 non-null  float64
     2   revenue               10865 non-null  float64
     3   original_title        10865 non-null  object 
     4   cast                  10789 non-null  object 
     5   director              10821 non-null  object 
     6   runtime               10865 non-null  float64
     7   genres                10842 non-null  object 
     8   production_companies  9835 non-null   object 
     9   release_date          10865 non-null  object 
     10  vote_count            10865 non-null  float64
     11  vote_average          10865 non-null  float64
     12  release_year          10865 non-null  float64
     13  budget_adj            10865 non-null  float64
     14  revenue_adj           10865 non-null  object 
    dtypes: float64(8), object(7)
    memory usage: 1.3+ MB
    

    So far, the columns missing values are cast, director, genres, and production companies. These are all categorical data so assigning any measure of tendency to the null values would require some extra work. First the rows with all missing values will be dropped:

    In [55]:
    # Rows with all columns set to null are dropped with the drop function and the parameter 'how' equal to all
    movieData.dropna(how = 'all', inplace = True)
    movieData.info()
    
    <class 'pandas.core.frame.DataFrame'>
    Int64Index: 10865 entries, 0 to 10865
    Data columns (total 15 columns):
     #   Column                Non-Null Count  Dtype  
    ---  ------                --------------  -----  
     0   popularity            10865 non-null  float64
     1   budget                10865 non-null  float64
     2   revenue               10865 non-null  float64
     3   original_title        10865 non-null  object 
     4   cast                  10789 non-null  object 
     5   director              10821 non-null  object 
     6   runtime               10865 non-null  float64
     7   genres                10842 non-null  object 
     8   production_companies  9835 non-null   object 
     9   release_date          10865 non-null  object 
     10  vote_count            10865 non-null  float64
     11  vote_average          10865 non-null  float64
     12  release_year          10865 non-null  float64
     13  budget_adj            10865 non-null  float64
     14  revenue_adj           10865 non-null  object 
    dtypes: float64(8), object(7)
    memory usage: 1.3+ MB
    

    The column 'production_companies' has many missing rows, if all the rows with missing values are dropped the sample would lose at least a thousand observations. It would be important to inspect further this column before making any decision.

    In [56]:
    # Inspection of the counts of production companies values:
    movieData['production_companies'].value_counts(0)
    
    Out[56]:
    Paramount Pictures                                                                                                     156
    Universal Pictures                                                                                                     133
    Warner Bros.                                                                                                            84
    Walt Disney Pictures                                                                                                    76
    Columbia Pictures                                                                                                       72
                                                                                                                          ... 
    Lions Gate Films|2929 Productions|Out of the Blue... Entertainment|Starbucks Entertainment|Cinema Gypsy Productions      1
    Tiger Aspect Productions|PolyGram Filmed Entertainment|Working Title Films                                               1
    Columbia Pictures|Gary Sanchez Productions                                                                               1
    Beatrice Film                                                                                                            1
    Rampart Films|Raven West Films Ltd.|Resonance Film &amp; Video                                                           1
    Name: production_companies, Length: 7445, dtype: int64

    Seems like the most relevant production companies have the most occurrences in the dataset, hence it could be useful to include them in the analysis. The rows with null values will be set to the value "unknown".

    In [57]:
    # The column production companies will modified by giving the 'unknown' value to all null rows
    movieData['production_companies'] = movieData['production_companies'].fillna('Unknown')
    movieData.info()
    
    <class 'pandas.core.frame.DataFrame'>
    Int64Index: 10865 entries, 0 to 10865
    Data columns (total 15 columns):
     #   Column                Non-Null Count  Dtype  
    ---  ------                --------------  -----  
     0   popularity            10865 non-null  float64
     1   budget                10865 non-null  float64
     2   revenue               10865 non-null  float64
     3   original_title        10865 non-null  object 
     4   cast                  10789 non-null  object 
     5   director              10821 non-null  object 
     6   runtime               10865 non-null  float64
     7   genres                10842 non-null  object 
     8   production_companies  10865 non-null  object 
     9   release_date          10865 non-null  object 
     10  vote_count            10865 non-null  float64
     11  vote_average          10865 non-null  float64
     12  release_year          10865 non-null  float64
     13  budget_adj            10865 non-null  float64
     14  revenue_adj           10865 non-null  object 
    dtypes: float64(8), object(7)
    memory usage: 1.3+ MB
    

    On the other hand, the values missing cast, director and genres are not so many so these rows will be dropped.

    In [58]:
    #Rows with missing values will be dropped, decreasing the number of observations by few dozens
    movieData.dropna(inplace = True)
    movieData.info()
    
    <class 'pandas.core.frame.DataFrame'>
    Int64Index: 10731 entries, 0 to 10865
    Data columns (total 15 columns):
     #   Column                Non-Null Count  Dtype  
    ---  ------                --------------  -----  
     0   popularity            10731 non-null  float64
     1   budget                10731 non-null  float64
     2   revenue               10731 non-null  float64
     3   original_title        10731 non-null  object 
     4   cast                  10731 non-null  object 
     5   director              10731 non-null  object 
     6   runtime               10731 non-null  float64
     7   genres                10731 non-null  object 
     8   production_companies  10731 non-null  object 
     9   release_date          10731 non-null  object 
     10  vote_count            10731 non-null  float64
     11  vote_average          10731 non-null  float64
     12  release_year          10731 non-null  float64
     13  budget_adj            10731 non-null  float64
     14  revenue_adj           10731 non-null  object 
    dtypes: float64(8), object(7)
    memory usage: 1.3+ MB
    

    The datatypes of the columns that are not numericals are to be examined. Original title, cast, director, production companies and genres should be strings.

    In [59]:
    #verification of the columns that should be strings:
    str == type(movieData['original_title'][5]) == type(movieData['cast'][5])
    str == type(movieData['director'][5]) == type(movieData['genres'][5])
    str == type(movieData['production_companies'][5]) 
    
    Out[59]:
    True

    So far, all the values kept as string are in the proper data type. Exemptions are release_date (that should be saved as a date), release_year should be an integer and revenue_adj should be a float. These will be changed.

    In [60]:
    #Verification of the form of the release_date values:
    (movieData['release_date'][5])
    
    Out[60]:
    '12/25/15'
    In [61]:
    #The dates will be transformed to datetype with the to_datetime method:
    movieData['release_date'] = pd.to_datetime(movieData['release_date'])
    

    This operation turned those movies from the late 1960’s into the years 2060’s. This problem of mixing centuries happens when datasets store the year in a two-digit value while containing values of different centuries. As the data set cannot hold values from years after 2020, all those rows with release dates after this year will be taken back to the previous century. This approach to the problem was found on Stack Overflow.

    In [62]:
    #Movies whose year is greater than the current one will be taken back to the past century.
    movieData.loc[movieData['release_date'].dt.year >= 2020, 'release_date'] -= pd.DateOffset(years=100)
    
    In [63]:
    #inspection of the results
    movieData['release_date'].describe()
    
    Out[63]:
    count                   10731
    unique                   5872
    top       2009-01-01 00:00:00
    freq                       27
    first     1960-01-01 00:00:00
    last      2015-12-31 00:00:00
    Name: release_date, dtype: object
    In [64]:
    # The relese year will be changed to int using the astype method:
    movieData.release_year = movieData.release_year.astype(int)
    movieData.release_year.dtypes
    
    Out[64]:
    dtype('int64')
    In [65]:
    #Inspection of how revenue_ad looks like and further change to float
    movieData['revenue_adj'][9]
    
    #They seemed like numbers so the method astype to float could be useful
    #movieData.revenue_adj = movieData.revenue_adj.astype(float)
    #This line of code did not work because there was a row with a value = '0</pre></body></html>Xtext/csv';
    
    #The row containing that value was found with 'loc' and dropped with 'drop'
    movieData.loc[movieData['revenue_adj']=='0</pre></body></html>Xtext/csv']
    movieData.drop(movieData.loc[movieData['revenue_adj']=='0</pre></body></html>Xtext/csv'].index, inplace=True)
    
    #with the wrong value gone the code was ran again
    movieData.revenue_adj = movieData.revenue_adj.astype(float)
    

    Finally the numerical variables that had some minimun values as zero will be inspected:

    In [66]:
    # New dataset with all the rows whose budget value equals to zero
    zero_budget = movieData.loc[movieData['budget']==0,:]
    zero_budget.budget.describe()
    
    #This shows that half of the dataset lacks budget values
    
    # New dataset with all the rows whose budget value equals to zero
    zero_rev = movieData.loc[movieData['revenue']==0,:]
    zero_rev.revenue.describe()
    
    #This shows that half of the dataset lacks revenue values
    
    Out[66]:
    count    5887.0
    mean        0.0
    std         0.0
    min         0.0
    25%         0.0
    50%         0.0
    75%         0.0
    max         0.0
    Name: revenue, dtype: float64

    The lack of revenue and budget values could introduce bias to the analysis. Options are to fill these values with the mean or drop the rows. Fillng with the mean will turn the original distribution of the data into a non-realistic one. Due to the fact that these columns are important for the analysis dropping them is not an option. On the other hand, dropping the rows with zero values would cut the observations by half, but at least these will be observations with realistic data. With these considerations in mind, the dataset with real values of revenue and budget will be saved into a new dataset and both datasets will be investigated.

    In [67]:
    #New datasets with budget and revenue values filled
    movieData2 = movieData.drop(movieData[(movieData.budget <= 0)].index)
    movieData2 = movieData.drop(movieData[(movieData.budget_adj <= 0)].index)
    movieData2 = movieData.drop(movieData[(movieData.revenue <= 0) & (movieData.revenue_adj <= 0)].index)
    movieData2.info()
    
    <class 'pandas.core.frame.DataFrame'>
    Int64Index: 4843 entries, 0 to 10848
    Data columns (total 15 columns):
     #   Column                Non-Null Count  Dtype         
    ---  ------                --------------  -----         
     0   popularity            4843 non-null   float64       
     1   budget                4843 non-null   float64       
     2   revenue               4843 non-null   float64       
     3   original_title        4843 non-null   object        
     4   cast                  4843 non-null   object        
     5   director              4843 non-null   object        
     6   runtime               4843 non-null   float64       
     7   genres                4843 non-null   object        
     8   production_companies  4843 non-null   object        
     9   release_date          4843 non-null   datetime64[ns]
     10  vote_count            4843 non-null   float64       
     11  vote_average          4843 non-null   float64       
     12  release_year          4843 non-null   int64         
     13  budget_adj            4843 non-null   float64       
     14  revenue_adj           4843 non-null   float64       
    dtypes: datetime64[ns](1), float64(8), int64(1), object(5)
    memory usage: 605.4+ KB
    

    Lastly the popularity variable will be inspected.

    In [68]:
    movieData.popularity.describe()
    
    Out[68]:
    count    10730.000000
    mean         0.652672
    std          1.004833
    min          0.000188
    25%          0.210766
    50%          0.387136
    75%          0.721158
    max         32.985763
    Name: popularity, dtype: float64

    It would seem as if the popularity is a value from 0 to 1. With several outliers. Official documentation states that the metric is related to the number of votes and not the vote average, this column will be dropped, but no specifications of their min and max values is stated so is difficult to stablish a threshold. This column will be dropped.

    In [69]:
    #The column popularity is dropped using the function drop
    movieData.drop(['popularity'], axis = 1, inplace = True)
    

    Exploratory Data Analysis

    Now that the data has been cleaned, the descriptive statistics and visual aids can be used to better understand the variables and how they interact.

    How the runtime and budget influence the ratings a movie receives?

    In [70]:
    #statistics of the numerical values
    movieData.describe()
    movieData2.describe()
    
    Out[70]:
    popularity budget revenue runtime vote_count vote_average release_year budget_adj revenue_adj
    count 4843.000000 4.843000e+03 4.843000e+03 4843.000000 4843.000000 4843.000000 4843.000000 4.843000e+03 4.843000e+03
    mean 1.046315 2.960373e+07 8.933981e+07 107.969647 436.794342 6.149060 2000.914516 3.520258e+07 1.152341e+08
    std 1.357222 4.053107e+07 1.621546e+08 21.107426 806.854055 0.798274 11.571917 4.377308e+07 1.989424e+08
    min 0.001117 0.000000e+00 2.000000e+00 15.000000 10.000000 2.100000 1960.000000 0.000000e+00 2.370705e+00
    25% 0.388227 1.700000e+06 7.779664e+06 95.000000 47.000000 5.600000 1994.000000 2.334746e+06 1.048057e+07
    50% 0.680803 1.500000e+07 3.191160e+07 104.000000 147.000000 6.200000 2004.000000 2.032801e+07 4.402879e+07
    75% 1.210533 4.000000e+07 1.000000e+08 117.000000 435.500000 6.700000 2010.000000 4.973958e+07 1.317599e+08
    max 32.985763 4.250000e+08 2.781506e+09 705.000000 9767.000000 8.400000 2015.000000 4.250000e+08 2.827124e+09

    These values seem a lot better now, and we can use them to get some answers to the proposed questions.

    For the purposes of the examination, a well-rated movie will be defined as that with a vote of 7 or higher, regular movies are that between 6 and 7 and a bad movie one that has less than 5 average rating. The descriptive table highlights that each of these movies have a n average of 200 votes. Using this definitions, some masks will be created.

    Firstly, the variable 'budget' will be inspected in this regard.

    In [71]:
    #Masks dividing the movies into good and regular
    #New variables with Boolean values based on a condition from the original dataframe
    Good = movieData.vote_average > 7
    Regular = ((movieData.vote_average < 7) | (movieData.vote_average > 6))
    bad = movieData.vote_average < 6
    
    #How these ratings are related to the budget
    #The average budget of movies rated as good
    movieData.budget[Good].mean()
    
    Out[71]:
    17037564.310821805
    In [72]:
    #The average budget of movies rated as regular
    movieData.budget[Regular].mean()
    
    Out[72]:
    14805024.023485554
    In [73]:
    #The average budget of movies rated as bad
    movieData.budget[bad].mean() 
    
    Out[73]:
    12610486.134366926

    It seems like the movies rated as good have a higher budget than the ones ranked as not.

    Now it is time to see their relationship with runtime:

    In [74]:
    #The average runtime of movies rated as good
    movieData.runtime[Good].mean()
    
    Out[74]:
    110.00813669650123
    In [75]:
    #The average runtime of movies rated as regular
    movieData.runtime[Regular].mean()
    
    Out[75]:
    102.47148182665424
    In [76]:
    #The average runtime of movies rated as bad
    movieData.runtime[bad].mean()
    
    Out[76]:
    97.69310276287021

    It seems as if the movies ranked as 'Good' last longer.

    These findings will be inspected visually:

    In [77]:
    movieData.budget[Good].hist(color = 'green', label = 'Good Movie', bins = 13);
    movieData.budget[Regular].hist(color = 'blue', alpha = 0.2, label = 'Regular Movie', bins = 13);
    movieData.budget[bad].hist(color = 'orange', alpha = 0.2, label = 'Bad Movie', bins = 13);
    plt.title("Histogram of Movie expenses")
    plt.xlabel("Budget (1*10 ^8 $)")
    plt.ylabel("Movies")
    plt.legend();
    

    This chart shows as if the majority of the movies have low budget but this is related to big ammount of movies with budgets equal to zero. Still both good and less good movies have higher ammount of not so expensive movies.

    In [78]:
    # Visual inspection of the datasets with hist function 
    movieData.runtime[Good].hist(color = 'green', label = 'Good Movie', bins = 15);
    movieData.runtime[Regular].hist(color = 'blue', alpha = 0.2, label = 'Regular Movie', bins = 15);
    movieData.runtime[bad].hist(color = 'orange', alpha = 0.5, label = 'Bad Movie', bins = 15);
    plt.title("Histogram of Movie duration")
    plt.xlabel("Runtime (minutes)")
    plt.ylabel("Movies")
    plt.legend();
    

    This chart shows as if the majority of the movies independent of their quality last between 100 and 150 minutes.

    In order to avoid the effect of the null values, the same examinations will be performed with the dataset without zero values.

    In [79]:
    #Masks dividing the movies into good and regular
    Good = movieData2.vote_average > 7
    Regular = ((movieData2.vote_average < 7) | (movieData2.vote_average > 6))
    bad = movieData2.vote_average < 6
    
    #the average budget of movied labeled as good:
    movieData2.budget[Good].mean()
    
    Out[79]:
    32126722.119426753
    In [80]:
    #the average budget of movied labeled as regular:
    movieData2.budget[Regular].mean()
    
    Out[80]:
    29603729.519719183
    In [81]:
    #the average budget of movied labeled as bad:
    movieData2.budget[bad].mean()
    
    Out[81]:
    28621739.973247726
    In [82]:
    # Visual inspection of the dataset without budgets equal to zero:
    movieData2.budget[Good].hist(color = 'green', label = 'Good Movie');
    movieData2.budget[Regular].hist(color = 'blue', alpha = 0.2, label = 'Regular Movie');
    movieData2.budget[bad].hist(color = 'orange', alpha = 0.4, label = 'Regular Movie');
    plt.title("Histogram of Movie expenses")
    plt.xlabel("Budget (1*10 ^8 $)")
    plt.ylabel("Movies")
    plt.legend();
    

    The trend seems pretty similar for both filtered and unfiltered datasets the good and regular movies have similar budgets. One thing is clear, that the majority of movies do not have excessive budgets.

    As an extra, other potential explanatory variables to explore are cast and director:

    In [83]:
    #The cast unique values ands statistics are obtained
    movieData.cast.nunique()
    
    movieData.director.describe()
    
    Out[83]:
    count           10730
    unique           5017
    top       Woody Allen
    freq               45
    Name: director, dtype: object

    The cast column in its present form would be difficult to use. When it comes to directors, they are half the movies, it is pertinent to inspect if these frequent directors are related to the movies with higher votes. First the most frequent directors will be obtained, pandas could count the with the value_counts method but in order to store them as a list, an approach fund on stackOverflow was utilized:

    In [84]:
    #first the 15 most frequent directors will be obtained
    # n is the number of directors to retrieve
    n = 15
    # A new variable is created with the directors with more values assigned in the dataset
    freqDir = movieData.director.value_counts()[:n].index.tolist()
    
    #following, the movies created by these directors will be listed and their statistics analyzed
    moviesTopDir = movieData[movieData['director'].isin(freqDir)]
    moviesTopDir.vote_average.describe()
    
    Out[84]:
    count    354.00000
    mean       6.40791
    std        0.75312
    min        4.10000
    25%        5.90000
    50%        6.50000
    75%        6.90000
    max        8.10000
    Name: vote_average, dtype: float64

    The table shows that the movies directed by the 15th most common directors are in average 'Regular' movies due to our previous definition. The one with higher rank is 8.1 but the 75% of them fall below 6.9; the standard deviation is very low hence most of the movies are around the mean.

    The same approach will be used for the production companies and genres. First obtain the most common ones and see their average punctuation.

    In [85]:
    # Unique values for production companies
    movieData.production_companies.nunique()
    
    # 15th  common production companies
    n = 15
    freqPro = movieData.production_companies.value_counts()[:n].index.tolist()
    print(freqPro)
    
    #Dataset only with movies produced by these companies
    moviesTopComp = movieData[movieData['production_companies'].isin(freqPro)]
    moviesTopComp.vote_average.describe()
    
    ['Unknown', 'Paramount Pictures', 'Universal Pictures', 'Warner Bros.', 'Walt Disney Pictures', 'Columbia Pictures', 'Metro-Goldwyn-Mayer (MGM)', 'New Line Cinema', 'Touchstone Pictures', '20th Century Fox', 'Twentieth Century Fox Film Corporation', 'TriStar Pictures', 'Orion Pictures', 'Miramax Films', 'DreamWorks Animation']
    
    Out[85]:
    count    1912.000000
    mean        5.997071
    std         0.947111
    min         2.100000
    25%         5.400000
    50%         6.000000
    75%         6.700000
    max         9.200000
    Name: vote_average, dtype: float64

    The table shows that the movies produced by the 15th most common companies are in average 'Bad' movies. The one with higher rank is 9.2 (the highest rank of the entire dataset) but the 75% of them fall below 6.7; the standard deviation is quite high hence, the values of this subset are very spread.

    Last thing to be inspected are going to be the genres:

    In [86]:
    # Unique values for genres
    movieData.genres.nunique()
    
    # 15th  common genres
    # a new variable with the number of 
    n = 15
    freqGen = movieData.genres.value_counts()[:n].index.tolist()
    print(freqGen)
    moviesTopGen = movieData[movieData['genres'].isin(freqGen)]
    moviesTopGen.vote_average.describe()
    
    ['Drama', 'Comedy', 'Drama|Romance', 'Comedy|Drama', 'Documentary', 'Comedy|Romance', 'Horror|Thriller', 'Horror', 'Comedy|Drama|Romance', 'Drama|Thriller', 'Comedy|Family', 'Action|Thriller', 'Thriller', 'Drama|Comedy', 'Animation|Family']
    
    Out[86]:
    count    3870.000000
    mean        5.981214
    std         0.928975
    min         2.300000
    25%         5.400000
    50%         6.000000
    75%         6.600000
    max         9.200000
    Name: vote_average, dtype: float64

    The table shows that belong to the 15th most common genres are in average 'Bad' movies. The one with higher rank is 9.2 (the highest rank of the entire dataset) but the 75% of them fall below 6.7; the standard deviation is low meaning that these movies are around the mean.

    In [87]:
    movieData.vote_average.mode()
    
    Out[87]:
    0    6.1
    dtype: float64

    The mode of the vote_average is 6.1 (most common rating among movies).

    A last inspection of the cast variable:

    In [88]:
    movieData.cast.describe()
    
    Out[88]:
    count          10730
    unique         10664
    top       Louis C.K.
    freq               6
    Name: cast, dtype: object

    It seems as if Louis C.K. is the most common actor with 6 titles. This is mainly due to the fact that actors are stashed in the cast and separated by a bar.

    How the runtime and budget influence the box office performance of a movie?

    The rate of success in the revenue for this analysis will depend on the budget. A successful revenue will be defined as one that is at least 3 times the budget.

    In [89]:
    #Creation of masks to inspect the movies that follow the criteria explained above
    Succesful = movieData.revenue_adj > (3 * movieData.budget_adj)
    Bad = movieData.revenue_adj <  movieData.budget_adj
    
    In [90]:
    #Total of movies that are considered as uccesful
    sum(Succesful)
    
    Out[90]:
    2398

    The 'bad' movies are more than the succesful ones.

    In [91]:
    #mean budget of movies considered as succesful and bad
    movieData.budget[Succesful].mean()
    
    Out[91]:
    20100607.09591326
    In [92]:
    movieData.budget[Bad].mean()
    
    Out[92]:
    19797233.88074168

    Good movies spent more in average than 'bad' ones.

    In [93]:
    #relationship between sucess in the box office and runtime
    movieData.runtime[Succesful].mean()
    
    Out[93]:
    107.34904086738949
    In [94]:
    movieData.runtime[Bad].mean()
    
    Out[94]:
    103.67130214917826

    This results show that the average succesful movie last longer than the ones that do not perform that well in the box office. This will be inspected visually.

    In [95]:
    #duration of movies vs revenue criteria
    movieData.runtime[Bad].hist(label = 'Bad Performing movie');
    movieData.runtime[Succesful].hist(label = 'Sucessful Movie');
    plt.title("Histogram of movie duration")
    plt.xlabel("Runtime (minutes)")
    plt.ylabel("Movies")
    plt.legend();
    

    The mean and the histogram show that movies with good performance in the box office last, in average, longer.

    The categorical variables of this dataset have too many values in order to draw specific relationships between each category and the criteria described above as success or failure. Because of that, the most common values of genres, companies and directors are inspected visually.

    In [104]:
    #The most common genres are inspected
    #First a list of colors is created to assign each genre a distinct color
    #my_colors = list(islice(cycle(['b', 'g', 'r', 'c', 'm', 'y', 'k', 'pink', 'orange', 'gray', 'olive', 'cyan', 'brown', 'purple', 'blue']), None, len(moviesTopGen)))
    
    #With groupby the revenues are grouped by genre and plot
    moviesTopGen.groupby('genres').revenue_adj.mean().plot(kind='bar');
    plt.title('Avg. revenues of most common genres');
    plt.ylabel('Revenues', fontsize=15);
    plt.xlabel('Genres', fontsize=15);
    

    This graph shows that among the most common genres, Animation|Family is the one with the highest average revenues.

    In [105]:
    #The most common genres are inspected
    #First a list of colors is created to assign each genre a distinct color
    #my_colors = list(islice(cycle(['b', 'g', 'r', 'c', 'm', 'y', 'k', 'pink', 'orange', 'gray']), None, len(moviesTopDir)))
    
    #With groupby the revenues are grouped by genre and plot
    moviesTopDir.groupby('director').revenue_adj.mean().plot(kind='bar');
    plt.title('Avg. revenues of most common directors');
    plt.ylabel('Revenues', fontsize=12);
    plt.xlabel('Directors', fontsize=12);
    

    This graph shows that among the most common directors, Steven Spielberg is the one whose movies have the highest average revenues.

    In [106]:
    #The most common genres are inspected
    #First a list of colors is created to assign each genre a distinct color
    my_colors = list(islice(cycle(['b', 'g', 'r', 'c', 'm', 'y', 'k', 'pink', 'orange', 'gray', 'olive', 'cyan', 'brown', 'purple', 'blue']), None, len(moviesTopComp)))
    
    #With groupby the revenues are grouped by genre and plot
    moviesTopComp.groupby('production_companies').revenue_adj.mean().plot(kind='bar');
    plt.title('Avg. revenues of most common prod. companies');
    plt.ylabel('Revenues', fontsize=15);
    plt.xlabel('Production Companies', fontsize=15);
    

    This graph shows that among the most common production companies, Paramount and Columbia are the ones with the highst average revenues. Lastly, the release year of the movies is to be inspected to see if recent movies are more succesful (perhaps because they are spending more):

    In [99]:
    #relationship between sucess in the box office and release year 
    #The mode will be utilized because the average year does not make a lot of sense
    movieData.release_year[Succesful].mode()
    
    Out[99]:
    0    2014
    dtype: int64
    In [100]:
    movieData.release_year[Bad].mode()
    
    Out[100]:
    0    2013
    dtype: int64

    The most repeated year among the succesful movies was 2014 and 2013 for the bad ones.

    Are the box office performance and the ratings related?

    A short inspection of the relatioship between the revenues and vote averages is carried out:

    In [101]:
    #Evaluation of the vote average the movies more succesful in the box office get:
    movieData[Succesful].vote_average.describe()
    
    Out[101]:
    count    2398.000000
    mean        6.310842
    std         0.796917
    min         2.100000
    25%         5.800000
    50%         6.300000
    75%         6.900000
    max         8.400000
    Name: vote_average, dtype: float64

    This shows that movies with good revenues are in average, regular movies.

    The movie with the highest rating will be inspected:

    In [102]:
    #Obtention of the highest rate:
    movieData.describe()
    
    movieTop = movieData[movieData['vote_average']==9.2]
    movieTop
    
    Out[102]:
    budget revenue original_title cast director runtime genres production_companies release_date vote_count vote_average release_year budget_adj revenue_adj
    3894 0.0 0.0 The Story of Film: An Odyssey Mark Cousins|Jean-Michel Frodon|Cari Beauchamp... Mark Cousins 900.0 Documentary Unknown 2011-09-03 14.0 9.2 2011 0.0 0.0

    The revenue and budget values of this movie is zero, hence no further information could be gathered by it.

    In [103]:
    #Obtention of the highest rate for the secondary data frame:
    movieData2.describe()
    movieTop2 = movieData2[movieData2['vote_average']==8.4]
    movieTop2
    
    Out[103]:
    popularity budget revenue original_title cast director runtime genres production_companies release_date vote_count vote_average release_year budget_adj revenue_adj
    4178 7.192039 25000000.0 28341469.0 The Shawshank Redemption Tim Robbins|Morgan Freeman|Bob Gunton|William ... Frank Darabont 142.0 Drama|Crime Castle Rock Entertainment 1994-09-10 5754.0 8.4 1994 3.677779e+07 4.169346e+07
    7948 0.283191 1200000.0 4978922.0 Stop Making Sense David Byrne|Tina Weymouth|Chris Frantz|Jerry H... Jonathan Demme 88.0 Documentary|Music Talking Heads|Arnold Stiefel Company 1984-11-16 23.0 8.4 1984 2.518850e+06 1.045097e+07

    Conclusions

    The analysis in general was challenging due to the variety of the categorical variables. Variables such as cast and director have a strong impact on the outcome of a movie, but in order to be used they had to be prepared. The answer to the first question ‘How the runtime and budget influence the ratings a movie receives?’ could be answered by saying that movies that last too long are not among the higher rated movies. In addition, the movies with higher ratings have higher budget in average than the ones with worse ratings. That could be due to the presence of outliers. This needed to be inspected further and see if certain runtimes are more common among certain genres. When it comes to the second question ‘How the runtime and budget influence the box office performance of a movie? The inspection showed that movies considered to be successful lasted longer in average and spent more in average. Also the movies considered as successful were less than half of the dataset. In general, a high number of movies lacking values for budget and revenues impacted highly the results obtained. The big presence of null values influences the mean, and many assumptions were made based on average values. Moreover, the inspection of the top frequent values among the categorical values also have the problem of reducing the samples dramatically. The assumptions drawn from this preliminary examination are insightful but have flaws.

    References