data:image/s3,"s3://crabby-images/8b501/8b501f8c29ab7376011fd5e7c378c0404821c8ce" alt=""
Data Source & Purpose of Data Analysis
Data Source: Movie Industry dataset from kaggle
Purpose of Data Analysis : What movie features correlates to its gross earning?
If you want to see this project from github, please click this Link
1. Prepare Data
1.1 Import Libraries and Data
# Import libraries
import pandas as pd
import seaborn as sns
import os
import matplotlib
import matplotlib.pyplot as plt
plt.style.use('ggplot')
from matplotlib.pyplot import figure
%matplotlib inline
matplotlib.rcParams['figure.figsize'] = (12,8) # Adjusts the configuration of the plots we will create
# Read in the data
pwd = os.getcwd()
filepath = pwd + "\\movies.csv"
df = pd.read_csv(filepath)
1.2 Explore Data
# Let's look at the data
df.head()
data:image/s3,"s3://crabby-images/391be/391be151f3fb31a34b1058f5f26cfb4493ab359e" alt=""
2. Clean Data
2.1 Deal with missing data
# Let's see if there is any missing data
for col in df.columns:
print(df[col].isnull().value_counts(), "\n")
data:image/s3,"s3://crabby-images/42a38/42a38908f2a64c29691afe20c48356819f77493a" alt=""
# Drop rows with missing data
df = df.dropna()
2.2 Organize the data types
# Data types for our columns
df.dtypes
data:image/s3,"s3://crabby-images/f96a4/f96a44668e04ea5ff9c5904a4d1de69d9e171ec8" alt=""
# Change data type of columns
df['budget'] = df['budget'].astype('int64')
df['gross'] = df['gross'].astype('int64')
df['runtime'] = df['runtime'].astype('int64')
# Split the string to seperate only dates
new = df['released'].str.split(" \(", n = 1, expand = True)
df['released_date'] = new[0]
# Convert the datatype to datetime
df['released_date'] = pd.to_datetime(df['released_date'])
df['released_date']
data:image/s3,"s3://crabby-images/52e5f/52e5fd50190c1e2ebe6db69af2e591eb86bbcd23" alt=""
3. Analyze and Visualize the data
3.1 Scatter Plot 1 - Budget vs Gross earning
# Scatter plot
plt.scatter(x=df['budget'], y=df['gross'])
plt.title('Budget vs Gross Earnings')
plt.xlabel('Budget for Film')
plt.ylabel('Gross Earnings')
plt.show()
data:image/s3,"s3://crabby-images/b3970/b397086b0f757d1c7b6e47ff6cbfe3d4948e292a" alt=""
3.2 Scatter Plot 2 - Budget vs Gross earning
# Plot budget vs gross earnings using seaborn
sns.regplot(x='budget', y='gross', data=df, scatter_kws={"color":"red"}, line_kws={"color":"blue"})
data:image/s3,"s3://crabby-images/bf32d/bf32d30e4d4e9532bf21ff984c6ed146aa772d83" alt=""
3.3 Scatter Plot - Correlation between movie features
# Let's start looking at correlation
df.corr()
data:image/s3,"s3://crabby-images/b532d/b532d9734d3a0ae311ae0cbab5f4330b970a3409" alt=""
# High Correlation between budget and gross
correlation_matrix = df.corr(method='pearson')
sns.heatmap(correlation_matrix, annot=True)
plt.title('Correlation Matrix for Numeric Features')
plt.xlabel('Movie Features')
plt.ylabel('Movie Features')
plt.show()
data:image/s3,"s3://crabby-images/ffb8d/ffb8daddc75672488c8a0a69d44f32b89554e4c9" alt=""