DATA WRANGLING AND VISUALIZATION OF NETFLIX DATA

Netflix is an American over-the-top content platform and production company headquartered in Los Gatos, California. It was founded in 1997 as a DVD delivery service. Today it now has over 203.7 million subscribers worldwide and acts as a leading streaming service that allows its customers to watch a wide variety of award-winning TV shows, movies, documentaries, and more on thousands of internet-connected devices.

I began this analysis with goals to understand Netflix’s history and goals, the target market, and the existing product. Also, I incline to know what people want to watch (the genre of movies and tv shows), and how movies and tv shows on Netflix are rated. Lastly, to know countries that are major contributors to Netflix views and understand the growth of Netflix over the years, as well as understand any areas that may need improvement.

In this article, we will perform basic wrangling and visualization on the datasets of all tv shows and movies available on Netflix.

DATA WRANGLING

Data wrangling is one of the crucial tasks in data science and analysis which includes operations like Data sorting, Data filtration, Data reduction, Data access, Data processing.

Data wrangling is a preprocessing phase where data is transformed from one form to another. The phase aims to make data available for analytics and this phase includes data collection, exploratory data analysis e.t.c. In this project, I performed data wrangling using data from Netflix datasets from Kaggle.

Dataset

This dataset consists of tv shows and movies available on Netflix as of 2019. The dataset is collected from flixable which is a third-party Netflix search engine and can be obtained from kaggle.

Data loading

Data is an integral part of analysis and often stored in files (CSV, Excel, JSON, XML, SQL e.t.c). So pandas have inbuilt support to load data from files as a DataFrame. The first step is to import all the libraries and load data needed for analysis and visualization.

#import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os, sys
import warnings
warnings.filterwarnings(‘ignore’)
import pandas.util.testing as tm

Since the data is in csv format, use .read_csv () format to read the data.

#loading datasets
data = pd.read_csv(‘/Users/wuraolaifeoluwa/Documents/Rasheed data doc /datasets/netflix_titles.csv’)

To view the first 5 columns of the data, we use .head() function. Here is an overview of what the dataset looks like by calling the name of the file using .head() function;

#view first 5 columns of the dataset
data.head()

The .info() function is used to print a concise summary of a DataFrame. This method prints information about a DataFrame including the index type and column types, non-null values, and memory usage. Check snapshot below to view basic information about the data.

#information about the data
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7787 entries, 0 to 7786
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 show_id 7787 non-null object
1 type 7787 non-null object
2 title 7787 non-null object
3 director 5398 non-null object
4 cast 7069 non-null object
5 country 7280 non-null object
6 date_added 7777 non-null object
7 release_year 7787 non-null int64
8 rating 7780 non-null object
9 duration 7787 non-null object
10 listed_in 7787 non-null object
11 description 7787 non-null object
dtypes: int64(1), object(11)
memory usage: 730.2+ KB

Data Cleaning

Now that you have imported the data, the next step is to clean it. Speaking from both personal and professional experience, your analysis will only be as good as the quality of your data so it is very important to see this step-through carefully.

Cleaning the data can include tasks such as checking for null values, imputing missing values, checking for outliers, or making sure columns are named correctly. Datasets are not always perfect. Data duplication or missing values can affect the analysis process. We can check for the sum of null values using data.isnull().sum() and use Seaborn’s heatmap to visualize the null values, check below for a detailed overview of null values present in our Netflix dataset.

data.isnull().sum()show_id            0
type 0
title 0
director 2389
cast 718
country 507
date_added 10
release_year 0
rating 7
duration 0
listed_in 0
description 0
dtype: int64

From the above, there are a few columns that contain null values (‘director’, ‘cast’, ‘country’, ‘date_added’, ‘rating’). Firstly, I dropped the ‘director’ and ‘cast’ columns because they are not needed in this analysis. Secondly, I filled ‘rating’ and ‘country’ columns with the most occurring values because both columns are categorical contain data with unique values. Lastly, I used the forward fill method to fill null values in column ‘date_added’.

#drop the 'director' and 'cast' column
data = data.drop(['director', 'cast'], axis = 1)
#rating
data['rating'] = data.rating.fillna(data['rating'].mode()[0])
#date_added
data['date_added'] = data.date_added.fillna(method ='ffill')
#country
data['country'] = data.country.fillna(data['country'].mode()[0])

To get the data ready for visualization, I performed some more wrangling such as using .rename() function on the dataset to change ‘listed_in’ to ‘Genre’ for a better description.

#rename columns
data.rename(columns = {'listed_in' : 'Genre',}, inplace = True)

Also, I used the lambda function to create features like year_added and month_added from ‘date_added’ for better analysis.

#use lambda function to create year_added and month_added features
year_func = lambda x:x.split(',')[-1]
data['year_added'] = data['date_added'].apply(year_func)

month_func = lambda x:x.split(' ')[0]
data['month_added'] = data['date_added'].apply(month_func)

One way to quickly check for the summary of numeric values is to use the pd.describe() method on your data-frame. This provides a data summary for each column, including the minimum and maximum values, standard deviation, count, and percentage slabs of that indicator; see snapshot below for a detailed output for describe().

Since .describe () function will skip analysis information if the column’s datatype is a string, I extracted column names that are strings to get their statistical analysis such as count, unique, and frequency. See snapshot below;

VISUALIZATION

Next, we explore the data to gain insights into our dataset and what it contains. This includes, but is not limited to, checking for outliers or unusual data by looking at data distributions.

Analysis of the distribution of tv shows to movies

Observation

This shows that Movie has a higher viewers rate of 69.1% to TV Shows with a rate 30.9%. Clearly users explore more Movies than TV Shows on Netflix. This is because that it for a longer time, has users end up spending more time on Netflix and zest to watch more movie develops.

Top 10 Genres of tv shows on Netflix

Observation

The shows that Kids’ TV (205) has the highest genre rate of TV Shows users explore on Netflix, followed by International TV Shows, TV Dramas (111), and Crime TV Shows (106).

Top 10 Countries as a contributor to Netflix.

Observation

The United States has the highest contributor followed by Indian with 17.0% and United Kingdom with 7.3%. The United States has a major stake with 56.5% of shows on Netflix. This is because Netflix is owned by the United State.

Distribution of the type of movies/tv shows watched in the top country contributors

Observation

Unsurprisingly, the United States stands out in both movies and tv shows because Netflix is an American company. India surprisingly ranks second in the film, followed by the UK. This result shows movies are mostly viewed in countries contributing to Netflix than Tv shows.

Analysis on Rating

Observation

1 -Programming rated TV-MA in the United States by the TV Parental Guidelines signifies content for mature audiences.

2 -Programming rated TV-14 in the United States TV Parental Guidelines signifies content with parents strongly cautioned. Content may be inappropriate for children younger than 14 years of age.

3 -An R-rated film is a film that has been assessed as having material that may be unsuitable for children under the age of 17.

The plot above shows that Programming rated TV-MA has the highest rate, programming rated TV-14 ranks second followed by R-rated film. We can conclude from the findings that Netflix majorly comprises movies and tv shows rated TV-MA, TV-14, and R-rated films.

Understanding what content is available in different countries

Observation

To understand content available in different countries, I randomly selected 2 genres of films to give an overview of movies/tv shows available in different countries. My observation shows kid’s content is available and viewed in many countries such as United States, Italy, Canada, and others in the table above. Also, Horror movie content is viewed mostly in United States, Canada, Thailand, Mexico, and other above.

Latest to oldest movies on Netflix

Observation

This table above shows Netflix’s oldest movie to be ‘Pioneers: First Women Filmmakers’ and was released in 1925. while Netflix’s latest movie is ‘Gabby’s Dollhouse’ released in 2021.

Distribution of movies/tv shows added to Netflix monthly and over the years

Observation

From my observation in the first plot, 2019 shows the highest number of movies/tv shows added to Netflix over the years, followed by 2020 and 2018. From the second plot, December shows the highest number of movies/tv shows added to Netflix monthly, followed by October and January.

Distribution of movies/tv shows released over the years on Netflix

Observation

This plot shows Netflix released a high number of movies in the 20s and it’s still growing to date.

Netflix growth over the years (Licensed vs Original)

Observation

To analysis growth of Netflix over the years, let’s assume that the ‘release_year’ are the years that Netflix released its original movie/shows. while ‘date_added’ are movies/shows added to Netflix i.e. Netflix acquired a license for the movies to be viewed on Netflix

This shows Netflix’s original films started way back in the 20s but it has been constantly increasing since 2014 till date. Interestingly, Netflix has grown strong and now it is producing its own shows more.

Summary

Though many of us are familiar with Netflix, it was important to get a better idea of the overall industry and Netflix’s contributors, as well as perform some statistical analysis to better understand who is using Netflix and how they are using it. A summary of my findings can be are:

  • Movies have a higher viewers rate of 69.1% on Netflix to TV Shows with a rate 30.9%.
  • United States has a major stake with 56.5% of shows on Netflix. This is because Netflix is owned by the United States.
  • Netflix has grown strong and now it is producing its shows more.

For the complete exploratory and explanatory analysis, click here to view all the codes on Github. Thanks for reading!

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store