aboutsummaryrefslogtreecommitdiff
path: root/content/blog/2020-07-20-video-game-sales.org
diff options
context:
space:
mode:
authorChristian Cleberg <hello@cleberg.net>2024-03-29 01:42:38 -0500
committerChristian Cleberg <hello@cleberg.net>2024-03-29 01:42:38 -0500
commit00b2726e0561f174393ae600f0f11adb8afebaab (patch)
treea4733d553ce68f64277ffa3a52f800dc58ff72de /content/blog/2020-07-20-video-game-sales.org
parent8ba3d90a0f3db7e5ed29e25ff6d0c1b557ed3ca0 (diff)
parent41bd0ad58e44244fe67cb36e066d4bb68738516f (diff)
downloadcleberg.net-00b2726e0561f174393ae600f0f11adb8afebaab.tar.gz
cleberg.net-00b2726e0561f174393ae600f0f11adb8afebaab.tar.bz2
cleberg.net-00b2726e0561f174393ae600f0f11adb8afebaab.zip
merge org branch into main
Diffstat (limited to 'content/blog/2020-07-20-video-game-sales.org')
-rw-r--r--content/blog/2020-07-20-video-game-sales.org175
1 files changed, 175 insertions, 0 deletions
diff --git a/content/blog/2020-07-20-video-game-sales.org b/content/blog/2020-07-20-video-game-sales.org
new file mode 100644
index 0000000..672558d
--- /dev/null
+++ b/content/blog/2020-07-20-video-game-sales.org
@@ -0,0 +1,175 @@
+#+title: Data Exploration: Video Game Sales
+#+date: 2020-07-20
+#+description: Exploring and visualizing data with Python.
+#+filetags: :data:
+
+* Background Information
+This dataset (obtained from
+[[https://www.kaggle.com/gregorut/videogamesales/data][Kaggle]])
+contains a list of video games with sales greater than 100,000 copies.
+It was generated by a scrape of vgchartz.com.
+
+Fields include:
+
+- Rank: Ranking of overall sales
+- Name: The game name
+- Platform: Platform of the game release (i.e. PC,PS4, etc.)
+- Year: Year of the game's release
+- Genre: Genre of the game
+- Publisher: Publisher of the game
+- NA_{Sales}: Sales in North America (in millions)
+- EU_{Sales}: Sales in Europe (in millions)
+- JP_{Sales}: Sales in Japan (in millions)
+- Other_{Sales}: Sales in the rest of the world (in millions)
+- Global_{Sales}: Total worldwide sales.
+
+There are 16,598 records. 2 records were dropped due to incomplete
+information.
+
+* Import the Data
+#+begin_src python
+# Import the Python libraries we will be using
+import pandas as pd
+import numpy as np
+import seaborn as sns; sns.set()
+import matplotlib.pyplot as plt
+
+# Load the file using the path to the downloaded file
+file = r'video_game_sales.csv'
+df = pd.read_csv(file)
+df
+#+end_src
+
+#+caption: Dataframe Results
+[[https://img.cleberg.net/blog/20200720-data-exploration-video-game-sales/01_dataframe-min.png]]
+
+* Explore the Data
+#+begin_src python
+# With the description function, we can see the basic stats. For example, we can also see that the 'Year' column has some incomplete values.
+df.describe()
+#+end_src
+
+#+caption: df.describe()
+[[https://img.cleberg.net/blog/20200720-data-exploration-video-game-sales/02_describe-min.png]]
+
+#+begin_src python
+# This function shows the rows and columns of NaN values. For example, df[179,3] = nan
+np.where(pd.isnull(df))
+
+(array([179, ..., 16553], dtype=int64),
+ array([3, ..., 5], dtype=int64))
+#+end_src
+
+* Visualize the Data
+#+begin_src python
+# This function plots the global sales by platform
+sns.catplot(x='Platform', y='Global_Sales', data=df, jitter=False).set_xticklabels(rotation=90)
+#+end_src
+
+#+caption: Plot of Global Sales by Platform
+[[https://img.cleberg.net/blog/20200720-data-exploration-video-game-sales/03_plot-min.png]]
+
+#+begin_src python
+# This function plots the global sales by genre
+sns.catplot(x='Genre', y='Global_Sales', data=df, jitter=False).set_xticklabels(rotation=45)
+#+end_src
+
+#+caption: Plot of Global Sales by Genre
+[[https://img.cleberg.net/blog/20200720-data-exploration-video-game-sales/04_plot-min.png]]
+
+#+begin_src python
+# This function plots the global sales by year
+sns.lmplot(x='Year', y='Global_Sales', data=df).set_xticklabels(rotation=45)
+#+end_src
+
+#+caption: Plot of Global Sales by Year
+[[https://img.cleberg.net/blog/20200720-data-exploration-video-game-sales/05_plot-min.png]]
+
+#+begin_src python
+# This function plots four different lines to show sales from different regions.
+# The global sales plot line is commented-out, but can be included for comparison
+df2 = df.groupby('Year').sum()
+years = range(1980,2019)
+
+a = df2['NA_Sales']
+b = df2['EU_Sales']
+c = df2['JP_Sales']
+d = df2['Other_Sales']
+# e = df2['Global_Sales']
+
+fig, ax = plt.subplots(figsize=(12,12))
+ax.set_ylabel('Region Sales (in Millions)')
+ax.set_xlabel('Year')
+
+ax.plot(years, a, label='NA_Sales')
+ax.plot(years, b, label='EU_Sales')
+ax.plot(years, c, label='JP_Sales')
+ax.plot(years, d, label='Other_Sales')
+# ax.plot(years, e, label='Global_Sales')
+
+ax.legend()
+plt.show()
+#+end_src
+
+#+caption: Plot of Regional Sales by Year
+[[https://img.cleberg.net/blog/20200720-data-exploration-video-game-sales/06_plot-min.png]]
+
+** Investigate Outliers
+#+begin_src python
+# Find the game with the highest sales in North America
+df.loc[df['NA_Sales'].idxmax()]
+
+Rank 1
+Name Wii Sports
+Platform Wii
+Year 2006
+Genre Sports
+Publisher Nintendo
+NA_Sales 41.49
+EU_Sales 29.02
+JP_Sales 3.77
+Other_Sales 8.46
+Global_Sales 82.74
+Name: 0, dtype: object
+
+# Explore statistics in the year 2006 (highest selling year)
+df3 = df[(df['Year'] == 2006)]
+df3.describe()
+#+end_src
+
+#+caption: Descriptive Statistics of 2006 Sales
+[[https://img.cleberg.net/blog/20200720-data-exploration-video-game-sales/07_2006_stats-min.png]]
+
+#+begin_src python
+# Plot the results of the previous dataframe (games from 2006) - we can see the year's results were largely carried by Wii Sports
+sns.catplot(x="Genre", y="Global_Sales", data=df3, jitter=False).set_xticklabels(rotation=45)
+#+end_src
+
+#+caption: Plot of 2006 Sales
+[[https://img.cleberg.net/blog/20200720-data-exploration-video-game-sales/08_plot-min.png]]
+
+#+begin_src python
+# We can see 4 outliers in the graph above, so let's get the top 5 games from that dataframe
+# The results below show that Nintendo had all top 5 games (3 on the Wii and 2 on the DS)
+df3.sort_values(by=['Global_Sales'], ascending=False).head(5)
+#+end_src
+
+#+caption: Outliers of 2006 Sales
+[[https://img.cleberg.net/blog/20200720-data-exploration-video-game-sales/09_outliers-min.png]]
+
+* Discussion
+The purpose of exploring datasets is to ask questions, answer questions,
+and discover intelligence that can be used to inform decision-making.
+So, what have we found in this dataset?
+
+Today we simply explored a publicly-available dataset to see what kind
+of information it contained. During that exploration, we found that
+video game sales peaked in 2006. That peak was largely due to Nintendo,
+who sold the top 5 games in 2006 and has a number of games in the top-10
+list for the years 1980-2020. Additionally, the top four platforms by
+global sales (Wii, NES, GB, DS) are owned by Nintendo.
+
+We didn't explore everything this dataset has to offer, but we can tell
+from a brief analysis that Nintendo seems to rule sales in the video
+gaming world. Further analysis could provide insight into which genres,
+regions, publishers, or world events are correlated with sales.