format a portion of blog posts

author: Christian Cleberg <hello@cleberg.net> 2024-04-22 14:07:21 -0500
committer: Christian Cleberg <hello@cleberg.net> 2024-04-22 14:07:21 -0500
commit: 3def68d80edf87e28473609c31970507d9f03467 (patch)
tree: a64fb6363727dbfba4125d1b3c9d5c1423019b5e /content/blog/2020-07-26-business-analysis.org
parent: 9ad1dcee850864fd2c8564ac90e4154ce68ae2b8 (diff)
download: cleberg.net-3def68d80edf87e28473609c31970507d9f03467.tar.gz
cleberg.net-3def68d80edf87e28473609c31970507d9f03467.tar.bz2
cleberg.net-3def68d80edf87e28473609c31970507d9f03467.zip
1 files changed, 59 insertions, 65 deletions
diff --git a/content/blog/2020-07-26-business-analysis.org b/content/blog/2020-07-26-business-analysis.org
index 6d60471..098dce7 100644
--- a/content/blog/2020-07-26-business-analysis.org
+++ b/content/blog/2020-07-26-business-analysis.org
@@ -4,9 +4,9 @@
 #+filetags: :data:
 
 * Background Information
-This project aims to help investors learn more about a random city in
-order to determine optimal locations for business investments. The data
-used in this project was obtained using Foursquare's developer API.
+This project aims to help investors learn more about a random city in order to
+determine optimal locations for business investments. The data used in this
+project was obtained using Foursquare's developer API.
 
 Fields include:
 
@@ -15,12 +15,12 @@ Fields include:
 - Venue Latitude
 - Venue Longitude
 
-There are 232 records found using the center of Lincoln as the area of
-interest with a radius of 10,000.
+There are 232 records found using the center of Lincoln as the area of interest
+with a radius of 10,000.
 
 * Import the Data
-The first step is the simplest: import the applicable libraries. We will
-be using the libraries below for this project.
+The first step is the simplest: import the applicable libraries. We will be
+using the libraries below for this project.
 
 #+begin_src python
 # Import the Python libraries we will be using
@@ -33,10 +33,10 @@ from pandas.io.json import json_normalize
 from sklearn.cluster import KMeans
 #+end_src
 
-To begin our analysis, we need to import the data for this project. The
-data we are using in this project comes directly from the Foursquare
-API. The first step is to get the latitude and longitude of the city
-being studied (Lincoln, NE) and setting up the folium map.
+To begin our analysis, we need to import the data for this project. The data we
+are using in this project comes directly from the Foursquare API. The first step
+is to get the latitude and longitude of the city being studied (Lincoln, NE) and
+setting up the folium map.
 
 #+begin_src python
 # Define the latitude and longitude, then map the results
@@ -50,11 +50,11 @@ map_LNK
 #+caption: Blank Map
 [[https://img.cleberg.net/blog/20200726-ibm-data-science/01_blank_map-min.png]]
 
-Now that we have defined our city and created the map, we need to go get
-the business data. The Foursquare API will limit the results to 100 per
-API call, so we use our first API call below to determine the total
-results that Foursquare has found. Since the total results are 232, we
-perform the API fetching process three times (100 + 100 + 32 = 232).
+Now that we have defined our city and created the map, we need to go get the
+business data. The Foursquare API will limit the results to 100 per API call, so
+we use our first API call below to determine the total results that Foursquare
+has found. Since the total results are 232, we perform the API fetching process
+three times (100 + 100 + 32 = 232).
 
 #+begin_src python
 # Foursquare API credentials
@@ -117,13 +117,12 @@ results3 = requests.get(url3).json()
 #+end_src
 
 * Clean the Data
-Now that we have our data in three separate dataframes, we need to
-combine them into a single dataframe and make sure to reset the index so
-that we have a unique ID for each business. The =get~categorytype~=
-function below will pull the categories and name from each business's
-entry in the Foursquare data automatically. Once all the data has been
-labeled and combined, the results are stored in the =nearby_venues=
-dataframe.
+Now that we have our data in three separate dataframes, we need to combine them
+into a single dataframe and make sure to reset the index so that we have a
+unique ID for each business. The =get~categorytype~= function below will pull
+the categories and name from each business's entry in the Foursquare data
+automatically. Once all the data has been labeled and combined, the results are
+stored in the =nearby_venues= dataframe.
 
 #+begin_src python
 # This function will extract the category of the venue from the API dictionary
@@ -194,9 +193,9 @@ nearby_venues
 [[https://img.cleberg.net/blog/20200726-ibm-data-science/02_clean_data-min.png]]
 
 * Visualize the Data
-We now have a complete, clean data set. The next step is to visualize
-this data onto the map we created earlier. We will be using folium's
-=CircleMarker()= function to do this.
+We now have a complete, clean data set. The next step is to visualize this data
+onto the map we created earlier. We will be using folium's =CircleMarker()=
+function to do this.
 
 #+begin_src python
 # add markers to map
@@ -220,15 +219,14 @@ map_LNK
 data map]]
 
 * Clustering: /k-means/
-To cluster the data, we will be using the /k-means/ algorithm. This
-algorithm is iterative and will automatically make sure that data points
-in each cluster are as close as possible to each other, while being as
-far as possible away from other clusters.
+To cluster the data, we will be using the /k-means/ algorithm. This algorithm is
+iterative and will automatically make sure that data points in each cluster are
+as close as possible to each other, while being as far as possible away from
+other clusters.
 
-However, we first have to figure out how many clusters to use (defined
-as the variable /'k'/). To do so, we will use the next two functions to
-calculate the sum of squares within clusters and then return the optimal
-number of clusters.
+However, we first have to figure out how many clusters to use (defined as the
+variable /'k'/). To do so, we will use the next two functions to calculate the
+sum of squares within clusters and then return the optimal number of clusters.
 
 #+begin_src python
 # This function will return the sum of squares found in the data
@@ -266,9 +264,9 @@ def optimal_number_of_clusters(wcss):
 n = optimal_number_of_clusters(sum_of_squares)
 #+end_src
 
-Now that we have found that our optimal number of clusters is six, we
-need to perform k-means clustering. When this clustering occurs, each
-business is assigned a cluster number from 0 to 5 in the dataframe.
+Now that we have found that our optimal number of clusters is six, we need to
+perform k-means clustering. When this clustering occurs, each business is
+assigned a cluster number from 0 to 5 in the dataframe.
 
 #+begin_src python
 # set number of clusters equal to the optimal number
@@ -281,9 +279,8 @@ kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(cluster_df)
 nearby_venues.insert(0, 'Cluster Labels', kmeans.labels_)
 #+end_src
 
-Success! We now have a dataframe with clean business data, along with a
-cluster number for each business. Now let's map the data using six
-different colors.
+Success! We now have a dataframe with clean business data, along with a cluster
+number for each business. Now let's map the data using six different colors.
 
 #+begin_src python
 # create map with clusters
@@ -310,12 +307,11 @@ map_clusters
 [[https://img.cleberg.net/blog/20200726-ibm-data-science/04_clusters-min.png]]
 
 * Investigate Clusters
-Now that we have figured out our clusters, let's do a little more
-analysis to provide more insight into the clusters. With the information
-below, we can see which clusters are more popular for businesses and
-which are less popular. The results below show us that clusters 0
-through 3 are popular, while clusters 4 and 5 are not very popular at
-all.
+Now that we have figured out our clusters, let's do a little more analysis to
+provide more insight into the clusters. With the information below, we can see
+which clusters are more popular for businesses and which are less popular. The
+results below show us that clusters 0 through 3 are popular, while clusters 4
+and 5 are not very popular at all.
 
 #+begin_src python
 # Show how many venues are in each cluster
@@ -329,9 +325,9 @@ for x in range(0,6):
 #+caption: Venues per Cluster
 [[https://img.cleberg.net/blog/20200726-ibm-data-science/05_venues_per_cluster-min.png]]
 
-Our last piece of analysis is to summarize the categories of businesses
-within each cluster. With these results, we can clearly see that
-restaurants, coffee shops, and grocery stores are the most popular.
+Our last piece of analysis is to summarize the categories of businesses within
+each cluster. With these results, we can clearly see that restaurants, coffee
+shops, and grocery stores are the most popular.
 
 #+begin_src python
 # Calculate how many venues there are in each category
@@ -362,19 +358,17 @@ with pd.option_context('display.max_rows', None, 'display.max_columns', None):
 [[https://img.cleberg.net/blog/20200726-ibm-data-science/07_categories_per_cluster_pt2-min.png]]
 
 * Discussion
-In this project, we gathered location data for Lincoln, Nebraska, USA
-and clustered the data using the k-means algorithm in order to identify
-the unique clusters of businesses in Lincoln. Through these actions, we
-found that there are six unique business clusters in Lincoln and that
-two of the clusters are likely unsuitable for investors. The remaining
-four clusters have a variety of businesses, but are largely dominated by
-restaurants and grocery stores.
-
-Using this project, investors can now make more informed decisions when
-deciding the location and category of business in which to invest.
-
-Further studies may involve other attributes for business locations,
-such as population density, average wealth across the city, or crime
-rates. In addition, further studies may include additional location data
-and businesses by utilizing multiple sources, such as Google Maps and
-OpenStreetMap.
+In this project, we gathered location data for Lincoln, Nebraska, USA and
+clustered the data using the k-means algorithm in order to identify the unique
+clusters of businesses in Lincoln. Through these actions, we found that there
+are six unique business clusters in Lincoln and that two of the clusters are
+likely unsuitable for investors. The remaining four clusters have a variety of
+businesses, but are largely dominated by restaurants and grocery stores.
+
+Using this project, investors can now make more informed decisions when deciding
+the location and category of business in which to invest.
+
+Further studies may involve other attributes for business locations, such as
+population density, average wealth across the city, or crime rates. In addition,
+further studies may include additional location data and businesses by utilizing
+multiple sources, such as Google Maps and OpenStreetMap.
author	Christian Cleberg <hello@cleberg.net>	2024-04-22 14:07:21 -0500
committer	Christian Cleberg <hello@cleberg.net>	2024-04-22 14:07:21 -0500
commit	3def68d80edf87e28473609c31970507d9f03467 (patch)
tree	a64fb6363727dbfba4125d1b3c9d5c1423019b5e /content/blog/2020-07-26-business-analysis.org
parent	9ad1dcee850864fd2c8564ac90e4154ce68ae2b8 (diff)
download	cleberg.net-3def68d80edf87e28473609c31970507d9f03467.tar.gz cleberg.net-3def68d80edf87e28473609c31970507d9f03467.tar.bz2 cleberg.net-3def68d80edf87e28473609c31970507d9f03467.zip