# Capstone Project Report - The Battle of Neighborhoods

---
## 1. Introduction

### 1.1 Background

The purpose of this project is to help investors discover optimal locations for business investment in Lincoln, Nebraska, USA. Many investors know where they want to invest or which business type in which they want to invest, but not many know the different characteristics businesses have across a city such as Lincoln. Lincoln is a growing investment location for many industries, recently being dubbed as part of the "Silicon Prairie". However, it is still a small city with ~350,000 citizens. As a result, new investors to the area may need more information before they are able to comfortably invest in a business in Lincoln.

### 1.2 Business Problem:

Lincoln is a growing city but the businesses available for investors can vary drastically by location and category. Investors need more information about the business and areas in Lincoln before they can make final decisions.

### 1.3 Interest
Investors need help when researching new business locations. This project will help demonstrate the number of businesses in different areas of Lincoln, as well as the type of businesses, using the following techniques:
1. Optimally clustered areas of Lincoln businesses.
2. Sorted lists of business types in each cluster.

---
## 2. Data
### 2.1 Data Sources

I will need data for businesses within Lincoln. To obtain this data, I will use the Foursquare API to fetch the data. Foursquare is a location data provider with information about all manner of venues and events within an area of interest. Such information includes venue names, locations, menus and even photos. As such, the foursquare location platform will be used as the sole data source since all the stated required information can be obtained through the API.

### 2.2 Data Cleaning

The Foursquare API returns a set of 100 venues in the search location (per API call), along with many characteristics for each venue. To clean and transform the data, this project will drop all characteristics for each venue, except the following:
1. Venue Name
2. Venue Category
3. Venue Latitude
4. Venue Longitude

This project also checks for the total number of venues within the search area and reperforms the search multiple times, so that the dataset is not limited to just 100 venues. The search results confirmed that there are 232 venues in this area, so I performed three API calls in order to obtain the complete dataset.

### 2.3 Feature Selection

After data cleaning, there were 232 venues listed with four characteristics each (name, category, latitude, longitude). To further explore the data, I decided to group them into clusters to find the optimal number of investment cluster options. To be able to do this, I used the k-means clustering algorithm, which is a form of unsupervised machine learning.

---
## 3. Methodology

### 3.1 Materials
This project utilizes Python 3.7.6, along with the following libraries:
* Pandas: For creating and manipulating dataframes.
* Folium: Python visualization library would be used to visualize the neighborhoods cluster distribution of using interactive leaflet map.
* Scikit Learn: For importing k-means clustering.
* JSON: Library to handle JSON files.
* Requests: For fetching API results.
* Math: To be able to perform square root functions.

### 3.2 Design
This project examined a data set of 232 business venues within Lincoln and inspected four variables associated with each business venue:
1. Venue Name
2. Venue Category
3. Venue Latitude
4. Venue Longitude

This project uses both within-groups or between-groups designs via the k-means clustering algorithm.

### 3.3 Procedures
1. To being, I defined the latitude, longitude, and zoom variables for Lincoln. I used these variables to create a map of Lincoln using Folium.
2. Next, I utilized the Foursquare API, latitude, and longitude in order to obtain a total number of business venues within Lincoln. This returned 232, so I created three API calls to Foursquare since calls are limited to 100 results each.
3. Then, I created a complete data set by combining all three API results. This data set dropped all columns except the four mentioned in section 3.2.
4. Once the data set is cleaned, I plotted each of the venues on the Folium map using blue markers.
5. To begin clustering, I calculated the 'within clusters sum-of-squares' and used those results to calculate the optimal number of clusters. These functions indicated that this data set should use 6 clusters.
6. I used 6 clusters to fit my data set with k-means clustering. Once the fitting was complete, I appended the cluster numbers (0-5) to each row in the data set.
7. Now that the data set was stratified by cluster number, I plotted the points back onto the Folium map. This time, each cluster was represented by a unique color.
8. In order to provide more descriptive insight into each cluster, I calculated the number of venues in each cluster.
9. Finally, I displayed the venue categories in each cluster which had more than one occurence of that category in its cluster.

---
## 4. Results

### 4.1 Clustering Results

The resuls of the k-means clustering algorithm show that there are six unique clusters of businesses in Lincoln. These clusters could also be described by cardinal direction as:
1. Downtown (Blue)
2. South (Red)
3. East (Dark Green)
4. West (Black)
5. North (Pink)
6. Southeast (Purple)

#### Map of Clusters in Lincoln  
![](https://img.cmc.pub/blog/014-ibm-data-science/04_clusters-min.png)  

These clusters closely follow the street layout shown on the map. The clusters with many businesses (0 - 3) are in the most populous areas of Lincoln, while clusters 4 - 5 are in the areas of the city with fewer streets.

#### Number of Venues by Cluster in Lincoln  
![](https://img.cmc.pub/blog/014-ibm-data-science/05_venues_per_cluster-min.png)  

Finally, we can show a deeper inspection of the venues in these clusters. For investors who may want to invest in Mexican restaurants, they would have to decide whether to invest in an area with many Mexican restaurants already (cluster #1) or a cluster with fewer restaurants. Likewise, we can see which areas are more popular for other business categories.  

#### Venue Categories by Clusters in Lincoln (>1)  
![](https://img.cmc.pub/blog/014-ibm-data-science/06_categories_per_cluster_pt1-min.png)  
![](https://img.cmc.pub/blog/014-ibm-data-science/07_categories_per_cluster_pt2-min.png)  

---
## 5. Discussion

The results of this project show that there are six distinct business areas to consider in Lincoln when determining where to invest. Further, clusters 0 - 3 hold the most venues (Central, East, South, and Southeast), while clusters 4 - 5 have very few venues (North and West). This means that potential investors should look to stay in first four clusters of businesses in Lincoln and avoid the North and West areas. In addition, we found that venues such as Mexican restaurants and Coffee shops currently hold the spots as the most populous businesses. Investors will need to decide if they want to invest in a populous business or take a chance on businesses that do not have a large presence in the city.

These findings are limited by the data provided by Foursquare. A more conclusive study would utilize multiple data sources for business location data.

---
## 6. Conclusion

In this project, we gathered location data for Lincoln, Nebraska, USA and clustered the data using the k-means algorithm in order to identify the unique clusters of businesses in Lincoln. Through these actions, we found that there are six unique business clusters in Lincoln and that two of the clusters are likely unsuitable for investors. The remaining four clusters have a variety of businesses, but are largely dominated by restaurants and grocery stores.

Using this project, investors can now make more informed decisions when deciding the location and category of business in which to invest.

Further studies may involve other attributes for business locations, such as population density, average wealth across the city, or crime rates. In addition, further studies may include additional location data and businesses by utilizing multiple sources, such as Google Maps and OpenStreetMap.