SF Fire Data Analysis With Databricks And Spark
Let's dive into analyzing San Francisco Fire Department (SF Fire) incident data using Databricks, Spark, and datasets available for learning Spark v2. This is a cool way to get your hands dirty with big data tools and understand how data analysis can help in understanding urban incidents. We'll be working primarily with the sf-fire-calls.csv dataset, exploring its contents, and performing some interesting transformations and analyses. So, buckle up, data enthusiasts!
Understanding the SF Fire Calls Dataset
The sf-fire-calls.csv dataset is a treasure trove of information regarding fire incidents in San Francisco. When you're starting with any data analysis task, the first step is always to understand your data. This involves looking at the schema, understanding what each column represents, and identifying any potential data quality issues. This dataset usually contains details about each fire call, such as the call type, location, time, units involved, and potentially even the disposition of the incident. Understanding this data is crucial because it forms the foundation for all subsequent analysis.
Why is this initial understanding so important, guys? Well, imagine trying to build a house without knowing what materials you have! You need to know if you have wood, bricks, or concrete to plan your construction effectively. Similarly, in data analysis, knowing your data types (is it a number, text, or date?), the range of values (are there any outliers?), and the relationships between columns (does one column influence another?) will guide your analysis and ensure the insights you derive are meaningful and accurate.
To truly grasp the dataset, consider a few key questions. What time range does the data cover? This helps you understand if you're looking at recent trends or historical data. What are the most common types of calls? This can reveal the primary challenges the fire department faces. Where do the most incidents occur? Identifying hotspots can help with resource allocation and prevention efforts. How long does it typically take to resolve an incident? This can highlight areas where efficiency can be improved. Answering these questions through initial exploration sets the stage for deeper, more targeted analysis.
Furthermore, don't underestimate the power of data visualization at this stage. Simple charts and graphs can quickly reveal patterns and anomalies that might be missed when staring at raw data. For example, a histogram of incident times can show peak hours, while a map of incident locations can highlight problem areas. Tools like Databricks make it easy to create these visualizations directly from your Spark dataframes, allowing you to quickly gain insights into your data. Remember, the more you understand your data upfront, the better equipped you'll be to extract valuable insights and make data-driven decisions later on.
Setting Up Your Databricks Environment for Spark
Alright, let’s get our hands dirty! Setting up your Databricks environment for learning Spark v2 is crucial for working with the SF Fire dataset. Databricks provides a collaborative, cloud-based platform optimized for Apache Spark. Think of it as your data science workshop, equipped with all the tools you need to explore, process, and analyze big data.
First, you'll need a Databricks account. If you don’t already have one, head over to the Databricks website and sign up for a free community edition or a trial of the full platform. Once you’re in, you'll be greeted with a workspace where you can create notebooks. Notebooks are your primary interface for interacting with Spark. They allow you to write and execute code in multiple languages (Python, Scala, R, SQL) and visualize your results inline.
Next, you'll want to create a new cluster. A cluster is essentially a group of virtual machines that work together to process your data. When creating a cluster, you'll need to choose a Spark version (make sure it's Spark v2 for compatibility with our learning Spark v2 focus), the instance type for your worker nodes (the machines that do the heavy lifting), and the number of worker nodes. For smaller datasets like sf-fire-calls.csv, a small cluster with a few worker nodes should be sufficient. However, if you're planning to work with larger datasets in the future, you might want to consider a larger cluster with more powerful instances. Don't worry too much about optimizing your cluster configuration at this stage; you can always adjust it later as needed. The key is to get a working environment up and running so you can start experimenting with your data.
Once your cluster is up and running, you can upload the sf-fire-calls.csv dataset to the Databricks File System (DBFS). DBFS is a distributed file system that makes your data accessible to all the nodes in your cluster. You can upload the file directly through the Databricks UI or use the Databricks CLI. After the file is uploaded, you can read it into a Spark dataframe using the spark.read.csv() method. Make sure to specify the header=True option if your CSV file has a header row. And also make sure to specify the inferSchema=True so Spark automatically infers the datatypes.
Finally, it's a good idea to test your setup by running a simple Spark query to verify that everything is working correctly. For example, you could count the number of rows in your dataframe or display the first few rows using the show() method. If you can successfully execute these commands, congratulations! You've successfully set up your Databricks environment and are ready to start exploring the SF Fire dataset.
Loading and Inspecting the SF Fire Calls CSV Data
Time to load and inspect the sf-fire-calls.csv data within our Databricks environment. This step is vital for understanding the structure and content of the data before we start any serious analysis. Let's walk through the process of loading the data into a Spark DataFrame and then inspecting its schema and a few rows to get a feel for what we're working with.
First, we'll use Spark's CSV reader to load the data into a DataFrame. Assuming you've already uploaded the sf-fire-calls.csv file to DBFS, you can use the following code snippet in a Databricks notebook to load the data:
from pyspark.sql.types import *
sf_fire_schema = StructType([
StructField('CallNumber', IntegerType(), True),
StructField('UnitID', StringType(), True),
StructField('IncidentNumber', IntegerType(), True),
StructField('CallType', StringType(), True),
StructField('CallDate', StringType(), True),
StructField('WatchDate', StringType(), True),
StructField('CallTime', StringType(), True),
StructField('WatchTime', StringType(), True),
StructField('CallDateTime', TimestampType(), True),
StructField('WatchDateTime', TimestampType(), True),
StructField('AlarmDtTm', TimestampType(), True),
StructField('ArrivalDtTm', TimestampType(), True),
StructField('CloseDtTm', TimestampType(), True),
StructField('City', StringType(), True),
StructField('zipcode', IntegerType(), True),
StructField('Battalion', StringType(), True),
StructField('StationArea', StringType(), True),
StructField('Box', StringType(), True),
StructField('SuppressionUnits', IntegerType(), True),
StructField('SuppressionPersonnel', IntegerType(), True),
StructField('FirePreventionUnits', IntegerType(), True),
StructField('FirePreventionPersonnel', IntegerType(), True),
StructField('EMSUnits', IntegerType(), True),
StructField('EMSPersonnel', IntegerType(), True),
StructField('OtherUnits', IntegerType(), True),
StructField('OtherPersonnel', IntegerType(), True),
StructField('FirstUnitOnScene', StringType(), True),
StructField('ALSUnit', BooleanType(), True),
StructField('CallTypeGroup', StringType(), True),
StructField('NumAlarms', IntegerType(), True),
StructField('UnitType', StringType(), True),
StructField('UnitSequence', IntegerType(), True),
StructField('FirePreventionDistrict', StringType(), True),
StructField('SupervisorDistrict', StringType(), True),
StructField('Neighborhood', StringType(), True),
StructField('Location', StringType(), True),
StructField('RowID', StringType(), True),
StructField('lat', DoubleType(), True),
StructField('lng', DoubleType(), True),
])
sf_fire_df = spark.read.csv('dbfs:/FileStore/tables/sf_fire_calls.csv', header=True, schema = sf_fire_schema)
This code tells Spark to read the CSV file located at /FileStore/tables/sf_fire_calls.csv (adjust the path if you uploaded it to a different location). The header=True option tells Spark that the first row of the CSV file contains the column headers, and inferSchema=True instructs Spark to automatically infer the data type of each column based on its contents.
Once the data is loaded into a DataFrame, the next step is to inspect the schema. The schema provides information about the column names and their corresponding data types. You can print the schema using the printSchema() method:
sf_fire_df.printSchema()
This will output a tree-like structure that shows the name and data type of each column in the DataFrame. Pay close attention to the data types. Are they what you expect? For example, are date columns correctly identified as dates or timestamps? If not, you may need to explicitly specify the schema when loading the data or perform data type conversions later on.
Finally, it's always a good idea to peek at a few rows of the data to get a sense of its contents. You can use the show() method to display the first few rows of the DataFrame:
sf_fire_df.show(5)
This will display the first 5 rows of the DataFrame in a tabular format. Look for any unexpected values or patterns. Are there any missing values? Are there any inconsistencies in the data? By carefully inspecting the data, you can identify potential data quality issues and plan your data cleaning and transformation steps accordingly.
Analyzing Call Types and Response Times
Now for some real analysis! Let's start by analyzing the call types and response times in the sf-fire-calls.csv dataset. This will give us insights into the types of incidents the fire department responds to most frequently and how quickly they are able to reach the scene. We'll use Spark's powerful aggregation and transformation capabilities to perform these analyses.
First, let's determine the most common types of calls. We can do this by grouping the data by the CallType column and counting the number of calls for each type. Here's the Spark code to do this:
from pyspark.sql import functions as F
call_type_counts = sf_fire_df.groupBy('CallType').count().orderBy(F.desc('count'))
call_type_counts.show(10, truncate=False)
This code groups the DataFrame by the CallType column, counts the number of rows in each group, and then orders the results in descending order of the count. The show(10, truncate=False) method displays the top 10 call types and their counts, with truncate=False ensuring that the full call type descriptions are displayed.
By examining the output of this query, you can identify the most frequent types of incidents. For example, you might find that medical emergencies, structure fires, or traffic accidents are the most common call types. This information can be valuable for resource allocation and training purposes.
Next, let's analyze response times. Response time is a critical metric for emergency services, as it directly impacts the outcome of an incident. To calculate response time, we need to subtract the time the call was received from the time the first unit arrived on the scene. Assuming you have columns with the names 'CallDateTime' and 'ArrivalDateTime', here's how you can calculate the response time in seconds:
from pyspark.sql.functions import col, to_timestamp, unix_timestamp
sf_fire_df = sf_fire_df.withColumn('CallDateTime_TS', to_timestamp(col('CallDate') + ' ' + col('CallTime'), 'MM/dd/yyyy hh:mm:ss AA'))
sf_fire_df = sf_fire_df.withColumn('AlarmDtTm_TS', to_timestamp(col('AlarmDtTm'), 'MM/dd/yyyy hh:mm:ss AA'))
sf_fire_df = sf_fire_df.withColumn(
'responseTime',
(unix_timestamp(col('AlarmDtTm_TS')) - unix_timestamp(col('CallDateTime_TS')))
)
sf_fire_df.select('CallDateTime_TS', 'AlarmDtTm_TS', 'responseTime').show(10)
This code snippet adds a new column called responseTime to the DataFrame, which contains the response time in seconds. The unix_timestamp function converts the timestamp columns to Unix timestamps (seconds since the epoch), and the subtraction operation calculates the difference between the two timestamps.
After calculating the response time, you can analyze it further to identify trends and outliers. For example, you could calculate the average response time for different call types or different locations. You could also identify incidents with unusually long response times and investigate the reasons behind the delays. This information can be used to improve the efficiency and effectiveness of the fire department's response efforts.
Visualizing Incident Locations on a Map
Let's add some visual flair by plotting incident locations on a map! Visualizing data geographically can reveal patterns and trends that might not be obvious from looking at raw data or tables. We'll leverage the latitude and longitude information in the sf-fire-calls.csv dataset to create a map of incident locations.
Before we start plotting, let's make sure our latitude and longitude data is clean and complete. We'll check for missing values and invalid coordinates. Here's how you can filter out rows with missing latitude or longitude values:
from pyspark.sql import functions as F
sf_fire_df_valid_coords = sf_fire_df.filter(F.col('lat').isNotNull() & F.col('lng').isNotNull())
print(f"Rows with valid coordinates: {sf_fire_df_valid_coords.count()}")
This code filters the DataFrame to include only rows where both the lat and lng columns have non-null values. The isNotNull() method checks for null values, and the & operator combines the two conditions. After filtering the data, it's a good idea to verify that the remaining rows have valid coordinate values within the expected range. Latitude values should be between -90 and 90, and longitude values should be between -180 and 180.
Once you have a clean dataset with valid latitude and longitude coordinates, you can use a variety of tools to plot the incident locations on a map. One popular option is to use a Python library like matplotlib or plotly. However, since we're working in Databricks, we can also leverage Databricks' built-in visualization capabilities. First, we need to sample our dataset so we don't try to render too many points.
sampled_df = sf_fire_df_valid_coords.sample(withReplacement=False, fraction=0.1, seed=42)
display(sampled_df)
By visualizing the incident locations on a map, you can gain valuable insights into the spatial distribution of fire incidents in San Francisco. You might identify hotspots where incidents are concentrated, or you might notice patterns related to land use, population density, or other geographic factors. This information can be used to inform resource allocation, prevention efforts, and emergency response planning.
Advanced Analysis: Predicting Call Types
Ready for a challenge? Let's try some advanced analysis by building a machine learning model to predict call types based on other features in the sf-fire-calls.csv dataset. This is a classic classification problem that can be tackled using Spark's MLlib library. This falls outside the scope of learning spark v2, but you could still perform the example.
First, we need to prepare our data for machine learning. This involves selecting the features we want to use for prediction, cleaning and transforming the data, and splitting it into training and testing sets. Let's start by selecting a few relevant features, such as the time of day, day of week, and location of the incident. We'll also need to convert categorical features (like call type and location) into numerical representations that can be used by our machine learning model.
from pyspark.ml.feature import StringIndexer, VectorAssembler
from pyspark.ml import Pipeline
# Define the features to use for prediction
feature_cols = ['CallDateTime_TS', 'lat', 'lng']
# Create a StringIndexer to convert the call type to a numerical index
indexer = StringIndexer(inputCol='CallType', outputCol='CallTypeIndex')
# Create a VectorAssembler to combine the features into a single vector
assembler = VectorAssembler(inputCols=feature_cols, outputCol='features')
# Create a pipeline to chain the indexer and assembler
pipeline = Pipeline(stages=[indexer, assembler])
# Fit the pipeline to the data
pipeline_model = pipeline.fit(sf_fire_df_valid_coords)
# Transform the data
transformed_df = pipeline_model.transform(sf_fire_df_valid_coords)
This code snippet uses Spark's StringIndexer to convert the CallType column into a numerical index. It also uses VectorAssembler to combine the selected features into a single vector, which is required by most Spark MLlib algorithms. Finally, it creates a Pipeline to chain the indexer and assembler together, making it easier to apply the same transformations to both the training and testing data.
Once the data is prepared, we can choose a machine learning algorithm to use for prediction. For this example, let's use a simple decision tree classifier. We'll train the model on the training data and then evaluate its performance on the testing data.
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
# Split the data into training and testing sets
training_data, testing_data = transformed_df.randomSplit([0.8, 0.2], seed=42)
# Create a DecisionTreeClassifier
dt = DecisionTreeClassifier(labelCol='CallTypeIndex', featuresCol='features')
# Train the model
model = dt.fit(training_data)
# Make predictions on the testing data
predictions = model.transform(testing_data)
# Evaluate the model
evaluator = MulticlassClassificationEvaluator(labelCol='CallTypeIndex', predictionCol='prediction', metricName='accuracy')
accuracy = evaluator.evaluate(predictions)
print(f'Accuracy: {accuracy}')
This code snippet trains a decision tree classifier on the training data and then makes predictions on the testing data. It then evaluates the model's performance using the MulticlassClassificationEvaluator, which calculates the accuracy of the predictions. The accuracy score provides an indication of how well the model is able to predict call types based on the selected features.
Conclusion: Sparking Insights from Fire Data
We've journeyed through analyzing the SF Fire dataset using Databricks and Spark, from understanding the dataset to predicting call types. We started by loading and inspecting the sf-fire-calls.csv dataset, gaining an understanding of its structure and content. We then performed various analyses, such as determining the most common types of calls, calculating response times, and visualizing incident locations on a map. Finally, we even tackled an advanced problem by building a machine learning model to predict call types based on other features in the dataset. The capabilities and possibilities are endless!