Movement Analysis of the Aldabra Giant Tortoise¶
By Erin Hopper¶
Introduction¶
Specific animal movement data reveals many truths about the animal, particularly what locations are important to them, as well as habits and important activities. However, dense and thorough movement data can be difficult to find. In this project, I aim to use the data science pipeline to analyze the behavior of the Aldabra giant tortoise and categorize his activity into specific behaviors, such as eating, foraging, or sleeping. This will uncover insights about the daily behavior and schedule of the Aldabra giant tortoise, and allow wildlife researchers to predict his movement patterns or prepare to support these tortoises in captivity.
Data Curation¶
The first step in the data science pipeline is identifying data to work with and making sure it is in the right format for analysis.
This project will analyze the movement data of a Aldabra giant tortoise (Aldabrachelys giganteam) over a 14-day time period, which consists of about 990,000 locations at one second intervals. It was collected via accelerometer, magnetometer, and GPS in 2018. The data is very thorough and clean, and the specific nature (one indivdual over as short period of time) will allow for thorough analysis.
Furthermore, this data is being used in the 2024 MoveModel competition, so preliminary analysis through this project will prepare me to submit to this competition.
Data References
Redcliffe, James; Cole, Nik; Tatayah, Vikash; Wilson, Rory; Börger, Luca (2018). Aldabra giant tortoise (Aldabrachelys gigantea) high resolution movement path on Round Island (Mauritius). figshare. Dataset. https://doi.org/10.6084/m9.figshare.5808330.v1
I will begin by importing all necessary packages. These contain the tools we will use to analyze the data.
# Data storage and processing
import math
import datetime as dt
import utm
import scipy
import numpy as np
import pandas as pd
from collections import Counter
# Plots and visualizations
import matplotlib
import matplotlib.pyplot as plt
from matplotlib.lines import Line2D
import seaborn as sns
from kneed import KneeLocator
# Machine learning
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import NearestNeighbors
from sklearn.cluster import DBSCAN
Next, I will read the dataset into a pandas DataFrame. The .head() command displays the first few lines of the dataset.
tortoise = pd.read_csv("tortoise.csv")
tortoise.head()
X | Y | |
---|---|---|
0 | 582628.660871 | -2.195359e+06 |
1 | 582628.661313 | -2.195359e+06 |
2 | 582628.657275 | -2.195359e+06 |
3 | 582628.653591 | -2.195359e+06 |
4 | 582628.649807 | -2.195359e+06 |
Luckily, the dataset is already very clean and throrough. However, notice that the data is only location coordinates. I know from the dataset information that the locations are collected at 1 second intervals, so I can create a "time since data collection began" column. For now, I will use an arbitrary starting date and time: this can be changed later.
# Make time column
tortoise.insert(2, 'TimeSinceZero', range(0, len(tortoise)))
# Make a DateTime object for the time column, which is also useful
tortoise['DateTime'] = tortoise['TimeSinceZero'].astype('timedelta64[s]') + dt.datetime(2024,1,1,0,0)
# Make an animal ID column, which will be useful later on for counting
tortoise['AnimalID'] = 1
tortoise.head()
X | Y | TimeSinceZero | DateTime | AnimalID | |
---|---|---|---|---|---|
0 | 582628.660871 | -2.195359e+06 | 0 | 2024-01-01 00:00:00 | 1 |
1 | 582628.661313 | -2.195359e+06 | 1 | 2024-01-01 00:00:01 | 1 |
2 | 582628.657275 | -2.195359e+06 | 2 | 2024-01-01 00:00:02 | 1 |
3 | 582628.653591 | -2.195359e+06 | 3 | 2024-01-01 00:00:03 | 1 |
4 | 582628.649807 | -2.195359e+06 | 4 | 2024-01-01 00:00:04 | 1 |
Next, note that the coordinates are in UTM, which is a different coordinate system than lat-long. To aid analysis, we will convert to traditional Lat/Long. Note that the tortoise is from Round Island, Mauritias, which is in the Southern hemisphere, which is why we negate the y column and use northern=False.
convert = lambda row: utm.to_latlon(row['X'], -1*row['Y'], 40, northern=False)
latlong = tortoise.apply(convert, axis=1)
After converting to lat/long, I display the top few rows of our database to get a feel for how it looks.
tortoise[['Lat', 'Long']] = pd.DataFrame(latlong.tolist(), index=tortoise.index)
tortoise.head(10)
X | Y | TimeSinceZero | DateTime | AnimalID | Lat | Long | |
---|---|---|---|---|---|---|---|
0 | 582628.660871 | -2.195359e+06 | 0 | 2024-01-01 00:00:00 | 1 | -70.334233 | 59.200370 |
1 | 582628.661313 | -2.195359e+06 | 1 | 2024-01-01 00:00:01 | 1 | -70.334233 | 59.200370 |
2 | 582628.657275 | -2.195359e+06 | 2 | 2024-01-01 00:00:02 | 1 | -70.334233 | 59.200370 |
3 | 582628.653591 | -2.195359e+06 | 3 | 2024-01-01 00:00:03 | 1 | -70.334233 | 59.200370 |
4 | 582628.649807 | -2.195359e+06 | 4 | 2024-01-01 00:00:04 | 1 | -70.334233 | 59.200370 |
5 | 582628.645878 | -2.195359e+06 | 5 | 2024-01-01 00:00:05 | 1 | -70.334233 | 59.200370 |
6 | 582628.641846 | -2.195359e+06 | 6 | 2024-01-01 00:00:06 | 1 | -70.334233 | 59.200369 |
7 | 582628.637819 | -2.195359e+06 | 7 | 2024-01-01 00:00:07 | 1 | -70.334233 | 59.200369 |
8 | 582628.633810 | -2.195359e+06 | 8 | 2024-01-01 00:00:08 | 1 | -70.334233 | 59.200369 |
9 | 582628.629736 | -2.195359e+06 | 9 | 2024-01-01 00:00:09 | 1 | -70.334233 | 59.200369 |
Exploratory Analysis¶
During data exploration, I will calculate statistics about the data to see insight into how it is structured and what features it has. I can use statistical tools, like hypothesis testing and summary statistics, as well as visualizations, like charts and graphs, to do this.
First, we calculate some very preliminary summary statistics that will inform our data exploration.
# number of data points
print("Number of data points: ", len(tortoise))
# minimum and maximum of each location
print("X range: ", min(tortoise['X']), max(tortoise['X']))
print("Y range: ", min(tortoise['Y']), max(tortoise['Y']))
Number of data points: 990860 X range: 582488.314805739 582628.66131287 Y range: -2195359.09875836 -2194908.79436392
Exploration 1: Activity levels over time of day¶
For the first data exploration, I will plot the change in movement per second over time to estimate the time of day that data collection begins. I will use Haversine distance to calculate the change in location between any location and the location 60 seconds ago.
My haversine_vectorize function is from this article:
“How to Calculate Distance in Python and Pandas Using Scipy Spatial and Distance Functions.” Kanoki, 27 Dec. 2019, kanoki.org/2019/12/27/how-to-calculate-distance-in-python-and-pandas-using-scipy-spatial-and-distance-functions/.
def haversine_vectorize(lon1, lat1, lon2, lat2):
# Returns distance, in meters, between one set of longitude/latitude coordinates and another
lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])
newlon = lon2 - lon1
newlat = lat2 - lat1
haver_formula = np.sin(newlat/2.0)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(newlon/2.0)**2
dist = 2 * np.arcsin(np.sqrt(haver_formula ))
km = 6367 * dist #6367 for distance in KM for miles use 3958
return km*1000
# Take speed as change in distance over the past minute
tortoise['Speed(m/min)'] = haversine_vectorize(tortoise.Long, tortoise.Lat,
tortoise.Long.shift(60), tortoise.Lat.shift(60)).fillna(0)
# Converting the time to days in a separate column
tortoise['timeDays'] = tortoise['TimeSinceZero']/86400
The following is a line graph of the tortoise's speed, in meters per minute, over time (in days). Notice that the tortoise has very clear periods of activity and inactivity.
ax = tortoise.plot(x='timeDays', y='Speed(m/min)',figsize=(15,5), color="green")
ax.set_xlabel("Time (Days)")
ax.set_ylabel("Speed (meters per minute)")
ax.set_title("Tortoise Speed over Time")
ax.get_legend().remove()
plt.xticks(np.arange(0,13,1.0))
ax.grid(True, axis='x', which='major')
From Wikipedia, I know that the Aldabra Giant Tortoise is most active in the mornings, so I guess that the data collection begins at 6:00 am. Now, I can use a hypothesis test to see if activity levels during the morning hours (6:00 am to 12:00 pm) are significantly higher than during other hours of the day. I'll begin by categorizing the data points as either "morning" (from 6:00 am to 12:00 pm), afternoon (12:00 pm to 6:00 pm), evening (6:00 pm to 12:00 am), or night (12:00 am to 6:00 am.)
def t(seconds):
ofday = seconds%(60*60*24) # time of day (in seconds)
if ofday <= (60*60*6): # during first six hours
return "morning"
elif ofday <= (60*60*12): # during first twelve hours, but not first six
return "afternoon"
elif ofday <= (60*60*18): # during first eighteen hours, but not first twelve
return "evening"
else: # during final six hours
return "night"
tortoise['timeOfDay'] = tortoise['TimeSinceZero'].apply(t)
I can now prepare to apply an ANOVA to the data to determine if activity levels are different at different times of day. ANOVA tests are used to determine if different groups of data, usually more than two groups, have the same population mean or not. More information on the ANOVA test can be found here: https://researchmethod.net/anova/
Null hypothesis: Activity levels of the tortoise are not different at different times of day.
Alternative hypothesis: Activity levels of the tortoise are different at different times of day.
I will use an alpha of 0.01 for this test.
morning = tortoise['Speed(m/min)'][tortoise['timeOfDay']=='morning']
afternoon = tortoise['Speed(m/min)'][tortoise['timeOfDay']=='afternoon']
evening = tortoise['Speed(m/min)'][tortoise['timeOfDay'] == 'evening']
night = tortoise['Speed(m/min)'][tortoise['timeOfDay'] == 'night']
print(scipy.stats.f_oneway(morning, afternoon, evening, night))
F_onewayResult(statistic=59041.457362993075, pvalue=0.0)
The p-value was so low that floating point arithmetic rounded it to zero. Since p is less than alpha, I have enough evidence to reject the null hypothesis and say that the turtle activity levels are higher at times that I guessed to be morning.
Now, I will perform a post-hoc test to check which group differed from the others significantly. Post-hoc tests are follow ups to ANOVA to see which groups stand out from the others by comparing each group to every other group. I will use Tukey's HSD test.
Null hypothesis: Any pair of times has the same average speed of tortoise.
Alternative hypothesis: Any pair of times has different average speeds of tortoise.
We will use an alpha of 0.01 for this test.
print(scipy.stats.tukey_hsd(morning, afternoon, evening, night))
Tukey's HSD Pairwise Group Comparisons (95.0% Confidence Interval) Comparison Statistic p-value Lower CI Upper CI (0 - 1) 0.094 0.000 0.093 0.096 (0 - 2) 0.150 0.000 0.149 0.151 (0 - 3) 0.145 0.000 0.144 0.146 (1 - 0) -0.094 0.000 -0.096 -0.093 (1 - 2) 0.055 0.000 0.054 0.056 (1 - 3) 0.051 0.000 0.049 0.052 (2 - 0) -0.150 0.000 -0.151 -0.149 (2 - 1) -0.055 0.000 -0.056 -0.054 (2 - 3) -0.005 0.000 -0.006 -0.004 (3 - 0) -0.145 0.000 -0.146 -0.144 (3 - 1) -0.051 0.000 -0.052 -0.049 (3 - 2) 0.005 0.000 0.004 0.006
Since all p-values are 0, I can reject the null hypotheses and asuume that all categories have significantly different mean speeds. To decide if the average speed in the morning is greater than the others, I will take find the sample average for each category.
print("Morning: ", morning.mean())
print("Afternoon: ", afternoon.mean())
print("Evening: ", evening.mean())
print("Night: ", night.mean())
fig, ax = plt.subplots()
ind = np.arange(4)
width = 0.7
means = [morning.mean(), afternoon.mean(), evening.mean(), night.mean()]
rects = ax.bar(ind, means, width, color='g')
ax.set_ylabel('Sample Mean Speed (meters/minute)')
ax.set_title('Sample Mean Speed for each Time Category')
ax.set_xticks(ind)
ax.set_xticklabels(('Morning', 'Afternoon', 'Evening', 'Night'))
ax.set_ylim([0,0.2])
for rect in rects:
height = rect.get_height()
ax.text(rect.get_x() + rect.get_width()/2., 1.05*height,
str(int(height*10000)/10000),
ha='center', va='bottom')
Morning: 0.1728025598387896 Afternoon: 0.07831414332904804 Evening: 0.022951714499125055 Night: 0.027771349415464115
Since the activity level is clearly higher during the morning, I can now say that the data collection does start at 6:00 am and change our time column to reflect this. Note that while the times are now accurate, the days are still arbitrary.
tortoise['DateTime'] = tortoise['DateTime'] + dt.timedelta(hours=6)
Exploration 2: Activity levels of each day¶
Next, I'll investigate the turtle's speed over time a little closer: by grouping the speed of his movements, I can begin to gain insights about his behaviors. To begin, I will look a the distribution the change in location data.
ax = tortoise['Speed(m/min)'].hist(bins=30, color="green")
ax.set_xlabel("Speed (meters/minute)")
ax.set_title("Histogram of Tortoise Speed")
plt.show()
We can see from the histogram that the tortoise has about four distinct speeds: sedentary (0 movement), slow (0 to 0.2 meters/minute), medium (0.2 to 0.6 meters per second), and fast (0.8 meters per second).
Lets assign each timestamp a speed category and visualize that data.
def speed_class(speed):
if speed<0.05:
return 0
elif speed < 0.2:
return 1
elif speed < 0.6:
return 2
else:
return 3
tortoise['speedCategory'] = tortoise['Speed(m/min)'].apply(speed_class)
ax = tortoise.plot(x='timeDays', y='speedCategory', kind='scatter', s=1, color="green")
ax.grid(True, axis='x')
ax.set_xlabel("Number of Days")
ax.set_ylabel("Speed Category (0-3)")
ax.set_title("Speed Category over Time")
plt.yticks(np.arange(0, 4, 1.0))
plt.xticks(np.arange(0,13,1.0))
plt.show()
It seems from this chart that the tortoise moves much more during the first days of data collection than during the final days. I will evaluate this prediction by finding the total distance traveled each day and creating a linear regression equation.
Note: grouping the movement data by day avoids bias from the time of day (e.g. the turtle moving less at night or during the afternoon heat).
tortoise['Speed(m/s)'] = haversine_vectorize(tortoise.Long, tortoise.Lat,
tortoise.Long.shift(1), tortoise.Lat.shift(1)).fillna(0)
# Taking the integer rounds off the time in days, assigning each point the day number
tortoise['dayNum'] = tortoise['timeDays'].astype(int)
# Total distance traveled over each day
movementByDay = tortoise.groupby('dayNum')['Speed(m/s)'].sum()
ax = movementByDay.plot(color="green")
ax.set_title("Distance traveled each Day over Time")
ax.set_xlabel("Time in Days")
ax.set_ylabel("Distance per Day (meters)")
plt.show()
Next, I will add a linear regression to this data to estimate the negative linear relationship that I predicted. A linear regression line is the singular line that has the least distance from the data on the chart, and it can be used to investigate the relationship between two variables.
# Find the line of best fit
slope, intercept, rvalue, pvalue, stderr = scipy.stats.linregress(range(0, 12), movementByDay)
plot1, ax = plt.subplots()
plot1 = matplotlib.pyplot.scatter(range(0, 12), movementByDay, color="green")
x = range(0, 12)
y = slope*(x) + intercept
plot1 = matplotlib.pyplot.plot(x,y, color='blue')
ax.set_xlabel("Time in Days")
ax.set_ylabel("Distance per Day (meters)")
ax.set_title("Linear Regression for Distance per Day over Days")
plot1
print("Distance = ", slope, "* Days + ", intercept)
print ("r = ", rvalue)
Distance = -16.857338198721422 * Days + 201.9866665975539 r = -0.5050858802645146
The slope of the linear regression line is negative, so the turtle's movement per day does approximately decrease over time in the study. However, the r-value is only -.505, which indicates only a moderate linear relationship between day number and total distance. There are many possible explanations for the tortoise's decrease in movement - change in temperature over the data collection period, for example. The linear relationship is not very significant, but the general decrease in activity levels over the course of the data collection should be kept in mind for other analysis.
Exploration 3: Sites of Interest¶
Next, I will attempt to find points of interest to the tortoise: areas where it spends disproportionate amounts of time. After finding these areas, I will again apply a two-sample t-test to see if the predicted areas of interest are significant.
Begin by graphing the turtle locations. I am using UTM coordinated to minimize location warping while maintaining simplicity. By equalizing the axes, the plot is read by imagining the tortoise trajectory on a map.
fig, axs = matplotlib.pyplot.subplots(figsize = (5,10))
axs.axis('equal')
axs.set_title("Map of Tortoise Location over Time")
axs.text(.40, 1.80, 'Each square in \nthe grid is 20x20 \nmeters.', fontsize=9, transform=ax.transAxes, bbox=dict(facecolor='white', alpha=0.5))
axs.grid(True)
fig = plt.scatter(tortoise['X'],tortoise['Y'], c=tortoise['TimeSinceZero'], cmap='plasma', s=1)
axs.set_xticks(np.arange(582420, 582680, 20))
axs.set_xticklabels([])
axs.set_yticks(np.arange(-2195400, -2194860, 20))
axs.set_yticklabels([])
plt.show()
In designing a plan to find locations of interest for the tortoise, I split the tortoise's location tracks with a grid structure. I can then tally how much time he spends in each area to determine his significant places.
Each grid location will be an (x,y) tuple, where x starts at 0 and increases with every 10 m increase beyond 582470, and y starts at 0 and increases with every 20 meter increase beyond -2195400.
Note that I filter out all areas that the turtle is in for less than 2% of the data collection time to simplify the bar chart.
def get_area(row):
x = row['X']
y = row['Y']
x_final = int((x-582470)/10)
y_final = int((y+2195400)/20)
return (x_final, y_final)
tortoise['area'] = tortoise.apply(get_area, axis=1)
# Find the amount of time he spends in each area
areacounts = tortoise.groupby('area')['AnimalID'].sum()
# Filter out all areas that he spends more than 20000 seconds in
totaltime = areacounts.sum()
areacounts2 = areacounts.loc[lambda x : x >= 20000]
timeleft = areacounts2.sum()
# Create an 'other' category for the areas we filtered out
print(totaltime-timeleft)
areacounts2.at['other']= totaltime-timeleft
areacounts2pct = (areacounts2*100)/sum(areacounts2)
fig, ax = plt.subplots()
ax.set_title("Percent Time Spent in Each Area")
ax.set_xlabel("Area")
ax.set_ylabel("Percent of time in Area")
rects = ax.bar(np.arange(9), areacounts2pct, 0.5, color='g')
ax.set_xticks(np.arange(9))
ax.set_xticklabels(areacounts2pct.index, rotation=45, ha='right')
ax.set_ylim((0,40))
for rect in rects:
height = rect.get_height()
ax.text(rect.get_x() + rect.get_width()/2., 1.05*height,
str(int(height*10)/10),
ha='center', va='bottom')
plt.show()
277287
From the bar chart, it is clear that the tortoise spends the most amount of time (at least 5%) in the following areas:
- (4, 24): 32.2 %
- (3, 13): 12.5 %
- (2, 19): 9.4 %
I predict that these are areas of importance. To see if the tortoise actually spends a stastistically significant amount of time in these areas, we can use a Chi-Squared test for significance. This Chi-Squared test will determine whether the two categorical distributions come from the same distribution or not. More information on a Chi-squared test and how we calculated the DOF (degrees of freedom) parameter can be found here.
Null hypothesis: The sample occurences come from a uniform distribution.
Alternative hypothesis: The sample occurences do not come from a uniform distribution; e.g. the differences between occurence distribution and a uniform distribution are statistically significant.
I will use alpha = 0.01 for this significance test.
areacountsdf = pd.DataFrame(areacounts)
areacountsdf = areacountsdf.rename(columns={"AnimalID":"SampleOccurences"})
dof = len(areacountsdf)-1
# Since the default distribution for chisquare is uniform, we don't have to specify anything
scipy.stats.chisquare(areacountsdf['SampleOccurences'], ddof=dof)
Power_divergenceResult(statistic=10476917.027180428, pvalue=nan)
That's a very high test statistic! The pvalue was nan, but with a test statistic that high, the p-value must have been so close to zero that some type of numerical overflow caused the nan result. It is definitely lower than alpha = 0.01, so I can reject the null hypothesis and claim that they are from different distributions.
The test still may not be accurate though because I included many areas that the tortoise barely spends any time in. He may just pass through these areas while traveling or snag a corner of them while moving from place to place. Repeat the Chi-square test, but remove the categories where the tortoise spends an insignificant (< 6 hr) amount of time. All hypotheses remain the same.
areacounts = areacounts.loc[lambda x : x >= 6*60*60]
areacountsdf = pd.DataFrame(areacounts)
areacountsdf = areacountsdf.rename(columns={"AnimalID":"SampleOccurences"})
dof = len(areacountsdf)-1
scipy.stats.chisquare(areacountsdf['SampleOccurences'], ddof=dof)
Power_divergenceResult(statistic=777413.591017317, pvalue=nan)
The test statistic is less high than our last test, but still high enough to render the pvalue nan. I now have enough evidence to reject the null hypothesis and accept the claim that the tortoise does have significant areas of interest in the places mentioned above.
To finish off the data exploration, I will lay these areas over the original movement plot to see where in his trajectory the tortoise spends the most time. The blue rectangles are the top three identified areas of interest.
x = [4, 3, 2]
y = [24, 13, 19]
data = {'X': x, 'Y': y}
sigpoints = pd.DataFrame(data)
sigpoints['X'] = (sigpoints['X']*10)+582470
sigpoints['Y'] = (sigpoints['Y']*20)-2195400
fig, axs = matplotlib.pyplot.subplots(figsize = (5,10))
axs.axis('equal')
add = lambda row: axs.add_patch(plt.Rectangle((row['X'], row['Y']), 10, 20))
sigpoints.apply(add, axis=1)
axs.set_title("Map of Tortoise Location over Time")
axs.text(.40, 1.80, 'Each square in \nthe grid is 20x20 \nmeters.', fontsize=9, transform=ax.transAxes, bbox=dict(facecolor='white', alpha=0.5))
axs.grid(True)
fig = plt.scatter(tortoise['X'],tortoise['Y'], c=tortoise['TimeSinceZero'], cmap='plasma', s=1)
axs.set_xticks(np.arange(582420, 582680, 20))
axs.set_xticklabels([])
axs.set_yticks(np.arange(-2195400, -2194860, 20))
axs.set_yticklabels([])
plt.show()
Primary Analysis and Visualizations¶
Based on my analysis, the tortoise has clearly definied behavioral patterns, including a daily schedule of active vs. inactive times and clear favorite spots where he spends his time. Therefore, the next steps of my analysis will be to use a machine learning to further categorize his patterns of behavior into categories based on location, activity level, and time of day into behaviors like foraging, sleeping, or travelling.
To do this, I will use an unsupervised learning model. These are machine learning models that take data and attempt to cluster it into related categories with similar data. There are different types of clustering, so I will begin by selecting a model that matches my data. More information about types of clustering can be found here.
Since most clustering methods perform best on quantitative data, I will first make edits to my dataset to prime it for clustering. Before using any machine learning method, I will apply Principle Component Analysis to reduce the dimensionality of my data and make the clusters easier to visualize. Then, based on the spatial nature of my data and unknown number of clusters, I will use the DBSCAN clustering method to identity different tortoise behaviors.
Preparing the Data¶
First, I will have to prepare the features of the dataset that I plan to use for clustering. Some features of our dataset that might be of import are the speed of the tortoise. I will use speed over the past five minutes to reduce noise from the turtle stopping and starting. I will also use the time of day, assuming that tortoise behavior cycles throughout the day. I will also need some measure of how much he likes an area. From the last exploration, I have a categorical measure of this data (the amount of time spent in each area), but I will need a quantitative measure.
Also, because of the time-intensive nature of the machine learning calculations (and to avoid noise), I will trim the dataset to only include one point per minute. I will do this after aggregation of the speed calculation as to not lose data.
Creating these new features will be fairly simple.
Note that this code block takes about 2 hours to run because of the data-intense calculations and aggregation in the count_adjacent_rows() function. To avoid running it, the csv file with the output of this cell (and all the cells above) can be found here.
# Add a column for time (in seconds) of the day
tortoise['Time'] = tortoise['TimeSinceZero']%(60*60*24)
# Note that this means each day "splits" at 6am, which is ok because we see a sharp behavioral change at 6am anyway.
# Add a column for mean speed over the last five minutes (in meters/min)
tortoise['aggregateSpeed(m/min)'] = (.2)*haversine_vectorize(tortoise.Long, tortoise.Lat,
tortoise.Long.shift(300), tortoise.Lat.shift(300)).fillna(0)
# Trim the dataset to one point every minute
tortoiseOld = tortoise.copy()
tortoise = tortoiseOld[tortoiseOld['TimeSinceZero'] % 60 == 0]
print("New length of dataset: ", len(tortoise))
# Add a column for how many seconds the tortoise spends within 20 meters of his current location
# recall that the haversine_vectorize function calculates distance between two lat-longs
def count_adjacent_rows(index, df):
current_row = df.loc[index]
count = 0
for i in range(index - 60, -60, -60):
distance = haversine_vectorize(current_row['Long'], current_row['Lat'], df.loc[i]['Long'], df.loc[i]['Lat'])
if distance <= 20:
count += 1
else:
break
for i in range(index + 60, len(df)*60, 60):
distance = haversine_vectorize(current_row['Long'], current_row['Lat'], df.loc[i]['Long'], df.loc[i]['Lat'])
if distance <= 20:
count += 1
else:
break
return count
tortoise = tortoise.assign(**{'timeNearby':tortoise.apply(lambda row: count_adjacent_rows(row.name, tortoise), axis=1)})
#tortoise.to_csv('tortoiseoutput.csv', index=False)
New length of dataset: 16515
Finally, create a new dataframe with only these columns. I will apply the machine learning techniques to this new dataframe.
#tortoise = pd.read_csv('tortoiseoutput.csv')
newtortoise = tortoise[['Time', 'aggregateSpeed(m/min)', 'timeNearby']]
Principal Component Analysis and DBScan¶
To begin, I will perform Principal Component Analysis on the dataframe. This helps reduce the dimensionality of the dataset by using the fact that some of the features are already correlated with each other. PCA also lets us plot results in 2 dimensions instead of 3 while still capturing most of the information in the dataset, which will make results easier to visualize later. However, since I are already only working with three dimensions, it might not be necessary if most of the variance cannot be explained with only two dimensions.
First, let's make a heatmap to display the correlation matrix between each dimension.
# Display the correlation matrix using a heatmap
plt.figure(figsize=(4, 4))
fig, ax = plt.subplots()
ax.set_title("Heatmap of Tortoise Features")
sns.heatmap(newtortoise.corr(), cmap="YlGnBu", annot=True)
plt.show()
<Figure size 400x400 with 0 Axes>
This heatmap shows the correlation of each data feature with each other. Since no two features have very high correlation, they are all separate factors that are important to consider. Now, I can proceed with performing PCA. I will first scale the data, and then use the PCA command to perform PCA and find the explained variance ratio, which shows how much of the variance in the data each dimension can capture.
# Scale the data
scaler = StandardScaler()
newtortoise = pd.DataFrame(
scaler.fit_transform(newtortoise), columns=['Time', 'aggregateSpeed(m/min)', 'timeNearby']
)
# Perform PCA
pcaModel = PCA(n_components=3)
pcaModel.fit(newtortoise)
print(pcaModel.explained_variance_ratio_)
[0.5561368 0.29258821 0.15127499]
From the explained variance ratio, the first dimension explains about 56% of the data, the first two dimentions explain about 85%, and all three of course explain 100% of the data. Therefore, I will continue to use all 3 dimensions, since not using that last dimension loses plenty of the variation. However, I will still use the PCA transform as to better visualize results in the first two dimensions. This will be useful when refining hyperparameters of the DBSCAN model.
pca = PCA(n_components = 3)
tortoise_PCA = pca.fit_transform(newtortoise)
tortoise_PCA = pd.DataFrame(tortoise_PCA)
tortoise_PCA.columns = ['P1', 'P2', 'P3']
print(tortoise_PCA.head())
P1 P2 P3 0 1.262829 -0.659211 -1.749047 1 1.261455 -0.657479 -1.747995 2 1.260082 -0.655748 -1.746943 3 1.258438 -0.654272 -1.745619 4 1.257065 -0.652540 -1.744568
Now that I have scaled and transformed the dataset, I can perform DBSCAN to cluster the points. DBSCAN has two hyperparameters that I will have to select - epsilon and min_samples. Epsilon is the maximum distance between two points for them to be considered "neighbors." Min_samples is the smallest number of neighbors a point must have to be considered a core point. Clusters are eventually made of core points that are neighbors of each other, and all of their neighbors. A more detailed walkthrough of the DBSCAN algorithm can be found here.
First, I will find a value for the hyperparameter epsilon. One way to do this is with a k-nearest neighbors approach. Without going into detail, I will find the distances from each point to it's k nearest neighbors, then plot these distances on a graph. I will use the "elbow method" to pick a value that captures as much information while reducing noise. More information on this approach can be found here. For a scholarly, slightly denser walkthrough, see this paper. This second paper also explains why I use k = 2*(number of dimensions)-1 in KNN.
# Find k-value
k = 2 * tortoise_PCA.shape[-1] - 1
# Take KNN on the data
nbrs = NearestNeighbors(n_neighbors=k, radius=1).fit(tortoise_PCA)
# Store and sort the distances
distances, indices = nbrs.kneighbors(tortoise_PCA)
distances = np.sort(distances, axis=0)
numdist = len(distances)
distances = distances[:, k-1]
# Make the elbow plot
fig, ax = plt.subplots(figsize=(6,6))
fig = plt.plot(distances, color='g')
ax.set_xlabel('Number of Points in the dataset', fontsize=12)
ax.set_ylabel('3-nearest neighbor distance'.format(k), fontsize=12)
ax.set_title('Elbow Plot for KNN distances')
ax.set_yticks(np.arange(0, 1.2, 0.05))
ax.grid(visible=True)
plt.show()
I can use the KneeLocator package to find where the elbow of the graph is. It can be difficult to tell by eye alone, since the value appears to be anywhere from 0.05 to 0.3.
# Set up and find the y-value of the elbow point
kneedle = KneeLocator(x = range(1, numdist+1), y = distances, S = 1.0,
curve = "concave", direction = "increasing", online=True)
print(kneedle.knee_y)
0.19342877260316022
Therefore, I will start with epsilon = 0.2. From the above paper, a good estimate for min_samples is k+1, so 6. Note that these are just starting estimates: choosing hyperparameters for DBSCAN is very tricky, so I will try a variety of different parameters and combinations to see what works best with this specific data.
# Set hyperparameters
epsilon = 0.2
min_samples = 6
# Initialize and fit the model
db_start = DBSCAN(eps = epsilon, min_samples = min_samples).fit(tortoise_PCA)
labels = db_start.labels_
# Print summary of the clustering
nclusters = len(set(labels)) - (1 if -1 in labels else 0)
print("Number of clusters: ", nclusters)
print("Number of outliers: ", list(labels).count(-1))
counts = Counter(db_start.labels_)
for i in range(0, nclusters):
print("Cluster", i, ":", counts[i], "items")
Number of clusters: 18 Number of outliers: 78 Cluster 0 : 7200 items Cluster 1 : 8 items Cluster 2 : 9 items Cluster 3 : 356 items Cluster 4 : 130 items Cluster 5 : 21 items Cluster 6 : 31 items Cluster 7 : 5413 items Cluster 8 : 27 items Cluster 9 : 24 items Cluster 10 : 16 items Cluster 11 : 33 items Cluster 12 : 9 items Cluster 13 : 907 items Cluster 14 : 2041 items Cluster 15 : 98 items Cluster 16 : 102 items Cluster 17 : 12 items
From this initial parameterization, I can see three main clusters (clusters 0, 7, and 14) and a few smaller ones. Graphing this clustering in the first two dimensions will show us these clusters. Note that there is a third dimension of the data that is not visualized, so any clusters that appear overlapping in this plot likely differ in the third dimension. Also, outliers are represented by the blue -1 cluster. My method for plotting DBSCAN results is from this blog post.
p = sns.scatterplot(data = tortoise_PCA, x = "P1", y = "P2", hue = db_start.labels_, legend = "full", palette = "muted", linewidth = 0)
sns.move_legend(p, "upper right", bbox_to_anchor = (1.17, 1.), title = 'Clusters')
p.set(title="Initial Clustering Results")
plt.show()
This initial clustering is fine, but it's worthwhile to try other hyperparameters to see if I can improve on the clustering and group some of those separate points together. Finding an elbow point is not an exact science, so I can try a variety of values from 0.05 to 0.5. Also, finding the right min_samples varies depending on the data set, so I should try surrounding values (5, 6, and 7). I can investigate and plot the results each combination, and then select the best one to be a final clustering.
# Lists to store the hyperparameters and models
possible_epsilon = np.arange(0,10)
possible_min_samples = np.arange(0,3)
models = []
def get_model(epsilon, min_s):
return DBSCAN(eps = ((epsilon*0.05)+0.05), min_samples = (min_s+5)).fit(tortoise_PCA)
def get_summarystats(epsilon, min_s, verbose=True):
global models
eps = ((epsilon*0.05)+.05)
ms = min_s+5
model = models[epsilon][min_s]
labels = model.labels_
nclusters = len(set(labels)) - (1 if -1 in labels else 0)
counts = Counter(model.labels_)
if verbose:
print(f"Epsilon={eps:.2f}, min_samples={ms}: {nclusters} clusters and {list(labels).count(-1)} outliers")
print(list(counts.values()))
else:
total = 0
for i in counts.values():
if i >= 1000:
total += 1
print(f"Epsilon={eps:.2f}, min_samples={ms}: {nclusters} clusters, {list(labels).count(-1)} outliers, and {total} significant groups")
print("")
# Create all the models and print summary statistics of each
for i in possible_epsilon:
models.append([])
for j in possible_min_samples:
models[i].append(get_model(i, j))
get_summarystats(i,j, verbose=False)
Epsilon=0.05, min_samples=5: 71 clusters, 503 outliers, and 4 significant groups Epsilon=0.05, min_samples=6: 64 clusters, 587 outliers, and 4 significant groups Epsilon=0.05, min_samples=7: 59 clusters, 672 outliers, and 3 significant groups Epsilon=0.10, min_samples=5: 44 clusters, 187 outliers, and 6 significant groups Epsilon=0.10, min_samples=6: 40 clusters, 234 outliers, and 6 significant groups Epsilon=0.10, min_samples=7: 38 clusters, 259 outliers, and 6 significant groups Epsilon=0.15, min_samples=5: 23 clusters, 98 outliers, and 4 significant groups Epsilon=0.15, min_samples=6: 22 clusters, 110 outliers, and 4 significant groups Epsilon=0.15, min_samples=7: 23 clusters, 119 outliers, and 5 significant groups Epsilon=0.20, min_samples=5: 20 clusters, 63 outliers, and 3 significant groups Epsilon=0.20, min_samples=6: 18 clusters, 78 outliers, and 3 significant groups Epsilon=0.20, min_samples=7: 17 clusters, 89 outliers, and 3 significant groups Epsilon=0.25, min_samples=5: 16 clusters, 38 outliers, and 3 significant groups Epsilon=0.25, min_samples=6: 15 clusters, 46 outliers, and 3 significant groups Epsilon=0.25, min_samples=7: 16 clusters, 48 outliers, and 3 significant groups Epsilon=0.30, min_samples=5: 13 clusters, 23 outliers, and 3 significant groups Epsilon=0.30, min_samples=6: 13 clusters, 31 outliers, and 3 significant groups Epsilon=0.30, min_samples=7: 13 clusters, 33 outliers, and 3 significant groups Epsilon=0.35, min_samples=5: 10 clusters, 18 outliers, and 3 significant groups Epsilon=0.35, min_samples=6: 11 clusters, 23 outliers, and 3 significant groups Epsilon=0.35, min_samples=7: 12 clusters, 24 outliers, and 3 significant groups Epsilon=0.40, min_samples=5: 10 clusters, 8 outliers, and 3 significant groups Epsilon=0.40, min_samples=6: 9 clusters, 14 outliers, and 3 significant groups Epsilon=0.40, min_samples=7: 10 clusters, 14 outliers, and 3 significant groups Epsilon=0.45, min_samples=5: 9 clusters, 4 outliers, and 3 significant groups Epsilon=0.45, min_samples=6: 9 clusters, 5 outliers, and 3 significant groups Epsilon=0.45, min_samples=7: 10 clusters, 5 outliers, and 3 significant groups Epsilon=0.50, min_samples=5: 6 clusters, 2 outliers, and 2 significant groups Epsilon=0.50, min_samples=6: 6 clusters, 3 outliers, and 2 significant groups Epsilon=0.50, min_samples=7: 6 clusters, 3 outliers, and 2 significant groups
Criteria of a good model includes not having too many outliers, not including too much noise by having too many clusters, and not losing any important information. The key is finding models that are neither underfit nor overfit.
Based on these summary statistics, I can rule out epsilon=0.5 since it clusters all the data into mainly two large categories, losing the information of the third significant category present in the other clusterings. This is means the model is underfit. Also, epsilon < 0.3 has lots of small clusters that are likely noise and lots of outliers, meaning that those models are likely overfit. Furthermore, note that the value of min_samples appears not to change the clustering, except that a lower min_sample tends to group outliers with large clusters. so stick with the lowest choice of min_samples = 5. Now there are only 6 models to choose from: min_samples = 5 and epsilon is in the set {0.3, 0.35, 0.4, 0.45}.
Now, re-print the summary statistics and make plots for these models to further investigate.
possible_epsilon = np.arange(5,9)
min_samples = 0
fig, axs = plt.subplots(2, 2, figsize=(16,16))
p1 = sns.scatterplot(data = tortoise_PCA, x = "P1", y = "P2", hue = models[5][0].labels_, legend = "full", palette = "muted", linewidth = 0, ax = axs[0,0])
sns.move_legend(p1, "upper right", bbox_to_anchor = (1.17, 1.), title = 'Clusters')
axs[0, 0].set_title(f"Clustering Results with epsilon={(5*0.05)+0.05:.2f} and min_samples=5")
p2 = sns.scatterplot(data = tortoise_PCA, x = "P1", y = "P2", hue = models[6][0].labels_, legend = "full", palette = "muted", linewidth = 0, ax = axs[0,1])
sns.move_legend(p2, "upper right", bbox_to_anchor = (1.17, 1.), title = 'Clusters')
axs[0, 1].set_title(f"Clustering Results with epsilon={(6*0.05)+0.05:.2f} and min_samples=5")
p3 = sns.scatterplot(data = tortoise_PCA, x = "P1", y = "P2", hue = models[7][0].labels_, legend = "full", palette = "muted", linewidth = 0, ax = axs[1,0])
sns.move_legend(p3, "upper right", bbox_to_anchor = (1.17, 1.), title = 'Clusters')
axs[1, 0].set_title(f"Clustering Results with epsilon={(7*0.05)+0.05:.2f} and min_samples=5")
p4 = sns.scatterplot(data = tortoise_PCA, x = "P1", y = "P2", hue = models[8][0].labels_, legend = "full", palette = "muted", linewidth = 0, ax = axs[1,1])
sns.move_legend(p, "upper right", bbox_to_anchor = (1.17, 1.), title = 'Clusters')
axs[1, 1].set_title(f"Clustering Results with epsilon={(9*0.05)+0.05:.2f} and min_samples=5")
plt.show()
get_summarystats(5, 0)
get_summarystats(6, 0)
get_summarystats(7, 0)
get_summarystats(8, 0)
Epsilon=0.30, min_samples=5: 13 clusters and 23 outliers [5, 7378, 23, 22, 367, 24, 31, 5413, 27, 16, 45, 2951, 111, 102] Epsilon=0.35, min_samples=5: 10 clusters and 18 outliers [5, 7750, 18, 22, 24, 31, 5413, 27, 16, 3098, 111] Epsilon=0.40, min_samples=5: 10 clusters and 8 outliers [5, 7753, 8, 25, 24, 32, 5416, 27, 16, 3098, 111] Epsilon=0.45, min_samples=5: 9 clusters and 4 outliers [7761, 4, 25, 24, 32, 5416, 28, 16, 3098, 111]
Now, I have enough information to select a parameterization. The differences between the plots are subtle, but not negligible. Note the purple category on the far right in the epsilon=0.3 graph. This should probably be clustered with the rest of the green category like in the other three graphs, so rule out epsilon=0.2.
The other differences are much less important. There are a few points that are in different categories depending on the model, but it is hard to tell which model is better than another. I will choose epsilon=0.4 to maintain benefits from all three of the graphs, since it is the middle value.
Also, it's important to realize that none of these categorizations are more "correct" than each other - plenty of these datapoints are influenced by noise, and there is no true correct category for them to go into. I will still take the results with a grain of salt since no clustering is perfect. Also, all of the models at the end were very similar, so the ultimate decisions about the parameters could be different depending on the data scientist choosing the model.
Now, I can assign these categorizations back to specific datapoints in the tortoise dataset.
realmodel = models[7][0]
tortoise = tortoise.assign(**{'Behavior':realmodel.labels_})
Interpretation¶
Now that I have a measure of behavioral categories, we can attempt to interpret these categories to see what they mean for the tortoise! Start by seeing how many occurrences of each behavior we have.
ax = tortoise['Behavior'].value_counts().sort_index().plot(kind='bar', color='g')
ax.set_title("Counts for Tortoise Behavior")
ax.set_ylabel("Counts")
plt.show()
We can see that the only significant categories are 1, 5, and 8. Since classifying animal behavior is not an exact science, we can write off the other points as outliers or classify them with the same behavior as the tortoise is exhibiting in times near those points. To analyze those small groups as much as the large groups would be to overanalyze and overfit, as categories that small are probably just noise.
To get a sense of these behaviors, let's see how they compare to a few different metrics of the tortoise that we already have. We can start by looking at his behaviors over time.
ax = tortoise.plot(x='timeDays', y='Behavior', kind='scatter', s=1, color="green")
ax.set_xlabel("Number of Days")
ax.set_ylabel("Behavior")
ax.set_title("Behavior over Time")
plt.yticks(np.arange(0, 10, 1.0))
plt.xticks(np.arange(0,13,1.0))
plt.show()
It appears from this plot that the tortoise's behaviors varied more on different days than cycling throughout the day like we might expect. He might has "lazy days," "active days," "travel days," etc.
From the plot, we can also begin to group those outlier behaviors. Behaviors -1, 0 and 2 are clearly subsets of behavior 1, so we can group them in with behavior 1. Similarly, behaviors 7 and 9 are subsets of behavior 8. We can also group behavior 3 and 4 together since they occur near each other temporally, as well as behaviors 5 and 6.
When grouping behavior like this, I am not saying that two behavior groups share characteristics, only that they are one "behavioral act" that has elements of both categories. For example, the behavior represented by categories 7, 8, and 9 might be an act like foraging where the tortoise spends time moving to find food and stopping to eat it, which are subsets of the same act.
Also, we might expect behaviors 3-4 to represent a transition between behaviors 1 and 3.
Let's redefine the behavior categories to group into four categories described above.
def reclassify(x):
if x == -1 or x == 0 or x == 2 or x == 1:
return 1
elif x == 3 or x == 4:
return 3
elif x == 5 or x == 6:
return 5
elif x == 7 or x == 9 or x == 8:
return 8
tortoise = tortoise.assign(**{'Behavior':tortoise['Behavior'].apply(reclassify)})
ax = tortoise.plot(x='timeDays', y='Behavior', kind='scatter', s=1, color="green")
ax.set_xlabel("Number of Days")
ax.set_ylabel("Behavior")
ax.set_title("Behavior over Time")
plt.yticks(np.arange(0, 9, 1.0))
plt.xticks(np.arange(0,13,1.0))
plt.show()
Next, we can begin to match activity patterns to behavioral categories by finding some summary statistics. For the behaviors 0, 1, 3, and 5, we will plot histograms of speed and timeNearby. Then, we will plot his behavior over the spatial graph to see if location matches up to behavior at all. From this analysis, we will characterize each of these behaviors.
Note that while the y-axes of the following graphs do not align, that is because the graphs histograms so the proportions are much more important than the actual frequency.
fig, axs = plt.subplots(2, 2, figsize=(16,16))
p1 = tortoise['Speed(m/min)'][tortoise['Behavior'] == 1].hist(bins=30, color="green", ax = axs[0,0])
axs[0,0].set_xlabel("Speed (meters/minute)")
axs[0,0].set_title("Histogram of Tortoise Speed, Behavior=1")
axs[0,0].set_xlim([0, 0.9])
p2 = tortoise['Speed(m/min)'][tortoise['Behavior'] == 3].hist(bins=30, color="green", ax = axs[0,1])
axs[0,1].set_xlabel("Speed (meters/minute)")
axs[0,1].set_title("Histogram of Tortoise Speed, Behavior=3")
axs[0,1].set_xlim([0, 0.9])
p3 = tortoise['Speed(m/min)'][tortoise['Behavior'] == 5].hist(bins=30, color="green", ax = axs[1,0])
axs[1, 0].set_xlabel("Speed (meters/minute)")
axs[1, 0].set_title("Histogram of Tortoise Speed, Behavior=5")
axs[1, 0].set_xlim([0, 0.9])
p4 = tortoise['Speed(m/min)'][tortoise['Behavior'] == 8].hist(bins=30, color="green", ax = axs[1,1])
axs[1, 1].set_xlabel("Speed (meters/minute)")
axs[1, 1].set_title("Histogram of Tortoise Speed, Behavior=8")
axs[1, 1].set_xlim([0, 0.9])
plt.show()
We can generalize the speeds by behavior as follows:
- Behavior 1 (main behavior): Includes all high speed motion, as well as some low speed motion.
- Behavior 3 (transition behavior): Consistent mid speed motion.
- Behavior 5 (main behavior): Almost no motion.
- Behavior 8 (main behavior): Fairly consistent low speed motion with breaks.
Next, we will plot histograms of behavior vs. time nearby to see which behaviors occur in areas of interest and which behaviors occur in areas that the tortoise is just passing through.
fig, axs = plt.subplots(2, 2, figsize=(16,16))
p1 = tortoise['timeNearby'][tortoise['Behavior'] == 1].hist(bins=30, color="green", ax = axs[0,0])
axs[0,0].set_xlabel("Minutes Spent within 20 Meters")
axs[0,0].set_title("Histogram of Time Nearby, Behavior=1")
axs[0,0].set_xlim([0, 6500])
p2 = tortoise['timeNearby'][tortoise['Behavior'] == 3].hist(bins=30, color="green", ax = axs[0,1])
axs[0,1].set_xlabel("Minutes Spent within 20 Meters")
axs[0,1].set_title("Histogram of Time Nearby, Behavior=3")
axs[0,1].set_xlim([0, 6500])
p3 = tortoise['timeNearby'][tortoise['Behavior'] == 5].hist(bins=10, color="green", ax = axs[1,0])
axs[1, 0].set_xlabel("Minutes Spent within 20 Meters")
axs[1, 0].set_title("Histogram of Time Nearby, Behavior=5")
axs[1, 0].set_xlim([0, 6500])
p4 = tortoise['timeNearby'][tortoise['Behavior'] == 8].hist(bins=30, color="green", ax = axs[1,1])
axs[1, 1].set_xlabel("Minutes Spent within 20 Meters")
axs[1, 1].set_title("Histogram of Time Nearby, Behavior=8")
axs[1, 1].set_xlim([0, 6500])
plt.show()
We can generalize the speeds by behavior as follows:
- Behavior 1 (main behavior): Low time nearby indicates just passing through many areas.
- Behavior 3 (transition behavior): Fairly high time nearby indicates areas of interest or rest. This behavior includes some time at a high-interest spot and some time at mid-interest spots.
- Behavior 5 (main behavior): Very high time nearby indicates areas of interest or rest.
- Behavior 8 (main behavior): High time nearby indicates areas of interest or rest.
Finally, we will graph behavior in relation to location (by color) to see where he is physically during each behavior.
fig, axs = matplotlib.pyplot.subplots(figsize = (5,10))
axs.axis('equal')
axs.set_title("Map of Tortoise Location over Time by Behavior")
axs.text(.4, 1.55, 'Each square in \nthe grid is 20x20 \nmeters.', fontsize=9, transform=ax.transAxes, bbox=dict(facecolor='white', alpha=0.5))
axs.grid(True)
plt.scatter(tortoise['X'][tortoise['Behavior']==1],tortoise['Y'][tortoise['Behavior']==1], c='red', s=1)
plt.scatter(tortoise['X'][tortoise['Behavior']==3],tortoise['Y'][tortoise['Behavior']==3], c='green', s=1)
plt.scatter(tortoise['X'][tortoise['Behavior']==5],tortoise['Y'][tortoise['Behavior']==5], c='blue', s=1)
plt.scatter(tortoise['X'][tortoise['Behavior']==8],tortoise['Y'][tortoise['Behavior']==8], c='orange', s=1)
axs.set_xticks(np.arange(582420, 582680, 20))
axs.set_xticklabels([])
axs.set_yticks(np.arange(-2195400, -2194860, 20))
axs.set_yticklabels([])
custom_lines = [Line2D([0], [0], color='red', lw=4),
Line2D([0], [0], color='green', lw=4),
Line2D([0], [0], color='blue', lw=4),
Line2D([0], [0], color='orange', lw=4)]
axs.legend(custom_lines, ['Behavior 1', 'Behavior 3', 'Behavior 5', 'Behavior 8'])
plt.show()
From this map, we can visually see that behavior 1 is mostly for moving from place to place, while behaviors 5 and 8 are primarily in one location. Also, the movement in behavior 5 is far more clustered than in behavior 8.
Based on my analysis of speed, time nearby, and spatial location that characterizes each behavior, I will give them the following labels:
Behavior 1: Traveling. The range of speeds, including very high, as well as the low time spent in any given area indicate that the tortoise is moving through areas from one place to another.
Behavior 5: Resting. The low movement, safe familiar place, and specific movement to a location to complete the behavior (as indicated in the map) indicates the tortoise is stopping and resting for a period of time.
Behavior 8: Foraging/Eating. The specific area that the tortoise stays in as well as the low speed motion with breaks to stop indicate that the tortoise is moving around and looking for something, probably vegetation to eat.
Behavior 3: Transition. Recall from the behavior over time plot that that this behavior only occured in the gap between two other behaviors, so that as well as the variable speeds and time nearby indicates that the tortoise is preparing to switch behaviors.
Finally, we will edit the dataset to include these descriptive labels.
def reclassify(x):
if x == 1:
return "Traveling"
elif x == 3:
return "Transition"
elif x == 5:
return "Resting"
elif x == 8:
return "Foraging"
# Reclassify to add words to each behavior
tortoise = tortoise.assign(**{'Behavior':tortoise['Behavior'].apply(reclassify)})
Conclusions¶
Based on my analysis, the Aldabra Giant Tortoise in the wild has different behaviors like foraging, travelling, and resting that he cycles through over the span of days and weeks instead of cyling through behaviors daily like a person. This insight is useful to zookeepers or veterinarians who have to support the Aldabra Giant Tortoise in captivity. For more thorough analysis, investigation of a longer data span or multiple animals will yield wider trends that can be generalized to the entire species. Overall, the analysis of animal movement data yields great insights on animal behavior that are useful in both research and practice.