Human trafficking can happen anywhere, and predominatly affects children and young adults. This is a difficult statistic to track, as getting help to those entraped by human trafficking is difficult, and keeping their information safe once they become surviors is imperative. The Counter-Trafficking Data Collaborative has created a international platform where individual organizations are able to upload their tracking data of survivors. As this only tracks those who have been able to escape trafficking, we can assume the number of individuales trafficking affects is much larger. Using this data however, we can get a glimps of who is affected, and how, leading to better strategies of breaking the cycle as a whole. Let's explore with knowlege learned from the course Data Analysis with Python: Zero to Pandas.
Let's import the packages needed to download the data and explore it.
#Importing the packages:
import pandas as pd
pd.options.mode.chained_assignment = None
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import plotly.express as px
import os
!pip install --quiet pycountry_convert
from pycountry_convert import country_alpha2_to_country_name, country_name_to_country_alpha3
Let's begin by downloading the data, and listing the files within the dataset.
#Importing the uncleaned world data:
world_data = pd.read_csv('The Global Dataset 14 Apr 2020.csv', dtype={'typeOfSexConcatenated': str, 'RecruiterRelationship': str, 'majorityStatusAtExploit':str})
The dataset has now been read into the DataFrame world_data.
Now that we have the data loaded, we need to clean it. We will want to parse down the number of rows, and make the values uniform, if they are not.
#Let's take a quick look at the data
world_data.head()
By using this data you agree to the Terms of Use: https://www.ctdatacollaborative.org/terms-use | yearOfRegistration | Datasource | gender | ageBroad | majorityStatus | majorityStatusAtExploit | majorityEntry | citizenship | meansOfControlDebtBondage | ... | typeOfSexPrivateSexualServices | typeOfSexConcatenated | isAbduction | RecruiterRelationship | CountryOfExploitation | recruiterRelationIntimatePartner | recruiterRelationFriend | recruiterRelationFamily | recruiterRelationOther | recruiterRelationUnknown | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | NaN | 2002 | Case Management | Female | 18--20 | Adult | -99 | -99 | CO | -99 | ... | -99 | -99 | -99 | -99 | -99 | 0 | 0 | 0 | 0 | 1 |
1 | NaN | 2002 | Case Management | Female | 18--20 | Adult | -99 | -99 | CO | -99 | ... | -99 | -99 | -99 | -99 | -99 | 0 | 0 | 0 | 0 | 1 |
2 | NaN | 2002 | Case Management | Female | 18--20 | Adult | -99 | -99 | CO | -99 | ... | -99 | -99 | -99 | -99 | -99 | 0 | 0 | 0 | 0 | 1 |
3 | NaN | 2002 | Case Management | Female | 18--20 | Adult | -99 | -99 | CO | -99 | ... | -99 | -99 | -99 | -99 | -99 | 0 | 0 | 0 | 0 | 1 |
4 | NaN | 2002 | Case Management | Female | 18--20 | Adult | -99 | -99 | CO | -99 | ... | -99 | -99 | -99 | -99 | -99 | 0 | 0 | 0 | 0 | 1 |
5 rows × 64 columns
world_data.shape
(48801, 64)
world_data.head()
By using this data you agree to the Terms of Use: https://www.ctdatacollaborative.org/terms-use | yearOfRegistration | Datasource | gender | ageBroad | majorityStatus | majorityStatusAtExploit | majorityEntry | citizenship | meansOfControlDebtBondage | ... | typeOfSexPrivateSexualServices | typeOfSexConcatenated | isAbduction | RecruiterRelationship | CountryOfExploitation | recruiterRelationIntimatePartner | recruiterRelationFriend | recruiterRelationFamily | recruiterRelationOther | recruiterRelationUnknown | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | NaN | 2002 | Case Management | Female | 18--20 | Adult | -99 | -99 | CO | -99 | ... | -99 | -99 | -99 | -99 | -99 | 0 | 0 | 0 | 0 | 1 |
1 | NaN | 2002 | Case Management | Female | 18--20 | Adult | -99 | -99 | CO | -99 | ... | -99 | -99 | -99 | -99 | -99 | 0 | 0 | 0 | 0 | 1 |
2 | NaN | 2002 | Case Management | Female | 18--20 | Adult | -99 | -99 | CO | -99 | ... | -99 | -99 | -99 | -99 | -99 | 0 | 0 | 0 | 0 | 1 |
3 | NaN | 2002 | Case Management | Female | 18--20 | Adult | -99 | -99 | CO | -99 | ... | -99 | -99 | -99 | -99 | -99 | 0 | 0 | 0 | 0 | 1 |
4 | NaN | 2002 | Case Management | Female | 18--20 | Adult | -99 | -99 | CO | -99 | ... | -99 | -99 | -99 | -99 | -99 | 0 | 0 | 0 | 0 | 1 |
5 rows × 64 columns
As we can see the data seems to lack null values. This is because the mull values are signified by the string -99. We also want to remove the fist column, as well as a few of the inner columns that we will not be exploring today.
#Creating a subset of the orginal dataframe
world_data = world_data[['CountryOfExploitation', 'yearOfRegistration', 'Datasource',
'gender', 'ageBroad', 'majorityStatus', 'majorityStatusAtExploit',
'majorityEntry', 'citizenship', 'isForcedLabour', 'isSexualExploit',
'isOtherExploit', 'isSexAndLabour','typeOfExploitConcatenated']]
#Replacing the values -99 to represent null values
world_data.replace('-99', np.nan, inplace=True)
world_data.replace(-99, np.nan, inplace=True)
Let's now see what the cleanded data looks like:
world_data.head()
CountryOfExploitation | yearOfRegistration | Datasource | gender | ageBroad | majorityStatus | majorityStatusAtExploit | majorityEntry | citizenship | isForcedLabour | isSexualExploit | isOtherExploit | isSexAndLabour | typeOfExploitConcatenated | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | NaN | 2002 | Case Management | Female | 18--20 | Adult | NaN | NaN | CO | 0.0 | 1.0 | 0.0 | 0.0 | Sexual exploitation |
1 | NaN | 2002 | Case Management | Female | 18--20 | Adult | NaN | NaN | CO | 0.0 | 1.0 | 0.0 | 0.0 | Sexual exploitation |
2 | NaN | 2002 | Case Management | Female | 18--20 | Adult | NaN | NaN | CO | 0.0 | 1.0 | 0.0 | 0.0 | Sexual exploitation |
3 | NaN | 2002 | Case Management | Female | 18--20 | Adult | NaN | NaN | CO | 0.0 | 1.0 | 0.0 | 0.0 | Sexual exploitation |
4 | NaN | 2002 | Case Management | Female | 18--20 | Adult | NaN | NaN | CO | 0.0 | 1.0 | 0.0 | 0.0 | Sexual exploitation |
world_data.shape
(48801, 14)
world_data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 48801 entries, 0 to 48800 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 CountryOfExploitation 38626 non-null object 1 yearOfRegistration 48801 non-null int64 2 Datasource 48801 non-null object 3 gender 48801 non-null object 4 ageBroad 36439 non-null object 5 majorityStatus 36439 non-null object 6 majorityStatusAtExploit 9290 non-null object 7 majorityEntry 6491 non-null object 8 citizenship 48523 non-null object 9 isForcedLabour 26102 non-null float64 10 isSexualExploit 23861 non-null float64 11 isOtherExploit 30938 non-null float64 12 isSexAndLabour 23456 non-null float64 13 typeOfExploitConcatenated 32627 non-null object dtypes: float64(4), int64(1), object(9) memory usage: 5.2+ MB
Now that the data has been cleaned, we can start exploring it.
One of the best ways to start exploring data is to plot it, and visualize the relationships. That is what we are going to do below. Using matplotlib, seaborn, and plotly, let's dive into the data.
But first, lets create a smaller dataframe to explore:
#Setting new plot defaults
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
sns.set_palette('Set3')
sns.set_style('darkgrid')
matplotlib.rcParams['font.size'] = 14
matplotlib.rcParams['figure.figsize'] = (9, 5)
matplotlib.rcParams['figure.facecolor'] = '#00000000'
#Selecting a new dataframe of only surviors who were exploited in the US.
us_data = world_data[world_data['CountryOfExploitation']=='US']
#Selecting a new dataframe of only US surviors who were recorded in 2017 and 2018.
us_17_18_data = us_data[us_data['yearOfRegistration'] >=2017]
#Checking a random sample to make sure the new dataframe is correctly filtered.
us_17_18_data.sample(10)
CountryOfExploitation | yearOfRegistration | Datasource | gender | ageBroad | majorityStatus | majorityStatusAtExploit | majorityEntry | citizenship | isForcedLabour | isSexualExploit | isOtherExploit | isSexAndLabour | typeOfExploitConcatenated | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
48361 | US | 2018 | Hotline | Female | 9--17 | Minor | Minor | NaN | 00 | NaN | NaN | 0.0 | NaN | NaN |
40891 | US | 2017 | Hotline | Female | 27--29 | Adult | NaN | NaN | US | 0.0 | 1.0 | 0.0 | 0.0 | Sexual exploitation |
40123 | US | 2017 | Hotline | Female | 21--23 | Adult | Minor | NaN | 00 | 0.0 | 1.0 | 0.0 | 0.0 | Sexual exploitation |
48610 | US | 2018 | Hotline | Male | 0--8 | Minor | Minor | Minor | 00 | 0.0 | 1.0 | 0.0 | 0.0 | Sexual exploitation |
41418 | US | 2017 | Hotline | Female | 39--47 | Adult | NaN | NaN | US | 0.0 | 1.0 | 0.0 | 0.0 | Sexual exploitation |
46618 | US | 2018 | Hotline | Female | 30--38 | Adult | NaN | NaN | 00 | 0.0 | 1.0 | 0.0 | 0.0 | Sexual exploitation |
45355 | US | 2018 | Hotline | Female | 21--23 | Adult | NaN | NaN | 00 | 0.0 | 1.0 | 0.0 | 0.0 | Sexual exploitation |
45556 | US | 2018 | Hotline | Female | 21--23 | Adult | NaN | NaN | 00 | 0.0 | 1.0 | 0.0 | 0.0 | Sexual exploitation |
46396 | US | 2018 | Hotline | Female | 30--38 | Adult | NaN | NaN | 00 | 0.0 | 1.0 | 0.0 | 0.0 | Sexual exploitation |
47529 | US | 2018 | Hotline | Female | 9--17 | Minor | Minor | NaN | 00 | 0.0 | 1.0 | 0.0 | 0.0 | Sexual exploitation |
#How many survivors were reported in 2017 vs. 2018?
ax = sns.barplot(data=us_17_18_data,
x='yearOfRegistration',
y=us_17_18_data.index);
As we can see above, there are roughly the same number survivors reported in these 2 years, being just over 4,000.
#Exploring the number of male and female survivors
sns.catplot(x='gender',
kind='count',
palette="RdBu",
data=us_17_18_data);
Overwhelmingly, more female survivors were reported in 2017 and 2018, with over 7,000 cases in this 2 year span. What does this look like in percentages?
#Creating a plotly pie chart to break down gender prevalence:
gender = us_17_18_data.groupby(['gender', 'ageBroad']).size().reset_index()
gender.rename(columns = {0:'Number of Survivors'}, inplace=True)
fig = px.pie(gender.groupby('gender').sum().reset_index(),
values = 'Number of Survivors',
names = 'gender',
title = 'Gender of Human Trafficking Survivors',
color_discrete_sequence=px.colors.sequential.GnBu)
fig.show();
#At what age range did these victims become survivors?
count = us_17_18_data.ageBroad.value_counts()
plt.plot(count);
A majority of individuales who escape human trafficking are 9-20 years old. What does this look like when we also account for gender?
fig = px.bar(gender, x = 'ageBroad', y = 'Number of Survivors', color = 'gender', color_discrete_sequence=px.colors.qualitative.D3,
category_orders = {'ageBroad': ['0--8', '9--17', '18--20', '21--23', '24--26', '27--29', '30--38', '39--47', '48+']})
fig.show()
It appears the majority of reported males are 17 and younger, with a few in the age range of 21-23.
TODO - write some explanation here.
Instructions (delete this cell)
- Ask at least 5 interesting questions about your dataset
- Answer the questions either by computing the results using Numpy/Pandas or by plotting graphs using Matplotlib/Seaborn
- Create new columns, merge multiple dataset and perform grouping/aggregation wherever necessary
- Wherever you're using a library function from Pandas/Numpy/Matplotlib etc. explain briefly what it does
#Below, we are going to make a stacked bar chart using plotly.
us_17_18_data['Survivors'] = len(us_17_18_data) #new column made
data_bar_mg = pd.DataFrame(us_17_18_data.groupby(['gender', 'majorityStatus'])['majorityStatus'].agg(Survivors='count')).reset_index() #creating a dataframe aggregating age groups
fig = px.bar(data_bar_mg, x="majorityStatus", y="Survivors", color="gender",
title="Human Trafficking Demographics at Time of Survivors Status",
labels={'majorityStatus':'Age'},
color_discrete_sequence= px.colors.sequential.Plasma_r) #this is the plot
fig.update_traces(texttemplate='%{value}', textposition='outside') #adding values to the plot
fig.update_layout(hovermode='x') #adding info when hovering over the plot
fig.show();
Based on this data, female survivors are more likely to adults, while males are more likely to become survivors while still minors.
#Using a similar plotly graph to the graph above, exploring age at first exploit.
exploit = us_17_18_data.groupby(['gender', 'majorityStatusAtExploit']).size().reset_index()
exploit.rename(columns = {0:'Number of Survivors'}, inplace=True)
fig = px.bar(exploit, x = 'majorityStatusAtExploit',
y = 'Number of Survivors',
color = 'gender',
title="Human Trafficking Demographics at Time of First Exploit",
labels={'majorityStatusAtExploit':'Age'},
color_discrete_sequence= px.colors.sequential.Plasma_r,
category_orders = {'majorityStatusAtExploit': ['Minor', 'Adult']})
fig.update_traces(texttemplate='%{value}', textposition='outside')
fig.show()
This chart contrasts with our first. First exploitation of individuals tends to be while they are minors, but it takes years for them to eventually escape human trafficking.
#Using seaborn catplot to explore categories of exploit
g = sns.catplot(x='typeOfExploitConcatenated',
kind='count',
height=10,
palette="Set3",
data=us_17_18_data);
g.set_axis_labels("", "Number of Survivors").set_xticklabels(["Sexual Exploit", "Labor Exploit", "Both"]).despine(left=True);
From the data we can see the majority of survivors reported sexual exploitation, and very few reported labor exploitation.
#Let's draw a map of where the survivors are being reported.
#First we have to get the countries into a plottable format using Python:
def get_alpha3(col):
try:
iso_3 = country_name_to_country_alpha3(col)
except:
iso_3 = np.nan
return iso_3
def get_name(col):
try:
name = country_alpha2_to_country_name(col)
except:
name = np.nan
return name
#Now let's group by country of exploitation:
world_data['CountryOfExploitation'] = world_data['CountryOfExploitation'].apply(lambda x: get_name(x)) #renaming the country of exploitation using function above
world_data['alpha_3'] = world_data['CountryOfExploitation'].apply(lambda x: get_alpha3(x)) #renaming the country to a 3 letter abbrevation, and making new column
exploitation_map = pd.DataFrame(world_data.groupby(['CountryOfExploitation', 'alpha_3'])['alpha_3'].agg(Survivors='count')).reset_index() #creating a dataframe with the new columns and renamed countries so plotly can read it
#Now for the plotly map graph:
fig = px.choropleth(exploitation_map, locations='alpha_3',
color='Survivors',
hover_name='CountryOfExploitation',
color_continuous_scale='Viridis_r')
fig.update_layout(title_text="Human Trafficking Surviors Based on Reported Country of Exploitation")
fig.show()
#What are the numbers?
exploitation_map[['CountryOfExploitation', 'Survivors']].set_index('CountryOfExploitation').sort_values(by='Survivors', ascending=False).head(10) #creating a quick series table
Survivors | |
---|---|
CountryOfExploitation | |
United States | 12512 |
Ukraine | 5399 |
Moldova, Republic of | 4504 |
Russian Federation | 2738 |
Philippines | 1988 |
Indonesia | 1777 |
Cambodia | 1000 |
Malaysia | 930 |
Ghana | 544 |
United Arab Emirates | 504 |
As we can see, the US has a very large number of reported survivors. Why is this, and are the demographics similar to the rest of the world?
#Let's look at this as a dataframe:
most_common_exploit = world_data.typeOfExploitConcatenated.value_counts(ascending=False).head(5) #taking only the top 5
most_common_exploit
Sexual exploitation 15989 Forced labour 8969 Other 7063 Slavery and similar practices 359 Forced marriage 168 Name: typeOfExploitConcatenated, dtype: int64
#Seaborn bar graph of the most prevalent forms of exploitation:
g = sns.barplot(x=most_common_exploit.index, y=most_common_exploit.values)
labels=['Sexual Exploit', 'Forced Labor', 'Other', 'Slavery or Simiar', 'Forced Marriage'] #renaming the columns for ease of reading
g.set_xticklabels(labels=labels, rotation=80);
Globally, the most common form of human trafficking is sexual exploit, followed by forced labor. Unlike the US however, forced labor and other types of trafficking are between 40-50% as prevalent as sexual exploit, dramatically higher than what the United States sees.
#Using plotly to view male vs. female survivors in a pie chart
gender = world_data.groupby(['gender', 'ageBroad']).size().reset_index() #first creating a subset dataframe
gender.rename(columns = {0:'Number of Survivors'}, inplace=True) #renaming the columns
fig = px.pie(gender.groupby('gender').sum().reset_index(),
values = 'Number of Survivors',
names = 'gender',
title = 'Gender of Human Trafficking Survivors',
color_discrete_sequence=px.colors.sequential.RdBu) #graph created
fig.show();
Males survivors are still less prevalent than females, but it's closer to a quarter world wide, compared to the 3.7% in the US.
As we can see, thousands of people have been reported surviors of human trafficking around the world. In this exploration, we just scratched the surface of who this affects and where they are affected.
Because this dataset relies on individual organizations to upload the contacts they made with survivors, we don't know if this is a full picture of what human trafficking is in the world. This basic comparison showed that in the past 2 years, a majority of US trafficking survivors have been young adult females who where sexually exploited. This is not mirrored in the larger global view of a 17 year period. The global comparison showed us about 75% of the survivors were female, and the most prevalent forms of exploit were split between sexual exploit, forced labor, and "other".
Looking at the global data we can also see that the US is leading in human trafficking survivors. But what does this mean? Does the US have more reported human trafficking survivors because it has the best organizations to help rescue individuals from trafficking? Better record systems in place to track survivors? More prevalence of human trafficking overall?
This is not a question easily answered with the data we have here. Very few organizations record and release human trafficking survivor data as most intrudes on the privacy of survivors. The resource I used here, Counter-Trafficking Data Collaborative, is still very new, beginning in 2017, and it will take time for more data to be collected and reported to the public, so that a better understanding of human trafficking can be developed.
With more awareness, we can help to fight human trafficking.
As more data becomes avalible, more thorough analysis can be made on human trafficking data, as well as more robust strategies to combat it.
Source: Counter-Trafficking Data Collaborative (CTDC), [October, 2020]
Useful websites: