install.packages("tidyverse")
install.packages("janitor")
install.packages("gt")
One important step in data analysis that can not be overlooked is data cleaning, this ensure that the data is error free, complete, accurate and consistent by identifying and handling missing data, correcting or removing invalid and inconsistent records across the dataset. In doing this you will not only ensure that any insights or decisions made from the data are accurate and reliable, but also improve data storage by reducing the amount of unnecessary or redundant data, which can save storage space and improve processing times.
In this tutorial, we will be working with the FIFA-2021 data of different players around the world. This dataset contains information that can help us gain several insights about the demographics and attributes of professional football players However, like many other real-world datasets, it is messy and contains errors, inconsistencies, and missing data. Today we will clean and prepare the data for future analysis using several data cleaning techniques.
Libraries
We will be using the some of the core tidyverse packages such as readr, tidyr, dplyr and stringr, as well as some additional packages that will help us with specific data cleaning tasks. install the following package if you don’t already have them.
library(tidyverse)
library(stringi)
library(janitor)
library(lubridate)
library(gt)
Data Loading
Next we will load in the FIFA-21 dataset. you can download the data following this link and store it in a folder called data in your current working directory.
<- read_csv("data/fifa21 raw data v2.csv") raw_data
Rows: 18979 Columns: 77
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (23): Name, LongName, photoUrl, playerUrl, Nationality, Club, Contract, ...
dbl (54): ID, Age, ↓OVA, POT, BOV, Attacking, Crossing, Finishing, Heading A...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(raw_data)
# A tibble: 6 × 77
ID Name LongN…¹ photo…² playe…³ Natio…⁴ Age `↓OVA` POT Club Contr…⁵
<dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <chr> <chr>
1 158023 L. Me… Lionel… https:… http:/… Argent… 33 93 93 "\n\… 2004 ~…
2 20801 Crist… C. Ron… https:… http:/… Portug… 35 92 92 "\n\… 2018 ~…
3 200389 J. Ob… Jan Ob… https:… http:/… Sloven… 27 91 93 "\n\… 2014 ~…
4 192985 K. De… Kevin … https:… http:/… Belgium 29 91 91 "\n\… 2015 ~…
5 190871 Neyma… Neymar… https:… http:/… Brazil 28 91 91 "\n\… 2017 ~…
6 188545 R. Le… Robert… https:… http:/… Poland 31 91 91 "\n\… 2014 ~…
# … with 66 more variables: Positions <chr>, Height <chr>, Weight <chr>,
# `Preferred Foot` <chr>, BOV <dbl>, `Best Position` <chr>, Joined <chr>,
# `Loan Date End` <chr>, Value <chr>, Wage <chr>, `Release Clause` <chr>,
# Attacking <dbl>, Crossing <dbl>, Finishing <dbl>, `Heading Accuracy` <dbl>,
# `Short Passing` <dbl>, Volleys <dbl>, Skill <dbl>, Dribbling <dbl>,
# Curve <dbl>, `FK Accuracy` <dbl>, `Long Passing` <dbl>,
# `Ball Control` <dbl>, Movement <dbl>, Acceleration <dbl>, …
Data Cleaning
Name
Before we proceed further, some of the variable name contain spaces and different letter case which can be challenging when handling them through out the cleaning process, so i will transform the names by setting them to lower case and replace any space with an underscore ’_’ using the clean_names()
function from the janitor package.
<- clean_names(raw_data)
edit_raw
colnames(edit_raw)
[1] "id" "name" "long_name" "photo_url"
[5] "player_url" "nationality" "age" "ova"
[9] "pot" "club" "contract" "positions"
[13] "height" "weight" "preferred_foot" "bov"
[17] "best_position" "joined" "loan_date_end" "value"
[21] "wage" "release_clause" "attacking" "crossing"
[25] "finishing" "heading_accuracy" "short_passing" "volleys"
[29] "skill" "dribbling" "curve" "fk_accuracy"
[33] "long_passing" "ball_control" "movement" "acceleration"
[37] "sprint_speed" "agility" "reactions" "balance"
[41] "power" "shot_power" "jumping" "stamina"
[45] "strength" "long_shots" "mentality" "aggression"
[49] "interceptions" "positioning" "vision" "penalties"
[53] "composure" "defending" "marking" "standing_tackle"
[57] "sliding_tackle" "goalkeeping" "gk_diving" "gk_handling"
[61] "gk_kicking" "gk_positioning" "gk_reflexes" "total_stats"
[65] "base_stats" "w_f" "sm" "a_w"
[69] "d_w" "ir" "pac" "sho"
[73] "pas" "dri" "def" "phy"
[77] "hits"
Club
The club
variable represent the current club in which the player have a contract with as of the year 2021.
select(edit_raw, club)
# A tibble: 18,979 × 1
club
<chr>
1 "\n\n\n\nFC Barcelona"
2 "\n\n\n\nJuventus"
3 "\n\n\n\nAtlético Madrid"
4 "\n\n\n\nManchester City"
5 "\n\n\n\nParis Saint-Germain"
6 "\n\n\n\nFC Bayern München"
7 "\n\n\n\nLiverpool"
8 "\n\n\n\nLiverpool"
9 "\n\n\n\nParis Saint-Germain"
10 "\n\n\n\nFC Barcelona"
# … with 18,969 more rows
If you notice the column have some “\n” special character which is also called an escape sequence, it represents a new line character and is used to insert a line break in a string, we don’t need this here so we will remove them. some other things you might have missed are:
There are some players without a current contract.
Some records contain non-English characters such as é, ü, ş.
First, We will remove the “\n” character then convert non-English characters or special characters into their closest English equivalents using the stri_trans_general()
function from the stringi package, this process is called transliteration.
<- edit_raw |>
edit_raw mutate(club = str_remove_all(club, "\\n") |> stri_trans_general("Latin-ASCII"))
select(edit_raw, club) |> head(5) |> gt()
club |
---|
FC Barcelona |
Juventus |
Atletico Madrid |
Manchester City |
Paris Saint-Germain |
Check
It is a good practice to check for any misrepresentation of the data after making some changes in them. So we will verify if there are NA
values.
|> filter(is.na(club)) |> select(club) edit_raw
# A tibble: 0 × 1
# … with 1 variable: club <chr>
Great! no missing values, so all club names with special character have been transformed successfully.
Contract
This column contains the information about the year in which the players contract with their current club started and the year the contract will or is supposed to end.
select(edit_raw, contract)
# A tibble: 18,979 × 1
contract
<chr>
1 2004 ~ 2021
2 2018 ~ 2022
3 2014 ~ 2023
4 2015 ~ 2023
5 2017 ~ 2022
6 2014 ~ 2023
7 2017 ~ 2023
8 2018 ~ 2024
9 2018 ~ 2022
10 2014 ~ 2022
# … with 18,969 more rows
It is important that the two different information above should be in their respective column. This provide a clear and consistent record, there by making it easy to perform both simple and complex analysis using them.
To restructure the contract variable, we need to separate each of the identified variable into a start_year
and end_year
column. Before we do that, here are some observations we should keep in mind:
Players with no current club are considered free agents and are with no contract start year and end year.
|> filter(club == "No Club") |> select(club, contract) edit_raw
# A tibble: 237 × 2
club contract
<chr> <chr>
1 No Club Free
2 No Club Free
3 No Club Free
4 No Club Free
5 No Club Free
6 No Club Free
7 No Club Free
8 No Club Free
9 No Club Free
10 No Club Free
# … with 227 more rows
Note that when a player is on loan, no contract start year was specified.
|>
edit_raw filter(str_detect(contract, "Loan")) |>
select(club, contract, loan_date_end)
# A tibble: 1,013 × 3
club contract loan_date_end
<chr> <chr> <chr>
1 Tottenham Hotspur Jun 30, 2021 On Loan Jun 30, 2021
2 Fulham Jun 30, 2021 On Loan Jun 30, 2021
3 Paris Saint-Germain Jun 30, 2021 On Loan Jun 30, 2021
4 Napoli Jun 30, 2021 On Loan Jun 30, 2021
5 Atletico Madrid Jun 30, 2021 On Loan Jun 30, 2021
6 Bayer 04 Leverkusen Jun 30, 2021 On Loan Jun 30, 2021
7 FC Porto Jun 30, 2021 On Loan Jun 30, 2021
8 Arsenal Jun 30, 2021 On Loan Jun 30, 2021
9 RB Leipzig Jun 30, 2021 On Loan Jun 30, 2021
10 Sporting CP Jun 30, 2021 On Loan Jun 30, 2021
# … with 1,003 more rows
Also, all contract records do not have the same contract start year and end year in number, some are in both words and number dates, mostly for players on loan.
The contract start year and end year will be separated using a the tilde ~ character and the separate()
function from tidyr package.
<- edit_raw |>
edit_raw separate(col = contract,
into = c("start_year", "end_year"),
sep = " ~ ",
extra = "merge",
remove = FALSE)
Warning: Expected 2 pieces. Missing pieces filled with `NA` in 1250 rows [206,
249, 255, 290, 293, 303, 307, 333, 337, 370, 375, 376, 377, 378, 379, 380, 381,
398, 406, 417, ...].
select(edit_raw, contract, start_year, end_year) |> head(5) |> gt()
contract | start_year | end_year |
---|---|---|
2004 ~ 2021 | 2004 | 2021 |
2018 ~ 2022 | 2018 | 2022 |
2014 ~ 2023 | 2014 | 2023 |
2015 ~ 2023 | 2015 | 2023 |
2017 ~ 2022 | 2017 | 2022 |
There is a warning message which state that rows in which the expression “~” given to the sep
argument can not be used to separate the values. let us view some of these rows.
|>
edit_raw select(club, contract, start_year, end_year) |>
slice(206, 249, 255, 290, 293, 303, 307, 333, 337, 370, 375, 376)
# A tibble: 12 × 4
club contract start_year end_year
<chr> <chr> <chr> <chr>
1 Tottenham Hotspur Jun 30, 2021 On Loan Jun 30, 2021 On Loan <NA>
2 Fulham Jun 30, 2021 On Loan Jun 30, 2021 On Loan <NA>
3 Paris Saint-Germain Jun 30, 2021 On Loan Jun 30, 2021 On Loan <NA>
4 No Club Free Free <NA>
5 No Club Free Free <NA>
6 Napoli Jun 30, 2021 On Loan Jun 30, 2021 On Loan <NA>
7 Atletico Madrid Jun 30, 2021 On Loan Jun 30, 2021 On Loan <NA>
8 Bayer 04 Leverkusen Jun 30, 2021 On Loan Jun 30, 2021 On Loan <NA>
9 FC Porto Jun 30, 2021 On Loan Jun 30, 2021 On Loan <NA>
10 No Club Free Free <NA>
11 No Club Free Free <NA>
12 No Club Free Free <NA>
If you remember earlier when some observation of the contract column were put in place, some of the records have their date stated in words such as Free
(majorly players with no club), dates in words and number of when the contract began (majorly for players on loan) instead of the start and end year, this is the reason why there are some missing values in the end year because the separate()
function doesn’t know how to separate such record. In order to solve this problem, another variable called contract_type
will be created which will help in distinguishing between each type of contract allowing the start and end year the freedom of having just the numeric year data type.
<- edit_raw |>
edit_raw mutate(
# create a categorical variable for the type of contract.
contract_type = case_when(str_detect(start_year, "Loan") ~ "Loan",
== "Free" ~ "Free",
start_year TRUE ~ "Permanent"),
# send all Loan end year from the contract start year to the end year.
end_year = case_when(str_detect(start_year, "Loan") ~ str_extract(start_year, "\\d{4}"),
TRUE ~ end_year),
# convert all start_year records with `Free` and `Loan` to NA
start_year = case_when(contract == "Free" ~ NA_character_,
str_detect(start_year, "Loan") ~ NA_character_,
TRUE ~ start_year)
|>
) # convert start and end year to a numeric data type.
mutate(across(c(start_year, end_year), as.integer))
select(edit_raw, club, contract_type, start_year, end_year) |> head(5) |> gt()
club | contract_type | start_year | end_year |
---|---|---|---|
FC Barcelona | Permanent | 2004 | 2021 |
Juventus | Permanent | 2018 | 2022 |
Atletico Madrid | Permanent | 2014 | 2023 |
Manchester City | Permanent | 2015 | 2023 |
Paris Saint-Germain | Permanent | 2017 | 2022 |
There you have it, the type of contract, the year the contract started and ended all on their own column which create a lot of flexibility when using the variables one at a time or all at the same time.
check
Again we will perform some checks on the new columns created. first let us verify that all players with a permanent contract do not have missing records for the contract start and end year.
|>
edit_raw filter(contract_type == "Permanent" & is.na(start_year) & is.na(end_year)) |>
select(contract_type, start_year, end_year)
# A tibble: 0 × 3
# … with 3 variables: contract_type <chr>, start_year <int>, end_year <int>
Zero row returned great!
Now let us do some checks on the start and end date, ideally the start date should be earlier that the end date.
|>
edit_raw mutate(check = start_year <= end_year) |>
# condition to return poor records with earlier end year than start year
filter(check == FALSE) |>
select(start_year, end_year, check)
# A tibble: 0 × 3
# … with 3 variables: start_year <int>, end_year <int>, check <lgl>
Great! zero row, what this means is that all contract start year comes before the end year.
Joined national team date and loan contract expiration date.
Next, some closely related variables to the one above, which includes the date each player joined
their various national team and also the end date of a player loan contract, if the player was on a loan contract.
select(edit_raw, joined, loan_date_end) |> head(5)
# A tibble: 5 × 2
joined loan_date_end
<chr> <chr>
1 Jul 1, 2004 <NA>
2 Jul 10, 2018 <NA>
3 Jul 16, 2014 <NA>
4 Aug 30, 2015 <NA>
5 Aug 3, 2017 <NA>
Observation:
The data type of the
joined
variable is character, given that it is supposed to represented as date variable, it will be converted from a character to a date data type.
<- edit_raw |>
edit_raw mutate(joined = mdy(joined),
loan_date_end = mdy(loan_date_end)) |>
rename(loan_end_date = loan_date_end, joined_national_team = joined)
|>
edit_raw filter(contract_type == "Loan") |>
select(joined_national_team, loan_end_date) |>
head(5) |>
gt()
joined_national_team | loan_end_date |
---|---|
2013-09-01 | 2021-06-30 |
2010-07-01 | 2021-06-30 |
2015-07-02 | 2021-06-30 |
2019-07-01 | 2021-06-30 |
2018-07-12 | 2021-06-30 |
Check
National team date column
|> filter(is.na(joined_national_team)) |> select(joined_national_team) edit_raw
# A tibble: 0 × 1
# … with 1 variable: joined_national_team <date>
Verify if loan end date have missing value when a player is on loan.
|>
edit_raw select(contract_type, loan_end_date, end_year) |>
filter(contract_type == "Loan" & is.na(loan_end_date))
# A tibble: 0 × 3
# … with 3 variables: contract_type <chr>, loan_end_date <date>, end_year <int>
Just in case you didn’t notice, the loan contract end_year
is available in the loan_end_date
column it will be helpful to compare both value for each player to check if there are inconsistencies.
|>
edit_raw
# get the year
mutate(loan_end_year = year(loan_end_date)) |>
# return columns where the player is on loan and the contract end_year is not
# the same as the year in the loan end date column
filter(contract_type == "Loan" & end_year != loan_end_year) |>
select(contract_type, end_year, loan_end_year)
# A tibble: 0 × 3
# … with 3 variables: contract_type <chr>, end_year <int>, loan_end_year <dbl>
No missing value or misrepresented value detected.
Value, wage and release clause.
The value
variable represent the worth of each player as of the year 2021, the wages
represent the amount players earn from their respective clubs, while the release clause which is also known as a buyout clause or a buyout option.
select(edit_raw, value, wage, release_clause) |> head()
# A tibble: 6 × 3
value wage release_clause
<chr> <chr> <chr>
1 €103.5M €560K €138.4M
2 €63M €220K €75.9M
3 €120M €125K €159.4M
4 €129M €370K €161M
5 €132M €270K €166.5M
6 €111M €240K €132M
If we take a closer look at them, these variables are supposed to have numeric data type, they also seem to have similar problem such as:
A euro sign at the beginning of the value.
Either a (K) representing value in thousands and (M) in millions.
The values are collated in a short format i.e 12K instead of 12000.
Character data type.
To change these variables to the right value and data type the above stated observations will need to be rectified accordingly. All character signs and letter will be dropped and values with K will be converted to thousands in numeric value and M to millions.
<- edit_raw |>
edit_raw mutate(
# selecting the columns (value, wage, release_clause) to iterate over.
across(c(value, wage, release_clause), function(col) {
# convert each variable value to their respective numerical value.
case_when(
str_detect(col, "M") ~ as.double(str_replace_all(col, "(€|M)", "")) * 1000000,
str_detect(col, "K") ~ as.double(str_replace_all(col, "(€|K)", "")) * 1000,
TRUE ~ as.double(str_replace(col, "€", ""))
)
})
)
select(edit_raw, value, wage, release_clause) |> head(5) |> gt()
value | wage | release_clause |
---|---|---|
103500000 | 560000 | 138400000 |
63000000 | 220000 | 75900000 |
120000000 | 125000 | 159400000 |
129000000 | 370000 | 161000000 |
132000000 | 270000 | 166500000 |
Check
checking for NA
values.
filter(edit_raw, is.na(value)) |> select(value)
# A tibble: 0 × 1
# … with 1 variable: value <dbl>
filter(edit_raw, is.na(wage)) |> select(wage)
# A tibble: 0 × 1
# … with 1 variable: wage <dbl>
filter(edit_raw, is.na(release_clause)) |> select(release_clause)
# A tibble: 0 × 1
# … with 1 variable: release_clause <dbl>
We have no NA
values in all the columns.
Height & weight
The height and weight gives the exact physical description of the players, they also have a few problem of their own so let us check them.
select(edit_raw, height, weight)
# A tibble: 18,979 × 2
height weight
<chr> <chr>
1 170cm 72kg
2 187cm 83kg
3 188cm 87kg
4 181cm 70kg
5 175cm 68kg
6 184cm 80kg
7 175cm 71kg
8 191cm 91kg
9 178cm 73kg
10 187cm 85kg
# … with 18,969 more rows
|> filter(str_detect(height, "'")) |> select(height) edit_raw
# A tibble: 40 × 1
height
<chr>
1 "6'2\""
2 "6'3\""
3 "6'5\""
4 "5'11\""
5 "6'4\""
6 "6'1\""
7 "6'0\""
8 "6'1\""
9 "5'11\""
10 "6'2\""
# … with 30 more rows
Starting with the height
variable, this was recorded in both centimeter (170cm) and in feet and inches which means that the value “6’2” represent 6 feet and 2 inches. Of course we cant leave it this way, so we will have to convert the height to an appropriate numeric measure. I will be using the centimeter as the unit of measurement for players height as it is the most commonly used metric system, it also have other added advantages over feet & inches such as ease of use, compatibility etc.
To clean up the height column the all records in feet will be converted to centimeters. To do that the rows in feet & inches will be converted to inches initially then converted to CM. If we want to convert 6’2 to cm then we can use the formula: (6 * 12 + 2) * 2.54
Where:
12 inches = 1 foot, this means that any length measured in feet can be converted to inches by multiplying the number of feet by 12.
2.54 cm = 1 inch, so we can convert the height in inches to CM by multiplying the number of inches by 2.54
First we will create a custom function that convert each record height to centimeter.
<- function(str) {
convert_to_cm
map(str, function(s) {
if (stringr::str_detect(s, "'")) {
<- stringr::str_split(s, "'")[[1]]
feet_inches
2] <- stringr::str_remove(feet_inches[2], "\"")
feet_inches[
as.numeric(feet_inches[1]) * 12 + as.numeric(feet_inches[2])) * 2.54
(
else {
}
s
}|>
}) unlist()
}
<- edit_raw |>
edit_raw mutate(height = convert_to_cm(height) |> str_remove("cm") |> as.double()) |>
rename(height_cm = height)
select(edit_raw, height_cm) |> head(5) |> gt()
height_cm |
---|
170 |
187 |
188 |
181 |
175 |
Check
filter(edit_raw, is.na(height_cm)) |> select(height_cm)
# A tibble: 0 × 1
# … with 1 variable: height_cm <dbl>
Moving to the weight
variable, the column holds record of weight in both pounds(lbs) and kilograms(kg) to unify the unit of measurement, the weight will be converted to kg for the reason that kilograms is a more common unit of measurement for weight and provides a more precise and standardized measurement compared to lbs.
To fix the weight variable, all records in lbs will be converted to kg using the formula:
1 lb = 0.45359237 kg. This means that to convert lbs to kg we will need to multiply the number of pounds by 0.45359237
Again we will create a custom function that help us convert the weights to kilograms.
<- function(str) {
convert_to_kg
map(str, function(s) {
if (stringr::str_detect(s, "lbs")) {
<- stringr::str_remove(s, "lbs") |> as.numeric()
lbs
round(lbs * 0.45359237, 2)
else {
}
s
}|>
}) unlist()
}
<- edit_raw |>
edit_raw mutate(weight = convert_to_kg(weight) |> str_remove("kg") |> as.double()) |>
rename(weight_kg = weight)
select(edit_raw, weight_kg) |> head(5) |> gt()
weight_kg |
---|
72 |
83 |
87 |
70 |
68 |
checks
filter(edit_raw, is.na(weight_kg)) |> select(weight_kg)
# A tibble: 0 × 1
# … with 1 variable: weight_kg <dbl>
Great! the height values were correctly converted to centimeters without any error and the weight to numeric representation of the kilogram metric.
Player name
select(edit_raw, long_name, name)
# A tibble: 18,979 × 2
long_name name
<chr> <chr>
1 Lionel Messi L. Messi
2 C. Ronaldo dos Santos Aveiro Cristiano Ronaldo
3 Jan Oblak J. Oblak
4 Kevin De Bruyne K. De Bruyne
5 Neymar da Silva Santos Jr. Neymar Jr
6 Robert Lewandowski R. Lewandowski
7 Mohamed Salah M. Salah
8 Alisson Ramses Becker Alisson
9 Kylian Mbappé K. Mbappé
10 Marc-André ter Stegen M. ter Stegen
# … with 18,969 more rows
The long_name
and name
variable contains non-English characters or special characters so we will use the stri_trans_general()
function to convert special characters, this is to ensure that the data is in a consistent format that can be easily processed and analyzed.
<- edit_raw |>
edit_raw mutate(across(c(long_name, name), \(n) stri_trans_general(n, "Latin-ASCII")))
select(edit_raw, long_name, name) |> head(5) |> gt()
long_name | name |
---|---|
Lionel Messi | L. Messi |
C. Ronaldo dos Santos Aveiro | Cristiano Ronaldo |
Jan Oblak | J. Oblak |
Kevin De Bruyne | K. De Bruyne |
Neymar da Silva Santos Jr. | Neymar Jr |
Some drawbacks that should be noted about transliteration is that, it can lead to loss of important information such as accent marks or diacritical symbols that can change the meaning or pronunciation of a name, it can also be cultural insensitivity. Overall, it is important to carefully consider the potential risks and benefits of removing special characters from people names when cleaning data.
Checks
filter(edit_raw, is.na(long_name)) |> select(long_name)
# A tibble: 0 × 1
# … with 1 variable: long_name <chr>
filter(edit_raw, is.na(name)) |> select(name)
# A tibble: 0 × 1
# … with 1 variable: name <chr>
No NA
which mean the conversion was successful for all records.
Nationality
select(edit_raw, nationality)
# A tibble: 18,979 × 1
nationality
<chr>
1 Argentina
2 Portugal
3 Slovenia
4 Belgium
5 Brazil
6 Poland
7 Egypt
8 Brazil
9 France
10 Germany
# … with 18,969 more rows
The country includes some non-english character which will be converted. also while aiming for consistency across all country names, the &
ampersand symbol will be replaced with the letter ‘and’, also the Korea DPR will be updated to Korea DPRK where the ‘DPRK’ stands for ‘Democratic People’s Republic of Korea’.
<- edit_raw |>
edit_raw mutate(nationality = stri_trans_general(nationality, "Latin-ASCII"),
nationality = str_replace(nationality, "&", "and"),
nationality = if_else(nationality == "Korea DPR", "Korea DPRK", nationality))
select(edit_raw, nationality) |> head(5) |> gt()
nationality |
---|
Argentina |
Portugal |
Slovenia |
Belgium |
Brazil |
Check
filter(edit_raw, is.na(nationality)) |> select(nationality)
# A tibble: 0 × 1
# … with 1 variable: nationality <chr>
Player functionality rating.
The w/f
column representing player’s weak foot rating, SM
which represent player’s skill moves rating and finally IR
representing the injury rating. these variables are all on a scale of 1-5. Which is to say that they must be a numeric variable with an minimum value of 1 and a maximum value of 5.
select(edit_raw, w_f, sm, ir)
# A tibble: 18,979 × 3
w_f sm ir
<chr> <chr> <chr>
1 4 ★ 4★ 5 ★
2 4 ★ 5★ 5 ★
3 3 ★ 1★ 3 ★
4 5 ★ 4★ 4 ★
5 5 ★ 5★ 5 ★
6 4 ★ 4★ 4 ★
7 3 ★ 4★ 3 ★
8 3 ★ 1★ 3 ★
9 4 ★ 5★ 3 ★
10 4 ★ 1★ 3 ★
# … with 18,969 more rows
These variables will be converted to a numeric data type by simply removing the star symbol and changing the data type to integer.
<- edit_raw |>
edit_raw # iterating over each column and extracting only the digit present in them.
mutate(across(c(w_f, sm, ir), \(s) str_extract(s, "[:digit:]") |> as.integer())) |>
rename(weak_foot = w_f, skill_move = sm, injury_rating = ir)
select(edit_raw, weak_foot, skill_move, injury_rating) |> head(5) |> gt()
weak_foot | skill_move | injury_rating |
---|---|---|
4 | 4 | 5 |
4 | 5 | 5 |
3 | 1 | 3 |
5 | 4 | 4 |
5 | 5 | 5 |
Check
To make sure every value in the three variables above are within 1 to 5 and there are no NA
value, we will perform a descriptive summary on each of them.
|> select(weak_foot, skill_move, injury_rating) |> summary() edit_raw
weak_foot skill_move injury_rating
Min. :1.000 Min. :1.000 Min. :1.000
1st Qu.:3.000 1st Qu.:2.000 1st Qu.:1.000
Median :3.000 Median :2.000 Median :1.000
Mean :2.941 Mean :2.365 Mean :1.092
3rd Qu.:3.000 3rd Qu.:3.000 3rd Qu.:1.000
Max. :5.000 Max. :5.000 Max. :5.000
Looking at the minimum (Min.) and maximum (Max.) values of each variable, none of them have values below 1 and above 5.
filter(edit_raw, is.na(weak_foot)) |> select(weak_foot)
# A tibble: 0 × 1
# … with 1 variable: weak_foot <int>
filter(edit_raw, is.na(skill_move)) |> select(skill_move)
# A tibble: 0 × 1
# … with 1 variable: skill_move <int>
filter(edit_raw, is.na(injury_rating)) |> select(injury_rating)
# A tibble: 0 × 1
# … with 1 variable: injury_rating <int>
Hit
select(edit_raw, hits)
# A tibble: 18,979 × 1
hits
<chr>
1 771
2 562
3 150
4 207
5 595
6 248
7 246
8 120
9 1.6K
10 130
# … with 18,969 more rows
A few problem can be picked out from the hits
variable.
Some of the hits are up to 1000 which were formatted into 1K so converting them back to their original numeric representation will be done by multiplying any record with a
K
present in them by 1000.
There are missing values.
<- edit_raw |>
edit_raw # creating a numeric hits column
mutate(hit_num = str_replace(hits, "K", "") |> as.double(),
# change all 1.1K to 1100 based on the condition that there is a K
# string in the record.
hits = case_when(str_detect(hits, "K") ~ as.character(hit_num * 1000),
TRUE ~ hits ) |> as.double()) |>
# drop the hit_num column.
select(-hit_num)
select(edit_raw, hits) |> head(5) |> gt()
hits |
---|
771 |
562 |
150 |
207 |
595 |
Player positions
The position variable represent all the position a player have played.
select(edit_raw, positions)
# A tibble: 18,979 × 1
positions
<chr>
1 RW, ST, CF
2 ST, LW
3 GK
4 CAM, CM
5 LW, CAM
6 ST
7 RW
8 GK
9 ST, LW, RW
10 GK
# … with 18,969 more rows
Looking at the top 10 rows there are several positions merged into a single variable, this violates the first normal form (1NF) in database normalization because each row in a column should contain a single value, not multiple values. To comply with 1NF, the multivalued positions
variable will be moved into a separate table with a one-to-many relationship between the original FIFA-21 table and the new position table. Some of the major advantages of this technique are reduction of data redundancy, simplified data analysis process and enhanced data flexibility.
<- edit_raw |>
player_position select(id) |>
# join the separated positions to with the player id
bind_cols(
$positions |>
edit_rawstr_split(" ", simplify = TRUE) |>
as_tibble(.name_repair = "universal")
|>
) # keep all position value in a single column.
pivot_longer(cols = `...2`:`...4`,
names_to = "num_position",
values_to = "position") |>
# remove all empty rows
filter(position != "") |>
# remove all comma from the position string.
mutate(position = str_remove(position, ",")) |>
# drop the num_position variable.
select(-num_position)
Check
Check for valid positions abbreviations
<- c(
position_dict "GK" = "Goalkeeper",
# Defenders ------------------------------>
"CB" = "Central back",
"S" = "Sweeper",
"FB" = "Full back",
"LB" = "Left back",
"RB" = "Right back",
"WB" = "Wing back",
"RWB"= "Right wing back",
"LWB"= "Left wing back",
# Midfielders ---------------------------->
"CM" = "Central midfielder",
"DM" = "Defensive midfielder",
"AM" = "Attacking midfielder",
"CAM"= "Centeral attacking midfielder",
"CDM"= "Centeral Defensive midfielder",
"LM" = "Left midfielder",
"RM" = "Right midfielder",
"WM" = "Wide midfielder",
# Forward -------------------------------->
"ST" = "Striker",
"SS" = "Second striker",
"CF" = "Center forward",
"W" = "Winger",
"LW" = "Left winger",
"RW" = "Right winger"
)
<- player_position |> distinct(position) |> pull()
pos
all(pos %in% names(position_dict))
[1] TRUE
Join tables
You can combine the new position table with the main table using the left_join()
function from the dplyr package, all you have to pass to the function is the position table and the player id.
|>
edit_raw left_join(player_position, by = "id") |>
select(id, long_name, position)
# A tibble: 30,493 × 3
id long_name position
<dbl> <chr> <chr>
1 158023 Lionel Messi RW
2 158023 Lionel Messi ST
3 158023 Lionel Messi CF
4 20801 C. Ronaldo dos Santos Aveiro ST
5 20801 C. Ronaldo dos Santos Aveiro LW
6 200389 Jan Oblak GK
7 192985 Kevin De Bruyne CAM
8 192985 Kevin De Bruyne CM
9 190871 Neymar da Silva Santos Jr. LW
10 190871 Neymar da Silva Santos Jr. CAM
# … with 30,483 more rows
Data validation
These columns seem to be in their perfect data type so just a little extra data validity check will be done on them.
Age
The question we need to ask here is whether the age distribution is a representative of the overall football age distribution.
summary(edit_raw$age)
Min. 1st Qu. Median Mean 3rd Qu. Max.
16.00 21.00 25.00 25.19 29.00 53.00
The average is 25 compared to the average age from 2021 edition of the EA Sports FIFA which was around 27-28 years old.
Variables measured in percentage.
OVA: Players overall analysis or rate.
POT: Players potential.
BOV: Best Overall in percentage.
|>
edit_raw select(ova, pot, bov) |>
summarise(across(everything(),
list(min = min, max = max),
.names = "{.col}_{.fn}"))
# A tibble: 1 × 6
ova_min ova_max pot_min pot_max bov_min bov_max
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 47 93 47 95 48 93
All minimum value maximum values are within 0 to 100.
Player rating
These includes variable such as attacking, crossing, dribbling, etc, they are all supposed to be numeric.
|>
edit_raw select(attacking:base_stats, pac:phy) |>
map_lgl(is.double) |> all()
[1] TRUE
Checking for missing values in any of the selected variable.
|>
edit_raw select(attacking:base_stats, pac:phy) |>
transmute(across(everything(), is.na)) |>
summarise(across(everything(), sum)) |>
pivot_longer(everything(), names_to = "rating", values_to = "missing") |>
filter(missing > 0)
# A tibble: 0 × 2
# … with 2 variables: rating <chr>, missing <int>
Attacking & defensive work rate.
The rating of players in this segment should only include any of high, medium or low. Any other rating will be considered an incorrect record.
|>
edit_raw select(a_w, d_w) |>
names() |>
map(\(i) edit_raw |> distinct(.data[[i]]))
[[1]]
# A tibble: 3 × 1
a_w
<chr>
1 Medium
2 High
3 Low
[[2]]
# A tibble: 3 × 1
d_w
<chr>
1 Low
2 Medium
3 High
Data quality check
|> select(where(is.numeric)) |> names() edit_raw
[1] "id" "age" "ova" "pot"
[5] "start_year" "end_year" "height_cm" "weight_kg"
[9] "bov" "value" "wage" "release_clause"
[13] "attacking" "crossing" "finishing" "heading_accuracy"
[17] "short_passing" "volleys" "skill" "dribbling"
[21] "curve" "fk_accuracy" "long_passing" "ball_control"
[25] "movement" "acceleration" "sprint_speed" "agility"
[29] "reactions" "balance" "power" "shot_power"
[33] "jumping" "stamina" "strength" "long_shots"
[37] "mentality" "aggression" "interceptions" "positioning"
[41] "vision" "penalties" "composure" "defending"
[45] "marking" "standing_tackle" "sliding_tackle" "goalkeeping"
[49] "gk_diving" "gk_handling" "gk_kicking" "gk_positioning"
[53] "gk_reflexes" "total_stats" "base_stats" "weak_foot"
[57] "skill_move" "injury_rating" "pac" "sho"
[61] "pas" "dri" "def" "phy"
[65] "hits"
|> select(where(is.character)) |> names() edit_raw
[1] "name" "long_name" "photo_url" "player_url"
[5] "nationality" "club" "contract" "positions"
[9] "preferred_foot" "best_position" "a_w" "d_w"
[13] "contract_type"
|> select(where(is.Date)) |> names() edit_raw
[1] "joined_national_team" "loan_end_date"
Arranging the columns
<- edit_raw |>
edit_raw select(
id, long_name, name, club, contract, contract_type, start_year, :release_clause,
end_year, loan_end_date, joined_national_team, value:best_position, ova,
nationality, age, weight_kg, height_cm, preferred_foot:hits, positions, player_url, photo_url
pot, attacking
)
|> head(5) |> gt() edit_raw
id | long_name | name | club | contract | contract_type | start_year | end_year | loan_end_date | joined_national_team | value | wage | release_clause | nationality | age | weight_kg | height_cm | preferred_foot | bov | best_position | ova | pot | attacking | crossing | finishing | heading_accuracy | short_passing | volleys | skill | dribbling | curve | fk_accuracy | long_passing | ball_control | movement | acceleration | sprint_speed | agility | reactions | balance | power | shot_power | jumping | stamina | strength | long_shots | mentality | aggression | interceptions | positioning | vision | penalties | composure | defending | marking | standing_tackle | sliding_tackle | goalkeeping | gk_diving | gk_handling | gk_kicking | gk_positioning | gk_reflexes | total_stats | base_stats | weak_foot | skill_move | a_w | d_w | injury_rating | pac | sho | pas | dri | def | phy | hits | positions | player_url | photo_url |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
158023 | Lionel Messi | L. Messi | FC Barcelona | 2004 ~ 2021 | Permanent | 2004 | 2021 | NA | 2004-07-01 | 103500000 | 560000 | 138400000 | Argentina | 33 | 72 | 170 | Left | 93 | RW | 93 | 93 | 429 | 85 | 95 | 70 | 91 | 88 | 470 | 96 | 93 | 94 | 91 | 96 | 451 | 91 | 80 | 91 | 94 | 95 | 389 | 86 | 68 | 72 | 69 | 94 | 347 | 44 | 40 | 93 | 95 | 75 | 96 | 91 | 32 | 35 | 24 | 54 | 6 | 11 | 15 | 14 | 8 | 2231 | 466 | 4 | 4 | Medium | Low | 5 | 85 | 92 | 91 | 95 | 38 | 65 | 771 | RW, ST, CF | http://sofifa.com/player/158023/lionel-messi/210006/ | https://cdn.sofifa.com/players/158/023/21_60.png |
20801 | C. Ronaldo dos Santos Aveiro | Cristiano Ronaldo | Juventus | 2018 ~ 2022 | Permanent | 2018 | 2022 | NA | 2018-07-10 | 63000000 | 220000 | 75900000 | Portugal | 35 | 83 | 187 | Right | 92 | ST | 92 | 92 | 437 | 84 | 95 | 90 | 82 | 86 | 414 | 88 | 81 | 76 | 77 | 92 | 431 | 87 | 91 | 87 | 95 | 71 | 444 | 94 | 95 | 84 | 78 | 93 | 353 | 63 | 29 | 95 | 82 | 84 | 95 | 84 | 28 | 32 | 24 | 58 | 7 | 11 | 15 | 14 | 11 | 2221 | 464 | 4 | 5 | High | Low | 5 | 89 | 93 | 81 | 89 | 35 | 77 | 562 | ST, LW | http://sofifa.com/player/20801/c-ronaldo-dos-santos-aveiro/210006/ | https://cdn.sofifa.com/players/020/801/21_60.png |
200389 | Jan Oblak | J. Oblak | Atletico Madrid | 2014 ~ 2023 | Permanent | 2014 | 2023 | NA | 2014-07-16 | 120000000 | 125000 | 159400000 | Slovenia | 27 | 87 | 188 | Right | 91 | GK | 91 | 93 | 95 | 13 | 11 | 15 | 43 | 13 | 109 | 12 | 13 | 14 | 40 | 30 | 307 | 43 | 60 | 67 | 88 | 49 | 268 | 59 | 78 | 41 | 78 | 12 | 140 | 34 | 19 | 11 | 65 | 11 | 68 | 57 | 27 | 12 | 18 | 437 | 87 | 92 | 78 | 90 | 90 | 1413 | 489 | 3 | 1 | Medium | Medium | 3 | 87 | 92 | 78 | 90 | 52 | 90 | 150 | GK | http://sofifa.com/player/200389/jan-oblak/210006/ | https://cdn.sofifa.com/players/200/389/21_60.png |
192985 | Kevin De Bruyne | K. De Bruyne | Manchester City | 2015 ~ 2023 | Permanent | 2015 | 2023 | NA | 2015-08-30 | 129000000 | 370000 | 161000000 | Belgium | 29 | 70 | 181 | Right | 91 | CAM | 91 | 91 | 407 | 94 | 82 | 55 | 94 | 82 | 441 | 88 | 85 | 83 | 93 | 92 | 398 | 77 | 76 | 78 | 91 | 76 | 408 | 91 | 63 | 89 | 74 | 91 | 408 | 76 | 66 | 88 | 94 | 84 | 91 | 186 | 68 | 65 | 53 | 56 | 15 | 13 | 5 | 10 | 13 | 2304 | 485 | 5 | 4 | High | High | 4 | 76 | 86 | 93 | 88 | 64 | 78 | 207 | CAM, CM | http://sofifa.com/player/192985/kevin-de-bruyne/210006/ | https://cdn.sofifa.com/players/192/985/21_60.png |
190871 | Neymar da Silva Santos Jr. | Neymar Jr | Paris Saint-Germain | 2017 ~ 2022 | Permanent | 2017 | 2022 | NA | 2017-08-03 | 132000000 | 270000 | 166500000 | Brazil | 28 | 68 | 175 | Right | 91 | LW | 91 | 91 | 408 | 85 | 87 | 62 | 87 | 87 | 448 | 95 | 88 | 89 | 81 | 95 | 453 | 94 | 89 | 96 | 91 | 83 | 357 | 80 | 62 | 81 | 50 | 84 | 356 | 51 | 36 | 87 | 90 | 92 | 93 | 94 | 35 | 30 | 29 | 59 | 9 | 9 | 15 | 15 | 11 | 2175 | 451 | 5 | 5 | High | Medium | 5 | 91 | 85 | 86 | 94 | 36 | 59 | 595 | LW, CAM | http://sofifa.com/player/190871/neymar-da-silva-santos-jr/210006/ | https://cdn.sofifa.com/players/190/871/21_60.png |
Saving data
We can now go ahead and save the clean data in the data directory we created at the start of the tutorial.
write_csv(fifa_21, "data/fifa21_clean_data.csv")
write_csv(player_position, "data/player_position.csv")
You can access the full code by clicking on this link
Summary
Now that we have our cleaned data, Here is a brief highlight of the steps taken to clean the data.
* Loading the data: We imported the data set using the read_csv()
function.
* Column names: All column name were converted to lower case with an underscore.
* Variable cleaning: Separated the contract variable, change the loan end date and start date of each player’s national career to a date data type, remove additional characters from the height and weight column, converted players name and country to all english character, also removed symbols and escape characters from the ratings and hits column.
* Position table: Created a new table with each player id and field positions.
* Data validation: Validated all other variables for completeness and consistency.
* Data quality check: Verified all data types for each column.
* Saving the data: Saved both the cleaned data and the new position table in a .csv file.
Conclusion
The primary goal of data cleaning is to improve the quality and accuracy of the data so that it can be useful for analysis and decision-making. Data cleaning is also a critical step in the data analysis/science process and cannot be ignored or overlooked when working with data. With the data cleaned, we can proceed into data exploration, visualization, preprocessing, etc.