FIFA 2021 Data Cleaning In R

R
data cleaning
Author

Ayomide Akinwande

Published

March 16, 2023

One important step in data analysis that can not be overlooked is data cleaning, this ensure that the data is error free, complete, accurate and consistent by identifying and handling missing data, correcting or removing invalid and inconsistent records across the dataset. In doing this you will not only ensure that any insights or decisions made from the data are accurate and reliable, but also improve data storage by reducing the amount of unnecessary or redundant data, which can save storage space and improve processing times.

In this tutorial, we will be working with the FIFA-2021 data of different players around the world. This dataset contains information that can help us gain several insights about the demographics and attributes of professional football players However, like many other real-world datasets, it is messy and contains errors, inconsistencies, and missing data. Today we will clean and prepare the data for future analysis using several data cleaning techniques.

Libraries

We will be using the some of the core tidyverse packages such as readr, tidyr, dplyr and stringr, as well as some additional packages that will help us with specific data cleaning tasks. install the following package if you don’t already have them.

install.packages("tidyverse")
install.packages("janitor")
install.packages("gt")
library(tidyverse)
library(stringi)
library(janitor)
library(lubridate)

library(gt)

Data Loading

Next we will load in the FIFA-21 dataset. you can download the data following this link and store it in a folder called data in your current working directory.

raw_data <- read_csv("data/fifa21 raw data v2.csv")
Rows: 18979 Columns: 77
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (23): Name, LongName, photoUrl, playerUrl, Nationality, Club, Contract, ...
dbl (54): ID, Age, ↓OVA, POT, BOV, Attacking, Crossing, Finishing, Heading A...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(raw_data)
# A tibble: 6 × 77
      ID Name   LongN…¹ photo…² playe…³ Natio…⁴   Age `↓OVA`   POT Club  Contr…⁵
   <dbl> <chr>  <chr>   <chr>   <chr>   <chr>   <dbl>  <dbl> <dbl> <chr> <chr>  
1 158023 L. Me… Lionel… https:… http:/… Argent…    33     93    93 "\n\… 2004 ~…
2  20801 Crist… C. Ron… https:… http:/… Portug…    35     92    92 "\n\… 2018 ~…
3 200389 J. Ob… Jan Ob… https:… http:/… Sloven…    27     91    93 "\n\… 2014 ~…
4 192985 K. De… Kevin … https:… http:/… Belgium    29     91    91 "\n\… 2015 ~…
5 190871 Neyma… Neymar… https:… http:/… Brazil     28     91    91 "\n\… 2017 ~…
6 188545 R. Le… Robert… https:… http:/… Poland     31     91    91 "\n\… 2014 ~…
# … with 66 more variables: Positions <chr>, Height <chr>, Weight <chr>,
#   `Preferred Foot` <chr>, BOV <dbl>, `Best Position` <chr>, Joined <chr>,
#   `Loan Date End` <chr>, Value <chr>, Wage <chr>, `Release Clause` <chr>,
#   Attacking <dbl>, Crossing <dbl>, Finishing <dbl>, `Heading Accuracy` <dbl>,
#   `Short Passing` <dbl>, Volleys <dbl>, Skill <dbl>, Dribbling <dbl>,
#   Curve <dbl>, `FK Accuracy` <dbl>, `Long Passing` <dbl>,
#   `Ball Control` <dbl>, Movement <dbl>, Acceleration <dbl>, …

Data Cleaning

Name

Before we proceed further, some of the variable name contain spaces and different letter case which can be challenging when handling them through out the cleaning process, so i will transform the names by setting them to lower case and replace any space with an underscore ’_’ using the clean_names() function from the janitor package.

edit_raw <- clean_names(raw_data)

colnames(edit_raw)
 [1] "id"               "name"             "long_name"        "photo_url"       
 [5] "player_url"       "nationality"      "age"              "ova"             
 [9] "pot"              "club"             "contract"         "positions"       
[13] "height"           "weight"           "preferred_foot"   "bov"             
[17] "best_position"    "joined"           "loan_date_end"    "value"           
[21] "wage"             "release_clause"   "attacking"        "crossing"        
[25] "finishing"        "heading_accuracy" "short_passing"    "volleys"         
[29] "skill"            "dribbling"        "curve"            "fk_accuracy"     
[33] "long_passing"     "ball_control"     "movement"         "acceleration"    
[37] "sprint_speed"     "agility"          "reactions"        "balance"         
[41] "power"            "shot_power"       "jumping"          "stamina"         
[45] "strength"         "long_shots"       "mentality"        "aggression"      
[49] "interceptions"    "positioning"      "vision"           "penalties"       
[53] "composure"        "defending"        "marking"          "standing_tackle" 
[57] "sliding_tackle"   "goalkeeping"      "gk_diving"        "gk_handling"     
[61] "gk_kicking"       "gk_positioning"   "gk_reflexes"      "total_stats"     
[65] "base_stats"       "w_f"              "sm"               "a_w"             
[69] "d_w"              "ir"               "pac"              "sho"             
[73] "pas"              "dri"              "def"              "phy"             
[77] "hits"            

Club

The club variable represent the current club in which the player have a contract with as of the year 2021.

select(edit_raw, club)
# A tibble: 18,979 × 1
   club                         
   <chr>                        
 1 "\n\n\n\nFC Barcelona"       
 2 "\n\n\n\nJuventus"           
 3 "\n\n\n\nAtlético Madrid"    
 4 "\n\n\n\nManchester City"    
 5 "\n\n\n\nParis Saint-Germain"
 6 "\n\n\n\nFC Bayern München"  
 7 "\n\n\n\nLiverpool"          
 8 "\n\n\n\nLiverpool"          
 9 "\n\n\n\nParis Saint-Germain"
10 "\n\n\n\nFC Barcelona"       
# … with 18,969 more rows

If you notice the column have some “\n” special character which is also called an escape sequence, it represents a new line character and is used to insert a line break in a string, we don’t need this here so we will remove them. some other things you might have missed are:

There are some players without a current contract.

Some records contain non-English characters such as é, ü, ş.

First, We will remove the “\n” character then convert non-English characters or special characters into their closest English equivalents using the stri_trans_general() function from the stringi package, this process is called transliteration.

edit_raw <- edit_raw |>
  mutate(club = str_remove_all(club, "\\n") |> stri_trans_general("Latin-ASCII")) 

select(edit_raw, club) |> head(5) |> gt()
club
FC Barcelona
Juventus
Atletico Madrid
Manchester City
Paris Saint-Germain
Check

It is a good practice to check for any misrepresentation of the data after making some changes in them. So we will verify if there are NA values.

edit_raw |> filter(is.na(club)) |> select(club)
# A tibble: 0 × 1
# … with 1 variable: club <chr>

Great! no missing values, so all club names with special character have been transformed successfully.

Contract

This column contains the information about the year in which the players contract with their current club started and the year the contract will or is supposed to end.

select(edit_raw, contract)
# A tibble: 18,979 × 1
   contract   
   <chr>      
 1 2004 ~ 2021
 2 2018 ~ 2022
 3 2014 ~ 2023
 4 2015 ~ 2023
 5 2017 ~ 2022
 6 2014 ~ 2023
 7 2017 ~ 2023
 8 2018 ~ 2024
 9 2018 ~ 2022
10 2014 ~ 2022
# … with 18,969 more rows

It is important that the two different information above should be in their respective column. This provide a clear and consistent record, there by making it easy to perform both simple and complex analysis using them.
To restructure the contract variable, we need to separate each of the identified variable into a start_year and end_year column. Before we do that, here are some observations we should keep in mind:

Players with no current club are considered free agents and are with no contract start year and end year.

edit_raw |> filter(club == "No Club") |> select(club, contract)
# A tibble: 237 × 2
   club    contract
   <chr>   <chr>   
 1 No Club Free    
 2 No Club Free    
 3 No Club Free    
 4 No Club Free    
 5 No Club Free    
 6 No Club Free    
 7 No Club Free    
 8 No Club Free    
 9 No Club Free    
10 No Club Free    
# … with 227 more rows

Note that when a player is on loan, no contract start year was specified.

edit_raw |> 
  filter(str_detect(contract, "Loan")) |> 
  select(club, contract, loan_date_end)
# A tibble: 1,013 × 3
   club                contract             loan_date_end
   <chr>               <chr>                <chr>        
 1 Tottenham Hotspur   Jun 30, 2021 On Loan Jun 30, 2021 
 2 Fulham              Jun 30, 2021 On Loan Jun 30, 2021 
 3 Paris Saint-Germain Jun 30, 2021 On Loan Jun 30, 2021 
 4 Napoli              Jun 30, 2021 On Loan Jun 30, 2021 
 5 Atletico Madrid     Jun 30, 2021 On Loan Jun 30, 2021 
 6 Bayer 04 Leverkusen Jun 30, 2021 On Loan Jun 30, 2021 
 7 FC Porto            Jun 30, 2021 On Loan Jun 30, 2021 
 8 Arsenal             Jun 30, 2021 On Loan Jun 30, 2021 
 9 RB Leipzig          Jun 30, 2021 On Loan Jun 30, 2021 
10 Sporting CP         Jun 30, 2021 On Loan Jun 30, 2021 
# … with 1,003 more rows

Also, all contract records do not have the same contract start year and end year in number, some are in both words and number dates, mostly for players on loan.

The contract start year and end year will be separated using a the tilde ~ character and the separate() function from tidyr package.

edit_raw <- edit_raw |>
  separate(col = contract, 
           into = c("start_year", "end_year"), 
           sep = " ~ ",
           extra = "merge",
           remove = FALSE)
Warning: Expected 2 pieces. Missing pieces filled with `NA` in 1250 rows [206,
249, 255, 290, 293, 303, 307, 333, 337, 370, 375, 376, 377, 378, 379, 380, 381,
398, 406, 417, ...].
select(edit_raw, contract, start_year, end_year) |> head(5) |> gt()
contract start_year end_year
2004 ~ 2021 2004 2021
2018 ~ 2022 2018 2022
2014 ~ 2023 2014 2023
2015 ~ 2023 2015 2023
2017 ~ 2022 2017 2022

There is a warning message which state that rows in which the expression “~” given to the sep argument can not be used to separate the values. let us view some of these rows.

edit_raw |>
  select(club, contract, start_year, end_year) |>
  slice(206, 249, 255, 290, 293, 303, 307, 333, 337, 370, 375, 376)
# A tibble: 12 × 4
   club                contract             start_year           end_year
   <chr>               <chr>                <chr>                <chr>   
 1 Tottenham Hotspur   Jun 30, 2021 On Loan Jun 30, 2021 On Loan <NA>    
 2 Fulham              Jun 30, 2021 On Loan Jun 30, 2021 On Loan <NA>    
 3 Paris Saint-Germain Jun 30, 2021 On Loan Jun 30, 2021 On Loan <NA>    
 4 No Club             Free                 Free                 <NA>    
 5 No Club             Free                 Free                 <NA>    
 6 Napoli              Jun 30, 2021 On Loan Jun 30, 2021 On Loan <NA>    
 7 Atletico Madrid     Jun 30, 2021 On Loan Jun 30, 2021 On Loan <NA>    
 8 Bayer 04 Leverkusen Jun 30, 2021 On Loan Jun 30, 2021 On Loan <NA>    
 9 FC Porto            Jun 30, 2021 On Loan Jun 30, 2021 On Loan <NA>    
10 No Club             Free                 Free                 <NA>    
11 No Club             Free                 Free                 <NA>    
12 No Club             Free                 Free                 <NA>    

If you remember earlier when some observation of the contract column were put in place, some of the records have their date stated in words such as Free (majorly players with no club), dates in words and number of when the contract began (majorly for players on loan) instead of the start and end year, this is the reason why there are some missing values in the end year because the separate() function doesn’t know how to separate such record. In order to solve this problem, another variable called contract_type will be created which will help in distinguishing between each type of contract allowing the start and end year the freedom of having just the numeric year data type.

edit_raw <- edit_raw |>
  mutate(
    # create a categorical variable for the type of contract.
    contract_type = case_when(str_detect(start_year, "Loan") ~ "Loan",
                              start_year == "Free" ~ "Free",
                              TRUE ~ "Permanent"),
    
    # send all Loan end year from the contract start year to the end year.     
    end_year = case_when(str_detect(start_year, "Loan") ~ str_extract(start_year, "\\d{4}"),
                         TRUE ~ end_year),
    
    # convert all start_year records with `Free` and `Loan` to NA
    start_year = case_when(contract == "Free" ~ NA_character_, 
                           str_detect(start_year, "Loan") ~ NA_character_,
                           TRUE ~ start_year)
  ) |>
  # convert start and end year to a numeric data type.
  mutate(across(c(start_year, end_year), as.integer))

select(edit_raw, club, contract_type, start_year, end_year) |> head(5) |> gt()
club contract_type start_year end_year
FC Barcelona Permanent 2004 2021
Juventus Permanent 2018 2022
Atletico Madrid Permanent 2014 2023
Manchester City Permanent 2015 2023
Paris Saint-Germain Permanent 2017 2022

There you have it, the type of contract, the year the contract started and ended all on their own column which create a lot of flexibility when using the variables one at a time or all at the same time.

check

Again we will perform some checks on the new columns created. first let us verify that all players with a permanent contract do not have missing records for the contract start and end year.

edit_raw |> 
  filter(contract_type == "Permanent" & is.na(start_year) & is.na(end_year)) |> 
  select(contract_type, start_year, end_year)
# A tibble: 0 × 3
# … with 3 variables: contract_type <chr>, start_year <int>, end_year <int>

Zero row returned great!

Now let us do some checks on the start and end date, ideally the start date should be earlier that the end date.

edit_raw |>
  mutate(check = start_year <= end_year) |>
  # condition to return poor records with earlier end year than start year
  filter(check == FALSE) |>
  select(start_year, end_year, check)
# A tibble: 0 × 3
# … with 3 variables: start_year <int>, end_year <int>, check <lgl>

Great! zero row, what this means is that all contract start year comes before the end year.

Joined national team date and loan contract expiration date.

Next, some closely related variables to the one above, which includes the date each player joined their various national team and also the end date of a player loan contract, if the player was on a loan contract.

select(edit_raw, joined, loan_date_end) |> head(5)
# A tibble: 5 × 2
  joined       loan_date_end
  <chr>        <chr>        
1 Jul 1, 2004  <NA>         
2 Jul 10, 2018 <NA>         
3 Jul 16, 2014 <NA>         
4 Aug 30, 2015 <NA>         
5 Aug 3, 2017  <NA>         

Observation:

The data type of the joined variable is character, given that it is supposed to represented as date variable, it will be converted from a character to a date data type.

edit_raw <- edit_raw |>
  mutate(joined = mdy(joined),
         loan_date_end = mdy(loan_date_end)) |>
  rename(loan_end_date = loan_date_end, joined_national_team = joined)

edit_raw |>
  filter(contract_type == "Loan") |>
  select(joined_national_team, loan_end_date) |> 
  head(5) |>
  gt()
joined_national_team loan_end_date
2013-09-01 2021-06-30
2010-07-01 2021-06-30
2015-07-02 2021-06-30
2019-07-01 2021-06-30
2018-07-12 2021-06-30
Check

National team date column

edit_raw |> filter(is.na(joined_national_team)) |> select(joined_national_team)
# A tibble: 0 × 1
# … with 1 variable: joined_national_team <date>

Verify if loan end date have missing value when a player is on loan.

edit_raw |> 
  select(contract_type, loan_end_date, end_year) |>
  filter(contract_type == "Loan" & is.na(loan_end_date))
# A tibble: 0 × 3
# … with 3 variables: contract_type <chr>, loan_end_date <date>, end_year <int>

Just in case you didn’t notice, the loan contract end_year is available in the loan_end_date column it will be helpful to compare both value for each player to check if there are inconsistencies.

edit_raw |> 
  
  # get the year
  mutate(loan_end_year = year(loan_end_date)) |>
  
  # return columns where the player is on loan and the contract end_year is not  
  # the same as the year in the loan end date column
  filter(contract_type == "Loan" & end_year != loan_end_year) |>
  
  select(contract_type, end_year, loan_end_year)
# A tibble: 0 × 3
# … with 3 variables: contract_type <chr>, end_year <int>, loan_end_year <dbl>

No missing value or misrepresented value detected.

Value, wage and release clause.

The value variable represent the worth of each player as of the year 2021, the wages represent the amount players earn from their respective clubs, while the release clause which is also known as a buyout clause or a buyout option.

select(edit_raw, value, wage, release_clause) |> head()
# A tibble: 6 × 3
  value   wage  release_clause
  <chr>   <chr> <chr>         
1 €103.5M €560K €138.4M       
2 €63M    €220K €75.9M        
3 €120M   €125K €159.4M       
4 €129M   €370K €161M         
5 €132M   €270K €166.5M       
6 €111M   €240K €132M         

If we take a closer look at them, these variables are supposed to have numeric data type, they also seem to have similar problem such as:

A euro sign at the beginning of the value.

Either a (K) representing value in thousands and (M) in millions.

The values are collated in a short format i.e 12K instead of 12000.

Character data type.

To change these variables to the right value and data type the above stated observations will need to be rectified accordingly. All character signs and letter will be dropped and values with K will be converted to thousands in numeric value and M to millions.

edit_raw <- edit_raw |>
   mutate(
     # selecting the columns (value, wage, release_clause) to iterate over.
     across(c(value, wage, release_clause), function(col) {
      # convert each variable value to their respective numerical value.
      case_when(
        str_detect(col, "M") ~ as.double(str_replace_all(col, "(€|M)", "")) * 1000000,
        str_detect(col, "K") ~ as.double(str_replace_all(col, "(€|K)", "")) * 1000,
        TRUE ~ as.double(str_replace(col, "€", ""))
      )
    })
   )

select(edit_raw, value, wage, release_clause) |> head(5) |> gt()
value wage release_clause
103500000 560000 138400000
63000000 220000 75900000
120000000 125000 159400000
129000000 370000 161000000
132000000 270000 166500000
Check

checking for NA values.

filter(edit_raw, is.na(value)) |> select(value)
# A tibble: 0 × 1
# … with 1 variable: value <dbl>
filter(edit_raw, is.na(wage)) |> select(wage)
# A tibble: 0 × 1
# … with 1 variable: wage <dbl>
filter(edit_raw, is.na(release_clause)) |> select(release_clause)
# A tibble: 0 × 1
# … with 1 variable: release_clause <dbl>

We have no NA values in all the columns.

Height & weight

The height and weight gives the exact physical description of the players, they also have a few problem of their own so let us check them.

select(edit_raw, height, weight)
# A tibble: 18,979 × 2
   height weight
   <chr>  <chr> 
 1 170cm  72kg  
 2 187cm  83kg  
 3 188cm  87kg  
 4 181cm  70kg  
 5 175cm  68kg  
 6 184cm  80kg  
 7 175cm  71kg  
 8 191cm  91kg  
 9 178cm  73kg  
10 187cm  85kg  
# … with 18,969 more rows
edit_raw |> filter(str_detect(height, "'")) |> select(height)
# A tibble: 40 × 1
   height  
   <chr>   
 1 "6'2\"" 
 2 "6'3\"" 
 3 "6'5\"" 
 4 "5'11\""
 5 "6'4\"" 
 6 "6'1\"" 
 7 "6'0\"" 
 8 "6'1\"" 
 9 "5'11\""
10 "6'2\"" 
# … with 30 more rows

Starting with the height variable, this was recorded in both centimeter (170cm) and in feet and inches which means that the value “6’2” represent 6 feet and 2 inches. Of course we cant leave it this way, so we will have to convert the height to an appropriate numeric measure. I will be using the centimeter as the unit of measurement for players height as it is the most commonly used metric system, it also have other added advantages over feet & inches such as ease of use, compatibility etc.

To clean up the height column the all records in feet will be converted to centimeters. To do that the rows in feet & inches will be converted to inches initially then converted to CM. If we want to convert 6’2 to cm then we can use the formula: (6 * 12 + 2) * 2.54
Where:

12 inches = 1 foot, this means that any length measured in feet can be converted to inches by multiplying the number of feet by 12.

2.54 cm = 1 inch, so we can convert the height in inches to CM by multiplying the number of inches by 2.54

First we will create a custom function that convert each record height to centimeter.

convert_to_cm <- function(str) {
  
  map(str, function(s) {
      if (stringr::str_detect(s, "'")) {
        feet_inches <- stringr::str_split(s, "'")[[1]]
        
        feet_inches[2] <- stringr::str_remove(feet_inches[2], "\"") 

        (as.numeric(feet_inches[1]) * 12 + as.numeric(feet_inches[2])) * 2.54
        
      } else {
        s
      }
    }) |>
     unlist()
  
}

edit_raw <- edit_raw |>
  mutate(height = convert_to_cm(height) |> str_remove("cm") |> as.double()) |>
  rename(height_cm = height)

select(edit_raw, height_cm) |> head(5) |> gt()
height_cm
170
187
188
181
175
Check
filter(edit_raw, is.na(height_cm)) |> select(height_cm)
# A tibble: 0 × 1
# … with 1 variable: height_cm <dbl>

Moving to the weight variable, the column holds record of weight in both pounds(lbs) and kilograms(kg) to unify the unit of measurement, the weight will be converted to kg for the reason that kilograms is a more common unit of measurement for weight and provides a more precise and standardized measurement compared to lbs.
To fix the weight variable, all records in lbs will be converted to kg using the formula:

1 lb = 0.45359237 kg. This means that to convert lbs to kg we will need to multiply the number of pounds by 0.45359237

Again we will create a custom function that help us convert the weights to kilograms.

convert_to_kg <- function(str) {
  
  map(str, function(s) {
    if (stringr::str_detect(s, "lbs")) {
      
      lbs <- stringr::str_remove(s, "lbs") |> as.numeric()
      
      round(lbs * 0.45359237, 2)
      
    } else {
      s
    }
  }) |>
    unlist()
}

edit_raw <- edit_raw |>
  mutate(weight = convert_to_kg(weight) |> str_remove("kg") |> as.double()) |>
  rename(weight_kg = weight)

select(edit_raw, weight_kg) |> head(5) |> gt()
weight_kg
72
83
87
70
68
checks
filter(edit_raw, is.na(weight_kg)) |> select(weight_kg)
# A tibble: 0 × 1
# … with 1 variable: weight_kg <dbl>

Great! the height values were correctly converted to centimeters without any error and the weight to numeric representation of the kilogram metric.

Player name

select(edit_raw, long_name, name)
# A tibble: 18,979 × 2
   long_name                    name             
   <chr>                        <chr>            
 1 Lionel Messi                 L. Messi         
 2 C. Ronaldo dos Santos Aveiro Cristiano Ronaldo
 3 Jan Oblak                    J. Oblak         
 4 Kevin De Bruyne              K. De Bruyne     
 5 Neymar da Silva Santos Jr.   Neymar Jr        
 6 Robert Lewandowski           R. Lewandowski   
 7 Mohamed Salah                M. Salah         
 8 Alisson Ramses Becker        Alisson          
 9 Kylian Mbappé                K. Mbappé        
10 Marc-André ter Stegen        M. ter Stegen    
# … with 18,969 more rows

The long_name and name variable contains non-English characters or special characters so we will use the stri_trans_general() function to convert special characters, this is to ensure that the data is in a consistent format that can be easily processed and analyzed.

edit_raw <- edit_raw |> 
  mutate(across(c(long_name, name), \(n) stri_trans_general(n, "Latin-ASCII")))

select(edit_raw, long_name, name) |> head(5) |> gt()
long_name name
Lionel Messi L. Messi
C. Ronaldo dos Santos Aveiro Cristiano Ronaldo
Jan Oblak J. Oblak
Kevin De Bruyne K. De Bruyne
Neymar da Silva Santos Jr. Neymar Jr

Some drawbacks that should be noted about transliteration is that, it can lead to loss of important information such as accent marks or diacritical symbols that can change the meaning or pronunciation of a name, it can also be cultural insensitivity. Overall, it is important to carefully consider the potential risks and benefits of removing special characters from people names when cleaning data.

Checks
filter(edit_raw, is.na(long_name)) |> select(long_name)
# A tibble: 0 × 1
# … with 1 variable: long_name <chr>
filter(edit_raw, is.na(name)) |> select(name)
# A tibble: 0 × 1
# … with 1 variable: name <chr>

No NA which mean the conversion was successful for all records.

Nationality

select(edit_raw, nationality)
# A tibble: 18,979 × 1
   nationality
   <chr>      
 1 Argentina  
 2 Portugal   
 3 Slovenia   
 4 Belgium    
 5 Brazil     
 6 Poland     
 7 Egypt      
 8 Brazil     
 9 France     
10 Germany    
# … with 18,969 more rows

The country includes some non-english character which will be converted. also while aiming for consistency across all country names, the & ampersand symbol will be replaced with the letter ‘and’, also the Korea DPR will be updated to Korea DPRK where the ‘DPRK’ stands for ‘Democratic People’s Republic of Korea’.

edit_raw <- edit_raw |> 
  mutate(nationality = stri_trans_general(nationality, "Latin-ASCII"),
         nationality = str_replace(nationality, "&", "and"),
         nationality = if_else(nationality == "Korea DPR", "Korea DPRK", nationality))

select(edit_raw, nationality) |> head(5) |> gt()
nationality
Argentina
Portugal
Slovenia
Belgium
Brazil
Check
filter(edit_raw, is.na(nationality)) |> select(nationality)
# A tibble: 0 × 1
# … with 1 variable: nationality <chr>

Player functionality rating.

The w/f column representing player’s weak foot rating, SM which represent player’s skill moves rating and finally IR representing the injury rating. these variables are all on a scale of 1-5. Which is to say that they must be a numeric variable with an minimum value of 1 and a maximum value of 5.

select(edit_raw, w_f, sm, ir)
# A tibble: 18,979 × 3
   w_f   sm    ir   
   <chr> <chr> <chr>
 1 4 ★   4★    5 ★  
 2 4 ★   5★    5 ★  
 3 3 ★   1★    3 ★  
 4 5 ★   4★    4 ★  
 5 5 ★   5★    5 ★  
 6 4 ★   4★    4 ★  
 7 3 ★   4★    3 ★  
 8 3 ★   1★    3 ★  
 9 4 ★   5★    3 ★  
10 4 ★   1★    3 ★  
# … with 18,969 more rows

These variables will be converted to a numeric data type by simply removing the star symbol and changing the data type to integer.

edit_raw <- edit_raw |>
  # iterating over each column and extracting only the digit present in them.
  mutate(across(c(w_f, sm, ir), \(s) str_extract(s, "[:digit:]") |> as.integer())) |>
  rename(weak_foot = w_f, skill_move = sm, injury_rating = ir)

select(edit_raw, weak_foot, skill_move, injury_rating) |> head(5)  |> gt()
weak_foot skill_move injury_rating
4 4 5
4 5 5
3 1 3
5 4 4
5 5 5
Check

To make sure every value in the three variables above are within 1 to 5 and there are no NA value, we will perform a descriptive summary on each of them.

edit_raw |> select(weak_foot, skill_move, injury_rating) |> summary()
   weak_foot       skill_move    injury_rating  
 Min.   :1.000   Min.   :1.000   Min.   :1.000  
 1st Qu.:3.000   1st Qu.:2.000   1st Qu.:1.000  
 Median :3.000   Median :2.000   Median :1.000  
 Mean   :2.941   Mean   :2.365   Mean   :1.092  
 3rd Qu.:3.000   3rd Qu.:3.000   3rd Qu.:1.000  
 Max.   :5.000   Max.   :5.000   Max.   :5.000  

Looking at the minimum (Min.) and maximum (Max.) values of each variable, none of them have values below 1 and above 5.

filter(edit_raw, is.na(weak_foot)) |> select(weak_foot)
# A tibble: 0 × 1
# … with 1 variable: weak_foot <int>
filter(edit_raw, is.na(skill_move)) |> select(skill_move)
# A tibble: 0 × 1
# … with 1 variable: skill_move <int>
filter(edit_raw, is.na(injury_rating)) |> select(injury_rating)
# A tibble: 0 × 1
# … with 1 variable: injury_rating <int>

Hit

select(edit_raw, hits)
# A tibble: 18,979 × 1
   hits 
   <chr>
 1 771  
 2 562  
 3 150  
 4 207  
 5 595  
 6 248  
 7 246  
 8 120  
 9 1.6K 
10 130  
# … with 18,969 more rows

A few problem can be picked out from the hits variable.

Some of the hits are up to 1000 which were formatted into 1K so converting them back to their original numeric representation will be done by multiplying any record with a K present in them by 1000.

There are missing values.

edit_raw <- edit_raw |>
         # creating a numeric hits column
  mutate(hit_num = str_replace(hits, "K", "") |> as.double(),
         
         # change all 1.1K to 1100 based on the condition that there is a K 
         # string in the record.
         hits = case_when(str_detect(hits, "K") ~ as.character(hit_num * 1000),
                          TRUE ~ hits ) |> as.double()) |>
  # drop the hit_num column.
  select(-hit_num) 


select(edit_raw, hits) |> head(5) |> gt()
hits
771
562
150
207
595

Player positions

The position variable represent all the position a player have played.

select(edit_raw, positions)
# A tibble: 18,979 × 1
   positions 
   <chr>     
 1 RW, ST, CF
 2 ST, LW    
 3 GK        
 4 CAM, CM   
 5 LW, CAM   
 6 ST        
 7 RW        
 8 GK        
 9 ST, LW, RW
10 GK        
# … with 18,969 more rows

Looking at the top 10 rows there are several positions merged into a single variable, this violates the first normal form (1NF) in database normalization because each row in a column should contain a single value, not multiple values. To comply with 1NF, the multivalued positions variable will be moved into a separate table with a one-to-many relationship between the original FIFA-21 table and the new position table. Some of the major advantages of this technique are reduction of data redundancy, simplified data analysis process and enhanced data flexibility.

player_position <- edit_raw |> 
  select(id) |>
  # join the separated positions to with the player id
  bind_cols(
    edit_raw$positions |>
      str_split(" ", simplify = TRUE) |>
      as_tibble(.name_repair = "universal")
  ) |>
  # keep all position value in a single column. 
  pivot_longer(cols = `...2`:`...4`,    
               names_to = "num_position", 
               values_to = "position") |>
  # remove all empty rows
  filter(position != "") |>
  # remove all comma from the position string.
  mutate(position = str_remove(position, ",")) |>
  # drop the num_position variable.
  select(-num_position) 
Check

Check for valid positions abbreviations

position_dict <- c(
  "GK" = "Goalkeeper",
  
  # Defenders ------------------------------>
  "CB" = "Central back",
  "S"  = "Sweeper", 
  "FB" = "Full back",
  "LB" = "Left back",
  "RB" = "Right back",
  "WB" = "Wing back",
  "RWB"= "Right wing back",
  "LWB"= "Left wing back",
  
  # Midfielders ---------------------------->
  "CM" = "Central midfielder",
  "DM" = "Defensive midfielder",
  "AM" = "Attacking midfielder",
  "CAM"= "Centeral attacking midfielder",
  "CDM"= "Centeral Defensive midfielder",
  "LM" = "Left midfielder",
  "RM" = "Right midfielder",
  "WM" = "Wide midfielder",
  
  # Forward -------------------------------->
  "ST" = "Striker",
  "SS" = "Second striker",
  "CF" = "Center forward",
  "W"  = "Winger",
  "LW" = "Left winger",
  "RW" = "Right winger"
)


pos <- player_position |> distinct(position) |> pull()

all(pos %in% names(position_dict))
[1] TRUE
Join tables

You can combine the new position table with the main table using the left_join() function from the dplyr package, all you have to pass to the function is the position table and the player id.

edit_raw |> 
  left_join(player_position, by = "id") |>
  select(id, long_name, position)
# A tibble: 30,493 × 3
       id long_name                    position
    <dbl> <chr>                        <chr>   
 1 158023 Lionel Messi                 RW      
 2 158023 Lionel Messi                 ST      
 3 158023 Lionel Messi                 CF      
 4  20801 C. Ronaldo dos Santos Aveiro ST      
 5  20801 C. Ronaldo dos Santos Aveiro LW      
 6 200389 Jan Oblak                    GK      
 7 192985 Kevin De Bruyne              CAM     
 8 192985 Kevin De Bruyne              CM      
 9 190871 Neymar da Silva Santos Jr.   LW      
10 190871 Neymar da Silva Santos Jr.   CAM     
# … with 30,483 more rows

Data validation

These columns seem to be in their perfect data type so just a little extra data validity check will be done on them.

Age

The question we need to ask here is whether the age distribution is a representative of the overall football age distribution.

summary(edit_raw$age)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  16.00   21.00   25.00   25.19   29.00   53.00 

The average is 25 compared to the average age from 2021 edition of the EA Sports FIFA which was around 27-28 years old.

Variables measured in percentage.

OVA: Players overall analysis or rate.
POT: Players potential.
BOV: Best Overall in percentage.

edit_raw |>
  select(ova, pot, bov) |>
  summarise(across(everything(), 
                   list(min = min, max = max), 
                   .names = "{.col}_{.fn}"))
# A tibble: 1 × 6
  ova_min ova_max pot_min pot_max bov_min bov_max
    <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
1      47      93      47      95      48      93

All minimum value maximum values are within 0 to 100.

Player rating

These includes variable such as attacking, crossing, dribbling, etc, they are all supposed to be numeric.

edit_raw |>
  select(attacking:base_stats, pac:phy) |>
  map_lgl(is.double) |> all()
[1] TRUE

Checking for missing values in any of the selected variable.

edit_raw |>
  select(attacking:base_stats, pac:phy) |>
  transmute(across(everything(), is.na)) |>
  summarise(across(everything(), sum)) |>
  pivot_longer(everything(), names_to = "rating", values_to = "missing") |>
  filter(missing > 0)
# A tibble: 0 × 2
# … with 2 variables: rating <chr>, missing <int>

Attacking & defensive work rate.

The rating of players in this segment should only include any of high, medium or low. Any other rating will be considered an incorrect record.

edit_raw |>
  select(a_w, d_w) |> 
  names() |>
  map(\(i) edit_raw |> distinct(.data[[i]]))
[[1]]
# A tibble: 3 × 1
  a_w   
  <chr> 
1 Medium
2 High  
3 Low   

[[2]]
# A tibble: 3 × 1
  d_w   
  <chr> 
1 Low   
2 Medium
3 High  

Data quality check

edit_raw |> select(where(is.numeric)) |> names()
 [1] "id"               "age"              "ova"              "pot"             
 [5] "start_year"       "end_year"         "height_cm"        "weight_kg"       
 [9] "bov"              "value"            "wage"             "release_clause"  
[13] "attacking"        "crossing"         "finishing"        "heading_accuracy"
[17] "short_passing"    "volleys"          "skill"            "dribbling"       
[21] "curve"            "fk_accuracy"      "long_passing"     "ball_control"    
[25] "movement"         "acceleration"     "sprint_speed"     "agility"         
[29] "reactions"        "balance"          "power"            "shot_power"      
[33] "jumping"          "stamina"          "strength"         "long_shots"      
[37] "mentality"        "aggression"       "interceptions"    "positioning"     
[41] "vision"           "penalties"        "composure"        "defending"       
[45] "marking"          "standing_tackle"  "sliding_tackle"   "goalkeeping"     
[49] "gk_diving"        "gk_handling"      "gk_kicking"       "gk_positioning"  
[53] "gk_reflexes"      "total_stats"      "base_stats"       "weak_foot"       
[57] "skill_move"       "injury_rating"    "pac"              "sho"             
[61] "pas"              "dri"              "def"              "phy"             
[65] "hits"            
edit_raw |> select(where(is.character)) |> names()
 [1] "name"           "long_name"      "photo_url"      "player_url"    
 [5] "nationality"    "club"           "contract"       "positions"     
 [9] "preferred_foot" "best_position"  "a_w"            "d_w"           
[13] "contract_type" 
edit_raw |> select(where(is.Date)) |> names()
[1] "joined_national_team" "loan_end_date"       

Arranging the columns

edit_raw <- edit_raw |>
  select(
    id, long_name, name, club, contract, contract_type, start_year, 
    end_year, loan_end_date, joined_national_team, value:release_clause, 
    nationality, age, weight_kg, height_cm, preferred_foot:best_position, ova, 
    pot, attacking:hits, positions, player_url, photo_url
  )

edit_raw |> head(5) |> gt()
id long_name name club contract contract_type start_year end_year loan_end_date joined_national_team value wage release_clause nationality age weight_kg height_cm preferred_foot bov best_position ova pot attacking crossing finishing heading_accuracy short_passing volleys skill dribbling curve fk_accuracy long_passing ball_control movement acceleration sprint_speed agility reactions balance power shot_power jumping stamina strength long_shots mentality aggression interceptions positioning vision penalties composure defending marking standing_tackle sliding_tackle goalkeeping gk_diving gk_handling gk_kicking gk_positioning gk_reflexes total_stats base_stats weak_foot skill_move a_w d_w injury_rating pac sho pas dri def phy hits positions player_url photo_url
158023 Lionel Messi L. Messi FC Barcelona 2004 ~ 2021 Permanent 2004 2021 NA 2004-07-01 103500000 560000 138400000 Argentina 33 72 170 Left 93 RW 93 93 429 85 95 70 91 88 470 96 93 94 91 96 451 91 80 91 94 95 389 86 68 72 69 94 347 44 40 93 95 75 96 91 32 35 24 54 6 11 15 14 8 2231 466 4 4 Medium Low 5 85 92 91 95 38 65 771 RW, ST, CF http://sofifa.com/player/158023/lionel-messi/210006/ https://cdn.sofifa.com/players/158/023/21_60.png
20801 C. Ronaldo dos Santos Aveiro Cristiano Ronaldo Juventus 2018 ~ 2022 Permanent 2018 2022 NA 2018-07-10 63000000 220000 75900000 Portugal 35 83 187 Right 92 ST 92 92 437 84 95 90 82 86 414 88 81 76 77 92 431 87 91 87 95 71 444 94 95 84 78 93 353 63 29 95 82 84 95 84 28 32 24 58 7 11 15 14 11 2221 464 4 5 High Low 5 89 93 81 89 35 77 562 ST, LW http://sofifa.com/player/20801/c-ronaldo-dos-santos-aveiro/210006/ https://cdn.sofifa.com/players/020/801/21_60.png
200389 Jan Oblak J. Oblak Atletico Madrid 2014 ~ 2023 Permanent 2014 2023 NA 2014-07-16 120000000 125000 159400000 Slovenia 27 87 188 Right 91 GK 91 93 95 13 11 15 43 13 109 12 13 14 40 30 307 43 60 67 88 49 268 59 78 41 78 12 140 34 19 11 65 11 68 57 27 12 18 437 87 92 78 90 90 1413 489 3 1 Medium Medium 3 87 92 78 90 52 90 150 GK http://sofifa.com/player/200389/jan-oblak/210006/ https://cdn.sofifa.com/players/200/389/21_60.png
192985 Kevin De Bruyne K. De Bruyne Manchester City 2015 ~ 2023 Permanent 2015 2023 NA 2015-08-30 129000000 370000 161000000 Belgium 29 70 181 Right 91 CAM 91 91 407 94 82 55 94 82 441 88 85 83 93 92 398 77 76 78 91 76 408 91 63 89 74 91 408 76 66 88 94 84 91 186 68 65 53 56 15 13 5 10 13 2304 485 5 4 High High 4 76 86 93 88 64 78 207 CAM, CM http://sofifa.com/player/192985/kevin-de-bruyne/210006/ https://cdn.sofifa.com/players/192/985/21_60.png
190871 Neymar da Silva Santos Jr. Neymar Jr Paris Saint-Germain 2017 ~ 2022 Permanent 2017 2022 NA 2017-08-03 132000000 270000 166500000 Brazil 28 68 175 Right 91 LW 91 91 408 85 87 62 87 87 448 95 88 89 81 95 453 94 89 96 91 83 357 80 62 81 50 84 356 51 36 87 90 92 93 94 35 30 29 59 9 9 15 15 11 2175 451 5 5 High Medium 5 91 85 86 94 36 59 595 LW, CAM http://sofifa.com/player/190871/neymar-da-silva-santos-jr/210006/ https://cdn.sofifa.com/players/190/871/21_60.png

Saving data

We can now go ahead and save the clean data in the data directory we created at the start of the tutorial.

write_csv(fifa_21, "data/fifa21_clean_data.csv")
write_csv(player_position, "data/player_position.csv")

You can access the full code by clicking on this link

Summary

Now that we have our cleaned data, Here is a brief highlight of the steps taken to clean the data.
* Loading the data: We imported the data set using the read_csv() function.
* Column names: All column name were converted to lower case with an underscore.
* Variable cleaning: Separated the contract variable, change the loan end date and start date of each player’s national career to a date data type, remove additional characters from the height and weight column, converted players name and country to all english character, also removed symbols and escape characters from the ratings and hits column.
* Position table: Created a new table with each player id and field positions.
* Data validation: Validated all other variables for completeness and consistency.
* Data quality check: Verified all data types for each column.
* Saving the data: Saved both the cleaned data and the new position table in a .csv file.

Conclusion

The primary goal of data cleaning is to improve the quality and accuracy of the data so that it can be useful for analysis and decision-making. Data cleaning is also a critical step in the data analysis/science process and cannot be ignored or overlooked when working with data. With the data cleaned, we can proceed into data exploration, visualization, preprocessing, etc.