Data Analysis Challenge - FIFA 2021 Data Cleaning In R

One important step in data analysis that can not be overlooked is data cleaning, this ensure that the data is error free, complete, accurate and consistent by identifying and handling missing data, correcting or removing invalid and inconsistent records across the dataset. In doing this you will not only ensure that any insights or decisions made from the data are accurate and reliable, but also improve data storage by reducing the amount of unnecessary or redundant data, which can save storage space and improve processing times.

In this tutorial, we will be working with the FIFA-2021 data of different players around the world. This dataset contains information that can help us gain several insights about the demographics and attributes of professional football players However, like many other real-world datasets, it is messy and contains errors, inconsistencies, and missing data. Today we will clean and prepare the data for future analysis using several data cleaning techniques.

Libraries

We will be using the some of the core tidyverse packages such as readr, tidyr, dplyr and stringr, as well as some additional packages that will help us with specific data cleaning tasks. install the following package if you don’t already have them.

install.packages("tidyverse")
install.packages("janitor")
install.packages("gt")

library(tidyverse)
library(stringi)
library(janitor)
library(lubridate)

library(gt)

Data Loading

Next we will load in the FIFA-21 dataset. you can download the data following this link and store it in a folder called data in your current working directory.

raw_data <- read_csv("data/fifa21 raw data v2.csv")

Rows: 18979 Columns: 77
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (23): Name, LongName, photoUrl, playerUrl, Nationality, Club, Contract, ...
dbl (54): ID, Age, ↓OVA, POT, BOV, Attacking, Crossing, Finishing, Heading A...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(raw_data)

# A tibble: 6 × 77
      ID Name   LongN…¹ photo…² playe…³ Natio…⁴   Age `↓OVA`   POT Club  Contr…⁵
   <dbl> <chr>  <chr>   <chr>   <chr>   <chr>   <dbl>  <dbl> <dbl> <chr> <chr>  
1 158023 L. Me… Lionel… https:… http:/… Argent…    33     93    93 "\n\… 2004 ~…
2  20801 Crist… C. Ron… https:… http:/… Portug…    35     92    92 "\n\… 2018 ~…
3 200389 J. Ob… Jan Ob… https:… http:/… Sloven…    27     91    93 "\n\… 2014 ~…
4 192985 K. De… Kevin … https:… http:/… Belgium    29     91    91 "\n\… 2015 ~…
5 190871 Neyma… Neymar… https:… http:/… Brazil     28     91    91 "\n\… 2017 ~…
6 188545 R. Le… Robert… https:… http:/… Poland     31     91    91 "\n\… 2014 ~…
# … with 66 more variables: Positions <chr>, Height <chr>, Weight <chr>,
#   `Preferred Foot` <chr>, BOV <dbl>, `Best Position` <chr>, Joined <chr>,
#   `Loan Date End` <chr>, Value <chr>, Wage <chr>, `Release Clause` <chr>,
#   Attacking <dbl>, Crossing <dbl>, Finishing <dbl>, `Heading Accuracy` <dbl>,
#   `Short Passing` <dbl>, Volleys <dbl>, Skill <dbl>, Dribbling <dbl>,
#   Curve <dbl>, `FK Accuracy` <dbl>, `Long Passing` <dbl>,
#   `Ball Control` <dbl>, Movement <dbl>, Acceleration <dbl>, …

Data Cleaning

Name

Before we proceed further, some of the variable name contain spaces and different letter case which can be challenging when handling them through out the cleaning process, so i will transform the names by setting them to lower case and replace any space with an underscore ’_’ using the clean_names() function from the janitor package.

edit_raw <- clean_names(raw_data)

colnames(edit_raw)

 [1] "id"               "name"             "long_name"        "photo_url"       
 [5] "player_url"       "nationality"      "age"              "ova"             
 [9] "pot"              "club"             "contract"         "positions"       
[13] "height"           "weight"           "preferred_foot"   "bov"             
[17] "best_position"    "joined"           "loan_date_end"    "value"           
[21] "wage"             "release_clause"   "attacking"        "crossing"        
[25] "finishing"        "heading_accuracy" "short_passing"    "volleys"         
[29] "skill"            "dribbling"        "curve"            "fk_accuracy"     
[33] "long_passing"     "ball_control"     "movement"         "acceleration"    
[37] "sprint_speed"     "agility"          "reactions"        "balance"         
[41] "power"            "shot_power"       "jumping"          "stamina"         
[45] "strength"         "long_shots"       "mentality"        "aggression"      
[49] "interceptions"    "positioning"      "vision"           "penalties"       
[53] "composure"        "defending"        "marking"          "standing_tackle" 
[57] "sliding_tackle"   "goalkeeping"      "gk_diving"        "gk_handling"     
[61] "gk_kicking"       "gk_positioning"   "gk_reflexes"      "total_stats"     
[65] "base_stats"       "w_f"              "sm"               "a_w"             
[69] "d_w"              "ir"               "pac"              "sho"             
[73] "pas"              "dri"              "def"              "phy"             
[77] "hits"

Club

The club variable represent the current club in which the player have a contract with as of the year 2021.

select(edit_raw, club)

# A tibble: 18,979 × 1
   club                         
   <chr>                        
 1 "\n\n\n\nFC Barcelona"       
 2 "\n\n\n\nJuventus"           
 3 "\n\n\n\nAtlético Madrid"    
 4 "\n\n\n\nManchester City"    
 5 "\n\n\n\nParis Saint-Germain"
 6 "\n\n\n\nFC Bayern München"  
 7 "\n\n\n\nLiverpool"          
 8 "\n\n\n\nLiverpool"          
 9 "\n\n\n\nParis Saint-Germain"
10 "\n\n\n\nFC Barcelona"       
# … with 18,969 more rows

If you notice the column have some “\n” special character which is also called an escape sequence, it represents a new line character and is used to insert a line break in a string, we don’t need this here so we will remove them. some other things you might have missed are:

There are some players without a current contract.

Some records contain non-English characters such as é, ü, ş.

First, We will remove the “\n” character then convert non-English characters or special characters into their closest English equivalents using the stri_trans_general() function from the stringi package, this process is called transliteration.

edit_raw <- edit_raw |>
  mutate(club = str_remove_all(club, "\\n") |> stri_trans_general("Latin-ASCII")) 

select(edit_raw, club) |> head(5) |> gt()

club
FC Barcelona
Juventus
Atletico Madrid
Manchester City
Paris Saint-Germain

Check

It is a good practice to check for any misrepresentation of the data after making some changes in them. So we will verify if there are NA values.

edit_raw |> filter(is.na(club)) |> select(club)

# A tibble: 0 × 1
# … with 1 variable: club <chr>

Great! no missing values, so all club names with special character have been transformed successfully.

Contract

This column contains the information about the year in which the players contract with their current club started and the year the contract will or is supposed to end.

select(edit_raw, contract)

# A tibble: 18,979 × 1
   contract   
   <chr>      
 1 2004 ~ 2021
 2 2018 ~ 2022
 3 2014 ~ 2023
 4 2015 ~ 2023
 5 2017 ~ 2022
 6 2014 ~ 2023
 7 2017 ~ 2023
 8 2018 ~ 2024
 9 2018 ~ 2022
10 2014 ~ 2022
# … with 18,969 more rows

It is important that the two different information above should be in their respective column. This provide a clear and consistent record, there by making it easy to perform both simple and complex analysis using them.
To restructure the contract variable, we need to separate each of the identified variable into a start_year and end_year column. Before we do that, here are some observations we should keep in mind:

Players with no current club are considered free agents and are with no contract start year and end year.

edit_raw |> filter(club == "No Club") |> select(club, contract)

# A tibble: 237 × 2
   club    contract
   <chr>   <chr>   
 1 No Club Free    
 2 No Club Free    
 3 No Club Free    
 4 No Club Free    
 5 No Club Free    
 6 No Club Free    
 7 No Club Free    
 8 No Club Free    
 9 No Club Free    
10 No Club Free    
# … with 227 more rows

Note that when a player is on loan, no contract start year was specified.

edit_raw |> 
  filter(str_detect(contract, "Loan")) |> 
  select(club, contract, loan_date_end)

# A tibble: 1,013 × 3
   club                contract             loan_date_end
   <chr>               <chr>                <chr>        
 1 Tottenham Hotspur   Jun 30, 2021 On Loan Jun 30, 2021 
 2 Fulham              Jun 30, 2021 On Loan Jun 30, 2021 
 3 Paris Saint-Germain Jun 30, 2021 On Loan Jun 30, 2021 
 4 Napoli              Jun 30, 2021 On Loan Jun 30, 2021 
 5 Atletico Madrid     Jun 30, 2021 On Loan Jun 30, 2021 
 6 Bayer 04 Leverkusen Jun 30, 2021 On Loan Jun 30, 2021 
 7 FC Porto            Jun 30, 2021 On Loan Jun 30, 2021 
 8 Arsenal             Jun 30, 2021 On Loan Jun 30, 2021 
 9 RB Leipzig          Jun 30, 2021 On Loan Jun 30, 2021 
10 Sporting CP         Jun 30, 2021 On Loan Jun 30, 2021 
# … with 1,003 more rows

Also, all contract records do not have the same contract start year and end year in number, some are in both words and number dates, mostly for players on loan.

The contract start year and end year will be separated using a the tilde ~ character and the separate() function from tidyr package.

edit_raw <- edit_raw |>
  separate(col = contract, 
           into = c("start_year", "end_year"), 
           sep = " ~ ",
           extra = "merge",
           remove = FALSE)

Warning: Expected 2 pieces. Missing pieces filled with `NA` in 1250 rows [206,
249, 255, 290, 293, 303, 307, 333, 337, 370, 375, 376, 377, 378, 379, 380, 381,
398, 406, 417, ...].

select(edit_raw, contract, start_year, end_year) |> head(5) |> gt()

contract	start_year	end_year
2004 ~ 2021	2004	2021
2018 ~ 2022	2018	2022
2014 ~ 2023	2014	2023
2015 ~ 2023	2015	2023
2017 ~ 2022	2017	2022

There is a warning message which state that rows in which the expression “~” given to the sep argument can not be used to separate the values. let us view some of these rows.

edit_raw |>
  select(club, contract, start_year, end_year) |>
  slice(206, 249, 255, 290, 293, 303, 307, 333, 337, 370, 375, 376)

# A tibble: 12 × 4
   club                contract             start_year           end_year
   <chr>               <chr>                <chr>                <chr>   
 1 Tottenham Hotspur   Jun 30, 2021 On Loan Jun 30, 2021 On Loan <NA>    
 2 Fulham              Jun 30, 2021 On Loan Jun 30, 2021 On Loan <NA>    
 3 Paris Saint-Germain Jun 30, 2021 On Loan Jun 30, 2021 On Loan <NA>    
 4 No Club             Free                 Free                 <NA>    
 5 No Club             Free                 Free                 <NA>    
 6 Napoli              Jun 30, 2021 On Loan Jun 30, 2021 On Loan <NA>    
 7 Atletico Madrid     Jun 30, 2021 On Loan Jun 30, 2021 On Loan <NA>    
 8 Bayer 04 Leverkusen Jun 30, 2021 On Loan Jun 30, 2021 On Loan <NA>    
 9 FC Porto            Jun 30, 2021 On Loan Jun 30, 2021 On Loan <NA>    
10 No Club             Free                 Free                 <NA>    
11 No Club             Free                 Free                 <NA>    
12 No Club             Free                 Free                 <NA>

If you remember earlier when some observation of the contract column were put in place, some of the records have their date stated in words such as Free (majorly players with no club), dates in words and number of when the contract began (majorly for players on loan) instead of the start and end year, this is the reason why there are some missing values in the end year because the separate() function doesn’t know how to separate such record. In order to solve this problem, another variable called contract_type will be created which will help in distinguishing between each type of contract allowing the start and end year the freedom of having just the numeric year data type.

edit_raw <- edit_raw |>
  mutate(
    # create a categorical variable for the type of contract.
    contract_type = case_when(str_detect(start_year, "Loan") ~ "Loan",
                              start_year == "Free" ~ "Free",
                              TRUE ~ "Permanent"),
    
    # send all Loan end year from the contract start year to the end year.     
    end_year = case_when(str_detect(start_year, "Loan") ~ str_extract(start_year, "\\d{4}"),
                         TRUE ~ end_year),
    
    # convert all start_year records with `Free` and `Loan` to NA
    start_year = case_when(contract == "Free" ~ NA_character_, 
                           str_detect(start_year, "Loan") ~ NA_character_,
                           TRUE ~ start_year)
  ) |>
  # convert start and end year to a numeric data type.
  mutate(across(c(start_year, end_year), as.integer))

select(edit_raw, club, contract_type, start_year, end_year) |> head(5) |> gt()

club	contract_type	start_year	end_year
FC Barcelona	Permanent	2004	2021
Juventus	Permanent	2018	2022
Atletico Madrid	Permanent	2014	2023
Manchester City	Permanent	2015	2023
Paris Saint-Germain	Permanent	2017	2022

There you have it, the type of contract, the year the contract started and ended all on their own column which create a lot of flexibility when using the variables one at a time or all at the same time.

check

Again we will perform some checks on the new columns created. first let us verify that all players with a permanent contract do not have missing records for the contract start and end year.

edit_raw |> 
  filter(contract_type == "Permanent" & is.na(start_year) & is.na(end_year)) |> 
  select(contract_type, start_year, end_year)

# A tibble: 0 × 3
# … with 3 variables: contract_type <chr>, start_year <int>, end_year <int>

Zero row returned great!

Now let us do some checks on the start and end date, ideally the start date should be earlier that the end date.

edit_raw |>
  mutate(check = start_year <= end_year) |>
  # condition to return poor records with earlier end year than start year
  filter(check == FALSE) |>
  select(start_year, end_year, check)

# A tibble: 0 × 3
# … with 3 variables: start_year <int>, end_year <int>, check <lgl>

Great! zero row, what this means is that all contract start year comes before the end year.

Joined national team date and loan contract expiration date.

Next, some closely related variables to the one above, which includes the date each player joined their various national team and also the end date of a player loan contract, if the player was on a loan contract.

select(edit_raw, joined, loan_date_end) |> head(5)

# A tibble: 5 × 2
  joined       loan_date_end
  <chr>        <chr>        
1 Jul 1, 2004  <NA>         
2 Jul 10, 2018 <NA>         
3 Jul 16, 2014 <NA>         
4 Aug 30, 2015 <NA>         
5 Aug 3, 2017  <NA>

Observation:

The data type of the joined variable is character, given that it is supposed to represented as date variable, it will be converted from a character to a date data type.

edit_raw <- edit_raw |>
  mutate(joined = mdy(joined),
         loan_date_end = mdy(loan_date_end)) |>
  rename(loan_end_date = loan_date_end, joined_national_team = joined)

edit_raw |>
  filter(contract_type == "Loan") |>
  select(joined_national_team, loan_end_date) |> 
  head(5) |>
  gt()

joined_national_team	loan_end_date
2013-09-01	2021-06-30
2010-07-01	2021-06-30
2015-07-02	2021-06-30
2019-07-01	2021-06-30
2018-07-12	2021-06-30

Check

National team date column

edit_raw |> filter(is.na(joined_national_team)) |> select(joined_national_team)

# A tibble: 0 × 1
# … with 1 variable: joined_national_team <date>

Verify if loan end date have missing value when a player is on loan.

edit_raw |> 
  select(contract_type, loan_end_date, end_year) |>
  filter(contract_type == "Loan" & is.na(loan_end_date))

# A tibble: 0 × 3
# … with 3 variables: contract_type <chr>, loan_end_date <date>, end_year <int>

Just in case you didn’t notice, the loan contract end_year is available in the loan_end_date column it will be helpful to compare both value for each player to check if there are inconsistencies.

edit_raw |> 
  
  # get the year
  mutate(loan_end_year = year(loan_end_date)) |>
  
  # return columns where the player is on loan and the contract end_year is not  
  # the same as the year in the loan end date column
  filter(contract_type == "Loan" & end_year != loan_end_year) |>
  
  select(contract_type, end_year, loan_end_year)

# A tibble: 0 × 3
# … with 3 variables: contract_type <chr>, end_year <int>, loan_end_year <dbl>

No missing value or misrepresented value detected.

Value, wage and release clause.

The value variable represent the worth of each player as of the year 2021, the wages represent the amount players earn from their respective clubs, while the release clause which is also known as a buyout clause or a buyout option.

select(edit_raw, value, wage, release_clause) |> head()

# A tibble: 6 × 3
  value   wage  release_clause
  <chr>   <chr> <chr>         
1 €103.5M €560K €138.4M       
2 €63M    €220K €75.9M        
3 €120M   €125K €159.4M       
4 €129M   €370K €161M         
5 €132M   €270K €166.5M       
6 €111M   €240K €132M

If we take a closer look at them, these variables are supposed to have numeric data type, they also seem to have similar problem such as:

A euro sign at the beginning of the value.

Either a (K) representing value in thousands and (M) in millions.

The values are collated in a short format i.e 12K instead of 12000.

Character data type.

To change these variables to the right value and data type the above stated observations will need to be rectified accordingly. All character signs and letter will be dropped and values with K will be converted to thousands in numeric value and M to millions.

edit_raw <- edit_raw |>
   mutate(
     # selecting the columns (value, wage, release_clause) to iterate over.
     across(c(value, wage, release_clause), function(col) {
      # convert each variable value to their respective numerical value.
      case_when(
        str_detect(col, "M") ~ as.double(str_replace_all(col, "(€|M)", "")) * 1000000,
        str_detect(col, "K") ~ as.double(str_replace_all(col, "(€|K)", "")) * 1000,
        TRUE ~ as.double(str_replace(col, "€", ""))
      )
    })
   )

select(edit_raw, value, wage, release_clause) |> head(5) |> gt()

value	wage	release_clause
103500000	560000	138400000
63000000	220000	75900000
120000000	125000	159400000
129000000	370000	161000000
132000000	270000	166500000

Check

checking for NA values.

filter(edit_raw, is.na(value)) |> select(value)

# A tibble: 0 × 1
# … with 1 variable: value <dbl>

filter(edit_raw, is.na(wage)) |> select(wage)

# A tibble: 0 × 1
# … with 1 variable: wage <dbl>

filter(edit_raw, is.na(release_clause)) |> select(release_clause)

# A tibble: 0 × 1
# … with 1 variable: release_clause <dbl>

We have no NA values in all the columns.

Height & weight

The height and weight gives the exact physical description of the players, they also have a few problem of their own so let us check them.

select(edit_raw, height, weight)

# A tibble: 18,979 × 2
   height weight
   <chr>  <chr> 
 1 170cm  72kg  
 2 187cm  83kg  
 3 188cm  87kg  
 4 181cm  70kg  
 5 175cm  68kg  
 6 184cm  80kg  
 7 175cm  71kg  
 8 191cm  91kg  
 9 178cm  73kg  
10 187cm  85kg  
# … with 18,969 more rows

edit_raw |> filter(str_detect(height, "'")) |> select(height)

# A tibble: 40 × 1
   height  
   <chr>   
 1 "6'2\"" 
 2 "6'3\"" 
 3 "6'5\"" 
 4 "5'11\""
 5 "6'4\"" 
 6 "6'1\"" 
 7 "6'0\"" 
 8 "6'1\"" 
 9 "5'11\""
10 "6'2\"" 
# … with 30 more rows

Starting with the height variable, this was recorded in both centimeter (170cm) and in feet and inches which means that the value “6’2” represent 6 feet and 2 inches. Of course we cant leave it this way, so we will have to convert the height to an appropriate numeric measure. I will be using the centimeter as the unit of measurement for players height as it is the most commonly used metric system, it also have other added advantages over feet & inches such as ease of use, compatibility etc.

To clean up the height column the all records in feet will be converted to centimeters. To do that the rows in feet & inches will be converted to inches initially then converted to CM. If we want to convert 6’2 to cm then we can use the formula: (6 * 12 + 2) * 2.54
Where:

12 inches = 1 foot, this means that any length measured in feet can be converted to inches by multiplying the number of feet by 12.

2.54 cm = 1 inch, so we can convert the height in inches to CM by multiplying the number of inches by 2.54

First we will create a custom function that convert each record height to centimeter.

convert_to_cm <- function(str) {
  
  map(str, function(s) {
      if (stringr::str_detect(s, "'")) {
        feet_inches <- stringr::str_split(s, "'")[[1]]
        
        feet_inches[2] <- stringr::str_remove(feet_inches[2], "\"") 

        (as.numeric(feet_inches[1]) * 12 + as.numeric(feet_inches[2])) * 2.54
        
      } else {
        s
      }
    }) |>
     unlist()
  
}

edit_raw <- edit_raw |>
  mutate(height = convert_to_cm(height) |> str_remove("cm") |> as.double()) |>
  rename(height_cm = height)

select(edit_raw, height_cm) |> head(5) |> gt()

height_cm
170
187
188
181
175

Check

filter(edit_raw, is.na(height_cm)) |> select(height_cm)

# A tibble: 0 × 1
# … with 1 variable: height_cm <dbl>

Moving to the weight variable, the column holds record of weight in both pounds(lbs) and kilograms(kg) to unify the unit of measurement, the weight will be converted to kg for the reason that kilograms is a more common unit of measurement for weight and provides a more precise and standardized measurement compared to lbs.
To fix the weight variable, all records in lbs will be converted to kg using the formula:

1 lb = 0.45359237 kg. This means that to convert lbs to kg we will need to multiply the number of pounds by 0.45359237

Again we will create a custom function that help us convert the weights to kilograms.

convert_to_kg <- function(str) {
  
  map(str, function(s) {
    if (stringr::str_detect(s, "lbs")) {
      
      lbs <- stringr::str_remove(s, "lbs") |> as.numeric()
      
      round(lbs * 0.45359237, 2)
      
    } else {
      s
    }
  }) |>
    unlist()
}

edit_raw <- edit_raw |>
  mutate(weight = convert_to_kg(weight) |> str_remove("kg") |> as.double()) |>
  rename(weight_kg = weight)

select(edit_raw, weight_kg) |> head(5) |> gt()

weight_kg
72
83
87
70
68

checks

filter(edit_raw, is.na(weight_kg)) |> select(weight_kg)

# A tibble: 0 × 1
# … with 1 variable: weight_kg <dbl>

Great! the height values were correctly converted to centimeters without any error and the weight to numeric representation of the kilogram metric.

Player name

select(edit_raw, long_name, name)

# A tibble: 18,979 × 2
   long_name                    name             
   <chr>                        <chr>            
 1 Lionel Messi                 L. Messi         
 2 C. Ronaldo dos Santos Aveiro Cristiano Ronaldo
 3 Jan Oblak                    J. Oblak         
 4 Kevin De Bruyne              K. De Bruyne     
 5 Neymar da Silva Santos Jr.   Neymar Jr        
 6 Robert Lewandowski           R. Lewandowski   
 7 Mohamed Salah                M. Salah         
 8 Alisson Ramses Becker        Alisson          
 9 Kylian Mbappé                K. Mbappé        
10 Marc-André ter Stegen        M. ter Stegen    
# … with 18,969 more rows

The long_name and name variable contains non-English characters or special characters so we will use the stri_trans_general() function to convert special characters, this is to ensure that the data is in a consistent format that can be easily processed and analyzed.

edit_raw <- edit_raw |> 
  mutate(across(c(long_name, name), \(n) stri_trans_general(n, "Latin-ASCII")))

select(edit_raw, long_name, name) |> head(5) |> gt()

long_name	name
Lionel Messi	L. Messi
C. Ronaldo dos Santos Aveiro	Cristiano Ronaldo
Jan Oblak	J. Oblak
Kevin De Bruyne	K. De Bruyne
Neymar da Silva Santos Jr.	Neymar Jr

Some drawbacks that should be noted about transliteration is that, it can lead to loss of important information such as accent marks or diacritical symbols that can change the meaning or pronunciation of a name, it can also be cultural insensitivity. Overall, it is important to carefully consider the potential risks and benefits of removing special characters from people names when cleaning data.

Checks

filter(edit_raw, is.na(long_name)) |> select(long_name)

# A tibble: 0 × 1
# … with 1 variable: long_name <chr>

filter(edit_raw, is.na(name)) |> select(name)

# A tibble: 0 × 1
# … with 1 variable: name <chr>

No NA which mean the conversion was successful for all records.

Nationality

select(edit_raw, nationality)

# A tibble: 18,979 × 1
   nationality
   <chr>      
 1 Argentina  
 2 Portugal   
 3 Slovenia   
 4 Belgium    
 5 Brazil     
 6 Poland     
 7 Egypt      
 8 Brazil     
 9 France     
10 Germany    
# … with 18,969 more rows

The country includes some non-english character which will be converted. also while aiming for consistency across all country names, the & ampersand symbol will be replaced with the letter ‘and’, also the Korea DPR will be updated to Korea DPRK where the ‘DPRK’ stands for ‘Democratic People’s Republic of Korea’.

edit_raw <- edit_raw |> 
  mutate(nationality = stri_trans_general(nationality, "Latin-ASCII"),
         nationality = str_replace(nationality, "&", "and"),
         nationality = if_else(nationality == "Korea DPR", "Korea DPRK", nationality))

select(edit_raw, nationality) |> head(5) |> gt()

nationality
Argentina
Portugal
Slovenia
Belgium
Brazil

Check

filter(edit_raw, is.na(nationality)) |> select(nationality)

# A tibble: 0 × 1
# … with 1 variable: nationality <chr>

Player functionality rating.

The w/f column representing player’s weak foot rating, SM which represent player’s skill moves rating and finally IR representing the injury rating. these variables are all on a scale of 1-5. Which is to say that they must be a numeric variable with an minimum value of 1 and a maximum value of 5.

select(edit_raw, w_f, sm, ir)

# A tibble: 18,979 × 3
   w_f   sm    ir   
   <chr> <chr> <chr>
 1 4 ★   4★    5 ★  
 2 4 ★   5★    5 ★  
 3 3 ★   1★    3 ★  
 4 5 ★   4★    4 ★  
 5 5 ★   5★    5 ★  
 6 4 ★   4★    4 ★  
 7 3 ★   4★    3 ★  
 8 3 ★   1★    3 ★  
 9 4 ★   5★    3 ★  
10 4 ★   1★    3 ★  
# … with 18,969 more rows

These variables will be converted to a numeric data type by simply removing the star symbol and changing the data type to integer.

edit_raw <- edit_raw |>
  # iterating over each column and extracting only the digit present in them.
  mutate(across(c(w_f, sm, ir), \(s) str_extract(s, "[:digit:]") |> as.integer())) |>
  rename(weak_foot = w_f, skill_move = sm, injury_rating = ir)

select(edit_raw, weak_foot, skill_move, injury_rating) |> head(5)  |> gt()

weak_foot	skill_move	injury_rating
4	4	5
4	5	5
3	1	3
5	4	4
5	5	5

Check

To make sure every value in the three variables above are within 1 to 5 and there are no NA value, we will perform a descriptive summary on each of them.

edit_raw |> select(weak_foot, skill_move, injury_rating) |> summary()

   weak_foot       skill_move    injury_rating  
 Min.   :1.000   Min.   :1.000   Min.   :1.000  
 1st Qu.:3.000   1st Qu.:2.000   1st Qu.:1.000  
 Median :3.000   Median :2.000   Median :1.000  
 Mean   :2.941   Mean   :2.365   Mean   :1.092  
 3rd Qu.:3.000   3rd Qu.:3.000   3rd Qu.:1.000  
 Max.   :5.000   Max.   :5.000   Max.   :5.000

Looking at the minimum (Min.) and maximum (Max.) values of each variable, none of them have values below 1 and above 5.

filter(edit_raw, is.na(weak_foot)) |> select(weak_foot)

# A tibble: 0 × 1
# … with 1 variable: weak_foot <int>

filter(edit_raw, is.na(skill_move)) |> select(skill_move)

# A tibble: 0 × 1
# … with 1 variable: skill_move <int>

filter(edit_raw, is.na(injury_rating)) |> select(injury_rating)

# A tibble: 0 × 1
# … with 1 variable: injury_rating <int>

Hit

select(edit_raw, hits)

# A tibble: 18,979 × 1
   hits 
   <chr>
 1 771  
 2 562  
 3 150  
 4 207  
 5 595  
 6 248  
 7 246  
 8 120  
 9 1.6K 
10 130  
# … with 18,969 more rows

A few problem can be picked out from the hits variable.

Some of the hits are up to 1000 which were formatted into 1K so converting them back to their original numeric representation will be done by multiplying any record with a K present in them by 1000.

There are missing values.

edit_raw <- edit_raw |>
         # creating a numeric hits column
  mutate(hit_num = str_replace(hits, "K", "") |> as.double(),
         
         # change all 1.1K to 1100 based on the condition that there is a K 
         # string in the record.
         hits = case_when(str_detect(hits, "K") ~ as.character(hit_num * 1000),
                          TRUE ~ hits ) |> as.double()) |>
  # drop the hit_num column.
  select(-hit_num) 


select(edit_raw, hits) |> head(5) |> gt()

hits
771
562
150
207
595

Player positions

The position variable represent all the position a player have played.

select(edit_raw, positions)

# A tibble: 18,979 × 1
   positions 
   <chr>     
 1 RW, ST, CF
 2 ST, LW    
 3 GK        
 4 CAM, CM   
 5 LW, CAM   
 6 ST        
 7 RW        
 8 GK        
 9 ST, LW, RW
10 GK        
# … with 18,969 more rows

Looking at the top 10 rows there are several positions merged into a single variable, this violates the first normal form (1NF) in database normalization because each row in a column should contain a single value, not multiple values. To comply with 1NF, the multivalued positions variable will be moved into a separate table with a one-to-many relationship between the original FIFA-21 table and the new position table. Some of the major advantages of this technique are reduction of data redundancy, simplified data analysis process and enhanced data flexibility.

player_position <- edit_raw |> 
  select(id) |>
  # join the separated positions to with the player id
  bind_cols(
    edit_raw$positions |>
      str_split(" ", simplify = TRUE) |>
      as_tibble(.name_repair = "universal")
  ) |>
  # keep all position value in a single column. 
  pivot_longer(cols = `...2`:`...4`,    
               names_to = "num_position", 
               values_to = "position") |>
  # remove all empty rows
  filter(position != "") |>
  # remove all comma from the position string.
  mutate(position = str_remove(position, ",")) |>
  # drop the num_position variable.
  select(-num_position)

Check

Check for valid positions abbreviations

position_dict <- c(
  "GK" = "Goalkeeper",
  
  # Defenders ------------------------------>
  "CB" = "Central back",
  "S"  = "Sweeper", 
  "FB" = "Full back",
  "LB" = "Left back",
  "RB" = "Right back",
  "WB" = "Wing back",
  "RWB"= "Right wing back",
  "LWB"= "Left wing back",
  
  # Midfielders ---------------------------->
  "CM" = "Central midfielder",
  "DM" = "Defensive midfielder",
  "AM" = "Attacking midfielder",
  "CAM"= "Centeral attacking midfielder",
  "CDM"= "Centeral Defensive midfielder",
  "LM" = "Left midfielder",
  "RM" = "Right midfielder",
  "WM" = "Wide midfielder",
  
  # Forward -------------------------------->
  "ST" = "Striker",
  "SS" = "Second striker",
  "CF" = "Center forward",
  "W"  = "Winger",
  "LW" = "Left winger",
  "RW" = "Right winger"
)


pos <- player_position |> distinct(position) |> pull()

all(pos %in% names(position_dict))

[1] TRUE

Join tables

You can combine the new position table with the main table using the left_join() function from the dplyr package, all you have to pass to the function is the position table and the player id.

edit_raw |> 
  left_join(player_position, by = "id") |>
  select(id, long_name, position)

# A tibble: 30,493 × 3
       id long_name                    position
    <dbl> <chr>                        <chr>   
 1 158023 Lionel Messi                 RW      
 2 158023 Lionel Messi                 ST      
 3 158023 Lionel Messi                 CF      
 4  20801 C. Ronaldo dos Santos Aveiro ST      
 5  20801 C. Ronaldo dos Santos Aveiro LW      
 6 200389 Jan Oblak                    GK      
 7 192985 Kevin De Bruyne              CAM     
 8 192985 Kevin De Bruyne              CM      
 9 190871 Neymar da Silva Santos Jr.   LW      
10 190871 Neymar da Silva Santos Jr.   CAM     
# … with 30,483 more rows

Data validation

These columns seem to be in their perfect data type so just a little extra data validity check will be done on them.

Age

The question we need to ask here is whether the age distribution is a representative of the overall football age distribution.

summary(edit_raw$age)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  16.00   21.00   25.00   25.19   29.00   53.00

The average is 25 compared to the average age from 2021 edition of the EA Sports FIFA which was around 27-28 years old.

Variables measured in percentage.

OVA: Players overall analysis or rate.
POT: Players potential.
BOV: Best Overall in percentage.

edit_raw |>
  select(ova, pot, bov) |>
  summarise(across(everything(), 
                   list(min = min, max = max), 
                   .names = "{.col}_{.fn}"))

# A tibble: 1 × 6
  ova_min ova_max pot_min pot_max bov_min bov_max
    <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
1      47      93      47      95      48      93

All minimum value maximum values are within 0 to 100.

Player rating

These includes variable such as attacking, crossing, dribbling, etc, they are all supposed to be numeric.

edit_raw |>
  select(attacking:base_stats, pac:phy) |>
  map_lgl(is.double) |> all()

[1] TRUE

Checking for missing values in any of the selected variable.

edit_raw |>
  select(attacking:base_stats, pac:phy) |>
  transmute(across(everything(), is.na)) |>
  summarise(across(everything(), sum)) |>
  pivot_longer(everything(), names_to = "rating", values_to = "missing") |>
  filter(missing > 0)

# A tibble: 0 × 2
# … with 2 variables: rating <chr>, missing <int>

Attacking & defensive work rate.

The rating of players in this segment should only include any of high, medium or low. Any other rating will be considered an incorrect record.

edit_raw |>
  select(a_w, d_w) |> 
  names() |>
  map(\(i) edit_raw |> distinct(.data[[i]]))

[[1]]
# A tibble: 3 × 1
  a_w   
  <chr> 
1 Medium
2 High  
3 Low   

[[2]]
# A tibble: 3 × 1
  d_w   
  <chr> 
1 Low   
2 Medium
3 High

Data quality check

edit_raw |> select(where(is.numeric)) |> names()

 [1] "id"               "age"              "ova"              "pot"             
 [5] "start_year"       "end_year"         "height_cm"        "weight_kg"       
 [9] "bov"              "value"            "wage"             "release_clause"  
[13] "attacking"        "crossing"         "finishing"        "heading_accuracy"
[17] "short_passing"    "volleys"          "skill"            "dribbling"       
[21] "curve"            "fk_accuracy"      "long_passing"     "ball_control"    
[25] "movement"         "acceleration"     "sprint_speed"     "agility"         
[29] "reactions"        "balance"          "power"            "shot_power"      
[33] "jumping"          "stamina"          "strength"         "long_shots"      
[37] "mentality"        "aggression"       "interceptions"    "positioning"     
[41] "vision"           "penalties"        "composure"        "defending"       
[45] "marking"          "standing_tackle"  "sliding_tackle"   "goalkeeping"     
[49] "gk_diving"        "gk_handling"      "gk_kicking"       "gk_positioning"  
[53] "gk_reflexes"      "total_stats"      "base_stats"       "weak_foot"       
[57] "skill_move"       "injury_rating"    "pac"              "sho"             
[61] "pas"              "dri"              "def"              "phy"             
[65] "hits"

edit_raw |> select(where(is.character)) |> names()

 [1] "name"           "long_name"      "photo_url"      "player_url"    
 [5] "nationality"    "club"           "contract"       "positions"     
 [9] "preferred_foot" "best_position"  "a_w"            "d_w"           
[13] "contract_type"

edit_raw |> select(where(is.Date)) |> names()

[1] "joined_national_team" "loan_end_date"

Arranging the columns

edit_raw <- edit_raw |>
  select(
    id, long_name, name, club, contract, contract_type, start_year, 
    end_year, loan_end_date, joined_national_team, value:release_clause, 
    nationality, age, weight_kg, height_cm, preferred_foot:best_position, ova, 
    pot, attacking:hits, positions, player_url, photo_url
  )

edit_raw |> head(5) |> gt()

id	long_name	name	club	contract	contract_type	start_year	end_year	loan_end_date	joined_national_team	value	wage	release_clause	nationality	age	weight_kg	height_cm	preferred_foot	bov	best_position	ova	pot	attacking	crossing	finishing	heading_accuracy	short_passing	volleys	skill	dribbling	curve	fk_accuracy	long_passing	ball_control	movement	acceleration	sprint_speed	agility	reactions	balance	power	shot_power	jumping	stamina	strength	long_shots	mentality	aggression	interceptions	positioning	vision	penalties	composure	defending	marking	standing_tackle	sliding_tackle	goalkeeping	gk_diving	gk_handling	gk_kicking	gk_positioning	gk_reflexes	total_stats	base_stats	weak_foot	skill_move	a_w	d_w	injury_rating	pac	sho	pas	dri	def	phy	hits	positions	player_url	photo_url
158023	Lionel Messi	L. Messi	FC Barcelona	2004 ~ 2021	Permanent	2004	2021	NA	2004-07-01	103500000	560000	138400000	Argentina	33	72	170	Left	93	RW	93	93	429	85	95	70	91	88	470	96	93	94	91	96	451	91	80	91	94	95	389	86	68	72	69	94	347	44	40	93	95	75	96	91	32	35	24	54	6	11	15	14	8	2231	466	4	4	Medium	Low	5	85	92	91	95	38	65	771	RW, ST, CF	http://sofifa.com/player/158023/lionel-messi/210006/	https://cdn.sofifa.com/players/158/023/21_60.png
20801	C. Ronaldo dos Santos Aveiro	Cristiano Ronaldo	Juventus	2018 ~ 2022	Permanent	2018	2022	NA	2018-07-10	63000000	220000	75900000	Portugal	35	83	187	Right	92	ST	92	92	437	84	95	90	82	86	414	88	81	76	77	92	431	87	91	87	95	71	444	94	95	84	78	93	353	63	29	95	82	84	95	84	28	32	24	58	7	11	15	14	11	2221	464	4	5	High	Low	5	89	93	81	89	35	77	562	ST, LW	http://sofifa.com/player/20801/c-ronaldo-dos-santos-aveiro/210006/	https://cdn.sofifa.com/players/020/801/21_60.png
200389	Jan Oblak	J. Oblak	Atletico Madrid	2014 ~ 2023	Permanent	2014	2023	NA	2014-07-16	120000000	125000	159400000	Slovenia	27	87	188	Right	91	GK	91	93	95	13	11	15	43	13	109	12	13	14	40	30	307	43	60	67	88	49	268	59	78	41	78	12	140	34	19	11	65	11	68	57	27	12	18	437	87	92	78	90	90	1413	489	3	1	Medium	Medium	3	87	92	78	90	52	90	150	GK	http://sofifa.com/player/200389/jan-oblak/210006/	https://cdn.sofifa.com/players/200/389/21_60.png
192985	Kevin De Bruyne	K. De Bruyne	Manchester City	2015 ~ 2023	Permanent	2015	2023	NA	2015-08-30	129000000	370000	161000000	Belgium	29	70	181	Right	91	CAM	91	91	407	94	82	55	94	82	441	88	85	83	93	92	398	77	76	78	91	76	408	91	63	89	74	91	408	76	66	88	94	84	91	186	68	65	53	56	15	13	5	10	13	2304	485	5	4	High	High	4	76	86	93	88	64	78	207	CAM, CM	http://sofifa.com/player/192985/kevin-de-bruyne/210006/	https://cdn.sofifa.com/players/192/985/21_60.png
190871	Neymar da Silva Santos Jr.	Neymar Jr	Paris Saint-Germain	2017 ~ 2022	Permanent	2017	2022	NA	2017-08-03	132000000	270000	166500000	Brazil	28	68	175	Right	91	LW	91	91	408	85	87	62	87	87	448	95	88	89	81	95	453	94	89	96	91	83	357	80	62	81	50	84	356	51	36	87	90	92	93	94	35	30	29	59	9	9	15	15	11	2175	451	5	5	High	Medium	5	91	85	86	94	36	59	595	LW, CAM	http://sofifa.com/player/190871/neymar-da-silva-santos-jr/210006/	https://cdn.sofifa.com/players/190/871/21_60.png

Saving data

We can now go ahead and save the clean data in the data directory we created at the start of the tutorial.

write_csv(fifa_21, "data/fifa21_clean_data.csv")
write_csv(player_position, "data/player_position.csv")

You can access the full code by clicking on this link

Summary

Now that we have our cleaned data, Here is a brief highlight of the steps taken to clean the data.
* Loading the data: We imported the data set using the read_csv() function.
* Column names: All column name were converted to lower case with an underscore.
* Variable cleaning: Separated the contract variable, change the loan end date and start date of each player’s national career to a date data type, remove additional characters from the height and weight column, converted players name and country to all english character, also removed symbols and escape characters from the ratings and hits column.
* Position table: Created a new table with each player id and field positions.
* Data validation: Validated all other variables for completeness and consistency.
* Data quality check: Verified all data types for each column.
* Saving the data: Saved both the cleaned data and the new position table in a .csv file.

Conclusion

The primary goal of data cleaning is to improve the quality and accuracy of the data so that it can be useful for analysis and decision-making. Data cleaning is also a critical step in the data analysis/science process and cannot be ignored or overlooked when working with data. With the data cleaned, we can proceed into data exploration, visualization, preprocessing, etc.