R Cran Read Csv Strings as Factor

Reading and Writing CSV Files

Overview

Teaching: 30 min
Exercises: 0 min

Questions

  • How practise I read data from a CSV file into R?

  • How do I write data to a CSV file?

Objectives

  • Read in a .csv, and explore the arguments of the csv reader.

  • Write the altered data fix to a new .csv, and explore the arguments.

The most mutual way that scientists store data is in Excel spreadsheets. While in that location are R packages designed to access data from Excel spreadsheets (e.1000., gdata, RODBC, XLConnect, xlsx, RExcel), users often observe it easier to salve their spreadsheets in comma-separated values files (CSV) and so use R's built in functionality to read and manipulate the information. In this short lesson, we'll acquire how to read data from a .csv and write to a new .csv, and explore the arguments that allow y'all read and write the data correctly for your needs.

Read a .csv and Explore the Arguments

Let's kickoff by opening a .csv file containing data on the speeds at which cars of different colors were clocked in 45 mph zones in the four-corners states (CarSpeeds.csv). Nosotros volition use the congenital in read.csv(...) function call, which reads the information in as a data frame, and assign the data frame to a variable (using <-) and so that it is stored in R'southward memory. Then we volition explore some of the basic arguments that can be supplied to the role. First, open the RStudio projection containing the scripts and data you were working on in episode 'Analyzing Patient Data'.

                          # Import the data and look at the offset half dozen rows                                          carSpeeds                                          <-                                          read.csv              (              file                                          =                                          'data/motorcar-speeds.csv'              )                                          head              (              carSpeeds              )                                                  
                          Color Speed     State i  Blue    32 NewMexico two   Red    45   Arizona 3  Blueish    35  Colorado iv White    34   Arizona five   Red    25   Arizona 6  Blueish    41   Arizona                      

Changing Delimiters

The default delimiter of the read.csv() function is a comma, only you can use other delimiters by supplying the 'sep' argument to the function (eastward.g., typing sep = ';' allows a semi-colon separated file to be correctly imported - see ?read.csv() for more data on this and other options for working with different file types).

The phone call to a higher place will import the information, but we have not taken advantage of several handy arguments that can be helpful in loading the information in the format nosotros want. Let's explore some of these arguments.

The default for read.csv(...) is to set up the header argument to Truthful. This means that the first row of values in the .csv is set as header data (column names). If your data prepare does not accept a header, set the header argument to Faux:

                          # The get-go row of the data without setting the header argument:                                          carSpeeds              [              1              ,                                          ]                                                  
                          Colour Speed     State 1  Bluish    32 NewMexico                      
                          # The first row of the data if the header argument is set to Simulated:                                          carSpeeds                                          <-                                          read.csv              (              file                                          =                                          'data/machine-speeds.csv'              ,                                          header                                          =                                          Fake              )                                          carSpeeds              [              1              ,                                          ]                                                  
                          V1    V2    V3 one Color Speed State                      

Conspicuously this is non the desired beliefs for this data set, but information technology may exist useful if you have a dataset without headers.

The stringsAsFactors Argument

In older versions of R (prior to four.0) this was perhaps the virtually important argument in read.csv(), particularly if you were working with categorical data. This is considering the default behavior of R was to convert character strings into factors, which may brand it hard to practice such things as supersede values. It is important to exist aware of this behaviour, which we will demonstrate. For example, permit'south say nosotros observe out that the data collector was color blind, and accidentally recorded green cars as being blue. In order to correct the data fix, let'south replace 'Blue' with 'Greenish' in the $Color column:

                          # Hither we will use R's `ifelse` function, in which we provide the exam phrase,                                          # the outcome if the result of the exam is 'TRUE', and the upshot if the                                          # result is 'FALSE'. We will too assign the results to the Colour cavalcade,                                          # using '<-'                                          # First - reload the data with a header                                          carSpeeds                                          <-                                          read.csv              (              file                                          =                                          'data/automobile-speeds.csv'              ,                                          stringsAsFactors                                          =                                          TRUE              )                                          carSpeeds              $              Color                                          <-                                          ifelse              (              carSpeeds              $              Color                                          ==                                          'Blue'              ,                                          'Dark-green'              ,                                          carSpeeds              $              Color              )                                          carSpeeds              $              Colour                                                  
                          [ane] "Green" "1"     "Dark-green" "5"     "4"     "Green" "Green" "2"     "v"      [10] "4"     "4"     "5"     "Green" "Green" "two"     "four"     "Green" "Dark-green"  [19] "v"     "Green" "Light-green" "Light-green" "4"     "Green" "4"     "4"     "4"      [28] "four"     "5"     "Light-green" "4"     "5"     "2"     "4"     "2"     "2"      [37] "Green" "4"     "2"     "4"     "two"     "2"     "4"     "four"     "5"      [46] "2"     "Green" "four"     "4"     "2"     "ii"     "4"     "5"     "four"      [55] "Green" "Green" "2"     "Green" "five"     "2"     "4"     "Light-green" "Green"  [64] "v"     "2"     "4"     "four"     "2"     "Green" "5"     "Green" "4"      [73] "5"     "v"     "Green" "Green" "Greenish" "Greenish" "Green" "5"     "2"      [82] "Green" "5"     "ii"     "2"     "iv"     "4"     "5"     "5"     "five"      [91] "five"     "four"     "4"     "iv"     "v"     "2"     "5"     "2"     "2"     [100] "5"                      

What happened?!? It looks like 'Blue' was replaced with 'Green', just every other color was turned into a number (equally a character string, given the quote marks before and afterwards). This is because the colors of the cars were loaded as factors, and the factor level was reported following replacement.

To run into the internal structure, we can use another function, str(). In this instance, the dataframe's internal construction includes the format of each column, which is what nosotros are interested in. str() will be reviewed a little more in the lesson Data Types and Structures.

                          # Reload the data with a header (the previous ifelse call modifies attributes)                                          carSpeeds                                          <-                                          read.csv              (              file                                          =                                          'data/automobile-speeds.csv'              ,                                          stringsAsFactors                                          =                                          Truthful              )                                          str              (              carSpeeds              )                                                  
            'data.frame':	100 obs. of  three variables:  $ Color: Factor westward/ five levels " Cherry","Blackness",..: 3 i 3 5 four iii 3 2 v 4 ...  $ Speed: int  32 45 35 34 25 41 34 29 31 26 ...  $ State: Cistron w/ iv levels "Arizona","Colorado",..: 3 1 two 1 1 i 3 two ane 2 ...                      

Nosotros can run into that the $Colour and $Country columns are factors and $Speed is a numeric cavalcade.

At present, let'southward load the dataset using stringsAsFactors=Faux, and meet what happens when we try to supplant 'Blue' with 'Green' in the $Color column:

                          carSpeeds                                          <-                                          read.csv              (              file                                          =                                          'data/car-speeds.csv'              ,                                          stringsAsFactors                                          =                                          Imitation              )                                          str              (              carSpeeds              )                                                  
            'data.frame':	100 obs. of  3 variables:  $ Color: chr  "Bluish" " Red" "Blueish" "White" ...  $ Speed: int  32 45 35 34 25 41 34 29 31 26 ...  $ State: chr  "NewMexico" "Arizona" "Colorado" "Arizona" ...                      
                          carSpeeds              $              Color                                          <-                                          ifelse              (              carSpeeds              $              Color                                          ==                                          'Blue'              ,                                          'Dark-green'              ,                                          carSpeeds              $              Colour              )                                          carSpeeds              $              Color                                                  
                          [1] "Light-green" " Red"  "Green" "White" "Red"   "Greenish" "Green" "Blackness" "White"  [10] "Red"   "Crimson"   "White" "Green" "Green" "Black" "Cherry-red"   "Green" "Light-green"  [nineteen] "White" "Green" "Greenish" "Light-green" "Red"   "Dark-green" "Red"   "Cherry-red"   "Reddish"    [28] "Cherry"   "White" "Dark-green" "Red"   "White" "Black" "Red"   "Blackness" "Blackness"  [37] "Green" "Red"   "Black" "Ruby"   "Black" "Black" "Red"   "Red"   "White"  [46] "Black" "Green" "Cerise"   "Crimson"   "Black" "Black" "Red"   "White" "Red"    [55] "Green" "Green" "Blackness" "Green" "White" "Black" "Ruby"   "Green" "Green"  [64] "White" "Black" "Cherry"   "Cerise"   "Black" "Green" "White" "Dark-green" "Red"    [73] "White" "White" "Green" "Green" "Green" "Green" "Green" "White" "Blackness"  [82] "Green" "White" "Black" "Black" "Red"   "Red"   "White" "White" "White"  [91] "White" "Red"   "Red"   "Ruby-red"   "White" "Blackness" "White" "Black" "Black" [100] "White"                      

That'south ameliorate! And we tin see how the data now is read as character instead of factor. From R version four.0 onwards nosotros practice not have to specify stringsAsFactors=FALSE, this is the default beliefs.

The as.is Argument

This is an extension of the stringsAsFactors statement, but gives you control over individual columns. For example, if nosotros want the colors of cars imported as strings, but nosotros want the names of the states imported as factors, we would load the information ready every bit:

                          carSpeeds                                          <-                                          read.csv              (              file                                          =                                          'information/car-speeds.csv'              ,                                          every bit.is                                          =                                          1              )                                          # Notation, the 1 applies equally.is to the showtime column only                                                  

Now nosotros can meet that if nosotros try to replace 'Blueish' with 'Light-green' in the $Color column everything looks fine, while trying to replace 'Arizona' with 'Ohio' in the $State column returns the factor numbers for the names of states that we haven't replaced:

            'data.frame':	100 obs. of  iii variables:  $ Colour: chr  "Blue" " Cherry" "Blue" "White" ...  $ Speed: int  32 45 35 34 25 41 34 29 31 26 ...  $ State: Cistron westward/ four levels "Arizona","Colorado",..: 3 1 2 1 1 one 3 two 1 2 ...                      
                          carSpeeds              $              Color                                          <-                                          ifelse              (              carSpeeds              $              Color                                          ==                                          'Blue'              ,                                          'Green'              ,                                          carSpeeds              $              Color              )                                          carSpeeds              $              Color                                                  
                          [1] "Light-green" " Blood-red"  "Green" "White" "Red"   "Green" "Light-green" "Black" "White"  [ten] "Red"   "Cherry"   "White" "Green" "Dark-green" "Blackness" "Red"   "Green" "Green"  [19] "White" "Green" "Light-green" "Green" "Carmine"   "Green" "Ruby-red"   "Red"   "Reddish"    [28] "Red"   "White" "Light-green" "Blood-red"   "White" "Black" "Cherry"   "Black" "Black"  [37] "Light-green" "Scarlet"   "Black" "Red"   "Black" "Black" "Carmine"   "Red"   "White"  [46] "Black" "Green" "Reddish"   "Blood-red"   "Blackness" "Black" "Red"   "White" "Red"    [55] "Dark-green" "Green" "Black" "Light-green" "White" "Black" "Ruby"   "Green" "Green"  [64] "White" "Black" "Red"   "Blood-red"   "Black" "Green" "White" "Green" "Red"    [73] "White" "White" "Green" "Dark-green" "Dark-green" "Green" "Greenish" "White" "Blackness"  [82] "Light-green" "White" "Black" "Black" "Red"   "Red"   "White" "White" "White"  [91] "White" "Red"   "Cherry"   "Red"   "White" "Black" "White" "Black" "Black" [100] "White"                      
                          carSpeeds              $              State                                          <-                                          ifelse              (              carSpeeds              $              State                                          ==                                          'Arizona'              ,                                          'Ohio'              ,                                          carSpeeds              $              State              )                                          carSpeeds              $              Land                                                  
                          [1] "3"    "Ohio" "ii"    "Ohio" "Ohio" "Ohio" "3"    "ii"    "Ohio" "2"     [eleven] "4"    "4"    "4"    "4"    "4"    "3"    "Ohio" "3"    "Ohio" "iv"     [21] "four"    "four"    "3"    "2"    "2"    "3"    "2"    "four"    "ii"    "4"     [31] "3"    "2"    "2"    "iv"    "two"    "two"    "three"    "Ohio" "4"    "ii"     [41] "ii"    "3"    "Ohio" "four"    "Ohio" "ii"    "3"    "three"    "3"    "2"     [51] "Ohio" "iv"    "4"    "Ohio" "3"    "2"    "4"    "two"    "4"    "4"     [61] "4"    "2"    "3"    "2"    "3"    "2"    "3"    "Ohio" "3"    "4"     [71] "iv"    "2"    "Ohio" "iv"    "2"    "ii"    "2"    "Ohio" "3"    "Ohio"  [81] "4"    "ii"    "2"    "Ohio" "Ohio" "Ohio" "four"    "Ohio" "four"    "4"     [91] "4"    "Ohio" "Ohio" "3"    "2"    "2"    "4"    "3"    "Ohio" "4"                      

Nosotros tin see that $Color column is a character while $Country is a cistron.

Updating Values in a Factor

Suppose we want to keep the colors of cars as factors for some other operations we want to perform. Write lawmaking for replacing 'Blue' with 'Greenish' in the $Color column of the cars dataset without importing the information with stringsAsFactors=FALSE.

Solution

                                  carSpeeds                                                      <-                                                      read.csv                  (                  file                                                      =                                                      'data/car-speeds.csv'                  )                                                      # Replace 'Blue' with 'Greenish' in cars$Color without using the stringsAsFactors                                                      # or as.is arguments                                                      carSpeeds                  $                  Colour                                                      <-                                                      ifelse                  (                  as.character                  (                  carSpeeds                  $                  Color                  )                                                      ==                                                      'Blue'                  ,                                                      'Green'                  ,                                                      as.grapheme                  (                  carSpeeds                  $                  Colour                  ))                                                      # Convert colors back to factors                                                      carSpeeds                  $                  Color                                                      <-                                                      as.cistron                  (                  carSpeeds                  $                  Color                  )                                                                  

The strip.white Argument

It is not uncommon for mistakes to have been made when the information were recorded, for example a space (whitespace) may accept been inserted earlier a information value. By default this whitespace will exist kept in the R surround, such that '\ Red' volition be recognized as a dissimilar value than 'Red'. In lodge to avoid this blazon of fault, use the strip.white argument. Permit's come across how this works by checking for the unique values in the $Color column of our dataset:

Hither, the data recorder added a space earlier the colour of the automobile in one of the cells:

                          # We use the built-in unique() function to extract the unique colors in our dataset                                          unique              (              carSpeeds              $              Color              )                                                  
            [i] Green  Cherry-red  White Red   Black Levels:  Red Blackness Green Red White                      

Oops, we see two values for red cars.

Permit's try again, this time importing the data using the strip.white argument. Notation - this argument must be accompanied by the sep argument, by which we indicate the type of delimiter in the file (the comma for most .csv files)

                          carSpeeds                                          <-                                          read.csv              (                                          file                                          =                                          'data/car-speeds.csv'              ,                                          stringsAsFactors                                          =                                          FALSE              ,                                          strip.white                                          =                                          TRUE              ,                                          sep                                          =                                          ','                                          )                                          unique              (              carSpeeds              $              Color              )                                                  
            [1] "Blue"  "Ruby-red"   "White" "Black"                      

That's better!

Specify Missing Information When Loading

It is common for data sets to take missing values, or mistakes. The convention for recording missing values oftentimes depends on the individual who collected the data and tin exist recorded as north.a., --, or empty cells " ". R recognises the reserved character cord NA as a missing value, but not some of the examples to a higher place. Let's say the inflamation scale in the data set up we used earlier inflammation-01.csv actually starts at 1 for no inflamation and the zippo values (0) were a missed observation. Looking at the ?read.csv help page is at that place an argument we could utilise to ensure all zeros (0) are read in every bit NA? Perhaps, in the car-speeds.csv information contains mistakes and the person measuring the car speeds could not accurately distinguish between "Black or "Blue" cars. Is there a way to specify more than 1 'string', such every bit "Black" and "Bluish", to exist replaced past NA

Solution

                                  read.csv                  (                  file                                                      =                                                      "data/inflammation-01.csv"                  ,                                                      na.strings                                                      =                                                      "0"                  )                                                                  

or , in automobile-speeds.csv use a grapheme vector for multiple values.

                                  read.csv                  (                                                      file                                                      =                                                      'information/car-speeds.csv'                  ,                                                      na.strings                                                      =                                                      c                  (                  "Black"                  ,                                                      "Blue"                  )                                                      )                                                                  

Write a New .csv and Explore the Arguments

After altering our cars dataset by replacing 'Blue' with 'Light-green' in the $Color column, nosotros at present want to salvage the output. There are several arguments for the write.csv(...) function phone call, a few of which are particularly important for how the data are exported. Let's explore these now.

                          # Export the data. The write.csv() role requires a minimum of 2                                          # arguments, the information to be saved and the name of the output file.                                          write.csv              (              carSpeeds              ,                                          file                                          =                                          'data/car-speeds-cleaned.csv'              )                                                  

If you open up the file, you'll see that it has header names, considering the information had headers within R, but that at that place are numbers in the commencement column.

csv written without row.names argument

The row.names Argument

This argument allows us to set the names of the rows in the output data file. R's default for this argument is TRUE, and since it does non know what else to name the rows for the cars information set, it resorts to using row numbers. To correct this, we can set row.names to Faux:

                          write.csv              (              carSpeeds              ,                                          file                                          =                                          'data/motorcar-speeds-cleaned.csv'              ,                                          row.names                                          =                                          Simulated              )                                                  

Now we see:

csv written with row.names argument

Setting Column Names

In that location is also a col.names argument, which tin can be used to set the column names for a data set without headers. If the data ready already has headers (e.grand., we used the headers = Truthful statement when importing the data) and then a col.names argument volition be ignored.

The na Argument

In that location are times when we desire to specify certain values for NAs in the data set (e.g., nosotros are going to pass the data to a program that only accepts -9999 as a nodata value). In this case, nosotros want to set the NA value of our output file to the desired value, using the na argument. Let's see how this works:

                          # Commencement, replace the speed in the third row with NA, by using an index (square                                          # brackets to betoken the position of the value we desire to supplant)                                          carSpeeds              $              Speed              [              three              ]                                          <-                                          NA                                          caput              (              carSpeeds              )                                                  
                          Color Speed     Land one  Bluish    32 NewMexico ii   Red    45   Arizona 3  Blue    NA  Colorado iv White    34   Arizona 5   Red    25   Arizona 6  Blueish    41   Arizona                      
                          write.csv              (              carSpeeds              ,                                          file                                          =                                          'data/car-speeds-cleaned.csv'              ,                                          row.names                                          =                                          Imitation              )                                                  

Now nosotros'll set NA to -9999 when we write the new .csv file:

                          # Note - the na statement requires a string input                                          write.csv              (              carSpeeds              ,                                          file                                          =                                          'data/car-speeds-cleaned.csv'              ,                                          row.names                                          =                                          FALSE              ,                                          na                                          =                                          '-9999'              )                                                  

And we see:

csv written with -9999 as NA

Key Points

  • Import data from a .csv file using the read.csv(...) function.

  • Understand some of the key arguments bachelor for importing the information properly, including header, stringsAsFactors, equally.is, and strip.white.

  • Write information to a new .csv file using the write.csv(...) office

  • Understand some of the key arguments available for exporting the information properly, such every bit row.names, col.names, and na.

penaaptir1973.blogspot.com

Source: https://swcarpentry.github.io/r-novice-inflammation/11-supp-read-write-csv/

0 Response to "R Cran Read Csv Strings as Factor"

Publicar un comentario

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel