5 handy options in R data.table’s fread
Like all capabilities in the information.desk R offer, fread is fast. Quite fast. But there’s more to fread than pace. It has various helpful functions and solutions when importing external information into R. Right here are 5 of the most beneficial.
Take note: If you’d like to observe together, down load the New York Occasions CSV file of everyday Covid-19 circumstances by U.S. county at https://github.com/nytimes/covid-19-information/uncooked/learn/us-counties.csv.
Use fread’s nrows selection
Is your file huge? Would you like to take a look at its composition prior to importing the full thing – with no getting to open it in a textual content editor or Excel? Use fread’s nrows
selection to import only a part of a file for exploration.
The code underneath imports just the initially ten rows of the CSV.
mydt10 <- fread("us-counties.csv", nrows = 10)
If you just want to see column names with no any information at all, you can use nrows =
.
Use fread’s decide on selection
The moment you know the file composition, you can pick out which columns to import. fread’s decide on
selection lets you pick columns you want to keep. decide on
can take a vector of possibly column names or column-position figures. If names, they have to have to be in quotation marks, like most vectors of character strings:
mydt <- fread("us-counties.csv",
decide on = c("day", "county", "point out", "circumstances"))
As constantly, figures really do not have to have quotation marks:
mydt <- fread("us-counties.csv", select = c(1,2,3,5))
You can use an R item with a vector of column names inside fread, as you can see in this next team of code. I build a vector my_cols with day, county, point out, and circumstances then I use that vector inside fread.
my_cols <- c("date", "county", "state", "cases")
mydt <- fread("us-counties.csv", select = my_cols)
The reverse of decide on
is drop
. You can pick out to import all columns besides the kinds you specify with drop
, this sort of as:
mydt <- fread("us-counties.csv", drop = c("fips", "deaths"))
Like with decide on
, drop
can take a vector of column names or numerical positions.
Use fread with grep
If you’re familiar with Unix, you can execute command-line resources ideal from inside fread. For instance, if I just wanted California information, I could use grep to only import lines that consist of the textual content “California.” Take note that this queries each individual full row as a textual content string, not a distinct column, so your information has to be in a format the place that can make feeling.
ca <- fread("grep California us-counties.csv")
Regretably, grep doesn’t comprehend the first file’s column names, so you end up with default names.
head(ca) V1 V2 V3 V4 V5 V6 1: 2020-01-twenty five Orange California 6059 1 two: 2020-01-26 Los Angeles California 6037 1 3: 2020-01-26 Orange California 6059 1 four: 2020-01-27 Los Angeles California 6037 1 5: 2020-01-27 Orange California 6059 1 six: 2020-01-28 Los Angeles California 6037 1
Even so, fread lets us specify column names with the col.names
selection. I can established the names dependent on names from mydt10 that I produced over.
ca <- fread("grep California us-counties.csv", col.names = names(mydt10))> head(ca) day county point out fips circumstances deaths 1: 2020-01-twenty five Orange California 6059 1 two: 2020-01-26 Los Angeles California 6037 1 3: 2020-01-26 Orange California 6059 1 four: 2020-01-27 Los Angeles California 6037 1 5: 2020-01-27 Orange California 6059 1 six: 2020-01-28 Los Angeles California 6037 1
We can also use normal expressions, with grep’s -E
selection, allowing us do more elaborate queries, this sort of as looking for four states at the moment.
states4 <- fread(cmd = "grep -E 'Texas|Arizona|Florida|South Carolina' us-counties.csv",
col.names = names(mydt10))
The moment once again, a reminder: This is looking for each individual of individuals point out names anywhere in the row, not just in the point out column. If you operate the code over and test what states are involved in the effects with unique(states4$point out)
, you are going to see Oklahoma and Missouri in the states column together with Texas, Arizona, Florida, and South Carolina. Which is because each Oklahoma and Missouri have counties named Texas.
So, grep all through file import is a way to filter out a ton of information you really do not want from a very huge information established but it doesn’t assurance you only get what you want. Just after this type of import, you should continue to filter exclusively on column information to make guaranteed you did not get something unpredicted.
Use fread’s colClasses selection
You can established column lessons all through import – for just a handful of columns, not every single just one. For instance, the day column in this information is coming in as character strings, even although it’s in yr-thirty day period-working day format. We can established the column named day to the information kind Day during import utilizing the colClasses
selection.
mydt <- fread("us-counties.csv", colClasses = c("date" = "Date"))
Now, dates are Dates.
> str(mydt) Classes ‘data.table’ and 'data.frame':322651 obs. of six variables: $ day : Day, format: "2020-01-21" "2020-01-22" "2020-01-23" ... $ county: chr "Snohomish" "Snohomish" "Snohomish" "Cook dinner" ... $ point out : chr "Washington" "Washington" "Washington" "Illinois" ... $ fips : int 53061 53061 53061 17031 53061 6059 17031 53061 4013 6037 ... $ circumstances : int 1 1 1 1 1 1 1 1 1 1 ... $ deaths: int ...
Use fread on zipped information
You can import a zipped file with no unzipping it initially. fread can import gz and bz2 information straight, this sort of as mydt <- fread("myfile.gz")
. If you have to have to import a zip file, you can unzip it with the unzip
system command within fread, utilizing the syntax mydt <- fread(cmd = 'unzip -cq myfile.zip')
.
For more R ideas, head to InfoWorld’s Do More With R page.
Copyright © 2020 IDG Communications, Inc.