Monday, December 3, 2018

Release notes for the DataFrames.jl package v0.15.0

The DataFrames.jl package is getting closer to 1.0 release. In order to reach this level maturity a number of significant changes is introduced in this release. Also it should be expected that in the near future there will be also several major changes in the package to synchronize it with Julia 1.0.

This post is divided into three sections:
  1. A brief statement of major changes in the package since the last release.
  2. Planned changes in the comming releases.
  3. A more detailed look at the selected changes.
Soon I will update to reflect those changes.

A brief statement of major changes in the package since the last release

  • Finish deprecation period of makeunique keyword argument; now data frames will throw an error when supplied with duplicate columns unless explicitly it is allowed to auto-generate non-conflicting column names;
  • Major redesign of split-apply-combine mehtods leading to cleaner code and improved performance (see below for the details); also now all column mapping-functions allow functions, types or functors to perform transformations (earlier only functions were allowed)
  • If anonymous function is used in split-apply-combine functions (e.g. by or aggregate) then its auto-generated name is function;
  • Allow comparisons for GroupedDataFrame
  • A decision to treat data frame as collection of rows; this mainly affects names of supported functions; deprecation of length (use size instead), delete! (renamed), merge! (removed), insert! (renamed), head (use first), and tail (use last)
  • deprecate setindex! with nothing on RHS that dropped column in the past
  • A major review of getindex and view methods for all types defined in the DataFrames.jl package to make them consistent with Julia 1.0 behavior
  • Allow specifying columns to completecases, dropmissing and dropmissing!
  • Make all major types defined in the DataFrames.jl package immutable to avoid their allocation (they were mostly mutable in the past)
  • Add convinience method that copy on DataFrameRow produces a NamedTuple; add convinience method that copy on DataFrameRow produces a NamedTuple; both changes should make using DataFrameRow more convinient as it is now more commonly encountered because getindex with a single row selected will return a value of this type
  • Fixed show methods for data frames with csv/tsv target
  • HTML output of the data frame also prints its dimensions
  • significant improvements in the documentation of the DataFrames.jl package

Planned changes in the comming releases

  • further, minor, cleanup of split-apply-combine interface (there are some corner cases still left to be fixed)
  • major review of setindex! (similar to what was done with getindex in this release); in particular to support broadcasting consistent with Julia 1.0; expect breaking (and possibly major) changes here
  • finish deprecation periods for getindex and eachcol
(this list will probably be longer but those are things that are a priority)

A more detailed look at the selected changes

An improved split-apply-combine

The first thing I want to highlight it a new split-apply-combine API with an improved performance (a major contribution of @nalimilan).

Consider the following basic setting:

using DataFrames

df = DataFrame(x=categorical(repeat(string.('a':'f') .^ 4, 10^6)),
               y = 1:6*10^6)

Below, I report all timings with @time to get the feel of real workflow delay, but the times are reported after precompilation.
Code that worked under DataFrames 0.14.1 along with its timing is the following:

julia> @time by(df, :x, v -> DataFrame(s=sum(v.y)));
  0.474845 seconds (70.90 k allocations: 296.545 MiB, 31.18% gc time)

Now under DataFrames 0.15.0 it is:

julia> @time by(df, :x, v -> DataFrame(s=sum(v[:,:y])));
  0.234375 seconds (6.34 k allocations: 229.213 MiB, 35.98% gc time)

julia> @time by(df, :x, s = :y=>sum);
  0.114782 seconds (332 allocations: 137.347 MiB, 14.21% gc time)
Observe that there are two levels of speedup:
  • even under an old API we get 2x speedup due to better handling of grouping;
  • if we use a new type-stable API not only the code is shorter but it is even faster.
Now let us dig into the options the new API provides. I will show them all by example (I am omitting the old API with function passed - it still works unchanged):
by(df, :x, s = :y=>sum, p = :y=>maximum) # one or more keyword arguments
by(df, :x, :y=>sum, :y=>maximum) # one or more positional arguments
by(:y=>sum, df, :x) # a Pair as the first argument
by((s = :y=>sum, p = :y=>maximum), df, :x) # a NamedTuple of Pairs
by((:y=>sum, :y=>maximum), df, :x) # a Tuple of Pairs
by([:y=>sum, :y=>maximum], df, :x) # a vector of Pairs
Now, if you use a Pair, a tuple or a vector option (i.e. all other than keyword arguments or NamedTuple) then you can return a NamedTuple instead of a DataFrame to give names to the columns, which is faster, especially when there are many small groups, e.g.:
by(df, :x, x->(a=1, b=sum(x[:, :y])))
by(df, :x, :y => x->(a=1, b=sum(x))) # faster with a column selector
instead of the old:
by(df, :x, x->DataFrame(a=1, b=sum(x[:, :y])))

You can pass more than one column in this way. Then the columns are passed as a named tuple, e.g.

julia> using Statistics

julia> df = DataFrame(x = repeat(1:2, 3), a=1:6, b=1:6);

julia> by(df, :x, str = (:a, :b) => string)
2×2 DataFrame
│ Row │ x     │ str                            │
│     │ Int64 │ String                         │
│ 1   │ 1     │ (a = [1, 3, 5], b = [1, 3, 5]) │
│ 2   │ 2     │ (a = [2, 4, 6], b = [2, 4, 6]) │

julia> by(df, :x, cor = (:a, :b) => x->cor(x...))
2×2 DataFrame
│ Row │ x     │ cor     │
│     │ Int64 │ Float64 │
│ 1   │ 1     │ 1.0     │
│ 2   │ 2     │ 1.0     │

A more flexible eachrow and eachcol

Now the eachrow and eachcol functions return a value that is a read-only subtype of AbstractVector. This allows users to flexibly use all getindex mechanics from Base on these return values. For example:
julia> using DataFrames

julia> df = DataFrame(x=1:5, y='a':'e')
5×2 DataFrame
│ Row │ x     │ y    │
│     │ Int64 │ Char │
│ 1   │ 1     │ 'a'  │
│ 2   │ 2     │ 'b'  │
│ 3   │ 3     │ 'c'  │
│ 4   │ 4     │ 'd'  │
│ 5   │ 5     │ 'e'  │

julia> er = eachrow(df)
5-element DataFrames.DataFrameRows{DataFrame}:
 DataFrameRow (row 1)
x  1
y  a
 DataFrameRow (row 2)
x  2
y  b
 DataFrameRow (row 3)
x  3
y  c
 DataFrameRow (row 4)
x  4
y  d
 DataFrameRow (row 5)
x  5
y  e

julia> ec = eachcol(df)
┌ Warning: In the future eachcol will have names argument set to false by default
│   caller = top-level scope at none:0
└ @ Core none:0
2-element DataFrames.DataFrameColumns{DataFrame,Pair{Symbol,AbstractArray{T,1} where T}}:
┌ Warning: Indexing into a return value of eachcol will return a pair of column name and column value
│   caller = _getindex at abstractarray.jl:928 [inlined]
└ @ Core .\abstractarray.jl:928
┌ Warning: Indexing into a return value of eachcol will return a pair of column name and column value
│   caller = _getindex at abstractarray.jl:928 [inlined]
└ @ Core .\abstractarray.jl:928
 ┌ Warning: Indexing into a return value of eachcol will return a pair of column name and column value
│   caller = _getindex at abstractarray.jl:928 [inlined]
└ @ Core .\abstractarray.jl:928
[1, 2, 3, 4, 5]
 ['a', 'b', 'c', 'd', 'e']
And now you can index-into them like this (essentially any indexing Base allows for AbstractVector):
julia> ec[end]
┌ Warning: Indexing into a return value of eachcol will return a pair of column name and column value
│   caller = top-level scope at none:0
└ @ Core none:0
5-element Array{Char,1}:

julia> er[1:3]
3-element Array{DataFrameRow{DataFrame},1}:
 DataFrameRow (row 1)
x  1
y  a
 DataFrameRow (row 2)
x  2
y  b
 DataFrameRow (row 3)
x  3
y  c
You will notice massive warnings when using eachcol. They will be removed in the next release of the DataFrames.jl package and are due to two reasons:
  • there are now two variants of eachcol; one returning plain columns (called by eachcol(df, false); the other returning plain column names and value (called by eachcol(df, true)); in the past calling eachcol(df) defaulted to the true option; in the future it will default to false to be consistent with;
  • geting values of eachcol result returning value with column name in the past was inconsistent depending if we indexed into it or iterated over it; in the future it will always return a Pair in all cases.

Consistent getindex and view methods

There was a major redesign of how getindex and view work for all types that the DataFrames.jl package defines. Now they are as consistent with Base as was possible. The lengthtly details are outlined in Here are the key highlights of the new rules:
  • using @view on getindex will always consistently return a view containing the same values as getindex would return (in the past this was not the case);
  • selecting a single row with an integer from a data frame will return a DataFrameRow (it was a DataFrame in the past); this was a tough decision because DataFrameRow is a view, so one should be careful when using setindex! on such object, but it is guided by the rule that selecting a single row should drop a dimension like indexing in Base;
  • selecting multiple rows of a data frame will always perform a copy of columns (this was not consistent earlier; also the behavior follows what Base does); selecting columns without specifying rows returns an underlying vector; so for example, the difference is that now df[:, cols] performs a copy and df[cols] will not perform a copy of the underlying vectors.
Currently you will get many deprecation warnings where indexing rules will change. In the next release of the DataFrames.jl package these changes will be made.