Wednesday, December 20, 2017

An update of my tutorial on DataFrames

Since my last post about my tutorial to DataFrames I had switched to use them a lot more in my daily Julia workflow. Based on the experience I have added three sections to it:

  1. performance recommendations;
  2. possible pitfalls when using DataFrames;
  3. useful packages, currently FreqTables and DataFramesMeta.
Actually all three sections would benefit from the experience of the community, so if you have a comment please make an issue or PR on https://github.com/bkamins/Julia-DataFrames-Tutorial.

As for useful packages I have tried to use raw DataFrames to reduce dependencies of my code, but actually FreqTables and DataFramesMeta helped me a lot and did not give too much mental overhead of things to remember.

I would like to especially recommend FreqTables - a small package, but really useful. I would say that it deserves much more attention from the community than it gets (looking at the number of stars). So let me write a bit about it.

There are three reasons I like it:
  1. simply I make contingency tables almost all the time; previously I have used countmap from StatsBase a lot, but it returns a dictionary which is not very handy; freqtable returns a much nicer result (e.g. if possible it is sorted) and allows for more than one dimension;
  2. with freqtable I can use vectors or work on data frames, it nicely handles missings and allows for weighting;
  3. freqtable is faster than countmap (I was surprised when I learned this, maybe not a critical thing but a nice plus).
So how does the output from freqtable  look? Here is a sampler:


julia> using DataFrames, FreqTables

julia> srand(1); df = DataFrame(rand(1:3, 10, 2))
10×2 DataFrames.DataFrame
│ Row │ x1 │ x2 │
├─────┼────┼────┤
│ 1   │ 3  │ 2  │
│ 2   │ 3  │ 1  │
│ 3   │ 3  │ 1  │
│ 4   │ 3  │ 2  │
│ 5   │ 1  │ 2  │
│ 6   │ 1  │ 2  │
│ 7   │ 1  │ 3  │
│ 8   │ 2  │ 3  │
│ 9   │ 1  │ 1  │
│ 10  │ 1  │ 2  │

julia> freqtable(df, :x1, :x2)
3×3 Named Array{Int64,2}
x1 ╲ x2 │ 1  2  3
────────┼────────
1       │ 1  3  1
2       │ 0  0  1
3       │ 2  2  0


And now a simple benchmark against countmap :


julia> using DataFrames, FreqTables, StatsBase, BenchmarkTools

julia> srand(1); x = rand(1:100, 10^6); y = categorical(x); z = string.(x);

julia> @benchmark freqtable($x)
BenchmarkTools.Trial:
  memory estimate:  25.89 KiB
  allocs estimate:  83
  --------------
  minimum time:     24.246 ms (0.00% GC)
  median time:      24.672 ms (0.00% GC)
  mean time:        25.425 ms (0.00% GC)
  maximum time:     39.739 ms (0.00% GC)
  --------------
  samples:          197
  evals/sample:     1

julia> @benchmark countmap($x)
BenchmarkTools.Trial:
  memory estimate:  6.61 KiB
  allocs estimate:  10
  --------------
  minimum time:     42.230 ms (0.00% GC)
  median time:      42.813 ms (0.00% GC)
  mean time:        43.110 ms (0.00% GC)
  maximum time:     46.244 ms (0.00% GC)
  --------------
  samples:          116
  evals/sample:     1

julia> @benchmark freqtable($y)
BenchmarkTools.Trial:
  memory estimate:  10.16 KiB
  allocs estimate:  76
  --------------
  minimum time:     1.064 ms (0.00% GC)
  median time:      1.112 ms (0.00% GC)
  mean time:        1.129 ms (0.09% GC)
  maximum time:     3.485 ms (66.72% GC)
  --------------
  samples:          4403
  evals/sample:     1

julia> @benchmark countmap($y)
BenchmarkTools.Trial:
  memory estimate:  6.61 KiB
  allocs estimate:  10
  --------------
  minimum time:     87.141 ms (0.00% GC)
  median time:      88.167 ms (0.00% GC)
  mean time:        88.510 ms (0.00% GC)
  maximum time:     92.177 ms (0.00% GC)
  --------------
  samples:          57
  evals/sample:     1

julia> @benchmark freqtable($z)
BenchmarkTools.Trial:
  memory estimate:  45.81 MiB
  allocs estimate:  2000285
  --------------
  minimum time:     75.712 ms (3.94% GC)
  median time:      77.057 ms (3.94% GC)
  mean time:        77.346 ms (4.16% GC)
  maximum time:     83.298 ms (3.35% GC)
  --------------
  samples:          65
  evals/sample:     1

julia> @benchmark countmap($z)
BenchmarkTools.Trial:
  memory estimate:  6.61 KiB
  allocs estimate:  10
  --------------
  minimum time:     81.931 ms (0.00% GC)
  median time:      83.128 ms (0.00% GC)
  mean time:        83.472 ms (0.00% GC)
  maximum time:     89.977 ms (0.00% GC)
  --------------
  samples:          60
  evals/sample:     1


As you can see freqtable does really a good job on different types of inputs.

Actually there is a third way to do a similar using by form DataFrames  which is also quite fast but it is more messy. freqtable is more specialized - does one job, but does it well.

Here are the benchmarks of by:


julia> @benchmark by(DataFrame(x = $x), :x, nrow)
BenchmarkTools.Trial:
  memory estimate:  38.91 MiB
  allocs estimate:  5986
  --------------
  minimum time:     28.946 ms (1.56% GC)
  median time:      34.440 ms (14.82% GC)
  mean time:        34.291 ms (14.80% GC)
  maximum time:     41.079 ms (20.70% GC)
  --------------
  samples:          146
  evals/sample:     1

julia> @benchmark by(DataFrame(x = $y), :x, nrow)
BenchmarkTools.Trial:
  memory estimate:  38.92 MiB
  allocs estimate:  6198
  --------------
  minimum time:     44.810 ms (3.52% GC)
  median time:      50.244 ms (10.53% GC)
  mean time:        49.715 ms (10.38% GC)
  maximum time:     56.052 ms (17.67% GC)
  --------------
  samples:          101
  evals/sample:     1

julia> @benchmark by(DataFrame(x = $z), :x, nrow)
BenchmarkTools.Trial:
  memory estimate:  38.91 MiB
  allocs estimate:  5986
  --------------
  minimum time:     46.891 ms (0.93% GC)
  median time:      53.657 ms (10.12% GC)
  mean time:        52.539 ms (9.68% GC)
  maximum time:     60.736 ms (16.24% GC)
  --------------
  samples:          96
  evals/sample:     1

7 comments:

  1. Thanks for mentioning FreqTables, I did not know about it but it turns out to be very useful.

    ReplyDelete
  2. Thanks for the publicity! These functions should ideally live in StatsBase, but that's not possible without introducing a dependency on NamedArrays, which is blocked in particular by the competition with AxisArrays.

    FWIW, StatsBase.countmap() should be faster in recent StatsBase releases, normally even faster than freqtable() with bits types which can work with radix sort (at the expense of allocating a copy of the input).

    ReplyDelete
  3. Yes from my own testing `countmap` is faster; my `Pkg.status("StatsBase")` gives `# - StatsBase 0.19.4`.

    Also you may want to consider the unpublished (as of 1Jan2018) `fastby` from [FastGroupBy.jl](https://github.com/xiaodaigh/FastGroupBy.jl/blob/master/README.md). It allows you to do more than frequency counts for each by-group.

    (Shameless self-promotion aside) Eventually many of `FastGroupBy.jl` functionalities will pushed to `DataFrame` and `StatsBase` etc too.

    ReplyDelete
    Replies
    1. Thanks for the comment. It would be excellent to have faster *by* operations ported to DataFrames and StatsBase!

      I have tested your framework and have three suggestions you might consider (I am referring to master version):
      1) file string_sort/ccmp_sort.jl does not compile (presumably because you use `warning` instead of standard `warn` in several places);
      2) you could consider adding support for CategoricalVector as it internally stores data in a type that you support;
      3) you assume in the code that `fn` has the same return type on atomic value and on vector. This is often not true. E.g. `fastby(size, x, y)` fails. You could consider rewriting all `typeof(fn(y[1]))` and `typeof(fn(valvec[1]))` to `typeof(fn(y[1:1]))` and `typeof(fn(valvec[1:1]))`. This will work assuming that `fn`s return value is type-stable (but I guess this is the application area you are targeting).
      4) you could consider handling the case `length(byvec) == length(valvec) == 0` by returning empty `Dict` (now the function throws an errror).

      Delete
  4. yes there is support for categorical array and pooledarray but only in the function sumby. will add more. actually the faster countmap in StatsBase was contributed by me and i will continue to contribute more these fastby function become mature

    ReplyDelete