Wednesday, December 20, 2017

An update of my tutorial on DataFrames

Since my last post about my tutorial to DataFrames I had switched to use them a lot more in my daily Julia workflow. Based on the experience I have added three sections to it:

  1. performance recommendations;
  2. possible pitfalls when using DataFrames;
  3. useful packages, currently FreqTables and DataFramesMeta.
Actually all three sections would benefit from the experience of the community, so if you have a comment please make an issue or PR on https://github.com/bkamins/Julia-DataFrames-Tutorial.

As for useful packages I have tried to use raw DataFrames to reduce dependencies of my code, but actually FreqTables and DataFramesMeta helped me a lot and did not give too much mental overhead of things to remember.

I would like to especially recommend FreqTables - a small package, but really useful. I would say that it deserves much more attention from the community than it gets (looking at the number of stars). So let me write a bit about it.

There are three reasons I like it:
  1. simply I make contingency tables almost all the time; previously I have used countmap from StatsBase a lot, but it returns a dictionary which is not very handy; freqtable returns a much nicer result (e.g. if possible it is sorted) and allows for more than one dimension;
  2. with freqtable I can use vectors or work on data frames, it nicely handles missings and allows for weighting;
  3. freqtable is faster than countmap (I was surprised when I learned this, maybe not a critical thing but a nice plus).
So how does the output from freqtable  look? Here is a sampler:


julia> using DataFrames, FreqTables

julia> srand(1); df = DataFrame(rand(1:3, 10, 2))
10×2 DataFrames.DataFrame
│ Row │ x1 │ x2 │
├─────┼────┼────┤
│ 1   │ 3  │ 2  │
│ 2   │ 3  │ 1  │
│ 3   │ 3  │ 1  │
│ 4   │ 3  │ 2  │
│ 5   │ 1  │ 2  │
│ 6   │ 1  │ 2  │
│ 7   │ 1  │ 3  │
│ 8   │ 2  │ 3  │
│ 9   │ 1  │ 1  │
│ 10  │ 1  │ 2  │

julia> freqtable(df, :x1, :x2)
3×3 Named Array{Int64,2}
x1 ╲ x2 │ 1  2  3
────────┼────────
1       │ 1  3  1
2       │ 0  0  1
3       │ 2  2  0


And now a simple benchmark against countmap :


julia> using DataFrames, FreqTables, StatsBase, BenchmarkTools

julia> srand(1); x = rand(1:100, 10^6); y = categorical(x); z = string.(x);

julia> @benchmark freqtable($x)
BenchmarkTools.Trial:
  memory estimate:  25.89 KiB
  allocs estimate:  83
  --------------
  minimum time:     24.246 ms (0.00% GC)
  median time:      24.672 ms (0.00% GC)
  mean time:        25.425 ms (0.00% GC)
  maximum time:     39.739 ms (0.00% GC)
  --------------
  samples:          197
  evals/sample:     1

julia> @benchmark countmap($x)
BenchmarkTools.Trial:
  memory estimate:  6.61 KiB
  allocs estimate:  10
  --------------
  minimum time:     42.230 ms (0.00% GC)
  median time:      42.813 ms (0.00% GC)
  mean time:        43.110 ms (0.00% GC)
  maximum time:     46.244 ms (0.00% GC)
  --------------
  samples:          116
  evals/sample:     1

julia> @benchmark freqtable($y)
BenchmarkTools.Trial:
  memory estimate:  10.16 KiB
  allocs estimate:  76
  --------------
  minimum time:     1.064 ms (0.00% GC)
  median time:      1.112 ms (0.00% GC)
  mean time:        1.129 ms (0.09% GC)
  maximum time:     3.485 ms (66.72% GC)
  --------------
  samples:          4403
  evals/sample:     1

julia> @benchmark countmap($y)
BenchmarkTools.Trial:
  memory estimate:  6.61 KiB
  allocs estimate:  10
  --------------
  minimum time:     87.141 ms (0.00% GC)
  median time:      88.167 ms (0.00% GC)
  mean time:        88.510 ms (0.00% GC)
  maximum time:     92.177 ms (0.00% GC)
  --------------
  samples:          57
  evals/sample:     1

julia> @benchmark freqtable($z)
BenchmarkTools.Trial:
  memory estimate:  45.81 MiB
  allocs estimate:  2000285
  --------------
  minimum time:     75.712 ms (3.94% GC)
  median time:      77.057 ms (3.94% GC)
  mean time:        77.346 ms (4.16% GC)
  maximum time:     83.298 ms (3.35% GC)
  --------------
  samples:          65
  evals/sample:     1

julia> @benchmark countmap($z)
BenchmarkTools.Trial:
  memory estimate:  6.61 KiB
  allocs estimate:  10
  --------------
  minimum time:     81.931 ms (0.00% GC)
  median time:      83.128 ms (0.00% GC)
  mean time:        83.472 ms (0.00% GC)
  maximum time:     89.977 ms (0.00% GC)
  --------------
  samples:          60
  evals/sample:     1


As you can see freqtable does really a good job on different types of inputs.

Actually there is a third way to do a similar using by form DataFrames  which is also quite fast but it is more messy. freqtable is more specialized - does one job, but does it well.

Here are the benchmarks of by:


julia> @benchmark by(DataFrame(x = $x), :x, nrow)
BenchmarkTools.Trial:
  memory estimate:  38.91 MiB
  allocs estimate:  5986
  --------------
  minimum time:     28.946 ms (1.56% GC)
  median time:      34.440 ms (14.82% GC)
  mean time:        34.291 ms (14.80% GC)
  maximum time:     41.079 ms (20.70% GC)
  --------------
  samples:          146
  evals/sample:     1

julia> @benchmark by(DataFrame(x = $y), :x, nrow)
BenchmarkTools.Trial:
  memory estimate:  38.92 MiB
  allocs estimate:  6198
  --------------
  minimum time:     44.810 ms (3.52% GC)
  median time:      50.244 ms (10.53% GC)
  mean time:        49.715 ms (10.38% GC)
  maximum time:     56.052 ms (17.67% GC)
  --------------
  samples:          101
  evals/sample:     1

julia> @benchmark by(DataFrame(x = $z), :x, nrow)
BenchmarkTools.Trial:
  memory estimate:  38.91 MiB
  allocs estimate:  5986
  --------------
  minimum time:     46.891 ms (0.93% GC)
  median time:      53.657 ms (10.12% GC)
  mean time:        52.539 ms (9.68% GC)
  maximum time:     60.736 ms (16.24% GC)
  --------------
  samples:          96
  evals/sample:     1

Tuesday, December 5, 2017

A Tutorial on DataFrames in Julia

This time a blog post is more of an announcement :).

After the release of DataFrames 0.11 in Julia I have decided that it the ecosystem has stabilized enough to add it to my teaching materials.

There were some Issues and PRs submitted in the process. Finally, I have managed to compile the basic notes about using DataFrames. If you would be interested in them I have published them on GitHub here.

Sunday, November 26, 2017

Basics of generating random numbers in Julia

Recently there were two questions regarding random number generation in Julia: one in discussion on Discourse and the other on Stack Overflow.
In this post I have tried to collect the basic information related to those two topics.

How does rand() work?

The default random number generator in Julia uses Mersenne Twister algorithm. By default a global instance of this generator is used. It can be directly accessed using variable Base.GLOBAL_RNG. Therefore calling rand() and rand(Base.GLOBAL_RNG) provide identical results.

You can set a seed to random number generator using srand function. There are several options for this but two most popular are:
  1. Calling srand() which will make Julia to use the system entropy to get a seed (you can think of this as random seed);
  2. Calling srand(i) where i is an integer. In this case you will get a reproducible sequence of random numbers.
An important detail of rand() is that it produces a value in the interval [0,1[ and generates exactly 52 bits of randomness - it can produce 2⁵² distinct values. In order to understand why you must know that internally Julia generates a value from the interval [1, 2[ and then subtracts 1 from it. The reason for this approach is that Float64 has 52 bit significand and in range [1,2[ all representable floating point values are equally spaced, see e.g. Wikipedia. This ensures that values generated by rand() are uniformly distributed.

Generating something else than one float from [0, 1[ interval

There are many options here, please refer to the Julia Manual for details. What I want to highlight here are the consequences of the fact that the underlying random number generator produces only 52 bits of randomness. Below we measure the time of obtaining one random sample of Int32 and Int64. Here is the benchmark:

julia> using BenchmarkTools

julia> @btime rand()
  7.464 ns (0 allocations: 0 bytes)
0.5055246442408914

julia> @btime rand(Int32)
  7.464 ns (0 allocations: 0 bytes)
37355051

julia> @btime rand(Int64)
  10.730 ns (0 allocations: 0 bytes)
2952626120715896373

The reason for the difference is that in order to get Int64 we need to sample twice because 64>52. You might ask if this would matter in practical code. Actually it does, because on most machines you run Julia Int is Int64. Here is an example of sampling a value from a range (my machine is 64 bit):

julia> @btime rand(1:10)
  37.789 ns (0 allocations: 0 bytes)
5

julia> @btime Int(rand(Int32(1):Int32(10)))
  26.126 ns (0 allocations: 0 bytes)
4

If you can accept a bit of inaccuracy in the distribution you can get more speed for generating 1:n range if n is small with the following code (I use n=10 as above):

julia> @btime ceil(Int, 10rand())
  15.395 ns (0 allocations: 0 bytes)
6

The discussion why it is inaccurate is presented on Discourse. Here let me give you an extreme example (the proper value of the evaluated expression is 0.5 in both cases):

julia> mean(ceil(Int64, 2^60 * rand()) % Bool for i in 1:10^6)
0.0

julia> mean(rand(1:2^60) % Bool for i in 1:10^6)
0.500582

We know why the first approach fails - Float64 does not have enough precision and we will always get an even number if we multiply such a value by 2⁶⁰.

And this is a less extreme example, but when multiplying by something smaller than 2⁵² (the correct result is 1):

julia> mean(rand(1:(2^50+2^49)) % 3 for i in 1:10^7)
1.0000001

julia> mean(ceil(Int64, (2^50+2^49) * rand()) % 3 for i in 1:10^7)
0.9586021

And again we see a bias (although not as large).

Using multiple random number generators

Sometimes it is enough to have a single source of pseudorandomness. In such cases Base.GLOBAL_RNG is enough. However, in many settings (like the one discussed on Stack Overflow) we need more than one independent random number generator. The usual approach is to generate seeds for consecutive random number generators that ensure that sequences produced by them do not overlap. In the paper Efficient Jump Ahead for F₂-Linear RandomNumber Generators you can find the detailed discussion of the topic.

In Julia you can generate such independent generators using randjump function. You give it a number of generators you require and they are guaranteed to use seeds that correspond to random number generator states separated by 10²⁰ steps.

There is one good practice if you decide use randjump. Call it only once do produce the required generators. Consecutive uses of randjump based on the same random number generator can give surprising results. In order to understand the reason for this approach consider the following example:

julia> srand(1)
MersenneTwister(UInt32[0x00000001], Base.dSFMT.DSFMT_state(Int32[1749029653, 1072851681, 1610647787, 1072862326, 1841712345, 1073426746, -198061126, 1073322060, -156153802, 1073567984  …  1977574422, 1073209915, 278919868, 1072835605, 1290372147, 18858467, 1815133874, -1716870370, 382, 0]), [1.23603, 1.34652, 1.31271, 1.00791, 1.48861, 1.21097, 1.95192, 1.9999, 1.25166, 1.98667  …  1.1489, 1.77623, 1.16774, 1.00214, 1.46738, 1.03244, 1.36938, 1.75271, 1.43027, 1.71293], 382)

julia> rand(100);

julia> x = randjump(Base.GLOBAL_RNG, 2)
2-element Array{MersenneTwister,1}:
 MersenneTwister(UInt32[0x00000001], Base.dSFMT.DSFMT_state(Int32[-423222098, 1072940746, 1823958146, 1073056597, 94617959, 1073021145, 2081944769, 1072701541, -1344696523, 1073205595  …  -1005791883, 1073144418, 24484970, 1073440808, 1370926729, 1336278534, -1527371338, -19485865, 382, 0]), [1.23603, 1.34652, 1.31271, 1.00791, 1.48861, 1.21097, 1.95192, 1.9999, 1.25166, 1.98667  …
  1.1489, 1.77623, 1.16774, 1.00214, 1.46738, 1.03244, 1.36938, 1.75271, 1.43027, 1.71293], 100)
 MersenneTwister(UInt32[0x00000001], Base.dSFMT.DSFMT_state(Int32[2095040610, 1073104849, 93118390, 1073625566, 1967429533, 1073003175, -889983042, 1073278562, 29166119, 1073602896  …  1102161927, 1072737217, -1901932835, 1073559995, 1843649418, -1848103721, -1079494630, -1219397333, 382, 0]), [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0  …  0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], 382)

julia> rand();

julia> y = randjump(Base.GLOBAL_RNG, 2)
2-element Array{MersenneTwister,1}:
 MersenneTwister(UInt32[0x00000001], Base.dSFMT.DSFMT_state(Int32[-423222098, 1072940746, 1823958146, 1073056597, 94617959, 1073021145, 2081944769, 1072701541, -1344696523, 1073205595  …  -1005791883, 1073144418, 24484970, 1073440808, 1370926729, 1336278534, -1527371338, -19485865, 382, 0]), [1.23603, 1.34652, 1.31271, 1.00791, 1.48861, 1.21097, 1.95192, 1.9999, 1.25166, 1.98667  …
  1.1489, 1.77623, 1.16774, 1.00214, 1.46738, 1.03244, 1.36938, 1.75271, 1.43027, 1.71293], 101)
 MersenneTwister(UInt32[0x00000001], Base.dSFMT.DSFMT_state(Int32[2095040610, 1073104849, 93118390, 1073625566, 1967429533, 1073003175, -889983042, 1073278562, 29166119, 1073602896  …  1102161927, 1072737217, -1901932835, 1073559995, 1843649418, -1848103721, -1079494630, -1219397333, 382, 0]), [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0  …  0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], 382)

julia> x[1] === y[1]
true

julia> x[2] === y[2]
false

julia> x[2] == y[2]
true

What do we learn from it:

  • The first element of a vector of random number generators returned by randjump is simply its first argument - we can see it because x[1] === y[1] is true. No copying or jumping ahead is done.
  • Even though we have produced one random sample after creating x and before creating y (thus the state of Base.GLOBAL_RNG was not identical) the second elements of x and y have the same state, which can be seen by testing x[2] == y[2] (although they do not point to the same generator as x[2] === y[2] is false); the reason for this is that rand in Julia generates random numbers in batches of 382 size and calling rand() once in our case did not trigger the generation of the next batch of random numbers. So even though the state of Base.GLOBAL_RNG when generating x and y was not identical the internal state of Mersenne Twister generator was (it can be checked using Base.GLOBAL_RNG.state).
In short - it is safest not to call randjump several times in the same code.

Thursday, October 19, 2017

PyData Warsaw 2017 example

Following several requests in this post I am presenting the Asian option pricing example that I have discussed during PyData Warsaw 2017 (after the first day of talks I highly recommend everyone to come tomorrow as it is an excellent event).

The problem is taken from the book Foundations and Methods of Stochastic Simulation by Barry Nelson where it is solved using VBA in section 4.5.

I will not repeat the whole discussion of the model, but focus on the numerical aspect. We are asked to calculate the value of the expression:


where X is generated by a continuous state geometric Brownian motion. We will use Monte Carlo simulation to approximate the above expected value.

A single pass of the Monte Carlo simulation is approximated by a discrete sum:

The parameters we will use in the simulation are: T=1, r=0.05, K=55, σ=0.3, m=1000 and starting value of X at time 0 is 50. We will run 100,000 replications of Monte Carlo simulation and calculate the average.

I want to compare five implementations of the above problem:
  • Julia using loops
  • Julia using vectorized code
  • Python using loops
  • Numba using loops
  • NumPy using vectorized code
First go the codes in Julia (running time is given in the comment at the end):

And here are the codes in Python:

Julia implementation is a bit more fancy as it promotes the arguments to a common type on the run and in Python I pass all values as floats (which is simpler). Also vectorized code in Julia uses in-place updating. In general I tried to make the implementations a normal code in respective languages.

The key take-aways are the following:

  • standard Python is a no-go solution for this problem;
  • loops in Julia are fastest;
  • somewhat surprisingly vectorized Julia code is faster than Numba although the former has to allocate more memory;
  • NumPy implementation is around three times slower than vectorized Julia;
  • Vectorized Julia code hugely benefits from in-place operations (that avoid memory allocation); however, even without these optimizations it was faster than Numba.

Tuesday, September 26, 2017

Bridge hand distribution: simulation vs exact calculation

I have decided to start blogging about Julia language after having blogged about R for several years.
The objective of this blog is to provide short examples of how Julia can be used to solve computational problems.

For the start I have thought about reproducing the code generating distribution of high card points (HCP) in a pair or bridge hands that I have implemented in R some time ago (you can find it here).

The reason I have chosen this example for the first post is that implementation in Julia is much simpler than in R (well - it is also faster, but the gains are eaten by compilation times required for plotting).

Here is the code:
By simpler I mean that in this problem using loops is a natural approach. The benefits in my opinion are (you can compare to code in R here):

  • the code uses much less memory; which would be a significant issue in simulation approach in R implementation, consider sampling not 1,000,000, but e.g. 10,000,000,000 points (and in Julia it would be simple to run such a simulation on multiple cores);
  • the exact calculation code is much simpler to understand as you do not have to make workarounds to vectorize your calculations.

And here is the result you should get when you run the code: