Documentation

Overview

    Bigpi is an example bigmachine program that estimates digits of Pi using the Monte Carlo method. It distributes work by instantiating multiple machines and calling them to make samples, returning the total number of the samples that fell inside of the unit circle.

    We can run it locally with a small number of sample to test:

    % bigpi -n 1000000
    2018/03/16 15:21:05 waiting for machines to come online
    2018/03/16 15:21:08 machine http://localhost:63880/ RUNNING
    2018/03/16 15:21:08 machine http://localhost:63878/ RUNNING
    2018/03/16 15:21:08 machine http://localhost:63879/ RUNNING
    2018/03/16 15:21:08 machine http://localhost:63881/ RUNNING
    2018/03/16 15:21:08 machine http://localhost:63877/ RUNNING
    2018/03/16 15:21:08 all machines are ready
    2018/03/16 15:21:08 distributing work among 5 cores
    http://localhost:63878/: 2018/03/16 15:21:08 0/200000
    http://localhost:63880/: 2018/03/16 15:21:08 0/200000
    http://localhost:63879/: 2018/03/16 15:21:08 0/200000
    http://localhost:63881/: 2018/03/16 15:21:08 0/200000
    2018/03/16 15:21:08 total=784425 nsamples=1000000
    π = 3.1377
    

    By using a large EC2 instance we can distribute the work over 100s of cores trivially:

    % bigpi -bigsystem ec2 -bigec2type c5.18xlarge -n 1000000000000
    2018/03/20 21:00:05 waiting for machines to come online
    2018/03/20 21:01:09 machine https://ec2-54-213-185-145.us-west-2.compute.amazonaws.com:2000/ RUNNING
    2018/03/20 21:01:09 machine https://ec2-35-164-137-2.us-west-2.compute.amazonaws.com:2000/ RUNNING
    2018/03/20 21:01:09 machine https://ec2-34-208-105-231.us-west-2.compute.amazonaws.com:2000/ RUNNING
    2018/03/20 21:01:09 machine https://ec2-34-211-149-59.us-west-2.compute.amazonaws.com:2000/ RUNNING
    2018/03/20 21:01:09 machine https://ec2-34-223-251-92.us-west-2.compute.amazonaws.com:2000/ RUNNING
    2018/03/20 21:01:09 all machines are ready
    2018/03/20 21:01:09 distributing work among 360 cores
    https://ec2-34-208-105-231.us-west-2.compute.amazonaws.com:2000/: 2018/03/20 21:01:09 0/2777777777
    https://ec2-34-223-251-92.us-west-2.compute.amazonaws.com:2000/: 2018/03/20 21:01:09 0/2777777777
    ...
    2018/03/20 21:13:27 total=785397678380 nsamples=1000000000000
    π = 3.141590713520
    

    Once a bigmachine program is running, we can profile it using the standard Go pprof tooling. The returned profile is sampled from the whole cluster and merged. In the first iteration of this program, this helped find a bug: we were using the global rand.Float64 which requires a lock. The CPU profile highlighted the lock contention easily:

    % go tool pprof localhost:3333/debug/bigmachine/pprof/profile
    Fetching profile over HTTP from http://localhost:3333/debug/bigmachine/pprof/profile
    Saved profile in /Users/marius/pprof/pprof.045821636.samples.cpu.001.pb.gz
    File: 045821636
    Type: cpu
    Time: Mar 16, 2018 at 3:17pm (PDT)
    Duration: 2.51mins, Total samples = 16.80mins (669.32%)
    Entering interactive mode (type "help" for commands, "o" for options)
    (pprof) top
    Showing nodes accounting for 779.47s, 77.31% of 1008.18s total
    Dropped 51 nodes (cum <= 5.04s)
    Showing top 10 nodes out of 58
          flat  flat%   sum%        cum   cum%
       333.11s 33.04% 33.04%    333.11s 33.04%  runtime.procyield
       116.71s 11.58% 44.62%    469.55s 46.57%  runtime.lock
        76.35s  7.57% 52.19%    347.21s 34.44%  sync.(*Mutex).Lock
        65.79s  6.53% 58.72%     65.79s  6.53%  runtime.futex
        41.48s  4.11% 62.83%    202.05s 20.04%  sync.(*Mutex).Unlock
        34.10s  3.38% 66.21%    364.36s 36.14%  runtime.findrunnable
           33s  3.27% 69.49%        33s  3.27%  runtime.cansemacquire
        32.72s  3.25% 72.73%     51.01s  5.06%  runtime.runqgrab
        24.88s  2.47% 75.20%     57.72s  5.73%  runtime.unlock
        21.33s  2.12% 77.31%     21.33s  2.12%  math/rand.(*rngSource).Uint64
    

    Source Files