Simple Statistics 2

Illustration of a tree for the Simple Statistics 2 Release

var simpleStatistics = require('simple-statistics');

I just released Simple Statistics 2.0.0. I started the Simple Statistics project four years ago as a way to learn statistics and rally for readable code in the world of math.

In the meantime, I shipped lots of projects on my own and at Mapbox. CommonJS and Browserify became expectations, not wild experiments. JavaScript agglomerated features like Flow and Babel’s ability to use ES6, ES7, and beyond.

And I learned a bit about how documentation, coding, open source, and everything else works. I’ve been working on Simple Statistics at a steady pace, trying to move it along as I learn more and keep it compatible with the rest of the world. As I wrote back then in ‘Gravity Always Wins’, unmaintained projects become broken by default because the world moves away from them; versions increase, standards change, and software no longer fits.

Documentation

I’ve tried to focus my time on levers: what are the things that make the most change by doing the most basic, primitive, and universally-applicable things? So for Simple Statistics 2, a lot of my time was spent on a related project - documentation.js. By switching to modules, building a flexible system around JSDoc, and even testing code examples with jsdoctest, Simple Statistics can be both literate programming, as it was originally intended to be, and also as understandable and skimmable as a traditional JavaScript library.

Tests

I spent a lot of time ensuring that regressions will be rare and pull requests will be easy to review: Simple Statistics now has an extremely strict eslint configuration for code style, node-tap tests with over 99% coverage, and Flow annotations. Adding Flow coverage to this project was inspired by my experience adding Flow to Mapbox Studio: it identifies issues but it also shows you blind spots where types are accidentally imprecise. Simple Statistics finally makes the decision for what to do with invalid input: instead of returning a variety of null, undefined, and NaN, it will always return NaN for unknown output and throw errors for invalid input.

Performance & improvements

I had a few ‘aha’ moments that inspired improvements in Simple Statistics: realizing the critical role of sorting inspired faster sorted-input versions of methods, and thinking hard about big computations led me to an implementation of Kahan summation as a better default to naïve summation.

My friend and coworker Vladimir Agafonkin contributed contributed a change that calculates standard normal tables, saving byte size - which is really tiny now. He also contributed an implementation of the Quickselect algorithm, which gives some of the advantages of sorting by partially sorting an input.

James McGuigan contributed a new product() method that computes the product of an array of numbers. All in all, Simple Statistics now has 21 total contributors and I’d like to thank them all for making it a great project.

What’s next

Simple Statistics is pretty good: it does what it sets out to do. I think that if there are people who want it to handle gigantic datasets, implementing the online algorithms as reducers will be the way to do that. That would be interesting, but needs a concrete user in order to be really built.

There are still a few statistical methods we haven’t implemented yet and would be fun to add. New machine learning algorithms would also be a fun addition, if anyone wants to implement them.

Check out the simple-statistics repository and website and docs!

June 2, 2016 Tom MacWright (@tmcw, @tmcw@mastodon.social)