jStart Adopts Knowledge Anyhow for Advanced Data Analysis
By: Ted Morris
In the beginning was the REPL
That’s Read-Evaluate-Print Loop, perhaps more familiarly known as an interactive command shell. For data analysts working in Python or R or Scala/Spark, the REPL is often the point of departure for learning, and later doing, interactive data analysis. Sadly, as an analyst’s skill increases, and the scope of the problem at hand grows, the REPL quickly becomes the object of frustration -- Sessions are lost, ideas are difficult to reconstruct, output is hard to manage, code reuse is nearly impossible. And so came the IPython Notebook, an electronic version of the diaries and journals and notebooks kept by thinkers of all stripe that date far, far back into history.
DaVinci’s notebooks are renowned as a history of the evolution of his thinking and worldview. They contain many of his sketches and drawings, including the Vitruvian Man, that famous illustration to Vitruvius’ treatise on symmetry and proportion. Other well-known notebooks include those by Einstein (“Travel Diary to the USA”), Mark Twain, Beatrix Potter, Thomas Edison, Alan Turing, Jack Kerouac, Charles Darwin, Ludwig van Beethoven, Pablo Picasso … well, the list goes on and on.
The notebook isn’t just for famous writers, inventors, and scientists though, it’s a tool for all of us. The notebook format, with original text intermingled with sketches, references, and quotes is a great way to record ideas, but far beyond that it’s the foundation for a process of thinking, a way of having a dialog with ideas, and of recording that dialog for future reference. Extended to the world of computational data analysis, the notebook is a powerful vehicle for writing code, trying ideas, documenting thought processes, and encapsulating results in a form that can be consumed by others. As an analyst, I think of it as my vehicle for having a dialog with my data, and – perhaps more importantly – for telling a story with my data.
IPython is an interactive notebook environment for Python and its associated analytics stack that includes Numpy, Scipy, Pandas, Matplotlib, and related Python libraries. Developed originally for scientific computing by Fernando Perez, IPython has blossomed into a widely adopted open source tool used by analysts in many domains. A large user community has developed; Perez and team have curated a rich and diverse collection of notebook examples. A telling analysis of IPython notebooks on Github may be found here. This by the way is also a nice example of the kinds of things you can do with the Python analytics libraries in IPython.
Our team has embraced the notebook concept for data-driven customer solutions. To help bootstrap our work, we use IBM Knowledge Anyhow Workbench (KAWB), an enhanced IPython / Jupyter notebook hosted as a SaaS offering on IBM SoftLayer. KAWB gives us a walk-up-and-use cloud environment for doing ad-hoc analytics and creating data products in interactive notebooks. The Knowledge Anyhow Workbench includes features for managing and searching collections of notebooks, reusing analytics, sharing notebooks with colleagues, and publishing results to the cloud. For more details and to sign-up for your own workbench, see https://knowledgeanyhow.org.
As the client-facing arm of IBM Emerging Tech, jStart is diving into KA with both feet. We’re spinning up pilot projects in domains as diverse as ocean powerboat racing, elementary education reading proficiency, and biochemical research. We are especially excited about the newly released support for Spark and the Berkeley Data Analytics Stack (BDAS), and will be using the combination in several projects over the coming months. If you are interested in trying out the notebook approach to data analysis, whether in Python, Scala, or R, we’d like to help you, get started.