Docker for research

… and data analysis

J. Fernando Sánchez (jf.sanchez@upm)

2018

Intro

Before we begin

Code available at:

https://github.com/balkian/lab-in-a-box

Live demos at:

https://github.todevnull.com

https://lab.todevnull.com

https://hub.todevnull.com

Feel free to log in, but try not to break them for now 😉

My name is Fernando and…

At Grupo de Sistemas Inteligentes

  • Machine Learning and Big Data
  • NLP and Sentiment Analysis
  • Social Network Analysis
  • Agents and Simulation
  • Linked Data and Semantic Technologies

http://www.gsi.dit.upm.es

And I ❤ Docker

  • Docker+research for 3+ years
  • Advocate for ~2 years
  • Internal infrastructure: ansible, k8s and docker
  • Teach (with) it

About this talk

Takeaway: you can set up a multi-user data analysis environment with isolation in minutes

Plus: using docker to perform and share experiments is even easier

Related Meetups:

Big Data and Machine Learning with Docker

Using Docker in Machine Learning Projects

For researchers

Experiment, publish, repeat

Reproducibility

@ianholmes
@ianholmes

Obstacles

  • Missing data
  • Bleeding edge tools and libraries
  • Throwaway software
    • Hacky
    • Little to no documentation
  • Multiple languages

Obstacles

Is it a problem?

https://www.nature.com/
https://www.nature.com/

Jupyter notebooks

Jupyter architecture

http://jupyter.readthedocs.io
http://jupyter.readthedocs.io

Docker to the rescue

towardsdatascience.com
towardsdatascience.com

Jupyter/docker-stacks

Reproducible environment

And friendly, too

For small groups

Requirements

  • Shared environments
  • Resource sharing
  • Easy configuration
  • Versioning
  • Backups

And little to no overhead

Isolation

Jupyterhub

Authenticators

  • Local
  • OAuth
  • LDAP
  • JWT

Spawners

  • Local
  • Docker
  • Kubernetes
  • Marathon

More infrastructure

Demo

It’s demo time

https://github.todevnull.com https://github.com/balkian/lab-in-a-box

Other tools

Zeppelin

  • Alternative to Jupyter
https://zeppelin.apache.org/
https://zeppelin.apache.org/

CoCalc

  • Alternative to Jupyter
https://cocalc.org/
https://cocalc.org/

Docker-Nvidia

  • CUDA for docker
https://github.com/NVIDIA/nvidia-docker
https://github.com/NVIDIA/nvidia-docker

Jupyter Binder

  • Custom Jupyter from git repositories
https://mybinder.org/
https://mybinder.org/

Knowledge-Repo

http://knowledge-repo.readthedocs.io/
http://knowledge-repo.readthedocs.io/

Conclusions

Lessons learned

  • Docker + Docker-compose
    • Reproducible environments (partially)
    • Reduced tooling / experience
    • Ephemeral containers force you to automate/document installation
  • Jupyterhub
    • Shared environments
    • Web interface (zero knowledge)

What’s missing?

  • Roles and permissions
  • Backups

  • Ideas:
    • Kubernetes?
    • OpenShift?

Thanks for listening!

https://github.com/balkian/lab-in-a-box

jf.sanchez@upm.es