Docker Registry Distribution in a Deep Learning Pipeline
Once you’re past the R&D phase, you have working deep neural nets, and you need to run them somewhere. If you want to avoid headaches with dependencies, you probably chose to make docker images out of your neural nets, with Tensorflow or equivalent + your code + your neural net weights.
Problem is, if you have nets of significant size, your docker images can easily reach 1GB or more. In the context of a computing cluster to run all your predict jobs (+ potentially a CI system on your neural nets), a centralized local docker registry quickly becomes a bottleneck, making your predicts slow (awefully long times to pull images) or even time out.
At first we decided to evaluate DockerHub commercial service, but it turned out it’s not suitable for production. DockerHub will often ban your infrastructure from connecting due to too many login attempts. We couldn’t find any documentation about those rate limits (except a few years-old unsolved StackOverflow threads), nor any logs or configuration levers in our DockerHub web portal. We very quickly evacuated DockerHub, since our infrastructure was banned and paralyzed every couple of hours.
That’s a pity by the way, we use docker a lot and an off-the-shelf distributed docker registry was exactly a service we would have gladly paid for. We also checked out quay.io offering but couldn’t find documentation about rate limits either so we decided it was a no-go.
In the end, we went for a primitive but efficient solution. For each predict job, our cluster spins up an EC2 instance. Now each instance starts a local docker registry instance, and we use the same storage backend (a S3 bucket) for all our docker registries.
As a result, now each of our deep learning predicts pulls docker images from its own local docker registry which themselves all use the same S3 backend. This means the bottleneck is now S3 throughput, which means no bottleneck.
Of course, predict jobs pull from registry and do not push images, so we don’t have any collision/consistency risk (2 processes trying to modify and push the same docker image at the same time).
This simple distributed docker registry solution (one storage backend, multiple local docker registry instances) has been working great so far.
Here is the docker run command each predict job runs locally before starting:
docker run -d -p 127.0.0.1:5500:5000 –restart=always –name registry -v /home/ubuntu/registry-config.yml:/etc/docker/registry/config.yml -e REGISTRY_STORAGE_S3_ACCESSKEY=XXXXX -e REGISTRY_STORAGE_S3_SECRETKEY=XXXXX registry:2
and here is the registry-config.yml:
version: 0.1
log:
fields:
service: registry
storage:
s3:
region: eu-central-1
bucket: XXXXX-registry
encrypt: true
secure: true
v4auth: true
chunksize: 5242880
multipartcopychunksize: 33554432
multipartcopymaxconcurrency: 100
multipartcopythresholdsize: 33554432
cache:
blobdescriptor: inmemory
http:
addr: :5000
headers:
X-Content-Type-Options: [nosniff]
health:
storagedriver:
enabled: true
interval: 10s
threshold: 3
Originally published at fruty.io on October 1, 2018.