Merge pull request #1184 from alemorcuq/add-pytorch-chart

Add PyTorch chart
This commit is contained in:
Alejandro Moreno
2019-05-16 18:05:40 +02:00
committed by GitHub
13 changed files with 1091 additions and 0 deletions

View File

@@ -59,6 +59,7 @@ $ helm search bitnami
- [nginx](https://github.com/bitnami/charts/tree/master/bitnami/nginx)
- [nginx-ingress-controller](https://github.com/bitnami/charts/tree/master/bitnami/nginx-ingress-controller)
- [NodeJS](https://github.com/bitnami/charts/tree/master/bitnami/node)
- [PyTorch](https://github.com/bitnami/charts/tree/master/bitnami/pytorch)
- [TensorFlow ResNet](https://github.com/bitnami/charts/tree/master/bitnami/tensorflow-resnet)
- [Tomcat](https://github.com/bitnami/charts/tree/master/bitnami/tomcat)
- [WildFly](https://github.com/bitnami/charts/tree/master/bitnami/wildfly)

View File

@@ -0,0 +1,22 @@
# Patterns to ignore when building packages.
# This supports shell glob matching, relative path matching, and
# negation (prefixed with !). Only one pattern per line.
.DS_Store
# Common VCS dirs
.git/
.gitignore
.bzr/
.bzrignore
.hg/
.hgignore
.svn/
# Common backup files
*.swp
*.bak
*.tmp
*~
# Various IDEs
.project
.idea/
*.tmproj
.vscode/

View File

@@ -0,0 +1,18 @@
apiVersion: v1
name: pytorch
version: 0.0.1
appVersion: 1.1.0
description: Deep learning platform that accelerates the transition from research prototyping to production deployment
keywords:
- pytorch
- python
- machine
- learning
home: http://pytorch.org/
sources:
- https://github.com/bitnami/bitnami-docker-pytorch
maintainers:
- name: Bitnami
email: containers@bitnami.com
engine: gotpl
icon: https://bitnami.com/assets/stacks/pytorch/img/pytorch-stack-110x117.png

157
bitnami/pytorch/README.md Normal file
View File

@@ -0,0 +1,157 @@
# PyTorch
[PyTorch](http://pytorch.org/) is a deep learning platform that accelerates the transition from research prototyping to production deployment. It is built for full integration into Python that enables you to use it with its libraries and main packages.
## TL;DR;
```console
$ helm install bitnami/pytorch
```
## Introduction
This chart bootstraps a [PyTorch](https://github.com/bitnami/bitnami-docker-pytorch) deployment on a [Kubernetes](http://kubernetes.io) cluster using the [Helm](https://helm.sh) package manager.
Bitnami charts can be used with [Kubeapps](https://kubeapps.com/) for deployment and management of Helm Charts in clusters. This Helm chart has been tested on top of [Bitnami Kubernetes Production Runtime](https://kubeprod.io/) (BKPR). Deploy BKPR to get automated TLS certificates, logging and monitoring for your applications.
## Prerequisites
- Kubernetes 1.8+ with Beta APIs enabled
- PV provisioner support in the underlying infrastructure
## Installing the Chart
To install the chart with the release name `my-release`:
```console
$ helm install --name my-release bitnami/pytorch
```
The command deploys PyTorch on the Kubernetes cluster in the default configuration. The [configuration](#configuration) section lists the parameters that can be configured.
> **Tip**: List all releases using `helm list`
## Uninstalling the Chart
To uninstall/delete the `my-release` deployment:
```console
$ helm delete my-release
```
The command removes all the Kubernetes components associated with the chart and deletes the release.
## Configuration
The following table lists the configurable parameters of the MinIO chart and their default values.
| Parameter | Description | Default |
| ------------------------------------ | -------------------------------------------------------------------------------------------- | ------------------------------------------------------- |
| `global.imageRegistry` | Global Docker image registry | `nil` |
| `global.imagePullSecrets` | Global Docker registry secret names as an array | `[]` (does not add image pull secrets to deployed pods) |
| `image.registry` | PyTorch image registry | `docker.io` |
| `image.repository` | PyTorch image name | `bitnami/pytorch` |
| `image.tag` | PyTorch image tag | `{VERSION}` |
| `image.pullPolicy` | Image pull policy | `IfNotPresent` |
| `image.pullSecrets` | Specify docker-registry secret names as an array | `[]` (does not add image pull secrets to deployed pods) |
| `image.debug` | Specify if debug logs should be enabled | `false` |
| `git.registry` | Git image registry | `docker.io` |
| `git.repository` | Git image name | `bitnami/git` |
| `git.tag` | Git image tag | `latest` |
| `git.pullPolicy` | Git image pull policy | `Always` |
| `git.pullSecrets` | Specify docker-registry secret names as an array | `[]` (does not add image pull secrets to deployed pods) |
| `entrypoint.file` | Main entrypoint to your application | `''` |
| `entrypoint.args` | Args required by your entrypoint | `nil` |
| `mode` | Run PyTorch in standalone or distributed mode (possible values: `standalone`, `distributed`) | `standalone` |
| `worldSize` | Number of nodes that will execute your code | `nil` |
| `port` | PyTorch master port | `49875` |
| `configMap` | Config map that contains the files you want to load in PyTorch | `nil` |
| `cloneFilesFromGit.enabled` | Enable in order to download files from git repository | `false` |
| `cloneFilesFromGit.repository` | Repository that holds the files | `nil` |
| `cloneFilesFromGit.revision` | Revision from the repository to checkout | `master` |
| `extraEnvVars` | Extra environment variables to add to master and workers pods | `nil` |
| `nodeSelector` | Node labels for pod assignment | `{}` |
| `tolerations` | Toleration labels for pod assignment | `[]` |
| `affinity` | Map of node/pod affinities | `{}` |
| `resources` | Pod resources | `{}` |
| `securityContext.enabled` | Enable security context | `true` |
| `securityContext.fsGroup` | Group ID for the container | `1001` |
| `securityContext.runAsUser` | User ID for the container | `1001` |
| `livenessProbe.enabled` | Enable/disable the Liveness probe | `true` |
| `livenessProbe.initialDelaySeconds` | Delay before liveness probe is initiated | `5` |
| `livenessProbe.periodSeconds` | How often to perform the probe | `5` |
| `livenessProbe.timeoutSeconds` | When the probe times out | `5` |
| `livenessProbe.successThreshold` | Minimum consecutive successes for the probe to be considered successful after having failed. | `1` |
| `livenessProbe.failureThreshold` | Minimum consecutive failures for the probe to be considered failed after having succeeded. | `5` |
| `readinessProbe.enabled` | Enable/disable the Readiness probe | `true` |
| `readinessProbe.initialDelaySeconds` | Delay before readiness probe is initiated | `5` |
| `readinessProbe.periodSeconds` | How often to perform the probe | `5` |
| `readinessProbe.timeoutSeconds` | When the probe times out | `1` |
| `readinessProbe.successThreshold` | Minimum consecutive successes for the probe to be considered successful after having failed. | `1` |
| `readinessProbe.failureThreshold` | Minimum consecutive failures for the probe to be considered failed after having succeeded. | `5` |
| `persistence.enabled` | Use a PVC to persist data | `true` |
| `persistence.mountPath` | Path to mount the volume at | `/bitnami/pytorch` |
| `persistence.storageClass` | Storage class of backing PVC | `nil` (uses alpha storage class annotation) |
| `persistence.accessMode` | Use volume as ReadOnly or ReadWrite | `ReadWriteOnce` |
| `persistence.size` | Size of data volume | `8Gi` |
| `persistence.annotations` | Persistent Volume annotations | `{}` |
Specify each parameter using the `--set key=value[,key=value]` argument to `helm install`. For example,
```console
$ helm install --name my-release \
--set mode=distributed \
--set worldSize=4 \
bitnami/pytorch
```
The above command create 4 pods for PyTorch: one master and three workers.
Alternatively, a YAML file that specifies the values for the parameters can be provided while installing the chart. For example,
```console
$ helm install --name my-release -f values.yaml bitnami/pytorch
```
> **Tip**: You can use the default [values.yaml](values.yaml)
## Loading your files
The PyTorch chart supports three different ways to load your files. In order of priority, they are:
1. Existing config map
2. Files under the `files` directory
3. Cloning a git repository
This means that if you specify a config map with your files, it won't look for the `files/` directory nor the git repository.
In order to use use an existing config map:
```console
$ helm install --name my-release \
--set configMap=my-config-map \
bitnami/pytorch
```
To load your files from the `files/` directory you don't have to set any option. Just copy your files inside and don't specify a `ConfigMap`:
```console
$ helm install --name my-release \
bitnami/pytorch
```
Finally, if you want to clone a git repository:
```console
$ helm install --name my-release \
--set cloneFilesFromGit.enabled=true \
--set cloneFilesFromGit.repository=https://github.com/my-user/my-repo \
--set cloneFilesFromGit.revision=master \
bitnami/pytorch
```
## Persistence
The [Bitnami PyTorch](https://github.com/bitnami/bitnami-docker-pytorch) image can persist data. If enabled, the persisted path is `/bitnami/pytorch` by default.
The chart mounts a [Persistent Volume](http://kubernetes.io/docs/user-guide/persistent-volumes/) at this location. The volume is created using dynamic volume provisioning.

View File

@@ -0,0 +1,37 @@
{{- if or (.Values.configMap) (.Files.Glob "files/*") (.Values.cloneFilesFromGit.enabled) }}
{{- if .Values.entrypoint.file }}
The provided file {{ .Values.entrypoint.file }} is being executed. You can see the logs of each running node with:
kubectl logs [POD_NAME]
and the list of pods:
kubectl get pods --namespace {{ .Release.Namespace }} -l "app.kubernetes.io/name={{ include "pytorch.name" . }},app.kubernetes.io/instance={{ .Release.Name }}"
{{- else }}
You didn't specify any entrypoint to your code.
To run it, you can either deploy again using the `pytorch.entrypoint.file` option to specify your entrypoint, or execute it manually by jumping into the pods:
1. Get the running pods
kubectl get pods --namespace {{ .Release.Namespace }} -l "app.kubernetes.io/name={{ include "pytorch.name" . }},app.kubernetes.io/instance={{ .Release.Name }}"
2. Get into a pod
kubectl exec -ti [POD_NAME] bash
3. Execute your script as you would normally do.
{{- end }}
{{- else }}
WARNING: You haven't loaded any file. You can access the Python REPL by jumping into the pods:
1. Get the running pods
kubectl get pods --namespace {{ .Release.Namespace }} -l "app.kubernetes.io/name={{ include "pytorch.name" . }},app.kubernetes.io/instance={{ .Release.Name }}"
2. Run the Python REPL
kubectl exec -ti [POD_NAME] python
This chart allows three different methods to load your files:
1. Load the files from an existing ConfigMap, using the `configMap` option.
2. Putting your files in a `files` folder in the root of the Chart.
3. Cloning a Git repository with the `cloneFilesFromGit` option.
Examples for the different methods can be found in the README.
{{- end }}
{{ include "pytorch.validateValues" . }}

View File

@@ -0,0 +1,147 @@
{{/* vim: set filetype=mustache: */}}
{{/*
Expand the name of the chart.
*/}}
{{- define "pytorch.name" -}}
{{- default .Chart.Name .Values.nameOverride | trunc 63 | trimSuffix "-" -}}
{{- end -}}
{{/*
Create a default fully qualified app name.
We truncate at 63 chars because some Kubernetes name fields are limited to this (by the DNS naming spec).
If release name contains chart name it will be used as a full name.
*/}}
{{- define "pytorch.fullname" -}}
{{- if .Values.fullnameOverride -}}
{{- .Values.fullnameOverride | trunc 63 | trimSuffix "-" -}}
{{- else -}}
{{- $name := default .Chart.Name .Values.nameOverride -}}
{{- if contains $name .Release.Name -}}
{{- .Release.Name | trunc 63 | trimSuffix "-" -}}
{{- else -}}
{{- printf "%s-%s" .Release.Name $name | trunc 63 | trimSuffix "-" -}}
{{- end -}}
{{- end -}}
{{- end -}}
{{/*
Create chart name and version as used by the chart label.
*/}}
{{- define "pytorch.chart" -}}
{{- printf "%s-%s" .Chart.Name .Chart.Version | replace "+" "_" | trunc 63 | trimSuffix "-" -}}
{{- end -}}
{{/*
Return the proper PyTorch image name
*/}}
{{- define "pytorch.image" -}}
{{- $registryName := .Values.image.registry -}}
{{- $repositoryName := .Values.image.repository -}}
{{- $tag := .Values.image.tag | toString -}}
{{/*
Helm 2.11 supports the assignment of a value to a variable defined in a different scope,
but Helm 2.9 and 2.10 doesn't support it, so we need to implement this if-else logic.
Also, we can't use a single if because lazy evaluation is not an option
*/}}
{{- if .Values.global }}
{{- if .Values.global.imageRegistry }}
{{- printf "%s/%s:%s" .Values.global.imageRegistry $repositoryName $tag -}}
{{- else -}}
{{- printf "%s/%s:%s" $registryName $repositoryName $tag -}}
{{- end -}}
{{- else -}}
{{- printf "%s/%s:%s" $registryName $repositoryName $tag -}}
{{- end -}}
{{- end -}}
{{/*
Return the proper git image name
*/}}
{{- define "git.image" -}}
{{- $registryName := .Values.git.registry -}}
{{- $repositoryName := .Values.git.repository -}}
{{- $tag := .Values.git.tag | toString -}}
{{/*
Helm 2.11 supports the assignment of a value to a variable defined in a different scope,
but Helm 2.9 and 2.10 doesn't support it, so we need to implement this if-else logic.
Also, we can't use a single if because lazy evaluation is not an option
*/}}
{{- if .Values.global }}
{{- if .Values.global.imageRegistry }}
{{- printf "%s/%s:%s" .Values.global.imageRegistry $repositoryName $tag -}}
{{- else -}}
{{- printf "%s/%s:%s" $registryName $repositoryName $tag -}}
{{- end -}}
{{- else -}}
{{- printf "%s/%s:%s" $registryName $repositoryName $tag -}}
{{- end -}}
{{- end -}}
{{/*
Return the proper Docker Image Registry Secret Names
*/}}
{{- define "pytorch.imagePullSecrets" -}}
{{/*
Helm 2.11 supports the assignment of a value to a variable defined in a different scope,
but Helm 2.9 and 2.10 does not support it, so we need to implement this if-else logic.
Also, we can not use a single if because lazy evaluation is not an option
*/}}
{{- if .Values.global }}
{{- if .Values.global.imagePullSecrets }}
imagePullSecrets:
{{- range .Values.global.imagePullSecrets }}
- name: {{ . }}
{{- end }}
{{- else if or .Values.image.pullSecrets .Values.git.pullSecrets }}
imagePullSecrets:
{{- range .Values.image.pullSecrets }}
- name: {{ . }}
{{- end }}
{{- range .Values.git.pullSecrets }}
- name: {{ . }}
{{- end }}
{{- end -}}
{{- else if or .Values.image.pullSecrets .Values.git.pullSecrets }}
imagePullSecrets:
{{- range .Values.image.pullSecrets }}
- name: {{ . }}
{{- end }}
{{- range .Values.git.pullSecrets }}
- name: {{ . }}
{{- end }}
{{- end -}}
{{- end -}}
{{/*
Compile all warnings into a single message, and call fail.
*/}}
{{- define "pytorch.validateValues" -}}
{{- $messages := list -}}
{{- $messages := append $messages (include "pytorch.validateValues.mode" .) -}}
{{- $messages := append $messages (include "pytorch.validateValues.worldSize" .) -}}
{{- $messages := without $messages "" -}}
{{- $message := join "\n" $messages -}}
{{- if $message -}}
{{- printf "\nVALUES VALIDATION:\n%s" $message | fail -}}
{{- end -}}
{{- end -}}
{{/* Validate values of PyTorch - must provide a valid mode ("distributed" or "standalone") */}}
{{- define "pytorch.validateValues.mode" -}}
{{- if and (ne .Values.mode "distributed") (ne .Values.mode "standalone") -}}
pytorch: mode
Invalid mode selected. Valid values are "distributed" and
"standalone". Please set a valid mode (--set mode="xxxx")
{{- end -}}
{{- end -}}
{{/* Validate values of PyTorch - number of replicas must be even, greater than 4 and lower than 32 */}}
{{- define "pytorch.validateValues.worldSize" -}}
{{- $replicaCount := int .Values.worldSize }}
{{- if and (eq .Values.mode "distributed") (lt $replicaCount 24) -}}
pytorch: worldSize
World size must be greater than 1 in distributed mode!!
Please set a valid world size (--set worldSize=X)
{{- end -}}
{{- end -}}

View File

@@ -0,0 +1,13 @@
{{- if .Files.Glob "files/*" }}
apiVersion: v1
kind: ConfigMap
metadata:
name: {{ include "pytorch.fullname" . }}-files
labels:
app.kubernetes.io/name: {{ include "pytorch.name" . }}
helm.sh/chart: {{ include "pytorch.chart" . }}
app.kubernetes.io/instance: {{ .Release.Name }}
app.kubernetes.io/managed-by: {{ .Release.Service }}
data:
{{ (.Files.Glob "files/*").AsConfig | indent 2 }}
{{ end }}

View File

@@ -0,0 +1,146 @@
apiVersion: apps/v1
kind: Deployment
metadata:
name: {{ include "pytorch.fullname" . }}{{ if eq .Values.mode "distributed" }}-master{{ end }}
labels:
app.kubernetes.io/name: {{ include "pytorch.name" . }}
helm.sh/chart: {{ include "pytorch.chart" . }}
app.kubernetes.io/instance: {{ .Release.Name }}
app.kubernetes.io/managed-by: {{ .Release.Service }}
app.kubernetes.io/component: "master"
spec:
replicas: 1
selector:
matchLabels:
app.kubernetes.io/name: {{ include "pytorch.name" . }}
app.kubernetes.io/instance: {{ .Release.Name }}
app.kubernetes.io/component: "master"
template:
metadata:
labels:
app.kubernetes.io/name: {{ include "pytorch.name" . }}
helm.sh/chart: {{ include "pytorch.chart" . }}
app.kubernetes.io/instance: {{ .Release.Name }}
app.kubernetes.io/managed-by: {{ .Release.Service }}
app.kubernetes.io/component: "master"
spec:
{{- include "pytorch.imagePullSecrets" . | nindent 6 }}
{{- if .Values.securityContext.enabled }}
securityContext:
fsGroup: {{ .Values.securityContext.fsGroup }}
runAsUser: {{ .Values.securityContext.runAsUser }}
{{- end }}
{{- if .Values.nodeSelector }}
nodeSelector: {{ toYaml .Values.nodeSelector | nindent 8 }}
{{- end }}
{{- if .Values.tolerations }}
tolerations: {{ toYaml .Values.tolerations | nindent 8 }}
{{- end }}
{{- if .Values.affinity }}
affinity: {{ toYaml .Values.affinity | nindent 8 }}
{{- end }}
{{- if .Values.cloneFilesFromGit.enabled }}
initContainers:
- name: git-clone-repository
image: {{ include "git.image" . }}
imagePullPolicy: {{ .Values.git.pullPolicy | quote }}
command:
- /bin/sh
- -c
- |
git clone {{ .Values.cloneFilesFromGit.repository }} /app
cd /app
git checkout {{ .Values.cloneFilesFromGit.revision }}
volumeMounts:
- name: git-cloned-files
mountPath: /app
{{- end }}
containers:
- name: master
image: {{ include "pytorch.image" . }}
imagePullPolicy: {{ .Values.image.pullPolicy }}
command:
- bash
- -c
- |
{{- if .Values.entrypoint.file }}
python {{ .Values.entrypoint.file }} {{ if .Values.entrypoint.args }}{{ .Values.entrypoint.args }}{{ end }}
{{- end }}
sleep infinity
env:
{{- if eq .Values.mode "distributed" }}
- name: MASTER_ADDR
value: "127.0.0.1"
- name: MASTER_PORT
value: {{ .Values.port | quote }}
- name: WORLD_SIZE
value: {{ .Values.worldSize | quote }}
- name: RANK
value: "0"
{{- end }}
{{- if .Values.extraEnvVars }}
{{ toYaml .Values.extraEnvVars | indent 8 }}
{{- end }}
ports:
- name: pytorch
containerPort: {{ .Values.port }}
{{- if .Values.livenessProbe.enabled }}
livenessProbe:
exec:
command:
- python
- -c
- import torch; torch.__version__
initialDelaySeconds: {{ .Values.livenessProbe.initialDelaySeconds }}
periodSeconds: {{ .Values.livenessProbe.periodSeconds }}
timeoutSeconds: {{ .Values.livenessProbe.timeoutSeconds }}
successThreshold: {{ .Values.livenessProbe.successThreshold }}
failureThreshold: {{ .Values.livenessProbe.failureThreshold }}
{{- end }}
{{- if .Values.readinessProbe.enabled }}
readinessProbe:
exec:
command:
- python
- -c
- import torch; torch.__version__
initialDelaySeconds: {{ .Values.readinessProbe.initialDelaySeconds }}
periodSeconds: {{ .Values.readinessProbe.periodSeconds }}
timeoutSeconds: {{ .Values.readinessProbe.timeoutSeconds }}
successThreshold: {{ .Values.readinessProbe.successThreshold }}
failureThreshold: {{ .Values.readinessProbe.failureThreshold }}
{{- end }}
resources: {{ toYaml .Values.resources | nindent 12 }}
volumeMounts:
{{- if .Values.configMap }}
- name: ext-files
mountPath: /app
{{- else if .Files.Glob "files/*" }}
- name: local-files
mountPath: /app
{{- else if .Values.cloneFilesFromGit.enabled }}
- name: git-cloned-files
mountPath: /app
{{- end }}
- name: data
mountPath: {{ .Values.persistence.mountPath }}
volumes:
{{- if .Values.configMap }}
- name: ext-files
configMap:
name: {{ .Values.configMap }}
{{- else if .Files.Glob "files/*" }}
- name: local-files
configMap:
name: {{ include "pytorch.fullname" . }}-files
{{- else if .Values.cloneFilesFromGit.enabled }}
- name: git-cloned-files
emptyDir: {}
{{- end }}
- name: data
{{- if .Values.persistence.enabled }}
persistentVolumeClaim:
claimName: {{ include "pytorch.fullname" . }}{{ if eq .Values.mode "distributed" }}-master{{ end }}
{{- else }}
emptyDir: {}
{{- end }}

View File

@@ -0,0 +1,26 @@
{{- if .Values.persistence.enabled }}
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
name: {{ include "pytorch.fullname" . }}{{ if eq .Values.mode "distributed" }}-master{{ end }}
labels:
app.kubernetes.io/name: {{ include "pytorch.name" . }}
helm.sh/chart: {{ include "pytorch.chart" . }}
app.kubernetes.io/instance: {{ .Release.Name }}
app.kubernetes.io/managed-by: {{ .Release.Service }}
spec:
accessModes:
{{- range .Values.persistence.accessModes }}
- {{ . | quote }}
{{- end }}
resources:
requests:
storage: {{ .Values.persistence.size | quote }}
{{- if .Values.persistence.storageClass }}
{{- if (eq "-" .Values.persistence.storageClass) }}
storageClassName: ""
{{- else }}
storageClassName: "{{ .Values.persistence.storageClass }}"
{{- end }}
{{- end }}
{{- end }}

View File

@@ -0,0 +1,20 @@
apiVersion: v1
kind: Service
metadata:
name: {{ include "pytorch.fullname" . }}
labels:
app.kubernetes.io/name: {{ include "pytorch.name" . }}
helm.sh/chart: {{ include "pytorch.chart" . }}
app.kubernetes.io/instance: {{ .Release.Name }}
app.kubernetes.io/managed-by: {{ .Release.Service }}
app.kubernetes.io/component: "master"
spec:
type: ClusterIP
ports:
- port: {{ .Values.port }}
targetPort: pytorch
name: pytorch
selector:
app.kubernetes.io/name: {{ include "pytorch.name" . }}
app.kubernetes.io/instance: {{ .Release.Name }}
app.kubernetes.io/component: "master"

View File

@@ -0,0 +1,166 @@
{{- if eq .Values.mode "distributed" }}
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: {{ include "pytorch.fullname" . }}-worker
labels:
app.kubernetes.io/name: {{ include "pytorch.name" . }}
helm.sh/chart: {{ include "pytorch.chart" . }}
app.kubernetes.io/instance: {{ .Release.Name }}
app.kubernetes.io/managed-by: {{ .Release.Service }}
app.kubernetes.io/component: "worker"
spec:
selector:
matchLabels:
app.kubernetes.io/name: {{ include "pytorch.name" . }}
app.kubernetes.io/instance: {{ .Release.Name }}
app.kubernetes.io/component: "worker"
replicas: {{ sub .Values.worldSize 1 }}
template:
metadata:
labels:
app.kubernetes.io/name: {{ include "pytorch.name" . }}
helm.sh/chart: {{ include "pytorch.chart" . }}
app.kubernetes.io/instance: {{ .Release.Name }}
app.kubernetes.io/component: "worker"
spec:
{{- include "pytorch.imagePullSecrets" . | nindent 6 }}
{{- if .Values.securityContext.enabled }}
securityContext:
fsGroup: {{ .Values.securityContext.fsGroup }}
runAsUser: {{ .Values.securityContext.runAsUser }}
{{- end }}
{{- if .Values.nodeSelector }}
nodeSelector: {{ toYaml .Values.nodeSelector | nindent 8 }}
{{- end }}
{{- if .Values.tolerations }}
tolerations: {{ toYaml .Values.tolerations | nindent 8 }}
{{- end }}
{{- if .Values.affinity }}
affinity: {{ toYaml .Values.affinity | nindent 8 }}
{{- end }}
{{- if .Values.cloneFilesFromGit.enabled }}
initContainers:
- name: git-clone-repository
image: {{ include "git.image" . }}
imagePullPolicy: {{ .Values.git.pullPolicy | quote }}
command:
- /bin/sh
- -c
- |
git clone {{ .Values.cloneFilesFromGit.repository }} /app
cd /app
git checkout {{ .Values.cloneFilesFromGit.revision }}
volumeMounts:
- name: git-cloned-files
mountPath: /app
{{- end }}
containers:
- name: worker
image: {{ include "pytorch.image" . }}
imagePullPolicy: {{ .Values.image.pullPolicy }}
command:
- bash
- -c
- |
RANK=${POD_NAME##*-}
((RANK++))
export RANK
{{- if .Values.entrypoint.file }}
python {{ .Values.entrypoint.file }} {{ if .Values.entrypoint.args }}{{ .Values.entrypoint.args }}{{ end }}
{{- end }}
sleep infinity
env:
- name: POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: MASTER_ADDR
value: {{ include "pytorch.fullname" . }}
- name: MASTER_PORT
value: {{ .Values.port | quote }}
- name: WORLD_SIZE
value: {{ .Values.worldSize | quote }}
{{- if .Values.extraEnvVars }}
{{ toYaml .Values.extraEnvVars | indent 8 }}
{{- end }}
{{- if .Values.livenessProbe.enabled }}
livenessProbe:
exec:
command:
- python
- -c
- import torch; torch.__version__
initialDelaySeconds: {{ .Values.livenessProbe.initialDelaySeconds }}
periodSeconds: {{ .Values.livenessProbe.periodSeconds }}
timeoutSeconds: {{ .Values.livenessProbe.timeoutSeconds }}
successThreshold: {{ .Values.livenessProbe.successThreshold }}
failureThreshold: {{ .Values.livenessProbe.failureThreshold }}
{{- end }}
{{- if .Values.readinessProbe.enabled }}
readinessProbe:
exec:
command:
- python
- -c
- import torch; torch.__version__
initialDelaySeconds: {{ .Values.readinessProbe.initialDelaySeconds }}
periodSeconds: {{ .Values.readinessProbe.periodSeconds }}
timeoutSeconds: {{ .Values.readinessProbe.timeoutSeconds }}
successThreshold: {{ .Values.readinessProbe.successThreshold }}
failureThreshold: {{ .Values.readinessProbe.failureThreshold }}
{{- end }}
resources: {{ toYaml .Values.resources | nindent 12 }}
volumeMounts:
{{- if .Values.configMap }}
- name: ext-files
mountPath: /app
{{- else if .Files.Glob "files/*" }}
- name: local-files
mountPath: /app
{{- else if .Values.cloneFilesFromGit.enabled }}
- name: git-cloned-files
mountPath: /app
{{- end }}
- name: data
mountPath: {{ .Values.persistence.mountPath }}
volumes:
{{- if .Values.configMap }}
- name: ext-files
configMap:
name: {{ .Values.configMap }}
{{- else if .Files.Glob "files/*" }}
- name: local-files
configMap:
name: {{ include "pytorch.fullname" . }}-files
{{- else if .Values.cloneFilesFromGit.enabled }}
- name: git-cloned-files
emptyDir: {}
{{- end }}
{{- if .Values.persistence.enabled }}
volumeClaimTemplates:
- metadata:
name: data
labels:
app.kubernetes.io/name: {{ include "pytorch.name" . }}
app.kubernetes.io/instance: {{ .Release.Name }}
{{- if .Values.persistence.annotations }}
annotations: {{ toYaml .Values.persistence.annotations | nindent 8 }}
{{- end }}
spec:
accessModes: {{ toYaml .Values.persistence.accessModes | nindent 8 }}
{{- if .Values.persistence.storageClass }}
{{- if (eq "-" .Values.persistence.storageClass) }}
storageClassName: ""
{{- else }}
storageClassName: {{ .Values.persistence.storageClass | quote }}
{{- end }}
{{- end }}
resources:
requests:
storage: {{ .Values.persistence.size | quote }}
{{- else }}
- name: data
emptyDir: {}
{{- end }}
{{- end }}

View File

@@ -0,0 +1,169 @@
## Global Docker image parameters
## Please, note that this will override the image parameters, including dependencies, configured to use the global value
## Current available global Docker image parameters: imageRegistry and imagePullSecrets
##
# global:
# imageRegistry: myRegistryName
# imagePullSecrets:
# - myRegistryKeySecretName
## Bitnami PyTorch image version
## ref: https://hub.docker.com/r/bitnami/pytorch/tags/
##
image:
registry: docker.io
repository: bitnami/pytorch
tag: 1.1.0
## Specify a imagePullPolicy
## Defaults to 'Always' if image tag is 'latest', else set to 'IfNotPresent'
## ref: http://kubernetes.io/docs/user-guide/images/#pre-pulling-images
##
pullPolicy: IfNotPresent
## Optionally specify an array of imagePullSecrets.
## Secrets must be manually created in the namespace.
## ref: https://kubernetes.io/docs/tasks/configure-pod-container/pull-image-private-registry/
##
# pullSecrets:
# - myRegistryKeySecretName
##
## Set to true if you would like to see extra information on logs
## It turns BASH and NAMI debugging in minideb
## ref: https://github.com/bitnami/minideb-extras/#turn-on-bash-debugging
debug: false
## Bitnami git image version
## ref: https://hub.docker.com/r/bitnami/git/tags/
##
git:
registry: docker.io
repository: bitnami/git
tag: latest
pullPolicy: Always
## Optionally specify an array of imagePullSecrets.
## Secrets must be manually created in the namespace.
## ref: https://kubernetes.io/docs/tasks/configure-pod-container/pull-image-private-registry/
##
# pullSecrets:
# - myRegistryKeySecretName
## PyTorch configuration
##
## The main entrypoint of your app, this will be executed as:
## python [file] [args]
entrypoint:
file:
#args:
## Set to `distributed` in order to enable distributed mode
## mode: distributed
##
mode: distributed
## Number of nodes that will run the code
## WORLD_SIZE will be set to this value
##
worldSize: 4
## The port used to comunicate with the master
## MASTER_PORT will be set to this value
##
port: 49875
## Name of an existing config map containing all the files you want to load in PyTorch
##
#configMap:
## Enable in order to download files from git repository.
##
cloneFilesFromGit:
enabled: false
# repository:
# revision: master
## Additional environment variables
##
# extraEnvVars:
# - name: NCCL_DEBUG
# value: "INFO"
# - name: NCCL_DEBUG_SUBSYS
# value: "ALL"
## Node labels for pod assignment
## Ref: https://kubernetes.io/docs/user-guide/node-selection/
##
nodeSelector: {}
## Tolerations for pod assignment
## Ref: https://kubernetes.io/docs/concepts/configuration/taint-and-toleration/
##
tolerations: []
## Affinity for pod assignment
## Ref: https://kubernetes.io/docs/concepts/configuration/assign-pod-node/#affinity-and-anti-affinity
##
affinity: {}
## Configure resource requests and limits
## ref: http://kubernetes.io/docs/user-guide/compute-resources/
##
resources: {}
## Pod Security Context
## ref: https://kubernetes.io/docs/tasks/configure-pod-container/security-context/
##
securityContext:
enabled: true
fsGroup: 1001
runAsUser: 1001
## Configure liveness and readiness probes
## ref: https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-probes/#configure-probes)
##
livenessProbe:
enabled: true
initialDelaySeconds: 5
periodSeconds: 5
timeoutSeconds: 5
successThreshold: 1
failureThreshold: 5
readinessProbe:
enabled: true
initialDelaySeconds: 5
periodSeconds: 5
timeoutSeconds: 1
successThreshold: 1
failureThreshold: 5
## Enable persistence using Persistent Volume Claims
## ref: http://kubernetes.io/docs/user-guide/persistent-volumes/
##
persistence:
## If true, use a Persistent Volume Claim
##
enabled: true
## Data volume mount path
##
mountPath: /bitnami/pytorch
## Persistent Volume Access Mode
##
accessModes:
- ReadWriteOnce
## Persistent Volume size
##
size: 8Gi
## Persistent Volume Storage Class
## If defined, storageClassName: <storageClass>
## If set to "-", storageClassName: "", which disables dynamic provisioning
## If undefined (the default) or set to null, no storageClassName spec is
## set, choosing the default provisioner. (gp2 on AWS, standard on
## GKE, AWS & OpenStack)
##
# storageClass: "-"
## Persistent Volume Claim annotations
##
annotations: {}

169
bitnami/pytorch/values.yaml Normal file
View File

@@ -0,0 +1,169 @@
## Global Docker image parameters
## Please, note that this will override the image parameters, including dependencies, configured to use the global value
## Current available global Docker image parameters: imageRegistry and imagePullSecrets
##
# global:
# imageRegistry: myRegistryName
# imagePullSecrets:
# - myRegistryKeySecretName
## Bitnami PyTorch image version
## ref: https://hub.docker.com/r/bitnami/pytorch/tags/
##
image:
registry: docker.io
repository: bitnami/pytorch
tag: 1.1.0
## Specify a imagePullPolicy
## Defaults to 'Always' if image tag is 'latest', else set to 'IfNotPresent'
## ref: http://kubernetes.io/docs/user-guide/images/#pre-pulling-images
##
pullPolicy: IfNotPresent
## Optionally specify an array of imagePullSecrets.
## Secrets must be manually created in the namespace.
## ref: https://kubernetes.io/docs/tasks/configure-pod-container/pull-image-private-registry/
##
# pullSecrets:
# - myRegistryKeySecretName
##
## Set to true if you would like to see extra information on logs
## It turns BASH and NAMI debugging in minideb
## ref: https://github.com/bitnami/minideb-extras/#turn-on-bash-debugging
debug: false
## Bitnami git image version
## ref: https://hub.docker.com/r/bitnami/git/tags/
##
git:
registry: docker.io
repository: bitnami/git
tag: latest
pullPolicy: Always
## Optionally specify an array of imagePullSecrets.
## Secrets must be manually created in the namespace.
## ref: https://kubernetes.io/docs/tasks/configure-pod-container/pull-image-private-registry/
##
# pullSecrets:
# - myRegistryKeySecretName
## PyTorch configuration
##
## The main entrypoint of your app, this will be executed as:
## python [file] [args]
entrypoint:
file:
#args:
## Set to `distributed` in order to enable distributed mode
## mode: distributed
##
mode: standalone
## Number of nodes that will run the code
## WORLD_SIZE will be set to this value
##
#worldSize:
## The port used to comunicate with the master
## MASTER_PORT will be set to this value
##
port: 49875
## Name of an existing config map containing all the files you want to load in PyTorch
##
#configMap:
## Enable in order to download files from git repository.
##
cloneFilesFromGit:
enabled: false
# repository:
# revision: master
## Additional environment variables
##
# extraEnvVars:
# - name: NCCL_DEBUG
# value: "INFO"
# - name: NCCL_DEBUG_SUBSYS
# value: "ALL"
## Node labels for pod assignment
## Ref: https://kubernetes.io/docs/user-guide/node-selection/
##
nodeSelector: {}
## Tolerations for pod assignment
## Ref: https://kubernetes.io/docs/concepts/configuration/taint-and-toleration/
##
tolerations: []
## Affinity for pod assignment
## Ref: https://kubernetes.io/docs/concepts/configuration/assign-pod-node/#affinity-and-anti-affinity
##
affinity: {}
## Configure resource requests and limits
## ref: http://kubernetes.io/docs/user-guide/compute-resources/
##
resources: {}
## Pod Security Context
## ref: https://kubernetes.io/docs/tasks/configure-pod-container/security-context/
##
securityContext:
enabled: true
fsGroup: 1001
runAsUser: 1001
## Configure liveness and readiness probes
## ref: https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-probes/#configure-probes)
##
livenessProbe:
enabled: true
initialDelaySeconds: 5
periodSeconds: 5
timeoutSeconds: 5
successThreshold: 1
failureThreshold: 5
readinessProbe:
enabled: true
initialDelaySeconds: 5
periodSeconds: 5
timeoutSeconds: 1
successThreshold: 1
failureThreshold: 5
## Enable persistence using Persistent Volume Claims
## ref: http://kubernetes.io/docs/user-guide/persistent-volumes/
##
persistence:
## If true, use a Persistent Volume Claim
##
enabled: true
## Data volume mount path
##
mountPath: /bitnami/pytorch
## Persistent Volume Access Mode
##
accessModes:
- ReadWriteOnce
## Persistent Volume size
##
size: 8Gi
## Persistent Volume Storage Class
## If defined, storageClassName: <storageClass>
## If set to "-", storageClassName: "", which disables dynamic provisioning
## If undefined (the default) or set to null, no storageClassName spec is
## set, choosing the default provisioner. (gp2 on AWS, standard on
## GKE, AWS & OpenStack)
##
# storageClass: "-"
## Persistent Volume Claim annotations
##
annotations: {}