As a deep learning management system developed by Inspur, DeeepEngine targets at deep learning training clusters and supports several types of deep learning frameworks. It can rapidly deploy training environment for deep learning, and comprehensively manage deep learning training tasks.
An efficient and convenient platform for the users is provided. The unified management, scheduling and monitoring of CPU and GPU resources during training runs on clusters can effectively raise utilization of computation resources and productivity.
Functions and Features:
Providing a pure deep learning workflow, convenient, easy to use
Deep learning training tasks contain a lot of steps. The AIStation provides full-process support, from data pre-processing, parameter tuning, allocation of computation resources, activation of training tasks, training monitoring to result analysis.
One-button to deploy a training environment for deep learning, rapid launching of training tasks and better efficiency
Deep learning has many frameworks and models, which often depend on different environments. When many of those are used in parallel, they will require a complicated development environment. The AIStation makes isolation and rapid deployment of resources and working environments an easy task.
Monitoring and visualization of training tasks, management of training progress and quality at any time, quickly detecting model problems.
Deep learning model trainings are time consuming. Training time last from several hours to days. Potential problems of a training run are often only after the completion of a run. To avoid such problems, the AIStation provides real-time monitoring and visualization of training tasks, log printing of cost function loss, training errors and testing errors of each step.
Dynamic allocation of GPU resources to improve utilization
In the allocation/dispatch of GPU resource, the AIStation can balance the need of large, big, long-term and short-term tasks. The dynamic allocation of GPU resources can realize reasonable resource sharing and improve GPU efficiency;
Comprehensive cluster monitoring and management, real-time control of utilization and operation of CPU/GPU resources
Real-time monitoring of clusters, reasonable scheduling of training tasks; timely detection of training problems to improve the reliability of clusters.
Deployment of deep learning environment – Multiple deep learing frameworks: Caffe/TensorFlow/CNTK etc. – Support various models: GoogleNet/VGG/ResNet etc. – One-Key deployment of the distributed computing environment – Application arrangement, rapid start of APPs
Management of deep learning training tasks – Support for application templates, rapid submission of training tasks – workflow, data pre-processing, model training, visualization – Management of training tasks, observation of training progress and precision, parameter-tuning
Computing resources management and scheduling – Monitoring of CPU/GPU running state and performance state – Scheduling of GPU resources on demand, resource isolation – Scheduling strategy: fair share, preemption, backfilling
Statistics and analysis – Recording of usage of CPU/GPU resources – Statistics of usage of cluster resources, generate monthly reports – Statistics and analysis of training tasks