Skip to content

Training Service

Overview

The Training Service on AI Cloud provides a user-friendly interface for managing machine learning model training tasks. With this service, you can easily deploy, monitor, and manage your training jobs using the computational resources available on the platform.

Accessing the Training Dashboard

Step 1: Log in to the AI Cloud Console

Access the AI Cloud platform and navigate to the Training section.

Step 2: View Training Tasks

Upon entering the Training section, you will see a list of all your current training tasks. Each task is displayed with its name, GPU configuration, GPU memory, model, resource usage, and available operations.

Creating a New Training Task

Step 1: Click the "New" Button

Click the "+" button next to a task to create a new training task.

Step 2: Configure Task Parameters

Fill in the necessary details for your training task using the following parameters:

Parameter Description
Name A unique identifier for your training task.
Model Option to mount one model in model repository to your container's /data/ path (e.g., meta-llama-3-8b-instruct).
Workers Select the node number your training task needs (e.g., 2 for distributed nodes training).
GPUs Select the number and type of GPUs required for your task (e.g., 8 * NVIDIA HGX H100 for one worker).
GPU Memory The amount of GPU memory allocated for the task (e.g., 80GiB).
Image Choose one of the images as the running env.
Enable Tensorboard Option to enable Tensorboard monitoring. (Need to specify a folder path containing TensorBoard visualization files)
Start Command Option to enter the command to be executed in the container.
File system Mount Option to mount some path in file system to the container.

Configuring Your Task

Ensure that you select the appropriate GPUs and memory to match your training requirements.

Step 3: Review and Deploy

Review your configurations and click "Confirm" to deploy the training task.

Builtin Environments

For training task, AI Cloud provides a set of pre-defined environments that injected to all workers' container.

Parameter Description
$MASTER_ADDR Master node address.
$MASTER_PORT Master node port.
$NODE_RANK Node rank.
$HOST_NUM Node number.
$HOST_GPU_NUM Gpus per node.

Managing Training Tasks

Viewing Task Details

Click on Detail button to view detailed information about its status, resource usage, and performance metrics.

Deleting a Task

To delete a training task, click the "Cancel" button next to the task in the list.

Permanent Deletion

Deletion is permanent and cannot be undone. Ensure that you no longer need the task before deleting.

TensorBoard Integration

For tasks that enable TensorBoard, click the "TensorBoard" link to visualize training metrics and logs.

Monitoring Task Progress

Jump to the Detail page of one worker to see the latest 1000 lines logs of the corresponding container. And download the whole logs by clicking the Save Logs button.

Monitoring Resource Usage

Keep an eye on the "Metrics" column in detail page to monitor the GPU metrics by each task. This helps in managing costs and resource allocation effectively.

Cost Management

Monitor your resource usage to optimize costs and resource utilization.

Webshell

Each worker has an independent webshell, you can find it in the detail page and use it to execute commands in the container.
For the whole task, you can get the WebShell button at the Operation column of this task item. This webshell can be used to execute commands simultaneously to all workers' container.

Next Steps

After deploying your training tasks, you can use the Training Dashboard to monitor their progress and performance. Adjust your training configurations as needed to optimize results. For further assistance or to explore advanced features, refer to the related sections or visit our support page on the AI Cloud platform.

For more information on fine-tuning your models, proceed to the Finetune Service.