Training Service¶

Overview¶

The Training Service on AI Cloud provides a user-friendly interface for managing machine learning model training tasks. With this service, you can easily deploy, monitor, and manage your training jobs using the computational resources available on the platform.

Accessing the Training Dashboard¶

Step 1: Log in to the AI Cloud Console¶

Access the AI Cloud platform and navigate to the Training section.

Step 2: View Training Tasks¶

Upon entering the Training section, you will see a list of all your current training tasks. Each task is displayed with its name, GPU configuration, GPU memory, model, resource usage, and available operations.

Creating a New Training Task¶

Step 1: Click the "New" Button¶

Click the "+" button next to a task to create a new training task.

Step 2: Configure Task Parameters¶

Fill in the necessary details for your training task using the following parameters:

Parameter	Description
Name	A unique identifier for your training task.
Model	Option to mount one model in model repository to your container's `/data/` path (e.g., meta-llama-3-8b-instruct).
Workers	Select the node number your training task needs (e.g., 2 for distributed nodes training).
GPUs	Select the number and type of GPUs required for your task (e.g., 8 * NVIDIA HGX H100 for one worker).
GPU Memory	The amount of GPU memory allocated for the task (e.g., 80GiB).
Image	Choose one of the images as the running env.
Enable Tensorboard	Option to enable Tensorboard monitoring. (Need to specify a folder path containing TensorBoard visualization files)
Start Command	Option to enter the command to be executed in the container.
File system Mount	Option to mount some path in file system to the container.

Configuring Your Task

Ensure that you select the appropriate GPUs and memory to match your training requirements.

Step 3: Review and Deploy¶

Review your configurations and click "Confirm" to deploy the training task.

Builtin Environments¶

For training task, AI Cloud provides a set of pre-defined environments that injected to all workers' container.

Parameter	Description
$MASTER_ADDR	Master node address.
$MASTER_PORT	Master node port.
$NODE_RANK	Node rank.
$HOST_NUM	Node number.
$HOST_GPU_NUM	Gpus per node.

Managing Training Tasks¶

Viewing Task Details¶

Click on Detail button to view detailed information about its status, resource usage, and performance metrics.

Deleting a Task¶

To delete a training task, click the "Cancel" button next to the task in the list.

Permanent Deletion

Deletion is permanent and cannot be undone. Ensure that you no longer need the task before deleting.

TensorBoard Integration¶

For tasks that enable TensorBoard, click the "TensorBoard" link to visualize training metrics and logs.

Monitoring Task Progress¶

Jump to the Detail page of one worker to see the latest 1000 lines logs of the corresponding container. And download the whole logs by clicking the Save Logs button.

Monitoring Resource Usage¶

Keep an eye on the "Metrics" column in detail page to monitor the GPU metrics by each task. This helps in managing costs and resource allocation effectively.

Cost Management

Monitor your resource usage to optimize costs and resource utilization.

Webshell¶

Each worker has an independent webshell, you can find it in the detail page and use it to execute commands in the container.
For the whole task, you can get the WebShell button at the Operation column of this task item. This webshell can be used to execute commands simultaneously to all workers' container.

Next Steps¶

After deploying your training tasks, you can use the Training Dashboard to monitor their progress and performance. Adjust your training configurations as needed to optimize results. For further assistance or to explore advanced features, refer to the related sections or visit our support page on the AI Cloud platform.

For more information on fine-tuning your models, proceed to the Finetune Service.