Model Upload

Use the Gear model building tool to upload your custom model.

1. Instructions

This tool only supports MacOS and Linux
The operating machine needs to have a docker environment ready

2. Install Gear

sudo curl -o /usr/local/bin/gear -L http://oss-high-qy01.cdsgss.com/gear/gear_`uname -s`_`uname -m` && sudo chmod +x /usr/local/bin/gear

3. Start the Model

3.1. Initialization

mkdir qwen0_5B && cd qwen0_5B

$ gear init

Setting up the current directory for use with gear...

✅ Created /data/qwen/predict.py
✅ Created /data/qwen/.dockerignore
✅ Created /data/qwen/.github/workflows/push.yaml
✅ Created /data/qwen/gear.yaml

Done!

3.2. Prepare Model Files

Currently, pulling model files from remote is not supported. You must download your own model to the local machine.

$ git clone https://hf-mirror.com/Qwen/Qwen2.5-0.5B
Cloning into 'Qwen2.5-0.5B'...
remote: Enumerating objects: 42, done.
remote: Counting objects: 100% (39/39), done.
remote: Compressing objects: 100% (39/39), done.
remote: Total 42 (delta 18), reused 0 (delta 0), pack-reused 3 (from 1)
Unpacking objects: 100% (42/42), 3.61 MiB | 3.99 MiB/s, done.

3.3. Prepare `gear.yaml` Configuration File

# Configuration for gear ⚙️

build:
  # Whether GPU is needed
  gpu: true

  # Ubuntu system packages to install
  # system_packages:
  #   - "libgl1-mesa-glx"
  #   - "libglib2.0-0"

  # Python version '3.11'
  python_version: "3.11"

  # Python packages <package-name>==<version>
  python_packages:
    - "transformers"
    - "torch"
    - "accelerate>=0.26.0"

  # Commands to run after environment setup
  # run:
  #   - "echo env is ready!"
  #   - "echo another command if needed"

# predict.py defines how to run prediction on the model
predict: "predict.py:Predictor"

# Control request concurrency, if specified, `def predict` in predict.py must be changed to `async def predict`
# concurrency:
#   max: 20

3.4. Prepare `predict.py` File

from gear import BasePredictor, Input, ConcatenateIterator

class Predictor(BasePredictor):
    def setup(self) -> None:
        """Load the model into memory to efficiently run multiple predictions"""
        # self.model = AutoModelForCausalLM.from_pretrained(
        #     "./Qwen2.5-0.5B",
        #     torch_dtype="auto",
        #     device_map="auto"
        # )
        # self.tokenizer = AutoTokenizer.from_pretrained(model_name)

    async def predict(
            self,
            prompt: str = Input(description="prompt input", default="Hello")
    ) -> ConcatenateIterator[str]:
        """Run a prediction on the model"""
        test_model = ["This", "is", "a", "test", "example"]

        for i in test_model:
            yield i

3.5. Local Test Model

Run the test command:

$ gear serve

Building Docker image from environment in gear.yaml...
[+] Building 5.6s (13/13) FINISHED                                                                             docker:default
 => [internal] load build definition from Dockerfile                                                                     0.0s
 => => transferring dockerfile: 844B                                                                                     0.0s
 => resolve image config for docker-image://registry-bj.capitalonline.net/maas/dockerfile:1.4                            5.2s
 => CACHED docker-image://registry-bj.capitalonline.net/maas/dockerfile:1.4@sha256:1f6e06f58b23c0700df6c05284ac25b29088  0.0s
 => [internal] load .dockerignore                                                                                        0.0s
 => => transferring context: 366B                                                                                        0.0s
 => [internal] load metadata for registry-bj.capitalonline.net/maas/kit-base:cuda11.8-python3.11                         0.1s
 => [stage-0 1/6] FROM registry-bj.capitalonline.net/maas/kit-base:cuda11.8-python3.11@sha256:0e9a503b356d548998e326f24  0.0s
 => => resolve registry-bj.capitalonline.net/maas/kit-base:cuda11.8-python3.11@sha256:0e9a503b356d548998e326f24d6ca50d4  0.0s
 => [internal] load build context                                                                                        0.0s
 => => transferring context: 137.36kB                                                                                    0.0s
 => CACHED [stage-0 2/6] COPY .gear/tmp/build20250321152525.358211857500444/gear-1.0.0-py3-none-any.whl /tmp/gear-1.0.0  0.0s
 => CACHED [stage-0 3/6] RUN --mount=type=cache,target=/root/.cache/pip pip install --no-cache-dir /tmp/gear-1.0.0-py3-  0.0s
 => CACHED [stage-0 4/6] COPY .gear/tmp/build20250321152525.358211857500444/requirements.txt /tmp/requirements.txt       0.0s
 => CACHED [stage-0 5/6] RUN --mount=type=cache,target=/root/.cache/pip pip install -r /tmp/requirements.txt             0.0s
 => CACHED [stage-0 6/6] WORKDIR /src                                                                                    0.0s
 => exporting to image                                                                                                   0.0s
 => => exporting layers                                                                                                  0.0s
 => => preparing layers for inline cache                                                                                 0.0s
 => => writing image sha256:9342ca519d1f66e0f642cb7630581700a56f8a86f11c83ca89e9bf440a461b33                             0.0s
 => => naming to docker.io/library/gear-qwen05b-base                                                                     0.0s

Running 'python --check-hash-based-pycs never -m gear.server.http --await-explicit-shutdown true' in Docker with the current directory mounted as a volume...

Serving at http://127.0.0.1:8393

 ....

After the service starts, verify:

$ curl -X POST http://127.0.0.1:8393/predictions -H 'Content-Type: application/json' -H 'Stream: true'   -d '{"input": {"prompt": "Hello"}}'

{"input": {"prompt": "1+1=?"}, "output": ["This is"], "id": "1e525032-cbaa-4fa5-82a3-b915754f072c", "version": null, "created_at": null, "started_at": "2025-03-21T08:05:56.956757+00:00", "completed_at": null, "logs": "", "error": null, "status": "processing", "metrics": null}

{"input": {"prompt": "1+1=?"}, "output": ["一", "个"], "id": "1e525032-cbaa-4fa5-82a3-b915754f072c", "version": null, "created_at": null, "started_at": "2025-03-21T08:05:56.956757+00:00", "completed_at": null, "logs": "", "error": null, "status": "processing", "metrics": null}

{"input": {"prompt": "1+1=?"}, "output": ["Test", "Example"], "id": "1e525032-cbaa-4fa5-82a3-b915754f072c", "version": null, "created_at": null, "started_at": "2025-03-21T08:05:56.956757+00:00", "completed_at": null, "logs": "", "error": null, "status": "processing", "metrics": null}

{"input": {"prompt": "1+1=?"}, "output": [], "id": "1e525032-cbaa-4fa5-82a3-b915754f072c", "version": null, "created_at": null, "started_at": "2025-03-21T08:05:56.956757+00:00", "completed_at": "2025-03-21T08:05:57.154953+00:00", "logs": "", "error": null, "status": "succeeded", "metrics": {"predict_time": 0.198196}}

4. Publish Model on GpuGeek

4.1. Create Model

Click Create Model

4.2. Get Image Repository Token

Click Get Token

Use the obtained token to log in to the image repository:

$ gear login

4.4. Build and Upload Image

$ gear build -t  maas-harbor-cn.yun-paas.com/maas-<user-id>/<maas-name>:<tag>
$ docker push maas-harbor-cn.yun-paas.com/maas-<user-id>/<maas-name>:<tag>

You can also use the gear push command, but you need to define image in gear.yaml:

# Configuration for gear ⚙️

build:
  ...

# predict.py defines how to run prediction on the model
predict: "predict.py:Predictor"

image: maas-harbor-cn.yun-paas.com/maas-<user-id>/<maas-name>:<tag>

5. Predictor

`Predictor.setup()`

Prepare the model so multiple predictions can run efficiently. Use this optional method for any expensive one-time operations, such as loading the trained model or instantiating data transformations.

`Predictor.predict(**kwargs)`

Run a single prediction. This required method is where you call the loaded model during setup(). You may also want to add preprocessing and postprocessing code here.

The predict() method takes a list of arbitrary named parameters, each name must correspond to an Input() annotation.

`Input()`

Define each parameter in predict() using a gear.Input() object:

The Input() function takes the following keyword arguments:

description: A description of this input passed to the model user.
default: Sets the default value of the input. If not passed, the input is required. If explicitly set to None, the input is optional.
ge: For int or float types, the value must be greater than or equal to this number.
le: For int or float types, the value must be less than or equal to this number.
min_length: For str types, the minimum string length.
max_length: For str types, the maximum string length.
regex: For str types, the string must match this regex.
choices: For str or int types, a list of possible values for this input.

`Path()`

The gear.Path object is used to put files into and retrieve files from the model. It represents the path of a file on disk. It can be used for text-to-image, text-to-video, etc., returning the file path.

from gear import BasePredictor, Path

class Predictor(BasePredictor):
    def setup(self) -> None:
        print("async setup is also supported...")

    def predict(self) -> Path:
        image_path = "/tmp/outp.mp3"
        return Path(image_path)

`Streaming Output`

from gear import BasePredictor, ConcatenateIterator

class Predictor(BasePredictor):
    def predict(self) -> ConcatenateIterator[str]:
        tokens = ["Test", "streaming", "output", "."]
        for token in tokens:
            yield token + " "

`Async and Concurrency`

You can specify the predict() method as async def predict(...). In addition, if you have an asynchronous predict() function, you may also have an asynchronous setup() function:

from gear import BasePredictor

class Predictor(BasePredictor):
    async def setup(self) -> None:
        print("async setup is also supported...")

    async def predict(self) -> str:
        print("async predict")
        return "hello world"

Models with asynchronous predict() functions can run predictions concurrently, up to the limit specified in gear.yaml under concurrency.max. Attempts to exceed this limit will return a 409 Conflict response.

`OpenAI-Compatible`

from gear import BasePredictor, ConcatenateIterator, Input

class Predictor(BasePredictor):
    def setup(self) -> None:
        print("setup is ready...")

    async def predict(self,
                      prompt: str = Input(description="prompt input"),
                      max_tokens: int = Input(
                          description="Maximum number of tokens generated by the model.",
                          default=1024, ge=1, le=8192),
                      temperature: float = Input(
                          description="Controls randomness of the model output. Higher values (closer to 1) make output more random, lower values make it more deterministic.",
                          default=0.7, ge=0.00, le=2.0),
                      top_p: float = Input(
                          description="Controls output diversity. Lower values make output more focused, higher values make it more diverse.",
                          default=0.7, ge=0.1, le=1.0),
                      top_k: int = Input(
                          description="Samples from the top k tokens. Helps speed up generation and can improve quality.",
                          default=50, ge=1, le=100),
                      frequency_penalty: float = Input(
                          description="Reduces the likelihood of repeating words by penalizing those already used frequently.",
                          default=0.0, ge=-2.0, le=2.0)
                      ) -> ConcatenateIterator[str]:

        messages = []
        try:
            messages = json.loads(prompt)
            if not isinstance(messages, list):
                raise ValueError("messages must be a list")
            for msg in messages:
                if not isinstance(msg, dict):
                    raise TypeError(f"Message item must be a dict, got {type(msg)}")
                if "role" not in msg:
                    raise KeyError("Message missing 'role' field")
                if "content" not in msg:
                    raise KeyError("Message missing 'content' field")
                if not isinstance(msg["content"], str):
                    raise TypeError("content must be a string")
        except Exception as e:
            messages = [
                {"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
                {"role": "user", "content": prompt},
            ]

        for token in youmodel.generate(messages=messages):
            yield token

Currently, only the parameters prompt, max_tokens, temperature, top_p, top_k, and frequency_penalty are supported. The handling of messages must follow the method above.

1. Instructions​

2. Install Gear​

3. Start the Model​

3.1. Initialization​

3.2. Prepare Model Files​

3.3. Prepare gear.yaml Configuration File​

3.4. Prepare predict.py File​

3.5. Local Test Model​

4. Publish Model on GpuGeek​

4.1. Create Model​

4.2. Get Image Repository Token​

4.3. Login to Image Repository​

4.4. Build and Upload Image​

5. Predictor​

Predictor.setup()​

Predictor.predict(**kwargs)​

Input()​

Path()​

Streaming Output​

Async and Concurrency​

OpenAI-Compatible​