Skip to main content

Model Upload

Use the Gear model building tool to upload your custom model.

1. Instructions

  • This tool only supports MacOS and Linux
  • The operating machine needs to have a docker environment ready

2. Install Gear

sudo curl -o /usr/local/bin/gear -L http://oss-high-qy01.cdsgss.com/gear/gear_`uname -s`_`uname -m` && sudo chmod +x /usr/local/bin/gear

3. Start the Model

3.1. Initialization

mkdir qwen0_5B && cd qwen0_5B
$ gear init

Setting up the current directory for use with gear...

✅ Created /data/qwen/predict.py
✅ Created /data/qwen/.dockerignore
✅ Created /data/qwen/.github/workflows/push.yaml
✅ Created /data/qwen/gear.yaml

Done!

3.2. Prepare Model Files

Currently, pulling model files from remote is not supported. You must download your own model to the local machine.

$ git clone https://hf-mirror.com/Qwen/Qwen2.5-0.5B
Cloning into 'Qwen2.5-0.5B'...
remote: Enumerating objects: 42, done.
remote: Counting objects: 100% (39/39), done.
remote: Compressing objects: 100% (39/39), done.
remote: Total 42 (delta 18), reused 0 (delta 0), pack-reused 3 (from 1)
Unpacking objects: 100% (42/42), 3.61 MiB | 3.99 MiB/s, done.

3.3. Prepare gear.yaml Configuration File

# Configuration for gear ⚙️

build:
# Whether GPU is needed
gpu: true

# Ubuntu system packages to install
# system_packages:
# - "libgl1-mesa-glx"
# - "libglib2.0-0"

# Python version '3.11'
python_version: "3.11"

# Python packages <package-name>==<version>
python_packages:
- "transformers"
- "torch"
- "accelerate>=0.26.0"

# Commands to run after environment setup
# run:
# - "echo env is ready!"
# - "echo another command if needed"

# predict.py defines how to run prediction on the model
predict: "predict.py:Predictor"

# Control request concurrency, if specified, `def predict` in predict.py must be changed to `async def predict`
# concurrency:
# max: 20

3.4. Prepare predict.py File

from gear import BasePredictor, Input, ConcatenateIterator

class Predictor(BasePredictor):
def setup(self) -> None:
"""Load the model into memory to efficiently run multiple predictions"""
# self.model = AutoModelForCausalLM.from_pretrained(
# "./Qwen2.5-0.5B",
# torch_dtype="auto",
# device_map="auto"
# )
# self.tokenizer = AutoTokenizer.from_pretrained(model_name)

async def predict(
self,
prompt: str = Input(description="prompt input", default="Hello")
) -> ConcatenateIterator[str]:
"""Run a prediction on the model"""
test_model = ["This", "is", "a", "test", "example"]

for i in test_model:
yield i

3.5. Local Test Model

Run the test command:

$ gear serve

Building Docker image from environment in gear.yaml...
[+] Building 5.6s (13/13) FINISHED docker:default
=> [internal] load build definition from Dockerfile 0.0s
=> => transferring dockerfile: 844B 0.0s
=> resolve image config for docker-image://registry-bj.capitalonline.net/maas/dockerfile:1.4 5.2s
=> CACHED docker-image://registry-bj.capitalonline.net/maas/dockerfile:1.4@sha256:1f6e06f58b23c0700df6c05284ac25b29088 0.0s
=> [internal] load .dockerignore 0.0s
=> => transferring context: 366B 0.0s
=> [internal] load metadata for registry-bj.capitalonline.net/maas/kit-base:cuda11.8-python3.11 0.1s
=> [stage-0 1/6] FROM registry-bj.capitalonline.net/maas/kit-base:cuda11.8-python3.11@sha256:0e9a503b356d548998e326f24 0.0s
=> => resolve registry-bj.capitalonline.net/maas/kit-base:cuda11.8-python3.11@sha256:0e9a503b356d548998e326f24d6ca50d4 0.0s
=> [internal] load build context 0.0s
=> => transferring context: 137.36kB 0.0s
=> CACHED [stage-0 2/6] COPY .gear/tmp/build20250321152525.358211857500444/gear-1.0.0-py3-none-any.whl /tmp/gear-1.0.0 0.0s
=> CACHED [stage-0 3/6] RUN --mount=type=cache,target=/root/.cache/pip pip install --no-cache-dir /tmp/gear-1.0.0-py3- 0.0s
=> CACHED [stage-0 4/6] COPY .gear/tmp/build20250321152525.358211857500444/requirements.txt /tmp/requirements.txt 0.0s
=> CACHED [stage-0 5/6] RUN --mount=type=cache,target=/root/.cache/pip pip install -r /tmp/requirements.txt 0.0s
=> CACHED [stage-0 6/6] WORKDIR /src 0.0s
=> exporting to image 0.0s
=> => exporting layers 0.0s
=> => preparing layers for inline cache 0.0s
=> => writing image sha256:9342ca519d1f66e0f642cb7630581700a56f8a86f11c83ca89e9bf440a461b33 0.0s
=> => naming to docker.io/library/gear-qwen05b-base 0.0s

Running 'python --check-hash-based-pycs never -m gear.server.http --await-explicit-shutdown true' in Docker with the current directory mounted as a volume...

Serving at http://127.0.0.1:8393

....

After the service starts, verify:


$ curl -X POST http://127.0.0.1:8393/predictions -H 'Content-Type: application/json' -H 'Stream: true' -d '{"input": {"prompt": "Hello"}}'

{"input": {"prompt": "1+1=?"}, "output": ["This is"], "id": "1e525032-cbaa-4fa5-82a3-b915754f072c", "version": null, "created_at": null, "started_at": "2025-03-21T08:05:56.956757+00:00", "completed_at": null, "logs": "", "error": null, "status": "processing", "metrics": null}

{"input": {"prompt": "1+1=?"}, "output": ["一", "个"], "id": "1e525032-cbaa-4fa5-82a3-b915754f072c", "version": null, "created_at": null, "started_at": "2025-03-21T08:05:56.956757+00:00", "completed_at": null, "logs": "", "error": null, "status": "processing", "metrics": null}

{"input": {"prompt": "1+1=?"}, "output": ["Test", "Example"], "id": "1e525032-cbaa-4fa5-82a3-b915754f072c", "version": null, "created_at": null, "started_at": "2025-03-21T08:05:56.956757+00:00", "completed_at": null, "logs": "", "error": null, "status": "processing", "metrics": null}

{"input": {"prompt": "1+1=?"}, "output": [], "id": "1e525032-cbaa-4fa5-82a3-b915754f072c", "version": null, "created_at": null, "started_at": "2025-03-21T08:05:56.956757+00:00", "completed_at": "2025-03-21T08:05:57.154953+00:00", "logs": "", "error": null, "status": "succeeded", "metrics": {"predict_time": 0.198196}}

4. Publish Model on GpuGeek

4.1. Create Model

Click Create Model

4.2. Get Image Repository Token

Click Get Token

4.3. Login to Image Repository

Use the obtained token to log in to the image repository:

$ gear login 

4.4. Build and Upload Image

$ gear build -t  maas-harbor-cn.yun-paas.com/maas-<user-id>/<maas-name>:<tag>
$ docker push maas-harbor-cn.yun-paas.com/maas-<user-id>/<maas-name>:<tag>

You can also use the gear push command, but you need to define image in gear.yaml:

# Configuration for gear ⚙️

build:
...

# predict.py defines how to run prediction on the model
predict: "predict.py:Predictor"

image: maas-harbor-cn.yun-paas.com/maas-<user-id>/<maas-name>:<tag>

5. Predictor

Predictor.setup()

Prepare the model so multiple predictions can run efficiently. Use this optional method for any expensive one-time operations, such as loading the trained model or instantiating data transformations.

Predictor.predict(**kwargs)

Run a single prediction. This required method is where you call the loaded model during setup(). You may also want to add preprocessing and postprocessing code here.

The predict() method takes a list of arbitrary named parameters, each name must correspond to an Input() annotation.

Input()

Define each parameter in predict() using a gear.Input() object:

The Input() function takes the following keyword arguments:

  • description: A description of this input passed to the model user.
  • default: Sets the default value of the input. If not passed, the input is required. If explicitly set to None, the input is optional.
  • ge: For int or float types, the value must be greater than or equal to this number.
  • le: For int or float types, the value must be less than or equal to this number.
  • min_length: For str types, the minimum string length.
  • max_length: For str types, the maximum string length.
  • regex: For str types, the string must match this regex.
  • choices: For str or int types, a list of possible values for this input.

Path()

The gear.Path object is used to put files into and retrieve files from the model. It represents the path of a file on disk. It can be used for text-to-image, text-to-video, etc., returning the file path.

from gear import BasePredictor, Path

class Predictor(BasePredictor):
def setup(self) -> None:
print("async setup is also supported...")

def predict(self) -> Path:
image_path = "/tmp/outp.mp3"
return Path(image_path)

Streaming Output

from gear import BasePredictor, ConcatenateIterator

class Predictor(BasePredictor):
def predict(self) -> ConcatenateIterator[str]:
tokens = ["Test", "streaming", "output", "."]
for token in tokens:
yield token + " "

Async and Concurrency

You can specify the predict() method as async def predict(...). In addition, if you have an asynchronous predict() function, you may also have an asynchronous setup() function:

from gear import BasePredictor

class Predictor(BasePredictor):
async def setup(self) -> None:
print("async setup is also supported...")

async def predict(self) -> str:
print("async predict")
return "hello world"

Models with asynchronous predict() functions can run predictions concurrently, up to the limit specified in gear.yaml under concurrency.max. Attempts to exceed this limit will return a 409 Conflict response.

OpenAI-Compatible

from gear import BasePredictor, ConcatenateIterator, Input

class Predictor(BasePredictor):
def setup(self) -> None:
print("setup is ready...")

async def predict(self,
prompt: str = Input(description="prompt input"),
max_tokens: int = Input(
description="Maximum number of tokens generated by the model.",
default=1024, ge=1, le=8192),
temperature: float = Input(
description="Controls randomness of the model output. Higher values (closer to 1) make output more random, lower values make it more deterministic.",
default=0.7, ge=0.00, le=2.0),
top_p: float = Input(
description="Controls output diversity. Lower values make output more focused, higher values make it more diverse.",
default=0.7, ge=0.1, le=1.0),
top_k: int = Input(
description="Samples from the top k tokens. Helps speed up generation and can improve quality.",
default=50, ge=1, le=100),
frequency_penalty: float = Input(
description="Reduces the likelihood of repeating words by penalizing those already used frequently.",
default=0.0, ge=-2.0, le=2.0)
) -> ConcatenateIterator[str]:

messages = []
try:
messages = json.loads(prompt)
if not isinstance(messages, list):
raise ValueError("messages must be a list")
for msg in messages:
if not isinstance(msg, dict):
raise TypeError(f"Message item must be a dict, got {type(msg)}")
if "role" not in msg:
raise KeyError("Message missing 'role' field")
if "content" not in msg:
raise KeyError("Message missing 'content' field")
if not isinstance(msg["content"], str):
raise TypeError("content must be a string")
except Exception as e:
messages = [
{"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
{"role": "user", "content": prompt},
]

for token in youmodel.generate(messages=messages):
yield token

Currently, only the parameters prompt, max_tokens, temperature, top_p, top_k, and frequency_penalty are supported. The handling of messages must follow the method above.