Self-Hosted Embeddings
Run Hub embeddings against a self-hosted, OpenAI-compatible embeddings endpoint with the bundled Helm runtime or your own service.
Hub’s similar-feedback and semantic-search endpoints rely on text embeddings. By default these are produced by a managed provider, but Hub can also run against a self-hosted, OpenAI-compatible embeddings endpoint. This keeps open-text feedback inside your own infrastructure and removes the dependency on an external AI provider.
This guide shows how to configure Hub to use a self-hosted embeddings runtime and how to deploy the bundled Hugging Face Text Embeddings Inference (TEI) runtime that ships with the Hub Helm chart.
Recommended Model
Section titled “Recommended Model”The default recommended model for self-hosted Hub deployments is
Alibaba-NLP/gte-multilingual-base:
- Apache-2.0 licensed, no Hugging Face access gating.
- Native
768-dimensional output, matching Hub’s stored embedding dimension. - 8192-token context window.
- 75+ languages out of the box.
- Runs on CPU; a 4 vCPU / 8 GiB node is sufficient for the bundled runtime.
You can run any model that TEI supports as long as it returns 768-dimensional vectors, but the rest of this guide assumes the recommended model.
How Hub Talks to a Self-Hosted Endpoint
Section titled “How Hub Talks to a Self-Hosted Endpoint”Hub speaks the OpenAI embeddings protocol when EMBEDDING_PROVIDER=openai and
sends requests to the URL in EMBEDDING_BASE_URL. The endpoint must accept
POST /v1/embeddings with an Authorization: Bearer <key> header and return
an OpenAI-compatible response.
Hub Environment Variables
Section titled “Hub Environment Variables”Set these on hub-api and hub-worker. Both processes must agree on the
provider and model.
| Variable | Required | Example | Description |
|---|---|---|---|
EMBEDDING_PROVIDER | Yes | openai | Must be openai to use a custom OpenAI-compatible endpoint. |
EMBEDDING_MODEL | Yes | Alibaba-NLP/gte-multilingual-base | Model identifier Hub sends in the model field. Must match the model name served by your endpoint. |
EMBEDDING_BASE_URL | Yes | https://embeddings.example.com/v1 | OpenAI-compatible embeddings root. Hub appends /embeddings to this URL. |
EMBEDDING_PROVIDER_API_KEY | Yes | s3cret-internal-key | Bearer token Hub sends to the endpoint. Required even when the endpoint runs inside your private network. |
EMBEDDING_MAX_CONCURRENT | No | 5 | Max concurrent embedding jobs run by hub-worker. Defaults to 5. |
EMBEDDING_NORMALIZE | No | false | Whether Hub normalizes the returned vector before storage. Leave at false unless your model requires it. |
See Environment Variables for the full runtime configuration reference.
Option 1: Bundled TEI Runtime (Helm)
Section titled “Option 1: Bundled TEI Runtime (Helm)”The Hub Helm chart can deploy TEI as a sidecar Deployment and wire Hub to it
automatically. This is the simplest path when you already deploy Hub with the
chart.
Enable it under the embeddings key in your values.yaml:
embeddings: enabled: true model: Alibaba-NLP/gte-multilingual-base
# Optional: pin a model revision for reproducible builds. # revision: "<commit-sha-from-Hugging-Face>"
image: repository: ghcr.io/huggingface/text-embeddings-inference tag: cpu-1.9
# Keeps the default GTE CPU runtime within the default memory budget. extraArgs: - --dtype - float16
resources: requests: cpu: "4" memory: 8Gi limits: memory: 8Gi
persistence: enabled: true size: 10Gi accessModes: - ReadWriteOnceWhen embeddings.enabled is true the chart automatically sets the following
on hub-api and hub-worker, so you do not need to define them under
config.data or secrets.stringData:
EMBEDDING_PROVIDER=openaiEMBEDDING_MODELfromembeddings.servedModelName(orembeddings.modelwhen unset)EMBEDDING_BASE_URLfrom the in-cluster ServiceEMBEDDING_PROVIDER_API_KEYfrom the embeddings SecretEMBEDDING_MAX_CONCURRENTfromembeddings.maxConcurrentEMBEDDING_NORMALIZEfromembeddings.normalize
Secrets
Section titled “Secrets”The bundled runtime needs two secrets:
- Internal API key — Hub and TEI must share the same bearer token so Hub’s requests are accepted by TEI.
- Hugging Face token (optional) — Only required for private or gated
models.
Alibaba-NLP/gte-multilingual-baseis public, so this is normally left unset.
Internal API key
Section titled “Internal API key”You have three options, in order of preference for production:
-
Provide an existing Secret. Recommended when you manage secrets with an external system such as External Secrets Operator or sealed-secrets.
embeddings:auth:enabled: trueexistingSecret: hub-embeddings-authsecretKey: EMBEDDING_PROVIDER_API_KEY -
Provide the key inline. Suitable for evaluation and tightly controlled environments.
embeddings:auth:enabled: trueapiKey: "replace-with-a-strong-random-value" -
Let the chart generate one. If neither
existingSecretnorapiKeyis set, the chart generates a random 32-character key on first install and keeps it stable across upgrades. Useful for getting started, but harder to rotate.
Hugging Face token
Section titled “Hugging Face token”Only set this if you switch to a gated model.
embeddings: huggingFace: existingSecret: hub-embeddings-hf tokenKey: HF_TOKENOr inline:
embeddings: huggingFace: token: "hf_xxx"Verify the Endpoint
Section titled “Verify the Endpoint”After helm upgrade, port-forward the embeddings Service and check that it
returns 768-dimensional vectors.
kubectl port-forward svc/<release>-hub-embeddings 8080:8080
curl -s http://localhost:8080/v1/embeddings \ -H "Authorization: Bearer <EMBEDDING_PROVIDER_API_KEY>" \ -H "Content-Type: application/json" \ -d '{"model":"Alibaba-NLP/gte-multilingual-base","input":"hello"}' \ | jq '.data[0].embedding | length'# 768A 768 response confirms that TEI is healthy, the model is loaded, and the
auth token matches what Hub will send.
Operational Limits
Section titled “Operational Limits”The default chart settings are intentionally conservative:
- Single replica.
embeddings.replicaCountdefaults to1. - Autoscaling disabled.
embeddings.autoscaling.enableddefaults tofalse. ReadWriteOncemodel cache.embeddings.persistence.accessModesdefaults to[ReadWriteOnce], which is incompatible with multiple replicas on most CSI drivers.
To run more than one replica you need to choose one of:
- Use a
ReadWriteManyvolume. Setembeddings.persistence.accessModes: [ReadWriteMany]and a storage class that supports RWX. Each replica can then mount the same model cache. - Disable persistence. Set
embeddings.persistence.enabled: false. Each replica downloads the model into anemptyDiron startup, which trades storage for slower rollouts and more egress. - Pre-bake the model into a custom image. Build a TEI image with the
model files baked in, set
embeddings.image.repositoryto your image, and keep persistence disabled. Best for air-gapped or rollout-sensitive environments.
Then enable autoscaling:
embeddings: replicaCount: 2 autoscaling: enabled: true minReplicas: 2 maxReplicas: 4Option 2: Bring Your Own Endpoint
Section titled “Option 2: Bring Your Own Endpoint”If you already operate an embeddings service or use a different runtime, set
embeddings.enabled: false (or omit the section entirely) and configure Hub
directly. Any service that accepts POST /v1/embeddings with a bearer token
and returns 768-dim vectors works.
Set the variables on hub-api and hub-worker:
EMBEDDING_PROVIDER=openaiEMBEDDING_MODEL=Alibaba-NLP/gte-multilingual-baseEMBEDDING_BASE_URL=https://embeddings.internal.example.com/v1EMBEDDING_PROVIDER_API_KEY=<bearer-token>In the Helm chart these go under config.data (non-secret) and
secrets.stringData (secret), or via your external secret store.
Then verify against your endpoint with the same curl command as above.
Troubleshooting
Section titled “Troubleshooting”Embedding Jobs Stay in discarded
Section titled “Embedding Jobs Stay in discarded”Check river_job for feedback_embedding rows and inspect their errors
column. The two most common causes:
connection reset by peerorconnection refused— Hub reached the endpoint before it was ready. Wait for the runtime to finish model warmup, then re-enqueue withmake run-backfill-embeddings.401 Unauthorized—EMBEDDING_PROVIDER_API_KEYdoes not match the bearer token the endpoint expects.
Search Endpoints Return 503
Section titled “Search Endpoints Return 503”Hub returns 503 Service Unavailable from
/v1/feedback-records/search/semantic and /v1/feedback-records/{id}/similar
when embeddings are not configured. Confirm both EMBEDDING_PROVIDER and
EMBEDDING_MODEL are set on hub-api.
Vector Length Is Not 768
Section titled “Vector Length Is Not 768”Hub rejects writes whose vector dimension does not equal 768. Configure your
model or runtime to return 768-dim output, either by selecting a native-768
model or by requesting a 768-dim slice if the model supports Matryoshka
truncation.