Skip to content
Support

Self-Hosted Embeddings

Run Hub embeddings against a self-hosted, OpenAI-compatible embeddings endpoint with the bundled Helm runtime or your own service.

Hub’s similar-feedback and semantic-search endpoints rely on text embeddings. By default these are produced by a managed provider, but Hub can also run against a self-hosted, OpenAI-compatible embeddings endpoint. This keeps open-text feedback inside your own infrastructure and removes the dependency on an external AI provider.

This guide shows how to configure Hub to use a self-hosted embeddings runtime and how to deploy the bundled Hugging Face Text Embeddings Inference (TEI) runtime that ships with the Hub Helm chart.

The default recommended model for self-hosted Hub deployments is Alibaba-NLP/gte-multilingual-base:

  • Apache-2.0 licensed, no Hugging Face access gating.
  • Native 768-dimensional output, matching Hub’s stored embedding dimension.
  • 8192-token context window.
  • 75+ languages out of the box.
  • Runs on CPU; a 4 vCPU / 8 GiB node is sufficient for the bundled runtime.

You can run any model that TEI supports as long as it returns 768-dimensional vectors, but the rest of this guide assumes the recommended model.

Hub speaks the OpenAI embeddings protocol when EMBEDDING_PROVIDER=openai and sends requests to the URL in EMBEDDING_BASE_URL. The endpoint must accept POST /v1/embeddings with an Authorization: Bearer <key> header and return an OpenAI-compatible response.

Set these on hub-api and hub-worker. Both processes must agree on the provider and model.

VariableRequiredExampleDescription
EMBEDDING_PROVIDERYesopenaiMust be openai to use a custom OpenAI-compatible endpoint.
EMBEDDING_MODELYesAlibaba-NLP/gte-multilingual-baseModel identifier Hub sends in the model field. Must match the model name served by your endpoint.
EMBEDDING_BASE_URLYeshttps://embeddings.example.com/v1OpenAI-compatible embeddings root. Hub appends /embeddings to this URL.
EMBEDDING_PROVIDER_API_KEYYess3cret-internal-keyBearer token Hub sends to the endpoint. Required even when the endpoint runs inside your private network.
EMBEDDING_MAX_CONCURRENTNo5Max concurrent embedding jobs run by hub-worker. Defaults to 5.
EMBEDDING_NORMALIZENofalseWhether Hub normalizes the returned vector before storage. Leave at false unless your model requires it.

See Environment Variables for the full runtime configuration reference.

The Hub Helm chart can deploy TEI as a sidecar Deployment and wire Hub to it automatically. This is the simplest path when you already deploy Hub with the chart.

Enable it under the embeddings key in your values.yaml:

embeddings:
enabled: true
model: Alibaba-NLP/gte-multilingual-base
# Optional: pin a model revision for reproducible builds.
# revision: "<commit-sha-from-Hugging-Face>"
image:
repository: ghcr.io/huggingface/text-embeddings-inference
tag: cpu-1.9
# Keeps the default GTE CPU runtime within the default memory budget.
extraArgs:
- --dtype
- float16
resources:
requests:
cpu: "4"
memory: 8Gi
limits:
memory: 8Gi
persistence:
enabled: true
size: 10Gi
accessModes:
- ReadWriteOnce

When embeddings.enabled is true the chart automatically sets the following on hub-api and hub-worker, so you do not need to define them under config.data or secrets.stringData:

  • EMBEDDING_PROVIDER=openai
  • EMBEDDING_MODEL from embeddings.servedModelName (or embeddings.model when unset)
  • EMBEDDING_BASE_URL from the in-cluster Service
  • EMBEDDING_PROVIDER_API_KEY from the embeddings Secret
  • EMBEDDING_MAX_CONCURRENT from embeddings.maxConcurrent
  • EMBEDDING_NORMALIZE from embeddings.normalize

The bundled runtime needs two secrets:

  • Internal API key — Hub and TEI must share the same bearer token so Hub’s requests are accepted by TEI.
  • Hugging Face token (optional) — Only required for private or gated models. Alibaba-NLP/gte-multilingual-base is public, so this is normally left unset.

You have three options, in order of preference for production:

  1. Provide an existing Secret. Recommended when you manage secrets with an external system such as External Secrets Operator or sealed-secrets.

    embeddings:
    auth:
    enabled: true
    existingSecret: hub-embeddings-auth
    secretKey: EMBEDDING_PROVIDER_API_KEY
  2. Provide the key inline. Suitable for evaluation and tightly controlled environments.

    embeddings:
    auth:
    enabled: true
    apiKey: "replace-with-a-strong-random-value"
  3. Let the chart generate one. If neither existingSecret nor apiKey is set, the chart generates a random 32-character key on first install and keeps it stable across upgrades. Useful for getting started, but harder to rotate.

Only set this if you switch to a gated model.

embeddings:
huggingFace:
existingSecret: hub-embeddings-hf
tokenKey: HF_TOKEN

Or inline:

embeddings:
huggingFace:
token: "hf_xxx"

After helm upgrade, port-forward the embeddings Service and check that it returns 768-dimensional vectors.

Terminal window
kubectl port-forward svc/<release>-hub-embeddings 8080:8080
curl -s http://localhost:8080/v1/embeddings \
-H "Authorization: Bearer <EMBEDDING_PROVIDER_API_KEY>" \
-H "Content-Type: application/json" \
-d '{"model":"Alibaba-NLP/gte-multilingual-base","input":"hello"}' \
| jq '.data[0].embedding | length'
# 768

A 768 response confirms that TEI is healthy, the model is loaded, and the auth token matches what Hub will send.

The default chart settings are intentionally conservative:

  • Single replica. embeddings.replicaCount defaults to 1.
  • Autoscaling disabled. embeddings.autoscaling.enabled defaults to false.
  • ReadWriteOnce model cache. embeddings.persistence.accessModes defaults to [ReadWriteOnce], which is incompatible with multiple replicas on most CSI drivers.

To run more than one replica you need to choose one of:

  1. Use a ReadWriteMany volume. Set embeddings.persistence.accessModes: [ReadWriteMany] and a storage class that supports RWX. Each replica can then mount the same model cache.
  2. Disable persistence. Set embeddings.persistence.enabled: false. Each replica downloads the model into an emptyDir on startup, which trades storage for slower rollouts and more egress.
  3. Pre-bake the model into a custom image. Build a TEI image with the model files baked in, set embeddings.image.repository to your image, and keep persistence disabled. Best for air-gapped or rollout-sensitive environments.

Then enable autoscaling:

embeddings:
replicaCount: 2
autoscaling:
enabled: true
minReplicas: 2
maxReplicas: 4

If you already operate an embeddings service or use a different runtime, set embeddings.enabled: false (or omit the section entirely) and configure Hub directly. Any service that accepts POST /v1/embeddings with a bearer token and returns 768-dim vectors works.

Set the variables on hub-api and hub-worker:

Terminal window
EMBEDDING_PROVIDER=openai
EMBEDDING_MODEL=Alibaba-NLP/gte-multilingual-base
EMBEDDING_BASE_URL=https://embeddings.internal.example.com/v1
EMBEDDING_PROVIDER_API_KEY=<bearer-token>

In the Helm chart these go under config.data (non-secret) and secrets.stringData (secret), or via your external secret store.

Then verify against your endpoint with the same curl command as above.

Check river_job for feedback_embedding rows and inspect their errors column. The two most common causes:

  • connection reset by peer or connection refused — Hub reached the endpoint before it was ready. Wait for the runtime to finish model warmup, then re-enqueue with make run-backfill-embeddings.
  • 401 UnauthorizedEMBEDDING_PROVIDER_API_KEY does not match the bearer token the endpoint expects.

Hub returns 503 Service Unavailable from /v1/feedback-records/search/semantic and /v1/feedback-records/{id}/similar when embeddings are not configured. Confirm both EMBEDDING_PROVIDER and EMBEDDING_MODEL are set on hub-api.

Hub rejects writes whose vector dimension does not equal 768. Configure your model or runtime to return 768-dim output, either by selecting a native-768 model or by requesting a 768-dim slice if the model supports Matryoshka truncation.