Module huggingface
avi0ra/huggingface
Hugging Face Connector for Ballerina
Connects Ballerina applications to the Hugging Face Inference API for running state-of-the-art machine learning models hosted on the Hugging Face Hub.
This package provides a robust, typed Client equipped with strongly-typed request and response records supporting 17+ AI/ML operations. Built for production, it features a generic inferModel helper for unmapped models, a native Retrieval-Augmented Generation (RAG) pipeline, comprehensive stateful Conversation management, robust batch inference execution, streaming chat completions, server-side model wait (waitForModel), automatic retry heuristics (exponential backoff) for cold-starting models, and rich multi-modal helpers for images and audio.
Supported AI Capabilities
| Capability | Resource Path | Example Model |
|---|---|---|
| Chat Completion | /v1/chat/completions | Qwen/Qwen2.5-7B-Instruct |
| Streaming Chat | /v1/chat/completions/streamed | Qwen/Qwen2.5-7B-Instruct |
| Text Generation | /hf-inference/models/{model} | openai-community/gpt2 |
| Fill Mask | /hf-inference/models/{model}/fill-mask | google-bert/bert-base-uncased |
| Text Classification | /hf-inference/models/{model}/text-classification | distilbert-base-uncased-finetuned-sst-2-english |
| Token Classification (NER) | /hf-inference/models/{model}/token-classification | dslim/bert-base-NER |
| Feature Extraction | /hf-inference/models/{model}/feature-extraction | intfloat/multilingual-e5-large |
| Sentence Similarity | /hf-inference/models/{model}/sentence-similarity | sentence-transformers/all-MiniLM-L6-v2 |
| Question Answering | /hf-inference/models/{model}/question-answering | deepset/roberta-base-squad2 |
| Summarization | /hf-inference/models/{model}/summarization | facebook/bart-large-cnn |
| Translation | /hf-inference/models/{model}/translation | Helsinki-NLP/opus-mt-en-fr |
| Zero-Shot Classification | /hf-inference/models/{model}/zero-shot-classification | facebook/bart-large-mnli |
| Text-to-Image | /hf-inference/models/{model}/text-to-image | black-forest-labs/FLUX.1-schnell |
| Text-to-Speech | /hf-inference/models/{model}/text-to-speech | facebook/mms-tts-eng |
| Image Classification | /hf-inference/models/{model}/image-classification | google/vit-base-patch16-224 |
| Image Captioning (Image-to-Text) | /hf-inference/models/{model}/image-to-text | Salesforce/blip-image-captioning-large |
| Automatic Speech Recognition | /hf-inference/models/{model}/automatic-speech-recognition | openai/whisper-large-v3-turbo |
| Batch Operations | /hf-inference/models/{model}/.../batch | Any compatible model |
Any model available on the Hugging Face Hub can be used — not just the examples above. Browse by task at huggingface.co/models.
Setup
1. Get a Hugging Face token
- Create a free account at huggingface.co
- Go to Settings → Access Tokens
- Click New token, choose Read type, enable Inference Providers under the Inference section
- Copy the token
2. Add the connector
bal add avi0ra/huggingface
3. Configure the token
In Config.toml:
token = "<YOUR_HF_TOKEN>"
Or via environment variable:
export HF_TOKEN="<YOUR_HF_TOKEN>"
Quickstart
Chat Completion
import ballerina/io; import ballerina/os; import avi0ra/huggingface; configurable string token = os:getEnv("HF_TOKEN"); public function main() returns error? { huggingface:Client hf = check new ({auth: {token}}); huggingface:ChatCompletionResponse resp = check hf->/v1/chat/completions.post({ model: "Qwen/Qwen2.5-7B-Instruct", messages: [{role: "user", content: "What is Ballerina?"}], maxTokens: 100, topP: 0.9 }); io:println(resp?.choices); io:println("Tokens used: ", resp?.usage?.totalTokens); }
Streaming Chat Completion
Iterate chunks from the parsed SSE response:
import ballerina/io; import ballerina/os; import avi0ra/huggingface; configurable string token = os:getEnv("HF_TOKEN"); public function main() returns error? { huggingface:Client hf = check new ({auth: {token}}); stream<huggingface:ChatCompletionChunk, error?> tokenStream = check hf->/v1/chat/completions/streamed.post({ model: "Qwen/Qwen2.5-7B-Instruct", messages: [{role: "user", content: "Count from 1 to 5."}], maxTokens: 50 }); check from huggingface:ChatCompletionChunk chunk in tokenStream do { huggingface:ChatCompletionChunkChoice[]? choices = chunk?.choices; if choices is huggingface:ChatCompletionChunkChoice[] && choices.length() > 0 { string? content = choices[0].delta?.content; if content is string { io:print(content); } } }; io:println(); }
Note: The current implementation collects the full SSE response before returning the stream.
Stateful Chat Conversation
Maintain cross-turn chat history automatically using the Conversation class:
import ballerina/io; import ballerina/os; import avi0ra/huggingface; configurable string token = os:getEnv("HF_TOKEN"); public function main() returns error? { huggingface:Client hf = check new ({auth: {token}}); huggingface:Conversation conv = new ( hf, "Qwen/Qwen2.5-7B-Instruct", systemPrompt = "You are a helpful assistant." ); string reply1 = check conv.chat("What is Ballerina?"); io:println("Assistant: ", reply1); string reply2 = check conv.chat("Who created it?"); io:println("Assistant: ", reply2); io:println("Turns completed: ", conv.turnCount()); }
RAG Pipeline
End-to-end Retrieval Augmented Generation in a single function call:
import ballerina/io; import ballerina/os; import avi0ra/huggingface; configurable string token = os:getEnv("HF_TOKEN"); public function main() returns error? { huggingface:Client hf = check new ({auth: {token}}); huggingface:RagDocument[] documents = [ { id: "doc1", content: "Ballerina is an open-source language for cloud-native integration by WSO2.", metadata: {"source": "ballerina.io"} }, { id: "doc2", content: "WSO2 is a Sri Lankan technology company founded in 2005.", metadata: {"source": "wso2.com"} } ]; huggingface:RagResult result = check huggingface:ragQuery( hf, "Who created Ballerina?", documents ); io:println("Answer: ", result.answer); io:println("Sources used: ", result.sources.length()); io:println("Top relevance score: ", result.scores[0]); }
Auto-Retry for Cold Models
Models on the free tier go cold after inactivity and return 503 while loading. The connector retries automatically with exponential backoff:
huggingface:Client hf = check new ( {auth: {token}}, retryConfig = { maxRetries: 5, initialDelay: 2.0, maxDelay: 30.0 } );
Eliminating Cold-Start Latency with waitForModel
huggingface:Client hf = check new ({ auth: {token}, waitForModel: true, timeout: 120 });
Setting waitForModel: true sends x-wait-for-model: true on every request so the server
waits for cold models to load instead of immediately returning 503. This eliminates most retry
cycles. Pair it with a higher timeout to cover the model load time.
Multi-Modal Helpers
Load images and audio from files or URLs directly:
// Image from file huggingface:ImageClassificationResult[] res = check hf->/hf\-inference/models/["google/vit-base-patch16-224"]/image\-classification/file.post( "path/to/image.jpg" ); // Image captioning from URL huggingface:ImageToTextResult[] captions = check hf->/hf\-inference/models/["Salesforce/blip-image-captioning-large"]/image\-to\-text/url.post( "https://example.com/photo.jpg" ); // Audio from file huggingface:AutomaticSpeechRecognitionResponse resp = check hf->/hf\-inference/models/["openai/whisper-large-v3-turbo"]/automatic\-speech\-recognition/file.post( "path/to/audio.flac" );
All Supported Operations
Fill Mask
huggingface:FillMaskResult[] res = check hf->/hf\-inference/models/["google-bert/bert-base-uncased"]/fill\-mask.post({ inputs: "Paris is the [MASK] of France." }); io:println(res[0]?.tokenStr, " (", res[0]?.score, ")");
Text Classification
huggingface:ClassificationLabel[][] res = check hf->/hf\-inference/models/["distilbert-base-uncased-finetuned-sst-2-english"]/text\-classification.post({ inputs: "Ballerina makes integration elegant!" }); io:println(res[0][0]?.label, " (", res[0][0]?.score, ")");
Token Classification (NER)
huggingface:TokenClassificationEntity[] entities = check hf->/hf\-inference/models/["dslim/bert-base-NER"]/token\-classification.post({ inputs: "WSO2 is based in Sri Lanka." }); io:println(entities);
Feature Extraction (Embeddings)
float[] embeddings = check hf->/hf\-inference/models/["intfloat/multilingual-e5-large"]/feature\-extraction.post({ inputs: "Ballerina cloud-native integration." }); io:println("Dimensions: ", embeddings.length());
Sentence Similarity
float[] scores = check hf->/hf\-inference/models/["sentence-transformers/all-MiniLM-L6-v2"]/sentence\-similarity.post({ inputs: { source_sentence: "What is Ballerina?", sentences: ["Ballerina is a cloud-native language.", "Python is for data science."] } }); io:println("Scores: ", scores);
Question Answering
huggingface:QuestionAnsweringResponse ans = check hf->/hf\-inference/models/["deepset/roberta-base-squad2"]/question\-answering.post({ inputs: { question: "What is Ballerina?", context: "Ballerina is an open-source language for cloud-native integration by WSO2." } }); io:println(ans?.answer);
Summarization
huggingface:SummarizationResult[] res = check hf->/hf\-inference/models/["facebook/bart-large-cnn"]/summarization.post({ inputs: "Ballerina is a modern open-source programming language designed for cloud-native integration...", parameters: {maxLength: 40, minLength: 15} }); io:println(res[0].summaryText);
Translation
huggingface:TranslationResult[] res = check hf->/hf\-inference/models/["Helsinki-NLP/opus-mt-en-fr"]/translation.post({ inputs: "Hello, how are you?" }); io:println(res[0].translationText);
Zero-Shot Classification
huggingface:ZeroShotClassificationResponse res = check hf->/hf\-inference/models/["facebook/bart-large-mnli"]/zero\-shot\-classification.post({ inputs: "Ballerina is a programming language for cloud integration.", parameters: {candidateLabels: ["technology", "sports", "politics"]} }); io:println(res);
Text-to-Image Generation
byte[] imageBytes = check hf->/hf\-inference/models/["black-forest-labs/FLUX.1-schnell"]/text\-to\-image.post({ inputs: "A robot writing Ballerina code" }); check io:fileWriteBytes("output.png", imageBytes);
Text-to-Speech
byte[] audioBytes = check hf->/hf\-inference/models/["facebook/mms-tts-eng"]/text\-to\-speech.post({ inputs: "Hello from Ballerina!" }); check io:fileWriteBytes("speech.wav", audioBytes);
Image Classification
byte[] payload = check io:fileReadBytes("image.jpg"); huggingface:ImageClassificationResult[] res = check hf->/hf\-inference/models/["google/vit-base-patch16-224"]/image\-classification.post(payload); io:println(res[0]?.label, " (", res[0]?.score, ")");
Image Captioning (Image-to-Text)
byte[] payload = check io:fileReadBytes("photo.jpg"); huggingface:ImageToTextResult[] captions = check hf->/hf\-inference/models/["Salesforce/blip-image-captioning-large"]/image\-to\-text.post(payload); io:println(captions[0]?.generatedText);
Automatic Speech Recognition
huggingface:AutomaticSpeechRecognitionResponse resp = check hf->/hf\-inference/models/["openai/whisper-large-v3-turbo"]/automatic\-speech\-recognition/file.post( "audio.flac" ); io:println(resp?.text);
Universal Model Runner
The ModelRunner class works with any Hugging Face model. Provide the model ID and it
auto-detects the pipeline task from the Hub, then routes every call to the correct typed endpoint.
// Summarisation — just name the model huggingface:ModelRunner runner = new (hf, "facebook/bart-large-cnn"); io:println("Task: ", runner.getPipelineTag()); // "summarization" json summary = check runner.run( "Ballerina is a modern open-source language designed for cloud-native integration." ); io:println(summary); // NER — same API, different model huggingface:ModelRunner ner = new (hf, "dslim/bert-base-NER"); json entities = check ner.run("WSO2 is based in Sri Lanka."); // Translation huggingface:ModelRunner xlat = new (hf, "Helsinki-NLP/opus-mt-en-fr"); json translated = check xlat.run("Hello, how are you?"); // Question Answering — structured JSON input huggingface:ModelRunner qa = new (hf, "deepset/roberta-base-squad2"); json answer = check qa.runWithJson({ inputs: {question: "What is Ballerina?", context: "Ballerina is..."} }); // Image classification from file huggingface:ModelRunner clf = new (hf, "google/vit-base-patch16-224"); json labels = check clf.runImageFile("photo.jpg"); // Image generation — returns raw bytes huggingface:ModelRunner img = new (hf, "black-forest-labs/FLUX.1-schnell"); byte[] png = check img.generateMedia("A robot writing Ballerina code"); check io:fileWriteBytes("output.png", png); // ASR from audio file huggingface:ModelRunner whisper = new (hf, "openai/whisper-large-v3-turbo"); json transcript = check whisper.runAudioFile("audio.flac", huggingface:AUDIO_FLAC);
ModelRunner method reference
| Method | Input | Output | Auto-routed tasks |
|---|---|---|---|
run(string) | Plain text | json | text-generation, fill-mask, text-classification, token-classification, feature-extraction, summarization, translation |
runWithJson(json) | Custom JSON | json | question-answering, zero-shot-classification, sentence-similarity, chat-completion |
runBytes(byte[], contentType) | Binary | json | image-classification, image-to-text, automatic-speech-recognition |
generateMedia(string) | Prompt | byte[] | text-to-image, text-to-speech |
runImageFile(path) | File path | json | Same as runBytes |
runImageUrl(url) | Public URL | json | Same as runBytes |
runAudioFile(path) | File path | json | ASR |
runAudioUrl(url) | Public URL | json | ASR |
One-shot convenience functions
// Auto-detect + run in one line json result = check huggingface:autoRun(hf, "facebook/bart-large-cnn", "Long article..."); // With structured JSON payload json answer = check huggingface:autoRunJson(hf, "deepset/roberta-base-squad2", { inputs: {question: "What is Ballerina?", context: "Ballerina is..."} }); // Binary media generation byte[] png = check huggingface:autoGenerateMedia( hf, "black-forest-labs/FLUX.1-schnell", "A robot coding in Ballerina" );
Tip: Reuse a
ModelRunnerinstance for repeated calls — the Hub lookup only happens once at construction.autoRun()and friends perform the lookup on every call.
Generic Inference Helper
Call any Hugging Face model not covered by the typed operations. Now includes cold-start retry:
json result = check huggingface:inferModel( hf, "openai-community/gpt2", {inputs: "Ballerina is designed for"} ); io:println(result);
Using Custom Models
The connector works with any model on the Hugging Face Hub. Pass any model ID as long as it matches the task:
check hf->/hf\-inference/models/["Helsinki-NLP/opus-mt-en-si"]/translation.post({ inputs: "Hello" });
Browse available models by task:
Model Metadata & Batch Helpers
Retrieve model information and check inference availability:
huggingface:ModelInfo info = check huggingface:getModelInfo(hf, "gpt2"); io:println("Downloads: ", info.downloads); huggingface:ModelAvailability availability = check huggingface:checkModelAvailability(hf, "gpt2"); io:println("Available for inference: ", availability.available);
Run batch inference efficiently:
json[] batchResults = check huggingface:batchInfer( hf, ["Hello world", "Ballerina is great"], "openai-community/gpt2" );
Compute semantic similarity with embedding-based scoring:
float[] scores = check huggingface:sentenceSimilarity( hf, "What is Ballerina?", ["Ballerina is a cloud-native language.", "Python is for data science."] ); io:println("Scores: ", scores);
Changelog
1.1.0
- Added
ModelRunnerclass — universal model runner that auto-detects the pipeline task from the Hub and routes to the correct typed endpoint. Works with any Hugging Face model. - Added
autoRun(),autoRunJson(),autoGenerateMedia()convenience functions. - Added
waitForModelflag toConnectionConfig— sendsx-wait-for-model: trueheader to eliminate most cold-start 503 round-trips. - Added
fill-maskendpoint for BERT-style masked token prediction. - Added
image-to-text(captioning) endpoint with bytes, file, and URL variants. - Added
text-to-speechendpoint for audio synthesis. - Added
sentence-similaritytyped endpoint. - Added
sentenceSimilarityembedding-based helper function. - Added
topP,stop,seed,frequencyPenalty,presencePenaltytoChatCompletionRequest. - Added
doSample,topK,topP,repetitionPenaltytoTextGenerationParameters. - Added
guidanceScale,negativePrompt,seedtoTextToImageParameters. - Added
UsageStatstype andusagefield toChatCompletionResponse. - Fixed
inferModelandbatchInferto usepostWithRetry— now honour retry config on 503. - Fixed
RetryConfigvalidation:maxRetries >= 1andinitialDelay <= maxDelayenforced at init. - Fixed SSE streaming parser to handle
\r\nline endings. - Increased default
timeoutinConnectionConfigfrom 30 s to 60 s.
1.0.0
- Added stateful
Conversationclass for automated chat history management. - Added batch inference operations (
batchInferand typed/batchendpoints). - Added Model Metadata APIs (
getModelInfo,checkModelAvailability). - Upgraded
ragQueryto use batch embeddings andRagConfig.
0.3.0
- Added streaming chat completions via
/v1/chat/completions/streamed. - Added RAG pipeline helper
ragQuery(initial version). - Added automatic retry with exponential backoff for cold-starting models (503).
- Added image classification from file path and URL.
- Added ASR from file path and URL.
- Introduced
RetryConfig,RagDocument,RagResult,ImageContentType,AudioContentTypetypes. - Improved generic
inferModelhelper with rich error handling.
0.2.0
- Initial release of the
avi0ra/huggingfaceconnector. - Native support for 12 AI/ML inference operations.
- Generic
inferModelhelper.
Issues and contributions
Report issues at github.com/HasithaErandika/module-ballerinax-huggingface/issues.
For Ballerina community support: Discord · Stack Overflow #ballerina