ballerinax/ai.microsoft.sharepoint Ballerina library

1.0.0
Ballerina SharePoint Data Loader

The ballerinax/ai.microsoft.sharepoint module provides a TextDataLoader that retrieves documents from SharePoint document libraries and returns them as ai:TextDocument values, ready to be chunked, embedded, and indexed by the Ballerina AI module. Inherently textual files are decoded directly, while PDF documents have their text extracted with Apache Tika.

It implements the ai:DataLoader abstraction, so it can be used anywhere an ai:DataLoader is expected (for example, in a Retrieval-Augmented Generation ingestion pipeline).

Overview

  • Resolves a document library (drive) for each configured site using the Microsoft Graph sites API.
  • Downloads file content through the Microsoft Graph drive (drive items) API.
  • Supports loading individual files as well as entire folders, optionally recursively.
  • Optionally loads SharePoint site pages (modern web-part pages) as ai:TextDocuments.
  • Reads from multiple sites and libraries with a single loader instance.
  • Returns every file as an ai:TextDocument, based on its MIME type / extension:
    • Inherently textual files (e.g. txt, md, html, json, csv, xml) are decoded directly.
    • pdf files have their text extracted with Apache Tika.
    • Other files that cannot be represented as text (e.g. images, audio, archives) are skipped with a logged warning; explicitly naming such a file as a path is an error.

Authentication

SharePoint is accessed through the Microsoft Graph API. The loader supports three authentication mechanisms via ConnectionConfig.auth:

MechanismTypeBest for
Client credentials grantOAuth2ClientCredentialsGrantConfigApp-only, server-to-server access (requires the Sites.Read.All application permission, which also covers site pages; use Sites.Selected for least-privilege, per-site access)
Refresh token grantOAuth2RefreshTokenGrantConfigUser-delegated access
Bearer tokenhttp:BearerTokenConfigA pre-obtained access token (testing or externally managed tokens)

Usage

Initialization

Copy
import ballerina/ai;
import ballerinax/ai.microsoft.sharepoint;

final sharepoint:TextDataLoader loader = check new (
    {
        auth: {
            tokenUrl: "https://login.microsoftonline.com/<tenant-id>/oauth2/v2.0/token",
            clientId: "<client-id>",
            clientSecret: "<client-secret>",
            scopes: ["https://graph.microsoft.com/.default"]
        }
    },
    {
        // Load only PDFs found under /Onboarding (recursively), plus one explicit file.
        siteId: "contoso.sharepoint.com:/sites/HR",
        libraries: [
            {
                paths: ["/Policies/leave-policy.pdf", "/Onboarding"],
                recursive: true,
                includeExtensions: ["pdf"]
            }
        ]
    },
    {
        // Read from several document libraries on the same site; each library
        // carries its own paths and traversal options.
        siteId: "contoso.sharepoint.com:/sites/Engineering",
        libraries: [
            {name: "Specs", paths: ["/api-design.md"]},
            {name: "Site Assets", paths: ["/diagrams"], recursive: true}
        ]
    },
    {
        // Files and site pages can be combined; `pages: ["*"]` loads every page,
        // an empty/omitted `pages` (the default) loads none.
        siteId: "contoso.sharepoint.com:/sites/News",
        pages: ["*"]
    }
);

Loading site pages

Set the pages field on a Source to ingest SharePoint modern pages as text:

  • pages: ["*"] — load every page on the site.
  • pages: ["Home", "Q3-Update"] — load specific pages, matched by name, title, or id. Note that a page's name is its file name including the .aspx extension (e.g. Home.aspx), so matching by the human-friendly title is usually more convenient.
  • pages omitted or () (the default) — load no pages.

Each page becomes a single ai:TextDocument whose content is the page title followed by the plain text extracted from its web parts (innerHtml and server-processed searchable text, with HTML stripped). Metadata includes the page title, webUrl, and timestamps.

Note: Web-part text extraction relies on the Microsoft Graph pages/webParts API, which is best-effort — for some pages Graph returns little or no web-part content, in which case the resulting document may contain only the page title.

Reading from multiple libraries

A SharePoint site can host several document libraries. The libraries field takes one Library per library, each binding a library to its own paths and traversal options — mirroring the Microsoft Graph model, where a path is always resolved relative to a specific library (a Graph drive):

Copy
libraries: [
    {name: "Documents", paths: ["/Reports"], recursive: true},
    {name: "Specs", paths: ["/api-design.md"]}
]

Each target's paths default to ["/"], so a bare {name: "Specs"} loads that whole library; set paths to narrow it. The libraries field itself defaults to [{}], so a Source with only a siteId loads the whole of the site's default document library; set libraries to [] to load no document-library content.

  • name: "Documents" (the default) — the standard library. Note that this is the English display name; on tenants provisioned in another language the default library is reported by a localized name (e.g. Dokumente, Documentos, 文档), so set name explicitly to match what SharePoint shows.
  • name: "*" — read from every library on the site. Because the target's paths are then applied to all libraries, a path that does not exist in a given library is skipped for it rather than treated as an error. A named library resolves to exactly one drive, so a missing path there is still reported as an error to help catch typos.

Filtering by file type

Each Library has its own includeExtensions to restrict which files are loaded from that library's folders:

  • includeExtensions: ["pdf"] — only PDF files.
  • includeExtensions: ["pdf", ".md", "TXT"] — case-insensitive; a leading dot is optional.
  • omitted / () (the default) — load all file types.

The filter applies to files discovered while traversing folders. A file listed explicitly in that target's paths is always loaded, even if its extension isn't in the list.

Loading documents

Copy
public function main() returns error? {
    ai:Document[]|ai:Document documents = check loader.load();
    // Pass the documents to a chunker / embedding provider / vector store ...
}

load() returns a single ai:Document when exactly one file is resolved, and an ai:Document[] otherwise (mirroring ai:TextDataLoader).

Configuration reference

ConnectionConfig

FieldTypeDefaultDescription
authhttp:BearerTokenConfig | OAuth2ClientCredentialsGrantConfig | OAuth2RefreshTokenGrantConfigAuthentication configuration
serviceUrlstring"https://graph.microsoft.com/v1.0"Microsoft Graph base URL
httpVersionhttp:HttpVersionhttp:HTTP_2_0HTTP version understood by the client
http1SettingsClientHttp1SettingsHTTP/1.x protocol settings (keep-alive, chunking, proxy)
http2Settingshttp:ClientHttp2SettingsHTTP/2 protocol settings
timeoutdecimal30Response timeout, in seconds
forwardedstring"disable"Handling of the forwarded/x-forwarded header
followRedirectshttp:FollowRedirects{enabled: true, maxCount: 5, allowAuthHeaders: true}Redirect handling. Enabled by default so the Graph /items/{id}/content 302 download redirect is followed
poolConfighttp:PoolConfigurationRequest pooling configuration
cachehttp:CacheConfigHTTP caching configuration
compressionhttp:Compressionhttp:COMPRESSION_AUTOaccept-encoding handling
circuitBreakerhttp:CircuitBreakerConfigCircuit breaker configuration
retryConfighttp:RetryConfigRetry configuration
cookieConfighttp:CookieConfigCookie configuration
responseLimitshttp:ResponseLimitConfigsInbound response size limits
secureSockethttp:ClientSecureSocketSSL/TLS options
proxyhttp:ProxyConfigProxy server options
socketConfighttp:ClientSocketConfig{}Client socket configuration
validationbooleantrueInbound payload validation
laxDataBindingbooleantrueRelaxed data binding

The HTTP-level fields mirror those of the underlying Microsoft Graph sites and pages clients and are forwarded to both.

Source

FieldTypeDefaultDescription
siteIdstringMicrosoft Graph site id. Accepts the composite id ({hostname},{spsite-guid},{spweb-guid}) or the path form ({hostname}:/sites/{site-name})
librariesLibrary[][{}]Document libraries to read from, each paired with its own paths and traversal options. The default [{}] loads the whole of the site's default document library; [] loads no document-library content
pagesstring[]?()Site pages to load as text. Matched by name, title, or id. Use ["*"] for all pages; () for none

Library

FieldTypeDefaultDescription
namestring"Documents"Display name of the document library to read from, as shown in SharePoint. The default "Documents" is the English name; localized tenants use a translated name. Use "*" for every library on the site
pathsstring[]["/"]File and/or folder paths relative to this library's root. The default ["/"] loads the entire library; [] loads nothing from it
recursivebooleanfalseWhether folder paths are traversed recursively
includeExtensionsstring[]?()Extension allowlist applied to this library's folder contents (e.g. ["pdf"]). Case-insensitive; () loads all types. Explicit file paths bypass it

Import

import ballerinax/ai.microsoft.sharepoint;Copy

Other versions

1.0.0

Metadata

Released date: 15 days ago

Version: 1.0.0

License: Apache-2.0


Compatibility

Platform: java21

Ballerina version: 2201.12.0

GraalVM compatible: Yes


Pull count

Total: 154

Current verison: 154


Weekly downloads


Source repository


Keywords

Vendor/Microsoft

Area/AI & Machine Learning

Type/Data Loader

sharepoint


Contributors