Module ai.microsoft.sharepoint

ballerinax/ai.microsoft.sharepoint Ballerina library

1.0.0

Ballerina SharePoint Data Loader

The ballerinax/ai.microsoft.sharepoint module provides a TextDataLoader that retrieves documents from SharePoint document libraries and returns them as ai:TextDocument values, ready to be chunked, embedded, and indexed by the Ballerina AI module. Inherently textual files are decoded directly, while PDF documents have their text extracted with Apache Tika.

It implements the ai:DataLoader abstraction, so it can be used anywhere an ai:DataLoader is expected (for example, in a Retrieval-Augmented Generation ingestion pipeline).

Overview

Resolves a document library (drive) for each configured site using the Microsoft Graph sites API.
Downloads file content through the Microsoft Graph drive (drive items) API.
Supports loading individual files as well as entire folders, optionally recursively.
Optionally loads SharePoint site pages (modern web-part pages) as ai:TextDocuments.
Reads from multiple sites and libraries with a single loader instance.
Returns every file as an ai:TextDocument, based on its MIME type / extension:
- Inherently textual files (e.g. txt, md, html, json, csv, xml) are decoded directly.
- pdf files have their text extracted with Apache Tika.
- Other files that cannot be represented as text (e.g. images, audio, archives) are skipped with a logged warning; explicitly naming such a file as a path is an error.

Authentication

SharePoint is accessed through the Microsoft Graph API. The loader supports three authentication mechanisms via ConnectionConfig.auth:

Mechanism	Type	Best for
Client credentials grant	`OAuth2ClientCredentialsGrantConfig`	App-only, server-to-server access (requires the `Sites.Read.All` application permission, which also covers site pages; use `Sites.Selected` for least-privilege, per-site access)
Refresh token grant	`OAuth2RefreshTokenGrantConfig`	User-delegated access
Bearer token	`http:BearerTokenConfig`	A pre-obtained access token (testing or externally managed tokens)

Usage

Initialization


import ballerina/ai;
import ballerinax/ai.microsoft.sharepoint;

final sharepoint:TextDataLoader loader = check new (
    {
        auth: {
            tokenUrl: "https://login.microsoftonline.com/<tenant-id>/oauth2/v2.0/token",
            clientId: "<client-id>",
            clientSecret: "<client-secret>",
            scopes: ["https://graph.microsoft.com/.default"]
        }
    },
    {
        // Load only PDFs found under /Onboarding (recursively), plus one explicit file.
        siteId: "contoso.sharepoint.com:/sites/HR",
        libraries: [
            {
                paths: ["/Policies/leave-policy.pdf", "/Onboarding"],
                recursive: true,
                includeExtensions: ["pdf"]
            }
        ]
    },
    {
        // Read from several document libraries on the same site; each library
        // carries its own paths and traversal options.
        siteId: "contoso.sharepoint.com:/sites/Engineering",
        libraries: [
            {name: "Specs", paths: ["/api-design.md"]},
            {name: "Site Assets", paths: ["/diagrams"], recursive: true}
        ]
    },
    {
        // Files and site pages can be combined; `pages: ["*"]` loads every page,
        // an empty/omitted `pages` (the default) loads none.
        siteId: "contoso.sharepoint.com:/sites/News",
        pages: ["*"]
    }
);

Loading site pages

Set the pages field on a Source to ingest SharePoint modern pages as text:

pages: ["*"] — load every page on the site.
pages: ["Home", "Q3-Update"] — load specific pages, matched by name, title, or id. Note that a page's name is its file name including the .aspx extension (e.g. Home.aspx), so matching by the human-friendly title is usually more convenient.
pages omitted or () (the default) — load no pages.

Each page becomes a single ai:TextDocument whose content is the page title followed by the plain text extracted from its web parts (innerHtml and server-processed searchable text, with HTML stripped). Metadata includes the page title, webUrl, and timestamps.

Note: Web-part text extraction relies on the Microsoft Graph pages/webParts API, which is best-effort — for some pages Graph returns little or no web-part content, in which case the resulting document may contain only the page title.

Reading from multiple libraries

A SharePoint site can host several document libraries. The libraries field takes one Library per library, each binding a library to its own paths and traversal options — mirroring the Microsoft Graph model, where a path is always resolved relative to a specific library (a Graph drive):


libraries: [
    {name: "Documents", paths: ["/Reports"], recursive: true},
    {name: "Specs", paths: ["/api-design.md"]}
]

Each target's paths default to ["/"], so a bare {name: "Specs"} loads that whole library; set paths to narrow it. The libraries field itself defaults to [{}], so a Source with only a siteId loads the whole of the site's default document library; set libraries to [] to load no document-library content.

name: "Documents" (the default) — the standard library. Note that this is the English display name; on tenants provisioned in another language the default library is reported by a localized name (e.g. Dokumente, Documentos, 文档), so set name explicitly to match what SharePoint shows.
name: "*" — read from every library on the site. Because the target's paths are then applied to all libraries, a path that does not exist in a given library is skipped for it rather than treated as an error. A named library resolves to exactly one drive, so a missing path there is still reported as an error to help catch typos.

Filtering by file type

Each Library has its own includeExtensions to restrict which files are loaded from that library's folders:

includeExtensions: ["pdf"] — only PDF files.
includeExtensions: ["pdf", ".md", "TXT"] — case-insensitive; a leading dot is optional.
omitted / () (the default) — load all file types.

The filter applies to files discovered while traversing folders. A file listed explicitly in that target's paths is always loaded, even if its extension isn't in the list.

Loading documents


public function main() returns error? {
    ai:Document[]|ai:Document documents = check loader.load();
    // Pass the documents to a chunker / embedding provider / vector store ...
}

load() returns a single ai:Document when exactly one file is resolved, and an ai:Document[] otherwise (mirroring ai:TextDataLoader).

Configuration reference

`ConnectionConfig`

Field	Type	Default	Description
`auth`	`http:BearerTokenConfig \| OAuth2ClientCredentialsGrantConfig \| OAuth2RefreshTokenGrantConfig`	—	Authentication configuration
`serviceUrl`	`string`	`"https://graph.microsoft.com/v1.0"`	Microsoft Graph base URL
`httpVersion`	`http:HttpVersion`	`http:HTTP_2_0`	HTTP version understood by the client
`http1Settings`	`ClientHttp1Settings`	—	HTTP/1.x protocol settings (keep-alive, chunking, proxy)
`http2Settings`	`http:ClientHttp2Settings`	—	HTTP/2 protocol settings
`timeout`	`decimal`	`30`	Response timeout, in seconds
`forwarded`	`string`	`"disable"`	Handling of the `forwarded`/`x-forwarded` header
`followRedirects`	`http:FollowRedirects`	`{enabled: true, maxCount: 5, allowAuthHeaders: true}`	Redirect handling. Enabled by default so the Graph `/items/{id}/content` 302 download redirect is followed
`poolConfig`	`http:PoolConfiguration`	—	Request pooling configuration
`cache`	`http:CacheConfig`	—	HTTP caching configuration
`compression`	`http:Compression`	`http:COMPRESSION_AUTO`	`accept-encoding` handling
`circuitBreaker`	`http:CircuitBreakerConfig`	—	Circuit breaker configuration
`retryConfig`	`http:RetryConfig`	—	Retry configuration
`cookieConfig`	`http:CookieConfig`	—	Cookie configuration
`responseLimits`	`http:ResponseLimitConfigs`	—	Inbound response size limits
`secureSocket`	`http:ClientSecureSocket`	—	SSL/TLS options
`proxy`	`http:ProxyConfig`	—	Proxy server options
`socketConfig`	`http:ClientSocketConfig`	`{}`	Client socket configuration
`validation`	`boolean`	`true`	Inbound payload validation
`laxDataBinding`	`boolean`	`true`	Relaxed data binding

The HTTP-level fields mirror those of the underlying Microsoft Graph sites and pages clients and are forwarded to both.

`Source`

Field	Type	Default	Description
`siteId`	`string`	—	Microsoft Graph site id. Accepts the composite id (`{hostname},{spsite-guid},{spweb-guid}`) or the path form (`{hostname}:/sites/{site-name}`)
`libraries`	`Library[]`	`[{}]`	Document libraries to read from, each paired with its own paths and traversal options. The default `[{}]` loads the whole of the site's default document library; `[]` loads no document-library content
`pages`	`string[]?`	`()`	Site pages to load as text. Matched by name, title, or id. Use `["*"]` for all pages; `()` for none

`Library`

Field	Type	Default	Description
`name`	`string`	`"Documents"`	Display name of the document library to read from, as shown in SharePoint. The default `"Documents"` is the English name; localized tenants use a translated name. Use `"*"` for every library on the site
`paths`	`string[]`	`["/"]`	File and/or folder paths relative to this library's root. The default `["/"]` loads the entire library; `[]` loads nothing from it
`recursive`	`boolean`	`false`	Whether folder paths are traversed recursively
`includeExtensions`	`string[]?`	`()`	Extension allowlist applied to this library's folder contents (e.g. `["pdf"]`). Case-insensitive; `()` loads all types. Explicit file paths bypass it

Classes

ai.microsoft.sharepoint: TextDataLoader

Isolated

A data loader that retrieves documents from SharePoint document libraries as text.

Constructor

Initializes the SharePoint data loader.

init (ConnectionConfig sharePointConnectionConfigs, Source[] sources)

sharePointConnectionConfigs ConnectionConfig - The authentication and service configuration shared by all sources

sources Source[] - One or more SharePoint sources to load documents from

load

Isolated Function

function load() returns Document[]|Document|Error

Loads the configured SharePoint documents.

Return Type

Document[]|Document|Error - The loaded document when a single file is resolved, an array of documents otherwise, or an ai:Error on failure

Records

ai.microsoft.sharepoint: ConnectionConfig

Closed record

Authentication and connection configuration used to reach SharePoint through the Microsoft Graph API. The HTTP-level options mirror those of the underlying Microsoft Graph sites and pages clients and are forwarded to both.

Fields

auth OAuth2ClientCredentialsGrantConfig|BearerTokenConfig|OAuth2RefreshTokenGrantConfig - Configurations related to client authentication

httpVersion HttpVersion(default http:HTTP_2_0) - The HTTP version understood by the client

http1Settings ClientHttp1Settings(default {}) - Configurations related to HTTP/1.x protocol

http2Settings ClientHttp2Settings(default {}) - Configurations related to HTTP/2 protocol

timeout decimal(default 30) - The maximum time to wait (in seconds) for a response before closing the connection

forwarded string(default "disable") - The choice of setting forwarded/x-forwarded header

followRedirects FollowRedirects(default { enabled: true, maxCount: 5, allowAuthHeaders: true }) - Configurations associated with Redirection

poolConfig? PoolConfiguration - Configurations associated with request pooling

cache CacheConfig(default {}) - HTTP caching related configurations

compression Compression(default http:COMPRESSION_AUTO) - Specifies the way of handling compression (accept-encoding) header

circuitBreaker? CircuitBreakerConfig - Configurations associated with the behaviour of the Circuit Breaker

retryConfig? RetryConfig - Configurations associated with retrying

cookieConfig? CookieConfig - Configurations associated with cookies

responseLimits ResponseLimitConfigs(default {}) - Configurations associated with inbound response size limits

secureSocket? ClientSecureSocket - SSL/TLS-related options

proxy? ProxyConfig - Proxy server related options

socketConfig ClientSocketConfig(default {}) - Provides settings related to client socket configuration

validation boolean(default true) - Enables the inbound payload validation functionality which provided by the constraint package. Enabled by default

laxDataBinding boolean(default true) - Enables relaxed data binding on the client side. When enabled, nil values are treated as optional, and absent fields are handled as nilable types. Enabled by default.

serviceUrl string(default "https://graph.microsoft.com/v1.0") - The base URL of the Microsoft Graph service

ai.microsoft.sharepoint: Library

Closed record

A rule selecting what to load from one library within a site, or from every library on the site when name is "*" (the paths then apply to each).

Fields

name string(default "Documents") - The library's display name as shown in SharePoint, or "*" for every library on the site. Defaults to "Documents", the English name of the standard library; localized tenants report a translated name (e.g. Dokumente, Documentos), so set this to match what SharePoint shows

paths string[](default ["/"]) - File/folder paths relative to the library root (e.g. /Reports). Defaults to ["/"], the whole library; [] skips this library

recursive boolean(default false) - Whether folder paths are traversed into sub-folders. Defaults to false

includeExtensions string[]?(default ()) - Case-insensitive extension allowlist for folder contents. Defaults to (), all types

ai.microsoft.sharepoint: OAuth2ClientCredentialsGrantConfig

Closed record

OAuth2 Client Credentials Grant Configs

Fields

Fields Included from *OAuth2ClientCredentialsGrantConfig

tokenUrl string
clientId string
clientSecret string
scopes string|string[]
defaultTokenExpTime decimal
clockSkew decimal
optionalParams map<string>
credentialBearer CredentialBearer
clientConfig ClientConfiguration

tokenUrl string(default "https://login.microsoftonline.com/common/oauth2/v2.0/token") - Token URL

ai.microsoft.sharepoint: OAuth2RefreshTokenGrantConfig

Closed record

OAuth2 Refresh Token Grant Configs

Fields

Fields Included from *OAuth2RefreshTokenGrantConfig

refreshUrl string
refreshToken string
clientId string
clientSecret string
scopes string|string[]
defaultTokenExpTime decimal
clockSkew decimal
optionalParams map<string>
credentialBearer CredentialBearer
clientConfig ClientConfiguration

refreshUrl string(default "https://login.microsoftonline.com/common/oauth2/v2.0/token") - Refresh URL

ai.microsoft.sharepoint: Source

Closed record

A SharePoint site to load documents from. Several may be configured per loader.

Fields

siteId string - The Graph site id, which uniquely identifies a SharePoint site. It is either the composite id ({hostname},{spsite-guid},{spweb-guid}) or the path form ({hostname}:/sites/{site-name}). The browser URL is not supported directly, but you can derive the site id from the site URL as follows:
```
https://contoso.sharepoint.com/sites/Marketing
        └──────────┬─────────┘└──────┬───────┘
              hostname          server-relative path
siteID = contoso.sharepoint.com:/sites/Marketing
```
Alternatively, you can obtain the site id by calling the SharePoint REST API at {site-url}/_api/site/id, for example:
```
https://contoso.sharepoint.com/sites/Marketing/_api/site/id
```

libraries Library[](default [{}]) - The libraries to read from, each with its own paths and options. Defaults to the site's default document library.

pages string[]?(default ()) - Site pages to load as text, matched by name, title, or id. Defaults to (), no pages

Import

import ballerinax/ai.microsoft.sharepoint;

Other versions

1.0.0

Metadata

Released date: 15 days ago

Version: 1.0.0

License: Apache-2.0

Compatibility

Platform: java21

Ballerina version: 2201.12.0

GraalVM compatible: Yes

Pull count

Total: 154

Current verison: 154

Weekly downloads

Source repository

Keywords

Vendor/Microsoft

Area/AI & Machine Learning

Type/Data Loader

sharepoint

Contributors

Dependencies

ballerina/oauth2/2.14.1 ballerina/log/2.14.0 ballerina/url/2.6.1

Cookie policy

Delete policy

classes

records

ballerinax/ai.microsoft.sharepoint Ballerina library

Ballerina SharePoint Data Loader

Overview

Authentication

Usage

Initialization

Loading site pages

Reading from multiple libraries

Filtering by file type

Loading documents

Configuration reference

ConnectionConfig

Source

Library

Classes

ai.microsoft.sharepoint: TextDataLoader

Constructor

load

Return Type

Records

ai.microsoft.sharepoint: ConnectionConfig

Fields

ai.microsoft.sharepoint: Library

Fields

ai.microsoft.sharepoint: OAuth2ClientCredentialsGrantConfig

Fields

ai.microsoft.sharepoint: OAuth2RefreshTokenGrantConfig

Fields

ai.microsoft.sharepoint: Source

Fields

`ConnectionConfig`

`Source`

`Library`