ballerina/etl Ballerina library

0.8.0

Overview

This package provides a collection of APIs designed for data processing and manipulation, enabling seamless ETL workflows and supporting a variety of use cases.

The APIs in this package are categorized into the following ETL process stages:

Data Categorization
Data Cleaning
Data Enrichment
Data Filtering
Data Security
Unstructured Data Extraction

Features

Data Categorization

categorizeNumeric: Categorizes a dataset based on a numeric field and specified ranges.
categorizeRegexData: Categorizes a dataset based on a string field using a set of regular expressions.
categorizeSemantic: Categorizes a dataset based on a string field using semantic classification.

Data Cleaning

groupApproximateDuplicates: Identifies and groups approximate duplicates in a dataset, returning a nested array with unique records first, followed by groups of similar records.
handleWhiteSpaces: Returns a new dataset with all extra whitespace removed from string fields.
removeDuplicates: Returns a new dataset with all duplicate records removed.
removeEmptyValues: Returns a new dataset with all records containing nil or empty string values removed.
removeField: Returns a new dataset with a specified field removed from each record.
replaceText: Returns a new dataset where matches of the given regex pattern in a specified string field are replaced with a new value.
sortData: Returns a new dataset sorted by a specified field in ascending or descending order.
standardizeData: Returns a new dataset with all string values in a specified field standardized to a set of standard values.

Data Enrichment

joinData: Merges two datasets based on a common specified field and returns a new dataset with the merged records.
mergeData: Merges multiple datasets into a single dataset by flattening a nested array of records.

Data Filtering

filterDataByRatio: Filters a random set of records from a dataset based on a specified ratio.
filterDataByRegex: Filters a dataset based on a regex pattern match.
filterDataByRelativeExp: Filters a dataset based on a relative numeric comparison expression.

Data Security

decryptData: Returns a new dataset with specified fields encrypted using AES-ECB encryption with a given symmetric key.
encryptData: Returns a new dataset with specified fields encrypted using AES-ECB encryption with a given symmetric key.
maskSensitiveData: Returns a new dataset with PII (Personally Identifiable Information) fields masked using a specified character

Unstructured Data Extraction

extractFromText: Extracts unstructured data from a string and maps it to a ballerina record.

Usage

Configurations

Following APIs in this package utilize OpenAI services and require an OpenAI API key for operation.

categorizeSemantic
extractFromText
groupApproximateDuplicates
maskSensitiveData
standardizeData

Note: Configuration is required only for the APIs listed above. It is not needed for the use of any other APIs in this package.

Setting up the OpenAI API Key

Create an OpenAI account and obtain an API key.
Add the obtained API key and a supported GPT model in the Config.toml file as shown below:


[ballerina.etl.modelConfig]
openAiToken = "<OPENAI_API_KEY>"
model = "<GPT_MODEL>"

Supported GPT Models

"gpt-4-turbo"
"gpt-4o"
"gpt-4o-mini"

(Optional) Overriding Client Timeout

The default client timeout is set to 60 seconds. This value can be adjusted by specifying the timeout field as shown below:


[ballerina.etl.modelConfig]
openAiToken = "<OPENAI_API_KEY>"
model = "<GPT_MODEL>"
timeout = 120.0

Dependent Type Support

All APIs in this package support dependent types. Here is an example of how to use them:


import ballerina/etl;
import ballerina/io;

type Customer record {|
   string name;
   string city;
|};

public function main() returns error? {
   Customer[] dataset = [
      { name: "Alice", city: "New York" },
      { name: "Bob", city: "Los Angeles" },
      { name: "Alice", city: "New York" }
   ];
   Customer[] uniqueData = check etl:removeDuplicates(dataset);
   io:println(`Customer Data Without Duplicates : ${uniqueData}`);
}

Examples

The ballerina/etl package provides practical examples illustrating its usage in various scenarios. Explore these examples, covering different use cases:

Customer Data Processing - Processes customer data collected from various sources by extracting relevant information, cleaning and validating fields, enriching with additional metadata, and categorizing the data for downstream applications.
Product Catalog Processing - Consolidates product catalog data from multiple sources by extracting and merging entries, encrypting sensitive fields, classifying products into relevant categories, and storing the structured data securely in a MySQL database for easy access and analysis.
User Feedback Analysis - Handles raw user feedback by extracting and standardizing input, classifying comments based on content and sentiment, and storing the processed feedback for further analysis.

Functions

categorizeNumeric

function categorizeNumeric(record {}[] dataset, string fieldName, CategoryRanges categoryRanges, typedesc<record {}> returnType) returns returnType[][]|Error

Categorizes a dataset based on a numeric field and specified ranges.


Order[] dataset = [
    { orderId: 1, customerName: "Alice", totalAmount: 5.3 },
    { orderId: 2, customerName: "Bob", totalAmount: 10.5 },
    { orderId: 3, customerName: "John", totalAmount: 15.0 },
    { orderId: 4, customerName: "Charlie", totalAmount: 25.0 },
    { orderId: 5, customerName: "David", totalAmount: 29.2 }
];
CategoryRanges categoryRanges = [0, [10,20], 30];
Order[][] categorized = check etl:categorizeNumeric(dataset, "totalAmount", categoryRanges);
=>[[{ orderId: 1, customerName: "Alice", totalAmount: 5.3 }],
   [{ orderId: 2, customerName: "Bob", totalAmount: 10.5 }, { orderId: 3, customerName: "John", totalAmount: 15.0 }],
   [{ orderId: 4, customerName: "Charlie", totalAmount: 25.0 }, { orderId: 5, customerName: "David", totalAmount: 29.2 }]]

Parameters

dataset record {}[] - Array of records containing numeric values.

fieldName string - Name of the numeric field to categorize.

categoryRanges CategoryRanges - Numeric ranges for categorization.

returnType typedesc<record {}> (default <>) - The type of the return value (Ballerina record array).

Return Type

returnType[][]|Error - A nested array of categorized records or an etl:Error.

categorizeRegex

function categorizeRegex(record {}[] dataset, string fieldName, RegExp[] regexArray, typedesc<record {}> returnType) returns returnType[][]|Error

Categorizes a dataset based on a string field using a set of regular expressions.


Customer[] dataset = [
    { name: "Alice", city: "New York" },
    { name: "Bob", city: "Colombo" },
    { name: "Charlie", city: "Los Angeles" },
    { name: "John", city: "Boston" }
];
regexp:RegExp[] regexArray = [re `A.*$`, re `^B.*$`, re `^C.*$`];
Customer[][] categorized = check etl:categorizeRegex(dataset, "city", regexArray);
=>[[{ name: "Alice", city: "New York" }],
   [{ name: "Bob", city: "Colombo" }],
   [{ name: "Charlie", city: "Los Angeles" }]]

Parameters

dataset record {}[] - Array of records containing string values.

fieldName string - Name of the string field to categorize.

regexArray RegExp[] - Array of regular expressions for matching categories.

returnType typedesc<record {}> (default <>) - The type of the return value (Ballerina record array).

Return Type

returnType[][]|Error - A nested array of categorized records or an etl:Error.

categorizeSemantic

function categorizeSemantic(record {}[] dataset, string fieldName, string[] categories, typedesc<record {}> returnType) returns returnType[][]|Error

Categorizes a dataset based on a string field using semantic classification.


Review[] dataset = [
    { id: 1, comment: "Great service!" },
    { id: 2, comment: "Good service!" },
    { id: 3, comment: "blh blh blh" },
    { id: 4, comment: "Terrible experience" },
];
Review[][] categorized = check etl:categorizeSemantic(dataset, "comment", ["Positive", "Negative"]);
=>[[{ id: 1, comment: "Great service!" }, { id: 2, comment: "Good service!" }],
   [{ id: 4, comment: "Terrible experience" }]]

Parameters

dataset record {}[] - Array of records containing textual data.

fieldName string - Name of the field to categorize.

categories string[] - Array of category names for classification.

returnType typedesc<record {}> (default <>) - The type of the return value (Ballerina record array).

Return Type

returnType[][]|Error - A nested array of categorized records or an etl:Error.

decryptData

function decryptData(record {}[] dataset, string[] fieldNames, byte[] key, typedesc<record {}> returnType) returns returnType[]|Error

Returns a new dataset with specified fields decrypted using AES-ECB decryption with a given symmetric key.


Customer[] encryptedDataset = [
    { name: "kHKa63v98rbDm+FB2DJ3ig==", age: 23 },
    { name: "S0x+hpmvSOIT7UE8hOGZkA==", age: 35 }
];
byte[16] key = [78, 45, 73, 76, 56, 73, 116, 116, 72, 70, 105, 108, 97, 110, 65, 100];
DecryptedCustomer[] decryptedData = check etl:decryptData(encryptedDataset, ["name"], key);
=> [{ name: "Alice", age: 23 },
    { name: "Bob", age: 35 }]

Parameters

dataset record {}[] - The dataset containing records with Base64-encoded encrypted fields.

fieldNames string[] - An array of field names that should be decrypted.

key byte[] - The AES decryption key in byte array format.

returnType typedesc<record {}> (default <>) - The type of the return value (Ballerina record).

Return Type

returnType[]|Error - A dataset with the specified fields decrypted or an etl:Error.

encryptData

function encryptData(record {}[] dataset, string[] fieldNames, byte[] key, typedesc<record {}> returnType) returns returnType[]|Error

Returns a new dataset with specified fields encrypted using AES-ECB encryption with a given symmetric key.


Customer[] dataset = [
    { id: 1, name: "Alice", age: 25 },
    { id: 2, name: "Bob", age: 30 }
];
byte[16] key = [78, 45, 73, 76, 56, 73, 116, 116, 72, 70, 105, 108, 97, 110, 65, 100];
EncryptedCustomer[] encryptedData = check etl:encryptData(dataset, ["name"], key);
=>[{ id: 1, name: "kHKa63v98rbDm+FB2DJ3ig==", age: 25 },
   { id: 2, name: "S0x+hpmvSOIT7UE8hOGZkA==", age: 30 }]

Parameters

dataset record {}[] - The dataset containing records where specific fields need encryption.

fieldNames string[] - An array of field names that should be encrypted.

key byte[] - The AES encryption key in byte array format.

returnType typedesc<record {}> (default <>) - The type of the return value (Ballerina record ).

Return Type

returnType[]|Error - A dataset with specified fields encrypted and Base64-encoded or an etl:Error.

extractFromText

function extractFromText(string sourceText, typedesc<record {}> returnType) returns returnType|Error

Extracts structured data from a raw text input and maps it to a Ballerina record.


type Review record{|
    string goodPoints;
    string badPoints;
    string improvements;
|};
string reviews = "The smartphone has an impressive camera and smooth performance, making it great for photography and gaming. However, the battery drains quickly, and the charging speed could be improved. The UI is intuitive, but some features feel outdated and need a refresh.";
Review extractedDetails = check etl:extractFromText(reviews);
=> { goodPoints: "The smartphone has an impressive camera and smooth performance, making it great for photography and gaming.",
     badPoints: "However, the battery drains quickly, and the charging speed could be improved.",
     improvements: "The UI is intuitive, but some features feel outdated and need a refresh." }

Parameters

sourceText string - The raw text input from which structured data is to be extracted.

returnType typedesc<record {}> (default <>) - The type of the return value (Ballerina record).

Return Type

returnType|Error - A record with extracted details mapped to the specified field names or an etl:Error.

filterDataByRatio

function filterDataByRatio(record {}[] dataset, float ratio, typedesc<record {}> returnType) returns returnType[]|Error

Filters a random set of records from a dataset based on a specified ratio.


Customer[] dataset = [
    { id: 1, name: "Alice" },
    { id: 2, name: "Bob" },
    { id: 3, name: "Charlie" },
    { id: 4, name: "David" }
];
Customer[] filteredDataset = check etl:filterDataByRatio(dataset, 0.75);
=> [{ id: 4, name: "David" }, { id: 2, name: "Bob" }, { id: 3, name: "Charlie" }]

Parameters

dataset record {}[] - Array of records to be split.

ratio float - The ratio for splitting the dataset (e.g., 0.75 means 75% in the first set).

returnType typedesc<record {}> (default <>) - The type of the return value (Ballerina record array).

Return Type

returnType[]|Error - Filtered dataset containing a random subset of records or an etl:Error.

filterDataByRegex

function filterDataByRegex(record {}[] dataset, string fieldName, RegExp regexPattern, typedesc<record {}> returnType) returns returnType[]|Error

Filters a dataset based on a regex pattern match.


Customer[] dataset = [
    { id: 1, city: "New York" },
    { id: 2, city: "Los Angeles" },
    { id: 3, city: "San Francisco" }
];
string fieldName = "city";
regexp:RegExp regexPattern = re `^New.*$`;
Customer[] filteredDataset = check etl:filterDataByRegex(dataset, "city", regexPattern);
=> [{ id: 1, city: "New York"}]

Parameters

dataset record {}[] - Array of records to be filtered.

fieldName string - Name of the field to apply the regex filter.

regexPattern RegExp - Regular expression to match values in the field.

returnType typedesc<record {}> (default <>) - The type of the return value (Ballerina record array).

Return Type

returnType[]|Error - Filtered dataset containing records that match the regex pattern or an etl:Error.

filterDataByRelativeExp

function filterDataByRelativeExp(record {}[] dataset, string fieldName, Operation operation, float value, typedesc<record {}> returnType) returns returnType[]|Error

Filters a dataset based on a relative numeric comparison expression.


Customer[] dataset = [
    { id: 1, name: "Alice", age: 25 },
    { id: 2, name: "Bob", age: 30 },
    { id: 3, name: "Charlie", age: 22 },
    { id: 4, name: "David", age: 28 }
];
Customer[] filteredDataset = check etl:filterDataByRelativeExp(dataset, "age", etl:GREATER_THAN, 25);
=> [{ id: 2, name: "Bob", age: 30}, {id: 4, name: "David", age: 28}]

Parameters

dataset record {}[] - Array of records containing numeric fields for comparison.

fieldName string - Name of the field to evaluate.

operation Operation - Comparison operation to apply as etl:Operation.

value float - Numeric value to compare against.

returnType typedesc<record {}> (default <>) - The type of the return value (Ballerina record array).

Return Type

returnType[]|Error - Filtered dataset containing records that match the comparison or an etl:Error.

groupApproximateDuplicates

function groupApproximateDuplicates(record {}[] dataset, typedesc<record {}> returnType) returns returnType[][]|Error

Identifies and groups approximate duplicates in a dataset, returning a nested array with unique records first, followed by groups of similar records.


Customer[] dataset = [
    { name: "Alice", city: "New York" },
    { name: "Alice", city: "new york" },
    { name: "Bob", city: "Boston" },
    { name: "Charlie", city: "Los Angeles" },
    { name: "Charlie", city: "los angeles - usa" },
    { name: "John", city: "Chicago" }
];
Customer[][] result = check etl:groupApproximateDuplicates(dataset);
=> [[{ name: "Bob", city: "Boston" },{ name: "John", city: "Chicago" }],
    [{ name: "Alice", city: "New York" },{ name: "Alice", city: "new york" }],
    [{ name: "Charlie", city: "Los Angeles" },{ name: "Charlie", city: "los angeles - usa" }]]

Parameters

dataset record {}[] - Array of records that may contain approximate duplicates.

returnType typedesc<record {}> (default <>) - The type of the return value (Ballerina record).

Return Type

returnType[][]|Error - A nested array of records where the first array contains all unique records that do not have any duplicates, and the remaining arrays contain duplicate groups or an etl:Error.

handleWhiteSpaces

function handleWhiteSpaces(record {}[] dataset, typedesc<record {}> returnType) returns returnType[]|Error

Returns a new dataset with all extra whitespace removed from string fields.


Customer[] dataset = [
    { name: "  Alice  ", city: "  New   York  " },
    { name: "  Bob  ", city: "  Los Angeles  " },
    { name: "  Charlie  ", city: "  Chicago  " }
];
Customer[] cleanedData = check etl:handleWhiteSpaces(dataset);
=> [{ name: "Alice", city: "New York" },
    { name: "Bob", city: "Los Angeles" },
    { name: "Charlie", city: "Chicago" }]

Parameters

dataset record {}[] - Array of records with possible extra spaces.

returnType typedesc<record {}> (default <>) - The type of the return value (Ballerina record).

Return Type

returnType[]|Error - A dataset where multiple spaces are replaced with a single space, and values are trimmed or an etl:Error.

joinData

function joinData(record {}[] dataset1, record {}[] dataset2, string fieldName, typedesc<record {}> returnType) returns returnType[]|Error

Merges two datasets based on a common specified field and returns a new dataset with the merged records.


CustomerPersonalDetails[] dataset1 = [{ id: 1, name: "Alice" }, { id: 2, name: "Bob" }];
CustomerContactDetails[] dataset2 = [{ id: 1, phone: 0123456789 }, { id: 2, phone: 0987654321 }];
Customer[] mergedData = check etl:joinData(dataset1, dataset2, "id");
=> [{ id: 1, name: "Alice", phone: 0123456789 },
    { id: 2, name: "Bob", phone: 0987654321 }]

Parameters

dataset1 record {}[] - First dataset containing base records.

dataset2 record {}[] - Second dataset with additional data to be merged.

fieldName string - The field used to match records between the datasets.

returnType typedesc<record {}> (default <>) - The type of the return value (Ballerina record).

Return Type

returnType[]|Error - A merged dataset with updated records or an etl:Error.

maskSensitiveData

function maskSensitiveData(record {}[] dataset, Char maskingCharacter, typedesc<record {}> returnType) returns returnType[]|Error

Returns a new dataset with PII (Personally Identifiable Information) fields masked using a specified character.


Customer[] dataset = [
    { id: 1, name: "John Doe", email: "john@example.com" },
    { id: 2, name: "Jane Smith", email: "jane@example.com" }
];
MaskedCustomer[] maskedData = check etl:maskSensitiveData(dataset);
=> [{ id: 1, name: "XXX XXX", email: "XXXXXXXXXXXXXXX" },
    { id: 2, name: "XXXX XXXX", email: "XXXXXXXXXXXXXXX" }]

Parameters

dataset record {}[] - The dataset containing records where sensitive fields should be masked.

maskingCharacter Char (default "X") - The character to use for masking sensitive fields. Default is 'X'.

returnType typedesc<record {}> (default <>) - The type of the return value (Ballerina record).

Return Type

returnType[]|Error - A dataset where the specified fields containing PII are masked with the given masking character or an etl:Error.

mergeData

function mergeData(record {}[][] datasets, typedesc<record {}> returnType) returns returnType[]|Error

Merges multiple datasets into a single dataset by flattening a nested array of records.


Customer[][] dataSets = [
    [{ id: 1, name: "Alice" }, { id: 2, name: "Bob" }],
    [{ id: 3, name: "Charlie" }, { id: 4, name: "David" }]
];
Customer[] mergedData = check etl:mergeData(dataSets);
=> [{ id: 1, name: "Alice" },
    { id: 2, name: "Bob" },
    { id: 3, name: "Charlie" },
    { id: 4, name: "David" }]

Parameters

datasets record {}[][] - An array of datasets, where each dataset is an array of records.

returnType typedesc<record {}> (default <>) - The type of the return value (Ballerina record).

Return Type

returnType[]|Error - A single merged dataset containing all records or an etl:Error.

removeDuplicates

function removeDuplicates(record {}[] dataset, typedesc<record {}> returnType) returns returnType[]|Error

Returns a new dataset with all duplicate records removed.


Customer[] dataset = [
    { name: "Alice", city: "New York" },
    { name: "Alice", city: "New York" },
    { name: "Bob", city: "Los Angeles" },
    { name: "Charlie", city: "Chicago" },
    { name: "Charlie", city: "Chicago" }
];
Customer[] uniqueData = check etl:removeDuplicates(dataset);
=> [{ name: "Alice", city: "New York" },
    { name: "Bob", city: "Los Angeles" },
    { name: "Charlie", city: "Chicago" }]

Parameters

dataset record {}[] - Array of records that may contain duplicates.

returnType typedesc<record {}> (default <>) - The type of the return value (Ballerina record).

Return Type

returnType[]|Error - A dataset with duplicates removed or an etl:Error.

removeEmptyValues

function removeEmptyValues(record {}[] dataset, typedesc<record {}> returnType) returns returnType[]|Error

Returns a new dataset with all records containing nil or empty string values removed.


Customer[] dataset = [
    { name: "Alice", city: "New York" },
    { name: "Bob", city: null },
    { name: "", city: "Los Angeles" },
    { name: "Charlie", city: "Boston" },
    { name: "David", city: () }
];
NewCustomer[] filteredData = check etl:removeNull(dataset);
=> [{ name: "Alice", city: "New York" },
    { name: "Charlie", city: "Boston" }]

Parameters

dataset record {}[] - Array of records containing potential null or empty fields.

returnType typedesc<record {}> (default <>) - The type of the return value (Ballerina record).

Return Type

returnType[]|Error - A dataset with records containing nil or empty string values removed or an etl:Error.

removeField

function removeField(record {}[] dataset, string fieldName, typedesc<record {}> returnType) returns returnType[]|Error

Returns a new dataset with a specified field removed from each record.


Customer[] dataset = [
    { name: "Alice", city: "New York", age: 25 },
    { name: "Bob", city: "Los Angeles", age: 30 },
    { name: "Charlie", city: "Chicago", age: 35 }
];
NewCustomer[] updatedData = check etl:removeField(dataset, "age");
=> [{ name: "Alice", city: "New York" },
    { name: "Bob", city: "Los Angeles" },
    { name: "Charlie", city: "Chicago" }]

Parameters

dataset record {}[] - Array of records with fields to be removed.

fieldName string - The name of the field to remove from each record.

returnType typedesc<record {}> (default <>) - The type of the return value (Ballerina record).

Return Type

returnType[]|Error - A new dataset with the specified field removed from each record or an etl:Error.

replaceText

function replaceText(record {}[] dataset, string fieldName, RegExp searchValue, string replaceValue, typedesc<record {}> returnType) returns returnType[]|Error

Returns a new dataset where matches of the given regex pattern in a specified string field are replaced with a new value.


Customer[] dataset = [
    { name: "Alice", city: "New York" },
    { name: "Bob", city: "Los Angeles" },
    { name: "Charlie", city: "Chicago" }
];
Customer[] updatedData = check etl:replaceText(dataset, "city", re `New York`, "San Francisco");
=> [{ name: "Alice", city: "San Francisco" },
    { name: "Bob", city: "Los Angeles" },
    { name: "Charlie", city: "Chicago" }]

Parameters

dataset record {}[] - Array of records where text in a specified field will be replaced.

fieldName string - The name of the field where text replacement will occur.

searchValue RegExp - A regular expression to match text that will be replaced.

replaceValue string - The value that will replace the matched text.

returnType typedesc<record {}> (default <>) - The type of the return value (Ballerina record).

Return Type

returnType[]|Error - A new dataset with the replaced text in the specified field or an etl:Error.

sortData

function sortData(record {}[] dataset, string fieldName, SortDirection direction, typedesc<record {}> returnType) returns returnType[]|Error

Returns a new dataset sorted by a specified field in ascending or descending order.


Customer[] dataset = [
    { name: "Alice", age: 25 },
    { name: "Bob", age: 30 },
    { name: "Charlie", age: 22 }
];
Customer[] sortedData = check etl:sort(dataset, "age");
=> [{ name: "Charlie", age: 22 },
    { name: "Alice", age: 25 },
    { name: "Bob", age: 30 }]

Parameters

dataset record {}[] - Array of records to be sorted.

fieldName string - The field by which sorting is performed.

direction SortDirection (default ASCENDING) - direction in which to sort the data.

returnType typedesc<record {}> (default <>) - The type of the return value (Ballerina record).

Return Type

returnType[]|Error - A sorted dataset based on the specified field or an etl:Error.

standardizeData

function standardizeData(record {}[] dataset, string fieldName, string[] standardValues, typedesc<record {}> returnType) returns returnType[]|Error

Returns a new dataset with all string values in a specified field standardized to a set of standard values.


Customer[] dataset = [
    { name: "Alice", city: "New York" },
    { name: "Bob", city: "new york" },
    { name: "Charlie", city: "los-angeles" },
    { name: "John", city: "newyork -usa" }    
];
Customer[] standardizedData = check etl:standardizeData(dataset, "city", ["New York", "Los Angeles"]);
=> [{ name: "Alice", city: "New York" },
    { name: "Bob", city: "New York" },
    { name: "Charlie", city: "Los Angeles" },
    { name: "John", city: "New York" }]

Parameters

dataset record {}[] - Array of records containing string values to be standardized.

fieldName string - The name of the field to standardize.

standardValues string[] - An array of standard values to replace approximate matches.

returnType typedesc<record {}> (default <>) - The type of the return value (Ballerina record).

Return Type

returnType[]|Error - An updated dataset with standardized string values or an error if the operation fails or an etl:Error.

Enums

etl: Model

Represents the supported OpenAI GPT models

Members

GPT_4_TURBO - GPT-4 Turbo model

GPT_4O - GPT-4o model

GPT_4O_MINI - GPT-4o mini model

etl: Operation

Represents the available comparison operations for the filterDataByRelativeExp API.

Members

GREATER_THAN - Checks if the left operand is greater than the right operand.

LESS_THAN - Checks if the left operand is less than the right operand.

EQUAL - Checks if the left and right operands are equal.

NOT_EQUAL - Checks if the left and right operands are not equal.

GREATER_THAN_OR_EQUAL - Checks if the left operand is greater than or equal to the right operand.

LESS_THAN_OR_EQUAL - Checks if the left operand is less than or equal to the right operand.

etl: SortDirection

Represents the direction for the sortData API

Members

ASCENDING - Sorts the data in ascending order.

DESCENDING - Sorts the data in descending order.

Configurables

etl: modelConfig

record { openAiToken string, timeout decimal, model Model }(default { openAiToken: "", timeout: 60, model: GPT_4O_MINI })

Errors

etl: Error

Distinct

Represents ETL module related errors.

Tuple types

etl: CategoryRanges

[float, float[], float]

CategoryRanges

Represents the category ranges in the categorizeNumeric API

float - Represents the minimum value.
float[] - Represents the intermediate breakpoints.
float - Represents the maximum value.

Import

import ballerina/etl;

Metadata

Released date: 11 days ago

Version: 0.8.0

License: Apache-2.0

Compatibility

Platform: java21

Ballerina version: 2201.12.3

GraalVM compatible: Yes

Pull count

Total: 0

Current verison: 5

Weekly downloads

Source repository

Keywords

etl

Contributors

Other versions

0.8.0

Cookie policy

Delete policy

functions

enums

errors

tupleTypes

ballerina/etl Ballerina library

Overview

Features

Data Categorization

Data Cleaning

Data Enrichment

Data Filtering

Data Security

Unstructured Data Extraction

Usage

Configurations

Setting up the OpenAI API Key

Supported GPT Models

(Optional) Overriding Client Timeout

Dependent Type Support

Examples

Functions

categorizeNumeric

Parameters

Return Type

categorizeRegex

Parameters

Return Type

categorizeSemantic

Parameters

Return Type

decryptData

Parameters

Return Type

encryptData

Parameters

Return Type

extractFromText

Parameters

Return Type

filterDataByRatio

Parameters

Return Type

filterDataByRegex

Parameters

Return Type

filterDataByRelativeExp

Parameters

Return Type

groupApproximateDuplicates

Parameters

Return Type

handleWhiteSpaces

Parameters

Return Type

joinData

Parameters

Return Type

maskSensitiveData

Parameters

Return Type

mergeData

Parameters

Return Type

removeDuplicates

Parameters

Return Type

removeEmptyValues

Parameters

Return Type

removeField

Parameters

Return Type

replaceText

Parameters

Return Type

sortData

Parameters

Return Type

standardizeData

Parameters