Module etl
ballerina/etl Ballerina library
Overview
This package provides a collection of APIs designed for data processing and manipulation, enabling seamless ETL workflows and supporting a variety of use cases.
The APIs in this package are categorized into the following ETL process stages:
- Data Categorization
- Data Cleaning
- Data Enrichment
- Data Filtering
- Data Security
- Unstructured Data Extraction
Features
Data Categorization
categorizeNumeric: Categorizes a dataset based on a numeric field and specified ranges.categorizeRegexData: Categorizes a dataset based on a string field using a set of regular expressions.categorizeSemantic: Categorizes a dataset based on a string field using semantic classification.
Data Cleaning
groupApproximateDuplicates: Identifies and groups approximate duplicates in a dataset, returning a nested array with unique records first, followed by groups of similar records.handleWhiteSpaces: Returns a new dataset with all extra whitespace removed from string fields.removeDuplicates: Returns a new dataset with all duplicate records removed.removeEmptyValues: Returns a new dataset with all records containing nil or empty string values removed.removeField: Returns a new dataset with a specified field removed from each record.replaceText: Returns a new dataset where matches of the given regex pattern in a specified string field are replaced with a new value.sortData: Returns a new dataset sorted by a specified field in ascending or descending order.standardizeData: Returns a new dataset with all string values in a specified field standardized to a set of standard values.
Data Enrichment
joinData: Merges two datasets based on a common specified field and returns a new dataset with the merged records.mergeData: Merges multiple datasets into a single dataset by flattening a nested array of records.
Data Filtering
filterDataByRatio: Filters a random set of records from a dataset based on a specified ratio.filterDataByRegex: Filters a dataset based on a regex pattern match.filterDataByRelativeExp: Filters a dataset based on a relative numeric comparison expression.
Data Security
decryptData: Returns a new dataset with specified fields encrypted using AES-ECB encryption with a given symmetric key.encryptData: Returns a new dataset with specified fields encrypted using AES-ECB encryption with a given symmetric key.maskSensitiveData: Returns a new dataset with PII (Personally Identifiable Information) fields masked using a specified character
Unstructured Data Extraction
extractFromText: Extracts unstructured data from a string and maps it to a ballerina record.
Usage
Configurations
Following APIs in this package utilize OpenAI services and require an OpenAI API key for operation.
categorizeSemanticextractFromTextgroupApproximateDuplicatesmaskSensitiveDatastandardizeData
Note: Configuration is required only for the APIs listed above. It is not needed for the use of any other APIs in this package.
Setting up the OpenAI API Key
- Create an OpenAI account and obtain an API key.
- Add the obtained API key and a supported GPT model in the
Config.tomlfile as shown below:
[ballerina.etl.modelConfig] openAiToken = "<OPENAI_API_KEY>" model = "<GPT_MODEL>"
Supported GPT Models
"gpt-4-turbo""gpt-4o""gpt-4o-mini"
(Optional) Overriding Client Timeout
The default client timeout is set to 60 seconds. This value can be adjusted by specifying the timeout field as shown below:
[ballerina.etl.modelConfig] openAiToken = "<OPENAI_API_KEY>" model = "<GPT_MODEL>" timeout = 120.0
Dependent Type Support
All APIs in this package support dependent types. Here is an example of how to use them:
import ballerina/etl; import ballerina/io; type Customer record {| string name; string city; |}; public function main() returns error? { Customer[] dataset = [ { name: "Alice", city: "New York" }, { name: "Bob", city: "Los Angeles" }, { name: "Alice", city: "New York" } ]; Customer[] uniqueData = check etl:removeDuplicates(dataset); io:println(`Customer Data Without Duplicates : ${uniqueData}`); }
Examples
The ballerina/etl package provides practical examples illustrating its usage in various scenarios. Explore these examples, covering different use cases:
-
Customer Data Processing - Processes customer data collected from various sources by extracting relevant information, cleaning and validating fields, enriching with additional metadata, and categorizing the data for downstream applications.
-
Product Catalog Processing - Consolidates product catalog data from multiple sources by extracting and merging entries, encrypting sensitive fields, classifying products into relevant categories, and storing the structured data securely in a MySQL database for easy access and analysis.
-
User Feedback Analysis - Handles raw user feedback by extracting and standardizing input, classifying comments based on content and sentiment, and storing the processed feedback for further analysis.
Import
import ballerina/etl;Other versions
0.8.0