ballerinax/dataflowkit Ballerina library

1.5.1

Overview

This is a generated connector for Dataflow Kit API v1.3 OpenAPI specification.
The Dataflow Kit API provides the capability to automate web scraping tasks, extract, process, and transform data from multiple pages at any scale.

Prerequisites

Before using this connector in your Ballerina application, complete the following:

Create a Dataflow Kit account.
Obtain tokens - Follow this guide.

Quickstart

To use the Dataflow Kit connector in your Ballerina application, update the .bal file as follows:

Step 1: Import connector

First, import the ballerinax/dataflowkit module into the Ballerina project.


import ballerinax/dataflowkit;

Step 2: Create a new connector instance

You can now make the connection configuration using the Dataflow Kit API keys config.


dataflowkit:ApiKeysConfig & readonly apiKeyConfig = ?;
dataflowkit:Client baseClient = check new Client(apiKeyConfig);

Step 3: Invoke connector operation

Following is code demonstrates how to convert an URL to a pdf using ballerinax/dataflowkit connector.


public function main() returns error? {
    string entity = check baseClient->urlToPdf({url: "https://dataflowkit.com/doc-api#section/Authentication"});
}

Clients

dataflowkit: Client

Isolated

This is a generated connector for Dataflow Kit API v1.3 OpenAPI Specification. Render Javascript driven pages, while we internally manage Headless Chrome and proxies for you.

Build a custom web scraper with our Visual point-and-click toolkit.
Scrape the most popular Search engines result pages (SERP).
Convert web pages to PDF and capture screenshots.

Authentication

Dataflow Kit API require you to sign up for an API key in order to use the API.

The API key can be found in the DFK Dashboard after free registration.

Pass a secret API Key to all API requests to the server as the api_key query parameter.

Constructor

Gets invoked to initialize the connector. The connector initialization requires setting the API credentials. Create a Dataflow Kit account and obtain tokens by following this guide.

init (ApiKeysConfig apiKeyConfig, ConnectionConfig config, string serviceUrl)

apiKeyConfig ApiKeysConfig - API keys for authorization

config ConnectionConfig {} - The configurations to be used when initializing the connector

serviceUrl string "https://api.dataflowkit.com/v1" - URL of the target service

urlToPdf

Isolated FunctionRemote Function

function urlToPdf(Url2pdfrequest payload) returns string|error

Save web page as PDF

Parameters

payload Url2pdfrequest - URL to be converted

Return Type

string|error - A PDF file.

urlToScreenshot

Isolated FunctionRemote Function

function urlToScreenshot(Url2screenshotrequest payload) returns string|error

Capture web page Screenshots.

Parameters

payload Url2screenshotrequest - URL to be converted

Return Type

string|error - Returns jpg or png file.

fetch

Isolated FunctionRemote Function

function fetch(Fetchrequest payload) returns json|error

Download web page content

Parameters

payload Fetchrequest -
- Base fetcher type is the right choice for fetching server-side rendered pages. It takes fewer resources and works faster than rendering HTML with Chrome fetcher

Return Type

json|error - Returns utf8 encoded web page content.

parse

Isolated FunctionRemote Function

function parse(Parserequest payload) returns json|error

Extract structured data from web pages

Parameters

payload Parserequest -
Field types and attributes

Return Type

json|error - Returns data in the one of the follwing formats - JSON, JSON Lines, CSV, MS Excel, XML

serp

Isolated FunctionRemote Function

function serp(Serprequest payload) returns json|error

Collect search results from search engines

Parameters

payload Serprequest - <h2>Search parameters</h2>

Return Type

json|error - Returns data in the one of the follwing formats - JSON, JSON Lines, CSV, MS Excel, XML

Records

dataflowkit: ApiKeysConfig

Closed record

Provides API key configurations needed when communicating with a remote HTTP endpoint.

Fields

apiKey string - Represents API Key api_key

dataflowkit: ClientHttp1Settings

Closed record

Provides settings related to HTTP/1.x protocol.

Fields

keepAlive KeepAlive(default http:KEEPALIVE_AUTO) - Specifies whether to reuse a connection for multiple requests

chunking Chunking(default http:CHUNKING_AUTO) - The chunking behaviour of the request

proxy ProxyConfig? - Proxy server related options

dataflowkit: ConnectionConfig

Closed record

Provides a set of configurations for controlling the behaviours when communicating with a remote HTTP endpoint.

Fields

httpVersion HttpVersion(default http:HTTP_2_0) - The HTTP version understood by the client

http1Settings ClientHttp1Settings? - Configurations related to HTTP/1.x protocol

http2Settings ClientHttp2Settings? - Configurations related to HTTP/2 protocol

timeout decimal(default 60) - The maximum time to wait (in seconds) for a response before closing the connection

forwarded string(default "disable") - The choice of setting forwarded/x-forwarded header

poolConfig PoolConfiguration? - Configurations associated with request pooling

cache CacheConfig? - HTTP caching related configurations

compression Compression(default http:COMPRESSION_AUTO) - Specifies the way of handling compression (accept-encoding) header

circuitBreaker CircuitBreakerConfig? - Configurations associated with the behaviour of the Circuit Breaker

retryConfig RetryConfig? - Configurations associated with retrying

responseLimits ResponseLimitConfigs? - Configurations associated with inbound response size limits

secureSocket ClientSecureSocket? - SSL/TLS-related options

proxy ProxyConfig? - Proxy server related options

validation boolean(default true) - Enables the inbound payload validation functionality which provided by the constraint package. Enabled by default

dataflowkit: Fetchrequest

Fields

actions Action[](default []) - Use actions to automate manual workflows while rendering web pages. They simulate real-world human interaction with pages. (Chrome fetcher type only)

ignoreHTTPStatusErrCodes boolean? - The HTTP 200 OK success status response code indicates that the request has succeeded. Sometimes a server returns normal HTML content even with an erroneous Non-200 HTTP response status code. The IgnoreHTTPStatusCode option is useful when you need to force the return of HTML content. Defaults to "false."

initialCookies InitialCookie[](default []) - The "Initial Cookies" option is useful for crawling websites that require a login. The simplest solution to get an array of cookies for specific websites is to use a web browser "EditThisCookie" extension. Copy a cookie array with "EditThisCookie" and paste it into the "Initial cookie" field.

output string(default "buffer") - If set to file, the content of downloaded HTML is uploaded to Dataflow Kit Storage first. Then the link to this file is returned. Overwise, downloaded content is returned in the response body.

proxy string? - Specify proxy by adding country ISO code to country- value to send requests through a proxy in the specified country. Use country-any to use random geo-targets.

'type string - If set to base, the Base fetcher is used for downloading web page content. Use chrome for fetching content with a Headless chrome browser. If omitted base fetcher is used by default.

url string - Specify URL to download.

waitDelay decimal? - Specify a wait delay (in seconds). This may be useful if certain elements of the web site need to be rendered after the initial page load. (Chrome fetcher type only)

dataflowkit: Field

Fields

attrs string[] - A set of attributes to extract from a Field. Find more information about attributes

details Parserequest? - Details themself represent independent Parse request that extracts data from linked pages.

filters (record { name string }|record { name string, param string })[]? - Filters are used to pre-processing of text data when extracting.

name string - Field name is used to aggregate results.

selector string - Selector represents a CSS selector for data extraction within the given block.

'type int - Selector type. ( 0 - image, 1 - text, 2 - link)

dataflowkit: InitialCookie

Fields

domain string? -

expirationDate decimal? -

hostOnly boolean? -

httpOnly boolean? -

id decimal? -

name string? -

path string? -

sameSite string? -

secure boolean? -

session boolean? -

storeID string? -

value string? -

dataflowkit: Paginator

Fields

nextPageSelector string? -

pageNum int? -

dataflowkit: Parserequest

Fields

commonParent string? - Specifies common ancestor block for a set of fields used to extract data from a web page. (CSS Selector)

fields Field[] - Define a set of fields used to extract data from a web page. A Field represents a given chunk of extracted data from every block on each page.

format string - Extracted data is returned either in CSV, MS Excel, JSON, JSON(Lines) or XML format.

name string - Collection name.

paginator Paginator? -

path boolean(default false) - Path is a special parameter specifying navigation pages only. It collects information from detailed pages. No results from the current page return. Defaults to false.

request Fetchrequest? -

dataflowkit: ProxyConfig

Closed record

Proxy server configurations to be used with the HTTP client endpoint.

Fields

host string(default "") - Host name of the proxy server

port int(default 0) - Proxy server port

userName string(default "") - Proxy server username

password string(default "") - Proxy server password

dataflowkit: Serprequest

Fields

fields Field[]? - Specify CSS selectors (patterns) used to gather data from Search Engine Result Pages. Ready-to-use payloads for collecting search results from the most popular Search Engines are available. These payloads are customizable, though.

format string - Extracted data is returned either in CSV, MS Excel, JSON, JSON(Lines) or XML format.

name string - Collection name.

pageNum int(default 1) - Specify number of pages to crawl.

proxy string - Always specify proxy for sending SERP requests. Add choosen country ISO code to country- value to send requests through a proxy in the specified country. Use country-any to use random geo-targets.

'type string - For SERP requests you should always use chrome type to fetch content with a Headless chrome browser

url string - url holds the link to a Search Engine to use, and other optional parameters like languages or country.

dataflowkit: Url2pdfrequest

Fields

actions Action[](default []) - Use actions to automate manual workflows while rendering web pages. They simulate real-world human interaction with pages.

ignoreHTTPStatusErrCodes boolean? - The HTTP 200 OK success status response code indicates that the request has succeeded. Sometimes a server returns normal HTML content even with an erroneous Non-200 HTTP response status code. The IgnoreHTTPStatusCode option is useful when you need to force the return of HTML content. Defaults to "false."

initialCookies InitialCookie[](default []) - The "Initial Cookies" option is useful for crawling websites that require a login. The simplest solution to get an array of cookies for specific websites is to use a web browser "EditThisCookie" extension. Copy a cookie array with "EditThisCookie" and paste it into the "Initial cookie" field.

landscape boolean(default false) - Paper orientation. Parameter landscape = false means portrait orientation. Set landscape to true for landscape page oriantation.

marginBottom decimal(default 0.4) - Bottom Margin of the PDF (in inches)

marginLeft decimal(default 0.4) - Left Margin of the PDF (in inches)

marginRight decimal(default 0.4) - Right Margin of the PDF (in inches)

marginTop decimal(default 0.4) - Top Margin of the PDF (in inches)

output string(default "buffer") - If set to file, the resulted PDF is uploaded to Dataflow Kit Storage first. Then the link to this file is returned. Overwise, PDF content is returned in the response body.

pageRanges string? - Specify page ranges to convert. Defaults to the empty value, which means convert all pages.

paperSize string(default "A4") - Page size parameter consists of the most popular page formats.

printBackground boolean(default false) - Print background graphics in the PDF.

printHeaderFooter boolean(default false) - printHeaderFooter parameter consists of the date, name of the web page, the page URL, and how many pages the document you are printing.

proxy string? - Specify proxy by adding country ISO code to country- value to send requests through a proxy in the specified country. Use country-any to use random geo-targets.

scale decimal(default 1) - By default, PDF document content is generated according to dimensions of the original web page content. Using the scale parameter, you can specify a custom zoom factor from 0.1 to 5.0 of the webpage rendering.

url string - The full URL address (including HTTP/HTTPS) of a web page that you want to save as PDF

waitDelay decimal(default 0.5) - Specify a wait delay (in seconds). This may be useful if certain elements of the web site need to be rendered after the initial page load.

dataflowkit: Url2screenshotrequest

Fields

actions Action[](default []) - Use actions to automate manual workflows while rendering web pages. They simulate real-world human interaction with pages.

clipSelector string? - Captures a screenshot of specified CSS element on a web page.

format string(default "png") - Sets the Format of output image

fullPage boolean(default false) - takes a screenshot of a full web page. It ignores offsetX, offsety, width and height argument values.

height int(default 600) - Rectangle height in device independent pixels (dip).

ignoreHTTPStatusErrCodes boolean? - The HTTP 200 OK success status response code indicates that the request has succeeded. Sometimes a server returns normal HTML content even with an erroneous Non-200 HTTP response status code. The IgnoreHTTPStatusCode option is useful when you need to force the return of HTML content. Defaults to "false."

initialCookies InitialCookie[](default []) - The "Initial Cookies" option is useful for crawling websites that require a login. The simplest solution to get an array of cookies for specific websites is to use a web browser "EditThisCookie" extension. Copy a cookie array with "EditThisCookie" and paste it into the "Initial cookie" field.

offsetx int(default 0) - X offset in device independent pixels (dip).

offsety int(default 0) - Y offset in device independent pixels (dip).

output string(default "buffer") - If set to file, the resulted screenshot is uploaded to Dataflow Kit Storage first. Then the link to this file is returned. Overwise, web site screenshot is returned in the response body.

printBackground boolean(default false) - Print background graphics in the PDF.

proxy string? - Specify proxy by adding country ISO code to country- value to send requests through a proxy in the specified country. Use country-any to use random geo-targets.

quality int(default 80) - Sets the Quality of output image. Compression quality from range [0..100] (jpeg only).

scale decimal(default 1) - Image scale factor. range [0.1 .. 3]

url string - The full URL address (including HTTP/HTTPS) of a web page that you want to capture

waitDelay decimal(default 0.5) - Specify a wait delay (in seconds). This may be useful if certain elements of the web site need to be rendered after the initial page load.

width int(default 800) - Rectangle width in device independent pixels (dip).

Union types

dataflowkit: Action

record { ignoreIfNotPresent boolean, selector string, value string }|record { ignoreIfNotPresent boolean, selector string, value string }|record { ignoreIfNotPresent boolean, selector string, skipLastIteration boolean }|record { ignoreIfNotPresent boolean, selector string, skipLastIteration boolean }|record { ignoreIfNotPresent boolean, selector string, skipLastIteration boolean }|record { selector string }|record { selector string }|record { selector string }|record { waitDelay string }|record { script string }|record { actions Action[], times decimal }|record { skipLastIteration boolean }|record { scrollByPixels decimal, scrollingElementSelector string, selector string, times int }

Action

Import

import ballerinax/dataflowkit;

Metadata

Released date: about 2 years ago

Version: 1.5.1

License: Apache-2.0

Compatibility

Platform: any

Ballerina version: 2201.4.1

GraalVM compatible: Yes

Pull count

Total: 8

Current verison: 8

Weekly downloads

Source repository

Keywords

Website & App Building/Web Scraper

Cost/Freemium

Contributors

Other versions

1.5.1

1.5.0 1.4.0 1.2.0 1.1.0

Dependencies

ballerina/url/2.2.3 ballerina/http/2.6.1

Cookie policy

Delete policy

clients

records

unionTypes

ballerinax/dataflowkit Ballerina library

Overview

Prerequisites

Quickstart

Step 1: Import connector

Step 2: Create a new connector instance

Step 3: Invoke connector operation

Clients

dataflowkit: Client

Authentication

Constructor

urlToPdf

Parameters

Return Type

urlToScreenshot

Parameters

Return Type

fetch

Parameters

Return Type

parse

Parameters

Field types and attributes

Return Type

serp

Parameters

Return Type

Records

dataflowkit: ApiKeysConfig

Fields

dataflowkit: ClientHttp1Settings

Fields

dataflowkit: ConnectionConfig

Fields

dataflowkit: Fetchrequest

Fields

dataflowkit: Field

Fields

dataflowkit: InitialCookie

Fields

dataflowkit: Paginator

Fields

dataflowkit: Parserequest

Fields

dataflowkit: ProxyConfig

Fields

dataflowkit: Serprequest

Fields

dataflowkit: Url2pdfrequest

Fields

dataflowkit: Url2screenshotrequest

Fields

Union types

dataflowkit: Action