dataflowkit
ballerinax/dataflowkit Ballerina library
Overview
This is a generated connector for Dataflow Kit API v1.3 OpenAPI specification.
The Dataflow Kit API provides the capability to automate web scraping tasks, extract, process, and transform data from multiple pages at any scale.
Prerequisites
Before using this connector in your Ballerina application, complete the following:
- Create a Dataflow Kit account.
- Obtain tokens - Follow this guide.
Quickstart
To use the Dataflow Kit connector in your Ballerina application, update the .bal file as follows:
Step 1: Import connector
First, import the ballerinax/dataflowkit module into the Ballerina project.
import ballerinax/dataflowkit;
Step 2: Create a new connector instance
You can now make the connection configuration using the Dataflow Kit API keys config.
dataflowkit:ApiKeysConfig & readonly apiKeyConfig = ?; dataflowkit:Client baseClient = check new Client(apiKeyConfig);
Step 3: Invoke connector operation
Following is code demonstrates how to convert an URL to a pdf using ballerinax/dataflowkit
connector.
public function main() returns error? { string entity = check baseClient->urlToPdf({url: "https://dataflowkit.com/doc-api#section/Authentication"}); }
Clients
dataflowkit: Client
This is a generated connector for Dataflow Kit API v1.3 OpenAPI Specification. Render Javascript driven pages, while we internally manage Headless Chrome and proxies for you.
- Build a custom web scraper with our Visual point-and-click toolkit.
- Scrape the most popular Search engines result pages (SERP).
- Convert web pages to PDF and capture screenshots.
Authentication
Dataflow Kit API require you to sign up for an API key in order to use the API.
The API key can be found in the DFK Dashboard after free registration.
Pass a secret API Key to all API requests to the server as the api_key
query parameter.
Constructor
Gets invoked to initialize the connector
.
The connector initialization requires setting the API credentials.
Create a Dataflow Kit account and obtain tokens by following this guide.
init (ApiKeysConfig apiKeyConfig, ConnectionConfig config, string serviceUrl)
- apiKeyConfig ApiKeysConfig - API keys for authorization
- config ConnectionConfig {} - The configurations to be used when initializing the
connector
- serviceUrl string "https://api.dataflowkit.com/v1" - URL of the target service
urlToPdf
function urlToPdf(Url2pdfrequest payload) returns string|error
Save web page as PDF
Parameters
- payload Url2pdfrequest - URL to be converted
urlToScreenshot
function urlToScreenshot(Url2screenshotrequest payload) returns string|error
Capture web page Screenshots.
Parameters
- payload Url2screenshotrequest - URL to be converted
fetch
function fetch(Fetchrequest payload) returns json|error
Download web page content
Parameters
- payload Fetchrequest -
- Base fetcher type is the right choice for fetching server-side rendered pages. It takes fewer resources and works faster than rendering HTML with Chrome fetcher
Return Type
- json|error - Returns utf8 encoded web page content.
parse
function parse(Parserequest payload) returns json|error
Extract structured data from web pages
Parameters
- payload Parserequest -
Field types and attributes
Return Type
- json|error - Returns data in the one of the follwing formats - JSON, JSON Lines, CSV, MS Excel, XML
serp
function serp(Serprequest payload) returns json|error
Collect search results from search engines
Parameters
- payload Serprequest - <h2>Search parameters</h2>
Return Type
- json|error - Returns data in the one of the follwing formats - JSON, JSON Lines, CSV, MS Excel, XML
Records
dataflowkit: ApiKeysConfig
Provides API key configurations needed when communicating with a remote HTTP endpoint.
Fields
- apiKey string - Represents API Key
api_key
dataflowkit: ClientHttp1Settings
Provides settings related to HTTP/1.x protocol.
Fields
- keepAlive KeepAlive(default http:KEEPALIVE_AUTO) - Specifies whether to reuse a connection for multiple requests
- chunking Chunking(default http:CHUNKING_AUTO) - The chunking behaviour of the request
- proxy ProxyConfig? - Proxy server related options
dataflowkit: ConnectionConfig
Provides a set of configurations for controlling the behaviours when communicating with a remote HTTP endpoint.
Fields
- httpVersion HttpVersion(default http:HTTP_2_0) - The HTTP version understood by the client
- http1Settings ClientHttp1Settings? - Configurations related to HTTP/1.x protocol
- http2Settings ClientHttp2Settings? - Configurations related to HTTP/2 protocol
- timeout decimal(default 60) - The maximum time to wait (in seconds) for a response before closing the connection
- forwarded string(default "disable") - The choice of setting
forwarded
/x-forwarded
header
- poolConfig PoolConfiguration? - Configurations associated with request pooling
- cache CacheConfig? - HTTP caching related configurations
- compression Compression(default http:COMPRESSION_AUTO) - Specifies the way of handling compression (
accept-encoding
) header
- circuitBreaker CircuitBreakerConfig? - Configurations associated with the behaviour of the Circuit Breaker
- retryConfig RetryConfig? - Configurations associated with retrying
- responseLimits ResponseLimitConfigs? - Configurations associated with inbound response size limits
- secureSocket ClientSecureSocket? - SSL/TLS-related options
- proxy ProxyConfig? - Proxy server related options
- validation boolean(default true) - Enables the inbound payload validation functionality which provided by the constraint package. Enabled by default
dataflowkit: Fetchrequest
Fields
- actions Action[](default []) - Use actions to automate manual workflows while rendering web pages. They simulate real-world human interaction with pages. (Chrome fetcher type only)
- ignoreHTTPStatusErrCodes boolean? - The HTTP 200 OK success status response code indicates that the request has succeeded. Sometimes a server returns normal HTML content even with an erroneous Non-200 HTTP response status code. The IgnoreHTTPStatusCode option is useful when you need to force the return of HTML content. Defaults to "false."
- initialCookies InitialCookie[](default []) - The "Initial Cookies" option is useful for crawling websites that require a login. The simplest solution to get an array of cookies for specific websites is to use a web browser "EditThisCookie" extension. Copy a cookie array with "EditThisCookie" and paste it into the "Initial cookie" field.
- output string(default "buffer") - If set to file, the content of downloaded HTML is uploaded to Dataflow Kit Storage first. Then the link to this file is returned. Overwise, downloaded content is returned in the response body.
- proxy string? - Specify proxy by adding country ISO code to
country-
value to send requests through a proxy in the specified country. Usecountry-any
to use random geo-targets.
- 'type string - If set to
base
, the Base fetcher is used for downloading web page content. Usechrome
for fetching content with a Headless chrome browser. If omittedbase
fetcher is used by default.
- url string - Specify URL to download.
- waitDelay decimal? - Specify a wait delay (in seconds). This may be useful if certain elements of the web site need to be rendered after the initial page load. (Chrome fetcher type only)
dataflowkit: Field
Fields
- attrs string[] - A set of attributes to extract from a Field. Find more information about attributes
- details Parserequest? - Details themself represent independent Parse request that extracts data from linked pages.
- name string - Field name is used to aggregate results.
- selector string - Selector represents a CSS selector for data extraction within the given block.
- 'type int - Selector type. ( 0 - image, 1 - text, 2 - link)
dataflowkit: InitialCookie
Fields
- domain string? -
- expirationDate decimal? -
- hostOnly boolean? -
- httpOnly boolean? -
- id decimal? -
- name string? -
- path string? -
- sameSite string? -
- secure boolean? -
- session boolean? -
- storeID string? -
- value string? -
dataflowkit: Paginator
Fields
- nextPageSelector string? -
- pageNum int? -
dataflowkit: Parserequest
Fields
- commonParent string? - Specifies common ancestor block for a set of fields used to extract data from a web page. (CSS Selector)
- fields Field[] - Define a set of fields used to extract data from a web page. A Field represents a given chunk of extracted data from every block on each page.
- format string - Extracted data is returned either in CSV, MS Excel, JSON, JSON(Lines) or XML format.
- name string - Collection name.
- paginator Paginator? -
- path boolean(default false) - Path is a special parameter specifying navigation pages only. It collects information from detailed pages. No results from the current page return. Defaults to false.
- request Fetchrequest? -
dataflowkit: ProxyConfig
Proxy server configurations to be used with the HTTP client endpoint.
Fields
- host string(default "") - Host name of the proxy server
- port int(default 0) - Proxy server port
- userName string(default "") - Proxy server username
- password string(default "") - Proxy server password
dataflowkit: Serprequest
Fields
- fields Field[]? - Specify CSS selectors (patterns) used to gather data from Search Engine Result Pages. Ready-to-use payloads for collecting search results from the most popular Search Engines are available. These payloads are customizable, though.
- format string - Extracted data is returned either in CSV, MS Excel, JSON, JSON(Lines) or XML format.
- name string - Collection name.
- pageNum int(default 1) - Specify number of pages to crawl.
- proxy string - Always specify proxy for sending SERP requests. Add choosen country ISO code to
country-
value to send requests through a proxy in the specified country. Usecountry-any
to use random geo-targets.
- 'type string - For SERP requests you should always use
chrome
type to fetch content with a Headless chrome browser
- url string - url holds the link to a Search Engine to use, and other optional parameters like languages or country.
dataflowkit: Url2pdfrequest
Fields
- actions Action[](default []) - Use actions to automate manual workflows while rendering web pages. They simulate real-world human interaction with pages.
- ignoreHTTPStatusErrCodes boolean? - The HTTP 200 OK success status response code indicates that the request has succeeded. Sometimes a server returns normal HTML content even with an erroneous Non-200 HTTP response status code. The IgnoreHTTPStatusCode option is useful when you need to force the return of HTML content. Defaults to "false."
- initialCookies InitialCookie[](default []) - The "Initial Cookies" option is useful for crawling websites that require a login. The simplest solution to get an array of cookies for specific websites is to use a web browser "EditThisCookie" extension. Copy a cookie array with "EditThisCookie" and paste it into the "Initial cookie" field.
- landscape boolean(default false) - Paper orientation. Parameter landscape = false means portrait orientation. Set landscape to true for landscape page oriantation.
- marginBottom decimal(default 0.4) - Bottom Margin of the PDF (in inches)
- marginLeft decimal(default 0.4) - Left Margin of the PDF (in inches)
- marginRight decimal(default 0.4) - Right Margin of the PDF (in inches)
- marginTop decimal(default 0.4) - Top Margin of the PDF (in inches)
- output string(default "buffer") - If set to file, the resulted PDF is uploaded to Dataflow Kit Storage first. Then the link to this file is returned. Overwise, PDF content is returned in the response body.
- pageRanges string? - Specify page ranges to convert. Defaults to the empty value, which means convert all pages.
- paperSize string(default "A4") - Page size parameter consists of the most popular page formats.
- printBackground boolean(default false) - Print background graphics in the PDF.
- printHeaderFooter boolean(default false) - printHeaderFooter parameter consists of the date, name of the web page, the page URL, and how many pages the document you are printing.
- proxy string? - Specify proxy by adding country ISO code to
country-
value to send requests through a proxy in the specified country. Usecountry-any
to use random geo-targets.
- scale decimal(default 1) - By default, PDF document content is generated according to dimensions of the original web page content. Using the
scale
parameter, you can specify a custom zoom factor from 0.1 to 5.0 of the webpage rendering.
- url string - The full URL address (including HTTP/HTTPS) of a web page that you want to save as PDF
- waitDelay decimal(default 0.5) - Specify a wait delay (in seconds). This may be useful if certain elements of the web site need to be rendered after the initial page load.
dataflowkit: Url2screenshotrequest
Fields
- actions Action[](default []) - Use actions to automate manual workflows while rendering web pages. They simulate real-world human interaction with pages.
- clipSelector string? - Captures a screenshot of specified CSS element on a web page.
- format string(default "png") - Sets the Format of output image
- fullPage boolean(default false) - takes a screenshot of a full web page. It ignores offsetX, offsety, width and height argument values.
- height int(default 600) - Rectangle height in device independent pixels (dip).
- ignoreHTTPStatusErrCodes boolean? - The HTTP 200 OK success status response code indicates that the request has succeeded. Sometimes a server returns normal HTML content even with an erroneous Non-200 HTTP response status code. The IgnoreHTTPStatusCode option is useful when you need to force the return of HTML content. Defaults to "false."
- initialCookies InitialCookie[](default []) - The "Initial Cookies" option is useful for crawling websites that require a login. The simplest solution to get an array of cookies for specific websites is to use a web browser "EditThisCookie" extension. Copy a cookie array with "EditThisCookie" and paste it into the "Initial cookie" field.
- offsetx int(default 0) - X offset in device independent pixels (dip).
- offsety int(default 0) - Y offset in device independent pixels (dip).
- output string(default "buffer") - If set to file, the resulted screenshot is uploaded to Dataflow Kit Storage first. Then the link to this file is returned. Overwise, web site screenshot is returned in the response body.
- printBackground boolean(default false) - Print background graphics in the PDF.
- proxy string? - Specify proxy by adding country ISO code to
country-
value to send requests through a proxy in the specified country. Usecountry-any
to use random geo-targets.
- quality int(default 80) - Sets the Quality of output image. Compression quality from range [0..100] (jpeg only).
- scale decimal(default 1) - Image scale factor. range [0.1 .. 3]
- url string - The full URL address (including HTTP/HTTPS) of a web page that you want to capture
- waitDelay decimal(default 0.5) - Specify a wait delay (in seconds). This may be useful if certain elements of the web site need to be rendered after the initial page load.
- width int(default 800) - Rectangle width in device independent pixels (dip).
Union types
dataflowkit: Action
Action
Import
import ballerinax/dataflowkit;
Metadata
Released date: over 1 year ago
Version: 1.5.1
License: Apache-2.0
Compatibility
Platform: any
Ballerina version: 2201.4.1
GraalVM compatible: Yes
Pull count
Total: 0
Current verison: 4
Weekly downloads
Keywords
Website & App Building/Web Scraper
Cost/Freemium
Contributors