Developing Client Applications with Apache Arrow Flight
You can create client applications that use Arrow Flight to query data lakes at data-transfer speeds greater than speeds possible with ODBC and JDBC, without incurring the cost in time and CPU resources of deserializing data. As the volumes of data that are transferred increase in size, the performance benefits from the use of Apache Flight rather than ODBC or JDBC also increase.
You can run queries on datasets that are in the default project of a Dremio Cloud organization. Dremio Cloud is able to determine the organization and the default project from the authentication token that a Flight client uses. To query datasets in a non-default project, you can pass in the ID for the non-default project.
Dremio Cloud provides these endpoints for Arrow Flight connections:
- In the US control plane:
data.dremio.cloud:443
- In the EU control plane:
data.eu.dremio.cloud:443
All traffic within a control plane between Flight clients and Dremio Cloud go through the endpoint for that control plane. However, Dremio Cloud can scale up or down automatically to accommodate increasing and decreasing traffic on the endpoint.
Unless you pass in a different project ID, Arrow Flight clients run queries only against datasets that are in the default project or on datasources that are associated with the default project. By default, Dremio Cloud uses the oldest project in an organization as that organization's default project.
Organization administrators can specify which project to use as the default project. See "Setting the Default Project" in Managing Projects.
Supported Versions of Apache Arrow
Dremio Cloud supports client applications that use Arrow Flight in Apache Arrow version 6.0.
Supported Authentication Method
Client applications can authenticate to Dremio Cloud with personal access tokens (PATs). To create a PAT, follow the steps in the section Creating a Token.
Flight sessions
A Flight session has a duration of 120 minutes during which a Flight client interacts with Dremio Cloud. A Flight client initiates a new session by passing a getFlightInfo()
request that does not include a Cookie header that specifies a session ID that was obtained from Dremio Cloud. All requests that pass the same session ID are considered to be in the same session.
-
The Flight client, having obtained a PAT from Dremio Cloud, sends a
getFlightInfo()
request that includes the query to run, the URI for the endpoint, and the bearer token (PAT). A single bearer token can be used for requests until it expires. -
If Dremio Cloud is able to authenticate the Flight client by using the bearer token, it sends a response that includes FlightInfo, a Set-Cookie header with the session ID, the bearer token, and a Set-Cookie header with the ID of the default project in the organization.
FlightInfo responses from Dremio Cloud include the single endpoint for the control plane being used and the ticket for that endpoint. There is only one endpoint listed in FlightInfo responses.
Session IDs are generated by Dremio Cloud.
-
The client sends a
getStream()
request that includes the ticket, a Cookie header for the session ID, the bearer token, and a Cookie header for the ID of the default project. -
Dremio Cloud returns the query results in one flight.
-
The Flight client sends another
getFlightInfo()
request using the same session ID and bearer token. If this second request did not include the session ID that Dremio Cloud sent in response to the first request, then Dremio Cloud would send a new session ID and a new session would begin.
Use a Non-Default Project
To run queries on datasets and data sources that are in a non-default project, use PyArrow to pass in the ID for the non-default project. In connection.py
, replace project_id
with the ID of the project you want to use:
'''
Client cookie middleware, usually used as a black box.
'''
client_cookie_middleware = CookieMiddlewareFactory()
tls_args = {}
if self.tls:
tls_args = self._set_tls_connection_args()
scheme = "grpc+tls"
if self.project_id:
cookie = SimpleCookie()
'''
Load "project_id=<project-uuid>" into the Cookie container.
Note we're no longer using it as a black box, and the client
is making up its own cookie which is less than conformant
to RFC 6265. This should ideally not be used in production
systems.
'''
cookie['project_id'] = self.project_id
Managing Workloads
Dremio administrators can use the Arrow Flight server endpoint to manage query workloads by adding the following connection properties to Flight clients:
Flight Client Property | Description |
---|---|
ENGINE | Name of the engine to use to process all queries issued during the current session. |
SCHEMA | The name of the schema (datasource or folder, including child paths, such as mySource.folder1 and folder1.folder2 ) to use by default when a schema is not specified in a query. |
Sample Arrow Flight Client Applications
Dremio provides sample Arrow Flight client applications in Java and Python at Dremio Hub. The Go client in this repository does not support connections to Dremio Cloud.
Both sample clients use the hostname local
and the port number 32010
by default; so, be sure to override these defaults with the hostname data.dremio.cloud
or data.eu.dremio.cloud
and the port number 443
.
At this time, you can only connect to the default Sonar project in Dremio Cloud.