CREATE TABLE, INSERT, SELECT, and SHOW TABLES SQL statements) or use it for ad‑hoc reads and writes.
Key Features
The Ocient Spark connector includes these key features:- Read Pushdown — The connector accelerates reads by pushing column selection, filters (including on nested fields), aggregations, and queries that only need the first N rows down to the Ocient System while still letting Spark validate the final results.
- Read Partitioning — The connector parallelizes reads by splitting data into multiple Spark partitions based on a partition column. For details, see Read Partitioning Options.
- DataFrame Write Behavior and Save Modes — The connector controls how it writes DataFrames to Ocient tables by honoring Spark save modes to append (
Append), truncate‑and‑replace (Overwrite), or fail on existing tables (ErrorIfExists). - Catalog Support — The connector exposes Ocient as a Spark catalog so you can use standard Spark SQL directly on an Ocient System.
Prerequisites
To use the Ocient Spark connector, your system must meet these software requirements.| Software | Version |
|---|---|
| Ocient | Use Ocient System version 26.1 or later. |
| Operating System (OS) | , , or . Use the latest version of each OS. |
| Apache Spark | Version 3.5 or later. |
| Version 8 or later. | |
| Ocient JDBC driver | Version 4.0 or later. |
SELECT, INSERT, CREATE, and DELETE user privileges for the specified database. For details, see Data Control Language (DCL) Statement Reference.
Ocient Spark Connector Setup and Initial Use
To start working with the Ocient Spark Connector, register the connector. Then, you can start executing SQL statements.Connector Registration
For best results, first register the connector as a catalog in Spark. To register the connector, edit thespark-defaults.conf file in your Spark install to include these lines. Replace the username and password fields with your Ocient System credentials.
Text
Use SQL Statements
After registration, the Spark connector lets you treat your Ocient System like any other Spark catalog. The connector routes SQL operations through the catalog implementation. Execute the Spark commandUSE to switch to your Ocient catalog and schema for SQL statements. In this case, use the ocient_cat catalog and my_schema schema.
SQL
my_new_table with identifier id, name name, event timestamp event, and the structure of an integer and string nested_date.
SQL
SQL
SQL
SQL
SQL
Use Scala DataFrames
The Ocient Spark connector integrates directly with the Spark DataFrame API, so you can read from and write to Ocient tables using familiar Spark patterns. After you configure the Ocient catalog, you can reference fully qualified table names, and the connector handles all JDBC connectivity and type mapping. The examples in this section use Scala to interact with an Ocient catalog. Examples Write from Spark to Ocient This example takes an existing Spark DataFramedf and writes its rows into an Ocient table my_table.
my_table and writes its rows into a new Spark DataFrame df2.
Ad Hoc Usage
The Ocient Spark connector supports ad‑hoc reads and writes using the Spark.format("ocient") method. This method is useful for brief operations, but it cannot use the Spark catalog system to create, drop, or list tables.
For example, this Spark command reads an Ocient table and creates the DataFrame df from its contents. Substitute jdbc_connection with the JDBC connection string for the database, the username and pwd values for your Ocient username and password, and my_schema and my_table with the schema and table name for the table to read.
df and appends its contents into an Ocient table.
Bulk Loading Best Practices
Use these recommended OS and Spark settings to get reliable performance and avoid inconsistent writes when using the Ocient JDBC bulk loader with Spark. For details on bulk loading, see JDBC Bulk Loading.Linux SSH Configuration
Increase the SSH connection capacity on Loader Nodes:- Set
MaxStartups 1024in the OSsshd_configconfiguration file on theloader/SSHendpoint hosts that accept SSH connections from the bulk loader. - Restart the SSH service to apply the updated
sshd_configconfiguration. For example, on an system, runsudo systemctl restart ssh.
Spark Configuration
Edit thespark-defaults.conf configuration file to include these settings:
spark.task.maxFailures = 1— This configuration prevents Spark from retrying failed tasks and potentially duplicating writes.spark.speculation = false— This configuration prevents Spark from launching speculative duplicate tasks that can re-run writes against Ocient.
Configuration Options
You can set specific configurations for the Ocient Spark connector through standard Spark options:- Set globally using Spark configuration: Add options to your
spark-defaults.conffile or your cluster Spark settings (e.g.,spark.sql.catalog.ocient_cat.url=...). - Set options per job or per operation: Use Spark (
.option()) or command-line (--conf) statements to set options for one-time usage.
Connection Options
These options control how the connector establishes a JDBC connection to Ocient and identify which table or query Spark should use. All options are for both read and write operations.| Option | Default | Description |
|---|---|---|
url | None | Required. The Ocient JDBC URL. If you do not specify the sparkMode setting, the connector automatically sets sparkMode=true. |
user | None | Required. The Ocient username. |
password | None | Required. The password of the user. |
dbtable | None | Required for adhoc commands using .format("ocient"). This option is the Ocient schema and table name (for example, schema.table). |
maskPassword | 1 | Optional. Determines whether passwords are exposed in Spark connector logs. Supported values are 0 or 1. If this option is set to 1, Spark connector logs mask password fields. Otherwise, Spark connector logs include password fields. |
Read Partitioning Options
These options control how Spark splits a read into multiple partitions based on a column range, affecting parallelism and data distribution during Ocient table scans. All options are for read operations only. If you do not specify any of these options, you have only one partition.| Option | Default | Description |
|---|---|---|
numPartitions | 1 | Optional. The number of Spark partitions to create for reading. |
partitionColumn | None | Optional. This is a numeric, date, or timestamp column to use for partitioning the read. If you specify the partitionColumn option, you can also define a value range to partition using the lowerBound and upperBound options. If you do not specify a range, the connector automatically uses the full range of values. |
lowerBound | None | Optional. The minimum value of the range for the partitionColumn option.If you use this option, you must also include the upperBound option. |
upperBound | None | Optional. The maximum value of the range for the partitionColumn option.If you use this option, you must also include the lowerBound option. |
ocient.minRowsPerPartition | 1 | Optional. Minimum target number of rows per Spark partition for the read. The connector uses this as a hint to avoid creating many tiny partitions. This option guarantees that each planned partition covers at least this many estimated rows, where possible. |
Read Performance Options
These options tune how efficiently the connector fetches rows from Ocient during reads, including JDBC fetch size and Ocient System internal parallelism. All options are for read operations only.| Option | Default | Description |
|---|---|---|
fetchSize | 0 | The JDBC fetchSize in rows for read operations. When you set fetchSize = N, the connector asks Ocient to send up to N rows per network fetch call, which can reduce round-trips for large result sets. The default is 0, which lets the driver choose an appropriate fetch size. This option behaves the same way as the Spark standard JDBC fetchsize option. For details, see the Spark documentation. |
ocient.parallelism | 1 | Controls Ocient internal parallel execution level for read queries. When you set ocient.parallelism = N, the connector appends the USING PARALLELISM N clause to SELECT SQL statements so that Ocient executes each query with N internal workers. This option is separate from the numPartitions option, which controls the number of Spark tasks that run in parallel. |
Write Performance Options
This option tunes how efficiently Spark writes data to Ocient, primarily by controlling the JDBC batch size used for inserts. This option is for write operations only.| Option | Default | Description |
|---|---|---|
batchsize | 4800000 | Controls how many rows Spark sends to Ocient in each JDBC batch during a write operation. When you set batchsize = N, the connector groups up to N rows per insertion batch, which can significantly improve write throughput for large DataFrame writes. The default is 4,800,000 rows. This option behaves the same way as the Spark standard JDBC batchsize option. For details, see the Spark documentation. |
Data Type Mapping
The Ocient Spark connector supports all Ocient primitive types and complex types (such as array or tuple). When Spark creates a table, the connector writes the full Spark logical type into an OcientTYPE HINT clause on each column. For example, for a column that uses an Ocient TUPLE type and maps to a Spark STRUCT type, the connector generates column DDL that includes a type hint such as TYPE HINT 'STRUCT<myCol: STRING, another: INT>'.
During a read operation, the connector parses this TYPE_HINT field to reconstruct the original Spark schema, including nested field names. If the connector does not find the hint (e.g., for a pre-existing table), the connector maps Ocient TUPLE types to Spark STRUCT types with default field names (_1, _2, etc.).
Data Types
This table shows the Spark data types that correspond to the equivalent Ocient types. The table also lists whether each Spark type supports round-trips, meaning you can write the type from Spark to Ocient (creating the table) and then read it back into Spark while preserving the original Spark type and structure.| Spark Type | Ocient Type (in CREATE TABLE SQL Statement) | Round-trip |
|---|---|---|
StringType | VARCHAR | Yes |
LongType | BIGINT | Yes |
IntegerType | INTEGER | Yes |
ShortType | SMALLINT | Yes |
ByteType | TINYINT | Yes |
DoubleType | DOUBLE | Yes |
FloatType | FLOAT | Yes |
DecimalType(p,s) | DECIMAL(p,s) | Yes |
BooleanType | BOOLEAN | Yes |
BinaryType | VARBINARY | Yes |
DateType | DATE | Yes |
TimestampType | TIMESTAMP | Yes |
TimestampNTZType | TIMESTAMP | Yes (with TYPE_HINT) |
ArrayType | ElementType[] | Yes (with TYPE_HINT) |
StructType | TUPLE<<...>> | Yes (with TYPE_HINT) |
MapType | TUPLE<<key, val>>[] | Yes (with TYPE_HINT) |
TYPE_HINT values, without introducing a compile-time dependency on Spark 4.0 APIs.
| Spark 4.0 Type | Ocient Type (in CREATE TABLE SQL Statement) | TYPE_HINT |
|---|---|---|
IntervalYearMonthType | INTEGER | SPARK_INTERVAL_YEAR_MONTH |
IntervalDayTimeType | BIGINT | SPARK_INTERVAL_DAY_TIME |
isVariantType | VARCHAR | SPARK_VARIANT |
Related Links
Connect Using JDBC JDBC Manual JDBC Classes and Methods.Linux® is the registered trademark of Linus Torvalds in the U.S. and other countries.

