Skip to main content
The connector is a Spark DataSourceV2 implementation that adapts an Ocient System to operate as a first-class source and sink for Spark workloads. Built on top of the Ocient JDBC driver, the connector allows Spark to read from and write to Ocient tables using Spark APIs and SQL statements. The connector implements Spark catalog and table interfaces so you can register Ocient as a catalog (for CREATE TABLE, INSERT, SELECT, and SHOW TABLES SQL statements) or use it for ad‑hoc reads and writes.

Key Features

The Ocient Spark connector includes these key features:
  • Read Pushdown — The connector accelerates reads by pushing column selection, filters (including on nested fields), aggregations, and queries that only need the first N rows down to the Ocient System while still letting Spark validate the final results.
  • Read Partitioning — The connector parallelizes reads by splitting data into multiple Spark partitions based on a partition column. For details, see Read Partitioning Options.
  • DataFrame Write Behavior and Save Modes — The connector controls how it writes DataFrames to Ocient tables by honoring Spark save modes to append (Append), truncate‑and‑replace (Overwrite), or fail on existing tables (ErrorIfExists).
  • Catalog Support — The connector exposes Ocient as a Spark catalog so you can use standard Spark SQL directly on an Ocient System.

Prerequisites

To use the Ocient Spark connector, your system must meet these software requirements.
SoftwareVersion
OcientUse Ocient System version 26.1 or later.
Operating System (OS), , or .
Use the latest version of each OS.
Apache SparkVersion 3.5 or later.
Version 8 or later.
Ocient JDBC driverVersion 4.0 or later.
Additionally, you must have the SELECT, INSERT, CREATE, and DELETE user privileges for the specified database. For details, see Data Control Language (DCL) Statement Reference.

Ocient Spark Connector Setup and Initial Use

To start working with the Ocient Spark Connector, register the connector. Then, you can start executing SQL statements.

Connector Registration

For best results, first register the connector as a catalog in Spark. To register the connector, edit the spark-defaults.conf file in your Spark install to include these lines. Replace the username and password fields with your Ocient System credentials.
Text
spark.sql.catalog.ocient_cat=com.ocient.spark.v2.DefaultSource
spark.sql.catalog.ocient_cat.url=jdbc:ocient://host:port/db
spark.sql.catalog.ocient_cat.user=<username>
spark.sql.catalog.ocient_cat.password=<password>

Use SQL Statements

After registration, the Spark connector lets you treat your Ocient System like any other Spark catalog. The connector routes SQL operations through the catalog implementation. Execute the Spark command USE to switch to your Ocient catalog and schema for SQL statements. In this case, use the ocient_cat catalog and my_schema schema.
SQL
USE ocient_cat.my_schema;
Subsequent commands default to your Ocient catalog and schema, so you no longer need to reference them. This example creates a new table my_new_table with identifier id, name name, event timestamp event, and the structure of an integer and string nested_date.
SQL
CREATE TABLE my_new_table (
    id BIGINT,
    name VARCHAR,
    event_time TIMESTAMP,
    nested_data STRUCT<a: INT, b: STRING>
);
Insert a row into the new table.
SQL
INSERT INTO my_new_table VALUES
    (1,'foo', '2025-01-01 12:00:00', (100, 'bar'));
Read the row from the table.
SQL
SELECT * FROM my_new_table WHERE id = 1;
List the table.
SQL
SHOW TABLES;
Drop the table.
SQL
DROP TABLE my_new_table;

Use Scala DataFrames

The Ocient Spark connector integrates directly with the Spark DataFrame API, so you can read from and write to Ocient tables using familiar Spark patterns. After you configure the Ocient catalog, you can reference fully qualified table names, and the connector handles all JDBC connectivity and type mapping. The examples in this section use Scala to interact with an Ocient catalog. Examples Write from Spark to Ocient This example takes an existing Spark DataFrame df and writes its rows into an Ocient table my_table.
df.write.saveAsTable("ocient_cat.my_schema.my_table")
Write from Ocient to Spark This example reads from the Ocient table my_table and writes its rows into a new Spark DataFrame df2.
val df2 = spark.table("ocient_cat.my_schema.my_table")

Ad Hoc Usage

The Ocient Spark connector supports ad‑hoc reads and writes using the Spark .format("ocient") method. This method is useful for brief operations, but it cannot use the Spark catalog system to create, drop, or list tables. For example, this Spark command reads an Ocient table and creates the DataFrame df from its contents. Substitute jdbc_connection with the JDBC connection string for the database, the username and pwd values for your Ocient username and password, and my_schema and my_table with the schema and table name for the table to read.
val df = spark.read
    .format("ocient")
    .option("url", "jdbc_connection")
    .option("user", "username")
    .option("password", "pwd")
    .option("dbtable", "my_schema.my_table")
    .load()
This command takes the DataFrame df and appends its contents into an Ocient table.
df.write
    .format("ocient")
    .option("url", "jdbc_connection")
    .option("user", "username")
    .option("password", "pwd")
    .option("dbtable", "my_schema.my_table")
    .mode("append")
    .save()

Bulk Loading Best Practices

Use these recommended OS and Spark settings to get reliable performance and avoid inconsistent writes when using the Ocient JDBC bulk loader with Spark. For details on bulk loading, see JDBC Bulk Loading.

Linux SSH Configuration

Increase the SSH connection capacity on Loader Nodes:
  • Set MaxStartups 1024 in the OS sshd_config configuration file on the loader/SSH endpoint hosts that accept SSH connections from the bulk loader.
  • Restart the SSH service to apply the updated sshd_config configuration. For example, on an system, run sudo systemctl restart ssh.

Spark Configuration

Edit the spark-defaults.conf configuration file to include these settings:
  • spark.task.maxFailures = 1 — This configuration prevents Spark from retrying failed tasks and potentially duplicating writes.
  • spark.speculation = false — This configuration prevents Spark from launching speculative duplicate tasks that can re-run writes against Ocient.

Configuration Options

You can set specific configurations for the Ocient Spark connector through standard Spark options:
  • Set globally using Spark configuration: Add options to your spark-defaults.conf file or your cluster Spark settings (e.g., spark.sql.catalog.ocient_cat.url=...).
  • Set options per job or per operation: Use Spark (.option()) or command-line (--conf) statements to set options for one-time usage.
The connector passes most of these settings through to the underlying Ocient JDBC driver as connection properties, but the system interprets a few directly by the connector to shape the generated SQL.

Connection Options

These options control how the connector establishes a JDBC connection to Ocient and identify which table or query Spark should use. All options are for both read and write operations.
OptionDefaultDescription
urlNoneRequired. The Ocient JDBC URL.

If you do not specify the sparkMode setting, the connector automatically sets sparkMode=true.
userNoneRequired. The Ocient username.
passwordNoneRequired. The password of the user.
dbtableNoneRequired for adhoc commands using .format("ocient"). This option is the Ocient schema and table name (for example, schema.table).
maskPassword1Optional. Determines whether passwords are exposed in Spark connector logs. Supported values are 0 or 1.

If this option is set to 1, Spark connector logs mask password fields. Otherwise, Spark connector logs include password fields.

Read Partitioning Options

These options control how Spark splits a read into multiple partitions based on a column range, affecting parallelism and data distribution during Ocient table scans. All options are for read operations only. If you do not specify any of these options, you have only one partition.
OptionDefaultDescription
numPartitions1Optional. The number of Spark partitions to create for reading.
partitionColumnNoneOptional. This is a numeric, date, or timestamp column to use for partitioning the read.

If you specify the partitionColumn option, you can also define a value range to partition using the lowerBound and upperBound options. If you do not specify a range, the connector automatically uses the full range of values.
lowerBoundNoneOptional. The minimum value of the range for the partitionColumn option.

If you use this option, you must also include the upperBound option.
upperBoundNoneOptional. The maximum value of the range for the partitionColumn option.

If you use this option, you must also include the lowerBound option.
ocient.minRowsPerPartition1Optional. Minimum target number of rows per Spark partition for the read. The connector uses this as a hint to avoid creating many tiny partitions. This option guarantees that each planned partition covers at least this many estimated rows, where possible.

Read Performance Options

These options tune how efficiently the connector fetches rows from Ocient during reads, including JDBC fetch size and Ocient System internal parallelism. All options are for read operations only.
OptionDefaultDescription
fetchSize0The JDBC fetchSize in rows for read operations.
When you set fetchSize = N, the connector asks Ocient to send up to N rows per network fetch call, which can reduce round-trips for large result sets. The default is 0, which lets the driver choose an appropriate fetch size. This option behaves the same way as the Spark standard JDBC fetchsize option. For details, see the Spark documentation.
ocient.parallelism1Controls Ocient internal parallel execution level for read queries. When you set ocient.parallelism = N, the connector appends the USING PARALLELISM N clause to SELECT SQL statements so that Ocient executes each query with N internal workers. This option is separate from the numPartitions option, which controls the number of Spark tasks that run in parallel.

Write Performance Options

This option tunes how efficiently Spark writes data to Ocient, primarily by controlling the JDBC batch size used for inserts. This option is for write operations only.
OptionDefaultDescription
batchsize4800000

Controls how many rows Spark sends to Ocient in each JDBC batch during a write operation. When you set batchsize = N, the connector groups up to N rows per insertion batch, which can significantly improve write throughput for large DataFrame writes. The default is 4,800,000 rows. This option behaves the same way as the Spark standard JDBC batchsize option. For details, see the Spark documentation.

Data Type Mapping

The Ocient Spark connector supports all Ocient primitive types and complex types (such as array or tuple). When Spark creates a table, the connector writes the full Spark logical type into an Ocient TYPE HINT clause on each column. For example, for a column that uses an Ocient TUPLE type and maps to a Spark STRUCT type, the connector generates column DDL that includes a type hint such as TYPE HINT 'STRUCT<myCol: STRING, another: INT>'. During a read operation, the connector parses this TYPE_HINT field to reconstruct the original Spark schema, including nested field names. If the connector does not find the hint (e.g., for a pre-existing table), the connector maps Ocient TUPLE types to Spark STRUCT types with default field names (_1, _2, etc.).

Data Types

This table shows the Spark data types that correspond to the equivalent Ocient types. The table also lists whether each Spark type supports round-trips, meaning you can write the type from Spark to Ocient (creating the table) and then read it back into Spark while preserving the original Spark type and structure.
Spark TypeOcient Type (in CREATE TABLE SQL Statement)Round-trip
StringTypeVARCHARYes
LongTypeBIGINTYes
IntegerTypeINTEGERYes
ShortTypeSMALLINTYes
ByteTypeTINYINTYes
DoubleTypeDOUBLEYes
FloatTypeFLOATYes
DecimalType(p,s)DECIMAL(p,s)Yes
BooleanTypeBOOLEANYes
BinaryTypeVARBINARYYes
DateTypeDATEYes
TimestampTypeTIMESTAMPYes
TimestampNTZTypeTIMESTAMPYes (with TYPE_HINT)
ArrayTypeElementType[]Yes (with TYPE_HINT)
StructTypeTUPLE<<...>>Yes (with TYPE_HINT)
MapTypeTUPLE<<key, val>>[]Yes (with TYPE_HINT)
Spark 4.0 Types When you run the connector on Spark 4.0 or later, the connector detects these additional Spark types at runtime using reflection and maps them to the corresponding Ocient types and TYPE_HINT values, without introducing a compile-time dependency on Spark 4.0 APIs.
Spark 4.0 TypeOcient Type (in CREATE TABLE SQL Statement)TYPE_HINT
IntervalYearMonthTypeINTEGERSPARK_INTERVAL_YEAR_MONTH
IntervalDayTimeTypeBIGINTSPARK_INTERVAL_DAY_TIME
isVariantTypeVARCHARSPARK_VARIANT
Connect Using JDBC JDBC Manual JDBC Classes and Methods.
Linux® is the registered trademark of Linus Torvalds in the U.S. and other countries.
Last modified on May 27, 2026