> ## Documentation Index
> Fetch the complete documentation index at: https://docs.ocient.com/llms.txt
> Use this file to discover all available pages before exploring further.

# JDBC Spark Connector

export const Windows = "Windows®";

export const Ubuntu = "Ubuntu®";

export const Spark = "Apache® Spark™";

export const Ocient = "Ocient®";

export const macos = "macOS®";

export const Linux = "Linux®";

export const Java = "Java®";

The {Ocient} {Spark} connector is a [Spark DataSourceV2](https://downloads.apache.org/spark/docs/2.3.1/api/java/index.html?org/apache/spark/sql/sources/v2/DataSourceV2.html) implementation that adapts an Ocient System to operate as a first-class source and sink for Spark workloads. Built on top of the Ocient JDBC driver, the connector allows Spark to read from and write to Ocient tables using Spark APIs and SQL statements.

The connector implements Spark catalog and table interfaces so you can register Ocient as a catalog (for `CREATE TABLE`, `INSERT`, `SELECT`, and `SHOW TABLES` SQL statements) or use it for ad‑hoc reads and writes.

## Key Features

The Ocient Spark connector includes these key features:

* Read Pushdown — The connector accelerates reads by pushing column selection, filters (including on nested fields), aggregations, and queries that only need the first N rows down to the Ocient System while still letting Spark validate the final results.
* Read Partitioning  — The connector parallelizes reads by splitting data into multiple Spark partitions based on a partition column. For details, see [Read Partitioning Options](#read-partitioning-options).
* [DataFrame](https://spark.apache.org/docs/latest/sql-programming-guide.html) Write Behavior and Save Modes — The connector controls how it writes DataFrames to Ocient tables by honoring Spark save modes to append (`Append`), truncate‑and‑replace (`Overwrite`), or fail on existing tables (`ErrorIfExists`).
* Catalog Support — The connector exposes Ocient as a Spark catalog so you can use standard Spark SQL directly on an Ocient System.

## Prerequisites

To use the Ocient Spark connector, your system must meet these software requirements.

| **Software**          | **Version**                                                             |
| --------------------- | ----------------------------------------------------------------------- |
| Ocient                | Use Ocient System version 26.1 or later.                                |
| Operating System (OS) | {Windows}, {Linux}, or {macos}.<br />Use the latest version of each OS. |
| Apache Spark          | Version 3.5 or later.                                                   |
| {Java}                | Version 8 or later.                                                     |
| Ocient JDBC driver    | Version 4.0 or later.                                                   |

Additionally, you must have the `SELECT`, `INSERT`, `CREATE`, and `DELETE` user privileges for the specified database. For details, see [Data Control Language (DCL) Statement Reference](/data-control-language-dcl-statement-reference).

## Ocient Spark Connector Setup and Initial Use

To start working with the Ocient Spark Connector, register the connector. Then, you can start executing SQL statements.

### Connector Registration

For best results, first register the connector as a catalog in Spark.

To register the connector, edit the `spark-defaults.conf` file in your Spark install to include these lines. Replace the `username` and `password` fields with your Ocient System credentials.

```none Text theme={null}
spark.sql.catalog.ocient_cat=com.ocient.spark.v2.DefaultSource
spark.sql.catalog.ocient_cat.url=jdbc:ocient://host:port/db
spark.sql.catalog.ocient_cat.user=<username>
spark.sql.catalog.ocient_cat.password=<password>
```

### Use SQL Statements

After registration, the Spark connector lets you treat your Ocient System like any other Spark catalog. The connector routes SQL operations through the catalog implementation.

Execute the Spark command `USE` to switch to your Ocient catalog and schema for SQL statements. In this case, use the `ocient_cat` catalog and `my_schema` schema.

```sql SQL theme={null}
USE ocient_cat.my_schema;
```

Subsequent commands default to your Ocient catalog and schema, so you no longer need to reference them.

This example creates a new table `my_new_table` with identifier `id`, name `name`, event timestamp `event`, and the structure of an integer and string `nested_date`.

```sql SQL theme={null}
CREATE TABLE my_new_table (
    id BIGINT,
    name VARCHAR,
    event_time TIMESTAMP,
    nested_data STRUCT<a: INT, b: STRING>
);
```

Insert a row into the new table.

```sql SQL theme={null}
INSERT INTO my_new_table VALUES
    (1,'foo', '2025-01-01 12:00:00', (100, 'bar'));
```

Read the row from the table.

```sql SQL theme={null}
SELECT * FROM my_new_table WHERE id = 1;
```

List the table.

```sql SQL theme={null}
SHOW TABLES;
```

Drop the table.

```sql SQL theme={null}
DROP TABLE my_new_table;
```

### Use Scala DataFrames

The Ocient Spark connector integrates directly with the Spark DataFrame API, so you can read from and write to Ocient tables using familiar Spark patterns. After you configure the Ocient catalog, you can reference fully qualified table names, and the connector handles all JDBC connectivity and type mapping.

The examples in this section use [Scala](https://www.scala-lang.org/) to interact with an Ocient catalog.

**Examples**

**Write from Spark to Ocient**

This example takes an existing Spark DataFrame `df` and writes its rows into an Ocient table `my_table`.

<CodeGroup>
  ```javascript Scala theme={null}
  df.write.saveAsTable("ocient_cat.my_schema.my_table")
  ```
</CodeGroup>

**Write from Ocient to Spark**

This example reads from the Ocient table `my_table` and writes its rows into a new Spark DataFrame `df2`.

<CodeGroup>
  ```javascript Scala theme={null}
  val df2 = spark.table("ocient_cat.my_schema.my_table")
  ```
</CodeGroup>

### Ad Hoc Usage

The Ocient Spark connector supports ad‑hoc reads and writes using the Spark `.format("ocient")` method. This method is useful for brief operations, but it cannot use the Spark catalog system to create, drop, or list tables.

For example, this Spark command reads an Ocient table and creates the DataFrame `df` from its contents. Substitute `jdbc_connection` with the JDBC connection string for the database, the `username` and `pwd` values for your Ocient username and password, and `my_schema` and `my_table` with the schema and table name for the table to read.

<CodeGroup>
  ```javascript Scala theme={null}
  val df = spark.read
      .format("ocient")
      .option("url", "jdbc_connection")
      .option("user", "username")
      .option("password", "pwd")
      .option("dbtable", "my_schema.my_table")
      .load()
  ```
</CodeGroup>

This command takes the DataFrame `df` and appends its contents into an Ocient table.

<CodeGroup>
  ```javascript Scala theme={null}
  df.write
      .format("ocient")
      .option("url", "jdbc_connection")
      .option("user", "username")
      .option("password", "pwd")
      .option("dbtable", "my_schema.my_table")
      .mode("append")
      .save()
  ```
</CodeGroup>

## **Bulk Loading Best Practices**

Use these recommended OS and Spark settings to get reliable performance and avoid inconsistent writes when using the Ocient JDBC bulk loader with Spark.

For details on bulk loading, see [JDBC Bulk Loading](/jdbc-manual#jdbc-bulk-loading).

### Linux SSH Configuration

Increase the SSH connection capacity on Loader Nodes:

* Set `MaxStartups 1024` in the OS `sshd_config` configuration file on the `loader/SSH` endpoint hosts that accept SSH connections from the bulk loader.
* Restart the SSH service to apply the updated `sshd_config` configuration. For example, on an {Ubuntu} system, run `sudo systemctl restart ssh`.

### Spark Configuration

Edit the `spark-defaults.conf` configuration file to include these settings:

* `spark.task.maxFailures = 1` — This configuration prevents Spark from retrying failed tasks and potentially duplicating writes.
* `spark.speculation = false`  — This configuration prevents Spark from launching speculative duplicate tasks that can re-run writes against Ocient.

## Configuration Options

You can set specific configurations for the Ocient Spark connector through standard Spark options:

* Set globally using Spark configuration: Add options to your `spark-defaults.conf` file or your cluster Spark settings (e.g., `spark.sql.catalog.ocient_cat.url=...`).
* Set options per job or per operation: Use Spark (`.option()`) or command-line (`--conf`) statements to set options for one-time usage.

The connector passes most of these settings through to the underlying Ocient JDBC driver as connection properties, but the system interprets a few directly by the connector to shape the generated SQL.

### Connection Options

These options control how the connector establishes a JDBC connection to Ocient and identify which table or query Spark should use. All options are for both read and write operations.

| **Option**     | **Default** | **Description**                                                                                                                                                                                                                                            |
| -------------- | ----------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `url`          | None        | Required. The Ocient JDBC URL. <br /><br />If you do not specify the `sparkMode` setting, the connector automatically sets `sparkMode=true`.                                                                                                               |
| `user`         | None        | Required. The Ocient username.                                                                                                                                                                                                                             |
| `password`     | None        | Required. The password of the user.                                                                                                                                                                                                                        |
| `dbtable`      | None        | Required for adhoc commands using `.format("ocient")`. This option is the Ocient schema and table name (for example, `schema.table`).                                                                                                                      |
| `maskPassword` | 1           | Optional. Determines whether passwords are exposed in Spark connector logs. Supported values are `0` or `1`. <br /><br />If this option is set to `1`, Spark connector logs mask password fields. Otherwise, Spark connector logs include password fields. |

### Read Partitioning Options

These options control how Spark splits a read into multiple partitions based on a column range, affecting parallelism and data distribution during Ocient table scans. All options are for read operations only. If you do not specify any of these options, you have only one partition.

| **Option**                   | **Default** | **Description**                                                                                                                                                                                                                                                                                                                            |
| ---------------------------- | ----------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `numPartitions`              | 1           | Optional. The number of Spark partitions to create for reading.                                                                                                                                                                                                                                                                            |
| `partitionColumn`            | None        | Optional. This is a numeric, date, or timestamp column to use for partitioning the read.<br /><br />If you specify the `partitionColumn `option, you can also define a value range to partition using the `lowerBound` and `upperBound` options. If you do not specify a range, the connector automatically uses the full range of values. |
| `lowerBound`                 | None        | Optional. The minimum value of the range for the `partitionColumn` option.<br /><br />If you use this option, you must also include the `upperBound` option.                                                                                                                                                                               |
| `upperBound`                 | None        | Optional. The maximum value of the range for the `partitionColumn` option.<br /><br />If you use this option, you must also include the `lowerBound` option.                                                                                                                                                                               |
| `ocient.minRowsPerPartition` | 1           | Optional. Minimum target number of rows per Spark partition for the read. The connector uses this as a hint to avoid creating many tiny partitions. This option guarantees that each planned partition covers at least this many estimated rows, where possible.                                                                           |

### Read Performance Options

These options tune how efficiently the connector fetches rows from Ocient during reads, including JDBC fetch size and Ocient System internal parallelism. All options are for read operations only.

| **Option**           | **Default** | **Description**                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |
| -------------------- | ----------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `fetchSize`          | 0           | The JDBC `fetchSize` in rows for read operations. <br />When you set `fetchSize = N`, the connector asks Ocient to send up to `N` rows per network fetch call, which can reduce round-trips for large result sets. The default is 0, which lets the driver choose an appropriate fetch size. This option behaves the same way as the Spark standard JDBC `fetchsize` option. For details, see the [Spark documentation](https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html). |
| `ocient.parallelism` | 1           | Controls Ocient internal parallel execution level for read queries. When you set `ocient.parallelism = N`, the connector appends the `USING PARALLELISM N` clause to `SELECT` SQL statements so that Ocient executes each query with `N` internal workers. This option is separate from the `numPartitions` option, which controls the number of Spark tasks that run in parallel.                                                                                                        |

### Write Performance Options

This option tunes how efficiently Spark writes data to Ocient, primarily by controlling the JDBC batch size used for inserts. This option is for write operations only.

| **Option**  | **Default**            | **Description**                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |
| ----------- | ---------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `batchsize` | `4800000` <br /><br /> | Controls how many rows Spark sends to Ocient in each JDBC batch during a write operation. When you set `batchsize = N`, the connector groups up to `N` rows per insertion batch, which can significantly improve write throughput for large DataFrame writes. The default is 4,800,000 rows. This option behaves the same way as the Spark standard JDBC `batchsize` option. For details, see the [Spark documentation](https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html). |

## Data Type Mapping

The Ocient Spark connector supports all Ocient primitive types and complex types (such as array or tuple).

When Spark creates a table, the connector writes the full Spark logical type into an Ocient `TYPE HINT` clause on each column. For example, for a column that uses an Ocient TUPLE type and maps to a Spark STRUCT type, the connector generates column DDL that includes a type hint such as `TYPE HINT 'STRUCT<myCol: STRING, another: INT>'`.

During a read operation, the connector parses this `TYPE_HINT` field to reconstruct the original Spark schema, including nested field names. If the connector does not find the hint  (e.g., for a pre-existing table), the connector maps Ocient TUPLE types to Spark STRUCT types with default field names (`_1`, `_2`, etc.).

### Data Types

This table shows the Spark data types that correspond to the equivalent Ocient types. The table also lists whether each Spark type supports round-trips, meaning you can write the type from Spark to Ocient (creating the table) and then read it back into Spark while preserving the original Spark type and structure.

| **Spark Type**     | **Ocient Type (in** `CREATE TABLE` **SQL Statement)** | **Round-trip**         |
| ------------------ | ----------------------------------------------------- | ---------------------- |
| `StringType`       | `VARCHAR`                                             | Yes                    |
| `LongType`         | `BIGINT`                                              | Yes                    |
| `IntegerType`      | `INTEGER`                                             | Yes                    |
| `ShortType`        | `SMALLINT`                                            | Yes                    |
| `ByteType`         | `TINYINT`                                             | Yes                    |
| `DoubleType`       | `DOUBLE`                                              | Yes                    |
| `FloatType`        | `FLOAT`                                               | Yes                    |
| `DecimalType(p,s)` | `DECIMAL(p,s)`                                        | Yes                    |
| `BooleanType`      | `BOOLEAN`                                             | Yes                    |
| `BinaryType`       | `VARBINARY`                                           | Yes                    |
| `DateType`         | `DATE`                                                | Yes                    |
| `TimestampType`    | `TIMESTAMP`                                           | Yes                    |
| `TimestampNTZType` | `TIMESTAMP`                                           | Yes (with `TYPE_HINT`) |
| `ArrayType`        | `ElementType[]`                                       | Yes (with `TYPE_HINT`) |
| `StructType`       | `TUPLE<<...>>`                                        | Yes (with `TYPE_HINT`) |
| `MapType`          | `TUPLE<<key, val>>[]`                                 | Yes (with `TYPE_HINT`) |

**Spark 4.0 Types**

When you run the connector on Spark 4.0 or later, the connector detects these additional Spark types at runtime using reflection and maps them to the corresponding Ocient types and `TYPE_HINT` values, without introducing a compile-time dependency on Spark 4.0 APIs.

| **Spark 4.0 Type**      | **Ocient Type (in** `CREATE TABLE` **SQL Statement)** | `TYPE_HINT`                 |
| ----------------------- | ----------------------------------------------------- | --------------------------- |
| `IntervalYearMonthType` | `INTEGER`                                             | `SPARK_INTERVAL_YEAR_MONTH` |
| `IntervalDayTimeType`   | `BIGINT`                                              | `SPARK_INTERVAL_DAY_TIME`   |
| `isVariantType`         | `VARCHAR`                                             | `SPARK_VARIANT`             |

## Related Links

[Connect Using JDBC](/connect-using-jdbc)

[JDBC Manual](/jdbc-manual)

[JDBC Classes and Methods](/jdbc-classes-and-methods).

***

*Linux® is the registered trademark of Linus Torvalds in the U.S. and other countries.*