JDBC Spark Connector

the {{ocient}} {{spark}} connector is a https //downloads apache org/spark/docs/2 3 1/api/java/index html?org/apache/spark/sql/sources/v2/datasourcev2 html implementation that adapts an ocient system to operate as a first class source and sink for spark workloads built on top of the ocient jdbc driver, the connector allows spark to read from and write to ocient tables using spark apis and sql statements the connector implements spark catalog and table interfaces so you can register ocient as a catalog (for create table , insert , select , and show tables sql statements) or use it for ad‑hoc reads and writes key features the ocient spark connector includes these key features read pushdown — the connector accelerates reads by pushing column selection, filters (including on nested fields), aggregations, and queries that only need the first n rows down to the ocient system while still letting spark validate the final results read partitioning — the connector parallelizes reads by splitting data into multiple spark partitions based on a partition column for details, see docid\ pp91aew4 1hy1pft3f4zs https //spark apache org/docs/latest/sql programming guide html write behavior and save modes — the connector controls how it writes dataframes to ocient tables by honoring spark save modes to append ( append ), truncate‑and‑replace ( overwrite ), or fail on existing tables ( errorifexists ) catalog support — the connector exposes ocient as a spark catalog so you can use standard spark sql directly on an ocient system prerequisites to use the ocient spark connector, your system must meet these software requirements software version ocient use ocient system version 26 1 or later operating system (os) {{windows}} , {{linux}} , or {{macos}} use the latest version of each os apache spark version 3 5 or later {{java}} version 8 or later ocient jdbc driver version 4 0 or later additionally, you must have the select , insert , create , and delete user privileges for the specified database for details, see docid\ f55ngxtki0f7kkmyatvug ocient spark connector setup and initial use to start working with the ocient spark connector, register the connector then, you can start executing sql statements connector registration for best results, first register the connector as a catalog in spark to register the connector, edit the spark defaults conf file in your spark install to include these lines replace the username and password fields with your ocient system credentials spark sql catalog ocient cat=com ocient spark v2 defaultsource spark sql catalog ocient cat url=jdbc\ ocient //host\ port/db spark sql catalog ocient cat user=\<username> spark sql catalog ocient cat password=\<password> use sql statements after registration, the spark connector lets you treat your ocient system like any other spark catalog the connector routes sql operations through the catalog implementation execute the spark command use to switch to your ocient catalog and schema for sql statements in this case, use the ocient cat catalog and my schema schema use ocient cat my schema; subsequent commands default to your ocient catalog and schema, so you no longer need to reference them this example creates a new table my new table with identifier id , name name , event timestamp event , and the structure of an integer and string nested date create table my new table ( id bigint, name varchar, event time timestamp, nested data struct\<a int, b string> ); insert a row into the new table insert into my new table values (1,'foo', '2025 01 01 12 00 00', (100, 'bar')); read the row from the table select from my new table where id = 1; list the table show tables; drop the table drop table my new table; use scala dataframes the ocient spark connector integrates directly with the spark dataframe api, so you can read from and write to ocient tables using familiar spark patterns after you configure the ocient catalog, you can reference fully qualified table names, and the connector handles all jdbc connectivity and type mapping the examples in this section use https //www scala lang org/ to interact with an ocient catalog examples write from spark to ocient this example takes an existing spark dataframe df and writes its rows into an ocient table my table scala df write saveastable("ocient cat my schema my table") write from ocient to spark this example reads from the ocient table my table and writes its rows into a new spark dataframe df2 scala val df2 = spark table("ocient cat my schema my table") ad hoc usage the ocient spark connector supports ad‑hoc reads and writes using the spark format("ocient") method this method is useful for brief operations, but it cannot use the spark catalog system to create, drop, or list tables for example, this spark command reads an ocient table and creates the dataframe df from its contents substitute jdbc connection with the jdbc connection string for the database, the username and pwd values for your ocient username and password, and my schema and my table with the schema and table name for the table to read scala val df = spark read format("ocient") option("url", "jdbc connection") option("user", "username") option("password", "pwd") option("dbtable", "my schema my table") load() this command takes the dataframe df and appends its contents into an ocient table scala df write format("ocient") option("url", "jdbc connection") option("user", "username") option("password", "pwd") option("dbtable", "my schema my table") mode("append") save() bulk loading best practices use these recommended os and spark settings to get reliable performance and avoid inconsistent writes when using the ocient jdbc bulk loader with spark for details on bulk loading, see docid apnndn tjqmjdd5oqdvd linux ssh configuration increase the ssh connection capacity on loader nodes set maxstartups 1024 in the os sshd config configuration file on the loader/ssh endpoint hosts that accept ssh connections from the bulk loader restart the ssh service to apply the updated sshd config configuration for example, on an {{ubuntu}} system, run sudo systemctl restart ssh spark configuration edit the spark defaults conf configuration file to include these settings spark task maxfailures = 1 — this configuration prevents spark from retrying failed tasks and potentially duplicating writes spark speculation = false — this configuration prevents spark from launching speculative duplicate tasks that can re run writes against ocient configuration options you can set specific configurations for the ocient spark connector through standard spark options set globally using spark configuration add options to your spark defaults conf file or your cluster spark settings (e g , spark sql catalog ocient cat url= ) set options per job or per operation use spark ( option() ) or command line ( conf ) statements to set options for one time usage the connector passes most of these settings through to the underlying ocient jdbc driver as connection properties, but the system interprets a few directly by the connector to shape the generated sql connection options these options control how the connector establishes a jdbc connection to ocient and identify which table or query spark should use all options are for both read and write operations option default description url none required the ocient jdbc url if you do not specify the sparkmode setting, the connector automatically sets sparkmode=true user none required the ocient username password none required the password of the user dbtable none required for adhoc commands using format("ocient") this option is the ocient schema and table name (for example, schema table ) maskpassword 1 optional determines whether passwords are exposed in spark connector logs supported values are 0 or 1 if this option is set to 1 , spark connector logs mask password fields otherwise, spark connector logs include password fields read partitioning options these options control how spark splits a read into multiple partitions based on a column range, affecting parallelism and data distribution during ocient table scans all options are for read operations only if you do not specify any of these options, you have only one partition option default description numpartitions 1 optional the number of spark partitions to create for reading partitioncolumn none optional this is a numeric, date, or timestamp column to use for partitioning the read if you specify the partitioncolumn option, you can also define a value range to partition using the lowerbound and upperbound options if you do not specify a range, the connector automatically uses the full range of values lowerbound none optional the minimum value of the range for the partitioncolumn option if you use this option, you must also include the upperbound option upperbound none optional the maximum value of the range for the partitioncolumn option if you use this option, you must also include the lowerbound option ocient minrowsperpartition 1 optional minimum target number of rows per spark partition for the read the connector uses this as a hint to avoid creating many tiny partitions this option guarantees that each planned partition covers at least this many estimated rows, where possible read performance options these options tune how efficiently the connector fetches rows from ocient during reads, including jdbc fetch size and ocient system internal parallelism all options are for read operations only option default description fetchsize 0 the jdbc fetchsize in rows for read operations when you set fetchsize = n , the connector asks ocient to send up to n rows per network fetch call, which can reduce round trips for large result sets the default is 0, which lets the driver choose an appropriate fetch size this option behaves the same way as the spark standard jdbc fetchsize option for details, see the https //spark apache org/docs/latest/sql data sources jdbc html ocient parallelism 1 controls ocient internal parallel execution level for read queries when you set ocient parallelism = n , the connector appends the using parallelism n clause to select sql statements so that ocient executes each query with n internal workers this option is separate from the numpartitions option, which controls the number of spark tasks that run in parallel write performance options this option tunes how efficiently spark writes data to ocient, primarily by controlling the jdbc batch size used for inserts this option is for write operations only option default description batchsize 4800000 controls how many rows spark sends to ocient in each jdbc batch during a write operation when you set batchsize = n , the connector groups up to n rows per insertion batch, which can significantly improve write throughput for large dataframe writes the default is 4,800,000 rows this option behaves the same way as the spark standard jdbc batchsize option for details, see the https //spark apache org/docs/latest/sql data sources jdbc html data type mapping the ocient spark connector supports all ocient primitive types and complex types (such as array or tuple) when spark creates a table, the connector writes the full spark logical type into an ocient type hint clause on each column for example, for a column that uses an ocient tuple type and maps to a spark struct type, the connector generates column ddl that includes a type hint such as type hint 'struct\<mycol string, another int>' during a read operation, the connector parses this type hint field to reconstruct the original spark schema, including nested field names if the connector does not find the hint (e g , for a pre existing table), the connector maps ocient tuple types to spark struct types with default field names ( 1 , 2 , etc ) data types this table shows the spark data types that correspond to the equivalent ocient types the table also lists whether each spark type supports round trips, meaning you can write the type from spark to ocient (creating the table) and then read it back into spark while preserving the original spark type and structure spark type ocient type (in create table sql statement) round trip stringtype varchar yes longtype bigint yes integertype integer yes shorttype smallint yes bytetype tinyint yes doubletype double yes floattype float yes decimaltype(p,s) decimal(p,s) yes booleantype boolean yes binarytype varbinary yes datetype date yes timestamptype timestamp yes timestampntztype timestamp yes (with type hint ) arraytype elementtype\[] yes (with type hint ) structtype tuple<< >> yes (with type hint ) maptype tuple<\<key, val>>\[] yes (with type hint ) spark 4 0 types when you run the connector on spark 4 0 or later, the connector detects these additional spark types at runtime using reflection and maps them to the corresponding ocient types and type hint values, without introducing a compile time dependency on spark 4 0 apis spark 4 0 type ocient type (in create table sql statement) type hint intervalyearmonthtype integer spark interval year month intervaldaytimetype bigint spark interval day time isvarianttype varchar spark variant related links docid 1 p8y vgpzkd8k 0hxqd7 docid apnndn tjqmjdd5oqdvd docid\ vknnjxbrekwndt3kpt3ln {{linux}} is the registered trademark of linus torvalds in the u s and other countries