Loading and Transformation Ref...

Data Types in Loading

After extracting from source files or topics, data can be transformed and mapped to target columns in tables.

LAT automatically binds many primitive source data types to Ocient column data types and automatically converts some source types to the target column type. Different source data types automatically bind to target columns using different conventions.

For example, a string in JSON such as '123' that loads into an INT column properly binds to the corresponding integer value 123. Similarly, an integer like 456 can automatically bind to an INT column as 456, a FLOAT column as 456.0, and a VARCHAR column as '456'. Finally, a string including Well-Known Text (WKT) representation of a POINT automatically binds into an ST_POINT column.

Empty String Handling

When you attempt to load an empty string into a target column with LAT, most data types raise an error and the record does not load into the target table. The exceptions are varchar, decimal, binary, and array column types. In those cases, the data warehouse loads these values:

  • varchar: An empty string
  • decimal: 0.0 (with the appropriate precision)
  • binary: An empty binary blob
  • array: An empty array

When you extract data from text-based formats like delimited data, the data warehouse loads empty data as an empty string unless you use the empty_as_null setting.

NULL Handling

If you do not provide a value for a column in the target table, this table explains what value the data warehouse stores in the column. For the purposes of default value handling, these are equivalent:

  • Pipeline is missing a column.
  • Pipeline includes a column but the value in the JSON or the source data evaluates to null.
  • Pipeline includes a column but the field is missing on the source record (e.g., a.b does not exist in a JSON record).

To transform a NULL or missing value into a default in the pipeline, use the transformation functions. For example: not_null(a.b, 'my_default')

Ocient Column Definition

Database Behavior When Column is Missing or Column Value Evaluates to NULL

NOT NULL with default

Default

NOT NULL with no default

Record fails

Nullable with default

Default

Nullable with no default

NULL

Data Type Binding

Refer to the Data Types for detailed information about each data type. These tables describe the automatic binding conventions of each data type.

BIGINT

Valid JSON Bind Types: 

Number, String 

Note: 

Number and String can be either integer or floating point values and are converted to 64-bit signed integers. Decimals are dropped. '+','-' can be used at the beginning of a string to indicate sign. String overflows result in errors, while numeric overflows are silent. 

Example: 

{  "number_bigint": 1.0,  "string_bigint": "1"  }  -> 1, 1 

BINARY

Valid JSON Bind Types: 

String 

Note: 

Converted as UTF-8 to binary: each character in the string taking its binary representation. 

Example: 

{  "string_binary": "hello"  }  -> BINARY(01101000 01100101 01101100 01101100 01101111) 

BOOLEAN

Valid JSON Bind Types: 

Boolean, String 

Note: 

For a value of true, a string must be one of (case insensitive): true, t, yes, on, y, 1. 

For a value of false, a string must be one of (case insensitive): false, f, no, off, n, 0. 

Example: 

{  "boolean": true,  "string_0": "yes",  "string_1": "FALSE"  }  -> true, true, false 

DATE

Valid JSON Bind Types: 

String, Number 

Note: 

If String, must be in format "YYYY-MM-DD". If Number must be an EPOCH_DAY. Use the built-in TO_DATE function to convert a string with a different date format to an EPOCH_DAY and bind. 

Example: 

{  "string_date": "2021-11-04",  "epoch_date": 0  }  -> DATE(2021-11-04), DATE(1970-01-01) 

DECIMAL

Valid JSON Bind Types: 

Number, String 



Valid Logical Types: 

Decimal 



Note: 

Number and String can be either integer or floating point values. Precision and scale for the loaded data are defined by the database table definition. The target column must support enough digits both before and after the decimal point to accommodate decimal values extracted from Fixed Width Binary or other record formats that support decimal data. If insufficient precision or scale exists in the target column, the data will fail to load. 



Example: 

{  "number_decimal": 1.0,  "string_decimal": "1.1"  }  -> DECIMAL(1.0), DECIMAL(1.1) 



DOUBLE, DOUBLE PRECISION

Valid JSON Bind Types: 

Number, String 

Note: 

Number and String can be either integer or floating point values. 

Example: 

{  "number_double": 1.0,  "string_double": "1"  }  -> 1.0, 1.0 

HASH

Valid JSON Bind Types: 

String 

Note: 

Converted as hexadecimal to binary: each character in the string must be a valid hexadecimal digit /[0-9a-f]+/i. 

Example: 

{  "hex_string": "0a1b"  }  -> HASH(0000 1010 0001 1011) 

INT

Valid JSON Bind Types: 

Number, String 

Note: 

Number and String can be either integer or floating point values and are converted to 32-bit signed integers. Decimals are dropped. '+','-' can be used at the beginning of a string to indicate sign. String overflows result in errors, while numeric overflows are silent. 

Example: 

{  "number_int": 1.0,  "string_int": "1"  }  -> 1, 1 

IPV4

Valid JSON Bind Types: 

String 

Note: 

Source data must be in the form: [0-255].[0-255].[0-255].[0-255] 

Example: 

{  "string_ipv4": "192.168.0.1"  }  -> IPV4(192.168.0.1) 

IP

Valid JSON Bind Types: 

String 

Note: 

Source string can be in IPV4 or IPV6 format. IPV6 can appear in full form as shown in this example or in condensed form. Condensed form can use :: notation (e.g., 2001:cdba::3257:9652) and leading zeros in IPV6 hextets can also be omitted (e.g., 2001:0db8:0000:0000:0000:0000:3257:9652 → 2001:db8:0:0:0:0:3257:9652). IPV6 can also appear as a string of 32 hex characters without colons (e.g., 2001cdba000000000000000032579652). 

Example: 

{  "string_ip_ipv4": "192.168.0.1",  "string_ip_ipv6": "2001:cdba:0000:0000:0000:0000:3257:9652"  }  -> IP(192.168.0.1), IP(2001:cdba:0000:0000:0000:0000:3257:9652) 

LINESTRING, ST_LINESTRING

Valid JSON Bind Types: 

String 

Note: 

Must be in the WKT format "LINESTRING(LONG LAT, … )" or "LINESTRING EMPTY" 

Example: 

{  "string_st_linestring": "LINESTRING(-71.064544 42.28787, -90.444444 89.562993)"  }  -> ST_LINESTRING(-71.064544 42.28787, -90.444444 89.562993) 

MATRIX

Valid JSON Bind Types: 

Array[Array], String 

Note: 

Arrays must be 2D with the same rows and columns as specified in the table. String must be in JSON 2D array format. Each element will be cast according the the element type’s rules in this section. 

Example: 

{  "int_matrix": [[0, 1, 2],[3, 4, 5],[6, 7, 8]],  "string_matrix": "[[0,1],[2,3]]"  }  -> MATRIX[[0,1,2],[3,4,5],[6,7,8]], MATRIX[[0,1],[2,3]] 

POINT, ST_POINT

Valid JSON Bind Types: 

String 

Note: 

Must be in the format "POINT(LONG LAT)" 

Example: 

{  "string_st_point": "POINT(-71.064544 42.28787)"  }  -> ST_POINT(-71.064544, 42.28787) 

POLYGON, ST_POLYGON

Ensure that polygons are oriented in your intended way when you load them as polygon orientation matters in Ocient semantics. If you intend all polygons to be oriented counter-clockwise, use the st_forcepolygonccw transform function to enforce counter-clockwise semantics.

Valid JSON Bind Types: 

String 

Note: 

Must be in the WKT format "POLYGON((LONG LAT, … ))" or "POLYGON((LONG LAT, … ), (LONG LAT, … ))" or "POLYGON EMPTY" 

Example: 

{  "string_st_polygon": "POLYGON((-87.62 41.87, -89.40 43.07, -87.906 43.04, -87.62 41.87))"  }  -> ST_POLYGON((-87.62 41.87, -89.40 43.07, -87.906 43.04, -87.62 41.87)) 

REAL, FLOAT, SINGLE PRECISION

Valid JSON Bind Types: 

Number, String 

Note: 

Number and String can be either integer or floating point values. Precision and scale are inherited from the table definition. 

Example: 

{  "number_float": 1.0,  "string_float": "1"  }  -> 1.0, 1.0 

SMALLINT

Valid JSON Bind Types: 

Number, String 

Note: 

Number and String can be either integer or floating point values and are converted to 16-bit signed integers. Decimals are dropped. '+','-' can be used at the beginning of a string to indicate sign. String overflows result in errors, while numeric overflows are silent. 

Example: 

{  "number_smallint": 1.0,  "string_smallint": "1"  }  -> 1, 1 

TIME

Valid JSON Bind Types: 

Number, String 

Note: 

String must be in the format: HH:MM:SS[.SSSSSSSSS]. Number must represent nanoseconds since midnight. Use the TO_TIME transformation function to convert an arbitrary string format to nanoseconds and bind. 

Example: 

{  "number_time": 1000000000,  "string_time": "07:33:10.0001"  }  -> TIME(00:00:01.0000), TIME(07:33:10.0001) 

TIMESTAMP

Valid JSON Bind Types: 

Number, String 

Note: 

String must be in the format: YYYY-MM-DD HH:MM:SS[.SSSSSSSSS]. Number must represent nanoseconds since epoch. Use the TO_TIMESTAMP transformation function to convert an arbitrary string format to nanoseconds and bind. 

Example: 

{  "number_timestamp": 1000000000,  "string_timestamp": "2021-11-04 07:33:10.0001"  }  -> TIMESTAMP(1970-01-01 00:00:01.0000), TIMESTAMP(2021-11-04 07:33:10.0001) 

TINYINT, BYTE

Valid JSON Bind Types: 

Number, String 

Note: 

Number and String can be either integer or floating point values and are converted to 8-bit signed integers. Decimals are dropped. '+','-' can be used at the beginning of a string to indicate sign. String overflows result in errors, while numeric overflows are silent. 

Example: 

{  "number_smallint": 1.0,  "string_smallint": "1"  }  -> 1, 1 

TUPLE

Valid JSON Bind Types: 

Array, String 

Note: 

String must be in JSON array format "[e0,e1,e2]". Each element will be cast to the tuple element type at that element’s index from the Ocient table according to the rules for that element type in this section. 

Example: 

{  "double_int": [1.0, 1],  "string_array": ["Ocient", [0, 1]],  "string_tuple": "[1.0, 1]",  "tuple_tuple": [[1.0, 1],["Ocient", 2]]  }  -> TUPLE(DOUBLE, INT)(1.0, 1), TUPLE(STRING, INT[])("Ocient", INT[0, 1]), TUPLE(DOUBLE, INT)(1.0, 1), TUPLE(TUPLE(DOUBLE, INT), TUPLE(VARCHAR, INT)((1.0, 1), ("Ocient", 1)) 

TYPE[]

Valid JSON Bind Types: 

Array, String 

Note: 

String must be in JSON array format "[e0,e1,e2]". Each element will be cast to the array element type from the Ocient table according to the rules for that element type in this section. 

Example: 

{  "int_array": [0,1,2],  "2d_array": [[0, 1, 2],[3, 4]],  "string_array": "[0,1,2]",  "tuple_array": [[0, 1.0, "Ocient"],[1, 2.0, "rocks"]]  }  -> INT[0,1,2], INT[][INT[0,1,2],INT[3,4]], INT[0,1,2] TUPLE(INT, FLOAT, VARCHAR)[(0, 1.0, "Ocient"), (1, 2.0. "rocks")] 

UUID

Valid JSON Bind Types: 

String 

Note: 

Source data must be a valid hyphen-separated UUID. 

Example: 

{  "string_uuid": "40e6215d-b5c6-4896-987c-f30f3678f608"  }  -> UUID(40e6215d-b5c6-4896-987c-f30f3678f608) 

VARCHAR, CHAR

Valid JSON Bind Types: 

String, Number, Boolean 

Note: 

Number and Boolean types are converted to their string representations. 

Example: 

{  "number_string": 1000,  "boolean_string": true,  "string_string": "string"  }  -> "1000", "true", "string" 

Load Geospatial Data

Geospatial data can be loaded from a variety of formats into Ocient. The Ocient system requires special considerations when you load geospatial data into Ocient including:

  • Coordinate System
  • Supported Geospatial Loading Formats
  • Geospatial Type Conversion
  • POINT Normalization
  • POLYGON Normalization

Coordinate System

Ocient uses Spatial Reference System ID (SRID) 4326 for all internal geospatial data types. SRID 4326 is defined in the WGS84 Standard. Before you load the source data, all data should have the SRID 4326 data type. The loading operation ignores any source data that contains the coordinate system SRID information.

Supported Geospatial Loading Formats

Well-Known Text

The primary data format for loading geospatial data in Ocient is the Well-Known Text (WKT) format. This string format is a portable format that can represent points, lines, and polygons.

String data in WKT format can automatically load into ST_POINT, ST_LINESTRING, and ST_POLYGON Ocient column types. In addition, arrays of strings in WKT format can automatically load into array columns of a suitable geospatial data type. The Ocient System assumes raw strings are in WKT format when the system loads these strings into geospatial column types.

Each LINESTRING or POLYGON value can be up to a maximum of 512 MB in size. This means a LINESTRING or POLYGON can contain approximately 32 million point values.

  • WKT strings are whitespace insensitive, so POINT(10 20) is equivalent to POINT (10 20).
  • WKT strings are case insensitive
  • Extended Well-Known Text (EWKT) is not supported. To load EWKT data, load the substring after the EWKT semicolon as the WKT format.

Well-Known Binary and Extended Well-Known Binary

The other data formats that can load into Ocient are Well-Known Binary (WKB) and Extended Well-Known Binary (EWKB) formats. The Ocient System requires an explicit transformation function for both WKB and EWKB formats when you load data and convert it into the target column type.

You can use the function st_geomfromewkb to transform WKB and EWKB data.

In both cases, the Ocient System represents the WKB or EWKB string data as hexadecimal string data. The hexadecimal string can be one of these formats:

  • Only hexadecimal digits: 000000000140000000000000004010000000000000
  • Prefixed with 0x: 0x000000000140000000000000004010000000000000
  • Prefixed with \x: \x000000000140000000000000004010000000000000

Geometry Conversion

Ocient provides the automatic conversion of geometries from simpler types to more complex types. The conversion allows the load of a ST_LINESTRING column with a POINT, or a ST_POLYGON column with a POINT or LINESTRING. This capability also allows the load of arrays of ST_POLYGON data from the source arrays of a combination of POINT, LINESTRING, and POLYGON data, which represents a geometry collection.

POINT Normalization

During loading, Ocient automatically performs normalization of POINT data into a regular format used within Ocient. The Ocient System performs the following operations on POINT data during the load:

  • Constrain longitude to [-180, 180) and latitude to [-90, 90]; wrap around invalid coordinates using correct geographical handling.
  • Snap points near the pole to the pole.
  • Set longitude of points on the pole to 0
  • Remove signed zeros from coordinates, so -0 becomes 0.

POLYGON Normalization

Ocient follows a counterclockwise rotation convention to indicate the outer ring of a POLYGON. The inner ring follows a clockwise rotation. If the Ocient System loads a POLYGON with a clockwise outer ring, the system indicates that the POLYGON is outside of the given ring.

In some cases, the POLYGON load can lead to unexpected results when the source data follows a different polygon rotation convention.

To account for this, the LAT provides the st_forcepolygonccw function that forces a counterclockwise rotation of the outer polygon ring and a clockwise rotation of the inner polygon ring when applicable.

Geospatial Transformation Functions

The LAT has a select set of transformation functions to construct geospatial types and manipulate them during the load. You can find the supported geospatial transformation functions in Transformation Functions.