The data pipeline functionality enables you to load data in the structured JSON data format. Use these examples to see how you can access the structure in different ways. For an overview of loading JSON data, see:
These examples load a scalar into the VARCHAR column named customer_name. Use the CREATE TABLE SQL statement to create the table with this column.
CREATE TABLE my_schema.my_table (customer_name VARCHAR NOT NULL);
For a JSON file with the value at the top level, the selector can refer to the key directly.
Load a file with this data.
Use the selector $name.
CREATE PIPELINE my_pipeline
SOURCE filesystem
FILTER 'data.json'
EXTRACT
FORMAT json
INTO my_schema.my_table
SELECT
$name AS customer_name;
List all top-level JSON keys used in the SELECT statement in JSON_FIELDS of the CREATE PIPELINE SQL statement.
For a JSON file that has a nested value, you need a complex selector.
{"a": {"b": {"name": "John"}}}
In this case, the same CREATE PIPELINE statement for this file has $a.b.name instead of $name.
CREATE PIPELINE my_pipeline
SOURCE filesystem
FILTER 'data.json'
EXTRACT
FORMAT json
INTO my_schema.my_table
SELECT
$a.b.name AS customer_name;
NULL and Empty Handling for JSON Scalars
The System handles all JSON NULL, empty, and missing values in the same way. The system loads these values as NULL. These values fail to load into non-nullable columns. Provide an explicit default in the pipeline or use the COLUMN_DEFAULT_IF_NULL option to accept the configured column default instead of attempting to load NULL values.
The JSON data contains this information.
{"a": {"b": {"name": "John"}}}
{"blah": {}}
In this case, the output table has two rows because there are two records in the JSON source file.
CREATE PIPELINE my_pipeline
SOURCE filesystem
FILTER 'data.json'
EXTRACT
FORMAT json
INTO my_schema.my_table
SELECT
$a.b.name AS customer_name;
Output
Ocient> select * from my_schema.my_table;
name
---------------------------------------------
John
NULL
Fetched 2 rows
When you load all these JSON files with the same pipeline definition, the Ocient System creates a table with the same two rows.
The middle value is empty.
{"a": {"b": {"name": "John"}}}
{"a": {"b": {}}}
The middle value is NULL.
{"a": {"b": {"name": "John"}}}
{"a": {"b": null}
The leaf value is NULL.
{"a": {"b": {"name": "John"}}}
{"a": {"b": {"name": null}}}
Supported array extraction scenarios are:
- One-dimensional arrays.
- Multi-dimensional arrays.
- Array projection is the nested application of JSON selectors over the elements of arrays for both one-dimensional and multidimensional arrays.
- Arbitrary nesting of arrays within JSON objects, including splitting array dimensions across the JSON path.
- Loading individual array elements, which can also be arrays.
Mapping a transformation function over array elements is not supported.
One-Dimensional Arrays
The next few examples load the same data into the my_table table both directly and using array projection.
CREATE TABLE my_schema.my_table (customer_names VARCHAR[] NOT NULL);
Load data directly from a JSON array of strings. The JSON data contains this information.
{
"a": {
"b": {
"names": ["John", "Bob", "Rajiv"]
}
}
}
CREATE PIPELINE my_pipeline
SOURCE filesystem
FILTER 'data.json'
EXTRACT
FORMAT json
INTO my_schema.my_table
SELECT
$a.b.names[] AS customer_names;
Output
Ocient> select * from my_schema.my_table;
customer_names
--------------------------------------------------------------------------------
[John, Bob, Rajiv]
Fetched 1 row
Array Projection
Array projection is the application of the specified selector over all elements in an array, similar to a for loop.
Load data using array projection into the same table using this JSON data.
{ "a": [ {"name": "John", "hometown": "Chicago"},
{"name": "Bob", "hometown": "Cucamonga"},
{"name": "Rajiv", "hometown": "Glendale Heights"} ] }
CREATE PIPELINE my_pipeline
SOURCE filesystem
FILTER 'data.json'
EXTRACT
FORMAT json
INTO my_schema.my_table
SELECT
$a[].name AS customer_names;
Output
Ocient> select * from my_schema.my_table;
customer_names
--------------------------------------------------------------------------------
[John, Bob, Rajiv]
Fetched 1 row
The output of both loading operations is the same.
Repeat this example with data in an array that contains nested objects. The JSON file contains this information.
{ "a": [ {"b": {"name": "John", "hometown": "Chicago"}},
{"b": {"name": "Bob", "hometown": "Cucamonga"}},
{"b": {"name": "Rajiv", "hometown": "Glendale Heights"}} ] }
CREATE PIPELINE my_pipeline
SOURCE filesystem
FILTER 'data.json'
EXTRACT
FORMAT json
INTO my_schema.my_table
SELECT
$a[].b.name AS customer_names;
Output
Ocient> select * from my_schema.my_table;
customer_names
--------------------------------------------------------------------------------
[John, Bob, Rajiv]
Fetched 1 row
The output is still the same.
Multi-Dimensional Arrays
Multi-dimensional arrays work similarly to one-dimensional arrays.
These examples load the same data into the my_table table both directly, and then using array projection.
CREATE TABLE my_schema.my_table (customer_names VARCHAR[][][] NOT NULL);
Load data directly from a JSON array of strings. The JSON file contains this information.
{ "a": {
"b": {
"names": [
[["John", "Bob", "Rajiv"], ["Anna", "Hanna", "Vanna", "Rosanna"]],
[["Masha", "Natasha", "Sasha"], ["Chad", "Thad", "Brad"]]] }
} }
CREATE PIPELINE my_pipeline
SOURCE filesystem
FILTER 'data.json'
EXTRACT
FORMAT json
INTO my_schema.my_table
SELECT
$a.b.names[][][] AS customer_names;
Load data using array projection. The JSON file contains this information.
{
"a": [
[
[
{"b": {"name": "John", "hometown": "Chicago"}},
{"b": {"name": "Bob", "hometown": "Cucamonga"}},
{"b": {"name": "Rajiv", "hometown": "Glendale Heights"}},
],
[
{"b": {"name": "Anna", "hometown": "Chicago"}},
{"b": {"name": "Hanna", "hometown": "Cucamonga"}},
{"b": {"name": "Vanna", "hometown": "Glendale Heights"}},
{"b": {"name": "Rosanna", "hometown": "Mahwah"}},
],
],
[
[
{"b": {"name": "Masha", "hometown": "Boston"}},
{"b": {"name": "Natasha", "hometown": "Cambridge"}},
{"b": {"name": "Sasha", "hometown": "Summerville"}},
],
[
{"b": {"name": "Chad", "hometown": "Champaign"}},
{"b": {"name": "Thad", "hometown": "Urbana"}},
{"b": {"name": "Brad", "hometown": "Mahomet"}},
],
],
]
}
$a[][][].b.name accesses each object, such as {"b": {"name": "John", "hometown": "Chicago"}}, and uses b.name to apply this nested attribute selector on all of the elements in the inner arrays.
CREATE PIPELINE my_pipeline
SOURCE filesystem
FILTER 'data.json'
EXTRACT
FORMAT json
INTO my_schema.my_table
SELECT
$a[][][].b.name AS customer_names;
Output
Ocient> select * from my_schema.my_table;
customer_names
--------------------------------------------------------------------------------
[[[John, Bob, Rajiv], [Anna, Hanna, Vanna, Rosanna]], [[Masha, Natasha, Sasha],
[Chad, Thad, Brad]]]
Fetched 1 row
The output of both loading operations is the same.
Multi-Dimensional Arrays With Dimensions Split Across the JSON Path
To load a two-dimensional array, for example, use a one-dimensional array that contains JSON objects, each of which contains another one-dimensional array. Load the relevant parts of this JSON object into a two-dimensional array.
In this example, the column has VARCHAR[][] type and the SELECT expression is $x.a[].b.c[].name. The loaded data is a two-dimensional array of names.
The JSON file contains this information.
{
"x": {
"a": [
{
"b": {
"c": [
{"name": "John", "hometown": "Chicago"},
{"name": "Bob", "hometown": "Cucamonga"},
{"name": "Rajiv", "hometown": "Glendale Heights"},
]
}
},
{
"b": {
"c": [
{"name": "Anna", "hometown": "Chicago"},
{"name": "Hanna", "hometown": "Cucamonga"},
{"name": "Vanna", "hometown": "Glendale Heights"},
{"name": "Rosanna", "hometown": "Mahwah"},
]
}
},
]
}
}
CREATE TABLE my_schema.my_table (customer_names VARCHAR[][]);
CREATE PIPELINE my_pipeline
SOURCE filesystem
FILTER 'data.json'
EXTRACT
FORMAT json
INTO my_schema.my_table
SELECT
$x.a[].b.c[].name AS customer_names;
Output
Ocient> select * from my_schema.my_table;
customer_names
--------------------------------------------------------------------------------
[[John, Bob, Rajiv], [Anna, Hanna, Vanna, Rosanna]]
Fetched 1 row
The output is a two-dimensional array of names.
Extract data from an element of a JSON array. That element can also be an array.
In this example, extract the first element of a, which is an array. JSON array indexes start at 1 so that all array indexing in Pipelines are consistent with the SQL standard.
The JSON file contains this information.
{
"a": [
{"b": [{"name": "John", "hometown": "Chicago"}]},
{
"b": [
{"name": "Bob", "hometown": "Cucamonga"},
{"name": "Rajiv", "hometown": "Glendale Heights"},
]
},
{"b": [{"name": "Rajiv", "hometown": "Glendale Heights"}]},
]
}
CREATE TABLE my_schema.my_table (customer_names VARCHAR[]);
CREATE PIPELINE my_pipeline
SOURCE filesystem
FILTER 'data.json'
EXTRACT
FORMAT json
INTO my_schema.my_table
SELECT
$a[2].b[].name AS customer_names;
Output
Ocient> select * from my_table.my_schema;
customer_names
--------------------------------------------------------------------------------
[Bob, Rajiv]
Fetched 1 row
NULL and Empty Handling for JSON Arrays
The Ocient System handles NULL, empty, and missing values the same way for arrays as for scalars. The system converts a value that is NULL, empty, or missing to NULL and loads it as NULL.
Provide an explicit default in the pipeline or use the COLUMN_DEFAULT_IF_NULL option to accept the configured column default instead of attempting to load NULL.
Tuples of Scalars
Data pipeline loading supports tuples of scalars in the JSON source.
The specification of tuple elements using curly braces is only supported for the basic case (i.e. $a.{b,c}). More complex selectors are not supported.
The recommended way of specifying tuples is to use an individual JSON selector for each tuple element, such as $a.b or $a.c.
You can apply functions to tuple elements.
Tuple Construction
This example loads two strings into a simple tuple.
{"users_hometowns": {"a": {"name": "Bob", "hometown": "Cucamonga"}}}
CREATE TABLE my_schema.my_table (users_hometowns TUPLE<<VARCHAR,VARCHAR>>);
CREATE PIPELINE my_pipeline
...
SELECT
tuple<<char,char>>($users_hometowns.a.name, $users_hometowns.a.hometown)
AS users_hometowns;
Output
Ocient> select * from my_schema.my_table;
users_hometowns
--------------------------------------------------------------------------------
<<Bob, Cucamonga>>
Fetched 1 row
Advanced Tuple Construction
Use these examples to explore how to construct more complex tuples with nested objects and arrays.
Select Fields from a Nested Object into a Tuple
When the target fields are nested inside a named object, prefix the braces with the path to that object.
Selector:
Transform equivalent:
This data pipeline constructs a TUPLE<<VARCHAR, INT>> from fields b and c inside object a.
PREVIEW PIPELINE test_pl
SOURCE INLINE '{"a": {"b": "two", "c": 2}}'
EXTRACT FORMAT JSON
INTO test.tgt_tbl
SELECT $a.{b, c} AS tgt_col4;
Output
tgt_col4
--------------------------------------------------------------------------------
("two", 2)
Select an Array of Tuples from an Array of Objects
Appending [] to the parent path maps the tuple selector over every element in the array, producing an array of tuples.
Selector:
Transform equivalent:
zip_with($a[].b, $a[].c, tuple)
This data pipeline produces a TUPLE<<INT, VARCHAR>>[] from an array of objects, each containing fields b and c.
PREVIEW PIPELINE test_pl
SOURCE INLINE '{"a": [{"b": 1, "c": "John"},
{"b": 2, "c": "Bob"},
{"b": 3, "c": "Rajiv"}]}'
EXTRACT FORMAT JSON
INTO test.tgt_tbl
SELECT $a[].{b, c} AS tgt_col5;
Output
tgt_col5
--------------------------------------------------------------------------------
[(1, "John"),(2, "Bob"),(3, "Rajiv")]
Select a Tuple Containing an Array Element
Individual fields inside braces can have their own brackets [] to indicate that the field is an array. This selector produces a tuple with one element, an array, and the other a scalar.
Selector:
Transform equivalent:
This data pipeline constructs a TUPLE<<VARCHAR[], INT>> where the first element is an array of strings and the second element is an integer.
PREVIEW PIPELINE test_pl
SOURCE INLINE '{"a": {"b": ["John", "Bob", "Rajiv"], "c": 3}}'
EXTRACT FORMAT JSON
INTO test.tgt_tbl
SELECT $a.{b[], c} AS tgt_col6;
Output
tgt_col6
--------------------------------------------------------------------------------
([John,Bob,Rajiv], 3)
Select an Array of Tuples with an Array Element
Combining brackets [] on the parent path with [] on a child field produces an array of tuples where each tuple contains an array element and a scalar element.
Selector:
Transform equivalent:
zip_with($a[].b[], $a[].c, tuple)
This data pipeline produces a TUPLE<<VARCHAR[], INT>>[] from an array of objects that each contain a nested array.
PREVIEW PIPELINE test_pl
SOURCE INLINE '{"a": [{"b": ["John", "Bob", "Rajiv"], "c": 4},
{"b": ["Rajiv", "Bob", "John"], "c": 5}]}'
EXTRACT FORMAT JSON
INTO test.tgt_tbl
SELECT $a[].{b[], c} AS tgt_col7;
Output
tgt_col7
--------------------------------------------------------------------------------
[([John,Bob,Rajiv], 4),([Rajiv,Bob,John], 5)]
Select a Tuple with Nested Tuple Elements
Nesting tuple selectors produces tuples where the elements are also tuples.
Selector:
Transform equivalent:
tuple(tuple($a.b, $a.c), tuple($d.e, $d.f))
This data pipeline constructs a TUPLE<<TUPLE<<VARCHAR, INT>>, TUPLE<<INT, VARCHAR>>>> from two nested objects.
PREVIEW PIPELINE test_pl
SOURCE INLINE '{"a": {"b": "John", "c": 6},
"d": {"f": "Bob", "e": 7}}'
EXTRACT FORMAT JSON
INTO test.tgt_tbl
SELECT $.{a.{b, c}, d.{e, f}} AS tgt_col8;
Output
tgt_col8
--------------------------------------------------------------------------------
(("John", 6), ("Bob", 7))
Select an Array of Tuples with a Nested Tuple Element
A dot-brace group preceded by . (with no field name) creates an inline tuple element within an outer tuple. When you combine such an element with brackets [] on the parent path, this selector produces an array of tuples where one element is a tuple.
Selector:
$a[].{b, .{c[1], c[2], c[3]}}
Transform equivalent:
zip_with($a[].b, zip_with($a[].c[1], $a[].c[2], $a[].c[3], tuple), tuple)
This data pipeline produces a TUPLE<<INT, TUPLE<<VARCHAR, VARCHAR, VARCHAR>>>>[] by selecting a scalar field and an inline tuple of indexed array elements.
PREVIEW PIPELINE test_pl
SOURCE INLINE '{"a": [{"c": ["John", "Bob", "Rajiv"], "b": 4},
{"c": ["Rajiv", "Bob", "John"], "b": 5}]}'
EXTRACT FORMAT JSON
INTO test.tgt_tbl
SELECT $a[].{b, .{c[1], c[2], c[3]}} AS tgt_col10;
Output
tgt_col10
--------------------------------------------------------------------------------
[(4, ("John", "Bob", "Rajiv")),(5, ("Rajiv", "Bob", "John"))]
In this example, load a tuple where elements have different types, VARCHAR and TIMESTAMP, and where the second element has to be converted from a JSON string into an Ocient timestamp. The JSON file contains this information.
{"event_timestamp": {"event": "my_event", "timestamp": "1980-02-03 15:16:17.12345"}}
CREATE TABLE my_schema.my_table (event_timestamp TUPLE<<VARCHAR,TIMESTAMP>> NOT NULL);
CREATE PIPELINE my_pipeline
...
SELECT
tuple<<char,timestamp>>($event_timestamp.event,
TO_TIMESTAMP(char($event_timestamp.timestamp), 'yyyy-MM-dd HH:mm:ss.SSSSSS'))
AS event_timestamp;
Output
Ocient> select * from my_schema.my_table;
event_timestamp
--------------------------------------------------------------------------------
<<my_event, 1980-02-03 15:16:17.123450000>>
Fetched 1 row
The output table has one row of tuples.
NULL and Empty Handling for JSON Tuples
All the rules for handling NULL, empty, and missing elements that apply to scalars and arrays also apply to tuples. If any part of the selector is NULL, empty, or missing, data pipeline loading converts that value to NULL.
Additionally, because you can apply functions to tuple elements (and not array elements), you can use the NULL_IF function to convert a tuple element to NULL. For example, tuple<<char,varchar>>( $a.name, NULL_IF($a.hometown, 'N/A') ) indicates to the pipeline that the string 'N/A' signifies NULL for the hometown element but not for the name element.
Special Characters in JSON Keys
This example shows loading data where the JSON key names contain a special character.
Create a table for the load using the customer_name column.
CREATE TABLE my_schema.my_table (customer_name VARCHAR NOT NULL);
Load a file with this data. The JSON key name contains the - special character.
Use double quotes for the selector $first-name in the CREATE PIPELINE SQL statement.
CREATE PIPELINE my_pipeline
SOURCE filesystem
FILTER 'data.json'
EXTRACT
FORMAT json
INTO my_schema.my_table
SELECT
$"first-name" AS customer_name;
Other operations exist in the source field selector syntax to flatten or compact arrays while extracting the data.
Flatten Arrays
Flattening means that the extracted array should not increase the rank, and instead, sub-arrays should be concatenated. To flatten an array, use the underscore character in the array selector (e.g., $data.array[_]).
Example
This JSON data contains a nested array with two sub-arrays having data [1,2,3] and [4,5,6].
{
a: {
b: [
{c: [1,2,3] },
{c: [4,5,6] }
]
}
}
Flatten the nested array.
Output: [1,2,3,4,5,6]
This selected array has rank 1, and not 2. Contrast this with the non-flattened selector a.b[].c[], which would have the output [[1,2,3],[4,5,6]].
You can also flatten arrays with the FLATTEN function. For details, see Array Data Transformation Functions. The functionality of the operator and function is equivalent.
Compact Arrays
Compaction eliminates NULL values from the output data. The exact bit pattern of NULL in the source data is source-type dependent. For JSON, null is a literal keyword that is unambiguous. For CSV or other less-defined types, the configuration determines which exact bits equate to NULL. To compact an array, use the exclamation mark character in the array selector(e.g., $data.array[!]).
Example
This JSON data contains a nested array with four NULL values.
{
a: {
b: {
c: [ 1,null,2,null,3,null,null]
}
}
}
Compact the nested array.
Output: [1,2,3]
You can also compact arrays with the ARRAY_COMPACT function. For details, see Array Data Transformation Functions. The functionality of the operator and function is equivalent.
Data Pipeline Load of Parquet Data from S3
Data Pipeline Load of Parquet Data from S3
Data Types for Data Pipelines
Data Formats for Data Pipelines
Monitor Data Pipelines