Reading YAML Files¶

The YAML extension provides flexible options for reading YAML files into DuckDB tables.

Direct File Access¶

The simplest way to query YAML files is using the file path directly in the FROM clause:

-- Single file
SELECT * FROM 'config.yaml';

-- Glob pattern
SELECT * FROM 'data/*.yaml';

-- Both .yaml and .yml extensions work
SELECT * FROM 'settings.yml';

read_yaml Function¶

For more control, use the read_yaml table function:

SELECT * FROM read_yaml('path/to/file.yaml');

Input Types¶

The function accepts several input formats:

Single FileGlob PatternFile ListDirectory

SELECT * FROM read_yaml('config.yaml');

SELECT * FROM read_yaml('data/*.yaml');
SELECT * FROM read_yaml('configs/**/*.yml');

SELECT * FROM read_yaml(['config1.yaml', 'config2.yaml', 'config3.yaml']);

-- Reads all .yaml and .yml files in the directory
SELECT * FROM read_yaml('data/configs/');

Parameters¶

Parameter	Type	Default	Description
`auto_detect`	BOOLEAN	`true`	Enable automatic type detection
`columns`	STRUCT	-	Explicit column type specification
`records`	VARCHAR	-	Path to nested array of records (dot notation)
`multi_document`	BOOLEAN	`true`	Handle multiple YAML documents
`expand_root_sequence`	BOOLEAN	`true`	Expand top-level sequences into rows
`ignore_errors`	BOOLEAN	`false`	Continue on parsing errors
`maximum_object_size`	INTEGER	16777216	Maximum file size in bytes (16MB)
`sample_size`	INTEGER	20480	Rows to sample for schema detection
`maximum_sample_files`	INTEGER	32	Files to sample for schema detection

auto_detect¶

Controls automatic type detection for columns:

-- Enabled (default): dates, times, numbers detected automatically
SELECT * FROM read_yaml('data.yaml', auto_detect = true);

-- Disabled: all values remain as VARCHAR
SELECT * FROM read_yaml('data.yaml', auto_detect = false);

columns¶

Specify explicit column types:

SELECT * FROM read_yaml('data.yaml', columns = {
    'id': 'INTEGER',
    'name': 'VARCHAR',
    'created_at': 'TIMESTAMP',
    'metadata': 'JSON'
});

See Column Type Specification for details.

records¶

Extract records from a nested array within the YAML structure using dot notation:

# config.yaml
metadata:
  version: "1.0"
  author: admin
data:
  users:
    - id: 1
      name: Alice
      role: admin
    - id: 2
      name: Bob
      role: user

-- Extract the nested users array as rows
SELECT * FROM read_yaml('config.yaml', records = 'data.users');

id	name	role
1	Alice	admin
2	Bob	user

The records parameter supports:

Single-level paths: records = 'items'
Nested paths: records = 'data.users' or records = 'config.database.tables'
Unicode keys: records = 'données.éléments'

When to use records

Use records when your YAML file contains metadata at the top level and the actual data records are nested within a specific field.

multi_document¶

Handle files with multiple YAML documents:

---
id: 1
name: First
---
id: 2
name: Second

-- Each document becomes a row (default)
SELECT * FROM read_yaml('multi.yaml', multi_document = true);

-- Disable to read only the first document
SELECT * FROM read_yaml('multi.yaml', multi_document = false);

expand_root_sequence¶

Control how top-level sequences are handled:

- name: Alice
  age: 30
- name: Bob
  age: 25

-- Each item becomes a row (default)
SELECT * FROM read_yaml('people.yaml', expand_root_sequence = true);
-- Returns: 2 rows

-- Keep as a single row with array
SELECT * FROM read_yaml('people.yaml', expand_root_sequence = false);
-- Returns: 1 row with array column

ignore_errors¶

Continue processing when some files or documents have errors:

-- Stop on first error (default)
SELECT * FROM read_yaml('data/*.yaml', ignore_errors = false);

-- Skip invalid files/documents
SELECT * FROM read_yaml('data/*.yaml', ignore_errors = true);

Sampling Parameters¶

Control schema detection for large datasets:

-- Sample more rows for complex schemas
SELECT * FROM read_yaml('data/*.yaml', sample_size = 50000);

-- Sample more files
SELECT * FROM read_yaml('data/*.yaml', maximum_sample_files = 100);

-- Unlimited sampling (reads all data for schema detection)
SELECT * FROM read_yaml('data/*.yaml',
    sample_size = -1,
    maximum_sample_files = -1
);

When to increase sampling

If you see type mismatch errors, some files may have different schemas. Increase sample_size or maximum_sample_files to capture all variations.

Column Type Specification¶

Explicitly define column types to ensure consistent schemas:

Basic Types¶

SELECT * FROM read_yaml('data.yaml', columns = {
    'id': 'INTEGER',
    'name': 'VARCHAR',
    'price': 'DECIMAL(10,2)',
    'active': 'BOOLEAN',
    'created': 'TIMESTAMP'
});

Complex Types¶

SELECT * FROM read_yaml('config.yaml', columns = {
    'server': 'STRUCT(host VARCHAR, port INTEGER)',
    'features': 'VARCHAR[]',
    'settings': 'MAP(VARCHAR, VARCHAR)'
});

Preserving YAML Structure¶

SELECT * FROM read_yaml('nested.yaml', columns = {
    'config': 'YAML',    -- Keep as YAML for later processing
    'metadata': 'JSON'   -- Convert to JSON type
});

Partial Specification¶

Specify some columns, let others be auto-detected:

SELECT * FROM read_yaml('data.yaml',
    auto_detect = true,
    columns = {
        'id': 'BIGINT',      -- Force this to BIGINT
        'name': 'VARCHAR'     -- Force this to VARCHAR
    }
    -- Other columns auto-detected
);

read_yaml_objects¶

For preserving the complete document structure (one row per file):

SELECT * FROM read_yaml_objects('config.yaml');

This is useful when you want to:

Keep the entire YAML as a single value
Process complex nested structures
Maintain document boundaries

Multi-Document YAML¶

Files with multiple YAML documents (separated by ---) are handled automatically:

---
id: 1
type: user
name: Alice
---
id: 2
type: admin
name: Bob

SELECT * FROM read_yaml('users.yaml');
-- Returns 2 rows

Sequence Handling¶

Top-level sequences are expanded into rows:

- name: Alice
  score: 95
- name: Bob
  score: 87
- name: Carol
  score: 92

SELECT * FROM read_yaml('scores.yaml');
-- Returns 3 rows with columns: name, score

Examples¶

Reading Configuration Files¶

# app.yaml
database:
  host: localhost
  port: 5432
  name: myapp
cache:
  enabled: true
  ttl: 3600

SELECT
    database,
    yaml_extract_string(database, '$.host') AS db_host,
    yaml_extract(database, '$.port') AS db_port,
    cache
FROM read_yaml('app.yaml');

Combining Multiple Files¶

-- Read all configs and include filename
SELECT
    filename,
    yaml_extract_string(data, '$.name') AS name,
    yaml_extract(data, '$.version') AS version
FROM read_yaml_objects('configs/*.yaml');

Creating Tables from YAML¶

-- Create a table from YAML with explicit types
CREATE TABLE events AS
SELECT * FROM read_yaml('events.yaml', columns = {
    'id': 'INTEGER',
    'timestamp': 'TIMESTAMP',
    'type': 'VARCHAR',
    'data': 'JSON'
});