Type Detection¶
The YAML extension includes comprehensive automatic type detection, inferring appropriate DuckDB types from YAML values.
Overview¶
When auto_detect = true (default), the extension analyzes YAML values and assigns optimal types:
-- With auto_detect (default)
SELECT typeof(date_col), typeof(count_col), typeof(active_col)
FROM read_yaml('data.yaml');
-- Returns: DATE, INTEGER, BOOLEAN
Detected Types¶
Temporal Types¶
DATE¶
Detected formats:
| Format | Example |
|---|---|
| ISO 8601 | 2024-01-15 |
| Slash (US) | 01/15/2024 |
| Slash (EU) | 15/01/2024 |
| Dot | 15.01.2024 |
TIME¶
Detected formats:
| Format | Example |
|---|---|
| Standard | 14:30:00 |
| With milliseconds | 14:30:00.123 |
| Short | 14:30 |
TIMESTAMP¶
Detected formats:
| Format | Example |
|---|---|
| ISO 8601 with Z | 2024-01-15T14:30:00Z |
| ISO 8601 with offset | 2024-01-15T14:30:00+05:00 |
| ISO 8601 no timezone | 2024-01-15T14:30:00 |
| Space separator | 2024-01-15 14:30:00 |
# All detected as TIMESTAMP
ts1: 2024-01-15T14:30:00Z
ts2: 2024-01-15T14:30:00+05:00
ts3: 2024-01-15 14:30:00
Numeric Types¶
Numbers are detected with optimal precision:
| Type | Range | Example |
|---|---|---|
| TINYINT | -128 to 127 | 42 |
| SMALLINT | -32,768 to 32,767 | 1000 |
| INTEGER | -2.1B to 2.1B | 100000 |
| BIGINT | Larger integers | 9223372036854775807 |
| DOUBLE | Decimals | 3.14159 |
# Detected as TINYINT
small: 42
# Detected as INTEGER
medium: 100000
# Detected as BIGINT
large: 9223372036854775807
# Detected as DOUBLE
decimal: 3.14159
scientific: 1.23e-4
Special Float Values¶
Boolean Type¶
All recognized as BOOLEAN:
| True Values | False Values |
|---|---|
true |
false |
True |
False |
TRUE |
FALSE |
yes |
no |
Yes |
No |
YES |
NO |
on |
off |
On |
Off |
ON |
OFF |
y |
n |
Y |
N |
t |
f |
T |
F |
NULL Values¶
All recognized as NULL:
String Type¶
Any value not matching the above patterns defaults to VARCHAR:
Array Detection¶
Arrays are typed based on their elements:
Homogeneous Arrays¶
# INTEGER[]
numbers: [1, 2, 3, 4, 5]
# VARCHAR[]
names: [Alice, Bob, Carol]
# BOOLEAN[]
flags: [true, false, true]
Mixed-Type Arrays¶
When elements have different types, the array falls back to VARCHAR[]:
Empty Arrays¶
Empty arrays default to VARCHAR[]:
Struct Detection¶
Nested objects are detected as STRUCT types:
user:
name: John
age: 30
email: john@example.com
# Detected as: STRUCT(name VARCHAR, age TINYINT, email VARCHAR)
Controlling Type Detection¶
Disable Auto-Detection¶
Override Specific Columns¶
-- Override some types, auto-detect others
SELECT * FROM read_yaml('data.yaml',
auto_detect = true,
columns = {
'id': 'BIGINT', -- Force BIGINT instead of detected INTEGER
'price': 'DECIMAL(10,2)' -- Force DECIMAL instead of DOUBLE
}
);
Force All Column Types¶
-- Specify all column types explicitly
SELECT * FROM read_yaml('data.yaml',
auto_detect = false,
columns = {
'id': 'INTEGER',
'name': 'VARCHAR',
'created': 'TIMESTAMP',
'active': 'BOOLEAN'
}
);
Schema Sampling¶
For large datasets with multiple files, the extension samples a subset to determine schema:
| Parameter | Default | Description |
|---|---|---|
sample_size |
20480 | Number of rows to sample |
maximum_sample_files |
32 | Number of files to sample |
-- Increase sampling for complex schemas
SELECT * FROM read_yaml('data/*.yaml',
sample_size = 50000,
maximum_sample_files = 100
);
-- Sample all data (slow for large datasets)
SELECT * FROM read_yaml('data/*.yaml',
sample_size = -1,
maximum_sample_files = -1
);
Schema Mismatch Errors
If sampling misses files with different structures, you may see cast errors.
Increase sampling parameters or use explicit columns to handle this.
Type Coercion¶
When a column has mixed types across rows, DuckDB applies type coercion:
Result: value column becomes VARCHAR to accommodate both.
Best Practices¶
When to Use Auto-Detection¶
- Data has consistent, predictable types
- Quick exploration of unknown datasets
- Prototyping queries
When to Specify Types¶
- Production pipelines requiring consistent schemas
- Data with ambiguous formats (dates that look like strings)
- Performance-critical applications
- Ensuring specific numeric precision
Handling Date Ambiguity¶
Dates like 01/02/2024 are ambiguous (January 2 or February 1?):
-- Force a specific interpretation
SELECT * FROM read_yaml('dates.yaml', columns = {
'date': 'DATE'
});
-- Or use strptime for explicit parsing
SELECT strptime(date_string, '%m/%d/%Y')::DATE
FROM read_yaml('dates.yaml', auto_detect = false);
See Also¶
- Reading YAML Files - The
columnsparameter - YAML Type - Storing parsed YAML data
- Parameters Reference - All available parameters