feat: add monte-carlo data observability skills (#481)

* feat: add monte-carlo data observability skills

Add 4 data observability skills from monte-carlo-data/mc-agent-toolkit:
- monte-carlo-prevent: table health, alerts, lineage before SQL/dbt edits
- monte-carlo-monitor-creation: monitor creation via MCP tools
- monte-carlo-push-ingestion: push ingestion API guide
- monte-carlo-validation-notebook: SQL validation notebooks for dbt PRs

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: add monte-carlo-data/mc-agent-toolkit to community contributors

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: replace skill summaries with full original content

Copy the actual SKILL.md content from monte-carlo-data/mc-agent-toolkit
instead of rewritten summaries. Only the frontmatter is changed to match
the antigravity template format.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: add reference docs, scripts, and templates for all 4 skills

Addresses Codex review feedback — copies supporting files from
monte-carlo-data/mc-agent-toolkit that the SKILL.md files reference:

- prevent: references/workflows.md, parameters.md, TROUBLESHOOTING.md
- monitor-creation: references/ (5 monitor type docs)
- push-ingestion: references/ (8 docs) + scripts/templates/ (6 warehouses)
- validation-notebook: scripts/ (2 helper scripts)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: update script paths to match antigravity directory name

${CLAUDE_PLUGIN_ROOT}/skills/generate-validation-notebook/scripts/ →
${CLAUDE_PLUGIN_ROOT}/skills/monte-carlo-validation-notebook/scripts/

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
cryptoque
2026-04-09 10:54:12 -07:00
committed by GitHub
parent 0107845f32
commit ea0eb05fc6
76 changed files with 15461 additions and 0 deletions

View File

@@ -284,6 +284,7 @@ This collection would not be possible without the incredible work of the Claude
### Community Contributors
- **[monte-carlo-data/mc-agent-toolkit](https://github.com/monte-carlo-data/mc-agent-toolkit)**: Monte Carlo data observability skills — table health checks, change impact assessment, monitor creation, push ingestion, and SQL validation notebooks for dbt changes.
- **[pumanitro/global-chat](https://github.com/pumanitro/global-chat)**: Source for the Global Chat Agent Discovery skill - cross-protocol discovery of MCP servers and AI agents across multiple registries.
- **[bitjaru/styleseed](https://github.com/bitjaru/styleseed)**: StyleSeed Toss UI and UX skill collection - setup wizard, page and pattern generation, design-token management, accessibility review, UX audits, feedback states, and microcopy guidance for professional mobile-first UI.
- **[milkomida77/guardian-agent-prompts](https://github.com/milkomida77/guardian-agent-prompts)**: Source for the Multi-Agent Task Orchestrator skill - production-tested delegation patterns, anti-duplication, and quality gates for coordinated agent work.

View File

@@ -0,0 +1,222 @@
---
name: monte-carlo-monitor-creation
description: "Guides creation of Monte Carlo monitors via MCP tools, producing monitors-as-code YAML for CI/CD deployment."
category: data
risk: safe
source: community
source_repo: monte-carlo-data/mc-agent-toolkit
source_type: community
date_added: "2026-04-08"
author: monte-carlo-data
tags: [data-observability, monitoring, monte-carlo, monitors-as-code]
tools: [claude, cursor, codex]
---
# Monte Carlo Monitor Creation Skill
This skill teaches you to create Monte Carlo monitors correctly via MCP. Every creation tool runs in **dry-run mode** and returns monitors-as-code (MaC) YAML. No monitors are created directly -- the user applies the YAML via the Monte Carlo CLI or CI/CD.
Reference files live next to this skill file. **Use the Read tool** (not MCP resources) to access them:
- Metric monitor details: `references/metric-monitor.md` (relative to this file)
- Validation monitor details: `references/validation-monitor.md` (relative to this file)
- Custom SQL monitor details: `references/custom-sql-monitor.md` (relative to this file)
- Comparison monitor details: `references/comparison-monitor.md` (relative to this file)
- Table monitor details: `references/table-monitor.md` (relative to this file)
## When to activate this skill
Activate when the user:
- Asks to create, add, or set up a monitor (e.g. "add a monitor for...", "create a freshness check on...", "set up validation for...")
- Mentions monitoring a specific table, field, or metric
- Wants to check data quality rules or enforce data contracts
- Asks about monitoring options for a table or dataset
- Requests monitors-as-code YAML generation
- Wants to add monitoring after new transformation logic (when the prevent skill is not active)
## When NOT to activate this skill
Do not activate when the user is:
- Just querying data or exploring table contents
- Triaging or responding to active alerts (use the prevent skill's Workflow 3)
- Running impact assessments before code changes (use the prevent skill's Workflow 4)
- Asking about existing monitor configuration (use `getMonitors` directly)
- Editing or deleting existing monitors
---
## Available MCP tools
All tools are available via the `monte-carlo` MCP server.
| Tool | Purpose |
| ---------------------------- | ---------------------------------------------------------- |
| `testConnection` | Verify auth and connectivity before starting |
| `search` | Find tables/assets by name; use `include_fields` for columns |
| `getTable` | Schema, stats, metadata, domain membership, capabilities |
| `getValidationPredicates` | List available validation rule types for a warehouse |
| `getDomains` | List MC domains (only needed if table has no domain info) |
| `createMetricMonitorMac` | Generate metric monitor YAML (dry-run) |
| `createValidationMonitorMac` | Generate validation monitor YAML (dry-run) |
| `createComparisonMonitorMac` | Generate comparison monitor YAML (dry-run) |
| `createCustomSqlMonitorMac` | Generate custom SQL monitor YAML (dry-run) |
| `createTableMonitorMac` | Generate table monitor YAML (dry-run) |
---
## Monitor types
| Type | Tool | Use When |
| -------------- | ---------------------------- | --------------------------------------------------------------------------------------------------------------------------------------- |
| **Metric** | `createMetricMonitorMac` | Track statistical metrics on fields (null rates, unique counts, numeric stats) or row count changes over time. Requires a timestamp field for aggregation. |
| **Validation** | `createValidationMonitorMac` | Row-level data quality checks with conditions (e.g. "field X is never null", "status is in allowed set"). Alerts on INVALID data. |
| **Custom SQL** | `createCustomSqlMonitorMac` | Run arbitrary SQL returning a single number and alert on thresholds. Most flexible; use when other types don't fit. |
| **Comparison** | `createComparisonMonitorMac` | Compare metrics between two tables (e.g. dev vs prod, source vs target). |
| **Table** | `createTableMonitorMac` | Monitor groups of tables for freshness, schema changes, and volume. Uses asset selection at database/schema level. |
---
## Procedure
Follow these steps in order. Do NOT skip steps.
### Validation Phase (Steps 1-3) -- MUST complete before any creation tool is called
The number one error pattern is agents skipping validation and calling a creation tool with guessed or incomplete parameters. **Every field in the creation call must be grounded in data retrieved during this phase.** Do not proceed to Step 4 until Steps 1-3 are fully satisfied.
#### Step 1: Understand the request
Ask yourself:
- What does the user want to monitor? (a specific table, a metric, a data quality rule, cross-table consistency, freshness/volume at schema level)
- Which monitor type fits? Use the monitor types table above.
- Does the user have all the details, or do they need guidance?
If the user's intent is unclear, ask a focused question before proceeding.
#### Step 2: Identify the table(s) and columns
If you don't have the table MCON:
1. Use `search` with the table name and `include_fields: ["field_names"]` to find the MCON and get column names.
2. If the user provided a full table ID like `database:schema.table`, search for it.
3. Once you have the MCON, call `getTable` with `include_fields: true` and `include_table_capabilities: true` to verify capabilities and get domain info.
If you already have the MCON:
1. Call `getTable` with the MCON, `include_fields: true`, and `include_table_capabilities: true`.
**CRITICAL: You need the actual column names from `getTable` results. NEVER guess or hallucinate column names.** This is the most common source of monitor creation failures.
For monitor types that require a timestamp column (metric monitors), review the column names and identify likely timestamp candidates. Present them to the user if ambiguous.
#### Step 3: Handle domain assignment
Monitors must be assigned to a domain that contains the table being monitored. The `getTable` response includes a `domains` list with `uuid` and `name`.
1. If `domains` is empty: skip domain assignment.
2. If `domains` has exactly one entry: default `domain_id` to that domain's UUID.
3. If `domains` has multiple entries: present only those domains and ask the user to pick.
Do NOT present all account domains as options -- only domains that contain the table are valid.
**ALWAYS check the table's `domains` BEFORE calling any creation tool.**
---
### Creation Phase (Steps 4-8)
Only enter this phase after the validation phase is complete with real data from MCP tools.
#### Step 4: Load the sub-skill reference
Based on the monitor type, read the detailed reference for parameter guidance:
- **Metric** -- Read the detailed reference: `references/metric-monitor.md` (relative to this file)
- **Validation** -- Read the detailed reference: `references/validation-monitor.md` (relative to this file)
- **Custom SQL** -- Read the detailed reference: `references/custom-sql-monitor.md` (relative to this file)
- **Comparison** -- Read the detailed reference: `references/comparison-monitor.md` (relative to this file)
- **Table** -- Read the detailed reference: `references/table-monitor.md` (relative to this file)
#### Step 5: Ask about scheduling
**Skip this step for table monitors.** Table monitors do not support the `schedule` field in MaC YAML — adding it will cause a validation error on `montecarlo monitors apply`. Table monitor scheduling is managed automatically by Monte Carlo.
For all other monitor types, the creation tools default to a fixed schedule running every 60 minutes. Present these options:
1. **Fixed interval** -- any integer for `interval_minutes` (30, 60, 90, 120, 360, 720, 1440, etc.)
2. **Dynamic** -- MC auto-determines when to run based on table update patterns.
3. **Loose** -- runs once per day.
Schedule format in MaC YAML:
- Fixed: `schedule: { type: fixed, interval_minutes: <N> }`
- Dynamic: `schedule: { type: dynamic }`
- Loose: `schedule: { type: loose, start_time: "00:00" }`
#### Step 6: Confirm with the user
Before calling the creation tool, present the monitor configuration in plain language:
- Monitor type
- Target table (and columns if applicable)
- What it checks / what triggers an alert
- Domain assignment
- Schedule
Ask: "Does this look correct? I'll generate the monitor configuration."
**NEVER call the creation tool without user confirmation.**
#### Step 7: Create the monitor
Call the appropriate creation tool with the parameters built in previous steps. Always pass an MCON when possible. If only table name is available, also pass warehouse.
#### Step 8: Present results
**CRITICAL: Always include the YAML in your response.** The user needs copy-pasteable YAML.
1. If a non-default schedule was chosen, modify the schedule section in the YAML before presenting.
2. Wrap the YAML in the full MaC structure (see "MaC YAML format" section below).
3. ALWAYS present the full YAML in a ```yaml code block.
4. Explain where to put it and how to apply it (see below).
5. ALWAYS use ISO 8601 format for datetime values.
6. NEVER reformat YAML values returned by creation tools.
---
## MaC YAML format
The YAML returned by creation tools is the monitor definition. It must be wrapped in the standard MaC structure to be applied:
```yaml
montecarlo:
<monitor_type>:
- <returned yaml>
```
For example, a metric monitor would look like:
```yaml
montecarlo:
metric:
- <yaml returned by createMetricMonitorMac>
```
**Important:** `montecarlo.yml` (without a directory path) is a separate Monte Carlo project configuration file -- it is NOT the same as a monitor definition file. Monitor definitions go in their own `.yml` files, typically in a `monitors/` directory or alongside dbt model schema files.
Tell the user:
- Save the YAML to a `.yml` file (e.g. `monitors/<table_name>.yml` or in their dbt schema)
- Apply via the Monte Carlo CLI: `montecarlo monitors apply --namespace <namespace>`
- Or integrate into CI/CD for automatic deployment on merge
---
## Common mistakes to avoid
- **NEVER guess column names.** Always get them from `getTable`.
- **NEVER skip the confirmation step** (Step 6).
- For metric monitors, `aggregate_time_field` MUST be a real timestamp column from the table.
- For validation monitors, conditions match INVALID data, not valid data.
- Always pass an MCON when possible. If only table name is available, also pass warehouse.
- **ALWAYS check table's `domains` BEFORE calling any creation tool.**
- ALWAYS use ISO 8601 format for datetime values.
- NEVER reformat YAML values returned by creation tools.
- Do not call creation tools before the validation phase is complete.

View File

@@ -0,0 +1,426 @@
# Comparison Monitor Reference
Detailed reference for building `createComparisonMonitorMac` tool calls.
## When to Use
Use a comparison monitor when the user wants to:
- Compare data between two tables (e.g., source vs target, dev vs prod)
- Validate data consistency after migration or replication
- Check row count parity across environments
- Compare field-level metrics between tables (null counts, sums, distributions)
---
## Pre-Step: Verify Both Tables and Fields
Before constructing alert conditions, you MUST verify that both tables exist and that any referenced fields are real columns. This is the most common source of comparison monitor failures.
1. **Resolve both MCONs.** Use `search` to find the source and target tables. If the user provided `database:schema.table` format, search for each to get the MCON.
2. **Get full schemas.** Call `getTable` with `include_fields: true` on BOTH the source table and the target table. You need the column lists from both.
3. **For field-level metrics, verify fields exist on both sides.** Confirm that `sourceField` exists in the source table's column list AND `targetField` exists in the target table's column list. Field names are case-sensitive on most warehouses.
4. **Check field type compatibility.** The metric must be compatible with the column types on both sides. For example, `NUMERIC_MEAN` requires numeric columns in both the source and target tables. If the source column is numeric but the target is a string, the comparison will fail.
5. If any field does not exist or types are incompatible, stop and ask the user to clarify. Do not guess.
---
## Required Parameters
| Parameter | Type | Description |
|-----------|------|-------------|
| `name` | string | Unique identifier for the monitor. Use a descriptive slug (e.g., `orders_dev_prod_compare`). |
| `description` | string | Human-readable description of what the monitor checks. |
| `source_table` | string | Source table MCON (preferred) or `database:schema.table` format. If not MCON, also pass `source_warehouse`. |
| `target_table` | string | Target table MCON (preferred) or `database:schema.table` format. If not MCON, also pass `target_warehouse`. |
| `alert_conditions` | array | List of comparison conditions (see Alert Conditions below). |
## Optional Parameters
| Parameter | Type | Description |
|-----------|------|-------------|
| `source_warehouse` | string | Warehouse name or UUID for the source table. Required if `source_table` is not an MCON. |
| `target_warehouse` | string | Warehouse name or UUID for the target table. Required if `target_table` is not an MCON. |
| `segment_fields` | array of string | Fields to segment the comparison by. Must exist in BOTH tables with the same name. |
| `domain_id` | string (uuid) | Domain UUID (use `getDomains` to list). Only one domain can be assigned per monitor. |
---
## Cross-Warehouse Comparisons
When the source and target tables live in different warehouses (e.g., comparing a Snowflake staging table against a BigQuery production table), you MUST provide both `source_warehouse` and `target_warehouse` explicitly. The tool cannot auto-resolve warehouses when tables are in different environments.
Even when both tables are MCONs, if they belong to different warehouses, pass both warehouse parameters to be safe. Omitting them in cross-warehouse scenarios causes silent failures or incorrect results.
Common cross-warehouse patterns:
- **Dev vs prod:** same warehouse type, different databases or schemas
- **Migration validation:** source in old warehouse, target in new warehouse
- **Replication checks:** primary warehouse vs replica or downstream warehouse
---
## Alert Conditions
Each condition compares a metric between the source and target tables.
| Field | Type | Required | Description |
|-------|------|----------|-------------|
| `metric` | string | Yes | The metric to compare (see Metrics Reference below). |
| `sourceField` | string | For field-level metrics | Column in the source table. Required for ALL metrics except `ROW_COUNT`. |
| `targetField` | string | For field-level metrics | Column in the target table. Required for ALL metrics except `ROW_COUNT`. |
| `thresholdValue` | number | No | Threshold for acceptable difference between source and target. |
| `isThresholdRelative` | boolean | No | `false` = absolute difference (default), `true` = percentage difference. |
| `customMetric` | object | No | Custom SQL expressions for source and target (see Custom Metrics below). |
---
## ROW_COUNT and Fields: A Critical Rule
> **NEVER pass `sourceField` or `targetField` when using the `ROW_COUNT` metric.**
`ROW_COUNT` is a table-level metric -- it counts all rows in the table, not values in a column. Passing field names with `ROW_COUNT` causes the API call to fail or produce unexpected behavior.
This is the single most common mistake with comparison monitors. Before submitting any alert condition with `ROW_COUNT`, verify that `sourceField` and `targetField` are both absent from the condition object.
| Metric | Fields needed? | What happens if you pass fields? |
|--------|---------------|----------------------------------|
| `ROW_COUNT` | **No -- NEVER pass fields** | API error or undefined behavior |
| All other metrics | **Yes -- always pass both fields** | Required for the comparison to work |
---
## Metrics Reference
### Table-level metric (no fields needed)
| Metric | Description |
|--------|-------------|
| `ROW_COUNT` | Compare total row counts between source and target. |
### Field-level metrics (require `sourceField` and `targetField`)
#### Uniqueness and duplicates
| Metric | Description |
|--------|-------------|
| `UNIQUE_COUNT` | Count of distinct values. |
| `DUPLICATE_COUNT` | Count of duplicate (non-unique) values. |
| `APPROX_DISTINCT_COUNT` | Approximate distinct count (faster on large tables). |
#### Null and empty checks
| Metric | Description |
|--------|-------------|
| `NULL_COUNT` | Count of null values. |
| `NON_NULL_COUNT` | Count of non-null values. |
| `EMPTY_STRING_COUNT` | Count of empty string values. |
| `TEXT_ALL_SPACES_COUNT` | Count of values that are all whitespace. |
| `NAN_COUNT` | Count of NaN values. |
| `TEXT_NULL_KEYWORD_COUNT` | Count of values containing null-like keywords (e.g., "NULL", "None"). |
#### Numeric statistics
| Metric | Description |
|--------|-------------|
| `NUMERIC_MEAN` | Mean of numeric field. |
| `NUMERIC_MEDIAN` | Median of numeric field. |
| `NUMERIC_MIN` | Minimum value. |
| `NUMERIC_MAX` | Maximum value. |
| `NUMERIC_STDDEV` | Standard deviation. |
| `SUM` | Sum of numeric field. |
| `ZERO_COUNT` | Count of zero values. |
| `NEGATIVE_COUNT` | Count of negative values. |
#### Percentiles
| Metric | Description |
|--------|-------------|
| `PERCENTILE_20` | 20th percentile value. |
| `PERCENTILE_40` | 40th percentile value. |
| `PERCENTILE_60` | 60th percentile value. |
| `PERCENTILE_80` | 80th percentile value. |
#### Text statistics
| Metric | Description |
|--------|-------------|
| `TEXT_MAX_LENGTH` | Maximum string length. |
| `TEXT_MIN_LENGTH` | Minimum string length. |
| `TEXT_MEAN_LENGTH` | Mean string length. |
| `TEXT_STD_LENGTH` | Standard deviation of string length. |
#### Text format checks
| Metric | Description |
|--------|-------------|
| `TEXT_NOT_INT_COUNT` | Count of values not parseable as integers. |
| `TEXT_NOT_NUMBER_COUNT` | Count of values not parseable as numbers. |
| `TEXT_NOT_UUID_COUNT` | Count of values not matching UUID format. |
| `TEXT_NOT_SSN_COUNT` | Count of values not matching SSN format. |
| `TEXT_NOT_US_PHONE_COUNT` | Count of values not matching US phone format. |
| `TEXT_NOT_US_STATE_CODE_COUNT` | Count of values not matching US state codes. |
| `TEXT_NOT_US_ZIP_CODE_COUNT` | Count of values not matching US zip codes. |
| `TEXT_NOT_EMAIL_ADDRESS_COUNT` | Count of values not matching email format. |
| `TEXT_NOT_TIMESTAMP_COUNT` | Count of values not parseable as timestamps. |
#### Boolean
| Metric | Description |
|--------|-------------|
| `TRUE_COUNT` | Count of true values. |
| `FALSE_COUNT` | Count of false values. |
#### Timestamp
| Metric | Description |
|--------|-------------|
| `FUTURE_TIMESTAMP_COUNT` | Count of timestamps in the future. |
| `PAST_TIMESTAMP_COUNT` | Count of timestamps unreasonably far in the past. |
| `UNIX_ZERO_COUNT` | Count of timestamps equal to Unix epoch zero (1970-01-01). |
---
## Choosing the Right Metric
| User intent | Correct metric | Fields needed? |
|-------------|---------------|----------------|
| Row count parity | `ROW_COUNT` | **No** -- never pass fields |
| Distinct values in a column | `UNIQUE_COUNT` | Yes |
| Null values in a column | `NULL_COUNT` | Yes |
| Sum, average, min, max | `SUM`, `NUMERIC_MEAN`, `NUMERIC_MIN`, `NUMERIC_MAX` | Yes |
| Data completeness | `NON_NULL_COUNT` | Yes |
| String format validation | `TEXT_NOT_EMAIL_ADDRESS_COUNT`, `TEXT_NOT_UUID_COUNT`, etc. | Yes |
| Custom computed expressions | Use `customMetric` instead of `metric` | No (SQL handles it) |
---
## Custom Metrics
Use custom metrics when:
- **Column names differ** between source and target and you need a computed expression (not just a direct field comparison).
- **You need a derived calculation** like `SUM(quantity * unit_price)` rather than a simple column metric.
- **Standard metrics do not cover the comparison** (e.g., comparing a ratio, a conditional aggregate, or a windowed calculation).
If the columns simply have different names but you want a standard metric (e.g., compare `SUM` of `revenue` in source vs `total_revenue` in target), you do NOT need a custom metric -- just use the standard metric with different `sourceField` and `targetField` values.
Custom metric structure:
```json
{
"customMetric": {
"displayName": "Revenue Sum",
"sourceSqlExpression": "SUM(revenue)",
"targetSqlExpression": "SUM(total_revenue)"
}
}
```
| Field | Type | Required | Description |
|-------|------|----------|-------------|
| `displayName` | string | Yes | Human-readable name for the metric in alerts and dashboards. |
| `sourceSqlExpression` | string | Yes | SQL expression evaluated against the source table. |
| `targetSqlExpression` | string | Yes | SQL expression evaluated against the target table. |
When using `customMetric`, do NOT also pass `metric`, `sourceField`, or `targetField` in the same alert condition. The custom metric replaces all of those.
---
## Threshold Guidance
### Absolute thresholds (`isThresholdRelative: false` or omitted)
The `thresholdValue` is the maximum acceptable absolute difference between the source and target metric values.
- `thresholdValue: 0` -- source and target must match exactly.
- `thresholdValue: 100` -- up to 100 units of difference is acceptable.
### Relative (percentage) thresholds (`isThresholdRelative: true`)
The `thresholdValue` is the maximum acceptable percentage difference.
- `thresholdValue: 5` -- up to 5% difference is acceptable.
- `thresholdValue: 0.1` -- up to 0.1% difference is acceptable.
### When to use each
| Scenario | Recommended threshold type |
|----------|---------------------------|
| Exact replication (row counts must match) | Absolute, `thresholdValue: 0` |
| Near-real-time sync with small lag | Absolute, small value (e.g., 10-100) |
| Tables at different scales | Relative, percentage-based |
| Aggregated metrics (sums, means) | Relative, to handle floating-point differences |
---
## Examples
### Row count parity with absolute threshold
Compare row counts between dev and prod, alerting if they differ by more than 100 rows.
```json
{
"name": "orders_dev_prod_row_count",
"description": "Verify dev and prod orders tables have similar row counts",
"source_table": "MCON++a1b2c3d4-e5f6-7890-abcd-ef1234567890++1++1++dev_warehouse:core.orders",
"target_table": "MCON++b2c3d4e5-f6a7-8901-bcde-f12345678901++1++1++prod_warehouse:core.orders",
"alert_conditions": [
{
"metric": "ROW_COUNT",
"thresholdValue": 100,
"isThresholdRelative": false
}
]
}
```
Note: no `sourceField` or `targetField` -- `ROW_COUNT` is table-level.
### Row count parity with percentage threshold
Alert if row counts differ by more than 5%.
```json
{
"name": "orders_replication_check",
"description": "Verify replicated orders table is within 5% of source row count",
"source_table": "MCON++a1b2c3d4-e5f6-7890-abcd-ef1234567890++1++1++primary:sales.orders",
"target_table": "MCON++b2c3d4e5-f6a7-8901-bcde-f12345678901++1++1++replica:sales.orders",
"alert_conditions": [
{
"metric": "ROW_COUNT",
"thresholdValue": 5,
"isThresholdRelative": true
}
]
}
```
### Field-level comparison (different column names)
Compare the sum of `revenue` in the source table against `total_revenue` in the target table.
```json
{
"name": "revenue_source_target_sum",
"description": "Verify revenue sums match between staging and production",
"source_table": "MCON++a1b2c3d4-e5f6-7890-abcd-ef1234567890++1++1++staging:finance.transactions",
"target_table": "MCON++b2c3d4e5-f6a7-8901-bcde-f12345678901++1++1++production:finance.transactions",
"alert_conditions": [
{
"metric": "SUM",
"sourceField": "revenue",
"targetField": "total_revenue",
"thresholdValue": 1,
"isThresholdRelative": true
}
]
}
```
### Segmented comparison
Compare null counts on `email` between source and target, segmented by `country`. The `country` field must exist in both tables.
```json
{
"name": "email_nulls_by_country",
"description": "Compare email null counts by country between ETL source and target",
"source_table": "MCON++a1b2c3d4-e5f6-7890-abcd-ef1234567890++1++1++raw:crm.contacts",
"target_table": "MCON++b2c3d4e5-f6a7-8901-bcde-f12345678901++1++1++analytics:crm.contacts",
"segment_fields": ["country"],
"alert_conditions": [
{
"metric": "NULL_COUNT",
"sourceField": "email",
"targetField": "email",
"thresholdValue": 0,
"isThresholdRelative": false
}
]
}
```
### Cross-warehouse comparison with explicit warehouses
When source and target are in different warehouses, both warehouse parameters must be provided.
```json
{
"name": "migration_users_row_count",
"description": "Validate user row counts match after Snowflake to BigQuery migration",
"source_table": "snowflake_db:public.users",
"source_warehouse": "snowflake-prod",
"target_table": "bigquery_project:public.users",
"target_warehouse": "bigquery-prod",
"alert_conditions": [
{
"metric": "ROW_COUNT",
"thresholdValue": 0,
"isThresholdRelative": false
}
]
}
```
### Custom metric comparison
Compare a computed revenue expression when the SQL differs between source and target.
```json
{
"name": "computed_revenue_compare",
"description": "Compare total revenue computation between legacy and new schema",
"source_table": "MCON++a1b2c3d4-e5f6-7890-abcd-ef1234567890++1++1++warehouse:legacy.orders",
"target_table": "MCON++b2c3d4e5-f6a7-8901-bcde-f12345678901++1++1++warehouse:v2.orders",
"alert_conditions": [
{
"customMetric": {
"displayName": "Total Revenue",
"sourceSqlExpression": "SUM(quantity * unit_price)",
"targetSqlExpression": "SUM(total_amount)"
},
"thresholdValue": 0.01,
"isThresholdRelative": true
}
]
}
```
### Multiple alert conditions
Compare both row counts and field-level metrics in a single monitor.
```json
{
"name": "orders_full_comparison",
"description": "Full comparison of orders between staging and production",
"source_table": "MCON++a1b2c3d4-e5f6-7890-abcd-ef1234567890++1++1++staging:core.orders",
"target_table": "MCON++b2c3d4e5-f6a7-8901-bcde-f12345678901++1++1++production:core.orders",
"domain_id": "f47ac10b-58cc-4372-a567-0e02b2c3d479",
"alert_conditions": [
{
"metric": "ROW_COUNT",
"thresholdValue": 0,
"isThresholdRelative": false
},
{
"metric": "NULL_COUNT",
"sourceField": "customer_id",
"targetField": "customer_id",
"thresholdValue": 0,
"isThresholdRelative": false
},
{
"metric": "SUM",
"sourceField": "amount",
"targetField": "amount",
"thresholdValue": 0.1,
"isThresholdRelative": true
}
]
}
```
Note: the `ROW_COUNT` condition has no fields, while the field-level conditions each specify both `sourceField` and `targetField`.

View File

@@ -0,0 +1,207 @@
# Custom SQL Monitor Reference
Detailed reference for building `createCustomSqlMonitorMac` tool calls.
## When to Use
Use a custom SQL monitor when the user wants to:
- Run a specific SQL query and alert on its result
- Implement cross-table logic (joins, subqueries, CTEs)
- Apply business-specific aggregations or calculations that don't map to a single metric
- Monitor a condition that spans multiple columns or tables
- Use a SQL query they already have in mind
---
## The Universal Fallback
Custom SQL is the fallback monitor type. Reach for it whenever another monitor type cannot express what the user needs:
- **Validation monitor won't work** because the column doesn't exist yet, or the logic requires joins across tables.
- **Metric monitor can't express the business logic** -- for example, a ratio between two columns, a conditional aggregation, or a calculation that spans multiple tables.
- **Cross-table joins are needed** -- metric and validation monitors operate on a single table. If the check requires data from two or more tables, custom SQL is the only option.
- **The user already has a SQL query** -- don't force it into another monitor type. Wrap it in a custom SQL monitor.
If you find yourself contorting another monitor type to fit the user's intent, stop and use custom SQL instead.
---
## Required Parameters
| Parameter | Type | Description |
|-----------|------|-------------|
| `name` | string | Unique identifier for the monitor. Use a descriptive slug (e.g., `orphan_orders_check`). |
| `description` | string | Human-readable description of what the monitor checks. |
| `warehouse` | string | Warehouse name or UUID where the SQL query will be executed. |
| `sql` | string | SQL query that returns a **single numeric value** (one row, one column). |
| `alert_conditions` | array | List of threshold conditions (see Alert Conditions below). |
## Optional Parameters
| Parameter | Type | Description |
|-----------|------|-------------|
| `domain_id` | string (uuid) | Domain UUID (use `getDomains` to list). Only one domain can be assigned per monitor. |
---
## Alert Conditions
Each alert condition compares the query result against a threshold.
| Field | Type | Required | Description |
|-------|------|----------|-------------|
| `operator` | string | Yes | `"GT"`, `"LT"`, `"EQ"`, `"GTE"`, `"LTE"`, `"NE"` |
| `thresholdValue` | number | Yes | Numeric threshold to compare the query result against. |
### No AUTO Support
Custom SQL monitors do **NOT** support `AUTO` (anomaly detection). You must specify an explicit operator and threshold for every alert condition. This is a common mistake -- if the user asks for anomaly detection, steer them toward a metric monitor instead, which does support `AUTO`.
If the user is unsure what threshold to set, help them reason about it: "What value would indicate a problem? If the query returns X, should that fire an alert?"
---
## SQL Query Requirements
The SQL query MUST return exactly **one row with one numeric column**. This is non-negotiable -- the monitor compares that single value against the alert conditions.
### Rules
- Use aggregate functions: `COUNT(*)`, `SUM()`, `AVG()`, `MAX()`, `MIN()`, or similar.
- Can reference any table, view, or materialized view accessible in the warehouse.
- Can use joins, subqueries, CTEs, window functions -- any valid SQL.
- Do **NOT** include trailing semicolons.
- Do **NOT** include comments (`--` or `/* */`) -- some warehouses strip them inconsistently.
### SQL Validation Tips
These are the most common mistakes that cause custom SQL monitors to fail or produce misleading results:
1. **Handle NULLs with COALESCE.** If your aggregate could return NULL (e.g., `SUM(amount)` on an empty result set), wrap it: `SELECT COALESCE(SUM(amount), 0) FROM ...`. A NULL result cannot be compared against a threshold and will not trigger alerts.
2. **Ensure exactly one row, one column.** If your query could return zero rows (e.g., a filtered `SELECT` with no `GROUP BY`), wrap it in an outer aggregate: `SELECT COUNT(*) FROM (SELECT ...) sub`. If it returns multiple columns, select only the one you need.
3. **Test the query mentally.** Before finalizing, ask: "If this query returns 5, will the alert condition fire correctly?" Walk through the logic with a concrete number.
4. **For time-windowed checks, use appropriate date functions.** SQL syntax for date arithmetic varies by warehouse (see Warehouse-Specific SQL Notes below). Always scope time windows to avoid scanning the entire table history.
5. **Avoid non-deterministic results.** Queries using `LIMIT` without `ORDER BY`, or `RANDOM()`, produce unpredictable results that make alerting unreliable.
---
## Warehouse-Specific SQL Notes
SQL syntax for date arithmetic and functions varies across warehouses. When writing time-windowed queries, use the correct syntax for the user's warehouse:
| Operation | Snowflake | BigQuery | Redshift |
|-----------|-----------|----------|----------|
| Subtract 1 day from now | `DATEADD(day, -1, CURRENT_TIMESTAMP())` | `DATE_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 DAY)` | `DATEADD(day, -1, GETDATE())` |
| Subtract 1 hour from now | `DATEADD(hour, -1, CURRENT_TIMESTAMP())` | `TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 HOUR)` | `DATEADD(hour, -1, GETDATE())` |
| Current timestamp | `CURRENT_TIMESTAMP()` | `CURRENT_TIMESTAMP()` | `GETDATE()` |
| Date truncation | `DATE_TRUNC('day', col)` | `DATE_TRUNC(col, DAY)` | `DATE_TRUNC('day', col)` |
When unsure which warehouse the user is on, ask. Getting the syntax wrong causes the monitor to fail on every scheduled run.
---
## Examples
### Orphan records (GT 0)
Alert when orders reference customers that don't exist.
```json
{
"name": "orphan_orders_check",
"description": "Detect orders referencing non-existent customers",
"warehouse": "production_snowflake",
"sql": "SELECT COUNT(*) FROM analytics.core.orders o LEFT JOIN analytics.core.customers c ON o.customer_id = c.id WHERE c.id IS NULL",
"alert_conditions": [
{
"operator": "GT",
"thresholdValue": 0
}
]
}
```
### Daily revenue floor (LT threshold)
Alert when total revenue for the past 24 hours drops below a minimum.
```json
{
"name": "daily_revenue_floor",
"description": "Alert when daily revenue falls below $10,000",
"warehouse": "production_snowflake",
"sql": "SELECT COALESCE(SUM(amount), 0) FROM analytics.billing.transactions WHERE created_at >= DATEADD(day, -1, CURRENT_TIMESTAMP())",
"alert_conditions": [
{
"operator": "LT",
"thresholdValue": 10000
}
]
}
```
### Duplicate rate exceeds threshold
Alert when the duplicate rate on a key field exceeds 1%.
```json
{
"name": "order_id_duplicate_rate",
"description": "Alert when order_id duplicate rate exceeds 1%",
"warehouse": "production_snowflake",
"sql": "SELECT COALESCE(1.0 - (COUNT(DISTINCT order_id) * 1.0 / NULLIF(COUNT(*), 0)), 0) FROM analytics.core.orders WHERE created_at >= DATEADD(day, -1, CURRENT_TIMESTAMP())",
"alert_conditions": [
{
"operator": "GT",
"thresholdValue": 0.01
}
]
}
```
### Multiple threshold conditions (range check)
Alert when a value falls outside an acceptable range. Multiple conditions act as independent checks -- each one that evaluates to true fires its own alert.
```json
{
"name": "avg_order_amount_range",
"description": "Alert when average order amount is outside the $20-$500 range",
"warehouse": "production_snowflake",
"sql": "SELECT COALESCE(AVG(amount), 0) FROM analytics.core.orders WHERE created_at >= DATEADD(day, -1, CURRENT_TIMESTAMP()) AND status = 'completed'",
"alert_conditions": [
{
"operator": "LT",
"thresholdValue": 20
},
{
"operator": "GT",
"thresholdValue": 500
}
]
}
```
### Cross-table freshness check (BigQuery syntax)
Alert when the latest row in a downstream table is more than 2 hours behind the source.
```json
{
"name": "pipeline_lag_check",
"description": "Alert when downstream table lags source by more than 2 hours",
"warehouse": "production_bigquery",
"sql": "SELECT COALESCE(TIMESTAMP_DIFF(s.max_ts, t.max_ts, MINUTE), 9999) FROM (SELECT MAX(event_timestamp) AS max_ts FROM project.raw.events) s CROSS JOIN (SELECT MAX(processed_at) AS max_ts FROM project.analytics.events_processed) t",
"alert_conditions": [
{
"operator": "GT",
"thresholdValue": 120
}
]
}
```

View File

@@ -0,0 +1,292 @@
# Metric Monitor Reference
Detailed reference for building `createMetricMonitorMac` tool calls.
## When to Use
Use a metric monitor when the user wants to:
- Track row count changes over time
- Monitor null rates, unique counts, or other statistical metrics on specific fields
- Detect anomalies in numeric distributions (mean, max, min, percentiles)
- Monitor data freshness (time since last row count change)
- Segment metrics by dimensions (e.g., by country, status)
---
## Required Parameters
| Parameter | Type | Description |
|-----------|------|-------------|
| `name` | string | Unique identifier for the monitor. Use a descriptive slug (e.g., `orders_null_check`). |
| `description` | string | Human-readable description of what the monitor checks. |
| `table` | string | Table MCON (preferred) or `database:schema.table` format. If not MCON, also pass `warehouse`. |
| `aggregate_time_field` | string | **MUST be a real timestamp/datetime column from the table.** NEVER guess this value. |
| `alert_conditions` | array | List of alert condition objects (see Alert Conditions below). |
## Optional Parameters
| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `warehouse` | string | auto-resolved | Warehouse name or UUID. Required if `table` is not an MCON. |
| `segment_fields` | array of string | none | Fields to group/segment metrics by (e.g., `["country", "status"]`). |
| `aggregate_by` | string | `"day"` | Time interval: `"hour"`, `"day"`, `"week"`, `"month"`. |
| `where_condition` | string | none | SQL WHERE clause (without `WHERE` keyword) to filter rows before computing metrics. |
| `interval_minutes` | int | auto | Schedule interval in minutes. Must be compatible with `aggregate_by` (see note below). If not specified, the tool defaults to the minimum valid interval for the chosen `aggregate_by`. |
| `domain_id` | string (uuid) | none | Domain UUID (use `getDomains` to list). |
---
## Schedule and Aggregation Compatibility
The schedule interval must be compatible with `aggregate_by`. Daily aggregation requires an interval that is a multiple of 1440 minutes (24 hours), weekly requires a multiple of 10080, etc. If you pass `interval_minutes`, make sure it satisfies this constraint. If you omit it, the tool picks a sensible default.
| `aggregate_by` | Minimum `interval_minutes` | Default if omitted |
|---|---|---|
| `hour` | 60 | 60 |
| `day` | 1440 | 1440 |
| `week` | 10080 | 10080 |
| `month` | 43200 | 43200 |
For example, to run a daily-aggregated monitor every other day, pass `aggregate_by: "day"` and `interval_minutes: 2880`.
---
## Choosing the Timestamp Field
The `aggregate_time_field` is the most critical parameter. It MUST be a real column from the table that contains timestamp or datetime values. This is the number one source of monitor creation failures.
### How to pick it
1. You should already have the column names from `getTable` with `include_fields: true` (done in Step 2 of the main skill).
2. Look for columns whose names suggest a timestamp: `created_at`, `updated_at`, `modified_at`, `timestamp`, `event_timestamp`, or columns with `_ts`, `_dt`, `_time` suffixes, or `date`, `datetime`.
3. If the user specified one, verify it exists in the column list.
4. If exactly one obvious candidate exists, suggest it.
5. If multiple candidates exist, present them and ask the user.
6. If NO obvious timestamp columns exist, suggest a custom SQL monitor instead (which does not need a timestamp field).
**NEVER** proceed without confirming the timestamp field exists in the table schema.
### Common timestamp field mistakes
- **Using a DATE column (not TIMESTAMP):** This may work, but aggregation granularity is limited. For example, `aggregate_by: "hour"` is meaningless on a DATE column because the time component is always midnight. Warn the user and default to `aggregate_by: "day"` or coarser.
- **Using a field that contains many nulls:** If the timestamp column has significant null values, rows with null timestamps are excluded from aggregation windows, producing unreliable or misleading results. Check the column's null rate from `getTable` field stats if available, and warn the user if it is high.
- **Guessing a field name that does not exist:** Always verify the column name against the `getTable` output. A typo or assumed name (e.g., `created_date` when the actual column is `created_at`) causes the monitor creation to fail silently or error.
---
## Field-Type-to-Metric Compatibility Matrix
**Before selecting a metric, check the column's data type from `getTable` results.** Passing a metric incompatible with the column type is the most common source of creation failures after timestamp issues.
| Column Type | Compatible Metrics |
|-------------|-------------------|
| **Numeric** (int, float, decimal, bigint) | `NUMERIC_MEAN`, `NUMERIC_MEDIAN`, `NUMERIC_MIN`, `NUMERIC_MAX`, `NUMERIC_STDDEV`, `SUM`, `ZERO_COUNT`, `ZERO_RATE`, `NEGATIVE_COUNT`, `NEGATIVE_RATE`, `NULL_COUNT`, `NULL_RATE`, `UNIQUE_COUNT`, `UNIQUE_RATE`, `DUPLICATE_COUNT` |
| **String / Text** (varchar, char, text) | `TEXT_MAX_LENGTH`, `TEXT_MIN_LENGTH`, `TEXT_MEAN_LENGTH`, `TEXT_INT_RATE`, `TEXT_NUMBER_RATE`, `TEXT_UUID_RATE`, `TEXT_EMAIL_ADDRESS_RATE`, `EMPTY_STRING_COUNT`, `EMPTY_STRING_RATE`, `NULL_COUNT`, `NULL_RATE`, `UNIQUE_COUNT`, `UNIQUE_RATE`, `DUPLICATE_COUNT` |
| **Boolean** | `TRUE_COUNT`, `FALSE_COUNT`, `NULL_COUNT`, `NULL_RATE` |
| **Timestamp / Date** | `FUTURE_TIMESTAMP_COUNT`, `PAST_TIMESTAMP_COUNT`, `UNIX_ZERO_TIMESTAMP_COUNT`, `NULL_COUNT`, `NULL_RATE`, `UNIQUE_COUNT`, `UNIQUE_RATE` |
| **Any type** | `NULL_COUNT`, `NULL_RATE`, `UNIQUE_COUNT`, `UNIQUE_RATE`, `DUPLICATE_COUNT` |
### Rules
- **NEVER** apply `NUMERIC_*`, `SUM`, `ZERO_*`, or `NEGATIVE_*` metrics to string, boolean, or timestamp columns.
- **NEVER** apply `TEXT_*` or `EMPTY_STRING_*` metrics to numeric, boolean, or timestamp columns.
- **NEVER** apply `TRUE_COUNT` or `FALSE_COUNT` to non-boolean columns.
- **NEVER** apply `FUTURE_TIMESTAMP_COUNT`, `PAST_TIMESTAMP_COUNT`, or `UNIX_ZERO_TIMESTAMP_COUNT` to non-timestamp columns.
- When in doubt, `NULL_COUNT`, `NULL_RATE`, `UNIQUE_COUNT`, and `UNIQUE_RATE` are safe for any column type.
---
## Alert Conditions
Each alert condition has:
| Field | Type | Required | Description |
|-------|------|----------|-------------|
| `metric` | string | Yes | The metric to monitor (see Metrics Reference below). |
| `operator` | string | Yes | `"AUTO"` (anomaly detection), `"GT"`, `"LT"`, `"EQ"`, `"GTE"`, `"LTE"`, `"NE"`. |
| `threshold` | number | For explicit operators | The threshold value. Required when using `GT`, `LT`, `EQ`, `GTE`, `LTE`, or `NE`. Not used with `AUTO`. |
| `fields` | array of string | Depends | Column names to apply the metric to. Required for field-level metrics. Not needed for table-level metrics. |
---
## Operator Guidance
### When to use `AUTO` (anomaly detection)
- Best when you do not know the expected range of values and want Monte Carlo's ML to learn normal patterns and alert on deviations.
- Works well for organic metrics that vary day-to-day (row counts, null rates on evolving data, numeric distributions).
- Some metrics **require** `AUTO` -- see the table below.
### When to use explicit thresholds (`GT`, `LT`, `EQ`, `GTE`, `LTE`, `NE`)
- Use when there is a known business rule or data contract (e.g., "null rate on `email` should never exceed 5%", "order amount must always be greater than 0").
- Provides deterministic alerting -- no training period needed, alerts fire immediately when the condition is met.
- Requires a `threshold` value in the alert condition.
### Operator restrictions by metric
| Metric | Allowed Operators | Notes |
|--------|-------------------|-------|
| `ROW_COUNT_CHANGE` | `AUTO` only | Anomaly detection on row count delta. |
| `TIME_SINCE_LAST_ROW_COUNT_CHANGE` | `AUTO` only | Anomaly detection on staleness duration. |
| `RELATIVE_ROW_COUNT` | `AUTO` only | Anomaly detection on segment distribution. Requires `segment_fields`. |
| All other metrics | `AUTO`, `GT`, `LT`, `EQ`, `GTE`, `LTE`, `NE` | Any operator is valid. |
---
## Metrics Reference
### Table-level metrics (no `fields` needed)
| Metric | Operator | Description |
|--------|----------|-------------|
| `ROW_COUNT_CHANGE` | Must use `AUTO` | Alert on anomalous changes in total row count. |
| `TIME_SINCE_LAST_ROW_COUNT_CHANGE` | Must use `AUTO` | Alert when the table has not been updated for an unusual duration. |
### Field-level metrics (must specify `fields`)
| Metric | Column Types | Description |
|--------|-------------|-------------|
| `NULL_COUNT` | Any | Count of null values. |
| `NULL_RATE` | Any | Rate of null values (0.0 to 1.0). |
| `UNIQUE_COUNT` | Any | Count of distinct values. |
| `UNIQUE_RATE` | Any | Rate of distinct values (0.0 to 1.0). |
| `DUPLICATE_COUNT` | Any | Count of duplicate (non-unique) values. |
| `EMPTY_STRING_COUNT` | String/Text | Count of empty string values. |
| `EMPTY_STRING_RATE` | String/Text | Rate of empty string values. |
| `NUMERIC_MEAN` | Numeric | Mean of numeric field. |
| `NUMERIC_MEDIAN` | Numeric | Median of numeric field. |
| `NUMERIC_MIN` | Numeric | Minimum value of numeric field. |
| `NUMERIC_MAX` | Numeric | Maximum value of numeric field. |
| `NUMERIC_STDDEV` | Numeric | Standard deviation of numeric field. |
| `SUM` | Numeric | Sum of numeric field. |
| `ZERO_COUNT` | Numeric | Count of zero values. |
| `ZERO_RATE` | Numeric | Rate of zero values. |
| `NEGATIVE_COUNT` | Numeric | Count of negative values. |
| `NEGATIVE_RATE` | Numeric | Rate of negative values. |
| `TRUE_COUNT` | Boolean | Count of true values. |
| `FALSE_COUNT` | Boolean | Count of false values. |
| `TEXT_MAX_LENGTH` | String/Text | Maximum string length. |
| `TEXT_MIN_LENGTH` | String/Text | Minimum string length. |
| `TEXT_MEAN_LENGTH` | String/Text | Mean string length. |
| `TEXT_INT_RATE` | String/Text | Rate of values parseable as integers. |
| `TEXT_NUMBER_RATE` | String/Text | Rate of values parseable as numbers. |
| `TEXT_UUID_RATE` | String/Text | Rate of values matching UUID format. |
| `TEXT_EMAIL_ADDRESS_RATE` | String/Text | Rate of values matching email format. |
| `FUTURE_TIMESTAMP_COUNT` | Timestamp/Date | Count of timestamps in the future. |
| `PAST_TIMESTAMP_COUNT` | Timestamp/Date | Count of timestamps unreasonably far in the past. |
| `UNIX_ZERO_TIMESTAMP_COUNT` | Timestamp/Date | Count of timestamps equal to Unix epoch zero (1970-01-01). |
### Segmentation metric
| Metric | Operator | Description |
|--------|----------|-------------|
| `RELATIVE_ROW_COUNT` | Must use `AUTO` | Alert on anomalous changes in distribution across segments. MUST use `segment_fields`. |
---
## Examples
### Row count anomaly detection
```json
{
"name": "orders_row_count",
"description": "Detect anomalous changes in daily order volume",
"table": "MCON++a1b2c3d4-e5f6-7890-abcd-ef1234567890++1++1++analytics:core.orders",
"aggregate_time_field": "created_at",
"aggregate_by": "day",
"alert_conditions": [
{
"metric": "ROW_COUNT_CHANGE",
"operator": "AUTO"
}
]
}
```
### Null monitoring on specific fields
```json
{
"name": "orders_null_check",
"description": "Alert when email or user_id nulls exceed 50 per day",
"table": "MCON++a1b2c3d4-e5f6-7890-abcd-ef1234567890++1++1++analytics:core.orders",
"aggregate_time_field": "created_at",
"aggregate_by": "day",
"alert_conditions": [
{
"metric": "NULL_COUNT",
"operator": "GT",
"threshold": 50,
"fields": ["email", "user_id"]
}
]
}
```
### Segmented monitoring
```json
{
"name": "orders_by_country_distribution",
"description": "Detect anomalous shifts in order distribution across countries",
"table": "MCON++a1b2c3d4-e5f6-7890-abcd-ef1234567890++1++1++analytics:core.orders",
"aggregate_time_field": "created_at",
"aggregate_by": "day",
"segment_fields": ["country"],
"alert_conditions": [
{
"metric": "RELATIVE_ROW_COUNT",
"operator": "AUTO"
}
]
}
```
### Numeric range monitoring with filter
```json
{
"name": "completed_orders_amount_check",
"description": "Detect anomalous max order amounts for completed orders",
"table": "MCON++a1b2c3d4-e5f6-7890-abcd-ef1234567890++1++1++analytics:core.orders",
"aggregate_time_field": "created_at",
"aggregate_by": "day",
"where_condition": "status = 'completed'",
"alert_conditions": [
{
"metric": "NUMERIC_MAX",
"operator": "AUTO",
"fields": ["amount"]
}
]
}
```
### Multiple alert conditions in one monitor
```json
{
"name": "payments_quality_check",
"description": "Monitor payment amount stats and null rate on transaction_id",
"table": "MCON++a1b2c3d4-e5f6-7890-abcd-ef1234567890++1++1++warehouse:billing.payments",
"aggregate_time_field": "processed_at",
"aggregate_by": "day",
"domain_id": "f47ac10b-58cc-4372-a567-0e02b2c3d479",
"alert_conditions": [
{
"metric": "NUMERIC_MEAN",
"operator": "AUTO",
"fields": ["amount"]
},
{
"metric": "NULL_RATE",
"operator": "GT",
"threshold": 0.01,
"fields": ["transaction_id"]
}
]
}
```

View File

@@ -0,0 +1,231 @@
# Table Monitor Reference
Detailed reference for building `createTableMonitorMac` tool calls.
## When to Use
Use a table monitor when the user wants to:
- Monitor many tables at once across an entire database or schema
- Track freshness (when was each table last updated?)
- Detect schema changes (columns added, removed, or type-changed)
- Monitor volume changes (row count anomalies) across a broad set of tables
- Apply broad coverage with anomaly detection (no custom thresholds needed)
**Do NOT use a table monitor when the user wants to:**
- Track field-level metrics on a single table (use a metric monitor)
- Apply custom thresholds or explicit operators like GT/LT (use a metric monitor)
- Validate row-level business rules or referential integrity (use a validation monitor)
---
## Required Parameters
| Parameter | Type | Description |
|-----------|------|-------------|
| `name` | string | Unique identifier for the table monitor. Must be unique across all table monitors in the same namespace. |
| `description` | string | Human-readable description of what the monitor checks (max 512 characters). |
| `warehouse` | string | Warehouse name or UUID. Use `getTable` or `search` to find it. |
| `asset_selection` | object | Asset selection config defining which tables to monitor (see Asset Selection below). |
## Optional Parameters
| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `alert_conditions` | array of strings | `["last_updated_on", "schema", "total_row_count", "total_row_count_last_changed_on"]` | Metric names to monitor (see Alert Conditions below). |
| `domain_id` | string (uuid) | none | Domain UUID (use `getDomains` to list). |
---
## Pre-Step: Verify Warehouse
Before creating a table monitor, resolve the warehouse name or UUID. The `warehouse` parameter is required and must match an existing warehouse in the Monte Carlo account.
1. If the user provides a table name, call `getTable` to retrieve the table details -- the response includes the warehouse name and UUID.
2. If the user provides a database or schema name without a specific table, call `search` with the database or schema name to find assets and identify the warehouse.
3. Use either the warehouse name or UUID in the `warehouse` parameter.
**NEVER guess the warehouse value.** If you cannot resolve it, ask the user.
---
## Asset Selection
The `asset_selection` object defines which tables the monitor covers. It must include a `databases` list.
**Use database and schema scoping to select which tables to monitor.** This is the reliable approach and covers most use cases.
> **Known limitation:** The MCP tool supports `filters` and `exclusions` parameters, but the tool's schema describes the wrong format for them. Until this is fixed ([K2-269](https://linear.app/montecarlodata/issue/K2-269)), **do not pass `filters` or `exclusions`** — they will cause errors. Use database/schema scoping instead to narrow the set of monitored tables. If the user needs regex or pattern-based filtering, explain this limitation and suggest either (a) using schema-level scoping to get close, or (b) creating individual metric monitors for specific tables.
### Database-Level Selection
To monitor all tables in an entire database, specify only the database name with no `schemas` list:
```json
{
"databases": [
{"name": "analytics"}
]
}
```
This monitors every table in every schema within the `analytics` database.
### Schema-Level Selection
To monitor all tables in specific schemas, include the `schemas` list:
```json
{
"databases": [
{
"name": "analytics",
"schemas": ["core", "staging"]
}
]
}
```
This monitors every table in the `core` and `staging` schemas within `analytics`, but not tables in other schemas.
### Multiple Databases
You can monitor tables across multiple databases in a single monitor:
```json
{
"databases": [
{"name": "analytics", "schemas": ["core"]},
{"name": "raw_data"},
{"name": "reporting", "schemas": ["public", "internal"]}
]
}
```
---
## Alert Conditions
Alert conditions define which metrics the table monitor tracks. The operator is always AUTO (anomaly detection) -- custom thresholds are not available for table monitors.
| Metric | Description |
|--------|-------------|
| `last_updated_on` | Freshness monitoring. Alerts when a table has not been updated within its normal cadence. |
| `schema` | Any schema change. Alerts when columns are added, removed, or their types change. |
| `schema_fields_added` | New columns detected. Alerts only when new columns appear in the table. |
| `schema_fields_removed` | Columns removed. Alerts only when existing columns are dropped from the table. |
| `schema_fields_type_change` | Column type changes. Alerts only when a column's data type changes. |
| `total_row_count` | Row count changes. Alerts on anomalous changes in total row count. |
| `total_row_count_last_changed_on` | Time since last volume change. Alerts when the row count has not changed for an unusual duration. |
### Notes
- **All operators are AUTO (anomaly detection).** Table monitors do not support custom thresholds like GT, LT, or explicit operators. If the user needs custom thresholds, use a metric monitor instead.
- **No `schedule` field.** Table monitors do not support the `schedule` field in MaC YAML. Adding it will cause a validation error on `montecarlo monitors apply`. Table monitor scheduling is managed automatically by Monte Carlo. Do NOT add a schedule block to the generated YAML.
- The default set (`last_updated_on`, `schema`, `total_row_count`, `total_row_count_last_changed_on`) provides broad coverage and is appropriate for most use cases. Only override the defaults when the user specifically requests a subset.
- `schema` is a superset of `schema_fields_added`, `schema_fields_removed`, and `schema_fields_type_change`. If using `schema`, there is no need to also include the granular schema metrics.
---
## Examples
### Monitor all tables in a database (minimal config)
```json
{
"name": "analytics_db_monitor",
"description": "Monitor all tables in the analytics database for freshness, schema changes, and volume",
"warehouse": "production_warehouse",
"asset_selection": {
"databases": [
{"name": "analytics"}
]
}
}
```
Uses the default alert conditions (`last_updated_on`, `schema`, `total_row_count`, `total_row_count_last_changed_on`).
### Monitor specific schemas with default alerts
```json
{
"name": "core_schemas_monitor",
"description": "Monitor all tables in core and reporting schemas",
"warehouse": "production_warehouse",
"asset_selection": {
"databases": [
{
"name": "analytics",
"schemas": ["core", "reporting"]
}
]
}
}
```
Monitors every table in the `core` and `reporting` schemas, leaving other schemas unmonitored.
### Monitor multiple schemas across databases
```json
{
"name": "prod_tables_monitor",
"description": "Monitor production tables across analytics and raw_data databases",
"warehouse": "production_warehouse",
"asset_selection": {
"databases": [
{
"name": "analytics",
"schemas": ["core", "reporting"]
},
{
"name": "raw_data",
"schemas": ["ingestion"]
}
]
}
}
```
Monitors tables in specific production schemas, leaving development and staging schemas unmonitored.
### Schema change monitoring only
```json
{
"name": "warehouse_schema_watch",
"description": "Track schema changes across the entire data warehouse",
"warehouse": "production_warehouse",
"asset_selection": {
"databases": [
{"name": "analytics"},
{"name": "raw_data"}
]
},
"alert_conditions": [
"schema_fields_added",
"schema_fields_removed",
"schema_fields_type_change"
]
}
```
Monitors only schema changes (not freshness or volume) across multiple databases. Uses the granular schema metrics instead of `schema` to allow selectively enabling/disabling each type.
---
## Table Monitor vs Metric Monitor
| Aspect | Table Monitor | Metric Monitor |
|--------|---------------|----------------|
| **Scope** | Multiple tables (database/schema level) | Single table |
| **Metrics** | Freshness, schema changes, row count | Field-level metrics (null rate, mean, sum, etc.) |
| **Operator** | AUTO only (anomaly detection) | AUTO or explicit thresholds (GT, LT, EQ, etc.) |
| **Asset selection** | Database/schema with filters and exclusions | Single table specified by MCON or name |
| **Timestamp field** | Not required | Required (`aggregate_time_field`) |
| **Segmentation** | Not available | Available via `segment_fields` |
| **Best for** | Broad coverage, freshness, schema drift | Targeted field-level data quality checks |
**Rule of thumb:** If the user wants to monitor a specific field on a specific table with specific thresholds, use a metric monitor. If the user wants broad monitoring across many tables with automatic anomaly detection, use a table monitor.

View File

@@ -0,0 +1,404 @@
# Validation Monitor Reference
Detailed reference for building `createValidationMonitorMac` tool calls.
## When to Use
Use a validation monitor when the user wants to:
- Check that specific fields are never null
- Validate that values are within an allowed set (e.g., status in 'active', 'pending', 'inactive')
- Enforce referential integrity (field values exist in another table)
- Apply row-level business rules (e.g., "amount must be positive")
- Combine multiple conditions with AND/OR logic
---
## Getting the Logic Right: Conditions Match INVALID Data
This is the single most confusing aspect of validation monitors and the number one source of mistakes. **Conditions describe what INVALID data looks like -- the data you want to be alerted about.** They do NOT describe what valid data looks like.
Think of it this way: the monitor scans rows and fires an alert when it finds rows matching the condition. So the condition must match the BAD rows.
| User wants | Condition should match | Common mistake |
|------------|----------------------|----------------|
| "id should never be null" | id IS NULL (alert when null found) | id IS NOT NULL (would alert on every valid row) |
| "status must be in [active, pending]" | status NOT IN [active, pending] (alert on unexpected values) | status IN [active, pending] (would alert on valid rows) |
| "amount must be positive" | amount IS NEGATIVE (alert on bad values) | amount > 0 (would alert on valid rows) |
| "email must not be empty" | email IS NULL **OR** email = '' (alert on missing) | email IS NOT NULL (would alert on valid rows) |
**Before building any condition, ask yourself: "If a row matches this condition, is the row INVALID?" If the answer is no, the logic is backwards.**
---
## Pre-Step: Verify Field Existence
Before constructing the `alert_condition`, verify that every field name you plan to reference exists in the table's column list. This is the number two source of validation monitor failures -- referencing columns that do not exist or are misspelled.
1. You should already have the column list from `getTable` with `include_fields: true` (done in Step 2 of the main skill).
2. For every field name in your planned conditions, confirm it appears in the column list exactly as spelled (field names are case-sensitive on most warehouses).
3. If a field does not exist, stop and ask the user to clarify the correct column name. Do not guess.
---
## Required Parameters
| Parameter | Type | Description |
|-----------|------|-------------|
| `name` | string | Unique identifier for the monitor. Use a descriptive slug (e.g., `orders_not_null_check`). |
| `description` | string | Human-readable description of what the monitor checks. |
| `table` | string | Table MCON (preferred) or `database:schema.table` format. If not MCON, also pass `warehouse`. |
| `alert_condition` | object | Condition tree defining when to alert (see Alert Condition Structure below). |
## Optional Parameters
| Parameter | Type | Description |
|-----------|------|-------------|
| `warehouse` | string | Warehouse name or UUID. Required if `table` is not an MCON. |
| `domain_id` | string (uuid) | Domain UUID (use `getDomains` to list). |
---
## Alert Condition Structure
The top level of `alert_condition` must always be a GROUP node. This GROUP contains one or more conditions combined with AND or OR logic.
```json
{
"type": "GROUP",
"operator": "AND",
"conditions": [...]
}
```
### Condition Types
There are four condition types: UNARY, BINARY, SQL, and GROUP.
#### UNARY (single-value checks)
Used for predicates that operate on a single field with no comparison value.
```json
{
"type": "UNARY",
"predicate": {"name": "null", "negated": false},
"value": [{"type": "FIELD", "field": "column_name"}]
}
```
- `predicate.name` -- the predicate to apply (see Predicates Reference below).
- `predicate.negated` -- set to `true` to invert the predicate (e.g., `null` with `negated: true` means "is NOT null").
- `value` -- an array with a single value descriptor (usually a FIELD reference).
#### BINARY (comparison checks)
Used for predicates that compare a field against a value.
```json
{
"type": "BINARY",
"predicate": {"name": "greater_than", "negated": false},
"left": [{"type": "FIELD", "field": "column_name"}],
"right": [{"type": "LITERAL", "literal": "0"}]
}
```
- `left` -- the left-hand side of the comparison (typically a FIELD reference).
- `right` -- the right-hand side (typically a LITERAL value, SQL expression, or FIELD reference).
- Both `left` and `right` are arrays of value descriptors.
#### SQL (custom SQL expression)
Used for complex conditions that are difficult to express with UNARY/BINARY nodes. The SQL expression should evaluate to true for INVALID rows.
```json
{
"type": "SQL",
"sql": "amount > 0 AND amount < 1000000"
}
```
#### GROUP (nested conditions)
Used to combine multiple conditions with AND or OR logic. Groups can be nested.
```json
{
"type": "GROUP",
"operator": "OR",
"conditions": [
{"type": "UNARY", "...": "..."},
{"type": "BINARY", "...": "..."}
]
}
```
---
## Value Types
Value descriptors appear in the `value`, `left`, and `right` arrays of UNARY and BINARY conditions.
| Type | Field | Description | Example |
|------|-------|-------------|---------|
| `FIELD` | `"field": "column_name"` | References a column in the table. | `{"type": "FIELD", "field": "user_id"}` |
| `LITERAL` | `"literal": "value"` | A static value (always a string, even for numbers). | `{"type": "LITERAL", "literal": "100"}` |
| `SQL` | `"sql": "SELECT ..."` | A SQL expression or subquery. | `{"type": "SQL", "sql": "SELECT MAX(id) FROM ref_table"}` |
---
## Predicates Reference
Before building conditions, call `getValidationPredicates` to get the full list of supported predicates for the connected warehouse. The list below covers common predicates but may not be exhaustive.
### Unary Predicates
These predicates take no comparison value -- they check a property of the field itself.
| Predicate | Description | Example use |
|-----------|-------------|-------------|
| `null` | Field value is null. | Alert on null ids. |
| `is_negative` | Field value is negative. | Alert on negative amounts. |
| `is_between_0_and_1` | Field value is between 0 and 1 (inclusive). | Alert on rates that should be percentages (0-100). |
| `is_future_date` | Field value is a date/timestamp in the future. | Alert on future-dated records. |
| `is_uuid` | Field value matches UUID format. | Alert on non-UUID values in a UUID field (use with `negated: true`). |
### Binary Predicates
These predicates compare a field against a value.
| Predicate | Right-hand side | Description | Example use |
|-----------|----------------|-------------|-------------|
| `equal` | Single LITERAL | Field equals the given value. | Alert when `status` equals `'deleted'`. |
| `greater_than` | Single LITERAL | Field is greater than the given value. | Alert when `discount_pct` exceeds 100. |
| `less_than` | Single LITERAL | Field is less than the given value. | Alert when `quantity` is below 0. |
| `in_set` | Multiple LITERALs | Field value is in the given set. | Alert when `status` is in an invalid set (see example below). |
| `contains` | Single LITERAL | Field value contains the given substring. | Alert when `email` contains `'test@'`. |
| `starts_with` | Single LITERAL | Field value starts with the given prefix. | Alert when `phone` starts with `'000'`. |
| `between` | Two LITERALs | Field value is between the two given values (inclusive). | Alert when `score` is between 0 and 10 (if that range is invalid). |
### Using `negated` to Invert Predicates
Any predicate can be inverted by setting `"negated": true` in the predicate object. This is essential for "must be in set" validations:
- **"status must be in [active, pending]"** becomes `in_set` with values `["active", "pending"]` and `negated: true` -- meaning "alert when status is NOT in [active, pending]".
- **"id must not be null"** becomes `null` with `negated: false` -- meaning "alert when id IS null" (no inversion needed since the condition already matches invalid data).
---
## Examples
### Alert when id is null
Verify that `id` exists in the table schema from `getTable` before proceeding.
```json
{
"name": "orders_id_not_null",
"description": "Alert when order id is null",
"table": "MCON++a1b2c3d4-e5f6-7890-abcd-ef1234567890++1++1++analytics:core.orders",
"alert_condition": {
"type": "GROUP",
"operator": "AND",
"conditions": [
{
"type": "UNARY",
"predicate": {"name": "null", "negated": false},
"value": [{"type": "FIELD", "field": "id"}]
}
]
}
}
```
The condition matches rows where `id` IS NULL -- these are the invalid rows we want to be alerted about.
### Alert when status is not in allowed set
Verify that `status` exists in the table schema from `getTable` before proceeding.
```json
{
"name": "orders_status_allowed_values",
"description": "Alert when order status is outside the allowed set",
"table": "MCON++a1b2c3d4-e5f6-7890-abcd-ef1234567890++1++1++analytics:core.orders",
"alert_condition": {
"type": "GROUP",
"operator": "AND",
"conditions": [
{
"type": "BINARY",
"predicate": {"name": "in_set", "negated": true},
"left": [{"type": "FIELD", "field": "status"}],
"right": [
{"type": "LITERAL", "literal": "active"},
{"type": "LITERAL", "literal": "pending"},
{"type": "LITERAL", "literal": "inactive"}
]
}
]
}
}
```
Note `negated: true` -- the predicate is `in_set`, but we want to alert when the value is NOT in the set. This catches any unexpected status values.
### Alert when amount is negative
Verify that `amount` exists in the table schema from `getTable` before proceeding.
```json
{
"name": "orders_positive_amount",
"description": "Alert when order amount is negative",
"table": "MCON++a1b2c3d4-e5f6-7890-abcd-ef1234567890++1++1++analytics:core.orders",
"alert_condition": {
"type": "GROUP",
"operator": "AND",
"conditions": [
{
"type": "UNARY",
"predicate": {"name": "is_negative", "negated": false},
"value": [{"type": "FIELD", "field": "amount"}]
}
]
}
}
```
The condition matches rows where `amount` is negative -- these are the invalid rows.
### Combined conditions: null OR negative
Verify that both `amount` and `quantity` exist in the table schema from `getTable` before proceeding.
```json
{
"name": "orders_amount_quality",
"description": "Alert when amount is null or quantity is negative",
"table": "MCON++a1b2c3d4-e5f6-7890-abcd-ef1234567890++1++1++analytics:core.orders",
"alert_condition": {
"type": "GROUP",
"operator": "OR",
"conditions": [
{
"type": "UNARY",
"predicate": {"name": "null", "negated": false},
"value": [{"type": "FIELD", "field": "amount"}]
},
{
"type": "UNARY",
"predicate": {"name": "is_negative", "negated": false},
"value": [{"type": "FIELD", "field": "quantity"}]
}
]
}
}
```
The OR operator means an alert fires if either condition matches -- the row has a null amount OR a negative quantity.
### Between check with nested AND/OR
Verify that `score` and `status` exist in the table schema from `getTable` before proceeding.
```json
{
"name": "records_score_validation",
"description": "Alert when score is outside 0-100 range for active records",
"table": "MCON++a1b2c3d4-e5f6-7890-abcd-ef1234567890++1++1++warehouse:metrics.records",
"alert_condition": {
"type": "GROUP",
"operator": "AND",
"conditions": [
{
"type": "BINARY",
"predicate": {"name": "equal", "negated": false},
"left": [{"type": "FIELD", "field": "status"}],
"right": [{"type": "LITERAL", "literal": "active"}]
},
{
"type": "BINARY",
"predicate": {"name": "between", "negated": true},
"left": [{"type": "FIELD", "field": "score"}],
"right": [
{"type": "LITERAL", "literal": "0"},
{"type": "LITERAL", "literal": "100"}
]
}
]
}
}
```
This uses `between` with `negated: true` to alert when score is outside the 0-100 range, but only for active records (the AND operator requires both conditions to match).
### Referential integrity with SQL subquery
Verify that `customer_id` exists in the table schema from `getTable` before proceeding.
```json
{
"name": "orders_valid_customer",
"description": "Alert when customer_id does not exist in customers table",
"table": "MCON++a1b2c3d4-e5f6-7890-abcd-ef1234567890++1++1++analytics:core.orders",
"alert_condition": {
"type": "GROUP",
"operator": "AND",
"conditions": [
{
"type": "SQL",
"sql": "customer_id IS NOT NULL AND customer_id NOT IN (SELECT id FROM analytics.core.customers)"
}
]
}
}
```
The SQL condition type is useful for referential integrity checks that require subqueries. The `customer_id IS NOT NULL` guard avoids alerting on null values (which should be caught by a separate null check if needed).
### Contains and starts_with checks
Verify that `email` and `phone` exist in the table schema from `getTable` before proceeding.
```json
{
"name": "contacts_format_validation",
"description": "Alert when email contains test data or phone has invalid prefix",
"table": "MCON++a1b2c3d4-e5f6-7890-abcd-ef1234567890++1++1++warehouse:crm.contacts",
"alert_condition": {
"type": "GROUP",
"operator": "OR",
"conditions": [
{
"type": "BINARY",
"predicate": {"name": "contains", "negated": false},
"left": [{"type": "FIELD", "field": "email"}],
"right": [{"type": "LITERAL", "literal": "@test.example.com"}]
},
{
"type": "BINARY",
"predicate": {"name": "starts_with", "negated": false},
"left": [{"type": "FIELD", "field": "phone"}],
"right": [{"type": "LITERAL", "literal": "000"}]
}
]
}
}
```
---
## Fallback: Custom SQL Monitor
If `createValidationMonitorMac` fails -- for example because a referenced column does not exist yet in the live table, or the warehouse does not support a particular predicate -- fall back to `createCustomSqlMonitorMac` with an explicit SQL query instead.
A custom SQL monitor lets you express any validation logic as a SQL query that returns rows or a count. This is always available as a backup when the structured validation condition tree cannot express what you need or encounters an API error.
When falling back:
1. Translate the intended validation logic into a SQL query.
2. The SQL should select rows that violate the rule (matching the same "conditions match INVALID data" principle).
3. Use `createCustomSqlMonitorMac` with the translated query.
4. Inform the user that you used a custom SQL monitor as a fallback and explain why.

View File

@@ -0,0 +1,252 @@
---
name: monte-carlo-prevent
description: "Surfaces Monte Carlo data observability context (table health, alerts, lineage, blast radius) before SQL/dbt edits."
category: data
risk: safe
source: community
source_repo: monte-carlo-data/mc-agent-toolkit
source_type: community
date_added: "2026-04-08"
author: monte-carlo-data
tags: [data-observability, dbt, schema, monte-carlo, lineage]
tools: [claude, cursor, codex]
---
# Monte Carlo Prevent Skill
This skill brings Monte Carlo's data observability context directly into your editor. When you're modifying a dbt model or SQL pipeline, use it to surface table health, lineage, active alerts, and to generate monitors-as-code without leaving Claude Code.
Reference files live next to this skill file. **Use the Read tool** (not MCP resources) to access them:
- Full workflow step-by-step instructions: `references/workflows.md` (relative to this file)
- MCP parameter details: `references/parameters.md` (relative to this file)
- Troubleshooting: `references/TROUBLESHOOTING.md` (relative to this file)
## When to activate this skill
**Do not wait to be asked.** Run the appropriate workflow automatically whenever the user:
- References or opens a `.sql` file or dbt model (files in `models/`) → run Workflow 1
- Mentions a table name, dataset, or dbt model name in passing → run Workflow 1
- Describes a planned change to a model (new column, join update, filter change, refactor) → **STOP — run Workflow 4 before writing any code**
-
- Adds a new column, metric, or output expression to an existing
model → run Workflow 4 first, then ALWAYS offer Workflow 2
regardless of risk tier — do not skip the monitor offer
- Asks about data quality, freshness, row counts, or anomalies → run Workflow 1
- Wants to triage or respond to a data quality alert → run Workflow 3
Present the results as context the engineer needs before proceeding — not as a response to a question.
## When NOT to activate this skill
Do not invoke Monte Carlo tools for:
- Seed files (files in seeds/ directory)
- Analysis files (files in analyses/ directory)
- One-off or ad-hoc SQL scripts not part of a dbt project
- Configuration files (dbt_project.yml, profiles.yml, packages.yml)
- Test files unless the user is specifically asking about data quality
If uncertain whether a file is a dbt model, check for {{ ref() }} or {{ source() }}
Jinja references — if absent, do not activate.
### Macros and snapshots — gate edits, skip auto-context
Macro files (`macros/`) and snapshot files (`snapshots/`) are **not** models, so
do not auto-fetch Monte Carlo context (Workflow 1) when they are opened. However,
macros are inlined into every model that calls them at compile time — a one-line
macro change can silently alter dozens of models. Snapshots control historical
tracking and are similarly sensitive.
**The pre-edit hook gates these files.** If the hook fires for a macro or snapshot,
identify which models are affected and run the change impact assessment (Workflow 4)
for those models before proceeding with the edit.
---
## REQUIRED: Change impact assessment before any SQL edit
**Before editing or writing any SQL for a dbt model or pipeline, you MUST run Workflow 4.**
This applies whenever the user expresses intent to modify a model — including phrases like:
- "I want to add a column…"
- "Let me add / I'm adding…"
- "I'd like to change / update / rename…"
- "Can you add / modify / refactor…"
- "Let's add…" / "Add a `<column>` column"
- Any other description of a planned schema or logic change
- "Exclude / filter out / remove [records/customers/rows]…"
- "Adjust / increase / decrease [threshold/parameter/value]…"
- "Fix / bugfix / patch [issue/bug]…"
- "Revert / restore / undo [change/previous behavior]…"
- "Disable / enable [feature/logic/flag]…"
- "Clean up / remove [references/columns/code]…"
- "Implement [backend/feature] for…"
- "Create [models/dbt models] for…" (when modifying existing referenced tables)
- "Increase / decrease / change [max_tokens/threshold/date constant/numeric parameter]…"
- Any change to a hardcoded value, constant, or configuration parameter within SQL
- "Drop / remove / delete [column/field/table]"
- "Rename [column/field] to [new name]"
- "Add [column]" (short imperative form, e.g. "add a created_at column")
- Any single-verb imperative command targeting a column, table, or model
(e.g. "drop X", "rename Y", "add Z", "remove W")
Parameter changes (threshold values, date constants, numeric limits) appear
safe but silently change model output. Treat them the same as logic changes
for impact assessment purposes.
**Do not write or edit any SQL until the change impact assessment (Workflow 4) has been presented to the user.** The assessment must come first — not after the edit, not in parallel.
---
## Pre-edit gate — check before modifying any file
**Before calling Edit, Write, or MultiEdit on any `.sql` or dbt model
file, you MUST check:**
1. Has the synthesis step been run for THIS SPECIFIC CHANGE in the
current prompt?
2. **If YES** → proceed with the edit
3. **If NO** → stop immediately, run Workflow 4, present the full
report with synthesis connected to this specific change.
**If risk is High or Medium:** ask "Do you want me to proceed
with the edit?" and wait for explicit confirmation.
**If risk is Low:** use judgment — proceed if straightforward
and no concerns found, otherwise ask before editing.
**Important: "Workflow 4 already ran this session" is NOT sufficient
to proceed.** Each distinct change prompt requires its own synthesis
step connecting the MC findings to that specific change.
The synthesis must reference the specific columns, filters, or logic
being changed in the current prompt — not just general table health.
Example:
- ✅ "Given 34 downstream models depend on is_paying_workspace,
adding 'MC Internal' to the exclusion list will exclude these
workspaces from all downstream health scores and exports.
Confirm?"
- ❌ "Workflow 4 already ran. Making the edit now."
The only exception: if the user explicitly acknowledges the risk
and confirms they want to skip (e.g. "I know the risks, just make
the change") — proceed but note the skipped assessment.
## Available MCP tools
All tools are available via the `monte-carlo` MCP server.
| Tool | Purpose |
| ---------------------------- | -------------------------------------------------------------------- |
| `testConnection` | Verify auth and connectivity |
| `search` | Find tables/assets by name |
| `getTable` | Schema, stats, metadata for a table |
| `getAssetLineage` | Upstream/downstream dependencies (call with mcons array + direction) |
| `getAlerts` | Active incidents and alerts |
| `getMonitors` | Monitor configs — filter by table using mcons array |
| `getQueriesForTable` | Recent query history |
| `getQueryData` | Full SQL for a specific query |
| `createValidationMonitorMac` | Generate validation monitors-as-code YAML |
| `createMetricMonitorMac` | Generate metric monitors-as-code YAML |
| `createComparisonMonitorMac` | Generate comparison monitors-as-code YAML |
| `createCustomSqlMonitorMac` | Generate custom SQL monitors-as-code YAML |
| `getValidationPredicates` | List available validation rule types |
| `updateAlert` | Update alert status/severity |
| `setAlertOwner` | Assign alert ownership |
| `createOrUpdateAlertComment` | Add comments to alerts |
| `getAudiences` | List notification audiences |
| `getDomains` | List MC domains |
| `getUser` | Current user info |
| `getCurrentTime` | ISO timestamp for API calls |
## Core workflows
Each workflow has detailed step-by-step instructions in `references/workflows.md` (Read tool).
### 1. Table health check
**When:** User opens a dbt model or mentions a table.
**What:** Surfaces health, lineage, alerts, and risk signals. Auto-escalates to Workflow 4 if change intent is detected and risk signals are present.
### 2. Add a monitor
**When:** New column, filter, or business rule is added to a model.
**What:** Suggests and generates monitors-as-code YAML using the appropriate `create*MonitorMac` tool. Saves to `monitors/<table_name>.yml`.
### 3. Alert triage
**When:** User is investigating an active data quality incident.
**What:** Lists open alerts, checks table state, traces lineage for root cause, reviews recent queries.
### 4. Change impact assessment — REQUIRED before modifying a model
**When:** Any intent to modify a dbt model's logic, columns, joins, or filters.
**What:** Surfaces blast radius, downstream dependencies, active incidents, monitor coverage, and query exposure. Produces a risk-tiered report with synthesis connecting findings to specific code recommendations. See `references/workflows.md` for the full assessment sequence, report format, and synthesis rules.
### 5. Change validation queries
**When:** Explicit engineer request only (e.g. "validate this change", "ready to commit").
**What:** Generates 3-5 targeted SQL queries to verify the change behaved as intended. Uses Workflow 4 context — requires both impact assessment and file edit in session.
---
## Post-synthesis confirmation rules
Always end the synthesis with one clear, specific recommendation in plain English:
"Given the above, I recommend: [specific action]"
**If the risk is High or Medium:** STOP and wait for confirmation before editing
any file. You must ask the engineer and receive an explicit "yes", "go ahead",
"proceed", or similar confirmation before making code changes.
Say: "Do you want me to proceed with the edit?"
Do NOT say: "Proceeding with the edit." — that skips the engineer's decision.
**If the risk is Low:** Use your judgment based on the synthesis findings. If
the change is straightforward and the synthesis found no concerns, you may
proceed. If anything is surprising or worth flagging, ask before editing.
---
## Session markers
These markers coordinate between the skill and the plugin's hooks. Output each
on its own line when the condition is met.
### Impact check complete
After the engineer confirms (High/Medium) or after presenting the synthesis (Low),
output one marker per assessed table. **IMPORTANT: use only the table/model name, not the full MCON:**
<!-- MC_IMPACT_CHECK_COMPLETE: <table_name> -->
(Use the model filename without .sql extension — NOT "acme.analytics.orders" or "prod.public.client_hub")
How many markers to emit depends on how the assessment was triggered:
**Hook-triggered** (the pre-edit hook blocked an edit and instructed you to run
the assessment): Be strict — only emit markers for tables whose lineage **and**
monitor coverage were fetched directly via Monte Carlo tools in this session. If
the engineer describes changes to multiple tables but only one was formally
assessed, emit only one marker. The pre-edit hook will gate the other tables and
prompt for their own Workflow 4 runs.
**Voluntarily invoked** (the engineer proactively asked for an impact assessment):
Be looser — emit markers for all tables the assessment meaningfully covered, even
if some were assessed via lineage context rather than direct MC tool calls. The
engineer is already safety-conscious; don't force redundant assessments for tables
they clearly considered.
### Monitor coverage gap
When Workflow 4 finds zero custom monitors on a table's affected columns, output:
<!-- MC_MONITOR_GAP: <table_name> -->
Use only the table/model name (NOT the full MCON). This allows the plugin's hooks
to remind the engineer about monitor coverage at commit time. Only output this
marker when the gap is specifically about the columns or logic being changed —
not for general table-level monitor absence.

View File

@@ -0,0 +1,23 @@
## Troubleshooting
### MCP connection fails:
```bash
# Verify the server is reachable
curl -s -o /dev/null -w "%{http_code}" https://integrations.getmontecarlo.com/mcp/
```
**If using the plugin (OAuth):** Run `/mcp` in Claude Code, select the `monte-carlo` server, and re-authenticate. If the browser flow doesn't complete, copy the callback URL from your browser's address bar into the URL prompt that appears in Claude Code.
**Legacy (header-based auth, for MCP clients without HTTP transport):** Check that `x-mcd-id` and `x-mcd-token` are set correctly in your MCP config. The key format is `<KEY_ID>:<KEY_SECRET>` — these are split across two separate headers.
### Monitor creation errors:
**`montecarlo monitors apply` fails with "Unknown field":**
Monitor definition files must have `montecarlo:` as the root key — do not copy the `validation:` or `custom_sql:` output from the MCP tools directly. Reformat using the `montecarlo: > custom_sql:` structure shown in Workflow 2.
**`montecarlo monitors apply` fails with "Not a Monte Carlo project":**
Ensure `montecarlo.yml` (the project config) exists in the working directory. This file must contain only `version`, `namespace`, and `default_resource` — not monitor definitions.
**`createValidationMonitorMac` fails with a Snowflake error:**
This tool validates the condition SQL against the live table. If the column doesn't exist yet (e.g. you're writing the monitor before deploying the model change), fall back to `createCustomSqlMonitorMac` with an explicit SQL query instead.

View File

@@ -0,0 +1,32 @@
# MCP Parameter Notes
Important parameter details for Monte Carlo MCP tools. Consult when making API
calls to avoid common mistakes.
---
## `getAlerts` — use snake_case parameters
The MCP tool uses Python snake_case, **not** the camelCase params from the MC web UI:
```
✓ created_after (not createdTime.after)
✓ created_before (not createdTime.before)
✓ order_by (not orderBy)
✓ table_mcons (not tableMcons)
```
Always provide `created_after` and `created_before`. Max window is 60 days.
Use `getCurrentTime()` to get the current ISO timestamp when needed.
---
## `search` — finding the right table identifier
MC uses MCONs (Monte Carlo Object Names) as table identifiers. Always use
`search` first to resolve a table name to its MCON before calling `getTable`,
`getAssetLineage`, or `getAlerts`.
```
search(query="orders_status") → returns mcon, full_table_id, warehouse
```

View File

@@ -0,0 +1,478 @@
# Workflow Details
Detailed step-by-step instructions for each Monte Carlo Prevent workflow.
These are referenced from the main SKILL.md — consult the relevant section when
executing a workflow.
---
## Workflow 1: Table health check — when opening or editing a model
When the user opens a dbt model or mentions a table, run this sequence automatically:
```
1. search(query="<table_name>") → get the full MCON/table identifier
2. getTable(mcon="<mcon>") → schema, freshness, row count, importance score, monitoring status
3. getAssetLineage(mcon="<mcon>") → upstream sources, downstream dependents
4. getAlerts(created_after="<7 days ago>", created_before="<now>", table_mcons=["<mcon>"]) → active alerts
```
Summarize for the user:
- **Health**: last updated, row count, is it monitored?
- **Lineage**: N upstream sources, M downstream consumers (name the important ones)
- **Alerts**: any active/unacknowledged incidents — lead with these if present
- **Risk signals** (lite): flag if importance score is high, if key assets are downstream, or if alerts are already firing — these indicate the table warrants extra care before modification
Example summary to offer unprompted when a dbt model file is opened:
> "The table `orders_status` was last updated 2 hours ago with 142K rows. It has 3 downstream dependents including `order_status_snapshot` (key asset). There are 2 active freshness alerts — this table warrants extra care before modification. Want me to run a full change impact assessment?"
**Auto-escalation rule — after completing steps 14 above:**
First, check whether the user has expressed intent to modify the model
in this session (e.g. mentioned a change, asked to add/edit/fix something).
IF change intent has been expressed AND any of the following are true:
- One or more active/unacknowledged alerts exist on the table
- One or more downstream dependents are key assets
- The table's importance score is above 0.8
→ Ask the user before running Workflow 4:
"This is a high-importance table with [N active alerts / key asset
dependents / importance score 0.989]. Do you want me to run a full
change impact assessment before proceeding? (yes/no)"
→ Wait for confirmation. If yes → run Workflow 4.
If no → proceed but note: "Skipping impact assessment at your request."
IF risk signals exist but NO change intent has been expressed:
→ Surface the health summary and note the risk signals only:
"This is a high-importance table with key asset dependents. When
you're ready to make changes, say 'run impact assessment' or just
describe your change and I'll run it automatically."
→ Do NOT run Workflow 4. Do NOT ask about running Workflow 4.
### New model creation variant
When the user is creating a new .sql dbt model file (not editing an existing one):
1. Parse all {{ ref('...') }} and {{ source('...', '...') }} calls from the SQL
2. For each referenced table, run the standard Workflow 1 health check:
search() → getTable() → getAlerts()
3. Surface a consolidated upstream health summary:
"Your new model references N upstream tables. Here's their current health:"
- List each with: last updated, active alerts (if any), key asset flag
4. Flag any upstream table with active alerts as a risk:
"⚠️ <table_name> has <N> active alerts — your new model will inherit this data quality issue"
Skip getAssetLineage for new models — they have no downstream dependents yet.
Skip Workflow 4 for new models — there is no existing blast radius to assess.
---
## Workflow 2: Add a monitor — when new transformation logic is added
> **For detailed monitor creation guidance** — including parameter validation, field-type compatibility checks, and common error prevention — see the `monitor-creation` skill (`skills/monitor-creation/SKILL.md`). The workflow below is a quick-path for the common "just added a column, offer a monitor" case within a prevent session.
When the user adds a new column, filter, or business rule, suggest adding a monitor. First, choose the monitor type based on what the new logic does:
```
- New column with a row-level condition (null check, range, regex)
→ createValidationMonitorMac
- New aggregate metric (row count, sum, average, percentile over time)
→ createMetricMonitorMac
- Logic that should match another table or a prior time period
→ createComparisonMonitorMac
- Complex business rule that doesn't fit the above
→ createCustomSqlMonitorMac
```
Then run the appropriate sequence:
```
1. Read the SQL file being edited to extract the specific transformation logic:
- Confirm the file path from conversation context (do not guess or assume)
- If no file path is clear, ask the engineer: "Which file contains the new logic?"
- Extract the specific new column definition, filter condition, or business rule
- Use this logic directly when constructing the monitor condition in step 3
2. For validation monitors: getValidationPredicates() → show what validation types are available
For all types: determine the right tool from the selection guide above
3. Call the selected create*MonitorMac tool:
- createValidationMonitorMac(mcon, description, condition_sql) → returns YAML
- createMetricMonitorMac(mcon, description, metric, operator) → returns YAML
- createComparisonMonitorMac(source_table, target_table, metric) → returns YAML
- createCustomSqlMonitorMac(mcon, description, sql) → returns YAML
⚠ If createValidationMonitorMac fails (e.g. column doesn't exist yet in the live table),
fall back to createCustomSqlMonitorMac with an explicit SQL query instead.
3. Save the YAML to <project>/monitors/<table_name>.yml
4. Run: montecarlo monitors apply --dry-run (to preview)
5. Run: montecarlo monitors apply --auto-yes (to apply)
```
**Important — YAML format for `monitors apply`:**
All `create*MonitorMac` tools return YAML that is not directly compatible with `montecarlo monitors apply`. Reformat the output into a standalone monitor file with `montecarlo:` as the root key. The second-level key matches the monitor type: `custom_sql:`, `validation:`, `metric:`, or `comparison:`. The example below shows `custom_sql:` — substitute the appropriate key for other monitor types.
```yaml
# monitors/<table_name>.yml ← monitor definitions only, NOT montecarlo.yml
montecarlo:
custom_sql:
- warehouse: <warehouse_name>
name: <monitor_name>
description: <description>
schedule:
interval_minutes: 720
start_time: '<ISO timestamp>'
sql: <your validation SQL>
alert_conditions:
- operator: GT
threshold_value: 0.0
```
The `montecarlo.yml` project config is a **separate file** in the project root containing only:
```yaml
# montecarlo.yml ← project config only, NOT monitor definitions
version: 1
namespace: <your-namespace>
default_resource: <warehouse_name>
```
Do NOT put `version:`, `namespace:`, or `default_resource:` inside monitor definition files.
---
## Workflow 3: Alert triage — when investigating an active incident
```
1. getAlerts(
created_after="<start>",
created_before="<end>",
order_by="-createdTime",
statuses=["NOT_ACKNOWLEDGED"]
) → list open alerts
2. getTable(mcon="<affected_table_mcon>") → check current table state
3. getAssetLineage(mcon="<mcon>") → identify upstream cause or downstream blast radius
4. getQueriesForTable(mcon="<mcon>") → recent queries that might explain the anomaly
```
To respond to an alert:
- `updateAlert(alert_id="<id>", status="ACKNOWLEDGED")` — acknowledge it
- `setAlertOwner(alert_id="<id>", owner="<email>")` — assign ownership
- `createOrUpdateAlertComment(alert_id="<id>", comment="<text>")` — add context
---
## Workflow 4: Change impact assessment — REQUIRED before modifying a model
**Trigger:** Any expressed intent to add, rename, drop, or change a column, join, filter, or model logic. Run this immediately — before writing any code — even if the user hasn't asked for it.
### Bugfixes and reverts require impact assessment too
When the user says "fix", "revert", "restore", or "undo", run this workflow
before writing any code — even if the change seems small or safe.
A revert that undoes a column addition or changes join logic has the same
blast radius as the original change. Downstream models may have already
adapted to the "incorrect" behavior, meaning the fix itself could break them.
Pay special attention to:
- Whether the revert removes a column other models now depend on
- Whether downstream models reference the specific logic being reverted
- Whether active alerts may be related to the change being reverted
When the user is about to rename or drop a column, change a join condition, alter a filter, or refactor a model's logic, run this sequence to surface the blast radius before any changes are committed:
```
1. search(query="<table_name>") + getTable(mcon="<mcon>")
→ importance score, query volume (reads/writes per day), key asset flag
2. getAssetLineage(mcon="<mcon>")
→ full list of downstream dependents; for each, note whether it is a key asset
3. getTable(mcon="<downstream_mcon>") for each key downstream asset
→ importance score, last updated, monitoring status
4. getAlerts(
created_after="<7 days ago>",
created_before="<now>",
table_mcons=["<mcon>", "<downstream_mcon_1>", ...],
statuses=["NOT_ACKNOWLEDGED"]
)
→ any active incidents already affecting this table or its dependents
5. getQueriesForTable(mcon="<mcon>")
→ recent queries; scan for references to the specific columns being changed
→ use getQueryData(query_id="<id>") to fetch full SQL for ambiguous cases
5b. Supplementary local search for downstream dbt refs:
- Search the local models/ directory for ref('<table_name>') (single-hop only)
- Compare results against getAssetLineage output from step 2
- If any local models reference this table but are NOT in MC's lineage results:
"⚠️ Found N local model(s) referencing this table not yet in MC's lineage: [list]"
- If no models/ directory exists in the current project, skip silently
- MC lineage remains the authoritative source — local grep is supplementary only
6. getMonitors(mcon="<mcon>")
→ which monitors are watching columns or metrics affected by the change
```
### Risk tier assessment
| Tier | Conditions |
|---|---|
| 🔴 High | Key asset downstream, OR active alerts already firing, OR >50 reads/day |
| 🟡 Medium | Non-key assets downstream, OR monitors on affected columns, OR moderate query volume |
| 🟢 Low | No downstream dependents, no active alerts, low query volume |
### Multi-model changes
When the user is changing multiple models in the same session or same domain
(e.g., 3 timeseries models, 4 criticality_score models):
- Run a single consolidated impact assessment across all changed tables
- Deduplicate downstream dependents — if two changed tables share a downstream
dependent, count it once and note that it's affected by multiple upstream changes
- Present a unified blast radius report rather than N separate reports
- Escalate risk tier if the combined blast radius is larger than any individual table
Example consolidated report header:
"## Change Impact: 3 models in timeseries domain
Combined downstream blast radius: 28 tables (deduplicated)
Highest risk table: timeseries_detector_routing (22 downstream refs)"
### Report format
```
## Change Impact: <table_name>
Risk: 🔴 High / 🟡 Medium / 🟢 Low
Downstream blast radius:
- <N> tables depend on this model
- Key assets affected: <list or "none">
Active incidents:
- <alert title, status> or "none"
Column exposure (for columns being changed):
- Found in <N> recent queries (e.g. <query snippet>)
Monitor coverage:
- <monitor name> watches <metric> — will be affected by this change
- If zero custom monitors exist → append:
"⚠️ No custom monitors on this table. After making your changes,
I'll suggest a monitor for the new logic — or say 'add a monitor'
to do it now."
Recommendation:
- <specific callout, e.g. "Notify owners of downstream_table before deploying",
"Coordinate with the freshness alert owner", "Add a monitor for the new column">
```
If risk is 🔴 High:
1. Call `getAudiences()` to retrieve configured notification audiences
2. Include in the recommendation: "Notify: <audience names / channels>"
3. Proactively suggest:
- Notifying owners of downstream key assets (`setAlertOwner` / `createOrUpdateAlertComment` on active alerts)
- Adding a monitor for the new logic before deploying (Workflow 2)
- Running `montecarlo monitors apply --dry-run` after changes to verify nothing breaks
### Synthesis: translate findings into code recommendations
After presenting the impact report, use the findings to shape your code suggestion.
Do not present MC data and then write code as if the data wasn't there.
Explicitly connect each key finding to a specific recommendation:
- Active alerts firing on the table:
→ Recommend deferring or minimally scoping the change until alerts are resolved
→ Explain: "There are N active alerts on this table — making this change now
risks compounding an existing data quality issue"
- Key assets downstream:
→ Recommend defensive coding patterns: null guards, backward-compatible changes,
additive-only schema changes where possible
→ Explain: "X downstream key assets depend on this table — I'd recommend
writing this as [specific pattern] to avoid breaking [specific dependent]"
- Monitors on affected columns:
→ Call out that the change will affect monitor coverage
→ Recommend updating monitors alongside the code change (offer Workflow 2)
→ Explain: "The existing monitor on [column] will need to be updated to
account for this change"
- New output column or logic being added:
→ Always offer Workflow 2 after the impact assessment, regardless
of existing monitor coverage
→ Do not skip this step even if risk tier is 🟢 Low
→ Say explicitly: "This adds new output logic — would you like me
to generate a monitor for it? I can add a null check, range
validation, or custom SQL rule."
→ Wait for the user's response before proceeding with the edit
- High read volume (>50 reads/day):
→ Recommend extra caution around column renames or removals
→ Suggest backward-compatible transition (add new column, deprecate old one)
→ Explain: "This table has [N] reads/day — a column rename without a
transition period would break downstream consumers immediately"
- Column renames, even inside CTEs:
→ Never assume a CTE-internal rename is safe. Always check:
1. Does this column appear in the final SELECT, directly or
via a CTE that feeds into the final SELECT?
2. If yes — treat as a breaking change. Recommend a
backward-compatible transition: add the correctly-named
column, keep the old one temporarily, remove in a
follow-up PR.
3. If truly internal and never surfaces in output — confirm
this explicitly before proceeding.
→ Explain: "Even though this column is defined in a CTE, if it
surfaces in the final SELECT it is a public output column —
renaming it breaks any downstream model selecting it by name."
---
## Workflow 5: Change validation queries — after a code change is made
**Trigger:** Explicit engineer intent only. Activate when the engineer says something like:
- "generate validation queries", "validate this change", "I'm done with this change"
- "let me test this", "write queries to check this", "ready to commit"
**Required session context — do not activate without both:**
1. Workflow 4 (change impact assessment) has run for this table in this session
2. A file edit was made to a `.sql` or dbt model file for that same table
**Do NOT activate automatically after file edits. Do NOT proactively offer after Workflow 4 or file edits. The engineer asks when they are ready.**
---
### What this workflow does
Using the context already in the session — the Workflow 4 findings, the file diff, and the `getTable` result — generate 35 targeted SQL validation queries that directly test whether this specific change behaved as intended.
These are not generic templates. Use the semantic meaning of the change from Workflow 4 context: which columns changed and why, what business logic was affected, what downstream models depend on this table, and what monitors exist. A null check on a new `days_since_contract_start` column should verify it is never negative and never null for rows with a `contract_start_date` — not just check for nulls generically.
---
### Step 1 — Identify the change type from session context
From Workflow 4 findings and the file diff, classify the primary change. A change may span multiple types — classify the dominant one and note secondaries:
- **New column** — a new output column was added to the SELECT
- **Filter change** — a WHERE clause, IN-list, or CASE condition was modified
- **Join change** — a JOIN condition or join target was modified
- **Column rename or drop** — an existing output column was renamed or removed
- **Parameter change** — a hardcoded threshold, constant, or numeric value was changed
- **New model** — the file was newly created, no production baseline exists
---
### Step 2 — Determine warehouse context from Workflow 4
From the `getTable` result already in session context, extract:
- **Fully qualified table name** — e.g. `analytics.prod_internal_bi.client_hub_master`
- **Warehouse type** — Snowflake, BigQuery, Redshift, Databricks
- **Schema** — already resolved, do not re-derive
Use the correct SQL dialect for the warehouse type. Key differences:
| Warehouse | Date diff | Current timestamp | Notes |
|---|---|---|---|
| Snowflake | `DATEDIFF('day', a, b)` | `CURRENT_TIMESTAMP()` | `QUALIFY` supported |
| BigQuery | `DATE_DIFF(a, b, DAY)` | `CURRENT_TIMESTAMP()` | Use subquery instead of `QUALIFY` |
| Redshift | `DATEDIFF('day', a, b)` | `GETDATE()` | |
| Databricks | `DATEDIFF(a, b)` | `CURRENT_TIMESTAMP()` | |
For the dev database, use the placeholder `<YOUR_DEV_DATABASE>` with a comment instructing the engineer to replace it. Do not guess the dev database name.
---
### Step 3 — Apply database targeting rules (mandatory)
These rules are not negotiable — violating them produces queries that will fail at runtime:
- **Columns or logic that only exist post-change** → dev database only. Never query production for a column that doesn't exist there yet.
- **Comparison queries (before vs after)** → both production and dev databases
- **New model (no production baseline)** → dev database only for all queries
- **Row count comparison** → always include, always query both databases
---
### Step 4 — Generate targeted validation queries
Always include a row count comparison regardless of change type — it's the baseline signal that something unexpected happened.
Then generate change-specific queries based on what needs to be validated for this change type. Use the exact conditions, column names, and business logic from the diff and Workflow 4 findings — not generic placeholders. The goal for each change type:
**New column:** Verify the column is non-null where it should be non-null (based on its business meaning), that its value range is plausible, and that its distribution makes sense given the underlying data. Query dev only.
**Filter change:** Verify that only the intended rows were reclassified — generate a before/after count showing how many rows were added or removed by the new condition using the exact filter logic from the diff, and a sample of the rows that changed classification. The sample helps the engineer confirm the right records moved.
**Join change:** Verify that the join didn't introduce duplicates — a uniqueness check on the join key is essential. Also verify row count didn't change unexpectedly. Query dev for uniqueness, both databases for row count.
**Column rename or drop:** Verify the old column name is absent and the new column (if renamed) is present in the dev schema. Also verify that downstream models referencing the old column name are identified — use the local ref() grep results from Workflow 4 if available.
**Parameter or threshold change:** Verify the distribution of values affected by the change — how many rows moved above or below the new threshold, and whether the count matches the engineer's expectation. Query both databases to compare before and after.
**New model:** No production comparison possible. Verify row count is non-zero and plausible, sample rows look correct, and key columns are non-null. Query dev only.
---
### Step 5 — Add change-specific context to each query
For every query, include a SQL comment block that explains:
- What the query is checking
- What a healthy result looks like **for this specific change**
- What would indicate a problem
Derive this context from Workflow 4 findings. Use the business meaning of the change, not generic descriptions. For example, for adding `days_since_contract_start`:
```sql
/*
Null rate check: days_since_contract_start (new column, dev only)
What to look for:
- Null count should equal workspaces with no contract_start_date
- All rows with contract_start_date should have a non-null, non-negative value
- Values above 3650 (~10 years) are suspicious and may indicate a data issue
*/
```
This is what differentiates these queries from generic validation — the comment tells the engineer exactly what pass and fail look like for their specific change.
---
### Step 6 — Save to local file
Save all generated queries to:
```
validation/<table_name>_<YYYYMMDD_HHMM>.sql
```
Include a header at the top of the file:
```sql
/*
Validation queries for: <fully_qualified_table>
Change type: <change type from Step 1>
Generated: <timestamp>
Workflow 4 risk tier: <tier from this session>
Instructions:
1. Replace <YOUR_DEV_DATABASE> with your personal or branch database
2. Run the row count comparison first
3. Run change-specific queries to validate intended behavior
4. Unexpected results should be investigated before merging
*/
```
Then tell the engineer:
> "Validation queries saved to `validation/<table_name>_<timestamp>.sql`.
> Replace `<YOUR_DEV_DATABASE>` with your dev database and run in Snowflake
> or your preferred SQL client to verify the change behaved as expected."
---
### What this workflow does NOT do
- Does not execute queries (Phase 2)
- Does not require warehouse MCP connection
- Does not generate Monte Carlo notebook YAML
- Does not trigger automatically — only on explicit engineer request
- Does not activate if Workflow 4 has not run for this table in this session

View File

@@ -0,0 +1,363 @@
---
name: monte-carlo-push-ingestion
description: "Expert guide for pushing metadata, lineage, and query logs to Monte Carlo from any data warehouse."
category: data
risk: safe
source: community
source_repo: monte-carlo-data/mc-agent-toolkit
source_type: community
date_added: "2026-04-08"
author: monte-carlo-data
tags: [data-observability, ingestion, monte-carlo, pycarlo, metadata]
tools: [claude, cursor, codex]
---
# Monte Carlo Push Ingestion
You are an agent that helps customers collect metadata, lineage, and query logs from their
data warehouses and push that data to Monte Carlo via the push ingestion API. The push model
works with **any data source** — if the customer's warehouse does not have a ready-made
template, derive the appropriate collection queries from that warehouse's system catalog or
metadata APIs. The push format and pycarlo SDK calls are the same regardless of source.
Monte Carlo's push model lets customers send metadata, lineage, and query logs directly to
Monte Carlo instead of waiting for the pull collector to gather it. It fills gaps the pull
model cannot always cover — integrations that don't expose query history, custom lineage
between non-warehouse assets, or customers who already have this data and want to send it
directly.
Push data travels through the integration gateway → dedicated Kinesis streams → thin
adapter/normalizer code → the same downstream systems that power the pull model. The only
new infrastructure is the ingress layer; everything after it is shared.
## MANDATORY — Always start from templates
When generating any push-ingestion script, you MUST:
1. **Read the corresponding template** before writing any code. Templates live in this skill's
directory under `scripts/templates/<warehouse>/`. To find them, glob for
`**/push-ingestion/scripts/templates/<warehouse>/*.py` — this works regardless of where the
skill is installed. Do NOT search from the current working directory alone.
2. **Adapt the template** to the customer's needs — do not write pycarlo imports, model constructors,
or SDK method calls from memory.
3. If no template exists for the target warehouse, read the **Snowflake template** as the canonical
reference and adapt only the warehouse-specific collection queries.
Template files follow this naming pattern:
- `collect_<flow>.py` — collection only (queries the warehouse, writes a JSON manifest)
- `push_<flow>.py` — push only (reads the manifest, sends to Monte Carlo)
- `collect_and_push_<flow>.py` — combined (imports from both, runs in sequence)
**After running any push script**, you MUST surface the `invocation_id`(s) returned by the API
to the user. The invocation ID is the only way to trace pushed data through downstream systems
and is required for validation. Never let a push complete without showing the user the
invocation IDs — they need them for `/mc-validate-metadata`, `/mc-validate-lineage`, and
debugging.
## Canonical pycarlo API — authoritative reference
The following imports, classes, and method signatures are the **ONLY** correct pycarlo API for
push ingestion. If your training data suggests different names, **it is wrong**. Use exactly
what is listed here.
### Imports and client setup
```python
from pycarlo.core import Client, Session
from pycarlo.features.ingestion import IngestionService
from pycarlo.features.ingestion.models import (
# Metadata
RelationalAsset, AssetMetadata, AssetField, AssetVolume, AssetFreshness, Tag,
# Lineage
LineageEvent, LineageAssetRef, ColumnLineageField, ColumnLineageSourceField,
# Query logs
QueryLogEntry,
)
client = Client(session=Session(mcd_id=key_id, mcd_token=key_token, scope="Ingestion"))
service = IngestionService(mc_client=client)
```
### Method signatures
```python
# Metadata
service.send_metadata(resource_uuid=..., resource_type=..., events=[RelationalAsset(...)])
# Lineage (table or column)
service.send_lineage(resource_uuid=..., resource_type=..., events=[LineageEvent(...)])
# Query logs — note: log_type, NOT resource_type
service.send_query_logs(resource_uuid=..., log_type=..., events=[QueryLogEntry(...)])
# Extract invocation ID from any response
service.extract_invocation_id(result)
```
### RelationalAsset structure (nested, NOT flat)
```python
RelationalAsset(
type="TABLE", # ONLY "TABLE" or "VIEW" (uppercase) — normalize warehouse-native values
metadata=AssetMetadata(
name="my_table",
database="analytics",
schema="public",
description="optional description",
),
fields=[
AssetField(name="id", type="INTEGER", description=None),
AssetField(name="amount", type="DECIMAL(10,2)"),
],
volume=AssetVolume(row_count=1000000, byte_count=111111111), # optional
freshness=AssetFreshness(last_update_time="2026-03-12T14:30:00Z"), # optional
)
```
## Environment variable conventions
All generated scripts MUST use these exact variable names. Do NOT invent alternatives like
`MCD_KEY_ID`, `MC_TOKEN`, `MONTE_CARLO_KEY`, etc.
| Variable | Purpose | Used by |
|---|---|---|
| `MCD_INGEST_ID` | Ingestion key ID (scope=Ingestion) | push scripts |
| `MCD_INGEST_TOKEN` | Ingestion key secret | push scripts |
| `MCD_ID` | GraphQL API key ID | verification scripts |
| `MCD_TOKEN` | GraphQL API key secret | verification scripts |
| `MCD_RESOURCE_UUID` | Warehouse resource UUID | all scripts |
## What this skill can build for you
Tell Claude your warehouse or data platform and Monte Carlo resource UUID and this skill will
generate a ready-to-run Python script that:
- Connects to your warehouse using the idiomatic driver for that platform
- Discovers databases, schemas, and tables
- Extracts the right columns — names, types, row counts, byte counts, last modified time, descriptions
- Builds the correct pycarlo `RelationalAsset`, `LineageEvent`, or `QueryLogEntry` objects
- Pushes to Monte Carlo and saves an output manifest with the `invocation_id` for tracing
Templates are available for common warehouses (Snowflake, BigQuery, BigQuery Iceberg,
Databricks, Redshift, Hive). For any other platform, Claude will derive the appropriate
collection queries from the warehouse's system catalog or metadata APIs and generate an
equivalent script.
### Ready-to-run examples
Production-ready example scripts built from these templates are published in the
[mcd-public-resources](https://github.com/monte-carlo-data/mcd-public-resources) repo:
- **[BigQuery Iceberg (BigLake) tables](https://github.com/monte-carlo-data/mcd-public-resources/tree/main/examples/push-ingestion/bigquery/push-iceberg-tables)** —
metadata and query log collection for BigQuery Iceberg tables that are invisible to Monte
Carlo's standard pull collector (which uses `__TABLES__`). Includes a `--only-freshness-and-volume`
flag for fast periodic pushes that skip the schema/fields query — useful for hourly cron jobs
after the initial full metadata push.
## Reference docs — when to load
| Reference file | Load when… |
|---|---|
| `references/prerequisites.md` | Customer is setting up for the first time, has auth errors, or needs help creating API keys |
| `references/push-metadata.md` | Building or debugging a metadata collection script |
| `references/push-lineage.md` | Building or debugging a lineage collection script |
| `references/push-query-logs.md` | Building or debugging a query log collection script |
| `references/custom-lineage.md` | Customer needs custom lineage nodes or edges via GraphQL |
| `references/validation.md` | Verifying pushed data, running GraphQL checks, or deleting push-ingested tables |
| `references/direct-http-api.md` | Customer wants to call push APIs directly via curl/HTTP without pycarlo |
| `references/anomaly-detection.md` | Customer asks why freshness or volume detectors aren't firing |
## Prerequisites — read this first
→ Load `references/prerequisites.md`
Two separate API keys are required. This is the most common setup stumbling block:
- **Ingestion key** (scope=Ingestion) — for pushing data
- **GraphQL API key** — for verification queries
Both use the same `x-mcd-id` / `x-mcd-token` headers but point to different endpoints.
## What you can push
| Flow | pycarlo method | Push endpoint | Type field | Expiration |
|---|---|---|---|---|
| Table metadata | `send_metadata()` | `/ingest/v1/metadata` | `resource_type` (e.g. `"data-lake"`) | **Never expires** |
| Table lineage | `send_lineage()` | `/ingest/v1/lineage` | `resource_type` (same as metadata) | **Never expires** |
| Column lineage | `send_lineage()` (events include `fields`) | `/ingest/v1/lineage` | `resource_type` (same as metadata) | **Expires after 10 days** |
| Query logs | `send_query_logs()` | `/ingest/v1/querylogs` | **`log_type`** (not `resource_type`!) | Same as pulled |
| Custom lineage | GraphQL mutations | `api.getmontecarlo.com/graphql` | N/A — uses GraphQL API key | 7 days default; set `expireAt: "9999-12-31"` for permanent |
**Important**: Query logs use `log_type` instead of `resource_type`. This is the only push
endpoint where the field name differs. See `references/push-query-logs.md` for the full list
of supported `log_type` values.
The pycarlo SDK is optional — you can also call the push APIs directly via HTTP/curl. See
`references/direct-http-api.md` for examples.
Every push returns an `invocation_id` — save it. It is your primary debugging handle across
all downstream systems.
## Step 1 — Generate your collection scripts
Ask Claude to build the script for your warehouse:
> "Build me a metadata collection script for Snowflake. My MC resource UUID is `abc-123`."
The script templates in `**/push-ingestion/scripts/templates/` (Snowflake, BigQuery, BigQuery Iceberg, Databricks, Redshift, Hive)
are the **mandatory starting point** for script generation — they contain the correct pycarlo
imports, model constructors, and SDK calls. **They are not an exhaustive list.** If the
customer's warehouse is not listed, use the templates as a guide and determine the appropriate
queries or file-collection approach for their platform. For file-based sources (like Hive
Metastore logs), provide the command to retrieve the file, parse it, and transform it into the
format required by the push APIs. The push format and SDK calls are identical regardless of
source; only the collection queries change.
**Batching**: For large payloads, split events into batches. Use a batch size of **50 assets**
per push call. The pycarlo HTTP client has a hardcoded 10-second read timeout that cannot be
overridden (`Session` and `Client` do not accept a `timeout` parameter) — larger batches (200+)
will timeout on warehouses with thousands of tables. The compressed request body must also not
exceed **1MB** (Kinesis limit). All push endpoints support batching.
**Push frequency**: Push at most **once per hour**. Sub-hourly pushes produce unpredictable
anomaly detector behavior because the training pipeline aggregates into hourly buckets.
**Per flow, see:**
- Metadata (schema + volume + freshness): `references/push-metadata.md`
- Table and column lineage: `references/push-lineage.md`
- Query logs: `references/push-query-logs.md`
## Step 2 — Validate pushed data
After pushing, verify data is visible in Monte Carlo using the GraphQL API (GraphQL API key).
`references/validation.md` — all verification queries (getTable, getMetricsV4,
getTableLineage, getDerivedTablesPartialLineage, getAggregatedQueries)
Timing expectations:
- **Metadata**: visible within a few minutes
- **Table lineage**: visible within seconds to a few minutes (fast direct path to Neo4j)
- **Column lineage**: a few minutes
- **Query logs**: at least **15-20 minutes** (async processing pipeline)
## Step 3 — Anomaly detection (optional)
If you want Monte Carlo's freshness and volume detectors to fire on pushed data, you need to
push consistently over time — detectors require historical data to train.
`references/anomaly-detection.md` — recommended push frequency, minimum samples,
training windows, and what to tell customers who ask why detectors aren't activating
## Custom lineage nodes and edges
For non-warehouse assets (dbt models, Airflow DAGs, custom ETL pipelines) or cross-resource
lineage, use the GraphQL mutations directly:
`references/custom-lineage.md``createOrUpdateLineageNode`, `createOrUpdateLineageEdge`,
`deleteLineageNode`, and the critical `expireAt: "9999-12-31"` rule
## Deleting push-ingested tables
Push tables are excluded from the normal pull-based deletion flow (intentionally). To delete
them explicitly, use `deletePushIngestedTables` — covered in `references/validation.md`
under "Table management operations".
## Available slash commands
Customers can invoke these explicitly instead of describing their intent in prose:
| Command | Purpose |
|---|---|
| `/mc-build-metadata-collector` | Generate a metadata collection script |
| `/mc-build-lineage-collector` | Generate a lineage collection script |
| `/mc-build-query-log-collector` | Generate a query log collection script |
| `/mc-validate-metadata` | Verify pushed metadata via the GraphQL API |
| `/mc-validate-lineage` | Verify pushed lineage via the GraphQL API |
| `/mc-validate-query-logs` | Verify pushed query logs via the GraphQL API |
| `/mc-create-lineage-node` | Create a custom lineage node |
| `/mc-create-lineage-edge` | Create a custom lineage edge |
| `/mc-delete-lineage-node` | Delete a custom lineage node |
| `/mc-delete-push-tables` | Delete push-ingested tables |
## Debugging checkpoints
When pushed data isn't appearing, work through these five checkpoints in order:
1. **Did the SDK return a `202` and an `invocation_id`?**
If not, the gateway rejected the request — check auth headers and `resource.uuid`.
2. **Is the integration key the right type?**
Must be scope `Ingestion`, created via `montecarlo integrations create-key --scope Ingestion`.
A standard GraphQL API key will not work for push.
3. **Is `resource.uuid` correct and authorized?**
The key can be scoped to specific warehouse UUIDs. If the UUID doesn't match, you get `403`.
4. **Did the normalizer process it?**
Use the `invocation_id` to search CloudWatch logs for the relevant Lambda. For query logs,
check the `log_type` — Hive requires `"hive-s3"`, not `"hive"`.
5. **Did the downstream system pick it up?**
- Metadata: query `getTable` in GraphQL
- Table lineage: check Neo4j within secondsminutes (fast path via PushLineageProcessor)
- Query logs: wait at least 15-20 minutes; check `getAggregatedQueries`
## Known gotchas
- **`log_type` vs `resource_type`**: metadata and lineage use `resource_type` (e.g. `"data-lake"`);
query logs use **`log_type`** — the only endpoint where the field name differs. Wrong value →
`Unsupported ingest query-log log_type` error.
- **`invocation_id` must be saved**: every output manifest should include it — it's your
only tracing handle once the request leaves the SDK.
- **Query log async delay**: at least 15-20 minutes. `getAggregatedQueries` will return 0 until
processing completes — this is expected, not a bug.
- **Custom lineage `expireAt` defaults to 7 days**: nodes vanish silently unless you set
`expireAt: "9999-12-31"` for permanent nodes.
- **Push tables are never auto-deleted**: the periodic cleanup job excludes them by default
(`exclude_push_tables=True`). Delete them explicitly via `deletePushIngestedTables` (max
1,000 MCONs per call; also deletes lineage nodes and all edges touching those nodes).
- **Anomaly detectors need history**: pushing once is not enough. Freshness needs 7+ pushes
over ~2 weeks; volume needs 1048 samples over ~42 days. Push at most once per hour.
- **Batching required for large payloads**: the compressed request body must not exceed 1MB.
Split large event lists into batches.
- **Column lineage expires after 10 days**: unlike table metadata and table lineage (which
never expire), column lineage has a 10-day TTL, same as pulled column lineage.
- **Quote SQL identifiers in warehouse queries**: database, schema, and table names must be
quoted to handle mixed-case or special characters. The quoting syntax varies by warehouse —
Snowflake and Redshift use double quotes (`"{db}"`), BigQuery/Databricks/Hive use backticks
(`` `db` ``). The templates already handle this correctly for each warehouse — follow the
same quoting pattern when adapting.
## Memory safety
Generated scripts must include a startup memory check. The collection phase loads query history
rows into memory for parsing — on large warehouses with long lookback windows, this can exhaust
available RAM and cause the process to be silently killed (SIGKILL / exit 137) with no traceback.
Add this pattern near the top of every generated script, after imports:
```python
import os
def _check_available_memory(min_gb: float = 2.0) -> None:
"""Warn if available memory is below the threshold."""
try:
if hasattr(os, "sysconf"): # Linux / macOS
page_size = os.sysconf("SC_PAGE_SIZE")
avail_pages = os.sysconf("SC_AVPHYS_PAGES")
avail_gb = (page_size * avail_pages) / (1024 ** 3)
else:
return # Windows — skip check
except (ValueError, OSError):
return
if avail_gb < min_gb:
print(
f"WARNING: Only {avail_gb:.1f} GB of memory available "
f"(minimum recommended: {min_gb:.1f} GB). "
f"Consider reducing the lookback window or increasing available memory."
)
```
Call `_check_available_memory()` before connecting to the warehouse.
Additionally, when fetching query history:
- Use `cursor.fetchmany(batch_size)` in a loop instead of `cursor.fetchall()` when possible
- For very large result sets, consider adding a LIMIT clause and processing in windows

View File

@@ -0,0 +1,87 @@
# Anomaly Detection for Push-Ingested Data
Push volume and freshness data feeds the same anomaly detectors as the pull model.
The detectors don't activate immediately — they need enough historical data to learn
expected behavior before they can alert on deviations.
## Recommended push frequency: hourly
- Push at most **once per hour** — pushing more frequently produces unpredictable detector
behavior because the training pipeline aggregates data into hourly buckets
- Push **consistently** — gaps of more than a few days delay activation or deactivate
previously-active detectors
## Freshness detector
The freshness detector learns how often a table is updated and fires when it has not been
updated for longer than expected.
**What it trains on**: consecutive differences (`delta_sec`) between `last_update_time`
values across pushes. A push only counts if `last_update_time` actually changed.
**Requirements to activate:**
| Requirement | Value |
|---|---|
| Minimum samples | 7 pushes where `last_update_time` changed (or coverage ≥ 0.8 for slow tables) |
| Minimum coverage | 0.15 (= `median_update_secs × n_samples / 22 days`) |
| Training window | 35 days |
| Supported update cycle | 5 minutes 7.7 days |
| Minimum table age | ~14 days on older warehouses |
**Deactivation triggers:**
- No push for **14 days**`"no recent data"`
- Gap > 7 days in last 14 days, for fast tables (median update ≤ 26.4 hours) → `"gap of over a week in last 2 weeks"`
## Volume detector (Volume Change + Unchanged Size)
Detects unexpected spikes/drops in row count or byte count.
**Requirements to activate:**
| Requirement | Value |
|---|---|
| Minimum samples (daily) | 10 |
| Minimum samples (subdaily, ~12x/day) | 48 |
| Minimum samples (weekly) | 5 |
| Minimum coverage | 0.30 (= `N × median_update_secs / 42 days`) |
| Training window | 42 days |
| Minimum table age | 5 days |
| Regularity check | 75th/25th percentile of update intervals ≥ 0.2 |
**Deactivation**: No hard gap limit, but coverage degrades as the 42-day window advances
without new data. Eventually drops below 0.3 and deactivates.
## Summary table
| | Freshness | Volume Change / Unchanged Size |
|---|---|---|
| Recommended frequency | Hourly | Hourly |
| Maximum frequency | Once per hour | Once per hour |
| Training window | 35 days | 42 days |
| Minimum samples | 7 | 10 (daily) / 48 (subdaily) / 5 (weekly) |
| Minimum coverage | 0.15 | 0.30 |
| Hard deactivation gap | 14 days | No (coverage degrades) |
| Fast-table gap warning | 7 days in last 14 | N/A |
## What to tell customers
When a customer asks "why isn't my anomaly detection working?":
1. **Check detector status** in the MC UI or via GraphQL (`getTable.thresholds.freshness.status`).
A `"training"` status means not enough data yet. `"inactive"` means a deactivation
condition was hit — check the reason code.
2. **Verify push frequency** — are they pushing exactly once per hour? Both too-fast and
too-slow rates cause problems.
3. **Verify that `last_update_time` changes** — for freshness to accumulate training samples,
each push must carry a *different* `last_update_time` than the previous one. If the table
hasn't actually updated, the push still arrives but doesn't advance the sample count.
4. **Set realistic expectations** — freshness detectors need about 12 weeks of hourly pushes.
Volume detectors need 10+ days for daily tables, up to 42 days for subdaily tables.
Anomaly detection is not instant.
5. **Don't push gaps and then resume** — if a customer pauses pushes for a week and then
resumes, the freshness detector may deactivate. They should keep pushing even when the
table hasn't changed (just repeat the same `last_update_time`) to maintain coverage,
even though that specific push won't count as a new freshness sample.

View File

@@ -0,0 +1,203 @@
# Custom Lineage Nodes and Edges
## When to use this
The `send_lineage()` pycarlo method is the right choice for warehouse tables you own.
The **GraphQL mutations** in this document are for:
- Non-warehouse assets: dbt models, Airflow DAGs, Fivetran connectors, custom ETL jobs
- Connecting nodes across different MC resources (warehouses)
- One-off lineage corrections not tied to a collector run
- Fine-grained control over node properties, object types, and expiry
All mutations use the **GraphQL API key** (not the Ingestion key) and the endpoint
`https://api.getmontecarlo.com/graphql`.
## Critical: expireAt
If you don't set `expireAt`, nodes and edges expire after **7 days** and vanish from the
lineage graph silently. For any node or edge that should persist:
```
expireAt: "9999-12-31"
```
This is the same value that `PushLineageProcessor` uses internally for all push-ingested
lineage. Forgetting this is the most common cause of "my lineage disappeared after a week".
---
## createOrUpdateLineageNode
Creates or updates a node in the lineage graph. If a node with the same
`objectType` + `objectId` + `resourceId` already exists, it is updated.
```graphql
mutation CreateOrUpdateLineageNode(
$objectType: String!
$objectId: String!
$resourceId: UUID
$resourceName: String
$name: String
$properties: [ObjectPropertyInput]
$expireAt: DateTime
) {
createOrUpdateLineageNode(
objectType: $objectType
objectId: $objectId
resourceId: $resourceId
resourceName: $resourceName
name: $name
properties: $properties
expireAt: $expireAt
) {
node {
mcon
displayName
objectType
isCustom
expireAt
}
}
}
```
**Variables:**
```json
{
"objectType": "table",
"objectId": "analytics:analytics.orders",
"resourceId": "<warehouse-uuid>",
"name": "orders",
"expireAt": "9999-12-31"
}
```
`objectType` can be any string — common values: `"table"`, `"view"`, `"report"`,
`"dashboard"`, `"job"`, `"model"`.
`objectId` should be a stable unique identifier for the asset within the resource.
For tables, use the `fullTableId` format: `database:schema.table`.
The returned `mcon` is the stable MC identifier for this node — save it if you plan to
reference it in edges or deletions.
---
## createOrUpdateLineageEdge
Creates or updates a directed edge: source → destination (default: IS_DOWNSTREAM).
```graphql
mutation CreateOrUpdateLineageEdge(
$source: NodeInput!
$destination: NodeInput!
$expireAt: DateTime
$edgeType: EdgeType
) {
createOrUpdateLineageEdge(
source: $source
destination: $destination
expireAt: $expireAt
edgeType: $edgeType
) {
edge {
source { mcon displayName objectType }
destination { mcon displayName objectType }
isCustom
expireAt
}
}
}
```
`NodeInput` shape:
```json
{
"objectType": "table",
"objectId": "analytics:analytics.orders",
"resourceId": "<warehouse-uuid>"
}
```
**Full example — dbt model → warehouse table:**
```json
{
"source": {
"objectType": "model",
"objectId": "dbt://my_project/models/staging/stg_orders",
"resourceName": "dbt-production"
},
"destination": {
"objectType": "table",
"objectId": "analytics:analytics.orders",
"resourceId": "<snowflake-warehouse-uuid>"
},
"expireAt": "9999-12-31",
"edgeType": "IS_DOWNSTREAM"
}
```
---
## deleteLineageNode
Deletes a node and **all its edges and objects**. This is irreversible.
```graphql
mutation DeleteLineageNode($mcon: String!) {
deleteLineageNode(mcon: $mcon) {
objectsDeleted
nodesDeleted
edgesDeleted
}
}
```
Get the MCON from `createOrUpdateLineageNode`'s response, or from:
```graphql
query {
getTable(fullTableId: "analytics:analytics.orders", dwId: "<warehouse-uuid>") {
mcon
}
}
```
---
## Python helper for all three mutations
```python
import requests
GRAPHQL_URL = "https://api.getmontecarlo.com/graphql"
HEADERS = {
"x-mcd-id": "<graphql-api-key-id>",
"x-mcd-token": "<graphql-api-key-secret>",
"Content-Type": "application/json",
}
def run_mutation(query: str, variables: dict) -> dict:
resp = requests.post(GRAPHQL_URL, json={"query": query, "variables": variables}, headers=HEADERS)
resp.raise_for_status()
data = resp.json()
if "errors" in data:
raise RuntimeError(data["errors"])
return data["data"]
# Example: create a permanent node
result = run_mutation(
"""mutation($objectType: String!, $objectId: String!, $resourceId: UUID, $expireAt: DateTime) {
createOrUpdateLineageNode(objectType: $objectType, objectId: $objectId,
resourceId: $resourceId, expireAt: $expireAt) {
node { mcon displayName }
}
}""",
{
"objectType": "table",
"objectId": "analytics:analytics.orders",
"resourceId": "<warehouse-uuid>",
"expireAt": "9999-12-31",
}
)
print("MCON:", result["createOrUpdateLineageNode"]["node"]["mcon"])
```

View File

@@ -0,0 +1,207 @@
# Direct HTTP API (without pycarlo)
The `pycarlo` SDK is optional. You can call the push APIs directly over HTTPS from any
language or tool (curl, Postman, etc.) as long as you:
- authenticate with an integration key whose scope is `Ingestion`
- send a JSON body that matches the ingest schema
- send to the correct integration gateway endpoint
## Endpoint
The host is environment-specific:
- **Production**: `https://integrations.getmontecarlo.com`
## Authentication headers
All requests use the same headers:
```
x-mcd-id: <integration-key-id>
x-mcd-token: <integration-key-secret>
Content-Type: application/json
```
## Response
On success, all endpoints return:
```json
{"invocation_id": "<uuid>"}
```
Save the `invocation_id` — it is the primary trace ID for debugging across downstream systems.
---
## Metadata example
`POST /ingest/v1/metadata`
```bash
curl -X POST "https://integrations.getmontecarlo.com/ingest/v1/metadata" \
-H "Content-Type: application/json" \
-H "x-mcd-id: <integration-key-id>" \
-H "x-mcd-token: <integration-key-secret>" \
-d '{
"event_type": "RELATIONAL_ASSET",
"resource": {
"uuid": "<warehouse-uuid>",
"resource_type": "snowflake"
},
"events": [
{
"type": "TABLE",
"metadata": {
"name": "orders",
"database": "analytics",
"schema": "public",
"description": "Orders table"
},
"fields": [
{"name": "id", "type": "INTEGER"},
{"name": "amount", "type": "DECIMAL(10,2)"}
],
"volume": {
"row_count": 1000000,
"byte_count": 111111111
},
"freshness": {
"last_update_time": "2026-03-12T14:30:00Z"
}
}
]
}'
```
`volume` and `freshness` are optional — you can push schema-only metadata.
---
## Table lineage example
`POST /ingest/v1/lineage` with `event_type: "LINEAGE"`
```bash
curl -X POST "https://integrations.getmontecarlo.com/ingest/v1/lineage" \
-H "Content-Type: application/json" \
-H "x-mcd-id: <integration-key-id>" \
-H "x-mcd-token: <integration-key-secret>" \
-d '{
"event_type": "LINEAGE",
"resource": {
"uuid": "<warehouse-uuid>",
"resource_type": "snowflake"
},
"events": [
{
"source": {
"name": "orders_raw",
"database": "analytics",
"schema": "public"
},
"destination": {
"name": "orders_curated",
"database": "analytics",
"schema": "public"
}
}
]
}'
```
---
## Column lineage example
`POST /ingest/v1/lineage` with `event_type: "COLUMN_LINEAGE"`
Same endpoint as table lineage. Column lineage automatically creates the parent table-level
edge too.
```bash
curl -X POST "https://integrations.getmontecarlo.com/ingest/v1/lineage" \
-H "Content-Type: application/json" \
-H "x-mcd-id: <integration-key-id>" \
-H "x-mcd-token: <integration-key-secret>" \
-d '{
"event_type": "COLUMN_LINEAGE",
"resource": {
"uuid": "<warehouse-uuid>",
"resource_type": "snowflake"
},
"events": [
{
"source": {
"name": "customers",
"database": "analytics",
"schema": "public"
},
"destination": {
"name": "customer_orders",
"database": "analytics",
"schema": "public"
},
"col_mappings": [
{
"destination_col": "customer_id",
"source_cols": ["customer_id"]
},
{
"destination_col": "full_name",
"source_cols": ["first_name", "last_name"]
}
]
}
]
}'
```
---
## Query log example
`POST /ingest/v1/querylogs`
**Important**: this endpoint uses `log_type` instead of `resource_type` in the resource object.
This is the only endpoint where the field name differs.
```bash
curl -X POST "https://integrations.getmontecarlo.com/ingest/v1/querylogs" \
-H "Content-Type: application/json" \
-H "x-mcd-id: <integration-key-id>" \
-H "x-mcd-token: <integration-key-secret>" \
-d '{
"event_type": "QUERY_LOG",
"resource": {
"uuid": "<warehouse-uuid>",
"log_type": "snowflake"
},
"events": [
{
"start_time": "2026-03-02T12:00:00Z",
"end_time": "2026-03-02T12:00:05Z",
"query_text": "SELECT * FROM analytics.public.orders",
"query_id": "query-123",
"user": "analyst@company.com",
"returned_rows": 10
}
]
}'
```
Supported `log_type` values: `snowflake`, `bigquery`, `databricks`, `redshift`, `hive-s3`,
`athena`, `teradata`, `clickhouse`, `databricks-metastore-sql-warehouse`, `s3`, `presto-s3`.
---
## Batching
The compressed request body must not exceed **1MB** (Kinesis limit). For large payloads, split
events into multiple requests. Each request returns its own `invocation_id`.
## Expiration summary
| Flow | Expiration |
|---|---|
| Table metadata | Never expires |
| Table lineage | Never expires |
| Column lineage | Expires after 10 days |
| Query logs | Same as pulled query logs |

View File

@@ -0,0 +1,150 @@
# Prerequisites
## Two keys, two purposes
Push ingestion requires **two separate Monte Carlo API keys** — one for pushing data, one
for reading/verifying it. They use identical header names but different endpoints.
| Key | Purpose | Endpoint |
|---|---|---|
| **Ingestion key** (scope=`Ingestion`) | Push metadata, lineage, query logs | `https://integrations.getmontecarlo.com` |
| **GraphQL API key** | Verify pushed data, run management mutations | `https://api.getmontecarlo.com/graphql` |
Both authenticate with:
```
x-mcd-id: <key-id>
x-mcd-token: <key-secret>
```
The secret for both is shown **only once** at creation time — store it securely immediately.
---
## Create the Ingestion key (for pushing)
Use the Monte Carlo CLI:
```bash
montecarlo integrations create-key \
--scope Ingestion \
--description "Push ingestion key"
```
Output:
```
Key id: <id>
Key secret: <secret> ← only shown once
```
Install the CLI if needed:
```bash
pip install montecarlodata
montecarlo configure # enter your API key when prompted
```
**Optional — restrict to a specific warehouse:**
If you want the key to only work for one warehouse UUID, use the GraphQL mutation instead:
```graphql
mutation {
createIntegrationKey(
description: "Push key for warehouse XYZ"
scope: Ingestion
warehouseIds: ["<warehouse-uuid>"]
) {
key { id secret }
}
}
```
---
## Create the GraphQL API key (for verification)
1. Go to **https://getmontecarlo.com/settings/api**
2. Click **Add**
3. Choose key type (personal or account-level — account-level requires Account Owner role)
4. Copy the **Key ID** and **Secret** immediately
The GraphQL endpoint is: `https://api.getmontecarlo.com/graphql`
Test it:
```bash
curl -s -X POST https://api.getmontecarlo.com/graphql \
-H "x-mcd-id: <id>" \
-H "x-mcd-token: <secret>" \
-H "Content-Type: application/json" \
-d '{"query": "{ getUser { email } }"}' | python3 -m json.tool
```
---
## Find your warehouse (resource) UUID
The Ingestion key needs to reference the correct MC resource UUID. To find it:
```graphql
query {
getUser {
account {
warehouses {
uuid
name
connectionType
}
}
}
}
```
Or in the MC UI: **Settings → Integrations** → click the warehouse → copy the UUID from the URL.
---
## Install pycarlo (optional)
The pycarlo SDK simplifies push calls, but is not required. You can also call the push APIs
directly via HTTP/curl — see `references/direct-http-api.md`.
```bash
pip install pycarlo
```
Initialize the ingestion client in your script:
```python
from pycarlo.core import Client, Session
from pycarlo.features.ingestion import IngestionService
client = Client(session=Session(
mcd_id="<ingestion-key-id>",
mcd_token="<ingestion-key-secret>",
scope="Ingestion",
))
service = IngestionService(mc_client=client)
```
Load credentials from environment variables (recommended):
```python
import os
service = IngestionService(mc_client=Client(session=Session(
mcd_id=os.environ["MCD_INGEST_ID"],
mcd_token=os.environ["MCD_INGEST_TOKEN"],
scope="Ingestion",
)))
```
---
## Environment variable conventions
The script templates use these env var names by default:
| Variable | Key type | Used by |
|---|---|---|
| `MCD_INGEST_ID` | Ingestion key ID | push and collect_and_push scripts |
| `MCD_INGEST_TOKEN` | Ingestion key secret | push and collect_and_push scripts |
| `MCD_ID` | GraphQL API key ID | verification scripts, slash commands |
| `MCD_TOKEN` | GraphQL API key secret | verification scripts, slash commands |
| `MCD_RESOURCE_UUID` | Warehouse UUID | all scripts |

View File

@@ -0,0 +1,160 @@
# Pushing Table and Column Lineage
## Overview
Both table-level and column-level lineage use the same endpoint: `POST /ingest/v1/lineage`.
The `event_type` field distinguishes them:
- `LINEAGE` — table-level: source table → destination table
- `COLUMN_LINEAGE` — column-level: source table.column → destination table.column
(also automatically creates the parent table-level edge)
Push lineage is **typically visible in the MC lineage graph within seconds to a few minutes**
via the fast direct path (PushLineageProcessor → S3 CSVs → neo4jLineageLoaderPrivate → Neo4j).
**Expiration**:
- Pushed **table lineage does not expire** (`expire_at = 9999-12-31`).
- Pushed **column lineage expires after 10 days** (same as pulled column lineage).
**Batching**: For large numbers of lineage events, split into batches. The compressed request
body must not exceed **1MB** (Kinesis limit).
## pycarlo models
```python
from pycarlo.features.ingestion import (
IngestionService,
LineageEvent,
LineageAssetRef,
ColumnLineageField,
ColumnLineageSourceField,
)
```
## Table lineage example
```python
event = LineageEvent(
destination=LineageAssetRef(
database="analytics",
schema="public",
table="customer_orders",
),
sources=[
LineageAssetRef(database="analytics", schema="public", table="customers"),
LineageAssetRef(database="analytics", schema="public", table="orders"),
],
)
result = service.send_lineage(
resource_uuid="<your-resource-uuid>",
resource_type="data-lake",
events=[event],
)
invocation_id = service.extract_invocation_id(result)
print("invocation_id:", invocation_id)
```
## Column lineage example
```python
event = LineageEvent(
destination=LineageAssetRef(
database="analytics",
schema="public",
table="customer_orders",
),
sources=[
LineageAssetRef(database="analytics", schema="public", table="customers"),
LineageAssetRef(database="analytics", schema="public", table="orders"),
],
# column mappings: dest_col ← src_table.src_col
fields=[
ColumnLineageField(
destination_field="customer_id",
source_fields=[
ColumnLineageSourceField(
database="analytics", schema="public",
table="customers", field="customer_id",
)
],
),
ColumnLineageField(
destination_field="order_amount",
source_fields=[
ColumnLineageSourceField(
database="analytics", schema="public",
table="orders", field="amount",
)
],
),
],
)
result = service.send_lineage(
resource_uuid=resource_uuid,
resource_type="data-lake",
events=[event],
)
```
Column lineage push automatically creates a table-level edge too, so you don't need to
send separate table and column lineage events for the same relationship.
## Extracting lineage from SQL logs
For warehouses that don't expose a native lineage table, extract lineage by parsing query
history SQL for `CREATE TABLE AS SELECT`, `INSERT INTO ... SELECT`, and `MERGE INTO` patterns.
Simplified example regex:
```python
import re
CTAS_PATTERN = re.compile(
r"CREATE\s+(?:OR\s+REPLACE\s+)?TABLE\s+(?:IF\s+NOT\s+EXISTS\s+)?(\S+)\s+AS\s+SELECT",
re.IGNORECASE,
)
INSERT_PATTERN = re.compile(
r"INSERT\s+(?:OVERWRITE\s+)?(?:INTO\s+)?(\S+).*?FROM\s+(\S+)",
re.IGNORECASE | re.DOTALL,
)
```
For Snowflake, BigQuery, and Redshift the query history tables provide this SQL.
For Databricks, use `system.access.table_lineage` directly (no parsing needed).
For Hive, parse the HiveServer2 log file.
## Output manifest (include invocation_id)
```python
manifest = {
"resource_uuid": resource_uuid,
"invocation_id": service.extract_invocation_id(result), # ← save this
"collected_at": datetime.now(tz=timezone.utc).isoformat(),
"edges": [
{
"destination": {"database": e.destination.database, "table": e.destination.table},
"sources": [{"database": s.database, "table": s.table} for s in e.sources],
}
for e in events
],
}
with open("lineage_output.json", "w") as f:
json.dump(manifest, f, indent=2)
```
## How push lineage is distinguished from query-derived lineage
Push-ingested lineage nodes and edges carry `origin = push_ingest` in Neo4j and
`origin_type = DIRECT_LINEAGE` in the normalized lineage model. This prevents the lineage
DAG from overwriting them with query-log-derived edges and gives MC a clear audit trail.
## Neo4j node expiry
Push-ingested **table lineage** nodes and edges are written with `expire_at = 9999-12-31`
(never expire). This is handled internally by PushLineageProcessor — you do not need to set
this manually when using `send_lineage()`.
Push-ingested **column lineage** expires after **10 days**, same as pulled column lineage.
For custom nodes created via GraphQL mutations, you **do** need to set
`expireAt: "9999-12-31"` explicitly — see `references/custom-lineage.md`.

View File

@@ -0,0 +1,158 @@
# Pushing Table Metadata
## Overview
Metadata push sends three types of signals per table:
- **Schema** — column names and types
- **Volume** — row count and byte count
- **Freshness** — last update timestamp
All three travel together in a single `RelationalAsset` object via `POST /ingest/v1/metadata`.
**Expiration**: Pushed table metadata **does not expire**. Once pushed, it remains in Monte
Carlo until explicitly deleted via `deletePushIngestedTables`.
**Batching**: For large numbers of tables, split assets into batches. The compressed request
body must not exceed **1MB** (Kinesis limit).
## pycarlo models
```python
from pycarlo.features.ingestion import (
IngestionService,
RelationalAsset,
AssetMetadata,
AssetField,
AssetVolume,
AssetFreshness,
)
```
## Minimal example
```python
asset = RelationalAsset(
type="TABLE", # ONLY "TABLE" or "VIEW" — normalize warehouse-native values
metadata=AssetMetadata(
name="orders",
database="analytics",
schema="public",
description="Order transactions",
),
fields=[
AssetField(name="order_id", type="INTEGER"),
AssetField(name="amount", type="DECIMAL"),
AssetField(name="created_at", type="TIMESTAMP"),
],
volume=AssetVolume(
row_count=1_500_000,
byte_count=250_000_000,
),
freshness=AssetFreshness(
last_update_time="2024-03-01T12:00:00Z", # ISO 8601 string, NOT a datetime object
),
)
result = service.send_metadata(
resource_uuid="<your-resource-uuid>",
resource_type="data-lake", # see note below on resource_type
events=[asset],
)
invocation_id = service.extract_invocation_id(result)
print("invocation_id:", invocation_id) # save this!
```
## resource_type
The `resource_type` value must match the type of the MC resource (warehouse connection) you
are pushing to. Use the same string that appears in the MC UI or the `connectionType` field
from `getUser { account { warehouses { connectionType } } }`.
Common values:
- `"data-lake"` — Hive, EMR, Glue, generic data lake connections
- `"snowflake"` — Snowflake
- `"bigquery"` — BigQuery
- `"databricks"` — Databricks Unity Catalog
- `"redshift"` — Redshift
## Asset type
The `type` parameter on `RelationalAsset` must be one of two values (uppercase):
- `"TABLE"` — tables, external tables, dynamic tables, materialized views, etc.
- `"VIEW"` — views, secure views
**Important**: Warehouse-native type values like `"BASE TABLE"` (Snowflake), `"MANAGED"` /
`"EXTERNAL"` (Databricks), or `"MATERIALIZED_VIEW"` (BigQuery) are **NOT accepted** by the
MC API and will cause a 400 error. Always normalize to `"TABLE"` or `"VIEW"` before pushing.
## Field types
Normalize to SQL-standard uppercase strings. Monte Carlo accepts any string but canonical
values like `INTEGER`, `BIGINT`, `VARCHAR`, `FLOAT`, `BOOLEAN`, `TIMESTAMP`, `DATE`,
`DECIMAL`, `ARRAY`, `STRUCT` work best with downstream features.
## Volume and freshness are optional
If your warehouse doesn't expose row counts or last-modified timestamps, omit `volume`
and/or `freshness` — schema-only metadata is valid.
If you send `freshness`, each push must carry a **changed** `last_update_time` to count as
a new data point for the anomaly detector (repeated identical timestamps don't advance the
training clock).
## Freshness + volume only mode (skip schema)
For periodic pushes (e.g. hourly cron), you often don't need to re-collect the full schema
on every run — field definitions rarely change. Collection scripts can support a
`--only-freshness-and-volume` flag that skips the `COLUMNS` / `INFORMATION_SCHEMA` query
and omits `fields` from the manifest. This is significantly faster on warehouses with many
tables. Use the full collection (with fields) on the first push and on a daily schedule,
and the freshness+volume only mode for hourly pushes in between. See the
[BigQuery Iceberg example](https://github.com/monte-carlo-data/mcd-public-resources/tree/main/examples/push-ingestion/bigquery/push-iceberg-tables)
for a working implementation of this pattern.
## Batch multiple tables
`events` accepts a list. Push all tables in a single call or in batches:
```python
result = service.send_metadata(
resource_uuid=resource_uuid,
resource_type="data-lake",
events=[asset1, asset2, asset3, ...],
)
```
## Output manifest (include invocation_id)
Always write a local manifest so you can trace issues later:
```python
import json
from datetime import datetime, timezone
manifest = {
"resource_uuid": resource_uuid,
"invocation_id": service.extract_invocation_id(result), # ← critical for debugging
"collected_at": datetime.now(tz=timezone.utc).isoformat(),
"assets": [
{
"database": a.metadata.database,
"schema": a.metadata.schema,
"table": a.metadata.name,
"row_count": a.volume.row_count if a.volume else None,
"fields": [{"name": f.name, "type": f.type} for f in a.fields],
}
for a in assets
],
}
with open("metadata_output.json", "w") as f:
json.dump(manifest, f, indent=2)
```
## Push frequency for anomaly detection
To keep volume and freshness anomaly detectors active:
- Push **at most once per hour** (pushing more frequently produces unpredictable behavior)
- Push **consistently** — gaps longer than a few days will deactivate detectors
- See `references/anomaly-detection.md` for minimum sample requirements

View File

@@ -0,0 +1,219 @@
# Pushing Query Logs
## Overview
Query logs let Monte Carlo build table usage history, populate query lineage, and surface
query-level insights in the catalog. Push them via `POST /ingest/v1/querylogs`.
**Important timing note**: MC processes pushed query logs asynchronously. Logs pushed now
may not be visible in `getAggregatedQueries` for **at least 15-20 minutes**. This is expected
behavior, not a bug.
**Expiration**: Pushed query logs expire on the same schedule as pulled query logs.
**Batching**: For large query log sets, split events into batches. The compressed request body
must not exceed **1MB** (Kinesis limit). A conservative default is 250 entries per batch.
## pycarlo model
```python
from pycarlo.features.ingestion import IngestionService, QueryLogEntry
```
`QueryLogEntry` required fields:
- `start_time` (`datetime`) — when the query started
- `end_time` (`datetime`) — when the query finished (**required**, easy to miss)
- `query_text` (`str`) — the SQL statement
Optional fields:
- `query_id` (`str`) — warehouse-assigned query ID
- `user` (`str`) — user/email who ran the query
- `returned_rows` (`int`) — rows returned to the client
- `default_database` (`str`) — default database context
## Basic example
```python
from datetime import datetime, timezone
entries = [
QueryLogEntry(
start_time=datetime(2024, 3, 1, 10, 0, 0, tzinfo=timezone.utc),
end_time=datetime(2024, 3, 1, 10, 0, 5, tzinfo=timezone.utc),
query_text="SELECT * FROM analytics.public.orders WHERE status = 'pending'",
query_id="query-abc-123",
user="analyst@company.com",
returned_rows=847,
),
]
result = service.send_query_logs(
resource_uuid="<your-resource-uuid>",
log_type="snowflake", # ← warehouse-specific! see table below
entries=entries,
)
invocation_id = service.extract_invocation_id(result)
print("invocation_id:", invocation_id)
```
## log_type per warehouse
**Important**: the query-log endpoint uses `log_type`, not `resource_type`. This is the only
push endpoint where the field name differs from metadata/lineage. The `log_type` value must
match what the MC normalizer expects for your warehouse. Using the wrong value causes:
`ValueError: Unsupported ingest query-log log_type: <value>`
| Warehouse | log_type |
|---|---|
| Snowflake | `"snowflake"` |
| BigQuery | `"bigquery"` |
| Databricks | `"databricks"` |
| Redshift | `"redshift"` |
| Hive (EMR/S3) | `"hive-s3"` |
| Athena | `"athena"` |
| Teradata | `"teradata"` |
| ClickHouse | `"clickhouse"` |
| Databricks (SQL Warehouse) | `"databricks-metastore-sql-warehouse"` |
| S3 | `"s3"` |
| Presto (S3) | `"presto-s3"` |
## Warehouse-specific fields
Some warehouses support extra fields beyond the base `QueryLogEntry`. Pass them as keyword
arguments — the normalizer knows which fields are valid per warehouse.
**Snowflake extras:**
```python
QueryLogEntry(
...
bytes_scanned=1024000,
warehouse_name="COMPUTE_WH",
warehouse_size="X-Small",
role_name="ANALYST",
query_tag="reporting",
execution_status="SUCCESS",
)
```
**BigQuery extras:**
```python
QueryLogEntry(
...
total_bytes_billed=10485760,
statement_type="SELECT",
job_type="QUERY",
default_dataset="analytics.public",
)
```
**Athena extras:**
```python
QueryLogEntry(
...
bytes_scanned=2048000,
catalog="AwsDataCatalog",
database="analytics",
output_location="s3://my-bucket/results/",
state="SUCCEEDED",
)
```
## Collecting query logs per warehouse
### Snowflake
```sql
SELECT
query_id,
query_text,
start_time,
end_time,
user_name,
database_name,
warehouse_name,
bytes_scanned,
rows_produced AS returned_rows,
execution_status
FROM snowflake.account_usage.query_history
WHERE start_time >= DATEADD(hour, -24, CURRENT_TIMESTAMP())
AND execution_status = 'SUCCESS'
ORDER BY start_time
```
Note: `ACCOUNT_USAGE` views have up to 45 minutes of latency. Don't collect the last hour.
### BigQuery
```python
from google.cloud import bigquery
client = bigquery.Client(project=project_id)
jobs = client.list_jobs(all_users=True, min_creation_time=start_dt, max_creation_time=end_dt)
for job in jobs:
if hasattr(job, 'query') and job.query:
# job.job_id, job.query, job.created, job.ended, job.user_email
```
### Databricks
```sql
SELECT
statement_id AS query_id,
statement_text AS query_text,
start_time,
end_time,
executed_by AS user,
produced_rows AS returned_rows
FROM system.query.history
WHERE start_time >= DATEADD(HOUR, -24, NOW())
AND status = 'FINISHED'
```
### Redshift (modern clusters)
```sql
SELECT
query_id,
query_text, -- may need text assembly from SYS_QUERYTEXT for long queries
start_time,
end_time,
user_id,
status
FROM sys_query_history
WHERE start_time >= DATEADD(hour, -24, GETDATE())
AND status = 'success'
```
For long queries (text > 4000 chars), assemble from `SYS_QUERYTEXT`:
```sql
SELECT query_id, LISTAGG(text, '') WITHIN GROUP (ORDER BY sequence) AS full_text
FROM sys_querytext
WHERE query_id = <id>
GROUP BY query_id
```
### Hive
Parse the HiveServer2 log file (default: `/tmp/root/hive.log`) for lines matching:
```
(Executing|Starting) command\(queryId=(\S*)\): (?P<command>.*)
```
## Output manifest (include invocation_id)
```python
manifest = {
"resource_uuid": resource_uuid,
"invocation_id": service.extract_invocation_id(result), # ← save this
"collected_at": datetime.now(tz=timezone.utc).isoformat(),
"entry_count": len(entries),
"window_start": min(e.start_time for e in entries).isoformat(),
"window_end": max(e.end_time for e in entries).isoformat(),
"queries": [
{
"query_id": e.query_id,
"start_time": e.start_time.isoformat(),
"end_time": e.end_time.isoformat(),
"returned_rows": e.returned_rows,
"query": e.query_text[:200], # truncate for readability
}
for e in entries
],
}
with open("query_logs_output.json", "w") as f:
json.dump(manifest, f, indent=2)
```

View File

@@ -0,0 +1,257 @@
# Validating Pushed Data
All verification queries use the **GraphQL API key** at `https://api.getmontecarlo.com/graphql`.
---
## Resolve a table's MCON and fullTableId
Before running most queries you need either the `mcon` or `fullTableId`.
`fullTableId` format: `<database>:<schema>.<table>` — e.g. `analytics:public.orders`
```graphql
query GetTable($fullTableId: String!, $dwId: UUID!) {
getTable(fullTableId: $fullTableId, dwId: $dwId) {
mcon
fullTableId
displayName
}
}
```
Variables:
```json
{
"fullTableId": "analytics:public.orders",
"dwId": "<warehouse-uuid>"
}
```
---
## Verify metadata (schema + columns)
```graphql
query GetTableMetadata($mcon: String!) {
getTable(mcon: $mcon) {
mcon
fullTableId
versions {
edges {
node {
fields {
name
fieldType
}
}
}
}
}
}
```
Check that the fields list matches your pushed schema.
---
## Verify volume and freshness metrics
Use `getMetricsV4` to fetch row counts and last-modified timestamps:
```graphql
query GetMetrics(
$mcon: String!
$metricName: String!
$startTime: DateTime!
$endTime: DateTime!
) {
getMetricsV4(
dwId: null
mcon: $mcon
metricName: $metricName
startTime: $startTime
endTime: $endTime
) {
metricsJson
}
}
```
Variables (row count):
```json
{
"mcon": "<table-mcon>",
"metricName": "total_row_count",
"startTime": "2024-03-01T00:00:00Z",
"endTime": "2024-03-02T00:00:00Z"
}
```
`metricsJson` is a JSON string. Parse it and look for `value` and `measurementTimestamp`
(camelCase) in each data point.
Other useful metric names:
- `"total_row_count"` — row count
- `"total_byte_count"` — byte size
- `"total_row_count_last_changed_on"` — Unix epoch float of when the row count last changed
---
## Verify table lineage
```graphql
query GetTableLineage($mcon: String!) {
getTableLineage(mcon: $mcon, direction: "upstream", hops: 1) {
connectedNodes {
mcon
displayName
objectType
}
flattenedEdges {
directlyConnectedMcons
}
}
}
```
Check that your expected source tables appear in `connectedNodes` or
`flattenedEdges[].directlyConnectedMcons`.
---
## Verify column lineage
```graphql
query GetColumnLineage($mcon: String!, $column: String!) {
getDerivedTablesPartialLineage(mcon: $mcon, column: $column, pageSize: 1000) {
destinations {
table { mcon displayName }
columns { columnName }
}
}
}
```
Variables: `mcon` = source table MCON, `column` = source column name.
Check that each destination table and column appears in the response.
---
## Verify query logs
```graphql
query GetAggregatedQueries(
$mcon: String!
$queryType: String!
$startTime: DateTime!
$endTime: DateTime!
$first: Int
$after: String
) {
getAggregatedQueries(
mcon: $mcon
queryType: $queryType
startTime: $startTime
endTime: $endTime
first: $first
after: $after
) {
edges { node { queryHash queryCount lastSeen } }
pageInfo { hasNextPage endCursor }
}
}
```
Variables:
```json
{
"mcon": "<table-mcon>",
"queryType": "read",
"startTime": "2024-03-01T00:00:00Z",
"endTime": "2024-03-02T00:00:00Z",
"first": 100
}
```
**Remember**: query logs take up to 1 hour to process after push. If you see 0 results
immediately after pushing, wait and try again.
---
## Check detector thresholds (anomaly detection status)
```graphql
query GetDetectorStatus($mcon: String!) {
getTable(mcon: $mcon) {
thresholds {
freshness {
lower { value }
upper { value }
status
}
size {
lower { value }
upper { value }
status
}
}
}
}
```
`status` will be `"no data"` or `"inactive"` on a newly-pushed table. Detectors need
historical data to train — see `references/anomaly-detection.md` for requirements.
---
## Table management operations
### Delete push-ingested tables
Only works on push-ingested tables — pull-collected tables are excluded by default.
```graphql
mutation DeletePushTables($mcons: [String!]!) {
deletePushIngestedTables(mcons: $mcons) {
success
deletedCount
}
}
```
Variables:
```json
{
"mcons": ["<mcon-1>", "<mcon-2>"]
}
```
Resolve MCONs first with `getTable(fullTableId: ..., dwId: ...)`.
---
## Python helper
```python
import requests, json
GRAPHQL_URL = "https://api.getmontecarlo.com/graphql"
def graphql(query: str, variables: dict, key_id: str, key_token: str) -> dict:
resp = requests.post(
GRAPHQL_URL,
json={"query": query, "variables": variables},
headers={
"x-mcd-id": key_id,
"x-mcd-token": key_token,
"Content-Type": "application/json",
},
)
resp.raise_for_status()
data = resp.json()
if "errors" in data:
raise RuntimeError(json.dumps(data["errors"], indent=2))
return data["data"]
```

View File

@@ -0,0 +1,357 @@
#!/usr/bin/env python3
"""
Monte Carlo Push Ingestion — Verification Helper
Queries the Monte Carlo GraphQL API to verify that pushed metadata, lineage, and
query logs are visible in the platform.
Prerequisites:
pip install requests
Set environment variables:
MCD_ID — GraphQL API key ID (from getmontecarlo.com/settings/api)
MCD_TOKEN — GraphQL API key secret
MCD_RESOURCE_UUID — Your MC warehouse/resource UUID
Usage:
python sample_verify.py \
--full-table-id "analytics:public.orders" \
--check-schema \
--check-metrics \
--check-detectors \
--check-lineage \
--expected-sources "analytics:public.customers" "analytics:public.raw_orders"
"""
import argparse
import json
import os
import sys
from datetime import datetime, timedelta, timezone
import requests
GRAPHQL_URL = "https://api.getmontecarlo.com/graphql"
def graphql(query: str, variables: dict, key_id: str, key_token: str) -> dict:
"""Execute a GraphQL query/mutation and return the data payload."""
resp = requests.post(
GRAPHQL_URL,
json={"query": query, "variables": variables},
headers={
"x-mcd-id": key_id,
"x-mcd-token": key_token,
"Content-Type": "application/json",
},
timeout=30,
)
resp.raise_for_status()
body = resp.json()
if "errors" in body:
raise RuntimeError(json.dumps(body["errors"], indent=2))
return body["data"]
# ---------------------------------------------------------------------------
# Step 1: Resolve MCON from fullTableId
# ---------------------------------------------------------------------------
def get_table_mcon(full_table_id: str, dw_id: str, key_id: str, key_token: str) -> str:
"""Resolve a fullTableId + warehouse UUID to an MCON."""
data = graphql(
"""query GetTable($fullTableId: String!, $dwId: UUID!) {
getTable(fullTableId: $fullTableId, dwId: $dwId) {
mcon fullTableId displayName
}
}""",
{"fullTableId": full_table_id, "dwId": dw_id},
key_id, key_token,
)
table = data.get("getTable")
if not table:
raise ValueError(f"Table not found: {full_table_id} in resource {dw_id}")
print(f" Resolved: {table['fullTableId']} → MCON: {table['mcon']}")
return table["mcon"]
# ---------------------------------------------------------------------------
# Step 2: Verify schema (columns)
# ---------------------------------------------------------------------------
def verify_schema(mcon: str, expected_fields: list[str], key_id: str, key_token: str) -> bool:
"""Check that the table's column names match expected_fields."""
data = graphql(
"""query GetSchema($mcon: String!) {
getTable(mcon: $mcon) {
versions {
edges {
node {
fields { name fieldType }
}
}
}
}
}""",
{"mcon": mcon},
key_id, key_token,
)
edges = (data.get("getTable") or {}).get("versions", {}).get("edges", [])
if not edges:
print(" WARN: no schema versions found")
return False
fields = edges[0]["node"]["fields"]
got_names = {f["name"].lower() for f in fields}
print(f" Schema: {len(fields)} column(s) — {', '.join(f['name'] for f in fields[:8])}{'...' if len(fields) > 8 else ''}")
if expected_fields:
missing = [e for e in expected_fields if e.lower() not in got_names]
if missing:
print(f" FAIL: missing columns: {missing}")
return False
print(f" PASS: all expected columns present")
return True
# ---------------------------------------------------------------------------
# Step 3: Verify volume/freshness metrics
# ---------------------------------------------------------------------------
def verify_metrics(mcon: str, key_id: str, key_token: str) -> None:
"""Fetch and display the latest row_count and freshness metrics."""
end = datetime.now(tz=timezone.utc)
start = end - timedelta(days=7)
for metric_name in ("total_row_count", "total_row_count_last_changed_on"):
data = graphql(
"""query GetMetrics($mcon: String!, $metricName: String!, $start: DateTime!, $end: DateTime!) {
getMetricsV4(dwId: null, mcon: $mcon, metricName: $metricName,
startTime: $start, endTime: $end) {
metricsJson
}
}""",
{"mcon": mcon, "metricName": metric_name,
"start": start.isoformat(), "end": end.isoformat()},
key_id, key_token,
)
metrics_json = (data.get("getMetricsV4") or {}).get("metricsJson")
if not metrics_json:
print(f" {metric_name}: no data")
continue
points = json.loads(metrics_json)
if not points:
print(f" {metric_name}: no data points")
continue
latest = max(points, key=lambda p: p.get("measurementTimestamp") or "")
val = latest.get("value")
ts = latest.get("measurementTimestamp")
if metric_name == "total_row_count_last_changed_on" and val:
ts_fmt = datetime.fromtimestamp(float(val), tz=timezone.utc).strftime("%Y-%m-%dT%H:%M:%SZ")
print(f" {metric_name}: {ts_fmt}")
else:
print(f" {metric_name}: {val} (at {ts})")
# ---------------------------------------------------------------------------
# Step 3b: Verify detector status (freshness + volume)
# ---------------------------------------------------------------------------
def verify_detectors(mcon: str, key_id: str, key_token: str) -> None:
"""Check the status of freshness and volume anomaly detectors."""
data = graphql(
"""query GetDetectors($mcon: String!) {
getTable(mcon: $mcon) {
thresholds {
freshness { status }
size { status }
}
}
}""",
{"mcon": mcon},
key_id, key_token,
)
thresholds = (data.get("getTable") or {}).get("thresholds") or {}
freshness = thresholds.get("freshness") or {}
size = thresholds.get("size") or {}
freshness_status = freshness.get("status", "not available")
size_status = size.get("status", "not available")
print(f" Freshness detector: {freshness_status}")
print(f" Volume detector: {size_status}")
if freshness_status in ("no data", "training"):
print(" ↳ Freshness needs 7+ pushes with changed last_update_time over ~2 weeks")
if size_status in ("no data", "training"):
print(" ↳ Volume needs 10-48 samples over ~42 days (push hourly, consistently)")
# ---------------------------------------------------------------------------
# Step 4: Verify table lineage (upstream)
# ---------------------------------------------------------------------------
def verify_table_lineage(
mcon: str,
expected_source_mcons: list[str],
key_id: str,
key_token: str,
) -> bool:
"""Check that expected source MCONs appear in the upstream lineage."""
data = graphql(
"""query GetLineage($mcon: String!) {
getTableLineage(mcon: $mcon, direction: "upstream", hops: 1) {
connectedNodes { mcon displayName objectType }
flattenedEdges { directlyConnectedMcons }
}
}""",
{"mcon": mcon},
key_id, key_token,
)
lineage = data.get("getTableLineage") or {}
connected = {n["mcon"] for n in lineage.get("connectedNodes", [])}
flat = {m for e in lineage.get("flattenedEdges", []) for m in e.get("directlyConnectedMcons", [])}
all_found = connected | flat
print(f" Upstream nodes: {len(connected)}")
if not expected_source_mcons:
return True
missing = [s for s in expected_source_mcons if s not in all_found]
if missing:
print(f" FAIL: missing sources: {missing}")
return False
print(" PASS: all expected sources present")
return True
# ---------------------------------------------------------------------------
# Step 5: Verify column lineage
# ---------------------------------------------------------------------------
def verify_column_lineage(
source_mcon: str,
source_column: str,
expected_dest_mcon: str,
expected_dest_column: str,
key_id: str,
key_token: str,
) -> bool:
"""Check that source_column flows to expected_dest_column on expected_dest_mcon."""
data = graphql(
"""query GetColLineage($mcon: String!, $column: String!) {
getDerivedTablesPartialLineage(mcon: $mcon, column: $column, pageSize: 1000) {
destinations {
table { mcon displayName }
columns { columnName }
}
}
}""",
{"mcon": source_mcon, "column": source_column},
key_id, key_token,
)
destinations = (data.get("getDerivedTablesPartialLineage") or {}).get("destinations", [])
for dest in destinations:
if dest["table"]["mcon"] == expected_dest_mcon:
cols = {c["columnName"] for c in dest.get("columns", [])}
if expected_dest_column in cols:
print(f" PASS: {source_column}{dest['table']['displayName']}.{expected_dest_column}")
return True
print(f" FAIL: {source_column}{expected_dest_mcon}.{expected_dest_column} not found")
return False
# ---------------------------------------------------------------------------
# Step 6: Verify query logs
# ---------------------------------------------------------------------------
def verify_query_logs(
mcon: str,
start_time: datetime,
end_time: datetime,
key_id: str,
key_token: str,
) -> None:
"""Report read/write query counts for a table within the given time window."""
for query_type in ("read", "write"):
cursor = None
total = 0
while True:
data = graphql(
"""query GetQueries($mcon: String!, $type: String!, $start: DateTime!, $end: DateTime!, $after: String) {
getAggregatedQueries(mcon: $mcon, queryType: $type,
startTime: $start, endTime: $end,
first: 200, after: $after) {
edges { node { queryHash queryCount lastSeen } }
pageInfo { hasNextPage endCursor }
}
}""",
{"mcon": mcon, "type": query_type,
"start": start_time.isoformat(), "end": end_time.isoformat(),
"after": cursor},
key_id, key_token,
)
result = data.get("getAggregatedQueries") or {}
total += sum(e["node"]["queryCount"] for e in result.get("edges", []))
page = result.get("pageInfo", {})
if not page.get("hasNextPage"):
break
cursor = page["endCursor"]
print(f" {query_type} queries: {total}")
# ---------------------------------------------------------------------------
# CLI
# ---------------------------------------------------------------------------
def main() -> None:
parser = argparse.ArgumentParser(description="Verify Monte Carlo push-ingested data via GraphQL")
parser.add_argument("--key-id", default=os.environ.get("MCD_ID"))
parser.add_argument("--key-token", default=os.environ.get("MCD_TOKEN"))
parser.add_argument("--resource-uuid", default=os.environ.get("MCD_RESOURCE_UUID"), required=False)
parser.add_argument("--full-table-id", required=True, help="e.g. analytics:public.orders")
parser.add_argument("--mcon", help="Use MCON directly instead of resolving from fullTableId")
parser.add_argument("--check-schema", action="store_true")
parser.add_argument("--check-metrics", action="store_true")
parser.add_argument("--check-detectors", action="store_true", help="Check freshness/volume detector status")
parser.add_argument("--check-lineage", action="store_true")
parser.add_argument("--check-query-logs", action="store_true")
parser.add_argument("--expected-fields", nargs="*", default=[])
parser.add_argument("--expected-sources", nargs="*", default=[], help="Source MCONs for lineage check")
parser.add_argument("--lookback-hours", type=int, default=24, help="For query log check (default: 24)")
args = parser.parse_args()
if not args.key_id or not args.key_token:
print("ERROR: Provide --key-id/--key-token or set MCD_ID/MCD_TOKEN", file=sys.stderr)
sys.exit(1)
print(f"\n{'='*60}")
print(f"Verifying: {args.full_table_id}")
print(f"{'='*60}")
mcon = args.mcon
if not mcon:
if not args.resource_uuid:
print("ERROR: --resource-uuid required when --mcon is not provided", file=sys.stderr)
sys.exit(1)
mcon = get_table_mcon(args.full_table_id, args.resource_uuid, args.key_id, args.key_token)
if args.check_schema:
print("\n[Schema]")
verify_schema(mcon, args.expected_fields, args.key_id, args.key_token)
if args.check_metrics:
print("\n[Metrics]")
verify_metrics(mcon, args.key_id, args.key_token)
if args.check_detectors:
print("\n[Detectors]")
verify_detectors(mcon, args.key_id, args.key_token)
if args.check_lineage:
print("\n[Table Lineage]")
verify_table_lineage(mcon, args.expected_sources, args.key_id, args.key_token)
if args.check_query_logs:
print("\n[Query Logs]")
end = datetime.now(tz=timezone.utc)
start = end - timedelta(hours=args.lookback_hours)
verify_query_logs(mcon, start, end, args.key_id, args.key_token)
print("\nDone.")
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,71 @@
"""
BigQuery Iceberg — Metadata Collect & Push (combined)
=====================================================
Convenience wrapper that runs collect_metadata.collect() followed by
push_metadata.push() in a single invocation. Supports
``--only-freshness-and-volume`` for fast periodic pushes.
Prerequisites:
pip install google-cloud-bigquery pycarlo>=0.12.251
"""
from __future__ import annotations
import argparse
import os
from collect_metadata import collect
from push_metadata import push
def main() -> None:
parser = argparse.ArgumentParser(
description="Collect BigQuery Iceberg metadata and push to Monte Carlo",
)
# Collection args
parser.add_argument("--project-id", default=os.getenv("BIGQUERY_PROJECT_ID"))
parser.add_argument("--datasets", nargs="+", default=None)
parser.add_argument("--tables", nargs="+", default=None)
parser.add_argument(
"--only-freshness-and-volume",
action="store_true",
help="Skip field/schema collection — only collect freshness and volume.",
)
parser.add_argument("--manifest-file", default="metadata_output.json")
# Push args
parser.add_argument("--resource-uuid", default=os.getenv("MCD_RESOURCE_UUID"))
parser.add_argument("--key-id", default=os.getenv("MCD_INGEST_ID"))
parser.add_argument("--key-token", default=os.getenv("MCD_INGEST_TOKEN"))
parser.add_argument("--batch-size", type=int, default=500)
parser.add_argument("--push-result-file", default="metadata_push_result.json")
args = parser.parse_args()
if not args.project_id:
parser.error("--project-id or BIGQUERY_PROJECT_ID env var is required")
required_push = ["resource_uuid", "key_id", "key_token"]
missing = [k for k in required_push if getattr(args, k) is None]
if missing:
parser.error(f"Missing required push arguments/env vars: {missing}")
collect(
project_id=args.project_id,
datasets=args.datasets,
tables=args.tables,
only_freshness_and_volume=args.only_freshness_and_volume,
output_file=args.manifest_file,
)
push(
input_file=args.manifest_file,
resource_uuid=args.resource_uuid,
key_id=args.key_id,
key_token=args.key_token,
batch_size=args.batch_size,
output_file=args.push_result_file,
)
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,64 @@
"""
BigQuery Iceberg — Query Log Collect & Push (combined)
=====================================================
Convenience wrapper that runs collect_query_logs.collect() followed by
push_query_logs.push() in a single invocation.
Prerequisites:
pip install google-cloud-bigquery pycarlo>=0.12.251 python-dateutil>=2.8.0
"""
from __future__ import annotations
import argparse
import os
from collect_query_logs import LOOKBACK_HOURS, LOOKBACK_LAG_HOURS, collect
from push_query_logs import push
def main() -> None:
parser = argparse.ArgumentParser(
description="Collect BigQuery query logs and push to Monte Carlo",
)
# Collection args
parser.add_argument("--project-id", default=os.getenv("BIGQUERY_PROJECT_ID"))
parser.add_argument("--lookback-hours", type=int, default=LOOKBACK_HOURS)
parser.add_argument("--lookback-lag-hours", type=int, default=LOOKBACK_LAG_HOURS)
parser.add_argument("--manifest-file", default="query_logs_output.json")
# Push args
parser.add_argument("--resource-uuid", default=os.getenv("MCD_RESOURCE_UUID"))
parser.add_argument("--key-id", default=os.getenv("MCD_INGEST_ID"))
parser.add_argument("--key-token", default=os.getenv("MCD_INGEST_TOKEN"))
parser.add_argument("--batch-size", type=int, default=100)
parser.add_argument("--push-result-file", default="query_logs_push_result.json")
args = parser.parse_args()
if not args.project_id:
parser.error("--project-id or BIGQUERY_PROJECT_ID env var is required")
required_push = ["resource_uuid", "key_id", "key_token"]
missing = [k for k in required_push if getattr(args, k) is None]
if missing:
parser.error(f"Missing required push arguments/env vars: {missing}")
collect(
project_id=args.project_id,
lookback_hours=args.lookback_hours,
lookback_lag_hours=args.lookback_lag_hours,
output_file=args.manifest_file,
)
push(
input_file=args.manifest_file,
resource_uuid=args.resource_uuid,
key_id=args.key_id,
key_token=args.key_token,
batch_size=args.batch_size,
output_file=args.push_result_file,
)
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,253 @@
"""
BigQuery Iceberg — Metadata Collection (collect only)
=====================================================
Collects table schemas, row counts, byte sizes, and freshness for BigQuery
Iceberg (BigLake-managed) tables using INFORMATION_SCHEMA.TABLE_STORAGE and
INFORMATION_SCHEMA.COLUMNS. Standard BigQuery collection uses __TABLES__ which
does not include Iceberg tables — this template fills that gap.
Can be run standalone via CLI or imported (use the ``collect()`` function).
Supports a ``--only-freshness-and-volume`` flag to skip the COLUMNS query for
fast periodic pushes after the initial full metadata push.
Substitution points (search for "← SUBSTITUTE"):
- BIGQUERY_PROJECT_ID : GCP project ID to collect from
- GOOGLE_APPLICATION_CREDENTIALS : path to service-account JSON key file
- REGION : BigQuery region (default "us")
Prerequisites:
pip install google-cloud-bigquery
"""
from __future__ import annotations
import argparse
import json
import logging
import os
from datetime import datetime, timezone
from google.cloud import bigquery
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
log = logging.getLogger(__name__)
RESOURCE_TYPE = "bigquery"
# BigQuery type → Monte Carlo canonical type
BQ_TYPE_MAP: dict[str, str] = {
"INT64": "INTEGER",
"INTEGER": "INTEGER",
"FLOAT64": "FLOAT",
"FLOAT": "FLOAT",
"BOOL": "BOOLEAN",
"BOOLEAN": "BOOLEAN",
"STRING": "VARCHAR",
"BYTES": "BINARY",
"DATE": "DATE",
"DATETIME": "DATETIME",
"TIMESTAMP": "TIMESTAMP",
"TIME": "TIME",
"NUMERIC": "DECIMAL",
"BIGNUMERIC": "DECIMAL",
"RECORD": "STRUCT",
"STRUCT": "STRUCT",
"REPEATED": "ARRAY",
"JSON": "JSON",
"GEOGRAPHY": "GEOGRAPHY",
}
def map_bq_type(bq_type: str) -> str:
base = bq_type.split("(")[0].strip().upper()
return BQ_TYPE_MAP.get(base, bq_type.upper())
def _fetch_iceberg_tables(
client: bigquery.Client,
project_id: str,
datasets: list[str] | None = None,
tables: list[str] | None = None,
) -> list[dict]:
"""Query TABLE_STORAGE for BigLake (Iceberg) tables."""
conditions = [
"managed_table_type = 'BIGLAKE'",
"deleted = FALSE",
]
if datasets:
ds_list = ", ".join(f"'{d}'" for d in datasets)
conditions.append(f"table_schema IN ({ds_list})")
if tables:
tbl_list = ", ".join(f"'{t}'" for t in tables)
conditions.append(f"table_name IN ({tbl_list})")
where = " AND ".join(conditions)
query = f"""
SELECT
table_schema,
table_name,
total_rows,
current_physical_bytes,
storage_last_modified_time,
creation_time
FROM `{project_id}.region-us`.INFORMATION_SCHEMA.TABLE_STORAGE -- ← SUBSTITUTE: change region if needed
WHERE {where}
ORDER BY table_schema, table_name
"""
log.info("Querying TABLE_STORAGE for Iceberg tables ...")
rows = list(client.query(query).result())
log.info("Found %d Iceberg table(s).", len(rows))
return [dict(row) for row in rows]
def _fetch_columns(
client: bigquery.Client,
project_id: str,
dataset: str,
table_name: str,
) -> list[dict]:
"""Fetch column metadata for a specific table."""
query = f"""
SELECT column_name, data_type, ordinal_position, is_nullable, column_default
FROM `{project_id}.{dataset}.INFORMATION_SCHEMA.COLUMNS`
WHERE table_name = '{table_name}'
ORDER BY ordinal_position
"""
return [
{
"name": row["column_name"],
"type": map_bq_type(row["data_type"]),
}
for row in client.query(query).result()
]
def _resolve_freshness(row: dict) -> str:
"""Return the best available freshness timestamp as ISO8601.
Uses storage_last_modified_time if Google has populated it (expected
early April 2026). Falls back to current time with a warning.
"""
if row.get("storage_last_modified_time"):
return row["storage_last_modified_time"].isoformat()
log.warning(
"storage_last_modified_time is NULL for %s.%s"
"falling back to current time. Google's TABLE_STORAGE update "
"for Iceberg tables may not have shipped yet.",
row["table_schema"],
row["table_name"],
)
return datetime.now(timezone.utc).isoformat()
def collect(
project_id: str,
datasets: list[str] | None = None,
tables: list[str] | None = None,
only_freshness_and_volume: bool = False,
output_file: str = "metadata_output.json",
) -> dict:
"""Collect Iceberg table metadata and write a JSON manifest.
When only_freshness_and_volume is True, skips the COLUMNS query and
omits fields from the manifest. Use this for periodic hourly pushes
after the initial full metadata push.
"""
client = bigquery.Client(project=project_id) # ← SUBSTITUTE: adjust auth if needed
if only_freshness_and_volume:
log.info("Running in freshness+volume only mode (skipping fields).")
iceberg_tables = _fetch_iceberg_tables(client, project_id, datasets, tables)
if not iceberg_tables:
log.warning("No Iceberg tables found matching the criteria.")
return {"resource_type": RESOURCE_TYPE, "assets": []}
assets: list[dict] = []
for row in iceberg_tables:
dataset = row["table_schema"]
name = row["table_name"]
asset = {
"name": name,
"database": project_id,
"schema": dataset,
"type": "TABLE",
"volume": {
"row_count": row["total_rows"],
"byte_count": row["current_physical_bytes"],
},
"freshness": {
"last_updated_time": _resolve_freshness(row),
},
}
if not only_freshness_and_volume:
asset["description"] = None
asset["fields"] = _fetch_columns(client, project_id, dataset, name)
assets.append(asset)
log.info(
"Collected %s.%s.%s — rows=%s, bytes=%s",
project_id, dataset, name,
row["total_rows"], row["current_physical_bytes"],
)
manifest = {
"resource_type": RESOURCE_TYPE,
"collected_at": datetime.now(timezone.utc).isoformat(),
"assets": assets,
}
with open(output_file, "w") as fh:
json.dump(manifest, fh, indent=2)
log.info("Manifest written to %s (%d assets)", output_file, len(assets))
return manifest
def main() -> None:
parser = argparse.ArgumentParser(
description="Collect BigQuery Iceberg table metadata into a JSON manifest",
)
parser.add_argument(
"--project-id",
default=os.getenv("BIGQUERY_PROJECT_ID"), # ← SUBSTITUTE
help="GCP project ID (or set BIGQUERY_PROJECT_ID env var)",
)
parser.add_argument(
"--datasets",
nargs="+",
default=None,
help="Limit to specific dataset(s). Omit to scan all datasets.",
)
parser.add_argument(
"--tables",
nargs="+",
default=None,
help="Limit to specific table name(s) within the datasets.",
)
parser.add_argument(
"--only-freshness-and-volume",
action="store_true",
help="Skip field/schema collection — only collect freshness and volume. "
"Use for periodic hourly pushes after the initial full metadata push.",
)
parser.add_argument("--output-file", default="metadata_output.json")
args = parser.parse_args()
if not args.project_id:
parser.error("--project-id or BIGQUERY_PROJECT_ID env var is required")
collect(
project_id=args.project_id,
datasets=args.datasets,
tables=args.tables,
only_freshness_and_volume=args.only_freshness_and_volume,
output_file=args.output_file,
)
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,149 @@
"""
BigQuery Iceberg — Query Log Collection (collect only)
======================================================
Queries the BigQuery Jobs API for completed query jobs within a time
window and writes a JSON manifest that can be fed to push_query_logs.py.
Can be run standalone via CLI or imported (use the ``collect()`` function).
Substitution points (search for "← SUBSTITUTE"):
- BIGQUERY_PROJECT_ID : GCP project ID to collect from
- GOOGLE_APPLICATION_CREDENTIALS : path to service-account JSON key file
Prerequisites:
pip install google-cloud-bigquery
"""
from __future__ import annotations
import argparse
import json
import logging
import os
from datetime import datetime, timedelta, timezone
from google.cloud import bigquery
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
log = logging.getLogger(__name__)
LOG_TYPE = "bigquery"
LOOKBACK_HOURS: int = int(os.getenv("LOOKBACK_HOURS", "25"))
LOOKBACK_LAG_HOURS: int = int(os.getenv("LOOKBACK_LAG_HOURS", "1"))
MAX_JOBS: int = int(os.getenv("MAX_JOBS", "10000"))
# Limit to specific statement types — empty list means collect all.
STATEMENT_TYPE_FILTER: list[str] = []
def _safe_isoformat(dt: datetime | None) -> str | None:
if dt is None:
return None
if dt.tzinfo is None:
dt = dt.replace(tzinfo=timezone.utc)
return dt.isoformat()
def _collect_query_logs(
bq_client: bigquery.Client,
project_id: str,
start_dt: datetime,
end_dt: datetime,
) -> list[dict]:
"""Collect query logs from BigQuery job history."""
entries: list[dict] = []
log.info(
"Listing jobs for project=%s from %s to %s",
project_id, start_dt.isoformat(), end_dt.isoformat(),
)
for job in bq_client.list_jobs(
project=project_id,
all_users=True,
min_creation_time=start_dt,
max_creation_time=end_dt,
):
sql: str = getattr(job, "query", None) or ""
if not sql.strip():
continue
statement_type: str = getattr(job, "statement_type", None) or ""
if STATEMENT_TYPE_FILTER and statement_type not in STATEMENT_TYPE_FILTER:
continue
entries.append({
"query_id": job.job_id,
"query_text": sql,
"start_time": _safe_isoformat(getattr(job, "created", None)),
"end_time": _safe_isoformat(getattr(job, "ended", None)),
"user": getattr(job, "user_email", None),
"total_bytes_billed": getattr(job, "total_bytes_billed", None),
"statement_type": statement_type or None,
})
if len(entries) >= MAX_JOBS:
log.warning("Reached MAX_JOBS=%d — stopping early", MAX_JOBS)
break
return entries
def collect(
project_id: str,
lookback_hours: int = LOOKBACK_HOURS,
lookback_lag_hours: int = LOOKBACK_LAG_HOURS,
output_file: str = "query_logs_output.json",
) -> dict:
"""Collect query logs and write a JSON manifest."""
bq_client = bigquery.Client(project=project_id)
end_dt = datetime.now(timezone.utc) - timedelta(hours=lookback_lag_hours)
start_dt = end_dt - timedelta(hours=lookback_hours)
entries = _collect_query_logs(bq_client, project_id, start_dt, end_dt)
log.info("Collected %d query log entries.", len(entries))
manifest = {
"log_type": LOG_TYPE,
"collected_at": datetime.now(timezone.utc).isoformat(),
"window_start": start_dt.isoformat(),
"window_end": end_dt.isoformat(),
"query_log_count": len(entries),
"queries": entries,
}
with open(output_file, "w") as fh:
json.dump(manifest, fh, indent=2)
log.info("Query log manifest written to %s", output_file)
return manifest
def main() -> None:
parser = argparse.ArgumentParser(
description="Collect BigQuery query logs into a JSON manifest",
)
parser.add_argument(
"--project-id",
default=os.getenv("BIGQUERY_PROJECT_ID"),
help="GCP project ID (or set BIGQUERY_PROJECT_ID env var)",
)
parser.add_argument("--lookback-hours", type=int, default=LOOKBACK_HOURS)
parser.add_argument("--lookback-lag-hours", type=int, default=LOOKBACK_LAG_HOURS)
parser.add_argument("--output-file", default="query_logs_output.json")
args = parser.parse_args()
if not args.project_id:
parser.error("--project-id or BIGQUERY_PROJECT_ID env var is required")
collect(
project_id=args.project_id,
lookback_hours=args.lookback_hours,
lookback_lag_hours=args.lookback_lag_hours,
output_file=args.output_file,
)
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,190 @@
"""
BigQuery Iceberg — Metadata Push (push only)
============================================
Reads a JSON manifest produced by collect_metadata.py and pushes table
metadata to Monte Carlo using the pycarlo SDK's IngestionService.
Can be run standalone via CLI or imported (use the ``push()`` function).
Substitution points (search for "← SUBSTITUTE"):
- MCD_INGEST_ID : Monte Carlo Ingestion API key ID
- MCD_INGEST_TOKEN : Monte Carlo Ingestion API key token
- MCD_RESOURCE_UUID : Monte Carlo warehouse resource UUID
Prerequisites:
pip install pycarlo>=0.12.251
"""
from __future__ import annotations
import argparse
import json
import logging
import os
from concurrent.futures import ThreadPoolExecutor, as_completed
from datetime import datetime, timezone
from pycarlo.core import Client, Session
from pycarlo.features.ingestion import IngestionService
from pycarlo.features.ingestion.models import (
AssetField,
AssetFreshness,
AssetMetadata,
AssetVolume,
RelationalAsset,
)
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
log = logging.getLogger(__name__)
RESOURCE_TYPE = "bigquery"
_BATCH_SIZE = 500
_ENDPOINT = "https://integrations.getmontecarlo.com"
def _asset_from_dict(d: dict) -> RelationalAsset:
"""Reconstruct a RelationalAsset from a manifest dict entry."""
fields = [
AssetField(
name=f["name"],
type=f.get("type"),
description=f.get("description"),
)
for f in d.get("fields", [])
]
volume = None
if d.get("volume"):
volume = AssetVolume(
row_count=d["volume"].get("row_count"),
byte_count=d["volume"].get("byte_count"),
)
freshness = None
if d.get("freshness") and d["freshness"].get("last_updated_time"):
freshness = AssetFreshness(
last_update_time=d["freshness"]["last_updated_time"],
)
return RelationalAsset(
type=d.get("type", "TABLE"),
metadata=AssetMetadata(
name=d["name"],
database=d["database"],
schema=d["schema"],
description=d.get("description"),
),
fields=fields,
volume=volume,
freshness=freshness,
)
def push(
input_file: str,
resource_uuid: str,
key_id: str,
key_token: str,
batch_size: int = _BATCH_SIZE,
output_file: str = "metadata_push_result.json",
) -> dict:
"""Read a metadata manifest and push assets to Monte Carlo in batches."""
endpoint = _ENDPOINT
log.info("Using endpoint: %s", endpoint)
with open(input_file) as fh:
manifest = json.load(fh)
asset_dicts = manifest.get("assets", [])
resource_type = manifest.get("resource_type", RESOURCE_TYPE)
assets = [_asset_from_dict(d) for d in asset_dicts]
log.info("Loaded %d asset(s) from %s", len(assets), input_file)
batches = [assets[i : i + batch_size] for i in range(0, max(len(assets), 1), batch_size)]
total_batches = len(batches)
def _push_batch(batch: list[RelationalAsset], batch_num: int) -> str | None:
client = Client(session=Session(
mcd_id=key_id, mcd_token=key_token, scope="Ingestion", endpoint=endpoint,
))
service = IngestionService(mc_client=client)
result = service.send_metadata(
resource_uuid=resource_uuid,
resource_type=resource_type,
events=batch,
)
invocation_id = service.extract_invocation_id(result)
log.info(
"Pushed batch %d/%d (%d assets) — invocation_id=%s",
batch_num, total_batches, len(batch), invocation_id,
)
return invocation_id
max_workers = min(4, total_batches)
invocation_ids: list[str | None] = [None] * total_batches
with ThreadPoolExecutor(max_workers=max_workers) as pool:
futures = {
pool.submit(_push_batch, batch, i + 1): i
for i, batch in enumerate(batches)
}
for future in as_completed(futures):
idx = futures[future]
try:
invocation_ids[idx] = future.result()
except Exception as exc:
log.error("ERROR pushing batch %d: %s", idx + 1, exc)
raise
log.info("All %d batch(es) pushed.", total_batches)
push_result = {
"resource_uuid": resource_uuid,
"resource_type": resource_type,
"invocation_ids": invocation_ids,
"pushed_at": datetime.now(timezone.utc).isoformat(),
"total_assets": len(assets),
"batch_count": total_batches,
"batch_size": batch_size,
}
with open(output_file, "w") as fh:
json.dump(push_result, fh, indent=2)
log.info("Push result written to %s", output_file)
return push_result
def main() -> None:
parser = argparse.ArgumentParser(
description="Push BigQuery Iceberg metadata from a manifest to Monte Carlo",
)
parser.add_argument("--resource-uuid", default=os.getenv("MCD_RESOURCE_UUID"))
parser.add_argument("--key-id", default=os.getenv("MCD_INGEST_ID"))
parser.add_argument("--key-token", default=os.getenv("MCD_INGEST_TOKEN"))
parser.add_argument("--input-file", default="metadata_output.json")
parser.add_argument("--output-file", default="metadata_push_result.json")
parser.add_argument(
"--batch-size",
type=int,
default=_BATCH_SIZE,
help=f"Max assets per push batch (default: {_BATCH_SIZE})",
)
args = parser.parse_args()
required = ["resource_uuid", "key_id", "key_token"]
missing = [k for k in required if getattr(args, k) is None]
if missing:
parser.error(f"Missing required arguments/env vars: {missing}")
push(
input_file=args.input_file,
resource_uuid=args.resource_uuid,
key_id=args.key_id,
key_token=args.key_token,
batch_size=args.batch_size,
output_file=args.output_file,
)
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,208 @@
"""
BigQuery Iceberg — Query Log Push (push only)
=============================================
Reads a JSON manifest produced by collect_query_logs.py and pushes query
log entries to Monte Carlo using the pycarlo SDK's IngestionService.
Uses dateutil.isoparse() to convert ISO8601 strings back to datetime
objects (QueryLogEntry requires datetime, not str).
Can be run standalone via CLI or imported (use the ``push()`` function).
Substitution points (search for "← SUBSTITUTE"):
- MCD_INGEST_ID : Monte Carlo Ingestion API key ID
- MCD_INGEST_TOKEN : Monte Carlo Ingestion API key token
- MCD_RESOURCE_UUID : Monte Carlo warehouse resource UUID
Prerequisites:
pip install pycarlo>=0.12.251 python-dateutil>=2.8.0
"""
from __future__ import annotations
import argparse
import json
import logging
import os
from concurrent.futures import ThreadPoolExecutor, as_completed
from datetime import datetime, timezone
from dateutil.parser import isoparse
from pycarlo.core import Client, Session
from pycarlo.features.ingestion import IngestionService
from pycarlo.features.ingestion.models import QueryLogEntry
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
log = logging.getLogger(__name__)
LOG_TYPE = "bigquery"
# Query logs include full SQL text — keep batches small to stay under the
# 1 MB compressed payload limit.
_BATCH_SIZE = 100
# Truncate very long SQL to prevent 413 errors.
_MAX_QUERY_TEXT_LEN = 10_000
_ENDPOINT = "https://integrations.getmontecarlo.com"
def _build_query_log_entries(queries: list[dict]) -> list[QueryLogEntry]:
"""Convert manifest query dicts into QueryLogEntry objects."""
entries = []
truncated = 0
for q in queries:
query_text = q.get("query_text") or ""
if len(query_text) > _MAX_QUERY_TEXT_LEN:
query_text = query_text[:_MAX_QUERY_TEXT_LEN] + "... [TRUNCATED]"
truncated += 1
extra = {}
if q.get("total_bytes_billed") is not None:
extra["total_bytes_billed"] = q["total_bytes_billed"]
if q.get("statement_type") is not None:
extra["statement_type"] = q["statement_type"]
start_time = q.get("start_time")
end_time = q.get("end_time")
entry = QueryLogEntry(
query_id=q.get("query_id"),
query_text=query_text,
start_time=isoparse(start_time) if start_time else None,
end_time=isoparse(end_time) if end_time else None,
user=q.get("user"),
extra=extra or None,
)
entries.append(entry)
if truncated:
log.info("Truncated %d query text(s) exceeding %d chars", truncated, _MAX_QUERY_TEXT_LEN)
return entries
def push(
input_file: str,
resource_uuid: str,
key_id: str,
key_token: str,
batch_size: int = _BATCH_SIZE,
output_file: str = "query_logs_push_result.json",
) -> dict:
"""Read a query log manifest and push entries to Monte Carlo in batches."""
endpoint = _ENDPOINT
log.info("Using endpoint: %s", endpoint)
with open(input_file) as fh:
manifest = json.load(fh)
queries = manifest.get("queries", [])
log_type = manifest.get("log_type", LOG_TYPE)
entries = _build_query_log_entries(queries)
log.info("Loaded %d query log entry/entries from %s", len(entries), input_file)
if not entries:
log.info("No query log entries to push.")
push_result = {
"resource_uuid": resource_uuid,
"log_type": log_type,
"invocation_ids": [],
"pushed_at": datetime.now(timezone.utc).isoformat(),
"total_entries": 0,
"batch_count": 0,
"batch_size": batch_size,
}
with open(output_file, "w") as fh:
json.dump(push_result, fh, indent=2)
return push_result
batches = [entries[i : i + batch_size] for i in range(0, len(entries), batch_size)]
total_batches = len(batches)
def _push_batch(batch: list[QueryLogEntry], batch_num: int) -> str | None:
client = Client(session=Session(
mcd_id=key_id, mcd_token=key_token, scope="Ingestion", endpoint=endpoint,
))
service = IngestionService(mc_client=client)
result = service.send_query_logs(
resource_uuid=resource_uuid,
log_type=log_type,
events=batch,
)
invocation_id = service.extract_invocation_id(result)
log.info(
"Pushed batch %d/%d (%d entries) — invocation_id=%s",
batch_num, total_batches, len(batch), invocation_id,
)
return invocation_id
max_workers = min(4, total_batches)
invocation_ids: list[str | None] = [None] * total_batches
with ThreadPoolExecutor(max_workers=max_workers) as pool:
futures = {
pool.submit(_push_batch, batch, i + 1): i
for i, batch in enumerate(batches)
}
for future in as_completed(futures):
idx = futures[future]
try:
invocation_ids[idx] = future.result()
except Exception as exc:
log.error("ERROR pushing batch %d: %s", idx + 1, exc)
raise
log.info("All %d batch(es) pushed.", total_batches)
push_result = {
"resource_uuid": resource_uuid,
"log_type": log_type,
"invocation_ids": invocation_ids,
"pushed_at": datetime.now(timezone.utc).isoformat(),
"total_entries": len(entries),
"batch_count": total_batches,
"batch_size": batch_size,
}
with open(output_file, "w") as fh:
json.dump(push_result, fh, indent=2)
log.info("Push result written to %s", output_file)
return push_result
def main() -> None:
parser = argparse.ArgumentParser(
description="Push BigQuery query logs from a manifest to Monte Carlo",
)
parser.add_argument("--resource-uuid", default=os.getenv("MCD_RESOURCE_UUID"))
parser.add_argument("--key-id", default=os.getenv("MCD_INGEST_ID"))
parser.add_argument("--key-token", default=os.getenv("MCD_INGEST_TOKEN"))
parser.add_argument("--input-file", default="query_logs_output.json")
parser.add_argument("--output-file", default="query_logs_push_result.json")
parser.add_argument(
"--batch-size",
type=int,
default=_BATCH_SIZE,
help=f"Max entries per push batch (default: {_BATCH_SIZE})",
)
args = parser.parse_args()
required = ["resource_uuid", "key_id", "key_token"]
missing = [k for k in required if getattr(args, k) is None]
if missing:
parser.error(f"Missing required arguments/env vars: {missing}")
push(
input_file=args.input_file,
resource_uuid=args.resource_uuid,
key_id=args.key_id,
key_token=args.key_token,
batch_size=args.batch_size,
output_file=args.output_file,
)
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,70 @@
"""
BigQuery — Lineage Collection and Push (combined)
===================================================
Imports ``collect()`` from ``collect_lineage`` and ``push()`` from
``push_lineage``, runs both in sequence.
Substitution points (search for "← SUBSTITUTE"):
- BIGQUERY_PROJECT_ID : GCP project ID to collect from
- BIGQUERY_REGION : BigQuery region for INFORMATION_SCHEMA queries (e.g. "us", "eu")
- LOOKBACK_HOURS : how far back to scan job history (default 24 h)
- MCD_INGEST_ID / MCD_INGEST_TOKEN : Monte Carlo API credentials
- MCD_RESOURCE_UUID : UUID of the BigQuery connection in Monte Carlo
Prerequisites:
pip install google-cloud-bigquery pycarlo
"""
from __future__ import annotations
import argparse
import os
from collect_lineage import collect, LOOKBACK_HOURS
from push_lineage import push, _BATCH_SIZE
def main() -> None:
parser = argparse.ArgumentParser(description="Push BigQuery lineage to Monte Carlo")
parser.add_argument("--project-id", default=os.getenv("BIGQUERY_PROJECT_ID")) # ← SUBSTITUTE
parser.add_argument("--region", default=os.getenv("BIGQUERY_REGION", "us")) # ← SUBSTITUTE
parser.add_argument("--resource-uuid", default=os.getenv("MCD_RESOURCE_UUID"))
parser.add_argument("--key-id", default=os.getenv("MCD_INGEST_ID"))
parser.add_argument("--key-token", default=os.getenv("MCD_INGEST_TOKEN"))
parser.add_argument("--lookback-hours", type=int, default=LOOKBACK_HOURS)
parser.add_argument("--output-file", default="lineage_output.json")
parser.add_argument("--push-result-file", default="lineage_push_result.json")
parser.add_argument(
"--batch-size",
type=int,
default=_BATCH_SIZE,
help=f"Max events per push batch (default: {_BATCH_SIZE})",
)
args = parser.parse_args()
required = ["project_id", "resource_uuid", "key_id", "key_token"]
missing = [k for k in required if getattr(args, k) is None]
if missing:
parser.error(f"Missing required arguments/env vars: {missing}")
# Step 1: Collect
collect(
project_id=args.project_id,
region=args.region,
lookback_hours=args.lookback_hours,
output_file=args.output_file,
)
# Step 2: Push
push(
input_file=args.output_file,
resource_uuid=args.resource_uuid,
key_id=args.key_id,
key_token=args.key_token,
batch_size=args.batch_size,
output_file=args.push_result_file,
)
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,65 @@
"""
BigQuery — Metadata Collection and Push (combined)
===================================================
Imports ``collect()`` from ``collect_metadata`` and ``push()`` from
``push_metadata``, runs both in sequence.
Substitution points (search for "← SUBSTITUTE"):
- BIGQUERY_PROJECT_ID : GCP project ID to collect from
- GOOGLE_APPLICATION_CREDENTIALS : path to service-account JSON key file
- MCD_INGEST_ID / MCD_INGEST_TOKEN : Monte Carlo API credentials
- MCD_RESOURCE_UUID : UUID of the BigQuery connection in Monte Carlo
- DATASET_EXCLUSIONS : datasets to skip (informational / system datasets)
Prerequisites:
pip install google-cloud-bigquery pycarlo
"""
from __future__ import annotations
import argparse
import os
from collect_metadata import collect
from push_metadata import push, _BATCH_SIZE
def main() -> None:
parser = argparse.ArgumentParser(description="Push BigQuery metadata to Monte Carlo")
parser.add_argument("--project-id", default=os.getenv("BIGQUERY_PROJECT_ID")) # ← SUBSTITUTE
parser.add_argument("--resource-uuid", default=os.getenv("MCD_RESOURCE_UUID"))
parser.add_argument("--key-id", default=os.getenv("MCD_INGEST_ID"))
parser.add_argument("--key-token", default=os.getenv("MCD_INGEST_TOKEN"))
parser.add_argument("--output-file", default="metadata_output.json")
parser.add_argument("--push-result-file", default="metadata_push_result.json")
parser.add_argument(
"--batch-size",
type=int,
default=_BATCH_SIZE,
help=f"Max assets per push batch (default: {_BATCH_SIZE})",
)
args = parser.parse_args()
missing = [k for k, v in vars(args).items() if v is None and k not in ("output_file", "push_result_file", "batch_size")]
if missing:
parser.error(f"Missing required arguments/env vars: {missing}")
# Step 1: Collect
collect(
project_id=args.project_id,
output_file=args.output_file,
)
# Step 2: Push
push(
input_file=args.output_file,
resource_uuid=args.resource_uuid,
key_id=args.key_id,
key_token=args.key_token,
batch_size=args.batch_size,
output_file=args.push_result_file,
)
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,70 @@
"""
BigQuery — Query Log Collection and Push (combined)
=====================================================
Imports ``collect()`` from ``collect_query_logs`` and ``push()`` from
``push_query_logs``, runs both in sequence.
Substitution points (search for "← SUBSTITUTE"):
- BIGQUERY_PROJECT_ID : GCP project ID to collect query logs from
- GOOGLE_APPLICATION_CREDENTIALS : path to service-account JSON key file
- LOOKBACK_HOURS : how many hours back to collect (default 25, skip last 1 h)
- MCD_INGEST_ID / MCD_INGEST_TOKEN : Monte Carlo API credentials
- MCD_RESOURCE_UUID : UUID of the BigQuery connection in Monte Carlo
Prerequisites:
pip install google-cloud-bigquery pycarlo
"""
from __future__ import annotations
import argparse
import os
from collect_query_logs import collect, LOOKBACK_HOURS, LOOKBACK_LAG_HOURS
from push_query_logs import push, _BATCH_SIZE
def main() -> None:
parser = argparse.ArgumentParser(description="Push BigQuery query logs to Monte Carlo")
parser.add_argument("--project-id", default=os.getenv("BIGQUERY_PROJECT_ID")) # ← SUBSTITUTE
parser.add_argument("--resource-uuid", default=os.getenv("MCD_RESOURCE_UUID"))
parser.add_argument("--key-id", default=os.getenv("MCD_INGEST_ID"))
parser.add_argument("--key-token", default=os.getenv("MCD_INGEST_TOKEN"))
parser.add_argument("--lookback-hours", type=int, default=LOOKBACK_HOURS)
parser.add_argument("--lookback-lag-hours", type=int, default=LOOKBACK_LAG_HOURS)
parser.add_argument("--output-file", default="query_logs_output.json")
parser.add_argument("--push-result-file", default="query_logs_push_result.json")
parser.add_argument(
"--batch-size",
type=int,
default=_BATCH_SIZE,
help=f"Max entries per push batch (default: {_BATCH_SIZE})",
)
args = parser.parse_args()
required = ["project_id", "resource_uuid", "key_id", "key_token"]
missing = [k for k in required if getattr(args, k) is None]
if missing:
parser.error(f"Missing required arguments/env vars: {missing}")
# Step 1: Collect
collect(
project_id=args.project_id,
lookback_hours=args.lookback_hours,
lookback_lag_hours=args.lookback_lag_hours,
output_file=args.output_file,
)
# Step 2: Push
push(
input_file=args.output_file,
resource_uuid=args.resource_uuid,
key_id=args.key_id,
key_token=args.key_token,
batch_size=args.batch_size,
output_file=args.push_result_file,
)
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,214 @@
"""
BigQuery — Lineage Collection (collect only)
=============================================
Collects table-level lineage from two sources:
1. INFORMATION_SCHEMA.SCHEMATA_LINKS — cross-project dataset shares (per region)
2. Job query history — SQL parsing for CREATE TABLE AS SELECT and INSERT INTO
SELECT patterns to derive source->destination relationships.
Writes the collected lineage edges to a JSON manifest file.
Can be run standalone via CLI or imported (use the ``collect()`` function).
Substitution points (search for "← SUBSTITUTE"):
- BIGQUERY_PROJECT_ID : GCP project ID to collect from
- BIGQUERY_REGION : BigQuery region for INFORMATION_SCHEMA queries (e.g. "us", "eu")
- LOOKBACK_HOURS : how far back to scan job history (default 24 h)
Prerequisites:
pip install google-cloud-bigquery
"""
from __future__ import annotations
import argparse
import json
import logging
import os
import re
from datetime import datetime, timedelta, timezone
from google.cloud import bigquery
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
log = logging.getLogger(__name__)
RESOURCE_TYPE = "bigquery"
LOOKBACK_HOURS = int(os.getenv("LOOKBACK_HOURS", "24")) # ← SUBSTITUTE: adjust lookback window
# Regex patterns to detect CTAS and INSERT INTO SELECT in BigQuery SQL
_CTAS_PATTERN = re.compile(
r"CREATE\s+(?:OR\s+REPLACE\s+)?(?:TABLE|VIEW)\s+`?(?P<dest>[\w.\-]+)`?"
r".*?(?:AS\s+)?SELECT\b",
re.IGNORECASE | re.DOTALL,
)
_INSERT_PATTERN = re.compile(
r"INSERT\s+(?:INTO\s+)?`?(?P<dest>[\w.\-]+)`?.*?SELECT\b",
re.IGNORECASE | re.DOTALL,
)
_TABLE_REF_PATTERN = re.compile(r"`?([\w\-]+\.[\w\-]+\.[\w\-]+)`?", re.IGNORECASE)
def _parse_full_name(full_name: str) -> tuple[str, str, str]:
"""Split 'project.dataset.table' into (project, dataset, table)."""
parts = full_name.replace("`", "").split(".")
if len(parts) == 3:
return parts[0], parts[1], parts[2]
if len(parts) == 2:
return "", parts[0], parts[1]
return "", "", parts[0]
def _collect_schema_link_lineage(
bq_client: bigquery.Client,
project_id: str,
region: str,
) -> list[dict]:
"""Collect cross-project lineage from INFORMATION_SCHEMA.SCHEMATA_LINKS."""
query = f"""
SELECT
CATALOG_NAME AS source_project,
SCHEMA_NAME AS source_dataset,
LINKED_SCHEMA_CATALOG_NAME AS destination_project,
LINKED_SCHEMA_NAME AS destination_dataset
FROM `{project_id}`.`{region}`.INFORMATION_SCHEMA.SCHEMATA_LINKS
""" # ← SUBSTITUTE: update project_id and region as needed
edges: list[dict] = []
try:
for row in bq_client.query(query).result():
edges.append(
{
"destination": {
"database": row.destination_project,
"schema": row.destination_dataset,
"table": "*",
},
"sources": [
{
"database": row.source_project,
"schema": row.source_dataset,
"table": "*",
}
],
}
)
except Exception:
log.warning("SCHEMATA_LINKS query failed — skipping dataset-share lineage", exc_info=True)
return edges
def _collect_query_lineage(
bq_client: bigquery.Client,
project_id: str,
lookback_hours: int,
) -> list[dict]:
"""Derive lineage by parsing CTAS/INSERT patterns in job query history."""
end_dt = datetime.now(timezone.utc)
start_dt = end_dt - timedelta(hours=lookback_hours)
edges: list[dict] = []
for job in bq_client.list_jobs(all_users=True, min_creation_time=start_dt, max_creation_time=end_dt):
sql: str = getattr(job, "query", None) or ""
if not sql.strip():
continue
dest_match = _CTAS_PATTERN.search(sql) or _INSERT_PATTERN.search(sql)
if not dest_match:
continue
dest_full = dest_match.group("dest")
dest_project, dest_dataset, dest_table = _parse_full_name(dest_full)
if not dest_table:
continue
# Collect all 3-part table references in the query as sources, excluding destination
source_refs = [
m.group(1)
for m in _TABLE_REF_PATTERN.finditer(sql)
if m.group(1) != dest_full
]
if not source_refs:
continue
unique_sources = list(dict.fromkeys(source_refs))
sources = []
for ref in unique_sources:
p, d, t = _parse_full_name(ref)
sources.append({"database": p, "schema": d, "table": t})
edges.append(
{
"destination": {
"database": dest_project or project_id,
"schema": dest_dataset,
"table": dest_table,
},
"sources": sources,
}
)
return edges
def collect(
project_id: str,
region: str = "us",
lookback_hours: int = LOOKBACK_HOURS,
output_file: str = "lineage_output.json",
) -> dict:
"""
Connect to BigQuery, collect lineage edges, and write a JSON manifest.
Returns the manifest dict.
"""
bq_client = bigquery.Client(project=project_id)
log.info("Collecting lineage from project %s ...", project_id)
schema_edges = _collect_schema_link_lineage(bq_client, project_id, region)
query_edges = _collect_query_lineage(bq_client, project_id, lookback_hours)
all_edges = schema_edges + query_edges
log.info(
"Collected %d lineage edges (%d schema-link, %d query-derived)",
len(all_edges), len(schema_edges), len(query_edges),
)
manifest = {
"resource_type": RESOURCE_TYPE,
"collected_at": datetime.now(timezone.utc).isoformat(),
"schema_link_edges": len(schema_edges),
"query_derived_edges": len(query_edges),
"edges": all_edges,
}
with open(output_file, "w") as fh:
json.dump(manifest, fh, indent=2)
log.info("Lineage manifest written to %s", output_file)
return manifest
def main() -> None:
parser = argparse.ArgumentParser(
description="Collect BigQuery lineage and write to a manifest file",
)
parser.add_argument("--project-id", default=os.getenv("BIGQUERY_PROJECT_ID")) # ← SUBSTITUTE
parser.add_argument("--region", default=os.getenv("BIGQUERY_REGION", "us")) # ← SUBSTITUTE
parser.add_argument("--lookback-hours", type=int, default=LOOKBACK_HOURS)
parser.add_argument("--output-file", default="lineage_output.json")
args = parser.parse_args()
required = ["project_id"]
missing = [k for k in required if getattr(args, k) is None]
if missing:
parser.error(f"Missing required arguments/env vars: {missing}")
collect(
project_id=args.project_id,
region=args.region,
lookback_hours=args.lookback_hours,
output_file=args.output_file,
)
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,160 @@
"""
BigQuery — Metadata Collection (collect only)
==============================================
Collects table schemas, row counts, byte sizes, and descriptions from all
datasets in a BigQuery project and writes them to a JSON manifest file.
Can be run standalone via CLI or imported (use the ``collect()`` function).
Substitution points (search for "← SUBSTITUTE"):
- BIGQUERY_PROJECT_ID : GCP project ID to collect from
- GOOGLE_APPLICATION_CREDENTIALS : path to service-account JSON key file
- DATASET_EXCLUSIONS : datasets to skip (informational / system datasets)
Prerequisites:
pip install google-cloud-bigquery
"""
from __future__ import annotations
import argparse
import json
import logging
import os
from datetime import datetime, timezone
from google.cloud import bigquery
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
log = logging.getLogger(__name__)
RESOURCE_TYPE = "bigquery"
# Datasets to skip — add any internal / system datasets here
DATASET_EXCLUSIONS = { # ← SUBSTITUTE: add datasets to exclude
"_bqc_",
"INFORMATION_SCHEMA",
}
# BigQuery type → Monte Carlo canonical type
BQ_TYPE_MAP: dict[str, str] = {
"INT64": "INTEGER",
"INTEGER": "INTEGER",
"FLOAT64": "FLOAT",
"FLOAT": "FLOAT",
"BOOL": "BOOLEAN",
"BOOLEAN": "BOOLEAN",
"STRING": "VARCHAR",
"BYTES": "BINARY",
"DATE": "DATE",
"DATETIME": "DATETIME",
"TIMESTAMP": "TIMESTAMP",
"TIME": "TIME",
"NUMERIC": "DECIMAL",
"BIGNUMERIC": "DECIMAL",
"RECORD": "STRUCT",
"STRUCT": "STRUCT",
"REPEATED": "ARRAY",
"JSON": "JSON",
"GEOGRAPHY": "GEOGRAPHY",
}
def map_bq_type(bq_type: str) -> str:
return BQ_TYPE_MAP.get(bq_type.upper(), bq_type.upper())
def _collect_assets(bq_client: bigquery.Client, project_id: str) -> list[dict]:
"""Collect table metadata from BigQuery and return as a list of dicts."""
assets: list[dict] = []
for dataset_item in bq_client.list_datasets():
dataset_id = dataset_item.dataset_id
if any(exc in dataset_id for exc in DATASET_EXCLUSIONS):
log.info("Skipping dataset %s", dataset_id)
continue
dataset_ref = bq_client.dataset(dataset_id)
for table_item in bq_client.list_tables(dataset_ref):
table_ref = dataset_ref.table(table_item.table_id)
table = bq_client.get_table(table_ref)
fields = [
{
"name": field.name,
"type": map_bq_type(field.field_type),
"description": field.description or None,
}
for field in table.schema
]
asset = {
"name": table.table_id,
"database": project_id, # ← SUBSTITUTE: use project or dataset as database
"schema": dataset_id,
"type": "VIEW" if table.table_type == "VIEW" else "TABLE",
"description": table.description or None,
"fields": fields,
"volume": {
"row_count": table.num_rows,
"byte_count": table.num_bytes,
},
"freshness": {
"last_updated_time": table.modified.isoformat() if table.modified else None,
},
}
assets.append(asset)
log.info("Queued %s.%s.%s", project_id, dataset_id, table.table_id)
return assets
def collect(
project_id: str,
output_file: str = "metadata_output.json",
) -> dict:
"""
Connect to BigQuery, collect table metadata, and write a JSON manifest.
Returns the manifest dict.
"""
bq_client = bigquery.Client(project=project_id) # ← SUBSTITUTE: adjust auth if needed
log.info("Collecting metadata from project %s ...", project_id)
assets = _collect_assets(bq_client, project_id)
log.info("Collected %d asset(s).", len(assets))
manifest = {
"resource_type": RESOURCE_TYPE,
"collected_at": datetime.now(timezone.utc).isoformat(),
"assets": assets,
}
with open(output_file, "w") as fh:
json.dump(manifest, fh, indent=2)
log.info("Asset manifest written to %s", output_file)
return manifest
def main() -> None:
parser = argparse.ArgumentParser(
description="Collect BigQuery metadata and write to a manifest file",
)
parser.add_argument("--project-id", default=os.getenv("BIGQUERY_PROJECT_ID")) # ← SUBSTITUTE
parser.add_argument("--output-file", default="metadata_output.json")
args = parser.parse_args()
missing = [k for k, v in vars(args).items() if v is None and k != "output_file"]
if missing:
parser.error(f"Missing required arguments/env vars: {missing}")
collect(
project_id=args.project_id,
output_file=args.output_file,
)
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,164 @@
"""
BigQuery — Query Log Collection (collect only)
================================================
Collects completed job query logs from BigQuery job history and writes them to
a JSON manifest file for later push to Monte Carlo.
Can be run standalone via CLI or imported (use the ``collect()`` function).
Substitution points (search for "← SUBSTITUTE"):
- BIGQUERY_PROJECT_ID : GCP project ID to collect query logs from
- GOOGLE_APPLICATION_CREDENTIALS : path to service-account JSON key file
- LOOKBACK_HOURS : how many hours back to collect (default 25, skip last 1 h)
- STATEMENT_TYPE_FILTER : restrict to specific statement types, or leave empty for all
- MAX_JOBS : cap on number of jobs to collect per run
Prerequisites:
pip install google-cloud-bigquery
"""
from __future__ import annotations
import argparse
import json
import logging
import os
from datetime import datetime, timedelta, timezone
from google.cloud import bigquery
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
log = logging.getLogger(__name__)
LOG_TYPE = "bigquery"
# Collect jobs from [now - LOOKBACK_HOURS] to [now - LOOKBACK_LAG_HOURS].
# The lag avoids collecting in-flight jobs that have not yet completed.
LOOKBACK_HOURS: int = int(os.getenv("LOOKBACK_HOURS", "25")) # ← SUBSTITUTE
LOOKBACK_LAG_HOURS: int = int(os.getenv("LOOKBACK_LAG_HOURS", "1")) # ← SUBSTITUTE
# Limit statement types — e.g. ["SELECT", "CREATE_TABLE_AS_SELECT", "INSERT"]
# Set to an empty list to collect all statement types.
STATEMENT_TYPE_FILTER: list[str] = [] # ← SUBSTITUTE
# Maximum number of jobs to collect in a single run to avoid runaway costs
MAX_JOBS: int = int(os.getenv("MAX_JOBS", "10000")) # ← SUBSTITUTE
def _safe_isoformat(dt: datetime | None) -> str | None:
if dt is None:
return None
if dt.tzinfo is None:
dt = dt.replace(tzinfo=timezone.utc)
return dt.isoformat()
def _collect_query_logs(
bq_client: bigquery.Client,
project_id: str,
start_dt: datetime,
end_dt: datetime,
) -> list[dict]:
"""Collect query logs from BigQuery job history and return as a list of dicts."""
entries: list[dict] = []
log.info(
"Listing jobs for project=%s from %s to %s",
project_id, start_dt.isoformat(), end_dt.isoformat(),
)
for job in bq_client.list_jobs(
project=project_id,
all_users=True,
min_creation_time=start_dt,
max_creation_time=end_dt,
):
# Only process query jobs that have SQL text
sql: str = getattr(job, "query", None) or ""
if not sql.strip():
continue
statement_type: str = getattr(job, "statement_type", None) or ""
if STATEMENT_TYPE_FILTER and statement_type not in STATEMENT_TYPE_FILTER:
continue # ← SUBSTITUTE: adjust filter as needed
total_bytes_billed: int | None = getattr(job, "total_bytes_billed", None)
entries.append(
{
"query_id": job.job_id,
"query_text": sql,
"start_time": _safe_isoformat(getattr(job, "created", None)),
"end_time": _safe_isoformat(getattr(job, "ended", None)),
"user": getattr(job, "user_email", None),
"total_bytes_billed": total_bytes_billed,
"statement_type": statement_type or None,
}
)
if len(entries) >= MAX_JOBS:
log.warning("Reached MAX_JOBS=%d — stopping early", MAX_JOBS)
break
return entries
def collect(
project_id: str,
lookback_hours: int = LOOKBACK_HOURS,
lookback_lag_hours: int = LOOKBACK_LAG_HOURS,
output_file: str = "query_logs_output.json",
) -> dict:
"""
Connect to BigQuery, collect query logs, and write a JSON manifest.
Returns the manifest dict.
"""
bq_client = bigquery.Client(project=project_id) # ← SUBSTITUTE: adjust auth if needed
end_dt = datetime.now(timezone.utc) - timedelta(hours=lookback_lag_hours)
start_dt = end_dt - timedelta(hours=lookback_hours)
entries = _collect_query_logs(bq_client, project_id, start_dt, end_dt)
log.info("Collected %d query log entries.", len(entries))
manifest = {
"log_type": LOG_TYPE,
"collected_at": datetime.now(timezone.utc).isoformat(),
"window_start": start_dt.isoformat(),
"window_end": end_dt.isoformat(),
"query_log_count": len(entries),
"queries": entries,
}
with open(output_file, "w") as fh:
json.dump(manifest, fh, indent=2)
log.info("Query log manifest written to %s", output_file)
return manifest
def main() -> None:
parser = argparse.ArgumentParser(
description="Collect BigQuery query logs and write to a manifest file",
)
parser.add_argument("--project-id", default=os.getenv("BIGQUERY_PROJECT_ID")) # ← SUBSTITUTE
parser.add_argument("--lookback-hours", type=int, default=LOOKBACK_HOURS)
parser.add_argument("--lookback-lag-hours", type=int, default=LOOKBACK_LAG_HOURS)
parser.add_argument("--output-file", default="query_logs_output.json")
args = parser.parse_args()
required = ["project_id"]
missing = [k for k in required if getattr(args, k) is None]
if missing:
parser.error(f"Missing required arguments/env vars: {missing}")
collect(
project_id=args.project_id,
lookback_hours=args.lookback_hours,
lookback_lag_hours=args.lookback_lag_hours,
output_file=args.output_file,
)
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,198 @@
"""
BigQuery — Lineage Push (push only)
====================================
Reads a manifest file produced by ``collect_lineage.py`` and pushes the lineage
events to Monte Carlo using the pycarlo push ingestion API. Large payloads are
split into batches to stay under the 1 MB compressed limit.
Can be run standalone via CLI or imported (use the ``push()`` function).
Substitution points (search for "← SUBSTITUTE"):
- MCD_INGEST_ID / MCD_INGEST_TOKEN : Monte Carlo API credentials
- MCD_RESOURCE_UUID : UUID of the BigQuery connection in Monte Carlo
Prerequisites:
pip install pycarlo
"""
from __future__ import annotations
import argparse
import json
import logging
import os
from concurrent.futures import ThreadPoolExecutor, as_completed
from datetime import datetime, timezone
from pycarlo.core import Client, Session
from pycarlo.features.ingestion import IngestionService
from pycarlo.features.ingestion.models import (
LineageAssetRef,
LineageEvent,
)
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
log = logging.getLogger(__name__)
RESOURCE_TYPE = "bigquery"
# Maximum events per batch — conservative default to keep compressed payload under 1 MB
# ← SUBSTITUTE: tune based on average edge complexity (number of sources per event)
_BATCH_SIZE = 500
def _make_ref(database: str, schema: str, table: str) -> LineageAssetRef:
return LineageAssetRef(
type="TABLE",
name=table,
database=database,
schema=schema,
)
def _build_events(edges: list[dict]) -> list[LineageEvent]:
"""Build LineageEvent objects from manifest edge dicts."""
events = []
for edge in edges:
dest = edge["destination"]
sources = edge.get("sources", [])
if not sources:
continue
events.append(
LineageEvent(
destination=_make_ref(dest["database"], dest["schema"], dest["table"]),
sources=[
_make_ref(s["database"], s["schema"], s["table"])
for s in sources
],
)
)
return events
def push(
input_file: str,
resource_uuid: str,
key_id: str,
key_token: str,
batch_size: int = _BATCH_SIZE,
output_file: str = "lineage_push_result.json",
) -> dict:
"""
Read a lineage manifest and push events to Monte Carlo in batches.
Returns a result dict with invocation IDs for each batch.
"""
with open(input_file) as fh:
manifest = json.load(fh)
edges = manifest.get("edges", [])
resource_type = manifest.get("resource_type", RESOURCE_TYPE)
events = _build_events(edges)
log.info("Loaded %d lineage event(s) from %s", len(events), input_file)
if not events:
log.info("No lineage events to push.")
push_result = {
"resource_uuid": resource_uuid,
"resource_type": resource_type,
"invocation_ids": [],
"pushed_at": datetime.now(timezone.utc).isoformat(),
"total_events": 0,
"batch_count": 0,
"batch_size": batch_size,
}
with open(output_file, "w") as fh:
json.dump(push_result, fh, indent=2)
return push_result
# Split into batches
batches = []
for i in range(0, len(events), batch_size):
batches.append(events[i : i + batch_size])
total_batches = len(batches)
def _push_batch(batch: list, batch_num: int) -> str | None:
"""Push a single batch using a dedicated Session (thread-safe)."""
log.info("Pushing batch %d/%d (%d events) ...", batch_num, total_batches, len(batch))
client = Client(session=Session(mcd_id=key_id, mcd_token=key_token, scope="Ingestion"))
service = IngestionService(mc_client=client)
result = service.send_lineage(
resource_uuid=resource_uuid,
resource_type=resource_type,
events=batch,
)
invocation_id = service.extract_invocation_id(result)
if invocation_id:
log.info(" Batch %d: invocation_id=%s", batch_num, invocation_id)
return invocation_id
# Push batches in parallel (each thread gets its own pycarlo Session)
max_workers = min(4, total_batches)
invocation_ids: list[str | None] = [None] * total_batches
with ThreadPoolExecutor(max_workers=max_workers) as pool:
futures = {
pool.submit(_push_batch, batch, i + 1): i
for i, batch in enumerate(batches)
}
for future in as_completed(futures):
idx = futures[future]
try:
invocation_ids[idx] = future.result()
except Exception as exc:
log.error("ERROR pushing batch %d: %s", idx + 1, exc)
raise
log.info("All %d batches pushed (%d workers)", total_batches, max_workers)
push_result = {
"resource_uuid": resource_uuid,
"resource_type": resource_type,
"invocation_ids": invocation_ids,
"pushed_at": datetime.now(timezone.utc).isoformat(),
"total_events": len(events),
"batch_count": total_batches,
"batch_size": batch_size,
}
with open(output_file, "w") as fh:
json.dump(push_result, fh, indent=2)
log.info("Push result written to %s", output_file)
return push_result
def main() -> None:
parser = argparse.ArgumentParser(
description="Push BigQuery lineage from a manifest to Monte Carlo",
)
parser.add_argument("--resource-uuid", default=os.getenv("MCD_RESOURCE_UUID"))
parser.add_argument("--key-id", default=os.getenv("MCD_INGEST_ID"))
parser.add_argument("--key-token", default=os.getenv("MCD_INGEST_TOKEN"))
parser.add_argument("--input-file", default="lineage_output.json")
parser.add_argument("--output-file", default="lineage_push_result.json")
parser.add_argument(
"--batch-size",
type=int,
default=_BATCH_SIZE,
help=f"Max events per push batch (default: {_BATCH_SIZE})",
)
args = parser.parse_args()
required = ["resource_uuid", "key_id", "key_token"]
missing = [k for k in required if getattr(args, k) is None]
if missing:
parser.error(f"Missing required arguments/env vars: {missing}")
push(
input_file=args.input_file,
resource_uuid=args.resource_uuid,
key_id=args.key_id,
key_token=args.key_token,
batch_size=args.batch_size,
output_file=args.output_file,
)
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,193 @@
"""
BigQuery — Metadata Push (push only)
=====================================
Reads a manifest file produced by ``collect_metadata.py`` and pushes the assets
to Monte Carlo using the pycarlo push ingestion API. Large payloads are split
into batches to stay under the 1 MB compressed limit.
Can be run standalone via CLI or imported (use the ``push()`` function).
Substitution points (search for "← SUBSTITUTE"):
- MCD_INGEST_ID / MCD_INGEST_TOKEN : Monte Carlo API credentials
- MCD_RESOURCE_UUID : UUID of the BigQuery connection in Monte Carlo
Prerequisites:
pip install pycarlo
"""
from __future__ import annotations
import argparse
import json
import logging
import os
from concurrent.futures import ThreadPoolExecutor, as_completed
from datetime import datetime, timezone
from pycarlo.core import Client, Session
from pycarlo.features.ingestion import IngestionService
from pycarlo.features.ingestion.models import (
AssetField,
AssetFreshness,
AssetMetadata,
AssetVolume,
RelationalAsset,
)
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
log = logging.getLogger(__name__)
RESOURCE_TYPE = "bigquery"
# Maximum assets per batch — conservative default to keep compressed payload under 1 MB
# ← SUBSTITUTE: tune based on average asset size (fields per table, description length, etc.)
_BATCH_SIZE = 500
def _asset_from_dict(d: dict) -> RelationalAsset:
"""Reconstruct a RelationalAsset from a manifest dict entry."""
fields = [
AssetField(
name=f["name"],
type=f.get("type"),
description=f.get("description"),
)
for f in d.get("fields", [])
]
volume = None
if d.get("volume"):
volume = AssetVolume(
row_count=d["volume"].get("row_count"),
byte_count=d["volume"].get("byte_count"),
)
freshness = None
if d.get("freshness"):
freshness = AssetFreshness(
last_update_time=d["freshness"].get("last_update_time"),
)
return RelationalAsset(
type=d.get("type", "TABLE"),
metadata=AssetMetadata(
name=d["name"],
database=d["database"], # ← SUBSTITUTE: use project or dataset as database
schema=d["schema"],
description=d.get("description"),
),
fields=fields,
volume=volume,
freshness=freshness,
)
def push(
input_file: str,
resource_uuid: str,
key_id: str,
key_token: str,
batch_size: int = _BATCH_SIZE,
output_file: str = "metadata_push_result.json",
) -> dict:
"""
Read a metadata manifest and push assets to Monte Carlo in batches.
Returns a result dict with invocation IDs for each batch.
"""
with open(input_file) as fh:
manifest = json.load(fh)
asset_dicts = manifest.get("assets", [])
resource_type = manifest.get("resource_type", RESOURCE_TYPE)
assets = [_asset_from_dict(d) for d in asset_dicts]
log.info("Loaded %d asset(s) from %s", len(assets), input_file)
# Split into batches
batches = []
for i in range(0, max(len(assets), 1), batch_size):
batches.append(assets[i : i + batch_size])
total_batches = len(batches)
def _push_batch(batch: list, batch_num: int) -> str | None:
"""Push a single batch using a dedicated Session (thread-safe)."""
client = Client(session=Session(mcd_id=key_id, mcd_token=key_token, scope="Ingestion"))
service = IngestionService(mc_client=client)
result = service.send_metadata(
resource_uuid=resource_uuid,
resource_type=resource_type,
events=batch,
)
invocation_id = service.extract_invocation_id(result)
log.info("Pushed batch %d/%d (%d assets) — invocation_id=%s", batch_num, total_batches, len(batch), invocation_id)
return invocation_id
# Push batches in parallel (each thread gets its own pycarlo Session)
max_workers = min(4, total_batches)
invocation_ids: list[str | None] = [None] * total_batches
with ThreadPoolExecutor(max_workers=max_workers) as pool:
futures = {
pool.submit(_push_batch, batch, i + 1): i
for i, batch in enumerate(batches)
}
for future in as_completed(futures):
idx = futures[future]
try:
invocation_ids[idx] = future.result()
except Exception as exc:
log.error("ERROR pushing batch %d: %s", idx + 1, exc)
raise
log.info("All %d batches pushed (%d workers)", total_batches, max_workers)
push_result = {
"resource_uuid": resource_uuid,
"resource_type": resource_type,
"invocation_ids": invocation_ids,
"pushed_at": datetime.now(timezone.utc).isoformat(),
"total_assets": len(assets),
"batch_count": total_batches,
"batch_size": batch_size,
}
with open(output_file, "w") as fh:
json.dump(push_result, fh, indent=2)
log.info("Push result written to %s", output_file)
return push_result
def main() -> None:
parser = argparse.ArgumentParser(
description="Push BigQuery metadata from a manifest to Monte Carlo",
)
parser.add_argument("--resource-uuid", default=os.getenv("MCD_RESOURCE_UUID"))
parser.add_argument("--key-id", default=os.getenv("MCD_INGEST_ID"))
parser.add_argument("--key-token", default=os.getenv("MCD_INGEST_TOKEN"))
parser.add_argument("--input-file", default="metadata_output.json")
parser.add_argument("--output-file", default="metadata_push_result.json")
parser.add_argument(
"--batch-size",
type=int,
default=_BATCH_SIZE,
help=f"Max assets per push batch (default: {_BATCH_SIZE})",
)
args = parser.parse_args()
required = ["resource_uuid", "key_id", "key_token"]
missing = [k for k in required if getattr(args, k) is None]
if missing:
parser.error(f"Missing required arguments/env vars: {missing}")
push(
input_file=args.input_file,
resource_uuid=args.resource_uuid,
key_id=args.key_id,
key_token=args.key_token,
batch_size=args.batch_size,
output_file=args.output_file,
)
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,207 @@
"""
BigQuery — Query Log Push (push only)
======================================
Reads a manifest file produced by ``collect_query_logs.py`` and pushes the query
log entries to Monte Carlo using the pycarlo push ingestion API. Large payloads
are split into batches to stay under the 1 MB compressed limit.
Can be run standalone via CLI or imported (use the ``push()`` function).
Substitution points (search for "← SUBSTITUTE"):
- MCD_INGEST_ID / MCD_INGEST_TOKEN : Monte Carlo API credentials
- MCD_RESOURCE_UUID : UUID of the BigQuery connection in Monte Carlo
Prerequisites:
pip install pycarlo
"""
from __future__ import annotations
import argparse
import json
import logging
import os
from concurrent.futures import ThreadPoolExecutor, as_completed
from datetime import datetime, timezone
from dateutil.parser import isoparse
from pycarlo.core import Client, Session
from pycarlo.features.ingestion import IngestionService
from pycarlo.features.ingestion.models import QueryLogEntry
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
log = logging.getLogger(__name__)
LOG_TYPE = "bigquery"
# Maximum entries per batch — conservative default to keep compressed payload under 1 MB.
# Query logs include full SQL text — keep batches small to stay under the 1 MB
# compressed payload limit. 50 entries can trigger 413 on active warehouses.
# ← SUBSTITUTE: tune based on average query length
_BATCH_SIZE = 100
# Truncate query_text longer than this to prevent 413 errors.
# Some SQL statements (e.g., generated by BI tools) can be 100KB+ and blow up
# compressed payloads even at small batch sizes.
_MAX_QUERY_TEXT_LEN = 10_000
def _build_query_log_entries(queries: list[dict]) -> list[QueryLogEntry]:
"""Convert manifest query dicts into QueryLogEntry objects."""
entries = []
truncated = 0
for q in queries:
query_text = q.get("query_text") or ""
# Truncate very long SQL to prevent 413 Request Too Large
if len(query_text) > _MAX_QUERY_TEXT_LEN:
query_text = query_text[:_MAX_QUERY_TEXT_LEN] + "... [TRUNCATED]"
truncated += 1
extra = {}
if q.get("total_bytes_billed") is not None:
extra["total_bytes_billed"] = q["total_bytes_billed"]
if q.get("statement_type") is not None:
extra["statement_type"] = q["statement_type"]
start_time = q.get("start_time")
end_time = q.get("end_time")
entry = QueryLogEntry(
query_id=q.get("query_id"),
query_text=query_text,
start_time=isoparse(start_time) if start_time else None,
end_time=isoparse(end_time) if end_time else None,
user=q.get("user"),
extra=extra or None,
)
entries.append(entry)
if truncated:
log.info("Truncated %d query text(s) exceeding %d chars", truncated, _MAX_QUERY_TEXT_LEN)
return entries
def push(
input_file: str,
resource_uuid: str,
key_id: str,
key_token: str,
batch_size: int = _BATCH_SIZE,
output_file: str = "query_logs_push_result.json",
) -> dict:
"""
Read a query log manifest and push entries to Monte Carlo in batches.
Returns a result dict with invocation IDs for each batch.
"""
with open(input_file) as fh:
manifest = json.load(fh)
queries = manifest.get("queries", [])
log_type = manifest.get("log_type", LOG_TYPE)
entries = _build_query_log_entries(queries)
log.info("Loaded %d query log entry/entries from %s", len(entries), input_file)
if not entries:
log.info("No query log entries to push.")
push_result = {
"resource_uuid": resource_uuid,
"log_type": log_type,
"invocation_ids": [],
"pushed_at": datetime.now(timezone.utc).isoformat(),
"total_entries": 0,
"batch_count": 0,
"batch_size": batch_size,
}
with open(output_file, "w") as fh:
json.dump(push_result, fh, indent=2)
return push_result
# Split into batches
batches = []
for i in range(0, len(entries), batch_size):
batches.append(entries[i : i + batch_size])
total_batches = len(batches)
def _push_batch(batch: list, batch_num: int) -> str | None:
"""Push a single batch using a dedicated Session (thread-safe)."""
client = Client(session=Session(mcd_id=key_id, mcd_token=key_token, scope="Ingestion"))
service = IngestionService(mc_client=client)
result = service.send_query_logs(
resource_uuid=resource_uuid,
log_type=log_type,
events=batch,
)
invocation_id = service.extract_invocation_id(result)
log.info("Pushed batch %d/%d (%d entries) — invocation_id=%s", batch_num, total_batches, len(batch), invocation_id)
return invocation_id
# Push batches in parallel (each thread gets its own pycarlo Session)
max_workers = min(4, total_batches)
invocation_ids: list[str | None] = [None] * total_batches
with ThreadPoolExecutor(max_workers=max_workers) as pool:
futures = {
pool.submit(_push_batch, batch, i + 1): i
for i, batch in enumerate(batches)
}
for future in as_completed(futures):
idx = futures[future]
try:
invocation_ids[idx] = future.result()
except Exception as exc:
log.error("ERROR pushing batch %d: %s", idx + 1, exc)
raise
log.info("All %d batches pushed (%d workers)", total_batches, max_workers)
push_result = {
"resource_uuid": resource_uuid,
"log_type": log_type,
"invocation_ids": invocation_ids,
"pushed_at": datetime.now(timezone.utc).isoformat(),
"total_entries": len(entries),
"batch_count": total_batches,
"batch_size": batch_size,
}
with open(output_file, "w") as fh:
json.dump(push_result, fh, indent=2)
log.info("Push result written to %s", output_file)
return push_result
def main() -> None:
parser = argparse.ArgumentParser(
description="Push BigQuery query logs from a manifest to Monte Carlo",
)
parser.add_argument("--resource-uuid", default=os.getenv("MCD_RESOURCE_UUID"))
parser.add_argument("--key-id", default=os.getenv("MCD_INGEST_ID"))
parser.add_argument("--key-token", default=os.getenv("MCD_INGEST_TOKEN"))
parser.add_argument("--input-file", default="query_logs_output.json")
parser.add_argument("--output-file", default="query_logs_push_result.json")
parser.add_argument(
"--batch-size",
type=int,
default=_BATCH_SIZE,
help=f"Max entries per push batch (default: {_BATCH_SIZE})",
)
args = parser.parse_args()
required = ["resource_uuid", "key_id", "key_token"]
missing = [k for k in required if getattr(args, k) is None]
if missing:
parser.error(f"Missing required arguments/env vars: {missing}")
push(
input_file=args.input_file,
resource_uuid=args.resource_uuid,
key_id=args.key_id,
key_token=args.key_token,
batch_size=args.batch_size,
output_file=args.output_file,
)
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,83 @@
"""
Databricks — Lineage Collect & Push (combined)
================================================
Collects table-level and (optionally) column-level lineage from Databricks Unity
Catalog system tables, then pushes them to Monte Carlo via the push ingestion API.
This script imports and calls collect() from collect_lineage and push() from
push_lineage, running both in sequence.
Substitution points (search for "← SUBSTITUTE"):
- DATABRICKS_HOST : workspace hostname
- DATABRICKS_HTTP_PATH : SQL warehouse HTTP path
- DATABRICKS_TOKEN : PAT or service-principal secret
- LOOKBACK_DAYS : how many days back to collect lineage (default 30)
- MCD_INGEST_ID / MCD_INGEST_TOKEN : Monte Carlo API credentials
- MCD_RESOURCE_UUID : UUID of the Databricks connection in Monte Carlo
- PUSH_BATCH_SIZE : number of events per API call (default 500)
Use the --column-lineage flag to also push column-level lineage (disabled by default).
Prerequisites:
pip install databricks-sql-connector pycarlo
"""
from __future__ import annotations
import argparse
import logging
import os
from collect_lineage import LOOKBACK_DAYS, collect
from push_lineage import DEFAULT_BATCH_SIZE, push
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
log = logging.getLogger(__name__)
def main() -> None:
parser = argparse.ArgumentParser(description="Collect and push Databricks lineage to Monte Carlo")
parser.add_argument("--host", default=os.getenv("DATABRICKS_HOST")) # ← SUBSTITUTE
parser.add_argument("--http-path", default=os.getenv("DATABRICKS_HTTP_PATH")) # ← SUBSTITUTE
parser.add_argument("--token", default=os.getenv("DATABRICKS_TOKEN")) # ← SUBSTITUTE
parser.add_argument("--resource-uuid", default=os.getenv("MCD_RESOURCE_UUID"))
parser.add_argument("--key-id", default=os.getenv("MCD_INGEST_ID"))
parser.add_argument("--key-token", default=os.getenv("MCD_INGEST_TOKEN"))
parser.add_argument("--lookback-days", type=int, default=LOOKBACK_DAYS)
parser.add_argument(
"--column-lineage", action="store_true",
help="Also collect column-level lineage (requires system.access.column_lineage access)",
)
parser.add_argument("--batch-size", type=int, default=DEFAULT_BATCH_SIZE)
parser.add_argument("--manifest", default="manifest_lineage.json")
args = parser.parse_args()
required = ["host", "http_path", "token", "resource_uuid", "key_id", "key_token"]
missing = [k for k in required if getattr(args, k) is None]
if missing:
parser.error(f"Missing required arguments/env vars: {missing}")
log.info("Step 1: Collecting lineage …")
collect(
host=args.host,
http_path=args.http_path,
token=args.token,
manifest_path=args.manifest,
include_column_lineage=args.column_lineage,
lookback_days=args.lookback_days,
)
log.info("Step 2: Pushing lineage to Monte Carlo …")
push(
manifest_path=args.manifest,
resource_uuid=args.resource_uuid,
key_id=args.key_id,
key_token=args.key_token,
batch_size=args.batch_size,
)
log.info("Done — collect and push complete.")
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,77 @@
"""
Databricks — Metadata Collect & Push (combined)
=================================================
Collects table schemas, row counts, and byte sizes from Databricks Unity Catalog,
then pushes them to Monte Carlo via the push ingestion API.
This script imports and calls collect() from collect_metadata and push() from
push_metadata, running both in sequence.
Substitution points (search for "← SUBSTITUTE"):
- DATABRICKS_HOST : workspace hostname (e.g. adb-1234.azuredatabricks.net)
- DATABRICKS_HTTP_PATH : SQL warehouse HTTP path (e.g. /sql/1.0/warehouses/abc123)
- DATABRICKS_TOKEN : personal access token or service-principal secret
- DATABRICKS_CATALOG : catalog to collect from (default: "hive_metastore" or "main")
- SCHEMA_EXCLUSIONS : schemas to skip
- MCD_INGEST_ID / MCD_INGEST_TOKEN : Monte Carlo API credentials
- MCD_RESOURCE_UUID : UUID of the Databricks connection in Monte Carlo
- PUSH_BATCH_SIZE : number of assets per API call (default 500)
Prerequisites:
pip install databricks-sql-connector pycarlo
"""
from __future__ import annotations
import argparse
import logging
import os
from collect_metadata import collect
from push_metadata import DEFAULT_BATCH_SIZE, push
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
log = logging.getLogger(__name__)
def main() -> None:
parser = argparse.ArgumentParser(description="Collect and push Databricks metadata to Monte Carlo")
parser.add_argument("--host", default=os.getenv("DATABRICKS_HOST")) # ← SUBSTITUTE
parser.add_argument("--http-path", default=os.getenv("DATABRICKS_HTTP_PATH")) # ← SUBSTITUTE
parser.add_argument("--token", default=os.getenv("DATABRICKS_TOKEN")) # ← SUBSTITUTE
parser.add_argument("--catalog", default=os.getenv("DATABRICKS_CATALOG", "hive_metastore"))
parser.add_argument("--resource-uuid", default=os.getenv("MCD_RESOURCE_UUID"))
parser.add_argument("--key-id", default=os.getenv("MCD_INGEST_ID"))
parser.add_argument("--key-token", default=os.getenv("MCD_INGEST_TOKEN"))
parser.add_argument("--batch-size", type=int, default=DEFAULT_BATCH_SIZE)
parser.add_argument("--manifest", default="manifest_metadata.json")
args = parser.parse_args()
required = ["host", "http_path", "token", "resource_uuid", "key_id", "key_token"]
missing = [k for k in required if getattr(args, k) is None]
if missing:
parser.error(f"Missing required arguments/env vars: {missing}")
log.info("Step 1: Collecting metadata …")
collect(
host=args.host,
http_path=args.http_path,
token=args.token,
catalog=args.catalog,
manifest_path=args.manifest,
)
log.info("Step 2: Pushing metadata to Monte Carlo …")
push(
manifest_path=args.manifest,
resource_uuid=args.resource_uuid,
key_id=args.key_id,
key_token=args.key_token,
batch_size=args.batch_size,
)
log.info("Done — collect and push complete.")
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,83 @@
"""
Databricks — Query Log Collect & Push (combined)
==================================================
Collects finished query execution records from the Databricks system table
system.query.history and pushes them to Monte Carlo for query-pattern analysis,
lineage derivation, and usage attribution.
This script imports and calls collect() from collect_query_logs and push() from
push_query_logs, running both in sequence.
Substitution points (search for "← SUBSTITUTE"):
- DATABRICKS_HOST : workspace hostname
- DATABRICKS_HTTP_PATH : SQL warehouse HTTP path
- DATABRICKS_TOKEN : PAT or service-principal secret
- LOOKBACK_HOURS : hours back from [now - LAG_HOURS] to collect (default 25)
- LOOKBACK_LAG_HOURS : hours to lag behind now to avoid in-flight queries (default 1)
- MAX_ROWS : maximum query rows to collect per run (default 10000)
- MCD_INGEST_ID / MCD_INGEST_TOKEN : Monte Carlo API credentials
- MCD_RESOURCE_UUID : UUID of the Databricks connection in Monte Carlo
- PUSH_BATCH_SIZE : number of entries per API call (default 250)
Prerequisites:
pip install databricks-sql-connector pycarlo
"""
from __future__ import annotations
import argparse
import logging
import os
from collect_query_logs import LOOKBACK_HOURS, LOOKBACK_LAG_HOURS, MAX_ROWS, collect
from push_query_logs import DEFAULT_BATCH_SIZE, push
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
log = logging.getLogger(__name__)
def main() -> None:
parser = argparse.ArgumentParser(description="Collect and push Databricks query logs to Monte Carlo")
parser.add_argument("--host", default=os.getenv("DATABRICKS_HOST")) # ← SUBSTITUTE
parser.add_argument("--http-path", default=os.getenv("DATABRICKS_HTTP_PATH")) # ← SUBSTITUTE
parser.add_argument("--token", default=os.getenv("DATABRICKS_TOKEN")) # ← SUBSTITUTE
parser.add_argument("--resource-uuid", default=os.getenv("MCD_RESOURCE_UUID"))
parser.add_argument("--key-id", default=os.getenv("MCD_INGEST_ID"))
parser.add_argument("--key-token", default=os.getenv("MCD_INGEST_TOKEN"))
parser.add_argument("--lookback-hours", type=int, default=LOOKBACK_HOURS)
parser.add_argument("--lookback-lag-hours", type=int, default=LOOKBACK_LAG_HOURS)
parser.add_argument("--max-rows", type=int, default=MAX_ROWS)
parser.add_argument("--batch-size", type=int, default=DEFAULT_BATCH_SIZE)
parser.add_argument("--manifest", default="manifest_query_logs.json")
args = parser.parse_args()
required = ["host", "http_path", "token", "resource_uuid", "key_id", "key_token"]
missing = [k for k in required if getattr(args, k) is None]
if missing:
parser.error(f"Missing required arguments/env vars: {missing}")
log.info("Step 1: Collecting query logs …")
collect(
host=args.host,
http_path=args.http_path,
token=args.token,
manifest_path=args.manifest,
lookback_hours=args.lookback_hours,
lookback_lag_hours=args.lookback_lag_hours,
max_rows=args.max_rows,
)
log.info("Step 2: Pushing query logs to Monte Carlo …")
push(
manifest_path=args.manifest,
resource_uuid=args.resource_uuid,
key_id=args.key_id,
key_token=args.key_token,
batch_size=args.batch_size,
)
log.info("Done — collect and push complete.")
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,240 @@
"""
Databricks — Lineage Collection (collect-only)
================================================
Collects table-level and (optionally) column-level lineage from Databricks Unity
Catalog system tables (system.access.table_lineage and system.access.column_lineage).
No SQL parsing required — Databricks provides first-class lineage metadata.
Writes a JSON manifest file that can be consumed by push_lineage.py.
Substitution points (search for "← SUBSTITUTE"):
- DATABRICKS_HOST : workspace hostname
- DATABRICKS_HTTP_PATH : SQL warehouse HTTP path
- DATABRICKS_TOKEN : PAT or service-principal secret
- LOOKBACK_DAYS : how many days back to collect lineage (default 30)
Use the --column-lineage flag to also collect column-level lineage (disabled by default).
Prerequisites:
pip install databricks-sql-connector
"""
from __future__ import annotations
import argparse
import json
import logging
import os
from datetime import datetime, timezone
from typing import Any
from databricks import sql
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
log = logging.getLogger(__name__)
RESOURCE_TYPE = "databricks"
LOOKBACK_DAYS: int = int(os.getenv("LOOKBACK_DAYS", "30")) # ← SUBSTITUTE
def _check_available_memory(min_gb: float = 2.0) -> None:
"""Warn if available memory is below the threshold."""
try:
if hasattr(os, "sysconf"): # Linux / macOS
page_size = os.sysconf("SC_PAGE_SIZE")
avail_pages = os.sysconf("SC_AVPHYS_PAGES")
avail_gb = (page_size * avail_pages) / (1024 ** 3)
else:
return # Windows — skip check
except (ValueError, OSError):
return
if avail_gb < min_gb:
log.warning(
"Only %.1f GB of memory available (minimum recommended: %.1f GB). "
"Consider reducing the collection scope or increasing available memory.",
avail_gb,
min_gb,
)
def _query(cursor: Any, sql_text: str) -> list[dict[str, Any]]:
cursor.execute(sql_text)
cols = [d[0] for d in cursor.description]
rows = []
while True:
chunk = cursor.fetchmany(1000)
if not chunk:
break
rows.extend(dict(zip(cols, row)) for row in chunk)
return rows
def _parse_full_name(full_name: str) -> tuple[str, str, str]:
"""Split 'catalog.schema.table' into (catalog, schema, table)."""
parts = (full_name or "").split(".")
if len(parts) == 3:
return parts[0], parts[1], parts[2]
if len(parts) == 2:
return "", parts[0], parts[1]
return "", "", full_name
def collect_table_lineage(cursor: Any, lookback_days: int) -> list[dict[str, Any]]:
rows = _query(
cursor,
f"""
SELECT DISTINCT
source_table_full_name,
target_table_full_name,
created_by,
MAX(event_time) AS last_seen
FROM system.access.table_lineage
WHERE event_time >= DATEADD(DAY, -{lookback_days}, CURRENT_TIMESTAMP())
AND source_table_full_name IS NOT NULL
AND target_table_full_name IS NOT NULL
GROUP BY source_table_full_name, target_table_full_name, created_by
LIMIT 50000
""", # ← SUBSTITUTE: adjust lookback_days, LIMIT, or add catalog/schema filters
)
events: list[dict[str, Any]] = []
for row in rows:
src_catalog, src_schema, src_table = _parse_full_name(row["source_table_full_name"])
dst_catalog, dst_schema, dst_table = _parse_full_name(row["target_table_full_name"])
if not src_table or not dst_table:
continue
events.append({
"sources": [{"database": src_catalog, "schema": src_schema, "asset_name": src_table}],
"destination": {"database": dst_catalog, "schema": dst_schema, "asset_name": dst_table},
"lineage_type": "table",
})
return events
def collect_column_lineage(cursor: Any, lookback_days: int) -> list[dict[str, Any]]:
rows = _query(
cursor,
f"""
SELECT DISTINCT
source_table_full_name,
source_column_name,
target_table_full_name,
target_column_name
FROM system.access.column_lineage
WHERE event_time >= DATEADD(DAY, -{lookback_days}, CURRENT_TIMESTAMP())
AND source_table_full_name IS NOT NULL
AND target_table_full_name IS NOT NULL
LIMIT 50000
""", # ← SUBSTITUTE: adjust LIMIT or add catalog/schema filters if needed
)
# Group by destination table so we can build one event per destination
grouped: dict[str, dict[str, Any]] = {}
for row in rows:
dst_key = row["target_table_full_name"]
if dst_key not in grouped:
grouped[dst_key] = {"dst_full": dst_key, "columns": []}
grouped[dst_key]["columns"].append(row)
events: list[dict[str, Any]] = []
for dst_key, group in grouped.items():
dst_catalog, dst_schema, dst_table = _parse_full_name(group["dst_full"])
if not dst_table:
continue
col_fields: list[dict[str, Any]] = []
for row in group["columns"]:
src_catalog, src_schema, src_table = _parse_full_name(row["source_table_full_name"])
col_fields.append({
"destination_field": row["target_column_name"],
"sources": [{
"database": src_catalog,
"schema": src_schema,
"asset_name": src_table,
"field": row["source_column_name"],
}],
})
events.append({
"sources": [], # column lineage carries source refs inside col_fields
"destination": {"database": dst_catalog, "schema": dst_schema, "asset_name": dst_table},
"column_lineage": col_fields,
"lineage_type": "column",
})
return events
def collect(
host: str,
http_path: str,
token: str,
manifest_path: str = "manifest_lineage.json",
include_column_lineage: bool = False,
lookback_days: int = LOOKBACK_DAYS,
) -> list[dict[str, Any]]:
"""Connect to Databricks, collect lineage, write a JSON manifest, and return events."""
_check_available_memory(min_gb=2.0)
collected_at = datetime.now(timezone.utc).isoformat()
with sql.connect(
server_hostname=host, # ← SUBSTITUTE
http_path=http_path, # ← SUBSTITUTE
access_token=token, # ← SUBSTITUTE
) as conn:
with conn.cursor() as cursor:
table_events = collect_table_lineage(cursor, lookback_days)
col_events = collect_column_lineage(cursor, lookback_days) if include_column_lineage else []
all_events = table_events + col_events
log.info(
"Collected %d lineage events (%d table, %d column)",
len(all_events), len(table_events), len(col_events),
)
manifest = {
"resource_type": RESOURCE_TYPE,
"collected_at": collected_at,
"lookback_days": lookback_days,
"table_lineage_events": len(table_events),
"column_lineage_events": len(col_events),
"events": all_events,
}
with open(manifest_path, "w") as fh:
json.dump(manifest, fh, indent=2)
log.info("Manifest written to %s (%d events)", manifest_path, len(all_events))
return all_events
def main() -> None:
parser = argparse.ArgumentParser(description="Collect Databricks lineage to a manifest file")
parser.add_argument("--host", default=os.getenv("DATABRICKS_HOST")) # ← SUBSTITUTE
parser.add_argument("--http-path", default=os.getenv("DATABRICKS_HTTP_PATH")) # ← SUBSTITUTE
parser.add_argument("--token", default=os.getenv("DATABRICKS_TOKEN")) # ← SUBSTITUTE
parser.add_argument("--lookback-days", type=int, default=LOOKBACK_DAYS)
parser.add_argument(
"--column-lineage", action="store_true",
help="Also collect column-level lineage (requires system.access.column_lineage access)",
)
parser.add_argument("--manifest", default="manifest_lineage.json")
args = parser.parse_args()
required = ["host", "http_path", "token"]
missing = [k for k in required if getattr(args, k) is None]
if missing:
parser.error(f"Missing required arguments/env vars: {missing}")
collect(
host=args.host,
http_path=args.http_path,
token=args.token,
manifest_path=args.manifest,
include_column_lineage=args.column_lineage,
lookback_days=args.lookback_days,
)
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,212 @@
"""
Databricks — Metadata Collection (collect-only)
=================================================
Collects table schemas, row counts, and byte sizes from Databricks Unity Catalog
using INFORMATION_SCHEMA and DESCRIBE DETAIL, then writes a JSON manifest file
that can be consumed by push_metadata.py.
Substitution points (search for "← SUBSTITUTE"):
- DATABRICKS_HOST : workspace hostname (e.g. adb-1234.azuredatabricks.net)
- DATABRICKS_HTTP_PATH : SQL warehouse HTTP path (e.g. /sql/1.0/warehouses/abc123)
- DATABRICKS_TOKEN : personal access token or service-principal secret
- DATABRICKS_CATALOG : catalog to collect from (default: "hive_metastore" or "main")
- SCHEMA_EXCLUSIONS : schemas to skip
Prerequisites:
pip install databricks-sql-connector
"""
from __future__ import annotations
import argparse
import json
import logging
import os
from datetime import datetime, timezone
from typing import Any
from databricks import sql
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
log = logging.getLogger(__name__)
RESOURCE_TYPE = "databricks"
# Schemas to skip across all catalogs
SCHEMA_EXCLUSIONS: set[str] = { # ← SUBSTITUTE: add any internal schemas to skip
"information_schema",
"__databricks_internal",
}
def _check_available_memory(min_gb: float = 2.0) -> None:
"""Warn if available memory is below the threshold."""
try:
if hasattr(os, "sysconf"): # Linux / macOS
page_size = os.sysconf("SC_PAGE_SIZE")
avail_pages = os.sysconf("SC_AVPHYS_PAGES")
avail_gb = (page_size * avail_pages) / (1024 ** 3)
else:
return # Windows — skip check
except (ValueError, OSError):
return
if avail_gb < min_gb:
log.warning(
"Only %.1f GB of memory available (minimum recommended: %.1f GB). "
"Consider reducing the collection scope or increasing available memory.",
avail_gb,
min_gb,
)
def _query(cursor: Any, sql_text: str, params: tuple | None = None) -> list[dict[str, Any]]:
cursor.execute(sql_text, params)
cols = [d[0] for d in cursor.description]
rows = []
while True:
chunk = cursor.fetchmany(1000)
if not chunk:
break
rows.extend(dict(zip(cols, row)) for row in chunk)
return rows
def collect_tables(cursor: Any, catalog: str) -> list[dict[str, Any]]:
return _query(
cursor,
f"""
SELECT table_catalog, table_schema, table_name, table_type, comment
FROM {catalog}.information_schema.tables
WHERE table_schema NOT IN ({", ".join(f"'{s}'" for s in SCHEMA_EXCLUSIONS)})
ORDER BY table_schema, table_name
""", # ← SUBSTITUTE: add additional WHERE filters if needed
)
def collect_columns(cursor: Any, catalog: str, schema: str, table: str) -> list[dict[str, Any]]:
return _query(
cursor,
f"""
SELECT column_name, data_type, comment
FROM {catalog}.information_schema.columns
WHERE table_schema = '{schema}' AND table_name = '{table}'
ORDER BY ordinal_position
""",
)
def collect_detail(cursor: Any, catalog: str, schema: str, table: str) -> dict[str, Any] | None:
try:
rows = _query(cursor, f"DESCRIBE DETAIL `{catalog}`.`{schema}`.`{table}`")
return rows[0] if rows else None
except Exception:
log.debug("DESCRIBE DETAIL failed for %s.%s.%s", catalog, schema, table, exc_info=True)
return None
def collect(
host: str,
http_path: str,
token: str,
catalog: str,
manifest_path: str = "manifest_metadata.json",
) -> list[dict[str, Any]]:
"""Connect to Databricks, collect metadata, write a JSON manifest, and return the asset dicts.
The manifest contains serialised asset dicts that push_metadata.py can read.
"""
_check_available_memory(min_gb=2.0)
collected_at = datetime.now(timezone.utc).isoformat()
assets: list[dict[str, Any]] = []
with sql.connect(
server_hostname=host, # ← SUBSTITUTE
http_path=http_path, # ← SUBSTITUTE
access_token=token, # ← SUBSTITUTE
) as conn:
with conn.cursor() as cursor:
tables = collect_tables(cursor, catalog)
log.info("Found %d tables in catalog %s", len(tables), catalog)
for row in tables:
schema = row["table_schema"]
table_name = row["table_name"]
columns = collect_columns(cursor, catalog, schema, table_name)
fields = [
{
"name": col["column_name"],
"type": col["data_type"].upper(),
"description": col.get("comment") or None,
}
for col in columns
]
detail = collect_detail(cursor, catalog, schema, table_name)
row_count: int | None = None
byte_count: int | None = None
last_updated: str | None = None
if detail:
row_count = detail.get("numRows")
byte_count = detail.get("sizeInBytes")
last_modified = detail.get("lastModified")
if last_modified:
last_updated = (
last_modified.isoformat()
if hasattr(last_modified, "isoformat")
else str(last_modified)
)
asset = {
"asset_name": table_name,
"database": catalog, # ← SUBSTITUTE: use catalog as database
"schema": schema,
"asset_type": "VIEW" if row.get("table_type", "").upper() == "VIEW" else "TABLE",
"description": row.get("comment") or None,
"fields": fields,
"row_count": row_count,
"byte_count": byte_count,
"last_updated": last_updated,
}
assets.append(asset)
log.info("Collected %s.%s.%s", catalog, schema, table_name)
manifest = {
"resource_type": RESOURCE_TYPE,
"collected_at": collected_at,
"catalog": catalog,
"asset_count": len(assets),
"assets": assets,
}
with open(manifest_path, "w") as fh:
json.dump(manifest, fh, indent=2)
log.info("Manifest written to %s (%d assets)", manifest_path, len(assets))
return assets
def main() -> None:
parser = argparse.ArgumentParser(description="Collect Databricks metadata to a manifest file")
parser.add_argument("--host", default=os.getenv("DATABRICKS_HOST")) # ← SUBSTITUTE
parser.add_argument("--http-path", default=os.getenv("DATABRICKS_HTTP_PATH")) # ← SUBSTITUTE
parser.add_argument("--token", default=os.getenv("DATABRICKS_TOKEN")) # ← SUBSTITUTE
parser.add_argument("--catalog", default=os.getenv("DATABRICKS_CATALOG", "hive_metastore"))
parser.add_argument("--manifest", default="manifest_metadata.json")
args = parser.parse_args()
required = ["host", "http_path", "token"]
missing = [k for k in required if getattr(args, k) is None]
if missing:
parser.error(f"Missing required arguments/env vars: {missing}")
collect(
host=args.host,
http_path=args.http_path,
token=args.token,
catalog=args.catalog,
manifest_path=args.manifest,
)
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,204 @@
"""
Databricks — Query Log Collection (collect-only)
==================================================
Collects finished query execution records from the Databricks system table
system.query.history and writes a JSON manifest file that can be consumed
by push_query_logs.py.
Substitution points (search for "← SUBSTITUTE"):
- DATABRICKS_HOST : workspace hostname
- DATABRICKS_HTTP_PATH : SQL warehouse HTTP path
- DATABRICKS_TOKEN : PAT or service-principal secret
- LOOKBACK_HOURS : hours back from [now - LAG_HOURS] to collect (default 25)
- LOOKBACK_LAG_HOURS : hours to lag behind now to avoid in-flight queries (default 1)
- MAX_ROWS : maximum query rows to collect per run (default 10000)
Prerequisites:
pip install databricks-sql-connector
"""
from __future__ import annotations
import argparse
import json
import logging
import os
from datetime import datetime, timezone
from typing import Any
from databricks import sql
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
log = logging.getLogger(__name__)
LOG_TYPE = "databricks"
LOOKBACK_HOURS: int = int(os.getenv("LOOKBACK_HOURS", "25")) # ← SUBSTITUTE
LOOKBACK_LAG_HOURS: int = int(os.getenv("LOOKBACK_LAG_HOURS", "1")) # ← SUBSTITUTE
MAX_ROWS: int = int(os.getenv("MAX_ROWS", "10000")) # ← SUBSTITUTE
_QUERY_LOG_SQL = """\
SELECT
statement_id AS query_id,
statement_text AS query_text,
start_time,
end_time,
executed_by AS user_name,
produced_rows AS returned_rows,
total_task_duration_ms,
read_rows,
read_bytes
FROM system.query.history
WHERE start_time >= DATEADD(HOUR, -{lookback_hours}, NOW())
AND start_time < DATEADD(HOUR, -{lag_hours}, NOW())
AND status = 'FINISHED'
ORDER BY start_time
LIMIT {max_rows}
""" # ← SUBSTITUTE: adjust status filter or add warehouse_id filter as needed
def _check_available_memory(min_gb: float = 2.0) -> None:
"""Warn if available memory is below the threshold."""
try:
if hasattr(os, "sysconf"): # Linux / macOS
page_size = os.sysconf("SC_PAGE_SIZE")
avail_pages = os.sysconf("SC_AVPHYS_PAGES")
avail_gb = (page_size * avail_pages) / (1024 ** 3)
else:
return # Windows — skip check
except (ValueError, OSError):
return
if avail_gb < min_gb:
log.warning(
"Only %.1f GB of memory available (minimum recommended: %.1f GB). "
"Consider reducing the collection scope or increasing available memory.",
avail_gb,
min_gb,
)
def _safe_isoformat(dt: Any) -> str | None:
if dt is None:
return None
if hasattr(dt, "isoformat"):
if dt.tzinfo is None:
dt = dt.replace(tzinfo=timezone.utc)
return dt.isoformat()
return str(dt)
def _query(cursor: Any, sql_text: str) -> list[dict[str, Any]]:
cursor.execute(sql_text)
cols = [d[0] for d in cursor.description]
rows = []
while True:
chunk = cursor.fetchmany(1000)
if not chunk:
break
rows.extend(dict(zip(cols, row)) for row in chunk)
return rows
def collect_query_logs(
cursor: Any,
lookback_hours: int,
lag_hours: int,
max_rows: int,
) -> list[dict[str, Any]]:
rendered_sql = _QUERY_LOG_SQL.format(
lookback_hours=lookback_hours + lag_hours, # offset from NOW() to cover the window
lag_hours=lag_hours,
max_rows=max_rows,
)
rows = _query(cursor, rendered_sql)
log.info("Retrieved %d query log rows from system.query.history", len(rows))
entries: list[dict[str, Any]] = []
for row in rows:
query_text: str = row.get("query_text") or ""
if not query_text.strip():
continue # ← SUBSTITUTE: decide whether to skip empty-text rows
entry = {
"query_id": row.get("query_id"),
"query_text": query_text,
"start_time": _safe_isoformat(row.get("start_time")),
"end_time": _safe_isoformat(row.get("end_time")),
"user": row.get("user_name"),
"returned_rows": row.get("returned_rows"),
"total_task_duration_ms": row.get("total_task_duration_ms"),
"read_rows": row.get("read_rows"),
"read_bytes": row.get("read_bytes"),
}
entries.append(entry)
return entries
def collect(
host: str,
http_path: str,
token: str,
manifest_path: str = "manifest_query_logs.json",
lookback_hours: int = LOOKBACK_HOURS,
lookback_lag_hours: int = LOOKBACK_LAG_HOURS,
max_rows: int = MAX_ROWS,
) -> list[dict[str, Any]]:
"""Connect to Databricks, collect query logs, write a JSON manifest, and return entries."""
_check_available_memory(min_gb=2.0)
collected_at = datetime.now(timezone.utc).isoformat()
with sql.connect(
server_hostname=host, # ← SUBSTITUTE
http_path=http_path, # ← SUBSTITUTE
access_token=token, # ← SUBSTITUTE
) as conn:
with conn.cursor() as cursor:
entries = collect_query_logs(cursor, lookback_hours, lookback_lag_hours, max_rows)
log.info("Collected %d query log entries", len(entries))
manifest = {
"log_type": LOG_TYPE,
"collected_at": collected_at,
"lookback_hours": lookback_hours,
"lookback_lag_hours": lookback_lag_hours,
"query_log_count": len(entries),
"entries": entries,
}
with open(manifest_path, "w") as fh:
json.dump(manifest, fh, indent=2)
log.info("Manifest written to %s (%d entries)", manifest_path, len(entries))
return entries
def main() -> None:
parser = argparse.ArgumentParser(description="Collect Databricks query logs to a manifest file")
parser.add_argument("--host", default=os.getenv("DATABRICKS_HOST")) # ← SUBSTITUTE
parser.add_argument("--http-path", default=os.getenv("DATABRICKS_HTTP_PATH")) # ← SUBSTITUTE
parser.add_argument("--token", default=os.getenv("DATABRICKS_TOKEN")) # ← SUBSTITUTE
parser.add_argument("--lookback-hours", type=int, default=LOOKBACK_HOURS)
parser.add_argument("--lookback-lag-hours", type=int, default=LOOKBACK_LAG_HOURS)
parser.add_argument("--max-rows", type=int, default=MAX_ROWS)
parser.add_argument("--manifest", default="manifest_query_logs.json")
args = parser.parse_args()
required = ["host", "http_path", "token"]
missing = [k for k in required if getattr(args, k) is None]
if missing:
parser.error(f"Missing required arguments/env vars: {missing}")
collect(
host=args.host,
http_path=args.http_path,
token=args.token,
manifest_path=args.manifest,
lookback_hours=args.lookback_hours,
lookback_lag_hours=args.lookback_lag_hours,
max_rows=args.max_rows,
)
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,192 @@
"""
Databricks — Lineage Push (push-only)
=======================================
Reads a JSON manifest file produced by collect_lineage.py and pushes the lineage
events to Monte Carlo via the push ingestion API, with configurable batching to
keep compressed payloads under 1 MB.
Substitution points (search for "← SUBSTITUTE"):
- MCD_INGEST_ID / MCD_INGEST_TOKEN : Monte Carlo API credentials
- MCD_RESOURCE_UUID : UUID of the Databricks connection in Monte Carlo
- PUSH_BATCH_SIZE : number of events per API call (default 500)
Prerequisites:
pip install pycarlo
"""
from __future__ import annotations
import argparse
import json
import logging
import os
from concurrent.futures import ThreadPoolExecutor, as_completed
from datetime import datetime, timezone
from typing import Any
from pycarlo.core import Client, Session
from pycarlo.features.ingestion import IngestionService
from pycarlo.features.ingestion.models import (
ColumnLineageField,
ColumnLineageSourceField,
LineageAssetRef,
LineageEvent,
)
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
log = logging.getLogger(__name__)
RESOURCE_TYPE = "databricks"
DEFAULT_BATCH_SIZE = 500 # ← SUBSTITUTE: conservative default to stay under 1 MB compressed
def _ref_from_dict(d: dict[str, Any]) -> LineageAssetRef:
database = d.get("database", "")
schema = d.get("schema", "")
name = d["asset_name"]
return LineageAssetRef(
type="TABLE",
name=name,
database=database,
schema=schema,
asset_id=f"{database}__{schema}__{name}",
)
def _event_from_dict(d: dict[str, Any]) -> LineageEvent:
"""Reconstruct a LineageEvent from a manifest dict."""
sources = [_ref_from_dict(s) for s in d.get("sources", [])]
destination = _ref_from_dict(d["destination"])
fields: list[ColumnLineageField] | None = None
if d.get("column_lineage"):
fields = []
for cl in d["column_lineage"]:
src_fields = []
for s in cl.get("sources", []):
asset_id = f"{s.get('database', '')}__{s.get('schema', '')}__{s['asset_name']}"
src_fields.append(
ColumnLineageSourceField(
asset_id=asset_id,
field_name=s["field"],
)
)
fields.append(
ColumnLineageField(
name=cl["destination_field"],
source_fields=src_fields,
)
)
return LineageEvent(
sources=sources,
destination=destination,
fields=fields,
)
def push(
manifest_path: str,
resource_uuid: str,
key_id: str,
key_token: str,
batch_size: int = DEFAULT_BATCH_SIZE,
) -> dict[str, Any]:
"""Read a collect manifest and push lineage events to Monte Carlo in batches.
Returns a summary dict with invocation IDs and counts.
"""
with open(manifest_path) as fh:
manifest = json.load(fh)
event_dicts: list[dict[str, Any]] = manifest["events"]
events = [_event_from_dict(d) for d in event_dicts]
log.info("Loaded %d lineage events from %s", len(events), manifest_path)
# Split into batches
batches = []
for i in range(0, len(events), batch_size):
batches.append(events[i : i + batch_size])
total_batches = len(batches)
def _push_batch(batch: list, batch_num: int) -> str | None:
"""Push a single batch using a dedicated Session (thread-safe)."""
log.info("Pushing batch %d/%d (%d events) ...", batch_num, total_batches, len(batch))
client = Client(session=Session(mcd_id=key_id, mcd_token=key_token, scope="Ingestion"))
service = IngestionService(mc_client=client)
result = service.send_lineage(
resource_uuid=resource_uuid,
resource_type=RESOURCE_TYPE,
events=batch,
)
invocation_id = service.extract_invocation_id(result)
if invocation_id:
log.info("Batch %d: invocation_id=%s", batch_num, invocation_id)
return invocation_id
# Push batches in parallel (each thread gets its own pycarlo Session)
max_workers = min(4, total_batches)
invocation_ids: list[str | None] = [None] * total_batches
with ThreadPoolExecutor(max_workers=max_workers) as pool:
futures = {
pool.submit(_push_batch, batch, i + 1): i
for i, batch in enumerate(batches)
}
for future in as_completed(futures):
idx = futures[future]
try:
invocation_ids[idx] = future.result()
except Exception as exc:
log.error("ERROR pushing batch %d: %s", idx + 1, exc)
raise
log.info("All %d batches pushed (%d workers)", total_batches, max_workers)
pushed_at = datetime.now(timezone.utc).isoformat()
summary = {
"resource_uuid": resource_uuid,
"resource_type": RESOURCE_TYPE,
"invocation_ids": invocation_ids,
"pushed_at": pushed_at,
"event_count": len(events),
"batch_count": total_batches,
"batch_size": batch_size,
"lookback_days": manifest.get("lookback_days"),
"table_lineage_events": manifest.get("table_lineage_events"),
"column_lineage_events": manifest.get("column_lineage_events"),
}
push_manifest_path = manifest_path.replace(".json", "_push_result.json")
with open(push_manifest_path, "w") as fh:
json.dump(summary, fh, indent=2)
log.info("Push result written to %s", push_manifest_path)
return summary
def main() -> None:
parser = argparse.ArgumentParser(description="Push Databricks lineage to Monte Carlo from manifest")
parser.add_argument("--manifest", default="manifest_lineage.json")
parser.add_argument("--resource-uuid", default=os.getenv("MCD_RESOURCE_UUID"))
parser.add_argument("--key-id", default=os.getenv("MCD_INGEST_ID"))
parser.add_argument("--key-token", default=os.getenv("MCD_INGEST_TOKEN"))
parser.add_argument("--batch-size", type=int, default=DEFAULT_BATCH_SIZE)
args = parser.parse_args()
required = ["resource_uuid", "key_id", "key_token"]
missing = [k for k in required if getattr(args, k) is None]
if missing:
parser.error(f"Missing required arguments/env vars: {missing}")
push(
manifest_path=args.manifest,
resource_uuid=args.resource_uuid,
key_id=args.key_id,
key_token=args.key_token,
batch_size=args.batch_size,
)
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,178 @@
"""
Databricks — Metadata Push (push-only)
========================================
Reads a JSON manifest file produced by collect_metadata.py and pushes the assets
to Monte Carlo via the push ingestion API, with configurable batching to keep
compressed payloads under 1 MB.
Substitution points (search for "← SUBSTITUTE"):
- MCD_INGEST_ID / MCD_INGEST_TOKEN : Monte Carlo API credentials
- MCD_RESOURCE_UUID : UUID of the Databricks connection in Monte Carlo
- PUSH_BATCH_SIZE : number of assets per API call (default 500)
Prerequisites:
pip install pycarlo
"""
from __future__ import annotations
import argparse
import json
import logging
import os
from concurrent.futures import ThreadPoolExecutor, as_completed
from datetime import datetime, timezone
from typing import Any
from pycarlo.core import Client, Session
from pycarlo.features.ingestion import IngestionService
from pycarlo.features.ingestion.models import (
AssetField,
AssetFreshness,
AssetMetadata,
AssetVolume,
RelationalAsset,
)
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
log = logging.getLogger(__name__)
RESOURCE_TYPE = "databricks"
DEFAULT_BATCH_SIZE = 500 # ← SUBSTITUTE: conservative default to stay under 1 MB compressed
def _asset_from_dict(d: dict[str, Any]) -> RelationalAsset:
"""Reconstruct a RelationalAsset from a manifest dict."""
fields = [
AssetField(
name=f["name"],
type=f.get("type"),
description=f.get("description"),
)
for f in d.get("fields", [])
]
volume = None
if d.get("row_count") is not None or d.get("byte_count") is not None:
volume = AssetVolume(row_count=d.get("row_count"), byte_count=d.get("byte_count"))
freshness = None
if d.get("last_updated") is not None:
freshness = AssetFreshness(last_update_time=d.get("last_updated"))
return RelationalAsset(
type=d.get("asset_type", "TABLE"),
metadata=AssetMetadata(
name=d["asset_name"],
database=d["database"], # ← SUBSTITUTE: use catalog as database
schema=d["schema"],
description=d.get("description"),
),
fields=fields,
volume=volume,
freshness=freshness,
)
def push(
manifest_path: str,
resource_uuid: str,
key_id: str,
key_token: str,
batch_size: int = DEFAULT_BATCH_SIZE,
) -> dict[str, Any]:
"""Read a collect manifest and push assets to Monte Carlo in batches.
Returns a summary dict with invocation IDs and counts.
"""
with open(manifest_path) as fh:
manifest = json.load(fh)
asset_dicts: list[dict[str, Any]] = manifest["assets"]
assets = [_asset_from_dict(d) for d in asset_dicts]
log.info("Loaded %d assets from %s", len(assets), manifest_path)
# Split into batches
batches = []
for i in range(0, max(len(assets), 1), batch_size):
batches.append(assets[i : i + batch_size])
total_batches = len(batches)
def _push_batch(batch: list, batch_num: int) -> str | None:
"""Push a single batch using a dedicated Session (thread-safe)."""
client = Client(session=Session(mcd_id=key_id, mcd_token=key_token, scope="Ingestion"))
service = IngestionService(mc_client=client)
result = service.send_metadata(
resource_uuid=resource_uuid,
resource_type=RESOURCE_TYPE,
events=batch,
)
invocation_id = service.extract_invocation_id(result)
log.info("Pushed batch %d/%d (%d assets) — invocation_id=%s", batch_num, total_batches, len(batch), invocation_id)
return invocation_id
# Push batches in parallel (each thread gets its own pycarlo Session)
max_workers = min(4, total_batches)
invocation_ids: list[str | None] = [None] * total_batches
with ThreadPoolExecutor(max_workers=max_workers) as pool:
futures = {
pool.submit(_push_batch, batch, i + 1): i
for i, batch in enumerate(batches)
}
for future in as_completed(futures):
idx = futures[future]
try:
invocation_ids[idx] = future.result()
except Exception as exc:
log.error("ERROR pushing batch %d: %s", idx + 1, exc)
raise
log.info("All %d batches pushed (%d workers)", total_batches, max_workers)
pushed_at = datetime.now(timezone.utc).isoformat()
summary = {
"resource_uuid": resource_uuid,
"resource_type": RESOURCE_TYPE,
"invocation_ids": invocation_ids,
"pushed_at": pushed_at,
"asset_count": len(assets),
"batch_count": total_batches,
"batch_size": batch_size,
"catalog": manifest.get("catalog"),
}
# Write push result alongside the collect manifest
push_manifest_path = manifest_path.replace(".json", "_push_result.json")
with open(push_manifest_path, "w") as fh:
json.dump(summary, fh, indent=2)
log.info("Push result written to %s", push_manifest_path)
return summary
def main() -> None:
parser = argparse.ArgumentParser(description="Push Databricks metadata to Monte Carlo from manifest")
parser.add_argument("--manifest", default="manifest_metadata.json")
parser.add_argument("--resource-uuid", default=os.getenv("MCD_RESOURCE_UUID"))
parser.add_argument("--key-id", default=os.getenv("MCD_INGEST_ID"))
parser.add_argument("--key-token", default=os.getenv("MCD_INGEST_TOKEN"))
parser.add_argument("--batch-size", type=int, default=DEFAULT_BATCH_SIZE)
args = parser.parse_args()
required = ["resource_uuid", "key_id", "key_token"]
missing = [k for k in required if getattr(args, k) is None]
if missing:
parser.error(f"Missing required arguments/env vars: {missing}")
push(
manifest_path=args.manifest,
resource_uuid=args.resource_uuid,
key_id=args.key_id,
key_token=args.key_token,
batch_size=args.batch_size,
)
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,200 @@
"""
Databricks — Query Log Push (push-only)
=========================================
Reads a JSON manifest file produced by collect_query_logs.py and pushes the query
log entries to Monte Carlo via the push ingestion API, with configurable batching
to keep compressed payloads under 1 MB.
Substitution points (search for "← SUBSTITUTE"):
- MCD_INGEST_ID / MCD_INGEST_TOKEN : Monte Carlo API credentials
- MCD_RESOURCE_UUID : UUID of the Databricks connection in Monte Carlo
- PUSH_BATCH_SIZE : number of entries per API call (default 100)
Prerequisites:
pip install pycarlo
"""
from __future__ import annotations
import argparse
import json
import logging
import os
from concurrent.futures import ThreadPoolExecutor, as_completed
from datetime import datetime, timezone
from typing import Any
from dateutil.parser import isoparse
from pycarlo.core import Client, Session
from pycarlo.features.ingestion import IngestionService
from pycarlo.features.ingestion.models import QueryLogEntry
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
log = logging.getLogger(__name__)
LOG_TYPE = "databricks"
DEFAULT_BATCH_SIZE = 100 # ← SUBSTITUTE: conservative default to stay under 1 MB compressed
# Truncate query_text longer than this to prevent 413 errors.
# Some SQL statements (e.g., generated by BI tools) can be 100KB+ and blow up
# compressed payloads even at small batch sizes.
_MAX_QUERY_TEXT_LEN = 10_000
def _build_query_log_entries(entry_dicts: list[dict[str, Any]]) -> list[QueryLogEntry]:
"""Convert manifest query dicts into QueryLogEntry objects."""
entries = []
truncated = 0
for d in entry_dicts:
query_text = d.get("query_text") or ""
# Truncate very long SQL to prevent 413 Request Too Large
if len(query_text) > _MAX_QUERY_TEXT_LEN:
query_text = query_text[:_MAX_QUERY_TEXT_LEN] + "... [TRUNCATED]"
truncated += 1
extra = {}
if d.get("total_task_duration_ms") is not None:
extra["total_task_duration_ms"] = d["total_task_duration_ms"]
if d.get("read_rows") is not None:
extra["read_rows"] = d["read_rows"]
if d.get("read_bytes") is not None:
extra["read_bytes"] = d["read_bytes"]
start_time = d.get("start_time")
end_time = d.get("end_time")
entries.append(
QueryLogEntry(
query_id=d.get("query_id"),
query_text=query_text,
start_time=isoparse(start_time) if start_time else None,
end_time=isoparse(end_time) if end_time else None,
user=d.get("user"),
returned_rows=d.get("returned_rows"),
extra=extra or None,
)
)
if truncated:
log.info("Truncated %d query text(s) exceeding %d chars", truncated, _MAX_QUERY_TEXT_LEN)
return entries
def push(
manifest_path: str,
resource_uuid: str,
key_id: str,
key_token: str,
batch_size: int = DEFAULT_BATCH_SIZE,
) -> dict[str, Any]:
"""Read a collect manifest and push query log entries to Monte Carlo in batches.
Returns a summary dict with invocation IDs and counts.
"""
with open(manifest_path) as fh:
manifest = json.load(fh)
entry_dicts: list[dict[str, Any]] = manifest["entries"]
entries = _build_query_log_entries(entry_dicts)
log.info("Loaded %d query log entries from %s", len(entries), manifest_path)
if not entries:
log.info("No query log entries to push.")
summary = {
"resource_uuid": resource_uuid,
"log_type": LOG_TYPE,
"invocation_ids": [],
"pushed_at": datetime.now(timezone.utc).isoformat(),
"query_log_count": 0,
"batch_count": 0,
"batch_size": batch_size,
}
push_manifest_path = manifest_path.replace(".json", "_push_result.json")
with open(push_manifest_path, "w") as fh:
json.dump(summary, fh, indent=2)
return summary
# Split into batches
batches = []
for i in range(0, len(entries), batch_size):
batches.append(entries[i : i + batch_size])
total_batches = len(batches)
def _push_batch(batch: list, batch_num: int) -> str | None:
"""Push a single batch using a dedicated Session (thread-safe)."""
client = Client(session=Session(mcd_id=key_id, mcd_token=key_token, scope="Ingestion"))
service = IngestionService(mc_client=client)
result = service.send_query_logs(
resource_uuid=resource_uuid,
log_type=LOG_TYPE,
events=batch,
)
invocation_id = service.extract_invocation_id(result)
log.info("Pushed batch %d/%d (%d entries) — invocation_id=%s", batch_num, total_batches, len(batch), invocation_id)
return invocation_id
# Push batches in parallel (each thread gets its own pycarlo Session)
max_workers = min(4, total_batches)
invocation_ids: list[str | None] = [None] * total_batches
with ThreadPoolExecutor(max_workers=max_workers) as pool:
futures = {
pool.submit(_push_batch, batch, i + 1): i
for i, batch in enumerate(batches)
}
for future in as_completed(futures):
idx = futures[future]
try:
invocation_ids[idx] = future.result()
except Exception as exc:
log.error("ERROR pushing batch %d: %s", idx + 1, exc)
raise
log.info("All %d batches pushed (%d workers)", total_batches, max_workers)
pushed_at = datetime.now(timezone.utc).isoformat()
summary = {
"resource_uuid": resource_uuid,
"log_type": LOG_TYPE,
"invocation_ids": invocation_ids,
"pushed_at": pushed_at,
"query_log_count": len(entries),
"batch_count": total_batches,
"batch_size": batch_size,
"lookback_hours": manifest.get("lookback_hours"),
"lookback_lag_hours": manifest.get("lookback_lag_hours"),
}
push_manifest_path = manifest_path.replace(".json", "_push_result.json")
with open(push_manifest_path, "w") as fh:
json.dump(summary, fh, indent=2)
log.info("Push result written to %s", push_manifest_path)
return summary
def main() -> None:
parser = argparse.ArgumentParser(description="Push Databricks query logs to Monte Carlo from manifest")
parser.add_argument("--manifest", default="manifest_query_logs.json")
parser.add_argument("--resource-uuid", default=os.getenv("MCD_RESOURCE_UUID"))
parser.add_argument("--key-id", default=os.getenv("MCD_INGEST_ID"))
parser.add_argument("--key-token", default=os.getenv("MCD_INGEST_TOKEN"))
parser.add_argument("--batch-size", type=int, default=DEFAULT_BATCH_SIZE)
args = parser.parse_args()
required = ["resource_uuid", "key_id", "key_token"]
missing = [k for k in required if getattr(args, k) is None]
if missing:
parser.error(f"Missing required arguments/env vars: {missing}")
push(
manifest_path=args.manifest,
resource_uuid=args.resource_uuid,
key_id=args.key_id,
key_token=args.key_token,
batch_size=args.batch_size,
)
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,119 @@
#!/usr/bin/env python3
"""
Extract Hive lineage from a local log file and push it to Monte Carlo in one step.
Thin wrapper that calls ``collect()`` from ``collect_lineage`` followed by
``push()`` from ``push_lineage``, then writes the final manifest (with
``resource_uuid`` and ``invocation_id``) to ``--output-file``.
Substitution points
-------------------
- MCD_INGEST_ID (env) / --key-id (CLI) : Monte Carlo ingestion key ID
- MCD_INGEST_TOKEN (env) / --key-token (CLI) : Monte Carlo ingestion key token
- MCD_RESOURCE_UUID (env) / --resource-uuid (CLI) : MC resource UUID for this connection
- --log-file : path to local HiveServer2 log
Prerequisites
-------------
pip install pycarlo python-dotenv
Usage (table-level):
python collect_and_push_lineage.py \\
--key-id <MCD_INGEST_ID> \\
--key-token <MCD_INGEST_TOKEN> \\
--resource-uuid <MCD_RESOURCE_UUID> \\
--log-file /tmp/root/hive.log
Usage (column-level):
python collect_and_push_lineage.py ... --column-lineage
"""
import argparse
import json
import os
from collect_lineage import collect
from push_lineage import DEFAULT_BATCH_SIZE, DEFAULT_TIMEOUT_SECONDS, push
def main() -> None:
parser = argparse.ArgumentParser(
description="Extract Hive lineage from a local log file and push to Monte Carlo",
)
# Collect args
parser.add_argument(
"--log-file",
default="/tmp/root/hive.log",
help="Path to local HiveServer2 log file (default: /tmp/root/hive.log)", # ← SUBSTITUTE: your log path
)
# Push / MC args
parser.add_argument(
"--key-id",
default=os.environ.get("MCD_INGEST_ID"),
help="Monte Carlo ingestion key ID (env: MCD_INGEST_ID)",
)
parser.add_argument(
"--key-token",
default=os.environ.get("MCD_INGEST_TOKEN"),
help="Monte Carlo ingestion key token (env: MCD_INGEST_TOKEN)",
)
parser.add_argument(
"--resource-uuid",
default=os.environ.get("MCD_RESOURCE_UUID"),
help="Monte Carlo resource UUID for this Hive connection (env: MCD_RESOURCE_UUID)",
)
parser.add_argument(
"--column-lineage",
action="store_true",
help="Push column-level lineage instead of table-level",
)
parser.add_argument(
"--output-file",
default="lineage_output.json",
help="Path to write the lineage manifest (default: lineage_output.json)",
)
parser.add_argument(
"--batch-size",
type=int,
default=DEFAULT_BATCH_SIZE,
metavar="N",
help=f"Max events per POST (default: {DEFAULT_BATCH_SIZE})",
)
parser.add_argument(
"--timeout",
type=int,
default=DEFAULT_TIMEOUT_SECONDS,
metavar="SEC",
help=f"HTTP timeout per request in seconds (default: {DEFAULT_TIMEOUT_SECONDS})",
)
args = parser.parse_args()
if not args.key_id or not args.key_token:
parser.error("--key-id and --key-token are required (or set MCD_INGEST_ID / MCD_INGEST_TOKEN)")
if not args.resource_uuid:
parser.error("--resource-uuid is required (or set MCD_RESOURCE_UUID)")
manifest = collect(log_file=args.log_file)
if not manifest["edges"]:
print("No lineage edges detected — no CTAS or INSERT INTO ... SELECT patterns found.")
return
push(
manifest=manifest,
resource_uuid=args.resource_uuid,
key_id=args.key_id,
key_token=args.key_token,
column_lineage=args.column_lineage,
batch_size=args.batch_size,
timeout_seconds=args.timeout,
)
with open(args.output_file, "w") as fh:
json.dump(manifest, fh, indent=2)
print(f"Lineage manifest written to {args.output_file}")
print("Done.")
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,119 @@
#!/usr/bin/env python3
"""
Collect Hive table metadata and push it to Monte Carlo in one step.
Thin wrapper that calls ``collect()`` from ``collect_metadata`` followed by
``push()`` from ``push_metadata``, then writes the final manifest (with
``resource_uuid`` and ``invocation_id``) to ``--output-file``.
Substitution points
-------------------
- HIVE_HOST (env) / --hive-host (CLI) : HiveServer2 hostname
- MCD_INGEST_ID (env) / --key-id (CLI) : Monte Carlo ingestion key ID
- MCD_INGEST_TOKEN (env) / --key-token (CLI) : Monte Carlo ingestion key token
- MCD_RESOURCE_UUID (env) / --resource-uuid (CLI) : MC resource UUID for this connection
Prerequisites
-------------
pip install pycarlo pyhive python-dotenv
Usage
-----
python collect_and_push_metadata.py \\
--key-id <MCD_INGEST_ID> \\
--key-token <MCD_INGEST_TOKEN> \\
--resource-uuid <MCD_RESOURCE_UUID> \\
--hive-host <HIVESERVER2_HOSTNAME>
"""
import argparse
import json
import os
from collect_metadata import collect
from push_metadata import DEFAULT_BATCH_SIZE, DEFAULT_TIMEOUT_SECONDS, push
def main() -> None:
parser = argparse.ArgumentParser(
description="Collect Hive table metadata and push to Monte Carlo",
)
# Hive / collect args
parser.add_argument(
"--hive-host",
default=os.environ.get("HIVE_HOST"),
help="HiveServer2 hostname (env: HIVE_HOST)", # ← SUBSTITUTE: your EMR master DNS or Hive host
)
parser.add_argument(
"--hive-port",
type=int,
default=10000,
help="HiveServer2 port (default: 10000)", # ← SUBSTITUTE if your cluster uses a non-standard port
)
# Push / MC args
parser.add_argument(
"--key-id",
default=os.environ.get("MCD_INGEST_ID"),
help="Monte Carlo ingestion key ID (env: MCD_INGEST_ID)", # ← SUBSTITUTE env var name if different
)
parser.add_argument(
"--key-token",
default=os.environ.get("MCD_INGEST_TOKEN"),
help="Monte Carlo ingestion key token (env: MCD_INGEST_TOKEN)", # ← SUBSTITUTE env var name if different
)
parser.add_argument(
"--resource-uuid",
default=os.environ.get("MCD_RESOURCE_UUID"),
required=False,
help="Monte Carlo resource UUID for this Hive connection (env: MCD_RESOURCE_UUID)",
)
parser.add_argument(
"--output-file",
default="metadata_output.json",
help="Path to write the output manifest (default: metadata_output.json)",
)
parser.add_argument(
"--batch-size",
type=int,
default=DEFAULT_BATCH_SIZE,
metavar="N",
help=f"Max assets per POST (default: {DEFAULT_BATCH_SIZE})",
)
parser.add_argument(
"--timeout",
type=int,
default=DEFAULT_TIMEOUT_SECONDS,
metavar="SEC",
help=f"HTTP timeout per request in seconds (default: {DEFAULT_TIMEOUT_SECONDS})",
)
args = parser.parse_args()
if not args.hive_host:
parser.error("--hive-host is required (or set HIVE_HOST)")
if not args.key_id or not args.key_token:
parser.error("--key-id and --key-token are required (or set MCD_INGEST_ID / MCD_INGEST_TOKEN)")
if not args.resource_uuid:
parser.error("--resource-uuid is required (or set MCD_RESOURCE_UUID)")
manifest = collect(
hive_host=args.hive_host,
hive_port=args.hive_port,
)
push(
manifest=manifest,
resource_uuid=args.resource_uuid,
key_id=args.key_id,
key_token=args.key_token,
batch_size=args.batch_size,
timeout_seconds=args.timeout,
)
with open(args.output_file, "w") as fh:
json.dump(manifest, fh, indent=2)
print(f"Manifest written to {args.output_file}")
print("Done.")
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,117 @@
#!/usr/bin/env python3
"""
Collect Hive query logs from a local log file and push them to Monte Carlo
in one step.
Thin wrapper that calls ``collect()`` from ``collect_query_logs`` followed by
``push()`` from ``push_query_logs``, then writes the final manifest (with
``resource_uuid`` and ``invocation_id``) to ``--output-file``.
Substitution points
-------------------
- MCD_INGEST_ID (env) / --key-id (CLI) : Monte Carlo ingestion key ID
- MCD_INGEST_TOKEN (env) / --key-token (CLI) : Monte Carlo ingestion key token
- MCD_RESOURCE_UUID (env) / --resource-uuid (CLI) : MC resource UUID (optional for query logs)
- --log-file path to local HiveServer2 log (default: /tmp/root/hive.log)
- --op-logs-dir optional directory of per-query <queryId>.log files
Prerequisites
-------------
pip install pycarlo python-dateutil python-dotenv
Usage
-----
python collect_and_push_query_logs.py \\
--key-id <MCD_INGEST_ID> \\
--key-token <MCD_INGEST_TOKEN> \\
--resource-uuid <MCD_RESOURCE_UUID> \\
--log-file /tmp/root/hive.log \\
[--op-logs-dir /var/log/hive/operation_logs]
"""
import argparse
import json
import os
from collect_query_logs import collect
from push_query_logs import DEFAULT_BATCH_SIZE, DEFAULT_TIMEOUT_SECONDS, push
def main() -> None:
parser = argparse.ArgumentParser(
description="Collect Hive query logs from a local log file and push to Monte Carlo",
)
# Collect args
parser.add_argument(
"--log-file",
default="/tmp/root/hive.log",
help="Path to local HiveServer2 log file (default: /tmp/root/hive.log)", # ← SUBSTITUTE: your log path
)
parser.add_argument(
"--op-logs-dir",
default=None,
help=(
"Directory containing per-query Hive operation logs (<queryId>.log). "
"When provided, returned_rows is populated from SelectOperator RECORDS_OUT counts."
),
# ← SUBSTITUTE: e.g. /var/log/hive/operation_logs or wherever Hive writes op logs
)
# Push / MC args
parser.add_argument(
"--key-id",
default=os.environ.get("MCD_INGEST_ID"),
help="Monte Carlo ingestion key ID (env: MCD_INGEST_ID)",
)
parser.add_argument(
"--key-token",
default=os.environ.get("MCD_INGEST_TOKEN"),
help="Monte Carlo ingestion key token (env: MCD_INGEST_TOKEN)",
)
parser.add_argument(
"--resource-uuid",
default=os.environ.get("MCD_RESOURCE_UUID"),
help="Monte Carlo resource UUID (optional for query logs) (env: MCD_RESOURCE_UUID)",
)
parser.add_argument(
"--output-file",
default="query_logs_output.json",
help="Path to write the output manifest (default: query_logs_output.json)",
)
parser.add_argument(
"--batch-size",
type=int,
default=DEFAULT_BATCH_SIZE,
metavar="N",
help=f"Max events per POST (default: {DEFAULT_BATCH_SIZE})",
)
parser.add_argument(
"--timeout",
type=int,
default=DEFAULT_TIMEOUT_SECONDS,
metavar="SEC",
help=f"HTTP timeout per request in seconds (default: {DEFAULT_TIMEOUT_SECONDS})",
)
args = parser.parse_args()
if not args.key_id or not args.key_token:
parser.error("--key-id and --key-token are required (or set MCD_INGEST_ID / MCD_INGEST_TOKEN)")
manifest = collect(log_file=args.log_file, op_logs_dir=args.op_logs_dir)
push(
manifest=manifest,
key_id=args.key_id,
key_token=args.key_token,
resource_uuid=args.resource_uuid,
batch_size=args.batch_size,
timeout_seconds=args.timeout,
)
with open(args.output_file, "w") as fh:
json.dump(manifest, fh, indent=2)
print(f"Query log manifest written to {args.output_file}")
print("Done.")
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,265 @@
#!/usr/bin/env python3
"""
Extract table and column lineage from a local HiveServer2 log file — collection only.
Reads a plain-text Hive log file (not compressed), extracts SQL query blocks
from "Executing command" / "Starting command" entries, detects CTAS and
INSERT INTO ... SELECT patterns to build lineage edges, then writes a JSON
manifest file.
Can be run standalone via CLI or imported (use the ``collect()`` function).
Substitution points
-------------------
- --log-file path to local HiveServer2 log (default: /tmp/root/hive.log)
Prerequisites
-------------
pip install python-dotenv
Usage
-----
python collect_lineage.py \\
--log-file /tmp/root/hive.log \\
--output-file lineage_output.json
"""
from __future__ import annotations
import argparse
import json
import re
from dataclasses import dataclass, field
from datetime import datetime, timezone
# ← SUBSTITUTE: set RESOURCE_TYPE to match your Monte Carlo connection type
RESOURCE_TYPE = "data-lake"
# Regex for CTAS: CREATE TABLE [IF NOT EXISTS] db.table AS SELECT ... FROM db.table
_CTAS_RE = re.compile(
r"CREATE\s+TABLE\s+(?:IF\s+NOT\s+EXISTS\s+)?"
r"(?P<dest_db>\w+)\.(?P<dest_table>\w+)"
r".*?AS\s+SELECT\s+(?P<select_cols>.+?)\s+FROM\s+(?P<src_db>\w+)\.(?P<src_table>\w+)",
re.IGNORECASE | re.DOTALL,
)
# Regex for INSERT INTO/OVERWRITE db.table SELECT ... FROM db.table
_INSERT_RE = re.compile(
r"INSERT\s+(?:INTO|OVERWRITE)\s+(?:TABLE\s+)?(?P<dest_db>\w+)\.(?P<dest_table>\w+)"
r".*?SELECT\s+(?P<select_cols>.+?)\s+FROM\s+(?P<src_db>\w+)\.(?P<src_table>\w+)",
re.IGNORECASE | re.DOTALL,
)
# Regex to detect additional JOIN sources beyond the primary FROM clause
_JOIN_RE = re.compile(r"JOIN\s+(?P<src_db>\w+)\.(?P<src_table>\w+)", re.IGNORECASE)
# Simple column alias extraction: [alias.]col [AS dest]
_COL_RE = re.compile(r"(?:(\w+)\.)?(\w+)(?:\s+AS\s+(\w+))?", re.IGNORECASE)
# Hive string literals — strip before scanning so words inside 'status' AS ...
# are not treated as column refs
_STR_LITERAL_RE = re.compile(r"'(?:''|[^'])*'")
# ROW_NUMBER() OVER (...) AS alias — whole expression has no single source column;
# removing it avoids bogus tokens in col_mappings
_WINDOW_AS_ALIAS_RE = re.compile(
r"\b(?:ROW_NUMBER|RANK|DENSE_RANK|NTILE)\s*\(\s*\)\s+OVER\s*\([^)]*\)\s+AS\s+\w+",
re.IGNORECASE,
)
# Regex to pull query text out of Hive log "Executing/Starting command" lines
_COMMAND_START_RE = re.compile(
r"(?:Executing|Starting)\s+command\(queryId=\S*\):\s+(?P<query>.+?)(?=\n\d{4}-\d{2}-\d{2}|\Z)",
re.DOTALL,
)
# Tokens that are almost never real column names — SQL keywords, functions, casts, etc.
_SQL_SCAN_NOISE = frozenset(
{
"ROW_NUMBER", "RANK", "DENSE_RANK", "NTILE", "OVER", "PARTITION",
"ORDER", "BY", "CASE", "WHEN", "THEN", "ELSE", "END", "AND", "OR",
"NOT", "IN", "IS", "DISTINCT", "CAST", "CONVERT", "CURRENT_TIMESTAMP",
"CURRENT_DATE", "TRUE", "FALSE", "NULL", "BETWEEN", "LIKE", "EXISTS",
"ASC", "DESC", "LIMIT", "OFFSET", "GROUP", "HAVING", "UNION", "ALL",
"INNER", "LEFT", "RIGHT", "FULL", "OUTER", "CROSS", "JOIN", "ON",
"WHERE", "SELECT", "FROM", "AS", "STRING", "BIGINT", "INT", "SMALLINT",
"TINYINT", "DOUBLE", "FLOAT", "REAL", "DECIMAL", "BOOLEAN", "DATE",
"TIMESTAMP", "VARCHAR", "CHAR", "BINARY", "ARRAY", "MAP", "STRUCT",
"SUM", "AVG", "COUNT", "MIN", "MAX", "STDDEV", "VARIANCE", "VAR_POP",
"COALESCE", "IF", "SUBSTRING", "YEAR", "MONTH", "DAY", "LEAD", "LAG",
"FIRST_VALUE", "LAST_VALUE",
}
)
@dataclass
class _LineageEdge:
dest_db: str
dest_table: str
sources: list[tuple[str, str]] = field(default_factory=list)
# col_mappings: (dest_col, src_table, src_col)
col_mappings: list[tuple[str, str, str]] = field(default_factory=list)
def _prepare_select_for_col_scan(select_clause: str) -> str:
"""Remove literals and window headers so _COL_RE sees fewer false positives."""
s = _STR_LITERAL_RE.sub(" ", select_clause)
s = _WINDOW_AS_ALIAS_RE.sub(" ", s)
return s
def _dedupe_col_mappings(mappings: list[tuple[str, str, str]]) -> list[tuple[str, str, str]]:
seen: set[tuple[str, str, str]] = set()
out: list[tuple[str, str, str]] = []
for t in mappings:
if t in seen:
continue
seen.add(t)
out.append(t)
return out
def _extract_query_blocks(log_text: str) -> list[str]:
"""Extract individual SQL query strings from a Hive log file."""
return [m.group("query").strip() for m in _COMMAND_START_RE.finditer(log_text)]
def _parse_select_cols(select_clause: str, src_table: str) -> list[tuple[str, str, str]]:
"""
Lightweight column mapping: for each `alias.col AS dest` or `col AS dest`
in the SELECT clause, return (dest_col, src_table, src_col).
Strips string literals and window function headers first to reduce false
positives, and filters out SQL keywords/noise tokens.
"""
prepared = _prepare_select_for_col_scan(select_clause)
mappings = []
for m in _COL_RE.finditer(prepared):
src_col = m.group(2)
dest_col = m.group(3) or src_col
if src_col.upper() in ("FROM", "SELECT", "WHERE", "JOIN", "ON", "AS", "*"):
continue
if src_col.upper() in _SQL_SCAN_NOISE or dest_col.upper() in _SQL_SCAN_NOISE:
continue
# After stripping 'literal' AS col, we get " AS col" — skip bare (col, col) with no source expr.
if dest_col == src_col:
prefix = prepared[: m.start()].rstrip()
if prefix.upper().endswith("AS"):
continue
mappings.append((dest_col, src_table, src_col))
return _dedupe_col_mappings(mappings)
def _parse_edges(queries: list[str]) -> list[_LineageEdge]:
"""Parse SQL query strings into _LineageEdge objects."""
edges: dict[str, _LineageEdge] = {}
for sql in queries:
# Strip string literals to avoid false table/column matches inside quoted strings
sql_clean = re.sub(r"\s+", " ", _STR_LITERAL_RE.sub(" ", sql)).strip()
for pattern in (_CTAS_RE, _INSERT_RE):
m = pattern.search(sql_clean)
if not m:
continue
dest_db = m.group("dest_db").lower()
dest_table = m.group("dest_table").lower()
src_db = m.group("src_db").lower()
src_table = m.group("src_table").lower()
select_cols = m.group("select_cols")
key = f"{dest_db}.{dest_table}"
if key not in edges:
edges[key] = _LineageEdge(dest_db=dest_db, dest_table=dest_table)
edge = edges[key]
src_pair = (src_db, src_table)
if src_pair not in edge.sources:
edge.sources.append(src_pair)
# Pick up additional JOIN sources
for jm in _JOIN_RE.finditer(sql_clean):
jp = (jm.group("src_db").lower(), jm.group("src_table").lower())
if jp not in edge.sources:
edge.sources.append(jp)
edge.col_mappings.extend(_parse_select_cols(select_cols, src_table))
break # matched one pattern, move to next query
# Deduplicate column mappings per edge (same INSERT may appear many times in HS2 logs)
for e in edges.values():
e.col_mappings = _dedupe_col_mappings(e.col_mappings)
return list(edges.values())
def collect(log_file: str) -> dict:
"""
Parse lineage edges from a HiveServer2 log file and return a manifest dict.
Args:
log_file: Path to a local HiveServer2 log file.
Returns:
Manifest dict with keys: resource_type, collected_at, edges.
Each edge has destination, sources, and col_mappings lists.
"""
print(f"Reading Hive log file: {log_file} ...")
with open(log_file, errors="replace") as fh:
log_text = fh.read()
queries = _extract_query_blocks(log_text)
print(f" Extracted {len(queries)} query block(s).")
edges = _parse_edges(queries)
print(f" Parsed {len(edges)} lineage edge(s).")
manifest = {
"resource_type": RESOURCE_TYPE,
"collected_at": datetime.now(tz=timezone.utc).isoformat(),
"edges": [
{
"destination": {"database": e.dest_db, "table": e.dest_table},
"sources": [{"database": sdb, "table": stbl} for sdb, stbl in e.sources],
"col_mappings": [
{"dest_col": dc, "src_table": st, "src_col": sc}
for dc, st, sc in e.col_mappings
],
}
for e in edges
],
}
return manifest
def main() -> None:
parser = argparse.ArgumentParser(
description="Extract Hive lineage from a local log file and write a JSON manifest",
)
parser.add_argument(
"--log-file",
default="/tmp/root/hive.log",
help="Path to local HiveServer2 log file (default: /tmp/root/hive.log)", # ← SUBSTITUTE: your log path
)
parser.add_argument(
"--output-file",
default="lineage_output.json",
help="Path to write the lineage manifest (default: lineage_output.json)",
)
args = parser.parse_args()
manifest = collect(log_file=args.log_file)
if not manifest["edges"]:
print("No lineage edges detected — no CTAS or INSERT INTO ... SELECT patterns found.")
return
with open(args.output_file, "w") as fh:
json.dump(manifest, fh, indent=2)
print(f"Lineage manifest written to {args.output_file}")
print("Done.")
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,313 @@
#!/usr/bin/env python3
"""
Collect table metadata from a Hive Metastore — collection only.
Connects to HiveServer2 (default port 10000), discovers all databases and
tables via SHOW DATABASES / SHOW TABLES, reads schema and table statistics
via DESCRIBE FORMATTED, then writes a JSON manifest file.
Can be run standalone via CLI or imported (use the ``collect()`` function).
Substitution points
-------------------
- HIVE_HOST (env) / --hive-host (CLI) : HiveServer2 hostname
- HIVE_PORT (env) / --hive-port (CLI) : HiveServer2 port (default 10000)
Prerequisites
-------------
pip install pyhive python-dotenv
Usage
-----
python collect_metadata.py \\
--hive-host <HIVESERVER2_HOSTNAME> \\
--output-file metadata_output.json
"""
import argparse
import json
import os
import re
from datetime import datetime, timezone
from pyhive import hive
def _check_available_memory(min_gb: float = 2.0) -> None:
"""Warn if available memory is below the threshold."""
try:
if hasattr(os, "sysconf"): # Linux / macOS
page_size = os.sysconf("SC_PAGE_SIZE")
avail_pages = os.sysconf("SC_AVPHYS_PAGES")
avail_gb = (page_size * avail_pages) / (1024 ** 3)
else:
return # Windows — skip check
except (ValueError, OSError):
return
if avail_gb < min_gb:
print(
f"WARNING: Only {avail_gb:.1f} GB of memory available "
f"(minimum recommended: {min_gb:.1f} GB). "
f"Consider reducing the number of databases/tables or increasing available memory."
)
# ← SUBSTITUTE: set RESOURCE_TYPE to match your Monte Carlo connection type
RESOURCE_TYPE = "data-lake"
# Map Hive native types to SQL-standard uppercase types expected by Monte Carlo
_HIVE_TYPE_MAP: dict[str, str] = {
"tinyint": "TINYINT",
"smallint": "SMALLINT",
"int": "INTEGER",
"integer": "INTEGER",
"bigint": "BIGINT",
"float": "FLOAT",
"double": "DOUBLE",
"double precision": "DOUBLE",
"decimal": "DECIMAL",
"numeric": "DECIMAL",
"boolean": "BOOLEAN",
"string": "VARCHAR",
"varchar": "VARCHAR",
"char": "CHAR",
"binary": "BINARY",
"timestamp": "TIMESTAMP",
"date": "DATE",
"interval": "INTERVAL",
"array": "ARRAY",
"map": "MAP",
"struct": "STRUCT",
"uniontype": "UNION",
}
# ← SUBSTITUTE: add any internal table name prefixes you want to skip
_INTERNAL_TABLE_PREFIXES = ("tmp_", "__", "hive_")
def _normalize_hive_type(hive_type: str) -> str:
"""Uppercase and normalize a Hive type string to a SQL-standard form.
Parametrized types like ``decimal(10,2)`` or ``varchar(255)`` keep their
suffix; the base type is mapped through ``_HIVE_TYPE_MAP``.
"""
lower = hive_type.lower().strip()
base = lower.split("(")[0].strip()
suffix = hive_type[len(base):].strip() # preserve original params, e.g. decimal(10,2)
return _HIVE_TYPE_MAP.get(base, base.upper()) + suffix
def _connect(host: str, port: int) -> hive.Connection:
# ← SUBSTITUTE: update username/auth if your cluster requires Kerberos or LDAP
return hive.connect(host=host, port=port, username="hadoop", auth="NONE")
def _fetch_rows(cursor, query: str) -> list[tuple]:
"""Execute a query and fetch results in memory-safe chunks."""
cursor.execute(query)
rows: list[tuple] = []
while True:
chunk = cursor.fetchmany(1000)
if not chunk:
break
rows.extend(chunk)
return rows
def _parse_describe_formatted(rows: list[tuple]) -> dict:
"""
Parse DESCRIBE FORMATTED <db>.<table> output into a structured dict:
columns, row_count, total_size, last_modified, description, created_on
"""
result: dict = {
"columns": [],
"row_count": None,
"total_size": None,
"last_modified": None,
"description": None,
"created_on": None,
}
in_col_info = False
in_table_info = False
for row in rows:
col_name = (row[0] or "").strip()
data_type = (row[1] or "").strip()
comment = (row[2] or "").strip() if len(row) > 2 else ""
if col_name.startswith("# col_name"):
in_col_info = True
in_table_info = False
continue
if col_name.startswith("# Detailed Table Information"):
in_col_info = False
in_table_info = True
continue
if col_name.startswith("#"):
in_col_info = False
continue
if in_col_info and col_name and data_type:
result["columns"].append(
{
"name": col_name,
"type": _normalize_hive_type(data_type),
"description": comment or None,
}
)
if in_table_info:
# Table Parameters rows have an empty col_name; key is in data_type, value in comment
param_key = data_type.strip() if not col_name else col_name.strip().rstrip(":")
param_val = (comment.strip() if not col_name else data_type.strip()) or ""
if re.search(r"numRows", param_key, re.IGNORECASE):
try:
result["row_count"] = int(param_val)
except (ValueError, TypeError):
pass
elif re.search(r"totalSize", param_key, re.IGNORECASE):
try:
result["total_size"] = int(param_val)
except (ValueError, TypeError):
pass
elif re.search(r"last_modified_time", param_key, re.IGNORECASE):
try:
result["last_modified"] = datetime.fromtimestamp(
int(param_val), tz=timezone.utc
).isoformat()
except (ValueError, TypeError):
pass
elif re.search(r"^CreateTime", param_key):
# e.g. "Wed Mar 18 20:15:40 UTC 2026"
try:
result["created_on"] = datetime.strptime(
param_val, "%a %b %d %H:%M:%S %Z %Y"
).replace(tzinfo=timezone.utc).isoformat()
except (ValueError, TypeError):
pass
elif param_key == "comment" and not result["description"] and param_val:
result["description"] = param_val
return result
def collect(
hive_host: str,
hive_port: int = 10000,
) -> dict:
"""
Connect to HiveServer2, discover all databases and tables, and return a
manifest dict with collected asset metadata.
Args:
hive_host: HiveServer2 hostname.
hive_port: HiveServer2 port (default 10000).
Returns:
Manifest dict with keys: resource_type, collected_at, assets.
"""
_check_available_memory()
print(f"Connecting to HiveServer2 at {hive_host}:{hive_port} ...")
conn = _connect(hive_host, hive_port)
cursor = conn.cursor()
assets: list[dict] = []
print("Collecting table metadata ...")
databases = [row[0] for row in _fetch_rows(cursor, "SHOW DATABASES")]
print(f" Found databases: {databases}")
for db in databases:
# ← SUBSTITUTE: add any system databases you want to skip
if db in ("information_schema",):
continue
tables = _fetch_rows(cursor, f"SHOW TABLES IN {db}")
table_names = [row[0] for row in tables]
print(f" {db}: {len(table_names)} table(s)")
for table in table_names:
if any(table.startswith(p) for p in _INTERNAL_TABLE_PREFIXES):
continue
try:
desc_rows = _fetch_rows(cursor, f"DESCRIBE FORMATTED {db}.{table}")
except Exception as exc:
print(f" WARNING: could not describe {db}.{table}: {exc}")
continue
info = _parse_describe_formatted(desc_rows)
row_count = info["row_count"] if info["row_count"] and info["row_count"] > 0 else None
byte_count = info["total_size"] if info["total_size"] and info["total_size"] > 0 else None
assets.append(
{
"database": db,
"schema": db,
"name": table,
"description": info["description"],
"created_on": info["created_on"],
"row_count": row_count,
"byte_count": byte_count,
"last_modified": info["last_modified"],
"fields": [
{"name": col["name"], "type": col["type"], "description": col["description"]}
for col in info["columns"]
],
}
)
print(
f" + {db}.{table} ({len(info['columns'])} columns, "
f"desc={info['description']!r}, created={info['created_on']})"
)
cursor.close()
conn.close()
print(f"\nCollected {len(assets)} table(s).")
manifest = {
"resource_type": RESOURCE_TYPE,
"collected_at": datetime.now(tz=timezone.utc).isoformat(),
"assets": assets,
}
return manifest
def main() -> None:
parser = argparse.ArgumentParser(
description="Collect Hive table metadata and write a JSON manifest",
)
parser.add_argument(
"--hive-host",
default=os.environ.get("HIVE_HOST"),
help="HiveServer2 hostname (env: HIVE_HOST)", # ← SUBSTITUTE: your EMR master DNS or Hive host
)
parser.add_argument(
"--hive-port",
type=int,
default=10000,
help="HiveServer2 port (default: 10000)", # ← SUBSTITUTE if your cluster uses a non-standard port
)
parser.add_argument(
"--output-file",
default="metadata_output.json",
help="Path to write the output manifest (default: metadata_output.json)",
)
args = parser.parse_args()
if not args.hive_host:
parser.error("--hive-host is required (or set HIVE_HOST)")
manifest = collect(
hive_host=args.hive_host,
hive_port=args.hive_port,
)
with open(args.output_file, "w") as fh:
json.dump(manifest, fh, indent=2)
print(f"Asset manifest written to {args.output_file}")
print("Done.")
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,284 @@
#!/usr/bin/env python3
"""
Collect Hive query logs from a local HiveServer2 log file — collection only.
Parses a plain-text HiveServer2 log for "Executing/Starting command" entries
to extract query text, query ID, start time and end time. Optionally reads
per-query operation logs to populate ``returned_rows`` from SelectOperator
``RECORDS_OUT`` counters. Deduplicates entries by query ID.
Can be run standalone via CLI or imported (use the ``collect()`` function).
Substitution points
-------------------
- --log-file path to local HiveServer2 log (default: /tmp/root/hive.log)
- --op-logs-dir optional directory of per-query <queryId>.log files
Prerequisites
-------------
pip install python-dateutil python-dotenv
Usage
-----
python collect_query_logs.py \\
--log-file /tmp/root/hive.log \\
[--op-logs-dir /var/log/hive/operation_logs] \\
--output-file query_logs_output.json
"""
from __future__ import annotations
import argparse
import json
import re
from datetime import datetime, timezone
from io import StringIO
from pathlib import Path
from dateutil.parser import isoparse
# NOTE: the normalizer requires "hive-s3" — do not change to "hive" or "data-lake"
LOG_TYPE = "hive-s3"
# Matches the start of a new query block in the Hive log
_COMMAND_START_RE = re.compile(
r"(Executing|Starting)\s+command\(queryId=(?P<query_id>\S*)\):\s+(?P<command>.*)$"
)
# Extracts returned row counts from per-query Hive operation logs
_RECORDS_OUT_RE = re.compile(r"RECORDS_OUT_OPERATOR_SEL_\d+:(\d+)")
def _parse_log_entries(log_text: str) -> list[dict]:
"""
Parse a HiveServer2 log file and return a list of dicts:
query_id, start_time (datetime), end_time (datetime), query (str)
Each timestamped "Executing/Starting command" line starts a new entry.
The previous entry's end_time is set to the timestamp of the next line.
"""
entries = []
query = ""
query_id = ""
start_time: datetime | None = None
last_timestamp: datetime | None = None
for line in StringIO(log_text):
parts = line.split()
if not parts:
continue
try:
timestamp = isoparse(parts[0])
if not timestamp.tzinfo:
timestamp = timestamp.replace(tzinfo=timezone.utc)
except ValueError:
# Continuation line for a multi-line query
if query:
query += "\n" + line.rstrip()
continue
command_start = _COMMAND_START_RE.search(line)
if command_start:
# Emit the previous entry before starting a new one
if query and start_time:
entries.append(
{
"query_id": query_id,
"start_time": start_time,
"end_time": timestamp,
"query": query,
}
)
query_id = command_start.group("query_id")
start_time = timestamp
query = command_start.group("command").strip()
elif query and start_time:
# A timestamped non-command line closes the current entry
entries.append(
{
"query_id": query_id,
"start_time": start_time,
"end_time": timestamp,
"query": query,
}
)
query = ""
query_id = ""
start_time = None
last_timestamp = timestamp
# Flush any trailing entry
if query and start_time:
end_time = last_timestamp or start_time
entries.append(
{
"query_id": query_id,
"start_time": start_time,
"end_time": end_time,
"query": query,
}
)
return entries
def _load_returned_rows(op_logs_dir: str) -> dict[str, int]:
"""
Scan a directory of per-query Hive operation logs (named <queryId>.log) and
return a mapping of query_id -> rows returned.
The row count is taken from the last RECORDS_OUT_OPERATOR_SEL_N value in
each file, which reflects the final number of rows delivered to the client.
"""
rows_by_id: dict[str, int] = {}
for log_file in Path(op_logs_dir).glob("*.log"):
query_id = log_file.stem
last_count: int | None = None
try:
text = log_file.read_text(errors="replace")
except OSError:
continue
for m in _RECORDS_OUT_RE.finditer(text):
last_count = int(m.group(1))
if last_count is not None:
rows_by_id[query_id] = last_count
return rows_by_id
def _build_query_log_entries(
raw_entries: list[dict],
rows_by_id: dict[str, int] | None = None,
) -> list[dict]:
"""
Deduplicate raw log entries by query_id and enrich with returned_rows.
Returns plain dicts so that ``push_query_logs.py`` can reconstruct
QueryLogEntry objects from the JSON manifest.
"""
seen: set[str] = set()
entries = []
for r in raw_entries:
qid = r["query_id"]
if qid and qid in seen:
continue
if qid:
seen.add(qid)
returned_rows: int | None = rows_by_id.get(qid) if rows_by_id and qid else None
entries.append(
{
"query_id": qid or None,
"start_time": r["start_time"].isoformat(),
"end_time": r["end_time"].isoformat(),
"query_text": r["query"],
"user": "hadoop", # ← SUBSTITUTE: set the user appropriate for your cluster
"returned_rows": returned_rows,
}
)
return entries
def collect(
log_file: str,
op_logs_dir: str | None = None,
) -> dict:
"""
Parse query log entries from a HiveServer2 log file and return a manifest dict.
Args:
log_file: Path to a local HiveServer2 log file.
op_logs_dir: Optional directory containing per-query operation logs
(<queryId>.log). When provided, returned_rows is populated
from SelectOperator RECORDS_OUT counts.
Returns:
Manifest dict with keys: log_type, collected_at, entry_count,
window_start, window_end, queries.
"""
print(f"Reading Hive log file: {log_file} ...")
with open(log_file, errors="replace") as fh:
log_text = fh.read()
raw_entries = _parse_log_entries(log_text)
print(f" Parsed {len(raw_entries)} query log entry/entries.")
if not raw_entries:
print("No query log entries found.")
return {
"log_type": LOG_TYPE,
"collected_at": datetime.now(tz=timezone.utc).isoformat(),
"entry_count": 0,
"window_start": None,
"window_end": None,
"queries": [],
}
rows_by_id: dict[str, int] | None = None
if op_logs_dir:
rows_by_id = _load_returned_rows(op_logs_dir)
print(f" Loaded row counts for {len(rows_by_id)} query/queries from {op_logs_dir}")
queries = _build_query_log_entries(raw_entries, rows_by_id)
start_times = [r["start_time"] for r in raw_entries]
end_times = [r["end_time"] for r in raw_entries]
manifest = {
"log_type": LOG_TYPE,
"collected_at": datetime.now(tz=timezone.utc).isoformat(),
"entry_count": len(queries),
"window_start": min(start_times).isoformat() if start_times else None,
"window_end": max(end_times).isoformat() if end_times else None,
"queries": [
{
"query_id": q["query_id"],
"start_time": q["start_time"],
"end_time": q["end_time"],
"query": q["query_text"],
"user": q["user"],
"returned_rows": q["returned_rows"],
}
for q in queries
],
}
return manifest
def main() -> None:
parser = argparse.ArgumentParser(
description="Collect Hive query logs from a local log file and write a JSON manifest",
)
parser.add_argument(
"--log-file",
default="/tmp/root/hive.log",
help="Path to local HiveServer2 log file (default: /tmp/root/hive.log)", # ← SUBSTITUTE: your log path
)
parser.add_argument(
"--op-logs-dir",
default=None,
help=(
"Directory containing per-query Hive operation logs (<queryId>.log). "
"When provided, returned_rows is populated from SelectOperator RECORDS_OUT counts."
),
# ← SUBSTITUTE: e.g. /var/log/hive/operation_logs or wherever Hive writes op logs
)
parser.add_argument(
"--output-file",
default="query_logs_output.json",
help="Path to write the output manifest (default: query_logs_output.json)",
)
args = parser.parse_args()
manifest = collect(log_file=args.log_file, op_logs_dir=args.op_logs_dir)
with open(args.output_file, "w") as fh:
json.dump(manifest, fh, indent=2)
print(f"Query log manifest written to {args.output_file}")
print("Done.")
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,309 @@
#!/usr/bin/env python3
"""
Push a collected Hive lineage manifest to Monte Carlo — push only.
Reads a JSON manifest produced by ``collect_lineage.py``, builds LineageEvent
objects (table-level or column-level), and calls ``send_lineage`` in batches.
The manifest is updated in-place with ``resource_uuid`` and ``invocation_id``
after a successful push.
Can be run standalone via CLI or imported (use the ``push()`` function).
Substitution points
-------------------
- MCD_INGEST_ID (env) / --key-id (CLI) : Monte Carlo ingestion key ID
- MCD_INGEST_TOKEN (env) / --key-token (CLI) : Monte Carlo ingestion key token
- MCD_RESOURCE_UUID (env) / --resource-uuid (CLI) : MC resource UUID for this connection
Prerequisites
-------------
pip install pycarlo python-dotenv
Usage (table-level):
python push_lineage.py \\
--key-id <MCD_INGEST_ID> \\
--key-token <MCD_INGEST_TOKEN> \\
--resource-uuid <MCD_RESOURCE_UUID> \\
--input-file lineage_output.json
Usage (column-level):
python push_lineage.py ... --column-lineage
"""
import argparse
import json
import os
from concurrent.futures import ThreadPoolExecutor, as_completed
from pycarlo.core import Client, Session
from pycarlo.features.ingestion import IngestionService
from pycarlo.features.ingestion.models import (
ColumnLineageField,
ColumnLineageSourceField,
LineageAssetRef,
LineageEvent,
)
# ← SUBSTITUTE: set RESOURCE_TYPE to match your Monte Carlo connection type
RESOURCE_TYPE = "data-lake"
# ← SUBSTITUTE: default batch size for lineage push (events per request)
DEFAULT_BATCH_SIZE = 500
# ← SUBSTITUTE: HTTP timeout for MC ingestion requests (seconds)
DEFAULT_TIMEOUT_SECONDS = 120
def _build_table_lineage(edges_data: list[dict]) -> list[LineageEvent]:
"""Build table-level LineageEvent objects from raw edge dicts."""
events = []
for edge in edges_data:
sources = edge.get("sources", [])
if not sources:
continue
dest = edge["destination"]
events.append(
LineageEvent(
destination=LineageAssetRef(
type="TABLE",
name=dest["table"],
database=dest["database"],
schema=dest["database"],
),
sources=[
LineageAssetRef(
type="TABLE",
name=src["table"],
database=src["database"],
schema=src["database"],
)
for src in sources
],
)
)
return events
def _build_column_lineage(edges_data: list[dict]) -> list[LineageEvent]:
"""Build column-level LineageEvent objects from raw edge dicts."""
events = []
for edge in edges_data:
sources = edge.get("sources", [])
if not sources:
continue
dest = edge["destination"]
dest_asset_id = f"{dest['database']}__{dest['table']}"
source_asset_ids = {
(src["database"], src["table"]): f"{src['database']}__{src['table']}"
for src in sources
}
col_fields: dict[str, ColumnLineageField] = {}
for mapping in edge.get("col_mappings", []):
dest_col = mapping["dest_col"]
src_table = mapping["src_table"]
src_col = mapping["src_col"]
# Find the matching source db for this src_table
src_db = next(
(src["database"] for src in sources if src["table"] == src_table),
dest["database"],
)
src_aid = source_asset_ids.get((src_db, src_table), f"{src_db}__{src_table}")
if dest_col not in col_fields:
col_fields[dest_col] = ColumnLineageField(name=dest_col, source_fields=[])
col_fields[dest_col].source_fields.append(
ColumnLineageSourceField(asset_id=src_aid, field_name=src_col)
)
events.append(
LineageEvent(
destination=LineageAssetRef(
type="TABLE",
name=dest["table"],
database=dest["database"],
schema=dest["database"],
asset_id=dest_asset_id,
),
sources=[
LineageAssetRef(
type="TABLE",
name=src["table"],
database=src["database"],
schema=src["database"],
asset_id=source_asset_ids[(src["database"], src["table"])],
)
for src in sources
],
fields=list(col_fields.values()) if col_fields else None,
)
)
return events
def push(
manifest: dict,
resource_uuid: str,
key_id: str,
key_token: str,
column_lineage: bool = False,
batch_size: int = DEFAULT_BATCH_SIZE,
timeout_seconds: int = DEFAULT_TIMEOUT_SECONDS,
) -> str | None:
"""
Push collected lineage to Monte Carlo and update the manifest in-place.
Events are sent in batches of ``batch_size`` (default 500) to avoid
oversized payloads. Supports both table-level and column-level lineage.
Args:
manifest: Dict loaded from a ``collect_lineage.py`` output file.
resource_uuid: MC resource UUID for this Hive connection.
key_id: MC ingestion key ID.
key_token: MC ingestion key token.
column_lineage: When True, push column-level lineage; otherwise table-level.
batch_size: Events per POST request (default 500).
timeout_seconds: HTTP timeout per request (default 120).
Returns:
The last invocation ID string if returned by MC, otherwise None.
"""
resource_type = manifest.get("resource_type", RESOURCE_TYPE)
edges_data = manifest.get("edges", [])
if column_lineage:
events = _build_column_lineage(edges_data)
label = "column-level"
else:
events = _build_table_lineage(edges_data)
label = "table-level"
print(f"Loaded {len(events)} {label} lineage event(s) from manifest")
if not events:
print("No lineage events to push.")
manifest["resource_uuid"] = resource_uuid
manifest["invocation_id"] = None
return None
# Split into batches
batch_list = []
for i in range(0, len(events), batch_size):
batch_list.append(events[i : i + batch_size])
total_batches = len(batch_list)
def _push_batch(batch: list, batch_num: int) -> str | None:
"""Push a single batch using a dedicated Session (thread-safe)."""
print(f" Pushing batch {batch_num}/{total_batches} ({len(batch)} events) ...")
client = Client(session=Session(mcd_id=key_id, mcd_token=key_token, scope="Ingestion"))
service = IngestionService(mc_client=client)
result = service.send_lineage(
resource_uuid=resource_uuid,
resource_type=resource_type,
events=batch,
)
invocation_id = service.extract_invocation_id(result)
if invocation_id:
print(f" Batch {batch_num}: invocation_id={invocation_id}")
return invocation_id
# Push batches in parallel (each thread gets its own pycarlo Session)
max_workers = min(4, total_batches)
invocation_ids: list[str | None] = [None] * total_batches
with ThreadPoolExecutor(max_workers=max_workers) as pool:
futures = {
pool.submit(_push_batch, batch, i + 1): i
for i, batch in enumerate(batch_list)
}
for future in as_completed(futures):
idx = futures[future]
try:
invocation_ids[idx] = future.result()
except Exception as exc:
print(f" ERROR pushing batch {idx + 1}: {exc}")
raise
print(f" All {total_batches} batches pushed ({max_workers} workers)")
manifest["resource_uuid"] = resource_uuid
manifest["invocation_id"] = invocation_ids[-1] if invocation_ids else None
if len([i for i in invocation_ids if i]) > 1:
manifest["invocation_ids"] = invocation_ids
elif "invocation_ids" in manifest:
del manifest["invocation_ids"]
return manifest.get("invocation_id")
def main() -> None:
parser = argparse.ArgumentParser(
description="Push a collected Hive lineage manifest to Monte Carlo",
)
parser.add_argument(
"--key-id",
default=os.environ.get("MCD_INGEST_ID"),
help="Monte Carlo ingestion key ID (env: MCD_INGEST_ID)",
)
parser.add_argument(
"--key-token",
default=os.environ.get("MCD_INGEST_TOKEN"),
help="Monte Carlo ingestion key token (env: MCD_INGEST_TOKEN)",
)
parser.add_argument(
"--resource-uuid",
default=os.environ.get("MCD_RESOURCE_UUID"),
help="Monte Carlo resource UUID for this Hive connection (env: MCD_RESOURCE_UUID)",
)
parser.add_argument(
"--input-file",
default="lineage_output.json",
help="Path to the JSON manifest written by collect_lineage.py (default: lineage_output.json)",
)
parser.add_argument(
"--column-lineage",
action="store_true",
help="Push column-level lineage instead of table-level",
)
parser.add_argument(
"--batch-size",
type=int,
default=DEFAULT_BATCH_SIZE,
metavar="N",
help=f"Max events per POST (default: {DEFAULT_BATCH_SIZE})",
)
parser.add_argument(
"--timeout",
type=int,
default=DEFAULT_TIMEOUT_SECONDS,
metavar="SEC",
help=f"HTTP timeout per request in seconds (default: {DEFAULT_TIMEOUT_SECONDS})",
)
args = parser.parse_args()
if not args.key_id or not args.key_token:
parser.error("--key-id and --key-token are required (or set MCD_INGEST_ID / MCD_INGEST_TOKEN)")
if not args.resource_uuid:
parser.error("--resource-uuid is required (or set MCD_RESOURCE_UUID)")
with open(args.input_file) as fh:
manifest = json.load(fh)
push(
manifest=manifest,
resource_uuid=args.resource_uuid,
key_id=args.key_id,
key_token=args.key_token,
column_lineage=args.column_lineage,
batch_size=args.batch_size,
timeout_seconds=args.timeout,
)
with open(args.input_file, "w") as fh:
json.dump(manifest, fh, indent=2)
print(f"Manifest updated in-place: {args.input_file}")
print("Done.")
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,245 @@
#!/usr/bin/env python3
"""
Push a collected Hive metadata manifest to Monte Carlo — push only.
Reads a JSON manifest produced by ``collect_metadata.py``, builds
RelationalAsset objects, and calls ``send_metadata`` in batches. The manifest
is updated in-place with ``resource_uuid`` and ``invocation_id`` after a
successful push.
Can be run standalone via CLI or imported (use the ``push()`` function).
Substitution points
-------------------
- MCD_INGEST_ID (env) / --key-id (CLI) : Monte Carlo ingestion key ID
- MCD_INGEST_TOKEN (env) / --key-token (CLI) : Monte Carlo ingestion key token
- MCD_RESOURCE_UUID (env) / --resource-uuid (CLI) : MC resource UUID for this connection
Prerequisites
-------------
pip install pycarlo python-dotenv
Usage
-----
python push_metadata.py \\
--key-id <MCD_INGEST_ID> \\
--key-token <MCD_INGEST_TOKEN> \\
--resource-uuid <MCD_RESOURCE_UUID> \\
--input-file metadata_output.json
"""
import argparse
import json
import os
from concurrent.futures import ThreadPoolExecutor, as_completed
from datetime import datetime, timezone
from pycarlo.core import Client, Session
from pycarlo.features.ingestion import IngestionService
from pycarlo.features.ingestion.models import (
AssetField,
AssetFreshness,
AssetMetadata,
AssetVolume,
RelationalAsset,
)
# ← SUBSTITUTE: default batch size for metadata push (assets per request)
DEFAULT_BATCH_SIZE = 500
# ← SUBSTITUTE: HTTP timeout for MC ingestion requests (seconds)
DEFAULT_TIMEOUT_SECONDS = 120
def _build_assets(manifest: dict) -> list[RelationalAsset]:
"""Rebuild RelationalAsset objects from a collected metadata manifest."""
assets = []
for a in manifest.get("assets", []):
fields = [
AssetField(
name=f["name"],
type=f["type"],
description=f.get("description"),
)
for f in a.get("fields", [])
]
volume = None
row_count = a.get("row_count")
byte_count = a.get("byte_count")
if row_count or byte_count:
volume = AssetVolume(
row_count=row_count if row_count and row_count > 0 else None,
byte_count=byte_count if byte_count and byte_count > 0 else None,
)
freshness = None
last_modified = a.get("last_modified")
if last_modified:
freshness = AssetFreshness(last_update_time=last_modified)
assets.append(
RelationalAsset(
type="TABLE",
metadata=AssetMetadata(
name=a["name"],
database=a["database"],
schema=a["schema"],
description=a.get("description"),
created_on=a.get("created_on"),
),
fields=fields,
volume=volume,
freshness=freshness,
)
)
return assets
def push(
manifest: dict,
resource_uuid: str,
key_id: str,
key_token: str,
batch_size: int = DEFAULT_BATCH_SIZE,
timeout_seconds: int = DEFAULT_TIMEOUT_SECONDS,
) -> str | None:
"""
Push collected metadata to Monte Carlo and update the manifest in-place.
Assets are sent in batches of ``batch_size`` (default 500) to avoid
oversized payloads. The manifest is enriched with ``resource_uuid``
and the last ``invocation_id`` from the response.
Args:
manifest: Dict loaded from a ``collect_metadata.py`` output file.
resource_uuid: MC resource UUID for this Hive connection.
key_id: MC ingestion key ID.
key_token: MC ingestion key token.
batch_size: Assets per POST request (default 500).
timeout_seconds: HTTP timeout per request (default 120).
Returns:
The last invocation ID string if returned by MC, otherwise None.
"""
resource_type = manifest.get("resource_type", "data-lake")
assets = _build_assets(manifest)
n = len(assets)
print(f"Loaded {n} asset(s) from manifest")
# Split into batches
batch_list = []
for i in range(0, max(n, 1), batch_size):
batch_list.append(assets[i : i + batch_size])
total_batches = len(batch_list)
def _push_batch(batch: list, batch_num: int) -> str | None:
"""Push a single batch using a dedicated Session (thread-safe)."""
client = Client(session=Session(mcd_id=key_id, mcd_token=key_token, scope="Ingestion"))
service = IngestionService(mc_client=client)
result = service.send_metadata(
resource_uuid=resource_uuid,
resource_type=resource_type,
events=batch,
)
invocation_id = service.extract_invocation_id(result)
print(f" Pushed batch {batch_num}/{total_batches} ({len(batch)} assets) — invocation_id={invocation_id}")
return invocation_id
# Push batches in parallel (each thread gets its own pycarlo Session)
max_workers = min(4, total_batches)
invocation_ids: list[str | None] = [None] * total_batches
with ThreadPoolExecutor(max_workers=max_workers) as pool:
futures = {
pool.submit(_push_batch, batch, i + 1): i
for i, batch in enumerate(batch_list)
}
for future in as_completed(futures):
idx = futures[future]
try:
invocation_ids[idx] = future.result()
except Exception as exc:
print(f" ERROR pushing batch {idx + 1}: {exc}")
raise
print(f" All {total_batches} batches pushed ({max_workers} workers)")
manifest["resource_uuid"] = resource_uuid
manifest["invocation_id"] = invocation_ids[-1] if invocation_ids else None
if len([i for i in invocation_ids if i]) > 1:
manifest["invocation_ids"] = invocation_ids
elif "invocation_ids" in manifest:
del manifest["invocation_ids"]
return manifest.get("invocation_id")
def main() -> None:
parser = argparse.ArgumentParser(
description="Push a collected Hive metadata manifest to Monte Carlo",
)
parser.add_argument(
"--key-id",
default=os.environ.get("MCD_INGEST_ID"),
help="Monte Carlo ingestion key ID (env: MCD_INGEST_ID)", # ← SUBSTITUTE env var name if different
)
parser.add_argument(
"--key-token",
default=os.environ.get("MCD_INGEST_TOKEN"),
help="Monte Carlo ingestion key token (env: MCD_INGEST_TOKEN)", # ← SUBSTITUTE env var name if different
)
parser.add_argument(
"--resource-uuid",
default=os.environ.get("MCD_RESOURCE_UUID"),
required=False,
help="Monte Carlo resource UUID for this Hive connection (env: MCD_RESOURCE_UUID)",
)
parser.add_argument(
"--input-file",
default="metadata_output.json",
help="Path to the JSON manifest written by collect_metadata.py (default: metadata_output.json)",
)
parser.add_argument(
"--batch-size",
type=int,
default=DEFAULT_BATCH_SIZE,
metavar="N",
help=f"Max assets per POST (default: {DEFAULT_BATCH_SIZE})",
)
parser.add_argument(
"--timeout",
type=int,
default=DEFAULT_TIMEOUT_SECONDS,
metavar="SEC",
help=f"HTTP timeout per request in seconds (default: {DEFAULT_TIMEOUT_SECONDS})",
)
args = parser.parse_args()
if not args.key_id or not args.key_token:
parser.error("--key-id and --key-token are required (or set MCD_INGEST_ID / MCD_INGEST_TOKEN)")
if not args.resource_uuid:
parser.error("--resource-uuid is required (or set MCD_RESOURCE_UUID)")
with open(args.input_file) as fh:
manifest = json.load(fh)
push(
manifest=manifest,
resource_uuid=args.resource_uuid,
key_id=args.key_id,
key_token=args.key_token,
batch_size=args.batch_size,
timeout_seconds=args.timeout,
)
with open(args.input_file, "w") as fh:
json.dump(manifest, fh, indent=2)
print(f"Manifest updated in-place: {args.input_file}")
print("Done.")
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,255 @@
#!/usr/bin/env python3
"""
Push a collected Hive query log manifest to Monte Carlo — push only.
Reads a JSON manifest produced by ``collect_query_logs.py``, builds
QueryLogEntry objects, and calls ``send_query_logs`` in batches. The manifest
is updated in-place with ``resource_uuid`` and ``invocation_id`` after a
successful push.
Can be run standalone via CLI or imported (use the ``push()`` function).
Substitution points
-------------------
- MCD_INGEST_ID (env) / --key-id (CLI) : Monte Carlo ingestion key ID
- MCD_INGEST_TOKEN (env) / --key-token (CLI) : Monte Carlo ingestion key token
- MCD_RESOURCE_UUID (env) / --resource-uuid (CLI) : MC resource UUID (optional for query logs)
Prerequisites
-------------
pip install pycarlo python-dateutil python-dotenv
Usage
-----
python push_query_logs.py \\
--key-id <MCD_INGEST_ID> \\
--key-token <MCD_INGEST_TOKEN> \\
--resource-uuid <MCD_RESOURCE_UUID> \\
--input-file query_logs_output.json
"""
import argparse
import json
import os
from concurrent.futures import ThreadPoolExecutor, as_completed
from datetime import datetime, timezone
from dateutil.parser import isoparse
from pycarlo.core import Client, Session
from pycarlo.features.ingestion import IngestionService
from pycarlo.features.ingestion.models import QueryLogEntry
# ← SUBSTITUTE: default batch size for query log push (events per request)
# Query logs include full SQL text — keep batches small to stay under the 1 MB
# compressed payload limit. 50 entries can trigger 413 on active warehouses.
DEFAULT_BATCH_SIZE = 100
# ← SUBSTITUTE: HTTP timeout for MC ingestion requests (seconds)
DEFAULT_TIMEOUT_SECONDS = 120
# Truncate query_text longer than this to prevent 413 errors.
# Some SQL statements (e.g., generated by BI tools) can be 100KB+ and blow up
# compressed payloads even at small batch sizes.
_MAX_QUERY_TEXT_LEN = 10_000
def _build_events(manifest: dict) -> list[QueryLogEntry]:
"""
Rebuild QueryLogEntry objects from a collected query log manifest.
ISO timestamp strings are parsed back to datetime. Entries are
deduplicated by query_id.
"""
seen: set[str] = set()
events = []
truncated = 0
for q in manifest.get("queries", []):
qid = q.get("query_id")
if qid and qid in seen:
continue
if qid:
seen.add(qid)
start_time = isoparse(q["start_time"])
if not start_time.tzinfo:
start_time = start_time.replace(tzinfo=timezone.utc)
end_time = isoparse(q["end_time"])
if not end_time.tzinfo:
end_time = end_time.replace(tzinfo=timezone.utc)
query_text = q.get("query") or ""
# Truncate very long SQL to prevent 413 Request Too Large
if len(query_text) > _MAX_QUERY_TEXT_LEN:
query_text = query_text[:_MAX_QUERY_TEXT_LEN] + "... [TRUNCATED]"
truncated += 1
events.append(
QueryLogEntry(
start_time=start_time,
end_time=end_time,
query_text=query_text,
query_id=qid or None,
user=q.get("user", "hadoop"), # ← SUBSTITUTE: set the user appropriate for your cluster
returned_rows=q.get("returned_rows"),
)
)
if truncated:
print(f" Truncated {truncated} query text(s) exceeding {_MAX_QUERY_TEXT_LEN} chars")
return events
def push(
manifest: dict,
key_id: str,
key_token: str,
resource_uuid: str | None = None,
batch_size: int = DEFAULT_BATCH_SIZE,
timeout_seconds: int = DEFAULT_TIMEOUT_SECONDS,
) -> str | None:
"""
Push collected query logs to Monte Carlo and update the manifest in-place.
Events are sent in batches of ``batch_size`` (default 100) to avoid
oversized payloads.
Args:
manifest: Dict loaded from a ``collect_query_logs.py`` output file.
key_id: MC ingestion key ID.
key_token: MC ingestion key token.
resource_uuid: Optional MC resource UUID.
batch_size: Events per POST request (default 100).
timeout_seconds: HTTP timeout per request (default 120).
Returns:
The last invocation ID string if returned by MC, otherwise None.
"""
log_type = manifest.get("log_type", "hive-s3")
events = _build_events(manifest)
n = len(events)
print(f"Loaded {n} query log entry/entries from manifest")
if not events:
print("No query log entries to push.")
manifest["log_type"] = log_type
if resource_uuid is not None:
manifest["resource_uuid"] = resource_uuid
manifest["invocation_id"] = None
return None
# Split into batches
batch_list = []
for i in range(0, n, batch_size):
batch_list.append(events[i : i + batch_size])
total_batches = len(batch_list)
def _push_batch(batch: list, batch_num: int) -> str | None:
"""Push a single batch using a dedicated Session (thread-safe)."""
client = Client(session=Session(mcd_id=key_id, mcd_token=key_token, scope="Ingestion"))
service = IngestionService(mc_client=client)
result = service.send_query_logs(
resource_uuid=resource_uuid,
log_type=log_type,
events=batch,
)
invocation_id = service.extract_invocation_id(result)
print(f" Pushed batch {batch_num}/{total_batches} ({len(batch)} entries) — invocation_id={invocation_id}")
return invocation_id
# Push batches in parallel (each thread gets its own pycarlo Session)
max_workers = min(4, total_batches)
invocation_ids: list[str | None] = [None] * total_batches
with ThreadPoolExecutor(max_workers=max_workers) as pool:
futures = {
pool.submit(_push_batch, batch, i + 1): i
for i, batch in enumerate(batch_list)
}
for future in as_completed(futures):
idx = futures[future]
try:
invocation_ids[idx] = future.result()
except Exception as exc:
print(f" ERROR pushing batch {idx + 1}: {exc}")
raise
print(f" All {total_batches} batches pushed ({max_workers} workers)")
manifest["log_type"] = log_type
if resource_uuid is not None:
manifest["resource_uuid"] = resource_uuid
manifest["invocation_id"] = invocation_ids[-1] if invocation_ids else None
if len([i for i in invocation_ids if i]) > 1:
manifest["invocation_ids"] = invocation_ids
elif "invocation_ids" in manifest:
del manifest["invocation_ids"]
return manifest.get("invocation_id")
def main() -> None:
parser = argparse.ArgumentParser(
description="Push a collected Hive query log manifest to Monte Carlo",
)
parser.add_argument(
"--key-id",
default=os.environ.get("MCD_INGEST_ID"),
help="Monte Carlo ingestion key ID (env: MCD_INGEST_ID)",
)
parser.add_argument(
"--key-token",
default=os.environ.get("MCD_INGEST_TOKEN"),
help="Monte Carlo ingestion key token (env: MCD_INGEST_TOKEN)",
)
parser.add_argument(
"--resource-uuid",
default=os.environ.get("MCD_RESOURCE_UUID"),
help="Monte Carlo resource UUID (optional for query logs) (env: MCD_RESOURCE_UUID)",
)
parser.add_argument(
"--input-file",
default="query_logs_output.json",
help="Path to the JSON manifest written by collect_query_logs.py (default: query_logs_output.json)",
)
parser.add_argument(
"--batch-size",
type=int,
default=DEFAULT_BATCH_SIZE,
metavar="N",
help=f"Max events per POST (default: {DEFAULT_BATCH_SIZE})",
)
parser.add_argument(
"--timeout",
type=int,
default=DEFAULT_TIMEOUT_SECONDS,
metavar="SEC",
help=f"HTTP timeout per request in seconds (default: {DEFAULT_TIMEOUT_SECONDS})",
)
args = parser.parse_args()
if not args.key_id or not args.key_token:
parser.error("--key-id and --key-token are required (or set MCD_INGEST_ID / MCD_INGEST_TOKEN)")
with open(args.input_file) as fh:
manifest = json.load(fh)
push(
manifest=manifest,
key_id=args.key_id,
key_token=args.key_token,
resource_uuid=args.resource_uuid,
batch_size=args.batch_size,
timeout_seconds=args.timeout,
)
with open(args.input_file, "w") as fh:
json.dump(manifest, fh, indent=2)
print(f"Manifest updated in-place: {args.input_file}")
print("Done.")
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,78 @@
"""
Redshift — Lineage Collect & Push (combined)
==============================================
Collects table-level lineage from Redshift by parsing query history, then pushes
the derived lineage events to Monte Carlo via the push ingestion API.
This script imports and calls collect() from collect_lineage and push() from
push_lineage, running both in sequence.
Substitution points (search for "← SUBSTITUTE"):
- REDSHIFT_HOST / REDSHIFT_DB / REDSHIFT_USER / REDSHIFT_PASSWORD : connection
- LOOKBACK_HOURS : how far back to scan query history (default 24 h)
- MCD_INGEST_ID / MCD_INGEST_TOKEN : Monte Carlo API credentials
- MCD_RESOURCE_UUID : UUID of the Redshift connection in Monte Carlo
- PUSH_BATCH_SIZE : number of events per API call (default 500)
Prerequisites:
pip install psycopg2-binary pycarlo
"""
from __future__ import annotations
import argparse
import logging
import os
from collect_lineage import LOOKBACK_HOURS, collect
from push_lineage import DEFAULT_BATCH_SIZE, push
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
log = logging.getLogger(__name__)
def main() -> None:
parser = argparse.ArgumentParser(description="Collect and push Redshift lineage to Monte Carlo")
parser.add_argument("--host", default=os.getenv("REDSHIFT_HOST")) # ← SUBSTITUTE
parser.add_argument("--db", default=os.getenv("REDSHIFT_DB")) # ← SUBSTITUTE
parser.add_argument("--user", default=os.getenv("REDSHIFT_USER")) # ← SUBSTITUTE
parser.add_argument("--password", default=os.getenv("REDSHIFT_PASSWORD")) # ← SUBSTITUTE
parser.add_argument("--port", type=int, default=int(os.getenv("REDSHIFT_PORT", "5439")))
parser.add_argument("--resource-uuid", default=os.getenv("MCD_RESOURCE_UUID"))
parser.add_argument("--key-id", default=os.getenv("MCD_INGEST_ID"))
parser.add_argument("--key-token", default=os.getenv("MCD_INGEST_TOKEN"))
parser.add_argument("--lookback-hours", type=int, default=LOOKBACK_HOURS)
parser.add_argument("--batch-size", type=int, default=DEFAULT_BATCH_SIZE)
parser.add_argument("--manifest", default="manifest_lineage.json")
args = parser.parse_args()
required = ["host", "db", "user", "password", "resource_uuid", "key_id", "key_token"]
missing = [k for k in required if getattr(args, k) is None]
if missing:
parser.error(f"Missing required arguments/env vars: {missing}")
log.info("Step 1: Collecting lineage …")
collect(
host=args.host,
db=args.db,
user=args.user,
password=args.password,
manifest_path=args.manifest,
port=args.port,
lookback_hours=args.lookback_hours,
)
log.info("Step 2: Pushing lineage to Monte Carlo …")
push(
manifest_path=args.manifest,
resource_uuid=args.resource_uuid,
key_id=args.key_id,
key_token=args.key_token,
batch_size=args.batch_size,
)
log.info("Done — collect and push complete.")
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,80 @@
"""
Redshift — Metadata Collect & Push (combined)
===============================================
Collects table schemas, row counts, and byte sizes from Amazon Redshift,
then pushes them to Monte Carlo via the push ingestion API.
This script imports and calls collect() from collect_metadata and push() from
push_metadata, running both in sequence.
Substitution points (search for "← SUBSTITUTE"):
- REDSHIFT_HOST : Redshift cluster endpoint or serverless workgroup endpoint
- REDSHIFT_DB : database name to connect to
- REDSHIFT_USER : database user (or IAM role user)
- REDSHIFT_PASSWORD : database password
- DB_EXCLUSIONS : databases to skip
- SCHEMA_EXCLUSIONS : schemas to skip in every database
- MCD_INGEST_ID / MCD_INGEST_TOKEN : Monte Carlo API credentials
- MCD_RESOURCE_UUID : UUID of the Redshift connection in Monte Carlo
- PUSH_BATCH_SIZE : number of assets per API call (default 500)
Prerequisites:
pip install psycopg2-binary pycarlo
"""
from __future__ import annotations
import argparse
import logging
import os
from collect_metadata import collect
from push_metadata import DEFAULT_BATCH_SIZE, push
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
log = logging.getLogger(__name__)
def main() -> None:
parser = argparse.ArgumentParser(description="Collect and push Redshift metadata to Monte Carlo")
parser.add_argument("--host", default=os.getenv("REDSHIFT_HOST")) # ← SUBSTITUTE
parser.add_argument("--db", default=os.getenv("REDSHIFT_DB")) # ← SUBSTITUTE
parser.add_argument("--user", default=os.getenv("REDSHIFT_USER")) # ← SUBSTITUTE
parser.add_argument("--password", default=os.getenv("REDSHIFT_PASSWORD")) # ← SUBSTITUTE
parser.add_argument("--port", type=int, default=int(os.getenv("REDSHIFT_PORT", "5439")))
parser.add_argument("--resource-uuid", default=os.getenv("MCD_RESOURCE_UUID"))
parser.add_argument("--key-id", default=os.getenv("MCD_INGEST_ID"))
parser.add_argument("--key-token", default=os.getenv("MCD_INGEST_TOKEN"))
parser.add_argument("--batch-size", type=int, default=DEFAULT_BATCH_SIZE)
parser.add_argument("--manifest", default="manifest_metadata.json")
args = parser.parse_args()
required = ["host", "db", "user", "password", "resource_uuid", "key_id", "key_token"]
missing = [k for k in required if getattr(args, k) is None]
if missing:
parser.error(f"Missing required arguments/env vars: {missing}")
log.info("Step 1: Collecting metadata …")
collect(
host=args.host,
db=args.db,
user=args.user,
password=args.password,
manifest_path=args.manifest,
port=args.port,
)
log.info("Step 2: Pushing metadata to Monte Carlo …")
push(
manifest_path=args.manifest,
resource_uuid=args.resource_uuid,
key_id=args.key_id,
key_token=args.key_token,
batch_size=args.batch_size,
)
log.info("Done — collect and push complete.")
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,88 @@
"""
Redshift — Query Log Collect & Push (combined)
================================================
Collects completed query execution records from Redshift using sys_query_history
and sys_querytext, then pushes them to Monte Carlo for query-pattern analysis,
lineage derivation, and usage attribution.
This script imports and calls collect() from collect_query_logs and push() from
push_query_logs, running both in sequence.
Substitution points (search for "← SUBSTITUTE"):
- REDSHIFT_HOST / REDSHIFT_DB / REDSHIFT_USER / REDSHIFT_PASSWORD : connection
- LOOKBACK_HOURS : hours back from [now - LAG_HOURS] to collect (default 25)
- LOOKBACK_LAG_HOURS: lag behind now to avoid in-flight queries (default 1)
- BATCH_SIZE : number of query_ids to fetch texts for in one SQL call
- MAX_QUERIES : maximum query rows to process per run
- MCD_INGEST_ID / MCD_INGEST_TOKEN : Monte Carlo API credentials
- MCD_RESOURCE_UUID : UUID of the Redshift connection in Monte Carlo
- PUSH_BATCH_SIZE : number of entries per API call (default 250)
Prerequisites:
pip install psycopg2-binary pycarlo
"""
from __future__ import annotations
import argparse
import logging
import os
from collect_query_logs import BATCH_SIZE, LOOKBACK_HOURS, LOOKBACK_LAG_HOURS, MAX_QUERIES, collect
from push_query_logs import DEFAULT_BATCH_SIZE, push
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
log = logging.getLogger(__name__)
def main() -> None:
parser = argparse.ArgumentParser(description="Collect and push Redshift query logs to Monte Carlo")
parser.add_argument("--host", default=os.getenv("REDSHIFT_HOST")) # ← SUBSTITUTE
parser.add_argument("--db", default=os.getenv("REDSHIFT_DB")) # ← SUBSTITUTE
parser.add_argument("--user", default=os.getenv("REDSHIFT_USER")) # ← SUBSTITUTE
parser.add_argument("--password", default=os.getenv("REDSHIFT_PASSWORD")) # ← SUBSTITUTE
parser.add_argument("--port", type=int, default=int(os.getenv("REDSHIFT_PORT", "5439")))
parser.add_argument("--resource-uuid", default=os.getenv("MCD_RESOURCE_UUID"))
parser.add_argument("--key-id", default=os.getenv("MCD_INGEST_ID"))
parser.add_argument("--key-token", default=os.getenv("MCD_INGEST_TOKEN"))
parser.add_argument("--lookback-hours", type=int, default=LOOKBACK_HOURS)
parser.add_argument("--lookback-lag-hours", type=int, default=LOOKBACK_LAG_HOURS)
parser.add_argument("--batch-size", type=int, default=BATCH_SIZE)
parser.add_argument("--max-queries", type=int, default=MAX_QUERIES)
parser.add_argument("--push-batch-size", type=int, default=DEFAULT_BATCH_SIZE)
parser.add_argument("--manifest", default="manifest_query_logs.json")
args = parser.parse_args()
required = ["host", "db", "user", "password", "resource_uuid", "key_id", "key_token"]
missing = [k for k in required if getattr(args, k) is None]
if missing:
parser.error(f"Missing required arguments/env vars: {missing}")
log.info("Step 1: Collecting query logs …")
collect(
host=args.host,
db=args.db,
user=args.user,
password=args.password,
manifest_path=args.manifest,
port=args.port,
lookback_hours=args.lookback_hours,
lookback_lag_hours=args.lookback_lag_hours,
batch_size=args.batch_size,
max_queries=args.max_queries,
)
log.info("Step 2: Pushing query logs to Monte Carlo …")
push(
manifest_path=args.manifest,
resource_uuid=args.resource_uuid,
key_id=args.key_id,
key_token=args.key_token,
batch_size=args.push_batch_size,
)
log.info("Done — collect and push complete.")
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,235 @@
"""
Redshift — Lineage Collection (collect-only)
==============================================
Collects table-level lineage from Redshift by fetching recent successful query
history from sys_query_history + sys_querytext and parsing CREATE TABLE AS SELECT
(CTAS) and INSERT INTO SELECT patterns to derive source->destination relationships.
Writes a JSON manifest file that can be consumed by push_lineage.py.
Substitution points (search for "← SUBSTITUTE"):
- REDSHIFT_HOST / REDSHIFT_DB / REDSHIFT_USER / REDSHIFT_PASSWORD : connection
- LOOKBACK_HOURS : how far back to scan query history (default 24 h)
Prerequisites:
pip install psycopg2-binary
"""
from __future__ import annotations
import argparse
import json
import logging
import os
import re
from datetime import datetime, timezone
from typing import Any
import psycopg2
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
log = logging.getLogger(__name__)
RESOURCE_TYPE = "redshift"
LOOKBACK_HOURS: int = int(os.getenv("LOOKBACK_HOURS", "24")) # ← SUBSTITUTE
def _check_available_memory(min_gb: float = 2.0) -> None:
"""Warn if available memory is below the threshold."""
try:
if hasattr(os, "sysconf"): # Linux / macOS
page_size = os.sysconf("SC_PAGE_SIZE")
avail_pages = os.sysconf("SC_AVPHYS_PAGES")
avail_gb = (page_size * avail_pages) / (1024 ** 3)
else:
return # Windows — skip check
except (ValueError, OSError):
return
if avail_gb < min_gb:
log.warning(
"Only %.1f GB of memory available (minimum recommended: %.1f GB). "
"Consider reducing the collection scope or increasing available memory.",
avail_gb,
min_gb,
)
# Regex: CTAS — CREATE [OR REPLACE] TABLE <dest> AS SELECT
_CTAS_RE = re.compile(
r"CREATE\s+(?:OR\s+REPLACE\s+)?(?:TABLE|VIEW)\s+(?P<dest>\"?[\w.\"]+\"?)\s*(?:\([^)]*\))?\s*AS\s+SELECT\b",
re.IGNORECASE | re.DOTALL,
)
# Regex: INSERT INTO <dest> … SELECT
_INSERT_RE = re.compile(
r"INSERT\s+INTO\s+(?P<dest>\"?[\w.\"]+\"?)\s.*?SELECT\b",
re.IGNORECASE | re.DOTALL,
)
# Matches any schema.table or database.schema.table reference in the query
_TABLE_REF_RE = re.compile(r'"?([\w]+)"?\."?([\w]+)"?(?:\."?([\w]+)"?)?', re.IGNORECASE)
def _clean_name(name: str) -> str:
return name.strip('"').strip()
def _parse_ref(ref: str) -> tuple[str, str, str]:
"""Parse 'db.schema.table' or 'schema.table' -> (database, schema, table)."""
parts = [_clean_name(p) for p in ref.split(".")]
if len(parts) == 3:
return parts[0], parts[1], parts[2]
if len(parts) == 2:
return "", parts[0], parts[1]
return "", "", parts[0]
def _dictfetch(cursor: Any, sql: str, params: tuple | None = None) -> list[dict[str, Any]]:
cursor.execute(sql, params)
cols = [d.name for d in cursor.description]
rows = []
while True:
chunk = cursor.fetchmany(1000)
if not chunk:
break
rows.extend(dict(zip(cols, row)) for row in chunk)
return rows
def fetch_query_texts(cursor: Any, lookback_hours: int) -> list[str]:
"""Assemble full query texts from sys_query_history + sys_querytext."""
rows = _dictfetch(
cursor,
f"""
SELECT
sq.query_id,
LISTAGG(
CASE WHEN LEN(st.text) <= 200 THEN st.text ELSE LEFT(st.text, 200) END,
''
) WITHIN GROUP (ORDER BY st.sequence) AS full_text
FROM sys_query_history sq
JOIN sys_querytext st ON sq.query_id = st.query_id
WHERE sq.start_time >= DATEADD(hour, -{lookback_hours}, GETDATE())
AND sq.status = 'success'
GROUP BY sq.query_id
LIMIT 50000
""", # ← SUBSTITUTE: adjust lookback_hours, LIMIT, or add user/database filters
)
return [r["full_text"] for r in rows if r.get("full_text")]
def parse_lineage_from_sql(sql_text: str) -> list[dict[str, Any]]:
events: list[dict[str, Any]] = []
dest_match = _CTAS_RE.search(sql_text) or _INSERT_RE.search(sql_text)
if not dest_match:
return events
dest_raw = dest_match.group("dest")
dest_db, dest_schema, dest_table = _parse_ref(dest_raw)
if not dest_table:
return events
# Find all schema.table refs in the query, excluding the destination
source_refs: list[str] = []
for m in _TABLE_REF_RE.finditer(sql_text):
if m.group(3):
ref = f"{m.group(1)}.{m.group(2)}.{m.group(3)}"
else:
ref = f"{m.group(1)}.{m.group(2)}"
db, schema, table = _parse_ref(ref)
if not table or (db == dest_db and schema == dest_schema and table == dest_table):
continue
source_refs.append(ref)
if not source_refs:
return events
# Deduplicate sources while preserving order
seen: set[str] = set()
sources: list[dict[str, str]] = []
for ref in source_refs:
if ref not in seen:
seen.add(ref)
db, schema, table = _parse_ref(ref)
sources.append({"database": db, "schema": schema, "asset_name": table})
events.append({
"sources": sources,
"destination": {"database": dest_db, "schema": dest_schema, "asset_name": dest_table},
})
return events
def collect(
host: str,
db: str,
user: str,
password: str,
manifest_path: str = "manifest_lineage.json",
port: int = 5439,
lookback_hours: int = LOOKBACK_HOURS,
) -> list[dict[str, Any]]:
"""Connect to Redshift, collect lineage, write a JSON manifest, and return events."""
_check_available_memory()
collected_at = datetime.now(timezone.utc).isoformat()
conn = psycopg2.connect(
host=host, port=port, dbname=db, user=user, password=password, connect_timeout=30,
)
try:
with conn.cursor() as cursor:
query_texts = fetch_query_texts(cursor, lookback_hours)
finally:
conn.close()
log.info("Parsing lineage from %d query texts …", len(query_texts))
all_events: list[dict[str, Any]] = []
for sql_text in query_texts:
all_events.extend(parse_lineage_from_sql(sql_text))
log.info("Collected %d lineage events", len(all_events))
manifest = {
"resource_type": RESOURCE_TYPE,
"collected_at": collected_at,
"lookback_hours": lookback_hours,
"queries_scanned": len(query_texts),
"lineage_event_count": len(all_events),
"events": all_events,
}
with open(manifest_path, "w") as fh:
json.dump(manifest, fh, indent=2)
log.info("Manifest written to %s (%d events)", manifest_path, len(all_events))
return all_events
def main() -> None:
parser = argparse.ArgumentParser(description="Collect Redshift lineage to a manifest file")
parser.add_argument("--host", default=os.getenv("REDSHIFT_HOST")) # ← SUBSTITUTE
parser.add_argument("--db", default=os.getenv("REDSHIFT_DB")) # ← SUBSTITUTE
parser.add_argument("--user", default=os.getenv("REDSHIFT_USER")) # ← SUBSTITUTE
parser.add_argument("--password", default=os.getenv("REDSHIFT_PASSWORD")) # ← SUBSTITUTE
parser.add_argument("--port", type=int, default=int(os.getenv("REDSHIFT_PORT", "5439")))
parser.add_argument("--lookback-hours", type=int, default=LOOKBACK_HOURS)
parser.add_argument("--manifest", default="manifest_lineage.json")
args = parser.parse_args()
required = ["host", "db", "user", "password"]
missing = [k for k in required if getattr(args, k) is None]
if missing:
parser.error(f"Missing required arguments/env vars: {missing}")
collect(
host=args.host,
db=args.db,
user=args.user,
password=args.password,
manifest_path=args.manifest,
port=args.port,
lookback_hours=args.lookback_hours,
)
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,219 @@
"""
Redshift — Metadata Collection (collect-only)
===============================================
Collects table schemas, row counts, and byte sizes from Amazon Redshift using
SVV system views, then writes a JSON manifest file that can be consumed by
push_metadata.py.
Substitution points (search for "← SUBSTITUTE"):
- REDSHIFT_HOST : Redshift cluster endpoint or serverless workgroup endpoint
- REDSHIFT_DB : database name to connect to
- REDSHIFT_USER : database user (or IAM role user)
- REDSHIFT_PASSWORD : database password
- DB_EXCLUSIONS : databases to skip
- SCHEMA_EXCLUSIONS : schemas to skip in every database
Prerequisites:
pip install psycopg2-binary
"""
from __future__ import annotations
import argparse
import json
import logging
import os
from datetime import datetime, timezone
from typing import Any
import psycopg2
import psycopg2.extras
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
log = logging.getLogger(__name__)
RESOURCE_TYPE = "redshift"
DB_EXCLUSIONS: set[str] = {"dev", "padb_harvest"} # ← SUBSTITUTE: add internal databases
SCHEMA_EXCLUSIONS: set[str] = { # ← SUBSTITUTE: add internal schemas
"information_schema",
"pg_catalog",
"pg_internal",
"catalog_history",
}
def _check_available_memory(min_gb: float = 2.0) -> None:
"""Warn if available memory is below the threshold."""
try:
if hasattr(os, "sysconf"): # Linux / macOS
page_size = os.sysconf("SC_PAGE_SIZE")
avail_pages = os.sysconf("SC_AVPHYS_PAGES")
avail_gb = (page_size * avail_pages) / (1024 ** 3)
else:
return # Windows — skip check
except (ValueError, OSError):
return
if avail_gb < min_gb:
log.warning(
"Only %.1f GB of memory available (minimum recommended: %.1f GB). "
"Consider reducing the collection scope or increasing available memory.",
avail_gb,
min_gb,
)
def _dictfetch(cursor: Any, sql: str, params: tuple | None = None) -> list[dict[str, Any]]:
cursor.execute(sql, params)
cols = [d.name for d in cursor.description]
rows = []
while True:
chunk = cursor.fetchmany(1000)
if not chunk:
break
rows.extend(dict(zip(cols, row)) for row in chunk)
return rows
def collect_databases(cursor: Any) -> list[str]:
rows = _dictfetch(
cursor,
"SELECT database_name FROM svv_redshift_databases ORDER BY database_name",
)
return [r["database_name"] for r in rows if r["database_name"] not in DB_EXCLUSIONS]
def collect_tables(cursor: Any, db: str) -> list[dict[str, Any]]:
schema_list = ", ".join(f"'{s}'" for s in SCHEMA_EXCLUSIONS)
return _dictfetch(
cursor,
f"""
SELECT
database AS db,
schema,
"table" AS table_name,
"rows" AS row_count,
size * 1024 * 1024 AS byte_count
FROM svv_table_info
WHERE database = %s
AND schema NOT IN ({schema_list})
ORDER BY schema, "table"
""", # ← SUBSTITUTE: add additional WHERE clauses to narrow scope
(db,),
)
def collect_columns(cursor: Any, db: str, schema: str, table: str) -> list[dict[str, Any]]:
return _dictfetch(
cursor,
"""
SELECT column_name, data_type, remarks AS comment
FROM svv_columns
WHERE table_catalog = %s
AND table_schema = %s
AND table_name = %s
ORDER BY ordinal_position
""",
(db, schema, table),
)
def collect(
host: str,
db: str,
user: str,
password: str,
manifest_path: str = "manifest_metadata.json",
port: int = 5439,
) -> list[dict[str, Any]]:
"""Connect to Redshift, collect metadata, write a JSON manifest, and return asset dicts."""
_check_available_memory()
collected_at = datetime.now(timezone.utc).isoformat()
assets: list[dict[str, Any]] = []
conn = psycopg2.connect(
host=host, # ← SUBSTITUTE
port=port,
dbname=db, # ← SUBSTITUTE
user=user, # ← SUBSTITUTE
password=password, # ← SUBSTITUTE
connect_timeout=30,
)
try:
with conn.cursor() as cursor:
databases = collect_databases(cursor)
log.info("Found databases: %s", databases)
for database in databases:
tables = collect_tables(cursor, database)
log.info("Database %s%d tables", database, len(tables))
for t in tables:
schema = t["schema"]
table_name = t["table_name"]
columns = collect_columns(cursor, database, schema, table_name)
fields = [
{
"name": col["column_name"],
"type": col["data_type"].upper(),
"description": col.get("comment") or None,
}
for col in columns
]
asset = {
"asset_name": table_name,
"database": database, # ← SUBSTITUTE: use database as top-level namespace
"schema": schema,
"asset_type": "TABLE",
"fields": fields,
"row_count": t.get("row_count"),
"byte_count": t.get("byte_count"),
}
assets.append(asset)
log.info("Collected %s.%s.%s", database, schema, table_name)
finally:
conn.close()
manifest = {
"resource_type": RESOURCE_TYPE,
"collected_at": collected_at,
"asset_count": len(assets),
"assets": assets,
}
with open(manifest_path, "w") as fh:
json.dump(manifest, fh, indent=2)
log.info("Manifest written to %s (%d assets)", manifest_path, len(assets))
return assets
def main() -> None:
parser = argparse.ArgumentParser(description="Collect Redshift metadata to a manifest file")
parser.add_argument("--host", default=os.getenv("REDSHIFT_HOST")) # ← SUBSTITUTE
parser.add_argument("--db", default=os.getenv("REDSHIFT_DB")) # ← SUBSTITUTE
parser.add_argument("--user", default=os.getenv("REDSHIFT_USER")) # ← SUBSTITUTE
parser.add_argument("--password", default=os.getenv("REDSHIFT_PASSWORD")) # ← SUBSTITUTE
parser.add_argument("--port", type=int, default=int(os.getenv("REDSHIFT_PORT", "5439")))
parser.add_argument("--manifest", default="manifest_metadata.json")
args = parser.parse_args()
required = ["host", "db", "user", "password"]
missing = [k for k in required if getattr(args, k) is None]
if missing:
parser.error(f"Missing required arguments/env vars: {missing}")
collect(
host=args.host,
db=args.db,
user=args.user,
password=args.password,
manifest_path=args.manifest,
port=args.port,
)
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,239 @@
"""
Redshift — Query Log Collection (collect-only)
================================================
Collects completed query execution records from Redshift using sys_query_history
and sys_querytext (modern RA3/serverless), assembles full SQL text from
multi-row text chunks, and writes a JSON manifest file that can be consumed
by push_query_logs.py.
Substitution points (search for "← SUBSTITUTE"):
- REDSHIFT_HOST / REDSHIFT_DB / REDSHIFT_USER / REDSHIFT_PASSWORD : connection
- LOOKBACK_HOURS : hours back from [now - LAG_HOURS] to collect (default 25)
- LOOKBACK_LAG_HOURS: lag behind now to avoid in-flight queries (default 1)
- BATCH_SIZE : number of query_ids to fetch texts for in one SQL call
- MAX_QUERIES : maximum query rows to process per run
Prerequisites:
pip install psycopg2-binary
"""
from __future__ import annotations
import argparse
import json
import logging
import os
from datetime import datetime, timezone
from typing import Any
import psycopg2
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
log = logging.getLogger(__name__)
LOG_TYPE = "redshift"
LOOKBACK_HOURS: int = int(os.getenv("LOOKBACK_HOURS", "25")) # ← SUBSTITUTE
LOOKBACK_LAG_HOURS: int = int(os.getenv("LOOKBACK_LAG_HOURS", "1")) # ← SUBSTITUTE
BATCH_SIZE: int = int(os.getenv("BATCH_SIZE", "200")) # ← SUBSTITUTE
MAX_QUERIES: int = int(os.getenv("MAX_QUERIES", "10000")) # ← SUBSTITUTE
def _check_available_memory(min_gb: float = 2.0) -> None:
"""Warn if available memory is below the threshold."""
try:
if hasattr(os, "sysconf"): # Linux / macOS
page_size = os.sysconf("SC_PAGE_SIZE")
avail_pages = os.sysconf("SC_AVPHYS_PAGES")
avail_gb = (page_size * avail_pages) / (1024 ** 3)
else:
return # Windows — skip check
except (ValueError, OSError):
return
if avail_gb < min_gb:
log.warning(
"Only %.1f GB of memory available (minimum recommended: %.1f GB). "
"Consider reducing the collection scope or increasing available memory.",
avail_gb,
min_gb,
)
def _dictfetch(cursor: Any, sql: str, params: tuple | None = None) -> list[dict[str, Any]]:
cursor.execute(sql, params)
cols = [d.name for d in cursor.description]
rows = []
while True:
chunk = cursor.fetchmany(1000)
if not chunk:
break
rows.extend(dict(zip(cols, row)) for row in chunk)
return rows
def _safe_isoformat(dt: Any) -> str | None:
if dt is None:
return None
if hasattr(dt, "isoformat"):
if dt.tzinfo is None:
dt = dt.replace(tzinfo=timezone.utc)
return dt.isoformat()
return str(dt)
def fetch_query_metadata(
cursor: Any,
lookback_hours: int,
lag_hours: int,
max_queries: int,
) -> list[dict[str, Any]]:
"""Fetch query execution metadata from sys_query_history."""
return _dictfetch(
cursor,
f"""
SELECT
query_id,
start_time,
end_time,
status,
user_id,
database_name,
elapsed_time
FROM sys_query_history
WHERE start_time >= DATEADD(hour, -{lookback_hours}, GETDATE())
AND start_time < DATEADD(hour, -{lag_hours}, GETDATE())
AND status = 'success'
ORDER BY start_time
LIMIT {max_queries}
""", # ← SUBSTITUTE: add AND database_name = 'mydb' to narrow scope
)
def fetch_query_texts_batch(cursor: Any, query_ids: list[int]) -> dict[int, str]:
"""Batch-fetch and assemble multi-row query texts for a list of query_ids."""
if not query_ids:
return {}
# Build a VALUES list for the IN clause to avoid large parameter arrays
id_list = ", ".join(str(qid) for qid in query_ids)
rows = _dictfetch(
cursor,
f"""
SELECT
query_id,
LISTAGG(
CASE WHEN LEN(text) <= 200 THEN text ELSE LEFT(text, 200) END,
''
) WITHIN GROUP (ORDER BY sequence) AS query_text
FROM sys_querytext
WHERE query_id IN ({id_list})
GROUP BY query_id
""",
)
return {r["query_id"]: r["query_text"] for r in rows if r.get("query_text")}
def collect(
host: str,
db: str,
user: str,
password: str,
manifest_path: str = "manifest_query_logs.json",
port: int = 5439,
lookback_hours: int = LOOKBACK_HOURS,
lookback_lag_hours: int = LOOKBACK_LAG_HOURS,
batch_size: int = BATCH_SIZE,
max_queries: int = MAX_QUERIES,
) -> list[dict[str, Any]]:
"""Connect to Redshift, collect query logs, write a JSON manifest, and return entries."""
_check_available_memory()
collected_at = datetime.now(timezone.utc).isoformat()
conn = psycopg2.connect(
host=host, port=port, dbname=db, user=user, password=password, connect_timeout=30,
)
try:
with conn.cursor() as cursor:
query_meta = fetch_query_metadata(cursor, lookback_hours, lookback_lag_hours, max_queries)
log.info("Retrieved %d query metadata rows", len(query_meta))
# Batch-fetch texts to avoid enormous single queries
query_ids = [r["query_id"] for r in query_meta]
text_map: dict[int, str] = {}
for i in range(0, len(query_ids), batch_size):
batch = query_ids[i : i + batch_size]
text_map.update(fetch_query_texts_batch(cursor, batch))
log.debug("Fetched texts for batch %d%d", i, i + len(batch))
finally:
conn.close()
entries: list[dict[str, Any]] = []
for row in query_meta:
qid = row["query_id"]
query_text = text_map.get(qid, "")
if not query_text.strip():
continue # ← SUBSTITUTE: decide whether to push rows with missing text
entry = {
"query_id": str(qid),
"query_text": query_text,
"start_time": _safe_isoformat(row.get("start_time")),
"end_time": _safe_isoformat(row.get("end_time")),
"user": str(row.get("user_id")) if row.get("user_id") is not None else None,
"database_name": row.get("database_name"),
"elapsed_time_us": row.get("elapsed_time"),
}
entries.append(entry)
log.info("Collected %d query log entries", len(entries))
manifest = {
"log_type": LOG_TYPE,
"collected_at": collected_at,
"lookback_hours": lookback_hours,
"lookback_lag_hours": lookback_lag_hours,
"query_log_count": len(entries),
"entries": entries,
}
with open(manifest_path, "w") as fh:
json.dump(manifest, fh, indent=2)
log.info("Manifest written to %s (%d entries)", manifest_path, len(entries))
return entries
def main() -> None:
parser = argparse.ArgumentParser(description="Collect Redshift query logs to a manifest file")
parser.add_argument("--host", default=os.getenv("REDSHIFT_HOST")) # ← SUBSTITUTE
parser.add_argument("--db", default=os.getenv("REDSHIFT_DB")) # ← SUBSTITUTE
parser.add_argument("--user", default=os.getenv("REDSHIFT_USER")) # ← SUBSTITUTE
parser.add_argument("--password", default=os.getenv("REDSHIFT_PASSWORD")) # ← SUBSTITUTE
parser.add_argument("--port", type=int, default=int(os.getenv("REDSHIFT_PORT", "5439")))
parser.add_argument("--lookback-hours", type=int, default=LOOKBACK_HOURS)
parser.add_argument("--lookback-lag-hours", type=int, default=LOOKBACK_LAG_HOURS)
parser.add_argument("--batch-size", type=int, default=BATCH_SIZE)
parser.add_argument("--max-queries", type=int, default=MAX_QUERIES)
parser.add_argument("--manifest", default="manifest_query_logs.json")
args = parser.parse_args()
required = ["host", "db", "user", "password"]
missing = [k for k in required if getattr(args, k) is None]
if missing:
parser.error(f"Missing required arguments/env vars: {missing}")
collect(
host=args.host,
db=args.db,
user=args.user,
password=args.password,
manifest_path=args.manifest,
port=args.port,
lookback_hours=args.lookback_hours,
lookback_lag_hours=args.lookback_lag_hours,
batch_size=args.batch_size,
max_queries=args.max_queries,
)
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,178 @@
"""
Redshift — Lineage Push (push-only)
=====================================
Reads a JSON manifest file produced by collect_lineage.py and pushes the lineage
events to Monte Carlo via the push ingestion API, with configurable batching to
keep compressed payloads under 1 MB.
Substitution points (search for "← SUBSTITUTE"):
- MCD_INGEST_ID / MCD_INGEST_TOKEN : Monte Carlo API credentials
- MCD_RESOURCE_UUID : UUID of the Redshift connection in Monte Carlo
- PUSH_BATCH_SIZE : number of events per API call (default 500)
Prerequisites:
pip install pycarlo
"""
from __future__ import annotations
import argparse
import json
import logging
import os
from concurrent.futures import ThreadPoolExecutor, as_completed
from datetime import datetime, timezone
from typing import Any
from pycarlo.core import Client, Session
from pycarlo.features.ingestion import IngestionService
from pycarlo.features.ingestion.models import (
LineageAssetRef,
LineageEvent,
)
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
log = logging.getLogger(__name__)
RESOURCE_TYPE = "redshift"
DEFAULT_BATCH_SIZE = 500 # ← SUBSTITUTE: conservative default to stay under 1 MB compressed
def _ref_from_dict(d: dict[str, Any]) -> LineageAssetRef:
return LineageAssetRef(
type="TABLE",
name=d["asset_name"],
database=d.get("database", ""),
schema=d.get("schema", ""),
)
def _event_from_dict(d: dict[str, Any]) -> LineageEvent:
"""Reconstruct a LineageEvent from a manifest dict."""
sources = [_ref_from_dict(s) for s in d.get("sources", [])]
destination = _ref_from_dict(d["destination"])
return LineageEvent(
sources=sources,
destination=destination,
)
def push(
manifest_path: str,
resource_uuid: str,
key_id: str,
key_token: str,
batch_size: int = DEFAULT_BATCH_SIZE,
) -> dict[str, Any]:
"""Read a collect manifest and push lineage events to Monte Carlo in batches.
Returns a summary dict with invocation IDs and counts.
"""
with open(manifest_path) as fh:
manifest = json.load(fh)
event_dicts: list[dict[str, Any]] = manifest["events"]
events = [_event_from_dict(d) for d in event_dicts]
log.info("Loaded %d lineage events from %s", len(events), manifest_path)
if not events:
log.info("No lineage events to push.")
summary = {
"resource_uuid": resource_uuid,
"resource_type": RESOURCE_TYPE,
"invocation_ids": [],
"pushed_at": datetime.now(timezone.utc).isoformat(),
"event_count": 0,
"batch_count": 0,
"batch_size": batch_size,
}
push_manifest_path = manifest_path.replace(".json", "_push_result.json")
with open(push_manifest_path, "w") as fh:
json.dump(summary, fh, indent=2)
return summary
# Split into batches
batches = []
for i in range(0, len(events), batch_size):
batches.append(events[i : i + batch_size])
total_batches = len(batches)
def _push_batch(batch: list, batch_num: int) -> str | None:
"""Push a single batch using a dedicated Session (thread-safe)."""
log.info("Pushing batch %d/%d (%d events) ...", batch_num, total_batches, len(batch))
client = Client(session=Session(mcd_id=key_id, mcd_token=key_token, scope="Ingestion"))
service = IngestionService(mc_client=client)
result = service.send_lineage(
resource_uuid=resource_uuid,
resource_type=RESOURCE_TYPE,
events=batch,
)
invocation_id = service.extract_invocation_id(result)
if invocation_id:
log.info("Batch %d: invocation_id=%s", batch_num, invocation_id)
return invocation_id
# Push batches in parallel (each thread gets its own pycarlo Session)
max_workers = min(4, total_batches)
invocation_ids: list[str | None] = [None] * total_batches
with ThreadPoolExecutor(max_workers=max_workers) as pool:
futures = {
pool.submit(_push_batch, batch, i + 1): i
for i, batch in enumerate(batches)
}
for future in as_completed(futures):
idx = futures[future]
try:
invocation_ids[idx] = future.result()
except Exception as exc:
log.error("ERROR pushing batch %d: %s", idx + 1, exc)
raise
log.info("All %d batches pushed (%d workers)", total_batches, max_workers)
summary = {
"resource_uuid": resource_uuid,
"resource_type": RESOURCE_TYPE,
"invocation_ids": invocation_ids,
"pushed_at": datetime.now(timezone.utc).isoformat(),
"event_count": len(events),
"batch_count": total_batches,
"batch_size": batch_size,
"lookback_hours": manifest.get("lookback_hours"),
"queries_scanned": manifest.get("queries_scanned"),
}
push_manifest_path = manifest_path.replace(".json", "_push_result.json")
with open(push_manifest_path, "w") as fh:
json.dump(summary, fh, indent=2)
log.info("Push result written to %s", push_manifest_path)
return summary
def main() -> None:
parser = argparse.ArgumentParser(description="Push Redshift lineage to Monte Carlo from manifest")
parser.add_argument("--manifest", default="manifest_lineage.json")
parser.add_argument("--resource-uuid", default=os.getenv("MCD_RESOURCE_UUID"))
parser.add_argument("--key-id", default=os.getenv("MCD_INGEST_ID"))
parser.add_argument("--key-token", default=os.getenv("MCD_INGEST_TOKEN"))
parser.add_argument("--batch-size", type=int, default=DEFAULT_BATCH_SIZE)
args = parser.parse_args()
required = ["resource_uuid", "key_id", "key_token"]
missing = [k for k in required if getattr(args, k) is None]
if missing:
parser.error(f"Missing required arguments/env vars: {missing}")
push(
manifest_path=args.manifest,
resource_uuid=args.resource_uuid,
key_id=args.key_id,
key_token=args.key_token,
batch_size=args.batch_size,
)
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,178 @@
"""
Redshift — Metadata Push (push-only)
======================================
Reads a JSON manifest file produced by collect_metadata.py and pushes the assets
to Monte Carlo via the push ingestion API, with configurable batching to keep
compressed payloads under 1 MB.
Substitution points (search for "← SUBSTITUTE"):
- MCD_INGEST_ID / MCD_INGEST_TOKEN : Monte Carlo API credentials
- MCD_RESOURCE_UUID : UUID of the Redshift connection in Monte Carlo
- PUSH_BATCH_SIZE : number of assets per API call (default 500)
Prerequisites:
pip install pycarlo
"""
from __future__ import annotations
import argparse
import json
import logging
import os
from concurrent.futures import ThreadPoolExecutor, as_completed
from datetime import datetime, timezone
from typing import Any
from pycarlo.core import Client, Session
from pycarlo.features.ingestion import IngestionService
from pycarlo.features.ingestion.models import (
AssetField,
AssetFreshness,
AssetMetadata,
AssetVolume,
RelationalAsset,
)
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
log = logging.getLogger(__name__)
RESOURCE_TYPE = "redshift"
DEFAULT_BATCH_SIZE = 500 # ← SUBSTITUTE: conservative default to stay under 1 MB compressed
def _asset_from_dict(d: dict[str, Any]) -> RelationalAsset:
"""Reconstruct a RelationalAsset from a manifest dict."""
fields = [
AssetField(
name=f["name"],
type=f.get("type"),
description=f.get("description"),
)
for f in d.get("fields", [])
]
volume = None
if d.get("row_count") is not None or d.get("byte_count") is not None:
volume = AssetVolume(
row_count=d.get("row_count"),
byte_count=d.get("byte_count"),
)
freshness = None
if d.get("last_updated") is not None:
freshness = AssetFreshness(last_update_time=d.get("last_updated"))
return RelationalAsset(
type=d.get("asset_type", "TABLE"),
metadata=AssetMetadata(
name=d["asset_name"],
database=d["database"], # ← SUBSTITUTE: use database as top-level namespace
schema=d["schema"],
description=d.get("description"),
),
fields=fields,
volume=volume,
freshness=freshness,
)
def push(
manifest_path: str,
resource_uuid: str,
key_id: str,
key_token: str,
batch_size: int = DEFAULT_BATCH_SIZE,
) -> dict[str, Any]:
"""Read a collect manifest and push assets to Monte Carlo in batches.
Returns a summary dict with invocation IDs and counts.
"""
with open(manifest_path) as fh:
manifest = json.load(fh)
asset_dicts: list[dict[str, Any]] = manifest["assets"]
assets = [_asset_from_dict(d) for d in asset_dicts]
log.info("Loaded %d assets from %s", len(assets), manifest_path)
# Split into batches
batches = []
for i in range(0, max(len(assets), 1), batch_size):
batches.append(assets[i : i + batch_size])
total_batches = len(batches)
def _push_batch(batch: list, batch_num: int) -> str | None:
"""Push a single batch using a dedicated Session (thread-safe)."""
client = Client(session=Session(mcd_id=key_id, mcd_token=key_token, scope="Ingestion"))
service = IngestionService(mc_client=client)
result = service.send_metadata(
resource_uuid=resource_uuid,
resource_type=RESOURCE_TYPE,
events=batch,
)
invocation_id = service.extract_invocation_id(result)
log.info("Pushed batch %d/%d (%d assets) — invocation_id=%s", batch_num, total_batches, len(batch), invocation_id)
return invocation_id
# Push batches in parallel (each thread gets its own pycarlo Session)
max_workers = min(4, total_batches)
invocation_ids: list[str | None] = [None] * total_batches
with ThreadPoolExecutor(max_workers=max_workers) as pool:
futures = {
pool.submit(_push_batch, batch, i + 1): i
for i, batch in enumerate(batches)
}
for future in as_completed(futures):
idx = futures[future]
try:
invocation_ids[idx] = future.result()
except Exception as exc:
log.error("ERROR pushing batch %d: %s", idx + 1, exc)
raise
log.info("All %d batches pushed (%d workers)", total_batches, max_workers)
summary = {
"resource_uuid": resource_uuid,
"resource_type": RESOURCE_TYPE,
"invocation_ids": invocation_ids,
"pushed_at": datetime.now(timezone.utc).isoformat(),
"asset_count": len(assets),
"batch_count": total_batches,
"batch_size": batch_size,
}
push_manifest_path = manifest_path.replace(".json", "_push_result.json")
with open(push_manifest_path, "w") as fh:
json.dump(summary, fh, indent=2)
log.info("Push result written to %s", push_manifest_path)
return summary
def main() -> None:
parser = argparse.ArgumentParser(description="Push Redshift metadata to Monte Carlo from manifest")
parser.add_argument("--manifest", default="manifest_metadata.json")
parser.add_argument("--resource-uuid", default=os.getenv("MCD_RESOURCE_UUID"))
parser.add_argument("--key-id", default=os.getenv("MCD_INGEST_ID"))
parser.add_argument("--key-token", default=os.getenv("MCD_INGEST_TOKEN"))
parser.add_argument("--batch-size", type=int, default=DEFAULT_BATCH_SIZE)
args = parser.parse_args()
required = ["resource_uuid", "key_id", "key_token"]
missing = [k for k in required if getattr(args, k) is None]
if missing:
parser.error(f"Missing required arguments/env vars: {missing}")
push(
manifest_path=args.manifest,
resource_uuid=args.resource_uuid,
key_id=args.key_id,
key_token=args.key_token,
batch_size=args.batch_size,
)
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,196 @@
"""
Redshift — Query Log Push (push-only)
=======================================
Reads a JSON manifest file produced by collect_query_logs.py and pushes the query
log entries to Monte Carlo via the push ingestion API, with configurable batching
to keep compressed payloads under 1 MB.
Substitution points (search for "← SUBSTITUTE"):
- MCD_INGEST_ID / MCD_INGEST_TOKEN : Monte Carlo API credentials
- MCD_RESOURCE_UUID : UUID of the Redshift connection in Monte Carlo
- PUSH_BATCH_SIZE : number of entries per API call (default 100)
Prerequisites:
pip install pycarlo
"""
from __future__ import annotations
import argparse
import json
import logging
import os
from concurrent.futures import ThreadPoolExecutor, as_completed
from datetime import datetime, timezone
from typing import Any
from dateutil.parser import isoparse
from pycarlo.core import Client, Session
from pycarlo.features.ingestion import IngestionService
from pycarlo.features.ingestion.models import QueryLogEntry
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
log = logging.getLogger(__name__)
LOG_TYPE = "redshift"
DEFAULT_BATCH_SIZE = 100 # ← SUBSTITUTE: conservative default to stay under 1 MB compressed
# Truncate query_text longer than this to prevent 413 errors.
# Some SQL statements (e.g., generated by BI tools) can be 100KB+ and blow up
# compressed payloads even at small batch sizes.
_MAX_QUERY_TEXT_LEN = 10_000
def _build_query_log_entries(entry_dicts: list[dict[str, Any]]) -> list[QueryLogEntry]:
"""Convert manifest query dicts into QueryLogEntry objects."""
entries = []
truncated = 0
for d in entry_dicts:
query_text = d.get("query_text") or ""
# Truncate very long SQL to prevent 413 Request Too Large
if len(query_text) > _MAX_QUERY_TEXT_LEN:
query_text = query_text[:_MAX_QUERY_TEXT_LEN] + "... [TRUNCATED]"
truncated += 1
extra = {}
if d.get("database_name") is not None:
extra["database_name"] = d["database_name"]
if d.get("elapsed_time_us") is not None:
extra["elapsed_time_us"] = d["elapsed_time_us"]
start_time = d.get("start_time")
end_time = d.get("end_time")
entries.append(
QueryLogEntry(
query_id=d.get("query_id"),
query_text=query_text,
start_time=isoparse(start_time) if start_time else None,
end_time=isoparse(end_time) if end_time else None,
user=d.get("user"),
extra=extra or None,
)
)
if truncated:
log.info("Truncated %d query text(s) exceeding %d chars", truncated, _MAX_QUERY_TEXT_LEN)
return entries
def push(
manifest_path: str,
resource_uuid: str,
key_id: str,
key_token: str,
batch_size: int = DEFAULT_BATCH_SIZE,
) -> dict[str, Any]:
"""Read a collect manifest and push query log entries to Monte Carlo in batches.
Returns a summary dict with invocation IDs and counts.
"""
with open(manifest_path) as fh:
manifest = json.load(fh)
entry_dicts: list[dict[str, Any]] = manifest["entries"]
entries = _build_query_log_entries(entry_dicts)
log.info("Loaded %d query log entries from %s", len(entries), manifest_path)
if not entries:
log.info("No query log entries to push.")
summary = {
"resource_uuid": resource_uuid,
"log_type": LOG_TYPE,
"invocation_ids": [],
"pushed_at": datetime.now(timezone.utc).isoformat(),
"query_log_count": 0,
"batch_count": 0,
"batch_size": batch_size,
}
push_manifest_path = manifest_path.replace(".json", "_push_result.json")
with open(push_manifest_path, "w") as fh:
json.dump(summary, fh, indent=2)
return summary
# Split into batches
batches = []
for i in range(0, len(entries), batch_size):
batches.append(entries[i : i + batch_size])
total_batches = len(batches)
def _push_batch(batch: list, batch_num: int) -> str | None:
"""Push a single batch using a dedicated Session (thread-safe)."""
client = Client(session=Session(mcd_id=key_id, mcd_token=key_token, scope="Ingestion"))
service = IngestionService(mc_client=client)
result = service.send_query_logs(
resource_uuid=resource_uuid,
log_type=LOG_TYPE,
events=batch,
)
invocation_id = service.extract_invocation_id(result)
log.info("Pushed batch %d/%d (%d entries) — invocation_id=%s", batch_num, total_batches, len(batch), invocation_id)
return invocation_id
# Push batches in parallel (each thread gets its own pycarlo Session)
max_workers = min(4, total_batches)
invocation_ids: list[str | None] = [None] * total_batches
with ThreadPoolExecutor(max_workers=max_workers) as pool:
futures = {
pool.submit(_push_batch, batch, i + 1): i
for i, batch in enumerate(batches)
}
for future in as_completed(futures):
idx = futures[future]
try:
invocation_ids[idx] = future.result()
except Exception as exc:
log.error("ERROR pushing batch %d: %s", idx + 1, exc)
raise
log.info("All %d batches pushed (%d workers)", total_batches, max_workers)
summary = {
"resource_uuid": resource_uuid,
"log_type": LOG_TYPE,
"invocation_ids": invocation_ids,
"pushed_at": datetime.now(timezone.utc).isoformat(),
"query_log_count": len(entries),
"batch_count": total_batches,
"batch_size": batch_size,
"lookback_hours": manifest.get("lookback_hours"),
"lookback_lag_hours": manifest.get("lookback_lag_hours"),
}
push_manifest_path = manifest_path.replace(".json", "_push_result.json")
with open(push_manifest_path, "w") as fh:
json.dump(summary, fh, indent=2)
log.info("Push result written to %s", push_manifest_path)
return summary
def main() -> None:
parser = argparse.ArgumentParser(description="Push Redshift query logs to Monte Carlo from manifest")
parser.add_argument("--manifest", default="manifest_query_logs.json")
parser.add_argument("--resource-uuid", default=os.getenv("MCD_RESOURCE_UUID"))
parser.add_argument("--key-id", default=os.getenv("MCD_INGEST_ID"))
parser.add_argument("--key-token", default=os.getenv("MCD_INGEST_TOKEN"))
parser.add_argument("--batch-size", type=int, default=DEFAULT_BATCH_SIZE)
args = parser.parse_args()
required = ["resource_uuid", "key_id", "key_token"]
missing = [k for k in required if getattr(args, k) is None]
if missing:
parser.error(f"Missing required arguments/env vars: {missing}")
push(
manifest_path=args.manifest,
resource_uuid=args.resource_uuid,
key_id=args.key_id,
key_token=args.key_token,
batch_size=args.batch_size,
)
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,154 @@
#!/usr/bin/env python3
"""
Collect lineage from Snowflake and push it to Monte Carlo — combined.
Imports ``collect()`` from ``collect_lineage`` and ``push()`` from
``push_lineage``, runs both in sequence.
Substitution points
-------------------
- SNOWFLAKE_ACCOUNT (env) / --account (CLI) : Snowflake account identifier
- SNOWFLAKE_USER (env) / --user (CLI) : Snowflake username
- SNOWFLAKE_PASSWORD (env) / --password (CLI) : Snowflake password
- SNOWFLAKE_WAREHOUSE (env) / --warehouse (CLI) : Snowflake virtual warehouse
- MCD_INGEST_ID (env) / --key-id (CLI) : Monte Carlo ingestion key ID
- MCD_INGEST_TOKEN (env) / --key-token (CLI) : Monte Carlo ingestion key token
- MCD_RESOURCE_UUID (env) / --resource-uuid (CLI) : MC resource UUID for this connection
Prerequisites
-------------
pip install pycarlo snowflake-connector-python
Usage (table-level):
python collect_and_push_lineage.py \\
--account <SNOWFLAKE_ACCOUNT> \\
--user <SNOWFLAKE_USER> \\
--password <SNOWFLAKE_PASSWORD> \\
--warehouse <SNOWFLAKE_WAREHOUSE> \\
--key-id <MCD_INGEST_ID> \\
--key-token <MCD_INGEST_TOKEN> \\
--resource-uuid <MCD_RESOURCE_UUID>
Usage (column-level):
python collect_and_push_lineage.py ... --column-lineage
"""
from __future__ import annotations
import argparse
import os
from collect_lineage import collect, _LOOKBACK_HOURS
from push_lineage import push, _BATCH_SIZE
def main() -> None:
parser = argparse.ArgumentParser(
description="Collect Snowflake lineage from ACCOUNT_USAGE and push to Monte Carlo",
)
parser.add_argument(
"--account",
default=os.environ.get("SNOWFLAKE_ACCOUNT"),
help="Snowflake account identifier (env: SNOWFLAKE_ACCOUNT)",
)
parser.add_argument(
"--user",
default=os.environ.get("SNOWFLAKE_USER"),
help="Snowflake username (env: SNOWFLAKE_USER)",
)
parser.add_argument(
"--password",
default=os.environ.get("SNOWFLAKE_PASSWORD"),
help="Snowflake password (env: SNOWFLAKE_PASSWORD)",
)
parser.add_argument(
"--warehouse",
default=os.environ.get("SNOWFLAKE_WAREHOUSE"),
help="Snowflake virtual warehouse (env: SNOWFLAKE_WAREHOUSE)",
)
parser.add_argument(
"--key-id",
default=os.environ.get("MCD_INGEST_ID"),
help="Monte Carlo ingestion key ID (env: MCD_INGEST_ID)",
)
parser.add_argument(
"--key-token",
default=os.environ.get("MCD_INGEST_TOKEN"),
help="Monte Carlo ingestion key token (env: MCD_INGEST_TOKEN)",
)
parser.add_argument(
"--resource-uuid",
default=os.environ.get("MCD_RESOURCE_UUID"),
help="Monte Carlo resource UUID for this Snowflake connection (env: MCD_RESOURCE_UUID)",
)
parser.add_argument(
"--lookback-hours",
type=int,
default=_LOOKBACK_HOURS,
help=f"Hours of QUERY_HISTORY to scan (default: {_LOOKBACK_HOURS})",
)
parser.add_argument(
"--column-lineage",
action="store_true",
help="Push column-level lineage instead of table-level",
)
parser.add_argument(
"--output-file",
default="lineage_output.json",
help="Path for the intermediate collect manifest (default: lineage_output.json)",
)
parser.add_argument(
"--push-result-file",
default="lineage_push_result.json",
help="Path to write the push result (default: lineage_push_result.json)",
)
parser.add_argument(
"--batch-size",
type=int,
default=_BATCH_SIZE,
help=f"Max events per push batch (default: {_BATCH_SIZE})",
)
args = parser.parse_args()
missing = [
name
for name, val in [
("--account", args.account),
("--user", args.user),
("--password", args.password),
("--warehouse", args.warehouse),
("--key-id", args.key_id),
("--key-token", args.key_token),
("--resource-uuid", args.resource_uuid),
]
if not val
]
if missing:
parser.error(f"Missing required arguments: {', '.join(missing)}")
# Step 1: Collect
collect(
account=args.account,
user=args.user,
password=args.password,
warehouse=args.warehouse,
lookback_hours=args.lookback_hours,
column_lineage=args.column_lineage,
output_file=args.output_file,
)
# Step 2: Push
push(
input_file=args.output_file,
resource_uuid=args.resource_uuid,
key_id=args.key_id,
key_token=args.key_token,
batch_size=args.batch_size,
output_file=args.push_result_file,
)
print("Done.")
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,137 @@
#!/usr/bin/env python3
"""
Collect table metadata from Snowflake and push it to Monte Carlo — combined.
Imports ``collect()`` from ``collect_metadata`` and ``push()`` from
``push_metadata``, runs both in sequence.
Substitution points
-------------------
- SNOWFLAKE_ACCOUNT (env) / --account (CLI) : Snowflake account identifier (e.g. xy12345.us-east-1)
- SNOWFLAKE_USER (env) / --user (CLI) : Snowflake username
- SNOWFLAKE_PASSWORD (env) / --password (CLI) : Snowflake password
- SNOWFLAKE_WAREHOUSE (env) / --warehouse (CLI) : Snowflake virtual warehouse
- MCD_INGEST_ID (env) / --key-id (CLI) : Monte Carlo ingestion key ID
- MCD_INGEST_TOKEN (env) / --key-token (CLI) : Monte Carlo ingestion key token
- MCD_RESOURCE_UUID (env) / --resource-uuid (CLI) : MC resource UUID for this connection
Prerequisites
-------------
pip install pycarlo snowflake-connector-python
Usage
-----
python collect_and_push_metadata.py \\
--account <SNOWFLAKE_ACCOUNT> \\
--user <SNOWFLAKE_USER> \\
--password <SNOWFLAKE_PASSWORD> \\
--warehouse <SNOWFLAKE_WAREHOUSE> \\
--key-id <MCD_INGEST_ID> \\
--key-token <MCD_INGEST_TOKEN> \\
--resource-uuid <MCD_RESOURCE_UUID>
"""
import argparse
import os
from collect_metadata import collect
from push_metadata import push, _BATCH_SIZE
def main() -> None:
parser = argparse.ArgumentParser(
description="Collect Snowflake table metadata and push to Monte Carlo",
)
parser.add_argument(
"--account",
default=os.environ.get("SNOWFLAKE_ACCOUNT"),
help="Snowflake account identifier, e.g. xy12345.us-east-1 (env: SNOWFLAKE_ACCOUNT)", # ← SUBSTITUTE
)
parser.add_argument(
"--user",
default=os.environ.get("SNOWFLAKE_USER"),
help="Snowflake username (env: SNOWFLAKE_USER)", # ← SUBSTITUTE
)
parser.add_argument(
"--password",
default=os.environ.get("SNOWFLAKE_PASSWORD"),
help="Snowflake password (env: SNOWFLAKE_PASSWORD)", # ← SUBSTITUTE
)
parser.add_argument(
"--warehouse",
default=os.environ.get("SNOWFLAKE_WAREHOUSE"),
help="Snowflake virtual warehouse (env: SNOWFLAKE_WAREHOUSE)", # ← SUBSTITUTE
)
parser.add_argument(
"--key-id",
default=os.environ.get("MCD_INGEST_ID"),
help="Monte Carlo ingestion key ID (env: MCD_INGEST_ID)",
)
parser.add_argument(
"--key-token",
default=os.environ.get("MCD_INGEST_TOKEN"),
help="Monte Carlo ingestion key token (env: MCD_INGEST_TOKEN)",
)
parser.add_argument(
"--resource-uuid",
default=os.environ.get("MCD_RESOURCE_UUID"),
help="Monte Carlo resource UUID for this Snowflake connection (env: MCD_RESOURCE_UUID)",
)
parser.add_argument(
"--output-file",
default="metadata_output.json",
help="Path for the intermediate collect manifest (default: metadata_output.json)",
)
parser.add_argument(
"--push-result-file",
default="metadata_push_result.json",
help="Path to write the push result (default: metadata_push_result.json)",
)
parser.add_argument(
"--batch-size",
type=int,
default=_BATCH_SIZE,
help=f"Max assets per push batch (default: {_BATCH_SIZE})",
)
args = parser.parse_args()
missing = [
name
for name, val in [
("--account", args.account),
("--user", args.user),
("--password", args.password),
("--warehouse", args.warehouse),
("--key-id", args.key_id),
("--key-token", args.key_token),
("--resource-uuid", args.resource_uuid),
]
if not val
]
if missing:
parser.error(f"Missing required arguments: {', '.join(missing)}")
# Step 1: Collect
collect(
account=args.account,
user=args.user,
password=args.password,
warehouse=args.warehouse,
output_file=args.output_file,
)
# Step 2: Push
push(
input_file=args.output_file,
resource_uuid=args.resource_uuid,
key_id=args.key_id,
key_token=args.key_token,
batch_size=args.batch_size,
output_file=args.push_result_file,
)
print("Done.")
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,137 @@
#!/usr/bin/env python3
"""
Collect query logs from Snowflake and push them to Monte Carlo — combined.
Imports ``collect()`` from ``collect_query_logs`` and ``push()`` from
``push_query_logs``, runs both in sequence.
Substitution points
-------------------
- SNOWFLAKE_ACCOUNT (env) / --account (CLI) : Snowflake account identifier
- SNOWFLAKE_USER (env) / --user (CLI) : Snowflake username
- SNOWFLAKE_PASSWORD (env) / --password (CLI) : Snowflake password
- SNOWFLAKE_WAREHOUSE (env) / --warehouse (CLI) : Snowflake virtual warehouse
- MCD_INGEST_ID (env) / --key-id (CLI) : Monte Carlo ingestion key ID
- MCD_INGEST_TOKEN (env) / --key-token (CLI) : Monte Carlo ingestion key token
- MCD_RESOURCE_UUID (env) / --resource-uuid (CLI) : MC resource UUID for this connection
Prerequisites
-------------
pip install pycarlo snowflake-connector-python
Usage
-----
python collect_and_push_query_logs.py \\
--account <SNOWFLAKE_ACCOUNT> \\
--user <SNOWFLAKE_USER> \\
--password <SNOWFLAKE_PASSWORD> \\
--warehouse <SNOWFLAKE_WAREHOUSE> \\
--key-id <MCD_INGEST_ID> \\
--key-token <MCD_INGEST_TOKEN> \\
--resource-uuid <MCD_RESOURCE_UUID>
"""
import argparse
import os
from collect_query_logs import collect
from push_query_logs import push, _BATCH_SIZE
def main() -> None:
parser = argparse.ArgumentParser(
description="Collect Snowflake query logs from ACCOUNT_USAGE and push to Monte Carlo",
)
parser.add_argument(
"--account",
default=os.environ.get("SNOWFLAKE_ACCOUNT"),
help="Snowflake account identifier, e.g. xy12345.us-east-1 (env: SNOWFLAKE_ACCOUNT)", # ← SUBSTITUTE
)
parser.add_argument(
"--user",
default=os.environ.get("SNOWFLAKE_USER"),
help="Snowflake username (env: SNOWFLAKE_USER)",
)
parser.add_argument(
"--password",
default=os.environ.get("SNOWFLAKE_PASSWORD"),
help="Snowflake password (env: SNOWFLAKE_PASSWORD)",
)
parser.add_argument(
"--warehouse",
default=os.environ.get("SNOWFLAKE_WAREHOUSE"),
help="Snowflake virtual warehouse (env: SNOWFLAKE_WAREHOUSE)", # ← SUBSTITUTE
)
parser.add_argument(
"--key-id",
default=os.environ.get("MCD_INGEST_ID"),
help="Monte Carlo ingestion key ID (env: MCD_INGEST_ID)",
)
parser.add_argument(
"--key-token",
default=os.environ.get("MCD_INGEST_TOKEN"),
help="Monte Carlo ingestion key token (env: MCD_INGEST_TOKEN)",
)
parser.add_argument(
"--resource-uuid",
default=os.environ.get("MCD_RESOURCE_UUID"),
help="Monte Carlo resource UUID for this Snowflake connection (env: MCD_RESOURCE_UUID)",
)
parser.add_argument(
"--output-file",
default="query_logs_output.json",
help="Path for the intermediate collect manifest (default: query_logs_output.json)",
)
parser.add_argument(
"--push-result-file",
default="query_logs_push_result.json",
help="Path to write the push result (default: query_logs_push_result.json)",
)
parser.add_argument(
"--batch-size",
type=int,
default=_BATCH_SIZE,
help=f"Max entries per push batch (default: {_BATCH_SIZE})",
)
args = parser.parse_args()
missing = [
name
for name, val in [
("--account", args.account),
("--user", args.user),
("--password", args.password),
("--warehouse", args.warehouse),
("--key-id", args.key_id),
("--key-token", args.key_token),
("--resource-uuid", args.resource_uuid),
]
if not val
]
if missing:
parser.error(f"Missing required arguments: {', '.join(missing)}")
# Step 1: Collect
collect(
account=args.account,
user=args.user,
password=args.password,
warehouse=args.warehouse,
output_file=args.output_file,
)
# Step 2: Push
push(
input_file=args.output_file,
resource_uuid=args.resource_uuid,
key_id=args.key_id,
key_token=args.key_token,
batch_size=args.batch_size,
output_file=args.push_result_file,
)
print("Done.")
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,349 @@
#!/usr/bin/env python3
"""
Collect table and column lineage from Snowflake — collection only.
Queries ACCOUNT_USAGE for DML/DDL statements in the last 24 hours, parses each
QUERY_TEXT with regex to extract source and destination tables, then writes the
resulting lineage edges to a JSON manifest file.
Can be run standalone via CLI or imported (use the ``collect()`` function).
Note: ACCOUNT_USAGE views have an approximate latency of 45 minutes, so very
recent queries may not yet appear.
Substitution points
-------------------
- SNOWFLAKE_ACCOUNT (env) / --account (CLI) : Snowflake account identifier
- SNOWFLAKE_USER (env) / --user (CLI) : Snowflake username
- SNOWFLAKE_PASSWORD (env) / --password (CLI) : Snowflake password
- SNOWFLAKE_WAREHOUSE (env) / --warehouse (CLI) : Snowflake virtual warehouse
Prerequisites
-------------
pip install snowflake-connector-python
Usage (table-level):
python collect_lineage.py \\
--account <SNOWFLAKE_ACCOUNT> \\
--user <SNOWFLAKE_USER> \\
--password <SNOWFLAKE_PASSWORD> \\
--warehouse <SNOWFLAKE_WAREHOUSE>
Usage (column-level):
python collect_lineage.py ... --column-lineage
"""
from __future__ import annotations
import argparse
import json
import os
import re
from dataclasses import dataclass, field
from datetime import datetime, timezone
import snowflake.connector
# ← SUBSTITUTE: set RESOURCE_TYPE to match your Monte Carlo connection type
RESOURCE_TYPE = "snowflake"
def _check_available_memory(min_gb: float = 2.0) -> None:
"""Warn if available memory is below the threshold."""
try:
if hasattr(os, "sysconf"): # Linux / macOS
page_size = os.sysconf("SC_PAGE_SIZE")
avail_pages = os.sysconf("SC_AVPHYS_PAGES")
avail_gb = (page_size * avail_pages) / (1024 ** 3)
else:
return # Windows — skip check
except (ValueError, OSError):
return
if avail_gb < min_gb:
print(
f"WARNING: Only {avail_gb:.1f} GB of memory available "
f"(minimum recommended: {min_gb:.1f} GB). "
f"Consider reducing the lookback window or increasing available memory."
)
# Hours to look back in ACCOUNT_USAGE.QUERY_HISTORY
# ← SUBSTITUTE: adjust the lookback window to match your collection cadence
_LOOKBACK_HOURS = 24
# Regex for CTAS: CREATE [OR REPLACE] [TRANSIENT] TABLE [IF NOT EXISTS] [db.][schema.]table AS SELECT
_CTAS_RE = re.compile(
r"CREATE\s+(?:OR\s+REPLACE\s+)?(?:TRANSIENT\s+)?TABLE\s+(?:IF\s+NOT\s+EXISTS\s+)?"
r"(?:(?P<dest_db>\w+)\.)?(?:(?P<dest_schema>\w+)\.)?(?P<dest_table>\w+)"
r".*?AS\s+SELECT\s+(?P<select_cols>.+?)\s+FROM\s+"
r"(?:(?P<src_db>\w+)\.)?(?:(?P<src_schema>\w+)\.)?(?P<src_table>\w+)",
re.IGNORECASE | re.DOTALL,
)
# Regex for INSERT INTO [db.][schema.]table SELECT ... FROM [db.][schema.]table
_INSERT_RE = re.compile(
r"INSERT\s+(?:INTO|OVERWRITE)\s+"
r"(?:(?P<dest_db>\w+)\.)?(?:(?P<dest_schema>\w+)\.)?(?P<dest_table>\w+)"
r".*?SELECT\s+(?P<select_cols>.+?)\s+FROM\s+"
r"(?:(?P<src_db>\w+)\.)?(?:(?P<src_schema>\w+)\.)?(?P<src_table>\w+)",
re.IGNORECASE | re.DOTALL,
)
# Regex for CREATE [OR REPLACE] VIEW [db.][schema.]view AS SELECT ... FROM ...
_CREATE_VIEW_RE = re.compile(
r"CREATE\s+(?:OR\s+REPLACE\s+)?(?:SECURE\s+)?VIEW\s+"
r"(?:(?P<dest_db>\w+)\.)?(?:(?P<dest_schema>\w+)\.)?(?P<dest_table>\w+)"
r".*?AS\s+SELECT\s+(?P<select_cols>.+?)\s+FROM\s+"
r"(?:(?P<src_db>\w+)\.)?(?:(?P<src_schema>\w+)\.)?(?P<src_table>\w+)",
re.IGNORECASE | re.DOTALL,
)
# Additional JOIN sources
_JOIN_RE = re.compile(
r"JOIN\s+(?:(?P<src_db>\w+)\.)?(?:(?P<src_schema>\w+)\.)?(?P<src_table>\w+)",
re.IGNORECASE,
)
# Simple column alias extraction from SELECT clause
_COL_RE = re.compile(r"(?:(\w+)\.)?(\w+)(?:\s+AS\s+(\w+))?", re.IGNORECASE)
_SQL_KEYWORDS = {
"FROM", "SELECT", "WHERE", "JOIN", "ON", "AS", "*", "AND", "OR",
"GROUP", "ORDER", "BY", "HAVING", "LIMIT", "DISTINCT", "CASE", "WHEN",
"THEN", "ELSE", "END", "NULL", "NOT", "IN", "IS", "BETWEEN",
}
@dataclass
class _LineageEdge:
dest_db: str
dest_schema: str
dest_table: str
sources: list[tuple[str, str, str]] = field(default_factory=list)
# col_mappings: (dest_col, src_table, src_col)
col_mappings: list[tuple[str, str, str]] = field(default_factory=list)
def _parse_select_cols(select_clause: str, src_table: str) -> list[tuple[str, str, str]]:
mappings = []
for m in _COL_RE.finditer(select_clause):
src_col = m.group(2)
dest_col = m.group(3) or src_col
if src_col.upper() in _SQL_KEYWORDS:
continue
mappings.append((dest_col, src_table, src_col))
return mappings
def _parse_edges(rows: list[dict]) -> list[_LineageEdge]:
"""Parse QUERY_HISTORY rows into _LineageEdge objects."""
edges: dict[str, _LineageEdge] = {}
for row in rows:
query_text = row.get("QUERY_TEXT") or ""
default_db = (row.get("DATABASE_NAME") or "").lower()
sql_clean = re.sub(r"\s+", " ", query_text).strip()
for pattern in (_CTAS_RE, _INSERT_RE, _CREATE_VIEW_RE):
m = pattern.search(sql_clean)
if not m:
continue
dest_db = (m.group("dest_db") or default_db).lower()
dest_schema = (m.group("dest_schema") or "public").lower()
dest_table = m.group("dest_table").lower()
src_db = (m.group("src_db") or default_db).lower()
src_schema = (m.group("src_schema") or "public").lower()
src_table = m.group("src_table").lower()
select_cols = m.group("select_cols")
key = f"{dest_db}.{dest_schema}.{dest_table}"
if key not in edges:
edges[key] = _LineageEdge(
dest_db=dest_db, dest_schema=dest_schema, dest_table=dest_table
)
edge = edges[key]
src_triple = (src_db, src_schema, src_table)
if src_triple not in edge.sources:
edge.sources.append(src_triple)
for jm in _JOIN_RE.finditer(sql_clean):
jt = jm.group("src_table").lower()
jschema = (jm.group("src_schema") or src_schema).lower()
jdb = (jm.group("src_db") or src_db).lower()
jp = (jdb, jschema, jt)
if jp not in edge.sources:
edge.sources.append(jp)
edge.col_mappings.extend(_parse_select_cols(select_cols, src_table))
break
return list(edges.values())
def _fetch_query_history(conn, lookback_hours: int) -> list[dict]:
cursor = conn.cursor()
cursor.execute(
f"""
SELECT QUERY_ID, QUERY_TEXT, START_TIME, END_TIME, USER_NAME, DATABASE_NAME, EXECUTION_STATUS
FROM SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY
WHERE START_TIME >= DATEADD(hour, -{lookback_hours}, CURRENT_TIMESTAMP())
AND EXECUTION_STATUS = 'SUCCESS'
AND QUERY_TYPE IN ('CREATE_TABLE_AS_SELECT', 'INSERT', 'MERGE', 'CREATE_VIEW')
ORDER BY START_TIME
LIMIT 50000
"""
# ← SUBSTITUTE: adjust QUERY_TYPE list, LIMIT, or add a WHERE clause to scope to specific databases
)
columns = [col[0] for col in cursor.description]
rows = []
while True:
batch = cursor.fetchmany(1000)
if not batch:
break
rows.extend(dict(zip(columns, row)) for row in batch)
cursor.close()
return rows
def collect(
account: str,
user: str,
password: str,
warehouse: str,
lookback_hours: int = _LOOKBACK_HOURS,
column_lineage: bool = False,
output_file: str = "lineage_output.json",
) -> dict:
"""
Connect to Snowflake, collect lineage edges, and write a JSON manifest.
Returns the manifest dict.
"""
_check_available_memory()
print(f"Connecting to Snowflake account: {account} ...")
conn = snowflake.connector.connect(
account=account,
user=user,
password=password,
warehouse=warehouse,
)
print(f"Fetching QUERY_HISTORY for the last {lookback_hours} hour(s) ...")
rows = _fetch_query_history(conn, lookback_hours)
conn.close()
print(f" Retrieved {len(rows)} qualifying query/queries.")
if not rows:
print("No lineage queries found in the specified window.")
manifest = {
"resource_type": RESOURCE_TYPE,
"collected_at": datetime.now(tz=timezone.utc).isoformat(),
"column_lineage": column_lineage,
"edges": [],
}
with open(output_file, "w") as fh:
json.dump(manifest, fh, indent=2)
return manifest
edges = _parse_edges(rows)
print(f" Parsed {len(edges)} lineage edge(s).")
manifest = {
"resource_type": RESOURCE_TYPE,
"collected_at": datetime.now(tz=timezone.utc).isoformat(),
"column_lineage": column_lineage,
"edges": [
{
"destination": {
"database": e.dest_db,
"schema": e.dest_schema,
"table": e.dest_table,
},
"sources": [
{"database": sdb, "schema": sschema, "table": stbl}
for sdb, sschema, stbl in e.sources
],
"col_mappings": [
{"dest_col": dc, "src_table": st, "src_col": sc}
for dc, st, sc in e.col_mappings
],
}
for e in edges
],
}
with open(output_file, "w") as fh:
json.dump(manifest, fh, indent=2)
print(f"Lineage manifest written to {output_file}")
return manifest
def main() -> None:
parser = argparse.ArgumentParser(
description="Collect Snowflake lineage from ACCOUNT_USAGE and write to a manifest file",
)
parser.add_argument(
"--account",
default=os.environ.get("SNOWFLAKE_ACCOUNT"),
help="Snowflake account identifier (env: SNOWFLAKE_ACCOUNT)",
)
parser.add_argument(
"--user",
default=os.environ.get("SNOWFLAKE_USER"),
help="Snowflake username (env: SNOWFLAKE_USER)",
)
parser.add_argument(
"--password",
default=os.environ.get("SNOWFLAKE_PASSWORD"),
help="Snowflake password (env: SNOWFLAKE_PASSWORD)",
)
parser.add_argument(
"--warehouse",
default=os.environ.get("SNOWFLAKE_WAREHOUSE"),
help="Snowflake virtual warehouse (env: SNOWFLAKE_WAREHOUSE)",
)
parser.add_argument(
"--lookback-hours",
type=int,
default=_LOOKBACK_HOURS,
help=f"Hours of QUERY_HISTORY to scan (default: {_LOOKBACK_HOURS})",
)
parser.add_argument(
"--column-lineage",
action="store_true",
help="Include column-level lineage mappings in the manifest",
)
parser.add_argument(
"--output-file",
default="lineage_output.json",
help="Path to write the lineage manifest (default: lineage_output.json)",
)
args = parser.parse_args()
missing = [
name
for name, val in [
("--account", args.account),
("--user", args.user),
("--password", args.password),
("--warehouse", args.warehouse),
]
if not val
]
if missing:
parser.error(f"Missing required arguments: {', '.join(missing)}")
collect(
account=args.account,
user=args.user,
password=args.password,
warehouse=args.warehouse,
lookback_hours=args.lookback_hours,
column_lineage=args.column_lineage,
output_file=args.output_file,
)
print("Done.")
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,329 @@
#!/usr/bin/env python3
"""
Collect table metadata from Snowflake — collection only.
Connects to Snowflake, discovers all accessible databases and schemas, then
queries INFORMATION_SCHEMA.TABLES for volume/freshness and
INFORMATION_SCHEMA.COLUMNS for field definitions. The collected assets are
written to a JSON manifest file.
Can be run standalone via CLI or imported (use the ``collect()`` function).
Substitution points
-------------------
- SNOWFLAKE_ACCOUNT (env) / --account (CLI) : Snowflake account identifier (e.g. xy12345.us-east-1)
- SNOWFLAKE_USER (env) / --user (CLI) : Snowflake username
- SNOWFLAKE_PASSWORD (env) / --password (CLI) : Snowflake password
- SNOWFLAKE_WAREHOUSE (env) / --warehouse (CLI) : Snowflake virtual warehouse
Prerequisites
-------------
pip install snowflake-connector-python
Usage
-----
python collect_metadata.py \\
--account <SNOWFLAKE_ACCOUNT> \\
--user <SNOWFLAKE_USER> \\
--password <SNOWFLAKE_PASSWORD> \\
--warehouse <SNOWFLAKE_WAREHOUSE>
"""
import argparse
import json
import os
from datetime import datetime, timezone
import snowflake.connector
# ← SUBSTITUTE: set RESOURCE_TYPE to match your Monte Carlo connection type
RESOURCE_TYPE = "snowflake"
def _check_available_memory(min_gb: float = 2.0) -> None:
"""Warn if available memory is below the threshold."""
try:
if hasattr(os, "sysconf"): # Linux / macOS
page_size = os.sysconf("SC_PAGE_SIZE")
avail_pages = os.sysconf("SC_AVPHYS_PAGES")
avail_gb = (page_size * avail_pages) / (1024 ** 3)
else:
return # Windows — skip check
except (ValueError, OSError):
return
if avail_gb < min_gb:
print(
f"WARNING: Only {avail_gb:.1f} GB of memory available "
f"(minimum recommended: {min_gb:.1f} GB). "
f"Consider reducing the lookback window or increasing available memory."
)
# Databases that are Snowflake system databases — skip them
_SKIP_DATABASES = {"SNOWFLAKE", "SNOWFLAKE_SAMPLE_DATA"}
# Schemas that are Snowflake system schemas — skip them
_SKIP_SCHEMAS = {"INFORMATION_SCHEMA"}
# Snowflake TABLE_TYPE → Monte Carlo RelationalAsset.type mapping.
# The MC API only accepts "TABLE" or "VIEW" (uppercase).
_TABLE_TYPE_MAP = {
"BASE TABLE": "TABLE",
"TABLE": "TABLE",
"DYNAMIC TABLE": "TABLE",
"EXTERNAL TABLE": "TABLE",
"VIEW": "VIEW",
"MATERIALIZED VIEW": "VIEW",
"SECURE VIEW": "VIEW",
}
def _normalize_table_type(raw_type: str | None) -> str:
"""Map Snowflake's TABLE_TYPE value to MC-accepted 'TABLE' or 'VIEW'."""
if not raw_type:
return "TABLE"
return _TABLE_TYPE_MAP.get(raw_type.upper(), "TABLE")
def _connect(account: str, user: str, password: str, warehouse: str):
# ← SUBSTITUTE: add role= or authenticator= kwargs if your org requires them
return snowflake.connector.connect(
account=account,
user=user,
password=password,
warehouse=warehouse,
)
def _collect_assets(conn) -> list[dict]:
"""Collect table metadata from Snowflake and return as a list of dicts."""
cursor = conn.cursor()
assets: list[dict] = []
# --- Discover databases ---
cursor.execute("SHOW DATABASES")
# SHOW DATABASES returns (created_on, name, …); column index 1 is the name
all_db_rows = []
while True:
chunk = cursor.fetchmany(1000)
if not chunk:
break
all_db_rows.extend(chunk)
databases = [row[1] for row in all_db_rows if row[1] not in _SKIP_DATABASES]
print(f" Found {len(databases)} database(s): {databases}")
for db in databases:
# --- Discover schemas in each database ---
try:
cursor.execute(f'SHOW SCHEMAS IN DATABASE "{db}"')
except Exception as exc:
print(f" WARNING: could not list schemas in {db}: {exc}")
continue
# Column index 1 is the schema name
all_schema_rows = []
while True:
chunk = cursor.fetchmany(1000)
if not chunk:
break
all_schema_rows.extend(chunk)
schemas = [row[1] for row in all_schema_rows if row[1] not in _SKIP_SCHEMAS]
# --- Collect tables, volume, and freshness via INFORMATION_SCHEMA ---
try:
cursor.execute(
f"""
SELECT
TABLE_CATALOG,
TABLE_SCHEMA,
TABLE_NAME,
TABLE_TYPE,
ROW_COUNT,
BYTES,
LAST_ALTERED,
COMMENT
FROM "{db}".INFORMATION_SCHEMA.TABLES
WHERE TABLE_SCHEMA != 'INFORMATION_SCHEMA'
ORDER BY TABLE_SCHEMA, TABLE_NAME
"""
)
except Exception as exc:
print(f" WARNING: could not query INFORMATION_SCHEMA.TABLES in {db}: {exc}")
continue
table_rows = []
while True:
chunk = cursor.fetchmany(1000)
if not chunk:
break
table_rows.extend(chunk)
print(f" {db}: {len(table_rows)} table(s)")
# Build a set of schema names present in the table result to know which
# INFORMATION_SCHEMA.COLUMNS queries to run
schemas_with_tables: set[str] = {row[1] for row in table_rows}
# Pre-fetch all columns for this database in one query per schema
columns_by_table: dict[tuple[str, str], list[dict]] = {}
for schema in schemas_with_tables:
if schema not in schemas:
continue # respect the earlier schema skip list
try:
cursor.execute(
f"""
SELECT TABLE_NAME, COLUMN_NAME, DATA_TYPE, COMMENT
FROM "{db}".INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_SCHEMA = %s
ORDER BY TABLE_NAME, ORDINAL_POSITION
""",
(schema,),
)
except Exception as exc:
print(f" WARNING: could not fetch columns for {db}.{schema}: {exc}")
continue
all_col_rows = []
while True:
chunk = cursor.fetchmany(1000)
if not chunk:
break
all_col_rows.extend(chunk)
for col_row in all_col_rows:
table_name, col_name, data_type, col_comment = col_row
key = (schema, table_name)
if key not in columns_by_table:
columns_by_table[key] = []
columns_by_table[key].append(
{
"name": col_name,
"type": data_type,
"description": col_comment or None,
}
)
# Build asset dicts
for row in table_rows:
tbl_catalog, tbl_schema, tbl_name, tbl_type, row_count, byte_count, last_altered, tbl_comment = row
volume = None
if row_count is not None or byte_count is not None:
volume = {
"row_count": int(row_count) if row_count is not None else None,
"byte_count": int(byte_count) if byte_count is not None else None,
}
freshness = None
if last_altered is not None:
freshness = {
"last_update_time": last_altered.isoformat() if hasattr(last_altered, "isoformat") else str(last_altered),
}
fields = columns_by_table.get((tbl_schema, tbl_name), [])
assets.append(
{
"type": _normalize_table_type(tbl_type),
"database": tbl_catalog,
"schema": tbl_schema,
"name": tbl_name,
"description": tbl_comment or None,
"fields": fields,
"volume": volume,
"freshness": freshness,
}
)
print(f" + {tbl_catalog}.{tbl_schema}.{tbl_name} ({len(fields)} columns)")
cursor.close()
return assets
def collect(
account: str,
user: str,
password: str,
warehouse: str,
output_file: str = "metadata_output.json",
) -> dict:
"""
Connect to Snowflake, collect table metadata, and write a JSON manifest.
Returns the manifest dict.
"""
_check_available_memory()
print(f"Connecting to Snowflake account: {account} ...")
conn = _connect(account, user, password, warehouse)
print("Collecting table metadata ...")
assets = _collect_assets(conn)
conn.close()
print(f"\nCollected {len(assets)} table(s).")
manifest = {
"resource_type": RESOURCE_TYPE,
"collected_at": datetime.now(tz=timezone.utc).isoformat(),
"assets": assets,
}
with open(output_file, "w") as fh:
json.dump(manifest, fh, indent=2)
print(f"Asset manifest written to {output_file}")
return manifest
def main() -> None:
parser = argparse.ArgumentParser(
description="Collect Snowflake table metadata and write to a manifest file",
)
parser.add_argument(
"--account",
default=os.environ.get("SNOWFLAKE_ACCOUNT"),
help="Snowflake account identifier, e.g. xy12345.us-east-1 (env: SNOWFLAKE_ACCOUNT)", # ← SUBSTITUTE
)
parser.add_argument(
"--user",
default=os.environ.get("SNOWFLAKE_USER"),
help="Snowflake username (env: SNOWFLAKE_USER)", # ← SUBSTITUTE
)
parser.add_argument(
"--password",
default=os.environ.get("SNOWFLAKE_PASSWORD"),
help="Snowflake password (env: SNOWFLAKE_PASSWORD)", # ← SUBSTITUTE
)
parser.add_argument(
"--warehouse",
default=os.environ.get("SNOWFLAKE_WAREHOUSE"),
help="Snowflake virtual warehouse (env: SNOWFLAKE_WAREHOUSE)", # ← SUBSTITUTE
)
parser.add_argument(
"--output-file",
default="metadata_output.json",
help="Path to write the output manifest (default: metadata_output.json)",
)
args = parser.parse_args()
missing = [
name
for name, val in [
("--account", args.account),
("--user", args.user),
("--password", args.password),
("--warehouse", args.warehouse),
]
if not val
]
if missing:
parser.error(f"Missing required arguments: {', '.join(missing)}")
collect(
account=args.account,
user=args.user,
password=args.password,
warehouse=args.warehouse,
output_file=args.output_file,
)
print("Done.")
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,254 @@
#!/usr/bin/env python3
"""
Collect query logs from Snowflake ACCOUNT_USAGE.QUERY_HISTORY — collection only.
Queries a 24-hour window ending 1 hour ago (ACCOUNT_USAGE views have an
approximate 45-minute ingestion latency, so the last hour is intentionally
skipped to avoid incomplete data). The collected query logs are written to a
JSON manifest file.
Can be run standalone via CLI or imported (use the ``collect()`` function).
Substitution points
-------------------
- SNOWFLAKE_ACCOUNT (env) / --account (CLI) : Snowflake account identifier
- SNOWFLAKE_USER (env) / --user (CLI) : Snowflake username
- SNOWFLAKE_PASSWORD (env) / --password (CLI) : Snowflake password
- SNOWFLAKE_WAREHOUSE (env) / --warehouse (CLI) : Snowflake virtual warehouse
Prerequisites
-------------
pip install snowflake-connector-python
Usage
-----
python collect_query_logs.py \\
--account <SNOWFLAKE_ACCOUNT> \\
--user <SNOWFLAKE_USER> \\
--password <SNOWFLAKE_PASSWORD> \\
--warehouse <SNOWFLAKE_WAREHOUSE>
"""
import argparse
import json
import os
from datetime import datetime, timezone
import snowflake.connector
# ← SUBSTITUTE: set LOG_TYPE to match your warehouse type (query logs use log_type, not resource_type)
LOG_TYPE = "snowflake"
def _check_available_memory(min_gb: float = 2.0) -> None:
"""Warn if available memory is below the threshold."""
try:
if hasattr(os, "sysconf"): # Linux / macOS
page_size = os.sysconf("SC_PAGE_SIZE")
avail_pages = os.sysconf("SC_AVPHYS_PAGES")
avail_gb = (page_size * avail_pages) / (1024 ** 3)
else:
return # Windows — skip check
except (ValueError, OSError):
return
if avail_gb < min_gb:
print(
f"WARNING: Only {avail_gb:.1f} GB of memory available "
f"(minimum recommended: {min_gb:.1f} GB). "
f"Consider reducing the lookback window or increasing available memory."
)
# How many hours to look back from the trailing-edge cutoff
# ← SUBSTITUTE: adjust to match your collection cadence (e.g. 2 for every-2-hours runs)
_WINDOW_HOURS = 25
# Hours to skip at the trailing edge — ACCOUNT_USAGE has ~45-minute latency;
# skipping 1 hour provides a comfortable buffer.
# ← SUBSTITUTE: lower to 0 if you have confirmed real-time access to ACCOUNT_USAGE
_TRAILING_SKIP_HOURS = 1
# Maximum rows to collect per run — increase if your warehouse has higher query volume
# ← SUBSTITUTE: adjust based on your Snowflake query volume
_QUERY_LIMIT = 10000
def _fetch_query_history(conn) -> list[dict]:
"""
Fetch recent query history from SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY.
Collection window: [NOW - _WINDOW_HOURS, NOW - _TRAILING_SKIP_HOURS]
This intentionally excludes the most recent hour to avoid the ACCOUNT_USAGE
ingestion latency gap.
"""
cursor = conn.cursor()
cursor.execute(
f"""
SELECT
QUERY_ID,
QUERY_TEXT,
START_TIME,
END_TIME,
USER_NAME,
DATABASE_NAME,
WAREHOUSE_NAME,
BYTES_SCANNED,
ROWS_PRODUCED,
EXECUTION_STATUS,
QUERY_TAG,
ROLE_NAME
FROM SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY
WHERE START_TIME >= DATEADD(hour, -{_WINDOW_HOURS}, CURRENT_TIMESTAMP())
AND START_TIME < DATEADD(hour, -{_TRAILING_SKIP_HOURS}, CURRENT_TIMESTAMP())
AND EXECUTION_STATUS = 'SUCCESS'
ORDER BY START_TIME
LIMIT {_QUERY_LIMIT}
"""
# ← SUBSTITUTE: add AND DATABASE_NAME = '<db>' or AND WAREHOUSE_NAME = '<wh>'
# to restrict collection to a specific database or warehouse
)
columns = [col[0] for col in cursor.description]
rows = []
while True:
chunk = cursor.fetchmany(1000)
if not chunk:
break
rows.extend(dict(zip(columns, row)) for row in chunk)
cursor.close()
return rows
def _iso(dt: object) -> str | None:
if dt is None:
return None
return dt.isoformat() if hasattr(dt, "isoformat") else str(dt)
def collect(
account: str,
user: str,
password: str,
warehouse: str,
output_file: str = "query_logs_output.json",
) -> dict:
"""
Connect to Snowflake, collect query logs, and write a JSON manifest.
Returns the manifest dict.
"""
_check_available_memory()
print(f"Connecting to Snowflake account: {account} ...")
conn = snowflake.connector.connect(
account=account,
user=user,
password=password,
warehouse=warehouse,
)
print(
f"Fetching QUERY_HISTORY (last {_WINDOW_HOURS}h, excluding final {_TRAILING_SKIP_HOURS}h, "
f"limit {_QUERY_LIMIT}) ..."
)
rows = _fetch_query_history(conn)
conn.close()
print(f" Retrieved {len(rows)} query log row(s).")
if not rows:
print("No query log rows found in the specified window.")
manifest = {
"log_type": LOG_TYPE,
"collected_at": datetime.now(tz=timezone.utc).isoformat(),
"entry_count": 0,
"window_start": None,
"window_end": None,
"queries": [],
}
with open(output_file, "w") as fh:
json.dump(manifest, fh, indent=2, default=str)
return manifest
start_times = [r["START_TIME"] for r in rows if r.get("START_TIME") is not None]
end_times = [r["END_TIME"] for r in rows if r.get("END_TIME") is not None]
manifest = {
"log_type": LOG_TYPE,
"collected_at": datetime.now(tz=timezone.utc).isoformat(),
"entry_count": len(rows),
"window_start": _iso(min(start_times)) if start_times else None,
"window_end": _iso(max(end_times)) if end_times else None,
"queries": [
{
"query_id": r.get("QUERY_ID"),
"query_text": r.get("QUERY_TEXT") or "",
"start_time": _iso(r.get("START_TIME")),
"end_time": _iso(r.get("END_TIME")),
"user": r.get("USER_NAME"),
"warehouse": r.get("WAREHOUSE_NAME"),
"bytes_scanned": r.get("BYTES_SCANNED"),
"rows_produced": r.get("ROWS_PRODUCED"),
}
for r in rows
],
}
with open(output_file, "w") as fh:
json.dump(manifest, fh, indent=2, default=str)
print(f"Query log manifest written to {output_file}")
return manifest
def main() -> None:
parser = argparse.ArgumentParser(
description="Collect Snowflake query logs from ACCOUNT_USAGE and write to a manifest file",
)
parser.add_argument(
"--account",
default=os.environ.get("SNOWFLAKE_ACCOUNT"),
help="Snowflake account identifier, e.g. xy12345.us-east-1 (env: SNOWFLAKE_ACCOUNT)", # ← SUBSTITUTE
)
parser.add_argument(
"--user",
default=os.environ.get("SNOWFLAKE_USER"),
help="Snowflake username (env: SNOWFLAKE_USER)",
)
parser.add_argument(
"--password",
default=os.environ.get("SNOWFLAKE_PASSWORD"),
help="Snowflake password (env: SNOWFLAKE_PASSWORD)",
)
parser.add_argument(
"--warehouse",
default=os.environ.get("SNOWFLAKE_WAREHOUSE"),
help="Snowflake virtual warehouse (env: SNOWFLAKE_WAREHOUSE)", # ← SUBSTITUTE
)
parser.add_argument(
"--output-file",
default="query_logs_output.json",
help="Path to write the output manifest (default: query_logs_output.json)",
)
args = parser.parse_args()
missing = [
name
for name, val in [
("--account", args.account),
("--user", args.user),
("--password", args.password),
("--warehouse", args.warehouse),
]
if not val
]
if missing:
parser.error(f"Missing required arguments: {', '.join(missing)}")
collect(
account=args.account,
user=args.user,
password=args.password,
warehouse=args.warehouse,
output_file=args.output_file,
)
print("Done.")
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,307 @@
#!/usr/bin/env python3
"""
Push lineage events to Monte Carlo from a JSON manifest — push only.
Reads a manifest file produced by ``collect_lineage.py`` and sends the lineage
events to Monte Carlo using the pycarlo push ingestion API. Large payloads are
split into batches to stay under the 1 MB compressed limit.
Can be run standalone via CLI or imported (use the ``push()`` function).
Substitution points
-------------------
- MCD_INGEST_ID (env) / --key-id (CLI) : Monte Carlo ingestion key ID
- MCD_INGEST_TOKEN (env) / --key-token (CLI) : Monte Carlo ingestion key token
- MCD_RESOURCE_UUID (env) / --resource-uuid (CLI) : MC resource UUID for this connection
Prerequisites
-------------
pip install pycarlo
Usage
-----
python push_lineage.py \\
--key-id <MCD_INGEST_ID> \\
--key-token <MCD_INGEST_TOKEN> \\
--resource-uuid <MCD_RESOURCE_UUID> \\
--input-file lineage_output.json
"""
from __future__ import annotations
import argparse
import json
import os
from concurrent.futures import ThreadPoolExecutor, as_completed
from datetime import datetime, timezone
from pycarlo.core import Client, Session
from pycarlo.features.ingestion import IngestionService
from pycarlo.features.ingestion.models import (
ColumnLineageField,
ColumnLineageSourceField,
LineageAssetRef,
LineageEvent,
)
# ← SUBSTITUTE: set RESOURCE_TYPE to match your Monte Carlo connection type
RESOURCE_TYPE = "snowflake"
# Maximum events per batch — conservative default to keep compressed payload under 1 MB
# ← SUBSTITUTE: tune based on average edge complexity (number of sources, column mappings)
_BATCH_SIZE = 500
def _build_table_lineage_events(edges: list[dict]) -> list[LineageEvent]:
"""Build table-level LineageEvent objects from manifest edge dicts."""
events = []
for edge in edges:
dest = edge["destination"]
sources = edge.get("sources", [])
if not sources:
continue
events.append(
LineageEvent(
destination=LineageAssetRef(
type="TABLE",
name=dest["table"],
database=dest["database"],
schema=dest["schema"],
),
sources=[
LineageAssetRef(
type="TABLE",
name=s["table"],
database=s["database"],
schema=s["schema"],
)
for s in sources
],
)
)
return events
def _build_column_lineage_events(edges: list[dict]) -> list[LineageEvent]:
"""Build column-level LineageEvent objects from manifest edge dicts."""
events = []
for edge in edges:
dest = edge["destination"]
sources = edge.get("sources", [])
col_mappings = edge.get("col_mappings", [])
if not sources:
continue
dest_asset_id = f"{dest['database']}__{dest['schema']}__{dest['table']}"
source_asset_ids = {
(s["database"], s["schema"], s["table"]): f"{s['database']}__{s['schema']}__{s['table']}"
for s in sources
}
col_fields: dict[str, ColumnLineageField] = {}
for mapping in col_mappings:
dest_col = mapping["dest_col"]
src_table = mapping["src_table"]
src_col = mapping["src_col"]
# Match src_table to the first source with that table name
match = next(
(s for s in sources if s["table"] == src_table),
sources[0] if sources else None,
)
if not match:
continue
src_aid = source_asset_ids[(match["database"], match["schema"], match["table"])]
if dest_col not in col_fields:
col_fields[dest_col] = ColumnLineageField(name=dest_col, source_fields=[])
col_fields[dest_col].source_fields.append(
ColumnLineageSourceField(asset_id=src_aid, field_name=src_col)
)
events.append(
LineageEvent(
destination=LineageAssetRef(
type="TABLE",
name=dest["table"],
database=dest["database"],
schema=dest["schema"],
asset_id=dest_asset_id,
),
sources=[
LineageAssetRef(
type="TABLE",
name=s["table"],
database=s["database"],
schema=s["schema"],
asset_id=source_asset_ids[(s["database"], s["schema"], s["table"])],
)
for s in sources
],
fields=list(col_fields.values()) if col_fields else None,
)
)
return events
def push(
input_file: str,
resource_uuid: str,
key_id: str,
key_token: str,
batch_size: int = _BATCH_SIZE,
output_file: str = "lineage_push_result.json",
) -> dict:
"""
Read a lineage manifest and push events to Monte Carlo in batches.
Returns a result dict with invocation IDs for each batch.
"""
with open(input_file) as fh:
manifest = json.load(fh)
edges = manifest.get("edges", [])
resource_type = manifest.get("resource_type", RESOURCE_TYPE)
column_lineage = manifest.get("column_lineage", False)
if column_lineage:
events = _build_column_lineage_events(edges)
label = "column-level"
else:
events = _build_table_lineage_events(edges)
label = "table-level"
print(f"Loaded {len(events)} {label} lineage event(s) from {input_file}")
if not events:
print("No lineage events to push.")
push_result = {
"resource_uuid": resource_uuid,
"resource_type": resource_type,
"invocation_ids": [],
"pushed_at": datetime.now(tz=timezone.utc).isoformat(),
"total_events": 0,
"batch_count": 0,
"batch_size": batch_size,
}
with open(output_file, "w") as fh:
json.dump(push_result, fh, indent=2)
return push_result
# Split into batches
batches = []
for i in range(0, len(events), batch_size):
batches.append(events[i : i + batch_size])
total_batches = len(batches)
def _push_batch(batch: list, batch_num: int) -> str | None:
"""Push a single batch using a dedicated Session (thread-safe)."""
print(f" Pushing batch {batch_num}/{total_batches} ({len(batch)} events) ...")
client = Client(session=Session(mcd_id=key_id, mcd_token=key_token, scope="Ingestion"))
service = IngestionService(mc_client=client)
result = service.send_lineage(
resource_uuid=resource_uuid,
resource_type=resource_type,
events=batch,
)
invocation_id = service.extract_invocation_id(result)
if invocation_id:
print(f" Batch {batch_num}: invocation_id={invocation_id}")
return invocation_id
# Push batches in parallel (each thread gets its own pycarlo Session)
max_workers = min(4, total_batches)
invocation_ids: list[str | None] = [None] * total_batches
with ThreadPoolExecutor(max_workers=max_workers) as pool:
futures = {
pool.submit(_push_batch, batch, i + 1): i
for i, batch in enumerate(batches)
}
for future in as_completed(futures):
idx = futures[future]
try:
invocation_ids[idx] = future.result()
except Exception as exc:
print(f" ERROR pushing batch {idx + 1}: {exc}")
raise
print(f" All {total_batches} batches pushed ({max_workers} workers)")
push_result = {
"resource_uuid": resource_uuid,
"resource_type": resource_type,
"invocation_ids": invocation_ids,
"pushed_at": datetime.now(tz=timezone.utc).isoformat(),
"total_events": len(events),
"batch_count": total_batches,
"batch_size": batch_size,
"edges": edges, # preserve for downstream validation
}
with open(output_file, "w") as fh:
json.dump(push_result, fh, indent=2)
print(f"Push result written to {output_file}")
return push_result
def main() -> None:
parser = argparse.ArgumentParser(
description="Push Snowflake lineage from a manifest to Monte Carlo",
)
parser.add_argument(
"--key-id",
default=os.environ.get("MCD_INGEST_ID"),
help="Monte Carlo ingestion key ID (env: MCD_INGEST_ID)",
)
parser.add_argument(
"--key-token",
default=os.environ.get("MCD_INGEST_TOKEN"),
help="Monte Carlo ingestion key token (env: MCD_INGEST_TOKEN)",
)
parser.add_argument(
"--resource-uuid",
default=os.environ.get("MCD_RESOURCE_UUID"),
help="Monte Carlo resource UUID for this Snowflake connection (env: MCD_RESOURCE_UUID)",
)
parser.add_argument(
"--input-file",
default="lineage_output.json",
help="Path to the collect manifest to read (default: lineage_output.json)",
)
parser.add_argument(
"--output-file",
default="lineage_push_result.json",
help="Path to write the push result (default: lineage_push_result.json)",
)
parser.add_argument(
"--batch-size",
type=int,
default=_BATCH_SIZE,
help=f"Max events per push batch (default: {_BATCH_SIZE})",
)
args = parser.parse_args()
missing = [
name
for name, val in [
("--key-id", args.key_id),
("--key-token", args.key_token),
("--resource-uuid", args.resource_uuid),
]
if not val
]
if missing:
parser.error(f"Missing required arguments: {', '.join(missing)}")
push(
input_file=args.input_file,
resource_uuid=args.resource_uuid,
key_id=args.key_id,
key_token=args.key_token,
batch_size=args.batch_size,
output_file=args.output_file,
)
print("Done.")
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,228 @@
#!/usr/bin/env python3
"""
Push table metadata to Monte Carlo from a JSON manifest — push only.
Reads a manifest file produced by ``collect_metadata.py`` and sends the assets
to Monte Carlo as RelationalAsset events using the pycarlo push ingestion API.
Large payloads are split into batches to stay under the 1 MB compressed limit.
Can be run standalone via CLI or imported (use the ``push()`` function).
Substitution points
-------------------
- MCD_INGEST_ID (env) / --key-id (CLI) : Monte Carlo ingestion key ID
- MCD_INGEST_TOKEN (env) / --key-token (CLI) : Monte Carlo ingestion key token
- MCD_RESOURCE_UUID (env) / --resource-uuid (CLI) : MC resource UUID for this connection
Prerequisites
-------------
pip install pycarlo
Usage
-----
python push_metadata.py \\
--key-id <MCD_INGEST_ID> \\
--key-token <MCD_INGEST_TOKEN> \\
--resource-uuid <MCD_RESOURCE_UUID> \\
--input-file metadata_output.json
"""
import argparse
import json
import os
from concurrent.futures import ThreadPoolExecutor, as_completed
from datetime import datetime, timezone
from pycarlo.core import Client, Session
from pycarlo.features.ingestion import IngestionService
from pycarlo.features.ingestion.models import (
AssetField,
AssetFreshness,
AssetMetadata,
AssetVolume,
RelationalAsset,
)
# ← SUBSTITUTE: set RESOURCE_TYPE to match your Monte Carlo connection type
RESOURCE_TYPE = "snowflake"
# Maximum assets per batch — conservative default to keep compressed payload under 1 MB
# ← SUBSTITUTE: tune based on average asset size (fields per table, description length, etc.)
_BATCH_SIZE = 500
def _asset_from_dict(d: dict) -> RelationalAsset:
"""Reconstruct a RelationalAsset from a manifest dict entry."""
fields = [
AssetField(
name=f["name"],
type=f.get("type"),
description=f.get("description"),
)
for f in d.get("fields", [])
]
volume = None
if d.get("volume"):
volume = AssetVolume(
row_count=d["volume"].get("row_count"),
byte_count=d["volume"].get("byte_count"),
)
freshness = None
if d.get("freshness"):
freshness = AssetFreshness(
last_update_time=d["freshness"].get("last_update_time"),
)
return RelationalAsset(
type=d.get("type", "TABLE"),
metadata=AssetMetadata(
name=d["name"],
database=d["database"],
schema=d["schema"],
description=d.get("description"),
),
fields=fields,
volume=volume,
freshness=freshness,
)
def push(
input_file: str,
resource_uuid: str,
key_id: str,
key_token: str,
batch_size: int = _BATCH_SIZE,
output_file: str = "metadata_push_result.json",
) -> dict:
"""
Read a metadata manifest and push assets to Monte Carlo in batches.
Returns a result dict with invocation IDs for each batch.
"""
with open(input_file) as fh:
manifest = json.load(fh)
asset_dicts = manifest.get("assets", [])
resource_type = manifest.get("resource_type", RESOURCE_TYPE)
assets = [_asset_from_dict(d) for d in asset_dicts]
print(f"Loaded {len(assets)} asset(s) from {input_file}")
# Split into batches
batches = []
for i in range(0, max(len(assets), 1), batch_size):
batches.append(assets[i : i + batch_size])
total_batches = len(batches)
def _push_batch(batch: list, batch_num: int) -> str | None:
"""Push a single batch using a dedicated Session (thread-safe)."""
client = Client(session=Session(mcd_id=key_id, mcd_token=key_token, scope="Ingestion"))
service = IngestionService(mc_client=client)
result = service.send_metadata(
resource_uuid=resource_uuid,
resource_type=resource_type,
events=batch,
)
invocation_id = service.extract_invocation_id(result)
print(f" Pushed batch {batch_num}/{total_batches} ({len(batch)} assets) — invocation_id={invocation_id}")
return invocation_id
# Push batches in parallel (each thread gets its own pycarlo Session)
max_workers = min(4, total_batches)
invocation_ids: list[str | None] = [None] * total_batches
with ThreadPoolExecutor(max_workers=max_workers) as pool:
futures = {
pool.submit(_push_batch, batch, i + 1): i
for i, batch in enumerate(batches)
}
for future in as_completed(futures):
idx = futures[future]
try:
invocation_ids[idx] = future.result()
except Exception as exc:
print(f" ERROR pushing batch {idx + 1}: {exc}")
raise
print(f" All {total_batches} batches pushed ({max_workers} workers)")
push_result = {
"resource_uuid": resource_uuid,
"resource_type": resource_type,
"invocation_ids": invocation_ids,
"pushed_at": datetime.now(tz=timezone.utc).isoformat(),
"total_assets": len(assets),
"batch_count": total_batches,
"batch_size": batch_size,
}
with open(output_file, "w") as fh:
json.dump(push_result, fh, indent=2)
print(f"Push result written to {output_file}")
return push_result
def main() -> None:
parser = argparse.ArgumentParser(
description="Push Snowflake table metadata from a manifest to Monte Carlo",
)
parser.add_argument(
"--key-id",
default=os.environ.get("MCD_INGEST_ID"),
help="Monte Carlo ingestion key ID (env: MCD_INGEST_ID)",
)
parser.add_argument(
"--key-token",
default=os.environ.get("MCD_INGEST_TOKEN"),
help="Monte Carlo ingestion key token (env: MCD_INGEST_TOKEN)",
)
parser.add_argument(
"--resource-uuid",
default=os.environ.get("MCD_RESOURCE_UUID"),
help="Monte Carlo resource UUID for this Snowflake connection (env: MCD_RESOURCE_UUID)",
)
parser.add_argument(
"--input-file",
default="metadata_output.json",
help="Path to the collect manifest to read (default: metadata_output.json)",
)
parser.add_argument(
"--output-file",
default="metadata_push_result.json",
help="Path to write the push result (default: metadata_push_result.json)",
)
parser.add_argument(
"--batch-size",
type=int,
default=_BATCH_SIZE,
help=f"Max assets per push batch (default: {_BATCH_SIZE})",
)
args = parser.parse_args()
missing = [
name
for name, val in [
("--key-id", args.key_id),
("--key-token", args.key_token),
("--resource-uuid", args.resource_uuid),
]
if not val
]
if missing:
parser.error(f"Missing required arguments: {', '.join(missing)}")
push(
input_file=args.input_file,
resource_uuid=args.resource_uuid,
key_id=args.key_id,
key_token=args.key_token,
batch_size=args.batch_size,
output_file=args.output_file,
)
print("Done.")
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,248 @@
#!/usr/bin/env python3
"""
Push query logs to Monte Carlo from a JSON manifest — push only.
Reads a manifest file produced by ``collect_query_logs.py`` and sends the query
log entries to Monte Carlo using the pycarlo push ingestion API. Large payloads
are split into batches to stay under the 1 MB compressed limit.
Can be run standalone via CLI or imported (use the ``push()`` function).
Substitution points
-------------------
- MCD_INGEST_ID (env) / --key-id (CLI) : Monte Carlo ingestion key ID
- MCD_INGEST_TOKEN (env) / --key-token (CLI) : Monte Carlo ingestion key token
- MCD_RESOURCE_UUID (env) / --resource-uuid (CLI) : MC resource UUID for this connection
Prerequisites
-------------
pip install pycarlo
Usage
-----
python push_query_logs.py \\
--key-id <MCD_INGEST_ID> \\
--key-token <MCD_INGEST_TOKEN> \\
--resource-uuid <MCD_RESOURCE_UUID> \\
--input-file query_logs_output.json
"""
import argparse
import json
import os
from concurrent.futures import ThreadPoolExecutor, as_completed
from datetime import datetime, timezone
from dateutil.parser import isoparse
from pycarlo.core import Client, Session
from pycarlo.features.ingestion import IngestionService
from pycarlo.features.ingestion.models import QueryLogEntry
# ← SUBSTITUTE: set LOG_TYPE to match your warehouse type (query logs use log_type, not resource_type)
LOG_TYPE = "snowflake"
# Maximum entries per batch — conservative default to keep compressed payload under 1 MB.
# Query logs include full SQL text — keep batches small to stay under the 1 MB
# compressed payload limit. 50 entries can trigger 413 on active warehouses.
# ← SUBSTITUTE: tune based on average query length
_BATCH_SIZE = 100
# Truncate query_text longer than this to prevent 413 errors.
# Some SQL statements (e.g., generated by BI tools) can be 100KB+ and blow up
# compressed payloads even at small batch sizes.
_MAX_QUERY_TEXT_LEN = 10_000
def _build_query_log_entries(queries: list[dict]) -> list[QueryLogEntry]:
"""Convert manifest query dicts into QueryLogEntry objects."""
entries = []
truncated = 0
for q in queries:
start_time = q.get("start_time")
end_time = q.get("end_time")
query_text = q.get("query_text") or ""
query_id = q.get("query_id")
user_name = q.get("user")
warehouse_name = q.get("warehouse")
bytes_scanned = q.get("bytes_scanned")
rows_produced = q.get("rows_produced")
# Truncate very long SQL to prevent 413 Request Too Large
if len(query_text) > _MAX_QUERY_TEXT_LEN:
query_text = query_text[:_MAX_QUERY_TEXT_LEN] + "... [TRUNCATED]"
truncated += 1
extra = {}
if warehouse_name is not None:
extra["warehouse_name"] = warehouse_name
if bytes_scanned is not None:
extra["bytes_scanned"] = int(bytes_scanned)
entries.append(
QueryLogEntry(
start_time=isoparse(start_time) if start_time else None,
end_time=isoparse(end_time) if end_time else None,
query_text=query_text,
query_id=query_id,
user=user_name,
returned_rows=int(rows_produced) if rows_produced is not None else None,
extra=extra or None,
)
)
if truncated:
print(f" Truncated {truncated} query text(s) exceeding {_MAX_QUERY_TEXT_LEN} chars")
return entries
def push(
input_file: str,
resource_uuid: str,
key_id: str,
key_token: str,
batch_size: int = _BATCH_SIZE,
output_file: str = "query_logs_push_result.json",
) -> dict:
"""
Read a query log manifest and push entries to Monte Carlo in batches.
Returns a result dict with invocation IDs for each batch.
"""
with open(input_file) as fh:
manifest = json.load(fh)
queries = manifest.get("queries", [])
log_type = manifest.get("log_type", LOG_TYPE)
entries = _build_query_log_entries(queries)
print(f"Loaded {len(entries)} query log entry/entries from {input_file}")
if not entries:
print("No query log entries to push.")
push_result = {
"resource_uuid": resource_uuid,
"log_type": log_type,
"invocation_ids": [],
"pushed_at": datetime.now(tz=timezone.utc).isoformat(),
"total_entries": 0,
"batch_count": 0,
"batch_size": batch_size,
}
with open(output_file, "w") as fh:
json.dump(push_result, fh, indent=2)
return push_result
# Split into batches
batches = []
for i in range(0, len(entries), batch_size):
batches.append(entries[i : i + batch_size])
total_batches = len(batches)
def _push_batch(batch: list, batch_num: int) -> str | None:
"""Push a single batch using a dedicated Session (thread-safe)."""
client = Client(session=Session(mcd_id=key_id, mcd_token=key_token, scope="Ingestion"))
service = IngestionService(mc_client=client)
result = service.send_query_logs(
resource_uuid=resource_uuid,
log_type=log_type,
events=batch,
)
invocation_id = service.extract_invocation_id(result)
print(f" Pushed batch {batch_num}/{total_batches} ({len(batch)} entries) — invocation_id={invocation_id}")
return invocation_id
# Push batches in parallel (each thread gets its own pycarlo Session)
max_workers = min(4, total_batches)
invocation_ids: list[str | None] = [None] * total_batches
with ThreadPoolExecutor(max_workers=max_workers) as pool:
futures = {
pool.submit(_push_batch, batch, i + 1): i
for i, batch in enumerate(batches)
}
for future in as_completed(futures):
idx = futures[future]
try:
invocation_ids[idx] = future.result()
except Exception as exc:
print(f" ERROR pushing batch {idx + 1}: {exc}")
raise
print(f" All {total_batches} batches pushed ({max_workers} workers)")
push_result = {
"resource_uuid": resource_uuid,
"log_type": log_type,
"invocation_ids": invocation_ids,
"pushed_at": datetime.now(tz=timezone.utc).isoformat(),
"total_entries": len(entries),
"batch_count": total_batches,
"batch_size": batch_size,
}
with open(output_file, "w") as fh:
json.dump(push_result, fh, indent=2)
print(f"Push result written to {output_file}")
return push_result
def main() -> None:
parser = argparse.ArgumentParser(
description="Push Snowflake query logs from a manifest to Monte Carlo",
)
parser.add_argument(
"--key-id",
default=os.environ.get("MCD_INGEST_ID"),
help="Monte Carlo ingestion key ID (env: MCD_INGEST_ID)",
)
parser.add_argument(
"--key-token",
default=os.environ.get("MCD_INGEST_TOKEN"),
help="Monte Carlo ingestion key token (env: MCD_INGEST_TOKEN)",
)
parser.add_argument(
"--resource-uuid",
default=os.environ.get("MCD_RESOURCE_UUID"),
help="Monte Carlo resource UUID for this Snowflake connection (env: MCD_RESOURCE_UUID)",
)
parser.add_argument(
"--input-file",
default="query_logs_output.json",
help="Path to the collect manifest to read (default: query_logs_output.json)",
)
parser.add_argument(
"--output-file",
default="query_logs_push_result.json",
help="Path to write the push result (default: query_logs_push_result.json)",
)
parser.add_argument(
"--batch-size",
type=int,
default=_BATCH_SIZE,
help=f"Max entries per push batch (default: {_BATCH_SIZE})",
)
args = parser.parse_args()
missing = [
name
for name, val in [
("--key-id", args.key_id),
("--key-token", args.key_token),
("--resource-uuid", args.resource_uuid),
]
if not val
]
if missing:
parser.error(f"Missing required arguments: {', '.join(missing)}")
push(
input_file=args.input_file,
resource_uuid=args.resource_uuid,
key_id=args.key_id,
key_token=args.key_token,
batch_size=args.batch_size,
output_file=args.output_file,
)
print("Done.")
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,340 @@
#!/usr/bin/env python3
"""
Smoke test that every pycarlo model construction used by the templates
actually works with the real SDK. A wrong parameter name raises TypeError.
Run:
pip install pycarlo
python test_template_sdk_usage.py
"""
from datetime import datetime, timezone
from pycarlo.features.ingestion.models import (
AssetField,
AssetFreshness,
AssetMetadata,
AssetVolume,
ColumnLineageField,
ColumnLineageSourceField,
LineageAssetRef,
LineageEvent,
QueryLogEntry,
RelationalAsset,
Tag,
build_lineage_payload,
build_metadata_payload,
build_query_log_payload,
)
PASSED = 0
FAILED = 0
def check(label: str, fn):
global PASSED, FAILED
try:
obj = fn()
# Also verify serialization works
if hasattr(obj, "to_dict"):
obj.to_dict()
PASSED += 1
print(f" PASS {label}")
except Exception as exc:
FAILED += 1
print(f" FAIL {label}: {exc}")
def test_metadata_models():
print("\n== Metadata models ==")
check("AssetField(name, type)", lambda: AssetField(name="id", type="INTEGER"))
check(
"AssetField(name, type, description)",
lambda: AssetField(name="id", type="INTEGER", description="Primary key"),
)
check(
"AssetMetadata(name, database, schema)",
lambda: AssetMetadata(name="orders", database="analytics", schema="public"),
)
check(
"AssetMetadata(name, database, schema, description, view_query, created_on)",
lambda: AssetMetadata(
name="orders_view",
database="analytics",
schema="public",
description="A view",
view_query="SELECT * FROM orders",
created_on="2026-01-01T00:00:00Z",
),
)
check("AssetVolume(row_count)", lambda: AssetVolume(row_count=1000))
check(
"AssetVolume(row_count, byte_count)",
lambda: AssetVolume(row_count=1000, byte_count=50000),
)
check(
"AssetFreshness(last_update_time)",
lambda: AssetFreshness(last_update_time="2026-03-12T14:30:00Z"),
)
check("Tag(key, value)", lambda: Tag(key="env", value="prod"))
check("Tag(key only)", lambda: Tag(key="pii"))
check(
"RelationalAsset — full nested structure",
lambda: RelationalAsset(
type="TABLE",
metadata=AssetMetadata(
name="orders",
database="analytics",
schema="public",
description="Orders table",
),
fields=[
AssetField(name="id", type="INTEGER"),
AssetField(name="amount", type="DECIMAL(10,2)", description="Order total"),
],
volume=AssetVolume(row_count=1000000, byte_count=111111111),
freshness=AssetFreshness(last_update_time="2026-03-12T14:30:00Z"),
tags=[Tag(key="env", value="prod")],
),
)
check(
"RelationalAsset — minimal (no volume, freshness, tags)",
lambda: RelationalAsset(
type="VIEW",
metadata=AssetMetadata(name="v_orders", database="db", schema="sch"),
),
)
def test_lineage_models():
print("\n== Lineage models ==")
check(
"LineageAssetRef(type, name, database, schema)",
lambda: LineageAssetRef(
type="TABLE", name="orders", database="analytics", schema="public"
),
)
check(
"LineageAssetRef(type, name, database, schema, asset_id)",
lambda: LineageAssetRef(
type="TABLE",
name="orders",
database="analytics",
schema="public",
asset_id="analytics:public.orders",
),
)
check(
"LineageEvent — table lineage",
lambda: LineageEvent(
destination=LineageAssetRef(
type="TABLE", name="curated", database="db", schema="sch"
),
sources=[
LineageAssetRef(type="TABLE", name="raw", database="db", schema="sch"),
],
),
)
check(
"ColumnLineageSourceField(asset_id, field_name)",
lambda: ColumnLineageSourceField(
asset_id="db:sch.raw", field_name="amount"
),
)
check(
"ColumnLineageField(name, source_fields)",
lambda: ColumnLineageField(
name="total_amount",
source_fields=[
ColumnLineageSourceField(asset_id="db:sch.raw", field_name="amount"),
],
),
)
check(
"LineageEvent — column lineage",
lambda: LineageEvent(
destination=LineageAssetRef(
type="TABLE",
name="curated",
database="db",
schema="sch",
asset_id="db:sch.curated",
),
sources=[
LineageAssetRef(
type="TABLE",
name="raw",
database="db",
schema="sch",
asset_id="db:sch.raw",
),
],
fields=[
ColumnLineageField(
name="total_amount",
source_fields=[
ColumnLineageSourceField(
asset_id="db:sch.raw", field_name="amount"
),
],
),
],
),
)
def test_query_log_models():
print("\n== Query log models ==")
now = datetime.now(tz=timezone.utc)
check(
"QueryLogEntry — minimal",
lambda: QueryLogEntry(
start_time=now,
end_time=now,
query_text="SELECT 1",
),
)
check(
"QueryLogEntry — full with extra",
lambda: QueryLogEntry(
start_time=now,
end_time=now,
query_text="SELECT * FROM orders",
query_id="query-123",
user="analyst@company.com",
returned_rows=100,
error_code=None,
error_text=None,
extra={
"warehouse_name": "COMPUTE_WH",
"bytes_scanned": 12345,
},
),
)
check(
"QueryLogEntry — Snowflake extra fields",
lambda: QueryLogEntry(
start_time=now,
end_time=now,
query_text="SELECT 1",
extra={"warehouse_name": "WH", "bytes_scanned": 100},
),
)
check(
"QueryLogEntry — BigQuery extra fields",
lambda: QueryLogEntry(
start_time=now,
end_time=now,
query_text="SELECT 1",
extra={"total_bytes_billed": 999, "statement_type": "SELECT"},
),
)
check(
"QueryLogEntry — Databricks extra fields",
lambda: QueryLogEntry(
start_time=now,
end_time=now,
query_text="SELECT 1",
extra={"total_task_duration_ms": 500, "read_rows": 10, "read_bytes": 200},
),
)
check(
"QueryLogEntry — Redshift extra fields",
lambda: QueryLogEntry(
start_time=now,
end_time=now,
query_text="SELECT 1",
extra={"database_name": "dev", "elapsed_time_us": 123456},
),
)
def test_payload_builders():
print("\n== Payload builders ==")
now = datetime.now(tz=timezone.utc)
check(
"build_metadata_payload",
lambda: build_metadata_payload(
resource_uuid="uuid-123",
resource_type="snowflake",
events=[
RelationalAsset(
type="TABLE",
metadata=AssetMetadata(name="t", database="d", schema="s"),
)
],
),
)
check(
"build_lineage_payload — table",
lambda: build_lineage_payload(
resource_uuid="uuid-123",
resource_type="snowflake",
events=[
LineageEvent(
destination=LineageAssetRef(
type="TABLE", name="dst", database="d", schema="s"
),
sources=[
LineageAssetRef(
type="TABLE", name="src", database="d", schema="s"
)
],
)
],
),
)
check(
"build_query_log_payload",
lambda: build_query_log_payload(
resource_uuid="uuid-123",
log_type="snowflake",
events=[
QueryLogEntry(
start_time=now,
end_time=now,
query_text="SELECT 1",
)
],
),
)
if __name__ == "__main__":
test_metadata_models()
test_lineage_models()
test_query_log_models()
test_payload_builders()
print(f"\n{'='*40}")
print(f"Results: {PASSED} passed, {FAILED} failed")
if FAILED:
print("SOME TESTS FAILED — templates use wrong parameter names!")
raise SystemExit(1)
else:
print("All tests passed — all model constructions are valid.")

View File

@@ -0,0 +1,685 @@
---
name: monte-carlo-validation-notebook
description: "Generates SQL validation notebooks for dbt PR changes with before/after comparison queries."
category: data
risk: safe
source: community
source_repo: monte-carlo-data/mc-agent-toolkit
source_type: community
date_added: "2026-04-08"
author: monte-carlo-data
tags: [data-observability, validation, dbt, monte-carlo, sql-notebook]
tools: [claude, cursor, codex]
---
> **Tip:** This skill works well with Sonnet. Run `/model sonnet` before invoking for faster generation.
Generate a SQL Notebook with validation queries for dbt changes.
**Arguments:** $ARGUMENTS
Parse the arguments:
- **Target** (required): first argument — a GitHub PR URL or local dbt repo path
- **MC Base URL** (optional): `--mc-base-url <URL>` — defaults to `https://getmontecarlo.com`
- **Models** (optional): `--models <model1,model2,...>` — comma-separated list of model filenames (without `.sql` extension) to generate queries for. Only these models will be included. By default, all changed models are included up to a maximum of 10.
---
# Setup
**Prerequisites:**
- **`gh`** (GitHub CLI) — required for PR mode. Must be authenticated (`gh auth status`).
- **`python3`** — required for helper scripts.
- **`pyyaml`** — install with `pip3 install pyyaml` (or `pip install pyyaml`, `uv pip install pyyaml`, etc.)
**Note:** Generated SQL uses ANSI-compatible syntax that works across Snowflake, BigQuery, Redshift, and Athena. Minor adjustments may be needed for specific warehouse quirks.
This skill includes two helper scripts in `${CLAUDE_PLUGIN_ROOT}/skills/monte-carlo-validation-notebook/scripts/`:
- **`resolve_dbt_schema.py`** - Resolves dbt model output schemas from `dbt_project.yml` routing rules and model config overrides.
- **`generate_notebook_url.py`** - Encodes notebook YAML into a base64 import URL and opens it in the browser.
# Mode Detection
Auto-detect mode from the target argument:
- If target looks like a URL (contains `://` or `github.com`) -> **PR mode**
- If target is a path (`.`, `/path/to/repo`, relative path) -> **Local mode**
---
# Context
This command generates a SQL Notebook containing validation queries for dbt changes. The notebook can be opened in the MC Bridge SQL Notebook interface for interactive validation.
The output is an import URL that opens directly in the notebook interface:
```
<MC_BASE_URL>/notebooks/import#<base64-encoded-yaml>
```
**Key Features:**
- **Database Parameters**: Two `text` parameters (`prod_db` and `dev_db`) for selecting databases
- **Schema Inference**: Automatically infers schema per model from `dbt_project.yml` and model configs
- **Single-table queries**: Basic validation queries using `{{prod_db}}.<SCHEMA>.<TABLE>`
- **Comparison queries**: Before/after queries comparing `{{prod_db}}` vs `{{dev_db}}`
- **Flexible usage**: Users can set both parameters to the same database for single-database analysis
# Notebook YAML Spec Reference
Key structure:
```yaml
version: 1
metadata:
id: string # kebab-case + random suffix
name: string # display name
created_at: string # ISO 8601
updated_at: string # ISO 8601
default_context: # optional database/schema context
database: string
schema: string
cells:
- id: string
type: sql | markdown | parameter
content: string # SQL, markdown, or parameter config (JSON)
display_type: table | bar | timeseries
```
## Parameter Cell Spec
Parameter cells allow defining variables referenced in SQL via `{{param_name}}` syntax:
```yaml
- id: param-prod-db
type: parameter
content:
name: prod_db # variable name
config:
type: text # free-form text input
default_value: "ANALYTICS"
placeholder: "Prod database"
display_type: table
```
Parameter types:
- `text`: Free-form text input (used for database names)
- `schema_selector`: Two dropdowns (database -> schema), value stored as `DATABASE.SCHEMA`
- `dropdown`: Select from predefined options
# Task
Generate a SQL Notebook with validation queries based on the mode and target.
## Phase 1: Get Changed Files
The approach differs based on mode:
### If PR mode (GitHub PR):
1. Extract the PR number and repo from the target URL.
- Example: `https://github.com/monte-carlo-data/dbt/pull/3386` -> owner=`monte-carlo-data`, repo=`dbt`, PR=`3386`
2. Fetch PR metadata using `gh`:
```bash
gh pr view <PR#> --repo <owner>/<repo> --json number,title,author,mergedAt,headRefOid
```
3. Fetch the list of changed files:
```bash
gh pr view <PR#> --repo <owner>/<repo> --json files --jq '.files[].path'
```
4. Fetch the diff:
```bash
gh pr diff <PR#> --repo <owner>/<repo>
```
5. Filter the changed files list to only `.sql` files under `models/` or `snapshots/` directories (at any depth — e.g., `models/`, `analytics/models/`, `dbt/models/`). These are the dbt models to analyze. If no model SQL files were changed, report that and stop.
6. For each changed model file, fetch the full file content at the head SHA:
```bash
gh api repos/<owner>/<repo>/contents/<file_path>?ref=<head_sha> --jq '.content' | python3 -c "import sys,base64; sys.stdout.write(base64.b64decode(sys.stdin.read()).decode())"
```
7. **Fetch dbt_project.yml** for schema resolution. Detect the dbt project root by looking at the changed file paths — find the common parent directory that contains `dbt_project.yml`. Try these paths in order until one succeeds:
```bash
gh api repos/<owner>/<repo>/contents/<dbt_root>/dbt_project.yml?ref=<head_sha> --jq '.content' | python3 -c "import sys,base64; sys.stdout.write(base64.b64decode(sys.stdin.read()).decode())"
```
Common `<dbt_root>` locations: `analytics`, `.` (repo root), `dbt`, `transform`. Try each until found.
Save `dbt_project.yml` to `/tmp/validation_notebook_working/<PR#>/dbt_project.yml`.
### If Local mode (Local Directory):
1. Change to the target directory.
2. Get current branch info:
```bash
git rev-parse --abbrev-ref HEAD
```
3. Detect base branch - try `main`, `master`, `develop` in order, or use upstream tracking branch.
4. Get the list of changed SQL files compared to base branch:
```bash
git diff --name-only <base_branch>...HEAD -- '*.sql'
```
5. Filter to only `.sql` files under `models/` or `snapshots/` directories (at any depth — e.g., `models/`, `analytics/models/`, `dbt/models/`). If no model SQL files were changed, report that and stop.
6. Get the diff for each changed file:
```bash
git diff <base_branch>...HEAD -- <file_path>
```
7. Read model files directly from the filesystem.
8. **Find dbt_project.yml**:
```bash
find . -name "dbt_project.yml" -type f | head -1
```
9. For notebook metadata in local mode, use:
- **ID**: `local-<branch-name>-<timestamp>`
- **Title**: `Local: <branch-name>`
- **Author**: Output of `git config user.name`
- **Merged**: "N/A (local)"
### Model Selection (applies to both modes)
After filtering to `.sql` files under `models/` or `snapshots/`:
1. **If `--models` was specified:** Filter the changed files list to only include models whose filename (without `.sql` extension, case-insensitive) matches one of the specified model names. If any specified model is not found in the changed files, warn the user but continue with the models that were found. If none match, report that and stop.
2. **Model cap:** If more than 10 models remain after filtering, select the first 10 (by file path order) and warn the user:
```
⚠️ <total_count> models changed — generating validation queries for the first 10 only.
To generate for specific models, re-run with: --models <model1,model2,...>
Skipped models: <list of skipped model filenames>
```
## Phase 2: Parse Changed Models
For EACH changed dbt model `.sql` file, parse and extract:
### 2a. Model Metadata
**Output table name** -- Derive from file name:
- `<any_path>/models/<subdir>/<model_name>.sql` -> table is `<MODEL_NAME>` (uppercase, taken from the filename)
**Output schema** -- Use the schema resolution script:
1. **Setup**: Save `dbt_project.yml` and model files to `/tmp/validation_notebook_working/<id>/` preserving paths:
```
/tmp/validation_notebook_working/<id>/
+-- dbt_project.yml
+-- models/
+-- <path>/<model>.sql
```
2. **Run the script** for each model:
```bash
python3 ${CLAUDE_PLUGIN_ROOT}/skills/monte-carlo-validation-notebook/scripts/resolve_dbt_schema.py /tmp/validation_notebook_working/<id>/dbt_project.yml /tmp/validation_notebook_working/<id>/models/<path>/<model>.sql
```
3. **Error handling**: If the script fails, **STOP immediately** and report the error. Do NOT proceed with notebook generation if schema resolution fails.
4. **Output**: The script prints the resolved schema (e.g., `PROD`, `PROD_STAGE`, `PROD_LINEAGE`)
**Note**: Do NOT manually parse dbt_project.yml or model configs for schema -- always use the script. It handles model config overrides, dbt_project.yml routing rules, PROD_ prefix for custom schemas, and defaults to `PROD`.
**Config block** -- Look for `{{ config(...) }}` and extract:
- `materialized` -- 'table', 'view', 'incremental', 'ephemeral'
- `unique_key` -- the dedup key (may be a string or list)
- `cluster_by` -- clustering fields (may contain the time axis)
**Core segmentation fields** -- Scan the entire model SQL for fields likely to be business keys:
- Fields named `*_id` (e.g., `account_id`, `resource_id`, `monitor_id`) that appear in JOIN ON, GROUP BY, PARTITION BY, or `unique_key`
- Deduplicate and rank by frequency. Take the top 3.
**Time axis field** -- Detect the model's time dimension (in priority order):
1. `is_incremental()` block: field used in the WHERE comparison
2. `cluster_by` config: timestamp/date fields
3. Field name conventions: `ingest_ts`, `created_time`, `date_part`, `timestamp`, `run_start_time`, `export_ts`, `event_created_time`
4. ORDER BY DESC in QUALIFY/ROW_NUMBER
If no time axis is found, skip time-axis queries for this model.
### 2b. Diff Analysis
Parse the diff hunks for this file. Classify each changed line:
- **Changed fields** -- Lines added/modified in SELECT clauses or CTE definitions. Extract the output column name.
- **Changed filters** -- Lines added/modified in WHERE clauses.
- **Changed joins** -- Lines added/modified in JOIN ON conditions.
- **Changed unique_key** -- If `unique_key` in config was modified, note both old and new values.
- **New columns** -- Columns in "after" SELECT that don't appear in "before" (pure additions).
### 2c. Model Classification
Classify each model as **new** or **modified** based on the diff:
- If the diff for this file contains `new file mode` → classify as **new**
- Otherwise → classify as **modified**
This classification determines which query patterns are generated in Phase 3.
**Note:** For **new models**, Phase 2b diff analysis is skipped (there is no "before" to compare against). Phase 2a metadata extraction still applies.
## Phase 3: Generate Validation Queries
For each changed model, generate the applicable queries based on its classification (new vs modified).
**CRITICAL: Parameter Placeholder Syntax**
Use **double curly braces** `{{...}}` for parameter placeholders. Do NOT use `${...}` or any other syntax.
Correct: `{{prod_db}}.PROD.AGENT_RUNS`
Wrong: `${prod_db}.PROD.AGENT_RUNS`
**Table Reference Format:**
- Use `{{prod_db}}.<SCHEMA>.<TABLE_NAME>` for prod queries
- Use `{{dev_db}}.<SCHEMA>.<TABLE_NAME>` for dev queries
- `<SCHEMA>` is **hardcoded per-model** using the output from the schema resolution script
---
### Query Patterns for NEW Models
For new models, all queries target `{{dev_db}}` only. No comparison queries are generated since no prod table exists.
#### Pattern 7-new: Total Row Count
**Trigger:** Always.
```sql
SELECT COUNT(*) AS total_rows
FROM {{dev_db}}.<SCHEMA>.<TABLE_NAME>
```
#### Pattern 9: Sample Data Preview
**Trigger:** Always.
```sql
SELECT *
FROM {{dev_db}}.<SCHEMA>.<TABLE_NAME>
LIMIT 20
```
#### Pattern 2-new: Core Segmentation Counts
**Trigger:** Always.
```sql
SELECT
<segmentation_field>,
COUNT(*) AS row_count
FROM {{dev_db}}.<SCHEMA>.<TABLE_NAME>
GROUP BY <segmentation_field>
ORDER BY row_count DESC
LIMIT 100
```
#### Pattern 5: Uniqueness Check
**Trigger:** Always for new models (verify unique_key constraint from the start).
```sql
SELECT
COUNT(*) AS total_rows,
COUNT(DISTINCT <key_fields>) AS distinct_keys,
COUNT(*) - COUNT(DISTINCT <key_fields>) AS duplicate_count
FROM {{dev_db}}.<SCHEMA>.<TABLE_NAME>
```
```sql
SELECT <key_fields>, COUNT(*) AS n
FROM {{dev_db}}.<SCHEMA>.<TABLE_NAME>
GROUP BY <key_fields>
HAVING COUNT(*) > 1
ORDER BY n DESC
LIMIT 100
```
#### Pattern 6-new: NULL Rate Check (all columns)
**Trigger:** Always. Checks all output columns since everything is new.
```sql
SELECT
COUNT(*) AS total_rows,
SUM(CASE WHEN <col1> IS NULL THEN 1 ELSE 0 END) AS <col1>_null_count,
ROUND(100.0 * SUM(CASE WHEN <col1> IS NULL THEN 1 ELSE 0 END) / NULLIF(COUNT(*), 0), 2) AS <col1>_null_pct,
SUM(CASE WHEN <col2> IS NULL THEN 1 ELSE 0 END) AS <col2>_null_count,
ROUND(100.0 * SUM(CASE WHEN <col2> IS NULL THEN 1 ELSE 0 END) / NULLIF(COUNT(*), 0), 2) AS <col2>_null_pct
-- repeat for each output column
FROM {{dev_db}}.<SCHEMA>.<TABLE_NAME>
```
#### Pattern 8: Time-Axis Continuity
**Trigger:** Model is `materialized='incremental'` OR a time axis field was identified.
```sql
SELECT
CAST(<time_axis> AS DATE) AS day,
COUNT(*) AS row_count
FROM {{dev_db}}.<SCHEMA>.<TABLE_NAME>
WHERE <time_axis> >= CURRENT_TIMESTAMP - INTERVAL '14' DAY
GROUP BY day
ORDER BY day DESC
LIMIT 30
```
---
### Query Patterns for MODIFIED Models
For modified models, single-table queries use `{{prod_db}}` and comparison queries use both.
#### Pattern 7: Total Row Count
**Trigger:** Always.
```sql
SELECT COUNT(*) AS total_rows
FROM {{prod_db}}.<SCHEMA>.<TABLE_NAME>
```
#### Pattern 9: Sample Data Preview
**Trigger:** Always.
```sql
SELECT *
FROM {{prod_db}}.<SCHEMA>.<TABLE_NAME>
LIMIT 20
```
#### Pattern 2: Core Segmentation Counts
**Trigger:** Always.
```sql
SELECT
<segmentation_field>,
COUNT(*) AS row_count
FROM {{prod_db}}.<SCHEMA>.<TABLE_NAME>
GROUP BY <segmentation_field>
ORDER BY row_count DESC
LIMIT 100
```
#### Pattern 1: Changed Field Distribution
**Trigger:** Changed fields found in Phase 2b. **Exclude added columns** (from "New columns" in Phase 2b) — only include fields that exist in prod.
```sql
SELECT
<changed_field>,
COUNT(*) AS row_count,
ROUND(COUNT(*) * 100.0 / SUM(COUNT(*)) OVER(), 2) AS pct
FROM {{prod_db}}.<SCHEMA>.<TABLE_NAME>
GROUP BY <changed_field>
ORDER BY row_count DESC
LIMIT 100
```
#### Pattern 5: Uniqueness Check
**Trigger:** JOIN condition changed, `unique_key` changed, or model is incremental.
```sql
SELECT
COUNT(*) AS total_rows,
COUNT(DISTINCT <key_fields>) AS distinct_keys,
COUNT(*) - COUNT(DISTINCT <key_fields>) AS duplicate_count
FROM {{dev_db}}.<SCHEMA>.<TABLE_NAME>
```
```sql
SELECT <key_fields>, COUNT(*) AS n
FROM {{dev_db}}.<SCHEMA>.<TABLE_NAME>
GROUP BY <key_fields>
HAVING COUNT(*) > 1
ORDER BY n DESC
LIMIT 100
```
#### Pattern 6: NULL Rate Check
**Trigger:** New column added, or column wrapped in COALESCE/NULLIF.
**Important:** Added columns (from "New columns" in Phase 2b) do NOT exist in prod yet. For added columns, query `{{dev_db}}` only. For modified columns (COALESCE/NULLIF changes), compare both databases.
**For added columns** (dev only):
```sql
SELECT
COUNT(*) AS total_rows,
SUM(CASE WHEN <column> IS NULL THEN 1 ELSE 0 END) AS null_count,
ROUND(100.0 * SUM(CASE WHEN <column> IS NULL THEN 1 ELSE 0 END) / NULLIF(COUNT(*), 0), 2) AS null_pct
FROM {{dev_db}}.<SCHEMA>.<TABLE_NAME>
```
**For modified columns** (prod vs dev):
```sql
SELECT
'prod' AS source,
COUNT(*) AS total_rows,
SUM(CASE WHEN <column> IS NULL THEN 1 ELSE 0 END) AS null_count,
ROUND(100.0 * SUM(CASE WHEN <column> IS NULL THEN 1 ELSE 0 END) / NULLIF(COUNT(*), 0), 2) AS null_pct
FROM {{prod_db}}.<SCHEMA>.<TABLE_NAME>
UNION ALL
SELECT
'dev' AS source,
COUNT(*) AS total_rows,
SUM(CASE WHEN <column> IS NULL THEN 1 ELSE 0 END) AS null_count,
ROUND(100.0 * SUM(CASE WHEN <column> IS NULL THEN 1 ELSE 0 END) / NULLIF(COUNT(*), 0), 2) AS null_pct
FROM {{dev_db}}.<SCHEMA>.<TABLE_NAME>
```
#### Pattern 8: Time-Axis Continuity
**Trigger:** Model is `materialized='incremental'` OR a time axis field was identified.
```sql
SELECT
CAST(<time_axis> AS DATE) AS day,
COUNT(*) AS row_count
FROM {{prod_db}}.<SCHEMA>.<TABLE_NAME>
WHERE <time_axis> >= CURRENT_TIMESTAMP - INTERVAL '14' DAY
GROUP BY day
ORDER BY day DESC
LIMIT 30
```
#### Pattern 3: Before/After Comparison
**Trigger:** Always (for changed fields + top segmentation field). **Modified models only.**
**Important:** Exclude added columns (from "New columns" in Phase 2b) from `<group_fields>`. Only use fields that exist in BOTH prod and dev. Added columns don't exist in prod and will cause query errors.
```sql
WITH prod AS (
SELECT <group_fields>, COUNT(*) AS cnt
FROM {{prod_db}}.<SCHEMA>.<TABLE_NAME>
GROUP BY <group_fields>
),
dev AS (
SELECT <group_fields>, COUNT(*) AS cnt
FROM {{dev_db}}.<SCHEMA>.<TABLE_NAME>
GROUP BY <group_fields>
)
SELECT
COALESCE(b.<field>, d.<field>) AS <field>,
COALESCE(b.cnt, 0) AS cnt_prod,
COALESCE(d.cnt, 0) AS cnt_dev,
COALESCE(d.cnt, 0) - COALESCE(b.cnt, 0) AS diff
FROM prod b
FULL OUTER JOIN dev d ON b.<field> = d.<field>
ORDER BY ABS(diff) DESC
LIMIT 100
```
#### Pattern 7b: Row Count Comparison
**Trigger:** Always. **Modified models only.**
```sql
SELECT 'prod' AS source, COUNT(*) AS row_count FROM {{prod_db}}.<SCHEMA>.<TABLE_NAME>
UNION ALL
SELECT 'dev' AS source, COUNT(*) AS row_count FROM {{dev_db}}.<SCHEMA>.<TABLE_NAME>
```
## Phase 4: Build Notebook YAML
### 4a. Metadata
```yaml
version: 1
metadata:
id: validation-pr-<PR_NUMBER>-<random_suffix>
name: "Validation: PR #<PR_NUMBER> - <PR_TITLE_TRUNCATED>"
created_at: "<current_iso_timestamp>"
updated_at: "<current_iso_timestamp>"
```
### 4b. Parameter Cells
**Only include `prod_db` if there are modified models.** If all models are new, only include `dev_db`.
```yaml
# Include ONLY if there are modified models:
- id: param-prod-db
type: parameter
content:
name: prod_db
config:
type: text
default_value: "ANALYTICS"
placeholder: "Prod database (e.g., ANALYTICS)"
display_type: table
# Always include:
- id: param-dev-db
type: parameter
content:
name: dev_db
config:
type: text
default_value: "PERSONAL_<USER>"
placeholder: "Dev database (e.g., PERSONAL_JSMITH)"
display_type: table
```
### 4c. Markdown Summary Cell
```yaml
- id: cell-summary
type: markdown
content: |
# Validation Queries for <PR or Local Branch>
## Summary
- **Title:** <title>
- **Author:** <author>
- **Source:** <PR URL or "Local branch: <branch>">
- **Status:** <merge_timestamp or "Not yet merged" or "N/A (local)">
## Changes
<brief description based on diff analysis>
## Changed Models
- `<SCHEMA>.<TABLE_NAME>` (from `<file_path>`)
## How to Use
1. Select your Snowflake connector above
2. Set **dev_db** to your dev database (e.g., `PERSONAL_JSMITH`)
3. If modified models are present, set **prod_db** to your prod database (e.g., `ANALYTICS`)
4. Run single-table queries first, then comparison queries
display_type: table
```
### 4d. SQL Cell Format
```yaml
- id: cell-<pattern>-<model>-<index>
type: sql
content: |
/*
========================================
<Pattern Name (human-readable, e.g. "Total Row Count" — do NOT include pattern numbers like "Pattern 7:")>
========================================
Model: <SCHEMA>.<TABLE_NAME>
Triggered by: <why this pattern was generated>
What to look for: <interpretation guidance>
----------------------------------------
*/
<actual_sql_query>
display_type: table
```
### 4e. Cell Organization
Cells are ordered consistently for both model types, following this sequence:
**New models:**
1. Summary markdown cell (note that model is new)
2. Parameter cells (dev_db only — no prod_db if all models are new)
3. Total row count (Pattern 7-new)
4. Sample data preview (Pattern 9)
5. Core segmentation counts (Pattern 2-new)
6. Uniqueness check (Pattern 5), NULL rate check (Pattern 6-new), Time-axis continuity (Pattern 8)
**Modified models:**
1. Summary markdown cell
2. Parameter cells (prod_db, dev_db)
3. Total row count (Pattern 7)
4. Sample data preview (Pattern 9)
5. Core segmentation counts (Pattern 2)
6. Changed field distribution (Pattern 1)
7. Uniqueness check (Pattern 5), NULL rate check (Pattern 6), Time-axis continuity (Pattern 8)
8. Before/after comparisons (Pattern 3), Row count comparison (Pattern 7b)
## Phase 5: Generate Import URL
1. Write notebook YAML to `/tmp/validation_notebook_working/<id>/notebook.yaml`
2. Run the URL generation script:
```bash
python3 ${CLAUDE_PLUGIN_ROOT}/skills/monte-carlo-validation-notebook/scripts/generate_notebook_url.py /tmp/validation_notebook_working/<id>/notebook.yaml --mc-base-url <MC_BASE_URL>
```
3. The script validates both YAML syntax and notebook schema (required fields on metadata and cells). If validation fails, read the error messages carefully, fix the YAML to match the spec in Phase 4, and re-run.
## Phase 6: Output
Present:
```markdown
# Validation Notebook Generated
## Summary
- **Source:** PR #<number> - <title> OR Local: <branch>
- **Author:** <author>
- **Changed Models:** <count> models (of <total_count> changed)
- **Generated Queries:** <count> queries
> ⚠️ If models were capped: "Only the first 10 of <total_count> changed models were included. Re-run with `--models` to select specific models."
## Notebook Opened
The notebook has been opened directly in your browser.
Select your Snowflake connector in the notebook interface to begin running queries.
*Make sure MC Bridge is running. Let me know if you want tips on how to install this locally*
```
## Important Guidelines
1. **Do NOT execute queries** -- only generate the notebook
2. **Keep SQL readable** -- proper formatting and meaningful aliases
3. **Include LIMIT 100** on queries that could return many rows
4. **Use double curly braces** -- `{{prod_db}}` NOT `${prod_db}`
5. **Use correct table format** -- `{{prod_db}}.<SCHEMA>.<TABLE>` and `{{dev_db}}.<SCHEMA>.<TABLE>`
6. **Always use the schema resolution script** -- do NOT manually parse dbt_project.yml
7. **Schema is NOT a parameter** -- only `prod_db` and `dev_db` are parameters
8. **Skip ephemeral models** -- they have no physical table
9. **Truncate notebook name** -- keep under 50 chars
10. **Generate unique cell IDs** -- use pattern like `cell-p3-model-1`
11. **YAML multiline content** -- use `|` block scalar for SQL with comments
12. **ASCII-only YAML** -- the script sanitizes and validates before encoding
## Query Pattern Reference
| Pattern | Name | Trigger | Model Type | Database | Order |
|---------|------|---------|------------|----------|-------|
| 7 / 7-new | Total Row Count | Always | Both | `{{prod_db}}` (modified) / `{{dev_db}}` (new) | 1 |
| 9 | Sample Data Preview | Always | Both | `{{prod_db}}` (modified) / `{{dev_db}}` (new) | 2 |
| 2 / 2-new | Core Segmentation Counts | Always | Both | `{{prod_db}}` (modified) / `{{dev_db}}` (new) | 3 |
| 1 | Changed Field Distribution | Column modified in diff (not added) | Modified only | `{{prod_db}}` | 4 |
| 5 | Uniqueness Check | JOIN/unique_key changed (modified) / Always (new) | Both | `{{dev_db}}` | 5 |
| 6 / 6-new | NULL Rate Check | New column or COALESCE (modified) / Always (new) | Both | Added col: `{{dev_db}}` only; COALESCE: Both (modified) / `{{dev_db}}` (new) | 5 |
| 8 | Time-Axis Continuity | Incremental or time field | Both | `{{prod_db}}` (modified) / `{{dev_db}}` (new) | 5 |
| 3 | Before/After Comparison | Changed fields (not added) | Modified only | Both | 6 |
| 7b | Row Count Comparison | Always | Modified only | Both | 6 |
## MC Bridge Setup Help
If the user asks how to install or set up MC Bridge, fetch the README from the mc-bridge repo and show the relevant quick start / setup instructions:
```bash
gh api repos/monte-carlo-data/mc-bridge/readme --jq '.content' | base64 --decode
```
Focus on: how to install, configure connections, and run MC Bridge. Don't dump the entire README — extract just the setup-relevant sections.

View File

@@ -0,0 +1,141 @@
#!/usr/bin/env python3
"""
Encode a notebook YAML file into a base64 import URL and open it in the browser.
Usage:
python3 generate_notebook_url.py <notebook_yaml_path> [--mc-base-url URL]
"""
import argparse
import base64
import os
import re
import subprocess
import sys
import yaml
def sanitize_yaml(content: str) -> str:
"""Replace non-ASCII characters with ASCII equivalents."""
replacements = {
"\u2014": "-",
"\u2013": "-",
"\u2018": "'",
"\u2019": "'",
"\u201c": '"',
"\u201d": '"',
"\u2026": "...",
"\u00a0": " ",
}
for char, replacement in replacements.items():
content = content.replace(char, replacement)
content = re.sub(r"[^\x00-\x7F]", "?", content)
return content
def validate_yaml(content: str) -> None:
"""Parse YAML, validate notebook schema, and exit with context on failure."""
try:
doc = yaml.safe_load(content)
except yaml.YAMLError as e:
print(f"YAML validation failed: {e}", file=sys.stderr)
sys.exit(1)
errors: list[str] = []
# Top-level structure
if not isinstance(doc, dict):
errors.append("Root must be a mapping")
else:
if "version" not in doc:
errors.append("Missing top-level 'version'")
metadata = doc.get("metadata")
if not isinstance(metadata, dict):
errors.append("Missing or invalid 'metadata' mapping")
else:
for field in ("id", "name", "created_at", "updated_at"):
if field not in metadata:
errors.append(f"metadata.{field}: missing required field")
for bad_field in ("title", "description", "pr_number", "generated_by"):
if bad_field in metadata:
errors.append(
f"metadata.{bad_field}: unexpected field (use 'name' for the notebook title)"
)
cells = doc.get("cells")
if not isinstance(cells, list):
errors.append("Missing or invalid 'cells' list")
else:
for i, cell in enumerate(cells):
prefix = f"cells[{i}]"
if not isinstance(cell, dict):
errors.append(f"{prefix}: must be a mapping")
continue
if "id" not in cell:
errors.append(f"{prefix}: missing 'id'")
if "type" not in cell:
errors.append(f"{prefix}: missing 'type'")
cell_type = cell.get("type")
if cell_type not in ("sql", "markdown", "parameter"):
errors.append(
f"{prefix}: invalid type '{cell_type}' (must be sql, markdown, or parameter)"
)
if "display_type" not in cell:
errors.append(f"{prefix}: missing 'display_type'")
if cell_type == "parameter":
content_val = cell.get("content")
if not isinstance(content_val, dict):
errors.append(f"{prefix}: parameter cell 'content' must be a mapping with 'name' and 'config'")
else:
if "name" not in content_val:
errors.append(f"{prefix}: parameter content missing 'name'")
if "config" not in content_val:
errors.append(f"{prefix}: parameter content missing 'config'")
if errors:
print("Invalid notebook:", file=sys.stderr)
for err in errors:
print(f" - {err}", file=sys.stderr)
sys.exit(1)
def main() -> None:
parser = argparse.ArgumentParser(description="Encode notebook YAML to import URL")
parser.add_argument("yaml_path", help="Path to notebook YAML file")
parser.add_argument(
"--mc-base-url",
default="https://getmontecarlo.com",
help="MC Bridge base URL",
)
args = parser.parse_args()
with open(args.yaml_path) as f:
notebook_yaml = f.read()
yaml_content = sanitize_yaml(notebook_yaml.strip())
validate_yaml(yaml_content)
encoded = base64.b64encode(yaml_content.encode()).decode()
url = f"{args.mc_base_url}/notebooks/import#{encoded}"
print(f"URL length: {len(url)} chars")
# Save URL to file alongside the YAML
url_file = os.path.join(os.path.dirname(os.path.abspath(args.yaml_path)), "notebook_url.txt")
with open(url_file, "w") as f:
f.write(url)
print(f"URL saved to: {url_file}")
print("\n" + "=" * 60)
print("NOTEBOOK URL:")
print("=" * 60)
print(url)
print("=" * 60 + "\n")
print("Opening notebook in browser...")
subprocess.run(["open", url])
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,161 @@
#!/usr/bin/env python3
"""
Resolve the output schema for a dbt model.
Usage:
python3 resolve_dbt_schema.py <dbt_project_yml_path> <model_sql_path>
Returns the resolved schema name (uppercase), e.g., "PROD", "PROD_STAGE", "PROD_LINEAGE"
"""
import argparse
import re
import sys
from pathlib import Path
from typing import Dict, List, Optional, Tuple, Union
import yaml
def parse_model_config_schema(model_content: str) -> Optional[str]:
"""Extract schema from model's config block."""
pattern = r"\{\{\s*config\s*\([^)]*\bschema\s*=\s*['\"]([^'\"]+)['\"][^)]*\)\s*\}\}"
match = re.search(pattern, model_content, re.IGNORECASE | re.DOTALL)
if match:
return match.group(1).upper()
snapshot_pattern = r"target_schema\s*=\s*generate_schema_name\s*\(\s*['\"]([^'\"]+)['\"]"
match = re.search(snapshot_pattern, model_content, re.IGNORECASE | re.DOTALL)
if match:
return match.group(1).upper()
return None
def parse_dbt_project_routing(
dbt_project: dict, project_name: str
) -> Tuple[Dict[str, str], Dict[str, str]]:
"""Extract schema and database routing rules from dbt_project.yml."""
schema_routing = {} # type: Dict[str, str]
database_routing = {} # type: Dict[str, str]
models_config = dbt_project.get("models", {})
project_config = models_config.get(project_name, {})
def extract_routing(config: dict, current_path: str = "") -> None:
for key, value in config.items():
if key.startswith("+"):
continue
if not isinstance(value, dict):
continue
new_path = f"{current_path}/{key}" if current_path else key
schema = value.get("schema") or value.get("+schema")
if schema:
if "{{" not in schema:
schema_routing[new_path] = schema.upper()
database = value.get("database") or value.get("+database")
if database:
if "{{" not in database:
database_routing[new_path] = database.upper()
extract_routing(value, new_path)
extract_routing(project_config)
return schema_routing, database_routing
def parse_dbt_project_schema_routing(dbt_project: dict, project_name: str) -> Dict[str, str]:
schema_routing, _ = parse_dbt_project_routing(dbt_project, project_name)
return schema_routing
def get_model_relative_path(dbt_project_path: Path, model_path: Path) -> str:
dbt_project_dir = dbt_project_path.parent
model_relative = model_path.relative_to(dbt_project_dir)
parts = model_relative.parts
if parts and parts[0] == "models":
return str(Path(*parts[1:]))
return str(model_relative)
def find_matching_schema(
model_relative_path: str, routing: Dict[str, str]
) -> Optional[str]:
model_dir = str(Path(model_relative_path).parent)
matches = [] # type: List[Tuple[str, str]]
for route_path, schema in routing.items():
if model_dir == route_path or model_dir.startswith(route_path + "/"):
matches.append((route_path, schema))
if not matches:
return None
matches.sort(key=lambda x: len(x[0]), reverse=True)
return matches[0][1]
def apply_schema_prefix(schema: str, target_schema: str = "PROD") -> str:
if not schema or schema.upper() == target_schema.upper():
return target_schema.upper()
return f"{target_schema.upper()}_{schema.upper()}"
def resolve_schema(
dbt_project_path: Union[str, Path],
model_path: Union[str, Path],
default_schema: str = "PROD",
apply_prefix: bool = True,
) -> str:
dbt_project_path = Path(dbt_project_path)
model_path = Path(model_path)
model_content = model_path.read_text()
config_schema = parse_model_config_schema(model_content)
if config_schema:
if apply_prefix:
return apply_schema_prefix(config_schema, default_schema)
return config_schema
with open(dbt_project_path) as f:
dbt_project = yaml.safe_load(f)
project_name = dbt_project.get("name", "")
routing = parse_dbt_project_schema_routing(dbt_project, project_name)
model_relative = get_model_relative_path(dbt_project_path, model_path)
matched_schema = find_matching_schema(model_relative, routing)
if matched_schema:
if apply_prefix:
return apply_schema_prefix(matched_schema, default_schema)
return matched_schema
return default_schema.upper()
def main() -> None:
parser = argparse.ArgumentParser(
description="Resolve the output schema for a dbt model"
)
parser.add_argument("dbt_project_path", help="Path to dbt_project.yml")
parser.add_argument("model_path", help="Path to the model SQL file")
parser.add_argument("--default", default="PROD", help="Default schema (default: PROD)")
parser.add_argument("--no-prefix", action="store_true", help="Don't apply PROD_ prefix")
args = parser.parse_args()
dbt_project_path = Path(args.dbt_project_path)
model_path = Path(args.model_path)
if not dbt_project_path.exists():
print(f"Error: dbt_project.yml not found: {dbt_project_path}", file=sys.stderr)
sys.exit(1)
if not model_path.exists():
print(f"Error: Model file not found: {model_path}", file=sys.stderr)
sys.exit(1)
apply_prefix = not args.no_prefix
schema = resolve_schema(dbt_project_path, model_path, args.default, apply_prefix)
print(schema)
if __name__ == "__main__":
main()