mirror of
https://github.com/sickn33/antigravity-awesome-skills.git
synced 2026-04-25 17:25:12 +02:00
chore: sync repo state [ci skip]
This commit is contained in:
@@ -1,7 +1,7 @@
|
||||
{
|
||||
"name": "antigravity-awesome-skills",
|
||||
"version": "9.10.0",
|
||||
"description": "Plugin-safe Claude Code distribution of Antigravity Awesome Skills with 1,378 supported skills.",
|
||||
"description": "Plugin-safe Claude Code distribution of Antigravity Awesome Skills with 1,382 supported skills.",
|
||||
"author": {
|
||||
"name": "sickn33 and contributors",
|
||||
"url": "https://github.com/sickn33/antigravity-awesome-skills"
|
||||
|
||||
10
CATALOG.md
10
CATALOG.md
@@ -2,7 +2,7 @@
|
||||
|
||||
Generated at: 2026-02-08T00:00:00.000Z
|
||||
|
||||
Total skills: 1392
|
||||
Total skills: 1396
|
||||
|
||||
## architecture (93)
|
||||
|
||||
@@ -183,7 +183,7 @@ Total skills: 1392
|
||||
| `warren-buffett` | Agente que simula Warren Buffett — o maior investidor do seculo XX e XXI, CEO da Berkshire Hathaway, discipulo de Benjamin Graham e socio intelectual de Char... | persona, investing, value-investing, business | persona, investing, value-investing, business, warren, buffett, agente, que, simula, maior, investidor, do |
|
||||
| `whatsapp-automation` | Automate WhatsApp Business tasks via Rube MCP (Composio): send messages, manage templates, upload media, and handle contacts. Always search tools first for c... | whatsapp | whatsapp, automation, automate, business, tasks, via, rube, mcp, composio, send, messages, upload |
|
||||
|
||||
## data-ai (262)
|
||||
## data-ai (264)
|
||||
|
||||
| Skill | Description | Tags | Triggers |
|
||||
| --- | --- | --- | --- |
|
||||
@@ -367,6 +367,8 @@ Total skills: 1392
|
||||
| `maxia` | Connect to MAXIA AI-to-AI marketplace on Solana. Discover, buy, sell AI services. Earn USDC. 13 MCP tools, A2A protocol, DeFi yields, sentiment analysis, rug... | solana, crypto, marketplace, ai-agents, mcp, defi, usdc, web3, a2a | solana, crypto, marketplace, ai-agents, mcp, defi, usdc, web3, a2a, maxia, connect, ai |
|
||||
| `mlops-engineer` | Build comprehensive ML pipelines, experiment tracking, and model registries with MLflow, Kubeflow, and modern MLOps tools. | mlops | mlops, engineer, ml, pipelines, experiment, tracking, model, registries, mlflow, kubeflow |
|
||||
| `molykit` | CRITICAL: Use for MolyKit AI chat toolkit. Triggers on: BotClient, OpenAI, SSE streaming, AI chat, molykit, PlatformSend, spawn(), ThreadToken, cross-platfor... | molykit | molykit, critical, ai, chat, toolkit, triggers, botclient, openai, sse, streaming, platformsend, spawn |
|
||||
| `monte-carlo-push-ingestion` | Expert guide for pushing metadata, lineage, and query logs to Monte Carlo from any data warehouse. | data-observability, ingestion, monte-carlo, pycarlo, metadata | data-observability, ingestion, monte-carlo, pycarlo, metadata, monte, carlo, push, pushing, lineage, query, logs |
|
||||
| `monte-carlo-validation-notebook` | Generates SQL validation notebooks for dbt PR changes with before/after comparison queries. | data-observability, validation, dbt, monte-carlo, sql-notebook | data-observability, validation, dbt, monte-carlo, sql-notebook, monte, carlo, notebook, generates, sql, notebooks, pr |
|
||||
| `moyu` | Anti-over-engineering guardrail that activates when an AI coding agent expands scope, adds abstractions, or changes files the user did not request. | moyu | moyu, anti, engineering, guardrail, activates, ai, coding, agent, expands, scope, adds, abstractions |
|
||||
| `n8n-expression-syntax` | Validate n8n expression syntax and fix common errors. Use when writing n8n expressions, using {{}} syntax, accessing $json/$node variables, troubleshooting e... | n8n, expression, syntax | n8n, expression, syntax, validate, fix, common, errors, writing, expressions, accessing, json, node |
|
||||
| `nanobanana-ppt-skills` | AI-powered PPT generation with document analysis and styled images | nanobanana, ppt, skills | nanobanana, ppt, skills, ai, powered, generation, document, analysis, styled, images |
|
||||
@@ -996,7 +998,7 @@ Total skills: 1392
|
||||
| `yann-lecun-tecnico` | Sub-skill técnica de Yann LeCun. Cobre CNNs, LeNet, backpropagation, JEPA (I-JEPA, V-JEPA, MC-JEPA), AMI (Advanced Machinery of Intelligence), Self-Supervise... | persona, cnn, jepa, self-supervised, pytorch | persona, cnn, jepa, self-supervised, pytorch, yann, lecun, tecnico, sub, skill, cnica, de |
|
||||
| `youtube-summarizer` | Extract transcripts from YouTube videos and generate comprehensive, detailed summaries using intelligent analysis frameworks | video, summarization, transcription, youtube, content-analysis | video, summarization, transcription, youtube, content-analysis, summarizer, extract, transcripts, videos, generate, detailed, summaries |
|
||||
|
||||
## infrastructure (124)
|
||||
## infrastructure (126)
|
||||
|
||||
| Skill | Description | Tags | Triggers |
|
||||
| --- | --- | --- | --- |
|
||||
@@ -1090,6 +1092,8 @@ Total skills: 1392
|
||||
| `microservices-patterns` | Master microservices architecture patterns including service boundaries, inter-service communication, data management, and resilience patterns for building d... | microservices | microservices, architecture, including, boundaries, inter, communication, data, resilience, building, distributed |
|
||||
| `ml-engineer` | Build production ML systems with PyTorch 2.x, TensorFlow, and modern ML frameworks. Implements model serving, feature engineering, A/B testing, and monitoring. | ml | ml, engineer, pytorch, tensorflow, frameworks, implements, model, serving, feature, engineering, testing, monitoring |
|
||||
| `ml-pipeline-workflow` | Complete end-to-end MLOps pipeline orchestration from data preparation through model deployment. | ml, pipeline | ml, pipeline, complete, mlops, orchestration, data, preparation, through, model, deployment |
|
||||
| `monte-carlo-monitor-creation` | Guides creation of Monte Carlo monitors via MCP tools, producing monitors-as-code YAML for CI/CD deployment. | data-observability, monitoring, monte-carlo, monitors-as-code | data-observability, monitoring, monte-carlo, monitors-as-code, monte, carlo, monitor, creation, guides, monitors, via, mcp |
|
||||
| `monte-carlo-prevent` | Surfaces Monte Carlo data observability context (table health, alerts, lineage, blast radius) before SQL/dbt edits. | data-observability, dbt, schema, monte-carlo, lineage | data-observability, dbt, schema, monte-carlo, lineage, monte, carlo, prevent, surfaces, data, observability, context |
|
||||
| `moodle-external-api-development` | This skill guides you through creating custom external web service APIs for Moodle LMS, following Moodle's external API framework and coding standards. | moodle, external, api | moodle, external, api, development, skill, guides, through, creating, custom, web, apis, lms |
|
||||
| `multi-agent-task-orchestrator` | Route tasks to specialized AI agents with anti-duplication, quality gates, and 30-minute heartbeat monitoring | multi-agent, orchestration, task-routing, quality-gates, anti-duplication | multi-agent, orchestration, task-routing, quality-gates, anti-duplication, multi, agent, task, orchestrator, route, tasks, specialized |
|
||||
| `multi-cloud-architecture` | Decision framework and patterns for architecting applications across AWS, Azure, and GCP. | multi, cloud, architecture | multi, cloud, architecture, decision, framework, architecting, applications, aws, azure, gcp |
|
||||
|
||||
15
README.md
15
README.md
@@ -1,13 +1,13 @@
|
||||
<!-- registry-sync: version=9.10.0; skills=1392; stars=31557; updated_at=2026-04-08T16:50:54+00:00 -->
|
||||
# 🌌 Antigravity Awesome Skills: 1,392+ Agentic Skills for Claude Code, Gemini CLI, Cursor, Copilot & More
|
||||
<!-- registry-sync: version=9.10.0; skills=1396; stars=31557; updated_at=2026-04-08T16:50:54+00:00 -->
|
||||
# 🌌 Antigravity Awesome Skills: 1,396+ Agentic Skills for Claude Code, Gemini CLI, Cursor, Copilot & More
|
||||
|
||||
> **Installable GitHub library of 1,392+ agentic skills for Claude Code, Cursor, Codex CLI, Gemini CLI, Antigravity, and other AI coding assistants.**
|
||||
> **Installable GitHub library of 1,396+ agentic skills for Claude Code, Cursor, Codex CLI, Gemini CLI, Antigravity, and other AI coding assistants.**
|
||||
|
||||
Antigravity Awesome Skills is an installable GitHub library and npm installer for reusable `SKILL.md` playbooks. It is designed for Claude Code, Cursor, Codex CLI, Gemini CLI, Antigravity, Kiro, OpenCode, GitHub Copilot, and other AI coding assistants that benefit from structured operating instructions. Instead of collecting one-off prompt snippets, this repository gives you a searchable, installable catalog of skills, bundles, workflows, plugin-safe distributions, and practical docs that help agents perform recurring tasks with better context, stronger constraints, and clearer outputs.
|
||||
|
||||
You can use this repo to install a broad multi-tool skill library, start from role-based bundles, or jump into workflow-driven execution for planning, coding, debugging, testing, security review, infrastructure, product work, and growth tasks. The root README is intentionally a high-signal landing page: understand what the project is, install it quickly, choose the right tool path, and then follow deeper docs only when you need them.
|
||||
|
||||
**Start here:** [Star the repo](https://github.com/sickn33/antigravity-awesome-skills/stargazers) · [Install in 1 minute](#installation) · [Choose your tool](#choose-your-tool) · [Best skills by tool](#best-skills-by-tool) · [📚 Browse 1,392+ Skills](#browse-1392-skills) · [Bundles](docs/users/bundles.md) · [Workflows](docs/users/workflows.md) · [Plugins for Claude Code and Codex](docs/users/plugins.md)
|
||||
**Start here:** [Star the repo](https://github.com/sickn33/antigravity-awesome-skills/stargazers) · [Install in 1 minute](#installation) · [Choose your tool](#choose-your-tool) · [Best skills by tool](#best-skills-by-tool) · [📚 Browse 1,396+ Skills](#browse-1396-skills) · [Bundles](docs/users/bundles.md) · [Workflows](docs/users/workflows.md) · [Plugins for Claude Code and Codex](docs/users/plugins.md)
|
||||
|
||||
[](https://github.com/sickn33/antigravity-awesome-skills/stargazers)
|
||||
[](LICENSE)
|
||||
@@ -28,7 +28,7 @@ You can use this repo to install a broad multi-tool skill library, start from ro
|
||||
|
||||
- **Installable, not just inspirational**: use `npx antigravity-awesome-skills` to put skills where your tool expects them.
|
||||
- **Built for major agent workflows**: Claude Code, Cursor, Codex CLI, Gemini CLI, Antigravity, Kiro, OpenCode, Copilot, and more.
|
||||
- **Broad coverage with real utility**: 1,392+ skills across development, testing, security, infrastructure, product, and marketing.
|
||||
- **Broad coverage with real utility**: 1,396+ skills across development, testing, security, infrastructure, product, and marketing.
|
||||
- **Faster onboarding**: bundles and workflows reduce the time from "I found this repo" to "I used my first skill".
|
||||
- **Useful whether you want breadth or curation**: browse the full catalog, start with top bundles, or compare alternatives before installing.
|
||||
|
||||
@@ -127,7 +127,7 @@ Start with [Bundles](docs/users/bundles.md) for role-based recommendations, [Wor
|
||||
|
||||
## Best Skills By Tool
|
||||
|
||||
If you want a faster answer than "browse all 1,392+ skills", start with a tool-specific guide:
|
||||
If you want a faster answer than "browse all 1,396+ skills", start with a tool-specific guide:
|
||||
|
||||
- **[Claude Code skills](docs/users/claude-code-skills.md)**: install paths, starter skills, prompt examples, and plugin marketplace flow.
|
||||
- **[Cursor skills](docs/users/cursor-skills.md)**: best starter skills for `.cursor/skills/`, UI-heavy work, and pair-programming flows.
|
||||
@@ -183,7 +183,7 @@ If Antigravity starts hitting context limits with too many active skills, the ac
|
||||
|
||||
If you use OpenCode or another `.agents/skills` host, prefer a reduced install up front instead of copying the full library into a context-sensitive runtime. The installer now supports `--risk`, `--category`, and `--tags` so you can keep the installed set narrow.
|
||||
|
||||
## Browse 1,392+ Skills
|
||||
## Browse 1,396+ Skills
|
||||
|
||||
Use the root repo as a landing page, then jump into the deeper surface that matches your intent.
|
||||
|
||||
@@ -540,6 +540,7 @@ We officially thank the following contributors for their help in making this rep
|
||||
- [@warmskull](https://github.com/warmskull)
|
||||
- [@Wittlesus](https://github.com/Wittlesus)
|
||||
- [@digitamaz](https://github.com/digitamaz)
|
||||
- [@cryptoque](https://github.com/cryptoque)
|
||||
|
||||
## Star History
|
||||
|
||||
|
||||
@@ -12,6 +12,30 @@
|
||||
<changefreq>weekly</changefreq>
|
||||
<priority>0.7</priority>
|
||||
</url>
|
||||
<url>
|
||||
<loc>http://localhost/skill/monte-carlo-monitor-creation</loc>
|
||||
<lastmod>2026-04-09</lastmod>
|
||||
<changefreq>weekly</changefreq>
|
||||
<priority>0.7</priority>
|
||||
</url>
|
||||
<url>
|
||||
<loc>http://localhost/skill/monte-carlo-prevent</loc>
|
||||
<lastmod>2026-04-09</lastmod>
|
||||
<changefreq>weekly</changefreq>
|
||||
<priority>0.7</priority>
|
||||
</url>
|
||||
<url>
|
||||
<loc>http://localhost/skill/monte-carlo-push-ingestion</loc>
|
||||
<lastmod>2026-04-09</lastmod>
|
||||
<changefreq>weekly</changefreq>
|
||||
<priority>0.7</priority>
|
||||
</url>
|
||||
<url>
|
||||
<loc>http://localhost/skill/monte-carlo-validation-notebook</loc>
|
||||
<lastmod>2026-04-09</lastmod>
|
||||
<changefreq>weekly</changefreq>
|
||||
<priority>0.7</priority>
|
||||
</url>
|
||||
<url>
|
||||
<loc>http://localhost/skill/ui-a11y</loc>
|
||||
<lastmod>2026-04-09</lastmod>
|
||||
@@ -222,28 +246,4 @@
|
||||
<changefreq>weekly</changefreq>
|
||||
<priority>0.7</priority>
|
||||
</url>
|
||||
<url>
|
||||
<loc>http://localhost/skill/subject-line-psychologist</loc>
|
||||
<lastmod>2026-04-09</lastmod>
|
||||
<changefreq>weekly</changefreq>
|
||||
<priority>0.7</priority>
|
||||
</url>
|
||||
<url>
|
||||
<loc>http://localhost/skill/trust-calibrator</loc>
|
||||
<lastmod>2026-04-09</lastmod>
|
||||
<changefreq>weekly</changefreq>
|
||||
<priority>0.7</priority>
|
||||
</url>
|
||||
<url>
|
||||
<loc>http://localhost/skill/ux-persuasion-engineer</loc>
|
||||
<lastmod>2026-04-09</lastmod>
|
||||
<changefreq>weekly</changefreq>
|
||||
<priority>0.7</priority>
|
||||
</url>
|
||||
<url>
|
||||
<loc>http://localhost/skill/visual-emotion-engineer</loc>
|
||||
<lastmod>2026-04-09</lastmod>
|
||||
<changefreq>weekly</changefreq>
|
||||
<priority>0.7</priority>
|
||||
</url>
|
||||
</urlset>
|
||||
|
||||
@@ -18609,6 +18609,94 @@
|
||||
"reasons": []
|
||||
}
|
||||
},
|
||||
{
|
||||
"id": "monte-carlo-monitor-creation",
|
||||
"path": "skills/monte-carlo-monitor-creation",
|
||||
"category": "data",
|
||||
"name": "monte-carlo-monitor-creation",
|
||||
"description": "Guides creation of Monte Carlo monitors via MCP tools, producing monitors-as-code YAML for CI/CD deployment.",
|
||||
"risk": "safe",
|
||||
"source": "community",
|
||||
"date_added": "2026-04-08",
|
||||
"plugin": {
|
||||
"targets": {
|
||||
"codex": "supported",
|
||||
"claude": "supported"
|
||||
},
|
||||
"setup": {
|
||||
"type": "none",
|
||||
"summary": "",
|
||||
"docs": null
|
||||
},
|
||||
"reasons": []
|
||||
}
|
||||
},
|
||||
{
|
||||
"id": "monte-carlo-prevent",
|
||||
"path": "skills/monte-carlo-prevent",
|
||||
"category": "data",
|
||||
"name": "monte-carlo-prevent",
|
||||
"description": "Surfaces Monte Carlo data observability context (table health, alerts, lineage, blast radius) before SQL/dbt edits.",
|
||||
"risk": "safe",
|
||||
"source": "community",
|
||||
"date_added": "2026-04-08",
|
||||
"plugin": {
|
||||
"targets": {
|
||||
"codex": "supported",
|
||||
"claude": "supported"
|
||||
},
|
||||
"setup": {
|
||||
"type": "none",
|
||||
"summary": "",
|
||||
"docs": null
|
||||
},
|
||||
"reasons": []
|
||||
}
|
||||
},
|
||||
{
|
||||
"id": "monte-carlo-push-ingestion",
|
||||
"path": "skills/monte-carlo-push-ingestion",
|
||||
"category": "data",
|
||||
"name": "monte-carlo-push-ingestion",
|
||||
"description": "Expert guide for pushing metadata, lineage, and query logs to Monte Carlo from any data warehouse.",
|
||||
"risk": "safe",
|
||||
"source": "community",
|
||||
"date_added": "2026-04-08",
|
||||
"plugin": {
|
||||
"targets": {
|
||||
"codex": "supported",
|
||||
"claude": "supported"
|
||||
},
|
||||
"setup": {
|
||||
"type": "none",
|
||||
"summary": "",
|
||||
"docs": null
|
||||
},
|
||||
"reasons": []
|
||||
}
|
||||
},
|
||||
{
|
||||
"id": "monte-carlo-validation-notebook",
|
||||
"path": "skills/monte-carlo-validation-notebook",
|
||||
"category": "data",
|
||||
"name": "monte-carlo-validation-notebook",
|
||||
"description": "Generates SQL validation notebooks for dbt PR changes with before/after comparison queries.",
|
||||
"risk": "safe",
|
||||
"source": "community",
|
||||
"date_added": "2026-04-08",
|
||||
"plugin": {
|
||||
"targets": {
|
||||
"codex": "supported",
|
||||
"claude": "supported"
|
||||
},
|
||||
"setup": {
|
||||
"type": "none",
|
||||
"summary": "",
|
||||
"docs": null
|
||||
},
|
||||
"reasons": []
|
||||
}
|
||||
},
|
||||
{
|
||||
"id": "moodle-external-api-development",
|
||||
"path": "skills/moodle-external-api-development",
|
||||
|
||||
@@ -94,6 +94,8 @@
|
||||
"llm-application-optimize": "llm-application-dev-prompt-optimize",
|
||||
"machine-learning-pipeline": "machine-learning-ops-ml-pipeline",
|
||||
"microsoft-azure-dotnet": "microsoft-azure-webjobs-extensions-authentication-events-dotnet",
|
||||
"monte-carlo-creation": "monte-carlo-monitor-creation",
|
||||
"monte-carlo-notebook": "monte-carlo-validation-notebook",
|
||||
"moodle-external-development": "moodle-external-api-development",
|
||||
"multi-agent-orchestrator": "multi-agent-task-orchestrator",
|
||||
"multi-platform-apps": "multi-platform-apps-multi-platform",
|
||||
|
||||
@@ -611,6 +611,9 @@
|
||||
"mailchimp-automation",
|
||||
"microservices-patterns",
|
||||
"ml-pipeline-workflow",
|
||||
"monte-carlo-prevent",
|
||||
"monte-carlo-push-ingestion",
|
||||
"monte-carlo-validation-notebook",
|
||||
"n8n-expression-syntax",
|
||||
"native-data-fetching",
|
||||
"neon-postgres",
|
||||
@@ -741,6 +744,8 @@
|
||||
"memory-forensics",
|
||||
"ml-engineer",
|
||||
"ml-pipeline-workflow",
|
||||
"monte-carlo-monitor-creation",
|
||||
"monte-carlo-prevent",
|
||||
"multi-agent-task-orchestrator",
|
||||
"observability-engineer",
|
||||
"observability-monitoring-monitor-setup",
|
||||
@@ -1233,6 +1238,7 @@
|
||||
"landing-page-generator",
|
||||
"llm-structured-output",
|
||||
"local-legal-seo-audit",
|
||||
"monte-carlo-prevent",
|
||||
"odoo-ecommerce-configurator",
|
||||
"postgres-best-practices",
|
||||
"postgresql",
|
||||
|
||||
@@ -1,6 +1,6 @@
|
||||
{
|
||||
"generatedAt": "2026-02-08T00:00:00.000Z",
|
||||
"total": 1392,
|
||||
"total": 1396,
|
||||
"skills": [
|
||||
{
|
||||
"id": "00-andruia-consultant",
|
||||
@@ -20828,6 +20828,117 @@
|
||||
],
|
||||
"path": "skills/monorepo-management/SKILL.md"
|
||||
},
|
||||
{
|
||||
"id": "monte-carlo-monitor-creation",
|
||||
"name": "monte-carlo-monitor-creation",
|
||||
"description": "Guides creation of Monte Carlo monitors via MCP tools, producing monitors-as-code YAML for CI/CD deployment.",
|
||||
"category": "infrastructure",
|
||||
"tags": [
|
||||
"data-observability",
|
||||
"monitoring",
|
||||
"monte-carlo",
|
||||
"monitors-as-code"
|
||||
],
|
||||
"triggers": [
|
||||
"data-observability",
|
||||
"monitoring",
|
||||
"monte-carlo",
|
||||
"monitors-as-code",
|
||||
"monte",
|
||||
"carlo",
|
||||
"monitor",
|
||||
"creation",
|
||||
"guides",
|
||||
"monitors",
|
||||
"via",
|
||||
"mcp"
|
||||
],
|
||||
"path": "skills/monte-carlo-monitor-creation/SKILL.md"
|
||||
},
|
||||
{
|
||||
"id": "monte-carlo-prevent",
|
||||
"name": "monte-carlo-prevent",
|
||||
"description": "Surfaces Monte Carlo data observability context (table health, alerts, lineage, blast radius) before SQL/dbt edits.",
|
||||
"category": "infrastructure",
|
||||
"tags": [
|
||||
"data-observability",
|
||||
"dbt",
|
||||
"schema",
|
||||
"monte-carlo",
|
||||
"lineage"
|
||||
],
|
||||
"triggers": [
|
||||
"data-observability",
|
||||
"dbt",
|
||||
"schema",
|
||||
"monte-carlo",
|
||||
"lineage",
|
||||
"monte",
|
||||
"carlo",
|
||||
"prevent",
|
||||
"surfaces",
|
||||
"data",
|
||||
"observability",
|
||||
"context"
|
||||
],
|
||||
"path": "skills/monte-carlo-prevent/SKILL.md"
|
||||
},
|
||||
{
|
||||
"id": "monte-carlo-push-ingestion",
|
||||
"name": "monte-carlo-push-ingestion",
|
||||
"description": "Expert guide for pushing metadata, lineage, and query logs to Monte Carlo from any data warehouse.",
|
||||
"category": "data-ai",
|
||||
"tags": [
|
||||
"data-observability",
|
||||
"ingestion",
|
||||
"monte-carlo",
|
||||
"pycarlo",
|
||||
"metadata"
|
||||
],
|
||||
"triggers": [
|
||||
"data-observability",
|
||||
"ingestion",
|
||||
"monte-carlo",
|
||||
"pycarlo",
|
||||
"metadata",
|
||||
"monte",
|
||||
"carlo",
|
||||
"push",
|
||||
"pushing",
|
||||
"lineage",
|
||||
"query",
|
||||
"logs"
|
||||
],
|
||||
"path": "skills/monte-carlo-push-ingestion/SKILL.md"
|
||||
},
|
||||
{
|
||||
"id": "monte-carlo-validation-notebook",
|
||||
"name": "monte-carlo-validation-notebook",
|
||||
"description": "Generates SQL validation notebooks for dbt PR changes with before/after comparison queries.",
|
||||
"category": "data-ai",
|
||||
"tags": [
|
||||
"data-observability",
|
||||
"validation",
|
||||
"dbt",
|
||||
"monte-carlo",
|
||||
"sql-notebook"
|
||||
],
|
||||
"triggers": [
|
||||
"data-observability",
|
||||
"validation",
|
||||
"dbt",
|
||||
"monte-carlo",
|
||||
"sql-notebook",
|
||||
"monte",
|
||||
"carlo",
|
||||
"notebook",
|
||||
"generates",
|
||||
"sql",
|
||||
"notebooks",
|
||||
"pr"
|
||||
],
|
||||
"path": "skills/monte-carlo-validation-notebook/SKILL.md"
|
||||
},
|
||||
{
|
||||
"id": "moodle-external-api-development",
|
||||
"name": "moodle-external-api-development",
|
||||
|
||||
@@ -16178,6 +16178,82 @@
|
||||
},
|
||||
"runtime_files": []
|
||||
},
|
||||
{
|
||||
"id": "monte-carlo-monitor-creation",
|
||||
"path": "skills/monte-carlo-monitor-creation",
|
||||
"targets": {
|
||||
"codex": "supported",
|
||||
"claude": "supported"
|
||||
},
|
||||
"setup": {
|
||||
"type": "none",
|
||||
"summary": "",
|
||||
"docs": null
|
||||
},
|
||||
"reasons": [],
|
||||
"blocked_reasons": {
|
||||
"codex": [],
|
||||
"claude": []
|
||||
},
|
||||
"runtime_files": []
|
||||
},
|
||||
{
|
||||
"id": "monte-carlo-prevent",
|
||||
"path": "skills/monte-carlo-prevent",
|
||||
"targets": {
|
||||
"codex": "supported",
|
||||
"claude": "supported"
|
||||
},
|
||||
"setup": {
|
||||
"type": "none",
|
||||
"summary": "",
|
||||
"docs": null
|
||||
},
|
||||
"reasons": [],
|
||||
"blocked_reasons": {
|
||||
"codex": [],
|
||||
"claude": []
|
||||
},
|
||||
"runtime_files": []
|
||||
},
|
||||
{
|
||||
"id": "monte-carlo-push-ingestion",
|
||||
"path": "skills/monte-carlo-push-ingestion",
|
||||
"targets": {
|
||||
"codex": "supported",
|
||||
"claude": "supported"
|
||||
},
|
||||
"setup": {
|
||||
"type": "none",
|
||||
"summary": "",
|
||||
"docs": null
|
||||
},
|
||||
"reasons": [],
|
||||
"blocked_reasons": {
|
||||
"codex": [],
|
||||
"claude": []
|
||||
},
|
||||
"runtime_files": []
|
||||
},
|
||||
{
|
||||
"id": "monte-carlo-validation-notebook",
|
||||
"path": "skills/monte-carlo-validation-notebook",
|
||||
"targets": {
|
||||
"codex": "supported",
|
||||
"claude": "supported"
|
||||
},
|
||||
"setup": {
|
||||
"type": "none",
|
||||
"summary": "",
|
||||
"docs": null
|
||||
},
|
||||
"reasons": [],
|
||||
"blocked_reasons": {
|
||||
"codex": [],
|
||||
"claude": []
|
||||
},
|
||||
"runtime_files": []
|
||||
},
|
||||
{
|
||||
"id": "moodle-external-api-development",
|
||||
"path": "skills/moodle-external-api-development",
|
||||
@@ -26610,10 +26686,10 @@
|
||||
}
|
||||
],
|
||||
"summary": {
|
||||
"total_skills": 1392,
|
||||
"total_skills": 1396,
|
||||
"supported": {
|
||||
"codex": 1363,
|
||||
"claude": 1378
|
||||
"codex": 1367,
|
||||
"claude": 1382
|
||||
},
|
||||
"blocked": {
|
||||
"codex": 29,
|
||||
|
||||
@@ -1,9 +1,9 @@
|
||||
---
|
||||
title: Jetski/Cortex + Gemini Integration Guide
|
||||
description: "Use antigravity-awesome-skills with Jetski/Cortex without hitting context-window overflow with 1.392+ skills."
|
||||
description: "Use antigravity-awesome-skills with Jetski/Cortex without hitting context-window overflow with 1.396+ skills."
|
||||
---
|
||||
|
||||
# Jetski/Cortex + Gemini: safe integration with 1,1.392+ skills
|
||||
# Jetski/Cortex + Gemini: safe integration with 1,1.396+ skills
|
||||
|
||||
This guide shows how to integrate the `antigravity-awesome-skills` repository with an agent based on **Jetski/Cortex + Gemini** (or similar frameworks) **without exceeding the model context window**.
|
||||
|
||||
@@ -23,7 +23,7 @@ Never do:
|
||||
- concatenate all `SKILL.md` content into a single system prompt;
|
||||
- re-inject the entire library for **every** request.
|
||||
|
||||
With over 1,1.392 skills, this approach fills the context window before user messages are even added, causing truncation.
|
||||
With over 1,1.396 skills, this approach fills the context window before user messages are even added, causing truncation.
|
||||
|
||||
---
|
||||
|
||||
|
||||
@@ -20,7 +20,7 @@ This example shows one way to integrate **antigravity-awesome-skills** with a Je
|
||||
- How to enforce a **maximum number of skills per turn** via `maxSkillsPerTurn`.
|
||||
- How to choose whether to **truncate or error** when too many skills are requested via `overflowBehavior`.
|
||||
|
||||
This pattern avoids context overflow when you have 1,392+ skills installed.
|
||||
This pattern avoids context overflow when you have 1,396+ skills installed.
|
||||
|
||||
---
|
||||
|
||||
|
||||
@@ -6,7 +6,7 @@ This document keeps the repository's GitHub-facing discovery copy aligned with t
|
||||
|
||||
Preferred positioning:
|
||||
|
||||
> Installable GitHub library of 1,392+ agentic skills for Claude Code, Cursor, Codex CLI, Gemini CLI, Antigravity, and other AI coding assistants.
|
||||
> Installable GitHub library of 1,396+ agentic skills for Claude Code, Cursor, Codex CLI, Gemini CLI, Antigravity, and other AI coding assistants.
|
||||
|
||||
Key framing:
|
||||
|
||||
@@ -20,7 +20,7 @@ Key framing:
|
||||
|
||||
Preferred description:
|
||||
|
||||
> Installable GitHub library of 1,392+ agentic skills for Claude Code, Cursor, Codex CLI, Gemini CLI, Antigravity, and more. Includes installer CLI, bundles, workflows, and official/community skill collections.
|
||||
> Installable GitHub library of 1,396+ agentic skills for Claude Code, Cursor, Codex CLI, Gemini CLI, Antigravity, and more. Includes installer CLI, bundles, workflows, and official/community skill collections.
|
||||
|
||||
Preferred homepage:
|
||||
|
||||
@@ -28,7 +28,7 @@ Preferred homepage:
|
||||
|
||||
Preferred social preview:
|
||||
|
||||
- use a clean preview image that says `1,392+ Agentic Skills`;
|
||||
- use a clean preview image that says `1,396+ Agentic Skills`;
|
||||
- mention Claude Code, Cursor, Codex CLI, and Gemini CLI;
|
||||
- avoid dense text and tiny logos that disappear in social cards.
|
||||
|
||||
|
||||
@@ -69,7 +69,7 @@ For manual updates, you need:
|
||||
The update process refreshes:
|
||||
- Skills index (`skills_index.json`)
|
||||
- Web app skills data (`apps\web-app\public\skills.json`)
|
||||
- All 1,392+ skills from the skills directory
|
||||
- All 1,396+ skills from the skills directory
|
||||
|
||||
## When to Update
|
||||
|
||||
|
||||
@@ -673,4 +673,4 @@ Found a skill that should be in a bundle? Or want to create a new bundle? [Open
|
||||
|
||||
---
|
||||
|
||||
_Last updated: March 2026 | Total Skills: 1,392+ | Total Bundles: 37_
|
||||
_Last updated: March 2026 | Total Skills: 1,396+ | Total Bundles: 37_
|
||||
|
||||
@@ -12,7 +12,7 @@ Install the library into Claude Code, then invoke focused skills directly in the
|
||||
|
||||
## Why use this repo for Claude Code
|
||||
|
||||
- It includes 1,392+ skills instead of a narrow single-domain starter pack.
|
||||
- It includes 1,396+ skills instead of a narrow single-domain starter pack.
|
||||
- It supports the standard `.claude/skills/` path and the Claude Code plugin marketplace flow.
|
||||
- It also ships generated bundle plugins so teams can install focused packs like `Essentials` or `Security Developer` from the marketplace metadata.
|
||||
- It includes onboarding docs, bundles, and workflows so new users do not need to guess where to begin.
|
||||
|
||||
@@ -12,7 +12,7 @@ Install into the Gemini skills path, then ask Gemini to apply one skill at a tim
|
||||
|
||||
- It installs directly into the expected Gemini skills path.
|
||||
- It includes both core software engineering skills and deeper agent/LLM-oriented skills.
|
||||
- It helps new users get started with bundles and workflows rather than forcing a cold start from 1,392+ files.
|
||||
- It helps new users get started with bundles and workflows rather than forcing a cold start from 1,396+ files.
|
||||
- It is useful whether you want a broad internal skill library or a single repo to test many workflows quickly.
|
||||
|
||||
## Install Gemini CLI Skills
|
||||
|
||||
@@ -18,7 +18,7 @@ Kiro is AWS's agentic AI IDE that combines:
|
||||
|
||||
Kiro's agentic capabilities are enhanced by skills that provide:
|
||||
|
||||
- **Domain expertise** across 1,392+ specialized areas
|
||||
- **Domain expertise** across 1,396+ specialized areas
|
||||
- **Best practices** from Anthropic, OpenAI, Google, Microsoft, and AWS
|
||||
- **Workflow automation** for common development tasks
|
||||
- **AWS-specific patterns** for serverless, infrastructure, and cloud architecture
|
||||
|
||||
@@ -14,7 +14,7 @@ If you came in through a **Claude Code** or **Codex** plugin instead of a full l
|
||||
|
||||
When you ran `npx antigravity-awesome-skills` or cloned the repository, you:
|
||||
|
||||
✅ **Downloaded 1,392+ skill files** to your computer (default: `~/.gemini/antigravity/skills/`; or a custom path like `~/.agent/skills/` if you used `--path`)
|
||||
✅ **Downloaded 1,396+ skill files** to your computer (default: `~/.gemini/antigravity/skills/`; or a custom path like `~/.agent/skills/` if you used `--path`)
|
||||
✅ **Made them available** to your AI assistant
|
||||
❌ **Did NOT enable them all automatically** (they're just sitting there, waiting)
|
||||
|
||||
@@ -34,7 +34,7 @@ Bundles are **curated groups** of skills organized by role. They help you decide
|
||||
|
||||
**Analogy:**
|
||||
|
||||
- You installed a toolbox with 1,392+ tools (✅ done)
|
||||
- You installed a toolbox with 1,396+ tools (✅ done)
|
||||
- Bundles are like **labeled organizer trays** saying: "If you're a carpenter, start with these 10 tools"
|
||||
- You can either **pick skills from the tray** or install that tray as a focused marketplace bundle plugin
|
||||
|
||||
@@ -212,7 +212,7 @@ Let's actually use a skill right now. Follow these steps:
|
||||
|
||||
## Step 5: Picking Your First Skills (Practical Advice)
|
||||
|
||||
Don't try to use all 1,392+ skills at once. Here's a sensible approach:
|
||||
Don't try to use all 1,396+ skills at once. Here's a sensible approach:
|
||||
|
||||
If you want a tool-specific starting point before choosing skills, use:
|
||||
|
||||
@@ -343,7 +343,7 @@ Usually no, but if your AI doesn't recognize a skill:
|
||||
|
||||
### "Can I load all skills into the model at once?"
|
||||
|
||||
No. Even though you have 1,392+ skills installed locally, you should **not** concatenate every `SKILL.md` into a single system prompt or context block.
|
||||
No. Even though you have 1,396+ skills installed locally, you should **not** concatenate every `SKILL.md` into a single system prompt or context block.
|
||||
|
||||
The intended pattern is:
|
||||
|
||||
|
||||
@@ -34,7 +34,7 @@ antigravity-awesome-skills/
|
||||
├── 📄 CONTRIBUTING.md ← Contributor workflow
|
||||
├── 📄 CATALOG.md ← Full generated catalog
|
||||
│
|
||||
├── 📁 skills/ ← 1,392+ skills live here
|
||||
├── 📁 skills/ ← 1,396+ skills live here
|
||||
│ │
|
||||
│ ├── 📁 brainstorming/
|
||||
│ │ └── 📄 SKILL.md ← Skill definition
|
||||
@@ -47,7 +47,7 @@ antigravity-awesome-skills/
|
||||
│ │ └── 📁 2d-games/
|
||||
│ │ └── 📄 SKILL.md ← Nested skills also supported
|
||||
│ │
|
||||
│ └── ... (1,392+ total)
|
||||
│ └── ... (1,396+ total)
|
||||
│
|
||||
├── 📁 apps/
|
||||
│ └── 📁 web-app/ ← Interactive browser
|
||||
@@ -100,7 +100,7 @@ antigravity-awesome-skills/
|
||||
|
||||
```
|
||||
┌─────────────────────────┐
|
||||
│ 1,392+ SKILLS │
|
||||
│ 1,396+ SKILLS │
|
||||
└────────────┬────────────┘
|
||||
│
|
||||
┌────────────────────────┼────────────────────────┐
|
||||
@@ -201,7 +201,7 @@ If you want a workspace-style manual install instead, cloning into `.agent/skill
|
||||
│ ├── 📁 brainstorming/ │
|
||||
│ ├── 📁 stripe-integration/ │
|
||||
│ ├── 📁 react-best-practices/ │
|
||||
│ └── ... (1,392+ total) │
|
||||
│ └── ... (1,396+ total) │
|
||||
└─────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
|
||||
@@ -1,7 +1,7 @@
|
||||
{
|
||||
"name": "antigravity-awesome-skills",
|
||||
"version": "9.10.0",
|
||||
"description": "1,392+ agentic skills for Claude Code, Gemini CLI, Cursor, Antigravity & more. Installer CLI.",
|
||||
"description": "1,396+ agentic skills for Claude Code, Gemini CLI, Cursor, Antigravity & more. Installer CLI.",
|
||||
"license": "MIT",
|
||||
"scripts": {
|
||||
"validate": "node tools/scripts/run-python.js tools/scripts/validate_skills.py",
|
||||
|
||||
@@ -1,7 +1,7 @@
|
||||
{
|
||||
"name": "antigravity-awesome-skills",
|
||||
"version": "9.10.0",
|
||||
"description": "Plugin-safe Claude Code distribution of Antigravity Awesome Skills with 1,378 supported skills.",
|
||||
"description": "Plugin-safe Claude Code distribution of Antigravity Awesome Skills with 1,382 supported skills.",
|
||||
"author": {
|
||||
"name": "sickn33 and contributors",
|
||||
"url": "https://github.com/sickn33/antigravity-awesome-skills"
|
||||
|
||||
@@ -0,0 +1,222 @@
|
||||
---
|
||||
name: monte-carlo-monitor-creation
|
||||
description: "Guides creation of Monte Carlo monitors via MCP tools, producing monitors-as-code YAML for CI/CD deployment."
|
||||
category: data
|
||||
risk: safe
|
||||
source: community
|
||||
source_repo: monte-carlo-data/mc-agent-toolkit
|
||||
source_type: community
|
||||
date_added: "2026-04-08"
|
||||
author: monte-carlo-data
|
||||
tags: [data-observability, monitoring, monte-carlo, monitors-as-code]
|
||||
tools: [claude, cursor, codex]
|
||||
---
|
||||
|
||||
# Monte Carlo Monitor Creation Skill
|
||||
|
||||
This skill teaches you to create Monte Carlo monitors correctly via MCP. Every creation tool runs in **dry-run mode** and returns monitors-as-code (MaC) YAML. No monitors are created directly -- the user applies the YAML via the Monte Carlo CLI or CI/CD.
|
||||
|
||||
Reference files live next to this skill file. **Use the Read tool** (not MCP resources) to access them:
|
||||
|
||||
- Metric monitor details: `references/metric-monitor.md` (relative to this file)
|
||||
- Validation monitor details: `references/validation-monitor.md` (relative to this file)
|
||||
- Custom SQL monitor details: `references/custom-sql-monitor.md` (relative to this file)
|
||||
- Comparison monitor details: `references/comparison-monitor.md` (relative to this file)
|
||||
- Table monitor details: `references/table-monitor.md` (relative to this file)
|
||||
|
||||
## When to activate this skill
|
||||
|
||||
Activate when the user:
|
||||
|
||||
- Asks to create, add, or set up a monitor (e.g. "add a monitor for...", "create a freshness check on...", "set up validation for...")
|
||||
- Mentions monitoring a specific table, field, or metric
|
||||
- Wants to check data quality rules or enforce data contracts
|
||||
- Asks about monitoring options for a table or dataset
|
||||
- Requests monitors-as-code YAML generation
|
||||
- Wants to add monitoring after new transformation logic (when the prevent skill is not active)
|
||||
|
||||
## When NOT to activate this skill
|
||||
|
||||
Do not activate when the user is:
|
||||
|
||||
- Just querying data or exploring table contents
|
||||
- Triaging or responding to active alerts (use the prevent skill's Workflow 3)
|
||||
- Running impact assessments before code changes (use the prevent skill's Workflow 4)
|
||||
- Asking about existing monitor configuration (use `getMonitors` directly)
|
||||
- Editing or deleting existing monitors
|
||||
|
||||
---
|
||||
|
||||
## Available MCP tools
|
||||
|
||||
All tools are available via the `monte-carlo` MCP server.
|
||||
|
||||
| Tool | Purpose |
|
||||
| ---------------------------- | ---------------------------------------------------------- |
|
||||
| `testConnection` | Verify auth and connectivity before starting |
|
||||
| `search` | Find tables/assets by name; use `include_fields` for columns |
|
||||
| `getTable` | Schema, stats, metadata, domain membership, capabilities |
|
||||
| `getValidationPredicates` | List available validation rule types for a warehouse |
|
||||
| `getDomains` | List MC domains (only needed if table has no domain info) |
|
||||
| `createMetricMonitorMac` | Generate metric monitor YAML (dry-run) |
|
||||
| `createValidationMonitorMac` | Generate validation monitor YAML (dry-run) |
|
||||
| `createComparisonMonitorMac` | Generate comparison monitor YAML (dry-run) |
|
||||
| `createCustomSqlMonitorMac` | Generate custom SQL monitor YAML (dry-run) |
|
||||
| `createTableMonitorMac` | Generate table monitor YAML (dry-run) |
|
||||
|
||||
---
|
||||
|
||||
## Monitor types
|
||||
|
||||
| Type | Tool | Use When |
|
||||
| -------------- | ---------------------------- | --------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| **Metric** | `createMetricMonitorMac` | Track statistical metrics on fields (null rates, unique counts, numeric stats) or row count changes over time. Requires a timestamp field for aggregation. |
|
||||
| **Validation** | `createValidationMonitorMac` | Row-level data quality checks with conditions (e.g. "field X is never null", "status is in allowed set"). Alerts on INVALID data. |
|
||||
| **Custom SQL** | `createCustomSqlMonitorMac` | Run arbitrary SQL returning a single number and alert on thresholds. Most flexible; use when other types don't fit. |
|
||||
| **Comparison** | `createComparisonMonitorMac` | Compare metrics between two tables (e.g. dev vs prod, source vs target). |
|
||||
| **Table** | `createTableMonitorMac` | Monitor groups of tables for freshness, schema changes, and volume. Uses asset selection at database/schema level. |
|
||||
|
||||
---
|
||||
|
||||
## Procedure
|
||||
|
||||
Follow these steps in order. Do NOT skip steps.
|
||||
|
||||
### Validation Phase (Steps 1-3) -- MUST complete before any creation tool is called
|
||||
|
||||
The number one error pattern is agents skipping validation and calling a creation tool with guessed or incomplete parameters. **Every field in the creation call must be grounded in data retrieved during this phase.** Do not proceed to Step 4 until Steps 1-3 are fully satisfied.
|
||||
|
||||
#### Step 1: Understand the request
|
||||
|
||||
Ask yourself:
|
||||
- What does the user want to monitor? (a specific table, a metric, a data quality rule, cross-table consistency, freshness/volume at schema level)
|
||||
- Which monitor type fits? Use the monitor types table above.
|
||||
- Does the user have all the details, or do they need guidance?
|
||||
|
||||
If the user's intent is unclear, ask a focused question before proceeding.
|
||||
|
||||
#### Step 2: Identify the table(s) and columns
|
||||
|
||||
If you don't have the table MCON:
|
||||
1. Use `search` with the table name and `include_fields: ["field_names"]` to find the MCON and get column names.
|
||||
2. If the user provided a full table ID like `database:schema.table`, search for it.
|
||||
3. Once you have the MCON, call `getTable` with `include_fields: true` and `include_table_capabilities: true` to verify capabilities and get domain info.
|
||||
|
||||
If you already have the MCON:
|
||||
1. Call `getTable` with the MCON, `include_fields: true`, and `include_table_capabilities: true`.
|
||||
|
||||
**CRITICAL: You need the actual column names from `getTable` results. NEVER guess or hallucinate column names.** This is the most common source of monitor creation failures.
|
||||
|
||||
For monitor types that require a timestamp column (metric monitors), review the column names and identify likely timestamp candidates. Present them to the user if ambiguous.
|
||||
|
||||
#### Step 3: Handle domain assignment
|
||||
|
||||
Monitors must be assigned to a domain that contains the table being monitored. The `getTable` response includes a `domains` list with `uuid` and `name`.
|
||||
|
||||
1. If `domains` is empty: skip domain assignment.
|
||||
2. If `domains` has exactly one entry: default `domain_id` to that domain's UUID.
|
||||
3. If `domains` has multiple entries: present only those domains and ask the user to pick.
|
||||
|
||||
Do NOT present all account domains as options -- only domains that contain the table are valid.
|
||||
|
||||
**ALWAYS check the table's `domains` BEFORE calling any creation tool.**
|
||||
|
||||
---
|
||||
|
||||
### Creation Phase (Steps 4-8)
|
||||
|
||||
Only enter this phase after the validation phase is complete with real data from MCP tools.
|
||||
|
||||
#### Step 4: Load the sub-skill reference
|
||||
|
||||
Based on the monitor type, read the detailed reference for parameter guidance:
|
||||
|
||||
- **Metric** -- Read the detailed reference: `references/metric-monitor.md` (relative to this file)
|
||||
- **Validation** -- Read the detailed reference: `references/validation-monitor.md` (relative to this file)
|
||||
- **Custom SQL** -- Read the detailed reference: `references/custom-sql-monitor.md` (relative to this file)
|
||||
- **Comparison** -- Read the detailed reference: `references/comparison-monitor.md` (relative to this file)
|
||||
- **Table** -- Read the detailed reference: `references/table-monitor.md` (relative to this file)
|
||||
|
||||
#### Step 5: Ask about scheduling
|
||||
|
||||
**Skip this step for table monitors.** Table monitors do not support the `schedule` field in MaC YAML — adding it will cause a validation error on `montecarlo monitors apply`. Table monitor scheduling is managed automatically by Monte Carlo.
|
||||
|
||||
For all other monitor types, the creation tools default to a fixed schedule running every 60 minutes. Present these options:
|
||||
|
||||
1. **Fixed interval** -- any integer for `interval_minutes` (30, 60, 90, 120, 360, 720, 1440, etc.)
|
||||
2. **Dynamic** -- MC auto-determines when to run based on table update patterns.
|
||||
3. **Loose** -- runs once per day.
|
||||
|
||||
Schedule format in MaC YAML:
|
||||
- Fixed: `schedule: { type: fixed, interval_minutes: <N> }`
|
||||
- Dynamic: `schedule: { type: dynamic }`
|
||||
- Loose: `schedule: { type: loose, start_time: "00:00" }`
|
||||
|
||||
#### Step 6: Confirm with the user
|
||||
|
||||
Before calling the creation tool, present the monitor configuration in plain language:
|
||||
- Monitor type
|
||||
- Target table (and columns if applicable)
|
||||
- What it checks / what triggers an alert
|
||||
- Domain assignment
|
||||
- Schedule
|
||||
|
||||
Ask: "Does this look correct? I'll generate the monitor configuration."
|
||||
|
||||
**NEVER call the creation tool without user confirmation.**
|
||||
|
||||
#### Step 7: Create the monitor
|
||||
|
||||
Call the appropriate creation tool with the parameters built in previous steps. Always pass an MCON when possible. If only table name is available, also pass warehouse.
|
||||
|
||||
#### Step 8: Present results
|
||||
|
||||
**CRITICAL: Always include the YAML in your response.** The user needs copy-pasteable YAML.
|
||||
|
||||
1. If a non-default schedule was chosen, modify the schedule section in the YAML before presenting.
|
||||
2. Wrap the YAML in the full MaC structure (see "MaC YAML format" section below).
|
||||
3. ALWAYS present the full YAML in a ```yaml code block.
|
||||
4. Explain where to put it and how to apply it (see below).
|
||||
5. ALWAYS use ISO 8601 format for datetime values.
|
||||
6. NEVER reformat YAML values returned by creation tools.
|
||||
|
||||
---
|
||||
|
||||
## MaC YAML format
|
||||
|
||||
The YAML returned by creation tools is the monitor definition. It must be wrapped in the standard MaC structure to be applied:
|
||||
|
||||
```yaml
|
||||
montecarlo:
|
||||
<monitor_type>:
|
||||
- <returned yaml>
|
||||
```
|
||||
|
||||
For example, a metric monitor would look like:
|
||||
|
||||
```yaml
|
||||
montecarlo:
|
||||
metric:
|
||||
- <yaml returned by createMetricMonitorMac>
|
||||
```
|
||||
|
||||
**Important:** `montecarlo.yml` (without a directory path) is a separate Monte Carlo project configuration file -- it is NOT the same as a monitor definition file. Monitor definitions go in their own `.yml` files, typically in a `monitors/` directory or alongside dbt model schema files.
|
||||
|
||||
Tell the user:
|
||||
- Save the YAML to a `.yml` file (e.g. `monitors/<table_name>.yml` or in their dbt schema)
|
||||
- Apply via the Monte Carlo CLI: `montecarlo monitors apply --namespace <namespace>`
|
||||
- Or integrate into CI/CD for automatic deployment on merge
|
||||
|
||||
---
|
||||
|
||||
## Common mistakes to avoid
|
||||
|
||||
- **NEVER guess column names.** Always get them from `getTable`.
|
||||
- **NEVER skip the confirmation step** (Step 6).
|
||||
- For metric monitors, `aggregate_time_field` MUST be a real timestamp column from the table.
|
||||
- For validation monitors, conditions match INVALID data, not valid data.
|
||||
- Always pass an MCON when possible. If only table name is available, also pass warehouse.
|
||||
- **ALWAYS check table's `domains` BEFORE calling any creation tool.**
|
||||
- ALWAYS use ISO 8601 format for datetime values.
|
||||
- NEVER reformat YAML values returned by creation tools.
|
||||
- Do not call creation tools before the validation phase is complete.
|
||||
@@ -0,0 +1,426 @@
|
||||
# Comparison Monitor Reference
|
||||
|
||||
Detailed reference for building `createComparisonMonitorMac` tool calls.
|
||||
|
||||
## When to Use
|
||||
|
||||
Use a comparison monitor when the user wants to:
|
||||
|
||||
- Compare data between two tables (e.g., source vs target, dev vs prod)
|
||||
- Validate data consistency after migration or replication
|
||||
- Check row count parity across environments
|
||||
- Compare field-level metrics between tables (null counts, sums, distributions)
|
||||
|
||||
---
|
||||
|
||||
## Pre-Step: Verify Both Tables and Fields
|
||||
|
||||
Before constructing alert conditions, you MUST verify that both tables exist and that any referenced fields are real columns. This is the most common source of comparison monitor failures.
|
||||
|
||||
1. **Resolve both MCONs.** Use `search` to find the source and target tables. If the user provided `database:schema.table` format, search for each to get the MCON.
|
||||
2. **Get full schemas.** Call `getTable` with `include_fields: true` on BOTH the source table and the target table. You need the column lists from both.
|
||||
3. **For field-level metrics, verify fields exist on both sides.** Confirm that `sourceField` exists in the source table's column list AND `targetField` exists in the target table's column list. Field names are case-sensitive on most warehouses.
|
||||
4. **Check field type compatibility.** The metric must be compatible with the column types on both sides. For example, `NUMERIC_MEAN` requires numeric columns in both the source and target tables. If the source column is numeric but the target is a string, the comparison will fail.
|
||||
5. If any field does not exist or types are incompatible, stop and ask the user to clarify. Do not guess.
|
||||
|
||||
---
|
||||
|
||||
## Required Parameters
|
||||
|
||||
| Parameter | Type | Description |
|
||||
|-----------|------|-------------|
|
||||
| `name` | string | Unique identifier for the monitor. Use a descriptive slug (e.g., `orders_dev_prod_compare`). |
|
||||
| `description` | string | Human-readable description of what the monitor checks. |
|
||||
| `source_table` | string | Source table MCON (preferred) or `database:schema.table` format. If not MCON, also pass `source_warehouse`. |
|
||||
| `target_table` | string | Target table MCON (preferred) or `database:schema.table` format. If not MCON, also pass `target_warehouse`. |
|
||||
| `alert_conditions` | array | List of comparison conditions (see Alert Conditions below). |
|
||||
|
||||
## Optional Parameters
|
||||
|
||||
| Parameter | Type | Description |
|
||||
|-----------|------|-------------|
|
||||
| `source_warehouse` | string | Warehouse name or UUID for the source table. Required if `source_table` is not an MCON. |
|
||||
| `target_warehouse` | string | Warehouse name or UUID for the target table. Required if `target_table` is not an MCON. |
|
||||
| `segment_fields` | array of string | Fields to segment the comparison by. Must exist in BOTH tables with the same name. |
|
||||
| `domain_id` | string (uuid) | Domain UUID (use `getDomains` to list). Only one domain can be assigned per monitor. |
|
||||
|
||||
---
|
||||
|
||||
## Cross-Warehouse Comparisons
|
||||
|
||||
When the source and target tables live in different warehouses (e.g., comparing a Snowflake staging table against a BigQuery production table), you MUST provide both `source_warehouse` and `target_warehouse` explicitly. The tool cannot auto-resolve warehouses when tables are in different environments.
|
||||
|
||||
Even when both tables are MCONs, if they belong to different warehouses, pass both warehouse parameters to be safe. Omitting them in cross-warehouse scenarios causes silent failures or incorrect results.
|
||||
|
||||
Common cross-warehouse patterns:
|
||||
- **Dev vs prod:** same warehouse type, different databases or schemas
|
||||
- **Migration validation:** source in old warehouse, target in new warehouse
|
||||
- **Replication checks:** primary warehouse vs replica or downstream warehouse
|
||||
|
||||
---
|
||||
|
||||
## Alert Conditions
|
||||
|
||||
Each condition compares a metric between the source and target tables.
|
||||
|
||||
| Field | Type | Required | Description |
|
||||
|-------|------|----------|-------------|
|
||||
| `metric` | string | Yes | The metric to compare (see Metrics Reference below). |
|
||||
| `sourceField` | string | For field-level metrics | Column in the source table. Required for ALL metrics except `ROW_COUNT`. |
|
||||
| `targetField` | string | For field-level metrics | Column in the target table. Required for ALL metrics except `ROW_COUNT`. |
|
||||
| `thresholdValue` | number | No | Threshold for acceptable difference between source and target. |
|
||||
| `isThresholdRelative` | boolean | No | `false` = absolute difference (default), `true` = percentage difference. |
|
||||
| `customMetric` | object | No | Custom SQL expressions for source and target (see Custom Metrics below). |
|
||||
|
||||
---
|
||||
|
||||
## ROW_COUNT and Fields: A Critical Rule
|
||||
|
||||
> **NEVER pass `sourceField` or `targetField` when using the `ROW_COUNT` metric.**
|
||||
|
||||
`ROW_COUNT` is a table-level metric -- it counts all rows in the table, not values in a column. Passing field names with `ROW_COUNT` causes the API call to fail or produce unexpected behavior.
|
||||
|
||||
This is the single most common mistake with comparison monitors. Before submitting any alert condition with `ROW_COUNT`, verify that `sourceField` and `targetField` are both absent from the condition object.
|
||||
|
||||
| Metric | Fields needed? | What happens if you pass fields? |
|
||||
|--------|---------------|----------------------------------|
|
||||
| `ROW_COUNT` | **No -- NEVER pass fields** | API error or undefined behavior |
|
||||
| All other metrics | **Yes -- always pass both fields** | Required for the comparison to work |
|
||||
|
||||
---
|
||||
|
||||
## Metrics Reference
|
||||
|
||||
### Table-level metric (no fields needed)
|
||||
|
||||
| Metric | Description |
|
||||
|--------|-------------|
|
||||
| `ROW_COUNT` | Compare total row counts between source and target. |
|
||||
|
||||
### Field-level metrics (require `sourceField` and `targetField`)
|
||||
|
||||
#### Uniqueness and duplicates
|
||||
|
||||
| Metric | Description |
|
||||
|--------|-------------|
|
||||
| `UNIQUE_COUNT` | Count of distinct values. |
|
||||
| `DUPLICATE_COUNT` | Count of duplicate (non-unique) values. |
|
||||
| `APPROX_DISTINCT_COUNT` | Approximate distinct count (faster on large tables). |
|
||||
|
||||
#### Null and empty checks
|
||||
|
||||
| Metric | Description |
|
||||
|--------|-------------|
|
||||
| `NULL_COUNT` | Count of null values. |
|
||||
| `NON_NULL_COUNT` | Count of non-null values. |
|
||||
| `EMPTY_STRING_COUNT` | Count of empty string values. |
|
||||
| `TEXT_ALL_SPACES_COUNT` | Count of values that are all whitespace. |
|
||||
| `NAN_COUNT` | Count of NaN values. |
|
||||
| `TEXT_NULL_KEYWORD_COUNT` | Count of values containing null-like keywords (e.g., "NULL", "None"). |
|
||||
|
||||
#### Numeric statistics
|
||||
|
||||
| Metric | Description |
|
||||
|--------|-------------|
|
||||
| `NUMERIC_MEAN` | Mean of numeric field. |
|
||||
| `NUMERIC_MEDIAN` | Median of numeric field. |
|
||||
| `NUMERIC_MIN` | Minimum value. |
|
||||
| `NUMERIC_MAX` | Maximum value. |
|
||||
| `NUMERIC_STDDEV` | Standard deviation. |
|
||||
| `SUM` | Sum of numeric field. |
|
||||
| `ZERO_COUNT` | Count of zero values. |
|
||||
| `NEGATIVE_COUNT` | Count of negative values. |
|
||||
|
||||
#### Percentiles
|
||||
|
||||
| Metric | Description |
|
||||
|--------|-------------|
|
||||
| `PERCENTILE_20` | 20th percentile value. |
|
||||
| `PERCENTILE_40` | 40th percentile value. |
|
||||
| `PERCENTILE_60` | 60th percentile value. |
|
||||
| `PERCENTILE_80` | 80th percentile value. |
|
||||
|
||||
#### Text statistics
|
||||
|
||||
| Metric | Description |
|
||||
|--------|-------------|
|
||||
| `TEXT_MAX_LENGTH` | Maximum string length. |
|
||||
| `TEXT_MIN_LENGTH` | Minimum string length. |
|
||||
| `TEXT_MEAN_LENGTH` | Mean string length. |
|
||||
| `TEXT_STD_LENGTH` | Standard deviation of string length. |
|
||||
|
||||
#### Text format checks
|
||||
|
||||
| Metric | Description |
|
||||
|--------|-------------|
|
||||
| `TEXT_NOT_INT_COUNT` | Count of values not parseable as integers. |
|
||||
| `TEXT_NOT_NUMBER_COUNT` | Count of values not parseable as numbers. |
|
||||
| `TEXT_NOT_UUID_COUNT` | Count of values not matching UUID format. |
|
||||
| `TEXT_NOT_SSN_COUNT` | Count of values not matching SSN format. |
|
||||
| `TEXT_NOT_US_PHONE_COUNT` | Count of values not matching US phone format. |
|
||||
| `TEXT_NOT_US_STATE_CODE_COUNT` | Count of values not matching US state codes. |
|
||||
| `TEXT_NOT_US_ZIP_CODE_COUNT` | Count of values not matching US zip codes. |
|
||||
| `TEXT_NOT_EMAIL_ADDRESS_COUNT` | Count of values not matching email format. |
|
||||
| `TEXT_NOT_TIMESTAMP_COUNT` | Count of values not parseable as timestamps. |
|
||||
|
||||
#### Boolean
|
||||
|
||||
| Metric | Description |
|
||||
|--------|-------------|
|
||||
| `TRUE_COUNT` | Count of true values. |
|
||||
| `FALSE_COUNT` | Count of false values. |
|
||||
|
||||
#### Timestamp
|
||||
|
||||
| Metric | Description |
|
||||
|--------|-------------|
|
||||
| `FUTURE_TIMESTAMP_COUNT` | Count of timestamps in the future. |
|
||||
| `PAST_TIMESTAMP_COUNT` | Count of timestamps unreasonably far in the past. |
|
||||
| `UNIX_ZERO_COUNT` | Count of timestamps equal to Unix epoch zero (1970-01-01). |
|
||||
|
||||
---
|
||||
|
||||
## Choosing the Right Metric
|
||||
|
||||
| User intent | Correct metric | Fields needed? |
|
||||
|-------------|---------------|----------------|
|
||||
| Row count parity | `ROW_COUNT` | **No** -- never pass fields |
|
||||
| Distinct values in a column | `UNIQUE_COUNT` | Yes |
|
||||
| Null values in a column | `NULL_COUNT` | Yes |
|
||||
| Sum, average, min, max | `SUM`, `NUMERIC_MEAN`, `NUMERIC_MIN`, `NUMERIC_MAX` | Yes |
|
||||
| Data completeness | `NON_NULL_COUNT` | Yes |
|
||||
| String format validation | `TEXT_NOT_EMAIL_ADDRESS_COUNT`, `TEXT_NOT_UUID_COUNT`, etc. | Yes |
|
||||
| Custom computed expressions | Use `customMetric` instead of `metric` | No (SQL handles it) |
|
||||
|
||||
---
|
||||
|
||||
## Custom Metrics
|
||||
|
||||
Use custom metrics when:
|
||||
|
||||
- **Column names differ** between source and target and you need a computed expression (not just a direct field comparison).
|
||||
- **You need a derived calculation** like `SUM(quantity * unit_price)` rather than a simple column metric.
|
||||
- **Standard metrics do not cover the comparison** (e.g., comparing a ratio, a conditional aggregate, or a windowed calculation).
|
||||
|
||||
If the columns simply have different names but you want a standard metric (e.g., compare `SUM` of `revenue` in source vs `total_revenue` in target), you do NOT need a custom metric -- just use the standard metric with different `sourceField` and `targetField` values.
|
||||
|
||||
Custom metric structure:
|
||||
|
||||
```json
|
||||
{
|
||||
"customMetric": {
|
||||
"displayName": "Revenue Sum",
|
||||
"sourceSqlExpression": "SUM(revenue)",
|
||||
"targetSqlExpression": "SUM(total_revenue)"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
| Field | Type | Required | Description |
|
||||
|-------|------|----------|-------------|
|
||||
| `displayName` | string | Yes | Human-readable name for the metric in alerts and dashboards. |
|
||||
| `sourceSqlExpression` | string | Yes | SQL expression evaluated against the source table. |
|
||||
| `targetSqlExpression` | string | Yes | SQL expression evaluated against the target table. |
|
||||
|
||||
When using `customMetric`, do NOT also pass `metric`, `sourceField`, or `targetField` in the same alert condition. The custom metric replaces all of those.
|
||||
|
||||
---
|
||||
|
||||
## Threshold Guidance
|
||||
|
||||
### Absolute thresholds (`isThresholdRelative: false` or omitted)
|
||||
|
||||
The `thresholdValue` is the maximum acceptable absolute difference between the source and target metric values.
|
||||
|
||||
- `thresholdValue: 0` -- source and target must match exactly.
|
||||
- `thresholdValue: 100` -- up to 100 units of difference is acceptable.
|
||||
|
||||
### Relative (percentage) thresholds (`isThresholdRelative: true`)
|
||||
|
||||
The `thresholdValue` is the maximum acceptable percentage difference.
|
||||
|
||||
- `thresholdValue: 5` -- up to 5% difference is acceptable.
|
||||
- `thresholdValue: 0.1` -- up to 0.1% difference is acceptable.
|
||||
|
||||
### When to use each
|
||||
|
||||
| Scenario | Recommended threshold type |
|
||||
|----------|---------------------------|
|
||||
| Exact replication (row counts must match) | Absolute, `thresholdValue: 0` |
|
||||
| Near-real-time sync with small lag | Absolute, small value (e.g., 10-100) |
|
||||
| Tables at different scales | Relative, percentage-based |
|
||||
| Aggregated metrics (sums, means) | Relative, to handle floating-point differences |
|
||||
|
||||
---
|
||||
|
||||
## Examples
|
||||
|
||||
### Row count parity with absolute threshold
|
||||
|
||||
Compare row counts between dev and prod, alerting if they differ by more than 100 rows.
|
||||
|
||||
```json
|
||||
{
|
||||
"name": "orders_dev_prod_row_count",
|
||||
"description": "Verify dev and prod orders tables have similar row counts",
|
||||
"source_table": "MCON++a1b2c3d4-e5f6-7890-abcd-ef1234567890++1++1++dev_warehouse:core.orders",
|
||||
"target_table": "MCON++b2c3d4e5-f6a7-8901-bcde-f12345678901++1++1++prod_warehouse:core.orders",
|
||||
"alert_conditions": [
|
||||
{
|
||||
"metric": "ROW_COUNT",
|
||||
"thresholdValue": 100,
|
||||
"isThresholdRelative": false
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
Note: no `sourceField` or `targetField` -- `ROW_COUNT` is table-level.
|
||||
|
||||
### Row count parity with percentage threshold
|
||||
|
||||
Alert if row counts differ by more than 5%.
|
||||
|
||||
```json
|
||||
{
|
||||
"name": "orders_replication_check",
|
||||
"description": "Verify replicated orders table is within 5% of source row count",
|
||||
"source_table": "MCON++a1b2c3d4-e5f6-7890-abcd-ef1234567890++1++1++primary:sales.orders",
|
||||
"target_table": "MCON++b2c3d4e5-f6a7-8901-bcde-f12345678901++1++1++replica:sales.orders",
|
||||
"alert_conditions": [
|
||||
{
|
||||
"metric": "ROW_COUNT",
|
||||
"thresholdValue": 5,
|
||||
"isThresholdRelative": true
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
### Field-level comparison (different column names)
|
||||
|
||||
Compare the sum of `revenue` in the source table against `total_revenue` in the target table.
|
||||
|
||||
```json
|
||||
{
|
||||
"name": "revenue_source_target_sum",
|
||||
"description": "Verify revenue sums match between staging and production",
|
||||
"source_table": "MCON++a1b2c3d4-e5f6-7890-abcd-ef1234567890++1++1++staging:finance.transactions",
|
||||
"target_table": "MCON++b2c3d4e5-f6a7-8901-bcde-f12345678901++1++1++production:finance.transactions",
|
||||
"alert_conditions": [
|
||||
{
|
||||
"metric": "SUM",
|
||||
"sourceField": "revenue",
|
||||
"targetField": "total_revenue",
|
||||
"thresholdValue": 1,
|
||||
"isThresholdRelative": true
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
### Segmented comparison
|
||||
|
||||
Compare null counts on `email` between source and target, segmented by `country`. The `country` field must exist in both tables.
|
||||
|
||||
```json
|
||||
{
|
||||
"name": "email_nulls_by_country",
|
||||
"description": "Compare email null counts by country between ETL source and target",
|
||||
"source_table": "MCON++a1b2c3d4-e5f6-7890-abcd-ef1234567890++1++1++raw:crm.contacts",
|
||||
"target_table": "MCON++b2c3d4e5-f6a7-8901-bcde-f12345678901++1++1++analytics:crm.contacts",
|
||||
"segment_fields": ["country"],
|
||||
"alert_conditions": [
|
||||
{
|
||||
"metric": "NULL_COUNT",
|
||||
"sourceField": "email",
|
||||
"targetField": "email",
|
||||
"thresholdValue": 0,
|
||||
"isThresholdRelative": false
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
### Cross-warehouse comparison with explicit warehouses
|
||||
|
||||
When source and target are in different warehouses, both warehouse parameters must be provided.
|
||||
|
||||
```json
|
||||
{
|
||||
"name": "migration_users_row_count",
|
||||
"description": "Validate user row counts match after Snowflake to BigQuery migration",
|
||||
"source_table": "snowflake_db:public.users",
|
||||
"source_warehouse": "snowflake-prod",
|
||||
"target_table": "bigquery_project:public.users",
|
||||
"target_warehouse": "bigquery-prod",
|
||||
"alert_conditions": [
|
||||
{
|
||||
"metric": "ROW_COUNT",
|
||||
"thresholdValue": 0,
|
||||
"isThresholdRelative": false
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
### Custom metric comparison
|
||||
|
||||
Compare a computed revenue expression when the SQL differs between source and target.
|
||||
|
||||
```json
|
||||
{
|
||||
"name": "computed_revenue_compare",
|
||||
"description": "Compare total revenue computation between legacy and new schema",
|
||||
"source_table": "MCON++a1b2c3d4-e5f6-7890-abcd-ef1234567890++1++1++warehouse:legacy.orders",
|
||||
"target_table": "MCON++b2c3d4e5-f6a7-8901-bcde-f12345678901++1++1++warehouse:v2.orders",
|
||||
"alert_conditions": [
|
||||
{
|
||||
"customMetric": {
|
||||
"displayName": "Total Revenue",
|
||||
"sourceSqlExpression": "SUM(quantity * unit_price)",
|
||||
"targetSqlExpression": "SUM(total_amount)"
|
||||
},
|
||||
"thresholdValue": 0.01,
|
||||
"isThresholdRelative": true
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
### Multiple alert conditions
|
||||
|
||||
Compare both row counts and field-level metrics in a single monitor.
|
||||
|
||||
```json
|
||||
{
|
||||
"name": "orders_full_comparison",
|
||||
"description": "Full comparison of orders between staging and production",
|
||||
"source_table": "MCON++a1b2c3d4-e5f6-7890-abcd-ef1234567890++1++1++staging:core.orders",
|
||||
"target_table": "MCON++b2c3d4e5-f6a7-8901-bcde-f12345678901++1++1++production:core.orders",
|
||||
"domain_id": "f47ac10b-58cc-4372-a567-0e02b2c3d479",
|
||||
"alert_conditions": [
|
||||
{
|
||||
"metric": "ROW_COUNT",
|
||||
"thresholdValue": 0,
|
||||
"isThresholdRelative": false
|
||||
},
|
||||
{
|
||||
"metric": "NULL_COUNT",
|
||||
"sourceField": "customer_id",
|
||||
"targetField": "customer_id",
|
||||
"thresholdValue": 0,
|
||||
"isThresholdRelative": false
|
||||
},
|
||||
{
|
||||
"metric": "SUM",
|
||||
"sourceField": "amount",
|
||||
"targetField": "amount",
|
||||
"thresholdValue": 0.1,
|
||||
"isThresholdRelative": true
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
Note: the `ROW_COUNT` condition has no fields, while the field-level conditions each specify both `sourceField` and `targetField`.
|
||||
@@ -0,0 +1,207 @@
|
||||
# Custom SQL Monitor Reference
|
||||
|
||||
Detailed reference for building `createCustomSqlMonitorMac` tool calls.
|
||||
|
||||
## When to Use
|
||||
|
||||
Use a custom SQL monitor when the user wants to:
|
||||
|
||||
- Run a specific SQL query and alert on its result
|
||||
- Implement cross-table logic (joins, subqueries, CTEs)
|
||||
- Apply business-specific aggregations or calculations that don't map to a single metric
|
||||
- Monitor a condition that spans multiple columns or tables
|
||||
- Use a SQL query they already have in mind
|
||||
|
||||
---
|
||||
|
||||
## The Universal Fallback
|
||||
|
||||
Custom SQL is the fallback monitor type. Reach for it whenever another monitor type cannot express what the user needs:
|
||||
|
||||
- **Validation monitor won't work** because the column doesn't exist yet, or the logic requires joins across tables.
|
||||
- **Metric monitor can't express the business logic** -- for example, a ratio between two columns, a conditional aggregation, or a calculation that spans multiple tables.
|
||||
- **Cross-table joins are needed** -- metric and validation monitors operate on a single table. If the check requires data from two or more tables, custom SQL is the only option.
|
||||
- **The user already has a SQL query** -- don't force it into another monitor type. Wrap it in a custom SQL monitor.
|
||||
|
||||
If you find yourself contorting another monitor type to fit the user's intent, stop and use custom SQL instead.
|
||||
|
||||
---
|
||||
|
||||
## Required Parameters
|
||||
|
||||
| Parameter | Type | Description |
|
||||
|-----------|------|-------------|
|
||||
| `name` | string | Unique identifier for the monitor. Use a descriptive slug (e.g., `orphan_orders_check`). |
|
||||
| `description` | string | Human-readable description of what the monitor checks. |
|
||||
| `warehouse` | string | Warehouse name or UUID where the SQL query will be executed. |
|
||||
| `sql` | string | SQL query that returns a **single numeric value** (one row, one column). |
|
||||
| `alert_conditions` | array | List of threshold conditions (see Alert Conditions below). |
|
||||
|
||||
## Optional Parameters
|
||||
|
||||
| Parameter | Type | Description |
|
||||
|-----------|------|-------------|
|
||||
| `domain_id` | string (uuid) | Domain UUID (use `getDomains` to list). Only one domain can be assigned per monitor. |
|
||||
|
||||
---
|
||||
|
||||
## Alert Conditions
|
||||
|
||||
Each alert condition compares the query result against a threshold.
|
||||
|
||||
| Field | Type | Required | Description |
|
||||
|-------|------|----------|-------------|
|
||||
| `operator` | string | Yes | `"GT"`, `"LT"`, `"EQ"`, `"GTE"`, `"LTE"`, `"NE"` |
|
||||
| `thresholdValue` | number | Yes | Numeric threshold to compare the query result against. |
|
||||
|
||||
### No AUTO Support
|
||||
|
||||
Custom SQL monitors do **NOT** support `AUTO` (anomaly detection). You must specify an explicit operator and threshold for every alert condition. This is a common mistake -- if the user asks for anomaly detection, steer them toward a metric monitor instead, which does support `AUTO`.
|
||||
|
||||
If the user is unsure what threshold to set, help them reason about it: "What value would indicate a problem? If the query returns X, should that fire an alert?"
|
||||
|
||||
---
|
||||
|
||||
## SQL Query Requirements
|
||||
|
||||
The SQL query MUST return exactly **one row with one numeric column**. This is non-negotiable -- the monitor compares that single value against the alert conditions.
|
||||
|
||||
### Rules
|
||||
|
||||
- Use aggregate functions: `COUNT(*)`, `SUM()`, `AVG()`, `MAX()`, `MIN()`, or similar.
|
||||
- Can reference any table, view, or materialized view accessible in the warehouse.
|
||||
- Can use joins, subqueries, CTEs, window functions -- any valid SQL.
|
||||
- Do **NOT** include trailing semicolons.
|
||||
- Do **NOT** include comments (`--` or `/* */`) -- some warehouses strip them inconsistently.
|
||||
|
||||
### SQL Validation Tips
|
||||
|
||||
These are the most common mistakes that cause custom SQL monitors to fail or produce misleading results:
|
||||
|
||||
1. **Handle NULLs with COALESCE.** If your aggregate could return NULL (e.g., `SUM(amount)` on an empty result set), wrap it: `SELECT COALESCE(SUM(amount), 0) FROM ...`. A NULL result cannot be compared against a threshold and will not trigger alerts.
|
||||
|
||||
2. **Ensure exactly one row, one column.** If your query could return zero rows (e.g., a filtered `SELECT` with no `GROUP BY`), wrap it in an outer aggregate: `SELECT COUNT(*) FROM (SELECT ...) sub`. If it returns multiple columns, select only the one you need.
|
||||
|
||||
3. **Test the query mentally.** Before finalizing, ask: "If this query returns 5, will the alert condition fire correctly?" Walk through the logic with a concrete number.
|
||||
|
||||
4. **For time-windowed checks, use appropriate date functions.** SQL syntax for date arithmetic varies by warehouse (see Warehouse-Specific SQL Notes below). Always scope time windows to avoid scanning the entire table history.
|
||||
|
||||
5. **Avoid non-deterministic results.** Queries using `LIMIT` without `ORDER BY`, or `RANDOM()`, produce unpredictable results that make alerting unreliable.
|
||||
|
||||
---
|
||||
|
||||
## Warehouse-Specific SQL Notes
|
||||
|
||||
SQL syntax for date arithmetic and functions varies across warehouses. When writing time-windowed queries, use the correct syntax for the user's warehouse:
|
||||
|
||||
| Operation | Snowflake | BigQuery | Redshift |
|
||||
|-----------|-----------|----------|----------|
|
||||
| Subtract 1 day from now | `DATEADD(day, -1, CURRENT_TIMESTAMP())` | `DATE_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 DAY)` | `DATEADD(day, -1, GETDATE())` |
|
||||
| Subtract 1 hour from now | `DATEADD(hour, -1, CURRENT_TIMESTAMP())` | `TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 HOUR)` | `DATEADD(hour, -1, GETDATE())` |
|
||||
| Current timestamp | `CURRENT_TIMESTAMP()` | `CURRENT_TIMESTAMP()` | `GETDATE()` |
|
||||
| Date truncation | `DATE_TRUNC('day', col)` | `DATE_TRUNC(col, DAY)` | `DATE_TRUNC('day', col)` |
|
||||
|
||||
When unsure which warehouse the user is on, ask. Getting the syntax wrong causes the monitor to fail on every scheduled run.
|
||||
|
||||
---
|
||||
|
||||
## Examples
|
||||
|
||||
### Orphan records (GT 0)
|
||||
|
||||
Alert when orders reference customers that don't exist.
|
||||
|
||||
```json
|
||||
{
|
||||
"name": "orphan_orders_check",
|
||||
"description": "Detect orders referencing non-existent customers",
|
||||
"warehouse": "production_snowflake",
|
||||
"sql": "SELECT COUNT(*) FROM analytics.core.orders o LEFT JOIN analytics.core.customers c ON o.customer_id = c.id WHERE c.id IS NULL",
|
||||
"alert_conditions": [
|
||||
{
|
||||
"operator": "GT",
|
||||
"thresholdValue": 0
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
### Daily revenue floor (LT threshold)
|
||||
|
||||
Alert when total revenue for the past 24 hours drops below a minimum.
|
||||
|
||||
```json
|
||||
{
|
||||
"name": "daily_revenue_floor",
|
||||
"description": "Alert when daily revenue falls below $10,000",
|
||||
"warehouse": "production_snowflake",
|
||||
"sql": "SELECT COALESCE(SUM(amount), 0) FROM analytics.billing.transactions WHERE created_at >= DATEADD(day, -1, CURRENT_TIMESTAMP())",
|
||||
"alert_conditions": [
|
||||
{
|
||||
"operator": "LT",
|
||||
"thresholdValue": 10000
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
### Duplicate rate exceeds threshold
|
||||
|
||||
Alert when the duplicate rate on a key field exceeds 1%.
|
||||
|
||||
```json
|
||||
{
|
||||
"name": "order_id_duplicate_rate",
|
||||
"description": "Alert when order_id duplicate rate exceeds 1%",
|
||||
"warehouse": "production_snowflake",
|
||||
"sql": "SELECT COALESCE(1.0 - (COUNT(DISTINCT order_id) * 1.0 / NULLIF(COUNT(*), 0)), 0) FROM analytics.core.orders WHERE created_at >= DATEADD(day, -1, CURRENT_TIMESTAMP())",
|
||||
"alert_conditions": [
|
||||
{
|
||||
"operator": "GT",
|
||||
"thresholdValue": 0.01
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
### Multiple threshold conditions (range check)
|
||||
|
||||
Alert when a value falls outside an acceptable range. Multiple conditions act as independent checks -- each one that evaluates to true fires its own alert.
|
||||
|
||||
```json
|
||||
{
|
||||
"name": "avg_order_amount_range",
|
||||
"description": "Alert when average order amount is outside the $20-$500 range",
|
||||
"warehouse": "production_snowflake",
|
||||
"sql": "SELECT COALESCE(AVG(amount), 0) FROM analytics.core.orders WHERE created_at >= DATEADD(day, -1, CURRENT_TIMESTAMP()) AND status = 'completed'",
|
||||
"alert_conditions": [
|
||||
{
|
||||
"operator": "LT",
|
||||
"thresholdValue": 20
|
||||
},
|
||||
{
|
||||
"operator": "GT",
|
||||
"thresholdValue": 500
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
### Cross-table freshness check (BigQuery syntax)
|
||||
|
||||
Alert when the latest row in a downstream table is more than 2 hours behind the source.
|
||||
|
||||
```json
|
||||
{
|
||||
"name": "pipeline_lag_check",
|
||||
"description": "Alert when downstream table lags source by more than 2 hours",
|
||||
"warehouse": "production_bigquery",
|
||||
"sql": "SELECT COALESCE(TIMESTAMP_DIFF(s.max_ts, t.max_ts, MINUTE), 9999) FROM (SELECT MAX(event_timestamp) AS max_ts FROM project.raw.events) s CROSS JOIN (SELECT MAX(processed_at) AS max_ts FROM project.analytics.events_processed) t",
|
||||
"alert_conditions": [
|
||||
{
|
||||
"operator": "GT",
|
||||
"thresholdValue": 120
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
@@ -0,0 +1,292 @@
|
||||
# Metric Monitor Reference
|
||||
|
||||
Detailed reference for building `createMetricMonitorMac` tool calls.
|
||||
|
||||
## When to Use
|
||||
|
||||
Use a metric monitor when the user wants to:
|
||||
|
||||
- Track row count changes over time
|
||||
- Monitor null rates, unique counts, or other statistical metrics on specific fields
|
||||
- Detect anomalies in numeric distributions (mean, max, min, percentiles)
|
||||
- Monitor data freshness (time since last row count change)
|
||||
- Segment metrics by dimensions (e.g., by country, status)
|
||||
|
||||
---
|
||||
|
||||
## Required Parameters
|
||||
|
||||
| Parameter | Type | Description |
|
||||
|-----------|------|-------------|
|
||||
| `name` | string | Unique identifier for the monitor. Use a descriptive slug (e.g., `orders_null_check`). |
|
||||
| `description` | string | Human-readable description of what the monitor checks. |
|
||||
| `table` | string | Table MCON (preferred) or `database:schema.table` format. If not MCON, also pass `warehouse`. |
|
||||
| `aggregate_time_field` | string | **MUST be a real timestamp/datetime column from the table.** NEVER guess this value. |
|
||||
| `alert_conditions` | array | List of alert condition objects (see Alert Conditions below). |
|
||||
|
||||
## Optional Parameters
|
||||
|
||||
| Parameter | Type | Default | Description |
|
||||
|-----------|------|---------|-------------|
|
||||
| `warehouse` | string | auto-resolved | Warehouse name or UUID. Required if `table` is not an MCON. |
|
||||
| `segment_fields` | array of string | none | Fields to group/segment metrics by (e.g., `["country", "status"]`). |
|
||||
| `aggregate_by` | string | `"day"` | Time interval: `"hour"`, `"day"`, `"week"`, `"month"`. |
|
||||
| `where_condition` | string | none | SQL WHERE clause (without `WHERE` keyword) to filter rows before computing metrics. |
|
||||
| `interval_minutes` | int | auto | Schedule interval in minutes. Must be compatible with `aggregate_by` (see note below). If not specified, the tool defaults to the minimum valid interval for the chosen `aggregate_by`. |
|
||||
| `domain_id` | string (uuid) | none | Domain UUID (use `getDomains` to list). |
|
||||
|
||||
---
|
||||
|
||||
## Schedule and Aggregation Compatibility
|
||||
|
||||
The schedule interval must be compatible with `aggregate_by`. Daily aggregation requires an interval that is a multiple of 1440 minutes (24 hours), weekly requires a multiple of 10080, etc. If you pass `interval_minutes`, make sure it satisfies this constraint. If you omit it, the tool picks a sensible default.
|
||||
|
||||
| `aggregate_by` | Minimum `interval_minutes` | Default if omitted |
|
||||
|---|---|---|
|
||||
| `hour` | 60 | 60 |
|
||||
| `day` | 1440 | 1440 |
|
||||
| `week` | 10080 | 10080 |
|
||||
| `month` | 43200 | 43200 |
|
||||
|
||||
For example, to run a daily-aggregated monitor every other day, pass `aggregate_by: "day"` and `interval_minutes: 2880`.
|
||||
|
||||
---
|
||||
|
||||
## Choosing the Timestamp Field
|
||||
|
||||
The `aggregate_time_field` is the most critical parameter. It MUST be a real column from the table that contains timestamp or datetime values. This is the number one source of monitor creation failures.
|
||||
|
||||
### How to pick it
|
||||
|
||||
1. You should already have the column names from `getTable` with `include_fields: true` (done in Step 2 of the main skill).
|
||||
2. Look for columns whose names suggest a timestamp: `created_at`, `updated_at`, `modified_at`, `timestamp`, `event_timestamp`, or columns with `_ts`, `_dt`, `_time` suffixes, or `date`, `datetime`.
|
||||
3. If the user specified one, verify it exists in the column list.
|
||||
4. If exactly one obvious candidate exists, suggest it.
|
||||
5. If multiple candidates exist, present them and ask the user.
|
||||
6. If NO obvious timestamp columns exist, suggest a custom SQL monitor instead (which does not need a timestamp field).
|
||||
|
||||
**NEVER** proceed without confirming the timestamp field exists in the table schema.
|
||||
|
||||
### Common timestamp field mistakes
|
||||
|
||||
- **Using a DATE column (not TIMESTAMP):** This may work, but aggregation granularity is limited. For example, `aggregate_by: "hour"` is meaningless on a DATE column because the time component is always midnight. Warn the user and default to `aggregate_by: "day"` or coarser.
|
||||
- **Using a field that contains many nulls:** If the timestamp column has significant null values, rows with null timestamps are excluded from aggregation windows, producing unreliable or misleading results. Check the column's null rate from `getTable` field stats if available, and warn the user if it is high.
|
||||
- **Guessing a field name that does not exist:** Always verify the column name against the `getTable` output. A typo or assumed name (e.g., `created_date` when the actual column is `created_at`) causes the monitor creation to fail silently or error.
|
||||
|
||||
---
|
||||
|
||||
## Field-Type-to-Metric Compatibility Matrix
|
||||
|
||||
**Before selecting a metric, check the column's data type from `getTable` results.** Passing a metric incompatible with the column type is the most common source of creation failures after timestamp issues.
|
||||
|
||||
| Column Type | Compatible Metrics |
|
||||
|-------------|-------------------|
|
||||
| **Numeric** (int, float, decimal, bigint) | `NUMERIC_MEAN`, `NUMERIC_MEDIAN`, `NUMERIC_MIN`, `NUMERIC_MAX`, `NUMERIC_STDDEV`, `SUM`, `ZERO_COUNT`, `ZERO_RATE`, `NEGATIVE_COUNT`, `NEGATIVE_RATE`, `NULL_COUNT`, `NULL_RATE`, `UNIQUE_COUNT`, `UNIQUE_RATE`, `DUPLICATE_COUNT` |
|
||||
| **String / Text** (varchar, char, text) | `TEXT_MAX_LENGTH`, `TEXT_MIN_LENGTH`, `TEXT_MEAN_LENGTH`, `TEXT_INT_RATE`, `TEXT_NUMBER_RATE`, `TEXT_UUID_RATE`, `TEXT_EMAIL_ADDRESS_RATE`, `EMPTY_STRING_COUNT`, `EMPTY_STRING_RATE`, `NULL_COUNT`, `NULL_RATE`, `UNIQUE_COUNT`, `UNIQUE_RATE`, `DUPLICATE_COUNT` |
|
||||
| **Boolean** | `TRUE_COUNT`, `FALSE_COUNT`, `NULL_COUNT`, `NULL_RATE` |
|
||||
| **Timestamp / Date** | `FUTURE_TIMESTAMP_COUNT`, `PAST_TIMESTAMP_COUNT`, `UNIX_ZERO_TIMESTAMP_COUNT`, `NULL_COUNT`, `NULL_RATE`, `UNIQUE_COUNT`, `UNIQUE_RATE` |
|
||||
| **Any type** | `NULL_COUNT`, `NULL_RATE`, `UNIQUE_COUNT`, `UNIQUE_RATE`, `DUPLICATE_COUNT` |
|
||||
|
||||
### Rules
|
||||
|
||||
- **NEVER** apply `NUMERIC_*`, `SUM`, `ZERO_*`, or `NEGATIVE_*` metrics to string, boolean, or timestamp columns.
|
||||
- **NEVER** apply `TEXT_*` or `EMPTY_STRING_*` metrics to numeric, boolean, or timestamp columns.
|
||||
- **NEVER** apply `TRUE_COUNT` or `FALSE_COUNT` to non-boolean columns.
|
||||
- **NEVER** apply `FUTURE_TIMESTAMP_COUNT`, `PAST_TIMESTAMP_COUNT`, or `UNIX_ZERO_TIMESTAMP_COUNT` to non-timestamp columns.
|
||||
- When in doubt, `NULL_COUNT`, `NULL_RATE`, `UNIQUE_COUNT`, and `UNIQUE_RATE` are safe for any column type.
|
||||
|
||||
---
|
||||
|
||||
## Alert Conditions
|
||||
|
||||
Each alert condition has:
|
||||
|
||||
| Field | Type | Required | Description |
|
||||
|-------|------|----------|-------------|
|
||||
| `metric` | string | Yes | The metric to monitor (see Metrics Reference below). |
|
||||
| `operator` | string | Yes | `"AUTO"` (anomaly detection), `"GT"`, `"LT"`, `"EQ"`, `"GTE"`, `"LTE"`, `"NE"`. |
|
||||
| `threshold` | number | For explicit operators | The threshold value. Required when using `GT`, `LT`, `EQ`, `GTE`, `LTE`, or `NE`. Not used with `AUTO`. |
|
||||
| `fields` | array of string | Depends | Column names to apply the metric to. Required for field-level metrics. Not needed for table-level metrics. |
|
||||
|
||||
---
|
||||
|
||||
## Operator Guidance
|
||||
|
||||
### When to use `AUTO` (anomaly detection)
|
||||
|
||||
- Best when you do not know the expected range of values and want Monte Carlo's ML to learn normal patterns and alert on deviations.
|
||||
- Works well for organic metrics that vary day-to-day (row counts, null rates on evolving data, numeric distributions).
|
||||
- Some metrics **require** `AUTO` -- see the table below.
|
||||
|
||||
### When to use explicit thresholds (`GT`, `LT`, `EQ`, `GTE`, `LTE`, `NE`)
|
||||
|
||||
- Use when there is a known business rule or data contract (e.g., "null rate on `email` should never exceed 5%", "order amount must always be greater than 0").
|
||||
- Provides deterministic alerting -- no training period needed, alerts fire immediately when the condition is met.
|
||||
- Requires a `threshold` value in the alert condition.
|
||||
|
||||
### Operator restrictions by metric
|
||||
|
||||
| Metric | Allowed Operators | Notes |
|
||||
|--------|-------------------|-------|
|
||||
| `ROW_COUNT_CHANGE` | `AUTO` only | Anomaly detection on row count delta. |
|
||||
| `TIME_SINCE_LAST_ROW_COUNT_CHANGE` | `AUTO` only | Anomaly detection on staleness duration. |
|
||||
| `RELATIVE_ROW_COUNT` | `AUTO` only | Anomaly detection on segment distribution. Requires `segment_fields`. |
|
||||
| All other metrics | `AUTO`, `GT`, `LT`, `EQ`, `GTE`, `LTE`, `NE` | Any operator is valid. |
|
||||
|
||||
---
|
||||
|
||||
## Metrics Reference
|
||||
|
||||
### Table-level metrics (no `fields` needed)
|
||||
|
||||
| Metric | Operator | Description |
|
||||
|--------|----------|-------------|
|
||||
| `ROW_COUNT_CHANGE` | Must use `AUTO` | Alert on anomalous changes in total row count. |
|
||||
| `TIME_SINCE_LAST_ROW_COUNT_CHANGE` | Must use `AUTO` | Alert when the table has not been updated for an unusual duration. |
|
||||
|
||||
### Field-level metrics (must specify `fields`)
|
||||
|
||||
| Metric | Column Types | Description |
|
||||
|--------|-------------|-------------|
|
||||
| `NULL_COUNT` | Any | Count of null values. |
|
||||
| `NULL_RATE` | Any | Rate of null values (0.0 to 1.0). |
|
||||
| `UNIQUE_COUNT` | Any | Count of distinct values. |
|
||||
| `UNIQUE_RATE` | Any | Rate of distinct values (0.0 to 1.0). |
|
||||
| `DUPLICATE_COUNT` | Any | Count of duplicate (non-unique) values. |
|
||||
| `EMPTY_STRING_COUNT` | String/Text | Count of empty string values. |
|
||||
| `EMPTY_STRING_RATE` | String/Text | Rate of empty string values. |
|
||||
| `NUMERIC_MEAN` | Numeric | Mean of numeric field. |
|
||||
| `NUMERIC_MEDIAN` | Numeric | Median of numeric field. |
|
||||
| `NUMERIC_MIN` | Numeric | Minimum value of numeric field. |
|
||||
| `NUMERIC_MAX` | Numeric | Maximum value of numeric field. |
|
||||
| `NUMERIC_STDDEV` | Numeric | Standard deviation of numeric field. |
|
||||
| `SUM` | Numeric | Sum of numeric field. |
|
||||
| `ZERO_COUNT` | Numeric | Count of zero values. |
|
||||
| `ZERO_RATE` | Numeric | Rate of zero values. |
|
||||
| `NEGATIVE_COUNT` | Numeric | Count of negative values. |
|
||||
| `NEGATIVE_RATE` | Numeric | Rate of negative values. |
|
||||
| `TRUE_COUNT` | Boolean | Count of true values. |
|
||||
| `FALSE_COUNT` | Boolean | Count of false values. |
|
||||
| `TEXT_MAX_LENGTH` | String/Text | Maximum string length. |
|
||||
| `TEXT_MIN_LENGTH` | String/Text | Minimum string length. |
|
||||
| `TEXT_MEAN_LENGTH` | String/Text | Mean string length. |
|
||||
| `TEXT_INT_RATE` | String/Text | Rate of values parseable as integers. |
|
||||
| `TEXT_NUMBER_RATE` | String/Text | Rate of values parseable as numbers. |
|
||||
| `TEXT_UUID_RATE` | String/Text | Rate of values matching UUID format. |
|
||||
| `TEXT_EMAIL_ADDRESS_RATE` | String/Text | Rate of values matching email format. |
|
||||
| `FUTURE_TIMESTAMP_COUNT` | Timestamp/Date | Count of timestamps in the future. |
|
||||
| `PAST_TIMESTAMP_COUNT` | Timestamp/Date | Count of timestamps unreasonably far in the past. |
|
||||
| `UNIX_ZERO_TIMESTAMP_COUNT` | Timestamp/Date | Count of timestamps equal to Unix epoch zero (1970-01-01). |
|
||||
|
||||
### Segmentation metric
|
||||
|
||||
| Metric | Operator | Description |
|
||||
|--------|----------|-------------|
|
||||
| `RELATIVE_ROW_COUNT` | Must use `AUTO` | Alert on anomalous changes in distribution across segments. MUST use `segment_fields`. |
|
||||
|
||||
---
|
||||
|
||||
## Examples
|
||||
|
||||
### Row count anomaly detection
|
||||
|
||||
```json
|
||||
{
|
||||
"name": "orders_row_count",
|
||||
"description": "Detect anomalous changes in daily order volume",
|
||||
"table": "MCON++a1b2c3d4-e5f6-7890-abcd-ef1234567890++1++1++analytics:core.orders",
|
||||
"aggregate_time_field": "created_at",
|
||||
"aggregate_by": "day",
|
||||
"alert_conditions": [
|
||||
{
|
||||
"metric": "ROW_COUNT_CHANGE",
|
||||
"operator": "AUTO"
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
### Null monitoring on specific fields
|
||||
|
||||
```json
|
||||
{
|
||||
"name": "orders_null_check",
|
||||
"description": "Alert when email or user_id nulls exceed 50 per day",
|
||||
"table": "MCON++a1b2c3d4-e5f6-7890-abcd-ef1234567890++1++1++analytics:core.orders",
|
||||
"aggregate_time_field": "created_at",
|
||||
"aggregate_by": "day",
|
||||
"alert_conditions": [
|
||||
{
|
||||
"metric": "NULL_COUNT",
|
||||
"operator": "GT",
|
||||
"threshold": 50,
|
||||
"fields": ["email", "user_id"]
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
### Segmented monitoring
|
||||
|
||||
```json
|
||||
{
|
||||
"name": "orders_by_country_distribution",
|
||||
"description": "Detect anomalous shifts in order distribution across countries",
|
||||
"table": "MCON++a1b2c3d4-e5f6-7890-abcd-ef1234567890++1++1++analytics:core.orders",
|
||||
"aggregate_time_field": "created_at",
|
||||
"aggregate_by": "day",
|
||||
"segment_fields": ["country"],
|
||||
"alert_conditions": [
|
||||
{
|
||||
"metric": "RELATIVE_ROW_COUNT",
|
||||
"operator": "AUTO"
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
### Numeric range monitoring with filter
|
||||
|
||||
```json
|
||||
{
|
||||
"name": "completed_orders_amount_check",
|
||||
"description": "Detect anomalous max order amounts for completed orders",
|
||||
"table": "MCON++a1b2c3d4-e5f6-7890-abcd-ef1234567890++1++1++analytics:core.orders",
|
||||
"aggregate_time_field": "created_at",
|
||||
"aggregate_by": "day",
|
||||
"where_condition": "status = 'completed'",
|
||||
"alert_conditions": [
|
||||
{
|
||||
"metric": "NUMERIC_MAX",
|
||||
"operator": "AUTO",
|
||||
"fields": ["amount"]
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
### Multiple alert conditions in one monitor
|
||||
|
||||
```json
|
||||
{
|
||||
"name": "payments_quality_check",
|
||||
"description": "Monitor payment amount stats and null rate on transaction_id",
|
||||
"table": "MCON++a1b2c3d4-e5f6-7890-abcd-ef1234567890++1++1++warehouse:billing.payments",
|
||||
"aggregate_time_field": "processed_at",
|
||||
"aggregate_by": "day",
|
||||
"domain_id": "f47ac10b-58cc-4372-a567-0e02b2c3d479",
|
||||
"alert_conditions": [
|
||||
{
|
||||
"metric": "NUMERIC_MEAN",
|
||||
"operator": "AUTO",
|
||||
"fields": ["amount"]
|
||||
},
|
||||
{
|
||||
"metric": "NULL_RATE",
|
||||
"operator": "GT",
|
||||
"threshold": 0.01,
|
||||
"fields": ["transaction_id"]
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
@@ -0,0 +1,231 @@
|
||||
# Table Monitor Reference
|
||||
|
||||
Detailed reference for building `createTableMonitorMac` tool calls.
|
||||
|
||||
## When to Use
|
||||
|
||||
Use a table monitor when the user wants to:
|
||||
|
||||
- Monitor many tables at once across an entire database or schema
|
||||
- Track freshness (when was each table last updated?)
|
||||
- Detect schema changes (columns added, removed, or type-changed)
|
||||
- Monitor volume changes (row count anomalies) across a broad set of tables
|
||||
- Apply broad coverage with anomaly detection (no custom thresholds needed)
|
||||
|
||||
**Do NOT use a table monitor when the user wants to:**
|
||||
|
||||
- Track field-level metrics on a single table (use a metric monitor)
|
||||
- Apply custom thresholds or explicit operators like GT/LT (use a metric monitor)
|
||||
- Validate row-level business rules or referential integrity (use a validation monitor)
|
||||
|
||||
---
|
||||
|
||||
## Required Parameters
|
||||
|
||||
| Parameter | Type | Description |
|
||||
|-----------|------|-------------|
|
||||
| `name` | string | Unique identifier for the table monitor. Must be unique across all table monitors in the same namespace. |
|
||||
| `description` | string | Human-readable description of what the monitor checks (max 512 characters). |
|
||||
| `warehouse` | string | Warehouse name or UUID. Use `getTable` or `search` to find it. |
|
||||
| `asset_selection` | object | Asset selection config defining which tables to monitor (see Asset Selection below). |
|
||||
|
||||
## Optional Parameters
|
||||
|
||||
| Parameter | Type | Default | Description |
|
||||
|-----------|------|---------|-------------|
|
||||
| `alert_conditions` | array of strings | `["last_updated_on", "schema", "total_row_count", "total_row_count_last_changed_on"]` | Metric names to monitor (see Alert Conditions below). |
|
||||
| `domain_id` | string (uuid) | none | Domain UUID (use `getDomains` to list). |
|
||||
|
||||
---
|
||||
|
||||
## Pre-Step: Verify Warehouse
|
||||
|
||||
Before creating a table monitor, resolve the warehouse name or UUID. The `warehouse` parameter is required and must match an existing warehouse in the Monte Carlo account.
|
||||
|
||||
1. If the user provides a table name, call `getTable` to retrieve the table details -- the response includes the warehouse name and UUID.
|
||||
2. If the user provides a database or schema name without a specific table, call `search` with the database or schema name to find assets and identify the warehouse.
|
||||
3. Use either the warehouse name or UUID in the `warehouse` parameter.
|
||||
|
||||
**NEVER guess the warehouse value.** If you cannot resolve it, ask the user.
|
||||
|
||||
---
|
||||
|
||||
## Asset Selection
|
||||
|
||||
The `asset_selection` object defines which tables the monitor covers. It must include a `databases` list.
|
||||
|
||||
**Use database and schema scoping to select which tables to monitor.** This is the reliable approach and covers most use cases.
|
||||
|
||||
> **Known limitation:** The MCP tool supports `filters` and `exclusions` parameters, but the tool's schema describes the wrong format for them. Until this is fixed ([K2-269](https://linear.app/montecarlodata/issue/K2-269)), **do not pass `filters` or `exclusions`** — they will cause errors. Use database/schema scoping instead to narrow the set of monitored tables. If the user needs regex or pattern-based filtering, explain this limitation and suggest either (a) using schema-level scoping to get close, or (b) creating individual metric monitors for specific tables.
|
||||
|
||||
### Database-Level Selection
|
||||
|
||||
To monitor all tables in an entire database, specify only the database name with no `schemas` list:
|
||||
|
||||
```json
|
||||
{
|
||||
"databases": [
|
||||
{"name": "analytics"}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
This monitors every table in every schema within the `analytics` database.
|
||||
|
||||
### Schema-Level Selection
|
||||
|
||||
To monitor all tables in specific schemas, include the `schemas` list:
|
||||
|
||||
```json
|
||||
{
|
||||
"databases": [
|
||||
{
|
||||
"name": "analytics",
|
||||
"schemas": ["core", "staging"]
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
This monitors every table in the `core` and `staging` schemas within `analytics`, but not tables in other schemas.
|
||||
|
||||
### Multiple Databases
|
||||
|
||||
You can monitor tables across multiple databases in a single monitor:
|
||||
|
||||
```json
|
||||
{
|
||||
"databases": [
|
||||
{"name": "analytics", "schemas": ["core"]},
|
||||
{"name": "raw_data"},
|
||||
{"name": "reporting", "schemas": ["public", "internal"]}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Alert Conditions
|
||||
|
||||
Alert conditions define which metrics the table monitor tracks. The operator is always AUTO (anomaly detection) -- custom thresholds are not available for table monitors.
|
||||
|
||||
| Metric | Description |
|
||||
|--------|-------------|
|
||||
| `last_updated_on` | Freshness monitoring. Alerts when a table has not been updated within its normal cadence. |
|
||||
| `schema` | Any schema change. Alerts when columns are added, removed, or their types change. |
|
||||
| `schema_fields_added` | New columns detected. Alerts only when new columns appear in the table. |
|
||||
| `schema_fields_removed` | Columns removed. Alerts only when existing columns are dropped from the table. |
|
||||
| `schema_fields_type_change` | Column type changes. Alerts only when a column's data type changes. |
|
||||
| `total_row_count` | Row count changes. Alerts on anomalous changes in total row count. |
|
||||
| `total_row_count_last_changed_on` | Time since last volume change. Alerts when the row count has not changed for an unusual duration. |
|
||||
|
||||
### Notes
|
||||
|
||||
- **All operators are AUTO (anomaly detection).** Table monitors do not support custom thresholds like GT, LT, or explicit operators. If the user needs custom thresholds, use a metric monitor instead.
|
||||
- **No `schedule` field.** Table monitors do not support the `schedule` field in MaC YAML. Adding it will cause a validation error on `montecarlo monitors apply`. Table monitor scheduling is managed automatically by Monte Carlo. Do NOT add a schedule block to the generated YAML.
|
||||
- The default set (`last_updated_on`, `schema`, `total_row_count`, `total_row_count_last_changed_on`) provides broad coverage and is appropriate for most use cases. Only override the defaults when the user specifically requests a subset.
|
||||
- `schema` is a superset of `schema_fields_added`, `schema_fields_removed`, and `schema_fields_type_change`. If using `schema`, there is no need to also include the granular schema metrics.
|
||||
|
||||
---
|
||||
|
||||
## Examples
|
||||
|
||||
### Monitor all tables in a database (minimal config)
|
||||
|
||||
```json
|
||||
{
|
||||
"name": "analytics_db_monitor",
|
||||
"description": "Monitor all tables in the analytics database for freshness, schema changes, and volume",
|
||||
"warehouse": "production_warehouse",
|
||||
"asset_selection": {
|
||||
"databases": [
|
||||
{"name": "analytics"}
|
||||
]
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Uses the default alert conditions (`last_updated_on`, `schema`, `total_row_count`, `total_row_count_last_changed_on`).
|
||||
|
||||
### Monitor specific schemas with default alerts
|
||||
|
||||
```json
|
||||
{
|
||||
"name": "core_schemas_monitor",
|
||||
"description": "Monitor all tables in core and reporting schemas",
|
||||
"warehouse": "production_warehouse",
|
||||
"asset_selection": {
|
||||
"databases": [
|
||||
{
|
||||
"name": "analytics",
|
||||
"schemas": ["core", "reporting"]
|
||||
}
|
||||
]
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Monitors every table in the `core` and `reporting` schemas, leaving other schemas unmonitored.
|
||||
|
||||
### Monitor multiple schemas across databases
|
||||
|
||||
```json
|
||||
{
|
||||
"name": "prod_tables_monitor",
|
||||
"description": "Monitor production tables across analytics and raw_data databases",
|
||||
"warehouse": "production_warehouse",
|
||||
"asset_selection": {
|
||||
"databases": [
|
||||
{
|
||||
"name": "analytics",
|
||||
"schemas": ["core", "reporting"]
|
||||
},
|
||||
{
|
||||
"name": "raw_data",
|
||||
"schemas": ["ingestion"]
|
||||
}
|
||||
]
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Monitors tables in specific production schemas, leaving development and staging schemas unmonitored.
|
||||
|
||||
### Schema change monitoring only
|
||||
|
||||
```json
|
||||
{
|
||||
"name": "warehouse_schema_watch",
|
||||
"description": "Track schema changes across the entire data warehouse",
|
||||
"warehouse": "production_warehouse",
|
||||
"asset_selection": {
|
||||
"databases": [
|
||||
{"name": "analytics"},
|
||||
{"name": "raw_data"}
|
||||
]
|
||||
},
|
||||
"alert_conditions": [
|
||||
"schema_fields_added",
|
||||
"schema_fields_removed",
|
||||
"schema_fields_type_change"
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
Monitors only schema changes (not freshness or volume) across multiple databases. Uses the granular schema metrics instead of `schema` to allow selectively enabling/disabling each type.
|
||||
|
||||
---
|
||||
|
||||
## Table Monitor vs Metric Monitor
|
||||
|
||||
| Aspect | Table Monitor | Metric Monitor |
|
||||
|--------|---------------|----------------|
|
||||
| **Scope** | Multiple tables (database/schema level) | Single table |
|
||||
| **Metrics** | Freshness, schema changes, row count | Field-level metrics (null rate, mean, sum, etc.) |
|
||||
| **Operator** | AUTO only (anomaly detection) | AUTO or explicit thresholds (GT, LT, EQ, etc.) |
|
||||
| **Asset selection** | Database/schema with filters and exclusions | Single table specified by MCON or name |
|
||||
| **Timestamp field** | Not required | Required (`aggregate_time_field`) |
|
||||
| **Segmentation** | Not available | Available via `segment_fields` |
|
||||
| **Best for** | Broad coverage, freshness, schema drift | Targeted field-level data quality checks |
|
||||
|
||||
**Rule of thumb:** If the user wants to monitor a specific field on a specific table with specific thresholds, use a metric monitor. If the user wants broad monitoring across many tables with automatic anomaly detection, use a table monitor.
|
||||
@@ -0,0 +1,404 @@
|
||||
# Validation Monitor Reference
|
||||
|
||||
Detailed reference for building `createValidationMonitorMac` tool calls.
|
||||
|
||||
## When to Use
|
||||
|
||||
Use a validation monitor when the user wants to:
|
||||
|
||||
- Check that specific fields are never null
|
||||
- Validate that values are within an allowed set (e.g., status in 'active', 'pending', 'inactive')
|
||||
- Enforce referential integrity (field values exist in another table)
|
||||
- Apply row-level business rules (e.g., "amount must be positive")
|
||||
- Combine multiple conditions with AND/OR logic
|
||||
|
||||
---
|
||||
|
||||
## Getting the Logic Right: Conditions Match INVALID Data
|
||||
|
||||
This is the single most confusing aspect of validation monitors and the number one source of mistakes. **Conditions describe what INVALID data looks like -- the data you want to be alerted about.** They do NOT describe what valid data looks like.
|
||||
|
||||
Think of it this way: the monitor scans rows and fires an alert when it finds rows matching the condition. So the condition must match the BAD rows.
|
||||
|
||||
| User wants | Condition should match | Common mistake |
|
||||
|------------|----------------------|----------------|
|
||||
| "id should never be null" | id IS NULL (alert when null found) | id IS NOT NULL (would alert on every valid row) |
|
||||
| "status must be in [active, pending]" | status NOT IN [active, pending] (alert on unexpected values) | status IN [active, pending] (would alert on valid rows) |
|
||||
| "amount must be positive" | amount IS NEGATIVE (alert on bad values) | amount > 0 (would alert on valid rows) |
|
||||
| "email must not be empty" | email IS NULL **OR** email = '' (alert on missing) | email IS NOT NULL (would alert on valid rows) |
|
||||
|
||||
**Before building any condition, ask yourself: "If a row matches this condition, is the row INVALID?" If the answer is no, the logic is backwards.**
|
||||
|
||||
---
|
||||
|
||||
## Pre-Step: Verify Field Existence
|
||||
|
||||
Before constructing the `alert_condition`, verify that every field name you plan to reference exists in the table's column list. This is the number two source of validation monitor failures -- referencing columns that do not exist or are misspelled.
|
||||
|
||||
1. You should already have the column list from `getTable` with `include_fields: true` (done in Step 2 of the main skill).
|
||||
2. For every field name in your planned conditions, confirm it appears in the column list exactly as spelled (field names are case-sensitive on most warehouses).
|
||||
3. If a field does not exist, stop and ask the user to clarify the correct column name. Do not guess.
|
||||
|
||||
---
|
||||
|
||||
## Required Parameters
|
||||
|
||||
| Parameter | Type | Description |
|
||||
|-----------|------|-------------|
|
||||
| `name` | string | Unique identifier for the monitor. Use a descriptive slug (e.g., `orders_not_null_check`). |
|
||||
| `description` | string | Human-readable description of what the monitor checks. |
|
||||
| `table` | string | Table MCON (preferred) or `database:schema.table` format. If not MCON, also pass `warehouse`. |
|
||||
| `alert_condition` | object | Condition tree defining when to alert (see Alert Condition Structure below). |
|
||||
|
||||
## Optional Parameters
|
||||
|
||||
| Parameter | Type | Description |
|
||||
|-----------|------|-------------|
|
||||
| `warehouse` | string | Warehouse name or UUID. Required if `table` is not an MCON. |
|
||||
| `domain_id` | string (uuid) | Domain UUID (use `getDomains` to list). |
|
||||
|
||||
---
|
||||
|
||||
## Alert Condition Structure
|
||||
|
||||
The top level of `alert_condition` must always be a GROUP node. This GROUP contains one or more conditions combined with AND or OR logic.
|
||||
|
||||
```json
|
||||
{
|
||||
"type": "GROUP",
|
||||
"operator": "AND",
|
||||
"conditions": [...]
|
||||
}
|
||||
```
|
||||
|
||||
### Condition Types
|
||||
|
||||
There are four condition types: UNARY, BINARY, SQL, and GROUP.
|
||||
|
||||
#### UNARY (single-value checks)
|
||||
|
||||
Used for predicates that operate on a single field with no comparison value.
|
||||
|
||||
```json
|
||||
{
|
||||
"type": "UNARY",
|
||||
"predicate": {"name": "null", "negated": false},
|
||||
"value": [{"type": "FIELD", "field": "column_name"}]
|
||||
}
|
||||
```
|
||||
|
||||
- `predicate.name` -- the predicate to apply (see Predicates Reference below).
|
||||
- `predicate.negated` -- set to `true` to invert the predicate (e.g., `null` with `negated: true` means "is NOT null").
|
||||
- `value` -- an array with a single value descriptor (usually a FIELD reference).
|
||||
|
||||
#### BINARY (comparison checks)
|
||||
|
||||
Used for predicates that compare a field against a value.
|
||||
|
||||
```json
|
||||
{
|
||||
"type": "BINARY",
|
||||
"predicate": {"name": "greater_than", "negated": false},
|
||||
"left": [{"type": "FIELD", "field": "column_name"}],
|
||||
"right": [{"type": "LITERAL", "literal": "0"}]
|
||||
}
|
||||
```
|
||||
|
||||
- `left` -- the left-hand side of the comparison (typically a FIELD reference).
|
||||
- `right` -- the right-hand side (typically a LITERAL value, SQL expression, or FIELD reference).
|
||||
- Both `left` and `right` are arrays of value descriptors.
|
||||
|
||||
#### SQL (custom SQL expression)
|
||||
|
||||
Used for complex conditions that are difficult to express with UNARY/BINARY nodes. The SQL expression should evaluate to true for INVALID rows.
|
||||
|
||||
```json
|
||||
{
|
||||
"type": "SQL",
|
||||
"sql": "amount > 0 AND amount < 1000000"
|
||||
}
|
||||
```
|
||||
|
||||
#### GROUP (nested conditions)
|
||||
|
||||
Used to combine multiple conditions with AND or OR logic. Groups can be nested.
|
||||
|
||||
```json
|
||||
{
|
||||
"type": "GROUP",
|
||||
"operator": "OR",
|
||||
"conditions": [
|
||||
{"type": "UNARY", "...": "..."},
|
||||
{"type": "BINARY", "...": "..."}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Value Types
|
||||
|
||||
Value descriptors appear in the `value`, `left`, and `right` arrays of UNARY and BINARY conditions.
|
||||
|
||||
| Type | Field | Description | Example |
|
||||
|------|-------|-------------|---------|
|
||||
| `FIELD` | `"field": "column_name"` | References a column in the table. | `{"type": "FIELD", "field": "user_id"}` |
|
||||
| `LITERAL` | `"literal": "value"` | A static value (always a string, even for numbers). | `{"type": "LITERAL", "literal": "100"}` |
|
||||
| `SQL` | `"sql": "SELECT ..."` | A SQL expression or subquery. | `{"type": "SQL", "sql": "SELECT MAX(id) FROM ref_table"}` |
|
||||
|
||||
---
|
||||
|
||||
## Predicates Reference
|
||||
|
||||
Before building conditions, call `getValidationPredicates` to get the full list of supported predicates for the connected warehouse. The list below covers common predicates but may not be exhaustive.
|
||||
|
||||
### Unary Predicates
|
||||
|
||||
These predicates take no comparison value -- they check a property of the field itself.
|
||||
|
||||
| Predicate | Description | Example use |
|
||||
|-----------|-------------|-------------|
|
||||
| `null` | Field value is null. | Alert on null ids. |
|
||||
| `is_negative` | Field value is negative. | Alert on negative amounts. |
|
||||
| `is_between_0_and_1` | Field value is between 0 and 1 (inclusive). | Alert on rates that should be percentages (0-100). |
|
||||
| `is_future_date` | Field value is a date/timestamp in the future. | Alert on future-dated records. |
|
||||
| `is_uuid` | Field value matches UUID format. | Alert on non-UUID values in a UUID field (use with `negated: true`). |
|
||||
|
||||
### Binary Predicates
|
||||
|
||||
These predicates compare a field against a value.
|
||||
|
||||
| Predicate | Right-hand side | Description | Example use |
|
||||
|-----------|----------------|-------------|-------------|
|
||||
| `equal` | Single LITERAL | Field equals the given value. | Alert when `status` equals `'deleted'`. |
|
||||
| `greater_than` | Single LITERAL | Field is greater than the given value. | Alert when `discount_pct` exceeds 100. |
|
||||
| `less_than` | Single LITERAL | Field is less than the given value. | Alert when `quantity` is below 0. |
|
||||
| `in_set` | Multiple LITERALs | Field value is in the given set. | Alert when `status` is in an invalid set (see example below). |
|
||||
| `contains` | Single LITERAL | Field value contains the given substring. | Alert when `email` contains `'test@'`. |
|
||||
| `starts_with` | Single LITERAL | Field value starts with the given prefix. | Alert when `phone` starts with `'000'`. |
|
||||
| `between` | Two LITERALs | Field value is between the two given values (inclusive). | Alert when `score` is between 0 and 10 (if that range is invalid). |
|
||||
|
||||
### Using `negated` to Invert Predicates
|
||||
|
||||
Any predicate can be inverted by setting `"negated": true` in the predicate object. This is essential for "must be in set" validations:
|
||||
|
||||
- **"status must be in [active, pending]"** becomes `in_set` with values `["active", "pending"]` and `negated: true` -- meaning "alert when status is NOT in [active, pending]".
|
||||
- **"id must not be null"** becomes `null` with `negated: false` -- meaning "alert when id IS null" (no inversion needed since the condition already matches invalid data).
|
||||
|
||||
---
|
||||
|
||||
## Examples
|
||||
|
||||
### Alert when id is null
|
||||
|
||||
Verify that `id` exists in the table schema from `getTable` before proceeding.
|
||||
|
||||
```json
|
||||
{
|
||||
"name": "orders_id_not_null",
|
||||
"description": "Alert when order id is null",
|
||||
"table": "MCON++a1b2c3d4-e5f6-7890-abcd-ef1234567890++1++1++analytics:core.orders",
|
||||
"alert_condition": {
|
||||
"type": "GROUP",
|
||||
"operator": "AND",
|
||||
"conditions": [
|
||||
{
|
||||
"type": "UNARY",
|
||||
"predicate": {"name": "null", "negated": false},
|
||||
"value": [{"type": "FIELD", "field": "id"}]
|
||||
}
|
||||
]
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
The condition matches rows where `id` IS NULL -- these are the invalid rows we want to be alerted about.
|
||||
|
||||
### Alert when status is not in allowed set
|
||||
|
||||
Verify that `status` exists in the table schema from `getTable` before proceeding.
|
||||
|
||||
```json
|
||||
{
|
||||
"name": "orders_status_allowed_values",
|
||||
"description": "Alert when order status is outside the allowed set",
|
||||
"table": "MCON++a1b2c3d4-e5f6-7890-abcd-ef1234567890++1++1++analytics:core.orders",
|
||||
"alert_condition": {
|
||||
"type": "GROUP",
|
||||
"operator": "AND",
|
||||
"conditions": [
|
||||
{
|
||||
"type": "BINARY",
|
||||
"predicate": {"name": "in_set", "negated": true},
|
||||
"left": [{"type": "FIELD", "field": "status"}],
|
||||
"right": [
|
||||
{"type": "LITERAL", "literal": "active"},
|
||||
{"type": "LITERAL", "literal": "pending"},
|
||||
{"type": "LITERAL", "literal": "inactive"}
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Note `negated: true` -- the predicate is `in_set`, but we want to alert when the value is NOT in the set. This catches any unexpected status values.
|
||||
|
||||
### Alert when amount is negative
|
||||
|
||||
Verify that `amount` exists in the table schema from `getTable` before proceeding.
|
||||
|
||||
```json
|
||||
{
|
||||
"name": "orders_positive_amount",
|
||||
"description": "Alert when order amount is negative",
|
||||
"table": "MCON++a1b2c3d4-e5f6-7890-abcd-ef1234567890++1++1++analytics:core.orders",
|
||||
"alert_condition": {
|
||||
"type": "GROUP",
|
||||
"operator": "AND",
|
||||
"conditions": [
|
||||
{
|
||||
"type": "UNARY",
|
||||
"predicate": {"name": "is_negative", "negated": false},
|
||||
"value": [{"type": "FIELD", "field": "amount"}]
|
||||
}
|
||||
]
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
The condition matches rows where `amount` is negative -- these are the invalid rows.
|
||||
|
||||
### Combined conditions: null OR negative
|
||||
|
||||
Verify that both `amount` and `quantity` exist in the table schema from `getTable` before proceeding.
|
||||
|
||||
```json
|
||||
{
|
||||
"name": "orders_amount_quality",
|
||||
"description": "Alert when amount is null or quantity is negative",
|
||||
"table": "MCON++a1b2c3d4-e5f6-7890-abcd-ef1234567890++1++1++analytics:core.orders",
|
||||
"alert_condition": {
|
||||
"type": "GROUP",
|
||||
"operator": "OR",
|
||||
"conditions": [
|
||||
{
|
||||
"type": "UNARY",
|
||||
"predicate": {"name": "null", "negated": false},
|
||||
"value": [{"type": "FIELD", "field": "amount"}]
|
||||
},
|
||||
{
|
||||
"type": "UNARY",
|
||||
"predicate": {"name": "is_negative", "negated": false},
|
||||
"value": [{"type": "FIELD", "field": "quantity"}]
|
||||
}
|
||||
]
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
The OR operator means an alert fires if either condition matches -- the row has a null amount OR a negative quantity.
|
||||
|
||||
### Between check with nested AND/OR
|
||||
|
||||
Verify that `score` and `status` exist in the table schema from `getTable` before proceeding.
|
||||
|
||||
```json
|
||||
{
|
||||
"name": "records_score_validation",
|
||||
"description": "Alert when score is outside 0-100 range for active records",
|
||||
"table": "MCON++a1b2c3d4-e5f6-7890-abcd-ef1234567890++1++1++warehouse:metrics.records",
|
||||
"alert_condition": {
|
||||
"type": "GROUP",
|
||||
"operator": "AND",
|
||||
"conditions": [
|
||||
{
|
||||
"type": "BINARY",
|
||||
"predicate": {"name": "equal", "negated": false},
|
||||
"left": [{"type": "FIELD", "field": "status"}],
|
||||
"right": [{"type": "LITERAL", "literal": "active"}]
|
||||
},
|
||||
{
|
||||
"type": "BINARY",
|
||||
"predicate": {"name": "between", "negated": true},
|
||||
"left": [{"type": "FIELD", "field": "score"}],
|
||||
"right": [
|
||||
{"type": "LITERAL", "literal": "0"},
|
||||
{"type": "LITERAL", "literal": "100"}
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
This uses `between` with `negated: true` to alert when score is outside the 0-100 range, but only for active records (the AND operator requires both conditions to match).
|
||||
|
||||
### Referential integrity with SQL subquery
|
||||
|
||||
Verify that `customer_id` exists in the table schema from `getTable` before proceeding.
|
||||
|
||||
```json
|
||||
{
|
||||
"name": "orders_valid_customer",
|
||||
"description": "Alert when customer_id does not exist in customers table",
|
||||
"table": "MCON++a1b2c3d4-e5f6-7890-abcd-ef1234567890++1++1++analytics:core.orders",
|
||||
"alert_condition": {
|
||||
"type": "GROUP",
|
||||
"operator": "AND",
|
||||
"conditions": [
|
||||
{
|
||||
"type": "SQL",
|
||||
"sql": "customer_id IS NOT NULL AND customer_id NOT IN (SELECT id FROM analytics.core.customers)"
|
||||
}
|
||||
]
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
The SQL condition type is useful for referential integrity checks that require subqueries. The `customer_id IS NOT NULL` guard avoids alerting on null values (which should be caught by a separate null check if needed).
|
||||
|
||||
### Contains and starts_with checks
|
||||
|
||||
Verify that `email` and `phone` exist in the table schema from `getTable` before proceeding.
|
||||
|
||||
```json
|
||||
{
|
||||
"name": "contacts_format_validation",
|
||||
"description": "Alert when email contains test data or phone has invalid prefix",
|
||||
"table": "MCON++a1b2c3d4-e5f6-7890-abcd-ef1234567890++1++1++warehouse:crm.contacts",
|
||||
"alert_condition": {
|
||||
"type": "GROUP",
|
||||
"operator": "OR",
|
||||
"conditions": [
|
||||
{
|
||||
"type": "BINARY",
|
||||
"predicate": {"name": "contains", "negated": false},
|
||||
"left": [{"type": "FIELD", "field": "email"}],
|
||||
"right": [{"type": "LITERAL", "literal": "@test.example.com"}]
|
||||
},
|
||||
{
|
||||
"type": "BINARY",
|
||||
"predicate": {"name": "starts_with", "negated": false},
|
||||
"left": [{"type": "FIELD", "field": "phone"}],
|
||||
"right": [{"type": "LITERAL", "literal": "000"}]
|
||||
}
|
||||
]
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Fallback: Custom SQL Monitor
|
||||
|
||||
If `createValidationMonitorMac` fails -- for example because a referenced column does not exist yet in the live table, or the warehouse does not support a particular predicate -- fall back to `createCustomSqlMonitorMac` with an explicit SQL query instead.
|
||||
|
||||
A custom SQL monitor lets you express any validation logic as a SQL query that returns rows or a count. This is always available as a backup when the structured validation condition tree cannot express what you need or encounters an API error.
|
||||
|
||||
When falling back:
|
||||
|
||||
1. Translate the intended validation logic into a SQL query.
|
||||
2. The SQL should select rows that violate the rule (matching the same "conditions match INVALID data" principle).
|
||||
3. Use `createCustomSqlMonitorMac` with the translated query.
|
||||
4. Inform the user that you used a custom SQL monitor as a fallback and explain why.
|
||||
@@ -0,0 +1,252 @@
|
||||
---
|
||||
name: monte-carlo-prevent
|
||||
description: "Surfaces Monte Carlo data observability context (table health, alerts, lineage, blast radius) before SQL/dbt edits."
|
||||
category: data
|
||||
risk: safe
|
||||
source: community
|
||||
source_repo: monte-carlo-data/mc-agent-toolkit
|
||||
source_type: community
|
||||
date_added: "2026-04-08"
|
||||
author: monte-carlo-data
|
||||
tags: [data-observability, dbt, schema, monte-carlo, lineage]
|
||||
tools: [claude, cursor, codex]
|
||||
---
|
||||
|
||||
# Monte Carlo Prevent Skill
|
||||
|
||||
This skill brings Monte Carlo's data observability context directly into your editor. When you're modifying a dbt model or SQL pipeline, use it to surface table health, lineage, active alerts, and to generate monitors-as-code without leaving Claude Code.
|
||||
|
||||
Reference files live next to this skill file. **Use the Read tool** (not MCP resources) to access them:
|
||||
|
||||
- Full workflow step-by-step instructions: `references/workflows.md` (relative to this file)
|
||||
- MCP parameter details: `references/parameters.md` (relative to this file)
|
||||
- Troubleshooting: `references/TROUBLESHOOTING.md` (relative to this file)
|
||||
|
||||
## When to activate this skill
|
||||
|
||||
**Do not wait to be asked.** Run the appropriate workflow automatically whenever the user:
|
||||
|
||||
- References or opens a `.sql` file or dbt model (files in `models/`) → run Workflow 1
|
||||
- Mentions a table name, dataset, or dbt model name in passing → run Workflow 1
|
||||
|
||||
- Describes a planned change to a model (new column, join update, filter change, refactor) → **STOP — run Workflow 4 before writing any code**
|
||||
-
|
||||
- Adds a new column, metric, or output expression to an existing
|
||||
model → run Workflow 4 first, then ALWAYS offer Workflow 2
|
||||
regardless of risk tier — do not skip the monitor offer
|
||||
- Asks about data quality, freshness, row counts, or anomalies → run Workflow 1
|
||||
- Wants to triage or respond to a data quality alert → run Workflow 3
|
||||
|
||||
Present the results as context the engineer needs before proceeding — not as a response to a question.
|
||||
|
||||
## When NOT to activate this skill
|
||||
|
||||
Do not invoke Monte Carlo tools for:
|
||||
|
||||
- Seed files (files in seeds/ directory)
|
||||
- Analysis files (files in analyses/ directory)
|
||||
- One-off or ad-hoc SQL scripts not part of a dbt project
|
||||
- Configuration files (dbt_project.yml, profiles.yml, packages.yml)
|
||||
- Test files unless the user is specifically asking about data quality
|
||||
|
||||
If uncertain whether a file is a dbt model, check for {{ ref() }} or {{ source() }}
|
||||
Jinja references — if absent, do not activate.
|
||||
|
||||
### Macros and snapshots — gate edits, skip auto-context
|
||||
|
||||
Macro files (`macros/`) and snapshot files (`snapshots/`) are **not** models, so
|
||||
do not auto-fetch Monte Carlo context (Workflow 1) when they are opened. However,
|
||||
macros are inlined into every model that calls them at compile time — a one-line
|
||||
macro change can silently alter dozens of models. Snapshots control historical
|
||||
tracking and are similarly sensitive.
|
||||
|
||||
**The pre-edit hook gates these files.** If the hook fires for a macro or snapshot,
|
||||
identify which models are affected and run the change impact assessment (Workflow 4)
|
||||
for those models before proceeding with the edit.
|
||||
|
||||
---
|
||||
|
||||
## REQUIRED: Change impact assessment before any SQL edit
|
||||
|
||||
**Before editing or writing any SQL for a dbt model or pipeline, you MUST run Workflow 4.**
|
||||
|
||||
This applies whenever the user expresses intent to modify a model — including phrases like:
|
||||
|
||||
- "I want to add a column…"
|
||||
- "Let me add / I'm adding…"
|
||||
- "I'd like to change / update / rename…"
|
||||
- "Can you add / modify / refactor…"
|
||||
- "Let's add…" / "Add a `<column>` column"
|
||||
- Any other description of a planned schema or logic change
|
||||
- "Exclude / filter out / remove [records/customers/rows]…"
|
||||
- "Adjust / increase / decrease [threshold/parameter/value]…"
|
||||
- "Fix / bugfix / patch [issue/bug]…"
|
||||
- "Revert / restore / undo [change/previous behavior]…"
|
||||
- "Disable / enable [feature/logic/flag]…"
|
||||
- "Clean up / remove [references/columns/code]…"
|
||||
- "Implement [backend/feature] for…"
|
||||
- "Create [models/dbt models] for…" (when modifying existing referenced tables)
|
||||
- "Increase / decrease / change [max_tokens/threshold/date constant/numeric parameter]…"
|
||||
- Any change to a hardcoded value, constant, or configuration parameter within SQL
|
||||
- "Drop / remove / delete [column/field/table]"
|
||||
- "Rename [column/field] to [new name]"
|
||||
- "Add [column]" (short imperative form, e.g. "add a created_at column")
|
||||
- Any single-verb imperative command targeting a column, table, or model
|
||||
(e.g. "drop X", "rename Y", "add Z", "remove W")
|
||||
|
||||
Parameter changes (threshold values, date constants, numeric limits) appear
|
||||
safe but silently change model output. Treat them the same as logic changes
|
||||
for impact assessment purposes.
|
||||
|
||||
**Do not write or edit any SQL until the change impact assessment (Workflow 4) has been presented to the user.** The assessment must come first — not after the edit, not in parallel.
|
||||
|
||||
---
|
||||
|
||||
## Pre-edit gate — check before modifying any file
|
||||
|
||||
**Before calling Edit, Write, or MultiEdit on any `.sql` or dbt model
|
||||
file, you MUST check:**
|
||||
|
||||
1. Has the synthesis step been run for THIS SPECIFIC CHANGE in the
|
||||
current prompt?
|
||||
2. **If YES** → proceed with the edit
|
||||
3. **If NO** → stop immediately, run Workflow 4, present the full
|
||||
report with synthesis connected to this specific change.
|
||||
**If risk is High or Medium:** ask "Do you want me to proceed
|
||||
with the edit?" and wait for explicit confirmation.
|
||||
**If risk is Low:** use judgment — proceed if straightforward
|
||||
and no concerns found, otherwise ask before editing.
|
||||
|
||||
**Important: "Workflow 4 already ran this session" is NOT sufficient
|
||||
to proceed.** Each distinct change prompt requires its own synthesis
|
||||
step connecting the MC findings to that specific change.
|
||||
|
||||
The synthesis must reference the specific columns, filters, or logic
|
||||
being changed in the current prompt — not just general table health.
|
||||
|
||||
Example:
|
||||
|
||||
- ✅ "Given 34 downstream models depend on is_paying_workspace,
|
||||
adding 'MC Internal' to the exclusion list will exclude these
|
||||
workspaces from all downstream health scores and exports.
|
||||
Confirm?"
|
||||
- ❌ "Workflow 4 already ran. Making the edit now."
|
||||
|
||||
The only exception: if the user explicitly acknowledges the risk
|
||||
and confirms they want to skip (e.g. "I know the risks, just make
|
||||
the change") — proceed but note the skipped assessment.
|
||||
|
||||
## Available MCP tools
|
||||
|
||||
All tools are available via the `monte-carlo` MCP server.
|
||||
|
||||
| Tool | Purpose |
|
||||
| ---------------------------- | -------------------------------------------------------------------- |
|
||||
| `testConnection` | Verify auth and connectivity |
|
||||
| `search` | Find tables/assets by name |
|
||||
| `getTable` | Schema, stats, metadata for a table |
|
||||
| `getAssetLineage` | Upstream/downstream dependencies (call with mcons array + direction) |
|
||||
| `getAlerts` | Active incidents and alerts |
|
||||
| `getMonitors` | Monitor configs — filter by table using mcons array |
|
||||
| `getQueriesForTable` | Recent query history |
|
||||
| `getQueryData` | Full SQL for a specific query |
|
||||
| `createValidationMonitorMac` | Generate validation monitors-as-code YAML |
|
||||
| `createMetricMonitorMac` | Generate metric monitors-as-code YAML |
|
||||
| `createComparisonMonitorMac` | Generate comparison monitors-as-code YAML |
|
||||
| `createCustomSqlMonitorMac` | Generate custom SQL monitors-as-code YAML |
|
||||
| `getValidationPredicates` | List available validation rule types |
|
||||
| `updateAlert` | Update alert status/severity |
|
||||
| `setAlertOwner` | Assign alert ownership |
|
||||
| `createOrUpdateAlertComment` | Add comments to alerts |
|
||||
| `getAudiences` | List notification audiences |
|
||||
| `getDomains` | List MC domains |
|
||||
| `getUser` | Current user info |
|
||||
| `getCurrentTime` | ISO timestamp for API calls |
|
||||
|
||||
## Core workflows
|
||||
|
||||
Each workflow has detailed step-by-step instructions in `references/workflows.md` (Read tool).
|
||||
|
||||
### 1. Table health check
|
||||
|
||||
**When:** User opens a dbt model or mentions a table.
|
||||
**What:** Surfaces health, lineage, alerts, and risk signals. Auto-escalates to Workflow 4 if change intent is detected and risk signals are present.
|
||||
|
||||
### 2. Add a monitor
|
||||
|
||||
**When:** New column, filter, or business rule is added to a model.
|
||||
**What:** Suggests and generates monitors-as-code YAML using the appropriate `create*MonitorMac` tool. Saves to `monitors/<table_name>.yml`.
|
||||
|
||||
### 3. Alert triage
|
||||
|
||||
**When:** User is investigating an active data quality incident.
|
||||
**What:** Lists open alerts, checks table state, traces lineage for root cause, reviews recent queries.
|
||||
|
||||
### 4. Change impact assessment — REQUIRED before modifying a model
|
||||
|
||||
**When:** Any intent to modify a dbt model's logic, columns, joins, or filters.
|
||||
**What:** Surfaces blast radius, downstream dependencies, active incidents, monitor coverage, and query exposure. Produces a risk-tiered report with synthesis connecting findings to specific code recommendations. See `references/workflows.md` for the full assessment sequence, report format, and synthesis rules.
|
||||
|
||||
### 5. Change validation queries
|
||||
|
||||
**When:** Explicit engineer request only (e.g. "validate this change", "ready to commit").
|
||||
**What:** Generates 3-5 targeted SQL queries to verify the change behaved as intended. Uses Workflow 4 context — requires both impact assessment and file edit in session.
|
||||
|
||||
---
|
||||
|
||||
## Post-synthesis confirmation rules
|
||||
|
||||
Always end the synthesis with one clear, specific recommendation in plain English:
|
||||
"Given the above, I recommend: [specific action]"
|
||||
|
||||
**If the risk is High or Medium:** STOP and wait for confirmation before editing
|
||||
any file. You must ask the engineer and receive an explicit "yes", "go ahead",
|
||||
"proceed", or similar confirmation before making code changes.
|
||||
Say: "Do you want me to proceed with the edit?"
|
||||
Do NOT say: "Proceeding with the edit." — that skips the engineer's decision.
|
||||
|
||||
**If the risk is Low:** Use your judgment based on the synthesis findings. If
|
||||
the change is straightforward and the synthesis found no concerns, you may
|
||||
proceed. If anything is surprising or worth flagging, ask before editing.
|
||||
|
||||
---
|
||||
|
||||
## Session markers
|
||||
|
||||
These markers coordinate between the skill and the plugin's hooks. Output each
|
||||
on its own line when the condition is met.
|
||||
|
||||
### Impact check complete
|
||||
|
||||
After the engineer confirms (High/Medium) or after presenting the synthesis (Low),
|
||||
output one marker per assessed table. **IMPORTANT: use only the table/model name, not the full MCON:**
|
||||
|
||||
<!-- MC_IMPACT_CHECK_COMPLETE: <table_name> -->
|
||||
|
||||
(Use the model filename without .sql extension — NOT "acme.analytics.orders" or "prod.public.client_hub")
|
||||
|
||||
How many markers to emit depends on how the assessment was triggered:
|
||||
|
||||
**Hook-triggered** (the pre-edit hook blocked an edit and instructed you to run
|
||||
the assessment): Be strict — only emit markers for tables whose lineage **and**
|
||||
monitor coverage were fetched directly via Monte Carlo tools in this session. If
|
||||
the engineer describes changes to multiple tables but only one was formally
|
||||
assessed, emit only one marker. The pre-edit hook will gate the other tables and
|
||||
prompt for their own Workflow 4 runs.
|
||||
|
||||
**Voluntarily invoked** (the engineer proactively asked for an impact assessment):
|
||||
Be looser — emit markers for all tables the assessment meaningfully covered, even
|
||||
if some were assessed via lineage context rather than direct MC tool calls. The
|
||||
engineer is already safety-conscious; don't force redundant assessments for tables
|
||||
they clearly considered.
|
||||
|
||||
### Monitor coverage gap
|
||||
|
||||
When Workflow 4 finds zero custom monitors on a table's affected columns, output:
|
||||
|
||||
<!-- MC_MONITOR_GAP: <table_name> -->
|
||||
|
||||
Use only the table/model name (NOT the full MCON). This allows the plugin's hooks
|
||||
to remind the engineer about monitor coverage at commit time. Only output this
|
||||
marker when the gap is specifically about the columns or logic being changed —
|
||||
not for general table-level monitor absence.
|
||||
@@ -0,0 +1,23 @@
|
||||
## Troubleshooting
|
||||
|
||||
### MCP connection fails:
|
||||
```bash
|
||||
# Verify the server is reachable
|
||||
curl -s -o /dev/null -w "%{http_code}" https://integrations.getmontecarlo.com/mcp/
|
||||
```
|
||||
|
||||
**If using the plugin (OAuth):** Run `/mcp` in Claude Code, select the `monte-carlo` server, and re-authenticate. If the browser flow doesn't complete, copy the callback URL from your browser's address bar into the URL prompt that appears in Claude Code.
|
||||
|
||||
**Legacy (header-based auth, for MCP clients without HTTP transport):** Check that `x-mcd-id` and `x-mcd-token` are set correctly in your MCP config. The key format is `<KEY_ID>:<KEY_SECRET>` — these are split across two separate headers.
|
||||
|
||||
|
||||
### Monitor creation errors:
|
||||
|
||||
**`montecarlo monitors apply` fails with "Unknown field":**
|
||||
Monitor definition files must have `montecarlo:` as the root key — do not copy the `validation:` or `custom_sql:` output from the MCP tools directly. Reformat using the `montecarlo: > custom_sql:` structure shown in Workflow 2.
|
||||
|
||||
**`montecarlo monitors apply` fails with "Not a Monte Carlo project":**
|
||||
Ensure `montecarlo.yml` (the project config) exists in the working directory. This file must contain only `version`, `namespace`, and `default_resource` — not monitor definitions.
|
||||
|
||||
**`createValidationMonitorMac` fails with a Snowflake error:**
|
||||
This tool validates the condition SQL against the live table. If the column doesn't exist yet (e.g. you're writing the monitor before deploying the model change), fall back to `createCustomSqlMonitorMac` with an explicit SQL query instead.
|
||||
@@ -0,0 +1,32 @@
|
||||
# MCP Parameter Notes
|
||||
|
||||
Important parameter details for Monte Carlo MCP tools. Consult when making API
|
||||
calls to avoid common mistakes.
|
||||
|
||||
---
|
||||
|
||||
## `getAlerts` — use snake_case parameters
|
||||
|
||||
The MCP tool uses Python snake_case, **not** the camelCase params from the MC web UI:
|
||||
|
||||
```
|
||||
✓ created_after (not createdTime.after)
|
||||
✓ created_before (not createdTime.before)
|
||||
✓ order_by (not orderBy)
|
||||
✓ table_mcons (not tableMcons)
|
||||
```
|
||||
|
||||
Always provide `created_after` and `created_before`. Max window is 60 days.
|
||||
Use `getCurrentTime()` to get the current ISO timestamp when needed.
|
||||
|
||||
---
|
||||
|
||||
## `search` — finding the right table identifier
|
||||
|
||||
MC uses MCONs (Monte Carlo Object Names) as table identifiers. Always use
|
||||
`search` first to resolve a table name to its MCON before calling `getTable`,
|
||||
`getAssetLineage`, or `getAlerts`.
|
||||
|
||||
```
|
||||
search(query="orders_status") → returns mcon, full_table_id, warehouse
|
||||
```
|
||||
@@ -0,0 +1,478 @@
|
||||
# Workflow Details
|
||||
|
||||
Detailed step-by-step instructions for each Monte Carlo Prevent workflow.
|
||||
These are referenced from the main SKILL.md — consult the relevant section when
|
||||
executing a workflow.
|
||||
|
||||
---
|
||||
|
||||
## Workflow 1: Table health check — when opening or editing a model
|
||||
|
||||
When the user opens a dbt model or mentions a table, run this sequence automatically:
|
||||
|
||||
```
|
||||
1. search(query="<table_name>") → get the full MCON/table identifier
|
||||
2. getTable(mcon="<mcon>") → schema, freshness, row count, importance score, monitoring status
|
||||
3. getAssetLineage(mcon="<mcon>") → upstream sources, downstream dependents
|
||||
4. getAlerts(created_after="<7 days ago>", created_before="<now>", table_mcons=["<mcon>"]) → active alerts
|
||||
```
|
||||
|
||||
Summarize for the user:
|
||||
- **Health**: last updated, row count, is it monitored?
|
||||
- **Lineage**: N upstream sources, M downstream consumers (name the important ones)
|
||||
- **Alerts**: any active/unacknowledged incidents — lead with these if present
|
||||
- **Risk signals** (lite): flag if importance score is high, if key assets are downstream, or if alerts are already firing — these indicate the table warrants extra care before modification
|
||||
|
||||
Example summary to offer unprompted when a dbt model file is opened:
|
||||
> "The table `orders_status` was last updated 2 hours ago with 142K rows. It has 3 downstream dependents including `order_status_snapshot` (key asset). There are 2 active freshness alerts — this table warrants extra care before modification. Want me to run a full change impact assessment?"
|
||||
|
||||
**Auto-escalation rule — after completing steps 1–4 above:**
|
||||
|
||||
First, check whether the user has expressed intent to modify the model
|
||||
in this session (e.g. mentioned a change, asked to add/edit/fix something).
|
||||
|
||||
IF change intent has been expressed AND any of the following are true:
|
||||
- One or more active/unacknowledged alerts exist on the table
|
||||
- One or more downstream dependents are key assets
|
||||
- The table's importance score is above 0.8
|
||||
→ Ask the user before running Workflow 4:
|
||||
"This is a high-importance table with [N active alerts / key asset
|
||||
dependents / importance score 0.989]. Do you want me to run a full
|
||||
change impact assessment before proceeding? (yes/no)"
|
||||
→ Wait for confirmation. If yes → run Workflow 4.
|
||||
If no → proceed but note: "Skipping impact assessment at your request."
|
||||
|
||||
IF risk signals exist but NO change intent has been expressed:
|
||||
→ Surface the health summary and note the risk signals only:
|
||||
"This is a high-importance table with key asset dependents. When
|
||||
you're ready to make changes, say 'run impact assessment' or just
|
||||
describe your change and I'll run it automatically."
|
||||
→ Do NOT run Workflow 4. Do NOT ask about running Workflow 4.
|
||||
|
||||
### New model creation variant
|
||||
|
||||
When the user is creating a new .sql dbt model file (not editing an existing one):
|
||||
|
||||
1. Parse all {{ ref('...') }} and {{ source('...', '...') }} calls from the SQL
|
||||
2. For each referenced table, run the standard Workflow 1 health check:
|
||||
search() → getTable() → getAlerts()
|
||||
3. Surface a consolidated upstream health summary:
|
||||
"Your new model references N upstream tables. Here's their current health:"
|
||||
- List each with: last updated, active alerts (if any), key asset flag
|
||||
4. Flag any upstream table with active alerts as a risk:
|
||||
"⚠️ <table_name> has <N> active alerts — your new model will inherit this data quality issue"
|
||||
|
||||
Skip getAssetLineage for new models — they have no downstream dependents yet.
|
||||
Skip Workflow 4 for new models — there is no existing blast radius to assess.
|
||||
|
||||
---
|
||||
|
||||
## Workflow 2: Add a monitor — when new transformation logic is added
|
||||
|
||||
> **For detailed monitor creation guidance** — including parameter validation, field-type compatibility checks, and common error prevention — see the `monitor-creation` skill (`skills/monitor-creation/SKILL.md`). The workflow below is a quick-path for the common "just added a column, offer a monitor" case within a prevent session.
|
||||
|
||||
When the user adds a new column, filter, or business rule, suggest adding a monitor. First, choose the monitor type based on what the new logic does:
|
||||
|
||||
```
|
||||
- New column with a row-level condition (null check, range, regex)
|
||||
→ createValidationMonitorMac
|
||||
|
||||
- New aggregate metric (row count, sum, average, percentile over time)
|
||||
→ createMetricMonitorMac
|
||||
|
||||
- Logic that should match another table or a prior time period
|
||||
→ createComparisonMonitorMac
|
||||
|
||||
- Complex business rule that doesn't fit the above
|
||||
→ createCustomSqlMonitorMac
|
||||
```
|
||||
|
||||
Then run the appropriate sequence:
|
||||
|
||||
```
|
||||
1. Read the SQL file being edited to extract the specific transformation logic:
|
||||
- Confirm the file path from conversation context (do not guess or assume)
|
||||
- If no file path is clear, ask the engineer: "Which file contains the new logic?"
|
||||
- Extract the specific new column definition, filter condition, or business rule
|
||||
- Use this logic directly when constructing the monitor condition in step 3
|
||||
|
||||
2. For validation monitors: getValidationPredicates() → show what validation types are available
|
||||
For all types: determine the right tool from the selection guide above
|
||||
3. Call the selected create*MonitorMac tool:
|
||||
- createValidationMonitorMac(mcon, description, condition_sql) → returns YAML
|
||||
- createMetricMonitorMac(mcon, description, metric, operator) → returns YAML
|
||||
- createComparisonMonitorMac(source_table, target_table, metric) → returns YAML
|
||||
- createCustomSqlMonitorMac(mcon, description, sql) → returns YAML
|
||||
⚠ If createValidationMonitorMac fails (e.g. column doesn't exist yet in the live table),
|
||||
fall back to createCustomSqlMonitorMac with an explicit SQL query instead.
|
||||
3. Save the YAML to <project>/monitors/<table_name>.yml
|
||||
4. Run: montecarlo monitors apply --dry-run (to preview)
|
||||
5. Run: montecarlo monitors apply --auto-yes (to apply)
|
||||
```
|
||||
|
||||
**Important — YAML format for `monitors apply`:**
|
||||
All `create*MonitorMac` tools return YAML that is not directly compatible with `montecarlo monitors apply`. Reformat the output into a standalone monitor file with `montecarlo:` as the root key. The second-level key matches the monitor type: `custom_sql:`, `validation:`, `metric:`, or `comparison:`. The example below shows `custom_sql:` — substitute the appropriate key for other monitor types.
|
||||
|
||||
```yaml
|
||||
# monitors/<table_name>.yml ← monitor definitions only, NOT montecarlo.yml
|
||||
montecarlo:
|
||||
custom_sql:
|
||||
- warehouse: <warehouse_name>
|
||||
name: <monitor_name>
|
||||
description: <description>
|
||||
schedule:
|
||||
interval_minutes: 720
|
||||
start_time: '<ISO timestamp>'
|
||||
sql: <your validation SQL>
|
||||
alert_conditions:
|
||||
- operator: GT
|
||||
threshold_value: 0.0
|
||||
```
|
||||
|
||||
The `montecarlo.yml` project config is a **separate file** in the project root containing only:
|
||||
```yaml
|
||||
# montecarlo.yml ← project config only, NOT monitor definitions
|
||||
version: 1
|
||||
namespace: <your-namespace>
|
||||
default_resource: <warehouse_name>
|
||||
```
|
||||
|
||||
Do NOT put `version:`, `namespace:`, or `default_resource:` inside monitor definition files.
|
||||
|
||||
---
|
||||
|
||||
## Workflow 3: Alert triage — when investigating an active incident
|
||||
|
||||
```
|
||||
1. getAlerts(
|
||||
created_after="<start>",
|
||||
created_before="<end>",
|
||||
order_by="-createdTime",
|
||||
statuses=["NOT_ACKNOWLEDGED"]
|
||||
) → list open alerts
|
||||
2. getTable(mcon="<affected_table_mcon>") → check current table state
|
||||
3. getAssetLineage(mcon="<mcon>") → identify upstream cause or downstream blast radius
|
||||
4. getQueriesForTable(mcon="<mcon>") → recent queries that might explain the anomaly
|
||||
```
|
||||
|
||||
To respond to an alert:
|
||||
- `updateAlert(alert_id="<id>", status="ACKNOWLEDGED")` — acknowledge it
|
||||
- `setAlertOwner(alert_id="<id>", owner="<email>")` — assign ownership
|
||||
- `createOrUpdateAlertComment(alert_id="<id>", comment="<text>")` — add context
|
||||
|
||||
---
|
||||
|
||||
## Workflow 4: Change impact assessment — REQUIRED before modifying a model
|
||||
|
||||
**Trigger:** Any expressed intent to add, rename, drop, or change a column, join, filter, or model logic. Run this immediately — before writing any code — even if the user hasn't asked for it.
|
||||
|
||||
### Bugfixes and reverts require impact assessment too
|
||||
|
||||
When the user says "fix", "revert", "restore", or "undo", run this workflow
|
||||
before writing any code — even if the change seems small or safe.
|
||||
|
||||
A revert that undoes a column addition or changes join logic has the same
|
||||
blast radius as the original change. Downstream models may have already
|
||||
adapted to the "incorrect" behavior, meaning the fix itself could break them.
|
||||
|
||||
Pay special attention to:
|
||||
- Whether the revert removes a column other models now depend on
|
||||
- Whether downstream models reference the specific logic being reverted
|
||||
- Whether active alerts may be related to the change being reverted
|
||||
|
||||
When the user is about to rename or drop a column, change a join condition, alter a filter, or refactor a model's logic, run this sequence to surface the blast radius before any changes are committed:
|
||||
|
||||
```
|
||||
1. search(query="<table_name>") + getTable(mcon="<mcon>")
|
||||
→ importance score, query volume (reads/writes per day), key asset flag
|
||||
|
||||
2. getAssetLineage(mcon="<mcon>")
|
||||
→ full list of downstream dependents; for each, note whether it is a key asset
|
||||
|
||||
3. getTable(mcon="<downstream_mcon>") for each key downstream asset
|
||||
→ importance score, last updated, monitoring status
|
||||
|
||||
4. getAlerts(
|
||||
created_after="<7 days ago>",
|
||||
created_before="<now>",
|
||||
table_mcons=["<mcon>", "<downstream_mcon_1>", ...],
|
||||
statuses=["NOT_ACKNOWLEDGED"]
|
||||
)
|
||||
→ any active incidents already affecting this table or its dependents
|
||||
|
||||
5. getQueriesForTable(mcon="<mcon>")
|
||||
→ recent queries; scan for references to the specific columns being changed
|
||||
→ use getQueryData(query_id="<id>") to fetch full SQL for ambiguous cases
|
||||
|
||||
5b. Supplementary local search for downstream dbt refs:
|
||||
- Search the local models/ directory for ref('<table_name>') (single-hop only)
|
||||
- Compare results against getAssetLineage output from step 2
|
||||
- If any local models reference this table but are NOT in MC's lineage results:
|
||||
"⚠️ Found N local model(s) referencing this table not yet in MC's lineage: [list]"
|
||||
- If no models/ directory exists in the current project, skip silently
|
||||
- MC lineage remains the authoritative source — local grep is supplementary only
|
||||
|
||||
6. getMonitors(mcon="<mcon>")
|
||||
→ which monitors are watching columns or metrics affected by the change
|
||||
```
|
||||
|
||||
### Risk tier assessment
|
||||
|
||||
| Tier | Conditions |
|
||||
|---|---|
|
||||
| 🔴 High | Key asset downstream, OR active alerts already firing, OR >50 reads/day |
|
||||
| 🟡 Medium | Non-key assets downstream, OR monitors on affected columns, OR moderate query volume |
|
||||
| 🟢 Low | No downstream dependents, no active alerts, low query volume |
|
||||
|
||||
### Multi-model changes
|
||||
|
||||
When the user is changing multiple models in the same session or same domain
|
||||
(e.g., 3 timeseries models, 4 criticality_score models):
|
||||
|
||||
- Run a single consolidated impact assessment across all changed tables
|
||||
- Deduplicate downstream dependents — if two changed tables share a downstream
|
||||
dependent, count it once and note that it's affected by multiple upstream changes
|
||||
- Present a unified blast radius report rather than N separate reports
|
||||
- Escalate risk tier if the combined blast radius is larger than any individual table
|
||||
|
||||
Example consolidated report header:
|
||||
"## Change Impact: 3 models in timeseries domain
|
||||
Combined downstream blast radius: 28 tables (deduplicated)
|
||||
Highest risk table: timeseries_detector_routing (22 downstream refs)"
|
||||
|
||||
### Report format
|
||||
|
||||
```
|
||||
## Change Impact: <table_name>
|
||||
|
||||
Risk: 🔴 High / 🟡 Medium / 🟢 Low
|
||||
|
||||
Downstream blast radius:
|
||||
- <N> tables depend on this model
|
||||
- Key assets affected: <list or "none">
|
||||
|
||||
Active incidents:
|
||||
- <alert title, status> or "none"
|
||||
|
||||
Column exposure (for columns being changed):
|
||||
- Found in <N> recent queries (e.g. <query snippet>)
|
||||
|
||||
Monitor coverage:
|
||||
- <monitor name> watches <metric> — will be affected by this change
|
||||
- If zero custom monitors exist → append:
|
||||
"⚠️ No custom monitors on this table. After making your changes,
|
||||
I'll suggest a monitor for the new logic — or say 'add a monitor'
|
||||
to do it now."
|
||||
|
||||
Recommendation:
|
||||
- <specific callout, e.g. "Notify owners of downstream_table before deploying",
|
||||
"Coordinate with the freshness alert owner", "Add a monitor for the new column">
|
||||
```
|
||||
|
||||
If risk is 🔴 High:
|
||||
1. Call `getAudiences()` to retrieve configured notification audiences
|
||||
2. Include in the recommendation: "Notify: <audience names / channels>"
|
||||
3. Proactively suggest:
|
||||
- Notifying owners of downstream key assets (`setAlertOwner` / `createOrUpdateAlertComment` on active alerts)
|
||||
- Adding a monitor for the new logic before deploying (Workflow 2)
|
||||
- Running `montecarlo monitors apply --dry-run` after changes to verify nothing breaks
|
||||
|
||||
### Synthesis: translate findings into code recommendations
|
||||
|
||||
After presenting the impact report, use the findings to shape your code suggestion.
|
||||
Do not present MC data and then write code as if the data wasn't there.
|
||||
Explicitly connect each key finding to a specific recommendation:
|
||||
|
||||
- Active alerts firing on the table:
|
||||
→ Recommend deferring or minimally scoping the change until alerts are resolved
|
||||
→ Explain: "There are N active alerts on this table — making this change now
|
||||
risks compounding an existing data quality issue"
|
||||
|
||||
- Key assets downstream:
|
||||
→ Recommend defensive coding patterns: null guards, backward-compatible changes,
|
||||
additive-only schema changes where possible
|
||||
→ Explain: "X downstream key assets depend on this table — I'd recommend
|
||||
writing this as [specific pattern] to avoid breaking [specific dependent]"
|
||||
|
||||
- Monitors on affected columns:
|
||||
→ Call out that the change will affect monitor coverage
|
||||
→ Recommend updating monitors alongside the code change (offer Workflow 2)
|
||||
→ Explain: "The existing monitor on [column] will need to be updated to
|
||||
account for this change"
|
||||
|
||||
- New output column or logic being added:
|
||||
→ Always offer Workflow 2 after the impact assessment, regardless
|
||||
of existing monitor coverage
|
||||
→ Do not skip this step even if risk tier is 🟢 Low
|
||||
→ Say explicitly: "This adds new output logic — would you like me
|
||||
to generate a monitor for it? I can add a null check, range
|
||||
validation, or custom SQL rule."
|
||||
→ Wait for the user's response before proceeding with the edit
|
||||
|
||||
- High read volume (>50 reads/day):
|
||||
→ Recommend extra caution around column renames or removals
|
||||
→ Suggest backward-compatible transition (add new column, deprecate old one)
|
||||
→ Explain: "This table has [N] reads/day — a column rename without a
|
||||
transition period would break downstream consumers immediately"
|
||||
|
||||
- Column renames, even inside CTEs:
|
||||
→ Never assume a CTE-internal rename is safe. Always check:
|
||||
1. Does this column appear in the final SELECT, directly or
|
||||
via a CTE that feeds into the final SELECT?
|
||||
2. If yes — treat as a breaking change. Recommend a
|
||||
backward-compatible transition: add the correctly-named
|
||||
column, keep the old one temporarily, remove in a
|
||||
follow-up PR.
|
||||
3. If truly internal and never surfaces in output — confirm
|
||||
this explicitly before proceeding.
|
||||
→ Explain: "Even though this column is defined in a CTE, if it
|
||||
surfaces in the final SELECT it is a public output column —
|
||||
renaming it breaks any downstream model selecting it by name."
|
||||
|
||||
---
|
||||
|
||||
## Workflow 5: Change validation queries — after a code change is made
|
||||
|
||||
**Trigger:** Explicit engineer intent only. Activate when the engineer says something like:
|
||||
- "generate validation queries", "validate this change", "I'm done with this change"
|
||||
- "let me test this", "write queries to check this", "ready to commit"
|
||||
|
||||
**Required session context — do not activate without both:**
|
||||
1. Workflow 4 (change impact assessment) has run for this table in this session
|
||||
2. A file edit was made to a `.sql` or dbt model file for that same table
|
||||
|
||||
**Do NOT activate automatically after file edits. Do NOT proactively offer after Workflow 4 or file edits. The engineer asks when they are ready.**
|
||||
|
||||
---
|
||||
|
||||
### What this workflow does
|
||||
|
||||
Using the context already in the session — the Workflow 4 findings, the file diff, and the `getTable` result — generate 3–5 targeted SQL validation queries that directly test whether this specific change behaved as intended.
|
||||
|
||||
These are not generic templates. Use the semantic meaning of the change from Workflow 4 context: which columns changed and why, what business logic was affected, what downstream models depend on this table, and what monitors exist. A null check on a new `days_since_contract_start` column should verify it is never negative and never null for rows with a `contract_start_date` — not just check for nulls generically.
|
||||
|
||||
---
|
||||
|
||||
### Step 1 — Identify the change type from session context
|
||||
|
||||
From Workflow 4 findings and the file diff, classify the primary change. A change may span multiple types — classify the dominant one and note secondaries:
|
||||
|
||||
- **New column** — a new output column was added to the SELECT
|
||||
- **Filter change** — a WHERE clause, IN-list, or CASE condition was modified
|
||||
- **Join change** — a JOIN condition or join target was modified
|
||||
- **Column rename or drop** — an existing output column was renamed or removed
|
||||
- **Parameter change** — a hardcoded threshold, constant, or numeric value was changed
|
||||
- **New model** — the file was newly created, no production baseline exists
|
||||
|
||||
---
|
||||
|
||||
### Step 2 — Determine warehouse context from Workflow 4
|
||||
|
||||
From the `getTable` result already in session context, extract:
|
||||
- **Fully qualified table name** — e.g. `analytics.prod_internal_bi.client_hub_master`
|
||||
- **Warehouse type** — Snowflake, BigQuery, Redshift, Databricks
|
||||
- **Schema** — already resolved, do not re-derive
|
||||
|
||||
Use the correct SQL dialect for the warehouse type. Key differences:
|
||||
|
||||
| Warehouse | Date diff | Current timestamp | Notes |
|
||||
|---|---|---|---|
|
||||
| Snowflake | `DATEDIFF('day', a, b)` | `CURRENT_TIMESTAMP()` | `QUALIFY` supported |
|
||||
| BigQuery | `DATE_DIFF(a, b, DAY)` | `CURRENT_TIMESTAMP()` | Use subquery instead of `QUALIFY` |
|
||||
| Redshift | `DATEDIFF('day', a, b)` | `GETDATE()` | |
|
||||
| Databricks | `DATEDIFF(a, b)` | `CURRENT_TIMESTAMP()` | |
|
||||
|
||||
For the dev database, use the placeholder `<YOUR_DEV_DATABASE>` with a comment instructing the engineer to replace it. Do not guess the dev database name.
|
||||
|
||||
---
|
||||
|
||||
### Step 3 — Apply database targeting rules (mandatory)
|
||||
|
||||
These rules are not negotiable — violating them produces queries that will fail at runtime:
|
||||
|
||||
- **Columns or logic that only exist post-change** → dev database only. Never query production for a column that doesn't exist there yet.
|
||||
- **Comparison queries (before vs after)** → both production and dev databases
|
||||
- **New model (no production baseline)** → dev database only for all queries
|
||||
- **Row count comparison** → always include, always query both databases
|
||||
|
||||
---
|
||||
|
||||
### Step 4 — Generate targeted validation queries
|
||||
|
||||
Always include a row count comparison regardless of change type — it's the baseline signal that something unexpected happened.
|
||||
|
||||
Then generate change-specific queries based on what needs to be validated for this change type. Use the exact conditions, column names, and business logic from the diff and Workflow 4 findings — not generic placeholders. The goal for each change type:
|
||||
|
||||
**New column:** Verify the column is non-null where it should be non-null (based on its business meaning), that its value range is plausible, and that its distribution makes sense given the underlying data. Query dev only.
|
||||
|
||||
**Filter change:** Verify that only the intended rows were reclassified — generate a before/after count showing how many rows were added or removed by the new condition using the exact filter logic from the diff, and a sample of the rows that changed classification. The sample helps the engineer confirm the right records moved.
|
||||
|
||||
**Join change:** Verify that the join didn't introduce duplicates — a uniqueness check on the join key is essential. Also verify row count didn't change unexpectedly. Query dev for uniqueness, both databases for row count.
|
||||
|
||||
**Column rename or drop:** Verify the old column name is absent and the new column (if renamed) is present in the dev schema. Also verify that downstream models referencing the old column name are identified — use the local ref() grep results from Workflow 4 if available.
|
||||
|
||||
**Parameter or threshold change:** Verify the distribution of values affected by the change — how many rows moved above or below the new threshold, and whether the count matches the engineer's expectation. Query both databases to compare before and after.
|
||||
|
||||
**New model:** No production comparison possible. Verify row count is non-zero and plausible, sample rows look correct, and key columns are non-null. Query dev only.
|
||||
|
||||
---
|
||||
|
||||
### Step 5 — Add change-specific context to each query
|
||||
|
||||
For every query, include a SQL comment block that explains:
|
||||
- What the query is checking
|
||||
- What a healthy result looks like **for this specific change**
|
||||
- What would indicate a problem
|
||||
|
||||
Derive this context from Workflow 4 findings. Use the business meaning of the change, not generic descriptions. For example, for adding `days_since_contract_start`:
|
||||
|
||||
```sql
|
||||
/*
|
||||
Null rate check: days_since_contract_start (new column, dev only)
|
||||
What to look for:
|
||||
- Null count should equal workspaces with no contract_start_date
|
||||
- All rows with contract_start_date should have a non-null, non-negative value
|
||||
- Values above 3650 (~10 years) are suspicious and may indicate a data issue
|
||||
*/
|
||||
```
|
||||
|
||||
This is what differentiates these queries from generic validation — the comment tells the engineer exactly what pass and fail look like for their specific change.
|
||||
|
||||
---
|
||||
|
||||
### Step 6 — Save to local file
|
||||
|
||||
Save all generated queries to:
|
||||
```
|
||||
validation/<table_name>_<YYYYMMDD_HHMM>.sql
|
||||
```
|
||||
|
||||
Include a header at the top of the file:
|
||||
```sql
|
||||
/*
|
||||
Validation queries for: <fully_qualified_table>
|
||||
Change type: <change type from Step 1>
|
||||
Generated: <timestamp>
|
||||
Workflow 4 risk tier: <tier from this session>
|
||||
|
||||
Instructions:
|
||||
1. Replace <YOUR_DEV_DATABASE> with your personal or branch database
|
||||
2. Run the row count comparison first
|
||||
3. Run change-specific queries to validate intended behavior
|
||||
4. Unexpected results should be investigated before merging
|
||||
*/
|
||||
```
|
||||
|
||||
Then tell the engineer:
|
||||
> "Validation queries saved to `validation/<table_name>_<timestamp>.sql`.
|
||||
> Replace `<YOUR_DEV_DATABASE>` with your dev database and run in Snowflake
|
||||
> or your preferred SQL client to verify the change behaved as expected."
|
||||
|
||||
---
|
||||
|
||||
### What this workflow does NOT do
|
||||
- Does not execute queries (Phase 2)
|
||||
- Does not require warehouse MCP connection
|
||||
- Does not generate Monte Carlo notebook YAML
|
||||
- Does not trigger automatically — only on explicit engineer request
|
||||
- Does not activate if Workflow 4 has not run for this table in this session
|
||||
@@ -0,0 +1,363 @@
|
||||
---
|
||||
name: monte-carlo-push-ingestion
|
||||
description: "Expert guide for pushing metadata, lineage, and query logs to Monte Carlo from any data warehouse."
|
||||
category: data
|
||||
risk: safe
|
||||
source: community
|
||||
source_repo: monte-carlo-data/mc-agent-toolkit
|
||||
source_type: community
|
||||
date_added: "2026-04-08"
|
||||
author: monte-carlo-data
|
||||
tags: [data-observability, ingestion, monte-carlo, pycarlo, metadata]
|
||||
tools: [claude, cursor, codex]
|
||||
---
|
||||
|
||||
# Monte Carlo Push Ingestion
|
||||
|
||||
You are an agent that helps customers collect metadata, lineage, and query logs from their
|
||||
data warehouses and push that data to Monte Carlo via the push ingestion API. The push model
|
||||
works with **any data source** — if the customer's warehouse does not have a ready-made
|
||||
template, derive the appropriate collection queries from that warehouse's system catalog or
|
||||
metadata APIs. The push format and pycarlo SDK calls are the same regardless of source.
|
||||
|
||||
Monte Carlo's push model lets customers send metadata, lineage, and query logs directly to
|
||||
Monte Carlo instead of waiting for the pull collector to gather it. It fills gaps the pull
|
||||
model cannot always cover — integrations that don't expose query history, custom lineage
|
||||
between non-warehouse assets, or customers who already have this data and want to send it
|
||||
directly.
|
||||
|
||||
Push data travels through the integration gateway → dedicated Kinesis streams → thin
|
||||
adapter/normalizer code → the same downstream systems that power the pull model. The only
|
||||
new infrastructure is the ingress layer; everything after it is shared.
|
||||
|
||||
## MANDATORY — Always start from templates
|
||||
|
||||
When generating any push-ingestion script, you MUST:
|
||||
|
||||
1. **Read the corresponding template** before writing any code. Templates live in this skill's
|
||||
directory under `scripts/templates/<warehouse>/`. To find them, glob for
|
||||
`**/push-ingestion/scripts/templates/<warehouse>/*.py` — this works regardless of where the
|
||||
skill is installed. Do NOT search from the current working directory alone.
|
||||
2. **Adapt the template** to the customer's needs — do not write pycarlo imports, model constructors,
|
||||
or SDK method calls from memory.
|
||||
3. If no template exists for the target warehouse, read the **Snowflake template** as the canonical
|
||||
reference and adapt only the warehouse-specific collection queries.
|
||||
|
||||
Template files follow this naming pattern:
|
||||
- `collect_<flow>.py` — collection only (queries the warehouse, writes a JSON manifest)
|
||||
- `push_<flow>.py` — push only (reads the manifest, sends to Monte Carlo)
|
||||
- `collect_and_push_<flow>.py` — combined (imports from both, runs in sequence)
|
||||
|
||||
**After running any push script**, you MUST surface the `invocation_id`(s) returned by the API
|
||||
to the user. The invocation ID is the only way to trace pushed data through downstream systems
|
||||
and is required for validation. Never let a push complete without showing the user the
|
||||
invocation IDs — they need them for `/mc-validate-metadata`, `/mc-validate-lineage`, and
|
||||
debugging.
|
||||
|
||||
## Canonical pycarlo API — authoritative reference
|
||||
|
||||
The following imports, classes, and method signatures are the **ONLY** correct pycarlo API for
|
||||
push ingestion. If your training data suggests different names, **it is wrong**. Use exactly
|
||||
what is listed here.
|
||||
|
||||
### Imports and client setup
|
||||
|
||||
```python
|
||||
from pycarlo.core import Client, Session
|
||||
from pycarlo.features.ingestion import IngestionService
|
||||
from pycarlo.features.ingestion.models import (
|
||||
# Metadata
|
||||
RelationalAsset, AssetMetadata, AssetField, AssetVolume, AssetFreshness, Tag,
|
||||
# Lineage
|
||||
LineageEvent, LineageAssetRef, ColumnLineageField, ColumnLineageSourceField,
|
||||
# Query logs
|
||||
QueryLogEntry,
|
||||
)
|
||||
|
||||
client = Client(session=Session(mcd_id=key_id, mcd_token=key_token, scope="Ingestion"))
|
||||
service = IngestionService(mc_client=client)
|
||||
```
|
||||
|
||||
### Method signatures
|
||||
|
||||
```python
|
||||
# Metadata
|
||||
service.send_metadata(resource_uuid=..., resource_type=..., events=[RelationalAsset(...)])
|
||||
|
||||
# Lineage (table or column)
|
||||
service.send_lineage(resource_uuid=..., resource_type=..., events=[LineageEvent(...)])
|
||||
|
||||
# Query logs — note: log_type, NOT resource_type
|
||||
service.send_query_logs(resource_uuid=..., log_type=..., events=[QueryLogEntry(...)])
|
||||
|
||||
# Extract invocation ID from any response
|
||||
service.extract_invocation_id(result)
|
||||
```
|
||||
|
||||
### RelationalAsset structure (nested, NOT flat)
|
||||
|
||||
```python
|
||||
RelationalAsset(
|
||||
type="TABLE", # ONLY "TABLE" or "VIEW" (uppercase) — normalize warehouse-native values
|
||||
metadata=AssetMetadata(
|
||||
name="my_table",
|
||||
database="analytics",
|
||||
schema="public",
|
||||
description="optional description",
|
||||
),
|
||||
fields=[
|
||||
AssetField(name="id", type="INTEGER", description=None),
|
||||
AssetField(name="amount", type="DECIMAL(10,2)"),
|
||||
],
|
||||
volume=AssetVolume(row_count=1000000, byte_count=111111111), # optional
|
||||
freshness=AssetFreshness(last_update_time="2026-03-12T14:30:00Z"), # optional
|
||||
)
|
||||
```
|
||||
|
||||
## Environment variable conventions
|
||||
|
||||
All generated scripts MUST use these exact variable names. Do NOT invent alternatives like
|
||||
`MCD_KEY_ID`, `MC_TOKEN`, `MONTE_CARLO_KEY`, etc.
|
||||
|
||||
| Variable | Purpose | Used by |
|
||||
|---|---|---|
|
||||
| `MCD_INGEST_ID` | Ingestion key ID (scope=Ingestion) | push scripts |
|
||||
| `MCD_INGEST_TOKEN` | Ingestion key secret | push scripts |
|
||||
| `MCD_ID` | GraphQL API key ID | verification scripts |
|
||||
| `MCD_TOKEN` | GraphQL API key secret | verification scripts |
|
||||
| `MCD_RESOURCE_UUID` | Warehouse resource UUID | all scripts |
|
||||
|
||||
## What this skill can build for you
|
||||
|
||||
Tell Claude your warehouse or data platform and Monte Carlo resource UUID and this skill will
|
||||
generate a ready-to-run Python script that:
|
||||
- Connects to your warehouse using the idiomatic driver for that platform
|
||||
- Discovers databases, schemas, and tables
|
||||
- Extracts the right columns — names, types, row counts, byte counts, last modified time, descriptions
|
||||
- Builds the correct pycarlo `RelationalAsset`, `LineageEvent`, or `QueryLogEntry` objects
|
||||
- Pushes to Monte Carlo and saves an output manifest with the `invocation_id` for tracing
|
||||
|
||||
Templates are available for common warehouses (Snowflake, BigQuery, BigQuery Iceberg,
|
||||
Databricks, Redshift, Hive). For any other platform, Claude will derive the appropriate
|
||||
collection queries from the warehouse's system catalog or metadata APIs and generate an
|
||||
equivalent script.
|
||||
|
||||
### Ready-to-run examples
|
||||
|
||||
Production-ready example scripts built from these templates are published in the
|
||||
[mcd-public-resources](https://github.com/monte-carlo-data/mcd-public-resources) repo:
|
||||
|
||||
- **[BigQuery Iceberg (BigLake) tables](https://github.com/monte-carlo-data/mcd-public-resources/tree/main/examples/push-ingestion/bigquery/push-iceberg-tables)** —
|
||||
metadata and query log collection for BigQuery Iceberg tables that are invisible to Monte
|
||||
Carlo's standard pull collector (which uses `__TABLES__`). Includes a `--only-freshness-and-volume`
|
||||
flag for fast periodic pushes that skip the schema/fields query — useful for hourly cron jobs
|
||||
after the initial full metadata push.
|
||||
|
||||
## Reference docs — when to load
|
||||
|
||||
| Reference file | Load when… |
|
||||
|---|---|
|
||||
| `references/prerequisites.md` | Customer is setting up for the first time, has auth errors, or needs help creating API keys |
|
||||
| `references/push-metadata.md` | Building or debugging a metadata collection script |
|
||||
| `references/push-lineage.md` | Building or debugging a lineage collection script |
|
||||
| `references/push-query-logs.md` | Building or debugging a query log collection script |
|
||||
| `references/custom-lineage.md` | Customer needs custom lineage nodes or edges via GraphQL |
|
||||
| `references/validation.md` | Verifying pushed data, running GraphQL checks, or deleting push-ingested tables |
|
||||
| `references/direct-http-api.md` | Customer wants to call push APIs directly via curl/HTTP without pycarlo |
|
||||
| `references/anomaly-detection.md` | Customer asks why freshness or volume detectors aren't firing |
|
||||
|
||||
## Prerequisites — read this first
|
||||
|
||||
→ Load `references/prerequisites.md`
|
||||
|
||||
Two separate API keys are required. This is the most common setup stumbling block:
|
||||
- **Ingestion key** (scope=Ingestion) — for pushing data
|
||||
- **GraphQL API key** — for verification queries
|
||||
|
||||
Both use the same `x-mcd-id` / `x-mcd-token` headers but point to different endpoints.
|
||||
|
||||
## What you can push
|
||||
|
||||
| Flow | pycarlo method | Push endpoint | Type field | Expiration |
|
||||
|---|---|---|---|---|
|
||||
| Table metadata | `send_metadata()` | `/ingest/v1/metadata` | `resource_type` (e.g. `"data-lake"`) | **Never expires** |
|
||||
| Table lineage | `send_lineage()` | `/ingest/v1/lineage` | `resource_type` (same as metadata) | **Never expires** |
|
||||
| Column lineage | `send_lineage()` (events include `fields`) | `/ingest/v1/lineage` | `resource_type` (same as metadata) | **Expires after 10 days** |
|
||||
| Query logs | `send_query_logs()` | `/ingest/v1/querylogs` | **`log_type`** (not `resource_type`!) | Same as pulled |
|
||||
| Custom lineage | GraphQL mutations | `api.getmontecarlo.com/graphql` | N/A — uses GraphQL API key | 7 days default; set `expireAt: "9999-12-31"` for permanent |
|
||||
|
||||
**Important**: Query logs use `log_type` instead of `resource_type`. This is the only push
|
||||
endpoint where the field name differs. See `references/push-query-logs.md` for the full list
|
||||
of supported `log_type` values.
|
||||
|
||||
The pycarlo SDK is optional — you can also call the push APIs directly via HTTP/curl. See
|
||||
`references/direct-http-api.md` for examples.
|
||||
|
||||
Every push returns an `invocation_id` — save it. It is your primary debugging handle across
|
||||
all downstream systems.
|
||||
|
||||
## Step 1 — Generate your collection scripts
|
||||
|
||||
Ask Claude to build the script for your warehouse:
|
||||
|
||||
> "Build me a metadata collection script for Snowflake. My MC resource UUID is `abc-123`."
|
||||
|
||||
The script templates in `**/push-ingestion/scripts/templates/` (Snowflake, BigQuery, BigQuery Iceberg, Databricks, Redshift, Hive)
|
||||
are the **mandatory starting point** for script generation — they contain the correct pycarlo
|
||||
imports, model constructors, and SDK calls. **They are not an exhaustive list.** If the
|
||||
customer's warehouse is not listed, use the templates as a guide and determine the appropriate
|
||||
queries or file-collection approach for their platform. For file-based sources (like Hive
|
||||
Metastore logs), provide the command to retrieve the file, parse it, and transform it into the
|
||||
format required by the push APIs. The push format and SDK calls are identical regardless of
|
||||
source; only the collection queries change.
|
||||
|
||||
**Batching**: For large payloads, split events into batches. Use a batch size of **50 assets**
|
||||
per push call. The pycarlo HTTP client has a hardcoded 10-second read timeout that cannot be
|
||||
overridden (`Session` and `Client` do not accept a `timeout` parameter) — larger batches (200+)
|
||||
will timeout on warehouses with thousands of tables. The compressed request body must also not
|
||||
exceed **1MB** (Kinesis limit). All push endpoints support batching.
|
||||
|
||||
**Push frequency**: Push at most **once per hour**. Sub-hourly pushes produce unpredictable
|
||||
anomaly detector behavior because the training pipeline aggregates into hourly buckets.
|
||||
|
||||
**Per flow, see:**
|
||||
- Metadata (schema + volume + freshness): `references/push-metadata.md`
|
||||
- Table and column lineage: `references/push-lineage.md`
|
||||
- Query logs: `references/push-query-logs.md`
|
||||
|
||||
## Step 2 — Validate pushed data
|
||||
|
||||
After pushing, verify data is visible in Monte Carlo using the GraphQL API (GraphQL API key).
|
||||
|
||||
→ `references/validation.md` — all verification queries (getTable, getMetricsV4,
|
||||
getTableLineage, getDerivedTablesPartialLineage, getAggregatedQueries)
|
||||
|
||||
Timing expectations:
|
||||
- **Metadata**: visible within a few minutes
|
||||
- **Table lineage**: visible within seconds to a few minutes (fast direct path to Neo4j)
|
||||
- **Column lineage**: a few minutes
|
||||
- **Query logs**: at least **15-20 minutes** (async processing pipeline)
|
||||
|
||||
## Step 3 — Anomaly detection (optional)
|
||||
|
||||
If you want Monte Carlo's freshness and volume detectors to fire on pushed data, you need to
|
||||
push consistently over time — detectors require historical data to train.
|
||||
|
||||
→ `references/anomaly-detection.md` — recommended push frequency, minimum samples,
|
||||
training windows, and what to tell customers who ask why detectors aren't activating
|
||||
|
||||
## Custom lineage nodes and edges
|
||||
|
||||
For non-warehouse assets (dbt models, Airflow DAGs, custom ETL pipelines) or cross-resource
|
||||
lineage, use the GraphQL mutations directly:
|
||||
|
||||
→ `references/custom-lineage.md` — `createOrUpdateLineageNode`, `createOrUpdateLineageEdge`,
|
||||
`deleteLineageNode`, and the critical `expireAt: "9999-12-31"` rule
|
||||
|
||||
## Deleting push-ingested tables
|
||||
|
||||
Push tables are excluded from the normal pull-based deletion flow (intentionally). To delete
|
||||
them explicitly, use `deletePushIngestedTables` — covered in `references/validation.md`
|
||||
under "Table management operations".
|
||||
|
||||
## Available slash commands
|
||||
|
||||
Customers can invoke these explicitly instead of describing their intent in prose:
|
||||
|
||||
| Command | Purpose |
|
||||
|---|---|
|
||||
| `/mc-build-metadata-collector` | Generate a metadata collection script |
|
||||
| `/mc-build-lineage-collector` | Generate a lineage collection script |
|
||||
| `/mc-build-query-log-collector` | Generate a query log collection script |
|
||||
| `/mc-validate-metadata` | Verify pushed metadata via the GraphQL API |
|
||||
| `/mc-validate-lineage` | Verify pushed lineage via the GraphQL API |
|
||||
| `/mc-validate-query-logs` | Verify pushed query logs via the GraphQL API |
|
||||
| `/mc-create-lineage-node` | Create a custom lineage node |
|
||||
| `/mc-create-lineage-edge` | Create a custom lineage edge |
|
||||
| `/mc-delete-lineage-node` | Delete a custom lineage node |
|
||||
| `/mc-delete-push-tables` | Delete push-ingested tables |
|
||||
|
||||
## Debugging checkpoints
|
||||
|
||||
When pushed data isn't appearing, work through these five checkpoints in order:
|
||||
|
||||
1. **Did the SDK return a `202` and an `invocation_id`?**
|
||||
If not, the gateway rejected the request — check auth headers and `resource.uuid`.
|
||||
|
||||
2. **Is the integration key the right type?**
|
||||
Must be scope `Ingestion`, created via `montecarlo integrations create-key --scope Ingestion`.
|
||||
A standard GraphQL API key will not work for push.
|
||||
|
||||
3. **Is `resource.uuid` correct and authorized?**
|
||||
The key can be scoped to specific warehouse UUIDs. If the UUID doesn't match, you get `403`.
|
||||
|
||||
4. **Did the normalizer process it?**
|
||||
Use the `invocation_id` to search CloudWatch logs for the relevant Lambda. For query logs,
|
||||
check the `log_type` — Hive requires `"hive-s3"`, not `"hive"`.
|
||||
|
||||
5. **Did the downstream system pick it up?**
|
||||
- Metadata: query `getTable` in GraphQL
|
||||
- Table lineage: check Neo4j within seconds–minutes (fast path via PushLineageProcessor)
|
||||
- Query logs: wait at least 15-20 minutes; check `getAggregatedQueries`
|
||||
|
||||
## Known gotchas
|
||||
|
||||
- **`log_type` vs `resource_type`**: metadata and lineage use `resource_type` (e.g. `"data-lake"`);
|
||||
query logs use **`log_type`** — the only endpoint where the field name differs. Wrong value →
|
||||
`Unsupported ingest query-log log_type` error.
|
||||
- **`invocation_id` must be saved**: every output manifest should include it — it's your
|
||||
only tracing handle once the request leaves the SDK.
|
||||
- **Query log async delay**: at least 15-20 minutes. `getAggregatedQueries` will return 0 until
|
||||
processing completes — this is expected, not a bug.
|
||||
- **Custom lineage `expireAt` defaults to 7 days**: nodes vanish silently unless you set
|
||||
`expireAt: "9999-12-31"` for permanent nodes.
|
||||
- **Push tables are never auto-deleted**: the periodic cleanup job excludes them by default
|
||||
(`exclude_push_tables=True`). Delete them explicitly via `deletePushIngestedTables` (max
|
||||
1,000 MCONs per call; also deletes lineage nodes and all edges touching those nodes).
|
||||
- **Anomaly detectors need history**: pushing once is not enough. Freshness needs 7+ pushes
|
||||
over ~2 weeks; volume needs 10–48 samples over ~42 days. Push at most once per hour.
|
||||
- **Batching required for large payloads**: the compressed request body must not exceed 1MB.
|
||||
Split large event lists into batches.
|
||||
- **Column lineage expires after 10 days**: unlike table metadata and table lineage (which
|
||||
never expire), column lineage has a 10-day TTL, same as pulled column lineage.
|
||||
- **Quote SQL identifiers in warehouse queries**: database, schema, and table names must be
|
||||
quoted to handle mixed-case or special characters. The quoting syntax varies by warehouse —
|
||||
Snowflake and Redshift use double quotes (`"{db}"`), BigQuery/Databricks/Hive use backticks
|
||||
(`` `db` ``). The templates already handle this correctly for each warehouse — follow the
|
||||
same quoting pattern when adapting.
|
||||
|
||||
## Memory safety
|
||||
|
||||
Generated scripts must include a startup memory check. The collection phase loads query history
|
||||
rows into memory for parsing — on large warehouses with long lookback windows, this can exhaust
|
||||
available RAM and cause the process to be silently killed (SIGKILL / exit 137) with no traceback.
|
||||
|
||||
Add this pattern near the top of every generated script, after imports:
|
||||
|
||||
```python
|
||||
import os
|
||||
|
||||
def _check_available_memory(min_gb: float = 2.0) -> None:
|
||||
"""Warn if available memory is below the threshold."""
|
||||
try:
|
||||
if hasattr(os, "sysconf"): # Linux / macOS
|
||||
page_size = os.sysconf("SC_PAGE_SIZE")
|
||||
avail_pages = os.sysconf("SC_AVPHYS_PAGES")
|
||||
avail_gb = (page_size * avail_pages) / (1024 ** 3)
|
||||
else:
|
||||
return # Windows — skip check
|
||||
except (ValueError, OSError):
|
||||
return
|
||||
if avail_gb < min_gb:
|
||||
print(
|
||||
f"WARNING: Only {avail_gb:.1f} GB of memory available "
|
||||
f"(minimum recommended: {min_gb:.1f} GB). "
|
||||
f"Consider reducing the lookback window or increasing available memory."
|
||||
)
|
||||
```
|
||||
|
||||
Call `_check_available_memory()` before connecting to the warehouse.
|
||||
|
||||
Additionally, when fetching query history:
|
||||
- Use `cursor.fetchmany(batch_size)` in a loop instead of `cursor.fetchall()` when possible
|
||||
- For very large result sets, consider adding a LIMIT clause and processing in windows
|
||||
@@ -0,0 +1,87 @@
|
||||
# Anomaly Detection for Push-Ingested Data
|
||||
|
||||
Push volume and freshness data feeds the same anomaly detectors as the pull model.
|
||||
The detectors don't activate immediately — they need enough historical data to learn
|
||||
expected behavior before they can alert on deviations.
|
||||
|
||||
## Recommended push frequency: hourly
|
||||
|
||||
- Push at most **once per hour** — pushing more frequently produces unpredictable detector
|
||||
behavior because the training pipeline aggregates data into hourly buckets
|
||||
- Push **consistently** — gaps of more than a few days delay activation or deactivate
|
||||
previously-active detectors
|
||||
|
||||
## Freshness detector
|
||||
|
||||
The freshness detector learns how often a table is updated and fires when it has not been
|
||||
updated for longer than expected.
|
||||
|
||||
**What it trains on**: consecutive differences (`delta_sec`) between `last_update_time`
|
||||
values across pushes. A push only counts if `last_update_time` actually changed.
|
||||
|
||||
**Requirements to activate:**
|
||||
| Requirement | Value |
|
||||
|---|---|
|
||||
| Minimum samples | 7 pushes where `last_update_time` changed (or coverage ≥ 0.8 for slow tables) |
|
||||
| Minimum coverage | 0.15 (= `median_update_secs × n_samples / 22 days`) |
|
||||
| Training window | 35 days |
|
||||
| Supported update cycle | 5 minutes – 7.7 days |
|
||||
| Minimum table age | ~14 days on older warehouses |
|
||||
|
||||
**Deactivation triggers:**
|
||||
- No push for **14 days** → `"no recent data"`
|
||||
- Gap > 7 days in last 14 days, for fast tables (median update ≤ 26.4 hours) → `"gap of over a week in last 2 weeks"`
|
||||
|
||||
## Volume detector (Volume Change + Unchanged Size)
|
||||
|
||||
Detects unexpected spikes/drops in row count or byte count.
|
||||
|
||||
**Requirements to activate:**
|
||||
| Requirement | Value |
|
||||
|---|---|
|
||||
| Minimum samples (daily) | 10 |
|
||||
| Minimum samples (subdaily, ~12x/day) | 48 |
|
||||
| Minimum samples (weekly) | 5 |
|
||||
| Minimum coverage | 0.30 (= `N × median_update_secs / 42 days`) |
|
||||
| Training window | 42 days |
|
||||
| Minimum table age | 5 days |
|
||||
| Regularity check | 75th/25th percentile of update intervals ≥ 0.2 |
|
||||
|
||||
**Deactivation**: No hard gap limit, but coverage degrades as the 42-day window advances
|
||||
without new data. Eventually drops below 0.3 and deactivates.
|
||||
|
||||
## Summary table
|
||||
|
||||
| | Freshness | Volume Change / Unchanged Size |
|
||||
|---|---|---|
|
||||
| Recommended frequency | Hourly | Hourly |
|
||||
| Maximum frequency | Once per hour | Once per hour |
|
||||
| Training window | 35 days | 42 days |
|
||||
| Minimum samples | 7 | 10 (daily) / 48 (subdaily) / 5 (weekly) |
|
||||
| Minimum coverage | 0.15 | 0.30 |
|
||||
| Hard deactivation gap | 14 days | No (coverage degrades) |
|
||||
| Fast-table gap warning | 7 days in last 14 | N/A |
|
||||
|
||||
## What to tell customers
|
||||
|
||||
When a customer asks "why isn't my anomaly detection working?":
|
||||
|
||||
1. **Check detector status** in the MC UI or via GraphQL (`getTable.thresholds.freshness.status`).
|
||||
A `"training"` status means not enough data yet. `"inactive"` means a deactivation
|
||||
condition was hit — check the reason code.
|
||||
|
||||
2. **Verify push frequency** — are they pushing exactly once per hour? Both too-fast and
|
||||
too-slow rates cause problems.
|
||||
|
||||
3. **Verify that `last_update_time` changes** — for freshness to accumulate training samples,
|
||||
each push must carry a *different* `last_update_time` than the previous one. If the table
|
||||
hasn't actually updated, the push still arrives but doesn't advance the sample count.
|
||||
|
||||
4. **Set realistic expectations** — freshness detectors need about 1–2 weeks of hourly pushes.
|
||||
Volume detectors need 10+ days for daily tables, up to 42 days for subdaily tables.
|
||||
Anomaly detection is not instant.
|
||||
|
||||
5. **Don't push gaps and then resume** — if a customer pauses pushes for a week and then
|
||||
resumes, the freshness detector may deactivate. They should keep pushing even when the
|
||||
table hasn't changed (just repeat the same `last_update_time`) to maintain coverage,
|
||||
even though that specific push won't count as a new freshness sample.
|
||||
@@ -0,0 +1,203 @@
|
||||
# Custom Lineage Nodes and Edges
|
||||
|
||||
## When to use this
|
||||
|
||||
The `send_lineage()` pycarlo method is the right choice for warehouse tables you own.
|
||||
The **GraphQL mutations** in this document are for:
|
||||
- Non-warehouse assets: dbt models, Airflow DAGs, Fivetran connectors, custom ETL jobs
|
||||
- Connecting nodes across different MC resources (warehouses)
|
||||
- One-off lineage corrections not tied to a collector run
|
||||
- Fine-grained control over node properties, object types, and expiry
|
||||
|
||||
All mutations use the **GraphQL API key** (not the Ingestion key) and the endpoint
|
||||
`https://api.getmontecarlo.com/graphql`.
|
||||
|
||||
## Critical: expireAt
|
||||
|
||||
If you don't set `expireAt`, nodes and edges expire after **7 days** and vanish from the
|
||||
lineage graph silently. For any node or edge that should persist:
|
||||
|
||||
```
|
||||
expireAt: "9999-12-31"
|
||||
```
|
||||
|
||||
This is the same value that `PushLineageProcessor` uses internally for all push-ingested
|
||||
lineage. Forgetting this is the most common cause of "my lineage disappeared after a week".
|
||||
|
||||
---
|
||||
|
||||
## createOrUpdateLineageNode
|
||||
|
||||
Creates or updates a node in the lineage graph. If a node with the same
|
||||
`objectType` + `objectId` + `resourceId` already exists, it is updated.
|
||||
|
||||
```graphql
|
||||
mutation CreateOrUpdateLineageNode(
|
||||
$objectType: String!
|
||||
$objectId: String!
|
||||
$resourceId: UUID
|
||||
$resourceName: String
|
||||
$name: String
|
||||
$properties: [ObjectPropertyInput]
|
||||
$expireAt: DateTime
|
||||
) {
|
||||
createOrUpdateLineageNode(
|
||||
objectType: $objectType
|
||||
objectId: $objectId
|
||||
resourceId: $resourceId
|
||||
resourceName: $resourceName
|
||||
name: $name
|
||||
properties: $properties
|
||||
expireAt: $expireAt
|
||||
) {
|
||||
node {
|
||||
mcon
|
||||
displayName
|
||||
objectType
|
||||
isCustom
|
||||
expireAt
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Variables:**
|
||||
```json
|
||||
{
|
||||
"objectType": "table",
|
||||
"objectId": "analytics:analytics.orders",
|
||||
"resourceId": "<warehouse-uuid>",
|
||||
"name": "orders",
|
||||
"expireAt": "9999-12-31"
|
||||
}
|
||||
```
|
||||
|
||||
`objectType` can be any string — common values: `"table"`, `"view"`, `"report"`,
|
||||
`"dashboard"`, `"job"`, `"model"`.
|
||||
|
||||
`objectId` should be a stable unique identifier for the asset within the resource.
|
||||
For tables, use the `fullTableId` format: `database:schema.table`.
|
||||
|
||||
The returned `mcon` is the stable MC identifier for this node — save it if you plan to
|
||||
reference it in edges or deletions.
|
||||
|
||||
---
|
||||
|
||||
## createOrUpdateLineageEdge
|
||||
|
||||
Creates or updates a directed edge: source → destination (default: IS_DOWNSTREAM).
|
||||
|
||||
```graphql
|
||||
mutation CreateOrUpdateLineageEdge(
|
||||
$source: NodeInput!
|
||||
$destination: NodeInput!
|
||||
$expireAt: DateTime
|
||||
$edgeType: EdgeType
|
||||
) {
|
||||
createOrUpdateLineageEdge(
|
||||
source: $source
|
||||
destination: $destination
|
||||
expireAt: $expireAt
|
||||
edgeType: $edgeType
|
||||
) {
|
||||
edge {
|
||||
source { mcon displayName objectType }
|
||||
destination { mcon displayName objectType }
|
||||
isCustom
|
||||
expireAt
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
`NodeInput` shape:
|
||||
```json
|
||||
{
|
||||
"objectType": "table",
|
||||
"objectId": "analytics:analytics.orders",
|
||||
"resourceId": "<warehouse-uuid>"
|
||||
}
|
||||
```
|
||||
|
||||
**Full example — dbt model → warehouse table:**
|
||||
```json
|
||||
{
|
||||
"source": {
|
||||
"objectType": "model",
|
||||
"objectId": "dbt://my_project/models/staging/stg_orders",
|
||||
"resourceName": "dbt-production"
|
||||
},
|
||||
"destination": {
|
||||
"objectType": "table",
|
||||
"objectId": "analytics:analytics.orders",
|
||||
"resourceId": "<snowflake-warehouse-uuid>"
|
||||
},
|
||||
"expireAt": "9999-12-31",
|
||||
"edgeType": "IS_DOWNSTREAM"
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## deleteLineageNode
|
||||
|
||||
Deletes a node and **all its edges and objects**. This is irreversible.
|
||||
|
||||
```graphql
|
||||
mutation DeleteLineageNode($mcon: String!) {
|
||||
deleteLineageNode(mcon: $mcon) {
|
||||
objectsDeleted
|
||||
nodesDeleted
|
||||
edgesDeleted
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Get the MCON from `createOrUpdateLineageNode`'s response, or from:
|
||||
```graphql
|
||||
query {
|
||||
getTable(fullTableId: "analytics:analytics.orders", dwId: "<warehouse-uuid>") {
|
||||
mcon
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Python helper for all three mutations
|
||||
|
||||
```python
|
||||
import requests
|
||||
|
||||
GRAPHQL_URL = "https://api.getmontecarlo.com/graphql"
|
||||
HEADERS = {
|
||||
"x-mcd-id": "<graphql-api-key-id>",
|
||||
"x-mcd-token": "<graphql-api-key-secret>",
|
||||
"Content-Type": "application/json",
|
||||
}
|
||||
|
||||
def run_mutation(query: str, variables: dict) -> dict:
|
||||
resp = requests.post(GRAPHQL_URL, json={"query": query, "variables": variables}, headers=HEADERS)
|
||||
resp.raise_for_status()
|
||||
data = resp.json()
|
||||
if "errors" in data:
|
||||
raise RuntimeError(data["errors"])
|
||||
return data["data"]
|
||||
|
||||
# Example: create a permanent node
|
||||
result = run_mutation(
|
||||
"""mutation($objectType: String!, $objectId: String!, $resourceId: UUID, $expireAt: DateTime) {
|
||||
createOrUpdateLineageNode(objectType: $objectType, objectId: $objectId,
|
||||
resourceId: $resourceId, expireAt: $expireAt) {
|
||||
node { mcon displayName }
|
||||
}
|
||||
}""",
|
||||
{
|
||||
"objectType": "table",
|
||||
"objectId": "analytics:analytics.orders",
|
||||
"resourceId": "<warehouse-uuid>",
|
||||
"expireAt": "9999-12-31",
|
||||
}
|
||||
)
|
||||
print("MCON:", result["createOrUpdateLineageNode"]["node"]["mcon"])
|
||||
```
|
||||
@@ -0,0 +1,207 @@
|
||||
# Direct HTTP API (without pycarlo)
|
||||
|
||||
The `pycarlo` SDK is optional. You can call the push APIs directly over HTTPS from any
|
||||
language or tool (curl, Postman, etc.) as long as you:
|
||||
- authenticate with an integration key whose scope is `Ingestion`
|
||||
- send a JSON body that matches the ingest schema
|
||||
- send to the correct integration gateway endpoint
|
||||
|
||||
## Endpoint
|
||||
|
||||
The host is environment-specific:
|
||||
- **Production**: `https://integrations.getmontecarlo.com`
|
||||
|
||||
## Authentication headers
|
||||
|
||||
All requests use the same headers:
|
||||
```
|
||||
x-mcd-id: <integration-key-id>
|
||||
x-mcd-token: <integration-key-secret>
|
||||
Content-Type: application/json
|
||||
```
|
||||
|
||||
## Response
|
||||
|
||||
On success, all endpoints return:
|
||||
```json
|
||||
{"invocation_id": "<uuid>"}
|
||||
```
|
||||
|
||||
Save the `invocation_id` — it is the primary trace ID for debugging across downstream systems.
|
||||
|
||||
---
|
||||
|
||||
## Metadata example
|
||||
|
||||
`POST /ingest/v1/metadata`
|
||||
|
||||
```bash
|
||||
curl -X POST "https://integrations.getmontecarlo.com/ingest/v1/metadata" \
|
||||
-H "Content-Type: application/json" \
|
||||
-H "x-mcd-id: <integration-key-id>" \
|
||||
-H "x-mcd-token: <integration-key-secret>" \
|
||||
-d '{
|
||||
"event_type": "RELATIONAL_ASSET",
|
||||
"resource": {
|
||||
"uuid": "<warehouse-uuid>",
|
||||
"resource_type": "snowflake"
|
||||
},
|
||||
"events": [
|
||||
{
|
||||
"type": "TABLE",
|
||||
"metadata": {
|
||||
"name": "orders",
|
||||
"database": "analytics",
|
||||
"schema": "public",
|
||||
"description": "Orders table"
|
||||
},
|
||||
"fields": [
|
||||
{"name": "id", "type": "INTEGER"},
|
||||
{"name": "amount", "type": "DECIMAL(10,2)"}
|
||||
],
|
||||
"volume": {
|
||||
"row_count": 1000000,
|
||||
"byte_count": 111111111
|
||||
},
|
||||
"freshness": {
|
||||
"last_update_time": "2026-03-12T14:30:00Z"
|
||||
}
|
||||
}
|
||||
]
|
||||
}'
|
||||
```
|
||||
|
||||
`volume` and `freshness` are optional — you can push schema-only metadata.
|
||||
|
||||
---
|
||||
|
||||
## Table lineage example
|
||||
|
||||
`POST /ingest/v1/lineage` with `event_type: "LINEAGE"`
|
||||
|
||||
```bash
|
||||
curl -X POST "https://integrations.getmontecarlo.com/ingest/v1/lineage" \
|
||||
-H "Content-Type: application/json" \
|
||||
-H "x-mcd-id: <integration-key-id>" \
|
||||
-H "x-mcd-token: <integration-key-secret>" \
|
||||
-d '{
|
||||
"event_type": "LINEAGE",
|
||||
"resource": {
|
||||
"uuid": "<warehouse-uuid>",
|
||||
"resource_type": "snowflake"
|
||||
},
|
||||
"events": [
|
||||
{
|
||||
"source": {
|
||||
"name": "orders_raw",
|
||||
"database": "analytics",
|
||||
"schema": "public"
|
||||
},
|
||||
"destination": {
|
||||
"name": "orders_curated",
|
||||
"database": "analytics",
|
||||
"schema": "public"
|
||||
}
|
||||
}
|
||||
]
|
||||
}'
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Column lineage example
|
||||
|
||||
`POST /ingest/v1/lineage` with `event_type: "COLUMN_LINEAGE"`
|
||||
|
||||
Same endpoint as table lineage. Column lineage automatically creates the parent table-level
|
||||
edge too.
|
||||
|
||||
```bash
|
||||
curl -X POST "https://integrations.getmontecarlo.com/ingest/v1/lineage" \
|
||||
-H "Content-Type: application/json" \
|
||||
-H "x-mcd-id: <integration-key-id>" \
|
||||
-H "x-mcd-token: <integration-key-secret>" \
|
||||
-d '{
|
||||
"event_type": "COLUMN_LINEAGE",
|
||||
"resource": {
|
||||
"uuid": "<warehouse-uuid>",
|
||||
"resource_type": "snowflake"
|
||||
},
|
||||
"events": [
|
||||
{
|
||||
"source": {
|
||||
"name": "customers",
|
||||
"database": "analytics",
|
||||
"schema": "public"
|
||||
},
|
||||
"destination": {
|
||||
"name": "customer_orders",
|
||||
"database": "analytics",
|
||||
"schema": "public"
|
||||
},
|
||||
"col_mappings": [
|
||||
{
|
||||
"destination_col": "customer_id",
|
||||
"source_cols": ["customer_id"]
|
||||
},
|
||||
{
|
||||
"destination_col": "full_name",
|
||||
"source_cols": ["first_name", "last_name"]
|
||||
}
|
||||
]
|
||||
}
|
||||
]
|
||||
}'
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Query log example
|
||||
|
||||
`POST /ingest/v1/querylogs`
|
||||
|
||||
**Important**: this endpoint uses `log_type` instead of `resource_type` in the resource object.
|
||||
This is the only endpoint where the field name differs.
|
||||
|
||||
```bash
|
||||
curl -X POST "https://integrations.getmontecarlo.com/ingest/v1/querylogs" \
|
||||
-H "Content-Type: application/json" \
|
||||
-H "x-mcd-id: <integration-key-id>" \
|
||||
-H "x-mcd-token: <integration-key-secret>" \
|
||||
-d '{
|
||||
"event_type": "QUERY_LOG",
|
||||
"resource": {
|
||||
"uuid": "<warehouse-uuid>",
|
||||
"log_type": "snowflake"
|
||||
},
|
||||
"events": [
|
||||
{
|
||||
"start_time": "2026-03-02T12:00:00Z",
|
||||
"end_time": "2026-03-02T12:00:05Z",
|
||||
"query_text": "SELECT * FROM analytics.public.orders",
|
||||
"query_id": "query-123",
|
||||
"user": "analyst@company.com",
|
||||
"returned_rows": 10
|
||||
}
|
||||
]
|
||||
}'
|
||||
```
|
||||
|
||||
Supported `log_type` values: `snowflake`, `bigquery`, `databricks`, `redshift`, `hive-s3`,
|
||||
`athena`, `teradata`, `clickhouse`, `databricks-metastore-sql-warehouse`, `s3`, `presto-s3`.
|
||||
|
||||
---
|
||||
|
||||
## Batching
|
||||
|
||||
The compressed request body must not exceed **1MB** (Kinesis limit). For large payloads, split
|
||||
events into multiple requests. Each request returns its own `invocation_id`.
|
||||
|
||||
## Expiration summary
|
||||
|
||||
| Flow | Expiration |
|
||||
|---|---|
|
||||
| Table metadata | Never expires |
|
||||
| Table lineage | Never expires |
|
||||
| Column lineage | Expires after 10 days |
|
||||
| Query logs | Same as pulled query logs |
|
||||
@@ -0,0 +1,150 @@
|
||||
# Prerequisites
|
||||
|
||||
## Two keys, two purposes
|
||||
|
||||
Push ingestion requires **two separate Monte Carlo API keys** — one for pushing data, one
|
||||
for reading/verifying it. They use identical header names but different endpoints.
|
||||
|
||||
| Key | Purpose | Endpoint |
|
||||
|---|---|---|
|
||||
| **Ingestion key** (scope=`Ingestion`) | Push metadata, lineage, query logs | `https://integrations.getmontecarlo.com` |
|
||||
| **GraphQL API key** | Verify pushed data, run management mutations | `https://api.getmontecarlo.com/graphql` |
|
||||
|
||||
Both authenticate with:
|
||||
```
|
||||
x-mcd-id: <key-id>
|
||||
x-mcd-token: <key-secret>
|
||||
```
|
||||
|
||||
The secret for both is shown **only once** at creation time — store it securely immediately.
|
||||
|
||||
---
|
||||
|
||||
## Create the Ingestion key (for pushing)
|
||||
|
||||
Use the Monte Carlo CLI:
|
||||
|
||||
```bash
|
||||
montecarlo integrations create-key \
|
||||
--scope Ingestion \
|
||||
--description "Push ingestion key"
|
||||
```
|
||||
|
||||
Output:
|
||||
```
|
||||
Key id: <id>
|
||||
Key secret: <secret> ← only shown once
|
||||
```
|
||||
|
||||
Install the CLI if needed:
|
||||
```bash
|
||||
pip install montecarlodata
|
||||
montecarlo configure # enter your API key when prompted
|
||||
```
|
||||
|
||||
**Optional — restrict to a specific warehouse:**
|
||||
If you want the key to only work for one warehouse UUID, use the GraphQL mutation instead:
|
||||
|
||||
```graphql
|
||||
mutation {
|
||||
createIntegrationKey(
|
||||
description: "Push key for warehouse XYZ"
|
||||
scope: Ingestion
|
||||
warehouseIds: ["<warehouse-uuid>"]
|
||||
) {
|
||||
key { id secret }
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Create the GraphQL API key (for verification)
|
||||
|
||||
1. Go to **https://getmontecarlo.com/settings/api**
|
||||
2. Click **Add**
|
||||
3. Choose key type (personal or account-level — account-level requires Account Owner role)
|
||||
4. Copy the **Key ID** and **Secret** immediately
|
||||
|
||||
The GraphQL endpoint is: `https://api.getmontecarlo.com/graphql`
|
||||
|
||||
Test it:
|
||||
```bash
|
||||
curl -s -X POST https://api.getmontecarlo.com/graphql \
|
||||
-H "x-mcd-id: <id>" \
|
||||
-H "x-mcd-token: <secret>" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"query": "{ getUser { email } }"}' | python3 -m json.tool
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Find your warehouse (resource) UUID
|
||||
|
||||
The Ingestion key needs to reference the correct MC resource UUID. To find it:
|
||||
|
||||
```graphql
|
||||
query {
|
||||
getUser {
|
||||
account {
|
||||
warehouses {
|
||||
uuid
|
||||
name
|
||||
connectionType
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Or in the MC UI: **Settings → Integrations** → click the warehouse → copy the UUID from the URL.
|
||||
|
||||
---
|
||||
|
||||
## Install pycarlo (optional)
|
||||
|
||||
The pycarlo SDK simplifies push calls, but is not required. You can also call the push APIs
|
||||
directly via HTTP/curl — see `references/direct-http-api.md`.
|
||||
|
||||
```bash
|
||||
pip install pycarlo
|
||||
```
|
||||
|
||||
Initialize the ingestion client in your script:
|
||||
|
||||
```python
|
||||
from pycarlo.core import Client, Session
|
||||
from pycarlo.features.ingestion import IngestionService
|
||||
|
||||
client = Client(session=Session(
|
||||
mcd_id="<ingestion-key-id>",
|
||||
mcd_token="<ingestion-key-secret>",
|
||||
scope="Ingestion",
|
||||
))
|
||||
service = IngestionService(mc_client=client)
|
||||
```
|
||||
|
||||
Load credentials from environment variables (recommended):
|
||||
|
||||
```python
|
||||
import os
|
||||
service = IngestionService(mc_client=Client(session=Session(
|
||||
mcd_id=os.environ["MCD_INGEST_ID"],
|
||||
mcd_token=os.environ["MCD_INGEST_TOKEN"],
|
||||
scope="Ingestion",
|
||||
)))
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Environment variable conventions
|
||||
|
||||
The script templates use these env var names by default:
|
||||
|
||||
| Variable | Key type | Used by |
|
||||
|---|---|---|
|
||||
| `MCD_INGEST_ID` | Ingestion key ID | push and collect_and_push scripts |
|
||||
| `MCD_INGEST_TOKEN` | Ingestion key secret | push and collect_and_push scripts |
|
||||
| `MCD_ID` | GraphQL API key ID | verification scripts, slash commands |
|
||||
| `MCD_TOKEN` | GraphQL API key secret | verification scripts, slash commands |
|
||||
| `MCD_RESOURCE_UUID` | Warehouse UUID | all scripts |
|
||||
@@ -0,0 +1,160 @@
|
||||
# Pushing Table and Column Lineage
|
||||
|
||||
## Overview
|
||||
|
||||
Both table-level and column-level lineage use the same endpoint: `POST /ingest/v1/lineage`.
|
||||
The `event_type` field distinguishes them:
|
||||
- `LINEAGE` — table-level: source table → destination table
|
||||
- `COLUMN_LINEAGE` — column-level: source table.column → destination table.column
|
||||
(also automatically creates the parent table-level edge)
|
||||
|
||||
Push lineage is **typically visible in the MC lineage graph within seconds to a few minutes**
|
||||
via the fast direct path (PushLineageProcessor → S3 CSVs → neo4jLineageLoaderPrivate → Neo4j).
|
||||
|
||||
**Expiration**:
|
||||
- Pushed **table lineage does not expire** (`expire_at = 9999-12-31`).
|
||||
- Pushed **column lineage expires after 10 days** (same as pulled column lineage).
|
||||
|
||||
**Batching**: For large numbers of lineage events, split into batches. The compressed request
|
||||
body must not exceed **1MB** (Kinesis limit).
|
||||
|
||||
## pycarlo models
|
||||
|
||||
```python
|
||||
from pycarlo.features.ingestion import (
|
||||
IngestionService,
|
||||
LineageEvent,
|
||||
LineageAssetRef,
|
||||
ColumnLineageField,
|
||||
ColumnLineageSourceField,
|
||||
)
|
||||
```
|
||||
|
||||
## Table lineage example
|
||||
|
||||
```python
|
||||
event = LineageEvent(
|
||||
destination=LineageAssetRef(
|
||||
database="analytics",
|
||||
schema="public",
|
||||
table="customer_orders",
|
||||
),
|
||||
sources=[
|
||||
LineageAssetRef(database="analytics", schema="public", table="customers"),
|
||||
LineageAssetRef(database="analytics", schema="public", table="orders"),
|
||||
],
|
||||
)
|
||||
|
||||
result = service.send_lineage(
|
||||
resource_uuid="<your-resource-uuid>",
|
||||
resource_type="data-lake",
|
||||
events=[event],
|
||||
)
|
||||
invocation_id = service.extract_invocation_id(result)
|
||||
print("invocation_id:", invocation_id)
|
||||
```
|
||||
|
||||
## Column lineage example
|
||||
|
||||
```python
|
||||
event = LineageEvent(
|
||||
destination=LineageAssetRef(
|
||||
database="analytics",
|
||||
schema="public",
|
||||
table="customer_orders",
|
||||
),
|
||||
sources=[
|
||||
LineageAssetRef(database="analytics", schema="public", table="customers"),
|
||||
LineageAssetRef(database="analytics", schema="public", table="orders"),
|
||||
],
|
||||
# column mappings: dest_col ← src_table.src_col
|
||||
fields=[
|
||||
ColumnLineageField(
|
||||
destination_field="customer_id",
|
||||
source_fields=[
|
||||
ColumnLineageSourceField(
|
||||
database="analytics", schema="public",
|
||||
table="customers", field="customer_id",
|
||||
)
|
||||
],
|
||||
),
|
||||
ColumnLineageField(
|
||||
destination_field="order_amount",
|
||||
source_fields=[
|
||||
ColumnLineageSourceField(
|
||||
database="analytics", schema="public",
|
||||
table="orders", field="amount",
|
||||
)
|
||||
],
|
||||
),
|
||||
],
|
||||
)
|
||||
|
||||
result = service.send_lineage(
|
||||
resource_uuid=resource_uuid,
|
||||
resource_type="data-lake",
|
||||
events=[event],
|
||||
)
|
||||
```
|
||||
|
||||
Column lineage push automatically creates a table-level edge too, so you don't need to
|
||||
send separate table and column lineage events for the same relationship.
|
||||
|
||||
## Extracting lineage from SQL logs
|
||||
|
||||
For warehouses that don't expose a native lineage table, extract lineage by parsing query
|
||||
history SQL for `CREATE TABLE AS SELECT`, `INSERT INTO ... SELECT`, and `MERGE INTO` patterns.
|
||||
|
||||
Simplified example regex:
|
||||
```python
|
||||
import re
|
||||
|
||||
CTAS_PATTERN = re.compile(
|
||||
r"CREATE\s+(?:OR\s+REPLACE\s+)?TABLE\s+(?:IF\s+NOT\s+EXISTS\s+)?(\S+)\s+AS\s+SELECT",
|
||||
re.IGNORECASE,
|
||||
)
|
||||
INSERT_PATTERN = re.compile(
|
||||
r"INSERT\s+(?:OVERWRITE\s+)?(?:INTO\s+)?(\S+).*?FROM\s+(\S+)",
|
||||
re.IGNORECASE | re.DOTALL,
|
||||
)
|
||||
```
|
||||
|
||||
For Snowflake, BigQuery, and Redshift the query history tables provide this SQL.
|
||||
For Databricks, use `system.access.table_lineage` directly (no parsing needed).
|
||||
For Hive, parse the HiveServer2 log file.
|
||||
|
||||
## Output manifest (include invocation_id)
|
||||
|
||||
```python
|
||||
manifest = {
|
||||
"resource_uuid": resource_uuid,
|
||||
"invocation_id": service.extract_invocation_id(result), # ← save this
|
||||
"collected_at": datetime.now(tz=timezone.utc).isoformat(),
|
||||
"edges": [
|
||||
{
|
||||
"destination": {"database": e.destination.database, "table": e.destination.table},
|
||||
"sources": [{"database": s.database, "table": s.table} for s in e.sources],
|
||||
}
|
||||
for e in events
|
||||
],
|
||||
}
|
||||
with open("lineage_output.json", "w") as f:
|
||||
json.dump(manifest, f, indent=2)
|
||||
```
|
||||
|
||||
## How push lineage is distinguished from query-derived lineage
|
||||
|
||||
Push-ingested lineage nodes and edges carry `origin = push_ingest` in Neo4j and
|
||||
`origin_type = DIRECT_LINEAGE` in the normalized lineage model. This prevents the lineage
|
||||
DAG from overwriting them with query-log-derived edges and gives MC a clear audit trail.
|
||||
|
||||
## Neo4j node expiry
|
||||
|
||||
Push-ingested **table lineage** nodes and edges are written with `expire_at = 9999-12-31`
|
||||
(never expire). This is handled internally by PushLineageProcessor — you do not need to set
|
||||
this manually when using `send_lineage()`.
|
||||
|
||||
Push-ingested **column lineage** expires after **10 days**, same as pulled column lineage.
|
||||
|
||||
For custom nodes created via GraphQL mutations, you **do** need to set
|
||||
`expireAt: "9999-12-31"` explicitly — see `references/custom-lineage.md`.
|
||||
@@ -0,0 +1,158 @@
|
||||
# Pushing Table Metadata
|
||||
|
||||
## Overview
|
||||
|
||||
Metadata push sends three types of signals per table:
|
||||
- **Schema** — column names and types
|
||||
- **Volume** — row count and byte count
|
||||
- **Freshness** — last update timestamp
|
||||
|
||||
All three travel together in a single `RelationalAsset` object via `POST /ingest/v1/metadata`.
|
||||
|
||||
**Expiration**: Pushed table metadata **does not expire**. Once pushed, it remains in Monte
|
||||
Carlo until explicitly deleted via `deletePushIngestedTables`.
|
||||
|
||||
**Batching**: For large numbers of tables, split assets into batches. The compressed request
|
||||
body must not exceed **1MB** (Kinesis limit).
|
||||
|
||||
## pycarlo models
|
||||
|
||||
```python
|
||||
from pycarlo.features.ingestion import (
|
||||
IngestionService,
|
||||
RelationalAsset,
|
||||
AssetMetadata,
|
||||
AssetField,
|
||||
AssetVolume,
|
||||
AssetFreshness,
|
||||
)
|
||||
```
|
||||
|
||||
## Minimal example
|
||||
|
||||
```python
|
||||
asset = RelationalAsset(
|
||||
type="TABLE", # ONLY "TABLE" or "VIEW" — normalize warehouse-native values
|
||||
metadata=AssetMetadata(
|
||||
name="orders",
|
||||
database="analytics",
|
||||
schema="public",
|
||||
description="Order transactions",
|
||||
),
|
||||
fields=[
|
||||
AssetField(name="order_id", type="INTEGER"),
|
||||
AssetField(name="amount", type="DECIMAL"),
|
||||
AssetField(name="created_at", type="TIMESTAMP"),
|
||||
],
|
||||
volume=AssetVolume(
|
||||
row_count=1_500_000,
|
||||
byte_count=250_000_000,
|
||||
),
|
||||
freshness=AssetFreshness(
|
||||
last_update_time="2024-03-01T12:00:00Z", # ISO 8601 string, NOT a datetime object
|
||||
),
|
||||
)
|
||||
|
||||
result = service.send_metadata(
|
||||
resource_uuid="<your-resource-uuid>",
|
||||
resource_type="data-lake", # see note below on resource_type
|
||||
events=[asset],
|
||||
)
|
||||
invocation_id = service.extract_invocation_id(result)
|
||||
print("invocation_id:", invocation_id) # save this!
|
||||
```
|
||||
|
||||
## resource_type
|
||||
|
||||
The `resource_type` value must match the type of the MC resource (warehouse connection) you
|
||||
are pushing to. Use the same string that appears in the MC UI or the `connectionType` field
|
||||
from `getUser { account { warehouses { connectionType } } }`.
|
||||
|
||||
Common values:
|
||||
- `"data-lake"` — Hive, EMR, Glue, generic data lake connections
|
||||
- `"snowflake"` — Snowflake
|
||||
- `"bigquery"` — BigQuery
|
||||
- `"databricks"` — Databricks Unity Catalog
|
||||
- `"redshift"` — Redshift
|
||||
|
||||
## Asset type
|
||||
|
||||
The `type` parameter on `RelationalAsset` must be one of two values (uppercase):
|
||||
- `"TABLE"` — tables, external tables, dynamic tables, materialized views, etc.
|
||||
- `"VIEW"` — views, secure views
|
||||
|
||||
**Important**: Warehouse-native type values like `"BASE TABLE"` (Snowflake), `"MANAGED"` /
|
||||
`"EXTERNAL"` (Databricks), or `"MATERIALIZED_VIEW"` (BigQuery) are **NOT accepted** by the
|
||||
MC API and will cause a 400 error. Always normalize to `"TABLE"` or `"VIEW"` before pushing.
|
||||
|
||||
## Field types
|
||||
|
||||
Normalize to SQL-standard uppercase strings. Monte Carlo accepts any string but canonical
|
||||
values like `INTEGER`, `BIGINT`, `VARCHAR`, `FLOAT`, `BOOLEAN`, `TIMESTAMP`, `DATE`,
|
||||
`DECIMAL`, `ARRAY`, `STRUCT` work best with downstream features.
|
||||
|
||||
## Volume and freshness are optional
|
||||
|
||||
If your warehouse doesn't expose row counts or last-modified timestamps, omit `volume`
|
||||
and/or `freshness` — schema-only metadata is valid.
|
||||
|
||||
If you send `freshness`, each push must carry a **changed** `last_update_time` to count as
|
||||
a new data point for the anomaly detector (repeated identical timestamps don't advance the
|
||||
training clock).
|
||||
|
||||
## Freshness + volume only mode (skip schema)
|
||||
|
||||
For periodic pushes (e.g. hourly cron), you often don't need to re-collect the full schema
|
||||
on every run — field definitions rarely change. Collection scripts can support a
|
||||
`--only-freshness-and-volume` flag that skips the `COLUMNS` / `INFORMATION_SCHEMA` query
|
||||
and omits `fields` from the manifest. This is significantly faster on warehouses with many
|
||||
tables. Use the full collection (with fields) on the first push and on a daily schedule,
|
||||
and the freshness+volume only mode for hourly pushes in between. See the
|
||||
[BigQuery Iceberg example](https://github.com/monte-carlo-data/mcd-public-resources/tree/main/examples/push-ingestion/bigquery/push-iceberg-tables)
|
||||
for a working implementation of this pattern.
|
||||
|
||||
## Batch multiple tables
|
||||
|
||||
`events` accepts a list. Push all tables in a single call or in batches:
|
||||
|
||||
```python
|
||||
result = service.send_metadata(
|
||||
resource_uuid=resource_uuid,
|
||||
resource_type="data-lake",
|
||||
events=[asset1, asset2, asset3, ...],
|
||||
)
|
||||
```
|
||||
|
||||
## Output manifest (include invocation_id)
|
||||
|
||||
Always write a local manifest so you can trace issues later:
|
||||
|
||||
```python
|
||||
import json
|
||||
from datetime import datetime, timezone
|
||||
|
||||
manifest = {
|
||||
"resource_uuid": resource_uuid,
|
||||
"invocation_id": service.extract_invocation_id(result), # ← critical for debugging
|
||||
"collected_at": datetime.now(tz=timezone.utc).isoformat(),
|
||||
"assets": [
|
||||
{
|
||||
"database": a.metadata.database,
|
||||
"schema": a.metadata.schema,
|
||||
"table": a.metadata.name,
|
||||
"row_count": a.volume.row_count if a.volume else None,
|
||||
"fields": [{"name": f.name, "type": f.type} for f in a.fields],
|
||||
}
|
||||
for a in assets
|
||||
],
|
||||
}
|
||||
with open("metadata_output.json", "w") as f:
|
||||
json.dump(manifest, f, indent=2)
|
||||
```
|
||||
|
||||
## Push frequency for anomaly detection
|
||||
|
||||
To keep volume and freshness anomaly detectors active:
|
||||
- Push **at most once per hour** (pushing more frequently produces unpredictable behavior)
|
||||
- Push **consistently** — gaps longer than a few days will deactivate detectors
|
||||
- See `references/anomaly-detection.md` for minimum sample requirements
|
||||
@@ -0,0 +1,219 @@
|
||||
# Pushing Query Logs
|
||||
|
||||
## Overview
|
||||
|
||||
Query logs let Monte Carlo build table usage history, populate query lineage, and surface
|
||||
query-level insights in the catalog. Push them via `POST /ingest/v1/querylogs`.
|
||||
|
||||
**Important timing note**: MC processes pushed query logs asynchronously. Logs pushed now
|
||||
may not be visible in `getAggregatedQueries` for **at least 15-20 minutes**. This is expected
|
||||
behavior, not a bug.
|
||||
|
||||
**Expiration**: Pushed query logs expire on the same schedule as pulled query logs.
|
||||
|
||||
**Batching**: For large query log sets, split events into batches. The compressed request body
|
||||
must not exceed **1MB** (Kinesis limit). A conservative default is 250 entries per batch.
|
||||
|
||||
## pycarlo model
|
||||
|
||||
```python
|
||||
from pycarlo.features.ingestion import IngestionService, QueryLogEntry
|
||||
```
|
||||
|
||||
`QueryLogEntry` required fields:
|
||||
- `start_time` (`datetime`) — when the query started
|
||||
- `end_time` (`datetime`) — when the query finished (**required**, easy to miss)
|
||||
- `query_text` (`str`) — the SQL statement
|
||||
|
||||
Optional fields:
|
||||
- `query_id` (`str`) — warehouse-assigned query ID
|
||||
- `user` (`str`) — user/email who ran the query
|
||||
- `returned_rows` (`int`) — rows returned to the client
|
||||
- `default_database` (`str`) — default database context
|
||||
|
||||
## Basic example
|
||||
|
||||
```python
|
||||
from datetime import datetime, timezone
|
||||
|
||||
entries = [
|
||||
QueryLogEntry(
|
||||
start_time=datetime(2024, 3, 1, 10, 0, 0, tzinfo=timezone.utc),
|
||||
end_time=datetime(2024, 3, 1, 10, 0, 5, tzinfo=timezone.utc),
|
||||
query_text="SELECT * FROM analytics.public.orders WHERE status = 'pending'",
|
||||
query_id="query-abc-123",
|
||||
user="analyst@company.com",
|
||||
returned_rows=847,
|
||||
),
|
||||
]
|
||||
|
||||
result = service.send_query_logs(
|
||||
resource_uuid="<your-resource-uuid>",
|
||||
log_type="snowflake", # ← warehouse-specific! see table below
|
||||
entries=entries,
|
||||
)
|
||||
invocation_id = service.extract_invocation_id(result)
|
||||
print("invocation_id:", invocation_id)
|
||||
```
|
||||
|
||||
## log_type per warehouse
|
||||
|
||||
**Important**: the query-log endpoint uses `log_type`, not `resource_type`. This is the only
|
||||
push endpoint where the field name differs from metadata/lineage. The `log_type` value must
|
||||
match what the MC normalizer expects for your warehouse. Using the wrong value causes:
|
||||
`ValueError: Unsupported ingest query-log log_type: <value>`
|
||||
|
||||
| Warehouse | log_type |
|
||||
|---|---|
|
||||
| Snowflake | `"snowflake"` |
|
||||
| BigQuery | `"bigquery"` |
|
||||
| Databricks | `"databricks"` |
|
||||
| Redshift | `"redshift"` |
|
||||
| Hive (EMR/S3) | `"hive-s3"` |
|
||||
| Athena | `"athena"` |
|
||||
| Teradata | `"teradata"` |
|
||||
| ClickHouse | `"clickhouse"` |
|
||||
| Databricks (SQL Warehouse) | `"databricks-metastore-sql-warehouse"` |
|
||||
| S3 | `"s3"` |
|
||||
| Presto (S3) | `"presto-s3"` |
|
||||
|
||||
## Warehouse-specific fields
|
||||
|
||||
Some warehouses support extra fields beyond the base `QueryLogEntry`. Pass them as keyword
|
||||
arguments — the normalizer knows which fields are valid per warehouse.
|
||||
|
||||
**Snowflake extras:**
|
||||
```python
|
||||
QueryLogEntry(
|
||||
...
|
||||
bytes_scanned=1024000,
|
||||
warehouse_name="COMPUTE_WH",
|
||||
warehouse_size="X-Small",
|
||||
role_name="ANALYST",
|
||||
query_tag="reporting",
|
||||
execution_status="SUCCESS",
|
||||
)
|
||||
```
|
||||
|
||||
**BigQuery extras:**
|
||||
```python
|
||||
QueryLogEntry(
|
||||
...
|
||||
total_bytes_billed=10485760,
|
||||
statement_type="SELECT",
|
||||
job_type="QUERY",
|
||||
default_dataset="analytics.public",
|
||||
)
|
||||
```
|
||||
|
||||
**Athena extras:**
|
||||
```python
|
||||
QueryLogEntry(
|
||||
...
|
||||
bytes_scanned=2048000,
|
||||
catalog="AwsDataCatalog",
|
||||
database="analytics",
|
||||
output_location="s3://my-bucket/results/",
|
||||
state="SUCCEEDED",
|
||||
)
|
||||
```
|
||||
|
||||
## Collecting query logs per warehouse
|
||||
|
||||
### Snowflake
|
||||
```sql
|
||||
SELECT
|
||||
query_id,
|
||||
query_text,
|
||||
start_time,
|
||||
end_time,
|
||||
user_name,
|
||||
database_name,
|
||||
warehouse_name,
|
||||
bytes_scanned,
|
||||
rows_produced AS returned_rows,
|
||||
execution_status
|
||||
FROM snowflake.account_usage.query_history
|
||||
WHERE start_time >= DATEADD(hour, -24, CURRENT_TIMESTAMP())
|
||||
AND execution_status = 'SUCCESS'
|
||||
ORDER BY start_time
|
||||
```
|
||||
|
||||
Note: `ACCOUNT_USAGE` views have up to 45 minutes of latency. Don't collect the last hour.
|
||||
|
||||
### BigQuery
|
||||
```python
|
||||
from google.cloud import bigquery
|
||||
client = bigquery.Client(project=project_id)
|
||||
jobs = client.list_jobs(all_users=True, min_creation_time=start_dt, max_creation_time=end_dt)
|
||||
for job in jobs:
|
||||
if hasattr(job, 'query') and job.query:
|
||||
# job.job_id, job.query, job.created, job.ended, job.user_email
|
||||
```
|
||||
|
||||
### Databricks
|
||||
```sql
|
||||
SELECT
|
||||
statement_id AS query_id,
|
||||
statement_text AS query_text,
|
||||
start_time,
|
||||
end_time,
|
||||
executed_by AS user,
|
||||
produced_rows AS returned_rows
|
||||
FROM system.query.history
|
||||
WHERE start_time >= DATEADD(HOUR, -24, NOW())
|
||||
AND status = 'FINISHED'
|
||||
```
|
||||
|
||||
### Redshift (modern clusters)
|
||||
```sql
|
||||
SELECT
|
||||
query_id,
|
||||
query_text, -- may need text assembly from SYS_QUERYTEXT for long queries
|
||||
start_time,
|
||||
end_time,
|
||||
user_id,
|
||||
status
|
||||
FROM sys_query_history
|
||||
WHERE start_time >= DATEADD(hour, -24, GETDATE())
|
||||
AND status = 'success'
|
||||
```
|
||||
|
||||
For long queries (text > 4000 chars), assemble from `SYS_QUERYTEXT`:
|
||||
```sql
|
||||
SELECT query_id, LISTAGG(text, '') WITHIN GROUP (ORDER BY sequence) AS full_text
|
||||
FROM sys_querytext
|
||||
WHERE query_id = <id>
|
||||
GROUP BY query_id
|
||||
```
|
||||
|
||||
### Hive
|
||||
Parse the HiveServer2 log file (default: `/tmp/root/hive.log`) for lines matching:
|
||||
```
|
||||
(Executing|Starting) command\(queryId=(\S*)\): (?P<command>.*)
|
||||
```
|
||||
|
||||
## Output manifest (include invocation_id)
|
||||
|
||||
```python
|
||||
manifest = {
|
||||
"resource_uuid": resource_uuid,
|
||||
"invocation_id": service.extract_invocation_id(result), # ← save this
|
||||
"collected_at": datetime.now(tz=timezone.utc).isoformat(),
|
||||
"entry_count": len(entries),
|
||||
"window_start": min(e.start_time for e in entries).isoformat(),
|
||||
"window_end": max(e.end_time for e in entries).isoformat(),
|
||||
"queries": [
|
||||
{
|
||||
"query_id": e.query_id,
|
||||
"start_time": e.start_time.isoformat(),
|
||||
"end_time": e.end_time.isoformat(),
|
||||
"returned_rows": e.returned_rows,
|
||||
"query": e.query_text[:200], # truncate for readability
|
||||
}
|
||||
for e in entries
|
||||
],
|
||||
}
|
||||
with open("query_logs_output.json", "w") as f:
|
||||
json.dump(manifest, f, indent=2)
|
||||
```
|
||||
@@ -0,0 +1,257 @@
|
||||
# Validating Pushed Data
|
||||
|
||||
All verification queries use the **GraphQL API key** at `https://api.getmontecarlo.com/graphql`.
|
||||
|
||||
---
|
||||
|
||||
## Resolve a table's MCON and fullTableId
|
||||
|
||||
Before running most queries you need either the `mcon` or `fullTableId`.
|
||||
|
||||
`fullTableId` format: `<database>:<schema>.<table>` — e.g. `analytics:public.orders`
|
||||
|
||||
```graphql
|
||||
query GetTable($fullTableId: String!, $dwId: UUID!) {
|
||||
getTable(fullTableId: $fullTableId, dwId: $dwId) {
|
||||
mcon
|
||||
fullTableId
|
||||
displayName
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Variables:
|
||||
```json
|
||||
{
|
||||
"fullTableId": "analytics:public.orders",
|
||||
"dwId": "<warehouse-uuid>"
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Verify metadata (schema + columns)
|
||||
|
||||
```graphql
|
||||
query GetTableMetadata($mcon: String!) {
|
||||
getTable(mcon: $mcon) {
|
||||
mcon
|
||||
fullTableId
|
||||
versions {
|
||||
edges {
|
||||
node {
|
||||
fields {
|
||||
name
|
||||
fieldType
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Check that the fields list matches your pushed schema.
|
||||
|
||||
---
|
||||
|
||||
## Verify volume and freshness metrics
|
||||
|
||||
Use `getMetricsV4` to fetch row counts and last-modified timestamps:
|
||||
|
||||
```graphql
|
||||
query GetMetrics(
|
||||
$mcon: String!
|
||||
$metricName: String!
|
||||
$startTime: DateTime!
|
||||
$endTime: DateTime!
|
||||
) {
|
||||
getMetricsV4(
|
||||
dwId: null
|
||||
mcon: $mcon
|
||||
metricName: $metricName
|
||||
startTime: $startTime
|
||||
endTime: $endTime
|
||||
) {
|
||||
metricsJson
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Variables (row count):
|
||||
```json
|
||||
{
|
||||
"mcon": "<table-mcon>",
|
||||
"metricName": "total_row_count",
|
||||
"startTime": "2024-03-01T00:00:00Z",
|
||||
"endTime": "2024-03-02T00:00:00Z"
|
||||
}
|
||||
```
|
||||
|
||||
`metricsJson` is a JSON string. Parse it and look for `value` and `measurementTimestamp`
|
||||
(camelCase) in each data point.
|
||||
|
||||
Other useful metric names:
|
||||
- `"total_row_count"` — row count
|
||||
- `"total_byte_count"` — byte size
|
||||
- `"total_row_count_last_changed_on"` — Unix epoch float of when the row count last changed
|
||||
|
||||
---
|
||||
|
||||
## Verify table lineage
|
||||
|
||||
```graphql
|
||||
query GetTableLineage($mcon: String!) {
|
||||
getTableLineage(mcon: $mcon, direction: "upstream", hops: 1) {
|
||||
connectedNodes {
|
||||
mcon
|
||||
displayName
|
||||
objectType
|
||||
}
|
||||
flattenedEdges {
|
||||
directlyConnectedMcons
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Check that your expected source tables appear in `connectedNodes` or
|
||||
`flattenedEdges[].directlyConnectedMcons`.
|
||||
|
||||
---
|
||||
|
||||
## Verify column lineage
|
||||
|
||||
```graphql
|
||||
query GetColumnLineage($mcon: String!, $column: String!) {
|
||||
getDerivedTablesPartialLineage(mcon: $mcon, column: $column, pageSize: 1000) {
|
||||
destinations {
|
||||
table { mcon displayName }
|
||||
columns { columnName }
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Variables: `mcon` = source table MCON, `column` = source column name.
|
||||
|
||||
Check that each destination table and column appears in the response.
|
||||
|
||||
---
|
||||
|
||||
## Verify query logs
|
||||
|
||||
```graphql
|
||||
query GetAggregatedQueries(
|
||||
$mcon: String!
|
||||
$queryType: String!
|
||||
$startTime: DateTime!
|
||||
$endTime: DateTime!
|
||||
$first: Int
|
||||
$after: String
|
||||
) {
|
||||
getAggregatedQueries(
|
||||
mcon: $mcon
|
||||
queryType: $queryType
|
||||
startTime: $startTime
|
||||
endTime: $endTime
|
||||
first: $first
|
||||
after: $after
|
||||
) {
|
||||
edges { node { queryHash queryCount lastSeen } }
|
||||
pageInfo { hasNextPage endCursor }
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Variables:
|
||||
```json
|
||||
{
|
||||
"mcon": "<table-mcon>",
|
||||
"queryType": "read",
|
||||
"startTime": "2024-03-01T00:00:00Z",
|
||||
"endTime": "2024-03-02T00:00:00Z",
|
||||
"first": 100
|
||||
}
|
||||
```
|
||||
|
||||
**Remember**: query logs take up to 1 hour to process after push. If you see 0 results
|
||||
immediately after pushing, wait and try again.
|
||||
|
||||
---
|
||||
|
||||
## Check detector thresholds (anomaly detection status)
|
||||
|
||||
```graphql
|
||||
query GetDetectorStatus($mcon: String!) {
|
||||
getTable(mcon: $mcon) {
|
||||
thresholds {
|
||||
freshness {
|
||||
lower { value }
|
||||
upper { value }
|
||||
status
|
||||
}
|
||||
size {
|
||||
lower { value }
|
||||
upper { value }
|
||||
status
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
`status` will be `"no data"` or `"inactive"` on a newly-pushed table. Detectors need
|
||||
historical data to train — see `references/anomaly-detection.md` for requirements.
|
||||
|
||||
---
|
||||
|
||||
## Table management operations
|
||||
|
||||
### Delete push-ingested tables
|
||||
|
||||
Only works on push-ingested tables — pull-collected tables are excluded by default.
|
||||
|
||||
```graphql
|
||||
mutation DeletePushTables($mcons: [String!]!) {
|
||||
deletePushIngestedTables(mcons: $mcons) {
|
||||
success
|
||||
deletedCount
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Variables:
|
||||
```json
|
||||
{
|
||||
"mcons": ["<mcon-1>", "<mcon-2>"]
|
||||
}
|
||||
```
|
||||
|
||||
Resolve MCONs first with `getTable(fullTableId: ..., dwId: ...)`.
|
||||
|
||||
---
|
||||
|
||||
## Python helper
|
||||
|
||||
```python
|
||||
import requests, json
|
||||
|
||||
GRAPHQL_URL = "https://api.getmontecarlo.com/graphql"
|
||||
|
||||
def graphql(query: str, variables: dict, key_id: str, key_token: str) -> dict:
|
||||
resp = requests.post(
|
||||
GRAPHQL_URL,
|
||||
json={"query": query, "variables": variables},
|
||||
headers={
|
||||
"x-mcd-id": key_id,
|
||||
"x-mcd-token": key_token,
|
||||
"Content-Type": "application/json",
|
||||
},
|
||||
)
|
||||
resp.raise_for_status()
|
||||
data = resp.json()
|
||||
if "errors" in data:
|
||||
raise RuntimeError(json.dumps(data["errors"], indent=2))
|
||||
return data["data"]
|
||||
```
|
||||
@@ -0,0 +1,357 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Monte Carlo Push Ingestion — Verification Helper
|
||||
|
||||
Queries the Monte Carlo GraphQL API to verify that pushed metadata, lineage, and
|
||||
query logs are visible in the platform.
|
||||
|
||||
Prerequisites:
|
||||
pip install requests
|
||||
|
||||
Set environment variables:
|
||||
MCD_ID — GraphQL API key ID (from getmontecarlo.com/settings/api)
|
||||
MCD_TOKEN — GraphQL API key secret
|
||||
MCD_RESOURCE_UUID — Your MC warehouse/resource UUID
|
||||
|
||||
Usage:
|
||||
python sample_verify.py \
|
||||
--full-table-id "analytics:public.orders" \
|
||||
--check-schema \
|
||||
--check-metrics \
|
||||
--check-detectors \
|
||||
--check-lineage \
|
||||
--expected-sources "analytics:public.customers" "analytics:public.raw_orders"
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import os
|
||||
import sys
|
||||
from datetime import datetime, timedelta, timezone
|
||||
|
||||
import requests
|
||||
|
||||
GRAPHQL_URL = "https://api.getmontecarlo.com/graphql"
|
||||
|
||||
|
||||
def graphql(query: str, variables: dict, key_id: str, key_token: str) -> dict:
|
||||
"""Execute a GraphQL query/mutation and return the data payload."""
|
||||
resp = requests.post(
|
||||
GRAPHQL_URL,
|
||||
json={"query": query, "variables": variables},
|
||||
headers={
|
||||
"x-mcd-id": key_id,
|
||||
"x-mcd-token": key_token,
|
||||
"Content-Type": "application/json",
|
||||
},
|
||||
timeout=30,
|
||||
)
|
||||
resp.raise_for_status()
|
||||
body = resp.json()
|
||||
if "errors" in body:
|
||||
raise RuntimeError(json.dumps(body["errors"], indent=2))
|
||||
return body["data"]
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Step 1: Resolve MCON from fullTableId
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
def get_table_mcon(full_table_id: str, dw_id: str, key_id: str, key_token: str) -> str:
|
||||
"""Resolve a fullTableId + warehouse UUID to an MCON."""
|
||||
data = graphql(
|
||||
"""query GetTable($fullTableId: String!, $dwId: UUID!) {
|
||||
getTable(fullTableId: $fullTableId, dwId: $dwId) {
|
||||
mcon fullTableId displayName
|
||||
}
|
||||
}""",
|
||||
{"fullTableId": full_table_id, "dwId": dw_id},
|
||||
key_id, key_token,
|
||||
)
|
||||
table = data.get("getTable")
|
||||
if not table:
|
||||
raise ValueError(f"Table not found: {full_table_id} in resource {dw_id}")
|
||||
print(f" Resolved: {table['fullTableId']} → MCON: {table['mcon']}")
|
||||
return table["mcon"]
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Step 2: Verify schema (columns)
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
def verify_schema(mcon: str, expected_fields: list[str], key_id: str, key_token: str) -> bool:
|
||||
"""Check that the table's column names match expected_fields."""
|
||||
data = graphql(
|
||||
"""query GetSchema($mcon: String!) {
|
||||
getTable(mcon: $mcon) {
|
||||
versions {
|
||||
edges {
|
||||
node {
|
||||
fields { name fieldType }
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}""",
|
||||
{"mcon": mcon},
|
||||
key_id, key_token,
|
||||
)
|
||||
edges = (data.get("getTable") or {}).get("versions", {}).get("edges", [])
|
||||
if not edges:
|
||||
print(" WARN: no schema versions found")
|
||||
return False
|
||||
fields = edges[0]["node"]["fields"]
|
||||
got_names = {f["name"].lower() for f in fields}
|
||||
print(f" Schema: {len(fields)} column(s) — {', '.join(f['name'] for f in fields[:8])}{'...' if len(fields) > 8 else ''}")
|
||||
if expected_fields:
|
||||
missing = [e for e in expected_fields if e.lower() not in got_names]
|
||||
if missing:
|
||||
print(f" FAIL: missing columns: {missing}")
|
||||
return False
|
||||
print(f" PASS: all expected columns present")
|
||||
return True
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Step 3: Verify volume/freshness metrics
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
def verify_metrics(mcon: str, key_id: str, key_token: str) -> None:
|
||||
"""Fetch and display the latest row_count and freshness metrics."""
|
||||
end = datetime.now(tz=timezone.utc)
|
||||
start = end - timedelta(days=7)
|
||||
for metric_name in ("total_row_count", "total_row_count_last_changed_on"):
|
||||
data = graphql(
|
||||
"""query GetMetrics($mcon: String!, $metricName: String!, $start: DateTime!, $end: DateTime!) {
|
||||
getMetricsV4(dwId: null, mcon: $mcon, metricName: $metricName,
|
||||
startTime: $start, endTime: $end) {
|
||||
metricsJson
|
||||
}
|
||||
}""",
|
||||
{"mcon": mcon, "metricName": metric_name,
|
||||
"start": start.isoformat(), "end": end.isoformat()},
|
||||
key_id, key_token,
|
||||
)
|
||||
metrics_json = (data.get("getMetricsV4") or {}).get("metricsJson")
|
||||
if not metrics_json:
|
||||
print(f" {metric_name}: no data")
|
||||
continue
|
||||
points = json.loads(metrics_json)
|
||||
if not points:
|
||||
print(f" {metric_name}: no data points")
|
||||
continue
|
||||
latest = max(points, key=lambda p: p.get("measurementTimestamp") or "")
|
||||
val = latest.get("value")
|
||||
ts = latest.get("measurementTimestamp")
|
||||
if metric_name == "total_row_count_last_changed_on" and val:
|
||||
ts_fmt = datetime.fromtimestamp(float(val), tz=timezone.utc).strftime("%Y-%m-%dT%H:%M:%SZ")
|
||||
print(f" {metric_name}: {ts_fmt}")
|
||||
else:
|
||||
print(f" {metric_name}: {val} (at {ts})")
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Step 3b: Verify detector status (freshness + volume)
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
def verify_detectors(mcon: str, key_id: str, key_token: str) -> None:
|
||||
"""Check the status of freshness and volume anomaly detectors."""
|
||||
data = graphql(
|
||||
"""query GetDetectors($mcon: String!) {
|
||||
getTable(mcon: $mcon) {
|
||||
thresholds {
|
||||
freshness { status }
|
||||
size { status }
|
||||
}
|
||||
}
|
||||
}""",
|
||||
{"mcon": mcon},
|
||||
key_id, key_token,
|
||||
)
|
||||
thresholds = (data.get("getTable") or {}).get("thresholds") or {}
|
||||
freshness = thresholds.get("freshness") or {}
|
||||
size = thresholds.get("size") or {}
|
||||
freshness_status = freshness.get("status", "not available")
|
||||
size_status = size.get("status", "not available")
|
||||
print(f" Freshness detector: {freshness_status}")
|
||||
print(f" Volume detector: {size_status}")
|
||||
if freshness_status in ("no data", "training"):
|
||||
print(" ↳ Freshness needs 7+ pushes with changed last_update_time over ~2 weeks")
|
||||
if size_status in ("no data", "training"):
|
||||
print(" ↳ Volume needs 10-48 samples over ~42 days (push hourly, consistently)")
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Step 4: Verify table lineage (upstream)
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
def verify_table_lineage(
|
||||
mcon: str,
|
||||
expected_source_mcons: list[str],
|
||||
key_id: str,
|
||||
key_token: str,
|
||||
) -> bool:
|
||||
"""Check that expected source MCONs appear in the upstream lineage."""
|
||||
data = graphql(
|
||||
"""query GetLineage($mcon: String!) {
|
||||
getTableLineage(mcon: $mcon, direction: "upstream", hops: 1) {
|
||||
connectedNodes { mcon displayName objectType }
|
||||
flattenedEdges { directlyConnectedMcons }
|
||||
}
|
||||
}""",
|
||||
{"mcon": mcon},
|
||||
key_id, key_token,
|
||||
)
|
||||
lineage = data.get("getTableLineage") or {}
|
||||
connected = {n["mcon"] for n in lineage.get("connectedNodes", [])}
|
||||
flat = {m for e in lineage.get("flattenedEdges", []) for m in e.get("directlyConnectedMcons", [])}
|
||||
all_found = connected | flat
|
||||
print(f" Upstream nodes: {len(connected)}")
|
||||
if not expected_source_mcons:
|
||||
return True
|
||||
missing = [s for s in expected_source_mcons if s not in all_found]
|
||||
if missing:
|
||||
print(f" FAIL: missing sources: {missing}")
|
||||
return False
|
||||
print(" PASS: all expected sources present")
|
||||
return True
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Step 5: Verify column lineage
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
def verify_column_lineage(
|
||||
source_mcon: str,
|
||||
source_column: str,
|
||||
expected_dest_mcon: str,
|
||||
expected_dest_column: str,
|
||||
key_id: str,
|
||||
key_token: str,
|
||||
) -> bool:
|
||||
"""Check that source_column flows to expected_dest_column on expected_dest_mcon."""
|
||||
data = graphql(
|
||||
"""query GetColLineage($mcon: String!, $column: String!) {
|
||||
getDerivedTablesPartialLineage(mcon: $mcon, column: $column, pageSize: 1000) {
|
||||
destinations {
|
||||
table { mcon displayName }
|
||||
columns { columnName }
|
||||
}
|
||||
}
|
||||
}""",
|
||||
{"mcon": source_mcon, "column": source_column},
|
||||
key_id, key_token,
|
||||
)
|
||||
destinations = (data.get("getDerivedTablesPartialLineage") or {}).get("destinations", [])
|
||||
for dest in destinations:
|
||||
if dest["table"]["mcon"] == expected_dest_mcon:
|
||||
cols = {c["columnName"] for c in dest.get("columns", [])}
|
||||
if expected_dest_column in cols:
|
||||
print(f" PASS: {source_column} → {dest['table']['displayName']}.{expected_dest_column}")
|
||||
return True
|
||||
print(f" FAIL: {source_column} → {expected_dest_mcon}.{expected_dest_column} not found")
|
||||
return False
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Step 6: Verify query logs
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
def verify_query_logs(
|
||||
mcon: str,
|
||||
start_time: datetime,
|
||||
end_time: datetime,
|
||||
key_id: str,
|
||||
key_token: str,
|
||||
) -> None:
|
||||
"""Report read/write query counts for a table within the given time window."""
|
||||
for query_type in ("read", "write"):
|
||||
cursor = None
|
||||
total = 0
|
||||
while True:
|
||||
data = graphql(
|
||||
"""query GetQueries($mcon: String!, $type: String!, $start: DateTime!, $end: DateTime!, $after: String) {
|
||||
getAggregatedQueries(mcon: $mcon, queryType: $type,
|
||||
startTime: $start, endTime: $end,
|
||||
first: 200, after: $after) {
|
||||
edges { node { queryHash queryCount lastSeen } }
|
||||
pageInfo { hasNextPage endCursor }
|
||||
}
|
||||
}""",
|
||||
{"mcon": mcon, "type": query_type,
|
||||
"start": start_time.isoformat(), "end": end_time.isoformat(),
|
||||
"after": cursor},
|
||||
key_id, key_token,
|
||||
)
|
||||
result = data.get("getAggregatedQueries") or {}
|
||||
total += sum(e["node"]["queryCount"] for e in result.get("edges", []))
|
||||
page = result.get("pageInfo", {})
|
||||
if not page.get("hasNextPage"):
|
||||
break
|
||||
cursor = page["endCursor"]
|
||||
print(f" {query_type} queries: {total}")
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# CLI
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
def main() -> None:
|
||||
parser = argparse.ArgumentParser(description="Verify Monte Carlo push-ingested data via GraphQL")
|
||||
parser.add_argument("--key-id", default=os.environ.get("MCD_ID"))
|
||||
parser.add_argument("--key-token", default=os.environ.get("MCD_TOKEN"))
|
||||
parser.add_argument("--resource-uuid", default=os.environ.get("MCD_RESOURCE_UUID"), required=False)
|
||||
parser.add_argument("--full-table-id", required=True, help="e.g. analytics:public.orders")
|
||||
parser.add_argument("--mcon", help="Use MCON directly instead of resolving from fullTableId")
|
||||
parser.add_argument("--check-schema", action="store_true")
|
||||
parser.add_argument("--check-metrics", action="store_true")
|
||||
parser.add_argument("--check-detectors", action="store_true", help="Check freshness/volume detector status")
|
||||
parser.add_argument("--check-lineage", action="store_true")
|
||||
parser.add_argument("--check-query-logs", action="store_true")
|
||||
parser.add_argument("--expected-fields", nargs="*", default=[])
|
||||
parser.add_argument("--expected-sources", nargs="*", default=[], help="Source MCONs for lineage check")
|
||||
parser.add_argument("--lookback-hours", type=int, default=24, help="For query log check (default: 24)")
|
||||
args = parser.parse_args()
|
||||
|
||||
if not args.key_id or not args.key_token:
|
||||
print("ERROR: Provide --key-id/--key-token or set MCD_ID/MCD_TOKEN", file=sys.stderr)
|
||||
sys.exit(1)
|
||||
|
||||
print(f"\n{'='*60}")
|
||||
print(f"Verifying: {args.full_table_id}")
|
||||
print(f"{'='*60}")
|
||||
|
||||
mcon = args.mcon
|
||||
if not mcon:
|
||||
if not args.resource_uuid:
|
||||
print("ERROR: --resource-uuid required when --mcon is not provided", file=sys.stderr)
|
||||
sys.exit(1)
|
||||
mcon = get_table_mcon(args.full_table_id, args.resource_uuid, args.key_id, args.key_token)
|
||||
|
||||
if args.check_schema:
|
||||
print("\n[Schema]")
|
||||
verify_schema(mcon, args.expected_fields, args.key_id, args.key_token)
|
||||
|
||||
if args.check_metrics:
|
||||
print("\n[Metrics]")
|
||||
verify_metrics(mcon, args.key_id, args.key_token)
|
||||
|
||||
if args.check_detectors:
|
||||
print("\n[Detectors]")
|
||||
verify_detectors(mcon, args.key_id, args.key_token)
|
||||
|
||||
if args.check_lineage:
|
||||
print("\n[Table Lineage]")
|
||||
verify_table_lineage(mcon, args.expected_sources, args.key_id, args.key_token)
|
||||
|
||||
if args.check_query_logs:
|
||||
print("\n[Query Logs]")
|
||||
end = datetime.now(tz=timezone.utc)
|
||||
start = end - timedelta(hours=args.lookback_hours)
|
||||
verify_query_logs(mcon, start, end, args.key_id, args.key_token)
|
||||
|
||||
print("\nDone.")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,71 @@
|
||||
"""
|
||||
BigQuery Iceberg — Metadata Collect & Push (combined)
|
||||
=====================================================
|
||||
Convenience wrapper that runs collect_metadata.collect() followed by
|
||||
push_metadata.push() in a single invocation. Supports
|
||||
``--only-freshness-and-volume`` for fast periodic pushes.
|
||||
|
||||
Prerequisites:
|
||||
pip install google-cloud-bigquery pycarlo>=0.12.251
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import os
|
||||
|
||||
from collect_metadata import collect
|
||||
from push_metadata import push
|
||||
|
||||
|
||||
def main() -> None:
|
||||
parser = argparse.ArgumentParser(
|
||||
description="Collect BigQuery Iceberg metadata and push to Monte Carlo",
|
||||
)
|
||||
# Collection args
|
||||
parser.add_argument("--project-id", default=os.getenv("BIGQUERY_PROJECT_ID"))
|
||||
parser.add_argument("--datasets", nargs="+", default=None)
|
||||
parser.add_argument("--tables", nargs="+", default=None)
|
||||
parser.add_argument(
|
||||
"--only-freshness-and-volume",
|
||||
action="store_true",
|
||||
help="Skip field/schema collection — only collect freshness and volume.",
|
||||
)
|
||||
parser.add_argument("--manifest-file", default="metadata_output.json")
|
||||
|
||||
# Push args
|
||||
parser.add_argument("--resource-uuid", default=os.getenv("MCD_RESOURCE_UUID"))
|
||||
parser.add_argument("--key-id", default=os.getenv("MCD_INGEST_ID"))
|
||||
parser.add_argument("--key-token", default=os.getenv("MCD_INGEST_TOKEN"))
|
||||
parser.add_argument("--batch-size", type=int, default=500)
|
||||
parser.add_argument("--push-result-file", default="metadata_push_result.json")
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
if not args.project_id:
|
||||
parser.error("--project-id or BIGQUERY_PROJECT_ID env var is required")
|
||||
required_push = ["resource_uuid", "key_id", "key_token"]
|
||||
missing = [k for k in required_push if getattr(args, k) is None]
|
||||
if missing:
|
||||
parser.error(f"Missing required push arguments/env vars: {missing}")
|
||||
|
||||
collect(
|
||||
project_id=args.project_id,
|
||||
datasets=args.datasets,
|
||||
tables=args.tables,
|
||||
only_freshness_and_volume=args.only_freshness_and_volume,
|
||||
output_file=args.manifest_file,
|
||||
)
|
||||
|
||||
push(
|
||||
input_file=args.manifest_file,
|
||||
resource_uuid=args.resource_uuid,
|
||||
key_id=args.key_id,
|
||||
key_token=args.key_token,
|
||||
batch_size=args.batch_size,
|
||||
output_file=args.push_result_file,
|
||||
)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,64 @@
|
||||
"""
|
||||
BigQuery Iceberg — Query Log Collect & Push (combined)
|
||||
=====================================================
|
||||
Convenience wrapper that runs collect_query_logs.collect() followed by
|
||||
push_query_logs.push() in a single invocation.
|
||||
|
||||
Prerequisites:
|
||||
pip install google-cloud-bigquery pycarlo>=0.12.251 python-dateutil>=2.8.0
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import os
|
||||
|
||||
from collect_query_logs import LOOKBACK_HOURS, LOOKBACK_LAG_HOURS, collect
|
||||
from push_query_logs import push
|
||||
|
||||
|
||||
def main() -> None:
|
||||
parser = argparse.ArgumentParser(
|
||||
description="Collect BigQuery query logs and push to Monte Carlo",
|
||||
)
|
||||
# Collection args
|
||||
parser.add_argument("--project-id", default=os.getenv("BIGQUERY_PROJECT_ID"))
|
||||
parser.add_argument("--lookback-hours", type=int, default=LOOKBACK_HOURS)
|
||||
parser.add_argument("--lookback-lag-hours", type=int, default=LOOKBACK_LAG_HOURS)
|
||||
parser.add_argument("--manifest-file", default="query_logs_output.json")
|
||||
|
||||
# Push args
|
||||
parser.add_argument("--resource-uuid", default=os.getenv("MCD_RESOURCE_UUID"))
|
||||
parser.add_argument("--key-id", default=os.getenv("MCD_INGEST_ID"))
|
||||
parser.add_argument("--key-token", default=os.getenv("MCD_INGEST_TOKEN"))
|
||||
parser.add_argument("--batch-size", type=int, default=100)
|
||||
parser.add_argument("--push-result-file", default="query_logs_push_result.json")
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
if not args.project_id:
|
||||
parser.error("--project-id or BIGQUERY_PROJECT_ID env var is required")
|
||||
required_push = ["resource_uuid", "key_id", "key_token"]
|
||||
missing = [k for k in required_push if getattr(args, k) is None]
|
||||
if missing:
|
||||
parser.error(f"Missing required push arguments/env vars: {missing}")
|
||||
|
||||
collect(
|
||||
project_id=args.project_id,
|
||||
lookback_hours=args.lookback_hours,
|
||||
lookback_lag_hours=args.lookback_lag_hours,
|
||||
output_file=args.manifest_file,
|
||||
)
|
||||
|
||||
push(
|
||||
input_file=args.manifest_file,
|
||||
resource_uuid=args.resource_uuid,
|
||||
key_id=args.key_id,
|
||||
key_token=args.key_token,
|
||||
batch_size=args.batch_size,
|
||||
output_file=args.push_result_file,
|
||||
)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,253 @@
|
||||
"""
|
||||
BigQuery Iceberg — Metadata Collection (collect only)
|
||||
=====================================================
|
||||
Collects table schemas, row counts, byte sizes, and freshness for BigQuery
|
||||
Iceberg (BigLake-managed) tables using INFORMATION_SCHEMA.TABLE_STORAGE and
|
||||
INFORMATION_SCHEMA.COLUMNS. Standard BigQuery collection uses __TABLES__ which
|
||||
does not include Iceberg tables — this template fills that gap.
|
||||
|
||||
Can be run standalone via CLI or imported (use the ``collect()`` function).
|
||||
|
||||
Supports a ``--only-freshness-and-volume`` flag to skip the COLUMNS query for
|
||||
fast periodic pushes after the initial full metadata push.
|
||||
|
||||
Substitution points (search for "← SUBSTITUTE"):
|
||||
- BIGQUERY_PROJECT_ID : GCP project ID to collect from
|
||||
- GOOGLE_APPLICATION_CREDENTIALS : path to service-account JSON key file
|
||||
- REGION : BigQuery region (default "us")
|
||||
|
||||
Prerequisites:
|
||||
pip install google-cloud-bigquery
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import logging
|
||||
import os
|
||||
from datetime import datetime, timezone
|
||||
|
||||
from google.cloud import bigquery
|
||||
|
||||
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
|
||||
log = logging.getLogger(__name__)
|
||||
|
||||
RESOURCE_TYPE = "bigquery"
|
||||
|
||||
# BigQuery type → Monte Carlo canonical type
|
||||
BQ_TYPE_MAP: dict[str, str] = {
|
||||
"INT64": "INTEGER",
|
||||
"INTEGER": "INTEGER",
|
||||
"FLOAT64": "FLOAT",
|
||||
"FLOAT": "FLOAT",
|
||||
"BOOL": "BOOLEAN",
|
||||
"BOOLEAN": "BOOLEAN",
|
||||
"STRING": "VARCHAR",
|
||||
"BYTES": "BINARY",
|
||||
"DATE": "DATE",
|
||||
"DATETIME": "DATETIME",
|
||||
"TIMESTAMP": "TIMESTAMP",
|
||||
"TIME": "TIME",
|
||||
"NUMERIC": "DECIMAL",
|
||||
"BIGNUMERIC": "DECIMAL",
|
||||
"RECORD": "STRUCT",
|
||||
"STRUCT": "STRUCT",
|
||||
"REPEATED": "ARRAY",
|
||||
"JSON": "JSON",
|
||||
"GEOGRAPHY": "GEOGRAPHY",
|
||||
}
|
||||
|
||||
|
||||
def map_bq_type(bq_type: str) -> str:
|
||||
base = bq_type.split("(")[0].strip().upper()
|
||||
return BQ_TYPE_MAP.get(base, bq_type.upper())
|
||||
|
||||
|
||||
def _fetch_iceberg_tables(
|
||||
client: bigquery.Client,
|
||||
project_id: str,
|
||||
datasets: list[str] | None = None,
|
||||
tables: list[str] | None = None,
|
||||
) -> list[dict]:
|
||||
"""Query TABLE_STORAGE for BigLake (Iceberg) tables."""
|
||||
conditions = [
|
||||
"managed_table_type = 'BIGLAKE'",
|
||||
"deleted = FALSE",
|
||||
]
|
||||
if datasets:
|
||||
ds_list = ", ".join(f"'{d}'" for d in datasets)
|
||||
conditions.append(f"table_schema IN ({ds_list})")
|
||||
if tables:
|
||||
tbl_list = ", ".join(f"'{t}'" for t in tables)
|
||||
conditions.append(f"table_name IN ({tbl_list})")
|
||||
|
||||
where = " AND ".join(conditions)
|
||||
query = f"""
|
||||
SELECT
|
||||
table_schema,
|
||||
table_name,
|
||||
total_rows,
|
||||
current_physical_bytes,
|
||||
storage_last_modified_time,
|
||||
creation_time
|
||||
FROM `{project_id}.region-us`.INFORMATION_SCHEMA.TABLE_STORAGE -- ← SUBSTITUTE: change region if needed
|
||||
WHERE {where}
|
||||
ORDER BY table_schema, table_name
|
||||
"""
|
||||
log.info("Querying TABLE_STORAGE for Iceberg tables ...")
|
||||
rows = list(client.query(query).result())
|
||||
log.info("Found %d Iceberg table(s).", len(rows))
|
||||
return [dict(row) for row in rows]
|
||||
|
||||
|
||||
def _fetch_columns(
|
||||
client: bigquery.Client,
|
||||
project_id: str,
|
||||
dataset: str,
|
||||
table_name: str,
|
||||
) -> list[dict]:
|
||||
"""Fetch column metadata for a specific table."""
|
||||
query = f"""
|
||||
SELECT column_name, data_type, ordinal_position, is_nullable, column_default
|
||||
FROM `{project_id}.{dataset}.INFORMATION_SCHEMA.COLUMNS`
|
||||
WHERE table_name = '{table_name}'
|
||||
ORDER BY ordinal_position
|
||||
"""
|
||||
return [
|
||||
{
|
||||
"name": row["column_name"],
|
||||
"type": map_bq_type(row["data_type"]),
|
||||
}
|
||||
for row in client.query(query).result()
|
||||
]
|
||||
|
||||
|
||||
def _resolve_freshness(row: dict) -> str:
|
||||
"""Return the best available freshness timestamp as ISO8601.
|
||||
|
||||
Uses storage_last_modified_time if Google has populated it (expected
|
||||
early April 2026). Falls back to current time with a warning.
|
||||
"""
|
||||
if row.get("storage_last_modified_time"):
|
||||
return row["storage_last_modified_time"].isoformat()
|
||||
|
||||
log.warning(
|
||||
"storage_last_modified_time is NULL for %s.%s — "
|
||||
"falling back to current time. Google's TABLE_STORAGE update "
|
||||
"for Iceberg tables may not have shipped yet.",
|
||||
row["table_schema"],
|
||||
row["table_name"],
|
||||
)
|
||||
return datetime.now(timezone.utc).isoformat()
|
||||
|
||||
|
||||
def collect(
|
||||
project_id: str,
|
||||
datasets: list[str] | None = None,
|
||||
tables: list[str] | None = None,
|
||||
only_freshness_and_volume: bool = False,
|
||||
output_file: str = "metadata_output.json",
|
||||
) -> dict:
|
||||
"""Collect Iceberg table metadata and write a JSON manifest.
|
||||
|
||||
When only_freshness_and_volume is True, skips the COLUMNS query and
|
||||
omits fields from the manifest. Use this for periodic hourly pushes
|
||||
after the initial full metadata push.
|
||||
"""
|
||||
client = bigquery.Client(project=project_id) # ← SUBSTITUTE: adjust auth if needed
|
||||
|
||||
if only_freshness_and_volume:
|
||||
log.info("Running in freshness+volume only mode (skipping fields).")
|
||||
|
||||
iceberg_tables = _fetch_iceberg_tables(client, project_id, datasets, tables)
|
||||
if not iceberg_tables:
|
||||
log.warning("No Iceberg tables found matching the criteria.")
|
||||
return {"resource_type": RESOURCE_TYPE, "assets": []}
|
||||
|
||||
assets: list[dict] = []
|
||||
for row in iceberg_tables:
|
||||
dataset = row["table_schema"]
|
||||
name = row["table_name"]
|
||||
|
||||
asset = {
|
||||
"name": name,
|
||||
"database": project_id,
|
||||
"schema": dataset,
|
||||
"type": "TABLE",
|
||||
"volume": {
|
||||
"row_count": row["total_rows"],
|
||||
"byte_count": row["current_physical_bytes"],
|
||||
},
|
||||
"freshness": {
|
||||
"last_updated_time": _resolve_freshness(row),
|
||||
},
|
||||
}
|
||||
|
||||
if not only_freshness_and_volume:
|
||||
asset["description"] = None
|
||||
asset["fields"] = _fetch_columns(client, project_id, dataset, name)
|
||||
|
||||
assets.append(asset)
|
||||
log.info(
|
||||
"Collected %s.%s.%s — rows=%s, bytes=%s",
|
||||
project_id, dataset, name,
|
||||
row["total_rows"], row["current_physical_bytes"],
|
||||
)
|
||||
|
||||
manifest = {
|
||||
"resource_type": RESOURCE_TYPE,
|
||||
"collected_at": datetime.now(timezone.utc).isoformat(),
|
||||
"assets": assets,
|
||||
}
|
||||
with open(output_file, "w") as fh:
|
||||
json.dump(manifest, fh, indent=2)
|
||||
log.info("Manifest written to %s (%d assets)", output_file, len(assets))
|
||||
|
||||
return manifest
|
||||
|
||||
|
||||
def main() -> None:
|
||||
parser = argparse.ArgumentParser(
|
||||
description="Collect BigQuery Iceberg table metadata into a JSON manifest",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--project-id",
|
||||
default=os.getenv("BIGQUERY_PROJECT_ID"), # ← SUBSTITUTE
|
||||
help="GCP project ID (or set BIGQUERY_PROJECT_ID env var)",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--datasets",
|
||||
nargs="+",
|
||||
default=None,
|
||||
help="Limit to specific dataset(s). Omit to scan all datasets.",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--tables",
|
||||
nargs="+",
|
||||
default=None,
|
||||
help="Limit to specific table name(s) within the datasets.",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--only-freshness-and-volume",
|
||||
action="store_true",
|
||||
help="Skip field/schema collection — only collect freshness and volume. "
|
||||
"Use for periodic hourly pushes after the initial full metadata push.",
|
||||
)
|
||||
parser.add_argument("--output-file", default="metadata_output.json")
|
||||
args = parser.parse_args()
|
||||
|
||||
if not args.project_id:
|
||||
parser.error("--project-id or BIGQUERY_PROJECT_ID env var is required")
|
||||
|
||||
collect(
|
||||
project_id=args.project_id,
|
||||
datasets=args.datasets,
|
||||
tables=args.tables,
|
||||
only_freshness_and_volume=args.only_freshness_and_volume,
|
||||
output_file=args.output_file,
|
||||
)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,149 @@
|
||||
"""
|
||||
BigQuery Iceberg — Query Log Collection (collect only)
|
||||
======================================================
|
||||
Queries the BigQuery Jobs API for completed query jobs within a time
|
||||
window and writes a JSON manifest that can be fed to push_query_logs.py.
|
||||
|
||||
Can be run standalone via CLI or imported (use the ``collect()`` function).
|
||||
|
||||
Substitution points (search for "← SUBSTITUTE"):
|
||||
- BIGQUERY_PROJECT_ID : GCP project ID to collect from
|
||||
- GOOGLE_APPLICATION_CREDENTIALS : path to service-account JSON key file
|
||||
|
||||
Prerequisites:
|
||||
pip install google-cloud-bigquery
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import logging
|
||||
import os
|
||||
from datetime import datetime, timedelta, timezone
|
||||
|
||||
from google.cloud import bigquery
|
||||
|
||||
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
|
||||
log = logging.getLogger(__name__)
|
||||
|
||||
LOG_TYPE = "bigquery"
|
||||
|
||||
LOOKBACK_HOURS: int = int(os.getenv("LOOKBACK_HOURS", "25"))
|
||||
LOOKBACK_LAG_HOURS: int = int(os.getenv("LOOKBACK_LAG_HOURS", "1"))
|
||||
MAX_JOBS: int = int(os.getenv("MAX_JOBS", "10000"))
|
||||
|
||||
# Limit to specific statement types — empty list means collect all.
|
||||
STATEMENT_TYPE_FILTER: list[str] = []
|
||||
|
||||
|
||||
def _safe_isoformat(dt: datetime | None) -> str | None:
|
||||
if dt is None:
|
||||
return None
|
||||
if dt.tzinfo is None:
|
||||
dt = dt.replace(tzinfo=timezone.utc)
|
||||
return dt.isoformat()
|
||||
|
||||
|
||||
def _collect_query_logs(
|
||||
bq_client: bigquery.Client,
|
||||
project_id: str,
|
||||
start_dt: datetime,
|
||||
end_dt: datetime,
|
||||
) -> list[dict]:
|
||||
"""Collect query logs from BigQuery job history."""
|
||||
entries: list[dict] = []
|
||||
|
||||
log.info(
|
||||
"Listing jobs for project=%s from %s to %s",
|
||||
project_id, start_dt.isoformat(), end_dt.isoformat(),
|
||||
)
|
||||
|
||||
for job in bq_client.list_jobs(
|
||||
project=project_id,
|
||||
all_users=True,
|
||||
min_creation_time=start_dt,
|
||||
max_creation_time=end_dt,
|
||||
):
|
||||
sql: str = getattr(job, "query", None) or ""
|
||||
if not sql.strip():
|
||||
continue
|
||||
|
||||
statement_type: str = getattr(job, "statement_type", None) or ""
|
||||
if STATEMENT_TYPE_FILTER and statement_type not in STATEMENT_TYPE_FILTER:
|
||||
continue
|
||||
|
||||
entries.append({
|
||||
"query_id": job.job_id,
|
||||
"query_text": sql,
|
||||
"start_time": _safe_isoformat(getattr(job, "created", None)),
|
||||
"end_time": _safe_isoformat(getattr(job, "ended", None)),
|
||||
"user": getattr(job, "user_email", None),
|
||||
"total_bytes_billed": getattr(job, "total_bytes_billed", None),
|
||||
"statement_type": statement_type or None,
|
||||
})
|
||||
|
||||
if len(entries) >= MAX_JOBS:
|
||||
log.warning("Reached MAX_JOBS=%d — stopping early", MAX_JOBS)
|
||||
break
|
||||
|
||||
return entries
|
||||
|
||||
|
||||
def collect(
|
||||
project_id: str,
|
||||
lookback_hours: int = LOOKBACK_HOURS,
|
||||
lookback_lag_hours: int = LOOKBACK_LAG_HOURS,
|
||||
output_file: str = "query_logs_output.json",
|
||||
) -> dict:
|
||||
"""Collect query logs and write a JSON manifest."""
|
||||
bq_client = bigquery.Client(project=project_id)
|
||||
|
||||
end_dt = datetime.now(timezone.utc) - timedelta(hours=lookback_lag_hours)
|
||||
start_dt = end_dt - timedelta(hours=lookback_hours)
|
||||
|
||||
entries = _collect_query_logs(bq_client, project_id, start_dt, end_dt)
|
||||
log.info("Collected %d query log entries.", len(entries))
|
||||
|
||||
manifest = {
|
||||
"log_type": LOG_TYPE,
|
||||
"collected_at": datetime.now(timezone.utc).isoformat(),
|
||||
"window_start": start_dt.isoformat(),
|
||||
"window_end": end_dt.isoformat(),
|
||||
"query_log_count": len(entries),
|
||||
"queries": entries,
|
||||
}
|
||||
with open(output_file, "w") as fh:
|
||||
json.dump(manifest, fh, indent=2)
|
||||
log.info("Query log manifest written to %s", output_file)
|
||||
|
||||
return manifest
|
||||
|
||||
|
||||
def main() -> None:
|
||||
parser = argparse.ArgumentParser(
|
||||
description="Collect BigQuery query logs into a JSON manifest",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--project-id",
|
||||
default=os.getenv("BIGQUERY_PROJECT_ID"),
|
||||
help="GCP project ID (or set BIGQUERY_PROJECT_ID env var)",
|
||||
)
|
||||
parser.add_argument("--lookback-hours", type=int, default=LOOKBACK_HOURS)
|
||||
parser.add_argument("--lookback-lag-hours", type=int, default=LOOKBACK_LAG_HOURS)
|
||||
parser.add_argument("--output-file", default="query_logs_output.json")
|
||||
args = parser.parse_args()
|
||||
|
||||
if not args.project_id:
|
||||
parser.error("--project-id or BIGQUERY_PROJECT_ID env var is required")
|
||||
|
||||
collect(
|
||||
project_id=args.project_id,
|
||||
lookback_hours=args.lookback_hours,
|
||||
lookback_lag_hours=args.lookback_lag_hours,
|
||||
output_file=args.output_file,
|
||||
)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,190 @@
|
||||
"""
|
||||
BigQuery Iceberg — Metadata Push (push only)
|
||||
============================================
|
||||
Reads a JSON manifest produced by collect_metadata.py and pushes table
|
||||
metadata to Monte Carlo using the pycarlo SDK's IngestionService.
|
||||
|
||||
Can be run standalone via CLI or imported (use the ``push()`` function).
|
||||
|
||||
Substitution points (search for "← SUBSTITUTE"):
|
||||
- MCD_INGEST_ID : Monte Carlo Ingestion API key ID
|
||||
- MCD_INGEST_TOKEN : Monte Carlo Ingestion API key token
|
||||
- MCD_RESOURCE_UUID : Monte Carlo warehouse resource UUID
|
||||
|
||||
Prerequisites:
|
||||
pip install pycarlo>=0.12.251
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import logging
|
||||
import os
|
||||
from concurrent.futures import ThreadPoolExecutor, as_completed
|
||||
from datetime import datetime, timezone
|
||||
|
||||
from pycarlo.core import Client, Session
|
||||
from pycarlo.features.ingestion import IngestionService
|
||||
from pycarlo.features.ingestion.models import (
|
||||
AssetField,
|
||||
AssetFreshness,
|
||||
AssetMetadata,
|
||||
AssetVolume,
|
||||
RelationalAsset,
|
||||
)
|
||||
|
||||
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
|
||||
log = logging.getLogger(__name__)
|
||||
|
||||
RESOURCE_TYPE = "bigquery"
|
||||
_BATCH_SIZE = 500
|
||||
|
||||
_ENDPOINT = "https://integrations.getmontecarlo.com"
|
||||
|
||||
|
||||
def _asset_from_dict(d: dict) -> RelationalAsset:
|
||||
"""Reconstruct a RelationalAsset from a manifest dict entry."""
|
||||
fields = [
|
||||
AssetField(
|
||||
name=f["name"],
|
||||
type=f.get("type"),
|
||||
description=f.get("description"),
|
||||
)
|
||||
for f in d.get("fields", [])
|
||||
]
|
||||
|
||||
volume = None
|
||||
if d.get("volume"):
|
||||
volume = AssetVolume(
|
||||
row_count=d["volume"].get("row_count"),
|
||||
byte_count=d["volume"].get("byte_count"),
|
||||
)
|
||||
|
||||
freshness = None
|
||||
if d.get("freshness") and d["freshness"].get("last_updated_time"):
|
||||
freshness = AssetFreshness(
|
||||
last_update_time=d["freshness"]["last_updated_time"],
|
||||
)
|
||||
|
||||
return RelationalAsset(
|
||||
type=d.get("type", "TABLE"),
|
||||
metadata=AssetMetadata(
|
||||
name=d["name"],
|
||||
database=d["database"],
|
||||
schema=d["schema"],
|
||||
description=d.get("description"),
|
||||
),
|
||||
fields=fields,
|
||||
volume=volume,
|
||||
freshness=freshness,
|
||||
)
|
||||
|
||||
|
||||
def push(
|
||||
input_file: str,
|
||||
resource_uuid: str,
|
||||
key_id: str,
|
||||
key_token: str,
|
||||
batch_size: int = _BATCH_SIZE,
|
||||
output_file: str = "metadata_push_result.json",
|
||||
) -> dict:
|
||||
"""Read a metadata manifest and push assets to Monte Carlo in batches."""
|
||||
endpoint = _ENDPOINT
|
||||
log.info("Using endpoint: %s", endpoint)
|
||||
with open(input_file) as fh:
|
||||
manifest = json.load(fh)
|
||||
|
||||
asset_dicts = manifest.get("assets", [])
|
||||
resource_type = manifest.get("resource_type", RESOURCE_TYPE)
|
||||
assets = [_asset_from_dict(d) for d in asset_dicts]
|
||||
log.info("Loaded %d asset(s) from %s", len(assets), input_file)
|
||||
|
||||
batches = [assets[i : i + batch_size] for i in range(0, max(len(assets), 1), batch_size)]
|
||||
total_batches = len(batches)
|
||||
|
||||
def _push_batch(batch: list[RelationalAsset], batch_num: int) -> str | None:
|
||||
client = Client(session=Session(
|
||||
mcd_id=key_id, mcd_token=key_token, scope="Ingestion", endpoint=endpoint,
|
||||
))
|
||||
service = IngestionService(mc_client=client)
|
||||
result = service.send_metadata(
|
||||
resource_uuid=resource_uuid,
|
||||
resource_type=resource_type,
|
||||
events=batch,
|
||||
)
|
||||
invocation_id = service.extract_invocation_id(result)
|
||||
log.info(
|
||||
"Pushed batch %d/%d (%d assets) — invocation_id=%s",
|
||||
batch_num, total_batches, len(batch), invocation_id,
|
||||
)
|
||||
return invocation_id
|
||||
|
||||
max_workers = min(4, total_batches)
|
||||
invocation_ids: list[str | None] = [None] * total_batches
|
||||
|
||||
with ThreadPoolExecutor(max_workers=max_workers) as pool:
|
||||
futures = {
|
||||
pool.submit(_push_batch, batch, i + 1): i
|
||||
for i, batch in enumerate(batches)
|
||||
}
|
||||
for future in as_completed(futures):
|
||||
idx = futures[future]
|
||||
try:
|
||||
invocation_ids[idx] = future.result()
|
||||
except Exception as exc:
|
||||
log.error("ERROR pushing batch %d: %s", idx + 1, exc)
|
||||
raise
|
||||
|
||||
log.info("All %d batch(es) pushed.", total_batches)
|
||||
|
||||
push_result = {
|
||||
"resource_uuid": resource_uuid,
|
||||
"resource_type": resource_type,
|
||||
"invocation_ids": invocation_ids,
|
||||
"pushed_at": datetime.now(timezone.utc).isoformat(),
|
||||
"total_assets": len(assets),
|
||||
"batch_count": total_batches,
|
||||
"batch_size": batch_size,
|
||||
}
|
||||
with open(output_file, "w") as fh:
|
||||
json.dump(push_result, fh, indent=2)
|
||||
log.info("Push result written to %s", output_file)
|
||||
|
||||
return push_result
|
||||
|
||||
|
||||
def main() -> None:
|
||||
parser = argparse.ArgumentParser(
|
||||
description="Push BigQuery Iceberg metadata from a manifest to Monte Carlo",
|
||||
)
|
||||
parser.add_argument("--resource-uuid", default=os.getenv("MCD_RESOURCE_UUID"))
|
||||
parser.add_argument("--key-id", default=os.getenv("MCD_INGEST_ID"))
|
||||
parser.add_argument("--key-token", default=os.getenv("MCD_INGEST_TOKEN"))
|
||||
parser.add_argument("--input-file", default="metadata_output.json")
|
||||
parser.add_argument("--output-file", default="metadata_push_result.json")
|
||||
parser.add_argument(
|
||||
"--batch-size",
|
||||
type=int,
|
||||
default=_BATCH_SIZE,
|
||||
help=f"Max assets per push batch (default: {_BATCH_SIZE})",
|
||||
)
|
||||
args = parser.parse_args()
|
||||
|
||||
required = ["resource_uuid", "key_id", "key_token"]
|
||||
missing = [k for k in required if getattr(args, k) is None]
|
||||
if missing:
|
||||
parser.error(f"Missing required arguments/env vars: {missing}")
|
||||
|
||||
push(
|
||||
input_file=args.input_file,
|
||||
resource_uuid=args.resource_uuid,
|
||||
key_id=args.key_id,
|
||||
key_token=args.key_token,
|
||||
batch_size=args.batch_size,
|
||||
output_file=args.output_file,
|
||||
)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,208 @@
|
||||
"""
|
||||
BigQuery Iceberg — Query Log Push (push only)
|
||||
=============================================
|
||||
Reads a JSON manifest produced by collect_query_logs.py and pushes query
|
||||
log entries to Monte Carlo using the pycarlo SDK's IngestionService.
|
||||
|
||||
Uses dateutil.isoparse() to convert ISO8601 strings back to datetime
|
||||
objects (QueryLogEntry requires datetime, not str).
|
||||
|
||||
Can be run standalone via CLI or imported (use the ``push()`` function).
|
||||
|
||||
Substitution points (search for "← SUBSTITUTE"):
|
||||
- MCD_INGEST_ID : Monte Carlo Ingestion API key ID
|
||||
- MCD_INGEST_TOKEN : Monte Carlo Ingestion API key token
|
||||
- MCD_RESOURCE_UUID : Monte Carlo warehouse resource UUID
|
||||
|
||||
Prerequisites:
|
||||
pip install pycarlo>=0.12.251 python-dateutil>=2.8.0
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import logging
|
||||
import os
|
||||
from concurrent.futures import ThreadPoolExecutor, as_completed
|
||||
from datetime import datetime, timezone
|
||||
|
||||
from dateutil.parser import isoparse
|
||||
|
||||
from pycarlo.core import Client, Session
|
||||
from pycarlo.features.ingestion import IngestionService
|
||||
from pycarlo.features.ingestion.models import QueryLogEntry
|
||||
|
||||
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
|
||||
log = logging.getLogger(__name__)
|
||||
|
||||
LOG_TYPE = "bigquery"
|
||||
|
||||
# Query logs include full SQL text — keep batches small to stay under the
|
||||
# 1 MB compressed payload limit.
|
||||
_BATCH_SIZE = 100
|
||||
|
||||
# Truncate very long SQL to prevent 413 errors.
|
||||
_MAX_QUERY_TEXT_LEN = 10_000
|
||||
|
||||
_ENDPOINT = "https://integrations.getmontecarlo.com"
|
||||
|
||||
|
||||
def _build_query_log_entries(queries: list[dict]) -> list[QueryLogEntry]:
|
||||
"""Convert manifest query dicts into QueryLogEntry objects."""
|
||||
entries = []
|
||||
truncated = 0
|
||||
for q in queries:
|
||||
query_text = q.get("query_text") or ""
|
||||
|
||||
if len(query_text) > _MAX_QUERY_TEXT_LEN:
|
||||
query_text = query_text[:_MAX_QUERY_TEXT_LEN] + "... [TRUNCATED]"
|
||||
truncated += 1
|
||||
|
||||
extra = {}
|
||||
if q.get("total_bytes_billed") is not None:
|
||||
extra["total_bytes_billed"] = q["total_bytes_billed"]
|
||||
if q.get("statement_type") is not None:
|
||||
extra["statement_type"] = q["statement_type"]
|
||||
|
||||
start_time = q.get("start_time")
|
||||
end_time = q.get("end_time")
|
||||
|
||||
entry = QueryLogEntry(
|
||||
query_id=q.get("query_id"),
|
||||
query_text=query_text,
|
||||
start_time=isoparse(start_time) if start_time else None,
|
||||
end_time=isoparse(end_time) if end_time else None,
|
||||
user=q.get("user"),
|
||||
extra=extra or None,
|
||||
)
|
||||
entries.append(entry)
|
||||
|
||||
if truncated:
|
||||
log.info("Truncated %d query text(s) exceeding %d chars", truncated, _MAX_QUERY_TEXT_LEN)
|
||||
return entries
|
||||
|
||||
|
||||
def push(
|
||||
input_file: str,
|
||||
resource_uuid: str,
|
||||
key_id: str,
|
||||
key_token: str,
|
||||
batch_size: int = _BATCH_SIZE,
|
||||
output_file: str = "query_logs_push_result.json",
|
||||
) -> dict:
|
||||
"""Read a query log manifest and push entries to Monte Carlo in batches."""
|
||||
endpoint = _ENDPOINT
|
||||
log.info("Using endpoint: %s", endpoint)
|
||||
|
||||
with open(input_file) as fh:
|
||||
manifest = json.load(fh)
|
||||
|
||||
queries = manifest.get("queries", [])
|
||||
log_type = manifest.get("log_type", LOG_TYPE)
|
||||
entries = _build_query_log_entries(queries)
|
||||
log.info("Loaded %d query log entry/entries from %s", len(entries), input_file)
|
||||
|
||||
if not entries:
|
||||
log.info("No query log entries to push.")
|
||||
push_result = {
|
||||
"resource_uuid": resource_uuid,
|
||||
"log_type": log_type,
|
||||
"invocation_ids": [],
|
||||
"pushed_at": datetime.now(timezone.utc).isoformat(),
|
||||
"total_entries": 0,
|
||||
"batch_count": 0,
|
||||
"batch_size": batch_size,
|
||||
}
|
||||
with open(output_file, "w") as fh:
|
||||
json.dump(push_result, fh, indent=2)
|
||||
return push_result
|
||||
|
||||
batches = [entries[i : i + batch_size] for i in range(0, len(entries), batch_size)]
|
||||
total_batches = len(batches)
|
||||
|
||||
def _push_batch(batch: list[QueryLogEntry], batch_num: int) -> str | None:
|
||||
client = Client(session=Session(
|
||||
mcd_id=key_id, mcd_token=key_token, scope="Ingestion", endpoint=endpoint,
|
||||
))
|
||||
service = IngestionService(mc_client=client)
|
||||
result = service.send_query_logs(
|
||||
resource_uuid=resource_uuid,
|
||||
log_type=log_type,
|
||||
events=batch,
|
||||
)
|
||||
invocation_id = service.extract_invocation_id(result)
|
||||
log.info(
|
||||
"Pushed batch %d/%d (%d entries) — invocation_id=%s",
|
||||
batch_num, total_batches, len(batch), invocation_id,
|
||||
)
|
||||
return invocation_id
|
||||
|
||||
max_workers = min(4, total_batches)
|
||||
invocation_ids: list[str | None] = [None] * total_batches
|
||||
|
||||
with ThreadPoolExecutor(max_workers=max_workers) as pool:
|
||||
futures = {
|
||||
pool.submit(_push_batch, batch, i + 1): i
|
||||
for i, batch in enumerate(batches)
|
||||
}
|
||||
for future in as_completed(futures):
|
||||
idx = futures[future]
|
||||
try:
|
||||
invocation_ids[idx] = future.result()
|
||||
except Exception as exc:
|
||||
log.error("ERROR pushing batch %d: %s", idx + 1, exc)
|
||||
raise
|
||||
|
||||
log.info("All %d batch(es) pushed.", total_batches)
|
||||
|
||||
push_result = {
|
||||
"resource_uuid": resource_uuid,
|
||||
"log_type": log_type,
|
||||
"invocation_ids": invocation_ids,
|
||||
"pushed_at": datetime.now(timezone.utc).isoformat(),
|
||||
"total_entries": len(entries),
|
||||
"batch_count": total_batches,
|
||||
"batch_size": batch_size,
|
||||
}
|
||||
with open(output_file, "w") as fh:
|
||||
json.dump(push_result, fh, indent=2)
|
||||
log.info("Push result written to %s", output_file)
|
||||
|
||||
return push_result
|
||||
|
||||
|
||||
def main() -> None:
|
||||
parser = argparse.ArgumentParser(
|
||||
description="Push BigQuery query logs from a manifest to Monte Carlo",
|
||||
)
|
||||
parser.add_argument("--resource-uuid", default=os.getenv("MCD_RESOURCE_UUID"))
|
||||
parser.add_argument("--key-id", default=os.getenv("MCD_INGEST_ID"))
|
||||
parser.add_argument("--key-token", default=os.getenv("MCD_INGEST_TOKEN"))
|
||||
parser.add_argument("--input-file", default="query_logs_output.json")
|
||||
parser.add_argument("--output-file", default="query_logs_push_result.json")
|
||||
parser.add_argument(
|
||||
"--batch-size",
|
||||
type=int,
|
||||
default=_BATCH_SIZE,
|
||||
help=f"Max entries per push batch (default: {_BATCH_SIZE})",
|
||||
)
|
||||
args = parser.parse_args()
|
||||
|
||||
required = ["resource_uuid", "key_id", "key_token"]
|
||||
missing = [k for k in required if getattr(args, k) is None]
|
||||
if missing:
|
||||
parser.error(f"Missing required arguments/env vars: {missing}")
|
||||
|
||||
push(
|
||||
input_file=args.input_file,
|
||||
resource_uuid=args.resource_uuid,
|
||||
key_id=args.key_id,
|
||||
key_token=args.key_token,
|
||||
batch_size=args.batch_size,
|
||||
output_file=args.output_file,
|
||||
)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,70 @@
|
||||
"""
|
||||
BigQuery — Lineage Collection and Push (combined)
|
||||
===================================================
|
||||
Imports ``collect()`` from ``collect_lineage`` and ``push()`` from
|
||||
``push_lineage``, runs both in sequence.
|
||||
|
||||
Substitution points (search for "← SUBSTITUTE"):
|
||||
- BIGQUERY_PROJECT_ID : GCP project ID to collect from
|
||||
- BIGQUERY_REGION : BigQuery region for INFORMATION_SCHEMA queries (e.g. "us", "eu")
|
||||
- LOOKBACK_HOURS : how far back to scan job history (default 24 h)
|
||||
- MCD_INGEST_ID / MCD_INGEST_TOKEN : Monte Carlo API credentials
|
||||
- MCD_RESOURCE_UUID : UUID of the BigQuery connection in Monte Carlo
|
||||
|
||||
Prerequisites:
|
||||
pip install google-cloud-bigquery pycarlo
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import os
|
||||
|
||||
from collect_lineage import collect, LOOKBACK_HOURS
|
||||
from push_lineage import push, _BATCH_SIZE
|
||||
|
||||
|
||||
def main() -> None:
|
||||
parser = argparse.ArgumentParser(description="Push BigQuery lineage to Monte Carlo")
|
||||
parser.add_argument("--project-id", default=os.getenv("BIGQUERY_PROJECT_ID")) # ← SUBSTITUTE
|
||||
parser.add_argument("--region", default=os.getenv("BIGQUERY_REGION", "us")) # ← SUBSTITUTE
|
||||
parser.add_argument("--resource-uuid", default=os.getenv("MCD_RESOURCE_UUID"))
|
||||
parser.add_argument("--key-id", default=os.getenv("MCD_INGEST_ID"))
|
||||
parser.add_argument("--key-token", default=os.getenv("MCD_INGEST_TOKEN"))
|
||||
parser.add_argument("--lookback-hours", type=int, default=LOOKBACK_HOURS)
|
||||
parser.add_argument("--output-file", default="lineage_output.json")
|
||||
parser.add_argument("--push-result-file", default="lineage_push_result.json")
|
||||
parser.add_argument(
|
||||
"--batch-size",
|
||||
type=int,
|
||||
default=_BATCH_SIZE,
|
||||
help=f"Max events per push batch (default: {_BATCH_SIZE})",
|
||||
)
|
||||
args = parser.parse_args()
|
||||
|
||||
required = ["project_id", "resource_uuid", "key_id", "key_token"]
|
||||
missing = [k for k in required if getattr(args, k) is None]
|
||||
if missing:
|
||||
parser.error(f"Missing required arguments/env vars: {missing}")
|
||||
|
||||
# Step 1: Collect
|
||||
collect(
|
||||
project_id=args.project_id,
|
||||
region=args.region,
|
||||
lookback_hours=args.lookback_hours,
|
||||
output_file=args.output_file,
|
||||
)
|
||||
|
||||
# Step 2: Push
|
||||
push(
|
||||
input_file=args.output_file,
|
||||
resource_uuid=args.resource_uuid,
|
||||
key_id=args.key_id,
|
||||
key_token=args.key_token,
|
||||
batch_size=args.batch_size,
|
||||
output_file=args.push_result_file,
|
||||
)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,65 @@
|
||||
"""
|
||||
BigQuery — Metadata Collection and Push (combined)
|
||||
===================================================
|
||||
Imports ``collect()`` from ``collect_metadata`` and ``push()`` from
|
||||
``push_metadata``, runs both in sequence.
|
||||
|
||||
Substitution points (search for "← SUBSTITUTE"):
|
||||
- BIGQUERY_PROJECT_ID : GCP project ID to collect from
|
||||
- GOOGLE_APPLICATION_CREDENTIALS : path to service-account JSON key file
|
||||
- MCD_INGEST_ID / MCD_INGEST_TOKEN : Monte Carlo API credentials
|
||||
- MCD_RESOURCE_UUID : UUID of the BigQuery connection in Monte Carlo
|
||||
- DATASET_EXCLUSIONS : datasets to skip (informational / system datasets)
|
||||
|
||||
Prerequisites:
|
||||
pip install google-cloud-bigquery pycarlo
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import os
|
||||
|
||||
from collect_metadata import collect
|
||||
from push_metadata import push, _BATCH_SIZE
|
||||
|
||||
|
||||
def main() -> None:
|
||||
parser = argparse.ArgumentParser(description="Push BigQuery metadata to Monte Carlo")
|
||||
parser.add_argument("--project-id", default=os.getenv("BIGQUERY_PROJECT_ID")) # ← SUBSTITUTE
|
||||
parser.add_argument("--resource-uuid", default=os.getenv("MCD_RESOURCE_UUID"))
|
||||
parser.add_argument("--key-id", default=os.getenv("MCD_INGEST_ID"))
|
||||
parser.add_argument("--key-token", default=os.getenv("MCD_INGEST_TOKEN"))
|
||||
parser.add_argument("--output-file", default="metadata_output.json")
|
||||
parser.add_argument("--push-result-file", default="metadata_push_result.json")
|
||||
parser.add_argument(
|
||||
"--batch-size",
|
||||
type=int,
|
||||
default=_BATCH_SIZE,
|
||||
help=f"Max assets per push batch (default: {_BATCH_SIZE})",
|
||||
)
|
||||
args = parser.parse_args()
|
||||
|
||||
missing = [k for k, v in vars(args).items() if v is None and k not in ("output_file", "push_result_file", "batch_size")]
|
||||
if missing:
|
||||
parser.error(f"Missing required arguments/env vars: {missing}")
|
||||
|
||||
# Step 1: Collect
|
||||
collect(
|
||||
project_id=args.project_id,
|
||||
output_file=args.output_file,
|
||||
)
|
||||
|
||||
# Step 2: Push
|
||||
push(
|
||||
input_file=args.output_file,
|
||||
resource_uuid=args.resource_uuid,
|
||||
key_id=args.key_id,
|
||||
key_token=args.key_token,
|
||||
batch_size=args.batch_size,
|
||||
output_file=args.push_result_file,
|
||||
)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,70 @@
|
||||
"""
|
||||
BigQuery — Query Log Collection and Push (combined)
|
||||
=====================================================
|
||||
Imports ``collect()`` from ``collect_query_logs`` and ``push()`` from
|
||||
``push_query_logs``, runs both in sequence.
|
||||
|
||||
Substitution points (search for "← SUBSTITUTE"):
|
||||
- BIGQUERY_PROJECT_ID : GCP project ID to collect query logs from
|
||||
- GOOGLE_APPLICATION_CREDENTIALS : path to service-account JSON key file
|
||||
- LOOKBACK_HOURS : how many hours back to collect (default 25, skip last 1 h)
|
||||
- MCD_INGEST_ID / MCD_INGEST_TOKEN : Monte Carlo API credentials
|
||||
- MCD_RESOURCE_UUID : UUID of the BigQuery connection in Monte Carlo
|
||||
|
||||
Prerequisites:
|
||||
pip install google-cloud-bigquery pycarlo
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import os
|
||||
|
||||
from collect_query_logs import collect, LOOKBACK_HOURS, LOOKBACK_LAG_HOURS
|
||||
from push_query_logs import push, _BATCH_SIZE
|
||||
|
||||
|
||||
def main() -> None:
|
||||
parser = argparse.ArgumentParser(description="Push BigQuery query logs to Monte Carlo")
|
||||
parser.add_argument("--project-id", default=os.getenv("BIGQUERY_PROJECT_ID")) # ← SUBSTITUTE
|
||||
parser.add_argument("--resource-uuid", default=os.getenv("MCD_RESOURCE_UUID"))
|
||||
parser.add_argument("--key-id", default=os.getenv("MCD_INGEST_ID"))
|
||||
parser.add_argument("--key-token", default=os.getenv("MCD_INGEST_TOKEN"))
|
||||
parser.add_argument("--lookback-hours", type=int, default=LOOKBACK_HOURS)
|
||||
parser.add_argument("--lookback-lag-hours", type=int, default=LOOKBACK_LAG_HOURS)
|
||||
parser.add_argument("--output-file", default="query_logs_output.json")
|
||||
parser.add_argument("--push-result-file", default="query_logs_push_result.json")
|
||||
parser.add_argument(
|
||||
"--batch-size",
|
||||
type=int,
|
||||
default=_BATCH_SIZE,
|
||||
help=f"Max entries per push batch (default: {_BATCH_SIZE})",
|
||||
)
|
||||
args = parser.parse_args()
|
||||
|
||||
required = ["project_id", "resource_uuid", "key_id", "key_token"]
|
||||
missing = [k for k in required if getattr(args, k) is None]
|
||||
if missing:
|
||||
parser.error(f"Missing required arguments/env vars: {missing}")
|
||||
|
||||
# Step 1: Collect
|
||||
collect(
|
||||
project_id=args.project_id,
|
||||
lookback_hours=args.lookback_hours,
|
||||
lookback_lag_hours=args.lookback_lag_hours,
|
||||
output_file=args.output_file,
|
||||
)
|
||||
|
||||
# Step 2: Push
|
||||
push(
|
||||
input_file=args.output_file,
|
||||
resource_uuid=args.resource_uuid,
|
||||
key_id=args.key_id,
|
||||
key_token=args.key_token,
|
||||
batch_size=args.batch_size,
|
||||
output_file=args.push_result_file,
|
||||
)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,214 @@
|
||||
"""
|
||||
BigQuery — Lineage Collection (collect only)
|
||||
=============================================
|
||||
Collects table-level lineage from two sources:
|
||||
1. INFORMATION_SCHEMA.SCHEMATA_LINKS — cross-project dataset shares (per region)
|
||||
2. Job query history — SQL parsing for CREATE TABLE AS SELECT and INSERT INTO
|
||||
SELECT patterns to derive source->destination relationships.
|
||||
|
||||
Writes the collected lineage edges to a JSON manifest file.
|
||||
|
||||
Can be run standalone via CLI or imported (use the ``collect()`` function).
|
||||
|
||||
Substitution points (search for "← SUBSTITUTE"):
|
||||
- BIGQUERY_PROJECT_ID : GCP project ID to collect from
|
||||
- BIGQUERY_REGION : BigQuery region for INFORMATION_SCHEMA queries (e.g. "us", "eu")
|
||||
- LOOKBACK_HOURS : how far back to scan job history (default 24 h)
|
||||
|
||||
Prerequisites:
|
||||
pip install google-cloud-bigquery
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import logging
|
||||
import os
|
||||
import re
|
||||
from datetime import datetime, timedelta, timezone
|
||||
|
||||
from google.cloud import bigquery
|
||||
|
||||
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
|
||||
log = logging.getLogger(__name__)
|
||||
|
||||
RESOURCE_TYPE = "bigquery"
|
||||
LOOKBACK_HOURS = int(os.getenv("LOOKBACK_HOURS", "24")) # ← SUBSTITUTE: adjust lookback window
|
||||
|
||||
# Regex patterns to detect CTAS and INSERT INTO SELECT in BigQuery SQL
|
||||
_CTAS_PATTERN = re.compile(
|
||||
r"CREATE\s+(?:OR\s+REPLACE\s+)?(?:TABLE|VIEW)\s+`?(?P<dest>[\w.\-]+)`?"
|
||||
r".*?(?:AS\s+)?SELECT\b",
|
||||
re.IGNORECASE | re.DOTALL,
|
||||
)
|
||||
_INSERT_PATTERN = re.compile(
|
||||
r"INSERT\s+(?:INTO\s+)?`?(?P<dest>[\w.\-]+)`?.*?SELECT\b",
|
||||
re.IGNORECASE | re.DOTALL,
|
||||
)
|
||||
_TABLE_REF_PATTERN = re.compile(r"`?([\w\-]+\.[\w\-]+\.[\w\-]+)`?", re.IGNORECASE)
|
||||
|
||||
|
||||
def _parse_full_name(full_name: str) -> tuple[str, str, str]:
|
||||
"""Split 'project.dataset.table' into (project, dataset, table)."""
|
||||
parts = full_name.replace("`", "").split(".")
|
||||
if len(parts) == 3:
|
||||
return parts[0], parts[1], parts[2]
|
||||
if len(parts) == 2:
|
||||
return "", parts[0], parts[1]
|
||||
return "", "", parts[0]
|
||||
|
||||
|
||||
def _collect_schema_link_lineage(
|
||||
bq_client: bigquery.Client,
|
||||
project_id: str,
|
||||
region: str,
|
||||
) -> list[dict]:
|
||||
"""Collect cross-project lineage from INFORMATION_SCHEMA.SCHEMATA_LINKS."""
|
||||
query = f"""
|
||||
SELECT
|
||||
CATALOG_NAME AS source_project,
|
||||
SCHEMA_NAME AS source_dataset,
|
||||
LINKED_SCHEMA_CATALOG_NAME AS destination_project,
|
||||
LINKED_SCHEMA_NAME AS destination_dataset
|
||||
FROM `{project_id}`.`{region}`.INFORMATION_SCHEMA.SCHEMATA_LINKS
|
||||
""" # ← SUBSTITUTE: update project_id and region as needed
|
||||
edges: list[dict] = []
|
||||
try:
|
||||
for row in bq_client.query(query).result():
|
||||
edges.append(
|
||||
{
|
||||
"destination": {
|
||||
"database": row.destination_project,
|
||||
"schema": row.destination_dataset,
|
||||
"table": "*",
|
||||
},
|
||||
"sources": [
|
||||
{
|
||||
"database": row.source_project,
|
||||
"schema": row.source_dataset,
|
||||
"table": "*",
|
||||
}
|
||||
],
|
||||
}
|
||||
)
|
||||
except Exception:
|
||||
log.warning("SCHEMATA_LINKS query failed — skipping dataset-share lineage", exc_info=True)
|
||||
return edges
|
||||
|
||||
|
||||
def _collect_query_lineage(
|
||||
bq_client: bigquery.Client,
|
||||
project_id: str,
|
||||
lookback_hours: int,
|
||||
) -> list[dict]:
|
||||
"""Derive lineage by parsing CTAS/INSERT patterns in job query history."""
|
||||
end_dt = datetime.now(timezone.utc)
|
||||
start_dt = end_dt - timedelta(hours=lookback_hours)
|
||||
|
||||
edges: list[dict] = []
|
||||
for job in bq_client.list_jobs(all_users=True, min_creation_time=start_dt, max_creation_time=end_dt):
|
||||
sql: str = getattr(job, "query", None) or ""
|
||||
if not sql.strip():
|
||||
continue
|
||||
|
||||
dest_match = _CTAS_PATTERN.search(sql) or _INSERT_PATTERN.search(sql)
|
||||
if not dest_match:
|
||||
continue
|
||||
|
||||
dest_full = dest_match.group("dest")
|
||||
dest_project, dest_dataset, dest_table = _parse_full_name(dest_full)
|
||||
if not dest_table:
|
||||
continue
|
||||
|
||||
# Collect all 3-part table references in the query as sources, excluding destination
|
||||
source_refs = [
|
||||
m.group(1)
|
||||
for m in _TABLE_REF_PATTERN.finditer(sql)
|
||||
if m.group(1) != dest_full
|
||||
]
|
||||
if not source_refs:
|
||||
continue
|
||||
|
||||
unique_sources = list(dict.fromkeys(source_refs))
|
||||
sources = []
|
||||
for ref in unique_sources:
|
||||
p, d, t = _parse_full_name(ref)
|
||||
sources.append({"database": p, "schema": d, "table": t})
|
||||
|
||||
edges.append(
|
||||
{
|
||||
"destination": {
|
||||
"database": dest_project or project_id,
|
||||
"schema": dest_dataset,
|
||||
"table": dest_table,
|
||||
},
|
||||
"sources": sources,
|
||||
}
|
||||
)
|
||||
|
||||
return edges
|
||||
|
||||
|
||||
def collect(
|
||||
project_id: str,
|
||||
region: str = "us",
|
||||
lookback_hours: int = LOOKBACK_HOURS,
|
||||
output_file: str = "lineage_output.json",
|
||||
) -> dict:
|
||||
"""
|
||||
Connect to BigQuery, collect lineage edges, and write a JSON manifest.
|
||||
|
||||
Returns the manifest dict.
|
||||
"""
|
||||
bq_client = bigquery.Client(project=project_id)
|
||||
|
||||
log.info("Collecting lineage from project %s ...", project_id)
|
||||
schema_edges = _collect_schema_link_lineage(bq_client, project_id, region)
|
||||
query_edges = _collect_query_lineage(bq_client, project_id, lookback_hours)
|
||||
all_edges = schema_edges + query_edges
|
||||
|
||||
log.info(
|
||||
"Collected %d lineage edges (%d schema-link, %d query-derived)",
|
||||
len(all_edges), len(schema_edges), len(query_edges),
|
||||
)
|
||||
|
||||
manifest = {
|
||||
"resource_type": RESOURCE_TYPE,
|
||||
"collected_at": datetime.now(timezone.utc).isoformat(),
|
||||
"schema_link_edges": len(schema_edges),
|
||||
"query_derived_edges": len(query_edges),
|
||||
"edges": all_edges,
|
||||
}
|
||||
with open(output_file, "w") as fh:
|
||||
json.dump(manifest, fh, indent=2)
|
||||
log.info("Lineage manifest written to %s", output_file)
|
||||
|
||||
return manifest
|
||||
|
||||
|
||||
def main() -> None:
|
||||
parser = argparse.ArgumentParser(
|
||||
description="Collect BigQuery lineage and write to a manifest file",
|
||||
)
|
||||
parser.add_argument("--project-id", default=os.getenv("BIGQUERY_PROJECT_ID")) # ← SUBSTITUTE
|
||||
parser.add_argument("--region", default=os.getenv("BIGQUERY_REGION", "us")) # ← SUBSTITUTE
|
||||
parser.add_argument("--lookback-hours", type=int, default=LOOKBACK_HOURS)
|
||||
parser.add_argument("--output-file", default="lineage_output.json")
|
||||
args = parser.parse_args()
|
||||
|
||||
required = ["project_id"]
|
||||
missing = [k for k in required if getattr(args, k) is None]
|
||||
if missing:
|
||||
parser.error(f"Missing required arguments/env vars: {missing}")
|
||||
|
||||
collect(
|
||||
project_id=args.project_id,
|
||||
region=args.region,
|
||||
lookback_hours=args.lookback_hours,
|
||||
output_file=args.output_file,
|
||||
)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,160 @@
|
||||
"""
|
||||
BigQuery — Metadata Collection (collect only)
|
||||
==============================================
|
||||
Collects table schemas, row counts, byte sizes, and descriptions from all
|
||||
datasets in a BigQuery project and writes them to a JSON manifest file.
|
||||
|
||||
Can be run standalone via CLI or imported (use the ``collect()`` function).
|
||||
|
||||
Substitution points (search for "← SUBSTITUTE"):
|
||||
- BIGQUERY_PROJECT_ID : GCP project ID to collect from
|
||||
- GOOGLE_APPLICATION_CREDENTIALS : path to service-account JSON key file
|
||||
- DATASET_EXCLUSIONS : datasets to skip (informational / system datasets)
|
||||
|
||||
Prerequisites:
|
||||
pip install google-cloud-bigquery
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import logging
|
||||
import os
|
||||
from datetime import datetime, timezone
|
||||
|
||||
from google.cloud import bigquery
|
||||
|
||||
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
|
||||
log = logging.getLogger(__name__)
|
||||
|
||||
RESOURCE_TYPE = "bigquery"
|
||||
|
||||
# Datasets to skip — add any internal / system datasets here
|
||||
DATASET_EXCLUSIONS = { # ← SUBSTITUTE: add datasets to exclude
|
||||
"_bqc_",
|
||||
"INFORMATION_SCHEMA",
|
||||
}
|
||||
|
||||
# BigQuery type → Monte Carlo canonical type
|
||||
BQ_TYPE_MAP: dict[str, str] = {
|
||||
"INT64": "INTEGER",
|
||||
"INTEGER": "INTEGER",
|
||||
"FLOAT64": "FLOAT",
|
||||
"FLOAT": "FLOAT",
|
||||
"BOOL": "BOOLEAN",
|
||||
"BOOLEAN": "BOOLEAN",
|
||||
"STRING": "VARCHAR",
|
||||
"BYTES": "BINARY",
|
||||
"DATE": "DATE",
|
||||
"DATETIME": "DATETIME",
|
||||
"TIMESTAMP": "TIMESTAMP",
|
||||
"TIME": "TIME",
|
||||
"NUMERIC": "DECIMAL",
|
||||
"BIGNUMERIC": "DECIMAL",
|
||||
"RECORD": "STRUCT",
|
||||
"STRUCT": "STRUCT",
|
||||
"REPEATED": "ARRAY",
|
||||
"JSON": "JSON",
|
||||
"GEOGRAPHY": "GEOGRAPHY",
|
||||
}
|
||||
|
||||
|
||||
def map_bq_type(bq_type: str) -> str:
|
||||
return BQ_TYPE_MAP.get(bq_type.upper(), bq_type.upper())
|
||||
|
||||
|
||||
def _collect_assets(bq_client: bigquery.Client, project_id: str) -> list[dict]:
|
||||
"""Collect table metadata from BigQuery and return as a list of dicts."""
|
||||
assets: list[dict] = []
|
||||
|
||||
for dataset_item in bq_client.list_datasets():
|
||||
dataset_id = dataset_item.dataset_id
|
||||
|
||||
if any(exc in dataset_id for exc in DATASET_EXCLUSIONS):
|
||||
log.info("Skipping dataset %s", dataset_id)
|
||||
continue
|
||||
|
||||
dataset_ref = bq_client.dataset(dataset_id)
|
||||
|
||||
for table_item in bq_client.list_tables(dataset_ref):
|
||||
table_ref = dataset_ref.table(table_item.table_id)
|
||||
table = bq_client.get_table(table_ref)
|
||||
|
||||
fields = [
|
||||
{
|
||||
"name": field.name,
|
||||
"type": map_bq_type(field.field_type),
|
||||
"description": field.description or None,
|
||||
}
|
||||
for field in table.schema
|
||||
]
|
||||
|
||||
asset = {
|
||||
"name": table.table_id,
|
||||
"database": project_id, # ← SUBSTITUTE: use project or dataset as database
|
||||
"schema": dataset_id,
|
||||
"type": "VIEW" if table.table_type == "VIEW" else "TABLE",
|
||||
"description": table.description or None,
|
||||
"fields": fields,
|
||||
"volume": {
|
||||
"row_count": table.num_rows,
|
||||
"byte_count": table.num_bytes,
|
||||
},
|
||||
"freshness": {
|
||||
"last_updated_time": table.modified.isoformat() if table.modified else None,
|
||||
},
|
||||
}
|
||||
assets.append(asset)
|
||||
log.info("Queued %s.%s.%s", project_id, dataset_id, table.table_id)
|
||||
|
||||
return assets
|
||||
|
||||
|
||||
def collect(
|
||||
project_id: str,
|
||||
output_file: str = "metadata_output.json",
|
||||
) -> dict:
|
||||
"""
|
||||
Connect to BigQuery, collect table metadata, and write a JSON manifest.
|
||||
|
||||
Returns the manifest dict.
|
||||
"""
|
||||
bq_client = bigquery.Client(project=project_id) # ← SUBSTITUTE: adjust auth if needed
|
||||
|
||||
log.info("Collecting metadata from project %s ...", project_id)
|
||||
assets = _collect_assets(bq_client, project_id)
|
||||
log.info("Collected %d asset(s).", len(assets))
|
||||
|
||||
manifest = {
|
||||
"resource_type": RESOURCE_TYPE,
|
||||
"collected_at": datetime.now(timezone.utc).isoformat(),
|
||||
"assets": assets,
|
||||
}
|
||||
with open(output_file, "w") as fh:
|
||||
json.dump(manifest, fh, indent=2)
|
||||
log.info("Asset manifest written to %s", output_file)
|
||||
|
||||
return manifest
|
||||
|
||||
|
||||
def main() -> None:
|
||||
parser = argparse.ArgumentParser(
|
||||
description="Collect BigQuery metadata and write to a manifest file",
|
||||
)
|
||||
parser.add_argument("--project-id", default=os.getenv("BIGQUERY_PROJECT_ID")) # ← SUBSTITUTE
|
||||
parser.add_argument("--output-file", default="metadata_output.json")
|
||||
args = parser.parse_args()
|
||||
|
||||
missing = [k for k, v in vars(args).items() if v is None and k != "output_file"]
|
||||
if missing:
|
||||
parser.error(f"Missing required arguments/env vars: {missing}")
|
||||
|
||||
collect(
|
||||
project_id=args.project_id,
|
||||
output_file=args.output_file,
|
||||
)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,164 @@
|
||||
"""
|
||||
BigQuery — Query Log Collection (collect only)
|
||||
================================================
|
||||
Collects completed job query logs from BigQuery job history and writes them to
|
||||
a JSON manifest file for later push to Monte Carlo.
|
||||
|
||||
Can be run standalone via CLI or imported (use the ``collect()`` function).
|
||||
|
||||
Substitution points (search for "← SUBSTITUTE"):
|
||||
- BIGQUERY_PROJECT_ID : GCP project ID to collect query logs from
|
||||
- GOOGLE_APPLICATION_CREDENTIALS : path to service-account JSON key file
|
||||
- LOOKBACK_HOURS : how many hours back to collect (default 25, skip last 1 h)
|
||||
- STATEMENT_TYPE_FILTER : restrict to specific statement types, or leave empty for all
|
||||
- MAX_JOBS : cap on number of jobs to collect per run
|
||||
|
||||
Prerequisites:
|
||||
pip install google-cloud-bigquery
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import logging
|
||||
import os
|
||||
from datetime import datetime, timedelta, timezone
|
||||
|
||||
from google.cloud import bigquery
|
||||
|
||||
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
|
||||
log = logging.getLogger(__name__)
|
||||
|
||||
LOG_TYPE = "bigquery"
|
||||
|
||||
# Collect jobs from [now - LOOKBACK_HOURS] to [now - LOOKBACK_LAG_HOURS].
|
||||
# The lag avoids collecting in-flight jobs that have not yet completed.
|
||||
LOOKBACK_HOURS: int = int(os.getenv("LOOKBACK_HOURS", "25")) # ← SUBSTITUTE
|
||||
LOOKBACK_LAG_HOURS: int = int(os.getenv("LOOKBACK_LAG_HOURS", "1")) # ← SUBSTITUTE
|
||||
|
||||
# Limit statement types — e.g. ["SELECT", "CREATE_TABLE_AS_SELECT", "INSERT"]
|
||||
# Set to an empty list to collect all statement types.
|
||||
STATEMENT_TYPE_FILTER: list[str] = [] # ← SUBSTITUTE
|
||||
|
||||
# Maximum number of jobs to collect in a single run to avoid runaway costs
|
||||
MAX_JOBS: int = int(os.getenv("MAX_JOBS", "10000")) # ← SUBSTITUTE
|
||||
|
||||
|
||||
def _safe_isoformat(dt: datetime | None) -> str | None:
|
||||
if dt is None:
|
||||
return None
|
||||
if dt.tzinfo is None:
|
||||
dt = dt.replace(tzinfo=timezone.utc)
|
||||
return dt.isoformat()
|
||||
|
||||
|
||||
def _collect_query_logs(
|
||||
bq_client: bigquery.Client,
|
||||
project_id: str,
|
||||
start_dt: datetime,
|
||||
end_dt: datetime,
|
||||
) -> list[dict]:
|
||||
"""Collect query logs from BigQuery job history and return as a list of dicts."""
|
||||
entries: list[dict] = []
|
||||
|
||||
log.info(
|
||||
"Listing jobs for project=%s from %s to %s",
|
||||
project_id, start_dt.isoformat(), end_dt.isoformat(),
|
||||
)
|
||||
|
||||
for job in bq_client.list_jobs(
|
||||
project=project_id,
|
||||
all_users=True,
|
||||
min_creation_time=start_dt,
|
||||
max_creation_time=end_dt,
|
||||
):
|
||||
# Only process query jobs that have SQL text
|
||||
sql: str = getattr(job, "query", None) or ""
|
||||
if not sql.strip():
|
||||
continue
|
||||
|
||||
statement_type: str = getattr(job, "statement_type", None) or ""
|
||||
if STATEMENT_TYPE_FILTER and statement_type not in STATEMENT_TYPE_FILTER:
|
||||
continue # ← SUBSTITUTE: adjust filter as needed
|
||||
|
||||
total_bytes_billed: int | None = getattr(job, "total_bytes_billed", None)
|
||||
|
||||
entries.append(
|
||||
{
|
||||
"query_id": job.job_id,
|
||||
"query_text": sql,
|
||||
"start_time": _safe_isoformat(getattr(job, "created", None)),
|
||||
"end_time": _safe_isoformat(getattr(job, "ended", None)),
|
||||
"user": getattr(job, "user_email", None),
|
||||
"total_bytes_billed": total_bytes_billed,
|
||||
"statement_type": statement_type or None,
|
||||
}
|
||||
)
|
||||
|
||||
if len(entries) >= MAX_JOBS:
|
||||
log.warning("Reached MAX_JOBS=%d — stopping early", MAX_JOBS)
|
||||
break
|
||||
|
||||
return entries
|
||||
|
||||
|
||||
def collect(
|
||||
project_id: str,
|
||||
lookback_hours: int = LOOKBACK_HOURS,
|
||||
lookback_lag_hours: int = LOOKBACK_LAG_HOURS,
|
||||
output_file: str = "query_logs_output.json",
|
||||
) -> dict:
|
||||
"""
|
||||
Connect to BigQuery, collect query logs, and write a JSON manifest.
|
||||
|
||||
Returns the manifest dict.
|
||||
"""
|
||||
bq_client = bigquery.Client(project=project_id) # ← SUBSTITUTE: adjust auth if needed
|
||||
|
||||
end_dt = datetime.now(timezone.utc) - timedelta(hours=lookback_lag_hours)
|
||||
start_dt = end_dt - timedelta(hours=lookback_hours)
|
||||
|
||||
entries = _collect_query_logs(bq_client, project_id, start_dt, end_dt)
|
||||
log.info("Collected %d query log entries.", len(entries))
|
||||
|
||||
manifest = {
|
||||
"log_type": LOG_TYPE,
|
||||
"collected_at": datetime.now(timezone.utc).isoformat(),
|
||||
"window_start": start_dt.isoformat(),
|
||||
"window_end": end_dt.isoformat(),
|
||||
"query_log_count": len(entries),
|
||||
"queries": entries,
|
||||
}
|
||||
with open(output_file, "w") as fh:
|
||||
json.dump(manifest, fh, indent=2)
|
||||
log.info("Query log manifest written to %s", output_file)
|
||||
|
||||
return manifest
|
||||
|
||||
|
||||
def main() -> None:
|
||||
parser = argparse.ArgumentParser(
|
||||
description="Collect BigQuery query logs and write to a manifest file",
|
||||
)
|
||||
parser.add_argument("--project-id", default=os.getenv("BIGQUERY_PROJECT_ID")) # ← SUBSTITUTE
|
||||
parser.add_argument("--lookback-hours", type=int, default=LOOKBACK_HOURS)
|
||||
parser.add_argument("--lookback-lag-hours", type=int, default=LOOKBACK_LAG_HOURS)
|
||||
parser.add_argument("--output-file", default="query_logs_output.json")
|
||||
args = parser.parse_args()
|
||||
|
||||
required = ["project_id"]
|
||||
missing = [k for k in required if getattr(args, k) is None]
|
||||
if missing:
|
||||
parser.error(f"Missing required arguments/env vars: {missing}")
|
||||
|
||||
collect(
|
||||
project_id=args.project_id,
|
||||
lookback_hours=args.lookback_hours,
|
||||
lookback_lag_hours=args.lookback_lag_hours,
|
||||
output_file=args.output_file,
|
||||
)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,198 @@
|
||||
"""
|
||||
BigQuery — Lineage Push (push only)
|
||||
====================================
|
||||
Reads a manifest file produced by ``collect_lineage.py`` and pushes the lineage
|
||||
events to Monte Carlo using the pycarlo push ingestion API. Large payloads are
|
||||
split into batches to stay under the 1 MB compressed limit.
|
||||
|
||||
Can be run standalone via CLI or imported (use the ``push()`` function).
|
||||
|
||||
Substitution points (search for "← SUBSTITUTE"):
|
||||
- MCD_INGEST_ID / MCD_INGEST_TOKEN : Monte Carlo API credentials
|
||||
- MCD_RESOURCE_UUID : UUID of the BigQuery connection in Monte Carlo
|
||||
|
||||
Prerequisites:
|
||||
pip install pycarlo
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import logging
|
||||
import os
|
||||
from concurrent.futures import ThreadPoolExecutor, as_completed
|
||||
from datetime import datetime, timezone
|
||||
|
||||
from pycarlo.core import Client, Session
|
||||
from pycarlo.features.ingestion import IngestionService
|
||||
from pycarlo.features.ingestion.models import (
|
||||
LineageAssetRef,
|
||||
LineageEvent,
|
||||
)
|
||||
|
||||
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
|
||||
log = logging.getLogger(__name__)
|
||||
|
||||
RESOURCE_TYPE = "bigquery"
|
||||
|
||||
# Maximum events per batch — conservative default to keep compressed payload under 1 MB
|
||||
# ← SUBSTITUTE: tune based on average edge complexity (number of sources per event)
|
||||
_BATCH_SIZE = 500
|
||||
|
||||
|
||||
def _make_ref(database: str, schema: str, table: str) -> LineageAssetRef:
|
||||
return LineageAssetRef(
|
||||
type="TABLE",
|
||||
name=table,
|
||||
database=database,
|
||||
schema=schema,
|
||||
)
|
||||
|
||||
|
||||
def _build_events(edges: list[dict]) -> list[LineageEvent]:
|
||||
"""Build LineageEvent objects from manifest edge dicts."""
|
||||
events = []
|
||||
for edge in edges:
|
||||
dest = edge["destination"]
|
||||
sources = edge.get("sources", [])
|
||||
if not sources:
|
||||
continue
|
||||
events.append(
|
||||
LineageEvent(
|
||||
destination=_make_ref(dest["database"], dest["schema"], dest["table"]),
|
||||
sources=[
|
||||
_make_ref(s["database"], s["schema"], s["table"])
|
||||
for s in sources
|
||||
],
|
||||
)
|
||||
)
|
||||
return events
|
||||
|
||||
|
||||
def push(
|
||||
input_file: str,
|
||||
resource_uuid: str,
|
||||
key_id: str,
|
||||
key_token: str,
|
||||
batch_size: int = _BATCH_SIZE,
|
||||
output_file: str = "lineage_push_result.json",
|
||||
) -> dict:
|
||||
"""
|
||||
Read a lineage manifest and push events to Monte Carlo in batches.
|
||||
|
||||
Returns a result dict with invocation IDs for each batch.
|
||||
"""
|
||||
with open(input_file) as fh:
|
||||
manifest = json.load(fh)
|
||||
|
||||
edges = manifest.get("edges", [])
|
||||
resource_type = manifest.get("resource_type", RESOURCE_TYPE)
|
||||
events = _build_events(edges)
|
||||
log.info("Loaded %d lineage event(s) from %s", len(events), input_file)
|
||||
|
||||
if not events:
|
||||
log.info("No lineage events to push.")
|
||||
push_result = {
|
||||
"resource_uuid": resource_uuid,
|
||||
"resource_type": resource_type,
|
||||
"invocation_ids": [],
|
||||
"pushed_at": datetime.now(timezone.utc).isoformat(),
|
||||
"total_events": 0,
|
||||
"batch_count": 0,
|
||||
"batch_size": batch_size,
|
||||
}
|
||||
with open(output_file, "w") as fh:
|
||||
json.dump(push_result, fh, indent=2)
|
||||
return push_result
|
||||
|
||||
# Split into batches
|
||||
batches = []
|
||||
for i in range(0, len(events), batch_size):
|
||||
batches.append(events[i : i + batch_size])
|
||||
total_batches = len(batches)
|
||||
|
||||
def _push_batch(batch: list, batch_num: int) -> str | None:
|
||||
"""Push a single batch using a dedicated Session (thread-safe)."""
|
||||
log.info("Pushing batch %d/%d (%d events) ...", batch_num, total_batches, len(batch))
|
||||
client = Client(session=Session(mcd_id=key_id, mcd_token=key_token, scope="Ingestion"))
|
||||
service = IngestionService(mc_client=client)
|
||||
result = service.send_lineage(
|
||||
resource_uuid=resource_uuid,
|
||||
resource_type=resource_type,
|
||||
events=batch,
|
||||
)
|
||||
invocation_id = service.extract_invocation_id(result)
|
||||
if invocation_id:
|
||||
log.info(" Batch %d: invocation_id=%s", batch_num, invocation_id)
|
||||
return invocation_id
|
||||
|
||||
# Push batches in parallel (each thread gets its own pycarlo Session)
|
||||
max_workers = min(4, total_batches)
|
||||
invocation_ids: list[str | None] = [None] * total_batches
|
||||
|
||||
with ThreadPoolExecutor(max_workers=max_workers) as pool:
|
||||
futures = {
|
||||
pool.submit(_push_batch, batch, i + 1): i
|
||||
for i, batch in enumerate(batches)
|
||||
}
|
||||
for future in as_completed(futures):
|
||||
idx = futures[future]
|
||||
try:
|
||||
invocation_ids[idx] = future.result()
|
||||
except Exception as exc:
|
||||
log.error("ERROR pushing batch %d: %s", idx + 1, exc)
|
||||
raise
|
||||
|
||||
log.info("All %d batches pushed (%d workers)", total_batches, max_workers)
|
||||
|
||||
push_result = {
|
||||
"resource_uuid": resource_uuid,
|
||||
"resource_type": resource_type,
|
||||
"invocation_ids": invocation_ids,
|
||||
"pushed_at": datetime.now(timezone.utc).isoformat(),
|
||||
"total_events": len(events),
|
||||
"batch_count": total_batches,
|
||||
"batch_size": batch_size,
|
||||
}
|
||||
with open(output_file, "w") as fh:
|
||||
json.dump(push_result, fh, indent=2)
|
||||
log.info("Push result written to %s", output_file)
|
||||
|
||||
return push_result
|
||||
|
||||
|
||||
def main() -> None:
|
||||
parser = argparse.ArgumentParser(
|
||||
description="Push BigQuery lineage from a manifest to Monte Carlo",
|
||||
)
|
||||
parser.add_argument("--resource-uuid", default=os.getenv("MCD_RESOURCE_UUID"))
|
||||
parser.add_argument("--key-id", default=os.getenv("MCD_INGEST_ID"))
|
||||
parser.add_argument("--key-token", default=os.getenv("MCD_INGEST_TOKEN"))
|
||||
parser.add_argument("--input-file", default="lineage_output.json")
|
||||
parser.add_argument("--output-file", default="lineage_push_result.json")
|
||||
parser.add_argument(
|
||||
"--batch-size",
|
||||
type=int,
|
||||
default=_BATCH_SIZE,
|
||||
help=f"Max events per push batch (default: {_BATCH_SIZE})",
|
||||
)
|
||||
args = parser.parse_args()
|
||||
|
||||
required = ["resource_uuid", "key_id", "key_token"]
|
||||
missing = [k for k in required if getattr(args, k) is None]
|
||||
if missing:
|
||||
parser.error(f"Missing required arguments/env vars: {missing}")
|
||||
|
||||
push(
|
||||
input_file=args.input_file,
|
||||
resource_uuid=args.resource_uuid,
|
||||
key_id=args.key_id,
|
||||
key_token=args.key_token,
|
||||
batch_size=args.batch_size,
|
||||
output_file=args.output_file,
|
||||
)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,193 @@
|
||||
"""
|
||||
BigQuery — Metadata Push (push only)
|
||||
=====================================
|
||||
Reads a manifest file produced by ``collect_metadata.py`` and pushes the assets
|
||||
to Monte Carlo using the pycarlo push ingestion API. Large payloads are split
|
||||
into batches to stay under the 1 MB compressed limit.
|
||||
|
||||
Can be run standalone via CLI or imported (use the ``push()`` function).
|
||||
|
||||
Substitution points (search for "← SUBSTITUTE"):
|
||||
- MCD_INGEST_ID / MCD_INGEST_TOKEN : Monte Carlo API credentials
|
||||
- MCD_RESOURCE_UUID : UUID of the BigQuery connection in Monte Carlo
|
||||
|
||||
Prerequisites:
|
||||
pip install pycarlo
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import logging
|
||||
import os
|
||||
from concurrent.futures import ThreadPoolExecutor, as_completed
|
||||
from datetime import datetime, timezone
|
||||
|
||||
from pycarlo.core import Client, Session
|
||||
from pycarlo.features.ingestion import IngestionService
|
||||
from pycarlo.features.ingestion.models import (
|
||||
AssetField,
|
||||
AssetFreshness,
|
||||
AssetMetadata,
|
||||
AssetVolume,
|
||||
RelationalAsset,
|
||||
)
|
||||
|
||||
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
|
||||
log = logging.getLogger(__name__)
|
||||
|
||||
RESOURCE_TYPE = "bigquery"
|
||||
|
||||
# Maximum assets per batch — conservative default to keep compressed payload under 1 MB
|
||||
# ← SUBSTITUTE: tune based on average asset size (fields per table, description length, etc.)
|
||||
_BATCH_SIZE = 500
|
||||
|
||||
|
||||
def _asset_from_dict(d: dict) -> RelationalAsset:
|
||||
"""Reconstruct a RelationalAsset from a manifest dict entry."""
|
||||
fields = [
|
||||
AssetField(
|
||||
name=f["name"],
|
||||
type=f.get("type"),
|
||||
description=f.get("description"),
|
||||
)
|
||||
for f in d.get("fields", [])
|
||||
]
|
||||
|
||||
volume = None
|
||||
if d.get("volume"):
|
||||
volume = AssetVolume(
|
||||
row_count=d["volume"].get("row_count"),
|
||||
byte_count=d["volume"].get("byte_count"),
|
||||
)
|
||||
|
||||
freshness = None
|
||||
if d.get("freshness"):
|
||||
freshness = AssetFreshness(
|
||||
last_update_time=d["freshness"].get("last_update_time"),
|
||||
)
|
||||
|
||||
return RelationalAsset(
|
||||
type=d.get("type", "TABLE"),
|
||||
metadata=AssetMetadata(
|
||||
name=d["name"],
|
||||
database=d["database"], # ← SUBSTITUTE: use project or dataset as database
|
||||
schema=d["schema"],
|
||||
description=d.get("description"),
|
||||
),
|
||||
fields=fields,
|
||||
volume=volume,
|
||||
freshness=freshness,
|
||||
)
|
||||
|
||||
|
||||
def push(
|
||||
input_file: str,
|
||||
resource_uuid: str,
|
||||
key_id: str,
|
||||
key_token: str,
|
||||
batch_size: int = _BATCH_SIZE,
|
||||
output_file: str = "metadata_push_result.json",
|
||||
) -> dict:
|
||||
"""
|
||||
Read a metadata manifest and push assets to Monte Carlo in batches.
|
||||
|
||||
Returns a result dict with invocation IDs for each batch.
|
||||
"""
|
||||
with open(input_file) as fh:
|
||||
manifest = json.load(fh)
|
||||
|
||||
asset_dicts = manifest.get("assets", [])
|
||||
resource_type = manifest.get("resource_type", RESOURCE_TYPE)
|
||||
assets = [_asset_from_dict(d) for d in asset_dicts]
|
||||
log.info("Loaded %d asset(s) from %s", len(assets), input_file)
|
||||
|
||||
# Split into batches
|
||||
batches = []
|
||||
for i in range(0, max(len(assets), 1), batch_size):
|
||||
batches.append(assets[i : i + batch_size])
|
||||
total_batches = len(batches)
|
||||
|
||||
def _push_batch(batch: list, batch_num: int) -> str | None:
|
||||
"""Push a single batch using a dedicated Session (thread-safe)."""
|
||||
client = Client(session=Session(mcd_id=key_id, mcd_token=key_token, scope="Ingestion"))
|
||||
service = IngestionService(mc_client=client)
|
||||
result = service.send_metadata(
|
||||
resource_uuid=resource_uuid,
|
||||
resource_type=resource_type,
|
||||
events=batch,
|
||||
)
|
||||
invocation_id = service.extract_invocation_id(result)
|
||||
log.info("Pushed batch %d/%d (%d assets) — invocation_id=%s", batch_num, total_batches, len(batch), invocation_id)
|
||||
return invocation_id
|
||||
|
||||
# Push batches in parallel (each thread gets its own pycarlo Session)
|
||||
max_workers = min(4, total_batches)
|
||||
invocation_ids: list[str | None] = [None] * total_batches
|
||||
|
||||
with ThreadPoolExecutor(max_workers=max_workers) as pool:
|
||||
futures = {
|
||||
pool.submit(_push_batch, batch, i + 1): i
|
||||
for i, batch in enumerate(batches)
|
||||
}
|
||||
for future in as_completed(futures):
|
||||
idx = futures[future]
|
||||
try:
|
||||
invocation_ids[idx] = future.result()
|
||||
except Exception as exc:
|
||||
log.error("ERROR pushing batch %d: %s", idx + 1, exc)
|
||||
raise
|
||||
|
||||
log.info("All %d batches pushed (%d workers)", total_batches, max_workers)
|
||||
|
||||
push_result = {
|
||||
"resource_uuid": resource_uuid,
|
||||
"resource_type": resource_type,
|
||||
"invocation_ids": invocation_ids,
|
||||
"pushed_at": datetime.now(timezone.utc).isoformat(),
|
||||
"total_assets": len(assets),
|
||||
"batch_count": total_batches,
|
||||
"batch_size": batch_size,
|
||||
}
|
||||
with open(output_file, "w") as fh:
|
||||
json.dump(push_result, fh, indent=2)
|
||||
log.info("Push result written to %s", output_file)
|
||||
|
||||
return push_result
|
||||
|
||||
|
||||
def main() -> None:
|
||||
parser = argparse.ArgumentParser(
|
||||
description="Push BigQuery metadata from a manifest to Monte Carlo",
|
||||
)
|
||||
parser.add_argument("--resource-uuid", default=os.getenv("MCD_RESOURCE_UUID"))
|
||||
parser.add_argument("--key-id", default=os.getenv("MCD_INGEST_ID"))
|
||||
parser.add_argument("--key-token", default=os.getenv("MCD_INGEST_TOKEN"))
|
||||
parser.add_argument("--input-file", default="metadata_output.json")
|
||||
parser.add_argument("--output-file", default="metadata_push_result.json")
|
||||
parser.add_argument(
|
||||
"--batch-size",
|
||||
type=int,
|
||||
default=_BATCH_SIZE,
|
||||
help=f"Max assets per push batch (default: {_BATCH_SIZE})",
|
||||
)
|
||||
args = parser.parse_args()
|
||||
|
||||
required = ["resource_uuid", "key_id", "key_token"]
|
||||
missing = [k for k in required if getattr(args, k) is None]
|
||||
if missing:
|
||||
parser.error(f"Missing required arguments/env vars: {missing}")
|
||||
|
||||
push(
|
||||
input_file=args.input_file,
|
||||
resource_uuid=args.resource_uuid,
|
||||
key_id=args.key_id,
|
||||
key_token=args.key_token,
|
||||
batch_size=args.batch_size,
|
||||
output_file=args.output_file,
|
||||
)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,207 @@
|
||||
"""
|
||||
BigQuery — Query Log Push (push only)
|
||||
======================================
|
||||
Reads a manifest file produced by ``collect_query_logs.py`` and pushes the query
|
||||
log entries to Monte Carlo using the pycarlo push ingestion API. Large payloads
|
||||
are split into batches to stay under the 1 MB compressed limit.
|
||||
|
||||
Can be run standalone via CLI or imported (use the ``push()`` function).
|
||||
|
||||
Substitution points (search for "← SUBSTITUTE"):
|
||||
- MCD_INGEST_ID / MCD_INGEST_TOKEN : Monte Carlo API credentials
|
||||
- MCD_RESOURCE_UUID : UUID of the BigQuery connection in Monte Carlo
|
||||
|
||||
Prerequisites:
|
||||
pip install pycarlo
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import logging
|
||||
import os
|
||||
from concurrent.futures import ThreadPoolExecutor, as_completed
|
||||
from datetime import datetime, timezone
|
||||
|
||||
from dateutil.parser import isoparse
|
||||
from pycarlo.core import Client, Session
|
||||
from pycarlo.features.ingestion import IngestionService
|
||||
from pycarlo.features.ingestion.models import QueryLogEntry
|
||||
|
||||
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
|
||||
log = logging.getLogger(__name__)
|
||||
|
||||
LOG_TYPE = "bigquery"
|
||||
|
||||
# Maximum entries per batch — conservative default to keep compressed payload under 1 MB.
|
||||
# Query logs include full SQL text — keep batches small to stay under the 1 MB
|
||||
# compressed payload limit. 50 entries can trigger 413 on active warehouses.
|
||||
# ← SUBSTITUTE: tune based on average query length
|
||||
_BATCH_SIZE = 100
|
||||
|
||||
# Truncate query_text longer than this to prevent 413 errors.
|
||||
# Some SQL statements (e.g., generated by BI tools) can be 100KB+ and blow up
|
||||
# compressed payloads even at small batch sizes.
|
||||
_MAX_QUERY_TEXT_LEN = 10_000
|
||||
|
||||
|
||||
def _build_query_log_entries(queries: list[dict]) -> list[QueryLogEntry]:
|
||||
"""Convert manifest query dicts into QueryLogEntry objects."""
|
||||
entries = []
|
||||
truncated = 0
|
||||
for q in queries:
|
||||
query_text = q.get("query_text") or ""
|
||||
|
||||
# Truncate very long SQL to prevent 413 Request Too Large
|
||||
if len(query_text) > _MAX_QUERY_TEXT_LEN:
|
||||
query_text = query_text[:_MAX_QUERY_TEXT_LEN] + "... [TRUNCATED]"
|
||||
truncated += 1
|
||||
|
||||
extra = {}
|
||||
if q.get("total_bytes_billed") is not None:
|
||||
extra["total_bytes_billed"] = q["total_bytes_billed"]
|
||||
if q.get("statement_type") is not None:
|
||||
extra["statement_type"] = q["statement_type"]
|
||||
|
||||
start_time = q.get("start_time")
|
||||
end_time = q.get("end_time")
|
||||
|
||||
entry = QueryLogEntry(
|
||||
query_id=q.get("query_id"),
|
||||
query_text=query_text,
|
||||
start_time=isoparse(start_time) if start_time else None,
|
||||
end_time=isoparse(end_time) if end_time else None,
|
||||
user=q.get("user"),
|
||||
extra=extra or None,
|
||||
)
|
||||
entries.append(entry)
|
||||
if truncated:
|
||||
log.info("Truncated %d query text(s) exceeding %d chars", truncated, _MAX_QUERY_TEXT_LEN)
|
||||
return entries
|
||||
|
||||
|
||||
def push(
|
||||
input_file: str,
|
||||
resource_uuid: str,
|
||||
key_id: str,
|
||||
key_token: str,
|
||||
batch_size: int = _BATCH_SIZE,
|
||||
output_file: str = "query_logs_push_result.json",
|
||||
) -> dict:
|
||||
"""
|
||||
Read a query log manifest and push entries to Monte Carlo in batches.
|
||||
|
||||
Returns a result dict with invocation IDs for each batch.
|
||||
"""
|
||||
with open(input_file) as fh:
|
||||
manifest = json.load(fh)
|
||||
|
||||
queries = manifest.get("queries", [])
|
||||
log_type = manifest.get("log_type", LOG_TYPE)
|
||||
entries = _build_query_log_entries(queries)
|
||||
log.info("Loaded %d query log entry/entries from %s", len(entries), input_file)
|
||||
|
||||
if not entries:
|
||||
log.info("No query log entries to push.")
|
||||
push_result = {
|
||||
"resource_uuid": resource_uuid,
|
||||
"log_type": log_type,
|
||||
"invocation_ids": [],
|
||||
"pushed_at": datetime.now(timezone.utc).isoformat(),
|
||||
"total_entries": 0,
|
||||
"batch_count": 0,
|
||||
"batch_size": batch_size,
|
||||
}
|
||||
with open(output_file, "w") as fh:
|
||||
json.dump(push_result, fh, indent=2)
|
||||
return push_result
|
||||
|
||||
# Split into batches
|
||||
batches = []
|
||||
for i in range(0, len(entries), batch_size):
|
||||
batches.append(entries[i : i + batch_size])
|
||||
total_batches = len(batches)
|
||||
|
||||
def _push_batch(batch: list, batch_num: int) -> str | None:
|
||||
"""Push a single batch using a dedicated Session (thread-safe)."""
|
||||
client = Client(session=Session(mcd_id=key_id, mcd_token=key_token, scope="Ingestion"))
|
||||
service = IngestionService(mc_client=client)
|
||||
result = service.send_query_logs(
|
||||
resource_uuid=resource_uuid,
|
||||
log_type=log_type,
|
||||
events=batch,
|
||||
)
|
||||
invocation_id = service.extract_invocation_id(result)
|
||||
log.info("Pushed batch %d/%d (%d entries) — invocation_id=%s", batch_num, total_batches, len(batch), invocation_id)
|
||||
return invocation_id
|
||||
|
||||
# Push batches in parallel (each thread gets its own pycarlo Session)
|
||||
max_workers = min(4, total_batches)
|
||||
invocation_ids: list[str | None] = [None] * total_batches
|
||||
|
||||
with ThreadPoolExecutor(max_workers=max_workers) as pool:
|
||||
futures = {
|
||||
pool.submit(_push_batch, batch, i + 1): i
|
||||
for i, batch in enumerate(batches)
|
||||
}
|
||||
for future in as_completed(futures):
|
||||
idx = futures[future]
|
||||
try:
|
||||
invocation_ids[idx] = future.result()
|
||||
except Exception as exc:
|
||||
log.error("ERROR pushing batch %d: %s", idx + 1, exc)
|
||||
raise
|
||||
|
||||
log.info("All %d batches pushed (%d workers)", total_batches, max_workers)
|
||||
|
||||
push_result = {
|
||||
"resource_uuid": resource_uuid,
|
||||
"log_type": log_type,
|
||||
"invocation_ids": invocation_ids,
|
||||
"pushed_at": datetime.now(timezone.utc).isoformat(),
|
||||
"total_entries": len(entries),
|
||||
"batch_count": total_batches,
|
||||
"batch_size": batch_size,
|
||||
}
|
||||
with open(output_file, "w") as fh:
|
||||
json.dump(push_result, fh, indent=2)
|
||||
log.info("Push result written to %s", output_file)
|
||||
|
||||
return push_result
|
||||
|
||||
|
||||
def main() -> None:
|
||||
parser = argparse.ArgumentParser(
|
||||
description="Push BigQuery query logs from a manifest to Monte Carlo",
|
||||
)
|
||||
parser.add_argument("--resource-uuid", default=os.getenv("MCD_RESOURCE_UUID"))
|
||||
parser.add_argument("--key-id", default=os.getenv("MCD_INGEST_ID"))
|
||||
parser.add_argument("--key-token", default=os.getenv("MCD_INGEST_TOKEN"))
|
||||
parser.add_argument("--input-file", default="query_logs_output.json")
|
||||
parser.add_argument("--output-file", default="query_logs_push_result.json")
|
||||
parser.add_argument(
|
||||
"--batch-size",
|
||||
type=int,
|
||||
default=_BATCH_SIZE,
|
||||
help=f"Max entries per push batch (default: {_BATCH_SIZE})",
|
||||
)
|
||||
args = parser.parse_args()
|
||||
|
||||
required = ["resource_uuid", "key_id", "key_token"]
|
||||
missing = [k for k in required if getattr(args, k) is None]
|
||||
if missing:
|
||||
parser.error(f"Missing required arguments/env vars: {missing}")
|
||||
|
||||
push(
|
||||
input_file=args.input_file,
|
||||
resource_uuid=args.resource_uuid,
|
||||
key_id=args.key_id,
|
||||
key_token=args.key_token,
|
||||
batch_size=args.batch_size,
|
||||
output_file=args.output_file,
|
||||
)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,83 @@
|
||||
"""
|
||||
Databricks — Lineage Collect & Push (combined)
|
||||
================================================
|
||||
Collects table-level and (optionally) column-level lineage from Databricks Unity
|
||||
Catalog system tables, then pushes them to Monte Carlo via the push ingestion API.
|
||||
|
||||
This script imports and calls collect() from collect_lineage and push() from
|
||||
push_lineage, running both in sequence.
|
||||
|
||||
Substitution points (search for "← SUBSTITUTE"):
|
||||
- DATABRICKS_HOST : workspace hostname
|
||||
- DATABRICKS_HTTP_PATH : SQL warehouse HTTP path
|
||||
- DATABRICKS_TOKEN : PAT or service-principal secret
|
||||
- LOOKBACK_DAYS : how many days back to collect lineage (default 30)
|
||||
- MCD_INGEST_ID / MCD_INGEST_TOKEN : Monte Carlo API credentials
|
||||
- MCD_RESOURCE_UUID : UUID of the Databricks connection in Monte Carlo
|
||||
- PUSH_BATCH_SIZE : number of events per API call (default 500)
|
||||
|
||||
Use the --column-lineage flag to also push column-level lineage (disabled by default).
|
||||
|
||||
Prerequisites:
|
||||
pip install databricks-sql-connector pycarlo
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import logging
|
||||
import os
|
||||
|
||||
from collect_lineage import LOOKBACK_DAYS, collect
|
||||
from push_lineage import DEFAULT_BATCH_SIZE, push
|
||||
|
||||
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
|
||||
log = logging.getLogger(__name__)
|
||||
|
||||
|
||||
def main() -> None:
|
||||
parser = argparse.ArgumentParser(description="Collect and push Databricks lineage to Monte Carlo")
|
||||
parser.add_argument("--host", default=os.getenv("DATABRICKS_HOST")) # ← SUBSTITUTE
|
||||
parser.add_argument("--http-path", default=os.getenv("DATABRICKS_HTTP_PATH")) # ← SUBSTITUTE
|
||||
parser.add_argument("--token", default=os.getenv("DATABRICKS_TOKEN")) # ← SUBSTITUTE
|
||||
parser.add_argument("--resource-uuid", default=os.getenv("MCD_RESOURCE_UUID"))
|
||||
parser.add_argument("--key-id", default=os.getenv("MCD_INGEST_ID"))
|
||||
parser.add_argument("--key-token", default=os.getenv("MCD_INGEST_TOKEN"))
|
||||
parser.add_argument("--lookback-days", type=int, default=LOOKBACK_DAYS)
|
||||
parser.add_argument(
|
||||
"--column-lineage", action="store_true",
|
||||
help="Also collect column-level lineage (requires system.access.column_lineage access)",
|
||||
)
|
||||
parser.add_argument("--batch-size", type=int, default=DEFAULT_BATCH_SIZE)
|
||||
parser.add_argument("--manifest", default="manifest_lineage.json")
|
||||
args = parser.parse_args()
|
||||
|
||||
required = ["host", "http_path", "token", "resource_uuid", "key_id", "key_token"]
|
||||
missing = [k for k in required if getattr(args, k) is None]
|
||||
if missing:
|
||||
parser.error(f"Missing required arguments/env vars: {missing}")
|
||||
|
||||
log.info("Step 1: Collecting lineage …")
|
||||
collect(
|
||||
host=args.host,
|
||||
http_path=args.http_path,
|
||||
token=args.token,
|
||||
manifest_path=args.manifest,
|
||||
include_column_lineage=args.column_lineage,
|
||||
lookback_days=args.lookback_days,
|
||||
)
|
||||
|
||||
log.info("Step 2: Pushing lineage to Monte Carlo …")
|
||||
push(
|
||||
manifest_path=args.manifest,
|
||||
resource_uuid=args.resource_uuid,
|
||||
key_id=args.key_id,
|
||||
key_token=args.key_token,
|
||||
batch_size=args.batch_size,
|
||||
)
|
||||
|
||||
log.info("Done — collect and push complete.")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,77 @@
|
||||
"""
|
||||
Databricks — Metadata Collect & Push (combined)
|
||||
=================================================
|
||||
Collects table schemas, row counts, and byte sizes from Databricks Unity Catalog,
|
||||
then pushes them to Monte Carlo via the push ingestion API.
|
||||
|
||||
This script imports and calls collect() from collect_metadata and push() from
|
||||
push_metadata, running both in sequence.
|
||||
|
||||
Substitution points (search for "← SUBSTITUTE"):
|
||||
- DATABRICKS_HOST : workspace hostname (e.g. adb-1234.azuredatabricks.net)
|
||||
- DATABRICKS_HTTP_PATH : SQL warehouse HTTP path (e.g. /sql/1.0/warehouses/abc123)
|
||||
- DATABRICKS_TOKEN : personal access token or service-principal secret
|
||||
- DATABRICKS_CATALOG : catalog to collect from (default: "hive_metastore" or "main")
|
||||
- SCHEMA_EXCLUSIONS : schemas to skip
|
||||
- MCD_INGEST_ID / MCD_INGEST_TOKEN : Monte Carlo API credentials
|
||||
- MCD_RESOURCE_UUID : UUID of the Databricks connection in Monte Carlo
|
||||
- PUSH_BATCH_SIZE : number of assets per API call (default 500)
|
||||
|
||||
Prerequisites:
|
||||
pip install databricks-sql-connector pycarlo
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import logging
|
||||
import os
|
||||
|
||||
from collect_metadata import collect
|
||||
from push_metadata import DEFAULT_BATCH_SIZE, push
|
||||
|
||||
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
|
||||
log = logging.getLogger(__name__)
|
||||
|
||||
|
||||
def main() -> None:
|
||||
parser = argparse.ArgumentParser(description="Collect and push Databricks metadata to Monte Carlo")
|
||||
parser.add_argument("--host", default=os.getenv("DATABRICKS_HOST")) # ← SUBSTITUTE
|
||||
parser.add_argument("--http-path", default=os.getenv("DATABRICKS_HTTP_PATH")) # ← SUBSTITUTE
|
||||
parser.add_argument("--token", default=os.getenv("DATABRICKS_TOKEN")) # ← SUBSTITUTE
|
||||
parser.add_argument("--catalog", default=os.getenv("DATABRICKS_CATALOG", "hive_metastore"))
|
||||
parser.add_argument("--resource-uuid", default=os.getenv("MCD_RESOURCE_UUID"))
|
||||
parser.add_argument("--key-id", default=os.getenv("MCD_INGEST_ID"))
|
||||
parser.add_argument("--key-token", default=os.getenv("MCD_INGEST_TOKEN"))
|
||||
parser.add_argument("--batch-size", type=int, default=DEFAULT_BATCH_SIZE)
|
||||
parser.add_argument("--manifest", default="manifest_metadata.json")
|
||||
args = parser.parse_args()
|
||||
|
||||
required = ["host", "http_path", "token", "resource_uuid", "key_id", "key_token"]
|
||||
missing = [k for k in required if getattr(args, k) is None]
|
||||
if missing:
|
||||
parser.error(f"Missing required arguments/env vars: {missing}")
|
||||
|
||||
log.info("Step 1: Collecting metadata …")
|
||||
collect(
|
||||
host=args.host,
|
||||
http_path=args.http_path,
|
||||
token=args.token,
|
||||
catalog=args.catalog,
|
||||
manifest_path=args.manifest,
|
||||
)
|
||||
|
||||
log.info("Step 2: Pushing metadata to Monte Carlo …")
|
||||
push(
|
||||
manifest_path=args.manifest,
|
||||
resource_uuid=args.resource_uuid,
|
||||
key_id=args.key_id,
|
||||
key_token=args.key_token,
|
||||
batch_size=args.batch_size,
|
||||
)
|
||||
|
||||
log.info("Done — collect and push complete.")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,83 @@
|
||||
"""
|
||||
Databricks — Query Log Collect & Push (combined)
|
||||
==================================================
|
||||
Collects finished query execution records from the Databricks system table
|
||||
system.query.history and pushes them to Monte Carlo for query-pattern analysis,
|
||||
lineage derivation, and usage attribution.
|
||||
|
||||
This script imports and calls collect() from collect_query_logs and push() from
|
||||
push_query_logs, running both in sequence.
|
||||
|
||||
Substitution points (search for "← SUBSTITUTE"):
|
||||
- DATABRICKS_HOST : workspace hostname
|
||||
- DATABRICKS_HTTP_PATH : SQL warehouse HTTP path
|
||||
- DATABRICKS_TOKEN : PAT or service-principal secret
|
||||
- LOOKBACK_HOURS : hours back from [now - LAG_HOURS] to collect (default 25)
|
||||
- LOOKBACK_LAG_HOURS : hours to lag behind now to avoid in-flight queries (default 1)
|
||||
- MAX_ROWS : maximum query rows to collect per run (default 10000)
|
||||
- MCD_INGEST_ID / MCD_INGEST_TOKEN : Monte Carlo API credentials
|
||||
- MCD_RESOURCE_UUID : UUID of the Databricks connection in Monte Carlo
|
||||
- PUSH_BATCH_SIZE : number of entries per API call (default 250)
|
||||
|
||||
Prerequisites:
|
||||
pip install databricks-sql-connector pycarlo
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import logging
|
||||
import os
|
||||
|
||||
from collect_query_logs import LOOKBACK_HOURS, LOOKBACK_LAG_HOURS, MAX_ROWS, collect
|
||||
from push_query_logs import DEFAULT_BATCH_SIZE, push
|
||||
|
||||
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
|
||||
log = logging.getLogger(__name__)
|
||||
|
||||
|
||||
def main() -> None:
|
||||
parser = argparse.ArgumentParser(description="Collect and push Databricks query logs to Monte Carlo")
|
||||
parser.add_argument("--host", default=os.getenv("DATABRICKS_HOST")) # ← SUBSTITUTE
|
||||
parser.add_argument("--http-path", default=os.getenv("DATABRICKS_HTTP_PATH")) # ← SUBSTITUTE
|
||||
parser.add_argument("--token", default=os.getenv("DATABRICKS_TOKEN")) # ← SUBSTITUTE
|
||||
parser.add_argument("--resource-uuid", default=os.getenv("MCD_RESOURCE_UUID"))
|
||||
parser.add_argument("--key-id", default=os.getenv("MCD_INGEST_ID"))
|
||||
parser.add_argument("--key-token", default=os.getenv("MCD_INGEST_TOKEN"))
|
||||
parser.add_argument("--lookback-hours", type=int, default=LOOKBACK_HOURS)
|
||||
parser.add_argument("--lookback-lag-hours", type=int, default=LOOKBACK_LAG_HOURS)
|
||||
parser.add_argument("--max-rows", type=int, default=MAX_ROWS)
|
||||
parser.add_argument("--batch-size", type=int, default=DEFAULT_BATCH_SIZE)
|
||||
parser.add_argument("--manifest", default="manifest_query_logs.json")
|
||||
args = parser.parse_args()
|
||||
|
||||
required = ["host", "http_path", "token", "resource_uuid", "key_id", "key_token"]
|
||||
missing = [k for k in required if getattr(args, k) is None]
|
||||
if missing:
|
||||
parser.error(f"Missing required arguments/env vars: {missing}")
|
||||
|
||||
log.info("Step 1: Collecting query logs …")
|
||||
collect(
|
||||
host=args.host,
|
||||
http_path=args.http_path,
|
||||
token=args.token,
|
||||
manifest_path=args.manifest,
|
||||
lookback_hours=args.lookback_hours,
|
||||
lookback_lag_hours=args.lookback_lag_hours,
|
||||
max_rows=args.max_rows,
|
||||
)
|
||||
|
||||
log.info("Step 2: Pushing query logs to Monte Carlo …")
|
||||
push(
|
||||
manifest_path=args.manifest,
|
||||
resource_uuid=args.resource_uuid,
|
||||
key_id=args.key_id,
|
||||
key_token=args.key_token,
|
||||
batch_size=args.batch_size,
|
||||
)
|
||||
|
||||
log.info("Done — collect and push complete.")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,240 @@
|
||||
"""
|
||||
Databricks — Lineage Collection (collect-only)
|
||||
================================================
|
||||
Collects table-level and (optionally) column-level lineage from Databricks Unity
|
||||
Catalog system tables (system.access.table_lineage and system.access.column_lineage).
|
||||
No SQL parsing required — Databricks provides first-class lineage metadata.
|
||||
|
||||
Writes a JSON manifest file that can be consumed by push_lineage.py.
|
||||
|
||||
Substitution points (search for "← SUBSTITUTE"):
|
||||
- DATABRICKS_HOST : workspace hostname
|
||||
- DATABRICKS_HTTP_PATH : SQL warehouse HTTP path
|
||||
- DATABRICKS_TOKEN : PAT or service-principal secret
|
||||
- LOOKBACK_DAYS : how many days back to collect lineage (default 30)
|
||||
|
||||
Use the --column-lineage flag to also collect column-level lineage (disabled by default).
|
||||
|
||||
Prerequisites:
|
||||
pip install databricks-sql-connector
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import logging
|
||||
import os
|
||||
from datetime import datetime, timezone
|
||||
from typing import Any
|
||||
|
||||
from databricks import sql
|
||||
|
||||
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
|
||||
log = logging.getLogger(__name__)
|
||||
|
||||
RESOURCE_TYPE = "databricks"
|
||||
LOOKBACK_DAYS: int = int(os.getenv("LOOKBACK_DAYS", "30")) # ← SUBSTITUTE
|
||||
|
||||
|
||||
def _check_available_memory(min_gb: float = 2.0) -> None:
|
||||
"""Warn if available memory is below the threshold."""
|
||||
try:
|
||||
if hasattr(os, "sysconf"): # Linux / macOS
|
||||
page_size = os.sysconf("SC_PAGE_SIZE")
|
||||
avail_pages = os.sysconf("SC_AVPHYS_PAGES")
|
||||
avail_gb = (page_size * avail_pages) / (1024 ** 3)
|
||||
else:
|
||||
return # Windows — skip check
|
||||
except (ValueError, OSError):
|
||||
return
|
||||
if avail_gb < min_gb:
|
||||
log.warning(
|
||||
"Only %.1f GB of memory available (minimum recommended: %.1f GB). "
|
||||
"Consider reducing the collection scope or increasing available memory.",
|
||||
avail_gb,
|
||||
min_gb,
|
||||
)
|
||||
|
||||
|
||||
def _query(cursor: Any, sql_text: str) -> list[dict[str, Any]]:
|
||||
cursor.execute(sql_text)
|
||||
cols = [d[0] for d in cursor.description]
|
||||
rows = []
|
||||
while True:
|
||||
chunk = cursor.fetchmany(1000)
|
||||
if not chunk:
|
||||
break
|
||||
rows.extend(dict(zip(cols, row)) for row in chunk)
|
||||
return rows
|
||||
|
||||
|
||||
def _parse_full_name(full_name: str) -> tuple[str, str, str]:
|
||||
"""Split 'catalog.schema.table' into (catalog, schema, table)."""
|
||||
parts = (full_name or "").split(".")
|
||||
if len(parts) == 3:
|
||||
return parts[0], parts[1], parts[2]
|
||||
if len(parts) == 2:
|
||||
return "", parts[0], parts[1]
|
||||
return "", "", full_name
|
||||
|
||||
|
||||
def collect_table_lineage(cursor: Any, lookback_days: int) -> list[dict[str, Any]]:
|
||||
rows = _query(
|
||||
cursor,
|
||||
f"""
|
||||
SELECT DISTINCT
|
||||
source_table_full_name,
|
||||
target_table_full_name,
|
||||
created_by,
|
||||
MAX(event_time) AS last_seen
|
||||
FROM system.access.table_lineage
|
||||
WHERE event_time >= DATEADD(DAY, -{lookback_days}, CURRENT_TIMESTAMP())
|
||||
AND source_table_full_name IS NOT NULL
|
||||
AND target_table_full_name IS NOT NULL
|
||||
GROUP BY source_table_full_name, target_table_full_name, created_by
|
||||
LIMIT 50000
|
||||
""", # ← SUBSTITUTE: adjust lookback_days, LIMIT, or add catalog/schema filters
|
||||
)
|
||||
|
||||
events: list[dict[str, Any]] = []
|
||||
for row in rows:
|
||||
src_catalog, src_schema, src_table = _parse_full_name(row["source_table_full_name"])
|
||||
dst_catalog, dst_schema, dst_table = _parse_full_name(row["target_table_full_name"])
|
||||
|
||||
if not src_table or not dst_table:
|
||||
continue
|
||||
|
||||
events.append({
|
||||
"sources": [{"database": src_catalog, "schema": src_schema, "asset_name": src_table}],
|
||||
"destination": {"database": dst_catalog, "schema": dst_schema, "asset_name": dst_table},
|
||||
"lineage_type": "table",
|
||||
})
|
||||
return events
|
||||
|
||||
|
||||
def collect_column_lineage(cursor: Any, lookback_days: int) -> list[dict[str, Any]]:
|
||||
rows = _query(
|
||||
cursor,
|
||||
f"""
|
||||
SELECT DISTINCT
|
||||
source_table_full_name,
|
||||
source_column_name,
|
||||
target_table_full_name,
|
||||
target_column_name
|
||||
FROM system.access.column_lineage
|
||||
WHERE event_time >= DATEADD(DAY, -{lookback_days}, CURRENT_TIMESTAMP())
|
||||
AND source_table_full_name IS NOT NULL
|
||||
AND target_table_full_name IS NOT NULL
|
||||
LIMIT 50000
|
||||
""", # ← SUBSTITUTE: adjust LIMIT or add catalog/schema filters if needed
|
||||
)
|
||||
|
||||
# Group by destination table so we can build one event per destination
|
||||
grouped: dict[str, dict[str, Any]] = {}
|
||||
for row in rows:
|
||||
dst_key = row["target_table_full_name"]
|
||||
if dst_key not in grouped:
|
||||
grouped[dst_key] = {"dst_full": dst_key, "columns": []}
|
||||
grouped[dst_key]["columns"].append(row)
|
||||
|
||||
events: list[dict[str, Any]] = []
|
||||
for dst_key, group in grouped.items():
|
||||
dst_catalog, dst_schema, dst_table = _parse_full_name(group["dst_full"])
|
||||
if not dst_table:
|
||||
continue
|
||||
|
||||
col_fields: list[dict[str, Any]] = []
|
||||
for row in group["columns"]:
|
||||
src_catalog, src_schema, src_table = _parse_full_name(row["source_table_full_name"])
|
||||
col_fields.append({
|
||||
"destination_field": row["target_column_name"],
|
||||
"sources": [{
|
||||
"database": src_catalog,
|
||||
"schema": src_schema,
|
||||
"asset_name": src_table,
|
||||
"field": row["source_column_name"],
|
||||
}],
|
||||
})
|
||||
|
||||
events.append({
|
||||
"sources": [], # column lineage carries source refs inside col_fields
|
||||
"destination": {"database": dst_catalog, "schema": dst_schema, "asset_name": dst_table},
|
||||
"column_lineage": col_fields,
|
||||
"lineage_type": "column",
|
||||
})
|
||||
return events
|
||||
|
||||
|
||||
def collect(
|
||||
host: str,
|
||||
http_path: str,
|
||||
token: str,
|
||||
manifest_path: str = "manifest_lineage.json",
|
||||
include_column_lineage: bool = False,
|
||||
lookback_days: int = LOOKBACK_DAYS,
|
||||
) -> list[dict[str, Any]]:
|
||||
"""Connect to Databricks, collect lineage, write a JSON manifest, and return events."""
|
||||
_check_available_memory(min_gb=2.0)
|
||||
collected_at = datetime.now(timezone.utc).isoformat()
|
||||
|
||||
with sql.connect(
|
||||
server_hostname=host, # ← SUBSTITUTE
|
||||
http_path=http_path, # ← SUBSTITUTE
|
||||
access_token=token, # ← SUBSTITUTE
|
||||
) as conn:
|
||||
with conn.cursor() as cursor:
|
||||
table_events = collect_table_lineage(cursor, lookback_days)
|
||||
col_events = collect_column_lineage(cursor, lookback_days) if include_column_lineage else []
|
||||
|
||||
all_events = table_events + col_events
|
||||
log.info(
|
||||
"Collected %d lineage events (%d table, %d column)",
|
||||
len(all_events), len(table_events), len(col_events),
|
||||
)
|
||||
|
||||
manifest = {
|
||||
"resource_type": RESOURCE_TYPE,
|
||||
"collected_at": collected_at,
|
||||
"lookback_days": lookback_days,
|
||||
"table_lineage_events": len(table_events),
|
||||
"column_lineage_events": len(col_events),
|
||||
"events": all_events,
|
||||
}
|
||||
with open(manifest_path, "w") as fh:
|
||||
json.dump(manifest, fh, indent=2)
|
||||
log.info("Manifest written to %s (%d events)", manifest_path, len(all_events))
|
||||
|
||||
return all_events
|
||||
|
||||
|
||||
def main() -> None:
|
||||
parser = argparse.ArgumentParser(description="Collect Databricks lineage to a manifest file")
|
||||
parser.add_argument("--host", default=os.getenv("DATABRICKS_HOST")) # ← SUBSTITUTE
|
||||
parser.add_argument("--http-path", default=os.getenv("DATABRICKS_HTTP_PATH")) # ← SUBSTITUTE
|
||||
parser.add_argument("--token", default=os.getenv("DATABRICKS_TOKEN")) # ← SUBSTITUTE
|
||||
parser.add_argument("--lookback-days", type=int, default=LOOKBACK_DAYS)
|
||||
parser.add_argument(
|
||||
"--column-lineage", action="store_true",
|
||||
help="Also collect column-level lineage (requires system.access.column_lineage access)",
|
||||
)
|
||||
parser.add_argument("--manifest", default="manifest_lineage.json")
|
||||
args = parser.parse_args()
|
||||
|
||||
required = ["host", "http_path", "token"]
|
||||
missing = [k for k in required if getattr(args, k) is None]
|
||||
if missing:
|
||||
parser.error(f"Missing required arguments/env vars: {missing}")
|
||||
|
||||
collect(
|
||||
host=args.host,
|
||||
http_path=args.http_path,
|
||||
token=args.token,
|
||||
manifest_path=args.manifest,
|
||||
include_column_lineage=args.column_lineage,
|
||||
lookback_days=args.lookback_days,
|
||||
)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,212 @@
|
||||
"""
|
||||
Databricks — Metadata Collection (collect-only)
|
||||
=================================================
|
||||
Collects table schemas, row counts, and byte sizes from Databricks Unity Catalog
|
||||
using INFORMATION_SCHEMA and DESCRIBE DETAIL, then writes a JSON manifest file
|
||||
that can be consumed by push_metadata.py.
|
||||
|
||||
Substitution points (search for "← SUBSTITUTE"):
|
||||
- DATABRICKS_HOST : workspace hostname (e.g. adb-1234.azuredatabricks.net)
|
||||
- DATABRICKS_HTTP_PATH : SQL warehouse HTTP path (e.g. /sql/1.0/warehouses/abc123)
|
||||
- DATABRICKS_TOKEN : personal access token or service-principal secret
|
||||
- DATABRICKS_CATALOG : catalog to collect from (default: "hive_metastore" or "main")
|
||||
- SCHEMA_EXCLUSIONS : schemas to skip
|
||||
|
||||
Prerequisites:
|
||||
pip install databricks-sql-connector
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import logging
|
||||
import os
|
||||
from datetime import datetime, timezone
|
||||
from typing import Any
|
||||
|
||||
from databricks import sql
|
||||
|
||||
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
|
||||
log = logging.getLogger(__name__)
|
||||
|
||||
RESOURCE_TYPE = "databricks"
|
||||
|
||||
# Schemas to skip across all catalogs
|
||||
SCHEMA_EXCLUSIONS: set[str] = { # ← SUBSTITUTE: add any internal schemas to skip
|
||||
"information_schema",
|
||||
"__databricks_internal",
|
||||
}
|
||||
|
||||
|
||||
def _check_available_memory(min_gb: float = 2.0) -> None:
|
||||
"""Warn if available memory is below the threshold."""
|
||||
try:
|
||||
if hasattr(os, "sysconf"): # Linux / macOS
|
||||
page_size = os.sysconf("SC_PAGE_SIZE")
|
||||
avail_pages = os.sysconf("SC_AVPHYS_PAGES")
|
||||
avail_gb = (page_size * avail_pages) / (1024 ** 3)
|
||||
else:
|
||||
return # Windows — skip check
|
||||
except (ValueError, OSError):
|
||||
return
|
||||
if avail_gb < min_gb:
|
||||
log.warning(
|
||||
"Only %.1f GB of memory available (minimum recommended: %.1f GB). "
|
||||
"Consider reducing the collection scope or increasing available memory.",
|
||||
avail_gb,
|
||||
min_gb,
|
||||
)
|
||||
|
||||
|
||||
def _query(cursor: Any, sql_text: str, params: tuple | None = None) -> list[dict[str, Any]]:
|
||||
cursor.execute(sql_text, params)
|
||||
cols = [d[0] for d in cursor.description]
|
||||
rows = []
|
||||
while True:
|
||||
chunk = cursor.fetchmany(1000)
|
||||
if not chunk:
|
||||
break
|
||||
rows.extend(dict(zip(cols, row)) for row in chunk)
|
||||
return rows
|
||||
|
||||
|
||||
def collect_tables(cursor: Any, catalog: str) -> list[dict[str, Any]]:
|
||||
return _query(
|
||||
cursor,
|
||||
f"""
|
||||
SELECT table_catalog, table_schema, table_name, table_type, comment
|
||||
FROM {catalog}.information_schema.tables
|
||||
WHERE table_schema NOT IN ({", ".join(f"'{s}'" for s in SCHEMA_EXCLUSIONS)})
|
||||
ORDER BY table_schema, table_name
|
||||
""", # ← SUBSTITUTE: add additional WHERE filters if needed
|
||||
)
|
||||
|
||||
|
||||
def collect_columns(cursor: Any, catalog: str, schema: str, table: str) -> list[dict[str, Any]]:
|
||||
return _query(
|
||||
cursor,
|
||||
f"""
|
||||
SELECT column_name, data_type, comment
|
||||
FROM {catalog}.information_schema.columns
|
||||
WHERE table_schema = '{schema}' AND table_name = '{table}'
|
||||
ORDER BY ordinal_position
|
||||
""",
|
||||
)
|
||||
|
||||
|
||||
def collect_detail(cursor: Any, catalog: str, schema: str, table: str) -> dict[str, Any] | None:
|
||||
try:
|
||||
rows = _query(cursor, f"DESCRIBE DETAIL `{catalog}`.`{schema}`.`{table}`")
|
||||
return rows[0] if rows else None
|
||||
except Exception:
|
||||
log.debug("DESCRIBE DETAIL failed for %s.%s.%s", catalog, schema, table, exc_info=True)
|
||||
return None
|
||||
|
||||
|
||||
def collect(
|
||||
host: str,
|
||||
http_path: str,
|
||||
token: str,
|
||||
catalog: str,
|
||||
manifest_path: str = "manifest_metadata.json",
|
||||
) -> list[dict[str, Any]]:
|
||||
"""Connect to Databricks, collect metadata, write a JSON manifest, and return the asset dicts.
|
||||
|
||||
The manifest contains serialised asset dicts that push_metadata.py can read.
|
||||
"""
|
||||
_check_available_memory(min_gb=2.0)
|
||||
collected_at = datetime.now(timezone.utc).isoformat()
|
||||
assets: list[dict[str, Any]] = []
|
||||
|
||||
with sql.connect(
|
||||
server_hostname=host, # ← SUBSTITUTE
|
||||
http_path=http_path, # ← SUBSTITUTE
|
||||
access_token=token, # ← SUBSTITUTE
|
||||
) as conn:
|
||||
with conn.cursor() as cursor:
|
||||
tables = collect_tables(cursor, catalog)
|
||||
log.info("Found %d tables in catalog %s", len(tables), catalog)
|
||||
|
||||
for row in tables:
|
||||
schema = row["table_schema"]
|
||||
table_name = row["table_name"]
|
||||
|
||||
columns = collect_columns(cursor, catalog, schema, table_name)
|
||||
fields = [
|
||||
{
|
||||
"name": col["column_name"],
|
||||
"type": col["data_type"].upper(),
|
||||
"description": col.get("comment") or None,
|
||||
}
|
||||
for col in columns
|
||||
]
|
||||
|
||||
detail = collect_detail(cursor, catalog, schema, table_name)
|
||||
row_count: int | None = None
|
||||
byte_count: int | None = None
|
||||
last_updated: str | None = None
|
||||
if detail:
|
||||
row_count = detail.get("numRows")
|
||||
byte_count = detail.get("sizeInBytes")
|
||||
last_modified = detail.get("lastModified")
|
||||
if last_modified:
|
||||
last_updated = (
|
||||
last_modified.isoformat()
|
||||
if hasattr(last_modified, "isoformat")
|
||||
else str(last_modified)
|
||||
)
|
||||
|
||||
asset = {
|
||||
"asset_name": table_name,
|
||||
"database": catalog, # ← SUBSTITUTE: use catalog as database
|
||||
"schema": schema,
|
||||
"asset_type": "VIEW" if row.get("table_type", "").upper() == "VIEW" else "TABLE",
|
||||
"description": row.get("comment") or None,
|
||||
"fields": fields,
|
||||
"row_count": row_count,
|
||||
"byte_count": byte_count,
|
||||
"last_updated": last_updated,
|
||||
}
|
||||
assets.append(asset)
|
||||
log.info("Collected %s.%s.%s", catalog, schema, table_name)
|
||||
|
||||
manifest = {
|
||||
"resource_type": RESOURCE_TYPE,
|
||||
"collected_at": collected_at,
|
||||
"catalog": catalog,
|
||||
"asset_count": len(assets),
|
||||
"assets": assets,
|
||||
}
|
||||
with open(manifest_path, "w") as fh:
|
||||
json.dump(manifest, fh, indent=2)
|
||||
log.info("Manifest written to %s (%d assets)", manifest_path, len(assets))
|
||||
|
||||
return assets
|
||||
|
||||
|
||||
def main() -> None:
|
||||
parser = argparse.ArgumentParser(description="Collect Databricks metadata to a manifest file")
|
||||
parser.add_argument("--host", default=os.getenv("DATABRICKS_HOST")) # ← SUBSTITUTE
|
||||
parser.add_argument("--http-path", default=os.getenv("DATABRICKS_HTTP_PATH")) # ← SUBSTITUTE
|
||||
parser.add_argument("--token", default=os.getenv("DATABRICKS_TOKEN")) # ← SUBSTITUTE
|
||||
parser.add_argument("--catalog", default=os.getenv("DATABRICKS_CATALOG", "hive_metastore"))
|
||||
parser.add_argument("--manifest", default="manifest_metadata.json")
|
||||
args = parser.parse_args()
|
||||
|
||||
required = ["host", "http_path", "token"]
|
||||
missing = [k for k in required if getattr(args, k) is None]
|
||||
if missing:
|
||||
parser.error(f"Missing required arguments/env vars: {missing}")
|
||||
|
||||
collect(
|
||||
host=args.host,
|
||||
http_path=args.http_path,
|
||||
token=args.token,
|
||||
catalog=args.catalog,
|
||||
manifest_path=args.manifest,
|
||||
)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,204 @@
|
||||
"""
|
||||
Databricks — Query Log Collection (collect-only)
|
||||
==================================================
|
||||
Collects finished query execution records from the Databricks system table
|
||||
system.query.history and writes a JSON manifest file that can be consumed
|
||||
by push_query_logs.py.
|
||||
|
||||
Substitution points (search for "← SUBSTITUTE"):
|
||||
- DATABRICKS_HOST : workspace hostname
|
||||
- DATABRICKS_HTTP_PATH : SQL warehouse HTTP path
|
||||
- DATABRICKS_TOKEN : PAT or service-principal secret
|
||||
- LOOKBACK_HOURS : hours back from [now - LAG_HOURS] to collect (default 25)
|
||||
- LOOKBACK_LAG_HOURS : hours to lag behind now to avoid in-flight queries (default 1)
|
||||
- MAX_ROWS : maximum query rows to collect per run (default 10000)
|
||||
|
||||
Prerequisites:
|
||||
pip install databricks-sql-connector
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import logging
|
||||
import os
|
||||
from datetime import datetime, timezone
|
||||
from typing import Any
|
||||
|
||||
from databricks import sql
|
||||
|
||||
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
|
||||
log = logging.getLogger(__name__)
|
||||
|
||||
LOG_TYPE = "databricks"
|
||||
|
||||
LOOKBACK_HOURS: int = int(os.getenv("LOOKBACK_HOURS", "25")) # ← SUBSTITUTE
|
||||
LOOKBACK_LAG_HOURS: int = int(os.getenv("LOOKBACK_LAG_HOURS", "1")) # ← SUBSTITUTE
|
||||
MAX_ROWS: int = int(os.getenv("MAX_ROWS", "10000")) # ← SUBSTITUTE
|
||||
|
||||
_QUERY_LOG_SQL = """\
|
||||
SELECT
|
||||
statement_id AS query_id,
|
||||
statement_text AS query_text,
|
||||
start_time,
|
||||
end_time,
|
||||
executed_by AS user_name,
|
||||
produced_rows AS returned_rows,
|
||||
total_task_duration_ms,
|
||||
read_rows,
|
||||
read_bytes
|
||||
FROM system.query.history
|
||||
WHERE start_time >= DATEADD(HOUR, -{lookback_hours}, NOW())
|
||||
AND start_time < DATEADD(HOUR, -{lag_hours}, NOW())
|
||||
AND status = 'FINISHED'
|
||||
ORDER BY start_time
|
||||
LIMIT {max_rows}
|
||||
""" # ← SUBSTITUTE: adjust status filter or add warehouse_id filter as needed
|
||||
|
||||
|
||||
def _check_available_memory(min_gb: float = 2.0) -> None:
|
||||
"""Warn if available memory is below the threshold."""
|
||||
try:
|
||||
if hasattr(os, "sysconf"): # Linux / macOS
|
||||
page_size = os.sysconf("SC_PAGE_SIZE")
|
||||
avail_pages = os.sysconf("SC_AVPHYS_PAGES")
|
||||
avail_gb = (page_size * avail_pages) / (1024 ** 3)
|
||||
else:
|
||||
return # Windows — skip check
|
||||
except (ValueError, OSError):
|
||||
return
|
||||
if avail_gb < min_gb:
|
||||
log.warning(
|
||||
"Only %.1f GB of memory available (minimum recommended: %.1f GB). "
|
||||
"Consider reducing the collection scope or increasing available memory.",
|
||||
avail_gb,
|
||||
min_gb,
|
||||
)
|
||||
|
||||
|
||||
def _safe_isoformat(dt: Any) -> str | None:
|
||||
if dt is None:
|
||||
return None
|
||||
if hasattr(dt, "isoformat"):
|
||||
if dt.tzinfo is None:
|
||||
dt = dt.replace(tzinfo=timezone.utc)
|
||||
return dt.isoformat()
|
||||
return str(dt)
|
||||
|
||||
|
||||
def _query(cursor: Any, sql_text: str) -> list[dict[str, Any]]:
|
||||
cursor.execute(sql_text)
|
||||
cols = [d[0] for d in cursor.description]
|
||||
rows = []
|
||||
while True:
|
||||
chunk = cursor.fetchmany(1000)
|
||||
if not chunk:
|
||||
break
|
||||
rows.extend(dict(zip(cols, row)) for row in chunk)
|
||||
return rows
|
||||
|
||||
|
||||
def collect_query_logs(
|
||||
cursor: Any,
|
||||
lookback_hours: int,
|
||||
lag_hours: int,
|
||||
max_rows: int,
|
||||
) -> list[dict[str, Any]]:
|
||||
rendered_sql = _QUERY_LOG_SQL.format(
|
||||
lookback_hours=lookback_hours + lag_hours, # offset from NOW() to cover the window
|
||||
lag_hours=lag_hours,
|
||||
max_rows=max_rows,
|
||||
)
|
||||
rows = _query(cursor, rendered_sql)
|
||||
log.info("Retrieved %d query log rows from system.query.history", len(rows))
|
||||
|
||||
entries: list[dict[str, Any]] = []
|
||||
for row in rows:
|
||||
query_text: str = row.get("query_text") or ""
|
||||
if not query_text.strip():
|
||||
continue # ← SUBSTITUTE: decide whether to skip empty-text rows
|
||||
|
||||
entry = {
|
||||
"query_id": row.get("query_id"),
|
||||
"query_text": query_text,
|
||||
"start_time": _safe_isoformat(row.get("start_time")),
|
||||
"end_time": _safe_isoformat(row.get("end_time")),
|
||||
"user": row.get("user_name"),
|
||||
"returned_rows": row.get("returned_rows"),
|
||||
"total_task_duration_ms": row.get("total_task_duration_ms"),
|
||||
"read_rows": row.get("read_rows"),
|
||||
"read_bytes": row.get("read_bytes"),
|
||||
}
|
||||
entries.append(entry)
|
||||
|
||||
return entries
|
||||
|
||||
|
||||
def collect(
|
||||
host: str,
|
||||
http_path: str,
|
||||
token: str,
|
||||
manifest_path: str = "manifest_query_logs.json",
|
||||
lookback_hours: int = LOOKBACK_HOURS,
|
||||
lookback_lag_hours: int = LOOKBACK_LAG_HOURS,
|
||||
max_rows: int = MAX_ROWS,
|
||||
) -> list[dict[str, Any]]:
|
||||
"""Connect to Databricks, collect query logs, write a JSON manifest, and return entries."""
|
||||
_check_available_memory(min_gb=2.0)
|
||||
collected_at = datetime.now(timezone.utc).isoformat()
|
||||
|
||||
with sql.connect(
|
||||
server_hostname=host, # ← SUBSTITUTE
|
||||
http_path=http_path, # ← SUBSTITUTE
|
||||
access_token=token, # ← SUBSTITUTE
|
||||
) as conn:
|
||||
with conn.cursor() as cursor:
|
||||
entries = collect_query_logs(cursor, lookback_hours, lookback_lag_hours, max_rows)
|
||||
|
||||
log.info("Collected %d query log entries", len(entries))
|
||||
|
||||
manifest = {
|
||||
"log_type": LOG_TYPE,
|
||||
"collected_at": collected_at,
|
||||
"lookback_hours": lookback_hours,
|
||||
"lookback_lag_hours": lookback_lag_hours,
|
||||
"query_log_count": len(entries),
|
||||
"entries": entries,
|
||||
}
|
||||
with open(manifest_path, "w") as fh:
|
||||
json.dump(manifest, fh, indent=2)
|
||||
log.info("Manifest written to %s (%d entries)", manifest_path, len(entries))
|
||||
|
||||
return entries
|
||||
|
||||
|
||||
def main() -> None:
|
||||
parser = argparse.ArgumentParser(description="Collect Databricks query logs to a manifest file")
|
||||
parser.add_argument("--host", default=os.getenv("DATABRICKS_HOST")) # ← SUBSTITUTE
|
||||
parser.add_argument("--http-path", default=os.getenv("DATABRICKS_HTTP_PATH")) # ← SUBSTITUTE
|
||||
parser.add_argument("--token", default=os.getenv("DATABRICKS_TOKEN")) # ← SUBSTITUTE
|
||||
parser.add_argument("--lookback-hours", type=int, default=LOOKBACK_HOURS)
|
||||
parser.add_argument("--lookback-lag-hours", type=int, default=LOOKBACK_LAG_HOURS)
|
||||
parser.add_argument("--max-rows", type=int, default=MAX_ROWS)
|
||||
parser.add_argument("--manifest", default="manifest_query_logs.json")
|
||||
args = parser.parse_args()
|
||||
|
||||
required = ["host", "http_path", "token"]
|
||||
missing = [k for k in required if getattr(args, k) is None]
|
||||
if missing:
|
||||
parser.error(f"Missing required arguments/env vars: {missing}")
|
||||
|
||||
collect(
|
||||
host=args.host,
|
||||
http_path=args.http_path,
|
||||
token=args.token,
|
||||
manifest_path=args.manifest,
|
||||
lookback_hours=args.lookback_hours,
|
||||
lookback_lag_hours=args.lookback_lag_hours,
|
||||
max_rows=args.max_rows,
|
||||
)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,192 @@
|
||||
"""
|
||||
Databricks — Lineage Push (push-only)
|
||||
=======================================
|
||||
Reads a JSON manifest file produced by collect_lineage.py and pushes the lineage
|
||||
events to Monte Carlo via the push ingestion API, with configurable batching to
|
||||
keep compressed payloads under 1 MB.
|
||||
|
||||
Substitution points (search for "← SUBSTITUTE"):
|
||||
- MCD_INGEST_ID / MCD_INGEST_TOKEN : Monte Carlo API credentials
|
||||
- MCD_RESOURCE_UUID : UUID of the Databricks connection in Monte Carlo
|
||||
- PUSH_BATCH_SIZE : number of events per API call (default 500)
|
||||
|
||||
Prerequisites:
|
||||
pip install pycarlo
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import logging
|
||||
import os
|
||||
from concurrent.futures import ThreadPoolExecutor, as_completed
|
||||
from datetime import datetime, timezone
|
||||
from typing import Any
|
||||
|
||||
from pycarlo.core import Client, Session
|
||||
from pycarlo.features.ingestion import IngestionService
|
||||
from pycarlo.features.ingestion.models import (
|
||||
ColumnLineageField,
|
||||
ColumnLineageSourceField,
|
||||
LineageAssetRef,
|
||||
LineageEvent,
|
||||
)
|
||||
|
||||
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
|
||||
log = logging.getLogger(__name__)
|
||||
|
||||
RESOURCE_TYPE = "databricks"
|
||||
DEFAULT_BATCH_SIZE = 500 # ← SUBSTITUTE: conservative default to stay under 1 MB compressed
|
||||
|
||||
|
||||
def _ref_from_dict(d: dict[str, Any]) -> LineageAssetRef:
|
||||
database = d.get("database", "")
|
||||
schema = d.get("schema", "")
|
||||
name = d["asset_name"]
|
||||
return LineageAssetRef(
|
||||
type="TABLE",
|
||||
name=name,
|
||||
database=database,
|
||||
schema=schema,
|
||||
asset_id=f"{database}__{schema}__{name}",
|
||||
)
|
||||
|
||||
|
||||
def _event_from_dict(d: dict[str, Any]) -> LineageEvent:
|
||||
"""Reconstruct a LineageEvent from a manifest dict."""
|
||||
sources = [_ref_from_dict(s) for s in d.get("sources", [])]
|
||||
destination = _ref_from_dict(d["destination"])
|
||||
|
||||
fields: list[ColumnLineageField] | None = None
|
||||
if d.get("column_lineage"):
|
||||
fields = []
|
||||
for cl in d["column_lineage"]:
|
||||
src_fields = []
|
||||
for s in cl.get("sources", []):
|
||||
asset_id = f"{s.get('database', '')}__{s.get('schema', '')}__{s['asset_name']}"
|
||||
src_fields.append(
|
||||
ColumnLineageSourceField(
|
||||
asset_id=asset_id,
|
||||
field_name=s["field"],
|
||||
)
|
||||
)
|
||||
fields.append(
|
||||
ColumnLineageField(
|
||||
name=cl["destination_field"],
|
||||
source_fields=src_fields,
|
||||
)
|
||||
)
|
||||
|
||||
return LineageEvent(
|
||||
sources=sources,
|
||||
destination=destination,
|
||||
fields=fields,
|
||||
)
|
||||
|
||||
|
||||
def push(
|
||||
manifest_path: str,
|
||||
resource_uuid: str,
|
||||
key_id: str,
|
||||
key_token: str,
|
||||
batch_size: int = DEFAULT_BATCH_SIZE,
|
||||
) -> dict[str, Any]:
|
||||
"""Read a collect manifest and push lineage events to Monte Carlo in batches.
|
||||
|
||||
Returns a summary dict with invocation IDs and counts.
|
||||
"""
|
||||
with open(manifest_path) as fh:
|
||||
manifest = json.load(fh)
|
||||
|
||||
event_dicts: list[dict[str, Any]] = manifest["events"]
|
||||
events = [_event_from_dict(d) for d in event_dicts]
|
||||
log.info("Loaded %d lineage events from %s", len(events), manifest_path)
|
||||
|
||||
# Split into batches
|
||||
batches = []
|
||||
for i in range(0, len(events), batch_size):
|
||||
batches.append(events[i : i + batch_size])
|
||||
total_batches = len(batches)
|
||||
|
||||
def _push_batch(batch: list, batch_num: int) -> str | None:
|
||||
"""Push a single batch using a dedicated Session (thread-safe)."""
|
||||
log.info("Pushing batch %d/%d (%d events) ...", batch_num, total_batches, len(batch))
|
||||
client = Client(session=Session(mcd_id=key_id, mcd_token=key_token, scope="Ingestion"))
|
||||
service = IngestionService(mc_client=client)
|
||||
result = service.send_lineage(
|
||||
resource_uuid=resource_uuid,
|
||||
resource_type=RESOURCE_TYPE,
|
||||
events=batch,
|
||||
)
|
||||
invocation_id = service.extract_invocation_id(result)
|
||||
if invocation_id:
|
||||
log.info("Batch %d: invocation_id=%s", batch_num, invocation_id)
|
||||
return invocation_id
|
||||
|
||||
# Push batches in parallel (each thread gets its own pycarlo Session)
|
||||
max_workers = min(4, total_batches)
|
||||
invocation_ids: list[str | None] = [None] * total_batches
|
||||
|
||||
with ThreadPoolExecutor(max_workers=max_workers) as pool:
|
||||
futures = {
|
||||
pool.submit(_push_batch, batch, i + 1): i
|
||||
for i, batch in enumerate(batches)
|
||||
}
|
||||
for future in as_completed(futures):
|
||||
idx = futures[future]
|
||||
try:
|
||||
invocation_ids[idx] = future.result()
|
||||
except Exception as exc:
|
||||
log.error("ERROR pushing batch %d: %s", idx + 1, exc)
|
||||
raise
|
||||
|
||||
log.info("All %d batches pushed (%d workers)", total_batches, max_workers)
|
||||
|
||||
pushed_at = datetime.now(timezone.utc).isoformat()
|
||||
summary = {
|
||||
"resource_uuid": resource_uuid,
|
||||
"resource_type": RESOURCE_TYPE,
|
||||
"invocation_ids": invocation_ids,
|
||||
"pushed_at": pushed_at,
|
||||
"event_count": len(events),
|
||||
"batch_count": total_batches,
|
||||
"batch_size": batch_size,
|
||||
"lookback_days": manifest.get("lookback_days"),
|
||||
"table_lineage_events": manifest.get("table_lineage_events"),
|
||||
"column_lineage_events": manifest.get("column_lineage_events"),
|
||||
}
|
||||
|
||||
push_manifest_path = manifest_path.replace(".json", "_push_result.json")
|
||||
with open(push_manifest_path, "w") as fh:
|
||||
json.dump(summary, fh, indent=2)
|
||||
log.info("Push result written to %s", push_manifest_path)
|
||||
|
||||
return summary
|
||||
|
||||
|
||||
def main() -> None:
|
||||
parser = argparse.ArgumentParser(description="Push Databricks lineage to Monte Carlo from manifest")
|
||||
parser.add_argument("--manifest", default="manifest_lineage.json")
|
||||
parser.add_argument("--resource-uuid", default=os.getenv("MCD_RESOURCE_UUID"))
|
||||
parser.add_argument("--key-id", default=os.getenv("MCD_INGEST_ID"))
|
||||
parser.add_argument("--key-token", default=os.getenv("MCD_INGEST_TOKEN"))
|
||||
parser.add_argument("--batch-size", type=int, default=DEFAULT_BATCH_SIZE)
|
||||
args = parser.parse_args()
|
||||
|
||||
required = ["resource_uuid", "key_id", "key_token"]
|
||||
missing = [k for k in required if getattr(args, k) is None]
|
||||
if missing:
|
||||
parser.error(f"Missing required arguments/env vars: {missing}")
|
||||
|
||||
push(
|
||||
manifest_path=args.manifest,
|
||||
resource_uuid=args.resource_uuid,
|
||||
key_id=args.key_id,
|
||||
key_token=args.key_token,
|
||||
batch_size=args.batch_size,
|
||||
)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,178 @@
|
||||
"""
|
||||
Databricks — Metadata Push (push-only)
|
||||
========================================
|
||||
Reads a JSON manifest file produced by collect_metadata.py and pushes the assets
|
||||
to Monte Carlo via the push ingestion API, with configurable batching to keep
|
||||
compressed payloads under 1 MB.
|
||||
|
||||
Substitution points (search for "← SUBSTITUTE"):
|
||||
- MCD_INGEST_ID / MCD_INGEST_TOKEN : Monte Carlo API credentials
|
||||
- MCD_RESOURCE_UUID : UUID of the Databricks connection in Monte Carlo
|
||||
- PUSH_BATCH_SIZE : number of assets per API call (default 500)
|
||||
|
||||
Prerequisites:
|
||||
pip install pycarlo
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import logging
|
||||
import os
|
||||
from concurrent.futures import ThreadPoolExecutor, as_completed
|
||||
from datetime import datetime, timezone
|
||||
from typing import Any
|
||||
|
||||
from pycarlo.core import Client, Session
|
||||
from pycarlo.features.ingestion import IngestionService
|
||||
from pycarlo.features.ingestion.models import (
|
||||
AssetField,
|
||||
AssetFreshness,
|
||||
AssetMetadata,
|
||||
AssetVolume,
|
||||
RelationalAsset,
|
||||
)
|
||||
|
||||
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
|
||||
log = logging.getLogger(__name__)
|
||||
|
||||
RESOURCE_TYPE = "databricks"
|
||||
DEFAULT_BATCH_SIZE = 500 # ← SUBSTITUTE: conservative default to stay under 1 MB compressed
|
||||
|
||||
|
||||
def _asset_from_dict(d: dict[str, Any]) -> RelationalAsset:
|
||||
"""Reconstruct a RelationalAsset from a manifest dict."""
|
||||
fields = [
|
||||
AssetField(
|
||||
name=f["name"],
|
||||
type=f.get("type"),
|
||||
description=f.get("description"),
|
||||
)
|
||||
for f in d.get("fields", [])
|
||||
]
|
||||
|
||||
volume = None
|
||||
if d.get("row_count") is not None or d.get("byte_count") is not None:
|
||||
volume = AssetVolume(row_count=d.get("row_count"), byte_count=d.get("byte_count"))
|
||||
|
||||
freshness = None
|
||||
if d.get("last_updated") is not None:
|
||||
freshness = AssetFreshness(last_update_time=d.get("last_updated"))
|
||||
|
||||
return RelationalAsset(
|
||||
type=d.get("asset_type", "TABLE"),
|
||||
metadata=AssetMetadata(
|
||||
name=d["asset_name"],
|
||||
database=d["database"], # ← SUBSTITUTE: use catalog as database
|
||||
schema=d["schema"],
|
||||
description=d.get("description"),
|
||||
),
|
||||
fields=fields,
|
||||
volume=volume,
|
||||
freshness=freshness,
|
||||
)
|
||||
|
||||
|
||||
def push(
|
||||
manifest_path: str,
|
||||
resource_uuid: str,
|
||||
key_id: str,
|
||||
key_token: str,
|
||||
batch_size: int = DEFAULT_BATCH_SIZE,
|
||||
) -> dict[str, Any]:
|
||||
"""Read a collect manifest and push assets to Monte Carlo in batches.
|
||||
|
||||
Returns a summary dict with invocation IDs and counts.
|
||||
"""
|
||||
with open(manifest_path) as fh:
|
||||
manifest = json.load(fh)
|
||||
|
||||
asset_dicts: list[dict[str, Any]] = manifest["assets"]
|
||||
assets = [_asset_from_dict(d) for d in asset_dicts]
|
||||
log.info("Loaded %d assets from %s", len(assets), manifest_path)
|
||||
|
||||
# Split into batches
|
||||
batches = []
|
||||
for i in range(0, max(len(assets), 1), batch_size):
|
||||
batches.append(assets[i : i + batch_size])
|
||||
total_batches = len(batches)
|
||||
|
||||
def _push_batch(batch: list, batch_num: int) -> str | None:
|
||||
"""Push a single batch using a dedicated Session (thread-safe)."""
|
||||
client = Client(session=Session(mcd_id=key_id, mcd_token=key_token, scope="Ingestion"))
|
||||
service = IngestionService(mc_client=client)
|
||||
result = service.send_metadata(
|
||||
resource_uuid=resource_uuid,
|
||||
resource_type=RESOURCE_TYPE,
|
||||
events=batch,
|
||||
)
|
||||
invocation_id = service.extract_invocation_id(result)
|
||||
log.info("Pushed batch %d/%d (%d assets) — invocation_id=%s", batch_num, total_batches, len(batch), invocation_id)
|
||||
return invocation_id
|
||||
|
||||
# Push batches in parallel (each thread gets its own pycarlo Session)
|
||||
max_workers = min(4, total_batches)
|
||||
invocation_ids: list[str | None] = [None] * total_batches
|
||||
|
||||
with ThreadPoolExecutor(max_workers=max_workers) as pool:
|
||||
futures = {
|
||||
pool.submit(_push_batch, batch, i + 1): i
|
||||
for i, batch in enumerate(batches)
|
||||
}
|
||||
for future in as_completed(futures):
|
||||
idx = futures[future]
|
||||
try:
|
||||
invocation_ids[idx] = future.result()
|
||||
except Exception as exc:
|
||||
log.error("ERROR pushing batch %d: %s", idx + 1, exc)
|
||||
raise
|
||||
|
||||
log.info("All %d batches pushed (%d workers)", total_batches, max_workers)
|
||||
|
||||
pushed_at = datetime.now(timezone.utc).isoformat()
|
||||
summary = {
|
||||
"resource_uuid": resource_uuid,
|
||||
"resource_type": RESOURCE_TYPE,
|
||||
"invocation_ids": invocation_ids,
|
||||
"pushed_at": pushed_at,
|
||||
"asset_count": len(assets),
|
||||
"batch_count": total_batches,
|
||||
"batch_size": batch_size,
|
||||
"catalog": manifest.get("catalog"),
|
||||
}
|
||||
|
||||
# Write push result alongside the collect manifest
|
||||
push_manifest_path = manifest_path.replace(".json", "_push_result.json")
|
||||
with open(push_manifest_path, "w") as fh:
|
||||
json.dump(summary, fh, indent=2)
|
||||
log.info("Push result written to %s", push_manifest_path)
|
||||
|
||||
return summary
|
||||
|
||||
|
||||
def main() -> None:
|
||||
parser = argparse.ArgumentParser(description="Push Databricks metadata to Monte Carlo from manifest")
|
||||
parser.add_argument("--manifest", default="manifest_metadata.json")
|
||||
parser.add_argument("--resource-uuid", default=os.getenv("MCD_RESOURCE_UUID"))
|
||||
parser.add_argument("--key-id", default=os.getenv("MCD_INGEST_ID"))
|
||||
parser.add_argument("--key-token", default=os.getenv("MCD_INGEST_TOKEN"))
|
||||
parser.add_argument("--batch-size", type=int, default=DEFAULT_BATCH_SIZE)
|
||||
args = parser.parse_args()
|
||||
|
||||
required = ["resource_uuid", "key_id", "key_token"]
|
||||
missing = [k for k in required if getattr(args, k) is None]
|
||||
if missing:
|
||||
parser.error(f"Missing required arguments/env vars: {missing}")
|
||||
|
||||
push(
|
||||
manifest_path=args.manifest,
|
||||
resource_uuid=args.resource_uuid,
|
||||
key_id=args.key_id,
|
||||
key_token=args.key_token,
|
||||
batch_size=args.batch_size,
|
||||
)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,200 @@
|
||||
"""
|
||||
Databricks — Query Log Push (push-only)
|
||||
=========================================
|
||||
Reads a JSON manifest file produced by collect_query_logs.py and pushes the query
|
||||
log entries to Monte Carlo via the push ingestion API, with configurable batching
|
||||
to keep compressed payloads under 1 MB.
|
||||
|
||||
Substitution points (search for "← SUBSTITUTE"):
|
||||
- MCD_INGEST_ID / MCD_INGEST_TOKEN : Monte Carlo API credentials
|
||||
- MCD_RESOURCE_UUID : UUID of the Databricks connection in Monte Carlo
|
||||
- PUSH_BATCH_SIZE : number of entries per API call (default 100)
|
||||
|
||||
Prerequisites:
|
||||
pip install pycarlo
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import logging
|
||||
import os
|
||||
from concurrent.futures import ThreadPoolExecutor, as_completed
|
||||
from datetime import datetime, timezone
|
||||
from typing import Any
|
||||
|
||||
from dateutil.parser import isoparse
|
||||
from pycarlo.core import Client, Session
|
||||
from pycarlo.features.ingestion import IngestionService
|
||||
from pycarlo.features.ingestion.models import QueryLogEntry
|
||||
|
||||
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
|
||||
log = logging.getLogger(__name__)
|
||||
|
||||
LOG_TYPE = "databricks"
|
||||
DEFAULT_BATCH_SIZE = 100 # ← SUBSTITUTE: conservative default to stay under 1 MB compressed
|
||||
|
||||
# Truncate query_text longer than this to prevent 413 errors.
|
||||
# Some SQL statements (e.g., generated by BI tools) can be 100KB+ and blow up
|
||||
# compressed payloads even at small batch sizes.
|
||||
_MAX_QUERY_TEXT_LEN = 10_000
|
||||
|
||||
|
||||
def _build_query_log_entries(entry_dicts: list[dict[str, Any]]) -> list[QueryLogEntry]:
|
||||
"""Convert manifest query dicts into QueryLogEntry objects."""
|
||||
entries = []
|
||||
truncated = 0
|
||||
for d in entry_dicts:
|
||||
query_text = d.get("query_text") or ""
|
||||
|
||||
# Truncate very long SQL to prevent 413 Request Too Large
|
||||
if len(query_text) > _MAX_QUERY_TEXT_LEN:
|
||||
query_text = query_text[:_MAX_QUERY_TEXT_LEN] + "... [TRUNCATED]"
|
||||
truncated += 1
|
||||
|
||||
extra = {}
|
||||
if d.get("total_task_duration_ms") is not None:
|
||||
extra["total_task_duration_ms"] = d["total_task_duration_ms"]
|
||||
if d.get("read_rows") is not None:
|
||||
extra["read_rows"] = d["read_rows"]
|
||||
if d.get("read_bytes") is not None:
|
||||
extra["read_bytes"] = d["read_bytes"]
|
||||
|
||||
start_time = d.get("start_time")
|
||||
end_time = d.get("end_time")
|
||||
|
||||
entries.append(
|
||||
QueryLogEntry(
|
||||
query_id=d.get("query_id"),
|
||||
query_text=query_text,
|
||||
start_time=isoparse(start_time) if start_time else None,
|
||||
end_time=isoparse(end_time) if end_time else None,
|
||||
user=d.get("user"),
|
||||
returned_rows=d.get("returned_rows"),
|
||||
extra=extra or None,
|
||||
)
|
||||
)
|
||||
if truncated:
|
||||
log.info("Truncated %d query text(s) exceeding %d chars", truncated, _MAX_QUERY_TEXT_LEN)
|
||||
return entries
|
||||
|
||||
|
||||
def push(
|
||||
manifest_path: str,
|
||||
resource_uuid: str,
|
||||
key_id: str,
|
||||
key_token: str,
|
||||
batch_size: int = DEFAULT_BATCH_SIZE,
|
||||
) -> dict[str, Any]:
|
||||
"""Read a collect manifest and push query log entries to Monte Carlo in batches.
|
||||
|
||||
Returns a summary dict with invocation IDs and counts.
|
||||
"""
|
||||
with open(manifest_path) as fh:
|
||||
manifest = json.load(fh)
|
||||
|
||||
entry_dicts: list[dict[str, Any]] = manifest["entries"]
|
||||
entries = _build_query_log_entries(entry_dicts)
|
||||
log.info("Loaded %d query log entries from %s", len(entries), manifest_path)
|
||||
|
||||
if not entries:
|
||||
log.info("No query log entries to push.")
|
||||
summary = {
|
||||
"resource_uuid": resource_uuid,
|
||||
"log_type": LOG_TYPE,
|
||||
"invocation_ids": [],
|
||||
"pushed_at": datetime.now(timezone.utc).isoformat(),
|
||||
"query_log_count": 0,
|
||||
"batch_count": 0,
|
||||
"batch_size": batch_size,
|
||||
}
|
||||
push_manifest_path = manifest_path.replace(".json", "_push_result.json")
|
||||
with open(push_manifest_path, "w") as fh:
|
||||
json.dump(summary, fh, indent=2)
|
||||
return summary
|
||||
|
||||
# Split into batches
|
||||
batches = []
|
||||
for i in range(0, len(entries), batch_size):
|
||||
batches.append(entries[i : i + batch_size])
|
||||
total_batches = len(batches)
|
||||
|
||||
def _push_batch(batch: list, batch_num: int) -> str | None:
|
||||
"""Push a single batch using a dedicated Session (thread-safe)."""
|
||||
client = Client(session=Session(mcd_id=key_id, mcd_token=key_token, scope="Ingestion"))
|
||||
service = IngestionService(mc_client=client)
|
||||
result = service.send_query_logs(
|
||||
resource_uuid=resource_uuid,
|
||||
log_type=LOG_TYPE,
|
||||
events=batch,
|
||||
)
|
||||
invocation_id = service.extract_invocation_id(result)
|
||||
log.info("Pushed batch %d/%d (%d entries) — invocation_id=%s", batch_num, total_batches, len(batch), invocation_id)
|
||||
return invocation_id
|
||||
|
||||
# Push batches in parallel (each thread gets its own pycarlo Session)
|
||||
max_workers = min(4, total_batches)
|
||||
invocation_ids: list[str | None] = [None] * total_batches
|
||||
|
||||
with ThreadPoolExecutor(max_workers=max_workers) as pool:
|
||||
futures = {
|
||||
pool.submit(_push_batch, batch, i + 1): i
|
||||
for i, batch in enumerate(batches)
|
||||
}
|
||||
for future in as_completed(futures):
|
||||
idx = futures[future]
|
||||
try:
|
||||
invocation_ids[idx] = future.result()
|
||||
except Exception as exc:
|
||||
log.error("ERROR pushing batch %d: %s", idx + 1, exc)
|
||||
raise
|
||||
|
||||
log.info("All %d batches pushed (%d workers)", total_batches, max_workers)
|
||||
|
||||
pushed_at = datetime.now(timezone.utc).isoformat()
|
||||
summary = {
|
||||
"resource_uuid": resource_uuid,
|
||||
"log_type": LOG_TYPE,
|
||||
"invocation_ids": invocation_ids,
|
||||
"pushed_at": pushed_at,
|
||||
"query_log_count": len(entries),
|
||||
"batch_count": total_batches,
|
||||
"batch_size": batch_size,
|
||||
"lookback_hours": manifest.get("lookback_hours"),
|
||||
"lookback_lag_hours": manifest.get("lookback_lag_hours"),
|
||||
}
|
||||
|
||||
push_manifest_path = manifest_path.replace(".json", "_push_result.json")
|
||||
with open(push_manifest_path, "w") as fh:
|
||||
json.dump(summary, fh, indent=2)
|
||||
log.info("Push result written to %s", push_manifest_path)
|
||||
|
||||
return summary
|
||||
|
||||
|
||||
def main() -> None:
|
||||
parser = argparse.ArgumentParser(description="Push Databricks query logs to Monte Carlo from manifest")
|
||||
parser.add_argument("--manifest", default="manifest_query_logs.json")
|
||||
parser.add_argument("--resource-uuid", default=os.getenv("MCD_RESOURCE_UUID"))
|
||||
parser.add_argument("--key-id", default=os.getenv("MCD_INGEST_ID"))
|
||||
parser.add_argument("--key-token", default=os.getenv("MCD_INGEST_TOKEN"))
|
||||
parser.add_argument("--batch-size", type=int, default=DEFAULT_BATCH_SIZE)
|
||||
args = parser.parse_args()
|
||||
|
||||
required = ["resource_uuid", "key_id", "key_token"]
|
||||
missing = [k for k in required if getattr(args, k) is None]
|
||||
if missing:
|
||||
parser.error(f"Missing required arguments/env vars: {missing}")
|
||||
|
||||
push(
|
||||
manifest_path=args.manifest,
|
||||
resource_uuid=args.resource_uuid,
|
||||
key_id=args.key_id,
|
||||
key_token=args.key_token,
|
||||
batch_size=args.batch_size,
|
||||
)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,119 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Extract Hive lineage from a local log file and push it to Monte Carlo in one step.
|
||||
|
||||
Thin wrapper that calls ``collect()`` from ``collect_lineage`` followed by
|
||||
``push()`` from ``push_lineage``, then writes the final manifest (with
|
||||
``resource_uuid`` and ``invocation_id``) to ``--output-file``.
|
||||
|
||||
Substitution points
|
||||
-------------------
|
||||
- MCD_INGEST_ID (env) / --key-id (CLI) : Monte Carlo ingestion key ID
|
||||
- MCD_INGEST_TOKEN (env) / --key-token (CLI) : Monte Carlo ingestion key token
|
||||
- MCD_RESOURCE_UUID (env) / --resource-uuid (CLI) : MC resource UUID for this connection
|
||||
- --log-file : path to local HiveServer2 log
|
||||
|
||||
Prerequisites
|
||||
-------------
|
||||
pip install pycarlo python-dotenv
|
||||
|
||||
Usage (table-level):
|
||||
python collect_and_push_lineage.py \\
|
||||
--key-id <MCD_INGEST_ID> \\
|
||||
--key-token <MCD_INGEST_TOKEN> \\
|
||||
--resource-uuid <MCD_RESOURCE_UUID> \\
|
||||
--log-file /tmp/root/hive.log
|
||||
|
||||
Usage (column-level):
|
||||
python collect_and_push_lineage.py ... --column-lineage
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import os
|
||||
|
||||
from collect_lineage import collect
|
||||
from push_lineage import DEFAULT_BATCH_SIZE, DEFAULT_TIMEOUT_SECONDS, push
|
||||
|
||||
|
||||
def main() -> None:
|
||||
parser = argparse.ArgumentParser(
|
||||
description="Extract Hive lineage from a local log file and push to Monte Carlo",
|
||||
)
|
||||
# Collect args
|
||||
parser.add_argument(
|
||||
"--log-file",
|
||||
default="/tmp/root/hive.log",
|
||||
help="Path to local HiveServer2 log file (default: /tmp/root/hive.log)", # ← SUBSTITUTE: your log path
|
||||
)
|
||||
# Push / MC args
|
||||
parser.add_argument(
|
||||
"--key-id",
|
||||
default=os.environ.get("MCD_INGEST_ID"),
|
||||
help="Monte Carlo ingestion key ID (env: MCD_INGEST_ID)",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--key-token",
|
||||
default=os.environ.get("MCD_INGEST_TOKEN"),
|
||||
help="Monte Carlo ingestion key token (env: MCD_INGEST_TOKEN)",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--resource-uuid",
|
||||
default=os.environ.get("MCD_RESOURCE_UUID"),
|
||||
help="Monte Carlo resource UUID for this Hive connection (env: MCD_RESOURCE_UUID)",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--column-lineage",
|
||||
action="store_true",
|
||||
help="Push column-level lineage instead of table-level",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--output-file",
|
||||
default="lineage_output.json",
|
||||
help="Path to write the lineage manifest (default: lineage_output.json)",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--batch-size",
|
||||
type=int,
|
||||
default=DEFAULT_BATCH_SIZE,
|
||||
metavar="N",
|
||||
help=f"Max events per POST (default: {DEFAULT_BATCH_SIZE})",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--timeout",
|
||||
type=int,
|
||||
default=DEFAULT_TIMEOUT_SECONDS,
|
||||
metavar="SEC",
|
||||
help=f"HTTP timeout per request in seconds (default: {DEFAULT_TIMEOUT_SECONDS})",
|
||||
)
|
||||
args = parser.parse_args()
|
||||
|
||||
if not args.key_id or not args.key_token:
|
||||
parser.error("--key-id and --key-token are required (or set MCD_INGEST_ID / MCD_INGEST_TOKEN)")
|
||||
if not args.resource_uuid:
|
||||
parser.error("--resource-uuid is required (or set MCD_RESOURCE_UUID)")
|
||||
|
||||
manifest = collect(log_file=args.log_file)
|
||||
|
||||
if not manifest["edges"]:
|
||||
print("No lineage edges detected — no CTAS or INSERT INTO ... SELECT patterns found.")
|
||||
return
|
||||
|
||||
push(
|
||||
manifest=manifest,
|
||||
resource_uuid=args.resource_uuid,
|
||||
key_id=args.key_id,
|
||||
key_token=args.key_token,
|
||||
column_lineage=args.column_lineage,
|
||||
batch_size=args.batch_size,
|
||||
timeout_seconds=args.timeout,
|
||||
)
|
||||
|
||||
with open(args.output_file, "w") as fh:
|
||||
json.dump(manifest, fh, indent=2)
|
||||
print(f"Lineage manifest written to {args.output_file}")
|
||||
print("Done.")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,119 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Collect Hive table metadata and push it to Monte Carlo in one step.
|
||||
|
||||
Thin wrapper that calls ``collect()`` from ``collect_metadata`` followed by
|
||||
``push()`` from ``push_metadata``, then writes the final manifest (with
|
||||
``resource_uuid`` and ``invocation_id``) to ``--output-file``.
|
||||
|
||||
Substitution points
|
||||
-------------------
|
||||
- HIVE_HOST (env) / --hive-host (CLI) : HiveServer2 hostname
|
||||
- MCD_INGEST_ID (env) / --key-id (CLI) : Monte Carlo ingestion key ID
|
||||
- MCD_INGEST_TOKEN (env) / --key-token (CLI) : Monte Carlo ingestion key token
|
||||
- MCD_RESOURCE_UUID (env) / --resource-uuid (CLI) : MC resource UUID for this connection
|
||||
|
||||
Prerequisites
|
||||
-------------
|
||||
pip install pycarlo pyhive python-dotenv
|
||||
|
||||
Usage
|
||||
-----
|
||||
python collect_and_push_metadata.py \\
|
||||
--key-id <MCD_INGEST_ID> \\
|
||||
--key-token <MCD_INGEST_TOKEN> \\
|
||||
--resource-uuid <MCD_RESOURCE_UUID> \\
|
||||
--hive-host <HIVESERVER2_HOSTNAME>
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import os
|
||||
|
||||
from collect_metadata import collect
|
||||
from push_metadata import DEFAULT_BATCH_SIZE, DEFAULT_TIMEOUT_SECONDS, push
|
||||
|
||||
|
||||
def main() -> None:
|
||||
parser = argparse.ArgumentParser(
|
||||
description="Collect Hive table metadata and push to Monte Carlo",
|
||||
)
|
||||
# Hive / collect args
|
||||
parser.add_argument(
|
||||
"--hive-host",
|
||||
default=os.environ.get("HIVE_HOST"),
|
||||
help="HiveServer2 hostname (env: HIVE_HOST)", # ← SUBSTITUTE: your EMR master DNS or Hive host
|
||||
)
|
||||
parser.add_argument(
|
||||
"--hive-port",
|
||||
type=int,
|
||||
default=10000,
|
||||
help="HiveServer2 port (default: 10000)", # ← SUBSTITUTE if your cluster uses a non-standard port
|
||||
)
|
||||
# Push / MC args
|
||||
parser.add_argument(
|
||||
"--key-id",
|
||||
default=os.environ.get("MCD_INGEST_ID"),
|
||||
help="Monte Carlo ingestion key ID (env: MCD_INGEST_ID)", # ← SUBSTITUTE env var name if different
|
||||
)
|
||||
parser.add_argument(
|
||||
"--key-token",
|
||||
default=os.environ.get("MCD_INGEST_TOKEN"),
|
||||
help="Monte Carlo ingestion key token (env: MCD_INGEST_TOKEN)", # ← SUBSTITUTE env var name if different
|
||||
)
|
||||
parser.add_argument(
|
||||
"--resource-uuid",
|
||||
default=os.environ.get("MCD_RESOURCE_UUID"),
|
||||
required=False,
|
||||
help="Monte Carlo resource UUID for this Hive connection (env: MCD_RESOURCE_UUID)",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--output-file",
|
||||
default="metadata_output.json",
|
||||
help="Path to write the output manifest (default: metadata_output.json)",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--batch-size",
|
||||
type=int,
|
||||
default=DEFAULT_BATCH_SIZE,
|
||||
metavar="N",
|
||||
help=f"Max assets per POST (default: {DEFAULT_BATCH_SIZE})",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--timeout",
|
||||
type=int,
|
||||
default=DEFAULT_TIMEOUT_SECONDS,
|
||||
metavar="SEC",
|
||||
help=f"HTTP timeout per request in seconds (default: {DEFAULT_TIMEOUT_SECONDS})",
|
||||
)
|
||||
args = parser.parse_args()
|
||||
|
||||
if not args.hive_host:
|
||||
parser.error("--hive-host is required (or set HIVE_HOST)")
|
||||
if not args.key_id or not args.key_token:
|
||||
parser.error("--key-id and --key-token are required (or set MCD_INGEST_ID / MCD_INGEST_TOKEN)")
|
||||
if not args.resource_uuid:
|
||||
parser.error("--resource-uuid is required (or set MCD_RESOURCE_UUID)")
|
||||
|
||||
manifest = collect(
|
||||
hive_host=args.hive_host,
|
||||
hive_port=args.hive_port,
|
||||
)
|
||||
|
||||
push(
|
||||
manifest=manifest,
|
||||
resource_uuid=args.resource_uuid,
|
||||
key_id=args.key_id,
|
||||
key_token=args.key_token,
|
||||
batch_size=args.batch_size,
|
||||
timeout_seconds=args.timeout,
|
||||
)
|
||||
|
||||
with open(args.output_file, "w") as fh:
|
||||
json.dump(manifest, fh, indent=2)
|
||||
print(f"Manifest written to {args.output_file}")
|
||||
print("Done.")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,117 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Collect Hive query logs from a local log file and push them to Monte Carlo
|
||||
in one step.
|
||||
|
||||
Thin wrapper that calls ``collect()`` from ``collect_query_logs`` followed by
|
||||
``push()`` from ``push_query_logs``, then writes the final manifest (with
|
||||
``resource_uuid`` and ``invocation_id``) to ``--output-file``.
|
||||
|
||||
Substitution points
|
||||
-------------------
|
||||
- MCD_INGEST_ID (env) / --key-id (CLI) : Monte Carlo ingestion key ID
|
||||
- MCD_INGEST_TOKEN (env) / --key-token (CLI) : Monte Carlo ingestion key token
|
||||
- MCD_RESOURCE_UUID (env) / --resource-uuid (CLI) : MC resource UUID (optional for query logs)
|
||||
- --log-file path to local HiveServer2 log (default: /tmp/root/hive.log)
|
||||
- --op-logs-dir optional directory of per-query <queryId>.log files
|
||||
|
||||
Prerequisites
|
||||
-------------
|
||||
pip install pycarlo python-dateutil python-dotenv
|
||||
|
||||
Usage
|
||||
-----
|
||||
python collect_and_push_query_logs.py \\
|
||||
--key-id <MCD_INGEST_ID> \\
|
||||
--key-token <MCD_INGEST_TOKEN> \\
|
||||
--resource-uuid <MCD_RESOURCE_UUID> \\
|
||||
--log-file /tmp/root/hive.log \\
|
||||
[--op-logs-dir /var/log/hive/operation_logs]
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import os
|
||||
|
||||
from collect_query_logs import collect
|
||||
from push_query_logs import DEFAULT_BATCH_SIZE, DEFAULT_TIMEOUT_SECONDS, push
|
||||
|
||||
|
||||
def main() -> None:
|
||||
parser = argparse.ArgumentParser(
|
||||
description="Collect Hive query logs from a local log file and push to Monte Carlo",
|
||||
)
|
||||
# Collect args
|
||||
parser.add_argument(
|
||||
"--log-file",
|
||||
default="/tmp/root/hive.log",
|
||||
help="Path to local HiveServer2 log file (default: /tmp/root/hive.log)", # ← SUBSTITUTE: your log path
|
||||
)
|
||||
parser.add_argument(
|
||||
"--op-logs-dir",
|
||||
default=None,
|
||||
help=(
|
||||
"Directory containing per-query Hive operation logs (<queryId>.log). "
|
||||
"When provided, returned_rows is populated from SelectOperator RECORDS_OUT counts."
|
||||
),
|
||||
# ← SUBSTITUTE: e.g. /var/log/hive/operation_logs or wherever Hive writes op logs
|
||||
)
|
||||
# Push / MC args
|
||||
parser.add_argument(
|
||||
"--key-id",
|
||||
default=os.environ.get("MCD_INGEST_ID"),
|
||||
help="Monte Carlo ingestion key ID (env: MCD_INGEST_ID)",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--key-token",
|
||||
default=os.environ.get("MCD_INGEST_TOKEN"),
|
||||
help="Monte Carlo ingestion key token (env: MCD_INGEST_TOKEN)",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--resource-uuid",
|
||||
default=os.environ.get("MCD_RESOURCE_UUID"),
|
||||
help="Monte Carlo resource UUID (optional for query logs) (env: MCD_RESOURCE_UUID)",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--output-file",
|
||||
default="query_logs_output.json",
|
||||
help="Path to write the output manifest (default: query_logs_output.json)",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--batch-size",
|
||||
type=int,
|
||||
default=DEFAULT_BATCH_SIZE,
|
||||
metavar="N",
|
||||
help=f"Max events per POST (default: {DEFAULT_BATCH_SIZE})",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--timeout",
|
||||
type=int,
|
||||
default=DEFAULT_TIMEOUT_SECONDS,
|
||||
metavar="SEC",
|
||||
help=f"HTTP timeout per request in seconds (default: {DEFAULT_TIMEOUT_SECONDS})",
|
||||
)
|
||||
args = parser.parse_args()
|
||||
|
||||
if not args.key_id or not args.key_token:
|
||||
parser.error("--key-id and --key-token are required (or set MCD_INGEST_ID / MCD_INGEST_TOKEN)")
|
||||
|
||||
manifest = collect(log_file=args.log_file, op_logs_dir=args.op_logs_dir)
|
||||
|
||||
push(
|
||||
manifest=manifest,
|
||||
key_id=args.key_id,
|
||||
key_token=args.key_token,
|
||||
resource_uuid=args.resource_uuid,
|
||||
batch_size=args.batch_size,
|
||||
timeout_seconds=args.timeout,
|
||||
)
|
||||
|
||||
with open(args.output_file, "w") as fh:
|
||||
json.dump(manifest, fh, indent=2)
|
||||
print(f"Query log manifest written to {args.output_file}")
|
||||
print("Done.")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,265 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Extract table and column lineage from a local HiveServer2 log file — collection only.
|
||||
|
||||
Reads a plain-text Hive log file (not compressed), extracts SQL query blocks
|
||||
from "Executing command" / "Starting command" entries, detects CTAS and
|
||||
INSERT INTO ... SELECT patterns to build lineage edges, then writes a JSON
|
||||
manifest file.
|
||||
|
||||
Can be run standalone via CLI or imported (use the ``collect()`` function).
|
||||
|
||||
Substitution points
|
||||
-------------------
|
||||
- --log-file path to local HiveServer2 log (default: /tmp/root/hive.log)
|
||||
|
||||
Prerequisites
|
||||
-------------
|
||||
pip install python-dotenv
|
||||
|
||||
Usage
|
||||
-----
|
||||
python collect_lineage.py \\
|
||||
--log-file /tmp/root/hive.log \\
|
||||
--output-file lineage_output.json
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import re
|
||||
from dataclasses import dataclass, field
|
||||
from datetime import datetime, timezone
|
||||
|
||||
# ← SUBSTITUTE: set RESOURCE_TYPE to match your Monte Carlo connection type
|
||||
RESOURCE_TYPE = "data-lake"
|
||||
|
||||
# Regex for CTAS: CREATE TABLE [IF NOT EXISTS] db.table AS SELECT ... FROM db.table
|
||||
_CTAS_RE = re.compile(
|
||||
r"CREATE\s+TABLE\s+(?:IF\s+NOT\s+EXISTS\s+)?"
|
||||
r"(?P<dest_db>\w+)\.(?P<dest_table>\w+)"
|
||||
r".*?AS\s+SELECT\s+(?P<select_cols>.+?)\s+FROM\s+(?P<src_db>\w+)\.(?P<src_table>\w+)",
|
||||
re.IGNORECASE | re.DOTALL,
|
||||
)
|
||||
|
||||
# Regex for INSERT INTO/OVERWRITE db.table SELECT ... FROM db.table
|
||||
_INSERT_RE = re.compile(
|
||||
r"INSERT\s+(?:INTO|OVERWRITE)\s+(?:TABLE\s+)?(?P<dest_db>\w+)\.(?P<dest_table>\w+)"
|
||||
r".*?SELECT\s+(?P<select_cols>.+?)\s+FROM\s+(?P<src_db>\w+)\.(?P<src_table>\w+)",
|
||||
re.IGNORECASE | re.DOTALL,
|
||||
)
|
||||
|
||||
# Regex to detect additional JOIN sources beyond the primary FROM clause
|
||||
_JOIN_RE = re.compile(r"JOIN\s+(?P<src_db>\w+)\.(?P<src_table>\w+)", re.IGNORECASE)
|
||||
|
||||
# Simple column alias extraction: [alias.]col [AS dest]
|
||||
_COL_RE = re.compile(r"(?:(\w+)\.)?(\w+)(?:\s+AS\s+(\w+))?", re.IGNORECASE)
|
||||
|
||||
# Hive string literals — strip before scanning so words inside 'status' AS ...
|
||||
# are not treated as column refs
|
||||
_STR_LITERAL_RE = re.compile(r"'(?:''|[^'])*'")
|
||||
|
||||
# ROW_NUMBER() OVER (...) AS alias — whole expression has no single source column;
|
||||
# removing it avoids bogus tokens in col_mappings
|
||||
_WINDOW_AS_ALIAS_RE = re.compile(
|
||||
r"\b(?:ROW_NUMBER|RANK|DENSE_RANK|NTILE)\s*\(\s*\)\s+OVER\s*\([^)]*\)\s+AS\s+\w+",
|
||||
re.IGNORECASE,
|
||||
)
|
||||
|
||||
# Regex to pull query text out of Hive log "Executing/Starting command" lines
|
||||
_COMMAND_START_RE = re.compile(
|
||||
r"(?:Executing|Starting)\s+command\(queryId=\S*\):\s+(?P<query>.+?)(?=\n\d{4}-\d{2}-\d{2}|\Z)",
|
||||
re.DOTALL,
|
||||
)
|
||||
|
||||
# Tokens that are almost never real column names — SQL keywords, functions, casts, etc.
|
||||
_SQL_SCAN_NOISE = frozenset(
|
||||
{
|
||||
"ROW_NUMBER", "RANK", "DENSE_RANK", "NTILE", "OVER", "PARTITION",
|
||||
"ORDER", "BY", "CASE", "WHEN", "THEN", "ELSE", "END", "AND", "OR",
|
||||
"NOT", "IN", "IS", "DISTINCT", "CAST", "CONVERT", "CURRENT_TIMESTAMP",
|
||||
"CURRENT_DATE", "TRUE", "FALSE", "NULL", "BETWEEN", "LIKE", "EXISTS",
|
||||
"ASC", "DESC", "LIMIT", "OFFSET", "GROUP", "HAVING", "UNION", "ALL",
|
||||
"INNER", "LEFT", "RIGHT", "FULL", "OUTER", "CROSS", "JOIN", "ON",
|
||||
"WHERE", "SELECT", "FROM", "AS", "STRING", "BIGINT", "INT", "SMALLINT",
|
||||
"TINYINT", "DOUBLE", "FLOAT", "REAL", "DECIMAL", "BOOLEAN", "DATE",
|
||||
"TIMESTAMP", "VARCHAR", "CHAR", "BINARY", "ARRAY", "MAP", "STRUCT",
|
||||
"SUM", "AVG", "COUNT", "MIN", "MAX", "STDDEV", "VARIANCE", "VAR_POP",
|
||||
"COALESCE", "IF", "SUBSTRING", "YEAR", "MONTH", "DAY", "LEAD", "LAG",
|
||||
"FIRST_VALUE", "LAST_VALUE",
|
||||
}
|
||||
)
|
||||
|
||||
|
||||
@dataclass
|
||||
class _LineageEdge:
|
||||
dest_db: str
|
||||
dest_table: str
|
||||
sources: list[tuple[str, str]] = field(default_factory=list)
|
||||
# col_mappings: (dest_col, src_table, src_col)
|
||||
col_mappings: list[tuple[str, str, str]] = field(default_factory=list)
|
||||
|
||||
|
||||
def _prepare_select_for_col_scan(select_clause: str) -> str:
|
||||
"""Remove literals and window headers so _COL_RE sees fewer false positives."""
|
||||
s = _STR_LITERAL_RE.sub(" ", select_clause)
|
||||
s = _WINDOW_AS_ALIAS_RE.sub(" ", s)
|
||||
return s
|
||||
|
||||
|
||||
def _dedupe_col_mappings(mappings: list[tuple[str, str, str]]) -> list[tuple[str, str, str]]:
|
||||
seen: set[tuple[str, str, str]] = set()
|
||||
out: list[tuple[str, str, str]] = []
|
||||
for t in mappings:
|
||||
if t in seen:
|
||||
continue
|
||||
seen.add(t)
|
||||
out.append(t)
|
||||
return out
|
||||
|
||||
|
||||
def _extract_query_blocks(log_text: str) -> list[str]:
|
||||
"""Extract individual SQL query strings from a Hive log file."""
|
||||
return [m.group("query").strip() for m in _COMMAND_START_RE.finditer(log_text)]
|
||||
|
||||
|
||||
def _parse_select_cols(select_clause: str, src_table: str) -> list[tuple[str, str, str]]:
|
||||
"""
|
||||
Lightweight column mapping: for each `alias.col AS dest` or `col AS dest`
|
||||
in the SELECT clause, return (dest_col, src_table, src_col).
|
||||
|
||||
Strips string literals and window function headers first to reduce false
|
||||
positives, and filters out SQL keywords/noise tokens.
|
||||
"""
|
||||
prepared = _prepare_select_for_col_scan(select_clause)
|
||||
mappings = []
|
||||
for m in _COL_RE.finditer(prepared):
|
||||
src_col = m.group(2)
|
||||
dest_col = m.group(3) or src_col
|
||||
if src_col.upper() in ("FROM", "SELECT", "WHERE", "JOIN", "ON", "AS", "*"):
|
||||
continue
|
||||
if src_col.upper() in _SQL_SCAN_NOISE or dest_col.upper() in _SQL_SCAN_NOISE:
|
||||
continue
|
||||
# After stripping 'literal' AS col, we get " AS col" — skip bare (col, col) with no source expr.
|
||||
if dest_col == src_col:
|
||||
prefix = prepared[: m.start()].rstrip()
|
||||
if prefix.upper().endswith("AS"):
|
||||
continue
|
||||
mappings.append((dest_col, src_table, src_col))
|
||||
return _dedupe_col_mappings(mappings)
|
||||
|
||||
|
||||
def _parse_edges(queries: list[str]) -> list[_LineageEdge]:
|
||||
"""Parse SQL query strings into _LineageEdge objects."""
|
||||
edges: dict[str, _LineageEdge] = {}
|
||||
|
||||
for sql in queries:
|
||||
# Strip string literals to avoid false table/column matches inside quoted strings
|
||||
sql_clean = re.sub(r"\s+", " ", _STR_LITERAL_RE.sub(" ", sql)).strip()
|
||||
|
||||
for pattern in (_CTAS_RE, _INSERT_RE):
|
||||
m = pattern.search(sql_clean)
|
||||
if not m:
|
||||
continue
|
||||
|
||||
dest_db = m.group("dest_db").lower()
|
||||
dest_table = m.group("dest_table").lower()
|
||||
src_db = m.group("src_db").lower()
|
||||
src_table = m.group("src_table").lower()
|
||||
select_cols = m.group("select_cols")
|
||||
|
||||
key = f"{dest_db}.{dest_table}"
|
||||
if key not in edges:
|
||||
edges[key] = _LineageEdge(dest_db=dest_db, dest_table=dest_table)
|
||||
|
||||
edge = edges[key]
|
||||
src_pair = (src_db, src_table)
|
||||
if src_pair not in edge.sources:
|
||||
edge.sources.append(src_pair)
|
||||
|
||||
# Pick up additional JOIN sources
|
||||
for jm in _JOIN_RE.finditer(sql_clean):
|
||||
jp = (jm.group("src_db").lower(), jm.group("src_table").lower())
|
||||
if jp not in edge.sources:
|
||||
edge.sources.append(jp)
|
||||
|
||||
edge.col_mappings.extend(_parse_select_cols(select_cols, src_table))
|
||||
break # matched one pattern, move to next query
|
||||
|
||||
# Deduplicate column mappings per edge (same INSERT may appear many times in HS2 logs)
|
||||
for e in edges.values():
|
||||
e.col_mappings = _dedupe_col_mappings(e.col_mappings)
|
||||
|
||||
return list(edges.values())
|
||||
|
||||
|
||||
def collect(log_file: str) -> dict:
|
||||
"""
|
||||
Parse lineage edges from a HiveServer2 log file and return a manifest dict.
|
||||
|
||||
Args:
|
||||
log_file: Path to a local HiveServer2 log file.
|
||||
|
||||
Returns:
|
||||
Manifest dict with keys: resource_type, collected_at, edges.
|
||||
Each edge has destination, sources, and col_mappings lists.
|
||||
"""
|
||||
print(f"Reading Hive log file: {log_file} ...")
|
||||
with open(log_file, errors="replace") as fh:
|
||||
log_text = fh.read()
|
||||
|
||||
queries = _extract_query_blocks(log_text)
|
||||
print(f" Extracted {len(queries)} query block(s).")
|
||||
|
||||
edges = _parse_edges(queries)
|
||||
print(f" Parsed {len(edges)} lineage edge(s).")
|
||||
|
||||
manifest = {
|
||||
"resource_type": RESOURCE_TYPE,
|
||||
"collected_at": datetime.now(tz=timezone.utc).isoformat(),
|
||||
"edges": [
|
||||
{
|
||||
"destination": {"database": e.dest_db, "table": e.dest_table},
|
||||
"sources": [{"database": sdb, "table": stbl} for sdb, stbl in e.sources],
|
||||
"col_mappings": [
|
||||
{"dest_col": dc, "src_table": st, "src_col": sc}
|
||||
for dc, st, sc in e.col_mappings
|
||||
],
|
||||
}
|
||||
for e in edges
|
||||
],
|
||||
}
|
||||
return manifest
|
||||
|
||||
|
||||
def main() -> None:
|
||||
parser = argparse.ArgumentParser(
|
||||
description="Extract Hive lineage from a local log file and write a JSON manifest",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--log-file",
|
||||
default="/tmp/root/hive.log",
|
||||
help="Path to local HiveServer2 log file (default: /tmp/root/hive.log)", # ← SUBSTITUTE: your log path
|
||||
)
|
||||
parser.add_argument(
|
||||
"--output-file",
|
||||
default="lineage_output.json",
|
||||
help="Path to write the lineage manifest (default: lineage_output.json)",
|
||||
)
|
||||
args = parser.parse_args()
|
||||
|
||||
manifest = collect(log_file=args.log_file)
|
||||
|
||||
if not manifest["edges"]:
|
||||
print("No lineage edges detected — no CTAS or INSERT INTO ... SELECT patterns found.")
|
||||
return
|
||||
|
||||
with open(args.output_file, "w") as fh:
|
||||
json.dump(manifest, fh, indent=2)
|
||||
print(f"Lineage manifest written to {args.output_file}")
|
||||
print("Done.")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,313 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Collect table metadata from a Hive Metastore — collection only.
|
||||
|
||||
Connects to HiveServer2 (default port 10000), discovers all databases and
|
||||
tables via SHOW DATABASES / SHOW TABLES, reads schema and table statistics
|
||||
via DESCRIBE FORMATTED, then writes a JSON manifest file.
|
||||
|
||||
Can be run standalone via CLI or imported (use the ``collect()`` function).
|
||||
|
||||
Substitution points
|
||||
-------------------
|
||||
- HIVE_HOST (env) / --hive-host (CLI) : HiveServer2 hostname
|
||||
- HIVE_PORT (env) / --hive-port (CLI) : HiveServer2 port (default 10000)
|
||||
|
||||
Prerequisites
|
||||
-------------
|
||||
pip install pyhive python-dotenv
|
||||
|
||||
Usage
|
||||
-----
|
||||
python collect_metadata.py \\
|
||||
--hive-host <HIVESERVER2_HOSTNAME> \\
|
||||
--output-file metadata_output.json
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import os
|
||||
import re
|
||||
from datetime import datetime, timezone
|
||||
|
||||
from pyhive import hive
|
||||
|
||||
|
||||
def _check_available_memory(min_gb: float = 2.0) -> None:
|
||||
"""Warn if available memory is below the threshold."""
|
||||
try:
|
||||
if hasattr(os, "sysconf"): # Linux / macOS
|
||||
page_size = os.sysconf("SC_PAGE_SIZE")
|
||||
avail_pages = os.sysconf("SC_AVPHYS_PAGES")
|
||||
avail_gb = (page_size * avail_pages) / (1024 ** 3)
|
||||
else:
|
||||
return # Windows — skip check
|
||||
except (ValueError, OSError):
|
||||
return
|
||||
if avail_gb < min_gb:
|
||||
print(
|
||||
f"WARNING: Only {avail_gb:.1f} GB of memory available "
|
||||
f"(minimum recommended: {min_gb:.1f} GB). "
|
||||
f"Consider reducing the number of databases/tables or increasing available memory."
|
||||
)
|
||||
|
||||
# ← SUBSTITUTE: set RESOURCE_TYPE to match your Monte Carlo connection type
|
||||
RESOURCE_TYPE = "data-lake"
|
||||
|
||||
# Map Hive native types to SQL-standard uppercase types expected by Monte Carlo
|
||||
_HIVE_TYPE_MAP: dict[str, str] = {
|
||||
"tinyint": "TINYINT",
|
||||
"smallint": "SMALLINT",
|
||||
"int": "INTEGER",
|
||||
"integer": "INTEGER",
|
||||
"bigint": "BIGINT",
|
||||
"float": "FLOAT",
|
||||
"double": "DOUBLE",
|
||||
"double precision": "DOUBLE",
|
||||
"decimal": "DECIMAL",
|
||||
"numeric": "DECIMAL",
|
||||
"boolean": "BOOLEAN",
|
||||
"string": "VARCHAR",
|
||||
"varchar": "VARCHAR",
|
||||
"char": "CHAR",
|
||||
"binary": "BINARY",
|
||||
"timestamp": "TIMESTAMP",
|
||||
"date": "DATE",
|
||||
"interval": "INTERVAL",
|
||||
"array": "ARRAY",
|
||||
"map": "MAP",
|
||||
"struct": "STRUCT",
|
||||
"uniontype": "UNION",
|
||||
}
|
||||
|
||||
# ← SUBSTITUTE: add any internal table name prefixes you want to skip
|
||||
_INTERNAL_TABLE_PREFIXES = ("tmp_", "__", "hive_")
|
||||
|
||||
|
||||
def _normalize_hive_type(hive_type: str) -> str:
|
||||
"""Uppercase and normalize a Hive type string to a SQL-standard form.
|
||||
|
||||
Parametrized types like ``decimal(10,2)`` or ``varchar(255)`` keep their
|
||||
suffix; the base type is mapped through ``_HIVE_TYPE_MAP``.
|
||||
"""
|
||||
lower = hive_type.lower().strip()
|
||||
base = lower.split("(")[0].strip()
|
||||
suffix = hive_type[len(base):].strip() # preserve original params, e.g. decimal(10,2)
|
||||
return _HIVE_TYPE_MAP.get(base, base.upper()) + suffix
|
||||
|
||||
|
||||
def _connect(host: str, port: int) -> hive.Connection:
|
||||
# ← SUBSTITUTE: update username/auth if your cluster requires Kerberos or LDAP
|
||||
return hive.connect(host=host, port=port, username="hadoop", auth="NONE")
|
||||
|
||||
|
||||
def _fetch_rows(cursor, query: str) -> list[tuple]:
|
||||
"""Execute a query and fetch results in memory-safe chunks."""
|
||||
cursor.execute(query)
|
||||
rows: list[tuple] = []
|
||||
while True:
|
||||
chunk = cursor.fetchmany(1000)
|
||||
if not chunk:
|
||||
break
|
||||
rows.extend(chunk)
|
||||
return rows
|
||||
|
||||
|
||||
def _parse_describe_formatted(rows: list[tuple]) -> dict:
|
||||
"""
|
||||
Parse DESCRIBE FORMATTED <db>.<table> output into a structured dict:
|
||||
columns, row_count, total_size, last_modified, description, created_on
|
||||
"""
|
||||
result: dict = {
|
||||
"columns": [],
|
||||
"row_count": None,
|
||||
"total_size": None,
|
||||
"last_modified": None,
|
||||
"description": None,
|
||||
"created_on": None,
|
||||
}
|
||||
in_col_info = False
|
||||
in_table_info = False
|
||||
|
||||
for row in rows:
|
||||
col_name = (row[0] or "").strip()
|
||||
data_type = (row[1] or "").strip()
|
||||
comment = (row[2] or "").strip() if len(row) > 2 else ""
|
||||
|
||||
if col_name.startswith("# col_name"):
|
||||
in_col_info = True
|
||||
in_table_info = False
|
||||
continue
|
||||
if col_name.startswith("# Detailed Table Information"):
|
||||
in_col_info = False
|
||||
in_table_info = True
|
||||
continue
|
||||
if col_name.startswith("#"):
|
||||
in_col_info = False
|
||||
continue
|
||||
|
||||
if in_col_info and col_name and data_type:
|
||||
result["columns"].append(
|
||||
{
|
||||
"name": col_name,
|
||||
"type": _normalize_hive_type(data_type),
|
||||
"description": comment or None,
|
||||
}
|
||||
)
|
||||
|
||||
if in_table_info:
|
||||
# Table Parameters rows have an empty col_name; key is in data_type, value in comment
|
||||
param_key = data_type.strip() if not col_name else col_name.strip().rstrip(":")
|
||||
param_val = (comment.strip() if not col_name else data_type.strip()) or ""
|
||||
|
||||
if re.search(r"numRows", param_key, re.IGNORECASE):
|
||||
try:
|
||||
result["row_count"] = int(param_val)
|
||||
except (ValueError, TypeError):
|
||||
pass
|
||||
elif re.search(r"totalSize", param_key, re.IGNORECASE):
|
||||
try:
|
||||
result["total_size"] = int(param_val)
|
||||
except (ValueError, TypeError):
|
||||
pass
|
||||
elif re.search(r"last_modified_time", param_key, re.IGNORECASE):
|
||||
try:
|
||||
result["last_modified"] = datetime.fromtimestamp(
|
||||
int(param_val), tz=timezone.utc
|
||||
).isoformat()
|
||||
except (ValueError, TypeError):
|
||||
pass
|
||||
elif re.search(r"^CreateTime", param_key):
|
||||
# e.g. "Wed Mar 18 20:15:40 UTC 2026"
|
||||
try:
|
||||
result["created_on"] = datetime.strptime(
|
||||
param_val, "%a %b %d %H:%M:%S %Z %Y"
|
||||
).replace(tzinfo=timezone.utc).isoformat()
|
||||
except (ValueError, TypeError):
|
||||
pass
|
||||
elif param_key == "comment" and not result["description"] and param_val:
|
||||
result["description"] = param_val
|
||||
|
||||
return result
|
||||
|
||||
|
||||
def collect(
|
||||
hive_host: str,
|
||||
hive_port: int = 10000,
|
||||
) -> dict:
|
||||
"""
|
||||
Connect to HiveServer2, discover all databases and tables, and return a
|
||||
manifest dict with collected asset metadata.
|
||||
|
||||
Args:
|
||||
hive_host: HiveServer2 hostname.
|
||||
hive_port: HiveServer2 port (default 10000).
|
||||
|
||||
Returns:
|
||||
Manifest dict with keys: resource_type, collected_at, assets.
|
||||
"""
|
||||
_check_available_memory()
|
||||
print(f"Connecting to HiveServer2 at {hive_host}:{hive_port} ...")
|
||||
conn = _connect(hive_host, hive_port)
|
||||
cursor = conn.cursor()
|
||||
assets: list[dict] = []
|
||||
|
||||
print("Collecting table metadata ...")
|
||||
databases = [row[0] for row in _fetch_rows(cursor, "SHOW DATABASES")]
|
||||
print(f" Found databases: {databases}")
|
||||
|
||||
for db in databases:
|
||||
# ← SUBSTITUTE: add any system databases you want to skip
|
||||
if db in ("information_schema",):
|
||||
continue
|
||||
|
||||
tables = _fetch_rows(cursor, f"SHOW TABLES IN {db}")
|
||||
table_names = [row[0] for row in tables]
|
||||
print(f" {db}: {len(table_names)} table(s)")
|
||||
|
||||
for table in table_names:
|
||||
if any(table.startswith(p) for p in _INTERNAL_TABLE_PREFIXES):
|
||||
continue
|
||||
|
||||
try:
|
||||
desc_rows = _fetch_rows(cursor, f"DESCRIBE FORMATTED {db}.{table}")
|
||||
except Exception as exc:
|
||||
print(f" WARNING: could not describe {db}.{table}: {exc}")
|
||||
continue
|
||||
|
||||
info = _parse_describe_formatted(desc_rows)
|
||||
|
||||
row_count = info["row_count"] if info["row_count"] and info["row_count"] > 0 else None
|
||||
byte_count = info["total_size"] if info["total_size"] and info["total_size"] > 0 else None
|
||||
|
||||
assets.append(
|
||||
{
|
||||
"database": db,
|
||||
"schema": db,
|
||||
"name": table,
|
||||
"description": info["description"],
|
||||
"created_on": info["created_on"],
|
||||
"row_count": row_count,
|
||||
"byte_count": byte_count,
|
||||
"last_modified": info["last_modified"],
|
||||
"fields": [
|
||||
{"name": col["name"], "type": col["type"], "description": col["description"]}
|
||||
for col in info["columns"]
|
||||
],
|
||||
}
|
||||
)
|
||||
print(
|
||||
f" + {db}.{table} ({len(info['columns'])} columns, "
|
||||
f"desc={info['description']!r}, created={info['created_on']})"
|
||||
)
|
||||
|
||||
cursor.close()
|
||||
conn.close()
|
||||
print(f"\nCollected {len(assets)} table(s).")
|
||||
|
||||
manifest = {
|
||||
"resource_type": RESOURCE_TYPE,
|
||||
"collected_at": datetime.now(tz=timezone.utc).isoformat(),
|
||||
"assets": assets,
|
||||
}
|
||||
return manifest
|
||||
|
||||
|
||||
def main() -> None:
|
||||
parser = argparse.ArgumentParser(
|
||||
description="Collect Hive table metadata and write a JSON manifest",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--hive-host",
|
||||
default=os.environ.get("HIVE_HOST"),
|
||||
help="HiveServer2 hostname (env: HIVE_HOST)", # ← SUBSTITUTE: your EMR master DNS or Hive host
|
||||
)
|
||||
parser.add_argument(
|
||||
"--hive-port",
|
||||
type=int,
|
||||
default=10000,
|
||||
help="HiveServer2 port (default: 10000)", # ← SUBSTITUTE if your cluster uses a non-standard port
|
||||
)
|
||||
parser.add_argument(
|
||||
"--output-file",
|
||||
default="metadata_output.json",
|
||||
help="Path to write the output manifest (default: metadata_output.json)",
|
||||
)
|
||||
args = parser.parse_args()
|
||||
|
||||
if not args.hive_host:
|
||||
parser.error("--hive-host is required (or set HIVE_HOST)")
|
||||
|
||||
manifest = collect(
|
||||
hive_host=args.hive_host,
|
||||
hive_port=args.hive_port,
|
||||
)
|
||||
|
||||
with open(args.output_file, "w") as fh:
|
||||
json.dump(manifest, fh, indent=2)
|
||||
print(f"Asset manifest written to {args.output_file}")
|
||||
print("Done.")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,284 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Collect Hive query logs from a local HiveServer2 log file — collection only.
|
||||
|
||||
Parses a plain-text HiveServer2 log for "Executing/Starting command" entries
|
||||
to extract query text, query ID, start time and end time. Optionally reads
|
||||
per-query operation logs to populate ``returned_rows`` from SelectOperator
|
||||
``RECORDS_OUT`` counters. Deduplicates entries by query ID.
|
||||
|
||||
Can be run standalone via CLI or imported (use the ``collect()`` function).
|
||||
|
||||
Substitution points
|
||||
-------------------
|
||||
- --log-file path to local HiveServer2 log (default: /tmp/root/hive.log)
|
||||
- --op-logs-dir optional directory of per-query <queryId>.log files
|
||||
|
||||
Prerequisites
|
||||
-------------
|
||||
pip install python-dateutil python-dotenv
|
||||
|
||||
Usage
|
||||
-----
|
||||
python collect_query_logs.py \\
|
||||
--log-file /tmp/root/hive.log \\
|
||||
[--op-logs-dir /var/log/hive/operation_logs] \\
|
||||
--output-file query_logs_output.json
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import re
|
||||
from datetime import datetime, timezone
|
||||
from io import StringIO
|
||||
from pathlib import Path
|
||||
|
||||
from dateutil.parser import isoparse
|
||||
|
||||
# NOTE: the normalizer requires "hive-s3" — do not change to "hive" or "data-lake"
|
||||
LOG_TYPE = "hive-s3"
|
||||
|
||||
# Matches the start of a new query block in the Hive log
|
||||
_COMMAND_START_RE = re.compile(
|
||||
r"(Executing|Starting)\s+command\(queryId=(?P<query_id>\S*)\):\s+(?P<command>.*)$"
|
||||
)
|
||||
|
||||
# Extracts returned row counts from per-query Hive operation logs
|
||||
_RECORDS_OUT_RE = re.compile(r"RECORDS_OUT_OPERATOR_SEL_\d+:(\d+)")
|
||||
|
||||
|
||||
def _parse_log_entries(log_text: str) -> list[dict]:
|
||||
"""
|
||||
Parse a HiveServer2 log file and return a list of dicts:
|
||||
query_id, start_time (datetime), end_time (datetime), query (str)
|
||||
|
||||
Each timestamped "Executing/Starting command" line starts a new entry.
|
||||
The previous entry's end_time is set to the timestamp of the next line.
|
||||
"""
|
||||
entries = []
|
||||
query = ""
|
||||
query_id = ""
|
||||
start_time: datetime | None = None
|
||||
last_timestamp: datetime | None = None
|
||||
|
||||
for line in StringIO(log_text):
|
||||
parts = line.split()
|
||||
if not parts:
|
||||
continue
|
||||
|
||||
try:
|
||||
timestamp = isoparse(parts[0])
|
||||
if not timestamp.tzinfo:
|
||||
timestamp = timestamp.replace(tzinfo=timezone.utc)
|
||||
except ValueError:
|
||||
# Continuation line for a multi-line query
|
||||
if query:
|
||||
query += "\n" + line.rstrip()
|
||||
continue
|
||||
|
||||
command_start = _COMMAND_START_RE.search(line)
|
||||
if command_start:
|
||||
# Emit the previous entry before starting a new one
|
||||
if query and start_time:
|
||||
entries.append(
|
||||
{
|
||||
"query_id": query_id,
|
||||
"start_time": start_time,
|
||||
"end_time": timestamp,
|
||||
"query": query,
|
||||
}
|
||||
)
|
||||
query_id = command_start.group("query_id")
|
||||
start_time = timestamp
|
||||
query = command_start.group("command").strip()
|
||||
elif query and start_time:
|
||||
# A timestamped non-command line closes the current entry
|
||||
entries.append(
|
||||
{
|
||||
"query_id": query_id,
|
||||
"start_time": start_time,
|
||||
"end_time": timestamp,
|
||||
"query": query,
|
||||
}
|
||||
)
|
||||
query = ""
|
||||
query_id = ""
|
||||
start_time = None
|
||||
|
||||
last_timestamp = timestamp
|
||||
|
||||
# Flush any trailing entry
|
||||
if query and start_time:
|
||||
end_time = last_timestamp or start_time
|
||||
entries.append(
|
||||
{
|
||||
"query_id": query_id,
|
||||
"start_time": start_time,
|
||||
"end_time": end_time,
|
||||
"query": query,
|
||||
}
|
||||
)
|
||||
|
||||
return entries
|
||||
|
||||
|
||||
def _load_returned_rows(op_logs_dir: str) -> dict[str, int]:
|
||||
"""
|
||||
Scan a directory of per-query Hive operation logs (named <queryId>.log) and
|
||||
return a mapping of query_id -> rows returned.
|
||||
|
||||
The row count is taken from the last RECORDS_OUT_OPERATOR_SEL_N value in
|
||||
each file, which reflects the final number of rows delivered to the client.
|
||||
"""
|
||||
rows_by_id: dict[str, int] = {}
|
||||
for log_file in Path(op_logs_dir).glob("*.log"):
|
||||
query_id = log_file.stem
|
||||
last_count: int | None = None
|
||||
try:
|
||||
text = log_file.read_text(errors="replace")
|
||||
except OSError:
|
||||
continue
|
||||
for m in _RECORDS_OUT_RE.finditer(text):
|
||||
last_count = int(m.group(1))
|
||||
if last_count is not None:
|
||||
rows_by_id[query_id] = last_count
|
||||
return rows_by_id
|
||||
|
||||
|
||||
def _build_query_log_entries(
|
||||
raw_entries: list[dict],
|
||||
rows_by_id: dict[str, int] | None = None,
|
||||
) -> list[dict]:
|
||||
"""
|
||||
Deduplicate raw log entries by query_id and enrich with returned_rows.
|
||||
|
||||
Returns plain dicts so that ``push_query_logs.py`` can reconstruct
|
||||
QueryLogEntry objects from the JSON manifest.
|
||||
"""
|
||||
seen: set[str] = set()
|
||||
entries = []
|
||||
for r in raw_entries:
|
||||
qid = r["query_id"]
|
||||
if qid and qid in seen:
|
||||
continue
|
||||
if qid:
|
||||
seen.add(qid)
|
||||
|
||||
returned_rows: int | None = rows_by_id.get(qid) if rows_by_id and qid else None
|
||||
|
||||
entries.append(
|
||||
{
|
||||
"query_id": qid or None,
|
||||
"start_time": r["start_time"].isoformat(),
|
||||
"end_time": r["end_time"].isoformat(),
|
||||
"query_text": r["query"],
|
||||
"user": "hadoop", # ← SUBSTITUTE: set the user appropriate for your cluster
|
||||
"returned_rows": returned_rows,
|
||||
}
|
||||
)
|
||||
return entries
|
||||
|
||||
|
||||
def collect(
|
||||
log_file: str,
|
||||
op_logs_dir: str | None = None,
|
||||
) -> dict:
|
||||
"""
|
||||
Parse query log entries from a HiveServer2 log file and return a manifest dict.
|
||||
|
||||
Args:
|
||||
log_file: Path to a local HiveServer2 log file.
|
||||
op_logs_dir: Optional directory containing per-query operation logs
|
||||
(<queryId>.log). When provided, returned_rows is populated
|
||||
from SelectOperator RECORDS_OUT counts.
|
||||
|
||||
Returns:
|
||||
Manifest dict with keys: log_type, collected_at, entry_count,
|
||||
window_start, window_end, queries.
|
||||
"""
|
||||
print(f"Reading Hive log file: {log_file} ...")
|
||||
with open(log_file, errors="replace") as fh:
|
||||
log_text = fh.read()
|
||||
|
||||
raw_entries = _parse_log_entries(log_text)
|
||||
print(f" Parsed {len(raw_entries)} query log entry/entries.")
|
||||
|
||||
if not raw_entries:
|
||||
print("No query log entries found.")
|
||||
return {
|
||||
"log_type": LOG_TYPE,
|
||||
"collected_at": datetime.now(tz=timezone.utc).isoformat(),
|
||||
"entry_count": 0,
|
||||
"window_start": None,
|
||||
"window_end": None,
|
||||
"queries": [],
|
||||
}
|
||||
|
||||
rows_by_id: dict[str, int] | None = None
|
||||
if op_logs_dir:
|
||||
rows_by_id = _load_returned_rows(op_logs_dir)
|
||||
print(f" Loaded row counts for {len(rows_by_id)} query/queries from {op_logs_dir}")
|
||||
|
||||
queries = _build_query_log_entries(raw_entries, rows_by_id)
|
||||
|
||||
start_times = [r["start_time"] for r in raw_entries]
|
||||
end_times = [r["end_time"] for r in raw_entries]
|
||||
|
||||
manifest = {
|
||||
"log_type": LOG_TYPE,
|
||||
"collected_at": datetime.now(tz=timezone.utc).isoformat(),
|
||||
"entry_count": len(queries),
|
||||
"window_start": min(start_times).isoformat() if start_times else None,
|
||||
"window_end": max(end_times).isoformat() if end_times else None,
|
||||
"queries": [
|
||||
{
|
||||
"query_id": q["query_id"],
|
||||
"start_time": q["start_time"],
|
||||
"end_time": q["end_time"],
|
||||
"query": q["query_text"],
|
||||
"user": q["user"],
|
||||
"returned_rows": q["returned_rows"],
|
||||
}
|
||||
for q in queries
|
||||
],
|
||||
}
|
||||
return manifest
|
||||
|
||||
|
||||
def main() -> None:
|
||||
parser = argparse.ArgumentParser(
|
||||
description="Collect Hive query logs from a local log file and write a JSON manifest",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--log-file",
|
||||
default="/tmp/root/hive.log",
|
||||
help="Path to local HiveServer2 log file (default: /tmp/root/hive.log)", # ← SUBSTITUTE: your log path
|
||||
)
|
||||
parser.add_argument(
|
||||
"--op-logs-dir",
|
||||
default=None,
|
||||
help=(
|
||||
"Directory containing per-query Hive operation logs (<queryId>.log). "
|
||||
"When provided, returned_rows is populated from SelectOperator RECORDS_OUT counts."
|
||||
),
|
||||
# ← SUBSTITUTE: e.g. /var/log/hive/operation_logs or wherever Hive writes op logs
|
||||
)
|
||||
parser.add_argument(
|
||||
"--output-file",
|
||||
default="query_logs_output.json",
|
||||
help="Path to write the output manifest (default: query_logs_output.json)",
|
||||
)
|
||||
args = parser.parse_args()
|
||||
|
||||
manifest = collect(log_file=args.log_file, op_logs_dir=args.op_logs_dir)
|
||||
|
||||
with open(args.output_file, "w") as fh:
|
||||
json.dump(manifest, fh, indent=2)
|
||||
print(f"Query log manifest written to {args.output_file}")
|
||||
print("Done.")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,309 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Push a collected Hive lineage manifest to Monte Carlo — push only.
|
||||
|
||||
Reads a JSON manifest produced by ``collect_lineage.py``, builds LineageEvent
|
||||
objects (table-level or column-level), and calls ``send_lineage`` in batches.
|
||||
The manifest is updated in-place with ``resource_uuid`` and ``invocation_id``
|
||||
after a successful push.
|
||||
|
||||
Can be run standalone via CLI or imported (use the ``push()`` function).
|
||||
|
||||
Substitution points
|
||||
-------------------
|
||||
- MCD_INGEST_ID (env) / --key-id (CLI) : Monte Carlo ingestion key ID
|
||||
- MCD_INGEST_TOKEN (env) / --key-token (CLI) : Monte Carlo ingestion key token
|
||||
- MCD_RESOURCE_UUID (env) / --resource-uuid (CLI) : MC resource UUID for this connection
|
||||
|
||||
Prerequisites
|
||||
-------------
|
||||
pip install pycarlo python-dotenv
|
||||
|
||||
Usage (table-level):
|
||||
python push_lineage.py \\
|
||||
--key-id <MCD_INGEST_ID> \\
|
||||
--key-token <MCD_INGEST_TOKEN> \\
|
||||
--resource-uuid <MCD_RESOURCE_UUID> \\
|
||||
--input-file lineage_output.json
|
||||
|
||||
Usage (column-level):
|
||||
python push_lineage.py ... --column-lineage
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import os
|
||||
from concurrent.futures import ThreadPoolExecutor, as_completed
|
||||
|
||||
from pycarlo.core import Client, Session
|
||||
from pycarlo.features.ingestion import IngestionService
|
||||
from pycarlo.features.ingestion.models import (
|
||||
ColumnLineageField,
|
||||
ColumnLineageSourceField,
|
||||
LineageAssetRef,
|
||||
LineageEvent,
|
||||
)
|
||||
|
||||
# ← SUBSTITUTE: set RESOURCE_TYPE to match your Monte Carlo connection type
|
||||
RESOURCE_TYPE = "data-lake"
|
||||
|
||||
# ← SUBSTITUTE: default batch size for lineage push (events per request)
|
||||
DEFAULT_BATCH_SIZE = 500
|
||||
|
||||
# ← SUBSTITUTE: HTTP timeout for MC ingestion requests (seconds)
|
||||
DEFAULT_TIMEOUT_SECONDS = 120
|
||||
|
||||
|
||||
def _build_table_lineage(edges_data: list[dict]) -> list[LineageEvent]:
|
||||
"""Build table-level LineageEvent objects from raw edge dicts."""
|
||||
events = []
|
||||
for edge in edges_data:
|
||||
sources = edge.get("sources", [])
|
||||
if not sources:
|
||||
continue
|
||||
dest = edge["destination"]
|
||||
events.append(
|
||||
LineageEvent(
|
||||
destination=LineageAssetRef(
|
||||
type="TABLE",
|
||||
name=dest["table"],
|
||||
database=dest["database"],
|
||||
schema=dest["database"],
|
||||
),
|
||||
sources=[
|
||||
LineageAssetRef(
|
||||
type="TABLE",
|
||||
name=src["table"],
|
||||
database=src["database"],
|
||||
schema=src["database"],
|
||||
)
|
||||
for src in sources
|
||||
],
|
||||
)
|
||||
)
|
||||
return events
|
||||
|
||||
|
||||
def _build_column_lineage(edges_data: list[dict]) -> list[LineageEvent]:
|
||||
"""Build column-level LineageEvent objects from raw edge dicts."""
|
||||
events = []
|
||||
for edge in edges_data:
|
||||
sources = edge.get("sources", [])
|
||||
if not sources:
|
||||
continue
|
||||
|
||||
dest = edge["destination"]
|
||||
dest_asset_id = f"{dest['database']}__{dest['table']}"
|
||||
source_asset_ids = {
|
||||
(src["database"], src["table"]): f"{src['database']}__{src['table']}"
|
||||
for src in sources
|
||||
}
|
||||
|
||||
col_fields: dict[str, ColumnLineageField] = {}
|
||||
for mapping in edge.get("col_mappings", []):
|
||||
dest_col = mapping["dest_col"]
|
||||
src_table = mapping["src_table"]
|
||||
src_col = mapping["src_col"]
|
||||
# Find the matching source db for this src_table
|
||||
src_db = next(
|
||||
(src["database"] for src in sources if src["table"] == src_table),
|
||||
dest["database"],
|
||||
)
|
||||
src_aid = source_asset_ids.get((src_db, src_table), f"{src_db}__{src_table}")
|
||||
if dest_col not in col_fields:
|
||||
col_fields[dest_col] = ColumnLineageField(name=dest_col, source_fields=[])
|
||||
col_fields[dest_col].source_fields.append(
|
||||
ColumnLineageSourceField(asset_id=src_aid, field_name=src_col)
|
||||
)
|
||||
|
||||
events.append(
|
||||
LineageEvent(
|
||||
destination=LineageAssetRef(
|
||||
type="TABLE",
|
||||
name=dest["table"],
|
||||
database=dest["database"],
|
||||
schema=dest["database"],
|
||||
asset_id=dest_asset_id,
|
||||
),
|
||||
sources=[
|
||||
LineageAssetRef(
|
||||
type="TABLE",
|
||||
name=src["table"],
|
||||
database=src["database"],
|
||||
schema=src["database"],
|
||||
asset_id=source_asset_ids[(src["database"], src["table"])],
|
||||
)
|
||||
for src in sources
|
||||
],
|
||||
fields=list(col_fields.values()) if col_fields else None,
|
||||
)
|
||||
)
|
||||
return events
|
||||
|
||||
|
||||
def push(
|
||||
manifest: dict,
|
||||
resource_uuid: str,
|
||||
key_id: str,
|
||||
key_token: str,
|
||||
column_lineage: bool = False,
|
||||
batch_size: int = DEFAULT_BATCH_SIZE,
|
||||
timeout_seconds: int = DEFAULT_TIMEOUT_SECONDS,
|
||||
) -> str | None:
|
||||
"""
|
||||
Push collected lineage to Monte Carlo and update the manifest in-place.
|
||||
|
||||
Events are sent in batches of ``batch_size`` (default 500) to avoid
|
||||
oversized payloads. Supports both table-level and column-level lineage.
|
||||
|
||||
Args:
|
||||
manifest: Dict loaded from a ``collect_lineage.py`` output file.
|
||||
resource_uuid: MC resource UUID for this Hive connection.
|
||||
key_id: MC ingestion key ID.
|
||||
key_token: MC ingestion key token.
|
||||
column_lineage: When True, push column-level lineage; otherwise table-level.
|
||||
batch_size: Events per POST request (default 500).
|
||||
timeout_seconds: HTTP timeout per request (default 120).
|
||||
|
||||
Returns:
|
||||
The last invocation ID string if returned by MC, otherwise None.
|
||||
"""
|
||||
resource_type = manifest.get("resource_type", RESOURCE_TYPE)
|
||||
edges_data = manifest.get("edges", [])
|
||||
|
||||
if column_lineage:
|
||||
events = _build_column_lineage(edges_data)
|
||||
label = "column-level"
|
||||
else:
|
||||
events = _build_table_lineage(edges_data)
|
||||
label = "table-level"
|
||||
|
||||
print(f"Loaded {len(events)} {label} lineage event(s) from manifest")
|
||||
|
||||
if not events:
|
||||
print("No lineage events to push.")
|
||||
manifest["resource_uuid"] = resource_uuid
|
||||
manifest["invocation_id"] = None
|
||||
return None
|
||||
|
||||
# Split into batches
|
||||
batch_list = []
|
||||
for i in range(0, len(events), batch_size):
|
||||
batch_list.append(events[i : i + batch_size])
|
||||
total_batches = len(batch_list)
|
||||
|
||||
def _push_batch(batch: list, batch_num: int) -> str | None:
|
||||
"""Push a single batch using a dedicated Session (thread-safe)."""
|
||||
print(f" Pushing batch {batch_num}/{total_batches} ({len(batch)} events) ...")
|
||||
client = Client(session=Session(mcd_id=key_id, mcd_token=key_token, scope="Ingestion"))
|
||||
service = IngestionService(mc_client=client)
|
||||
result = service.send_lineage(
|
||||
resource_uuid=resource_uuid,
|
||||
resource_type=resource_type,
|
||||
events=batch,
|
||||
)
|
||||
invocation_id = service.extract_invocation_id(result)
|
||||
if invocation_id:
|
||||
print(f" Batch {batch_num}: invocation_id={invocation_id}")
|
||||
return invocation_id
|
||||
|
||||
# Push batches in parallel (each thread gets its own pycarlo Session)
|
||||
max_workers = min(4, total_batches)
|
||||
invocation_ids: list[str | None] = [None] * total_batches
|
||||
|
||||
with ThreadPoolExecutor(max_workers=max_workers) as pool:
|
||||
futures = {
|
||||
pool.submit(_push_batch, batch, i + 1): i
|
||||
for i, batch in enumerate(batch_list)
|
||||
}
|
||||
for future in as_completed(futures):
|
||||
idx = futures[future]
|
||||
try:
|
||||
invocation_ids[idx] = future.result()
|
||||
except Exception as exc:
|
||||
print(f" ERROR pushing batch {idx + 1}: {exc}")
|
||||
raise
|
||||
|
||||
print(f" All {total_batches} batches pushed ({max_workers} workers)")
|
||||
|
||||
manifest["resource_uuid"] = resource_uuid
|
||||
manifest["invocation_id"] = invocation_ids[-1] if invocation_ids else None
|
||||
if len([i for i in invocation_ids if i]) > 1:
|
||||
manifest["invocation_ids"] = invocation_ids
|
||||
elif "invocation_ids" in manifest:
|
||||
del manifest["invocation_ids"]
|
||||
|
||||
return manifest.get("invocation_id")
|
||||
|
||||
|
||||
def main() -> None:
|
||||
parser = argparse.ArgumentParser(
|
||||
description="Push a collected Hive lineage manifest to Monte Carlo",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--key-id",
|
||||
default=os.environ.get("MCD_INGEST_ID"),
|
||||
help="Monte Carlo ingestion key ID (env: MCD_INGEST_ID)",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--key-token",
|
||||
default=os.environ.get("MCD_INGEST_TOKEN"),
|
||||
help="Monte Carlo ingestion key token (env: MCD_INGEST_TOKEN)",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--resource-uuid",
|
||||
default=os.environ.get("MCD_RESOURCE_UUID"),
|
||||
help="Monte Carlo resource UUID for this Hive connection (env: MCD_RESOURCE_UUID)",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--input-file",
|
||||
default="lineage_output.json",
|
||||
help="Path to the JSON manifest written by collect_lineage.py (default: lineage_output.json)",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--column-lineage",
|
||||
action="store_true",
|
||||
help="Push column-level lineage instead of table-level",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--batch-size",
|
||||
type=int,
|
||||
default=DEFAULT_BATCH_SIZE,
|
||||
metavar="N",
|
||||
help=f"Max events per POST (default: {DEFAULT_BATCH_SIZE})",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--timeout",
|
||||
type=int,
|
||||
default=DEFAULT_TIMEOUT_SECONDS,
|
||||
metavar="SEC",
|
||||
help=f"HTTP timeout per request in seconds (default: {DEFAULT_TIMEOUT_SECONDS})",
|
||||
)
|
||||
args = parser.parse_args()
|
||||
|
||||
if not args.key_id or not args.key_token:
|
||||
parser.error("--key-id and --key-token are required (or set MCD_INGEST_ID / MCD_INGEST_TOKEN)")
|
||||
if not args.resource_uuid:
|
||||
parser.error("--resource-uuid is required (or set MCD_RESOURCE_UUID)")
|
||||
|
||||
with open(args.input_file) as fh:
|
||||
manifest = json.load(fh)
|
||||
|
||||
push(
|
||||
manifest=manifest,
|
||||
resource_uuid=args.resource_uuid,
|
||||
key_id=args.key_id,
|
||||
key_token=args.key_token,
|
||||
column_lineage=args.column_lineage,
|
||||
batch_size=args.batch_size,
|
||||
timeout_seconds=args.timeout,
|
||||
)
|
||||
|
||||
with open(args.input_file, "w") as fh:
|
||||
json.dump(manifest, fh, indent=2)
|
||||
print(f"Manifest updated in-place: {args.input_file}")
|
||||
print("Done.")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,245 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Push a collected Hive metadata manifest to Monte Carlo — push only.
|
||||
|
||||
Reads a JSON manifest produced by ``collect_metadata.py``, builds
|
||||
RelationalAsset objects, and calls ``send_metadata`` in batches. The manifest
|
||||
is updated in-place with ``resource_uuid`` and ``invocation_id`` after a
|
||||
successful push.
|
||||
|
||||
Can be run standalone via CLI or imported (use the ``push()`` function).
|
||||
|
||||
Substitution points
|
||||
-------------------
|
||||
- MCD_INGEST_ID (env) / --key-id (CLI) : Monte Carlo ingestion key ID
|
||||
- MCD_INGEST_TOKEN (env) / --key-token (CLI) : Monte Carlo ingestion key token
|
||||
- MCD_RESOURCE_UUID (env) / --resource-uuid (CLI) : MC resource UUID for this connection
|
||||
|
||||
Prerequisites
|
||||
-------------
|
||||
pip install pycarlo python-dotenv
|
||||
|
||||
Usage
|
||||
-----
|
||||
python push_metadata.py \\
|
||||
--key-id <MCD_INGEST_ID> \\
|
||||
--key-token <MCD_INGEST_TOKEN> \\
|
||||
--resource-uuid <MCD_RESOURCE_UUID> \\
|
||||
--input-file metadata_output.json
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import os
|
||||
from concurrent.futures import ThreadPoolExecutor, as_completed
|
||||
from datetime import datetime, timezone
|
||||
|
||||
from pycarlo.core import Client, Session
|
||||
from pycarlo.features.ingestion import IngestionService
|
||||
from pycarlo.features.ingestion.models import (
|
||||
AssetField,
|
||||
AssetFreshness,
|
||||
AssetMetadata,
|
||||
AssetVolume,
|
||||
RelationalAsset,
|
||||
)
|
||||
|
||||
# ← SUBSTITUTE: default batch size for metadata push (assets per request)
|
||||
DEFAULT_BATCH_SIZE = 500
|
||||
|
||||
# ← SUBSTITUTE: HTTP timeout for MC ingestion requests (seconds)
|
||||
DEFAULT_TIMEOUT_SECONDS = 120
|
||||
|
||||
|
||||
def _build_assets(manifest: dict) -> list[RelationalAsset]:
|
||||
"""Rebuild RelationalAsset objects from a collected metadata manifest."""
|
||||
assets = []
|
||||
for a in manifest.get("assets", []):
|
||||
fields = [
|
||||
AssetField(
|
||||
name=f["name"],
|
||||
type=f["type"],
|
||||
description=f.get("description"),
|
||||
)
|
||||
for f in a.get("fields", [])
|
||||
]
|
||||
|
||||
volume = None
|
||||
row_count = a.get("row_count")
|
||||
byte_count = a.get("byte_count")
|
||||
if row_count or byte_count:
|
||||
volume = AssetVolume(
|
||||
row_count=row_count if row_count and row_count > 0 else None,
|
||||
byte_count=byte_count if byte_count and byte_count > 0 else None,
|
||||
)
|
||||
|
||||
freshness = None
|
||||
last_modified = a.get("last_modified")
|
||||
if last_modified:
|
||||
freshness = AssetFreshness(last_update_time=last_modified)
|
||||
|
||||
assets.append(
|
||||
RelationalAsset(
|
||||
type="TABLE",
|
||||
metadata=AssetMetadata(
|
||||
name=a["name"],
|
||||
database=a["database"],
|
||||
schema=a["schema"],
|
||||
description=a.get("description"),
|
||||
created_on=a.get("created_on"),
|
||||
),
|
||||
fields=fields,
|
||||
volume=volume,
|
||||
freshness=freshness,
|
||||
)
|
||||
)
|
||||
return assets
|
||||
|
||||
|
||||
def push(
|
||||
manifest: dict,
|
||||
resource_uuid: str,
|
||||
key_id: str,
|
||||
key_token: str,
|
||||
batch_size: int = DEFAULT_BATCH_SIZE,
|
||||
timeout_seconds: int = DEFAULT_TIMEOUT_SECONDS,
|
||||
) -> str | None:
|
||||
"""
|
||||
Push collected metadata to Monte Carlo and update the manifest in-place.
|
||||
|
||||
Assets are sent in batches of ``batch_size`` (default 500) to avoid
|
||||
oversized payloads. The manifest is enriched with ``resource_uuid``
|
||||
and the last ``invocation_id`` from the response.
|
||||
|
||||
Args:
|
||||
manifest: Dict loaded from a ``collect_metadata.py`` output file.
|
||||
resource_uuid: MC resource UUID for this Hive connection.
|
||||
key_id: MC ingestion key ID.
|
||||
key_token: MC ingestion key token.
|
||||
batch_size: Assets per POST request (default 500).
|
||||
timeout_seconds: HTTP timeout per request (default 120).
|
||||
|
||||
Returns:
|
||||
The last invocation ID string if returned by MC, otherwise None.
|
||||
"""
|
||||
resource_type = manifest.get("resource_type", "data-lake")
|
||||
|
||||
assets = _build_assets(manifest)
|
||||
n = len(assets)
|
||||
|
||||
print(f"Loaded {n} asset(s) from manifest")
|
||||
|
||||
# Split into batches
|
||||
batch_list = []
|
||||
for i in range(0, max(n, 1), batch_size):
|
||||
batch_list.append(assets[i : i + batch_size])
|
||||
total_batches = len(batch_list)
|
||||
|
||||
def _push_batch(batch: list, batch_num: int) -> str | None:
|
||||
"""Push a single batch using a dedicated Session (thread-safe)."""
|
||||
client = Client(session=Session(mcd_id=key_id, mcd_token=key_token, scope="Ingestion"))
|
||||
service = IngestionService(mc_client=client)
|
||||
result = service.send_metadata(
|
||||
resource_uuid=resource_uuid,
|
||||
resource_type=resource_type,
|
||||
events=batch,
|
||||
)
|
||||
invocation_id = service.extract_invocation_id(result)
|
||||
print(f" Pushed batch {batch_num}/{total_batches} ({len(batch)} assets) — invocation_id={invocation_id}")
|
||||
return invocation_id
|
||||
|
||||
# Push batches in parallel (each thread gets its own pycarlo Session)
|
||||
max_workers = min(4, total_batches)
|
||||
invocation_ids: list[str | None] = [None] * total_batches
|
||||
|
||||
with ThreadPoolExecutor(max_workers=max_workers) as pool:
|
||||
futures = {
|
||||
pool.submit(_push_batch, batch, i + 1): i
|
||||
for i, batch in enumerate(batch_list)
|
||||
}
|
||||
for future in as_completed(futures):
|
||||
idx = futures[future]
|
||||
try:
|
||||
invocation_ids[idx] = future.result()
|
||||
except Exception as exc:
|
||||
print(f" ERROR pushing batch {idx + 1}: {exc}")
|
||||
raise
|
||||
|
||||
print(f" All {total_batches} batches pushed ({max_workers} workers)")
|
||||
|
||||
manifest["resource_uuid"] = resource_uuid
|
||||
manifest["invocation_id"] = invocation_ids[-1] if invocation_ids else None
|
||||
if len([i for i in invocation_ids if i]) > 1:
|
||||
manifest["invocation_ids"] = invocation_ids
|
||||
elif "invocation_ids" in manifest:
|
||||
del manifest["invocation_ids"]
|
||||
|
||||
return manifest.get("invocation_id")
|
||||
|
||||
|
||||
def main() -> None:
|
||||
parser = argparse.ArgumentParser(
|
||||
description="Push a collected Hive metadata manifest to Monte Carlo",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--key-id",
|
||||
default=os.environ.get("MCD_INGEST_ID"),
|
||||
help="Monte Carlo ingestion key ID (env: MCD_INGEST_ID)", # ← SUBSTITUTE env var name if different
|
||||
)
|
||||
parser.add_argument(
|
||||
"--key-token",
|
||||
default=os.environ.get("MCD_INGEST_TOKEN"),
|
||||
help="Monte Carlo ingestion key token (env: MCD_INGEST_TOKEN)", # ← SUBSTITUTE env var name if different
|
||||
)
|
||||
parser.add_argument(
|
||||
"--resource-uuid",
|
||||
default=os.environ.get("MCD_RESOURCE_UUID"),
|
||||
required=False,
|
||||
help="Monte Carlo resource UUID for this Hive connection (env: MCD_RESOURCE_UUID)",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--input-file",
|
||||
default="metadata_output.json",
|
||||
help="Path to the JSON manifest written by collect_metadata.py (default: metadata_output.json)",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--batch-size",
|
||||
type=int,
|
||||
default=DEFAULT_BATCH_SIZE,
|
||||
metavar="N",
|
||||
help=f"Max assets per POST (default: {DEFAULT_BATCH_SIZE})",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--timeout",
|
||||
type=int,
|
||||
default=DEFAULT_TIMEOUT_SECONDS,
|
||||
metavar="SEC",
|
||||
help=f"HTTP timeout per request in seconds (default: {DEFAULT_TIMEOUT_SECONDS})",
|
||||
)
|
||||
args = parser.parse_args()
|
||||
|
||||
if not args.key_id or not args.key_token:
|
||||
parser.error("--key-id and --key-token are required (or set MCD_INGEST_ID / MCD_INGEST_TOKEN)")
|
||||
if not args.resource_uuid:
|
||||
parser.error("--resource-uuid is required (or set MCD_RESOURCE_UUID)")
|
||||
|
||||
with open(args.input_file) as fh:
|
||||
manifest = json.load(fh)
|
||||
|
||||
push(
|
||||
manifest=manifest,
|
||||
resource_uuid=args.resource_uuid,
|
||||
key_id=args.key_id,
|
||||
key_token=args.key_token,
|
||||
batch_size=args.batch_size,
|
||||
timeout_seconds=args.timeout,
|
||||
)
|
||||
|
||||
with open(args.input_file, "w") as fh:
|
||||
json.dump(manifest, fh, indent=2)
|
||||
print(f"Manifest updated in-place: {args.input_file}")
|
||||
print("Done.")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,255 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Push a collected Hive query log manifest to Monte Carlo — push only.
|
||||
|
||||
Reads a JSON manifest produced by ``collect_query_logs.py``, builds
|
||||
QueryLogEntry objects, and calls ``send_query_logs`` in batches. The manifest
|
||||
is updated in-place with ``resource_uuid`` and ``invocation_id`` after a
|
||||
successful push.
|
||||
|
||||
Can be run standalone via CLI or imported (use the ``push()`` function).
|
||||
|
||||
Substitution points
|
||||
-------------------
|
||||
- MCD_INGEST_ID (env) / --key-id (CLI) : Monte Carlo ingestion key ID
|
||||
- MCD_INGEST_TOKEN (env) / --key-token (CLI) : Monte Carlo ingestion key token
|
||||
- MCD_RESOURCE_UUID (env) / --resource-uuid (CLI) : MC resource UUID (optional for query logs)
|
||||
|
||||
Prerequisites
|
||||
-------------
|
||||
pip install pycarlo python-dateutil python-dotenv
|
||||
|
||||
Usage
|
||||
-----
|
||||
python push_query_logs.py \\
|
||||
--key-id <MCD_INGEST_ID> \\
|
||||
--key-token <MCD_INGEST_TOKEN> \\
|
||||
--resource-uuid <MCD_RESOURCE_UUID> \\
|
||||
--input-file query_logs_output.json
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import os
|
||||
from concurrent.futures import ThreadPoolExecutor, as_completed
|
||||
from datetime import datetime, timezone
|
||||
|
||||
from dateutil.parser import isoparse
|
||||
|
||||
from pycarlo.core import Client, Session
|
||||
from pycarlo.features.ingestion import IngestionService
|
||||
from pycarlo.features.ingestion.models import QueryLogEntry
|
||||
|
||||
# ← SUBSTITUTE: default batch size for query log push (events per request)
|
||||
# Query logs include full SQL text — keep batches small to stay under the 1 MB
|
||||
# compressed payload limit. 50 entries can trigger 413 on active warehouses.
|
||||
DEFAULT_BATCH_SIZE = 100
|
||||
|
||||
# ← SUBSTITUTE: HTTP timeout for MC ingestion requests (seconds)
|
||||
DEFAULT_TIMEOUT_SECONDS = 120
|
||||
|
||||
# Truncate query_text longer than this to prevent 413 errors.
|
||||
# Some SQL statements (e.g., generated by BI tools) can be 100KB+ and blow up
|
||||
# compressed payloads even at small batch sizes.
|
||||
_MAX_QUERY_TEXT_LEN = 10_000
|
||||
|
||||
|
||||
def _build_events(manifest: dict) -> list[QueryLogEntry]:
|
||||
"""
|
||||
Rebuild QueryLogEntry objects from a collected query log manifest.
|
||||
|
||||
ISO timestamp strings are parsed back to datetime. Entries are
|
||||
deduplicated by query_id.
|
||||
"""
|
||||
seen: set[str] = set()
|
||||
events = []
|
||||
truncated = 0
|
||||
for q in manifest.get("queries", []):
|
||||
qid = q.get("query_id")
|
||||
if qid and qid in seen:
|
||||
continue
|
||||
if qid:
|
||||
seen.add(qid)
|
||||
|
||||
start_time = isoparse(q["start_time"])
|
||||
if not start_time.tzinfo:
|
||||
start_time = start_time.replace(tzinfo=timezone.utc)
|
||||
|
||||
end_time = isoparse(q["end_time"])
|
||||
if not end_time.tzinfo:
|
||||
end_time = end_time.replace(tzinfo=timezone.utc)
|
||||
|
||||
query_text = q.get("query") or ""
|
||||
|
||||
# Truncate very long SQL to prevent 413 Request Too Large
|
||||
if len(query_text) > _MAX_QUERY_TEXT_LEN:
|
||||
query_text = query_text[:_MAX_QUERY_TEXT_LEN] + "... [TRUNCATED]"
|
||||
truncated += 1
|
||||
|
||||
events.append(
|
||||
QueryLogEntry(
|
||||
start_time=start_time,
|
||||
end_time=end_time,
|
||||
query_text=query_text,
|
||||
query_id=qid or None,
|
||||
user=q.get("user", "hadoop"), # ← SUBSTITUTE: set the user appropriate for your cluster
|
||||
returned_rows=q.get("returned_rows"),
|
||||
)
|
||||
)
|
||||
if truncated:
|
||||
print(f" Truncated {truncated} query text(s) exceeding {_MAX_QUERY_TEXT_LEN} chars")
|
||||
return events
|
||||
|
||||
|
||||
def push(
|
||||
manifest: dict,
|
||||
key_id: str,
|
||||
key_token: str,
|
||||
resource_uuid: str | None = None,
|
||||
batch_size: int = DEFAULT_BATCH_SIZE,
|
||||
timeout_seconds: int = DEFAULT_TIMEOUT_SECONDS,
|
||||
) -> str | None:
|
||||
"""
|
||||
Push collected query logs to Monte Carlo and update the manifest in-place.
|
||||
|
||||
Events are sent in batches of ``batch_size`` (default 100) to avoid
|
||||
oversized payloads.
|
||||
|
||||
Args:
|
||||
manifest: Dict loaded from a ``collect_query_logs.py`` output file.
|
||||
key_id: MC ingestion key ID.
|
||||
key_token: MC ingestion key token.
|
||||
resource_uuid: Optional MC resource UUID.
|
||||
batch_size: Events per POST request (default 100).
|
||||
timeout_seconds: HTTP timeout per request (default 120).
|
||||
|
||||
Returns:
|
||||
The last invocation ID string if returned by MC, otherwise None.
|
||||
"""
|
||||
log_type = manifest.get("log_type", "hive-s3")
|
||||
|
||||
events = _build_events(manifest)
|
||||
n = len(events)
|
||||
print(f"Loaded {n} query log entry/entries from manifest")
|
||||
|
||||
if not events:
|
||||
print("No query log entries to push.")
|
||||
manifest["log_type"] = log_type
|
||||
if resource_uuid is not None:
|
||||
manifest["resource_uuid"] = resource_uuid
|
||||
manifest["invocation_id"] = None
|
||||
return None
|
||||
|
||||
# Split into batches
|
||||
batch_list = []
|
||||
for i in range(0, n, batch_size):
|
||||
batch_list.append(events[i : i + batch_size])
|
||||
total_batches = len(batch_list)
|
||||
|
||||
def _push_batch(batch: list, batch_num: int) -> str | None:
|
||||
"""Push a single batch using a dedicated Session (thread-safe)."""
|
||||
client = Client(session=Session(mcd_id=key_id, mcd_token=key_token, scope="Ingestion"))
|
||||
service = IngestionService(mc_client=client)
|
||||
result = service.send_query_logs(
|
||||
resource_uuid=resource_uuid,
|
||||
log_type=log_type,
|
||||
events=batch,
|
||||
)
|
||||
invocation_id = service.extract_invocation_id(result)
|
||||
print(f" Pushed batch {batch_num}/{total_batches} ({len(batch)} entries) — invocation_id={invocation_id}")
|
||||
return invocation_id
|
||||
|
||||
# Push batches in parallel (each thread gets its own pycarlo Session)
|
||||
max_workers = min(4, total_batches)
|
||||
invocation_ids: list[str | None] = [None] * total_batches
|
||||
|
||||
with ThreadPoolExecutor(max_workers=max_workers) as pool:
|
||||
futures = {
|
||||
pool.submit(_push_batch, batch, i + 1): i
|
||||
for i, batch in enumerate(batch_list)
|
||||
}
|
||||
for future in as_completed(futures):
|
||||
idx = futures[future]
|
||||
try:
|
||||
invocation_ids[idx] = future.result()
|
||||
except Exception as exc:
|
||||
print(f" ERROR pushing batch {idx + 1}: {exc}")
|
||||
raise
|
||||
|
||||
print(f" All {total_batches} batches pushed ({max_workers} workers)")
|
||||
|
||||
manifest["log_type"] = log_type
|
||||
if resource_uuid is not None:
|
||||
manifest["resource_uuid"] = resource_uuid
|
||||
manifest["invocation_id"] = invocation_ids[-1] if invocation_ids else None
|
||||
if len([i for i in invocation_ids if i]) > 1:
|
||||
manifest["invocation_ids"] = invocation_ids
|
||||
elif "invocation_ids" in manifest:
|
||||
del manifest["invocation_ids"]
|
||||
|
||||
return manifest.get("invocation_id")
|
||||
|
||||
|
||||
def main() -> None:
|
||||
parser = argparse.ArgumentParser(
|
||||
description="Push a collected Hive query log manifest to Monte Carlo",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--key-id",
|
||||
default=os.environ.get("MCD_INGEST_ID"),
|
||||
help="Monte Carlo ingestion key ID (env: MCD_INGEST_ID)",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--key-token",
|
||||
default=os.environ.get("MCD_INGEST_TOKEN"),
|
||||
help="Monte Carlo ingestion key token (env: MCD_INGEST_TOKEN)",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--resource-uuid",
|
||||
default=os.environ.get("MCD_RESOURCE_UUID"),
|
||||
help="Monte Carlo resource UUID (optional for query logs) (env: MCD_RESOURCE_UUID)",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--input-file",
|
||||
default="query_logs_output.json",
|
||||
help="Path to the JSON manifest written by collect_query_logs.py (default: query_logs_output.json)",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--batch-size",
|
||||
type=int,
|
||||
default=DEFAULT_BATCH_SIZE,
|
||||
metavar="N",
|
||||
help=f"Max events per POST (default: {DEFAULT_BATCH_SIZE})",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--timeout",
|
||||
type=int,
|
||||
default=DEFAULT_TIMEOUT_SECONDS,
|
||||
metavar="SEC",
|
||||
help=f"HTTP timeout per request in seconds (default: {DEFAULT_TIMEOUT_SECONDS})",
|
||||
)
|
||||
args = parser.parse_args()
|
||||
|
||||
if not args.key_id or not args.key_token:
|
||||
parser.error("--key-id and --key-token are required (or set MCD_INGEST_ID / MCD_INGEST_TOKEN)")
|
||||
|
||||
with open(args.input_file) as fh:
|
||||
manifest = json.load(fh)
|
||||
|
||||
push(
|
||||
manifest=manifest,
|
||||
key_id=args.key_id,
|
||||
key_token=args.key_token,
|
||||
resource_uuid=args.resource_uuid,
|
||||
batch_size=args.batch_size,
|
||||
timeout_seconds=args.timeout,
|
||||
)
|
||||
|
||||
with open(args.input_file, "w") as fh:
|
||||
json.dump(manifest, fh, indent=2)
|
||||
print(f"Manifest updated in-place: {args.input_file}")
|
||||
print("Done.")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,78 @@
|
||||
"""
|
||||
Redshift — Lineage Collect & Push (combined)
|
||||
==============================================
|
||||
Collects table-level lineage from Redshift by parsing query history, then pushes
|
||||
the derived lineage events to Monte Carlo via the push ingestion API.
|
||||
|
||||
This script imports and calls collect() from collect_lineage and push() from
|
||||
push_lineage, running both in sequence.
|
||||
|
||||
Substitution points (search for "← SUBSTITUTE"):
|
||||
- REDSHIFT_HOST / REDSHIFT_DB / REDSHIFT_USER / REDSHIFT_PASSWORD : connection
|
||||
- LOOKBACK_HOURS : how far back to scan query history (default 24 h)
|
||||
- MCD_INGEST_ID / MCD_INGEST_TOKEN : Monte Carlo API credentials
|
||||
- MCD_RESOURCE_UUID : UUID of the Redshift connection in Monte Carlo
|
||||
- PUSH_BATCH_SIZE : number of events per API call (default 500)
|
||||
|
||||
Prerequisites:
|
||||
pip install psycopg2-binary pycarlo
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import logging
|
||||
import os
|
||||
|
||||
from collect_lineage import LOOKBACK_HOURS, collect
|
||||
from push_lineage import DEFAULT_BATCH_SIZE, push
|
||||
|
||||
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
|
||||
log = logging.getLogger(__name__)
|
||||
|
||||
|
||||
def main() -> None:
|
||||
parser = argparse.ArgumentParser(description="Collect and push Redshift lineage to Monte Carlo")
|
||||
parser.add_argument("--host", default=os.getenv("REDSHIFT_HOST")) # ← SUBSTITUTE
|
||||
parser.add_argument("--db", default=os.getenv("REDSHIFT_DB")) # ← SUBSTITUTE
|
||||
parser.add_argument("--user", default=os.getenv("REDSHIFT_USER")) # ← SUBSTITUTE
|
||||
parser.add_argument("--password", default=os.getenv("REDSHIFT_PASSWORD")) # ← SUBSTITUTE
|
||||
parser.add_argument("--port", type=int, default=int(os.getenv("REDSHIFT_PORT", "5439")))
|
||||
parser.add_argument("--resource-uuid", default=os.getenv("MCD_RESOURCE_UUID"))
|
||||
parser.add_argument("--key-id", default=os.getenv("MCD_INGEST_ID"))
|
||||
parser.add_argument("--key-token", default=os.getenv("MCD_INGEST_TOKEN"))
|
||||
parser.add_argument("--lookback-hours", type=int, default=LOOKBACK_HOURS)
|
||||
parser.add_argument("--batch-size", type=int, default=DEFAULT_BATCH_SIZE)
|
||||
parser.add_argument("--manifest", default="manifest_lineage.json")
|
||||
args = parser.parse_args()
|
||||
|
||||
required = ["host", "db", "user", "password", "resource_uuid", "key_id", "key_token"]
|
||||
missing = [k for k in required if getattr(args, k) is None]
|
||||
if missing:
|
||||
parser.error(f"Missing required arguments/env vars: {missing}")
|
||||
|
||||
log.info("Step 1: Collecting lineage …")
|
||||
collect(
|
||||
host=args.host,
|
||||
db=args.db,
|
||||
user=args.user,
|
||||
password=args.password,
|
||||
manifest_path=args.manifest,
|
||||
port=args.port,
|
||||
lookback_hours=args.lookback_hours,
|
||||
)
|
||||
|
||||
log.info("Step 2: Pushing lineage to Monte Carlo …")
|
||||
push(
|
||||
manifest_path=args.manifest,
|
||||
resource_uuid=args.resource_uuid,
|
||||
key_id=args.key_id,
|
||||
key_token=args.key_token,
|
||||
batch_size=args.batch_size,
|
||||
)
|
||||
|
||||
log.info("Done — collect and push complete.")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,80 @@
|
||||
"""
|
||||
Redshift — Metadata Collect & Push (combined)
|
||||
===============================================
|
||||
Collects table schemas, row counts, and byte sizes from Amazon Redshift,
|
||||
then pushes them to Monte Carlo via the push ingestion API.
|
||||
|
||||
This script imports and calls collect() from collect_metadata and push() from
|
||||
push_metadata, running both in sequence.
|
||||
|
||||
Substitution points (search for "← SUBSTITUTE"):
|
||||
- REDSHIFT_HOST : Redshift cluster endpoint or serverless workgroup endpoint
|
||||
- REDSHIFT_DB : database name to connect to
|
||||
- REDSHIFT_USER : database user (or IAM role user)
|
||||
- REDSHIFT_PASSWORD : database password
|
||||
- DB_EXCLUSIONS : databases to skip
|
||||
- SCHEMA_EXCLUSIONS : schemas to skip in every database
|
||||
- MCD_INGEST_ID / MCD_INGEST_TOKEN : Monte Carlo API credentials
|
||||
- MCD_RESOURCE_UUID : UUID of the Redshift connection in Monte Carlo
|
||||
- PUSH_BATCH_SIZE : number of assets per API call (default 500)
|
||||
|
||||
Prerequisites:
|
||||
pip install psycopg2-binary pycarlo
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import logging
|
||||
import os
|
||||
|
||||
from collect_metadata import collect
|
||||
from push_metadata import DEFAULT_BATCH_SIZE, push
|
||||
|
||||
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
|
||||
log = logging.getLogger(__name__)
|
||||
|
||||
|
||||
def main() -> None:
|
||||
parser = argparse.ArgumentParser(description="Collect and push Redshift metadata to Monte Carlo")
|
||||
parser.add_argument("--host", default=os.getenv("REDSHIFT_HOST")) # ← SUBSTITUTE
|
||||
parser.add_argument("--db", default=os.getenv("REDSHIFT_DB")) # ← SUBSTITUTE
|
||||
parser.add_argument("--user", default=os.getenv("REDSHIFT_USER")) # ← SUBSTITUTE
|
||||
parser.add_argument("--password", default=os.getenv("REDSHIFT_PASSWORD")) # ← SUBSTITUTE
|
||||
parser.add_argument("--port", type=int, default=int(os.getenv("REDSHIFT_PORT", "5439")))
|
||||
parser.add_argument("--resource-uuid", default=os.getenv("MCD_RESOURCE_UUID"))
|
||||
parser.add_argument("--key-id", default=os.getenv("MCD_INGEST_ID"))
|
||||
parser.add_argument("--key-token", default=os.getenv("MCD_INGEST_TOKEN"))
|
||||
parser.add_argument("--batch-size", type=int, default=DEFAULT_BATCH_SIZE)
|
||||
parser.add_argument("--manifest", default="manifest_metadata.json")
|
||||
args = parser.parse_args()
|
||||
|
||||
required = ["host", "db", "user", "password", "resource_uuid", "key_id", "key_token"]
|
||||
missing = [k for k in required if getattr(args, k) is None]
|
||||
if missing:
|
||||
parser.error(f"Missing required arguments/env vars: {missing}")
|
||||
|
||||
log.info("Step 1: Collecting metadata …")
|
||||
collect(
|
||||
host=args.host,
|
||||
db=args.db,
|
||||
user=args.user,
|
||||
password=args.password,
|
||||
manifest_path=args.manifest,
|
||||
port=args.port,
|
||||
)
|
||||
|
||||
log.info("Step 2: Pushing metadata to Monte Carlo …")
|
||||
push(
|
||||
manifest_path=args.manifest,
|
||||
resource_uuid=args.resource_uuid,
|
||||
key_id=args.key_id,
|
||||
key_token=args.key_token,
|
||||
batch_size=args.batch_size,
|
||||
)
|
||||
|
||||
log.info("Done — collect and push complete.")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,88 @@
|
||||
"""
|
||||
Redshift — Query Log Collect & Push (combined)
|
||||
================================================
|
||||
Collects completed query execution records from Redshift using sys_query_history
|
||||
and sys_querytext, then pushes them to Monte Carlo for query-pattern analysis,
|
||||
lineage derivation, and usage attribution.
|
||||
|
||||
This script imports and calls collect() from collect_query_logs and push() from
|
||||
push_query_logs, running both in sequence.
|
||||
|
||||
Substitution points (search for "← SUBSTITUTE"):
|
||||
- REDSHIFT_HOST / REDSHIFT_DB / REDSHIFT_USER / REDSHIFT_PASSWORD : connection
|
||||
- LOOKBACK_HOURS : hours back from [now - LAG_HOURS] to collect (default 25)
|
||||
- LOOKBACK_LAG_HOURS: lag behind now to avoid in-flight queries (default 1)
|
||||
- BATCH_SIZE : number of query_ids to fetch texts for in one SQL call
|
||||
- MAX_QUERIES : maximum query rows to process per run
|
||||
- MCD_INGEST_ID / MCD_INGEST_TOKEN : Monte Carlo API credentials
|
||||
- MCD_RESOURCE_UUID : UUID of the Redshift connection in Monte Carlo
|
||||
- PUSH_BATCH_SIZE : number of entries per API call (default 250)
|
||||
|
||||
Prerequisites:
|
||||
pip install psycopg2-binary pycarlo
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import logging
|
||||
import os
|
||||
|
||||
from collect_query_logs import BATCH_SIZE, LOOKBACK_HOURS, LOOKBACK_LAG_HOURS, MAX_QUERIES, collect
|
||||
from push_query_logs import DEFAULT_BATCH_SIZE, push
|
||||
|
||||
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
|
||||
log = logging.getLogger(__name__)
|
||||
|
||||
|
||||
def main() -> None:
|
||||
parser = argparse.ArgumentParser(description="Collect and push Redshift query logs to Monte Carlo")
|
||||
parser.add_argument("--host", default=os.getenv("REDSHIFT_HOST")) # ← SUBSTITUTE
|
||||
parser.add_argument("--db", default=os.getenv("REDSHIFT_DB")) # ← SUBSTITUTE
|
||||
parser.add_argument("--user", default=os.getenv("REDSHIFT_USER")) # ← SUBSTITUTE
|
||||
parser.add_argument("--password", default=os.getenv("REDSHIFT_PASSWORD")) # ← SUBSTITUTE
|
||||
parser.add_argument("--port", type=int, default=int(os.getenv("REDSHIFT_PORT", "5439")))
|
||||
parser.add_argument("--resource-uuid", default=os.getenv("MCD_RESOURCE_UUID"))
|
||||
parser.add_argument("--key-id", default=os.getenv("MCD_INGEST_ID"))
|
||||
parser.add_argument("--key-token", default=os.getenv("MCD_INGEST_TOKEN"))
|
||||
parser.add_argument("--lookback-hours", type=int, default=LOOKBACK_HOURS)
|
||||
parser.add_argument("--lookback-lag-hours", type=int, default=LOOKBACK_LAG_HOURS)
|
||||
parser.add_argument("--batch-size", type=int, default=BATCH_SIZE)
|
||||
parser.add_argument("--max-queries", type=int, default=MAX_QUERIES)
|
||||
parser.add_argument("--push-batch-size", type=int, default=DEFAULT_BATCH_SIZE)
|
||||
parser.add_argument("--manifest", default="manifest_query_logs.json")
|
||||
args = parser.parse_args()
|
||||
|
||||
required = ["host", "db", "user", "password", "resource_uuid", "key_id", "key_token"]
|
||||
missing = [k for k in required if getattr(args, k) is None]
|
||||
if missing:
|
||||
parser.error(f"Missing required arguments/env vars: {missing}")
|
||||
|
||||
log.info("Step 1: Collecting query logs …")
|
||||
collect(
|
||||
host=args.host,
|
||||
db=args.db,
|
||||
user=args.user,
|
||||
password=args.password,
|
||||
manifest_path=args.manifest,
|
||||
port=args.port,
|
||||
lookback_hours=args.lookback_hours,
|
||||
lookback_lag_hours=args.lookback_lag_hours,
|
||||
batch_size=args.batch_size,
|
||||
max_queries=args.max_queries,
|
||||
)
|
||||
|
||||
log.info("Step 2: Pushing query logs to Monte Carlo …")
|
||||
push(
|
||||
manifest_path=args.manifest,
|
||||
resource_uuid=args.resource_uuid,
|
||||
key_id=args.key_id,
|
||||
key_token=args.key_token,
|
||||
batch_size=args.push_batch_size,
|
||||
)
|
||||
|
||||
log.info("Done — collect and push complete.")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,235 @@
|
||||
"""
|
||||
Redshift — Lineage Collection (collect-only)
|
||||
==============================================
|
||||
Collects table-level lineage from Redshift by fetching recent successful query
|
||||
history from sys_query_history + sys_querytext and parsing CREATE TABLE AS SELECT
|
||||
(CTAS) and INSERT INTO SELECT patterns to derive source->destination relationships.
|
||||
|
||||
Writes a JSON manifest file that can be consumed by push_lineage.py.
|
||||
|
||||
Substitution points (search for "← SUBSTITUTE"):
|
||||
- REDSHIFT_HOST / REDSHIFT_DB / REDSHIFT_USER / REDSHIFT_PASSWORD : connection
|
||||
- LOOKBACK_HOURS : how far back to scan query history (default 24 h)
|
||||
|
||||
Prerequisites:
|
||||
pip install psycopg2-binary
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import logging
|
||||
import os
|
||||
import re
|
||||
from datetime import datetime, timezone
|
||||
from typing import Any
|
||||
|
||||
import psycopg2
|
||||
|
||||
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
|
||||
log = logging.getLogger(__name__)
|
||||
|
||||
RESOURCE_TYPE = "redshift"
|
||||
LOOKBACK_HOURS: int = int(os.getenv("LOOKBACK_HOURS", "24")) # ← SUBSTITUTE
|
||||
|
||||
|
||||
def _check_available_memory(min_gb: float = 2.0) -> None:
|
||||
"""Warn if available memory is below the threshold."""
|
||||
try:
|
||||
if hasattr(os, "sysconf"): # Linux / macOS
|
||||
page_size = os.sysconf("SC_PAGE_SIZE")
|
||||
avail_pages = os.sysconf("SC_AVPHYS_PAGES")
|
||||
avail_gb = (page_size * avail_pages) / (1024 ** 3)
|
||||
else:
|
||||
return # Windows — skip check
|
||||
except (ValueError, OSError):
|
||||
return
|
||||
if avail_gb < min_gb:
|
||||
log.warning(
|
||||
"Only %.1f GB of memory available (minimum recommended: %.1f GB). "
|
||||
"Consider reducing the collection scope or increasing available memory.",
|
||||
avail_gb,
|
||||
min_gb,
|
||||
)
|
||||
|
||||
|
||||
# Regex: CTAS — CREATE [OR REPLACE] TABLE <dest> AS SELECT
|
||||
_CTAS_RE = re.compile(
|
||||
r"CREATE\s+(?:OR\s+REPLACE\s+)?(?:TABLE|VIEW)\s+(?P<dest>\"?[\w.\"]+\"?)\s*(?:\([^)]*\))?\s*AS\s+SELECT\b",
|
||||
re.IGNORECASE | re.DOTALL,
|
||||
)
|
||||
# Regex: INSERT INTO <dest> … SELECT
|
||||
_INSERT_RE = re.compile(
|
||||
r"INSERT\s+INTO\s+(?P<dest>\"?[\w.\"]+\"?)\s.*?SELECT\b",
|
||||
re.IGNORECASE | re.DOTALL,
|
||||
)
|
||||
# Matches any schema.table or database.schema.table reference in the query
|
||||
_TABLE_REF_RE = re.compile(r'"?([\w]+)"?\."?([\w]+)"?(?:\."?([\w]+)"?)?', re.IGNORECASE)
|
||||
|
||||
|
||||
def _clean_name(name: str) -> str:
|
||||
return name.strip('"').strip()
|
||||
|
||||
|
||||
def _parse_ref(ref: str) -> tuple[str, str, str]:
|
||||
"""Parse 'db.schema.table' or 'schema.table' -> (database, schema, table)."""
|
||||
parts = [_clean_name(p) for p in ref.split(".")]
|
||||
if len(parts) == 3:
|
||||
return parts[0], parts[1], parts[2]
|
||||
if len(parts) == 2:
|
||||
return "", parts[0], parts[1]
|
||||
return "", "", parts[0]
|
||||
|
||||
|
||||
def _dictfetch(cursor: Any, sql: str, params: tuple | None = None) -> list[dict[str, Any]]:
|
||||
cursor.execute(sql, params)
|
||||
cols = [d.name for d in cursor.description]
|
||||
rows = []
|
||||
while True:
|
||||
chunk = cursor.fetchmany(1000)
|
||||
if not chunk:
|
||||
break
|
||||
rows.extend(dict(zip(cols, row)) for row in chunk)
|
||||
return rows
|
||||
|
||||
|
||||
def fetch_query_texts(cursor: Any, lookback_hours: int) -> list[str]:
|
||||
"""Assemble full query texts from sys_query_history + sys_querytext."""
|
||||
rows = _dictfetch(
|
||||
cursor,
|
||||
f"""
|
||||
SELECT
|
||||
sq.query_id,
|
||||
LISTAGG(
|
||||
CASE WHEN LEN(st.text) <= 200 THEN st.text ELSE LEFT(st.text, 200) END,
|
||||
''
|
||||
) WITHIN GROUP (ORDER BY st.sequence) AS full_text
|
||||
FROM sys_query_history sq
|
||||
JOIN sys_querytext st ON sq.query_id = st.query_id
|
||||
WHERE sq.start_time >= DATEADD(hour, -{lookback_hours}, GETDATE())
|
||||
AND sq.status = 'success'
|
||||
GROUP BY sq.query_id
|
||||
LIMIT 50000
|
||||
""", # ← SUBSTITUTE: adjust lookback_hours, LIMIT, or add user/database filters
|
||||
)
|
||||
return [r["full_text"] for r in rows if r.get("full_text")]
|
||||
|
||||
|
||||
def parse_lineage_from_sql(sql_text: str) -> list[dict[str, Any]]:
|
||||
events: list[dict[str, Any]] = []
|
||||
|
||||
dest_match = _CTAS_RE.search(sql_text) or _INSERT_RE.search(sql_text)
|
||||
if not dest_match:
|
||||
return events
|
||||
|
||||
dest_raw = dest_match.group("dest")
|
||||
dest_db, dest_schema, dest_table = _parse_ref(dest_raw)
|
||||
if not dest_table:
|
||||
return events
|
||||
|
||||
# Find all schema.table refs in the query, excluding the destination
|
||||
source_refs: list[str] = []
|
||||
for m in _TABLE_REF_RE.finditer(sql_text):
|
||||
if m.group(3):
|
||||
ref = f"{m.group(1)}.{m.group(2)}.{m.group(3)}"
|
||||
else:
|
||||
ref = f"{m.group(1)}.{m.group(2)}"
|
||||
|
||||
db, schema, table = _parse_ref(ref)
|
||||
if not table or (db == dest_db and schema == dest_schema and table == dest_table):
|
||||
continue
|
||||
source_refs.append(ref)
|
||||
|
||||
if not source_refs:
|
||||
return events
|
||||
|
||||
# Deduplicate sources while preserving order
|
||||
seen: set[str] = set()
|
||||
sources: list[dict[str, str]] = []
|
||||
for ref in source_refs:
|
||||
if ref not in seen:
|
||||
seen.add(ref)
|
||||
db, schema, table = _parse_ref(ref)
|
||||
sources.append({"database": db, "schema": schema, "asset_name": table})
|
||||
|
||||
events.append({
|
||||
"sources": sources,
|
||||
"destination": {"database": dest_db, "schema": dest_schema, "asset_name": dest_table},
|
||||
})
|
||||
return events
|
||||
|
||||
|
||||
def collect(
|
||||
host: str,
|
||||
db: str,
|
||||
user: str,
|
||||
password: str,
|
||||
manifest_path: str = "manifest_lineage.json",
|
||||
port: int = 5439,
|
||||
lookback_hours: int = LOOKBACK_HOURS,
|
||||
) -> list[dict[str, Any]]:
|
||||
"""Connect to Redshift, collect lineage, write a JSON manifest, and return events."""
|
||||
_check_available_memory()
|
||||
collected_at = datetime.now(timezone.utc).isoformat()
|
||||
|
||||
conn = psycopg2.connect(
|
||||
host=host, port=port, dbname=db, user=user, password=password, connect_timeout=30,
|
||||
)
|
||||
try:
|
||||
with conn.cursor() as cursor:
|
||||
query_texts = fetch_query_texts(cursor, lookback_hours)
|
||||
finally:
|
||||
conn.close()
|
||||
|
||||
log.info("Parsing lineage from %d query texts …", len(query_texts))
|
||||
all_events: list[dict[str, Any]] = []
|
||||
for sql_text in query_texts:
|
||||
all_events.extend(parse_lineage_from_sql(sql_text))
|
||||
|
||||
log.info("Collected %d lineage events", len(all_events))
|
||||
|
||||
manifest = {
|
||||
"resource_type": RESOURCE_TYPE,
|
||||
"collected_at": collected_at,
|
||||
"lookback_hours": lookback_hours,
|
||||
"queries_scanned": len(query_texts),
|
||||
"lineage_event_count": len(all_events),
|
||||
"events": all_events,
|
||||
}
|
||||
with open(manifest_path, "w") as fh:
|
||||
json.dump(manifest, fh, indent=2)
|
||||
log.info("Manifest written to %s (%d events)", manifest_path, len(all_events))
|
||||
|
||||
return all_events
|
||||
|
||||
|
||||
def main() -> None:
|
||||
parser = argparse.ArgumentParser(description="Collect Redshift lineage to a manifest file")
|
||||
parser.add_argument("--host", default=os.getenv("REDSHIFT_HOST")) # ← SUBSTITUTE
|
||||
parser.add_argument("--db", default=os.getenv("REDSHIFT_DB")) # ← SUBSTITUTE
|
||||
parser.add_argument("--user", default=os.getenv("REDSHIFT_USER")) # ← SUBSTITUTE
|
||||
parser.add_argument("--password", default=os.getenv("REDSHIFT_PASSWORD")) # ← SUBSTITUTE
|
||||
parser.add_argument("--port", type=int, default=int(os.getenv("REDSHIFT_PORT", "5439")))
|
||||
parser.add_argument("--lookback-hours", type=int, default=LOOKBACK_HOURS)
|
||||
parser.add_argument("--manifest", default="manifest_lineage.json")
|
||||
args = parser.parse_args()
|
||||
|
||||
required = ["host", "db", "user", "password"]
|
||||
missing = [k for k in required if getattr(args, k) is None]
|
||||
if missing:
|
||||
parser.error(f"Missing required arguments/env vars: {missing}")
|
||||
|
||||
collect(
|
||||
host=args.host,
|
||||
db=args.db,
|
||||
user=args.user,
|
||||
password=args.password,
|
||||
manifest_path=args.manifest,
|
||||
port=args.port,
|
||||
lookback_hours=args.lookback_hours,
|
||||
)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,219 @@
|
||||
"""
|
||||
Redshift — Metadata Collection (collect-only)
|
||||
===============================================
|
||||
Collects table schemas, row counts, and byte sizes from Amazon Redshift using
|
||||
SVV system views, then writes a JSON manifest file that can be consumed by
|
||||
push_metadata.py.
|
||||
|
||||
Substitution points (search for "← SUBSTITUTE"):
|
||||
- REDSHIFT_HOST : Redshift cluster endpoint or serverless workgroup endpoint
|
||||
- REDSHIFT_DB : database name to connect to
|
||||
- REDSHIFT_USER : database user (or IAM role user)
|
||||
- REDSHIFT_PASSWORD : database password
|
||||
- DB_EXCLUSIONS : databases to skip
|
||||
- SCHEMA_EXCLUSIONS : schemas to skip in every database
|
||||
|
||||
Prerequisites:
|
||||
pip install psycopg2-binary
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import logging
|
||||
import os
|
||||
from datetime import datetime, timezone
|
||||
from typing import Any
|
||||
|
||||
import psycopg2
|
||||
import psycopg2.extras
|
||||
|
||||
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
|
||||
log = logging.getLogger(__name__)
|
||||
|
||||
RESOURCE_TYPE = "redshift"
|
||||
|
||||
DB_EXCLUSIONS: set[str] = {"dev", "padb_harvest"} # ← SUBSTITUTE: add internal databases
|
||||
|
||||
SCHEMA_EXCLUSIONS: set[str] = { # ← SUBSTITUTE: add internal schemas
|
||||
"information_schema",
|
||||
"pg_catalog",
|
||||
"pg_internal",
|
||||
"catalog_history",
|
||||
}
|
||||
|
||||
|
||||
def _check_available_memory(min_gb: float = 2.0) -> None:
|
||||
"""Warn if available memory is below the threshold."""
|
||||
try:
|
||||
if hasattr(os, "sysconf"): # Linux / macOS
|
||||
page_size = os.sysconf("SC_PAGE_SIZE")
|
||||
avail_pages = os.sysconf("SC_AVPHYS_PAGES")
|
||||
avail_gb = (page_size * avail_pages) / (1024 ** 3)
|
||||
else:
|
||||
return # Windows — skip check
|
||||
except (ValueError, OSError):
|
||||
return
|
||||
if avail_gb < min_gb:
|
||||
log.warning(
|
||||
"Only %.1f GB of memory available (minimum recommended: %.1f GB). "
|
||||
"Consider reducing the collection scope or increasing available memory.",
|
||||
avail_gb,
|
||||
min_gb,
|
||||
)
|
||||
|
||||
|
||||
def _dictfetch(cursor: Any, sql: str, params: tuple | None = None) -> list[dict[str, Any]]:
|
||||
cursor.execute(sql, params)
|
||||
cols = [d.name for d in cursor.description]
|
||||
rows = []
|
||||
while True:
|
||||
chunk = cursor.fetchmany(1000)
|
||||
if not chunk:
|
||||
break
|
||||
rows.extend(dict(zip(cols, row)) for row in chunk)
|
||||
return rows
|
||||
|
||||
|
||||
def collect_databases(cursor: Any) -> list[str]:
|
||||
rows = _dictfetch(
|
||||
cursor,
|
||||
"SELECT database_name FROM svv_redshift_databases ORDER BY database_name",
|
||||
)
|
||||
return [r["database_name"] for r in rows if r["database_name"] not in DB_EXCLUSIONS]
|
||||
|
||||
|
||||
def collect_tables(cursor: Any, db: str) -> list[dict[str, Any]]:
|
||||
schema_list = ", ".join(f"'{s}'" for s in SCHEMA_EXCLUSIONS)
|
||||
return _dictfetch(
|
||||
cursor,
|
||||
f"""
|
||||
SELECT
|
||||
database AS db,
|
||||
schema,
|
||||
"table" AS table_name,
|
||||
"rows" AS row_count,
|
||||
size * 1024 * 1024 AS byte_count
|
||||
FROM svv_table_info
|
||||
WHERE database = %s
|
||||
AND schema NOT IN ({schema_list})
|
||||
ORDER BY schema, "table"
|
||||
""", # ← SUBSTITUTE: add additional WHERE clauses to narrow scope
|
||||
(db,),
|
||||
)
|
||||
|
||||
|
||||
def collect_columns(cursor: Any, db: str, schema: str, table: str) -> list[dict[str, Any]]:
|
||||
return _dictfetch(
|
||||
cursor,
|
||||
"""
|
||||
SELECT column_name, data_type, remarks AS comment
|
||||
FROM svv_columns
|
||||
WHERE table_catalog = %s
|
||||
AND table_schema = %s
|
||||
AND table_name = %s
|
||||
ORDER BY ordinal_position
|
||||
""",
|
||||
(db, schema, table),
|
||||
)
|
||||
|
||||
|
||||
def collect(
|
||||
host: str,
|
||||
db: str,
|
||||
user: str,
|
||||
password: str,
|
||||
manifest_path: str = "manifest_metadata.json",
|
||||
port: int = 5439,
|
||||
) -> list[dict[str, Any]]:
|
||||
"""Connect to Redshift, collect metadata, write a JSON manifest, and return asset dicts."""
|
||||
_check_available_memory()
|
||||
collected_at = datetime.now(timezone.utc).isoformat()
|
||||
assets: list[dict[str, Any]] = []
|
||||
|
||||
conn = psycopg2.connect(
|
||||
host=host, # ← SUBSTITUTE
|
||||
port=port,
|
||||
dbname=db, # ← SUBSTITUTE
|
||||
user=user, # ← SUBSTITUTE
|
||||
password=password, # ← SUBSTITUTE
|
||||
connect_timeout=30,
|
||||
)
|
||||
try:
|
||||
with conn.cursor() as cursor:
|
||||
databases = collect_databases(cursor)
|
||||
log.info("Found databases: %s", databases)
|
||||
|
||||
for database in databases:
|
||||
tables = collect_tables(cursor, database)
|
||||
log.info("Database %s — %d tables", database, len(tables))
|
||||
|
||||
for t in tables:
|
||||
schema = t["schema"]
|
||||
table_name = t["table_name"]
|
||||
|
||||
columns = collect_columns(cursor, database, schema, table_name)
|
||||
fields = [
|
||||
{
|
||||
"name": col["column_name"],
|
||||
"type": col["data_type"].upper(),
|
||||
"description": col.get("comment") or None,
|
||||
}
|
||||
for col in columns
|
||||
]
|
||||
|
||||
asset = {
|
||||
"asset_name": table_name,
|
||||
"database": database, # ← SUBSTITUTE: use database as top-level namespace
|
||||
"schema": schema,
|
||||
"asset_type": "TABLE",
|
||||
"fields": fields,
|
||||
"row_count": t.get("row_count"),
|
||||
"byte_count": t.get("byte_count"),
|
||||
}
|
||||
assets.append(asset)
|
||||
log.info("Collected %s.%s.%s", database, schema, table_name)
|
||||
finally:
|
||||
conn.close()
|
||||
|
||||
manifest = {
|
||||
"resource_type": RESOURCE_TYPE,
|
||||
"collected_at": collected_at,
|
||||
"asset_count": len(assets),
|
||||
"assets": assets,
|
||||
}
|
||||
with open(manifest_path, "w") as fh:
|
||||
json.dump(manifest, fh, indent=2)
|
||||
log.info("Manifest written to %s (%d assets)", manifest_path, len(assets))
|
||||
|
||||
return assets
|
||||
|
||||
|
||||
def main() -> None:
|
||||
parser = argparse.ArgumentParser(description="Collect Redshift metadata to a manifest file")
|
||||
parser.add_argument("--host", default=os.getenv("REDSHIFT_HOST")) # ← SUBSTITUTE
|
||||
parser.add_argument("--db", default=os.getenv("REDSHIFT_DB")) # ← SUBSTITUTE
|
||||
parser.add_argument("--user", default=os.getenv("REDSHIFT_USER")) # ← SUBSTITUTE
|
||||
parser.add_argument("--password", default=os.getenv("REDSHIFT_PASSWORD")) # ← SUBSTITUTE
|
||||
parser.add_argument("--port", type=int, default=int(os.getenv("REDSHIFT_PORT", "5439")))
|
||||
parser.add_argument("--manifest", default="manifest_metadata.json")
|
||||
args = parser.parse_args()
|
||||
|
||||
required = ["host", "db", "user", "password"]
|
||||
missing = [k for k in required if getattr(args, k) is None]
|
||||
if missing:
|
||||
parser.error(f"Missing required arguments/env vars: {missing}")
|
||||
|
||||
collect(
|
||||
host=args.host,
|
||||
db=args.db,
|
||||
user=args.user,
|
||||
password=args.password,
|
||||
manifest_path=args.manifest,
|
||||
port=args.port,
|
||||
)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,239 @@
|
||||
"""
|
||||
Redshift — Query Log Collection (collect-only)
|
||||
================================================
|
||||
Collects completed query execution records from Redshift using sys_query_history
|
||||
and sys_querytext (modern RA3/serverless), assembles full SQL text from
|
||||
multi-row text chunks, and writes a JSON manifest file that can be consumed
|
||||
by push_query_logs.py.
|
||||
|
||||
Substitution points (search for "← SUBSTITUTE"):
|
||||
- REDSHIFT_HOST / REDSHIFT_DB / REDSHIFT_USER / REDSHIFT_PASSWORD : connection
|
||||
- LOOKBACK_HOURS : hours back from [now - LAG_HOURS] to collect (default 25)
|
||||
- LOOKBACK_LAG_HOURS: lag behind now to avoid in-flight queries (default 1)
|
||||
- BATCH_SIZE : number of query_ids to fetch texts for in one SQL call
|
||||
- MAX_QUERIES : maximum query rows to process per run
|
||||
|
||||
Prerequisites:
|
||||
pip install psycopg2-binary
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import logging
|
||||
import os
|
||||
from datetime import datetime, timezone
|
||||
from typing import Any
|
||||
|
||||
import psycopg2
|
||||
|
||||
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
|
||||
log = logging.getLogger(__name__)
|
||||
|
||||
LOG_TYPE = "redshift"
|
||||
|
||||
LOOKBACK_HOURS: int = int(os.getenv("LOOKBACK_HOURS", "25")) # ← SUBSTITUTE
|
||||
LOOKBACK_LAG_HOURS: int = int(os.getenv("LOOKBACK_LAG_HOURS", "1")) # ← SUBSTITUTE
|
||||
BATCH_SIZE: int = int(os.getenv("BATCH_SIZE", "200")) # ← SUBSTITUTE
|
||||
MAX_QUERIES: int = int(os.getenv("MAX_QUERIES", "10000")) # ← SUBSTITUTE
|
||||
|
||||
|
||||
def _check_available_memory(min_gb: float = 2.0) -> None:
|
||||
"""Warn if available memory is below the threshold."""
|
||||
try:
|
||||
if hasattr(os, "sysconf"): # Linux / macOS
|
||||
page_size = os.sysconf("SC_PAGE_SIZE")
|
||||
avail_pages = os.sysconf("SC_AVPHYS_PAGES")
|
||||
avail_gb = (page_size * avail_pages) / (1024 ** 3)
|
||||
else:
|
||||
return # Windows — skip check
|
||||
except (ValueError, OSError):
|
||||
return
|
||||
if avail_gb < min_gb:
|
||||
log.warning(
|
||||
"Only %.1f GB of memory available (minimum recommended: %.1f GB). "
|
||||
"Consider reducing the collection scope or increasing available memory.",
|
||||
avail_gb,
|
||||
min_gb,
|
||||
)
|
||||
|
||||
|
||||
def _dictfetch(cursor: Any, sql: str, params: tuple | None = None) -> list[dict[str, Any]]:
|
||||
cursor.execute(sql, params)
|
||||
cols = [d.name for d in cursor.description]
|
||||
rows = []
|
||||
while True:
|
||||
chunk = cursor.fetchmany(1000)
|
||||
if not chunk:
|
||||
break
|
||||
rows.extend(dict(zip(cols, row)) for row in chunk)
|
||||
return rows
|
||||
|
||||
|
||||
def _safe_isoformat(dt: Any) -> str | None:
|
||||
if dt is None:
|
||||
return None
|
||||
if hasattr(dt, "isoformat"):
|
||||
if dt.tzinfo is None:
|
||||
dt = dt.replace(tzinfo=timezone.utc)
|
||||
return dt.isoformat()
|
||||
return str(dt)
|
||||
|
||||
|
||||
def fetch_query_metadata(
|
||||
cursor: Any,
|
||||
lookback_hours: int,
|
||||
lag_hours: int,
|
||||
max_queries: int,
|
||||
) -> list[dict[str, Any]]:
|
||||
"""Fetch query execution metadata from sys_query_history."""
|
||||
return _dictfetch(
|
||||
cursor,
|
||||
f"""
|
||||
SELECT
|
||||
query_id,
|
||||
start_time,
|
||||
end_time,
|
||||
status,
|
||||
user_id,
|
||||
database_name,
|
||||
elapsed_time
|
||||
FROM sys_query_history
|
||||
WHERE start_time >= DATEADD(hour, -{lookback_hours}, GETDATE())
|
||||
AND start_time < DATEADD(hour, -{lag_hours}, GETDATE())
|
||||
AND status = 'success'
|
||||
ORDER BY start_time
|
||||
LIMIT {max_queries}
|
||||
""", # ← SUBSTITUTE: add AND database_name = 'mydb' to narrow scope
|
||||
)
|
||||
|
||||
|
||||
def fetch_query_texts_batch(cursor: Any, query_ids: list[int]) -> dict[int, str]:
|
||||
"""Batch-fetch and assemble multi-row query texts for a list of query_ids."""
|
||||
if not query_ids:
|
||||
return {}
|
||||
|
||||
# Build a VALUES list for the IN clause to avoid large parameter arrays
|
||||
id_list = ", ".join(str(qid) for qid in query_ids)
|
||||
rows = _dictfetch(
|
||||
cursor,
|
||||
f"""
|
||||
SELECT
|
||||
query_id,
|
||||
LISTAGG(
|
||||
CASE WHEN LEN(text) <= 200 THEN text ELSE LEFT(text, 200) END,
|
||||
''
|
||||
) WITHIN GROUP (ORDER BY sequence) AS query_text
|
||||
FROM sys_querytext
|
||||
WHERE query_id IN ({id_list})
|
||||
GROUP BY query_id
|
||||
""",
|
||||
)
|
||||
return {r["query_id"]: r["query_text"] for r in rows if r.get("query_text")}
|
||||
|
||||
|
||||
def collect(
|
||||
host: str,
|
||||
db: str,
|
||||
user: str,
|
||||
password: str,
|
||||
manifest_path: str = "manifest_query_logs.json",
|
||||
port: int = 5439,
|
||||
lookback_hours: int = LOOKBACK_HOURS,
|
||||
lookback_lag_hours: int = LOOKBACK_LAG_HOURS,
|
||||
batch_size: int = BATCH_SIZE,
|
||||
max_queries: int = MAX_QUERIES,
|
||||
) -> list[dict[str, Any]]:
|
||||
"""Connect to Redshift, collect query logs, write a JSON manifest, and return entries."""
|
||||
_check_available_memory()
|
||||
collected_at = datetime.now(timezone.utc).isoformat()
|
||||
|
||||
conn = psycopg2.connect(
|
||||
host=host, port=port, dbname=db, user=user, password=password, connect_timeout=30,
|
||||
)
|
||||
try:
|
||||
with conn.cursor() as cursor:
|
||||
query_meta = fetch_query_metadata(cursor, lookback_hours, lookback_lag_hours, max_queries)
|
||||
log.info("Retrieved %d query metadata rows", len(query_meta))
|
||||
|
||||
# Batch-fetch texts to avoid enormous single queries
|
||||
query_ids = [r["query_id"] for r in query_meta]
|
||||
text_map: dict[int, str] = {}
|
||||
for i in range(0, len(query_ids), batch_size):
|
||||
batch = query_ids[i : i + batch_size]
|
||||
text_map.update(fetch_query_texts_batch(cursor, batch))
|
||||
log.debug("Fetched texts for batch %d–%d", i, i + len(batch))
|
||||
finally:
|
||||
conn.close()
|
||||
|
||||
entries: list[dict[str, Any]] = []
|
||||
for row in query_meta:
|
||||
qid = row["query_id"]
|
||||
query_text = text_map.get(qid, "")
|
||||
if not query_text.strip():
|
||||
continue # ← SUBSTITUTE: decide whether to push rows with missing text
|
||||
|
||||
entry = {
|
||||
"query_id": str(qid),
|
||||
"query_text": query_text,
|
||||
"start_time": _safe_isoformat(row.get("start_time")),
|
||||
"end_time": _safe_isoformat(row.get("end_time")),
|
||||
"user": str(row.get("user_id")) if row.get("user_id") is not None else None,
|
||||
"database_name": row.get("database_name"),
|
||||
"elapsed_time_us": row.get("elapsed_time"),
|
||||
}
|
||||
entries.append(entry)
|
||||
|
||||
log.info("Collected %d query log entries", len(entries))
|
||||
|
||||
manifest = {
|
||||
"log_type": LOG_TYPE,
|
||||
"collected_at": collected_at,
|
||||
"lookback_hours": lookback_hours,
|
||||
"lookback_lag_hours": lookback_lag_hours,
|
||||
"query_log_count": len(entries),
|
||||
"entries": entries,
|
||||
}
|
||||
with open(manifest_path, "w") as fh:
|
||||
json.dump(manifest, fh, indent=2)
|
||||
log.info("Manifest written to %s (%d entries)", manifest_path, len(entries))
|
||||
|
||||
return entries
|
||||
|
||||
|
||||
def main() -> None:
|
||||
parser = argparse.ArgumentParser(description="Collect Redshift query logs to a manifest file")
|
||||
parser.add_argument("--host", default=os.getenv("REDSHIFT_HOST")) # ← SUBSTITUTE
|
||||
parser.add_argument("--db", default=os.getenv("REDSHIFT_DB")) # ← SUBSTITUTE
|
||||
parser.add_argument("--user", default=os.getenv("REDSHIFT_USER")) # ← SUBSTITUTE
|
||||
parser.add_argument("--password", default=os.getenv("REDSHIFT_PASSWORD")) # ← SUBSTITUTE
|
||||
parser.add_argument("--port", type=int, default=int(os.getenv("REDSHIFT_PORT", "5439")))
|
||||
parser.add_argument("--lookback-hours", type=int, default=LOOKBACK_HOURS)
|
||||
parser.add_argument("--lookback-lag-hours", type=int, default=LOOKBACK_LAG_HOURS)
|
||||
parser.add_argument("--batch-size", type=int, default=BATCH_SIZE)
|
||||
parser.add_argument("--max-queries", type=int, default=MAX_QUERIES)
|
||||
parser.add_argument("--manifest", default="manifest_query_logs.json")
|
||||
args = parser.parse_args()
|
||||
|
||||
required = ["host", "db", "user", "password"]
|
||||
missing = [k for k in required if getattr(args, k) is None]
|
||||
if missing:
|
||||
parser.error(f"Missing required arguments/env vars: {missing}")
|
||||
|
||||
collect(
|
||||
host=args.host,
|
||||
db=args.db,
|
||||
user=args.user,
|
||||
password=args.password,
|
||||
manifest_path=args.manifest,
|
||||
port=args.port,
|
||||
lookback_hours=args.lookback_hours,
|
||||
lookback_lag_hours=args.lookback_lag_hours,
|
||||
batch_size=args.batch_size,
|
||||
max_queries=args.max_queries,
|
||||
)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,178 @@
|
||||
"""
|
||||
Redshift — Lineage Push (push-only)
|
||||
=====================================
|
||||
Reads a JSON manifest file produced by collect_lineage.py and pushes the lineage
|
||||
events to Monte Carlo via the push ingestion API, with configurable batching to
|
||||
keep compressed payloads under 1 MB.
|
||||
|
||||
Substitution points (search for "← SUBSTITUTE"):
|
||||
- MCD_INGEST_ID / MCD_INGEST_TOKEN : Monte Carlo API credentials
|
||||
- MCD_RESOURCE_UUID : UUID of the Redshift connection in Monte Carlo
|
||||
- PUSH_BATCH_SIZE : number of events per API call (default 500)
|
||||
|
||||
Prerequisites:
|
||||
pip install pycarlo
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import logging
|
||||
import os
|
||||
from concurrent.futures import ThreadPoolExecutor, as_completed
|
||||
from datetime import datetime, timezone
|
||||
from typing import Any
|
||||
|
||||
from pycarlo.core import Client, Session
|
||||
from pycarlo.features.ingestion import IngestionService
|
||||
from pycarlo.features.ingestion.models import (
|
||||
LineageAssetRef,
|
||||
LineageEvent,
|
||||
)
|
||||
|
||||
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
|
||||
log = logging.getLogger(__name__)
|
||||
|
||||
RESOURCE_TYPE = "redshift"
|
||||
DEFAULT_BATCH_SIZE = 500 # ← SUBSTITUTE: conservative default to stay under 1 MB compressed
|
||||
|
||||
|
||||
def _ref_from_dict(d: dict[str, Any]) -> LineageAssetRef:
|
||||
return LineageAssetRef(
|
||||
type="TABLE",
|
||||
name=d["asset_name"],
|
||||
database=d.get("database", ""),
|
||||
schema=d.get("schema", ""),
|
||||
)
|
||||
|
||||
|
||||
def _event_from_dict(d: dict[str, Any]) -> LineageEvent:
|
||||
"""Reconstruct a LineageEvent from a manifest dict."""
|
||||
sources = [_ref_from_dict(s) for s in d.get("sources", [])]
|
||||
destination = _ref_from_dict(d["destination"])
|
||||
return LineageEvent(
|
||||
sources=sources,
|
||||
destination=destination,
|
||||
)
|
||||
|
||||
|
||||
def push(
|
||||
manifest_path: str,
|
||||
resource_uuid: str,
|
||||
key_id: str,
|
||||
key_token: str,
|
||||
batch_size: int = DEFAULT_BATCH_SIZE,
|
||||
) -> dict[str, Any]:
|
||||
"""Read a collect manifest and push lineage events to Monte Carlo in batches.
|
||||
|
||||
Returns a summary dict with invocation IDs and counts.
|
||||
"""
|
||||
with open(manifest_path) as fh:
|
||||
manifest = json.load(fh)
|
||||
|
||||
event_dicts: list[dict[str, Any]] = manifest["events"]
|
||||
events = [_event_from_dict(d) for d in event_dicts]
|
||||
log.info("Loaded %d lineage events from %s", len(events), manifest_path)
|
||||
|
||||
if not events:
|
||||
log.info("No lineage events to push.")
|
||||
summary = {
|
||||
"resource_uuid": resource_uuid,
|
||||
"resource_type": RESOURCE_TYPE,
|
||||
"invocation_ids": [],
|
||||
"pushed_at": datetime.now(timezone.utc).isoformat(),
|
||||
"event_count": 0,
|
||||
"batch_count": 0,
|
||||
"batch_size": batch_size,
|
||||
}
|
||||
push_manifest_path = manifest_path.replace(".json", "_push_result.json")
|
||||
with open(push_manifest_path, "w") as fh:
|
||||
json.dump(summary, fh, indent=2)
|
||||
return summary
|
||||
|
||||
# Split into batches
|
||||
batches = []
|
||||
for i in range(0, len(events), batch_size):
|
||||
batches.append(events[i : i + batch_size])
|
||||
total_batches = len(batches)
|
||||
|
||||
def _push_batch(batch: list, batch_num: int) -> str | None:
|
||||
"""Push a single batch using a dedicated Session (thread-safe)."""
|
||||
log.info("Pushing batch %d/%d (%d events) ...", batch_num, total_batches, len(batch))
|
||||
client = Client(session=Session(mcd_id=key_id, mcd_token=key_token, scope="Ingestion"))
|
||||
service = IngestionService(mc_client=client)
|
||||
result = service.send_lineage(
|
||||
resource_uuid=resource_uuid,
|
||||
resource_type=RESOURCE_TYPE,
|
||||
events=batch,
|
||||
)
|
||||
invocation_id = service.extract_invocation_id(result)
|
||||
if invocation_id:
|
||||
log.info("Batch %d: invocation_id=%s", batch_num, invocation_id)
|
||||
return invocation_id
|
||||
|
||||
# Push batches in parallel (each thread gets its own pycarlo Session)
|
||||
max_workers = min(4, total_batches)
|
||||
invocation_ids: list[str | None] = [None] * total_batches
|
||||
|
||||
with ThreadPoolExecutor(max_workers=max_workers) as pool:
|
||||
futures = {
|
||||
pool.submit(_push_batch, batch, i + 1): i
|
||||
for i, batch in enumerate(batches)
|
||||
}
|
||||
for future in as_completed(futures):
|
||||
idx = futures[future]
|
||||
try:
|
||||
invocation_ids[idx] = future.result()
|
||||
except Exception as exc:
|
||||
log.error("ERROR pushing batch %d: %s", idx + 1, exc)
|
||||
raise
|
||||
|
||||
log.info("All %d batches pushed (%d workers)", total_batches, max_workers)
|
||||
|
||||
summary = {
|
||||
"resource_uuid": resource_uuid,
|
||||
"resource_type": RESOURCE_TYPE,
|
||||
"invocation_ids": invocation_ids,
|
||||
"pushed_at": datetime.now(timezone.utc).isoformat(),
|
||||
"event_count": len(events),
|
||||
"batch_count": total_batches,
|
||||
"batch_size": batch_size,
|
||||
"lookback_hours": manifest.get("lookback_hours"),
|
||||
"queries_scanned": manifest.get("queries_scanned"),
|
||||
}
|
||||
|
||||
push_manifest_path = manifest_path.replace(".json", "_push_result.json")
|
||||
with open(push_manifest_path, "w") as fh:
|
||||
json.dump(summary, fh, indent=2)
|
||||
log.info("Push result written to %s", push_manifest_path)
|
||||
|
||||
return summary
|
||||
|
||||
|
||||
def main() -> None:
|
||||
parser = argparse.ArgumentParser(description="Push Redshift lineage to Monte Carlo from manifest")
|
||||
parser.add_argument("--manifest", default="manifest_lineage.json")
|
||||
parser.add_argument("--resource-uuid", default=os.getenv("MCD_RESOURCE_UUID"))
|
||||
parser.add_argument("--key-id", default=os.getenv("MCD_INGEST_ID"))
|
||||
parser.add_argument("--key-token", default=os.getenv("MCD_INGEST_TOKEN"))
|
||||
parser.add_argument("--batch-size", type=int, default=DEFAULT_BATCH_SIZE)
|
||||
args = parser.parse_args()
|
||||
|
||||
required = ["resource_uuid", "key_id", "key_token"]
|
||||
missing = [k for k in required if getattr(args, k) is None]
|
||||
if missing:
|
||||
parser.error(f"Missing required arguments/env vars: {missing}")
|
||||
|
||||
push(
|
||||
manifest_path=args.manifest,
|
||||
resource_uuid=args.resource_uuid,
|
||||
key_id=args.key_id,
|
||||
key_token=args.key_token,
|
||||
batch_size=args.batch_size,
|
||||
)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,178 @@
|
||||
"""
|
||||
Redshift — Metadata Push (push-only)
|
||||
======================================
|
||||
Reads a JSON manifest file produced by collect_metadata.py and pushes the assets
|
||||
to Monte Carlo via the push ingestion API, with configurable batching to keep
|
||||
compressed payloads under 1 MB.
|
||||
|
||||
Substitution points (search for "← SUBSTITUTE"):
|
||||
- MCD_INGEST_ID / MCD_INGEST_TOKEN : Monte Carlo API credentials
|
||||
- MCD_RESOURCE_UUID : UUID of the Redshift connection in Monte Carlo
|
||||
- PUSH_BATCH_SIZE : number of assets per API call (default 500)
|
||||
|
||||
Prerequisites:
|
||||
pip install pycarlo
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import logging
|
||||
import os
|
||||
from concurrent.futures import ThreadPoolExecutor, as_completed
|
||||
from datetime import datetime, timezone
|
||||
from typing import Any
|
||||
|
||||
from pycarlo.core import Client, Session
|
||||
from pycarlo.features.ingestion import IngestionService
|
||||
from pycarlo.features.ingestion.models import (
|
||||
AssetField,
|
||||
AssetFreshness,
|
||||
AssetMetadata,
|
||||
AssetVolume,
|
||||
RelationalAsset,
|
||||
)
|
||||
|
||||
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
|
||||
log = logging.getLogger(__name__)
|
||||
|
||||
RESOURCE_TYPE = "redshift"
|
||||
DEFAULT_BATCH_SIZE = 500 # ← SUBSTITUTE: conservative default to stay under 1 MB compressed
|
||||
|
||||
|
||||
def _asset_from_dict(d: dict[str, Any]) -> RelationalAsset:
|
||||
"""Reconstruct a RelationalAsset from a manifest dict."""
|
||||
fields = [
|
||||
AssetField(
|
||||
name=f["name"],
|
||||
type=f.get("type"),
|
||||
description=f.get("description"),
|
||||
)
|
||||
for f in d.get("fields", [])
|
||||
]
|
||||
|
||||
volume = None
|
||||
if d.get("row_count") is not None or d.get("byte_count") is not None:
|
||||
volume = AssetVolume(
|
||||
row_count=d.get("row_count"),
|
||||
byte_count=d.get("byte_count"),
|
||||
)
|
||||
|
||||
freshness = None
|
||||
if d.get("last_updated") is not None:
|
||||
freshness = AssetFreshness(last_update_time=d.get("last_updated"))
|
||||
|
||||
return RelationalAsset(
|
||||
type=d.get("asset_type", "TABLE"),
|
||||
metadata=AssetMetadata(
|
||||
name=d["asset_name"],
|
||||
database=d["database"], # ← SUBSTITUTE: use database as top-level namespace
|
||||
schema=d["schema"],
|
||||
description=d.get("description"),
|
||||
),
|
||||
fields=fields,
|
||||
volume=volume,
|
||||
freshness=freshness,
|
||||
)
|
||||
|
||||
|
||||
def push(
|
||||
manifest_path: str,
|
||||
resource_uuid: str,
|
||||
key_id: str,
|
||||
key_token: str,
|
||||
batch_size: int = DEFAULT_BATCH_SIZE,
|
||||
) -> dict[str, Any]:
|
||||
"""Read a collect manifest and push assets to Monte Carlo in batches.
|
||||
|
||||
Returns a summary dict with invocation IDs and counts.
|
||||
"""
|
||||
with open(manifest_path) as fh:
|
||||
manifest = json.load(fh)
|
||||
|
||||
asset_dicts: list[dict[str, Any]] = manifest["assets"]
|
||||
assets = [_asset_from_dict(d) for d in asset_dicts]
|
||||
log.info("Loaded %d assets from %s", len(assets), manifest_path)
|
||||
|
||||
# Split into batches
|
||||
batches = []
|
||||
for i in range(0, max(len(assets), 1), batch_size):
|
||||
batches.append(assets[i : i + batch_size])
|
||||
total_batches = len(batches)
|
||||
|
||||
def _push_batch(batch: list, batch_num: int) -> str | None:
|
||||
"""Push a single batch using a dedicated Session (thread-safe)."""
|
||||
client = Client(session=Session(mcd_id=key_id, mcd_token=key_token, scope="Ingestion"))
|
||||
service = IngestionService(mc_client=client)
|
||||
result = service.send_metadata(
|
||||
resource_uuid=resource_uuid,
|
||||
resource_type=RESOURCE_TYPE,
|
||||
events=batch,
|
||||
)
|
||||
invocation_id = service.extract_invocation_id(result)
|
||||
log.info("Pushed batch %d/%d (%d assets) — invocation_id=%s", batch_num, total_batches, len(batch), invocation_id)
|
||||
return invocation_id
|
||||
|
||||
# Push batches in parallel (each thread gets its own pycarlo Session)
|
||||
max_workers = min(4, total_batches)
|
||||
invocation_ids: list[str | None] = [None] * total_batches
|
||||
|
||||
with ThreadPoolExecutor(max_workers=max_workers) as pool:
|
||||
futures = {
|
||||
pool.submit(_push_batch, batch, i + 1): i
|
||||
for i, batch in enumerate(batches)
|
||||
}
|
||||
for future in as_completed(futures):
|
||||
idx = futures[future]
|
||||
try:
|
||||
invocation_ids[idx] = future.result()
|
||||
except Exception as exc:
|
||||
log.error("ERROR pushing batch %d: %s", idx + 1, exc)
|
||||
raise
|
||||
|
||||
log.info("All %d batches pushed (%d workers)", total_batches, max_workers)
|
||||
|
||||
summary = {
|
||||
"resource_uuid": resource_uuid,
|
||||
"resource_type": RESOURCE_TYPE,
|
||||
"invocation_ids": invocation_ids,
|
||||
"pushed_at": datetime.now(timezone.utc).isoformat(),
|
||||
"asset_count": len(assets),
|
||||
"batch_count": total_batches,
|
||||
"batch_size": batch_size,
|
||||
}
|
||||
|
||||
push_manifest_path = manifest_path.replace(".json", "_push_result.json")
|
||||
with open(push_manifest_path, "w") as fh:
|
||||
json.dump(summary, fh, indent=2)
|
||||
log.info("Push result written to %s", push_manifest_path)
|
||||
|
||||
return summary
|
||||
|
||||
|
||||
def main() -> None:
|
||||
parser = argparse.ArgumentParser(description="Push Redshift metadata to Monte Carlo from manifest")
|
||||
parser.add_argument("--manifest", default="manifest_metadata.json")
|
||||
parser.add_argument("--resource-uuid", default=os.getenv("MCD_RESOURCE_UUID"))
|
||||
parser.add_argument("--key-id", default=os.getenv("MCD_INGEST_ID"))
|
||||
parser.add_argument("--key-token", default=os.getenv("MCD_INGEST_TOKEN"))
|
||||
parser.add_argument("--batch-size", type=int, default=DEFAULT_BATCH_SIZE)
|
||||
args = parser.parse_args()
|
||||
|
||||
required = ["resource_uuid", "key_id", "key_token"]
|
||||
missing = [k for k in required if getattr(args, k) is None]
|
||||
if missing:
|
||||
parser.error(f"Missing required arguments/env vars: {missing}")
|
||||
|
||||
push(
|
||||
manifest_path=args.manifest,
|
||||
resource_uuid=args.resource_uuid,
|
||||
key_id=args.key_id,
|
||||
key_token=args.key_token,
|
||||
batch_size=args.batch_size,
|
||||
)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,196 @@
|
||||
"""
|
||||
Redshift — Query Log Push (push-only)
|
||||
=======================================
|
||||
Reads a JSON manifest file produced by collect_query_logs.py and pushes the query
|
||||
log entries to Monte Carlo via the push ingestion API, with configurable batching
|
||||
to keep compressed payloads under 1 MB.
|
||||
|
||||
Substitution points (search for "← SUBSTITUTE"):
|
||||
- MCD_INGEST_ID / MCD_INGEST_TOKEN : Monte Carlo API credentials
|
||||
- MCD_RESOURCE_UUID : UUID of the Redshift connection in Monte Carlo
|
||||
- PUSH_BATCH_SIZE : number of entries per API call (default 100)
|
||||
|
||||
Prerequisites:
|
||||
pip install pycarlo
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import logging
|
||||
import os
|
||||
from concurrent.futures import ThreadPoolExecutor, as_completed
|
||||
from datetime import datetime, timezone
|
||||
from typing import Any
|
||||
|
||||
from dateutil.parser import isoparse
|
||||
from pycarlo.core import Client, Session
|
||||
from pycarlo.features.ingestion import IngestionService
|
||||
from pycarlo.features.ingestion.models import QueryLogEntry
|
||||
|
||||
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
|
||||
log = logging.getLogger(__name__)
|
||||
|
||||
LOG_TYPE = "redshift"
|
||||
DEFAULT_BATCH_SIZE = 100 # ← SUBSTITUTE: conservative default to stay under 1 MB compressed
|
||||
|
||||
# Truncate query_text longer than this to prevent 413 errors.
|
||||
# Some SQL statements (e.g., generated by BI tools) can be 100KB+ and blow up
|
||||
# compressed payloads even at small batch sizes.
|
||||
_MAX_QUERY_TEXT_LEN = 10_000
|
||||
|
||||
|
||||
def _build_query_log_entries(entry_dicts: list[dict[str, Any]]) -> list[QueryLogEntry]:
|
||||
"""Convert manifest query dicts into QueryLogEntry objects."""
|
||||
entries = []
|
||||
truncated = 0
|
||||
for d in entry_dicts:
|
||||
query_text = d.get("query_text") or ""
|
||||
|
||||
# Truncate very long SQL to prevent 413 Request Too Large
|
||||
if len(query_text) > _MAX_QUERY_TEXT_LEN:
|
||||
query_text = query_text[:_MAX_QUERY_TEXT_LEN] + "... [TRUNCATED]"
|
||||
truncated += 1
|
||||
|
||||
extra = {}
|
||||
if d.get("database_name") is not None:
|
||||
extra["database_name"] = d["database_name"]
|
||||
if d.get("elapsed_time_us") is not None:
|
||||
extra["elapsed_time_us"] = d["elapsed_time_us"]
|
||||
|
||||
start_time = d.get("start_time")
|
||||
end_time = d.get("end_time")
|
||||
|
||||
entries.append(
|
||||
QueryLogEntry(
|
||||
query_id=d.get("query_id"),
|
||||
query_text=query_text,
|
||||
start_time=isoparse(start_time) if start_time else None,
|
||||
end_time=isoparse(end_time) if end_time else None,
|
||||
user=d.get("user"),
|
||||
extra=extra or None,
|
||||
)
|
||||
)
|
||||
if truncated:
|
||||
log.info("Truncated %d query text(s) exceeding %d chars", truncated, _MAX_QUERY_TEXT_LEN)
|
||||
return entries
|
||||
|
||||
|
||||
def push(
|
||||
manifest_path: str,
|
||||
resource_uuid: str,
|
||||
key_id: str,
|
||||
key_token: str,
|
||||
batch_size: int = DEFAULT_BATCH_SIZE,
|
||||
) -> dict[str, Any]:
|
||||
"""Read a collect manifest and push query log entries to Monte Carlo in batches.
|
||||
|
||||
Returns a summary dict with invocation IDs and counts.
|
||||
"""
|
||||
with open(manifest_path) as fh:
|
||||
manifest = json.load(fh)
|
||||
|
||||
entry_dicts: list[dict[str, Any]] = manifest["entries"]
|
||||
entries = _build_query_log_entries(entry_dicts)
|
||||
log.info("Loaded %d query log entries from %s", len(entries), manifest_path)
|
||||
|
||||
if not entries:
|
||||
log.info("No query log entries to push.")
|
||||
summary = {
|
||||
"resource_uuid": resource_uuid,
|
||||
"log_type": LOG_TYPE,
|
||||
"invocation_ids": [],
|
||||
"pushed_at": datetime.now(timezone.utc).isoformat(),
|
||||
"query_log_count": 0,
|
||||
"batch_count": 0,
|
||||
"batch_size": batch_size,
|
||||
}
|
||||
push_manifest_path = manifest_path.replace(".json", "_push_result.json")
|
||||
with open(push_manifest_path, "w") as fh:
|
||||
json.dump(summary, fh, indent=2)
|
||||
return summary
|
||||
|
||||
# Split into batches
|
||||
batches = []
|
||||
for i in range(0, len(entries), batch_size):
|
||||
batches.append(entries[i : i + batch_size])
|
||||
total_batches = len(batches)
|
||||
|
||||
def _push_batch(batch: list, batch_num: int) -> str | None:
|
||||
"""Push a single batch using a dedicated Session (thread-safe)."""
|
||||
client = Client(session=Session(mcd_id=key_id, mcd_token=key_token, scope="Ingestion"))
|
||||
service = IngestionService(mc_client=client)
|
||||
result = service.send_query_logs(
|
||||
resource_uuid=resource_uuid,
|
||||
log_type=LOG_TYPE,
|
||||
events=batch,
|
||||
)
|
||||
invocation_id = service.extract_invocation_id(result)
|
||||
log.info("Pushed batch %d/%d (%d entries) — invocation_id=%s", batch_num, total_batches, len(batch), invocation_id)
|
||||
return invocation_id
|
||||
|
||||
# Push batches in parallel (each thread gets its own pycarlo Session)
|
||||
max_workers = min(4, total_batches)
|
||||
invocation_ids: list[str | None] = [None] * total_batches
|
||||
|
||||
with ThreadPoolExecutor(max_workers=max_workers) as pool:
|
||||
futures = {
|
||||
pool.submit(_push_batch, batch, i + 1): i
|
||||
for i, batch in enumerate(batches)
|
||||
}
|
||||
for future in as_completed(futures):
|
||||
idx = futures[future]
|
||||
try:
|
||||
invocation_ids[idx] = future.result()
|
||||
except Exception as exc:
|
||||
log.error("ERROR pushing batch %d: %s", idx + 1, exc)
|
||||
raise
|
||||
|
||||
log.info("All %d batches pushed (%d workers)", total_batches, max_workers)
|
||||
|
||||
summary = {
|
||||
"resource_uuid": resource_uuid,
|
||||
"log_type": LOG_TYPE,
|
||||
"invocation_ids": invocation_ids,
|
||||
"pushed_at": datetime.now(timezone.utc).isoformat(),
|
||||
"query_log_count": len(entries),
|
||||
"batch_count": total_batches,
|
||||
"batch_size": batch_size,
|
||||
"lookback_hours": manifest.get("lookback_hours"),
|
||||
"lookback_lag_hours": manifest.get("lookback_lag_hours"),
|
||||
}
|
||||
|
||||
push_manifest_path = manifest_path.replace(".json", "_push_result.json")
|
||||
with open(push_manifest_path, "w") as fh:
|
||||
json.dump(summary, fh, indent=2)
|
||||
log.info("Push result written to %s", push_manifest_path)
|
||||
|
||||
return summary
|
||||
|
||||
|
||||
def main() -> None:
|
||||
parser = argparse.ArgumentParser(description="Push Redshift query logs to Monte Carlo from manifest")
|
||||
parser.add_argument("--manifest", default="manifest_query_logs.json")
|
||||
parser.add_argument("--resource-uuid", default=os.getenv("MCD_RESOURCE_UUID"))
|
||||
parser.add_argument("--key-id", default=os.getenv("MCD_INGEST_ID"))
|
||||
parser.add_argument("--key-token", default=os.getenv("MCD_INGEST_TOKEN"))
|
||||
parser.add_argument("--batch-size", type=int, default=DEFAULT_BATCH_SIZE)
|
||||
args = parser.parse_args()
|
||||
|
||||
required = ["resource_uuid", "key_id", "key_token"]
|
||||
missing = [k for k in required if getattr(args, k) is None]
|
||||
if missing:
|
||||
parser.error(f"Missing required arguments/env vars: {missing}")
|
||||
|
||||
push(
|
||||
manifest_path=args.manifest,
|
||||
resource_uuid=args.resource_uuid,
|
||||
key_id=args.key_id,
|
||||
key_token=args.key_token,
|
||||
batch_size=args.batch_size,
|
||||
)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,154 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Collect lineage from Snowflake and push it to Monte Carlo — combined.
|
||||
|
||||
Imports ``collect()`` from ``collect_lineage`` and ``push()`` from
|
||||
``push_lineage``, runs both in sequence.
|
||||
|
||||
Substitution points
|
||||
-------------------
|
||||
- SNOWFLAKE_ACCOUNT (env) / --account (CLI) : Snowflake account identifier
|
||||
- SNOWFLAKE_USER (env) / --user (CLI) : Snowflake username
|
||||
- SNOWFLAKE_PASSWORD (env) / --password (CLI) : Snowflake password
|
||||
- SNOWFLAKE_WAREHOUSE (env) / --warehouse (CLI) : Snowflake virtual warehouse
|
||||
- MCD_INGEST_ID (env) / --key-id (CLI) : Monte Carlo ingestion key ID
|
||||
- MCD_INGEST_TOKEN (env) / --key-token (CLI) : Monte Carlo ingestion key token
|
||||
- MCD_RESOURCE_UUID (env) / --resource-uuid (CLI) : MC resource UUID for this connection
|
||||
|
||||
Prerequisites
|
||||
-------------
|
||||
pip install pycarlo snowflake-connector-python
|
||||
|
||||
Usage (table-level):
|
||||
python collect_and_push_lineage.py \\
|
||||
--account <SNOWFLAKE_ACCOUNT> \\
|
||||
--user <SNOWFLAKE_USER> \\
|
||||
--password <SNOWFLAKE_PASSWORD> \\
|
||||
--warehouse <SNOWFLAKE_WAREHOUSE> \\
|
||||
--key-id <MCD_INGEST_ID> \\
|
||||
--key-token <MCD_INGEST_TOKEN> \\
|
||||
--resource-uuid <MCD_RESOURCE_UUID>
|
||||
|
||||
Usage (column-level):
|
||||
python collect_and_push_lineage.py ... --column-lineage
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import os
|
||||
|
||||
from collect_lineage import collect, _LOOKBACK_HOURS
|
||||
from push_lineage import push, _BATCH_SIZE
|
||||
|
||||
|
||||
def main() -> None:
|
||||
parser = argparse.ArgumentParser(
|
||||
description="Collect Snowflake lineage from ACCOUNT_USAGE and push to Monte Carlo",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--account",
|
||||
default=os.environ.get("SNOWFLAKE_ACCOUNT"),
|
||||
help="Snowflake account identifier (env: SNOWFLAKE_ACCOUNT)",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--user",
|
||||
default=os.environ.get("SNOWFLAKE_USER"),
|
||||
help="Snowflake username (env: SNOWFLAKE_USER)",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--password",
|
||||
default=os.environ.get("SNOWFLAKE_PASSWORD"),
|
||||
help="Snowflake password (env: SNOWFLAKE_PASSWORD)",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--warehouse",
|
||||
default=os.environ.get("SNOWFLAKE_WAREHOUSE"),
|
||||
help="Snowflake virtual warehouse (env: SNOWFLAKE_WAREHOUSE)",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--key-id",
|
||||
default=os.environ.get("MCD_INGEST_ID"),
|
||||
help="Monte Carlo ingestion key ID (env: MCD_INGEST_ID)",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--key-token",
|
||||
default=os.environ.get("MCD_INGEST_TOKEN"),
|
||||
help="Monte Carlo ingestion key token (env: MCD_INGEST_TOKEN)",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--resource-uuid",
|
||||
default=os.environ.get("MCD_RESOURCE_UUID"),
|
||||
help="Monte Carlo resource UUID for this Snowflake connection (env: MCD_RESOURCE_UUID)",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--lookback-hours",
|
||||
type=int,
|
||||
default=_LOOKBACK_HOURS,
|
||||
help=f"Hours of QUERY_HISTORY to scan (default: {_LOOKBACK_HOURS})",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--column-lineage",
|
||||
action="store_true",
|
||||
help="Push column-level lineage instead of table-level",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--output-file",
|
||||
default="lineage_output.json",
|
||||
help="Path for the intermediate collect manifest (default: lineage_output.json)",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--push-result-file",
|
||||
default="lineage_push_result.json",
|
||||
help="Path to write the push result (default: lineage_push_result.json)",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--batch-size",
|
||||
type=int,
|
||||
default=_BATCH_SIZE,
|
||||
help=f"Max events per push batch (default: {_BATCH_SIZE})",
|
||||
)
|
||||
args = parser.parse_args()
|
||||
|
||||
missing = [
|
||||
name
|
||||
for name, val in [
|
||||
("--account", args.account),
|
||||
("--user", args.user),
|
||||
("--password", args.password),
|
||||
("--warehouse", args.warehouse),
|
||||
("--key-id", args.key_id),
|
||||
("--key-token", args.key_token),
|
||||
("--resource-uuid", args.resource_uuid),
|
||||
]
|
||||
if not val
|
||||
]
|
||||
if missing:
|
||||
parser.error(f"Missing required arguments: {', '.join(missing)}")
|
||||
|
||||
# Step 1: Collect
|
||||
collect(
|
||||
account=args.account,
|
||||
user=args.user,
|
||||
password=args.password,
|
||||
warehouse=args.warehouse,
|
||||
lookback_hours=args.lookback_hours,
|
||||
column_lineage=args.column_lineage,
|
||||
output_file=args.output_file,
|
||||
)
|
||||
|
||||
# Step 2: Push
|
||||
push(
|
||||
input_file=args.output_file,
|
||||
resource_uuid=args.resource_uuid,
|
||||
key_id=args.key_id,
|
||||
key_token=args.key_token,
|
||||
batch_size=args.batch_size,
|
||||
output_file=args.push_result_file,
|
||||
)
|
||||
|
||||
print("Done.")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,137 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Collect table metadata from Snowflake and push it to Monte Carlo — combined.
|
||||
|
||||
Imports ``collect()`` from ``collect_metadata`` and ``push()`` from
|
||||
``push_metadata``, runs both in sequence.
|
||||
|
||||
Substitution points
|
||||
-------------------
|
||||
- SNOWFLAKE_ACCOUNT (env) / --account (CLI) : Snowflake account identifier (e.g. xy12345.us-east-1)
|
||||
- SNOWFLAKE_USER (env) / --user (CLI) : Snowflake username
|
||||
- SNOWFLAKE_PASSWORD (env) / --password (CLI) : Snowflake password
|
||||
- SNOWFLAKE_WAREHOUSE (env) / --warehouse (CLI) : Snowflake virtual warehouse
|
||||
- MCD_INGEST_ID (env) / --key-id (CLI) : Monte Carlo ingestion key ID
|
||||
- MCD_INGEST_TOKEN (env) / --key-token (CLI) : Monte Carlo ingestion key token
|
||||
- MCD_RESOURCE_UUID (env) / --resource-uuid (CLI) : MC resource UUID for this connection
|
||||
|
||||
Prerequisites
|
||||
-------------
|
||||
pip install pycarlo snowflake-connector-python
|
||||
|
||||
Usage
|
||||
-----
|
||||
python collect_and_push_metadata.py \\
|
||||
--account <SNOWFLAKE_ACCOUNT> \\
|
||||
--user <SNOWFLAKE_USER> \\
|
||||
--password <SNOWFLAKE_PASSWORD> \\
|
||||
--warehouse <SNOWFLAKE_WAREHOUSE> \\
|
||||
--key-id <MCD_INGEST_ID> \\
|
||||
--key-token <MCD_INGEST_TOKEN> \\
|
||||
--resource-uuid <MCD_RESOURCE_UUID>
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import os
|
||||
|
||||
from collect_metadata import collect
|
||||
from push_metadata import push, _BATCH_SIZE
|
||||
|
||||
|
||||
def main() -> None:
|
||||
parser = argparse.ArgumentParser(
|
||||
description="Collect Snowflake table metadata and push to Monte Carlo",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--account",
|
||||
default=os.environ.get("SNOWFLAKE_ACCOUNT"),
|
||||
help="Snowflake account identifier, e.g. xy12345.us-east-1 (env: SNOWFLAKE_ACCOUNT)", # ← SUBSTITUTE
|
||||
)
|
||||
parser.add_argument(
|
||||
"--user",
|
||||
default=os.environ.get("SNOWFLAKE_USER"),
|
||||
help="Snowflake username (env: SNOWFLAKE_USER)", # ← SUBSTITUTE
|
||||
)
|
||||
parser.add_argument(
|
||||
"--password",
|
||||
default=os.environ.get("SNOWFLAKE_PASSWORD"),
|
||||
help="Snowflake password (env: SNOWFLAKE_PASSWORD)", # ← SUBSTITUTE
|
||||
)
|
||||
parser.add_argument(
|
||||
"--warehouse",
|
||||
default=os.environ.get("SNOWFLAKE_WAREHOUSE"),
|
||||
help="Snowflake virtual warehouse (env: SNOWFLAKE_WAREHOUSE)", # ← SUBSTITUTE
|
||||
)
|
||||
parser.add_argument(
|
||||
"--key-id",
|
||||
default=os.environ.get("MCD_INGEST_ID"),
|
||||
help="Monte Carlo ingestion key ID (env: MCD_INGEST_ID)",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--key-token",
|
||||
default=os.environ.get("MCD_INGEST_TOKEN"),
|
||||
help="Monte Carlo ingestion key token (env: MCD_INGEST_TOKEN)",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--resource-uuid",
|
||||
default=os.environ.get("MCD_RESOURCE_UUID"),
|
||||
help="Monte Carlo resource UUID for this Snowflake connection (env: MCD_RESOURCE_UUID)",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--output-file",
|
||||
default="metadata_output.json",
|
||||
help="Path for the intermediate collect manifest (default: metadata_output.json)",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--push-result-file",
|
||||
default="metadata_push_result.json",
|
||||
help="Path to write the push result (default: metadata_push_result.json)",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--batch-size",
|
||||
type=int,
|
||||
default=_BATCH_SIZE,
|
||||
help=f"Max assets per push batch (default: {_BATCH_SIZE})",
|
||||
)
|
||||
args = parser.parse_args()
|
||||
|
||||
missing = [
|
||||
name
|
||||
for name, val in [
|
||||
("--account", args.account),
|
||||
("--user", args.user),
|
||||
("--password", args.password),
|
||||
("--warehouse", args.warehouse),
|
||||
("--key-id", args.key_id),
|
||||
("--key-token", args.key_token),
|
||||
("--resource-uuid", args.resource_uuid),
|
||||
]
|
||||
if not val
|
||||
]
|
||||
if missing:
|
||||
parser.error(f"Missing required arguments: {', '.join(missing)}")
|
||||
|
||||
# Step 1: Collect
|
||||
collect(
|
||||
account=args.account,
|
||||
user=args.user,
|
||||
password=args.password,
|
||||
warehouse=args.warehouse,
|
||||
output_file=args.output_file,
|
||||
)
|
||||
|
||||
# Step 2: Push
|
||||
push(
|
||||
input_file=args.output_file,
|
||||
resource_uuid=args.resource_uuid,
|
||||
key_id=args.key_id,
|
||||
key_token=args.key_token,
|
||||
batch_size=args.batch_size,
|
||||
output_file=args.push_result_file,
|
||||
)
|
||||
|
||||
print("Done.")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,137 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Collect query logs from Snowflake and push them to Monte Carlo — combined.
|
||||
|
||||
Imports ``collect()`` from ``collect_query_logs`` and ``push()`` from
|
||||
``push_query_logs``, runs both in sequence.
|
||||
|
||||
Substitution points
|
||||
-------------------
|
||||
- SNOWFLAKE_ACCOUNT (env) / --account (CLI) : Snowflake account identifier
|
||||
- SNOWFLAKE_USER (env) / --user (CLI) : Snowflake username
|
||||
- SNOWFLAKE_PASSWORD (env) / --password (CLI) : Snowflake password
|
||||
- SNOWFLAKE_WAREHOUSE (env) / --warehouse (CLI) : Snowflake virtual warehouse
|
||||
- MCD_INGEST_ID (env) / --key-id (CLI) : Monte Carlo ingestion key ID
|
||||
- MCD_INGEST_TOKEN (env) / --key-token (CLI) : Monte Carlo ingestion key token
|
||||
- MCD_RESOURCE_UUID (env) / --resource-uuid (CLI) : MC resource UUID for this connection
|
||||
|
||||
Prerequisites
|
||||
-------------
|
||||
pip install pycarlo snowflake-connector-python
|
||||
|
||||
Usage
|
||||
-----
|
||||
python collect_and_push_query_logs.py \\
|
||||
--account <SNOWFLAKE_ACCOUNT> \\
|
||||
--user <SNOWFLAKE_USER> \\
|
||||
--password <SNOWFLAKE_PASSWORD> \\
|
||||
--warehouse <SNOWFLAKE_WAREHOUSE> \\
|
||||
--key-id <MCD_INGEST_ID> \\
|
||||
--key-token <MCD_INGEST_TOKEN> \\
|
||||
--resource-uuid <MCD_RESOURCE_UUID>
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import os
|
||||
|
||||
from collect_query_logs import collect
|
||||
from push_query_logs import push, _BATCH_SIZE
|
||||
|
||||
|
||||
def main() -> None:
|
||||
parser = argparse.ArgumentParser(
|
||||
description="Collect Snowflake query logs from ACCOUNT_USAGE and push to Monte Carlo",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--account",
|
||||
default=os.environ.get("SNOWFLAKE_ACCOUNT"),
|
||||
help="Snowflake account identifier, e.g. xy12345.us-east-1 (env: SNOWFLAKE_ACCOUNT)", # ← SUBSTITUTE
|
||||
)
|
||||
parser.add_argument(
|
||||
"--user",
|
||||
default=os.environ.get("SNOWFLAKE_USER"),
|
||||
help="Snowflake username (env: SNOWFLAKE_USER)",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--password",
|
||||
default=os.environ.get("SNOWFLAKE_PASSWORD"),
|
||||
help="Snowflake password (env: SNOWFLAKE_PASSWORD)",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--warehouse",
|
||||
default=os.environ.get("SNOWFLAKE_WAREHOUSE"),
|
||||
help="Snowflake virtual warehouse (env: SNOWFLAKE_WAREHOUSE)", # ← SUBSTITUTE
|
||||
)
|
||||
parser.add_argument(
|
||||
"--key-id",
|
||||
default=os.environ.get("MCD_INGEST_ID"),
|
||||
help="Monte Carlo ingestion key ID (env: MCD_INGEST_ID)",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--key-token",
|
||||
default=os.environ.get("MCD_INGEST_TOKEN"),
|
||||
help="Monte Carlo ingestion key token (env: MCD_INGEST_TOKEN)",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--resource-uuid",
|
||||
default=os.environ.get("MCD_RESOURCE_UUID"),
|
||||
help="Monte Carlo resource UUID for this Snowflake connection (env: MCD_RESOURCE_UUID)",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--output-file",
|
||||
default="query_logs_output.json",
|
||||
help="Path for the intermediate collect manifest (default: query_logs_output.json)",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--push-result-file",
|
||||
default="query_logs_push_result.json",
|
||||
help="Path to write the push result (default: query_logs_push_result.json)",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--batch-size",
|
||||
type=int,
|
||||
default=_BATCH_SIZE,
|
||||
help=f"Max entries per push batch (default: {_BATCH_SIZE})",
|
||||
)
|
||||
args = parser.parse_args()
|
||||
|
||||
missing = [
|
||||
name
|
||||
for name, val in [
|
||||
("--account", args.account),
|
||||
("--user", args.user),
|
||||
("--password", args.password),
|
||||
("--warehouse", args.warehouse),
|
||||
("--key-id", args.key_id),
|
||||
("--key-token", args.key_token),
|
||||
("--resource-uuid", args.resource_uuid),
|
||||
]
|
||||
if not val
|
||||
]
|
||||
if missing:
|
||||
parser.error(f"Missing required arguments: {', '.join(missing)}")
|
||||
|
||||
# Step 1: Collect
|
||||
collect(
|
||||
account=args.account,
|
||||
user=args.user,
|
||||
password=args.password,
|
||||
warehouse=args.warehouse,
|
||||
output_file=args.output_file,
|
||||
)
|
||||
|
||||
# Step 2: Push
|
||||
push(
|
||||
input_file=args.output_file,
|
||||
resource_uuid=args.resource_uuid,
|
||||
key_id=args.key_id,
|
||||
key_token=args.key_token,
|
||||
batch_size=args.batch_size,
|
||||
output_file=args.push_result_file,
|
||||
)
|
||||
|
||||
print("Done.")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,349 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Collect table and column lineage from Snowflake — collection only.
|
||||
|
||||
Queries ACCOUNT_USAGE for DML/DDL statements in the last 24 hours, parses each
|
||||
QUERY_TEXT with regex to extract source and destination tables, then writes the
|
||||
resulting lineage edges to a JSON manifest file.
|
||||
|
||||
Can be run standalone via CLI or imported (use the ``collect()`` function).
|
||||
|
||||
Note: ACCOUNT_USAGE views have an approximate latency of 45 minutes, so very
|
||||
recent queries may not yet appear.
|
||||
|
||||
Substitution points
|
||||
-------------------
|
||||
- SNOWFLAKE_ACCOUNT (env) / --account (CLI) : Snowflake account identifier
|
||||
- SNOWFLAKE_USER (env) / --user (CLI) : Snowflake username
|
||||
- SNOWFLAKE_PASSWORD (env) / --password (CLI) : Snowflake password
|
||||
- SNOWFLAKE_WAREHOUSE (env) / --warehouse (CLI) : Snowflake virtual warehouse
|
||||
|
||||
Prerequisites
|
||||
-------------
|
||||
pip install snowflake-connector-python
|
||||
|
||||
Usage (table-level):
|
||||
python collect_lineage.py \\
|
||||
--account <SNOWFLAKE_ACCOUNT> \\
|
||||
--user <SNOWFLAKE_USER> \\
|
||||
--password <SNOWFLAKE_PASSWORD> \\
|
||||
--warehouse <SNOWFLAKE_WAREHOUSE>
|
||||
|
||||
Usage (column-level):
|
||||
python collect_lineage.py ... --column-lineage
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import os
|
||||
import re
|
||||
from dataclasses import dataclass, field
|
||||
from datetime import datetime, timezone
|
||||
|
||||
import snowflake.connector
|
||||
|
||||
# ← SUBSTITUTE: set RESOURCE_TYPE to match your Monte Carlo connection type
|
||||
RESOURCE_TYPE = "snowflake"
|
||||
|
||||
|
||||
def _check_available_memory(min_gb: float = 2.0) -> None:
|
||||
"""Warn if available memory is below the threshold."""
|
||||
try:
|
||||
if hasattr(os, "sysconf"): # Linux / macOS
|
||||
page_size = os.sysconf("SC_PAGE_SIZE")
|
||||
avail_pages = os.sysconf("SC_AVPHYS_PAGES")
|
||||
avail_gb = (page_size * avail_pages) / (1024 ** 3)
|
||||
else:
|
||||
return # Windows — skip check
|
||||
except (ValueError, OSError):
|
||||
return
|
||||
if avail_gb < min_gb:
|
||||
print(
|
||||
f"WARNING: Only {avail_gb:.1f} GB of memory available "
|
||||
f"(minimum recommended: {min_gb:.1f} GB). "
|
||||
f"Consider reducing the lookback window or increasing available memory."
|
||||
)
|
||||
|
||||
# Hours to look back in ACCOUNT_USAGE.QUERY_HISTORY
|
||||
# ← SUBSTITUTE: adjust the lookback window to match your collection cadence
|
||||
_LOOKBACK_HOURS = 24
|
||||
|
||||
# Regex for CTAS: CREATE [OR REPLACE] [TRANSIENT] TABLE [IF NOT EXISTS] [db.][schema.]table AS SELECT
|
||||
_CTAS_RE = re.compile(
|
||||
r"CREATE\s+(?:OR\s+REPLACE\s+)?(?:TRANSIENT\s+)?TABLE\s+(?:IF\s+NOT\s+EXISTS\s+)?"
|
||||
r"(?:(?P<dest_db>\w+)\.)?(?:(?P<dest_schema>\w+)\.)?(?P<dest_table>\w+)"
|
||||
r".*?AS\s+SELECT\s+(?P<select_cols>.+?)\s+FROM\s+"
|
||||
r"(?:(?P<src_db>\w+)\.)?(?:(?P<src_schema>\w+)\.)?(?P<src_table>\w+)",
|
||||
re.IGNORECASE | re.DOTALL,
|
||||
)
|
||||
|
||||
# Regex for INSERT INTO [db.][schema.]table SELECT ... FROM [db.][schema.]table
|
||||
_INSERT_RE = re.compile(
|
||||
r"INSERT\s+(?:INTO|OVERWRITE)\s+"
|
||||
r"(?:(?P<dest_db>\w+)\.)?(?:(?P<dest_schema>\w+)\.)?(?P<dest_table>\w+)"
|
||||
r".*?SELECT\s+(?P<select_cols>.+?)\s+FROM\s+"
|
||||
r"(?:(?P<src_db>\w+)\.)?(?:(?P<src_schema>\w+)\.)?(?P<src_table>\w+)",
|
||||
re.IGNORECASE | re.DOTALL,
|
||||
)
|
||||
|
||||
# Regex for CREATE [OR REPLACE] VIEW [db.][schema.]view AS SELECT ... FROM ...
|
||||
_CREATE_VIEW_RE = re.compile(
|
||||
r"CREATE\s+(?:OR\s+REPLACE\s+)?(?:SECURE\s+)?VIEW\s+"
|
||||
r"(?:(?P<dest_db>\w+)\.)?(?:(?P<dest_schema>\w+)\.)?(?P<dest_table>\w+)"
|
||||
r".*?AS\s+SELECT\s+(?P<select_cols>.+?)\s+FROM\s+"
|
||||
r"(?:(?P<src_db>\w+)\.)?(?:(?P<src_schema>\w+)\.)?(?P<src_table>\w+)",
|
||||
re.IGNORECASE | re.DOTALL,
|
||||
)
|
||||
|
||||
# Additional JOIN sources
|
||||
_JOIN_RE = re.compile(
|
||||
r"JOIN\s+(?:(?P<src_db>\w+)\.)?(?:(?P<src_schema>\w+)\.)?(?P<src_table>\w+)",
|
||||
re.IGNORECASE,
|
||||
)
|
||||
|
||||
# Simple column alias extraction from SELECT clause
|
||||
_COL_RE = re.compile(r"(?:(\w+)\.)?(\w+)(?:\s+AS\s+(\w+))?", re.IGNORECASE)
|
||||
_SQL_KEYWORDS = {
|
||||
"FROM", "SELECT", "WHERE", "JOIN", "ON", "AS", "*", "AND", "OR",
|
||||
"GROUP", "ORDER", "BY", "HAVING", "LIMIT", "DISTINCT", "CASE", "WHEN",
|
||||
"THEN", "ELSE", "END", "NULL", "NOT", "IN", "IS", "BETWEEN",
|
||||
}
|
||||
|
||||
|
||||
@dataclass
|
||||
class _LineageEdge:
|
||||
dest_db: str
|
||||
dest_schema: str
|
||||
dest_table: str
|
||||
sources: list[tuple[str, str, str]] = field(default_factory=list)
|
||||
# col_mappings: (dest_col, src_table, src_col)
|
||||
col_mappings: list[tuple[str, str, str]] = field(default_factory=list)
|
||||
|
||||
|
||||
def _parse_select_cols(select_clause: str, src_table: str) -> list[tuple[str, str, str]]:
|
||||
mappings = []
|
||||
for m in _COL_RE.finditer(select_clause):
|
||||
src_col = m.group(2)
|
||||
dest_col = m.group(3) or src_col
|
||||
if src_col.upper() in _SQL_KEYWORDS:
|
||||
continue
|
||||
mappings.append((dest_col, src_table, src_col))
|
||||
return mappings
|
||||
|
||||
|
||||
def _parse_edges(rows: list[dict]) -> list[_LineageEdge]:
|
||||
"""Parse QUERY_HISTORY rows into _LineageEdge objects."""
|
||||
edges: dict[str, _LineageEdge] = {}
|
||||
|
||||
for row in rows:
|
||||
query_text = row.get("QUERY_TEXT") or ""
|
||||
default_db = (row.get("DATABASE_NAME") or "").lower()
|
||||
sql_clean = re.sub(r"\s+", " ", query_text).strip()
|
||||
|
||||
for pattern in (_CTAS_RE, _INSERT_RE, _CREATE_VIEW_RE):
|
||||
m = pattern.search(sql_clean)
|
||||
if not m:
|
||||
continue
|
||||
|
||||
dest_db = (m.group("dest_db") or default_db).lower()
|
||||
dest_schema = (m.group("dest_schema") or "public").lower()
|
||||
dest_table = m.group("dest_table").lower()
|
||||
src_db = (m.group("src_db") or default_db).lower()
|
||||
src_schema = (m.group("src_schema") or "public").lower()
|
||||
src_table = m.group("src_table").lower()
|
||||
select_cols = m.group("select_cols")
|
||||
|
||||
key = f"{dest_db}.{dest_schema}.{dest_table}"
|
||||
if key not in edges:
|
||||
edges[key] = _LineageEdge(
|
||||
dest_db=dest_db, dest_schema=dest_schema, dest_table=dest_table
|
||||
)
|
||||
|
||||
edge = edges[key]
|
||||
src_triple = (src_db, src_schema, src_table)
|
||||
if src_triple not in edge.sources:
|
||||
edge.sources.append(src_triple)
|
||||
|
||||
for jm in _JOIN_RE.finditer(sql_clean):
|
||||
jt = jm.group("src_table").lower()
|
||||
jschema = (jm.group("src_schema") or src_schema).lower()
|
||||
jdb = (jm.group("src_db") or src_db).lower()
|
||||
jp = (jdb, jschema, jt)
|
||||
if jp not in edge.sources:
|
||||
edge.sources.append(jp)
|
||||
|
||||
edge.col_mappings.extend(_parse_select_cols(select_cols, src_table))
|
||||
break
|
||||
|
||||
return list(edges.values())
|
||||
|
||||
|
||||
def _fetch_query_history(conn, lookback_hours: int) -> list[dict]:
|
||||
cursor = conn.cursor()
|
||||
cursor.execute(
|
||||
f"""
|
||||
SELECT QUERY_ID, QUERY_TEXT, START_TIME, END_TIME, USER_NAME, DATABASE_NAME, EXECUTION_STATUS
|
||||
FROM SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY
|
||||
WHERE START_TIME >= DATEADD(hour, -{lookback_hours}, CURRENT_TIMESTAMP())
|
||||
AND EXECUTION_STATUS = 'SUCCESS'
|
||||
AND QUERY_TYPE IN ('CREATE_TABLE_AS_SELECT', 'INSERT', 'MERGE', 'CREATE_VIEW')
|
||||
ORDER BY START_TIME
|
||||
LIMIT 50000
|
||||
"""
|
||||
# ← SUBSTITUTE: adjust QUERY_TYPE list, LIMIT, or add a WHERE clause to scope to specific databases
|
||||
)
|
||||
columns = [col[0] for col in cursor.description]
|
||||
rows = []
|
||||
while True:
|
||||
batch = cursor.fetchmany(1000)
|
||||
if not batch:
|
||||
break
|
||||
rows.extend(dict(zip(columns, row)) for row in batch)
|
||||
cursor.close()
|
||||
return rows
|
||||
|
||||
|
||||
def collect(
|
||||
account: str,
|
||||
user: str,
|
||||
password: str,
|
||||
warehouse: str,
|
||||
lookback_hours: int = _LOOKBACK_HOURS,
|
||||
column_lineage: bool = False,
|
||||
output_file: str = "lineage_output.json",
|
||||
) -> dict:
|
||||
"""
|
||||
Connect to Snowflake, collect lineage edges, and write a JSON manifest.
|
||||
|
||||
Returns the manifest dict.
|
||||
"""
|
||||
_check_available_memory()
|
||||
print(f"Connecting to Snowflake account: {account} ...")
|
||||
conn = snowflake.connector.connect(
|
||||
account=account,
|
||||
user=user,
|
||||
password=password,
|
||||
warehouse=warehouse,
|
||||
)
|
||||
|
||||
print(f"Fetching QUERY_HISTORY for the last {lookback_hours} hour(s) ...")
|
||||
rows = _fetch_query_history(conn, lookback_hours)
|
||||
conn.close()
|
||||
print(f" Retrieved {len(rows)} qualifying query/queries.")
|
||||
|
||||
if not rows:
|
||||
print("No lineage queries found in the specified window.")
|
||||
manifest = {
|
||||
"resource_type": RESOURCE_TYPE,
|
||||
"collected_at": datetime.now(tz=timezone.utc).isoformat(),
|
||||
"column_lineage": column_lineage,
|
||||
"edges": [],
|
||||
}
|
||||
with open(output_file, "w") as fh:
|
||||
json.dump(manifest, fh, indent=2)
|
||||
return manifest
|
||||
|
||||
edges = _parse_edges(rows)
|
||||
print(f" Parsed {len(edges)} lineage edge(s).")
|
||||
|
||||
manifest = {
|
||||
"resource_type": RESOURCE_TYPE,
|
||||
"collected_at": datetime.now(tz=timezone.utc).isoformat(),
|
||||
"column_lineage": column_lineage,
|
||||
"edges": [
|
||||
{
|
||||
"destination": {
|
||||
"database": e.dest_db,
|
||||
"schema": e.dest_schema,
|
||||
"table": e.dest_table,
|
||||
},
|
||||
"sources": [
|
||||
{"database": sdb, "schema": sschema, "table": stbl}
|
||||
for sdb, sschema, stbl in e.sources
|
||||
],
|
||||
"col_mappings": [
|
||||
{"dest_col": dc, "src_table": st, "src_col": sc}
|
||||
for dc, st, sc in e.col_mappings
|
||||
],
|
||||
}
|
||||
for e in edges
|
||||
],
|
||||
}
|
||||
with open(output_file, "w") as fh:
|
||||
json.dump(manifest, fh, indent=2)
|
||||
print(f"Lineage manifest written to {output_file}")
|
||||
|
||||
return manifest
|
||||
|
||||
|
||||
def main() -> None:
|
||||
parser = argparse.ArgumentParser(
|
||||
description="Collect Snowflake lineage from ACCOUNT_USAGE and write to a manifest file",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--account",
|
||||
default=os.environ.get("SNOWFLAKE_ACCOUNT"),
|
||||
help="Snowflake account identifier (env: SNOWFLAKE_ACCOUNT)",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--user",
|
||||
default=os.environ.get("SNOWFLAKE_USER"),
|
||||
help="Snowflake username (env: SNOWFLAKE_USER)",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--password",
|
||||
default=os.environ.get("SNOWFLAKE_PASSWORD"),
|
||||
help="Snowflake password (env: SNOWFLAKE_PASSWORD)",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--warehouse",
|
||||
default=os.environ.get("SNOWFLAKE_WAREHOUSE"),
|
||||
help="Snowflake virtual warehouse (env: SNOWFLAKE_WAREHOUSE)",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--lookback-hours",
|
||||
type=int,
|
||||
default=_LOOKBACK_HOURS,
|
||||
help=f"Hours of QUERY_HISTORY to scan (default: {_LOOKBACK_HOURS})",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--column-lineage",
|
||||
action="store_true",
|
||||
help="Include column-level lineage mappings in the manifest",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--output-file",
|
||||
default="lineage_output.json",
|
||||
help="Path to write the lineage manifest (default: lineage_output.json)",
|
||||
)
|
||||
args = parser.parse_args()
|
||||
|
||||
missing = [
|
||||
name
|
||||
for name, val in [
|
||||
("--account", args.account),
|
||||
("--user", args.user),
|
||||
("--password", args.password),
|
||||
("--warehouse", args.warehouse),
|
||||
]
|
||||
if not val
|
||||
]
|
||||
if missing:
|
||||
parser.error(f"Missing required arguments: {', '.join(missing)}")
|
||||
|
||||
collect(
|
||||
account=args.account,
|
||||
user=args.user,
|
||||
password=args.password,
|
||||
warehouse=args.warehouse,
|
||||
lookback_hours=args.lookback_hours,
|
||||
column_lineage=args.column_lineage,
|
||||
output_file=args.output_file,
|
||||
)
|
||||
print("Done.")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,329 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Collect table metadata from Snowflake — collection only.
|
||||
|
||||
Connects to Snowflake, discovers all accessible databases and schemas, then
|
||||
queries INFORMATION_SCHEMA.TABLES for volume/freshness and
|
||||
INFORMATION_SCHEMA.COLUMNS for field definitions. The collected assets are
|
||||
written to a JSON manifest file.
|
||||
|
||||
Can be run standalone via CLI or imported (use the ``collect()`` function).
|
||||
|
||||
Substitution points
|
||||
-------------------
|
||||
- SNOWFLAKE_ACCOUNT (env) / --account (CLI) : Snowflake account identifier (e.g. xy12345.us-east-1)
|
||||
- SNOWFLAKE_USER (env) / --user (CLI) : Snowflake username
|
||||
- SNOWFLAKE_PASSWORD (env) / --password (CLI) : Snowflake password
|
||||
- SNOWFLAKE_WAREHOUSE (env) / --warehouse (CLI) : Snowflake virtual warehouse
|
||||
|
||||
Prerequisites
|
||||
-------------
|
||||
pip install snowflake-connector-python
|
||||
|
||||
Usage
|
||||
-----
|
||||
python collect_metadata.py \\
|
||||
--account <SNOWFLAKE_ACCOUNT> \\
|
||||
--user <SNOWFLAKE_USER> \\
|
||||
--password <SNOWFLAKE_PASSWORD> \\
|
||||
--warehouse <SNOWFLAKE_WAREHOUSE>
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import os
|
||||
from datetime import datetime, timezone
|
||||
|
||||
import snowflake.connector
|
||||
|
||||
# ← SUBSTITUTE: set RESOURCE_TYPE to match your Monte Carlo connection type
|
||||
RESOURCE_TYPE = "snowflake"
|
||||
|
||||
|
||||
def _check_available_memory(min_gb: float = 2.0) -> None:
|
||||
"""Warn if available memory is below the threshold."""
|
||||
try:
|
||||
if hasattr(os, "sysconf"): # Linux / macOS
|
||||
page_size = os.sysconf("SC_PAGE_SIZE")
|
||||
avail_pages = os.sysconf("SC_AVPHYS_PAGES")
|
||||
avail_gb = (page_size * avail_pages) / (1024 ** 3)
|
||||
else:
|
||||
return # Windows — skip check
|
||||
except (ValueError, OSError):
|
||||
return
|
||||
if avail_gb < min_gb:
|
||||
print(
|
||||
f"WARNING: Only {avail_gb:.1f} GB of memory available "
|
||||
f"(minimum recommended: {min_gb:.1f} GB). "
|
||||
f"Consider reducing the lookback window or increasing available memory."
|
||||
)
|
||||
|
||||
# Databases that are Snowflake system databases — skip them
|
||||
_SKIP_DATABASES = {"SNOWFLAKE", "SNOWFLAKE_SAMPLE_DATA"}
|
||||
|
||||
# Schemas that are Snowflake system schemas — skip them
|
||||
_SKIP_SCHEMAS = {"INFORMATION_SCHEMA"}
|
||||
|
||||
|
||||
# Snowflake TABLE_TYPE → Monte Carlo RelationalAsset.type mapping.
|
||||
# The MC API only accepts "TABLE" or "VIEW" (uppercase).
|
||||
_TABLE_TYPE_MAP = {
|
||||
"BASE TABLE": "TABLE",
|
||||
"TABLE": "TABLE",
|
||||
"DYNAMIC TABLE": "TABLE",
|
||||
"EXTERNAL TABLE": "TABLE",
|
||||
"VIEW": "VIEW",
|
||||
"MATERIALIZED VIEW": "VIEW",
|
||||
"SECURE VIEW": "VIEW",
|
||||
}
|
||||
|
||||
|
||||
def _normalize_table_type(raw_type: str | None) -> str:
|
||||
"""Map Snowflake's TABLE_TYPE value to MC-accepted 'TABLE' or 'VIEW'."""
|
||||
if not raw_type:
|
||||
return "TABLE"
|
||||
return _TABLE_TYPE_MAP.get(raw_type.upper(), "TABLE")
|
||||
|
||||
|
||||
def _connect(account: str, user: str, password: str, warehouse: str):
|
||||
# ← SUBSTITUTE: add role= or authenticator= kwargs if your org requires them
|
||||
return snowflake.connector.connect(
|
||||
account=account,
|
||||
user=user,
|
||||
password=password,
|
||||
warehouse=warehouse,
|
||||
)
|
||||
|
||||
|
||||
def _collect_assets(conn) -> list[dict]:
|
||||
"""Collect table metadata from Snowflake and return as a list of dicts."""
|
||||
cursor = conn.cursor()
|
||||
assets: list[dict] = []
|
||||
|
||||
# --- Discover databases ---
|
||||
cursor.execute("SHOW DATABASES")
|
||||
# SHOW DATABASES returns (created_on, name, …); column index 1 is the name
|
||||
all_db_rows = []
|
||||
while True:
|
||||
chunk = cursor.fetchmany(1000)
|
||||
if not chunk:
|
||||
break
|
||||
all_db_rows.extend(chunk)
|
||||
databases = [row[1] for row in all_db_rows if row[1] not in _SKIP_DATABASES]
|
||||
print(f" Found {len(databases)} database(s): {databases}")
|
||||
|
||||
for db in databases:
|
||||
# --- Discover schemas in each database ---
|
||||
try:
|
||||
cursor.execute(f'SHOW SCHEMAS IN DATABASE "{db}"')
|
||||
except Exception as exc:
|
||||
print(f" WARNING: could not list schemas in {db}: {exc}")
|
||||
continue
|
||||
|
||||
# Column index 1 is the schema name
|
||||
all_schema_rows = []
|
||||
while True:
|
||||
chunk = cursor.fetchmany(1000)
|
||||
if not chunk:
|
||||
break
|
||||
all_schema_rows.extend(chunk)
|
||||
schemas = [row[1] for row in all_schema_rows if row[1] not in _SKIP_SCHEMAS]
|
||||
|
||||
# --- Collect tables, volume, and freshness via INFORMATION_SCHEMA ---
|
||||
try:
|
||||
cursor.execute(
|
||||
f"""
|
||||
SELECT
|
||||
TABLE_CATALOG,
|
||||
TABLE_SCHEMA,
|
||||
TABLE_NAME,
|
||||
TABLE_TYPE,
|
||||
ROW_COUNT,
|
||||
BYTES,
|
||||
LAST_ALTERED,
|
||||
COMMENT
|
||||
FROM "{db}".INFORMATION_SCHEMA.TABLES
|
||||
WHERE TABLE_SCHEMA != 'INFORMATION_SCHEMA'
|
||||
ORDER BY TABLE_SCHEMA, TABLE_NAME
|
||||
"""
|
||||
)
|
||||
except Exception as exc:
|
||||
print(f" WARNING: could not query INFORMATION_SCHEMA.TABLES in {db}: {exc}")
|
||||
continue
|
||||
|
||||
table_rows = []
|
||||
while True:
|
||||
chunk = cursor.fetchmany(1000)
|
||||
if not chunk:
|
||||
break
|
||||
table_rows.extend(chunk)
|
||||
print(f" {db}: {len(table_rows)} table(s)")
|
||||
|
||||
# Build a set of schema names present in the table result to know which
|
||||
# INFORMATION_SCHEMA.COLUMNS queries to run
|
||||
schemas_with_tables: set[str] = {row[1] for row in table_rows}
|
||||
|
||||
# Pre-fetch all columns for this database in one query per schema
|
||||
columns_by_table: dict[tuple[str, str], list[dict]] = {}
|
||||
for schema in schemas_with_tables:
|
||||
if schema not in schemas:
|
||||
continue # respect the earlier schema skip list
|
||||
try:
|
||||
cursor.execute(
|
||||
f"""
|
||||
SELECT TABLE_NAME, COLUMN_NAME, DATA_TYPE, COMMENT
|
||||
FROM "{db}".INFORMATION_SCHEMA.COLUMNS
|
||||
WHERE TABLE_SCHEMA = %s
|
||||
ORDER BY TABLE_NAME, ORDINAL_POSITION
|
||||
""",
|
||||
(schema,),
|
||||
)
|
||||
except Exception as exc:
|
||||
print(f" WARNING: could not fetch columns for {db}.{schema}: {exc}")
|
||||
continue
|
||||
|
||||
all_col_rows = []
|
||||
while True:
|
||||
chunk = cursor.fetchmany(1000)
|
||||
if not chunk:
|
||||
break
|
||||
all_col_rows.extend(chunk)
|
||||
for col_row in all_col_rows:
|
||||
table_name, col_name, data_type, col_comment = col_row
|
||||
key = (schema, table_name)
|
||||
if key not in columns_by_table:
|
||||
columns_by_table[key] = []
|
||||
columns_by_table[key].append(
|
||||
{
|
||||
"name": col_name,
|
||||
"type": data_type,
|
||||
"description": col_comment or None,
|
||||
}
|
||||
)
|
||||
|
||||
# Build asset dicts
|
||||
for row in table_rows:
|
||||
tbl_catalog, tbl_schema, tbl_name, tbl_type, row_count, byte_count, last_altered, tbl_comment = row
|
||||
|
||||
volume = None
|
||||
if row_count is not None or byte_count is not None:
|
||||
volume = {
|
||||
"row_count": int(row_count) if row_count is not None else None,
|
||||
"byte_count": int(byte_count) if byte_count is not None else None,
|
||||
}
|
||||
|
||||
freshness = None
|
||||
if last_altered is not None:
|
||||
freshness = {
|
||||
"last_update_time": last_altered.isoformat() if hasattr(last_altered, "isoformat") else str(last_altered),
|
||||
}
|
||||
|
||||
fields = columns_by_table.get((tbl_schema, tbl_name), [])
|
||||
|
||||
assets.append(
|
||||
{
|
||||
"type": _normalize_table_type(tbl_type),
|
||||
"database": tbl_catalog,
|
||||
"schema": tbl_schema,
|
||||
"name": tbl_name,
|
||||
"description": tbl_comment or None,
|
||||
"fields": fields,
|
||||
"volume": volume,
|
||||
"freshness": freshness,
|
||||
}
|
||||
)
|
||||
print(f" + {tbl_catalog}.{tbl_schema}.{tbl_name} ({len(fields)} columns)")
|
||||
|
||||
cursor.close()
|
||||
return assets
|
||||
|
||||
|
||||
def collect(
|
||||
account: str,
|
||||
user: str,
|
||||
password: str,
|
||||
warehouse: str,
|
||||
output_file: str = "metadata_output.json",
|
||||
) -> dict:
|
||||
"""
|
||||
Connect to Snowflake, collect table metadata, and write a JSON manifest.
|
||||
|
||||
Returns the manifest dict.
|
||||
"""
|
||||
_check_available_memory()
|
||||
print(f"Connecting to Snowflake account: {account} ...")
|
||||
conn = _connect(account, user, password, warehouse)
|
||||
|
||||
print("Collecting table metadata ...")
|
||||
assets = _collect_assets(conn)
|
||||
conn.close()
|
||||
print(f"\nCollected {len(assets)} table(s).")
|
||||
|
||||
manifest = {
|
||||
"resource_type": RESOURCE_TYPE,
|
||||
"collected_at": datetime.now(tz=timezone.utc).isoformat(),
|
||||
"assets": assets,
|
||||
}
|
||||
with open(output_file, "w") as fh:
|
||||
json.dump(manifest, fh, indent=2)
|
||||
print(f"Asset manifest written to {output_file}")
|
||||
|
||||
return manifest
|
||||
|
||||
|
||||
def main() -> None:
|
||||
parser = argparse.ArgumentParser(
|
||||
description="Collect Snowflake table metadata and write to a manifest file",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--account",
|
||||
default=os.environ.get("SNOWFLAKE_ACCOUNT"),
|
||||
help="Snowflake account identifier, e.g. xy12345.us-east-1 (env: SNOWFLAKE_ACCOUNT)", # ← SUBSTITUTE
|
||||
)
|
||||
parser.add_argument(
|
||||
"--user",
|
||||
default=os.environ.get("SNOWFLAKE_USER"),
|
||||
help="Snowflake username (env: SNOWFLAKE_USER)", # ← SUBSTITUTE
|
||||
)
|
||||
parser.add_argument(
|
||||
"--password",
|
||||
default=os.environ.get("SNOWFLAKE_PASSWORD"),
|
||||
help="Snowflake password (env: SNOWFLAKE_PASSWORD)", # ← SUBSTITUTE
|
||||
)
|
||||
parser.add_argument(
|
||||
"--warehouse",
|
||||
default=os.environ.get("SNOWFLAKE_WAREHOUSE"),
|
||||
help="Snowflake virtual warehouse (env: SNOWFLAKE_WAREHOUSE)", # ← SUBSTITUTE
|
||||
)
|
||||
parser.add_argument(
|
||||
"--output-file",
|
||||
default="metadata_output.json",
|
||||
help="Path to write the output manifest (default: metadata_output.json)",
|
||||
)
|
||||
args = parser.parse_args()
|
||||
|
||||
missing = [
|
||||
name
|
||||
for name, val in [
|
||||
("--account", args.account),
|
||||
("--user", args.user),
|
||||
("--password", args.password),
|
||||
("--warehouse", args.warehouse),
|
||||
]
|
||||
if not val
|
||||
]
|
||||
if missing:
|
||||
parser.error(f"Missing required arguments: {', '.join(missing)}")
|
||||
|
||||
collect(
|
||||
account=args.account,
|
||||
user=args.user,
|
||||
password=args.password,
|
||||
warehouse=args.warehouse,
|
||||
output_file=args.output_file,
|
||||
)
|
||||
print("Done.")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,254 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Collect query logs from Snowflake ACCOUNT_USAGE.QUERY_HISTORY — collection only.
|
||||
|
||||
Queries a 24-hour window ending 1 hour ago (ACCOUNT_USAGE views have an
|
||||
approximate 45-minute ingestion latency, so the last hour is intentionally
|
||||
skipped to avoid incomplete data). The collected query logs are written to a
|
||||
JSON manifest file.
|
||||
|
||||
Can be run standalone via CLI or imported (use the ``collect()`` function).
|
||||
|
||||
Substitution points
|
||||
-------------------
|
||||
- SNOWFLAKE_ACCOUNT (env) / --account (CLI) : Snowflake account identifier
|
||||
- SNOWFLAKE_USER (env) / --user (CLI) : Snowflake username
|
||||
- SNOWFLAKE_PASSWORD (env) / --password (CLI) : Snowflake password
|
||||
- SNOWFLAKE_WAREHOUSE (env) / --warehouse (CLI) : Snowflake virtual warehouse
|
||||
|
||||
Prerequisites
|
||||
-------------
|
||||
pip install snowflake-connector-python
|
||||
|
||||
Usage
|
||||
-----
|
||||
python collect_query_logs.py \\
|
||||
--account <SNOWFLAKE_ACCOUNT> \\
|
||||
--user <SNOWFLAKE_USER> \\
|
||||
--password <SNOWFLAKE_PASSWORD> \\
|
||||
--warehouse <SNOWFLAKE_WAREHOUSE>
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import os
|
||||
from datetime import datetime, timezone
|
||||
|
||||
import snowflake.connector
|
||||
|
||||
# ← SUBSTITUTE: set LOG_TYPE to match your warehouse type (query logs use log_type, not resource_type)
|
||||
LOG_TYPE = "snowflake"
|
||||
|
||||
|
||||
def _check_available_memory(min_gb: float = 2.0) -> None:
|
||||
"""Warn if available memory is below the threshold."""
|
||||
try:
|
||||
if hasattr(os, "sysconf"): # Linux / macOS
|
||||
page_size = os.sysconf("SC_PAGE_SIZE")
|
||||
avail_pages = os.sysconf("SC_AVPHYS_PAGES")
|
||||
avail_gb = (page_size * avail_pages) / (1024 ** 3)
|
||||
else:
|
||||
return # Windows — skip check
|
||||
except (ValueError, OSError):
|
||||
return
|
||||
if avail_gb < min_gb:
|
||||
print(
|
||||
f"WARNING: Only {avail_gb:.1f} GB of memory available "
|
||||
f"(minimum recommended: {min_gb:.1f} GB). "
|
||||
f"Consider reducing the lookback window or increasing available memory."
|
||||
)
|
||||
|
||||
# How many hours to look back from the trailing-edge cutoff
|
||||
# ← SUBSTITUTE: adjust to match your collection cadence (e.g. 2 for every-2-hours runs)
|
||||
_WINDOW_HOURS = 25
|
||||
|
||||
# Hours to skip at the trailing edge — ACCOUNT_USAGE has ~45-minute latency;
|
||||
# skipping 1 hour provides a comfortable buffer.
|
||||
# ← SUBSTITUTE: lower to 0 if you have confirmed real-time access to ACCOUNT_USAGE
|
||||
_TRAILING_SKIP_HOURS = 1
|
||||
|
||||
# Maximum rows to collect per run — increase if your warehouse has higher query volume
|
||||
# ← SUBSTITUTE: adjust based on your Snowflake query volume
|
||||
_QUERY_LIMIT = 10000
|
||||
|
||||
|
||||
def _fetch_query_history(conn) -> list[dict]:
|
||||
"""
|
||||
Fetch recent query history from SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY.
|
||||
|
||||
Collection window: [NOW - _WINDOW_HOURS, NOW - _TRAILING_SKIP_HOURS]
|
||||
This intentionally excludes the most recent hour to avoid the ACCOUNT_USAGE
|
||||
ingestion latency gap.
|
||||
"""
|
||||
cursor = conn.cursor()
|
||||
cursor.execute(
|
||||
f"""
|
||||
SELECT
|
||||
QUERY_ID,
|
||||
QUERY_TEXT,
|
||||
START_TIME,
|
||||
END_TIME,
|
||||
USER_NAME,
|
||||
DATABASE_NAME,
|
||||
WAREHOUSE_NAME,
|
||||
BYTES_SCANNED,
|
||||
ROWS_PRODUCED,
|
||||
EXECUTION_STATUS,
|
||||
QUERY_TAG,
|
||||
ROLE_NAME
|
||||
FROM SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY
|
||||
WHERE START_TIME >= DATEADD(hour, -{_WINDOW_HOURS}, CURRENT_TIMESTAMP())
|
||||
AND START_TIME < DATEADD(hour, -{_TRAILING_SKIP_HOURS}, CURRENT_TIMESTAMP())
|
||||
AND EXECUTION_STATUS = 'SUCCESS'
|
||||
ORDER BY START_TIME
|
||||
LIMIT {_QUERY_LIMIT}
|
||||
"""
|
||||
# ← SUBSTITUTE: add AND DATABASE_NAME = '<db>' or AND WAREHOUSE_NAME = '<wh>'
|
||||
# to restrict collection to a specific database or warehouse
|
||||
)
|
||||
columns = [col[0] for col in cursor.description]
|
||||
rows = []
|
||||
while True:
|
||||
chunk = cursor.fetchmany(1000)
|
||||
if not chunk:
|
||||
break
|
||||
rows.extend(dict(zip(columns, row)) for row in chunk)
|
||||
cursor.close()
|
||||
return rows
|
||||
|
||||
|
||||
def _iso(dt: object) -> str | None:
|
||||
if dt is None:
|
||||
return None
|
||||
return dt.isoformat() if hasattr(dt, "isoformat") else str(dt)
|
||||
|
||||
|
||||
def collect(
|
||||
account: str,
|
||||
user: str,
|
||||
password: str,
|
||||
warehouse: str,
|
||||
output_file: str = "query_logs_output.json",
|
||||
) -> dict:
|
||||
"""
|
||||
Connect to Snowflake, collect query logs, and write a JSON manifest.
|
||||
|
||||
Returns the manifest dict.
|
||||
"""
|
||||
_check_available_memory()
|
||||
print(f"Connecting to Snowflake account: {account} ...")
|
||||
conn = snowflake.connector.connect(
|
||||
account=account,
|
||||
user=user,
|
||||
password=password,
|
||||
warehouse=warehouse,
|
||||
)
|
||||
|
||||
print(
|
||||
f"Fetching QUERY_HISTORY (last {_WINDOW_HOURS}h, excluding final {_TRAILING_SKIP_HOURS}h, "
|
||||
f"limit {_QUERY_LIMIT}) ..."
|
||||
)
|
||||
rows = _fetch_query_history(conn)
|
||||
conn.close()
|
||||
print(f" Retrieved {len(rows)} query log row(s).")
|
||||
|
||||
if not rows:
|
||||
print("No query log rows found in the specified window.")
|
||||
manifest = {
|
||||
"log_type": LOG_TYPE,
|
||||
"collected_at": datetime.now(tz=timezone.utc).isoformat(),
|
||||
"entry_count": 0,
|
||||
"window_start": None,
|
||||
"window_end": None,
|
||||
"queries": [],
|
||||
}
|
||||
with open(output_file, "w") as fh:
|
||||
json.dump(manifest, fh, indent=2, default=str)
|
||||
return manifest
|
||||
|
||||
start_times = [r["START_TIME"] for r in rows if r.get("START_TIME") is not None]
|
||||
end_times = [r["END_TIME"] for r in rows if r.get("END_TIME") is not None]
|
||||
|
||||
manifest = {
|
||||
"log_type": LOG_TYPE,
|
||||
"collected_at": datetime.now(tz=timezone.utc).isoformat(),
|
||||
"entry_count": len(rows),
|
||||
"window_start": _iso(min(start_times)) if start_times else None,
|
||||
"window_end": _iso(max(end_times)) if end_times else None,
|
||||
"queries": [
|
||||
{
|
||||
"query_id": r.get("QUERY_ID"),
|
||||
"query_text": r.get("QUERY_TEXT") or "",
|
||||
"start_time": _iso(r.get("START_TIME")),
|
||||
"end_time": _iso(r.get("END_TIME")),
|
||||
"user": r.get("USER_NAME"),
|
||||
"warehouse": r.get("WAREHOUSE_NAME"),
|
||||
"bytes_scanned": r.get("BYTES_SCANNED"),
|
||||
"rows_produced": r.get("ROWS_PRODUCED"),
|
||||
}
|
||||
for r in rows
|
||||
],
|
||||
}
|
||||
with open(output_file, "w") as fh:
|
||||
json.dump(manifest, fh, indent=2, default=str)
|
||||
print(f"Query log manifest written to {output_file}")
|
||||
|
||||
return manifest
|
||||
|
||||
|
||||
def main() -> None:
|
||||
parser = argparse.ArgumentParser(
|
||||
description="Collect Snowflake query logs from ACCOUNT_USAGE and write to a manifest file",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--account",
|
||||
default=os.environ.get("SNOWFLAKE_ACCOUNT"),
|
||||
help="Snowflake account identifier, e.g. xy12345.us-east-1 (env: SNOWFLAKE_ACCOUNT)", # ← SUBSTITUTE
|
||||
)
|
||||
parser.add_argument(
|
||||
"--user",
|
||||
default=os.environ.get("SNOWFLAKE_USER"),
|
||||
help="Snowflake username (env: SNOWFLAKE_USER)",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--password",
|
||||
default=os.environ.get("SNOWFLAKE_PASSWORD"),
|
||||
help="Snowflake password (env: SNOWFLAKE_PASSWORD)",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--warehouse",
|
||||
default=os.environ.get("SNOWFLAKE_WAREHOUSE"),
|
||||
help="Snowflake virtual warehouse (env: SNOWFLAKE_WAREHOUSE)", # ← SUBSTITUTE
|
||||
)
|
||||
parser.add_argument(
|
||||
"--output-file",
|
||||
default="query_logs_output.json",
|
||||
help="Path to write the output manifest (default: query_logs_output.json)",
|
||||
)
|
||||
args = parser.parse_args()
|
||||
|
||||
missing = [
|
||||
name
|
||||
for name, val in [
|
||||
("--account", args.account),
|
||||
("--user", args.user),
|
||||
("--password", args.password),
|
||||
("--warehouse", args.warehouse),
|
||||
]
|
||||
if not val
|
||||
]
|
||||
if missing:
|
||||
parser.error(f"Missing required arguments: {', '.join(missing)}")
|
||||
|
||||
collect(
|
||||
account=args.account,
|
||||
user=args.user,
|
||||
password=args.password,
|
||||
warehouse=args.warehouse,
|
||||
output_file=args.output_file,
|
||||
)
|
||||
print("Done.")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,307 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Push lineage events to Monte Carlo from a JSON manifest — push only.
|
||||
|
||||
Reads a manifest file produced by ``collect_lineage.py`` and sends the lineage
|
||||
events to Monte Carlo using the pycarlo push ingestion API. Large payloads are
|
||||
split into batches to stay under the 1 MB compressed limit.
|
||||
|
||||
Can be run standalone via CLI or imported (use the ``push()`` function).
|
||||
|
||||
Substitution points
|
||||
-------------------
|
||||
- MCD_INGEST_ID (env) / --key-id (CLI) : Monte Carlo ingestion key ID
|
||||
- MCD_INGEST_TOKEN (env) / --key-token (CLI) : Monte Carlo ingestion key token
|
||||
- MCD_RESOURCE_UUID (env) / --resource-uuid (CLI) : MC resource UUID for this connection
|
||||
|
||||
Prerequisites
|
||||
-------------
|
||||
pip install pycarlo
|
||||
|
||||
Usage
|
||||
-----
|
||||
python push_lineage.py \\
|
||||
--key-id <MCD_INGEST_ID> \\
|
||||
--key-token <MCD_INGEST_TOKEN> \\
|
||||
--resource-uuid <MCD_RESOURCE_UUID> \\
|
||||
--input-file lineage_output.json
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import os
|
||||
from concurrent.futures import ThreadPoolExecutor, as_completed
|
||||
from datetime import datetime, timezone
|
||||
|
||||
from pycarlo.core import Client, Session
|
||||
from pycarlo.features.ingestion import IngestionService
|
||||
from pycarlo.features.ingestion.models import (
|
||||
ColumnLineageField,
|
||||
ColumnLineageSourceField,
|
||||
LineageAssetRef,
|
||||
LineageEvent,
|
||||
)
|
||||
|
||||
# ← SUBSTITUTE: set RESOURCE_TYPE to match your Monte Carlo connection type
|
||||
RESOURCE_TYPE = "snowflake"
|
||||
|
||||
# Maximum events per batch — conservative default to keep compressed payload under 1 MB
|
||||
# ← SUBSTITUTE: tune based on average edge complexity (number of sources, column mappings)
|
||||
_BATCH_SIZE = 500
|
||||
|
||||
|
||||
def _build_table_lineage_events(edges: list[dict]) -> list[LineageEvent]:
|
||||
"""Build table-level LineageEvent objects from manifest edge dicts."""
|
||||
events = []
|
||||
for edge in edges:
|
||||
dest = edge["destination"]
|
||||
sources = edge.get("sources", [])
|
||||
if not sources:
|
||||
continue
|
||||
events.append(
|
||||
LineageEvent(
|
||||
destination=LineageAssetRef(
|
||||
type="TABLE",
|
||||
name=dest["table"],
|
||||
database=dest["database"],
|
||||
schema=dest["schema"],
|
||||
),
|
||||
sources=[
|
||||
LineageAssetRef(
|
||||
type="TABLE",
|
||||
name=s["table"],
|
||||
database=s["database"],
|
||||
schema=s["schema"],
|
||||
)
|
||||
for s in sources
|
||||
],
|
||||
)
|
||||
)
|
||||
return events
|
||||
|
||||
|
||||
def _build_column_lineage_events(edges: list[dict]) -> list[LineageEvent]:
|
||||
"""Build column-level LineageEvent objects from manifest edge dicts."""
|
||||
events = []
|
||||
for edge in edges:
|
||||
dest = edge["destination"]
|
||||
sources = edge.get("sources", [])
|
||||
col_mappings = edge.get("col_mappings", [])
|
||||
if not sources:
|
||||
continue
|
||||
|
||||
dest_asset_id = f"{dest['database']}__{dest['schema']}__{dest['table']}"
|
||||
source_asset_ids = {
|
||||
(s["database"], s["schema"], s["table"]): f"{s['database']}__{s['schema']}__{s['table']}"
|
||||
for s in sources
|
||||
}
|
||||
|
||||
col_fields: dict[str, ColumnLineageField] = {}
|
||||
for mapping in col_mappings:
|
||||
dest_col = mapping["dest_col"]
|
||||
src_table = mapping["src_table"]
|
||||
src_col = mapping["src_col"]
|
||||
# Match src_table to the first source with that table name
|
||||
match = next(
|
||||
(s for s in sources if s["table"] == src_table),
|
||||
sources[0] if sources else None,
|
||||
)
|
||||
if not match:
|
||||
continue
|
||||
src_aid = source_asset_ids[(match["database"], match["schema"], match["table"])]
|
||||
if dest_col not in col_fields:
|
||||
col_fields[dest_col] = ColumnLineageField(name=dest_col, source_fields=[])
|
||||
col_fields[dest_col].source_fields.append(
|
||||
ColumnLineageSourceField(asset_id=src_aid, field_name=src_col)
|
||||
)
|
||||
|
||||
events.append(
|
||||
LineageEvent(
|
||||
destination=LineageAssetRef(
|
||||
type="TABLE",
|
||||
name=dest["table"],
|
||||
database=dest["database"],
|
||||
schema=dest["schema"],
|
||||
asset_id=dest_asset_id,
|
||||
),
|
||||
sources=[
|
||||
LineageAssetRef(
|
||||
type="TABLE",
|
||||
name=s["table"],
|
||||
database=s["database"],
|
||||
schema=s["schema"],
|
||||
asset_id=source_asset_ids[(s["database"], s["schema"], s["table"])],
|
||||
)
|
||||
for s in sources
|
||||
],
|
||||
fields=list(col_fields.values()) if col_fields else None,
|
||||
)
|
||||
)
|
||||
return events
|
||||
|
||||
|
||||
def push(
|
||||
input_file: str,
|
||||
resource_uuid: str,
|
||||
key_id: str,
|
||||
key_token: str,
|
||||
batch_size: int = _BATCH_SIZE,
|
||||
output_file: str = "lineage_push_result.json",
|
||||
) -> dict:
|
||||
"""
|
||||
Read a lineage manifest and push events to Monte Carlo in batches.
|
||||
|
||||
Returns a result dict with invocation IDs for each batch.
|
||||
"""
|
||||
with open(input_file) as fh:
|
||||
manifest = json.load(fh)
|
||||
|
||||
edges = manifest.get("edges", [])
|
||||
resource_type = manifest.get("resource_type", RESOURCE_TYPE)
|
||||
column_lineage = manifest.get("column_lineage", False)
|
||||
|
||||
if column_lineage:
|
||||
events = _build_column_lineage_events(edges)
|
||||
label = "column-level"
|
||||
else:
|
||||
events = _build_table_lineage_events(edges)
|
||||
label = "table-level"
|
||||
|
||||
print(f"Loaded {len(events)} {label} lineage event(s) from {input_file}")
|
||||
|
||||
if not events:
|
||||
print("No lineage events to push.")
|
||||
push_result = {
|
||||
"resource_uuid": resource_uuid,
|
||||
"resource_type": resource_type,
|
||||
"invocation_ids": [],
|
||||
"pushed_at": datetime.now(tz=timezone.utc).isoformat(),
|
||||
"total_events": 0,
|
||||
"batch_count": 0,
|
||||
"batch_size": batch_size,
|
||||
}
|
||||
with open(output_file, "w") as fh:
|
||||
json.dump(push_result, fh, indent=2)
|
||||
return push_result
|
||||
|
||||
# Split into batches
|
||||
batches = []
|
||||
for i in range(0, len(events), batch_size):
|
||||
batches.append(events[i : i + batch_size])
|
||||
total_batches = len(batches)
|
||||
|
||||
def _push_batch(batch: list, batch_num: int) -> str | None:
|
||||
"""Push a single batch using a dedicated Session (thread-safe)."""
|
||||
print(f" Pushing batch {batch_num}/{total_batches} ({len(batch)} events) ...")
|
||||
client = Client(session=Session(mcd_id=key_id, mcd_token=key_token, scope="Ingestion"))
|
||||
service = IngestionService(mc_client=client)
|
||||
result = service.send_lineage(
|
||||
resource_uuid=resource_uuid,
|
||||
resource_type=resource_type,
|
||||
events=batch,
|
||||
)
|
||||
invocation_id = service.extract_invocation_id(result)
|
||||
if invocation_id:
|
||||
print(f" Batch {batch_num}: invocation_id={invocation_id}")
|
||||
return invocation_id
|
||||
|
||||
# Push batches in parallel (each thread gets its own pycarlo Session)
|
||||
max_workers = min(4, total_batches)
|
||||
invocation_ids: list[str | None] = [None] * total_batches
|
||||
|
||||
with ThreadPoolExecutor(max_workers=max_workers) as pool:
|
||||
futures = {
|
||||
pool.submit(_push_batch, batch, i + 1): i
|
||||
for i, batch in enumerate(batches)
|
||||
}
|
||||
for future in as_completed(futures):
|
||||
idx = futures[future]
|
||||
try:
|
||||
invocation_ids[idx] = future.result()
|
||||
except Exception as exc:
|
||||
print(f" ERROR pushing batch {idx + 1}: {exc}")
|
||||
raise
|
||||
|
||||
print(f" All {total_batches} batches pushed ({max_workers} workers)")
|
||||
|
||||
push_result = {
|
||||
"resource_uuid": resource_uuid,
|
||||
"resource_type": resource_type,
|
||||
"invocation_ids": invocation_ids,
|
||||
"pushed_at": datetime.now(tz=timezone.utc).isoformat(),
|
||||
"total_events": len(events),
|
||||
"batch_count": total_batches,
|
||||
"batch_size": batch_size,
|
||||
"edges": edges, # preserve for downstream validation
|
||||
}
|
||||
with open(output_file, "w") as fh:
|
||||
json.dump(push_result, fh, indent=2)
|
||||
print(f"Push result written to {output_file}")
|
||||
|
||||
return push_result
|
||||
|
||||
|
||||
def main() -> None:
|
||||
parser = argparse.ArgumentParser(
|
||||
description="Push Snowflake lineage from a manifest to Monte Carlo",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--key-id",
|
||||
default=os.environ.get("MCD_INGEST_ID"),
|
||||
help="Monte Carlo ingestion key ID (env: MCD_INGEST_ID)",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--key-token",
|
||||
default=os.environ.get("MCD_INGEST_TOKEN"),
|
||||
help="Monte Carlo ingestion key token (env: MCD_INGEST_TOKEN)",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--resource-uuid",
|
||||
default=os.environ.get("MCD_RESOURCE_UUID"),
|
||||
help="Monte Carlo resource UUID for this Snowflake connection (env: MCD_RESOURCE_UUID)",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--input-file",
|
||||
default="lineage_output.json",
|
||||
help="Path to the collect manifest to read (default: lineage_output.json)",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--output-file",
|
||||
default="lineage_push_result.json",
|
||||
help="Path to write the push result (default: lineage_push_result.json)",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--batch-size",
|
||||
type=int,
|
||||
default=_BATCH_SIZE,
|
||||
help=f"Max events per push batch (default: {_BATCH_SIZE})",
|
||||
)
|
||||
args = parser.parse_args()
|
||||
|
||||
missing = [
|
||||
name
|
||||
for name, val in [
|
||||
("--key-id", args.key_id),
|
||||
("--key-token", args.key_token),
|
||||
("--resource-uuid", args.resource_uuid),
|
||||
]
|
||||
if not val
|
||||
]
|
||||
if missing:
|
||||
parser.error(f"Missing required arguments: {', '.join(missing)}")
|
||||
|
||||
push(
|
||||
input_file=args.input_file,
|
||||
resource_uuid=args.resource_uuid,
|
||||
key_id=args.key_id,
|
||||
key_token=args.key_token,
|
||||
batch_size=args.batch_size,
|
||||
output_file=args.output_file,
|
||||
)
|
||||
print("Done.")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,228 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Push table metadata to Monte Carlo from a JSON manifest — push only.
|
||||
|
||||
Reads a manifest file produced by ``collect_metadata.py`` and sends the assets
|
||||
to Monte Carlo as RelationalAsset events using the pycarlo push ingestion API.
|
||||
Large payloads are split into batches to stay under the 1 MB compressed limit.
|
||||
|
||||
Can be run standalone via CLI or imported (use the ``push()`` function).
|
||||
|
||||
Substitution points
|
||||
-------------------
|
||||
- MCD_INGEST_ID (env) / --key-id (CLI) : Monte Carlo ingestion key ID
|
||||
- MCD_INGEST_TOKEN (env) / --key-token (CLI) : Monte Carlo ingestion key token
|
||||
- MCD_RESOURCE_UUID (env) / --resource-uuid (CLI) : MC resource UUID for this connection
|
||||
|
||||
Prerequisites
|
||||
-------------
|
||||
pip install pycarlo
|
||||
|
||||
Usage
|
||||
-----
|
||||
python push_metadata.py \\
|
||||
--key-id <MCD_INGEST_ID> \\
|
||||
--key-token <MCD_INGEST_TOKEN> \\
|
||||
--resource-uuid <MCD_RESOURCE_UUID> \\
|
||||
--input-file metadata_output.json
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import os
|
||||
from concurrent.futures import ThreadPoolExecutor, as_completed
|
||||
from datetime import datetime, timezone
|
||||
|
||||
from pycarlo.core import Client, Session
|
||||
from pycarlo.features.ingestion import IngestionService
|
||||
from pycarlo.features.ingestion.models import (
|
||||
AssetField,
|
||||
AssetFreshness,
|
||||
AssetMetadata,
|
||||
AssetVolume,
|
||||
RelationalAsset,
|
||||
)
|
||||
|
||||
# ← SUBSTITUTE: set RESOURCE_TYPE to match your Monte Carlo connection type
|
||||
RESOURCE_TYPE = "snowflake"
|
||||
|
||||
# Maximum assets per batch — conservative default to keep compressed payload under 1 MB
|
||||
# ← SUBSTITUTE: tune based on average asset size (fields per table, description length, etc.)
|
||||
_BATCH_SIZE = 500
|
||||
|
||||
|
||||
def _asset_from_dict(d: dict) -> RelationalAsset:
|
||||
"""Reconstruct a RelationalAsset from a manifest dict entry."""
|
||||
fields = [
|
||||
AssetField(
|
||||
name=f["name"],
|
||||
type=f.get("type"),
|
||||
description=f.get("description"),
|
||||
)
|
||||
for f in d.get("fields", [])
|
||||
]
|
||||
|
||||
volume = None
|
||||
if d.get("volume"):
|
||||
volume = AssetVolume(
|
||||
row_count=d["volume"].get("row_count"),
|
||||
byte_count=d["volume"].get("byte_count"),
|
||||
)
|
||||
|
||||
freshness = None
|
||||
if d.get("freshness"):
|
||||
freshness = AssetFreshness(
|
||||
last_update_time=d["freshness"].get("last_update_time"),
|
||||
)
|
||||
|
||||
return RelationalAsset(
|
||||
type=d.get("type", "TABLE"),
|
||||
metadata=AssetMetadata(
|
||||
name=d["name"],
|
||||
database=d["database"],
|
||||
schema=d["schema"],
|
||||
description=d.get("description"),
|
||||
),
|
||||
fields=fields,
|
||||
volume=volume,
|
||||
freshness=freshness,
|
||||
)
|
||||
|
||||
|
||||
def push(
|
||||
input_file: str,
|
||||
resource_uuid: str,
|
||||
key_id: str,
|
||||
key_token: str,
|
||||
batch_size: int = _BATCH_SIZE,
|
||||
output_file: str = "metadata_push_result.json",
|
||||
) -> dict:
|
||||
"""
|
||||
Read a metadata manifest and push assets to Monte Carlo in batches.
|
||||
|
||||
Returns a result dict with invocation IDs for each batch.
|
||||
"""
|
||||
with open(input_file) as fh:
|
||||
manifest = json.load(fh)
|
||||
|
||||
asset_dicts = manifest.get("assets", [])
|
||||
resource_type = manifest.get("resource_type", RESOURCE_TYPE)
|
||||
assets = [_asset_from_dict(d) for d in asset_dicts]
|
||||
print(f"Loaded {len(assets)} asset(s) from {input_file}")
|
||||
|
||||
# Split into batches
|
||||
batches = []
|
||||
for i in range(0, max(len(assets), 1), batch_size):
|
||||
batches.append(assets[i : i + batch_size])
|
||||
total_batches = len(batches)
|
||||
|
||||
def _push_batch(batch: list, batch_num: int) -> str | None:
|
||||
"""Push a single batch using a dedicated Session (thread-safe)."""
|
||||
client = Client(session=Session(mcd_id=key_id, mcd_token=key_token, scope="Ingestion"))
|
||||
service = IngestionService(mc_client=client)
|
||||
result = service.send_metadata(
|
||||
resource_uuid=resource_uuid,
|
||||
resource_type=resource_type,
|
||||
events=batch,
|
||||
)
|
||||
invocation_id = service.extract_invocation_id(result)
|
||||
print(f" Pushed batch {batch_num}/{total_batches} ({len(batch)} assets) — invocation_id={invocation_id}")
|
||||
return invocation_id
|
||||
|
||||
# Push batches in parallel (each thread gets its own pycarlo Session)
|
||||
max_workers = min(4, total_batches)
|
||||
invocation_ids: list[str | None] = [None] * total_batches
|
||||
|
||||
with ThreadPoolExecutor(max_workers=max_workers) as pool:
|
||||
futures = {
|
||||
pool.submit(_push_batch, batch, i + 1): i
|
||||
for i, batch in enumerate(batches)
|
||||
}
|
||||
for future in as_completed(futures):
|
||||
idx = futures[future]
|
||||
try:
|
||||
invocation_ids[idx] = future.result()
|
||||
except Exception as exc:
|
||||
print(f" ERROR pushing batch {idx + 1}: {exc}")
|
||||
raise
|
||||
|
||||
print(f" All {total_batches} batches pushed ({max_workers} workers)")
|
||||
|
||||
push_result = {
|
||||
"resource_uuid": resource_uuid,
|
||||
"resource_type": resource_type,
|
||||
"invocation_ids": invocation_ids,
|
||||
"pushed_at": datetime.now(tz=timezone.utc).isoformat(),
|
||||
"total_assets": len(assets),
|
||||
"batch_count": total_batches,
|
||||
"batch_size": batch_size,
|
||||
}
|
||||
with open(output_file, "w") as fh:
|
||||
json.dump(push_result, fh, indent=2)
|
||||
print(f"Push result written to {output_file}")
|
||||
|
||||
return push_result
|
||||
|
||||
|
||||
def main() -> None:
|
||||
parser = argparse.ArgumentParser(
|
||||
description="Push Snowflake table metadata from a manifest to Monte Carlo",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--key-id",
|
||||
default=os.environ.get("MCD_INGEST_ID"),
|
||||
help="Monte Carlo ingestion key ID (env: MCD_INGEST_ID)",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--key-token",
|
||||
default=os.environ.get("MCD_INGEST_TOKEN"),
|
||||
help="Monte Carlo ingestion key token (env: MCD_INGEST_TOKEN)",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--resource-uuid",
|
||||
default=os.environ.get("MCD_RESOURCE_UUID"),
|
||||
help="Monte Carlo resource UUID for this Snowflake connection (env: MCD_RESOURCE_UUID)",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--input-file",
|
||||
default="metadata_output.json",
|
||||
help="Path to the collect manifest to read (default: metadata_output.json)",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--output-file",
|
||||
default="metadata_push_result.json",
|
||||
help="Path to write the push result (default: metadata_push_result.json)",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--batch-size",
|
||||
type=int,
|
||||
default=_BATCH_SIZE,
|
||||
help=f"Max assets per push batch (default: {_BATCH_SIZE})",
|
||||
)
|
||||
args = parser.parse_args()
|
||||
|
||||
missing = [
|
||||
name
|
||||
for name, val in [
|
||||
("--key-id", args.key_id),
|
||||
("--key-token", args.key_token),
|
||||
("--resource-uuid", args.resource_uuid),
|
||||
]
|
||||
if not val
|
||||
]
|
||||
if missing:
|
||||
parser.error(f"Missing required arguments: {', '.join(missing)}")
|
||||
|
||||
push(
|
||||
input_file=args.input_file,
|
||||
resource_uuid=args.resource_uuid,
|
||||
key_id=args.key_id,
|
||||
key_token=args.key_token,
|
||||
batch_size=args.batch_size,
|
||||
output_file=args.output_file,
|
||||
)
|
||||
print("Done.")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,248 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Push query logs to Monte Carlo from a JSON manifest — push only.
|
||||
|
||||
Reads a manifest file produced by ``collect_query_logs.py`` and sends the query
|
||||
log entries to Monte Carlo using the pycarlo push ingestion API. Large payloads
|
||||
are split into batches to stay under the 1 MB compressed limit.
|
||||
|
||||
Can be run standalone via CLI or imported (use the ``push()`` function).
|
||||
|
||||
Substitution points
|
||||
-------------------
|
||||
- MCD_INGEST_ID (env) / --key-id (CLI) : Monte Carlo ingestion key ID
|
||||
- MCD_INGEST_TOKEN (env) / --key-token (CLI) : Monte Carlo ingestion key token
|
||||
- MCD_RESOURCE_UUID (env) / --resource-uuid (CLI) : MC resource UUID for this connection
|
||||
|
||||
Prerequisites
|
||||
-------------
|
||||
pip install pycarlo
|
||||
|
||||
Usage
|
||||
-----
|
||||
python push_query_logs.py \\
|
||||
--key-id <MCD_INGEST_ID> \\
|
||||
--key-token <MCD_INGEST_TOKEN> \\
|
||||
--resource-uuid <MCD_RESOURCE_UUID> \\
|
||||
--input-file query_logs_output.json
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import os
|
||||
from concurrent.futures import ThreadPoolExecutor, as_completed
|
||||
from datetime import datetime, timezone
|
||||
|
||||
from dateutil.parser import isoparse
|
||||
from pycarlo.core import Client, Session
|
||||
from pycarlo.features.ingestion import IngestionService
|
||||
from pycarlo.features.ingestion.models import QueryLogEntry
|
||||
|
||||
# ← SUBSTITUTE: set LOG_TYPE to match your warehouse type (query logs use log_type, not resource_type)
|
||||
LOG_TYPE = "snowflake"
|
||||
|
||||
# Maximum entries per batch — conservative default to keep compressed payload under 1 MB.
|
||||
# Query logs include full SQL text — keep batches small to stay under the 1 MB
|
||||
# compressed payload limit. 50 entries can trigger 413 on active warehouses.
|
||||
# ← SUBSTITUTE: tune based on average query length
|
||||
_BATCH_SIZE = 100
|
||||
|
||||
# Truncate query_text longer than this to prevent 413 errors.
|
||||
# Some SQL statements (e.g., generated by BI tools) can be 100KB+ and blow up
|
||||
# compressed payloads even at small batch sizes.
|
||||
_MAX_QUERY_TEXT_LEN = 10_000
|
||||
|
||||
|
||||
def _build_query_log_entries(queries: list[dict]) -> list[QueryLogEntry]:
|
||||
"""Convert manifest query dicts into QueryLogEntry objects."""
|
||||
entries = []
|
||||
truncated = 0
|
||||
for q in queries:
|
||||
start_time = q.get("start_time")
|
||||
end_time = q.get("end_time")
|
||||
query_text = q.get("query_text") or ""
|
||||
query_id = q.get("query_id")
|
||||
user_name = q.get("user")
|
||||
warehouse_name = q.get("warehouse")
|
||||
bytes_scanned = q.get("bytes_scanned")
|
||||
rows_produced = q.get("rows_produced")
|
||||
|
||||
# Truncate very long SQL to prevent 413 Request Too Large
|
||||
if len(query_text) > _MAX_QUERY_TEXT_LEN:
|
||||
query_text = query_text[:_MAX_QUERY_TEXT_LEN] + "... [TRUNCATED]"
|
||||
truncated += 1
|
||||
|
||||
extra = {}
|
||||
if warehouse_name is not None:
|
||||
extra["warehouse_name"] = warehouse_name
|
||||
if bytes_scanned is not None:
|
||||
extra["bytes_scanned"] = int(bytes_scanned)
|
||||
|
||||
entries.append(
|
||||
QueryLogEntry(
|
||||
start_time=isoparse(start_time) if start_time else None,
|
||||
end_time=isoparse(end_time) if end_time else None,
|
||||
query_text=query_text,
|
||||
query_id=query_id,
|
||||
user=user_name,
|
||||
returned_rows=int(rows_produced) if rows_produced is not None else None,
|
||||
extra=extra or None,
|
||||
)
|
||||
)
|
||||
if truncated:
|
||||
print(f" Truncated {truncated} query text(s) exceeding {_MAX_QUERY_TEXT_LEN} chars")
|
||||
return entries
|
||||
|
||||
|
||||
def push(
|
||||
input_file: str,
|
||||
resource_uuid: str,
|
||||
key_id: str,
|
||||
key_token: str,
|
||||
batch_size: int = _BATCH_SIZE,
|
||||
output_file: str = "query_logs_push_result.json",
|
||||
) -> dict:
|
||||
"""
|
||||
Read a query log manifest and push entries to Monte Carlo in batches.
|
||||
|
||||
Returns a result dict with invocation IDs for each batch.
|
||||
"""
|
||||
with open(input_file) as fh:
|
||||
manifest = json.load(fh)
|
||||
|
||||
queries = manifest.get("queries", [])
|
||||
log_type = manifest.get("log_type", LOG_TYPE)
|
||||
entries = _build_query_log_entries(queries)
|
||||
print(f"Loaded {len(entries)} query log entry/entries from {input_file}")
|
||||
|
||||
if not entries:
|
||||
print("No query log entries to push.")
|
||||
push_result = {
|
||||
"resource_uuid": resource_uuid,
|
||||
"log_type": log_type,
|
||||
"invocation_ids": [],
|
||||
"pushed_at": datetime.now(tz=timezone.utc).isoformat(),
|
||||
"total_entries": 0,
|
||||
"batch_count": 0,
|
||||
"batch_size": batch_size,
|
||||
}
|
||||
with open(output_file, "w") as fh:
|
||||
json.dump(push_result, fh, indent=2)
|
||||
return push_result
|
||||
|
||||
# Split into batches
|
||||
batches = []
|
||||
for i in range(0, len(entries), batch_size):
|
||||
batches.append(entries[i : i + batch_size])
|
||||
total_batches = len(batches)
|
||||
|
||||
def _push_batch(batch: list, batch_num: int) -> str | None:
|
||||
"""Push a single batch using a dedicated Session (thread-safe)."""
|
||||
client = Client(session=Session(mcd_id=key_id, mcd_token=key_token, scope="Ingestion"))
|
||||
service = IngestionService(mc_client=client)
|
||||
result = service.send_query_logs(
|
||||
resource_uuid=resource_uuid,
|
||||
log_type=log_type,
|
||||
events=batch,
|
||||
)
|
||||
invocation_id = service.extract_invocation_id(result)
|
||||
print(f" Pushed batch {batch_num}/{total_batches} ({len(batch)} entries) — invocation_id={invocation_id}")
|
||||
return invocation_id
|
||||
|
||||
# Push batches in parallel (each thread gets its own pycarlo Session)
|
||||
max_workers = min(4, total_batches)
|
||||
invocation_ids: list[str | None] = [None] * total_batches
|
||||
|
||||
with ThreadPoolExecutor(max_workers=max_workers) as pool:
|
||||
futures = {
|
||||
pool.submit(_push_batch, batch, i + 1): i
|
||||
for i, batch in enumerate(batches)
|
||||
}
|
||||
for future in as_completed(futures):
|
||||
idx = futures[future]
|
||||
try:
|
||||
invocation_ids[idx] = future.result()
|
||||
except Exception as exc:
|
||||
print(f" ERROR pushing batch {idx + 1}: {exc}")
|
||||
raise
|
||||
|
||||
print(f" All {total_batches} batches pushed ({max_workers} workers)")
|
||||
|
||||
push_result = {
|
||||
"resource_uuid": resource_uuid,
|
||||
"log_type": log_type,
|
||||
"invocation_ids": invocation_ids,
|
||||
"pushed_at": datetime.now(tz=timezone.utc).isoformat(),
|
||||
"total_entries": len(entries),
|
||||
"batch_count": total_batches,
|
||||
"batch_size": batch_size,
|
||||
}
|
||||
with open(output_file, "w") as fh:
|
||||
json.dump(push_result, fh, indent=2)
|
||||
print(f"Push result written to {output_file}")
|
||||
|
||||
return push_result
|
||||
|
||||
|
||||
def main() -> None:
|
||||
parser = argparse.ArgumentParser(
|
||||
description="Push Snowflake query logs from a manifest to Monte Carlo",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--key-id",
|
||||
default=os.environ.get("MCD_INGEST_ID"),
|
||||
help="Monte Carlo ingestion key ID (env: MCD_INGEST_ID)",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--key-token",
|
||||
default=os.environ.get("MCD_INGEST_TOKEN"),
|
||||
help="Monte Carlo ingestion key token (env: MCD_INGEST_TOKEN)",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--resource-uuid",
|
||||
default=os.environ.get("MCD_RESOURCE_UUID"),
|
||||
help="Monte Carlo resource UUID for this Snowflake connection (env: MCD_RESOURCE_UUID)",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--input-file",
|
||||
default="query_logs_output.json",
|
||||
help="Path to the collect manifest to read (default: query_logs_output.json)",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--output-file",
|
||||
default="query_logs_push_result.json",
|
||||
help="Path to write the push result (default: query_logs_push_result.json)",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--batch-size",
|
||||
type=int,
|
||||
default=_BATCH_SIZE,
|
||||
help=f"Max entries per push batch (default: {_BATCH_SIZE})",
|
||||
)
|
||||
args = parser.parse_args()
|
||||
|
||||
missing = [
|
||||
name
|
||||
for name, val in [
|
||||
("--key-id", args.key_id),
|
||||
("--key-token", args.key_token),
|
||||
("--resource-uuid", args.resource_uuid),
|
||||
]
|
||||
if not val
|
||||
]
|
||||
if missing:
|
||||
parser.error(f"Missing required arguments: {', '.join(missing)}")
|
||||
|
||||
push(
|
||||
input_file=args.input_file,
|
||||
resource_uuid=args.resource_uuid,
|
||||
key_id=args.key_id,
|
||||
key_token=args.key_token,
|
||||
batch_size=args.batch_size,
|
||||
output_file=args.output_file,
|
||||
)
|
||||
print("Done.")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,340 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Smoke test that every pycarlo model construction used by the templates
|
||||
actually works with the real SDK. A wrong parameter name raises TypeError.
|
||||
|
||||
Run:
|
||||
pip install pycarlo
|
||||
python test_template_sdk_usage.py
|
||||
"""
|
||||
|
||||
from datetime import datetime, timezone
|
||||
|
||||
from pycarlo.features.ingestion.models import (
|
||||
AssetField,
|
||||
AssetFreshness,
|
||||
AssetMetadata,
|
||||
AssetVolume,
|
||||
ColumnLineageField,
|
||||
ColumnLineageSourceField,
|
||||
LineageAssetRef,
|
||||
LineageEvent,
|
||||
QueryLogEntry,
|
||||
RelationalAsset,
|
||||
Tag,
|
||||
build_lineage_payload,
|
||||
build_metadata_payload,
|
||||
build_query_log_payload,
|
||||
)
|
||||
|
||||
PASSED = 0
|
||||
FAILED = 0
|
||||
|
||||
|
||||
def check(label: str, fn):
|
||||
global PASSED, FAILED
|
||||
try:
|
||||
obj = fn()
|
||||
# Also verify serialization works
|
||||
if hasattr(obj, "to_dict"):
|
||||
obj.to_dict()
|
||||
PASSED += 1
|
||||
print(f" PASS {label}")
|
||||
except Exception as exc:
|
||||
FAILED += 1
|
||||
print(f" FAIL {label}: {exc}")
|
||||
|
||||
|
||||
def test_metadata_models():
|
||||
print("\n== Metadata models ==")
|
||||
|
||||
check("AssetField(name, type)", lambda: AssetField(name="id", type="INTEGER"))
|
||||
|
||||
check(
|
||||
"AssetField(name, type, description)",
|
||||
lambda: AssetField(name="id", type="INTEGER", description="Primary key"),
|
||||
)
|
||||
|
||||
check(
|
||||
"AssetMetadata(name, database, schema)",
|
||||
lambda: AssetMetadata(name="orders", database="analytics", schema="public"),
|
||||
)
|
||||
|
||||
check(
|
||||
"AssetMetadata(name, database, schema, description, view_query, created_on)",
|
||||
lambda: AssetMetadata(
|
||||
name="orders_view",
|
||||
database="analytics",
|
||||
schema="public",
|
||||
description="A view",
|
||||
view_query="SELECT * FROM orders",
|
||||
created_on="2026-01-01T00:00:00Z",
|
||||
),
|
||||
)
|
||||
|
||||
check("AssetVolume(row_count)", lambda: AssetVolume(row_count=1000))
|
||||
check(
|
||||
"AssetVolume(row_count, byte_count)",
|
||||
lambda: AssetVolume(row_count=1000, byte_count=50000),
|
||||
)
|
||||
|
||||
check(
|
||||
"AssetFreshness(last_update_time)",
|
||||
lambda: AssetFreshness(last_update_time="2026-03-12T14:30:00Z"),
|
||||
)
|
||||
|
||||
check("Tag(key, value)", lambda: Tag(key="env", value="prod"))
|
||||
check("Tag(key only)", lambda: Tag(key="pii"))
|
||||
|
||||
check(
|
||||
"RelationalAsset — full nested structure",
|
||||
lambda: RelationalAsset(
|
||||
type="TABLE",
|
||||
metadata=AssetMetadata(
|
||||
name="orders",
|
||||
database="analytics",
|
||||
schema="public",
|
||||
description="Orders table",
|
||||
),
|
||||
fields=[
|
||||
AssetField(name="id", type="INTEGER"),
|
||||
AssetField(name="amount", type="DECIMAL(10,2)", description="Order total"),
|
||||
],
|
||||
volume=AssetVolume(row_count=1000000, byte_count=111111111),
|
||||
freshness=AssetFreshness(last_update_time="2026-03-12T14:30:00Z"),
|
||||
tags=[Tag(key="env", value="prod")],
|
||||
),
|
||||
)
|
||||
|
||||
check(
|
||||
"RelationalAsset — minimal (no volume, freshness, tags)",
|
||||
lambda: RelationalAsset(
|
||||
type="VIEW",
|
||||
metadata=AssetMetadata(name="v_orders", database="db", schema="sch"),
|
||||
),
|
||||
)
|
||||
|
||||
|
||||
def test_lineage_models():
|
||||
print("\n== Lineage models ==")
|
||||
|
||||
check(
|
||||
"LineageAssetRef(type, name, database, schema)",
|
||||
lambda: LineageAssetRef(
|
||||
type="TABLE", name="orders", database="analytics", schema="public"
|
||||
),
|
||||
)
|
||||
|
||||
check(
|
||||
"LineageAssetRef(type, name, database, schema, asset_id)",
|
||||
lambda: LineageAssetRef(
|
||||
type="TABLE",
|
||||
name="orders",
|
||||
database="analytics",
|
||||
schema="public",
|
||||
asset_id="analytics:public.orders",
|
||||
),
|
||||
)
|
||||
|
||||
check(
|
||||
"LineageEvent — table lineage",
|
||||
lambda: LineageEvent(
|
||||
destination=LineageAssetRef(
|
||||
type="TABLE", name="curated", database="db", schema="sch"
|
||||
),
|
||||
sources=[
|
||||
LineageAssetRef(type="TABLE", name="raw", database="db", schema="sch"),
|
||||
],
|
||||
),
|
||||
)
|
||||
|
||||
check(
|
||||
"ColumnLineageSourceField(asset_id, field_name)",
|
||||
lambda: ColumnLineageSourceField(
|
||||
asset_id="db:sch.raw", field_name="amount"
|
||||
),
|
||||
)
|
||||
|
||||
check(
|
||||
"ColumnLineageField(name, source_fields)",
|
||||
lambda: ColumnLineageField(
|
||||
name="total_amount",
|
||||
source_fields=[
|
||||
ColumnLineageSourceField(asset_id="db:sch.raw", field_name="amount"),
|
||||
],
|
||||
),
|
||||
)
|
||||
|
||||
check(
|
||||
"LineageEvent — column lineage",
|
||||
lambda: LineageEvent(
|
||||
destination=LineageAssetRef(
|
||||
type="TABLE",
|
||||
name="curated",
|
||||
database="db",
|
||||
schema="sch",
|
||||
asset_id="db:sch.curated",
|
||||
),
|
||||
sources=[
|
||||
LineageAssetRef(
|
||||
type="TABLE",
|
||||
name="raw",
|
||||
database="db",
|
||||
schema="sch",
|
||||
asset_id="db:sch.raw",
|
||||
),
|
||||
],
|
||||
fields=[
|
||||
ColumnLineageField(
|
||||
name="total_amount",
|
||||
source_fields=[
|
||||
ColumnLineageSourceField(
|
||||
asset_id="db:sch.raw", field_name="amount"
|
||||
),
|
||||
],
|
||||
),
|
||||
],
|
||||
),
|
||||
)
|
||||
|
||||
|
||||
def test_query_log_models():
|
||||
print("\n== Query log models ==")
|
||||
|
||||
now = datetime.now(tz=timezone.utc)
|
||||
|
||||
check(
|
||||
"QueryLogEntry — minimal",
|
||||
lambda: QueryLogEntry(
|
||||
start_time=now,
|
||||
end_time=now,
|
||||
query_text="SELECT 1",
|
||||
),
|
||||
)
|
||||
|
||||
check(
|
||||
"QueryLogEntry — full with extra",
|
||||
lambda: QueryLogEntry(
|
||||
start_time=now,
|
||||
end_time=now,
|
||||
query_text="SELECT * FROM orders",
|
||||
query_id="query-123",
|
||||
user="analyst@company.com",
|
||||
returned_rows=100,
|
||||
error_code=None,
|
||||
error_text=None,
|
||||
extra={
|
||||
"warehouse_name": "COMPUTE_WH",
|
||||
"bytes_scanned": 12345,
|
||||
},
|
||||
),
|
||||
)
|
||||
|
||||
check(
|
||||
"QueryLogEntry — Snowflake extra fields",
|
||||
lambda: QueryLogEntry(
|
||||
start_time=now,
|
||||
end_time=now,
|
||||
query_text="SELECT 1",
|
||||
extra={"warehouse_name": "WH", "bytes_scanned": 100},
|
||||
),
|
||||
)
|
||||
|
||||
check(
|
||||
"QueryLogEntry — BigQuery extra fields",
|
||||
lambda: QueryLogEntry(
|
||||
start_time=now,
|
||||
end_time=now,
|
||||
query_text="SELECT 1",
|
||||
extra={"total_bytes_billed": 999, "statement_type": "SELECT"},
|
||||
),
|
||||
)
|
||||
|
||||
check(
|
||||
"QueryLogEntry — Databricks extra fields",
|
||||
lambda: QueryLogEntry(
|
||||
start_time=now,
|
||||
end_time=now,
|
||||
query_text="SELECT 1",
|
||||
extra={"total_task_duration_ms": 500, "read_rows": 10, "read_bytes": 200},
|
||||
),
|
||||
)
|
||||
|
||||
check(
|
||||
"QueryLogEntry — Redshift extra fields",
|
||||
lambda: QueryLogEntry(
|
||||
start_time=now,
|
||||
end_time=now,
|
||||
query_text="SELECT 1",
|
||||
extra={"database_name": "dev", "elapsed_time_us": 123456},
|
||||
),
|
||||
)
|
||||
|
||||
|
||||
def test_payload_builders():
|
||||
print("\n== Payload builders ==")
|
||||
|
||||
now = datetime.now(tz=timezone.utc)
|
||||
|
||||
check(
|
||||
"build_metadata_payload",
|
||||
lambda: build_metadata_payload(
|
||||
resource_uuid="uuid-123",
|
||||
resource_type="snowflake",
|
||||
events=[
|
||||
RelationalAsset(
|
||||
type="TABLE",
|
||||
metadata=AssetMetadata(name="t", database="d", schema="s"),
|
||||
)
|
||||
],
|
||||
),
|
||||
)
|
||||
|
||||
check(
|
||||
"build_lineage_payload — table",
|
||||
lambda: build_lineage_payload(
|
||||
resource_uuid="uuid-123",
|
||||
resource_type="snowflake",
|
||||
events=[
|
||||
LineageEvent(
|
||||
destination=LineageAssetRef(
|
||||
type="TABLE", name="dst", database="d", schema="s"
|
||||
),
|
||||
sources=[
|
||||
LineageAssetRef(
|
||||
type="TABLE", name="src", database="d", schema="s"
|
||||
)
|
||||
],
|
||||
)
|
||||
],
|
||||
),
|
||||
)
|
||||
|
||||
check(
|
||||
"build_query_log_payload",
|
||||
lambda: build_query_log_payload(
|
||||
resource_uuid="uuid-123",
|
||||
log_type="snowflake",
|
||||
events=[
|
||||
QueryLogEntry(
|
||||
start_time=now,
|
||||
end_time=now,
|
||||
query_text="SELECT 1",
|
||||
)
|
||||
],
|
||||
),
|
||||
)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
test_metadata_models()
|
||||
test_lineage_models()
|
||||
test_query_log_models()
|
||||
test_payload_builders()
|
||||
print(f"\n{'='*40}")
|
||||
print(f"Results: {PASSED} passed, {FAILED} failed")
|
||||
if FAILED:
|
||||
print("SOME TESTS FAILED — templates use wrong parameter names!")
|
||||
raise SystemExit(1)
|
||||
else:
|
||||
print("All tests passed — all model constructions are valid.")
|
||||
@@ -0,0 +1,685 @@
|
||||
---
|
||||
name: monte-carlo-validation-notebook
|
||||
description: "Generates SQL validation notebooks for dbt PR changes with before/after comparison queries."
|
||||
category: data
|
||||
risk: safe
|
||||
source: community
|
||||
source_repo: monte-carlo-data/mc-agent-toolkit
|
||||
source_type: community
|
||||
date_added: "2026-04-08"
|
||||
author: monte-carlo-data
|
||||
tags: [data-observability, validation, dbt, monte-carlo, sql-notebook]
|
||||
tools: [claude, cursor, codex]
|
||||
---
|
||||
|
||||
> **Tip:** This skill works well with Sonnet. Run `/model sonnet` before invoking for faster generation.
|
||||
|
||||
Generate a SQL Notebook with validation queries for dbt changes.
|
||||
|
||||
**Arguments:** $ARGUMENTS
|
||||
|
||||
Parse the arguments:
|
||||
- **Target** (required): first argument — a GitHub PR URL or local dbt repo path
|
||||
- **MC Base URL** (optional): `--mc-base-url <URL>` — defaults to `https://getmontecarlo.com`
|
||||
- **Models** (optional): `--models <model1,model2,...>` — comma-separated list of model filenames (without `.sql` extension) to generate queries for. Only these models will be included. By default, all changed models are included up to a maximum of 10.
|
||||
|
||||
---
|
||||
|
||||
# Setup
|
||||
|
||||
**Prerequisites:**
|
||||
- **`gh`** (GitHub CLI) — required for PR mode. Must be authenticated (`gh auth status`).
|
||||
- **`python3`** — required for helper scripts.
|
||||
- **`pyyaml`** — install with `pip3 install pyyaml` (or `pip install pyyaml`, `uv pip install pyyaml`, etc.)
|
||||
|
||||
**Note:** Generated SQL uses ANSI-compatible syntax that works across Snowflake, BigQuery, Redshift, and Athena. Minor adjustments may be needed for specific warehouse quirks.
|
||||
|
||||
This skill includes two helper scripts in `${CLAUDE_PLUGIN_ROOT}/skills/monte-carlo-validation-notebook/scripts/`:
|
||||
|
||||
- **`resolve_dbt_schema.py`** - Resolves dbt model output schemas from `dbt_project.yml` routing rules and model config overrides.
|
||||
- **`generate_notebook_url.py`** - Encodes notebook YAML into a base64 import URL and opens it in the browser.
|
||||
|
||||
# Mode Detection
|
||||
|
||||
Auto-detect mode from the target argument:
|
||||
- If target looks like a URL (contains `://` or `github.com`) -> **PR mode**
|
||||
- If target is a path (`.`, `/path/to/repo`, relative path) -> **Local mode**
|
||||
|
||||
---
|
||||
|
||||
# Context
|
||||
|
||||
This command generates a SQL Notebook containing validation queries for dbt changes. The notebook can be opened in the MC Bridge SQL Notebook interface for interactive validation.
|
||||
|
||||
The output is an import URL that opens directly in the notebook interface:
|
||||
```
|
||||
<MC_BASE_URL>/notebooks/import#<base64-encoded-yaml>
|
||||
```
|
||||
|
||||
**Key Features:**
|
||||
- **Database Parameters**: Two `text` parameters (`prod_db` and `dev_db`) for selecting databases
|
||||
- **Schema Inference**: Automatically infers schema per model from `dbt_project.yml` and model configs
|
||||
- **Single-table queries**: Basic validation queries using `{{prod_db}}.<SCHEMA>.<TABLE>`
|
||||
- **Comparison queries**: Before/after queries comparing `{{prod_db}}` vs `{{dev_db}}`
|
||||
- **Flexible usage**: Users can set both parameters to the same database for single-database analysis
|
||||
|
||||
# Notebook YAML Spec Reference
|
||||
|
||||
Key structure:
|
||||
```yaml
|
||||
version: 1
|
||||
metadata:
|
||||
id: string # kebab-case + random suffix
|
||||
name: string # display name
|
||||
created_at: string # ISO 8601
|
||||
updated_at: string # ISO 8601
|
||||
default_context: # optional database/schema context
|
||||
database: string
|
||||
schema: string
|
||||
cells:
|
||||
- id: string
|
||||
type: sql | markdown | parameter
|
||||
content: string # SQL, markdown, or parameter config (JSON)
|
||||
display_type: table | bar | timeseries
|
||||
```
|
||||
|
||||
## Parameter Cell Spec
|
||||
|
||||
Parameter cells allow defining variables referenced in SQL via `{{param_name}}` syntax:
|
||||
|
||||
```yaml
|
||||
- id: param-prod-db
|
||||
type: parameter
|
||||
content:
|
||||
name: prod_db # variable name
|
||||
config:
|
||||
type: text # free-form text input
|
||||
default_value: "ANALYTICS"
|
||||
placeholder: "Prod database"
|
||||
display_type: table
|
||||
```
|
||||
|
||||
Parameter types:
|
||||
- `text`: Free-form text input (used for database names)
|
||||
- `schema_selector`: Two dropdowns (database -> schema), value stored as `DATABASE.SCHEMA`
|
||||
- `dropdown`: Select from predefined options
|
||||
|
||||
# Task
|
||||
|
||||
Generate a SQL Notebook with validation queries based on the mode and target.
|
||||
|
||||
## Phase 1: Get Changed Files
|
||||
|
||||
The approach differs based on mode:
|
||||
|
||||
### If PR mode (GitHub PR):
|
||||
|
||||
1. Extract the PR number and repo from the target URL.
|
||||
- Example: `https://github.com/monte-carlo-data/dbt/pull/3386` -> owner=`monte-carlo-data`, repo=`dbt`, PR=`3386`
|
||||
|
||||
2. Fetch PR metadata using `gh`:
|
||||
```bash
|
||||
gh pr view <PR#> --repo <owner>/<repo> --json number,title,author,mergedAt,headRefOid
|
||||
```
|
||||
|
||||
3. Fetch the list of changed files:
|
||||
```bash
|
||||
gh pr view <PR#> --repo <owner>/<repo> --json files --jq '.files[].path'
|
||||
```
|
||||
|
||||
4. Fetch the diff:
|
||||
```bash
|
||||
gh pr diff <PR#> --repo <owner>/<repo>
|
||||
```
|
||||
|
||||
5. Filter the changed files list to only `.sql` files under `models/` or `snapshots/` directories (at any depth — e.g., `models/`, `analytics/models/`, `dbt/models/`). These are the dbt models to analyze. If no model SQL files were changed, report that and stop.
|
||||
|
||||
6. For each changed model file, fetch the full file content at the head SHA:
|
||||
```bash
|
||||
gh api repos/<owner>/<repo>/contents/<file_path>?ref=<head_sha> --jq '.content' | python3 -c "import sys,base64; sys.stdout.write(base64.b64decode(sys.stdin.read()).decode())"
|
||||
```
|
||||
|
||||
7. **Fetch dbt_project.yml** for schema resolution. Detect the dbt project root by looking at the changed file paths — find the common parent directory that contains `dbt_project.yml`. Try these paths in order until one succeeds:
|
||||
```bash
|
||||
gh api repos/<owner>/<repo>/contents/<dbt_root>/dbt_project.yml?ref=<head_sha> --jq '.content' | python3 -c "import sys,base64; sys.stdout.write(base64.b64decode(sys.stdin.read()).decode())"
|
||||
```
|
||||
Common `<dbt_root>` locations: `analytics`, `.` (repo root), `dbt`, `transform`. Try each until found.
|
||||
|
||||
Save `dbt_project.yml` to `/tmp/validation_notebook_working/<PR#>/dbt_project.yml`.
|
||||
|
||||
### If Local mode (Local Directory):
|
||||
|
||||
1. Change to the target directory.
|
||||
|
||||
2. Get current branch info:
|
||||
```bash
|
||||
git rev-parse --abbrev-ref HEAD
|
||||
```
|
||||
|
||||
3. Detect base branch - try `main`, `master`, `develop` in order, or use upstream tracking branch.
|
||||
|
||||
4. Get the list of changed SQL files compared to base branch:
|
||||
```bash
|
||||
git diff --name-only <base_branch>...HEAD -- '*.sql'
|
||||
```
|
||||
|
||||
5. Filter to only `.sql` files under `models/` or `snapshots/` directories (at any depth — e.g., `models/`, `analytics/models/`, `dbt/models/`). If no model SQL files were changed, report that and stop.
|
||||
|
||||
6. Get the diff for each changed file:
|
||||
```bash
|
||||
git diff <base_branch>...HEAD -- <file_path>
|
||||
```
|
||||
|
||||
7. Read model files directly from the filesystem.
|
||||
|
||||
8. **Find dbt_project.yml**:
|
||||
```bash
|
||||
find . -name "dbt_project.yml" -type f | head -1
|
||||
```
|
||||
|
||||
9. For notebook metadata in local mode, use:
|
||||
- **ID**: `local-<branch-name>-<timestamp>`
|
||||
- **Title**: `Local: <branch-name>`
|
||||
- **Author**: Output of `git config user.name`
|
||||
- **Merged**: "N/A (local)"
|
||||
|
||||
### Model Selection (applies to both modes)
|
||||
|
||||
After filtering to `.sql` files under `models/` or `snapshots/`:
|
||||
|
||||
1. **If `--models` was specified:** Filter the changed files list to only include models whose filename (without `.sql` extension, case-insensitive) matches one of the specified model names. If any specified model is not found in the changed files, warn the user but continue with the models that were found. If none match, report that and stop.
|
||||
|
||||
2. **Model cap:** If more than 10 models remain after filtering, select the first 10 (by file path order) and warn the user:
|
||||
```
|
||||
⚠️ <total_count> models changed — generating validation queries for the first 10 only.
|
||||
To generate for specific models, re-run with: --models <model1,model2,...>
|
||||
Skipped models: <list of skipped model filenames>
|
||||
```
|
||||
|
||||
## Phase 2: Parse Changed Models
|
||||
|
||||
For EACH changed dbt model `.sql` file, parse and extract:
|
||||
|
||||
### 2a. Model Metadata
|
||||
|
||||
**Output table name** -- Derive from file name:
|
||||
- `<any_path>/models/<subdir>/<model_name>.sql` -> table is `<MODEL_NAME>` (uppercase, taken from the filename)
|
||||
|
||||
**Output schema** -- Use the schema resolution script:
|
||||
|
||||
1. **Setup**: Save `dbt_project.yml` and model files to `/tmp/validation_notebook_working/<id>/` preserving paths:
|
||||
```
|
||||
/tmp/validation_notebook_working/<id>/
|
||||
+-- dbt_project.yml
|
||||
+-- models/
|
||||
+-- <path>/<model>.sql
|
||||
```
|
||||
|
||||
2. **Run the script** for each model:
|
||||
```bash
|
||||
python3 ${CLAUDE_PLUGIN_ROOT}/skills/monte-carlo-validation-notebook/scripts/resolve_dbt_schema.py /tmp/validation_notebook_working/<id>/dbt_project.yml /tmp/validation_notebook_working/<id>/models/<path>/<model>.sql
|
||||
```
|
||||
|
||||
3. **Error handling**: If the script fails, **STOP immediately** and report the error. Do NOT proceed with notebook generation if schema resolution fails.
|
||||
|
||||
4. **Output**: The script prints the resolved schema (e.g., `PROD`, `PROD_STAGE`, `PROD_LINEAGE`)
|
||||
|
||||
**Note**: Do NOT manually parse dbt_project.yml or model configs for schema -- always use the script. It handles model config overrides, dbt_project.yml routing rules, PROD_ prefix for custom schemas, and defaults to `PROD`.
|
||||
|
||||
**Config block** -- Look for `{{ config(...) }}` and extract:
|
||||
- `materialized` -- 'table', 'view', 'incremental', 'ephemeral'
|
||||
- `unique_key` -- the dedup key (may be a string or list)
|
||||
- `cluster_by` -- clustering fields (may contain the time axis)
|
||||
|
||||
**Core segmentation fields** -- Scan the entire model SQL for fields likely to be business keys:
|
||||
- Fields named `*_id` (e.g., `account_id`, `resource_id`, `monitor_id`) that appear in JOIN ON, GROUP BY, PARTITION BY, or `unique_key`
|
||||
- Deduplicate and rank by frequency. Take the top 3.
|
||||
|
||||
**Time axis field** -- Detect the model's time dimension (in priority order):
|
||||
1. `is_incremental()` block: field used in the WHERE comparison
|
||||
2. `cluster_by` config: timestamp/date fields
|
||||
3. Field name conventions: `ingest_ts`, `created_time`, `date_part`, `timestamp`, `run_start_time`, `export_ts`, `event_created_time`
|
||||
4. ORDER BY DESC in QUALIFY/ROW_NUMBER
|
||||
|
||||
If no time axis is found, skip time-axis queries for this model.
|
||||
|
||||
### 2b. Diff Analysis
|
||||
|
||||
Parse the diff hunks for this file. Classify each changed line:
|
||||
|
||||
- **Changed fields** -- Lines added/modified in SELECT clauses or CTE definitions. Extract the output column name.
|
||||
- **Changed filters** -- Lines added/modified in WHERE clauses.
|
||||
- **Changed joins** -- Lines added/modified in JOIN ON conditions.
|
||||
- **Changed unique_key** -- If `unique_key` in config was modified, note both old and new values.
|
||||
- **New columns** -- Columns in "after" SELECT that don't appear in "before" (pure additions).
|
||||
|
||||
### 2c. Model Classification
|
||||
|
||||
Classify each model as **new** or **modified** based on the diff:
|
||||
- If the diff for this file contains `new file mode` → classify as **new**
|
||||
- Otherwise → classify as **modified**
|
||||
|
||||
This classification determines which query patterns are generated in Phase 3.
|
||||
|
||||
**Note:** For **new models**, Phase 2b diff analysis is skipped (there is no "before" to compare against). Phase 2a metadata extraction still applies.
|
||||
|
||||
## Phase 3: Generate Validation Queries
|
||||
|
||||
For each changed model, generate the applicable queries based on its classification (new vs modified).
|
||||
|
||||
**CRITICAL: Parameter Placeholder Syntax**
|
||||
|
||||
Use **double curly braces** `{{...}}` for parameter placeholders. Do NOT use `${...}` or any other syntax.
|
||||
|
||||
Correct: `{{prod_db}}.PROD.AGENT_RUNS`
|
||||
Wrong: `${prod_db}.PROD.AGENT_RUNS`
|
||||
|
||||
**Table Reference Format:**
|
||||
- Use `{{prod_db}}.<SCHEMA>.<TABLE_NAME>` for prod queries
|
||||
- Use `{{dev_db}}.<SCHEMA>.<TABLE_NAME>` for dev queries
|
||||
- `<SCHEMA>` is **hardcoded per-model** using the output from the schema resolution script
|
||||
|
||||
---
|
||||
|
||||
### Query Patterns for NEW Models
|
||||
|
||||
For new models, all queries target `{{dev_db}}` only. No comparison queries are generated since no prod table exists.
|
||||
|
||||
#### Pattern 7-new: Total Row Count
|
||||
**Trigger:** Always.
|
||||
|
||||
```sql
|
||||
SELECT COUNT(*) AS total_rows
|
||||
FROM {{dev_db}}.<SCHEMA>.<TABLE_NAME>
|
||||
```
|
||||
|
||||
#### Pattern 9: Sample Data Preview
|
||||
**Trigger:** Always.
|
||||
|
||||
```sql
|
||||
SELECT *
|
||||
FROM {{dev_db}}.<SCHEMA>.<TABLE_NAME>
|
||||
LIMIT 20
|
||||
```
|
||||
|
||||
#### Pattern 2-new: Core Segmentation Counts
|
||||
**Trigger:** Always.
|
||||
|
||||
```sql
|
||||
SELECT
|
||||
<segmentation_field>,
|
||||
COUNT(*) AS row_count
|
||||
FROM {{dev_db}}.<SCHEMA>.<TABLE_NAME>
|
||||
GROUP BY <segmentation_field>
|
||||
ORDER BY row_count DESC
|
||||
LIMIT 100
|
||||
```
|
||||
|
||||
#### Pattern 5: Uniqueness Check
|
||||
**Trigger:** Always for new models (verify unique_key constraint from the start).
|
||||
|
||||
```sql
|
||||
SELECT
|
||||
COUNT(*) AS total_rows,
|
||||
COUNT(DISTINCT <key_fields>) AS distinct_keys,
|
||||
COUNT(*) - COUNT(DISTINCT <key_fields>) AS duplicate_count
|
||||
FROM {{dev_db}}.<SCHEMA>.<TABLE_NAME>
|
||||
```
|
||||
|
||||
```sql
|
||||
SELECT <key_fields>, COUNT(*) AS n
|
||||
FROM {{dev_db}}.<SCHEMA>.<TABLE_NAME>
|
||||
GROUP BY <key_fields>
|
||||
HAVING COUNT(*) > 1
|
||||
ORDER BY n DESC
|
||||
LIMIT 100
|
||||
```
|
||||
|
||||
#### Pattern 6-new: NULL Rate Check (all columns)
|
||||
**Trigger:** Always. Checks all output columns since everything is new.
|
||||
|
||||
```sql
|
||||
SELECT
|
||||
COUNT(*) AS total_rows,
|
||||
SUM(CASE WHEN <col1> IS NULL THEN 1 ELSE 0 END) AS <col1>_null_count,
|
||||
ROUND(100.0 * SUM(CASE WHEN <col1> IS NULL THEN 1 ELSE 0 END) / NULLIF(COUNT(*), 0), 2) AS <col1>_null_pct,
|
||||
SUM(CASE WHEN <col2> IS NULL THEN 1 ELSE 0 END) AS <col2>_null_count,
|
||||
ROUND(100.0 * SUM(CASE WHEN <col2> IS NULL THEN 1 ELSE 0 END) / NULLIF(COUNT(*), 0), 2) AS <col2>_null_pct
|
||||
-- repeat for each output column
|
||||
FROM {{dev_db}}.<SCHEMA>.<TABLE_NAME>
|
||||
```
|
||||
|
||||
#### Pattern 8: Time-Axis Continuity
|
||||
**Trigger:** Model is `materialized='incremental'` OR a time axis field was identified.
|
||||
|
||||
```sql
|
||||
SELECT
|
||||
CAST(<time_axis> AS DATE) AS day,
|
||||
COUNT(*) AS row_count
|
||||
FROM {{dev_db}}.<SCHEMA>.<TABLE_NAME>
|
||||
WHERE <time_axis> >= CURRENT_TIMESTAMP - INTERVAL '14' DAY
|
||||
GROUP BY day
|
||||
ORDER BY day DESC
|
||||
LIMIT 30
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Query Patterns for MODIFIED Models
|
||||
|
||||
For modified models, single-table queries use `{{prod_db}}` and comparison queries use both.
|
||||
|
||||
#### Pattern 7: Total Row Count
|
||||
**Trigger:** Always.
|
||||
|
||||
```sql
|
||||
SELECT COUNT(*) AS total_rows
|
||||
FROM {{prod_db}}.<SCHEMA>.<TABLE_NAME>
|
||||
```
|
||||
|
||||
#### Pattern 9: Sample Data Preview
|
||||
**Trigger:** Always.
|
||||
|
||||
```sql
|
||||
SELECT *
|
||||
FROM {{prod_db}}.<SCHEMA>.<TABLE_NAME>
|
||||
LIMIT 20
|
||||
```
|
||||
|
||||
#### Pattern 2: Core Segmentation Counts
|
||||
**Trigger:** Always.
|
||||
|
||||
```sql
|
||||
SELECT
|
||||
<segmentation_field>,
|
||||
COUNT(*) AS row_count
|
||||
FROM {{prod_db}}.<SCHEMA>.<TABLE_NAME>
|
||||
GROUP BY <segmentation_field>
|
||||
ORDER BY row_count DESC
|
||||
LIMIT 100
|
||||
```
|
||||
|
||||
#### Pattern 1: Changed Field Distribution
|
||||
**Trigger:** Changed fields found in Phase 2b. **Exclude added columns** (from "New columns" in Phase 2b) — only include fields that exist in prod.
|
||||
|
||||
```sql
|
||||
SELECT
|
||||
<changed_field>,
|
||||
COUNT(*) AS row_count,
|
||||
ROUND(COUNT(*) * 100.0 / SUM(COUNT(*)) OVER(), 2) AS pct
|
||||
FROM {{prod_db}}.<SCHEMA>.<TABLE_NAME>
|
||||
GROUP BY <changed_field>
|
||||
ORDER BY row_count DESC
|
||||
LIMIT 100
|
||||
```
|
||||
|
||||
#### Pattern 5: Uniqueness Check
|
||||
**Trigger:** JOIN condition changed, `unique_key` changed, or model is incremental.
|
||||
|
||||
```sql
|
||||
SELECT
|
||||
COUNT(*) AS total_rows,
|
||||
COUNT(DISTINCT <key_fields>) AS distinct_keys,
|
||||
COUNT(*) - COUNT(DISTINCT <key_fields>) AS duplicate_count
|
||||
FROM {{dev_db}}.<SCHEMA>.<TABLE_NAME>
|
||||
```
|
||||
|
||||
```sql
|
||||
SELECT <key_fields>, COUNT(*) AS n
|
||||
FROM {{dev_db}}.<SCHEMA>.<TABLE_NAME>
|
||||
GROUP BY <key_fields>
|
||||
HAVING COUNT(*) > 1
|
||||
ORDER BY n DESC
|
||||
LIMIT 100
|
||||
```
|
||||
|
||||
#### Pattern 6: NULL Rate Check
|
||||
**Trigger:** New column added, or column wrapped in COALESCE/NULLIF.
|
||||
|
||||
**Important:** Added columns (from "New columns" in Phase 2b) do NOT exist in prod yet. For added columns, query `{{dev_db}}` only. For modified columns (COALESCE/NULLIF changes), compare both databases.
|
||||
|
||||
**For added columns** (dev only):
|
||||
```sql
|
||||
SELECT
|
||||
COUNT(*) AS total_rows,
|
||||
SUM(CASE WHEN <column> IS NULL THEN 1 ELSE 0 END) AS null_count,
|
||||
ROUND(100.0 * SUM(CASE WHEN <column> IS NULL THEN 1 ELSE 0 END) / NULLIF(COUNT(*), 0), 2) AS null_pct
|
||||
FROM {{dev_db}}.<SCHEMA>.<TABLE_NAME>
|
||||
```
|
||||
|
||||
**For modified columns** (prod vs dev):
|
||||
```sql
|
||||
SELECT
|
||||
'prod' AS source,
|
||||
COUNT(*) AS total_rows,
|
||||
SUM(CASE WHEN <column> IS NULL THEN 1 ELSE 0 END) AS null_count,
|
||||
ROUND(100.0 * SUM(CASE WHEN <column> IS NULL THEN 1 ELSE 0 END) / NULLIF(COUNT(*), 0), 2) AS null_pct
|
||||
FROM {{prod_db}}.<SCHEMA>.<TABLE_NAME>
|
||||
UNION ALL
|
||||
SELECT
|
||||
'dev' AS source,
|
||||
COUNT(*) AS total_rows,
|
||||
SUM(CASE WHEN <column> IS NULL THEN 1 ELSE 0 END) AS null_count,
|
||||
ROUND(100.0 * SUM(CASE WHEN <column> IS NULL THEN 1 ELSE 0 END) / NULLIF(COUNT(*), 0), 2) AS null_pct
|
||||
FROM {{dev_db}}.<SCHEMA>.<TABLE_NAME>
|
||||
```
|
||||
|
||||
#### Pattern 8: Time-Axis Continuity
|
||||
**Trigger:** Model is `materialized='incremental'` OR a time axis field was identified.
|
||||
|
||||
```sql
|
||||
SELECT
|
||||
CAST(<time_axis> AS DATE) AS day,
|
||||
COUNT(*) AS row_count
|
||||
FROM {{prod_db}}.<SCHEMA>.<TABLE_NAME>
|
||||
WHERE <time_axis> >= CURRENT_TIMESTAMP - INTERVAL '14' DAY
|
||||
GROUP BY day
|
||||
ORDER BY day DESC
|
||||
LIMIT 30
|
||||
```
|
||||
|
||||
#### Pattern 3: Before/After Comparison
|
||||
**Trigger:** Always (for changed fields + top segmentation field). **Modified models only.**
|
||||
|
||||
**Important:** Exclude added columns (from "New columns" in Phase 2b) from `<group_fields>`. Only use fields that exist in BOTH prod and dev. Added columns don't exist in prod and will cause query errors.
|
||||
|
||||
```sql
|
||||
WITH prod AS (
|
||||
SELECT <group_fields>, COUNT(*) AS cnt
|
||||
FROM {{prod_db}}.<SCHEMA>.<TABLE_NAME>
|
||||
GROUP BY <group_fields>
|
||||
),
|
||||
dev AS (
|
||||
SELECT <group_fields>, COUNT(*) AS cnt
|
||||
FROM {{dev_db}}.<SCHEMA>.<TABLE_NAME>
|
||||
GROUP BY <group_fields>
|
||||
)
|
||||
SELECT
|
||||
COALESCE(b.<field>, d.<field>) AS <field>,
|
||||
COALESCE(b.cnt, 0) AS cnt_prod,
|
||||
COALESCE(d.cnt, 0) AS cnt_dev,
|
||||
COALESCE(d.cnt, 0) - COALESCE(b.cnt, 0) AS diff
|
||||
FROM prod b
|
||||
FULL OUTER JOIN dev d ON b.<field> = d.<field>
|
||||
ORDER BY ABS(diff) DESC
|
||||
LIMIT 100
|
||||
```
|
||||
|
||||
#### Pattern 7b: Row Count Comparison
|
||||
**Trigger:** Always. **Modified models only.**
|
||||
|
||||
```sql
|
||||
SELECT 'prod' AS source, COUNT(*) AS row_count FROM {{prod_db}}.<SCHEMA>.<TABLE_NAME>
|
||||
UNION ALL
|
||||
SELECT 'dev' AS source, COUNT(*) AS row_count FROM {{dev_db}}.<SCHEMA>.<TABLE_NAME>
|
||||
```
|
||||
|
||||
## Phase 4: Build Notebook YAML
|
||||
|
||||
### 4a. Metadata
|
||||
```yaml
|
||||
version: 1
|
||||
metadata:
|
||||
id: validation-pr-<PR_NUMBER>-<random_suffix>
|
||||
name: "Validation: PR #<PR_NUMBER> - <PR_TITLE_TRUNCATED>"
|
||||
created_at: "<current_iso_timestamp>"
|
||||
updated_at: "<current_iso_timestamp>"
|
||||
```
|
||||
|
||||
### 4b. Parameter Cells
|
||||
|
||||
**Only include `prod_db` if there are modified models.** If all models are new, only include `dev_db`.
|
||||
|
||||
```yaml
|
||||
# Include ONLY if there are modified models:
|
||||
- id: param-prod-db
|
||||
type: parameter
|
||||
content:
|
||||
name: prod_db
|
||||
config:
|
||||
type: text
|
||||
default_value: "ANALYTICS"
|
||||
placeholder: "Prod database (e.g., ANALYTICS)"
|
||||
display_type: table
|
||||
|
||||
# Always include:
|
||||
- id: param-dev-db
|
||||
type: parameter
|
||||
content:
|
||||
name: dev_db
|
||||
config:
|
||||
type: text
|
||||
default_value: "PERSONAL_<USER>"
|
||||
placeholder: "Dev database (e.g., PERSONAL_JSMITH)"
|
||||
display_type: table
|
||||
```
|
||||
|
||||
### 4c. Markdown Summary Cell
|
||||
```yaml
|
||||
- id: cell-summary
|
||||
type: markdown
|
||||
content: |
|
||||
# Validation Queries for <PR or Local Branch>
|
||||
## Summary
|
||||
- **Title:** <title>
|
||||
- **Author:** <author>
|
||||
- **Source:** <PR URL or "Local branch: <branch>">
|
||||
- **Status:** <merge_timestamp or "Not yet merged" or "N/A (local)">
|
||||
## Changes
|
||||
<brief description based on diff analysis>
|
||||
## Changed Models
|
||||
- `<SCHEMA>.<TABLE_NAME>` (from `<file_path>`)
|
||||
## How to Use
|
||||
1. Select your Snowflake connector above
|
||||
2. Set **dev_db** to your dev database (e.g., `PERSONAL_JSMITH`)
|
||||
3. If modified models are present, set **prod_db** to your prod database (e.g., `ANALYTICS`)
|
||||
4. Run single-table queries first, then comparison queries
|
||||
display_type: table
|
||||
```
|
||||
|
||||
### 4d. SQL Cell Format
|
||||
```yaml
|
||||
- id: cell-<pattern>-<model>-<index>
|
||||
type: sql
|
||||
content: |
|
||||
/*
|
||||
========================================
|
||||
<Pattern Name (human-readable, e.g. "Total Row Count" — do NOT include pattern numbers like "Pattern 7:")>
|
||||
========================================
|
||||
Model: <SCHEMA>.<TABLE_NAME>
|
||||
Triggered by: <why this pattern was generated>
|
||||
What to look for: <interpretation guidance>
|
||||
----------------------------------------
|
||||
*/
|
||||
<actual_sql_query>
|
||||
display_type: table
|
||||
```
|
||||
|
||||
### 4e. Cell Organization
|
||||
|
||||
Cells are ordered consistently for both model types, following this sequence:
|
||||
|
||||
**New models:**
|
||||
1. Summary markdown cell (note that model is new)
|
||||
2. Parameter cells (dev_db only — no prod_db if all models are new)
|
||||
3. Total row count (Pattern 7-new)
|
||||
4. Sample data preview (Pattern 9)
|
||||
5. Core segmentation counts (Pattern 2-new)
|
||||
6. Uniqueness check (Pattern 5), NULL rate check (Pattern 6-new), Time-axis continuity (Pattern 8)
|
||||
|
||||
**Modified models:**
|
||||
1. Summary markdown cell
|
||||
2. Parameter cells (prod_db, dev_db)
|
||||
3. Total row count (Pattern 7)
|
||||
4. Sample data preview (Pattern 9)
|
||||
5. Core segmentation counts (Pattern 2)
|
||||
6. Changed field distribution (Pattern 1)
|
||||
7. Uniqueness check (Pattern 5), NULL rate check (Pattern 6), Time-axis continuity (Pattern 8)
|
||||
8. Before/after comparisons (Pattern 3), Row count comparison (Pattern 7b)
|
||||
|
||||
## Phase 5: Generate Import URL
|
||||
|
||||
1. Write notebook YAML to `/tmp/validation_notebook_working/<id>/notebook.yaml`
|
||||
2. Run the URL generation script:
|
||||
```bash
|
||||
python3 ${CLAUDE_PLUGIN_ROOT}/skills/monte-carlo-validation-notebook/scripts/generate_notebook_url.py /tmp/validation_notebook_working/<id>/notebook.yaml --mc-base-url <MC_BASE_URL>
|
||||
```
|
||||
3. The script validates both YAML syntax and notebook schema (required fields on metadata and cells). If validation fails, read the error messages carefully, fix the YAML to match the spec in Phase 4, and re-run.
|
||||
|
||||
## Phase 6: Output
|
||||
|
||||
Present:
|
||||
```markdown
|
||||
# Validation Notebook Generated
|
||||
## Summary
|
||||
- **Source:** PR #<number> - <title> OR Local: <branch>
|
||||
- **Author:** <author>
|
||||
- **Changed Models:** <count> models (of <total_count> changed)
|
||||
- **Generated Queries:** <count> queries
|
||||
|
||||
> ⚠️ If models were capped: "Only the first 10 of <total_count> changed models were included. Re-run with `--models` to select specific models."
|
||||
|
||||
## Notebook Opened
|
||||
The notebook has been opened directly in your browser.
|
||||
Select your Snowflake connector in the notebook interface to begin running queries.
|
||||
*Make sure MC Bridge is running. Let me know if you want tips on how to install this locally*
|
||||
```
|
||||
|
||||
## Important Guidelines
|
||||
|
||||
1. **Do NOT execute queries** -- only generate the notebook
|
||||
2. **Keep SQL readable** -- proper formatting and meaningful aliases
|
||||
3. **Include LIMIT 100** on queries that could return many rows
|
||||
4. **Use double curly braces** -- `{{prod_db}}` NOT `${prod_db}`
|
||||
5. **Use correct table format** -- `{{prod_db}}.<SCHEMA>.<TABLE>` and `{{dev_db}}.<SCHEMA>.<TABLE>`
|
||||
6. **Always use the schema resolution script** -- do NOT manually parse dbt_project.yml
|
||||
7. **Schema is NOT a parameter** -- only `prod_db` and `dev_db` are parameters
|
||||
8. **Skip ephemeral models** -- they have no physical table
|
||||
9. **Truncate notebook name** -- keep under 50 chars
|
||||
10. **Generate unique cell IDs** -- use pattern like `cell-p3-model-1`
|
||||
11. **YAML multiline content** -- use `|` block scalar for SQL with comments
|
||||
12. **ASCII-only YAML** -- the script sanitizes and validates before encoding
|
||||
|
||||
## Query Pattern Reference
|
||||
|
||||
| Pattern | Name | Trigger | Model Type | Database | Order |
|
||||
|---------|------|---------|------------|----------|-------|
|
||||
| 7 / 7-new | Total Row Count | Always | Both | `{{prod_db}}` (modified) / `{{dev_db}}` (new) | 1 |
|
||||
| 9 | Sample Data Preview | Always | Both | `{{prod_db}}` (modified) / `{{dev_db}}` (new) | 2 |
|
||||
| 2 / 2-new | Core Segmentation Counts | Always | Both | `{{prod_db}}` (modified) / `{{dev_db}}` (new) | 3 |
|
||||
| 1 | Changed Field Distribution | Column modified in diff (not added) | Modified only | `{{prod_db}}` | 4 |
|
||||
| 5 | Uniqueness Check | JOIN/unique_key changed (modified) / Always (new) | Both | `{{dev_db}}` | 5 |
|
||||
| 6 / 6-new | NULL Rate Check | New column or COALESCE (modified) / Always (new) | Both | Added col: `{{dev_db}}` only; COALESCE: Both (modified) / `{{dev_db}}` (new) | 5 |
|
||||
| 8 | Time-Axis Continuity | Incremental or time field | Both | `{{prod_db}}` (modified) / `{{dev_db}}` (new) | 5 |
|
||||
| 3 | Before/After Comparison | Changed fields (not added) | Modified only | Both | 6 |
|
||||
| 7b | Row Count Comparison | Always | Modified only | Both | 6 |
|
||||
|
||||
## MC Bridge Setup Help
|
||||
|
||||
If the user asks how to install or set up MC Bridge, fetch the README from the mc-bridge repo and show the relevant quick start / setup instructions:
|
||||
|
||||
```bash
|
||||
gh api repos/monte-carlo-data/mc-bridge/readme --jq '.content' | base64 --decode
|
||||
```
|
||||
|
||||
Focus on: how to install, configure connections, and run MC Bridge. Don't dump the entire README — extract just the setup-relevant sections.
|
||||
@@ -0,0 +1,141 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Encode a notebook YAML file into a base64 import URL and open it in the browser.
|
||||
|
||||
Usage:
|
||||
python3 generate_notebook_url.py <notebook_yaml_path> [--mc-base-url URL]
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import base64
|
||||
import os
|
||||
import re
|
||||
import subprocess
|
||||
import sys
|
||||
|
||||
import yaml
|
||||
|
||||
|
||||
def sanitize_yaml(content: str) -> str:
|
||||
"""Replace non-ASCII characters with ASCII equivalents."""
|
||||
replacements = {
|
||||
"\u2014": "-",
|
||||
"\u2013": "-",
|
||||
"\u2018": "'",
|
||||
"\u2019": "'",
|
||||
"\u201c": '"',
|
||||
"\u201d": '"',
|
||||
"\u2026": "...",
|
||||
"\u00a0": " ",
|
||||
}
|
||||
for char, replacement in replacements.items():
|
||||
content = content.replace(char, replacement)
|
||||
content = re.sub(r"[^\x00-\x7F]", "?", content)
|
||||
return content
|
||||
|
||||
|
||||
def validate_yaml(content: str) -> None:
|
||||
"""Parse YAML, validate notebook schema, and exit with context on failure."""
|
||||
try:
|
||||
doc = yaml.safe_load(content)
|
||||
except yaml.YAMLError as e:
|
||||
print(f"YAML validation failed: {e}", file=sys.stderr)
|
||||
sys.exit(1)
|
||||
|
||||
errors: list[str] = []
|
||||
|
||||
# Top-level structure
|
||||
if not isinstance(doc, dict):
|
||||
errors.append("Root must be a mapping")
|
||||
else:
|
||||
if "version" not in doc:
|
||||
errors.append("Missing top-level 'version'")
|
||||
metadata = doc.get("metadata")
|
||||
if not isinstance(metadata, dict):
|
||||
errors.append("Missing or invalid 'metadata' mapping")
|
||||
else:
|
||||
for field in ("id", "name", "created_at", "updated_at"):
|
||||
if field not in metadata:
|
||||
errors.append(f"metadata.{field}: missing required field")
|
||||
for bad_field in ("title", "description", "pr_number", "generated_by"):
|
||||
if bad_field in metadata:
|
||||
errors.append(
|
||||
f"metadata.{bad_field}: unexpected field (use 'name' for the notebook title)"
|
||||
)
|
||||
|
||||
cells = doc.get("cells")
|
||||
if not isinstance(cells, list):
|
||||
errors.append("Missing or invalid 'cells' list")
|
||||
else:
|
||||
for i, cell in enumerate(cells):
|
||||
prefix = f"cells[{i}]"
|
||||
if not isinstance(cell, dict):
|
||||
errors.append(f"{prefix}: must be a mapping")
|
||||
continue
|
||||
if "id" not in cell:
|
||||
errors.append(f"{prefix}: missing 'id'")
|
||||
if "type" not in cell:
|
||||
errors.append(f"{prefix}: missing 'type'")
|
||||
cell_type = cell.get("type")
|
||||
if cell_type not in ("sql", "markdown", "parameter"):
|
||||
errors.append(
|
||||
f"{prefix}: invalid type '{cell_type}' (must be sql, markdown, or parameter)"
|
||||
)
|
||||
if "display_type" not in cell:
|
||||
errors.append(f"{prefix}: missing 'display_type'")
|
||||
if cell_type == "parameter":
|
||||
content_val = cell.get("content")
|
||||
if not isinstance(content_val, dict):
|
||||
errors.append(f"{prefix}: parameter cell 'content' must be a mapping with 'name' and 'config'")
|
||||
else:
|
||||
if "name" not in content_val:
|
||||
errors.append(f"{prefix}: parameter content missing 'name'")
|
||||
if "config" not in content_val:
|
||||
errors.append(f"{prefix}: parameter content missing 'config'")
|
||||
|
||||
if errors:
|
||||
print("Invalid notebook:", file=sys.stderr)
|
||||
for err in errors:
|
||||
print(f" - {err}", file=sys.stderr)
|
||||
sys.exit(1)
|
||||
|
||||
|
||||
def main() -> None:
|
||||
parser = argparse.ArgumentParser(description="Encode notebook YAML to import URL")
|
||||
parser.add_argument("yaml_path", help="Path to notebook YAML file")
|
||||
parser.add_argument(
|
||||
"--mc-base-url",
|
||||
default="https://getmontecarlo.com",
|
||||
help="MC Bridge base URL",
|
||||
)
|
||||
args = parser.parse_args()
|
||||
|
||||
with open(args.yaml_path) as f:
|
||||
notebook_yaml = f.read()
|
||||
|
||||
yaml_content = sanitize_yaml(notebook_yaml.strip())
|
||||
validate_yaml(yaml_content)
|
||||
|
||||
encoded = base64.b64encode(yaml_content.encode()).decode()
|
||||
url = f"{args.mc_base_url}/notebooks/import#{encoded}"
|
||||
|
||||
print(f"URL length: {len(url)} chars")
|
||||
|
||||
# Save URL to file alongside the YAML
|
||||
url_file = os.path.join(os.path.dirname(os.path.abspath(args.yaml_path)), "notebook_url.txt")
|
||||
with open(url_file, "w") as f:
|
||||
f.write(url)
|
||||
print(f"URL saved to: {url_file}")
|
||||
|
||||
print("\n" + "=" * 60)
|
||||
print("NOTEBOOK URL:")
|
||||
print("=" * 60)
|
||||
print(url)
|
||||
print("=" * 60 + "\n")
|
||||
|
||||
print("Opening notebook in browser...")
|
||||
subprocess.run(["open", url])
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,161 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Resolve the output schema for a dbt model.
|
||||
|
||||
Usage:
|
||||
python3 resolve_dbt_schema.py <dbt_project_yml_path> <model_sql_path>
|
||||
|
||||
Returns the resolved schema name (uppercase), e.g., "PROD", "PROD_STAGE", "PROD_LINEAGE"
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import re
|
||||
import sys
|
||||
from pathlib import Path
|
||||
from typing import Dict, List, Optional, Tuple, Union
|
||||
|
||||
import yaml
|
||||
|
||||
|
||||
def parse_model_config_schema(model_content: str) -> Optional[str]:
|
||||
"""Extract schema from model's config block."""
|
||||
pattern = r"\{\{\s*config\s*\([^)]*\bschema\s*=\s*['\"]([^'\"]+)['\"][^)]*\)\s*\}\}"
|
||||
match = re.search(pattern, model_content, re.IGNORECASE | re.DOTALL)
|
||||
if match:
|
||||
return match.group(1).upper()
|
||||
|
||||
snapshot_pattern = r"target_schema\s*=\s*generate_schema_name\s*\(\s*['\"]([^'\"]+)['\"]"
|
||||
match = re.search(snapshot_pattern, model_content, re.IGNORECASE | re.DOTALL)
|
||||
if match:
|
||||
return match.group(1).upper()
|
||||
|
||||
return None
|
||||
|
||||
|
||||
def parse_dbt_project_routing(
|
||||
dbt_project: dict, project_name: str
|
||||
) -> Tuple[Dict[str, str], Dict[str, str]]:
|
||||
"""Extract schema and database routing rules from dbt_project.yml."""
|
||||
schema_routing = {} # type: Dict[str, str]
|
||||
database_routing = {} # type: Dict[str, str]
|
||||
|
||||
models_config = dbt_project.get("models", {})
|
||||
project_config = models_config.get(project_name, {})
|
||||
|
||||
def extract_routing(config: dict, current_path: str = "") -> None:
|
||||
for key, value in config.items():
|
||||
if key.startswith("+"):
|
||||
continue
|
||||
if not isinstance(value, dict):
|
||||
continue
|
||||
new_path = f"{current_path}/{key}" if current_path else key
|
||||
schema = value.get("schema") or value.get("+schema")
|
||||
if schema:
|
||||
if "{{" not in schema:
|
||||
schema_routing[new_path] = schema.upper()
|
||||
database = value.get("database") or value.get("+database")
|
||||
if database:
|
||||
if "{{" not in database:
|
||||
database_routing[new_path] = database.upper()
|
||||
extract_routing(value, new_path)
|
||||
|
||||
extract_routing(project_config)
|
||||
return schema_routing, database_routing
|
||||
|
||||
|
||||
def parse_dbt_project_schema_routing(dbt_project: dict, project_name: str) -> Dict[str, str]:
|
||||
schema_routing, _ = parse_dbt_project_routing(dbt_project, project_name)
|
||||
return schema_routing
|
||||
|
||||
|
||||
def get_model_relative_path(dbt_project_path: Path, model_path: Path) -> str:
|
||||
dbt_project_dir = dbt_project_path.parent
|
||||
model_relative = model_path.relative_to(dbt_project_dir)
|
||||
parts = model_relative.parts
|
||||
if parts and parts[0] == "models":
|
||||
return str(Path(*parts[1:]))
|
||||
return str(model_relative)
|
||||
|
||||
|
||||
def find_matching_schema(
|
||||
model_relative_path: str, routing: Dict[str, str]
|
||||
) -> Optional[str]:
|
||||
model_dir = str(Path(model_relative_path).parent)
|
||||
matches = [] # type: List[Tuple[str, str]]
|
||||
for route_path, schema in routing.items():
|
||||
if model_dir == route_path or model_dir.startswith(route_path + "/"):
|
||||
matches.append((route_path, schema))
|
||||
if not matches:
|
||||
return None
|
||||
matches.sort(key=lambda x: len(x[0]), reverse=True)
|
||||
return matches[0][1]
|
||||
|
||||
|
||||
def apply_schema_prefix(schema: str, target_schema: str = "PROD") -> str:
|
||||
if not schema or schema.upper() == target_schema.upper():
|
||||
return target_schema.upper()
|
||||
return f"{target_schema.upper()}_{schema.upper()}"
|
||||
|
||||
|
||||
def resolve_schema(
|
||||
dbt_project_path: Union[str, Path],
|
||||
model_path: Union[str, Path],
|
||||
default_schema: str = "PROD",
|
||||
apply_prefix: bool = True,
|
||||
) -> str:
|
||||
dbt_project_path = Path(dbt_project_path)
|
||||
model_path = Path(model_path)
|
||||
|
||||
model_content = model_path.read_text()
|
||||
|
||||
config_schema = parse_model_config_schema(model_content)
|
||||
if config_schema:
|
||||
if apply_prefix:
|
||||
return apply_schema_prefix(config_schema, default_schema)
|
||||
return config_schema
|
||||
|
||||
with open(dbt_project_path) as f:
|
||||
dbt_project = yaml.safe_load(f)
|
||||
|
||||
project_name = dbt_project.get("name", "")
|
||||
|
||||
routing = parse_dbt_project_schema_routing(dbt_project, project_name)
|
||||
model_relative = get_model_relative_path(dbt_project_path, model_path)
|
||||
matched_schema = find_matching_schema(model_relative, routing)
|
||||
if matched_schema:
|
||||
if apply_prefix:
|
||||
return apply_schema_prefix(matched_schema, default_schema)
|
||||
return matched_schema
|
||||
|
||||
return default_schema.upper()
|
||||
|
||||
|
||||
def main() -> None:
|
||||
parser = argparse.ArgumentParser(
|
||||
description="Resolve the output schema for a dbt model"
|
||||
)
|
||||
parser.add_argument("dbt_project_path", help="Path to dbt_project.yml")
|
||||
parser.add_argument("model_path", help="Path to the model SQL file")
|
||||
parser.add_argument("--default", default="PROD", help="Default schema (default: PROD)")
|
||||
parser.add_argument("--no-prefix", action="store_true", help="Don't apply PROD_ prefix")
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
dbt_project_path = Path(args.dbt_project_path)
|
||||
model_path = Path(args.model_path)
|
||||
|
||||
if not dbt_project_path.exists():
|
||||
print(f"Error: dbt_project.yml not found: {dbt_project_path}", file=sys.stderr)
|
||||
sys.exit(1)
|
||||
|
||||
if not model_path.exists():
|
||||
print(f"Error: Model file not found: {model_path}", file=sys.stderr)
|
||||
sys.exit(1)
|
||||
|
||||
apply_prefix = not args.no_prefix
|
||||
schema = resolve_schema(dbt_project_path, model_path, args.default, apply_prefix)
|
||||
print(schema)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -19,7 +19,7 @@
|
||||
"skills": "./skills/",
|
||||
"interface": {
|
||||
"displayName": "Antigravity Awesome Skills",
|
||||
"shortDescription": "1,363 plugin-safe skills for coding, security, product, and ops workflows.",
|
||||
"shortDescription": "1,367 plugin-safe skills for coding, security, product, and ops workflows.",
|
||||
"longDescription": "Install a plugin-safe Codex distribution of Antigravity Awesome Skills. Skills that still need hardening or target-specific setup remain available in the repo but are excluded from this plugin.",
|
||||
"developerName": "sickn33 and contributors",
|
||||
"category": "Productivity",
|
||||
|
||||
@@ -0,0 +1,222 @@
|
||||
---
|
||||
name: monte-carlo-monitor-creation
|
||||
description: "Guides creation of Monte Carlo monitors via MCP tools, producing monitors-as-code YAML for CI/CD deployment."
|
||||
category: data
|
||||
risk: safe
|
||||
source: community
|
||||
source_repo: monte-carlo-data/mc-agent-toolkit
|
||||
source_type: community
|
||||
date_added: "2026-04-08"
|
||||
author: monte-carlo-data
|
||||
tags: [data-observability, monitoring, monte-carlo, monitors-as-code]
|
||||
tools: [claude, cursor, codex]
|
||||
---
|
||||
|
||||
# Monte Carlo Monitor Creation Skill
|
||||
|
||||
This skill teaches you to create Monte Carlo monitors correctly via MCP. Every creation tool runs in **dry-run mode** and returns monitors-as-code (MaC) YAML. No monitors are created directly -- the user applies the YAML via the Monte Carlo CLI or CI/CD.
|
||||
|
||||
Reference files live next to this skill file. **Use the Read tool** (not MCP resources) to access them:
|
||||
|
||||
- Metric monitor details: `references/metric-monitor.md` (relative to this file)
|
||||
- Validation monitor details: `references/validation-monitor.md` (relative to this file)
|
||||
- Custom SQL monitor details: `references/custom-sql-monitor.md` (relative to this file)
|
||||
- Comparison monitor details: `references/comparison-monitor.md` (relative to this file)
|
||||
- Table monitor details: `references/table-monitor.md` (relative to this file)
|
||||
|
||||
## When to activate this skill
|
||||
|
||||
Activate when the user:
|
||||
|
||||
- Asks to create, add, or set up a monitor (e.g. "add a monitor for...", "create a freshness check on...", "set up validation for...")
|
||||
- Mentions monitoring a specific table, field, or metric
|
||||
- Wants to check data quality rules or enforce data contracts
|
||||
- Asks about monitoring options for a table or dataset
|
||||
- Requests monitors-as-code YAML generation
|
||||
- Wants to add monitoring after new transformation logic (when the prevent skill is not active)
|
||||
|
||||
## When NOT to activate this skill
|
||||
|
||||
Do not activate when the user is:
|
||||
|
||||
- Just querying data or exploring table contents
|
||||
- Triaging or responding to active alerts (use the prevent skill's Workflow 3)
|
||||
- Running impact assessments before code changes (use the prevent skill's Workflow 4)
|
||||
- Asking about existing monitor configuration (use `getMonitors` directly)
|
||||
- Editing or deleting existing monitors
|
||||
|
||||
---
|
||||
|
||||
## Available MCP tools
|
||||
|
||||
All tools are available via the `monte-carlo` MCP server.
|
||||
|
||||
| Tool | Purpose |
|
||||
| ---------------------------- | ---------------------------------------------------------- |
|
||||
| `testConnection` | Verify auth and connectivity before starting |
|
||||
| `search` | Find tables/assets by name; use `include_fields` for columns |
|
||||
| `getTable` | Schema, stats, metadata, domain membership, capabilities |
|
||||
| `getValidationPredicates` | List available validation rule types for a warehouse |
|
||||
| `getDomains` | List MC domains (only needed if table has no domain info) |
|
||||
| `createMetricMonitorMac` | Generate metric monitor YAML (dry-run) |
|
||||
| `createValidationMonitorMac` | Generate validation monitor YAML (dry-run) |
|
||||
| `createComparisonMonitorMac` | Generate comparison monitor YAML (dry-run) |
|
||||
| `createCustomSqlMonitorMac` | Generate custom SQL monitor YAML (dry-run) |
|
||||
| `createTableMonitorMac` | Generate table monitor YAML (dry-run) |
|
||||
|
||||
---
|
||||
|
||||
## Monitor types
|
||||
|
||||
| Type | Tool | Use When |
|
||||
| -------------- | ---------------------------- | --------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| **Metric** | `createMetricMonitorMac` | Track statistical metrics on fields (null rates, unique counts, numeric stats) or row count changes over time. Requires a timestamp field for aggregation. |
|
||||
| **Validation** | `createValidationMonitorMac` | Row-level data quality checks with conditions (e.g. "field X is never null", "status is in allowed set"). Alerts on INVALID data. |
|
||||
| **Custom SQL** | `createCustomSqlMonitorMac` | Run arbitrary SQL returning a single number and alert on thresholds. Most flexible; use when other types don't fit. |
|
||||
| **Comparison** | `createComparisonMonitorMac` | Compare metrics between two tables (e.g. dev vs prod, source vs target). |
|
||||
| **Table** | `createTableMonitorMac` | Monitor groups of tables for freshness, schema changes, and volume. Uses asset selection at database/schema level. |
|
||||
|
||||
---
|
||||
|
||||
## Procedure
|
||||
|
||||
Follow these steps in order. Do NOT skip steps.
|
||||
|
||||
### Validation Phase (Steps 1-3) -- MUST complete before any creation tool is called
|
||||
|
||||
The number one error pattern is agents skipping validation and calling a creation tool with guessed or incomplete parameters. **Every field in the creation call must be grounded in data retrieved during this phase.** Do not proceed to Step 4 until Steps 1-3 are fully satisfied.
|
||||
|
||||
#### Step 1: Understand the request
|
||||
|
||||
Ask yourself:
|
||||
- What does the user want to monitor? (a specific table, a metric, a data quality rule, cross-table consistency, freshness/volume at schema level)
|
||||
- Which monitor type fits? Use the monitor types table above.
|
||||
- Does the user have all the details, or do they need guidance?
|
||||
|
||||
If the user's intent is unclear, ask a focused question before proceeding.
|
||||
|
||||
#### Step 2: Identify the table(s) and columns
|
||||
|
||||
If you don't have the table MCON:
|
||||
1. Use `search` with the table name and `include_fields: ["field_names"]` to find the MCON and get column names.
|
||||
2. If the user provided a full table ID like `database:schema.table`, search for it.
|
||||
3. Once you have the MCON, call `getTable` with `include_fields: true` and `include_table_capabilities: true` to verify capabilities and get domain info.
|
||||
|
||||
If you already have the MCON:
|
||||
1. Call `getTable` with the MCON, `include_fields: true`, and `include_table_capabilities: true`.
|
||||
|
||||
**CRITICAL: You need the actual column names from `getTable` results. NEVER guess or hallucinate column names.** This is the most common source of monitor creation failures.
|
||||
|
||||
For monitor types that require a timestamp column (metric monitors), review the column names and identify likely timestamp candidates. Present them to the user if ambiguous.
|
||||
|
||||
#### Step 3: Handle domain assignment
|
||||
|
||||
Monitors must be assigned to a domain that contains the table being monitored. The `getTable` response includes a `domains` list with `uuid` and `name`.
|
||||
|
||||
1. If `domains` is empty: skip domain assignment.
|
||||
2. If `domains` has exactly one entry: default `domain_id` to that domain's UUID.
|
||||
3. If `domains` has multiple entries: present only those domains and ask the user to pick.
|
||||
|
||||
Do NOT present all account domains as options -- only domains that contain the table are valid.
|
||||
|
||||
**ALWAYS check the table's `domains` BEFORE calling any creation tool.**
|
||||
|
||||
---
|
||||
|
||||
### Creation Phase (Steps 4-8)
|
||||
|
||||
Only enter this phase after the validation phase is complete with real data from MCP tools.
|
||||
|
||||
#### Step 4: Load the sub-skill reference
|
||||
|
||||
Based on the monitor type, read the detailed reference for parameter guidance:
|
||||
|
||||
- **Metric** -- Read the detailed reference: `references/metric-monitor.md` (relative to this file)
|
||||
- **Validation** -- Read the detailed reference: `references/validation-monitor.md` (relative to this file)
|
||||
- **Custom SQL** -- Read the detailed reference: `references/custom-sql-monitor.md` (relative to this file)
|
||||
- **Comparison** -- Read the detailed reference: `references/comparison-monitor.md` (relative to this file)
|
||||
- **Table** -- Read the detailed reference: `references/table-monitor.md` (relative to this file)
|
||||
|
||||
#### Step 5: Ask about scheduling
|
||||
|
||||
**Skip this step for table monitors.** Table monitors do not support the `schedule` field in MaC YAML — adding it will cause a validation error on `montecarlo monitors apply`. Table monitor scheduling is managed automatically by Monte Carlo.
|
||||
|
||||
For all other monitor types, the creation tools default to a fixed schedule running every 60 minutes. Present these options:
|
||||
|
||||
1. **Fixed interval** -- any integer for `interval_minutes` (30, 60, 90, 120, 360, 720, 1440, etc.)
|
||||
2. **Dynamic** -- MC auto-determines when to run based on table update patterns.
|
||||
3. **Loose** -- runs once per day.
|
||||
|
||||
Schedule format in MaC YAML:
|
||||
- Fixed: `schedule: { type: fixed, interval_minutes: <N> }`
|
||||
- Dynamic: `schedule: { type: dynamic }`
|
||||
- Loose: `schedule: { type: loose, start_time: "00:00" }`
|
||||
|
||||
#### Step 6: Confirm with the user
|
||||
|
||||
Before calling the creation tool, present the monitor configuration in plain language:
|
||||
- Monitor type
|
||||
- Target table (and columns if applicable)
|
||||
- What it checks / what triggers an alert
|
||||
- Domain assignment
|
||||
- Schedule
|
||||
|
||||
Ask: "Does this look correct? I'll generate the monitor configuration."
|
||||
|
||||
**NEVER call the creation tool without user confirmation.**
|
||||
|
||||
#### Step 7: Create the monitor
|
||||
|
||||
Call the appropriate creation tool with the parameters built in previous steps. Always pass an MCON when possible. If only table name is available, also pass warehouse.
|
||||
|
||||
#### Step 8: Present results
|
||||
|
||||
**CRITICAL: Always include the YAML in your response.** The user needs copy-pasteable YAML.
|
||||
|
||||
1. If a non-default schedule was chosen, modify the schedule section in the YAML before presenting.
|
||||
2. Wrap the YAML in the full MaC structure (see "MaC YAML format" section below).
|
||||
3. ALWAYS present the full YAML in a ```yaml code block.
|
||||
4. Explain where to put it and how to apply it (see below).
|
||||
5. ALWAYS use ISO 8601 format for datetime values.
|
||||
6. NEVER reformat YAML values returned by creation tools.
|
||||
|
||||
---
|
||||
|
||||
## MaC YAML format
|
||||
|
||||
The YAML returned by creation tools is the monitor definition. It must be wrapped in the standard MaC structure to be applied:
|
||||
|
||||
```yaml
|
||||
montecarlo:
|
||||
<monitor_type>:
|
||||
- <returned yaml>
|
||||
```
|
||||
|
||||
For example, a metric monitor would look like:
|
||||
|
||||
```yaml
|
||||
montecarlo:
|
||||
metric:
|
||||
- <yaml returned by createMetricMonitorMac>
|
||||
```
|
||||
|
||||
**Important:** `montecarlo.yml` (without a directory path) is a separate Monte Carlo project configuration file -- it is NOT the same as a monitor definition file. Monitor definitions go in their own `.yml` files, typically in a `monitors/` directory or alongside dbt model schema files.
|
||||
|
||||
Tell the user:
|
||||
- Save the YAML to a `.yml` file (e.g. `monitors/<table_name>.yml` or in their dbt schema)
|
||||
- Apply via the Monte Carlo CLI: `montecarlo monitors apply --namespace <namespace>`
|
||||
- Or integrate into CI/CD for automatic deployment on merge
|
||||
|
||||
---
|
||||
|
||||
## Common mistakes to avoid
|
||||
|
||||
- **NEVER guess column names.** Always get them from `getTable`.
|
||||
- **NEVER skip the confirmation step** (Step 6).
|
||||
- For metric monitors, `aggregate_time_field` MUST be a real timestamp column from the table.
|
||||
- For validation monitors, conditions match INVALID data, not valid data.
|
||||
- Always pass an MCON when possible. If only table name is available, also pass warehouse.
|
||||
- **ALWAYS check table's `domains` BEFORE calling any creation tool.**
|
||||
- ALWAYS use ISO 8601 format for datetime values.
|
||||
- NEVER reformat YAML values returned by creation tools.
|
||||
- Do not call creation tools before the validation phase is complete.
|
||||
@@ -0,0 +1,426 @@
|
||||
# Comparison Monitor Reference
|
||||
|
||||
Detailed reference for building `createComparisonMonitorMac` tool calls.
|
||||
|
||||
## When to Use
|
||||
|
||||
Use a comparison monitor when the user wants to:
|
||||
|
||||
- Compare data between two tables (e.g., source vs target, dev vs prod)
|
||||
- Validate data consistency after migration or replication
|
||||
- Check row count parity across environments
|
||||
- Compare field-level metrics between tables (null counts, sums, distributions)
|
||||
|
||||
---
|
||||
|
||||
## Pre-Step: Verify Both Tables and Fields
|
||||
|
||||
Before constructing alert conditions, you MUST verify that both tables exist and that any referenced fields are real columns. This is the most common source of comparison monitor failures.
|
||||
|
||||
1. **Resolve both MCONs.** Use `search` to find the source and target tables. If the user provided `database:schema.table` format, search for each to get the MCON.
|
||||
2. **Get full schemas.** Call `getTable` with `include_fields: true` on BOTH the source table and the target table. You need the column lists from both.
|
||||
3. **For field-level metrics, verify fields exist on both sides.** Confirm that `sourceField` exists in the source table's column list AND `targetField` exists in the target table's column list. Field names are case-sensitive on most warehouses.
|
||||
4. **Check field type compatibility.** The metric must be compatible with the column types on both sides. For example, `NUMERIC_MEAN` requires numeric columns in both the source and target tables. If the source column is numeric but the target is a string, the comparison will fail.
|
||||
5. If any field does not exist or types are incompatible, stop and ask the user to clarify. Do not guess.
|
||||
|
||||
---
|
||||
|
||||
## Required Parameters
|
||||
|
||||
| Parameter | Type | Description |
|
||||
|-----------|------|-------------|
|
||||
| `name` | string | Unique identifier for the monitor. Use a descriptive slug (e.g., `orders_dev_prod_compare`). |
|
||||
| `description` | string | Human-readable description of what the monitor checks. |
|
||||
| `source_table` | string | Source table MCON (preferred) or `database:schema.table` format. If not MCON, also pass `source_warehouse`. |
|
||||
| `target_table` | string | Target table MCON (preferred) or `database:schema.table` format. If not MCON, also pass `target_warehouse`. |
|
||||
| `alert_conditions` | array | List of comparison conditions (see Alert Conditions below). |
|
||||
|
||||
## Optional Parameters
|
||||
|
||||
| Parameter | Type | Description |
|
||||
|-----------|------|-------------|
|
||||
| `source_warehouse` | string | Warehouse name or UUID for the source table. Required if `source_table` is not an MCON. |
|
||||
| `target_warehouse` | string | Warehouse name or UUID for the target table. Required if `target_table` is not an MCON. |
|
||||
| `segment_fields` | array of string | Fields to segment the comparison by. Must exist in BOTH tables with the same name. |
|
||||
| `domain_id` | string (uuid) | Domain UUID (use `getDomains` to list). Only one domain can be assigned per monitor. |
|
||||
|
||||
---
|
||||
|
||||
## Cross-Warehouse Comparisons
|
||||
|
||||
When the source and target tables live in different warehouses (e.g., comparing a Snowflake staging table against a BigQuery production table), you MUST provide both `source_warehouse` and `target_warehouse` explicitly. The tool cannot auto-resolve warehouses when tables are in different environments.
|
||||
|
||||
Even when both tables are MCONs, if they belong to different warehouses, pass both warehouse parameters to be safe. Omitting them in cross-warehouse scenarios causes silent failures or incorrect results.
|
||||
|
||||
Common cross-warehouse patterns:
|
||||
- **Dev vs prod:** same warehouse type, different databases or schemas
|
||||
- **Migration validation:** source in old warehouse, target in new warehouse
|
||||
- **Replication checks:** primary warehouse vs replica or downstream warehouse
|
||||
|
||||
---
|
||||
|
||||
## Alert Conditions
|
||||
|
||||
Each condition compares a metric between the source and target tables.
|
||||
|
||||
| Field | Type | Required | Description |
|
||||
|-------|------|----------|-------------|
|
||||
| `metric` | string | Yes | The metric to compare (see Metrics Reference below). |
|
||||
| `sourceField` | string | For field-level metrics | Column in the source table. Required for ALL metrics except `ROW_COUNT`. |
|
||||
| `targetField` | string | For field-level metrics | Column in the target table. Required for ALL metrics except `ROW_COUNT`. |
|
||||
| `thresholdValue` | number | No | Threshold for acceptable difference between source and target. |
|
||||
| `isThresholdRelative` | boolean | No | `false` = absolute difference (default), `true` = percentage difference. |
|
||||
| `customMetric` | object | No | Custom SQL expressions for source and target (see Custom Metrics below). |
|
||||
|
||||
---
|
||||
|
||||
## ROW_COUNT and Fields: A Critical Rule
|
||||
|
||||
> **NEVER pass `sourceField` or `targetField` when using the `ROW_COUNT` metric.**
|
||||
|
||||
`ROW_COUNT` is a table-level metric -- it counts all rows in the table, not values in a column. Passing field names with `ROW_COUNT` causes the API call to fail or produce unexpected behavior.
|
||||
|
||||
This is the single most common mistake with comparison monitors. Before submitting any alert condition with `ROW_COUNT`, verify that `sourceField` and `targetField` are both absent from the condition object.
|
||||
|
||||
| Metric | Fields needed? | What happens if you pass fields? |
|
||||
|--------|---------------|----------------------------------|
|
||||
| `ROW_COUNT` | **No -- NEVER pass fields** | API error or undefined behavior |
|
||||
| All other metrics | **Yes -- always pass both fields** | Required for the comparison to work |
|
||||
|
||||
---
|
||||
|
||||
## Metrics Reference
|
||||
|
||||
### Table-level metric (no fields needed)
|
||||
|
||||
| Metric | Description |
|
||||
|--------|-------------|
|
||||
| `ROW_COUNT` | Compare total row counts between source and target. |
|
||||
|
||||
### Field-level metrics (require `sourceField` and `targetField`)
|
||||
|
||||
#### Uniqueness and duplicates
|
||||
|
||||
| Metric | Description |
|
||||
|--------|-------------|
|
||||
| `UNIQUE_COUNT` | Count of distinct values. |
|
||||
| `DUPLICATE_COUNT` | Count of duplicate (non-unique) values. |
|
||||
| `APPROX_DISTINCT_COUNT` | Approximate distinct count (faster on large tables). |
|
||||
|
||||
#### Null and empty checks
|
||||
|
||||
| Metric | Description |
|
||||
|--------|-------------|
|
||||
| `NULL_COUNT` | Count of null values. |
|
||||
| `NON_NULL_COUNT` | Count of non-null values. |
|
||||
| `EMPTY_STRING_COUNT` | Count of empty string values. |
|
||||
| `TEXT_ALL_SPACES_COUNT` | Count of values that are all whitespace. |
|
||||
| `NAN_COUNT` | Count of NaN values. |
|
||||
| `TEXT_NULL_KEYWORD_COUNT` | Count of values containing null-like keywords (e.g., "NULL", "None"). |
|
||||
|
||||
#### Numeric statistics
|
||||
|
||||
| Metric | Description |
|
||||
|--------|-------------|
|
||||
| `NUMERIC_MEAN` | Mean of numeric field. |
|
||||
| `NUMERIC_MEDIAN` | Median of numeric field. |
|
||||
| `NUMERIC_MIN` | Minimum value. |
|
||||
| `NUMERIC_MAX` | Maximum value. |
|
||||
| `NUMERIC_STDDEV` | Standard deviation. |
|
||||
| `SUM` | Sum of numeric field. |
|
||||
| `ZERO_COUNT` | Count of zero values. |
|
||||
| `NEGATIVE_COUNT` | Count of negative values. |
|
||||
|
||||
#### Percentiles
|
||||
|
||||
| Metric | Description |
|
||||
|--------|-------------|
|
||||
| `PERCENTILE_20` | 20th percentile value. |
|
||||
| `PERCENTILE_40` | 40th percentile value. |
|
||||
| `PERCENTILE_60` | 60th percentile value. |
|
||||
| `PERCENTILE_80` | 80th percentile value. |
|
||||
|
||||
#### Text statistics
|
||||
|
||||
| Metric | Description |
|
||||
|--------|-------------|
|
||||
| `TEXT_MAX_LENGTH` | Maximum string length. |
|
||||
| `TEXT_MIN_LENGTH` | Minimum string length. |
|
||||
| `TEXT_MEAN_LENGTH` | Mean string length. |
|
||||
| `TEXT_STD_LENGTH` | Standard deviation of string length. |
|
||||
|
||||
#### Text format checks
|
||||
|
||||
| Metric | Description |
|
||||
|--------|-------------|
|
||||
| `TEXT_NOT_INT_COUNT` | Count of values not parseable as integers. |
|
||||
| `TEXT_NOT_NUMBER_COUNT` | Count of values not parseable as numbers. |
|
||||
| `TEXT_NOT_UUID_COUNT` | Count of values not matching UUID format. |
|
||||
| `TEXT_NOT_SSN_COUNT` | Count of values not matching SSN format. |
|
||||
| `TEXT_NOT_US_PHONE_COUNT` | Count of values not matching US phone format. |
|
||||
| `TEXT_NOT_US_STATE_CODE_COUNT` | Count of values not matching US state codes. |
|
||||
| `TEXT_NOT_US_ZIP_CODE_COUNT` | Count of values not matching US zip codes. |
|
||||
| `TEXT_NOT_EMAIL_ADDRESS_COUNT` | Count of values not matching email format. |
|
||||
| `TEXT_NOT_TIMESTAMP_COUNT` | Count of values not parseable as timestamps. |
|
||||
|
||||
#### Boolean
|
||||
|
||||
| Metric | Description |
|
||||
|--------|-------------|
|
||||
| `TRUE_COUNT` | Count of true values. |
|
||||
| `FALSE_COUNT` | Count of false values. |
|
||||
|
||||
#### Timestamp
|
||||
|
||||
| Metric | Description |
|
||||
|--------|-------------|
|
||||
| `FUTURE_TIMESTAMP_COUNT` | Count of timestamps in the future. |
|
||||
| `PAST_TIMESTAMP_COUNT` | Count of timestamps unreasonably far in the past. |
|
||||
| `UNIX_ZERO_COUNT` | Count of timestamps equal to Unix epoch zero (1970-01-01). |
|
||||
|
||||
---
|
||||
|
||||
## Choosing the Right Metric
|
||||
|
||||
| User intent | Correct metric | Fields needed? |
|
||||
|-------------|---------------|----------------|
|
||||
| Row count parity | `ROW_COUNT` | **No** -- never pass fields |
|
||||
| Distinct values in a column | `UNIQUE_COUNT` | Yes |
|
||||
| Null values in a column | `NULL_COUNT` | Yes |
|
||||
| Sum, average, min, max | `SUM`, `NUMERIC_MEAN`, `NUMERIC_MIN`, `NUMERIC_MAX` | Yes |
|
||||
| Data completeness | `NON_NULL_COUNT` | Yes |
|
||||
| String format validation | `TEXT_NOT_EMAIL_ADDRESS_COUNT`, `TEXT_NOT_UUID_COUNT`, etc. | Yes |
|
||||
| Custom computed expressions | Use `customMetric` instead of `metric` | No (SQL handles it) |
|
||||
|
||||
---
|
||||
|
||||
## Custom Metrics
|
||||
|
||||
Use custom metrics when:
|
||||
|
||||
- **Column names differ** between source and target and you need a computed expression (not just a direct field comparison).
|
||||
- **You need a derived calculation** like `SUM(quantity * unit_price)` rather than a simple column metric.
|
||||
- **Standard metrics do not cover the comparison** (e.g., comparing a ratio, a conditional aggregate, or a windowed calculation).
|
||||
|
||||
If the columns simply have different names but you want a standard metric (e.g., compare `SUM` of `revenue` in source vs `total_revenue` in target), you do NOT need a custom metric -- just use the standard metric with different `sourceField` and `targetField` values.
|
||||
|
||||
Custom metric structure:
|
||||
|
||||
```json
|
||||
{
|
||||
"customMetric": {
|
||||
"displayName": "Revenue Sum",
|
||||
"sourceSqlExpression": "SUM(revenue)",
|
||||
"targetSqlExpression": "SUM(total_revenue)"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
| Field | Type | Required | Description |
|
||||
|-------|------|----------|-------------|
|
||||
| `displayName` | string | Yes | Human-readable name for the metric in alerts and dashboards. |
|
||||
| `sourceSqlExpression` | string | Yes | SQL expression evaluated against the source table. |
|
||||
| `targetSqlExpression` | string | Yes | SQL expression evaluated against the target table. |
|
||||
|
||||
When using `customMetric`, do NOT also pass `metric`, `sourceField`, or `targetField` in the same alert condition. The custom metric replaces all of those.
|
||||
|
||||
---
|
||||
|
||||
## Threshold Guidance
|
||||
|
||||
### Absolute thresholds (`isThresholdRelative: false` or omitted)
|
||||
|
||||
The `thresholdValue` is the maximum acceptable absolute difference between the source and target metric values.
|
||||
|
||||
- `thresholdValue: 0` -- source and target must match exactly.
|
||||
- `thresholdValue: 100` -- up to 100 units of difference is acceptable.
|
||||
|
||||
### Relative (percentage) thresholds (`isThresholdRelative: true`)
|
||||
|
||||
The `thresholdValue` is the maximum acceptable percentage difference.
|
||||
|
||||
- `thresholdValue: 5` -- up to 5% difference is acceptable.
|
||||
- `thresholdValue: 0.1` -- up to 0.1% difference is acceptable.
|
||||
|
||||
### When to use each
|
||||
|
||||
| Scenario | Recommended threshold type |
|
||||
|----------|---------------------------|
|
||||
| Exact replication (row counts must match) | Absolute, `thresholdValue: 0` |
|
||||
| Near-real-time sync with small lag | Absolute, small value (e.g., 10-100) |
|
||||
| Tables at different scales | Relative, percentage-based |
|
||||
| Aggregated metrics (sums, means) | Relative, to handle floating-point differences |
|
||||
|
||||
---
|
||||
|
||||
## Examples
|
||||
|
||||
### Row count parity with absolute threshold
|
||||
|
||||
Compare row counts between dev and prod, alerting if they differ by more than 100 rows.
|
||||
|
||||
```json
|
||||
{
|
||||
"name": "orders_dev_prod_row_count",
|
||||
"description": "Verify dev and prod orders tables have similar row counts",
|
||||
"source_table": "MCON++a1b2c3d4-e5f6-7890-abcd-ef1234567890++1++1++dev_warehouse:core.orders",
|
||||
"target_table": "MCON++b2c3d4e5-f6a7-8901-bcde-f12345678901++1++1++prod_warehouse:core.orders",
|
||||
"alert_conditions": [
|
||||
{
|
||||
"metric": "ROW_COUNT",
|
||||
"thresholdValue": 100,
|
||||
"isThresholdRelative": false
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
Note: no `sourceField` or `targetField` -- `ROW_COUNT` is table-level.
|
||||
|
||||
### Row count parity with percentage threshold
|
||||
|
||||
Alert if row counts differ by more than 5%.
|
||||
|
||||
```json
|
||||
{
|
||||
"name": "orders_replication_check",
|
||||
"description": "Verify replicated orders table is within 5% of source row count",
|
||||
"source_table": "MCON++a1b2c3d4-e5f6-7890-abcd-ef1234567890++1++1++primary:sales.orders",
|
||||
"target_table": "MCON++b2c3d4e5-f6a7-8901-bcde-f12345678901++1++1++replica:sales.orders",
|
||||
"alert_conditions": [
|
||||
{
|
||||
"metric": "ROW_COUNT",
|
||||
"thresholdValue": 5,
|
||||
"isThresholdRelative": true
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
### Field-level comparison (different column names)
|
||||
|
||||
Compare the sum of `revenue` in the source table against `total_revenue` in the target table.
|
||||
|
||||
```json
|
||||
{
|
||||
"name": "revenue_source_target_sum",
|
||||
"description": "Verify revenue sums match between staging and production",
|
||||
"source_table": "MCON++a1b2c3d4-e5f6-7890-abcd-ef1234567890++1++1++staging:finance.transactions",
|
||||
"target_table": "MCON++b2c3d4e5-f6a7-8901-bcde-f12345678901++1++1++production:finance.transactions",
|
||||
"alert_conditions": [
|
||||
{
|
||||
"metric": "SUM",
|
||||
"sourceField": "revenue",
|
||||
"targetField": "total_revenue",
|
||||
"thresholdValue": 1,
|
||||
"isThresholdRelative": true
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
### Segmented comparison
|
||||
|
||||
Compare null counts on `email` between source and target, segmented by `country`. The `country` field must exist in both tables.
|
||||
|
||||
```json
|
||||
{
|
||||
"name": "email_nulls_by_country",
|
||||
"description": "Compare email null counts by country between ETL source and target",
|
||||
"source_table": "MCON++a1b2c3d4-e5f6-7890-abcd-ef1234567890++1++1++raw:crm.contacts",
|
||||
"target_table": "MCON++b2c3d4e5-f6a7-8901-bcde-f12345678901++1++1++analytics:crm.contacts",
|
||||
"segment_fields": ["country"],
|
||||
"alert_conditions": [
|
||||
{
|
||||
"metric": "NULL_COUNT",
|
||||
"sourceField": "email",
|
||||
"targetField": "email",
|
||||
"thresholdValue": 0,
|
||||
"isThresholdRelative": false
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
### Cross-warehouse comparison with explicit warehouses
|
||||
|
||||
When source and target are in different warehouses, both warehouse parameters must be provided.
|
||||
|
||||
```json
|
||||
{
|
||||
"name": "migration_users_row_count",
|
||||
"description": "Validate user row counts match after Snowflake to BigQuery migration",
|
||||
"source_table": "snowflake_db:public.users",
|
||||
"source_warehouse": "snowflake-prod",
|
||||
"target_table": "bigquery_project:public.users",
|
||||
"target_warehouse": "bigquery-prod",
|
||||
"alert_conditions": [
|
||||
{
|
||||
"metric": "ROW_COUNT",
|
||||
"thresholdValue": 0,
|
||||
"isThresholdRelative": false
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
### Custom metric comparison
|
||||
|
||||
Compare a computed revenue expression when the SQL differs between source and target.
|
||||
|
||||
```json
|
||||
{
|
||||
"name": "computed_revenue_compare",
|
||||
"description": "Compare total revenue computation between legacy and new schema",
|
||||
"source_table": "MCON++a1b2c3d4-e5f6-7890-abcd-ef1234567890++1++1++warehouse:legacy.orders",
|
||||
"target_table": "MCON++b2c3d4e5-f6a7-8901-bcde-f12345678901++1++1++warehouse:v2.orders",
|
||||
"alert_conditions": [
|
||||
{
|
||||
"customMetric": {
|
||||
"displayName": "Total Revenue",
|
||||
"sourceSqlExpression": "SUM(quantity * unit_price)",
|
||||
"targetSqlExpression": "SUM(total_amount)"
|
||||
},
|
||||
"thresholdValue": 0.01,
|
||||
"isThresholdRelative": true
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
### Multiple alert conditions
|
||||
|
||||
Compare both row counts and field-level metrics in a single monitor.
|
||||
|
||||
```json
|
||||
{
|
||||
"name": "orders_full_comparison",
|
||||
"description": "Full comparison of orders between staging and production",
|
||||
"source_table": "MCON++a1b2c3d4-e5f6-7890-abcd-ef1234567890++1++1++staging:core.orders",
|
||||
"target_table": "MCON++b2c3d4e5-f6a7-8901-bcde-f12345678901++1++1++production:core.orders",
|
||||
"domain_id": "f47ac10b-58cc-4372-a567-0e02b2c3d479",
|
||||
"alert_conditions": [
|
||||
{
|
||||
"metric": "ROW_COUNT",
|
||||
"thresholdValue": 0,
|
||||
"isThresholdRelative": false
|
||||
},
|
||||
{
|
||||
"metric": "NULL_COUNT",
|
||||
"sourceField": "customer_id",
|
||||
"targetField": "customer_id",
|
||||
"thresholdValue": 0,
|
||||
"isThresholdRelative": false
|
||||
},
|
||||
{
|
||||
"metric": "SUM",
|
||||
"sourceField": "amount",
|
||||
"targetField": "amount",
|
||||
"thresholdValue": 0.1,
|
||||
"isThresholdRelative": true
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
Note: the `ROW_COUNT` condition has no fields, while the field-level conditions each specify both `sourceField` and `targetField`.
|
||||
@@ -0,0 +1,207 @@
|
||||
# Custom SQL Monitor Reference
|
||||
|
||||
Detailed reference for building `createCustomSqlMonitorMac` tool calls.
|
||||
|
||||
## When to Use
|
||||
|
||||
Use a custom SQL monitor when the user wants to:
|
||||
|
||||
- Run a specific SQL query and alert on its result
|
||||
- Implement cross-table logic (joins, subqueries, CTEs)
|
||||
- Apply business-specific aggregations or calculations that don't map to a single metric
|
||||
- Monitor a condition that spans multiple columns or tables
|
||||
- Use a SQL query they already have in mind
|
||||
|
||||
---
|
||||
|
||||
## The Universal Fallback
|
||||
|
||||
Custom SQL is the fallback monitor type. Reach for it whenever another monitor type cannot express what the user needs:
|
||||
|
||||
- **Validation monitor won't work** because the column doesn't exist yet, or the logic requires joins across tables.
|
||||
- **Metric monitor can't express the business logic** -- for example, a ratio between two columns, a conditional aggregation, or a calculation that spans multiple tables.
|
||||
- **Cross-table joins are needed** -- metric and validation monitors operate on a single table. If the check requires data from two or more tables, custom SQL is the only option.
|
||||
- **The user already has a SQL query** -- don't force it into another monitor type. Wrap it in a custom SQL monitor.
|
||||
|
||||
If you find yourself contorting another monitor type to fit the user's intent, stop and use custom SQL instead.
|
||||
|
||||
---
|
||||
|
||||
## Required Parameters
|
||||
|
||||
| Parameter | Type | Description |
|
||||
|-----------|------|-------------|
|
||||
| `name` | string | Unique identifier for the monitor. Use a descriptive slug (e.g., `orphan_orders_check`). |
|
||||
| `description` | string | Human-readable description of what the monitor checks. |
|
||||
| `warehouse` | string | Warehouse name or UUID where the SQL query will be executed. |
|
||||
| `sql` | string | SQL query that returns a **single numeric value** (one row, one column). |
|
||||
| `alert_conditions` | array | List of threshold conditions (see Alert Conditions below). |
|
||||
|
||||
## Optional Parameters
|
||||
|
||||
| Parameter | Type | Description |
|
||||
|-----------|------|-------------|
|
||||
| `domain_id` | string (uuid) | Domain UUID (use `getDomains` to list). Only one domain can be assigned per monitor. |
|
||||
|
||||
---
|
||||
|
||||
## Alert Conditions
|
||||
|
||||
Each alert condition compares the query result against a threshold.
|
||||
|
||||
| Field | Type | Required | Description |
|
||||
|-------|------|----------|-------------|
|
||||
| `operator` | string | Yes | `"GT"`, `"LT"`, `"EQ"`, `"GTE"`, `"LTE"`, `"NE"` |
|
||||
| `thresholdValue` | number | Yes | Numeric threshold to compare the query result against. |
|
||||
|
||||
### No AUTO Support
|
||||
|
||||
Custom SQL monitors do **NOT** support `AUTO` (anomaly detection). You must specify an explicit operator and threshold for every alert condition. This is a common mistake -- if the user asks for anomaly detection, steer them toward a metric monitor instead, which does support `AUTO`.
|
||||
|
||||
If the user is unsure what threshold to set, help them reason about it: "What value would indicate a problem? If the query returns X, should that fire an alert?"
|
||||
|
||||
---
|
||||
|
||||
## SQL Query Requirements
|
||||
|
||||
The SQL query MUST return exactly **one row with one numeric column**. This is non-negotiable -- the monitor compares that single value against the alert conditions.
|
||||
|
||||
### Rules
|
||||
|
||||
- Use aggregate functions: `COUNT(*)`, `SUM()`, `AVG()`, `MAX()`, `MIN()`, or similar.
|
||||
- Can reference any table, view, or materialized view accessible in the warehouse.
|
||||
- Can use joins, subqueries, CTEs, window functions -- any valid SQL.
|
||||
- Do **NOT** include trailing semicolons.
|
||||
- Do **NOT** include comments (`--` or `/* */`) -- some warehouses strip them inconsistently.
|
||||
|
||||
### SQL Validation Tips
|
||||
|
||||
These are the most common mistakes that cause custom SQL monitors to fail or produce misleading results:
|
||||
|
||||
1. **Handle NULLs with COALESCE.** If your aggregate could return NULL (e.g., `SUM(amount)` on an empty result set), wrap it: `SELECT COALESCE(SUM(amount), 0) FROM ...`. A NULL result cannot be compared against a threshold and will not trigger alerts.
|
||||
|
||||
2. **Ensure exactly one row, one column.** If your query could return zero rows (e.g., a filtered `SELECT` with no `GROUP BY`), wrap it in an outer aggregate: `SELECT COUNT(*) FROM (SELECT ...) sub`. If it returns multiple columns, select only the one you need.
|
||||
|
||||
3. **Test the query mentally.** Before finalizing, ask: "If this query returns 5, will the alert condition fire correctly?" Walk through the logic with a concrete number.
|
||||
|
||||
4. **For time-windowed checks, use appropriate date functions.** SQL syntax for date arithmetic varies by warehouse (see Warehouse-Specific SQL Notes below). Always scope time windows to avoid scanning the entire table history.
|
||||
|
||||
5. **Avoid non-deterministic results.** Queries using `LIMIT` without `ORDER BY`, or `RANDOM()`, produce unpredictable results that make alerting unreliable.
|
||||
|
||||
---
|
||||
|
||||
## Warehouse-Specific SQL Notes
|
||||
|
||||
SQL syntax for date arithmetic and functions varies across warehouses. When writing time-windowed queries, use the correct syntax for the user's warehouse:
|
||||
|
||||
| Operation | Snowflake | BigQuery | Redshift |
|
||||
|-----------|-----------|----------|----------|
|
||||
| Subtract 1 day from now | `DATEADD(day, -1, CURRENT_TIMESTAMP())` | `DATE_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 DAY)` | `DATEADD(day, -1, GETDATE())` |
|
||||
| Subtract 1 hour from now | `DATEADD(hour, -1, CURRENT_TIMESTAMP())` | `TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 HOUR)` | `DATEADD(hour, -1, GETDATE())` |
|
||||
| Current timestamp | `CURRENT_TIMESTAMP()` | `CURRENT_TIMESTAMP()` | `GETDATE()` |
|
||||
| Date truncation | `DATE_TRUNC('day', col)` | `DATE_TRUNC(col, DAY)` | `DATE_TRUNC('day', col)` |
|
||||
|
||||
When unsure which warehouse the user is on, ask. Getting the syntax wrong causes the monitor to fail on every scheduled run.
|
||||
|
||||
---
|
||||
|
||||
## Examples
|
||||
|
||||
### Orphan records (GT 0)
|
||||
|
||||
Alert when orders reference customers that don't exist.
|
||||
|
||||
```json
|
||||
{
|
||||
"name": "orphan_orders_check",
|
||||
"description": "Detect orders referencing non-existent customers",
|
||||
"warehouse": "production_snowflake",
|
||||
"sql": "SELECT COUNT(*) FROM analytics.core.orders o LEFT JOIN analytics.core.customers c ON o.customer_id = c.id WHERE c.id IS NULL",
|
||||
"alert_conditions": [
|
||||
{
|
||||
"operator": "GT",
|
||||
"thresholdValue": 0
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
### Daily revenue floor (LT threshold)
|
||||
|
||||
Alert when total revenue for the past 24 hours drops below a minimum.
|
||||
|
||||
```json
|
||||
{
|
||||
"name": "daily_revenue_floor",
|
||||
"description": "Alert when daily revenue falls below $10,000",
|
||||
"warehouse": "production_snowflake",
|
||||
"sql": "SELECT COALESCE(SUM(amount), 0) FROM analytics.billing.transactions WHERE created_at >= DATEADD(day, -1, CURRENT_TIMESTAMP())",
|
||||
"alert_conditions": [
|
||||
{
|
||||
"operator": "LT",
|
||||
"thresholdValue": 10000
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
### Duplicate rate exceeds threshold
|
||||
|
||||
Alert when the duplicate rate on a key field exceeds 1%.
|
||||
|
||||
```json
|
||||
{
|
||||
"name": "order_id_duplicate_rate",
|
||||
"description": "Alert when order_id duplicate rate exceeds 1%",
|
||||
"warehouse": "production_snowflake",
|
||||
"sql": "SELECT COALESCE(1.0 - (COUNT(DISTINCT order_id) * 1.0 / NULLIF(COUNT(*), 0)), 0) FROM analytics.core.orders WHERE created_at >= DATEADD(day, -1, CURRENT_TIMESTAMP())",
|
||||
"alert_conditions": [
|
||||
{
|
||||
"operator": "GT",
|
||||
"thresholdValue": 0.01
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
### Multiple threshold conditions (range check)
|
||||
|
||||
Alert when a value falls outside an acceptable range. Multiple conditions act as independent checks -- each one that evaluates to true fires its own alert.
|
||||
|
||||
```json
|
||||
{
|
||||
"name": "avg_order_amount_range",
|
||||
"description": "Alert when average order amount is outside the $20-$500 range",
|
||||
"warehouse": "production_snowflake",
|
||||
"sql": "SELECT COALESCE(AVG(amount), 0) FROM analytics.core.orders WHERE created_at >= DATEADD(day, -1, CURRENT_TIMESTAMP()) AND status = 'completed'",
|
||||
"alert_conditions": [
|
||||
{
|
||||
"operator": "LT",
|
||||
"thresholdValue": 20
|
||||
},
|
||||
{
|
||||
"operator": "GT",
|
||||
"thresholdValue": 500
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
### Cross-table freshness check (BigQuery syntax)
|
||||
|
||||
Alert when the latest row in a downstream table is more than 2 hours behind the source.
|
||||
|
||||
```json
|
||||
{
|
||||
"name": "pipeline_lag_check",
|
||||
"description": "Alert when downstream table lags source by more than 2 hours",
|
||||
"warehouse": "production_bigquery",
|
||||
"sql": "SELECT COALESCE(TIMESTAMP_DIFF(s.max_ts, t.max_ts, MINUTE), 9999) FROM (SELECT MAX(event_timestamp) AS max_ts FROM project.raw.events) s CROSS JOIN (SELECT MAX(processed_at) AS max_ts FROM project.analytics.events_processed) t",
|
||||
"alert_conditions": [
|
||||
{
|
||||
"operator": "GT",
|
||||
"thresholdValue": 120
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
Some files were not shown because too many files have changed in this diff Show More
Reference in New Issue
Block a user