Architecture
DAG on Spark, truly multi-engine.
NeptunoDQ builds a directed acyclic graph of your quality rules, executes them in parallel while respecting dependencies, and runs identically on Apache Spark open source and Databricks.
Rule DAG
Dependencies respected, failures contained.
NeptunoDQ builds a DAG of your rules before executing them. If Rule B depends on Rule A, B only runs if A passes. If A fails or is skipped, B is automatically cancelled to avoid wasting cluster resources.
- Explicit dependencies. Declare which rules depend on which in the configuration file. The engine calculates the optimal execution order.
- Failure propagation. A failure in a parent node propagates to its descendants, which are marked as skipped without executing.
- No orphan executions. A child rule is never executed against data already known to be invalid by the parent rule.
Spark Fair Scheduler
Independent branches run in parallel.
Pool per rule
Each rule is assigned to its own scheduler pool, preventing a slow rule from blocking others.
True parallelism
Branches without dependencies run simultaneously. A 50-rule suite does not take 50 times as long as a single rule.
Cluster utilization
The Fair Scheduler distributes cluster resources across active rules, maximizing utilization.
Rule types
Five control types, one inventory.
From SQL files to Databricks notebooks. All share the same lifecycle: proposal, review, approval, execution, and traceability.
SQL_FILE
SQL rules in external files. Ideal for complex queries under version control. The file can live on HDFS, S3, or local storage.
SQL_TEXT
Inline SQL in the configuration. For simple, fast checks that don't justify a separate file. Variables work the same way.
TABLE
Predefined checks on a table: NULLS, DUPLICATES, WHITES. No SQL required — just declare which columns to validate.
FILE
Validations on files before loading them: CSV, Parquet, etc. Catch quality issues before they reach the table.
ADBNOTEBOOK
Run a Databricks notebook as a quality step. For complex validation logic that already exists as a notebook.
Variables
Variables and parametric substitution.
Use ${VARIABLE_NAME} in any configuration value. NeptunoDQ substitutes them at runtime according to the defined resolution order.
project_id: "neptuno-demo"
department: "analytics"
mdc: "neptuno-demo-001"
neptuno_properties:
spark.neptuno.num.thread: "4"
spark.sql.shuffle.partitions: "10"
rules:
- control_type: "SQL_FILE"
rule_id: "check_total_money"
file: "${PATH}/rules/validate_money.sql"
table: "${USERS_TABLE}"
umbral: "0.01"
variables:
table: "${NEPTUNO_SCHEMA}.${USERS_TABLE}"
max_age: "75"
sql_aggregations:
total_money: "sum(money)"
- control_type: "SQL_TEXT"
rule_id: "check_vip_users"
text: "select * from ${table} where upper(tier) = 'VIP' and spend < ${min_spend}"
table: "${USERS_TABLE}"
variables:
table: "${NEPTUNO_SCHEMA}.${USERS_TABLE}"
min_spend: "1000"
sql_aggregations:
violation_count: "count(*)"Resolution order
- 1.CLI Arguments: Variables passed with -c KEY=VALUE or --conf KEY=VALUE in the launch command.
- 2.Environment variables: System variables available at runtime. Useful for secrets and environment configuration.
Supported engines
Same engine, multiple environments.
NeptunoDQ runs on Apache Spark open source and Databricks using the same configuration file. Other engines are on the roadmap.
Apache Spark
OSS 3.5, 4.0, 4.1
Databricks
Runtimes 15.4, 16.4, 17.3
Snowflake
AWS
EMR, Glue, S3
Azure
ADB, ADLS
CLI & Launchers
Two launchers, one mental model.
For Apache Spark open source, use spark-submit. For Databricks, create a Workflows Job with the JAR and parameters as a string list.
spark-submit \
--class com.softbenur.neptuno.spark.NeptunoProjectSparkSubmit \
neptuno-launcher-spark.jar \
--mode YAML \
--path /path/to/config.yaml \
--database my_database \
--conf ENV=prod["--mode", "JSON", "--path", "dbfs:/configs/quality_rules.json", "--database", "my_database"]Main class: com.softbenur.neptuno.spark.NeptunoProjectDatabricksSubmit. Task type: JAR. Environment variables can also be passed as --conf ENV=prod.
See the architecture in a real demo.
Explore the architecture demo and the product walkthroughs.