Architecture

DAG on Spark, truly multi-engine.

NeptunoDQ builds a directed acyclic graph of your quality rules, executes them in parallel while respecting dependencies, and runs identically on Apache Spark open source and Databricks.

Rule DAG

Dependencies respected, failures contained.

NeptunoDQ builds a DAG of your rules before executing them. If Rule B depends on Rule A, B only runs if A passes. If A fails or is skipped, B is automatically cancelled to avoid wasting cluster resources.

Explicit dependencies. Declare which rules depend on which in the configuration file. The engine calculates the optimal execution order.
Failure propagation. A failure in a parent node propagates to its descendants, which are marked as skipped without executing.
No orphan executions. A child rule is never executed against data already known to be invalid by the parent rule.

Spark Fair Scheduler

Independent branches run in parallel.

Pool per rule

Each rule is assigned to its own scheduler pool, preventing a slow rule from blocking others.

True parallelism

Branches without dependencies run simultaneously. A 50-rule suite does not take 50 times as long as a single rule.

Cluster utilization

The Fair Scheduler distributes cluster resources across active rules, maximizing utilization.

Rule types

Five control types, one inventory.

From SQL files to Databricks notebooks. All share the same lifecycle: proposal, review, approval, execution, and traceability.

SQL_FILE

SQL rules in external files. Ideal for complex queries under version control. The file can live on HDFS, S3, or local storage.

SQL_TEXT

Inline SQL in the configuration. For simple, fast checks that don't justify a separate file. Variables work the same way.

TABLE

Predefined checks on a table: NULLS, DUPLICATES, WHITES. No SQL required — just declare which columns to validate.

FILE

Validations on files before loading them: CSV, Parquet, etc. Catch quality issues before they reach the table.

ADBNOTEBOOK

Run a Databricks notebook as a quality step. For complex validation logic that already exists as a notebook.

Variables

Variables and parametric substitution.

Use ${VARIABLE_NAME} in any configuration value. NeptunoDQ substitutes them at runtime according to the defined resolution order.

config.yamlYAML

project_id: "neptuno-demo"
department: "analytics"
mdc: "neptuno-demo-001"

neptuno_properties:
  spark.neptuno.num.thread: "4"
  spark.sql.shuffle.partitions: "10"

rules:
  - control_type: "SQL_FILE"
    rule_id: "check_total_money"
    file: "${PATH}/rules/validate_money.sql"
    table: "${USERS_TABLE}"
    umbral: "0.01"
    variables:
      table: "${NEPTUNO_SCHEMA}.${USERS_TABLE}"
      max_age: "75"
    sql_aggregations:
      total_money: "sum(money)"

  - control_type: "SQL_TEXT"
    rule_id: "check_vip_users"
    text: "select * from ${table} where upper(tier) = 'VIP' and spend < ${min_spend}"
    table: "${USERS_TABLE}"
    variables:
      table: "${NEPTUNO_SCHEMA}.${USERS_TABLE}"
      min_spend: "1000"
    sql_aggregations:
      violation_count: "count(*)"

Resolution order

1.CLI Arguments: Variables passed with -c KEY=VALUE or --conf KEY=VALUE in the launch command.
2.Environment variables: System variables available at runtime. Useful for secrets and environment configuration.

Supported engines

Same engine, multiple environments.

NeptunoDQ runs on Apache Spark open source and Databricks using the same configuration file. Other engines are on the roadmap.

Supported

Apache Spark

OSS 3.5, 4.0, 4.1

Supported

Databricks

Runtimes 15.4, 16.4, 17.3

Roadmap

Snowflake

Compatible

AWS

EMR, Glue, S3

Compatible

Azure

ADB, ADLS

CLI & Launchers

Two launchers, one mental model.

For Apache Spark open source, use spark-submit. For Databricks, create a Workflows Job with the JAR and parameters as a string list.

Apache SparkBash

spark-submit \
  --class com.softbenur.neptuno.spark.NeptunoProjectSparkSubmit \
  neptuno-launcher-spark.jar \
  --mode YAML \
  --path /path/to/config.yaml \
  --database my_database \
  --conf ENV=prod

Databricks Job parametersJSON

["--mode", "JSON", "--path", "dbfs:/configs/quality_rules.json", "--database", "my_database"]

Main class: com.softbenur.neptuno.spark.NeptunoProjectDatabricksSubmit. Task type: JAR. Environment variables can also be passed as --conf ENV=prod.

See the architecture in a real demo.

Explore the architecture demo and the product walkthroughs.

View architecture demo View product