From Legacy ETL to Cloud Data Fusion Pipelines: SQL vs. Idiomatic CDAP Approaches

April 6, 2026 · 15 min read · MigryX Team

Google Cloud Data Fusion is a fully managed, CDAP-based data integration service that lets teams build ETL/ELT pipelines visually or programmatically. For organizations migrating off Alteryx, Talend, IBM DataStage, Informatica, SSIS, or Oracle ODI, Data Fusion offers a natural landing zone — it’s a visual pipeline builder backed by a Dataproc Spark execution engine, with native BigQuery, GCS, Pub/Sub, and Cloud SQL connectors out of the box.

But the critical architectural question during migration is: should you push SQL to BigQuery through Data Fusion, or should you use Data Fusion's native CDAP plugins to express transformation logic? The answer depends on what the legacy ETL actually does, and MigryX's parsers make this decision automatically by classifying every transformation in the source system.

Two Approaches, One Pipeline

Cloud Data Fusion supports two fundamentally different patterns for data transformation, and understanding when to use each is key to producing pipelines that are both performant and maintainable.

Approach 1: SQL Pushdown to BigQuery

When legacy ETL logic is fundamentally SQL — SELECT statements with JOINs, GROUP BY, CASE expressions, window functions, MERGE for upserts — the most efficient Data Fusion pipeline pushes that SQL directly to BigQuery. Data Fusion becomes a thin orchestration and scheduling layer, while BigQuery's serverless engine does the actual computation.

This approach is ideal when migrating from SQL-heavy legacy platforms:

# Data Fusion pipeline JSON — BigQuery SQL Pushdown node
{
  "name": "BigQueryPushdown",
  "plugin": {
    "name": "BigQueryPushDown",
    "type": "batchsource",
    "artifact": {"name": "google-cloud", "scope": "SYSTEM"},
    "properties": {
      "project": "my-project",
      "dataset": "staging",
      "sql": "SELECT o.order_id, o.amount, c.name, c.region,
              CASE WHEN o.amount >= 10000 THEN 'enterprise'
                   WHEN o.amount >= 1000  THEN 'mid_market'
                   ELSE 'smb' END AS deal_tier,
              SUM(o.amount) OVER (PARTITION BY c.region) AS region_total
       FROM `staging.orders` o
       JOIN `ref.customers` c ON o.customer_id = c.customer_id
       WHERE o.order_date >= '2025-01-01'",
      "enableQueryPushdown": "true"
    }
  }
}
The SQL pushdown approach has zero data movement for transformations. BigQuery executes the SQL in its serverless engine, and Data Fusion only orchestrates the execution. This means no Dataproc cluster is needed for SQL-only stages, dramatically reducing cost and latency.

Approach 2: Idiomatic CDAP Plugins

When legacy ETL involves visual dataflow logic that isn’t naturally expressed in SQL — multi-branch routing, row-level parsing, conditional splits, custom functions, complex type coercion, nested data flattening — MigryX generates native CDAP plugins that map 1:1 to the legacy transformation semantics.

This approach is ideal for visual ETL platforms:

CDAP Plugin Mapping from Legacy ETL

Each legacy transformation type maps to a specific Data Fusion plugin:

Legacy TransformationData Fusion PluginNotes
Alteryx Select / FilterWrangler (filter-rows directive)Row filtering with expression support
Alteryx FormulaJavaScript TransformRow-level calculations and derived columns
Alteryx Join / AppendJoiner pluginInner, outer, left, right join modes
Alteryx SummarizeGroupBy AggregateSUM, AVG, COUNT, MIN, MAX with group keys
Talend tMapJoiner + JavaScript TransformLookup join + expression columns
Talend tNormalizeWrangler (split-to-rows)Delimiter-based row explosion
DataStage TransformerJavaScript / Python TransformDerivation expressions converted to JS/Python
DataStage LookupJoiner (broadcast mode)Small reference table broadcast join
Informatica RouterSplitter pluginConditional routing to multiple output ports
Informatica ExpressionWrangler / JavaScript TransformRow-level derived fields
Informatica Sequence GeneratorJavaScript TransformCounter logic in custom transform
SSIS Derived ColumnWrangler (set-column directive)Expression-based column creation
SSIS Conditional SplitSplitter pluginCondition-based row routing
ODI Interface (IKM SQL)BigQuery PushdownSQL-based KMs go to pushdown mode
ODI Interface (IKM Procedure)JavaScript / Python TransformProcedural KMs become custom plugins

Example: Alteryx Workflow to Data Fusion Pipeline

Consider an Alteryx workflow with an Input Data tool reading from CSV, a Formula tool adding derived columns, a Filter tool removing invalid rows, a Join tool enriching with a reference table, and a Summarize tool aggregating results before writing to a database output.

# Data Fusion pipeline JSON — Idiomatic CDAP conversion of Alteryx workflow
{
  "name": "alteryx_orders_pipeline",
  "description": "Converted from Alteryx workflow: daily_order_summary.yxmd",
  "stages": [
    {
      "name": "RawOrders",
      "plugin": {"name": "GCSFile", "type": "batchsource"},
      "properties": {
        "path": "gs://raw-data/orders/",
        "format": "csv",
        "schema": "{\"type\":\"record\",\"fields\":[...]}"
      }
    },
    {
      "name": "CleanAndDerive",
      "plugin": {"name": "Wrangler", "type": "transform"},
      "properties": {
        "directives": [
          "filter-rows-on condition-false amount > 0",
          "set-column deal_tier ifelse(amount >= 10000, 'enterprise', ifelse(amount >= 1000, 'mid_market', 'smb'))",
          "set-column order_month format-date(order_date, 'yyyy-MM')"
        ]
      }
    },
    {
      "name": "ProductLookup",
      "plugin": {"name": "Joiner", "type": "batchjoiner"},
      "properties": {
        "joinKeys": "CleanAndDerive.product_id = Products.product_id",
        "selectedFields": "CleanAndDerive.*, Products.product_name, Products.category",
        "requiredInputs": "CleanAndDerive"
      }
    },
    {
      "name": "RegionSummary",
      "plugin": {"name": "GroupByAggregate", "type": "batchaggregator"},
      "properties": {
        "groupByFields": "region, deal_tier, order_month",
        "aggregates": "total_amount:amount:Sum, order_count:order_id:Count, avg_amount:amount:Avg"
      }
    },
    {
      "name": "BigQuerySink",
      "plugin": {"name": "BigQueryTable", "type": "batchsink"},
      "properties": {
        "project": "my-project",
        "dataset": "gold",
        "table": "regional_order_summary",
        "operation": "INSERT",
        "truncateTable": "true"
      }
    }
  ],
  "connections": [
    {"from": "RawOrders", "to": "CleanAndDerive"},
    {"from": "CleanAndDerive", "to": "ProductLookup"},
    {"from": "ProductLookup", "to": "RegionSummary"},
    {"from": "RegionSummary", "to": "BigQuerySink"}
  ]
}
Informatica to Cloud Data Fusion migration — automated end-to-end by MigryX

Informatica to Cloud Data Fusion migration — automated end-to-end by MigryX

Hybrid Pipelines: SQL + CDAP Together

Real-world legacy ETL rarely falls cleanly into one category. A single Informatica workflow might have a Source Qualifier with SQL overrides (SQL-expressible), followed by an Expression transformation (could be either), followed by a Router with complex conditions (CDAP plugin), writing to multiple targets with Update Strategy logic (SQL MERGE). MigryX handles this by generating hybrid pipelines.

# Hybrid pipeline — SQL pushdown for extraction, CDAP for routing, SQL for loading
{
  "stages": [
    {
      "name": "ExtractWithSQL",
      "plugin": {"name": "BigQueryPushDown", "type": "batchsource"},
      "properties": {
        "sql": "SELECT * FROM `staging.transactions` WHERE txn_date >= CURRENT_DATE() - 7"
      }
    },
    {
      "name": "ClassifyAndRoute",
      "plugin": {"name": "JavaScriptTransform", "type": "transform"},
      "properties": {
        "script": "function transform(input, emitter, context) {
          if (input.amount >= 50000) { emitter.emit(input, 'high_value'); }
          else if (input.amount >= 5000) { emitter.emit(input, 'standard'); }
          else { emitter.emit(input, 'micro'); }
        }",
        "outputPorts": "high_value,standard,micro"
      }
    },
    {
      "name": "HighValueSink",
      "plugin": {"name": "BigQueryTable", "type": "batchsink"},
      "properties": {
        "dataset": "gold", "table": "high_value_transactions",
        "operation": "UPSERT", "tableKey": "txn_id"
      }
    },
    {
      "name": "StandardSink",
      "plugin": {"name": "BigQueryTable", "type": "batchsink"},
      "properties": {
        "dataset": "gold", "table": "standard_transactions",
        "operation": "INSERT"
      }
    }
  ]
}
Hybrid pipelines are the most common output in real-world migrations. Pure SQL pipelines and pure CDAP pipelines are edge cases. MigryX's parser classifies each transformation independently, so a single legacy workflow can produce a pipeline with both SQL pushdown stages and CDAP plugin stages — optimizing for performance and fidelity at every step.

MigryX: Idiomatic Code, Not Line-by-Line Translation

The difference between MigryX and manual migration is not just speed — it is code quality. MigryX generates idiomatic, platform-optimized code that leverages native features of your target platform. A SAS DATA step does not become a clunky row-by-row loop — it becomes a clean, vectorized DataFrame operation. A PROC SQL query does not become a literal translation — it becomes an optimized query that takes advantage of your platform’s pushdown capabilities.

Orchestration: Scheduling and Dependencies

Legacy ETL platforms bundle orchestration with transformation. Informatica workflows, Talend job groups, DataStage sequences, and SSIS packages all include scheduling and dependency management. In Data Fusion, pipelines are scheduled natively or orchestrated through Cloud Composer (Airflow) for complex multi-pipeline DAGs.

# Cloud Composer DAG orchestrating multiple Data Fusion pipelines
from airflow import DAG
from airflow.providers.google.cloud.operators.datafusion import (
    CloudDataFusionStartPipelineOperator,
)
from datetime import datetime

dag = DAG(
    "daily_etl_orchestration",
    schedule_interval="0 6 * * *",
    start_date=datetime(2026, 1, 1),
    catchup=False,
)

extract_pipeline = CloudDataFusionStartPipelineOperator(
    task_id="extract_sources",
    pipeline_name="extract_all_sources",
    instance_name="prod-datafusion",
    location="us-central1",
    dag=dag,
)

transform_pipeline = CloudDataFusionStartPipelineOperator(
    task_id="transform_and_enrich",
    pipeline_name="transform_enrich_orders",
    instance_name="prod-datafusion",
    location="us-central1",
    dag=dag,
)

load_gold = CloudDataFusionStartPipelineOperator(
    task_id="load_gold_layer",
    pipeline_name="load_gold_tables",
    instance_name="prod-datafusion",
    location="us-central1",
    dag=dag,
)

extract_pipeline >> transform_pipeline >> load_gold
MigryX Screenshot

MigryX precision parser — Deep AST-level analysis ensures every construct is understood before conversion begins

Platform-Specific Optimization by MigryX

MigryX maintains deep knowledge of every target platform’s strengths and best practices. When converting to Snowflake, it leverages Snowpark and native SQL functions. When targeting Databricks, it uses PySpark DataFrame operations optimized for distributed execution. When generating dbt models, it follows dbt best practices for modularity and testability. This platform awareness is what makes MigryX output production-ready from day one.

How MigryX Decides: SQL or CDAP?

MigryX's parser analyzes each legacy transformation and classifies it into one of three categories:

  1. SQL-expressible — The transformation is a standard SQL operation (SELECT, JOIN, GROUP BY, CASE, window function, MERGE). Output: BigQuery pushdown SQL inside a Data Fusion pipeline stage.
  2. Plugin-required — The transformation involves row-level procedural logic, multi-branch routing, custom functions, or data type manipulation that SQL cannot express cleanly. Output: Native CDAP plugin (Wrangler, JavaScript Transform, Splitter, etc.).
  3. Hybrid — The transformation mixes SQL and procedural logic within a single construct (e.g., Informatica mapping with both SQL Override and Expression transformations). Output: Hybrid pipeline with both SQL pushdown and CDAP stages.

This classification is deterministic, auditable, and overridable. MigryX generates a conversion report that shows exactly why each transformation was routed to SQL or CDAP, with the original legacy code and the generated Data Fusion configuration side by side.

Key Takeaways

Migrating legacy ETL to Cloud Data Fusion is not a one-size-fits-all exercise. SQL-heavy workloads should push SQL to BigQuery for maximum performance. Visual dataflow logic should use CDAP's native plugin ecosystem. And most real-world migrations will produce hybrid pipelines that combine both approaches. MigryX automates this classification and generates production-ready Data Fusion pipeline JSON from every legacy source — Alteryx, Talend, DataStage, Informatica, SSIS, and ODI — with full lineage published to Dataplex.

Why MigryX Delivers Superior Migration Results

The challenges described throughout this article are exactly what MigryX was built to solve. Here is how MigryX transforms this process:

MigryX combines precision AST parsing with Merlin AI to deliver 99% accurate, production-ready migration — turning what used to be a multi-year manual effort into a streamlined, validated process. See it in action.

Ready to migrate legacy ETL to Cloud Data Fusion?

See how MigryX converts your Alteryx, Talend, DataStage, or Informatica workflows to production-ready Data Fusion pipelines with SQL pushdown and CDAP plugins.

Explore BigQuery Migration   Schedule a Demo