Getting Started with AES Fabric ETL
This guide provides instructions on how to set up, configure, and execute the AES Fabric ETL (aesetl_v1) framework within Microsoft Fabric.
For any implementation assistance or questions, do not hesitate to reach out to support@assuranceeservices.com. We are here to help you with a successful implementation.
1. Prerequisites
Before installation, ensure you meet the following platform and permission requirements:
- Platform Requirements:
- A Microsoft Fabric Workspace with appropriate capacity (for MPE, a minimum F64 is required).
- A Fabric SQL Database (recommended name: DWInstrumentation) for metadata and instrumentation.
- A Fabric Lakehouse (lhdw) for data storage and logging.
- A Fabric Environment with the mssql-python library installed (required for SQL connectivity).
- Required Permissions:
- Read/Write access to target Lakehouse(s).
- Execute, Read, and Write permissions on the instrumentation SQL Database.
- Access to create and run Fabric pipelines.
- Access to Read Items in Workspace.
- Azure Key Vault access if using secret management.
2. Installation and Setup
Follow these steps to deploy the framework:
- Configure Instrumentation Database:
Execute the provided database schema scripts to create tables, stored procedures, and template metadata in your Fabric SQL Database. - Configure Fabric Environment:
- Create a new Fabric Environment (aesetlenv).
- Add the mssql-python library using “Full mode” and publish the environment.
- Deploy Notebook:
- Import the
aesetl_v1.ipynbnotebook into your workspace. - Copy the
builtin/scripts/folder to the notebook’s resource directory. - Attach the newly created Fabric Environment to the notebook.
- Import the
- Configure Lakehouse:
Create alogs/folder structure in your target Lakehouse’s Files area.
3. Running via Fabric Pipeline
To automate your ETL jobs, you can execute them through a Fabric Pipeline:
- Create a Pipeline: Open your Fabric workspace and create a new Fabric Pipeline.
- Add Notebook Activity:
- Select the
aesetl_v1notebook as the target activity. - Configure the following parameters in the activity settings:
strjobname: “Your JobName”(matching the strjobname in the instrumentation DB of table dbo.etljobs).-
instrumentationdbname: “DWInstrumentation”. lakehousetolog: “lhdw”.-
strsessionid:@{pipeline().RunId}. pipelineid:@{pipeline().Id}.
- Select the
- Execute: Run or schedule the pipeline to initiate the metadata-driven ETL process.
4. Scheduling and Monitoring
- Scheduling: Use the native Microsoft Fabric Pipeline scheduler to set your job frequency.
- Parallel Execution: The framework automatically manages parallel processing based on your metadata configuration.
- Parts (
strJobStepPartName) run sequentially. - Process Groups(
intProcessNo) within a part run in parallel simultaneously.
- Parts (
- Monitoring:
- Logs: Check real-time output in the notebook or view log files in your Lakehouse under
Files/logs/{JobName}/{Timestamp}_{SessionID}.log. - Database: Query the
ETLJobsandETLJobStepstables in your instrumentation database for current statuses (e.g., P for Processing, C for Complete, F for Failed).
- Logs: Check real-time output in the notebook or view log files in your Lakehouse under
FAQ
What is AES Fabric ETL?
It is a metadata-driven ETL orchestration framework specifically built for Microsoft Fabric that manages complex data pipelines through SQL-based configurations.
How is the software licensed?
It is licensed under the Business Source License 1.1 (BSL). It is free for community use up to 1,000 task runs per month; exceeding this requires a commercial license. For license or any questions, reach out to support@assuranceeservices.com
Installation & Configuration
Is the pymssql library included in Fabric by default?
No. You must manually create a Fabric Environment and add the mssql-python library in Full mode to enable SQL Server connectivity.
How are job steps organized for execution?
Steps are grouped by strJobStepPartName (executed sequentially) and intProcessNo (executed in parallel within each part), ordered by intOrderNo.
How do I handle sensitive data like API keys?
You should store credentials in Azure Key Vault and reference them in your variables using the $azkv$KeyVaultName .secret.SecretName$azkv syntax.
Functionality & Performance
What happens if a job is triggered while it is already running?
To prevent redundant processing, the framework incorporates a concurrency control mechanism that precludes the simultaneous execution of the same job. Upon initiation, the system validates the current pipeline status; if an active instance is detected, the framework performs a graceful exit and assigns a status of “U” (Unchanged) to maintain data integrity and avoid duplicate workloads.
Does the framework support incremental loading?
Yes. It tracks the strCurrentKeyValue and uses a strNextKeyCommand to determine the new data range for every run.
What merge types are supported for Lakehouse data?
It supports Type 1 SCD (overwrites existing records) and Type 2 SCD (maintains full historical tracking). It also supports Type 2 SCD by Datepart.
What is Type 2 SCD by Datepart?
This is a specialized SCD merge that performs Type 1 (overwrite) updates until a record’s update date passes a specific threshold (e.g., SECOND, HOUR, DAY), at which point it switches to Type 2 (historical tracking).
How do I execute multiple Spark SQL statements in one step?
You can wrap multiple statements in a <sparksql> tag and set <executeparallel> to true to run them concurrently, controlled by the <maxparallelism> parameter.
What is the purpose of the SHADOWDELETEMSSQL2LH command?
It identifies records deleted in the source system by comparing primary keys and propagates those deletions to the Lakehouse by setting a _deleteddate timestamp, which is distinct from standard SCD logic.
Can I trigger one Fabric pipeline from another using this framework?
Yes, the RUNPIPELINE command type allows you to trigger external pipelines with options to wait for completion or skip the task if the target pipeline is already running.
Variable Substitution & Logic
How to use JobVariables?
Variables are stored in the dbo.ETLJobVariables table within your instrumentation database. They are categorized into two scopes: Global Variables and Job-Specific Variables
What is the difference between Global and Job-specific variables?
Variables tagged with the strJobName as ‘Global’ are accessible to every job in the system, while others are restricted to the specific job they are assigned to.
How do I reference variables within my strCommand and strNextKeyCommand?
You use the $$VariableName$$ syntax. The framework will automatically replace these placeholders with values defined in the dbo.ETLJobVariables table before execution.
How can I use the result of one step to trigger the next?
The INSTRUMENTATIONDBSQL command type allows a NextKeyCommand to query the instrumentation database directly, effectively creating a “cascade” effect where a watermark value from one step determines the execution range of the next.
Architecture & Logging
Where can I find the historical record of my job runs?
Execution history is maintained in two primary audit tables: dbo.ETLJobAudit for high-level job status and dbo.ETLJobStepAudit for granular, step-by-step performance and error data.
What are the specific date pattern tokens for DATAFILELOAD?
When organizing file paths or names, you can use yyyy for a 4-digit year, mm for a 2-digit month, and dd for a 2-digit day.
Troubleshooting
What should I do if my job status is stuck on “Processing” but it isn’t actually running? You may need to manually reset the job status in the instrumentation database by updating the strJobStatus to ‘F’ (Failed) in the dbo.ETLJobs table.
Why am I seeing “Access Denied” when retrieving Key Vault secrets?
Ensure that the “Key Vault Secrets User” role has been granted to your Fabric workspace identity and verify the Key Vault name in your variable settings.
What is the resolution for a “Job Already Running” status conflict?
This typically occurs if a previous instance did not terminate correctly, leaving the status as ‘P’ (Processing). If the pipeline is confirmed as inactive, the status must be manually reset to ‘F’ (Failed) in the dbo.ETLJobs table to re-enable execution.
How do I resolve a ModuleNotFoundError during task execution?
This indicates that the framework’s internal helper modules are not in the correct resource directory. Verify that all .py utility files—including fabric_helper, lakehouse_helper , and task_helper -reside in the builtin/scripts/ folder of the notebook’s resources.
Why is my incremental load processing redundant data?
This is often caused by a failure in the strNextKeyCommand or a logic error in the ETLUpdateJobStepNextKeyValue stored procedure. Ensure the next key command returns a valid, updated value and that the strCurrentKeyValue is successfully persisted to the database upon step completion
For further technical support or enterprise licensing, please contact our engineering team at support@assuranceeservices.com.
