The daily load process for customer data was inefficient, taking approximately 7–8 hours to load a 9GB CSV file containing around 30 million records into a SQL Server database. The bottleneck arose from:
1. Using KingswaySoft, a third-party SSIS integration tool, for both data transfer from an Amazon S3 bucket and hashing sensitive customer information (email addresses and mobile numbers) column by column.
2. The record-by-record hashing operation, performed within KingswaySoft, added significant processing time.
This delay impacted downstream processes and hindered timely access to the latest customer data.
Proposed Solution
After analyzing the issue, the following optimized approach was designed and implemented:
1. Data Transfer Optimization:
- Continued using KingswaySoft solely for its reliable Amazon S3 bucket integration, to transfer the CSV file to the local environment.
2. Hashing Process Optimization:
- Moved the hashing operation from KingswaySoft to the SQL Server side using SQL’s in-built MD5 hashing functions.
- Imported raw data into a staging table in SQL Server without hashing.
- Applied a batch update query to hash the required columns (email and mobile) directly in the SQL Server database after data ingestion.
3. Parallelism and Indexing:
- Leveraged bulk insert operations and optimized SQL Server configurations to handle the 30 million records efficiently.
- Applied appropriate indexes post-load to speed up updates and queries.
4. Result:
- Reduced the total data load time from 7–8 hours to 45 minutes.
- Maintained the integrity and security of sensitive customer data through hashing.
Detailed Execution Steps
Maxnet tackled Client’s data challenges by employing a structured, well-defined strategy involving data centralization, integration, and advanced business intelligence tools.
Amazon S3 to Local Storage
- Used KingswaySoft SSIS connectors to pull the 9GB CSV file from the Amazon S3 bucket to local storage.
- Ensured error handling and retries to account for large file size and potential network issues.
Raw Data Import
- Loaded the CSV data into a staging table in SQL Server using bulk insert operations to minimize transfer time.
- This avoided any transformations or hashing at this stage.
Batch Hashing
- Used SQL Server’s HASHBYTES(‘MD5’, column_name) function to hash sensitive data in batches.
- This approach reduced processing overhead by allowing SQL Server to perform hashing in a set-based operation.
Index Creation and Final Merge
- After hashing, created necessary indexes to optimize further queries.
- Merged the hashed staging data into the main customer table using an efficient SQL query.
Performance Validation
- Compared end-to-end load times before and after the optimization.
- Monitored SQL Server resource utilization to ensure no bottlenecks occurred during the batch update.

Impact and Outcome
Performance Gain:
- Reduced load time from 7–8 hours to 45 minutes (an 85–90% improvement).
- Freed up critical resources for downstream processes.
Cost Optimization
- Reduced reliance on KingswaySoft for CPU-intensive hashing, leveraging SQL Server’s native capabilities instead.
Scalability
- The solution can handle larger datasets in the future with minimal changes, as SQL Server is better equipped for parallel operations than a record-by-record SSIS-based transformation.

Key Takeaways
- Minimize Transformation During Data Transfer:
Focus on moving raw data as quickly as possible to the database, deferring transformations to where resources are optimized for such tasks. - Leverage Database Strengths:
SQL Server is designed to handle large-scale operations efficiently, making it suitable for batch hashing and updates. - Reduce Overhead in ETL Tools:
Offloading non-ETL-specific tasks like hashing from tools like KingswaySoft can significantly improve performance. - Iterative Optimization
By analyzing the bottlenecks (hashing in this case), tailored solutions can be developed to address specific issues without overhauling the entire pipeline.
Ready to reduce your technology cost?
This use case highlights how thoughtful analysis and leveraging the strengths of different components in the technology stack can dramatically improve data processing performance.