OLAP Expert Inside
Business Intelligence OLAP Business Intelligence OLAP
Product Configurator Product Configurator
Hospital Medical Record Order Entry Hospital Medical Record Order Entry
High-Traffic ASP.NET Web Farm High-Traffic ASP.NET Web Farm
Web Dashboard Web Dashboard
Large-scale ETL from Excel files Large-scale ETL from Excel files
Staff Scheduling Staff Scheduling
ID Scanning ID Scanning
Pharmacy Performance System Pharmacy Performance System
Real Estate Asset Management Real Estate Asset Management
Who We Are Who We Are
Contact Us Contact Us

The Measure of SuccessTM
 
WinMetrics Corporation Microsoft .NET Case Study: Import Excel/CSV Files via Xml/Xsl
Large-scale ETL from Excel and CSV Files
Our client is a software manufacturer that develops comprehensive ERP system. Their customers needed a facility that would extract, transform and load (ETL) data from many files from many locations. They asked us to develop this system because their existing system was far too slow and inaccurate. Here's what we had to do:
Input Variations:
Accepts both CSV and Excel input.
Accepts comma-delimited (CSV) import files and strips enclosing quotations, if present.
For CSV input, optionally ignores column headings in the first row.
Supports both Excel 2003 and 2007 file formats.
Accepts orders in 3 different Excel configurations: one order with multiple items per row on one sheet, one order with items in multiple rows on one sheet AND one order per row in header sheet with items in multiple rows in a detail sheet.
The mapping of input file columns to database table columns is controlled by user-defined import templates.
Imports user-defined variable fields for orders and order items.
User-defined generic recipient attributes.
User-defined enumerated values for Region, Format, Language recipient attributes.
Ignores blank lines in import files.
Windows.Forms GUI can process multiple files that share a common filename prefix with a single invocation.
Generates empty Excel import files that conform to a selected template to make it easy for new users to get started.
Validation and Database Update Features:
Imports sales orders, contact lists, recipients and shipping requests. Also exports contact lists.
Validates data the spans 35 tables across 3 databases, thereby translating user-friendly lookup table codes into their corresponding unique database IDs.
Updates 9 tables: Recipient, Address, CreditCard, Order, OrderItem, ContactList, ContactListMember, RequirementAttribute, RequirementAttributeValue
Optionally overwrites existing recipients.
Assigns unique BatchIDs.
Validates and encrypts credit card numbers.
Encrypts recipient passwords.
Processes valid orders and updates inventory.
Output Features:
Outputs detailed error logs to Excel that can be quickly edited and resubmitted. An Excel error log looks like the input file with the addition of a right-most “Errors” column.
Outputs success logs to Excel so that there is an audit trail of what worked.
Moves input files to Archive directory so that users know what has been processed.
Displays an import error log and opens the error log in Excel on demand.
Architectural Features:
Field validations, e.g. required and datatype, are Xsd schema driven.
If the host has a multi-core CPU, it will use multi-tasking to improve performance via the Parallel Extension to .NET Framework 3.5.
Packaged as a WCF service to support third party access and interfaces.
Includes comprehensive automated unit tests that can run unattended.
Integrates with log4net error logging.
Recipient import rules are consistent whether they are imported stand-alone, inside ContactLists or inside Sales Orders because all of the code is reused.
Used inheritance extensively for data validation classes. Xsd schema determined what classes were instantiated. Lookup table validations were lazy loaded can cached.
ETL System Architecture

We first tried to locate third party software that would do all of the above. But two strong requirements necessitated custom development:

1.
Each custom uses a different subset of input fiields. The mapping of fields to columns was also customer-specific. The templates that defined the mapping is stored in a database.
2.
There could be no runtime licenses or additional installation processes.
Template-driven Excel Import Architecture (click for full-scale image)
Template-driven Excel Import Architecture

The architecture that we planned appears above. We had customers with both Windows.Forms and Web interfaces. We wanted to isolate shared components behind WCF so that they could be maintained in one place. We developed and tested the WCF interface, but ended up bundling the assemblies without it. We came away with a few lessons learned:

We wanted to reuse Xml schema validations rather write our own. But, we were required to report errors by input row. So, we marked each element that corresponded to an input row with a row number attribute. We developed a bottom-up technique for validating XmlDocument sub-trees against the schema. When a validator found an error, we attached an error attribute to the root of the sub-tree being validated. We only validated the next higher level if all the lower-level validations passed. At the end of processing, we concatentated all of the errors below elements with row number attributes and diplayed them in the righmost column of the appropriate row of the Excel output file.
XmlDocument validations are CPU-intensive. Fortunately, they can be executed in parallel in a Parallel.For loop supported by the parallel extensions to .NET 3.5.
Excel import is done by opening the file on disk as though it were a database via a Jet database provider. When we were developing this application, there was no 64-bit Jet DB provider. Therefore, we planned to move the Excel-to-Xml translation to the other side of the interface so that the WCF processes could run on an x64 server.
We needed to call Microsoft XPath Extension Functions in Xslt for metadata.
The SqlXml bulk loader is packaged as a COM object that reads and writes files in its own single-threaded apartment. Therefore, we had to launch the SqlXml BulkLoader in its own STA thread when running under ASP.NET.
It is much faster to use the SqlXml BulkLoader to populate staging tables and then use sql to perform updates than to process UpdateGrams more directly.
VB.NET supports Xml as first-class literals with expression holes that make relatively short work of Xml conversion tasks. This feature earned VB.NET its own assembly in our otherwise C# solution.
WCF can't serialize XmlDocuments, just elements. So, we had to pass root elements instead of documents.
Our Performance Objectives Were Met

With excellent guidance from our client's development staff, WinMetrics developed over 22,000 lines of C# in less than 6 months elapsed time to hit the target:

Imports 10,000 new recipients, addresses and credit cards in less than 65 seconds on a quad-core PC in debug mode.
Imports 5,000 new orders with 5,000 new order items in just over one minute on a quad-core PC in debug mode.
"Carl Kelley is a thoroughbred professional who loves his job and is extremely good at it. I worked with Carl in re-engineering an ETL solution in our product suite. It was a complicated part of our Order Entry that dealt with multiple file types, user defined formats, high volumes and complex business rules. He speeded it up by 15 times while enhancing it to run with a web interface as well. Carl not only grasps the big picture pretty quickly, but dives deeper to find holes and mishaps waiting to happen. He exhibited an enormous talent for spotting things that could go wrong before they went wrong. He was then very persistent in coming up with elegant solutions to plug them. He is also a very keen software engineer, who was not satisfied with code that works once. He was very methodical about going back and re-engineering or refactoring code whenever he had to program new requirements. His work ethic was a treat to watch- working long hours; persistent and methodical. Carl also has an excellent sense of humor and a great communicator of ideas, concepts and issues. He also juggled between multiple projects, never once losing his élan no matter, as he would say, how hot the kitchen got. "
--   Srinivasan Venkataraman, Senior Developer

 
   
The Measure of Success
ASP.NET/C#
 
      Any Comments?   ©1997-2009 WinMetrics Corporation. All rights reserved. Terms of Use.