Tuesday, January 28, 2014

Instructions for Implementing a Microsoft PDW

SQL Server PDW uses Microsoft database software and pre-configured commodity hardware to provide MPP functionality at a lower price point than previously possible.
Follow a proven methodology; we recommend the Kimball Lifecycle.
1.    Understand business requirements so you can engage the business, prioritize your efforts, and deliver business value.
2.    Design the data warehouse data sets for flexibility, usability, and performance.
3.    Build and deliver quick, business process-based increments within an enterprise data framework known as the data warehouse bus matrix.
4.    Design and build a DW/BI system architecture based on your business requirements, data volumes, and IT systems environment.
5.    Build out the extract, transformation, and loading (ETL) system with standard components to deal with common design patterns found in the analytic data environment.
Provide the complete solution, including reports, query tools, applications, portals, documentation, training, and support
the Parallel Data Warehouse work a bit differently. Data is distributed across many independent query nodes across the server. Each of these nodes may hold a subset of the fact data, which may need to join to all of the dimensions. On SQL Server PDW, the standard approach is to replicate all of the dimensions out to each node so the node can perform local joins, thus preserving the physical dimensional model. However, in rare cases it may make sense to normalize and/or distribute very large dimensions on SQL Server PDW to save time in the replication process and to save space on each compute node.
The user data model would be dimensional for usability, and the physical data model would also be dimensional for simplicity and performance.
normalized version is easier to build if the source system is already normalized; but the dimensional version is easier to use and will generally perform better for analytic queries.
 Large organizations have always had to deal with big data, but the boundaries that define big data have expanded with the raw power of computers. The quantity of data you have to deal with only becomes a problem when it becomes difficult to work within your existing systems environment.
Companies such as Google and Yahoo! have helped pioneer new technologies such as MapReduce and Hadoop to help deal with the massive quantities of unstructured data they collect. At the same time, mainstream technologies for dealing with more structured big data sets, such as massively parallel processing (MPP) systems and column store databases, are experiencing significant growth.
Each CPU core can work with any section of memory or disk, and all memory and disk is available to each core. This all-in-one architecture is known as symmetric multi processing (SMP).
the CPU cores connect to the memory and disk via a system bus. This shared connection supports high speed inter-process communication, memory, and data sharing, and is easier to manage because it is a single physical unit.
The large-scale server power of a product like SQL Server PDW allows you to provide a true enterprise information resource: a single version of the truth without the extra time, resources, and maintenance required to copy data out to multiple data marts.
However, SMP systems have a limit when it comes to big data; they can scale only to the point where their system bus gets overloaded. Too many CPUs making simultaneous requests for data on the system bus creates a traffic jam. As usage grows, the system bus becomes a bottleneck and limits the total amount of processing that can take place on an SMP system. There are ways to mitigate this contention by creating localized subsets of CPUs and memory, but this only extends the limit.
One time-tested strategy for handling large amounts of data is to avoid this bottleneck by distributing data and processing across many servers, or nodes, each of which has its own memory and disk so they can share the workload. This approach, known as massively parallel processing (MPP), has been around for several decades and is the basis for many of the largest super computers in existence today. Due to their high cost and complexity, MPP systems have historically been used by the largest companies and governmental organizations.
This massively parallel architecture lies at the heart of Microsoft’s Parallel Data Warehouse system. Parallel Data Warehouse is a Microsoft SQL Server product designed to scale data warehouses from tens to hundreds of terabytes of data. It delivers the MPP architecture using an “appliance” model, providing preconfigured, optimized commodity hardware and software and a single point of support.
A user query request would come into the control node, which breaks the SQL into multiple parallel operations and distributes them out to the compute nodes where the actual data resides. A special module called Data Movement Services coordinates any needed data movement among nodes taking place between and handles any functions that need to be resolved centrally. When the compute nodes are finished, the control node handles post-processing and re-integration of results sets for delivery back to the users.
Each compute node is a separate SMP server running SQL Server. Compute nodes in current SQL Server PDW configurations ship with dual, hex-core CPUs, 96 GB of memory, and local tempdb workspace. They are connected together using dual InfiniBand network to support high-speed node-to-node data sharing for cross-node computations known as data shuffling. This network also connects the compute nodes to the control and administrative nodes to support high-speed data loading, extraction of query results, backup, and other administrative functions.
The disk subsystems for the compute nodes are managed by a storage area network (SAN) component with high-speed dual Fibre Channel connectivity. This data bus supports high-speed I/O, and failover redundancy. The compute nodes and disk drives are physically housed in the same rack, called a data rack.
There are three types of administrative service nodes that share the control rack with the control node. These include:
  • Management Nodes, which provide the DBA or data center operations interface to access and manage the overall solution and support the system’s internal network.
  • A Landing Zone Node, where cleansed data is staged and prepped before loading into the data warehouse.
  • A Backup Node and the appropriate associated storage. The Backup Node provides high-speed integrated backup at the database level. This is tied to the organization’s overall backup strategy and systems.
The SQL Server PDW is a large-scale enterprise class system and has built-in redundancies:
  • Primary data is stored as RAID1.
  • Hardware redundancy includes redundant power supplies, spare disks, compute nodes, control nodes, and management servers, mostly designed to support automatic failover.
SQL Server Parallel Data Warehouse is sold as a data warehouse appliance: a set of commodity hardware and Microsoft software pre-configured to meet the needs of a range of data sizes and performance. This makes sense because configuring the individual components, network and connectivity throughput, and disk subsystem performance is a significant effort, more than most IT shops would care to take on. With the appliance, all components and network connectivity are carefully designed, configured, and balanced for optimal performance, and necessary software on all nodes is pre-installed and pre-configured.
The MPP architecture can be scaled up by adding racks of compute nodes. The base system starts with one rack. On an HP appliance, for example, a full rack holds 10 nodes, and additional 10-node racks can be added up to a total of 40 nodes. The 40-node limit is more due to the definition of the product and not an inherent limit of the system design. SQL Server PDW uses its backup and restore facility to make expanding a SQL Server PDW is fairly straight forward: back up the database, add the new rack, reconfigure, and restore. The database restoration automatically redistributes the data across all nodes.
Microsoft is working with several hardware vendors to offer SQL Server PDW systems. HP is the first to market with a publicly available product at this writing.
The physical architecture of distributed nodes with local data means the large data sets have to be distributed across the nodes in a way that will support both data load and query processes. The goal is to get each node and CPU core working as hard as possible on every query. In the data warehouse, fact tables are distributed evenly across nodes so each node will have work to do.
Efficient processing on nodes results when local fact table subsets can join to local dimensions tables, which can be achieved if dimension tables are replicated to all nodes. SQL Server PDW allows you to specify distributed or replicated tables at time of creation, and then transparently manages placing the appropriate data on the appropriate compute nodes at load time.
Performance is platform-dependent. Indexes and aggregates are the two standard performance tools in the DW/BI system, and these vary widely across database product and platform. For example, in an SMP environment, it can take a long time to run a query that asks for total sales for the last five years. In this case, it makes sense to create aggregated tables once during the ETL process that can be used over and over to answer summary level queries. (Note that these aggregate tables need to be transparent to the user to maintain ease of use.)
However, the MPP environment offers a third performance tool: parallel processing. Distributing query tasks across multiple nodes may allow summary level queries to be answered on the fly. This greatly simplifies the design, tuning, and maintenance of indexes and aggregates. You can generally rely on the brute force power of the underlying parallel processing architecture for excellent query performance in the MPP environment.
If performance tuning or parallel processing isn’t enough, you may have to create separate subsets of the data warehouse and host them on downstream servers. These data marts may be departmental in focus; data is often limited to a few subject areas and summarized. (If the data marts contained atomic-level data from all business processes, you would be back where you started with the enterprise data warehouse.) From a performance point of view, the idea is to offload a subset of users and queries to a dedicated platform. This is a crude form of distributed processing, and is probably less effective than simply adding another rack to the SQL Server PDW machine where it could be used by the entire organization when needed.
There are times when this distributed strategy makes sense. Certain data may be useful or interesting only to a small analytic community. Other data may be sensitive and require strict physical access limitations. In some cases, the desire for a separate server is organizational; a certain department may insist on having its data on its own server. As we will describe in the implementation section, you may have existing data marts with extensive reports and applications built on them. In this case, it’s much easier to initially populate these downstream marts from the SQL Server PDW rather than rewrite the reports and applications to work directly from the SQL Server PDW.
In these cases, the SQL Server PDW can act as the central source of the distributed data warehouse. SQL Server PDW has a Remote Table Copy feature that will propagate tables to these downstream SQL Server systems at high speeds. The target systems need to be physically located close enough to the SQL Server PDW so they can connect to the InfiniBand network, since this is part of the speed component. If the downstream systems are designed based on Microsoft’s Fast Track architecture, the data transfer rates can be significant. The downstream systems can also be any data mart running SQL Server 2008 or above. This includes the new HP Business Data Warehouse, optimized for SQL Server 2008 R2, and the HP Business Data Warehouse, a BI appliance also optimized for SQL Server 2008 R2.
The SQL Server PDW system must live in a data center and involves at least two racks, so you should do some planning with your server management group before the truck shows up on installation day. Since it uses InfiniBand, any other servers you want to benefit from fast data transfer functions will need InfiniBand connections and to be located close enough to the SQL Server PDW server to meet any cable limitations.
Vendor installation is usually part of the purchase and takes a few days depending on what issues show up.
If you are converting an existing SMP SQL Server database to SQL Server PDW, you can use a tool the Microsoft PDW team has built to help
One big advantage of the SQL Server PDW system from the DBA’s perspective is the simplification it brings to physical data management. The physical location of data, including filegroups, disk layout, LUNs, and tempdb location, is all handled automatically as part of the core SQL Server PDW system.
There is one high-level physical decision to be made when moving to a massively parallel environment: how the tables should be split up across the nodes. There are two primary ways to physically instantiate tables in SQL Server PDW: replicated or distributed. The CREATE TABLE DDL includes a distribution clause where this is specified.
A replicated table looks like a single table to anyone who accesses SQL Server PDW, but it is actually replicated out to all compute nodes on the server. That is, there is one copy of the table on each node.
The purpose of replicating tables is to improve performance by having local copies of data on each node to support local joins. Replicated tables are generally used for dimensions and lookup tables to support local joins to the fact tables.
The replicated tables are managed by the system transparently. From the DBA’s perspective, the CREATE TABLE syntax is pretty simple:
     CREATE TABLE Customer (
           CustomerKey int NOT NULL,
           Name varchar(50),
           ZipCode varchar(10))

The default is REPLICATE if the distribution clause is omitted.
The rows of a distributed table are spread across all nodes as evenly as possible. Each row is written out to a distribution which is a storage location on a node. There are eight distributions on each compute node, each with its own disks. In other words, no copies are made; each row in the source table ends up in only one distribution on one compute node. The rows are mapped to the distributions using a hash function on a column from the table.
The goal of distribution is to improve performance by maximizing parallel processing. Fact tables are usually the largest tables in the data warehouse, and are usually distributed.
Here is the DDL for the distributed table shown in Figure 9:
     CREATE TABLE SalesFact (
           DateKey INT NOT NULL,
           CustomerKey INT,
           DollarAmount MONEY)
           (DISTRIBUTION = HASH(CustomerKey));

The choice of the distribution column is key, so to speak. If a few customers accounted for a large percentage of sales, using Customer Key would lead to an imbalance in the data distribution. One or two distributions would end up with a larger percentage of the data. This imbalance is called data skew. One or a few distributions with 10% more rows than average may cause problems, and a difference of greater than 30% will lead to poor performance. This makes sense because each query has to wait for all nodes to complete, and any node with significantly more data will take longer than the others when processing queries involving skewed data.
The primary criteria for selecting a good column for distribution are high cardinality and even row counts. There are other considerations for choosing the distribution column. For example, it’s not a good idea to choose a column that is often constrained to a single value in user queries. If users typically constrain on a single day, then the DateKey column is not a good candidate because all the rows for that day will end up in a single distribution. Other factors come into play when selecting a distribution key, such as distributing multiple fact tables that may need to be joined together to support certain analytics.
There are a few additional design decisions to make in defining the data warehouse tables. There are typically far fewer indexes on an MPP system because they are not needed. Do use clustered indexes where it makes sense. In most cases, this means creating a clustered index on the surrogate key of the dimension tables, and on the same column used for partitioning the fact tables. Use non-clustered indexes with care. In many cases, they are not needed because of the parallel processing speed, and they add maintenance, slow the load process, and take up space.
Fact tables may be partitioned for the same reasons you would partition on an SMP system, such as rolling window management or load isolation that uses a SWITCH operation. Partitioning is conceptually simpler in SQL Server PDW because it is fully specified as part of the table creation DDL rather than through a separate partition function and scheme.
Once you have tables defined in SQL Server PDW, the next step is to load data into them. The initial data transfers will most likely use scripts to bulk copy the existing data warehouse history into the SQL Server PDW. Moving forward, if you were using SQL Server Integration Services, your ETL system should function essentially the same with SQL Server PDW as it did with your prior data warehouse. For example, SQL Server PDW has its own source and destination connections you will use in your Integration Services packages. However, there are a few product differences that will impact your ETL system.
The IDENTITY property of an integer field is not supported in SQL Server PDW. This makes sense when you realize rows in a distributed table will be inserted across many separate nodes. The cost of keeping track of incremental identity assignments across multiple nodes in a parallel process would dramatically slow any insert process. If you were using the IDENTITY property to assign surrogate keys to your dimensions, you will need to manage this either in your ETL process by keeping surrogate key values in a table and assigning them incrementally, or in the INSERT statement by using the ROW NUMBER ACROSS function.
If you use Integration Services Lookup transformations in your existing ETL packages, make sure you select Full cache in the Cache mode section when querying SQL Server PDW, which pre-populates the lookup cache. Using the Lookup transformation to perform a non-cached SELECT operation against incoming Integration Services pipeline rows is inefficient with SQL Server PDW.
The SQL Server PDW system has a separate staging server as part of the control rack called the Landing Zone. Incoming data from the Integration Services connections or the SQL Server PDW bulk loader (DWLoader.exe) flow through the Landing Zone prior to being distributed to the compute nodes for permanent storage. The Landing Zone quickly reads incoming rows from files or Integration Services and sends them off to compute nodes in a round-robin fashion using a module called the Data Movement System (DMS) which, not surprisingly, handles data movement around the system. On each compute node, a DMS instance will hash the rows and send them back out to the DMS instance of the node to which they map. This receiver DMS inserts the row into a staging table where any sorting and indexing takes place. The final step uses SELECT INTO to copy the data from the staging table to the target table. All this happens behind the scenes and is managed by the system.
This whole flow keeps data loading in a highly parallel fashion and minimizes any processing work actually performed on the Landing Zone.
One benefit of parallel processing is the load process can run while users are querying the data. The loader processes get lower priority, so they have little impact on user queries. This means you can process yesterday’s load without having to limit user access. It also means you could do near-real time data loads to give access to current data where it’s needed.
SQL Server PDW has its own variant of SQL with extensions to support parallel processing. Some functions in the SMP SQL Server product have not been implemented in SQL Server PDW. Some of these were omitted because they are functions that do not translate well into a parallel environment. For example, the IDENTITY property is not supported as described in the ETL section.
Transact-SQL compatibility with SQL Server SMP is not yet fully complete, and Microsoft continues to add functionality through frequent updates. You will want to test any existing scripts or stored procedures that are part of your current operations against the latest functionality provided by SQL Server PDW.
 From an ETL processing perspective, SQL Server PDW can act as a large-scale ETL engine to manage the bulk transformation of big data sets. SQL Server PDW can also support near real-time data warehousing, which is critical for certain analytics.
If you need to integrate with existing systems, SQL Server PDW can help. Remote Table Copy is a high-speed table copying function that can transfer tables from the SQL Server PDW to SQL Server running on a locally connected SMP server. Data transfer rates can be as fast as 400 GB per hour. Once the data is in the target SQL Server machine, you would complete the ETL process to properly integrate it into the database with appropriate indexes, partitioning, and any other required constraints.
Organizations dealing with particularly large data sets and operating with narrow load windows may not have time to use a separate ETL system to process the data before loading it into the SQL Server PDW. In these cases, SQL Server PDW can serve as a large-scale transformation engine as part of an overall EDW architecture. This approach generally involves loading the data directly into tables in the SQL Server PDW database, and then performing ETL lookups as INSERT-SELECT operations joining staging tables to dimension tables to lookup surrogate keys in bulk. This approach applies the full power of the parallel environment to the core ETL processes.
While most of the analytic data in the data warehouse does not need to be loaded on a less-than-24-hour basis, some business opportunities require more frequent data loads. SQL Server PDW’s parallel load process supports “near real time loading” under the Read Uncommitted isolation level (Dirty Reads). Loads can be run while users query tables and these data loads have a low impact on the overall performance of concurrently-running queries.
SQL Server Parallel Data Warehouse offers a viable platform for supporting large-scale data warehouses into the hundreds of terabytes. The appliance nature of the system makes it relatively easy to configure, install, tune, manage, and expand. SQL Server PDW provides parallel processing of queries against dimensional models on atomic data to address the Kimball approach’s goals of query performance, usability, and flexibility on an enterprise information resource.

No comments:

Post a Comment