Abstracts

Opening Keynote

The Next Generation of HPC: A Seismic Shift

Peter Ungaro
Cray Inc.

From oil and gas exploration to predicting extreme weather events, science today relies on HPC. But keeping ahead of the exploding demand for computational power is requiring renewed innovation, especially with the power and programming issues found in future processors. Developing an exascale – not just exaflops – solution that is meaningful to the petroleum industry is leading to significant changes in system architecture and a more holistic approach to supercomputing. This presentation will provide a perspective on the state of supercomputing and what impact it will have on the challenges facing petroleum companies today and into the future.

Morning Plenary talks

IWAVE: a framework for regular grid finite difference modeling

William Symes
Rice University

Finite difference or finite element modeling methods that use explicit time-stepping on regular spatial grids share significant structural features. Some (dynamic) field updates depend on other (dynamic and static) fields, in the same way at every gridpoint (the stencil); these dependencies implicitly define the sequence of actions that take place within a simulation, including exchange of field subarrays in parallel domain decomposition. The details of the field interactions depend on a small amount of data, and are realized in the same way for every simulator of this type.

The IWAVE package evolved from its beginnings as a QC component of the SEAM project into an attempt to take systematic advantage of this structural commonality. IWAVE is written entirely in ISO C99, and is organized around a set of data structures encapsulating the common features of regular grid simulation, and associated sets of functions that manipulate these data structures. Key data structures include (1) arrays and virtual subarrays, all referenced to a common global coordinate system; (2) domains, or lists of arrays, which encapsulate all dynamic and static fields defining a physical model; (3) standard lists of functions, defining the relations between grids and time step update formulas, implementation of which suffices to completely define a simulator. In addition, IWAVE includes auxiliary packages centered around suitable structures for storage and i/o of gridded and time series data, associated to standard disk data formats.

I will describe the concepts underlying IWAVE and some of its key features, and illustrate its capabilities via acoustic and elastic modeling applications. I’ll also explain how the structure of IWAVE permits extension beyond modeling, to migration and inversion applications, and present some examples. Finally, I will mention possible extensions beyond regularly gridded modeling.

The primary designer of IWAVE was Igor Terentyev; the central concepts of the design were explained in Igor’s 2009 MA thesis. Other contributors include Tetyana Vdovina, Xin Wang, Dong Sun, Max R. Deschantsreiter (Performance Jones LLC), and the author of this presentation. Development was supported by the SEG’s SEAM project, the National Science Foundation, and the sponsors of The Rice Inversion Project.

Data Center environmental monitoring – efficiencies and future challenges

Farhad Banisadr and Richard Rivera
LANL

Today’s data centers require more power, cooling, and infrastructure enhancements as the demand for computing cycles increases. Combined with advanced control strategies, environmental monitoring and energy metering have become the basis for increasing the efficiencies and reducing the energy costs of our data centers. CFD modeling, power monitoring and wireless technology have provided the tools that are required to reduce the PUE (Power Usage Effectiveness) and optimize the operation of our HPC facilities. Data center managers have been able to reduce the PUE by performing environmental data analysis, executing proper engineering measures toward optimization of facilities, and achieving increased efficiencies.

Current challenges include developing future design strategies for providing power and liquid cooling to Exa-scale computers that can operate using elevated water temperatures to the server racks, while eliminating, or minimizing the use of vapor-compression refrigeration systems. It takes a combined effort of IT manufacturers, consulting engineers, and data centers managers in order to overcome these challenges.

Parallel session A: Benchmarking, optimization & performance

Performance Model for 3D Stencil Computation

Mauricio Araya-Polo¹ and Raul de la Cruz²
¹Repsol USA and ²Barcelona Supercomputing Center

Stencil computations are the core (and most computational demanding segment) of many Scientific Computing applications, in particular the ones based on Finite Difference method, therefore reducing overall execution time is a must. Unfortunately, stencil source code optimization is not a simple task, actually it is lengthy and tedious. It is lengthy because the large number of stencil optimizations combinations to test, which might consume days of computing time, and it is tedious due to the slightly different versions of code to implement. Alternatively, models that predict performance can be built without any actual stencil execution, thus reducing the cumbersome optimization task. Previous works have proposed cache misses and execution time models for specific stencil optimizations. Furthermore, most of them have been designed for 2D datasets or stencil sizes that only suit low order numerical schemes. We propose a flexible and accurate model that cover a wide range of stencil sizes up to high order schemes, it captures the behavior of 3D stencil computations using platform parameters. The model has been tested in a group of representative hardware architectures, using realistic dataset sizes. Our model predicts successfully stencil execution times and cache misses. However, predictions accuracy depends on the platform, for instance on x86 architectures prediction errors ranges between 1-20%. We believe that the model is reliable and it can help to speed up the stencil computation optimization process. To that end, other stencil optimization techniques can be added to this model, thus essentially providing a framework.

Characterization and modeling of finite difference stencils performance

Philippe Thierry, Leo Borges and Jamel Tayeb
Intel Corp

Thanks to the growth of computational capabilities in the last decade, Finite Difference (FD) based methods reappeared in 3D seismic to simulate the full two-way wave equation, together with 3D reverse time migration (RTM) and full waveform inversion (FWI). Given its combined demand for memory bandwidth and floating point operations within a very limited region of source code, the optimization of FD kernels still is a key point for geophysicists and computer scientists. Since both the physical description of the problem and the computing device are rapidly changing, it is critical to analyze current performances and extrapolate it to future hardware and implementations. The main challenge of performance modeling is to include all scales – from micro-architecture simulator, to sockets and cluster levels –and also to consider the huge space of possible implementations. In this paper we present multiple approaches to characterize performance and power consumption in terms of theoretical achievable values of a given hardware. The characterization addresses several implementations of finite difference stencils in terms of stencil lengths and wave equation approximations. The model can help to prioritize the optimization work. Together with power consumption measurement it can underline the implementation impact and guide the choice for an optimal hardware configuration. On a longer term, the characterization can help to better design future hardware specifications by aligning the specs to the right trend. The model shows that the more complex is the wave equation approximation, the more efficient the implementation can be or the less impactful would be the I/O. Such sometimes non intuitive statements can be established for the coming years since we already know the near future complexity increase of the wave equation approximation (from acoustic, TTI, elastic or visco-elastic).

Advances in HPC for Oil and Gas: The Intel MIC Architecture

Lars Koesterke and Yaakoub El-Khamra
The University of Texas at Austin

On Sept. 22nd the Texas Advanced Computing Center (TACC) anounced (www.tacc.utexas.edu/news/press-releases/2011/stampede) the deployment of a 10 PFlops system (Stampede) in January 2013 (www.tacc.utexas.edu/stampede). 80% of its performance will come from Intel’s new Many Integrated Core (MIC) technology. These MIC co-processors are designed to process highly parallel workloads, as frequently found in Oil & Gas applications, and provide the benefits of using the x86 instruction set. MIC co-processors are conveniently programmed in familiar languages (C/C++/Fortran) and parallel programming paradigms (OpenMP/MPI). Application development and porting efforts have already begun at TACC.

Porting code to a new architecture is always a challenge, but the degree of difficulty and level of effort may vary widely. Due to the familiar x86-based programming environment several codes have already been ported by TACC staff to MIC with relative ease. Here we are presenting our efforts to port a reservoir simulator and give the audience a hands-on and relevant demonstration of the potential of large heterogeneous systems with MIC co-processors which have the potential to make Exascale computing in Oil/Gas a reality.

We are developing and porting a parallel reservoir simulator based on the black oil model with the Parallel Extensible Toolkit for Scientific Computing (PETSc) libraries. This simulator uses the black oil model with implicit pressure and explicit saturation treatment. PETSc is a suite of data structures and routines for the scalable (parallel) solution of scientific applications modeled by partial differential equations. PETSc contains scalable parallel preconditioners, Krylov subspace solvers, parallel Newton-based nonlinear solvers, parallel time-stepping (ODE) solvers as well as parallel linear algebra routines and data structures. Since PETSc solvers and data structures use the standard message passing interface (MPI) for parallel computing, allowing it to scale well on thousands of processors, it made a logical choice as an underlying infrastructure for a high performance reservoir simulator.

Early Experiences In Porting IWAVE to OpenCL on the Fusion APU

Ted Barragy¹ and William Symes²
¹AMD and ²Rice University

This talk describes early experiences porting IWAVE, an open source seismic modeling package developed by The Rice Inversion Project, to OpenCL and its implementation on a Fusion APU. This is a work in progress, as such, the results are currently incomplete on the GPU side.

The presentation begins with an overview of IWAVE, including physical models, numerical algorithms, basic data structures, input dataset formats (SEGY), and the current parallel implementation(s). An overview of the APU architecture (x86 CPU, VLIW4 GPU, and interconnect) is given next. The third section ties the first two together by providing details on the port of a finite difference kernel to OpenCL.

These details demonstrate several points. First, much of the complexity of OpenCL can be hidden in a ‘utility’ library. Second, few code modifications are needed due to the IWAVE architecture. Third, existing OpenMP declarations are leveraged as a programmers ‘template’ to ease the burden of the port. Fourth, the implementation demonstrates a partitioning of seismic computations into two parts. The regular grid part is executed under OpenCL on the GPU. The rest of the computation, such as boundary conditions and MPI parallelism, are executed under C on the CPU. This partitioning demonstrates some of the unique advantages expected of the ‘fused’ APU compared to CPU plus discrete GPU. Once common pointers are available across CPU & GPU, explicit data movement is eliminated. Smaller data elements (grid boundaries) are accessed without penalty. Finally, there is hope of SW investment protection and excellent GFLOP/$ given that the APU is a commodity part.

The talk concludes with some initial performance results for the kernel under OpenCL and estimates of performance once the runtime supports the full Fusion System Architecture (e.g. common pointers).

High Performance Compute Platform based on multi-core DSP for Seismic modeling and imaging

Murtaza Ali¹, William Symes², Xiaohui Li¹, Eric Stotzer¹ and Jan Odegard²
²Texas Instruments and ²Rice University

High Performance Compute platforms have become an essential tool in the oil and gas industry for seismic processing including modeling and imaging. In recent years, we have seen various compute acceleration platforms being used to speed up the computations. Among the various processors that can be used for accelerating the computations, Texas Instruments’ (TI) C66x KeyStone based multicore DSPs offer the highest compute performance per watt at about 16 GFLOPS/W. In addition, these devices are programmed through standard programming languages like C and openMP. The TMS320C6678 device has eight C66 cores running up to 1.25 GHz providing a peak performance of 160 GFLOPS per device. An octal 16 lane PCI-express board with eight such devices provides over a Teraflop of performance and is an ideal compute platform for seismic processing common in oil and gas industry. In this paper, we used the seismic modeling and inversion/imaging package called IWAVE++ from Rice University to benchmark performance of this multicore DSP. The main computational kernel is an acoustic modeling technique based on the solution of a partial difference equation using the commonly used finite difference technique. This kernel is optimized to match the C66 core. The distribution of processing across cores within device uses openMP due to its shared memory architecture. The distribution of processing across devices is done using the industry standard Message Passing Interface (MPI) protocol applicable for distributed memory architecture. In this paper, we will show the results of benchmarks of the kernel and the distribution of processes across parallel cores and devices on the board.

Parallel session B: Programming models, languages and tools

Loo.py: A Loop Generation Tool for CPUs and GPUs

Andreas Kloeckner¹ and Tim Warburton²
¹New York Universityand ²Rice University

Performance-aware implementations of even simple mathematical concepts typically encounter an explosion in code size and complexity. Much of this complexity is artificial, as the ideas driving the implementation are often taken from a rather limited set of tricks. Encoding these tricks in C is tedious, error-prone and leads to redundant, unmaintainable results. This is a failure of the currently available tools. While progress in hardware has been abundant, the situation with tools has been called a “train wreck” (Tim Mattson at a previous edition of this workshop).

Loo.py is our attempt at fighting complexity growth. By extracting just the tuning ideas and leaving their implementation to the machine, visible redundancy is greatly reduced. Our approach is based on program transformation. A user starts with a semi-mathematical algorithm description and then issues transformation commands that gradually make the code suitable for a target machine. The transformations uniquely specify the generated program, which is output in human-readable C. This puts the user in charge of the optimization, eliminates guesswork and helps debuggability. It also differentiates the approach from directive-based compilation, where the compiler has final say and the result cannot easily be inspected. Further, our transformation is driven From Python, a high-level language, making it practical for user code to apply transformations based on hardware and operational considerations at run time.

I will demonstrate the use of Loo.py and its effectiveness for PDE and linear algebra problems, including ones relevant to oil and gas.

This work is based on an ongoing effort to supply high-level language tools for performance programming, realized in the PyOpenCL and PyCUDA packages. These packages are having broad impact across academia and industry, with 40,000+ downloads in the last 12 months. I will briefly cover the use of these packages, and I will show how they enable tools like Loo.py.

ASIODS: An Application and I/O Collaboration System

Maxime Hugues¹, Mike Moretti², and Serge Petiton³
¹INRIA, France, ²DataDirect Networks and ³LIFL, France

After the break of the 10 Pflops by the “K” computer, post-petascale era is few steps away and exascale machines are on the way. This continuing and insatiable appetite for computation power comes from the need to accelerate applications and solve larger, more complex problems to a get a higher accuracy. However, building and operating of those machines are challenging for computer scientists by many aspects such as fault tolerance, energy consumption, programming, network and storage system performance. By focusing on storage, trends announce that more than hundred petabytes and 20TB/s of bandwidth would be necessary for applications, which would process and generate a large set of data.

However, storage performance is an ongoing problem for seismic imaging applications that generate and rely on many terabytes of data. Commonly, storage systems use caching techniques to improve performances. Then data are kept in RAM what avoid disk accesses and give a faster transfer. Nevertheless, this technique does not allow to optimize irregular I/O accesses. For that case, some applications anticipate data reading by prefetching data. Mainly users handle those operations because file system and application operate separately. For massive parallel computer, some researchers have obtained good performances by delegating I/O operations to avoid disk contentions and segregate computations from I/O.

Considering the coming of massive parallel computer and massive I/O, we propose a system that combines the advantage of I/O delegation and user expertise. Expertise describes task dependencies and I/O related through a graph that allows to make cooperation between application and I/O system. This has led to the design an asynchronous and smart I/O delegated system, named ASIODS.

This talk presents the main idea and design of ASIODS. First evaluation results are also presented and demonstrate a potential gain over 10% of execution time.

*This work has been started as M. Hugues was at LIFL

Implementation and Evaluation of Coarray Fortran

Deepak Eachempati¹, Debjyote Majumder¹, Barbara Chapman¹, Terrence Liao² and Henri Calandra²
¹University of Houston and ²Total

Compiler technology needs to evolve to support parallel programming models for large-scale distributed systems. Global address space models are attractive because they provide a familiar programming model. However, there are severe challenges in getting them to run well at scale. The Partitioned Global Address Space (PGAS) models enhance GAS by exposing processor-memory affinity, which is critical to efficiently implement GAS models on distributed architectures. Coarray Fortran (CAF) is a PGAS Fortran extension which has been incorporated into the Fortran 2008 standard. It enables parallel programming in Fortran with minimal change to the language syntax. In a joint project between UH and Total, we have investigated CAF as a viable programming model for production Oil and Gas applications. We created an open-source CAF implementation within the OpenUH compiler. In contrast to other open-source implementation efforts, our approach is to use a single, unified compiler infrastructure to translate, optimize and generate binaries from CAF codes. We have also developed a runtime system for supporting 1-sided communication between executing processes. Experiments using micro-benchmarks and finite difference wave equation solvers from Total show that our implementation achieves better communication performance compared to MPI. However, CAF lacks many of the features provided by MPI such as non-blocking communication. Since remote communication is a major performance bottleneck on distributed memory systems, the implementation is responsible for hiding latency by reducing communication or overlapping it with computation. We have implemented optimizations in the CAF runtime to address this. A get-cache is used to reduce the number of remote reads, and non-blocking prefetching is used to increase communication-computation overlap. To improve remote write performance, we make all remote writes automatically non-blocking. We show that, used together, get-cache and non-blocking put optimizations can provide 50% performance improvement over the baseline implementation. This work is partially sponsored by Total.

Challenges in Application of HPC for Reservoir Simulation

Bret Beckner, Vadim Dyadechko and Serguei Maliassov
ExxonMobil Upstream Research Company

The performance of reservoir simulation codes has historically been driven by advances in CPU clock speeds and increased memory of the computers on which these codes execute. Improvements in computing power have allowed reservoir simulators to show performance increases with no significant changes to the underlying code. These performance increases have disappeared due to the stagnation in the growth of CPU clock speeds. In its place a new method for increasing computing power, placing more computing units on a chip, has emerged. This shift from serial to parallel hardware architectures requires new software approaches to fully exploit these new chip designs. New programming models for reservoir simulation are required to harvest parallelism in a general, scalable, and architecture-independent fashion. We investigated parallel performance for reservoir simulation on a high performance Linux cluster. A deliberate strategy of performance-based development of the major types of computations encountered in reservoir simulation programs was employed. It was found that even though most of those operations are memory-bandwidth bound, it is possible with careful implementation to get excellent utilization of the available compute power. On most of the linear algebra operations for fixed size problem it was possible to obtain super-linear speedup with increasing processor counts. To further explore the parallel performance of reservoir simulation algorithms, an experimental parallel reservoir simulator prototype was created and tested extensively on ExxonMobil’s HPC cluster. It was found that under some conditions reservoir simulation computations also can experience super-linear speedup. These experiments demonstrate that with careful implementation it may be possible to run reservoir simulation on HPC clusters significantly faster than it can be done on single CPU computers. The application of reservoir simulation on these very powerful computing systems could provide in excess of 100x speed increase for large reservoir simulation models.

Analyzing the Performance of IWAVE on a Cluster using HPCToolkit

John Mellor-Crummey
Computer Science, Rice University

HPCToolkit is an integrated suite of multiplatform tools for measuring and analyzing the performance of parallel programs on scalable parallel systems. HPCToolkit uses low-overhead asynchronous sampling to obtain deep insight into performance bottlenecks and identify opportunities for tuning both within and across the nodes of a parallel system. As a parallel program executes, HPCToolkit periodically interrupts each executing thread or process to attribute costs (e.g., time, cache misses) to the currently executing program context. By analyzing profiles based on these measurements, one can pinpoint and quantify costs associated with computation, data movement, and synchronization. Using HPCToolkit’s profile views, one can perform detailed analysis of important application kernels at the loop level using hardware performance counters. Using novel visualizations based on traces of call stack samples, HPCToolkit can provide insight into how a program execution unfolds over time at multiple levels of abstraction. In this talk, I will demonstrate HPCToolkit’s performance analysis capabilities by exploring different aspects of the performance of Rice University’s IWAVE code on a cluster equipped with Intel Westmere processors and an Infiniband connect.

Parallel session C: Facilities and infrastructure

Data Center Efficiency – turning up the heat

David Baldwin
Petroleum Geo-Services

During the hottest and driest summer in Texas’ history, Petroleum Geo-Services (PGS) conducted a four month long experiment in a live data center to increase the ambient temperature in the data hall in an effort to reduce the center’s PUE (power usage effectiveness) and hence our costs.

In the project the ambient temperature in the room was raised from 72°F to 82°F in 2°F increments approximately every 2 weeks and the effects on the power consumed were monitored with the flowing objectives:

Ascertain if there is a point at which the efficiency reduces with the increase in temperature as the power consumed by the server (primarily the fans) out strips the savings of the fans in the computer room air handlers (CRAH).
Investigate any possible effects on the mean time between failures (MTBF) of the disks in the servers as the ambient temperature increases, reducing any savings in the cooling system.
Investigate the cost benefits of improving the efficiency of the data center, primarily in the reduction of the electrical utility costs.
Describe the environmental impact of potential savings in electricity in terms of kg of CO2 saved.

In this session, I will look at how we answered some of these questions and the conclusions we came to as to how hot you could run a data center.

Design and Implementation of Technical Computing facilities in Hess
Tower

Jeff Davis, Gary Whittle, Jim Breef and Vic Forsyth
Hess Corporation

During the summer of 2011, Hess Corporation moved into a modern, Silver LEED (“Leadership in Energy and Environmental Design”) certified new building in downtown Houston. One of the goals of the new building was to provide a technically world-class facility for oil and gas exploration and production. The new building has multiple datacenters providing differing levels of services and redundancies. These datacenters house HPC nodes using NVIDIA GPUs, general purpose servers, high-end file servers, networking equipment and technical workstations.

Over 450 technical workstations are accessed by users through a custom “Connection Broker” interface and are connected using fiber optic extenders and remote computing techniques such as HP RGS and Microsoft RDP. This model of desktop computing provides for targeted use of network bandwidth, building environment/power and compute resources in general. Instead of devoting multiple technical workstations to each user, “pools” of workstations can be accessed as needed by technical users. Technical workstations can be quite massive, so to reduce this density Hess is implementing rack-friendly blade workstations. This essentially turns the expensive technical workstation into something similar to the HPC node or piece of the “private cloud”.

Another goal of the new building was to enhance technical collaboration. Over 50 visualization/team rooms were constructed, incorporating cutting-edge audio visual equipment and collaboration technology, such as the RGS shared technical desktop, allowing users to collaborate on all of Hess’ technical platforms and software across the office LAN and Hess global WAN. Users can load up data, prepare and test presentations in their own offices, and then re-connect to these workstations immediately from the visualization room to execute their presentations without lost time.

We will show how Hess chose to implement technical IT infrastructure in Hess Tower, providing the desired technical computation and connectivity options, whilst meeting LEED certification and “green” building goals.

Pushing the Envelope with Large Scale HPC Oil and Gas Clusters –
Challenges, Risks, and Business Benefits

Kent Blancett
BP

The business driver for HPC clusters in Oil and Gas is to do more – new algorithms, higher frequencies, more depth, less decimation of data, etc.. Over the past decade BP has transitioned from exotic large memory systems to more cost-effective technologies as Linux clusters have demonstrated value using Ethernet networks. BP has chosen to pursue a constant growth strategy so we always have some of the latest hardware. We must continue to listen to our customers and watch the technology market to see what makes sense and when.

The presenters will describe the BP Advanced Seismic Imaging clusters and how they have evolved into the current configuration. They will also share some of the key challenges, lessons learned, and business benefits derived from growing the clusters over the past 6 years. The following topics will be discussed:

Focus of HPC work
Application direction
What technology is available and when
How to split the bucket of money – compute, filesystems, network
Risk mitigation methods
Best practices for maintaining and scaling clusters networks
Issues we have avoided or seen
Performance improvements
Business Benefits

Large-Scale Visualization in HPC—Issues and Trends

Kelly Gaither¹ and Erik Engquist²
¹The University of Texas at Austin and ²Rice University

Visualization has become an essential tool in scientific research, spanning many areas of science and engineering research and design. As complexity and data size have scaled dramatically, so have the requirements for processing, managing, and storing the enormous amounts of data needed for scientific visualization and analysis.

In this talk, we will discuss visualization and data analysis at the campus and national research center scale. At TACC, we take a multi-pronged approach to completing the path from data to insight, including remote and collaborative visualization tools and technologies, feature detection techniques to sample and hone in on relative bits in the data, and scalable visualization and analysis techniques and resources. The DAVinCI project at Rice seeks to leverage the efficiencies of data center resources while providing flexible access for visualization and data analysis users. This involves lowering common barriers to usage while seeking to provide seamless transitions from an individual researcher’s remote visualization to use of large immersive facilities and multi-use collaborative workspaces. We will provide food for thought on issues and trends in large-scale visualization going forward.

Designing a Flexible, Energy Efficient High Performance Computing Facility

Frank Hersom, Saad Dimachkieh and Gary Kuzma
HOK Architects

HOK was asked by a Major Oil and Gas Company to design a High Performance Computing Facility for their Houston campus. We will discuss how we have designed a facility with the flexibility to adapt to new computing technologies, and handle increasing rack densities. We will discuss the energy saving strategies appropriate for the Houston climate, and how we forecast the PUE for this new facility.

Closing keynote

Towards Exascale Computing for Oil and Gas—Directions, Challenges and Strategies

Rajeeb Hazra
Intel

In this talk we will discuss key challenges for getting to exascale computing and outline our overarching vision for the future of High Performance Computing. In the process we will highlight near term opportunities and indicate strategies for overcoming key challenges. The talk will also discuss required areas of investment across the HPC ecosystem with a particular emphasis on how this will be critical for getting to exascale. Borrowing from last years talk in this workshop by Scott Morton, Hess, Henri Calandra, Total and John Etgen, BP entitled “What We Need and What We Can Do With an Exascale System” the speaker will connect the discussion to needs in the Oil and Gas Industry.

The Effect of Oil Prices on the Economy

Benjamin Tang, Graduate Student, Yinsheng Li, Grad Student and Xi Chen, Undergraduate, Statistics, Rice University

Oil price volatility has been shown to impact economic output in various linear and non-linear effects. In addition, it was observed that positive and negative oil price “shocks” result in non- symmetrical perturbations in economic activity. Contemporary research has looked at the impact of bottle-necks on oil price volatility; these take the form of “production constraints”, as well as demand and supply shocks. Our results, using both a Vector Auto-Regression (VAR) and Vector Error Correction Model (VECM), demonstrate that oil prices have a significant, and negative impact on US-GDP, with the cumulative impulse response peaking after 6 quarters. Conversely, US economic activity had no significant impact on West-Texas Intermediate (WTI) crude prices. Investigating the causes of oil price volatility, our findings suggest that oil product demand-to-inventory ratios, Strategic Petroleum Reserves (SPR) and Organization of Petroleum Exporting Countries (OPEC)s spare capacity were significant explanatory variables for volatility, whereas conventional indicators of economic demand-type indicators (refinery inputs and utilization rates) were not useful predictors of oil price volatility. This corroborates with our primary finding that GNP does not have a causal impact on oil prices. As such, in terms of public policy implications, we recommend the application of the coefficient weights against the cost of the relevant policy (i.e. cost of developing SPR, developing demand or inventory elasticity) in deriving the optimal policy portfolio to maintain oil price stability.

Stochastic Discontinuous Galerkin Method for Single Phase Flow

Kun Liu, Graduate Student, Computational and Applied Mathematics, Rice University
with Béatrice M. Rivière, Associate Professor, Computational and Applied Mathematics, Rice University

This poster Introduces the Monte Carlo discontinuous Galerkin (MCDG) method which is used to solve a single phase flow model with random input data. Source Point Method (SPM) is used to generate stochastic permeability fields. Numerical example shows that the CPU time is reduce by 98% by MCDG with relatively accurate numerical results.

Methane Hydrate Distribution in Heterogeneous Marine Sediments

Sayantan Chatterjee, Graduate Student, Chemical and Biomolecular Engineering, Rice University
with Walter G. Chapman, Professor and George J. Hirasaki, Professor, Chemical and Biomolecular Engineering; Gerald R. Dickens, Professor and Brandon Dugan, Assistant Professor, Earth Science, Rice University

Solid gas hydrates can form when low molecular weight gas molecules and water combine at relatively high gas concentrations, high pressures, low temperatures, and low salinity conditions. These conditions are present along many continental margins (as well as in permafrost environments) where hydrocarbon gases, especially CH4, accumulate in sediment pore space within a finite depth interval known as the gas hydrate stability zone (GHSZ). While the amount and distribution of marine gas hydrates remain uncertain, they may constitute a potential energy resource, a deep-water geohazard, and an important component of the global carbon cycle. My PhD dissertation focuses on developing a two-dimensional, lithologically heterogeneous, sedimentation-compaction fluid flow model that tracks gas hydrate accumulation over geologic timescales. Simulations with a vertical fracture network, which extends through the GHSZ and has permeability 100 times greater than the surrounding shale formation, show that focused fluid flow causes higher hydrate and free gas saturation within the fracture network compared to the surrounding, lower permeability shale. Systems with high permeability, dipping sand layers also show localized, elevated saturations of hydrate and free gas within the sand layers due to focused fluid flow. Permeability anisotropy, with a small vertical to horizontal permeability ratio, enhances hydrate concentrations within high permeability conduits because anisotropy enhances transport of methane-charged fluid to high permeability conduits. The 2-D, heterogeneous models quantify how focused fluid flow through high permeability zones affects local hydrate accumulation and saturation. Using this model, it is determined that the hydrate and free gas saturations can be characterized by the local Peclet number (localized, focused, advective flux relative to diffusion); which is consistent with Peclet number (net fluid flux relative to diffusion) characterization in 1-D systems. This characterization suggests that even in lithologically complex systems, local hydrate and free gas saturations can be characterized by basic local parameters.

Chracterizing the pore structrue of vuggy carbonates

Neeraj Rohilla, Graduate Student, Chemical and Biomolecular Engineering, Rice University
with George J. Hirasaki, Professor, Chemical and Biomolecular Engineering, Rice University

Carbonates reservoirs exhibit complex pore structure with micropores and macropores/vugs. Vuggy pore space can be divided into separate-vugs and touching-vugs, depending on vug interconnection. Characterizing pore structure in carbonates is a complex task due to the diverse variety of pore types seen in carbonates and extreme pore level heterogeneity. In order to characterize the pore size in vuggy carbonates, we use NMR along with tracer analysis. The distribution of porosity between micro and macro-porosity can be measured by NMR. However, NMR cannot predict if different sized vugs are connected or isolated. Tracer analysis is used to characterize the connectivity of the vug system and matrix. Modified version of differential capacitance model of Coats and Smith (1964) and a solution procedure developed by Baker (1975) is used to study dispersion and capacitance effects in core-samples. Tracer analysis gives valuable insight on fraction of dead-end pores, dispersion and mass transfer effects at core scale.

2D Harmonic Coordinate Finite Element Method for Acoustic Waves

Xin Wang, Graduate Student, Computational and Applied Mathematics, Rice University
with William W. Symes, Noah Harding Professor, Computational and Applied Mathematics, Rice University

We develop the 2D harmonic coordinate finite element method (HCFEM) on the regular Cartesian grid, which achieves the optimal second order convergence for static and dynamic acoustic boundary values problems with spatially heterogeneous bulk modulus and density at the additional cost of solving two auxiliary elliptic boundary value problems. Unlike the conventional FEM, HCFEM does not require interface conforming meshes to achieve optimal convergence rate. HCFEM stiffness and mass matrices are constructed in a systematic procedure, and have the same sparsity pattern as those in the standard regular grid finite element method. Mass-lumping in HCFEM is proved to preserve the optimal convergence order, due to the smoothness of acoustic solutions in harmonic coordinates, and results in an efficient, explicit time step.

Hybrid finite volume and discontinuous Galerkin method for convection-diffusion problems

Xin Yang, Graduate Student, Computational and Applied Mathematics, Rice University
with Béatrice M. Rivière, Associate Professor, Computational and Applied Mathematics, Rice University

This work formulates and analyzes a new coupled finite volume (FV) and discontinuous Galerkin (DG) method for convection-diffusion problems. DG methods, though costly, have proved to be accurate for solving convection-diffusion problems and capable of handling discontinuous and tensor coefficients. FV methods have proved to be very efficient but they are only of first order accurate and they become ineffective for tensor coefficient problems. The coupled method takes advantage of both the accuracy of DG methods in the regions containing heterogeneous coefficients and the efficiency of FV methods in other regions. Numerical results demonstrate that this coupled method is able to resolve complicated coefficient problems with a decreased computational cost compared to DG methods. This work can be applied to problems such as the transport of contaminant underground, the CO2 sequestration and the transport of cells in the body.

Acceleration of the Front Propagation Segmentation Algorithm: Towards Tumor Volumetrics

Yunming Zhang, Undergrad student, Computer Science, Rice University
with Vivek Sarkar, Professor, Vincent Cave, Research Programmer, Zoran Budimlic, Research Scientist, Computer Science, Rice University; Alex Bui, Professor, Radiology, William Hsu, Professor, Radiology, UCLA

I worked on accelerating the Front Propagation Segmentation Algorithm with Habanero C and incorporated it to the image processing pipeline at the Center for Domain Specific Computing jointly hosted by Rice, UCLA, UCSB and OSU. It is used to help radiologists better assess a patient’s response to treatment. The program processes 3D images from patients and generate volumetric data of tumors. The algorithm starts with a point of interest identified by a domain experts. This could be a suspected tumor pointed out by a radiologist. For each 2D slice, the algorithm compares the gradient value difference in neighboring pixels with the starting point and propagates from the starting point to its neighboring pixels. Eventually, the propagation spreads to the the entire 2D slice and assigns a gradient distance from the starting pixel for each pixel in the image. Using the newly updated gradient distance value, clusters of pixels with similar gradient values form regions of interest on a 2D slice. The program then traces the contour of Regions of Interests and selects the right contour based on its proximity to seed point and its area. In the end, it calculates the volume based on information from all 2D slices of the 3D image. The algorithm is parallelized in contour tracing for each individual 2D slice. I achieved as much as 40% speed up of the segmentation algorithm with Habanero C, a parallel language developed at Rice University. I fully utilized the data parallelism created a separate task for each 2D slice of the image.

1D Acoustic Transparency Property

Yin Huang, Graduate Student, Computational and Applied Mathematics, Rice University

A new proof is given to show the 1-D acoustic transparency theorem for medium with bounded variation. My proof uses the piecewise approximation of the function with bounded variation, and then get the bound by analyzing the reflection and transmission of the waves traveling across the depth interval. An example is given to show that the bounded variation is not a necessary condition for a medium to keep transparency property.

The formation and evolution of triangle zones in fold-and-thrust-belts: Insights from DEM simulations

Sarah Dean, Graduate Student, Earth Science, Rice University
with Tom Fournier, Research Associate; Julia Morgan, Professor, Earth Science, Rice University

Studying the frontal structures of fold and thrust belts can provide valuable insights into the formation of mountain belts, and the mechanical processes that control the incorporation of new material into the deforming belt. Triangle zones are of particular interest since they are often traps for oil and natural gas. However, the mechanics governing the presence or absence of a frontal triangle zone in a fold and thrust belt are not well understood. One hypothesis is that the presence of a mechanically weak cover unit promotes the formation of a frontal backthrust above a duplexed unit, creating a frontal triangle zone, versus a more mechanically strong and homogenous fold and thrust belt which does not form frontal triangle zones. Simulations carried out using the discrete element method (DEM) tested the effects of different parameters on the formation of frontal structures in a fold and thrust belt, including the effects of a mechanically weak cover sequence. All models were designed by constructing multiple layers with different interparticle bond strengths to simulate mechanical stratigraphy. Each model was 120km long and 4-5km thick, composed of 25 to 50 thousand particles with radii of 60 and 80 m. Horizontal compression was imposed by moving the left wall inward, above an elastically deformable basement. Particle positions and displacements were recorded for every 160m of displacement. The deformation fields for the 1st and 2nd strain invariants were calculated and show the cumulative strain over time for each simulation. Only the models with a stratigraphy consisting of mechanically weak units between stronger units allowed for the formation of triangle zones, indicating that a weak cover sequence is not responsible for the formation of triangle zones.

Coarray Fortran 2.0: A productive language for scalable scientific computing

Chaoran Yang, Graduate Student, Computer Science, Rice University
with John Mellor-Crummey, Professor; Laksono Adhianto, Research Scientist; Guohua Jin, Research Scientist; Mark Krentel, Research Scientist; Dung Nguyen, Research Scientist; Willion Scherer III, Research Scientist; Scott K. Warren, Research Scientist; and Karthik Murthy, Graduate Student, Computer Science, Rice University

MPI has long been the dominant programming model for writing scalable parallel programs. However, writing MPI is difficult because the application developer has only a local view of data and MPI’s two sided communication model can be awkward, particularly for applications that require dynamic load balancing. Partitioned global address space languages address these difficulties by providing a shared global address space abstraction and one-sided communication. Fortran 2008 now includes support for shared data known as “coarrays” to support this style of programming. However, Fortran 2008’s support for parallel programming is quite limited. At Rice, we believe that more extensive support is needed to support productive parallel programming. Accordingly, we have been developing a richer set of language extensions, known as Coarray Fortran 2.0 (CAF 2.0), to support developing scalable parallel programs in Fortran. CAF 2.0 is a coherent synthesis of concepts from MPI, Unified Parallel C, and IBM’s X10 programming language. CAF 2.0 includes a broad array of features including process subsets known as teams, asynchronous communication, collective operations, communication topologies, dynamic allocation of shared data, global pointers, and a richer set of synchronization constructs. In this poster, we will present the CAF 2.0 language constructs, and explain their utility by using them to express a key kernel in LANL’s Parallel Ocean Program. In a companion poster, we will describe the performance of CAF 2.0 implementations of High Performance Computing Challenge benchmarks on DOE supercomputers.

Coarray Fortran 2.0 on petascale architectures

Karthik Murthy, Graduate Student, Computer Science, Rice University
with John Mellor-Crummey, Professor; Laksono Adhianto, Research Scientist; Guohua Jin, Research Scientist; Mark Krentel, Research Scientist; Dung Nguyen, Research Scientist; Willion Scherer III, Research Scientist; Scott K. Warren, Research Scientist; and Chaoran Yang, Graduate Student, Computer Science, Rice University

At Supercomputing 2011, Preissl et. al. showed that rewriting the communication phase of a parallel scientific simulation of fusion from two-sided communication in MPI into one-sided communication in Coarray Fortran (CAF) yielded a 52% speedup for the application overall. One-sided communication enabled the application to make better use of the communication hardware on a Cray XE6 supercomputer by spreading communication out over time. This result highlights the promise of programming models that support one-sided communication. At Rice, we have been developing Coarray Fortran 2.0, a richer programming model based on one-sided communication. The features of Coarray Fortran 2.0 are presented in a companion poster. This poster showcases our experiences using Coarray Fortran 2.0 to develop implementations of four of the well-known HPC Challenge benchmarks — Global HPL, Global RandomAccess, EP STREAM (Triad), and Global FFT — as well as an Unbalanced Tree Search (UTS) benchmark, which requires dynamic load balancing for scalability and performance. We present the performance of our CAF 2.0 implementations of these benchmarks on Cray supercomputers at DOE national laboratories. We evaluate the scalability and performance of our implementations on up to thousands of cores. Our performance numbers for these benchmarks are competitive with the best performing implementations using other programming models. Our CAF 2.0 implementations won awards for performance and elegance at the HPC Challenge Awards competition at Supercomputing ’10 and Supercomputing ’11.

Kinetic Monte Carlo simulations of mineral dissolution at different scales

Inna Kurganskaya, Graduate Student, Earth Science, Rice University
with Andreas Luttge, Professor, Earth Science, Rice University

Accurate modeling of geological basins requires a fundamental knowledge about mineral dissolution and precipitation kinetics. Porosity and permeability characteristics of geological reservoirs are controlled by the dissolution process destroying the rock and by precipitation of minerals clogging the pores. Therefore understanding of key reaction mechanisms is critical for basin formation modeling and predictions of long-term changes of rock structure. Nowadays mineral dissolution can be studied from the scales of single molecular reactions to the scales of real geological objects. We use a multiscale approach to understand mineral dissolution kinetics by combining AFM and Optical Interferometry (VSI) observations of reacting surface with molecular-level Kinetic Monte Carlo (KMC) simulations. Our goal is to build a detailed kinetic model of mineral dissolution and test it verifying results of KMC simulations with experimental data. In order to do that, we need to operate with the systems comprising hundreds and thousands of millions of atoms to simulate only micron-sized areas observed with AFM. At the same time, a large amount of surface reactions, e.g. dissolution, precipitation and diffusion of molecules at various surface sites, must be considered in the model. Increase of the number of possible reactions also decreases the size of the modeled system. Thus we need to develop effective computational algorithms and to use HPC techniques in order to substantially increase the system size. We believe, that these approaches will help us to find the best models capable to reproduce experimentally observed dissolution patterns of sizes ranging from one square micron (AFM scale) to 100000 square microns (VSI scale).

Gas Hydrate Distribution and Detection, Rice University

Guangsheng Gu, Graduate Student, Chemical and Biomolecular Engineering
with Walter Chapman, Professor, George J. Hirasaki, Professor, Chemical & Biomolecular Engineering, Colin A. Zelt, Professor, Earth Science, Rice University, Priyank Jaiswal, Assistant Professor, Boone Pickens School of Geology, Oklahoma State University

The most common indication for gas hydrates in seismic data is the presence of a Bottom-Simulating Reflector (BSR) which marks the base of the gas hydrate stability zone (GHSZ). BSR originates when acoustic impedance sufficiently changes within one wavelength of the dominant seismic frequency as hydrates dissociates into free gas below the base of the GHSZ. We demonstrate that when hydrates are not purely from methane (e.g. 5% propane) the transition zone that marks the phase change from a purely solid hydrate state to a purely free gas state can span over a distance significantly larger (~300m) relative to the dominant seismic wavelength (~50m from a 30Hz wavelet) resulting in a weak to absent BSR. We generate synthetic response with seismic wavelets of varying dominant frequencies to show that a strong BSR reflection (>1/10 of the seafloor reflection) occurs if the dominant wavelength is larger then twice the thickness of the transition zone. Our results provide a potential reasoning for absence of BSR in conventional seismic surveys in a mixed (thermogenic and biogenic) hydrate systems such as the Gulf of Mexico.

Performance Analysis on Intel’s Single-Chip Cloud Computer (SCC)

Xu Liu, Graduate Student, Computer Science, Rice University
with John Mellor-Crummey, Professor, Computer Science, Rice University

Intel’s Single-Chip Cloud Computer (SCC) is an experimental multi-core processor designed to scale to hundreds of cores. However, compared to traditional multicores, the SCC places a high burden on application programmers. To date, SCC-based systems have had no performance tool that can help programmers rapidly analyze an application’s performance bottlenecks. To remedy this problem, we ported Rice University’s HPCToolkit performance tools to the SCC to support analysis of hot loops with hardware performance counters, pinpoint scaling bottlenecks in context, identify load imbalance with call path traces, and identify time-varying performance problems such as serialization. This poster describes our experiences using HPCToolkit to analyze the performance the Trsm code from UT Austin’s Elemental library and the SCC-version of the NAS LU benchmark. HPCToolkit enabled us to understand Trsm’s performance in detail and provided insight that enabled us to improve the performance of the NAS LU benchmark by more than 30%.

Next Generation Performance Tools for Heterogeneous Architectures

Milind Chabbi, Graduate Student, Computer Science, Rice University
with Karthik Murthy, Student, Mike Fagan, Research Scientist, John Mellor-Crummey, Professor, Computer Science, Rice University

General purpose graphics processing units (GPGPUs) deliver high FLOP rates with modest energy consumption. Heterogeneous architectures are becoming commonplace in high performance computing because they can outperform traditional multicore processors by orders of magnitude. Because of its energy efficiency, heterogeneous computing is here to stay from smart phones to supercomputers. However, writing performance efficient hybrid codes is complex for two reasons. First, programming techniques that yield efficient hybrid code depend upon accelerator architecture, hardware vendor, and sometimes even the product version. Second, hybrid code inefficiency frequently derives from imbalanced use of resources. For example, offloading an entire computation to GPUs can become a source of inefficiency. Attaining peak performance requires appropriate division and overlap of computations across all available resources. Such division is both program and machine dependent. Consequently, developers need tools to provide insight into inefficiencies in hybrid codes. Current performance tools lack the capability to accurately measure and attribute the sources of bottlenecks in hybrid codes. In addition, vendor hardware and software support for performance measurement is in its infancy.

To address the clear need for hybrid code performance tools, we are enhancing Rice University’s HPCToolkit performance tools to accurately measure, analyze and pinpoint bottlenecks in hybrid codes to help guide programmers to tuning opportunities. HPCToolkit focuses on identifying the causes of inefficiencies rather than the symptoms. For hybrid codes, it is important to identify what code is responsible for idleness of CPU or GPU resources. To analyze the performance of hybrid codes, HPCToolkit uses a novel “blame shifting” approach where it blames code executing on a CPU when a GPU is idle, and the kernels executing on a GPU when a CPU is waiting. HPCToolkit can also identify code regions where both CPU and GPU are working.

Opening Keynote

Morning Plenary talks

Parallel session A: Benchmarking, optimization & performance

Parallel session B: Programming models, languages and tools

Parallel session C: Facilities and infrastructure

Closing keynote

Program Committee

Workshop Venue

Workshop History