Professor Stephen Jarvis
About
Biography
Professor Stephen Jarvis joined the University of Surrey as President and Vice-Chancellor in September 2025. Prior to this Professor Jarvis was the Provost and Vice-Principal at the University of Birmingham. He also served as Pro-Vice-Chancellor at the University and Head of its College of Engineering and Physical Sciences during his time here.
Before joining the University of Birmingham he was Deputy Pro-Vice-Chancellor (Research) at the University of Warwick, where he led industry-academic partnerships in the area of big data and established an international scholarship programme in AI. He also supported the establishment of The Alan Turing Institute, the United Kingdom's national institute for data science and artificial intelligence, where he served as a non-executive Director and Trustee between 2018 and 2020.
Professor Jarvis is a computational scientist, and has published more than 220 peer-reviewed academic papers, including with Argonne National Laboratory, ARM, Centre for Space Research at MIT, CRAY, Fujitsu Labs Europe, IBM TJ Watson Research Laboratory, Intel, Lawrence Berkeley National Laboratory, NASA Ames Research Center, NEC European Research Laboratory, Rolls-Royce, Sandia National Laboratories and the Francis Crick Institute. He has received several major grants from the UK Research Councils and from industry, supporting research in areas including data science, robotics and autonomy, aerodynamics and continuum mechanics, networks and distributed systems, computer graphics and visualization, software engineering and computer architectures.
My qualifications
News
Publications
We present projection sorting, an algorithmic approach to determining pairwise short-range forces between particles in molecular dynamics simulations. We show it can be more effective than the standard approaches when particle density is non-uniform. We implement tuned versions of the algorithm in the context of a biophysical simulation of chromosome condensation, for the modern Intel Broadwell and Knights Landing architectures, across multiple nodes. We demonstrate up to 5× overall speedup and good scaling to large problem sizes and processor counts. •We present the projection sorting algorithm for molecular dynamics simulations.•We provide optimised implementations for Intel Broadwell and Knights Landing.•We extend our implementations to multi-node environments using MPI.•We investigate the performance of our algorithm in a biophysical MD simulation.•We observe serial speedups up to 5×, and good MPI scaling behaviour.
There is significant national interest in tackling issues surrounding the needs of vulnerable children and adults. At the same time, UK local authorities face severe financial challenges as a result of decreasing financial settlements and increasing demands from growing urban populations. This research employs state-of-the-art data analytics and visualisation techniques to analyse six years of local government social care data for the city of Birmingham, the UK's second most populated city. This analysis identifies: (i) service cost profiles over time; (ii) geographical dimensions to service demand and delivery; (iii) patterns in the provision of services, and (iv) the extent to which data value and data protection interact. The research accesses data held by the local authority to discover patterns and insights that may assist in the understanding of service demand, support decision making and resource management, whilst protecting and safeguarding its most vulnerable citizens. The use of data in this manner could also inform the approach a local authority has to its data, its capture and use, and the potential for supporting data-led management and service improvements.
This paper introduces a novel four-stage methodology for real-estate valuation. This research shows that space, property, economic, neighbourhood and time features are all contributing factors in producing a house price predictor in which validation shows a 96.6% accuracy on Gaussian Process Regression beating regression-kriging, random forests and an M5P-decision-tree. The output is integrated into a commercial real estate decision engine.
While local-area greenspace is associated with reduced symptoms of mental distress and greater life satisfaction, most previous research has measured the amount of local-area greenspace within administrative boundaries, and found mixed results for associations between greenspace and multidimensional mental wellbeing. The study was designed to examine whether the amount of greenspace within a radius of individuals’ homes was associated with mental wellbeing, testing the government guideline that greenspace should be available within 300 m of homes. Individual and Household-level data were drawn from the Annual Population Survey at postcode level (APS, Pooled Dataset 2012–2015), which includes 3 mental wellbeing measures, covering aspects of life satisfaction, sense of worth, and happiness, as well as a range of socio-demographic variables. Greenspace data were obtained Greenspace Information for Greater London Group (GiGL), and was used to calculated the amount of greenspace within a 300 m radius of individuals. Linear regression models revealed positive and statistically significant associations between the amount of greenspace and indicators of life satisfaction and worth. Moran's I, an indicator of spatial autocorrelation, revealed statistically significant clustering of the residuals of these models, so Geographically Weighted Regression (GWR) models were calculated, in order to adjust for underlying spatial processes within the data and investigate the geographic variation in the association between local greenspace and mental wellbeing. The global GWR model revealed that an increase in 1 ha of greenspace within 300 m of residents was associated with a statistically significant 0.803 increase in life satisfaction, 0.740 and 0.521 for worth and happiness, respectively. This therefore provides some support for the inclusion of greenspace within 300 m of homes. Local GWR coefficients revealed slight variation in the strength of these associations across the study space. Therefore, further analyses are required to investigate whether the walking (network distance), absolute size, or type of each greenspace are able to explain this spatial variation. •Associations between local-area greenspace and mental wellbeing are mixed.•UK Government recommends greenspace within 300 m of individuals.•Greenspace within buffers of individuals was calculated.•Greenspace within 300 m associated with improved mental wellbeing.•This association varies spatially.
Input/output (I/O) operations can represent a significant proportion of the run-time when large scientific applications are run in parallel. Although there have been advances in the form of file-format libraries, file system design and I/O hardware, a growing divergence exists between the performance of parallel file systems and compute processing rates. In this paper we utilise RIOT, an input/output tracing toolkit being developed at the University of Warwick, to assess the performance of three standard industry I/O benchmarks and mini-applications. We present a case study demonstrating the tracing and analysis capabilities of RIOT at scale, using MPI-IO, Parallel HDF-5 and MPI-IO augmented with the Parallel Log-structured File System (PLFS) middle-ware being developed by the Los Alamos National Laboratory.
Presents the introductory welcome message from the conference proceedings. May include the conference officers' congratulations to all involved with the conference event and publication of the proceedings record.
Analysing and learning from spatio-temporal datasets is an important process in many domains, including transportation, healthcare and meteorology. In particular, data collected by sensors in the environment allows us to understand and model the processes acting within the environment. Recently, the volume of spatio-temporal data collected has increased significantly, presenting several challenges for data scientists. Methods are therefore needed to reduce the quantity of data that needs to be processed in order to analyse and learn from spatio-temporal datasets. In this article, we present the - Dimensional Spatio-Temporal Reduction method ( D-STR ) for reducing the quantity of data used to store a dataset whilst enabling multiple types of analysis on the reduced dataset. D-STR uses hierarchical partitioning to find spatio-temporal regions of similar instances, and models the instances within each region to summarise the dataset. We demonstrate the generality of D-STR with three datasets exhibiting different spatio-temporal characteristics and present results for a range of data modelling techniques. Finally, we compare D-STR with other techniques for reducing the volume of spatio-temporal data. Our results demonstrate that D-STR is effective in reducing spatio-temporal data and generalises to datasets that exhibit different properties.
Presents the welcome message from the conference proceedings.
Conference Title: 2019 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS)Conference Start Date: 2019, Nov. 18 Conference End Date: 2019, Nov. 18 Conference Location: Denver, CO, USAPresents the introductory welcome message from the conference proceedings. May include the conference officers' congratulations to all involved with the conference event and publication of the proceedings record.
Presents the introductory welcome message from the conference proceedings. May include the conference officers' congratulations to all involved with the conference event and publication of the proceedings record.
Comparative genome hybridization using the pan-Neisseria microarray identified genes from the gonococcal genetic island (GGI) within Neisseria meningitidis strains of serogroups W-135, H, and Z. While some of these strains contain nearly all of the genes of the GGI, there are differences in the presence of some of these genes between the strains, including between those of the same serogroup. Attempts were then made to determine the location of the GGI in these meningococci. Sequencing of Neisseria gonorrhoeae strain MS11 revealed that the GGI is a conjugative plasmid that can be chromosomally integrated at the dif sites near ung and can also be present in its circularized form. In N. meningitidis, a dif site is present in this location and also serves as the point of chromosomal integration of the GGI in this species.
Recent studies demonstrate that people are increasingly looking online to assess their health, with reasons varying from personal preferences and beliefs to inability to book a timely appointment with their local medical practice. Records of these activities represent a new source of data about the health of populations, but which is currently unaccounted for by disease surveillance models. This could potentially be useful as evidence of individuals' perception of bodily changes and self-diagnosis of early symptoms of an emerging disease. We make use of the Experian geodemographic Mosaic dataset in order to extract Type 2 diabetes candidate risk variables and compare their temporal relationships with the search keywords, used to describe early symptoms of the disease on Google. Our results demonstrate that Google Trends can detect early signs of diabetes by monitoring combinations of keywords, associated with searches for hypertension treatment and poor living conditions; Combined search semantics, related to obesity, how to quit smoking and improve living conditions (deprivation) can be also employed, however, may lead to less accurate results.
This paper aims to explore whether there is a relationship between search patterns for flood risk information on the Web and how badly localities have been affected by flood events. We hypothesize that localities where people stay more actively informed about potential flooding experience less negative impact than localities where people make less effort to be informed. Being informed, of course, does not hold the waters back; however, it may stimulate (or serve as an indicator of) such resilient behaviours as timely use of sandbags, relocation of possessions from basements to upper floors and/or temporary evacuation from flooded homes to alternative accommodation. We make use of open data to test this relationship empirically. Our results demonstrate that although aggregated Web search reflects average rainfall patterns, its eigenvectors predominantly consist of locations with similar flood impacts during 2014–2015. These results are also consistent with statistically significant correlations of Web search eigenvectors with flood warning and incident reporting datasets.
The LOLITA natural language processor is an example of one of the ever-increasing number of large-scale systems written entirely in a functional programming language. The system consists of over 47,000 lines of Haskell code (excluding comments) and is able to perform a wide range of tasks such as semantic and pragmatic analysis of text, information extraction and query analysis. The efficiency of such a system is critical; interactive tasks (such as query analysis) must ensure that the user is not inconvenienced by long pauses, and batch mode tasks (such as information extraction) must ensure that an adequate throughput can be achieved. For the past three years the profiling tools supplied with GHC and HBC have been used to analyse and reason about the complexity of the LOLITA system. There have been good results, however experience has shown that in a large system the profiling life-cycle is often too long to make detailed analysis possible, and the results are often misleading. In response to these problems a profiler has been developed which allows the complete set of program costs to be recorded in so-called cost-centre stacks. These program costs are then analysed using a post-processing tool to allow the developer to explore the costs of the program in ways that are either not possible with existing tools or would require repeated compilations and executions of the program. The modifications to the Glasgow Haskell compiler based on detailed cost semantics and an efficient implementation scheme are discussed. The results of using this new profiling tool in the analysis of a number of Haskell programs are also presented. The overheads of the scheme are discussed and the benefits of this new system are considered. An outline is also given of how this approach can be modified to assist with the tracing and debugging of programs.
Presents the welcome message from the conference proceedings.
Increasingly, user generated content (UGC) in social media postings and their associated metadata such as time and location stamps are being used to provide useful operational information during natural hazard events such as hurricanes, storms and floods. The main advantage of these new sources of data are twofold. First, in a purely additive sense, they can provide much denser geographical coverage of the hazard as compared to traditional sensor networks. Second, they provide what physical sensors are not able to do: By documenting personal observations and experiences, they directly record the impact of a hazard on the human environment. For this reason interpretation of the content (e.g., hashtags, images, text, emojis, etc) and metadata (e.g., keywords, tags, geolocation) have been a focus of much research into social media analytics. However, as choices of semantic tags in the current methods are usually reduced to the exact name or type of the event (e.g., hashtags '#Sandy' or '#flooding'), the main limitation of such approaches remains their mere nowcasting capacity. In this study we make use of polysemous tags of images posted during several recent flood events and demonstrate how such volunteered geographic data can be used to provide early warning of an event before its outbreak.
Presents the welcome message from the conference proceedings.
In dynamic resource allocation systems, servers are moved between pools when overloading is detected. In this work, we investigate the impact to such systems of combining three adaptive monitoring techniques. First we employ two well known switching policies the Proportional Switching Policy (PSP) and the Bottleneck Aware Switching Policy (BSP) to move servers between server pools as appropriate. Second we use a meta forecasting technique to predict the movement in future system workload. Third, we use a Dynamic Active Window Model (DAWM), which defines the period over which workload data is analysed. We have previously shown that request servicing capability can be improved by as much as 40% when the right combination of dynamic server switching and workload forecasting are used. This extended model shows that a further 51.5% improvement can be achieved when the switching server policy, meta-forecasting and dynamic active window management are employed together over a real-world workload based on Internet traces.
Presents the introductory welcome message from the conference proceedings. May include the conference officers' congratulations to all involved with the conference event and publication of the proceedings record.
When linking spatio-temporal datasets, the kD-STR algorithm can be used to reduce the datasets and speed up the linking process. However, kD-STR can sacrifice accuracy in the linked dataset whilst retaining unnecessary information. To overcome this, we propose a preprocessing step that removes unnecessary information and an alternative heuristic for kD-STR that prioritises accuracy in the linked output. These are evaluated in a case study linking a road condition dataset with air temperature, rainfall and road traffic data. In this case study, we found the alternative heuristic achieved a 19% improvement in mean error for the linked air temperature features and an 18% reduction in storage used for the rainfall dataset compared to the original kD-STR heuristic. The results in this paper support our hypothesis that, at worse, our alternative heuristic will yield a similar error and storage overhead for linking scenarios as the original kD-STR heuristic. However, in some cases it can give a reduction that is more accurate when linking the datasets whilst using less storage than the original kD-STR algorithm.
Presents the introductory welcome message from the conference proceedings. May include the conference officers' congratulations to all involved with the conference event and publication of the proceedings record.
The view that interacting with nature enhances mental wellbeing is commonplace, despite a dearth of evidence or even agreed definitions of 'nature'. The aim of this review was to systematically appraise the evidence for associations between greenspace and mental wellbeing, stratified by the different ways in which greenspace has been conceptualised in quantitative research. We undertook a comprehensive database search and thorough screening of articles which included a measure of greenspace and validated mental wellbeing tool, to capture aspects of hedonic and/or eudaimonic wellbeing. Quality and risk of bias in research were assessed to create grades of evidence. We undertook detailed narrative synthesis of the 50 studies which met the review inclusion criteria, as methodological heterogeneity precluded meta-analysis. Results of a quality assessment and narrative synthesis suggest associations between different greenspace characteristics and mental wellbeing. We identified six ways in which greenspace was conceptualised and measured: (i) amount of local-area greenspace; (ii) greenspace type; (iii) visits to greenspace; (iv) views of greenspace; (v) greenspace accessibility; and (vi) self-reported connection to nature. There was adequate evidence for associations between the amount of local-area greenspace and life satisfaction (hedonic wellbeing), but not personal flourishing (eudaimonic wellbeing). Evidence for associations between mental wellbeing and visits to greenspace, accessibility, and types of greenspace was limited. There was inadequate evidence for associations with views of greenspace and connectedness to nature. Several studies reported variation in associations between greenspace and wellbeing by life course stage, gender, levels of physically activity or attitudes to nature. Greenspace has positive associations with mental wellbeing (particularly hedonic wellbeing), but the evidence is not currently sufficient or specific enough to guide planning decisions. Further studies are needed, based on dynamic measures of greenspace, reflecting access and uses of greenspace, and measures of both eudaimonic and hedonic mental wellbeing.
Presents the welcome message from the conference proceedings.
Many critical infrastructure systems have network structures and are under stress. Despite their national importance, the complexity of large-scale transport networks means that we do not fully understand their vulnerabilities to cascade failures. The research conducted through this paper examines the interdependent rail networks in Greater London and surrounding commuter area. We focus on the morning commuter hours, where the system is under the most demand stress. There is increasing evidence that the topological shape of the network plays an important role in dynamic cascades. Here, we examine whether the different topological measures of resilience (stability) or robustness (failure) are more appropriate for understanding poor railway performance. The results show that resilience, not robustness, has a strong correlation with the consumer experience statistics. Our results are a way of describing the complexity of cascade dynamics on networks without the involvement of detailed agent-based models, showing that cascade effects are more responsible for poor performance than failures. The network science analysis hints at pathways towards making the network structure more resilient by reducing feedback loops.
We investigate whether increasing cycling activity affects the emergence of new local businesses. Historical amenity data from OpenStreetMap is used to quantify change in shop and sustenance amenity counts. We apply an instrumental variable framework to investigate a causal relationship and to account for endogeneity in the model. Measures of cycling infrastructure serve as instruments. The impact is evaluated on the level of 4835 Lower Super Output Areas in Greater London. Our results indicate that an increase in cycling trips significantly contributes to the emergence of new local shops and businesses. Limitations regarding data quality, zero-inflation and residual spatial autocorrelation are discussed. While our findings correspond to previous investigations stating positive economic effects of cycling, we advance research in the field by providing a new dataset of unprecedented high granularity and size. Furthermore, this is the first study in cycling research looking at business amenities as a measure of economic activity. The insights from our analysis can enhance understandings of how cycling affects the development of local urban economies and may thus be used to assess and evaluate transport policies and investments. Beyond this, our study highlights the value of open data in city research.
Presents the introductory welcome message from the conference proceedings. May include the conference officers' congratulations to all involved with the conference event and publication of the proceedings record.
The scale and complexity of online applications and e-business infrastructures has led service providers to rely on the capabilities of large-scale hosting platforms, i.e., data centers. Dynamic resource allocation (DRA) algorithms have been shown to allow server resource allocation to be matched with application workload, which can improve server resource utilisation and drive down costs for service providers. However, research on DRA algorithms has almost exclusively focused on their performance characteristics at small-scale, precluding their useful application in commercial hosting environments, such as those dedicated to supporting cloud computing. In this paper, we show, and subsequently propose a framework to address, the scalability problems of current DRA algorithms. We then build on the proposed framework to develop a highly-scalable algorithm for dynamic resource allocation.
In this paper we introduce a novel transformation pass written using LLVM that performs kernel fusion. We demonstrate the correctness and performance of the pass on several example programs inspired by scientific applications of interest. The method achieves up to 4× speedup relative to unfused versions of the programs, and exact performance parity with manually fused versions. In contrast to previous work, it also requires minimal user intervention. Our approach is facilitated by a new loop fusion algorithm capable of interprocedurally fusing both skewed and unskewed loops in different kernels.
Mobile sensors can relocate and self-deploy into a network. While focusing on the problems of coverage, existing deployment schemes largely over-simplify the conditions for network connectivity: they either assume that the communication range is large enough for sensors in geometric neighborhoods to obtain location information through local communication, or they assume a dense network that remains connected. In addition, an obstacle-free field or full knowledge of the field layout is often assumed. We present new schemes that are not governed by these assumptions, and thus adapt to a wider range of application scenarios. The schemes are designed to maximize sensing coverage and also guarantee connectivity for a network with arbitrary sensor communication/sensing ranges or node densities, at the cost of a small moving distance. The schemes do not need any knowledge of the field layout, which can be irregular and have obstacles/holes of arbitrary shape. Our first scheme is an enhanced form of the traditional virtual-force-based method, which we term the Connectivity-Preserved Virtual Force (CPVF) scheme. We show that the localized communication, which is the very reason for its simplicity, results in poor coverage in certain cases. We then describe a Floor-based scheme which overcomes the difficulties of CPVF and, as a result, significantly outperforms it and other state-of-the-art approaches. Throughout the paper our conclusions are corroborated by the results from extensive simulations.
Existing algorithms for orienting sensors in directional sensor networks have primarily concerned themselves with the problem of maximizing the number of covered targets, assuming that target identification is a non-issue. Such an assumption however, does not hold true in all situations. In this paper, heuristic algorithms for choosing active sensors and orienting them with the goal of balancing coverage and identifiability are presented. The performance of the algorithms are verified via extensive simulations, and shown to confer increased target identifiability compared to algorithms originally designed to simply maximize the number of targets covered.
Conjugate Gradient (CG) algorithms form a large part of many HPC applications, examples include bioinformatics and weather applications. These algorithms allow numerical solutions to complex linear systems. Understanding how distributed implementations of these algorithms use a network interconnect will allow system designers to gain a deeper insight into their exacting requirements for existing and future applications. This short paper documents our initial investigation into the communication patterns present in the High Performance Conjugate Gradient (HPCG) benchmark. Through our analysis, we identify patterns and features which may warrant further investigation to improve the performance of CG algorithms and applications which make extensive use of them. In this paper, we capture communication traces from runs of the HPCG benchmark at a variety of different processor counts and then examine this data to identify potential performance bottlenecks. Initial results show that there is a fall in the throughput of the network when more processes are communicating with each other, due to network contention.
Large-scale scientific computing applications frequently make use of closely-coupled distributed parallel components. The performance of such applications is therefore dependent on the component parts and their interaction at run-time. This paper describes a methodology for predictive performance modelling and evaluation of parallel applications composed of multiple interacting components. In this paper, the fundamental steps and required operations involved in the modelling and evaluation process are identified - including component decomposition, component model combination, M x N communication modelling, dataflow analysis and overall performance evaluation. A case study is presented to illustrate the modelling process and the methodology is verified through experimental analysis.
With the maturity of technologies, such as HTML5 and JavaScript, and with the increasing popularity of cross-platform frameworks, such as Apache Cordova, mobile cloud computing as a new design paradigm of mobile application developments is becoming increasingly more accessible to developers. Following this trend, future on-device mobile application ecosystems will not only comprise a mixture of native and remote applications, but also include multiple hybrid mobile cloud applications. The resource competition in such ecosystems and its impact over the performance of mobile cloud applications has not yet been studied. In this paper, we study this competition from a game theoretical perspective and examine how it affects the behavior of mobile cloud applications. Three offload decision models of cooperative and non-cooperative nature are constructed and their efficiency compared. We present an extension to the classic load balancing game to model the offload behaviors within a non-cooperative environment. Mixed-strategy Nash equilibria are derived for the non-cooperative offload game with complete information, which further quantifies the price of anarchy in such ecosystems. We present simulation results that demonstrate the differences between each decision model’s efficiency. Our modeling approach facilitates further research in the design of the offload decision engines of mobile cloud applications. Our extension to the classic load balancing game broadens its applicability to real-life applications.
Despite the recent successes of nuclear energy researchers, the scientific community still remains some distance from being able to create controlled, self-sustaining fusion reactions. Inertial Confinement Fusion (ICF) techniques represent one possible option to surpass this barrier, with scientific simulation playing a leading role in guiding and supporting their development. The simulation of such techniques allows for safe and efficient investigation of laser design and pulse shaping, as well as providing insight into the reaction as a whole. The research presented here focuses on the simulation code EPOCH, a fully relativistic particle-in-cell plasma physics code concerned with faithfully recreating laser-plasma interactions at scale. A significant challenge in developing large codes like EPOCH is maintaining effective scientific delivery on successive generations of high-performance computing architecture. To support this process, we adopt the use of mini-applications - small code proxies that encapsulate important computational properties of their larger parent counterparts. Through the development of a mini-application for EPOCH (called miniEPOCH), we investigate a variety of the performance features exhibited in EPOCH, expose opportunities for optimisation and increased scientific capability, and offer our conclusions to guide future changes to similar ICF codes.
Repeatable and accurate tests are important when designing hardware and algorithms for solar-powered wireless sensor networks (WSNs). Since no two days are exactly alike with regard to energy harvesting, tests must be carried out indoors. Solar simulators are traditionally used in replicating the effects of sunlight indoors; however, solar simulators are expensive, have lighting elements that have short lifetimes, and are usually not designed to carry out the types of tests that hardware and algorithm designers require. As a result, hardware and algorithm designers use tests that are inaccurate and not repeatable (both for others and also for the designers themselves). In this article, we propose an indoor test methodology that does not rely on solar simulators. The test methodology has its basis in astronomy and photovoltaic cell design. We present a generic design for a test apparatus that can be used in carrying out the test methodology. We also present a specific design that we use in implementing an actual test apparatus. We test the efficacy of our test apparatus and, to demonstrate the usefulness of the test methodology, perform experiments akin to those required in projects involving solar-powered WSNs. Results of the said tests and experiments demonstrate that the test methodology is an invaluable tool for hardware and algorithm designers working with solar-powered WSNs.
As we move towards the Exactable era of supercomputing, node-level failures are becoming more common-place, frequent check pointing is currently used to recover from such failures in long-running science applications. While compute performance has steadily improved year-on-year, parallel I/O performance has stalled, meaning check pointing is fast becoming a bottleneck to performance. Using current file systems in the most efficient way possible will alleviate some of these issues and will help prepare developers and system designers for Exactable, unfortunately, many domain-scientists simply submit their jobs with the default file system configuration. In this paper, we analyse previous work on finding optimality on Lustre file systems, demonstrating that by exposing parallelism in the parallel file system, performance can be improved by up to 49x. However, we demonstrate that on systems where many applications are competing for a finite number of object storage targets (OSTs), competing tasks may reduce optimal performance considerably. We show that reducing each job's request for OSTs by 40% decreases performance by only 13%, while increasing the availability and quality of service of the file system. Further, we present a series of metrics designed to analyse and explain the effects of contention on parallel file systems. Finally, we re-evaluate our previous work with the Parallel Log-structured File System (PLFS), comparing it to Lustre at various scales. We show that PLFS may perform better than Lustre in particular configurations, but that at large scale PLFS becomes a bottleneck to performance. We extend the metrics proposed in this paper to explain these performance deficiencies that exist in PLFS, demonstrating that the software creates high levels of self-contention at scale.
Hardware accelerators such as GPGPUs are becoming increasingly common in HPC platforms and their use is widely recognised as being one of the most promising approaches for reaching exascale levels of performance. Large HPC centres, such as AWE, have made huge investments in maintaining their existing scientific software codebases, the vast majority of which were not designed to effectively utilise accelerator devices. Consequently, HPC centres will have to decide how to develop their existing applications to take best advantage of future HPC system architectures. Given limited development and financial resources, it is unlikely that all potential approaches will be evaluated for each application. We are interested in how this decision making can be improved, and this work seeks to directly evaluate three candidate technologies-OpenACC, OpenCL and CUDA-in terms of performance, programmer productivity, and portability using a recently developed Lagrangian-Eulerian explicit hydrodynamics mini-application. We find that OpenACC is an extremely viable programming model for accelerator devices, improving programmer productivity and achieving better performance than OpenCL and CUDA.
High resolution social media data presents an opportunity to better understand people’s behavioural patterns and sentiment. Whilst significant work has been conducted in various targeted social contexts, very little is understood about differentiated behaviour in different industrial sectors. In this paper, we present results on how social media usage and general sentiment vary across the geographic and industry sector landscape. Unlike existing studies, we use a novel geocomputational approach to link location specific Twitter data with business sectors by leveraging the UK Standard Industrial Classification Code (SIC Code). Our baseline results for the Greater London area identifies Construction, Real Estate, Transport and Financial Services industries consistently have stronger Twitter footprints. We go on to apply natural language processing (NLP) techniques to understand the prevailing sentiment within each business sector and discuss how the evidence can contribute towards de-biasing Twitter data. We believe this research will prove a valuable surveillance tool for policy makers and service providers to monitor ongoing sentiment in different industry sectors, perceive the impact of new policies and can be used as a low cost alternative to survey methods in organisational studies.
This paper investigates profit optimisation by the partitioning of server pools. Different types of web requests are considered in this paper: best-effort requests and multi-class QoS-demanding requests. Each request is associated with a certain amount of revenue. The system earns associated revenue when completing a best effort request, while the revenue earned for serving a QoS-demanding request depends on the degree of satisfaction. The complete server pool is divided into two clusters of servers, each dedicated to serving one type of request. The total profits accrued when serving these requests are mathematically modelled in this paper The optimisation equations for the pool partitioning have also been constructed so that the maximum profit can be achieved. An optimal server switching policy is also developed The server switching policy is optimal in the sense that the maximum profit can be maintained by si,;itching servers from one partition to the other Costs involved in server switching are factored into the switching model. Supportive experimentation based on simulation have been conducted and the results verify the effectiveness of the pool partitioning policy and the server switching policy developed in this paper.
As the High Performance Computing industry moves towards the exascale era of computing, parallel scientific and engineering applications are becoming increasingly complex. The use of simulation allows us to predict how an application's performance will change with the adoption of new hardware or software, helping to inform procurement decisions. In this paper, we present a disk simulator designed to predict the performance of read and write operations to a single hard disk drive (HDD). Our simulator uses a geometry discovery benchmark (Diskovery) in order to estimate the data layout of the HDD, as well as the time spent moving the read/write head. We validate our simulator against two different HDDs, using a benchmark designed to simulate common disk read and write patterns, demonstrating accuracy to within 5% of the observed I/O time for sequential operations, and to within 10% of the observed time for seek-heavy workloads.
As core counts increase in the world’s most powerful supercomputers, applications are becoming limited not only by computational power, but also by data availability. In the race to exascale, efficient and effective communication policies are key to achieving optimal application performance. Applications using adaptive mesh refinement (AMR) trade off communication for computational load balancing, to enable the focused computation of specific areas of interest. This class of application is particularly susceptible to the communication performance of the underlying architectures, and are inherently difficult to scale efficiently. In this paper we present a study of the effect of patch distribution strategies on the scalability of an AMR code. We demonstrate the significance of patch placement on communication overheads, and by balancing the computation and communication costs of patches, we develop a scheme to optimise performance of a specific, industry-strength, benchmark application.
LOLITA is a large scale natural processing system written in the functional language Haskell. It consists of over 47,000 lines of code written over 180 different modules. There are currently 20 people working on the system, most of whom are Ph.D. students. The majority of research projects involve the development of an application which is written around a semantic network; the knowledge representation structure at the core of the system. Because of the type of various applications, developers often join the team with little or no functional programming experience. For this reason the task of teaching these developers to the level required to implement their respective applications, requires teaching at various levels of abstraction. The strategy chosen means that each researcher only needs to be taught at the particular level of abstraction at which they work. These abstractions give rise to the notion of a domain specific sublanguage; that is a programming style in which a different language is created for each desired level of abstraction. In this paper we show how functional languages provide the necessary framework to enable these sublanguages to be created.
Resource management constitutes an important infrastructural component of a computational grid environment. The aim of grid resource management is to efficiently schedule applications over the available resources provided by the supporting grid architecture. Such goals within the high performance community rely, in part, on accurate performance prediction capabilities. This paper introduces a resource management infrastructure for grid computing environments. The technique couples application performance prediction with a hierarchical multi-agent system. An initial system implementation utilises the performance prediction capabilities of the PACE toolkit to provide quantitative data regarding the performance of complex applications running on local grid resources. The validation results show that a high level of accuracy can be obtained, that cross-platform comparisons can be easily undertaken, and that the estimates can be evaluated rapidly. A hierarchy of homogeneous agents are used to provide a scalable and adaptable abstraction of the grid system architecture. An agent is a representative of a local grid resource and is considered to be both a service provider and a service requestor. Agents are organised into a hierarchy and cooperate to provide service advertisement and discovery. A performance monitor and advisor has been developed to optimise the performance of the agent system. A case study with corresponding experimental results are included to demonstrate the efficiency of the resource management and scheduling system. The main features of the system include: hard quality of service support using PACE performance prediction capabilities; agent-based dynamic resource advertisement and discovery capabilities; simulation-based quantitative grid performance analysis and user-oriented scheduling of local grid resources.
Significantly increasing intra-node parallelism is widely recognised as being a key prerequisite for reaching exascale levels of computational performance. In future exascale systems it is likely that this performance improvement will be realised by increasing the parallelism available in traditional CPU devices and using massively-parallel hardware accelerators. The MPI programming model is starting to reach its scalability limit and is unable to take advantage of hardware accelerators; consequently, HPC centres (such as AWE) will have to decide how to develop their existing applications to best take advantage of future HPC system architectures. This work seeks to evaluate OpenCL as a candidate technology for implementing an alternative hybrid programming model, and whether it is able to deliver improved code portability whilst also maintaining or improving performance. On certain platforms the performance of our OpenCL imple- mentation is within 4% of an optimised native version.
Pipelined wavefront applications form a large portion of the high performance scientific computing workloads at supercomputing centres. This paper investigates the viability of graphics processing units (GPUs) for the acceleration of these codes, using NVIDIA's Compute Unified Device Architecture (CUDA). We identify the optimisations suitable for this new architecture and quantify the characteristics of those wavefront codes that are likely to experience speedups.
Internet service providers (ISPs) usually use several server pools to host different web applications, to ensure smooth system management and minimum interference between applications. The workload demand in each of the pools can vary dramatically due to a number of factors, including timing and the types of the hosted applications. Therefore, it is desirable that servers should be able to switch between pools to optimise resource usage and maximise company revenue. Internet applications can be modelled as multi-tier queueing networks, with each network station corresponding to each application tier. The advantage of using an analytical model is that performance metrics can be easily computed, and potential system bottlenecks can be identified without running the actual system. In this paper, an analytical model is used to assist dynamic resource allocation in server pools. In addition, an admission control scheme is also used to deal with system overloading. Performance evaluation is conducted via simulation and the experimental results show the benefits of our approach for various workload scenarios.
Performance engineers are beginning to explore software-level optimisation as a means to reduce the energy consumed when running their codes. This paper presents POSE, a mathematical and visual modelling tool which highlights the relationship between runtime and power consumption. POSE allows developers to assess whether power optimisation is worth pursuing for their codes. We demonstrate POSE by studying the power optimisation characteristics of applications from the Mantevo and Rodinia benchmark suites. We show that LavaMD has the most scope for CPU power optimisation, with improvements in Energy Delay Squared Product (ED2P) of up to 30.59%. Conversely, MiniMD offers the least scope, with improvements to the same metric limited to 7.60%. We also show that no power optimised version of MiniMD operating below 2.3 GHz can match the ED2P performance of the original code running at 3.2 GHz. For LavaMD this limit is marginally less restrictive at 2.2 GHz.
Modernizing production-grade, often legacy applications to take advantage of modern multi-core and many-core architectures can be a difficult and costly undertaking. This is especially true currently, as it is unclear which architectures will dominate future systems. The complexity of these codes can mean that parallelisation for a given architecture requires significant re-engineering. One way to assess the benefit of such an exercise would be to use mini-applications that are representative of the legacy programs. In this paper, we investigate different implementations of TeaLeaf, a mini-application from the Mantevo suite that solves the linear heat conduction equation. TeaLeaf has been ported to use many parallel programming models, including OpenMP, CUDA and MPI among others. It has also been re-engineered to use the OPS embedded DSL and template libraries Kokkos and RAJA. We use these different implementations to assess the performance portability of each technique on modern multi-core systems. While manually parallelising the application targeting and optimizing for each platform gives the best performance, this has the obvious disadvantage that it requires the creation of different versions for each and every platform of interest. Frameworks such as OPS, Kokkos and RAJA can produce executables of the program automatically that achieve comparable portability. Based on a recently developed performance portability metric, our results show that OPS and RAJA achieve an application performance portability score of 71% and 77% respectively for this application.
Large scale simulation performance is dependent on a number of components, however the task of investigation and optimization has long favored computational and communication elements above I/O. Manually extracting the pattern of I/O behavior from a parent application is a useful way of working to address performance issues on a per-application basis, but developing workflows with some degree of automation and flexibility provides a more powerful approach to tackling current and future I/O challenges. In this paper we describe a workload replication workflow that extracts the I/O pattern of an application and recreates its behavior with a flexible proxy application. We demonstrate how simple lightweight characterization can be translated to provide an effective representation of a physics application, and show how a proxy replication can be used as a tool for investigating I/O library paradigms.
The automatic allocation of enterprise workload to resources can be enhanced by being able to make what–if response time predictions whilst different allocations are being considered. We experimentally investigate an historical and a layered queuing performance model and show how they can provide a good level of support for a dynamic-urgent cloud environment. Using this we define, implement and experimentally investigate the effectiveness of a prediction-based cloud workload and resource management algorithm. Based on these experimental analyses we: (i) comparatively evaluate the layered queuing and historical techniques; (ii) evaluate the effectiveness of the management algorithm in different operating scenarios; and (iii) provide guidance on using prediction-based workload and resource management.
The diverging gap between processor and memory performance has been a well discussed aspect of computer architecture literature for some years. The use of multi-core processor designs has, however, brought new problems to the design of memory architectures - increased core density without matched improvement in memory capacity is reduc- ing the available memory per parallel process. Multiple cores accessing memory simultaneously degrades performance as a result of resource con- tention for memory channels and physical DIMMs. These issues combine to ensure that memory remains an on-going challenge in the design of parallel algorithms which scale. In this paper we present WMTrace, a lightweight tool to trace and analyse memory allocation events in parallel applications. This tool is able to dynamically link to pre-existing application binaries requiring no source code modification or recompilation. A post-execution analysis stage enables in-depth analysis of traces to be performed allowing memory allocations to be analysed by time, size or function. The second half of this paper features a case study in which we apply WMTrace to five parallel scientific applications and benchmarks, demonstrating its effectiveness at recording high-water mark memory consumption as well as memory use per-function over time. An in-depth analysis is provided for an unstructured mesh benchmark which reveals significant memory allocation imbalance across its participating processes.
Computational Grids allow large-scale, pervasive and consistent sharing of geographically dispersed resources. Their inherent nature incorporates issues including the discovery of resources located in different administrative domains, predicting the performance of those resources and monitoring their behaviour. The Monitoring and Discovery Service (MDS), one of the pillars provided by the Globus toolkit, can be used to offer Grid information servies to an existing agent-based resource advertisement and discovery system. This paper presents an agent system which implements the GRid Information Protocol (GRIP) and the GRid Registration Protocol (GRRP) of the MDS to discover virtual organisations and monitor their respective resources. The result system has the effect of resource brokering, monitoring and performance prediction.
The physical and social processes in urban systems are inherently spatial and hence data describing them contain spatial autocorrelation (a proximity-based interdependency on a variable) that need to be accounted for. Standard k-fold cross-validation (KCV) techniques that attempt to measure the generalisation performance of machine learning and statistical algorithms are inappropriate in this setting due to their inherent i.i.d assumption, which is violated by spatial dependency. As such, more appropriate validation methods have been considered, notably blocking and spatial k-fold cross-validation (SKCV). However, the physical barriers and complex network structures which make up a city's landscape mean that these methods are also inappropriate, largely because the travel patterns (and hence Spatial Autocorrelation (SAC)) in most urban spaces are rarely Euclidean in nature. To overcome this problem, we propose a new road distance and travel time k-fold cross-validation method, RT-KCV. We show how this outperforms the prior art in providing better estimates of the true generalisation performance to unseen data.
This paper investigates the development of a molecular dynamics code that is highly portable between architectures. Using OpenCL, we develop an implementation of Sandia’s miniMD benchmark that achieves good levels of performance across a wide range of hardware: CPUs, discrete GPUs and integrated GPUs. We demonstrate that the performance bottlenecks of miniMD’s short-range force calculation kernel are the same across these architectures, and detail a number of platform- agnostic optimisations that improve its performance by at least 2x on all hardware considered. Our complete code is shown to be 1.7x faster than the original miniMD, and at most 2x slower than implementations individually hand-tuned for a specific architecture.
With the increasing complexity of memory architectures and scientific applications, developing data structures that are performant, portable, scalable, and support developer productivity, is a challenging task. In this paper, we present Warwick Data Store (WDS), a lightweight and extensible C++ template library designed to manage these complexities and allow rapid prototyping. WDS is designed to abstract details of the underlying data structures away from the user, thus easing application development and optimisation. We show that using WDS does not significantly impact achieved performance across a variety of different scientific benchmarks and proxy-applications, compilers, and different architectures. The overheads are largely below 30% for smaller problems, with the overhead deceasing to below 10% when using larger problems. This shows that the library does not significantly impact the performance, while providing additional functionality to data structures, and the ability to optimise data structures without changing the application code.
Performance modelling is an important tool utilised by the High Performance Computing industry to accurately predict the run-time of science applications on a variety of different architectures. Performance models aid in procurement decisions and help to highlight areas for possible code optimisations. This paper presents a performance model for a magnetohydrodynamics physics application, Lare. We demonstrate that this model is capable of accurately predicting the run-time of Lare across multiple platforms with an accuracy of 90% (for both strong and weak scaled problems). We then utilise this model to evaluate the performance of future optimisations. The model is generated using SST/macro, the machine level component of the Structural Simulation Toolkit (SST) from Sandia National Laboratories, and is validated on both a commodity cluster located at the University of Warwick and a large scale capability resource located at Lawrence Livermore National Laboratory.
One of the most important metrics of machine efficiency in HPC is job turnaround time, which is the time taken for a user to submit a job and recieve their results. This time consists of two primary components; run-time, which depends on the resources allocated to the job, and queue wait time, which is dependent on the resources requested and the present level of machine usage. This paper investigates the effect of applying application performance modelling techniques to producing run-time estimates for jobs to be scheduled on parallel High Performance Computing systems. Our aim through the development of such tools is to improve turnaround time for jobs across the system. This investigation is performed from the perspective of a community of HPC users who would make use of a tool to assist in job submission. We implement and validate a higher performance implementation of the scheduling algorithm used by the Maui scheduler and demonstrate that it matches the behaviour of an existing Maui configuration. We formulate a method for generating potential performance model results and use it to to modify three workloads from production supercomputers to include generated performance model wall-time estimates. We then apply the scheduling simulator to the workloads in order to simulate the effect of using of such a tool for the three real workloads. Examining the results from the simulation, we show a randomly selected sample set of tool users obtaining and improvement of upto 23% in average queuing time, and conclude that a tool that uses performance models to generate improved wall-time estimates would be beneficial to users of HPC systems.
All-Reduce is a collective-combine operation frequently utilised in synchronous parameter updates in parallel machine learning algorithms. The performance of this operation - and subsequently of the algorithm itself - is heavily dependent on its implementation, configuration and on the supporting hardware on which it is run. Given the pivotal role of all-reduce, a failure in any of these regards will significantly impact the resulting scientific output. In this research we explore the performance of alternative all-reduce algorithms in data-flow graphs and compare these to the commonly used reduce-broadcast approach. We present an architecture and interface for all-reduce in task-based frameworks, and a parallelization scheme for object-serialization and computation. We present a concrete, novel application of a butterfly all-reduce algorithm on the Apache Spark framework on a high-performance compute cluster, and demonstrate the effectiveness of the new butterfly algorithm with a logarithmic speed-up with respect to the vector length compared with the original reduce-broadcast method - a 9x speed-up is observed for vector lengths in the order of 108. This improvement is comprised of both algorithmic changes (65%) and parallel-processing optimization (35%). The effectiveness of the new butterfly all-reduce is demonstrated using real-world neural network applications with the Spark framework. For the model-update operation we observe significant speed-ups using the new butterfly algorithm compared with the original reduce-broadcast, for both smaller (Cifar and Mnist) and larger (ImageNet) datasets.
In this paper we define a Distributed Arbitrary Segment Tree (DAST), a distributed tree-like structure that layers the range query processing mechanism over public Distributed Hash Table (DHT) services. Compared with traditional segment trees, the arbitrary segment tree used by a DAST reduces the number of key-space segments that need to be maintained, which in turn results in fewer query operations and lower overheads. Moreover, considering that range queries often contain redundant entries that the clients do not need, we introduce the concept of Accuracy of Results (AoR) for range queries. We demonstrate that by adjusting AoR, the DHT operational overhead can be improved. DAST is implemented on a well-known public DHT service (OpenDHT) and validation through experimentation and supporting simulation is performed. The results demonstrate the effectiveness of DAST over exiting methods.
Identification of regions with untypical percentage G+C composition and dinucleotide signatures are two genome analysis techniques used in the identification of horizontally acquired DNA. We describe a generic program framework for performing both types of analysis in linear time. The approach is extended for length >2 oligonucleotide signatures. Using the derived program we test some of the conclusions of Karlin and Burge, the primary exponents of these techniques. We demonstrate that no single method of signature analysis is sufficient for the complete identification of horizontally acquired DNA. Consequently we produce a fast program - and a robust methodology - for the production and employment genome signatures in the identification of horizontally acquired DNA.
As core counts increase in the world’s most powerful supercomputers, applications are becoming limited not only by computational power, but also by data availability. In the race to exascale, efficient and effective communication policies are key to achieving optimal application performance. Applications using adaptive mesh refinement (AMR) trade off communication for computational load balancing, to enable the focused computation of specific areas of interest. This class of application is particularly susceptible to the communication performance of the underlying architectures, and are inherently difficult to scale efficiently. In this paper we present a study of the effect of patch distribution strategies on the scalability of an AMR code. We demonstrate the significance of patch placement on communication overheads, and by balancing the computation and communication costs of patches, we develop a scheme to optimise performance of a specific, industry-strength, benchmark application.
The automatic allocation of enterprise workload to resources can be enhanced by being able to make `what-if' response time predictions, whilst different allocations are being considered. It is important to quantitatively compare the effectiveness of different prediction techniques for use in cloud infrastructures. To help make the comparison of relevance to a wide range of possible cloud environments it is useful to consider the following. 1.) urgent cloud customers such as the emergency services that can demand cloud resources at short notice (e.g. for our FireGrid emergency response software). 2.) dynamic enterprise systems, that must rapidly adapt to frequent changes in workload, system configuration and/or available cloud servers. 3.) The use of the predictions in a coordinated manner by both the cloud infrastructure and cloud customer management systems. 4.) A broad range of criteria for evaluating each technique. However, there have been no previous comparisons meeting these requirements. This paper, meeting the above requirements, quantitatively compares the layered queuing and (¿HYDRA¿) historical techniques - including our initial thoughts on how they could be combined. Supporting results and experiments include the following: i.) defining, investigating and hence providing guidelines on the use of a historical and layered queuing model; ii.) using these guidelines showing that both techniques can make low overhead and typically over 70% accurate predictions, for new server architectures for which only a small number of benchmarks have been run; and iii.) defining and investigating tuning a prediction-based cloud workload and resource management algorithm.
Contemporary High Performance Computing (HPC) applications can exhibit unacceptably high overheads when existing instrumentation–based performance analysis tools are applied. Our experience shows that for some sections of these codes, existing instrumentation–based tools can cause, on average, a fivefold increase in runtime. Our experience has been that, in a performance modelling context, these less representative runs can misdirect the modelling process. We present an approach to recording call paths for optimised HPC application binaries, without the need for instrumentation. A a result, a new tool has been developed which complements our work on analytical– and simulation–based performance modelling. The utility of this approach, in terms of low and consistent runtime overhead, is demonstrated by a comparative evaluation against existing tools for a range of recognised HPC benchmark codes.
Performance modelling is an important tool utilised by the High Performance Computing industry to accurately predict the run-time of science applications on a variety of different architectures. Performance models aid in procurement decisions and help to highlight areas for possible code optimisations. This paper presents a performance model for a magnetohydrodynamics physics application, Lare. We demonstrate that this model is capable of accurately predicting the run-time of Lare across multiple platforms with an accuracy of 90% (for both strong and weak scaled problems). We then utilise this model to evaluate the performance of future optimisations. The model is generated using SST/macro, the machine level component of the Structural Simulation Toolkit (SST) from Sandia National Laboratories, and is validated on both a commodity cluster located at the University of Warwick and a large scale capability resource located at Lawrence Livermore National Laboratory.
Benchmarking and analyzing I/O performance across high performance computing (HPC) platforms is necessary to identify performance bottlenecks and guide effective use of new and existing storage systems. Doing this with large production applications, which can often be commercially sensitive and lack portability, is not a straightforward task and the availability of a representative proxy for I/O workloads can help to provide a solution. We use Darshan I/O characterization and the MACSio proxy application to replicate five production workloads, showing how these can be used effectively to investigate I/O performance when migrating between HPC systems ranging from small local clusters to leadership scale machines. Preliminary results indicate that it is possible to generate datasets that match the target application with a good degree of accuracy. This enables a predictive performance analysis study of a representative workload to be conducted on five different systems. The results of this analysis are used to identify how workloads exhibit different I/O footprints on a file system and what effect file system configuration can have on performance.
Role-Based Access Control (RBAC) remains one of the most popular authorization control mechanisms. Workflow is a business flow composed of several related tasks. These tasks are interrelated and context-dependent during their execution. Under many circumstances execution context introduces uncertainty in authorization decisions for tasks. This paper investigates the role-based authorization model with the runtime context constraints and dynamic cardinality constraints. The Generalized Stochastic Petri-net is used to model the authorization process. Moreover, due to the state explosion problem in the Petri-net formalism, the proposed modeling method combines the Queuing theory to analyze both system-oriented and user-oriented performance. Given the workflow information, its running context and the authorization policies, this work can be used to predict the performance of these workflows running in the system. The prediction information can give insight in how to adjust authorization policies to strike a better balance between security and performance.
This paper investigates the development of a molecular dynamics code that is highly portable between architectures. Using OpenCL, we develop an implementation of Sandia's miniMD benchmark that achieves good levels of performance across a wide range of hardware: CPUs, discrete GPUs and integrated GPUs. We demonstrate that the performance bottlenecks of miniMD's short-range force calculation kernel are the same across these architectures, and detail a number of platform-agnostic optimisations that improve its performance by at least 2x on all hardware considered. Our complete code is shown to be 1.7x faster than the original miniMD, and at most 2x slower than implementations individually hand-tuned for a specific architecture.
In a multicluster architecture, where jobs can be submitted through each constituent cluster, the job arrival rates in individual clusters may he uneven and the load therefore needs to be balanced among clusters. In this paper we investigate load balancing for two types of jobs, namely non-QoS and QoS-demanding jobs and as a result, two performance-specific load balancing strategies (called ORT and OMR) are developed. The ORT strategy is used to obtain the optimised mean response time for non-QoS jobs and the OMR strategy is used to achieve the optimised mean miss rate for QoS-demanding jobs. The ORT and OMR strategies are mathematically modelled combining queuing network theory to establish sets of optimisation equations. Numerical solutions are developed to solve these optimisation equations. and a so called fair workload level is determined for each cluster. When the current workload in a cluster reaches this pre-calculated fair workload level, the jobs subsequently submitted to the cluster are transferred to other clusters for execution. The effectiveness of both strategies is demonstrated through theoretical analysis and experimental verification. The results show that the proposed load balancing mechanisms bring about considerable performance gains for both job types, while the job transfer frequency among clusters is considerably reduced. This has a number of advantages, in particular in the case where scheduling jobs to remote resources involves the transfer of large executable and data files.
Imbalanced workload-distribution can significantly degrade performance of grid computing environments. In the past, the theory of divisible load has been widely investigated in static heterogeneous systems. However, it has not been widely applied to grid environments, which are characterized by heterogeneous resources and dynamic environments. In this paper, we propose a performance-based approach to workload distribution for master-slave types of applications on grids. Furthermore, applications with irregular workloads are addressed. We implemented three kinds of applications and conducted experimentations on our grid test-beds. Experimental results show that this approach performs more efficiently than conventional schemes. Consequently, we claim that dynamic workload distribution can benefit applications on grid environments.
This paper details the construction of an analytical performance model of HYDRA, a production nonlinear multigrid solver used by Rolls-Royce for computational fluid dynamics simulations. The model captures both the computational behaviour of HYDRA's key subroutines and the behaviour of its proprietary communication library, OPlus, with an absolute error consistently under 16% on up to 384 cores of an Intel X5650-based commodity cluster. We demonstrate how a performance model can be used to highlight performance bottlenecks and unexpected communication behaviours, thereby guiding code optimisation efforts. Informed by model predictions, we implement an optimisation in OPlus that decreases the communication and synchronisation time by up to 3.01 times and consequently improves total application performance by 1.41 times.
We propose a generalizable framework for the population estimation of dense, informal settlements in low-income urban areas–so called 'slums'–using high-resolution satellite imagery. Precise population estimates are a crucial factor for efficient resource allocations by government authorities and NGO's, for instance in medical emergencies. We utilize equitable ground-truth data, which is gathered in collaboration with local communities: Through training and community mapping, the local population contributes their unique domain knowledge, while also maintaining agency over their data. This practice allows us to avoid carrying forward potential biases into the modeling pipeline, which might arise from a less rigorous ground-truthing approach. We contextualize our approach in respect to the ongoing discussion within the machine learning community, aiming to make real-world machine learning applications more inclusive, fair and accountable. Because of the resource intensive ground-truth generation process, our training data is limited. We propose a gridded population estimation model, enabling flexible and customizable spatial resolutions. We test our pipeline on three experimental site in Nigeria, utilizing pre-trained and fine-tune vision networks to overcome data sparsity. Our findings highlight the difficulties of transferring common benchmark models to real-world tasks. We discuss this and propose steps forward.
It is assumed in this paper that periodic real-time applications are being run on a cluster of heterogeneous workstations, and new non-periodic real-time applications arrive at the system dynamically. In the dynamic scheduling scheme presented in this paper, the new applications are scheduled in such a way that they utilize spare capabilities left by existing periodic applications in the cluster. An admission control is introduced so that new applications are rejected by the system if their deadlines cannot be met. The effectiveness of the proposed scheduling scheme has been evaluated using simulation; experimental results show that the system utilization is significantly improved.
Block-structured adaptive mesh refinement is a technique that can be used when solving partial differential equations to reduce the number of zones necessary to achieve the required accuracy in areas of interest. These areas (shock fronts, material interfaces, etc.) are recursively covered with finer mesh patches that are grouped into a hierarchy of refinement levels. Despite the potential for large savings in computational requirements and memory usage without a corresponding reduction in accuracy, AMR adds overhead in managing the mesh hierarchy, adding complex communication and data movement requirements to a simulation. In this paper, we describe the design and implementation of a native GPU-based AMR library, including: the classes used to manage data on a mesh patch, the routines used for transferring data between GPUs on different nodes, and the data-parallel operators developed to coarsen and refine mesh data. We validate the performance and accuracy of our implementation using three test problems and two architectures: an eight-node cluster, and over four thousand nodes of Oak Ridge National Laboratory’s Titan supercomputer. Our GPU-based AMR hydrodynamics code performs up to 4.87× faster than the CPU-based implementation, and has been scaled to over four thousand GPUs using a combination of MPI and CUDA.
katkey: 1723269_290 recordID: HT020176416 recordOrigin: ZDB-15-ACM gndPER: /Jarvis, Stephen gndKOE: 1222-1/ cdFull: http://dl.acm.org/citation.cfm?id=2088457
This book constitutes the refereed proceedings papers from the 8th International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computing Systems, PMBS 2017, held in Denver, Colorado, USA, in November 2017. The 10 full papers and 3 short papers included in this volume were carefully reviewed and selected from 36 submissions. They were organized in topical sections named: performance evaluation and analysis; performance modeling and simulation; and short papers. .
Crafting scalable analytics in order to extract actionable business intelligence is a challenging endeavour, requiring multiple layers of expertise and experience. Often, this expertise is irreconcilably split between an organisation’s engineers and subject matter domain experts. Previous approaches to this problem have relied on technically adept users with tool-specific training. Such an approach has a number of challenges: Expertise — There are few data-analytic subject domain experts with in-depth technical knowledge of compute architectures; Performance — Analysts do not generally make full use of the performance and scalability capabilities of the underlying architectures; Heterogeneity — calculating the most performant and scalable mix of real-time (on-line) and batch (off-line) analytics in a problem domain is difficult; Tools — Supporting frameworks will often direct several tasks, including, composition, planning, code generation, validation, performance tuning and analysis, but do not typically provide end-to-end solutions embedding all of these activities. In this paper, we present a novel semi-automated approach to the composition, planning, code generation and performance tuning of scalable hybrid analytics, using a semantically rich type system which requires little programming expertise from the user. This approach is the first of its kind to permit domain experts with little or no technical expertise to assemble complex and scalable analytics, for hybrid on- and off-line analytic environments, with no additional requirement for low-level engineering support. This paper describes (i) an abstract model of analytic assembly and execution, (ii) goal-based planning and (iii) code generation for hybrid on- and off-line analytics. An implementation, through a system which we call Mendeleev, is used to (iv) demonstrate the applicability of this technique through a series of case studies, where a single interface is used to create analytics that can be run simultaneously over on- and off-line environments. Finally, we (v) analyse the performance of the planner, and (vi) show that the performance of Mendeleev’s generated code is comparable with that of hand-written analytics. •A new abstract model of assembly and execution for arbitrary analytics, centred around a semantically rich type system.•Goal-based planning of hybrid analytic applications using this abstract model, requiring little programming ability from the user.•Automatic code generation across scalable compute architectures, integrating heterogeneous on- and off-line runtime environments.•Validation of the planning approach through its application to four case studies in telecommunications and image analysis, including an exploration of the performance and scalability of the planning engine for each of these case studies.•A demonstration of comparable performance with equivalent hand-written alternatives in both on- and off-line runtime environments.
We present StressBench, a network benchmarking framework written for testing MPI operations and file I/O concurrently. It is designed specifically to execute MPI communication and file access patterns that are representative of real-world scientific applications. Existing tools consider either the worst case congestion with small abstract patterns or peak performance with simplistic patterns. StressBench allows for a richer study of congestion by allowing orchestration of network load scenarios that are representative of those typically seen at HPC centres, something that is difficult to achieve with existing tools. We demonstrate the versatility of the framework from micro benchmarks through to finely controlled congested runs across a cluster. Validation of the results using four proxy application communication schemes within StressBench against parent applications shows a maximum difference of 15%. Using the I/O modeling capabilities of StressBench, we are able to quantify the impact of file I/O on application traffic showing how it can be used in procurement and performance studies.
Scalable management and scheduling of dynamic grid resources requires new technologies to build the next generation intelligent grid environments. This work demonstrates that AI techniques can be utilised to achieve effective workload and resource management. A combination of intelligent agents and multi-agent approaches is applied to both local grid resource scheduling and global grid load balancing. Each agent is a representative of a local grid resource and utilises predictive application performance data with iterative heuristic algorithms to engineer local load balancing across multiple hosts. At a higher level, agents cooperate with each other to balance workload using a peer-to-peer service advertisement and discovery mechanism.
This volume contains the 9 full papers presented at the 6th International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computing Systems (PMBS 2015), held as part of the 27th ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC 2015) at the Austin Convention Center in Austin, Texas on 15-20 November 2015. The SC conference series is the premier international forum for high-performance computing, networking, storage and analysis. The conference is unique in that it hosts a wide range of international participants from academia, national laboratories and industry, and featured over 350 exhibitors in the industry's largest HPC technology fair. This year's conference was themed HPC transforms, recognising the immense impact that highperformance computing has on our lives. Specifically SC 2015 focussed not only on the very visible way in which HPC is changing the world around us, but also on how HPC is improving every aspect of our lives in the most unexpected ways.
Web service technology will provide a platform for next generation dynamic e-business applications. This paper describes a framework for identifying, monitoring and reporting performance data of critical transactions within a web service using the Java ARM standard, a Transaction Definition Language (TDL) and an Automated Instrumentation technique. Gourmet2Go, a demonstrator provided with the IBM web services toolkit, is used as a case study.
This paper describes a post-mortem call-graph profiling tool that analyses trace information generated during the execution of BSPlib programs. The purpose of the tool is to expose imbalance in either computation or communication, and to highlight portions of code that are amenable to improvement. One of the major benefits of this tool is that the amount of information displayed when visualising a profile for a parallel program is no more complex than that of a sequential program. The simplicity and practical relevance of the tool is demonstrated by analysing an SQL database query evaluation program. The tool is used to guide a series of optimisations that minimise the communication imbalance in the SQL program; this results in significant improvements in the performance of the parallel algorithms used in the program. Unlike other profiling tools, the architecture independent metric of size of data communicated is used to guide the optimisation process.
In recent years the High Performance Computing (HPC) industry has benefited from the development of higher density multi-core processors. With recent chips capable of executing up to 32 tasks in parallel, this rate of growth also shows no sign of slowing. Alongside the development of denser micro-processors has been the considerably more modest rate of improvement in random access memory (RAM) capacities. The effect has been that the available memory-per-core has reduced and current projections suggest that this is still set to reduce further. In this paper we present three studies into the use and measurement of memory in parallel applications; our aim is to capture, understand and, if possible, reduce the memory-per-core needed by complete, multi-component applications. First, we present benchmarked memory usage and runtimes of a six scientific benchmarks, which represent algorithms that are common to a host of production-grade codes. Memory usage of each benchmark is measured and reported for a variety of compiler toolkits, and we show greater than 30% variation in memory high-water mark requirements between compilers. Second, we utilise this benchmark data combined with runtime data, to simulate, via the Maui scheduler simulator, the effect on a multi-science workflow if memory-per-core is reduced from 1.5GB-per-core to only 256MB. Finally, we present initial results from a new memory profiling tool currently in development at the University of Warwick. This tool is applied to a finite-element benchmark and is able to map high-water-mark memory allocations to individual program functions. This demonstrates a lightweight and accurate method of identifying potential memory problems, a technique we expect to become commonplace as memory capacities decrease.
This paper presents a unique study into how to identify a meso-level normative (i.e., institutional) hierarchy of procedures that aim to deliver the ecological status of waterbodies in the UK. Using traditional survey and workshop methods, the majority of recent studies concentrate on engagement practices between macro- (government bodies) and micro- (local residents) level structures, which can be potentially replicated elsewhere. Meso-level elements (middle-level structures of control) are often regarded as ‘subjective institutional change’, e.g., failures to implement programs locally or misinterpretations of reflexive dialogs with communities. Nevertheless, it is often only meso-level structures that are capable of promoting and replicating policies elsewhere. At the same time, there is increasing appeal by governmental departments for communities to ‘selforganize’ and take responsibility for prioritizing environmental tasks, which themselves might be instigated by local trusts and voluntary organizations, the existence of which remains largely unaccounted for by central offices. The recent proliferation of Twitter accounts, with the prominent themes of water, ecology and ecosystems, which include people, organizations, businesses and ‘bots’ of various types, presents new opportunities for digital methods to gain insights into structures and functions of these virtual communities. We hypothesize that our methods can produce invaluable insights into the ‘crafting’ of environmental institutions through approaches commonly ignored by traditional ‘analog’ meso-level mechanisms. We use the example of Integrated Catchment Management in the UK, and specifically the Tamar Catchment in southwest England, in order to demonstrate how well Twitter can capture this transitory meso-level environmental political system.
Input/output (I/O) operations are one of the biggest challenges facing scientific computing as it transitions to exascale. The traditional software stack – com- prising of parallel file systems, middlewares and high level libraries – has evolved to enable applications to better cope with the demands of enormous datasets. This software stack makes high performance parallel I/O easily accessible to application engineers, however it is important to ensure best performance is not compromised through attempts to enrich these libraries. We present MINIO, a benchmark for the investigation of I/O behaviour focusing on understanding overheads and inefficiencies in high level library usage. MINIO uses HDF5 and TyphonIO to explore I/O at scale using different application behavioural pat- terns. A case study is performed using MINIO to identify performance limiting characteristics present in the TyphonIO library as an example of performance discrepancies in the I/O stack.
Identification of regions with untypical percentage G+C composition anddinucleotide signatures are two genome analysis techniques used in theidentification of horizontallyacquired DNA. We describe a generic programframework for performing both types of analysis in linear time. Theapproach is extended for length >2 oligonucleotide signatures. Using the derived program we test some of the conclusions of Karlin and Burge, the primary exponents of these techniques. We demonstrate that no singlemethod of signature analysis is sufficient for the complete identificationof horizontally acquired DNA. Consequently we produce a fast program -and a robust methodology - for the production and employment genomesignatures in the identification of horizontally acquired DNA.
Power constraints are forcing HPC systems to continue to increase hardware concurrency. Efficiently scaling applications on future machines will be essential for improved science and it is recognised that the “flat” MPI model will start to reach its scalability limits. The optimal approach is unknown, necessitating the use of mini-applications to rapidly evaluate new approaches. Reducing MPI task count through the use of shared memory programming models will likely be essential. We examine different strategies for improving the strong-scaling performance of explicit Hydrodynamics applications. Using the CloverLeaf mini-application across multiple generations of Cray platforms (XC30, XK6 and XK7), we show the utility of the hybrid approach and document our experiences with OpenMP, CUDA, OpenCL and OpenACC under both the PGI and CCE compilers. We also evaluate Cray Reveal as a tool for automatically hybridising HPC applications and Cray’s MPI rank to network topology-mapping tools for improving application performance.
The paper describes a novel approach for building a UK-wide Automated Land Valuation Model and its implementation into commercial online software. We examine existing approaches to land valuation used in the UK, notably Trade Area Analysis, Spatial Interaction and Comparable Sales. We make the case that land use analysis, demographics and societal preferences affect the potential income and optimal use of parcels of land and hence the value of those parcels. This hypothesis leads to the introduction of a number of additional factors required to facilitate estimated land value, including traffic flow, population and site suitability. A number of artificial intelligence (AI) and machine learning spatial-temporal techniques are introduced to predict the value of all land parcels sold since 1995. We introduce a new technique, which includes (i) the application of Support Vector Machines to land use analysis; (ii) the use of predictive techniques for macro-environmental factors; (iii) the use of large, open-source data sets to improve valuation; (iv) industry alignment in predefined industrial tool. A number of different mathematical techniques are used to validate the proposed model and we show that our model demonstrates 92% accuracy for residential pricing predictions.
A call-graph profiling tool has been designed and implemented to analyse the efficiency of programs written in BSPlib. This tool highlights computation and communication imbalance in parallel programs, exposing portions of program code which are amenable to improvement. A unique feature of this profiler is that it uses the bulk synchronous parallel cost model, thus providing a mechanism for portable and architecture-independent parallel performance tuning. In order to test the capabilities of the model on a real-world example, the performance characteristics of an SQL query processing application are investigated on a number of different parallel architectures.
This document reports on the six month survey of the UK e-Science demonstrator projects. The report details the eight project consultations in which we have engaged, and provides a summary of the likely compatibility of these projects with the Warwick middleware research. Two of these eight projects have been identified for further collaboration, details of which are documented. In addition, this consultation has been extended to include UK e-Science pilot projects and the details of planned collaboration with one of these pilot projects also forms part of this report.
Predictive performance models of e-Commerce applications will allow Grid workload managers to provide e-Commerce clients with qualities of service (QoS) whilst making efficient use of resources. This paper demonstrates the use of two ‘coarse-grained’ modelling approaches (based on layered queuing modelling and historical performance data analysis) for predicting the performance of dynamic e-Commerce systems on heterogeneous servers. Results for a popular e-Commerce benchmark show how request response times and server throughputs can be predicted on servers with heterogeneous CPUs at different background loads. The two approaches are compared and their usefulness to Grid workload management is considered.
In this paper we present a preliminary predictive model for a key biomedical imaging application in the UK e-Science IXI (Information eXtraction from Images) project. This code represents a significant challenge for our existing performance prediction tools as it has internal structures that exhibit highly variable runtimes depending on qualities in the input data provided. Since the runtime can vary by more than an order of magnitude, it has been difficult to apply meaningful quality of service criteria to workflows that use this code. The model developed here is used in the context of an interactive scheduling system which provides rapid feedback to the users, allowing them to tailor their workloads to available resources, or to allocate extra resources to scheduled workloads.
This paper details the development and application of a model for predictive performance analysis of a pipelined synchronous wavefront application running on commodity processor cluster systems. The performance model builds on existing work [1] by including extensions for modern commodity processor architectures. These extensions, including coarser hardware benchmarking, prove to be essential in countering the effects of modern superscalar processors (e.g. multiple operation pipelines and on-the-fly optimisations), complex memory hierarchies, and the impact of applying modern optimising compilers. The process of application modelling is also extended, combining static source code analysis with run-time profiling results for increased accuracy. The model is validated on several high performance SMP systems and the results show a high predictive accuracy (
Since web workloads are known to vary dynamically with time, in this paper, we argue that dynamic resource allocation techniques are necessary to provide guarantees to web applications running on shared data centers. To address this issue, we use a system architecture that combines online measurements with prediction and resource allocation techniques. To capture the transient behavior of the application workloads, we model a server resource using a time-domain description of a generalized processor sharing (GPS) server. This model relates application resource requirements to their dynamically changing workload characteristics. The parameters of this model are continuously updated using an online monitoring and prediction framework. This framework uses time series analysis techniques to predict expected workload parameters from measured system metrics. We then employ a constrained non-linear optimization technique to dynamically allocate the server resources based on the estimated application requirements. The main advantage of our techniques is that they capture the transient behavior of applications while incorporating nonlinearity in the system model. We evaluate our techniques using simulations with synthetic as well as real-world web workloads. Our results show that these techniques can judiciously allocate system resources, especially under transient overload conditions.
OpenACC is a directive-based programming model designed to allow easy access to emerging advanced architecture systems for existing production codes based on Fortran, C and C++. It also provides an approach to coding contemporary technologies without the need to learn complex vendor-specific languages, or understand the hardware at the deepest level. Portability and performance are the key features of this programming model, which are essential to productivity in real scientific applications. OpenACC support is provided by a number of vendors and is defined by an open standard. However the standard is relatively new, and the implementations are relatively immature. This paper experimentally evaluates the currently available compilers by assessing two approaches to the OpenACC programming model: the "parallel" and "kernels" constructs. The implementation of both of these construct is compared, for each vendor, showing performance differences of up to 84%. Additionally, we observe performance differences of up to 13% between the best vendor implementations. OpenACC features which appear to cause performance issues in certain compilers are identified and linked to differing default vector length clauses between vendors. These studies are carried out over a range of hardware including GPU, APU, Xeon and Xeon Phi based architectures. Finally, OpenACC performance, and productivity, are compared against the alternative native programming approaches on each targeted platform, including CUDA, OpenCL, OpenMP 4.0 and Intel Offload, in addition to MPI and OpenMP.
In the approach to exascale, scalable tools are becoming increasingly necessary to support parallel applications. Evaluating an application’s call stack is a vital technique for a wide variety of profilers and debuggers, and can create a significant performance overhead. In this paper we present a heuristic technique to reduce the overhead of frequent call stack evaluations. We use this technique to estimate the similarity between successive call stacks, removing the need for full call stack traversal and eliminating a significant portion of the performance overhead. We demonstrate this technique applied to a parallel memory tracing toolkit, WMTools, and analyse the performance gains and accuracy.
It is common nowadays that multiple cores reside on the same chip and share the on-chip cache. Resource sharing may cause performance degradation of the co-running jobs. Job co-scheduling is a technique that can effectively alleviate the contention. Many co-schedulers have been developed in the literature, but most of them do not aim to find the optimal co-scheduling solution. Being able to determine the optimal solution is critical for evaluating co-scheduling systems. Moreover, most co-schedulers only consider serial jobs. However, there often exist both parallel and serial jobs in some situations. This paper aims to tackle these issues. In this paper, a graph-based method is developed to find the optimal co-scheduling solution for serial jobs, and then the method is extended to incorporate parallel jobs. The extensive experiments have been conducted to evaluate the effectiveness and efficiency of the proposed co-scheduling algorithms. The results show that the proposed algorithms can find the optimal co-scheduling solution for both serial and parallel jobs.
A DNA microdot offers a novel way in which to communicate secret information. It is an extension of the more traditional microdot, a greatly reduced photograph of a secret document which replaced a full stop somewhere in an innocent-looking letter. The DNA microdot appears to be the secure modern alternative, exploiting the complexity of DNA in which a coded secret message is hidden. An enemy can only unlock the secret information by first knowing that an intercepted letter is impregnated with microdot DNA, and secondly by finding the message amongst a huge amount of background DNA used to mask the secret information. Using software developed to identify horizontally acquired DNA, we show that this apparently insurmountable task is in fact possible. With the increased speed of DNA sequencing, the information contained in the DNA microdot is far from secure.
This paper describes the design and implementation of DAST, a Distributed Arbitrary Segment Tree structure that gives support of range query for public Distributed Hash Table (DHT) services. DAST does not modify the underlying DHT infrastructure, instead it utilises the scalability and robustness of DHT while providing simplicity of implementation and deployment for applications. Compared with traditional segment trees, the arbitrary segment tree used by a DAST reduces the number of key-space segments that need to be maintained, which in turn results in fewer query operations and lower overheads. Moreover, considering that range queries often contain redundant entries that the clients do not need, we introduce the concept of Accuracy of Results (AoR) for range queries. We demonstrate that by adjusting AoR, the DHT operational overhead can be improved. DAST is implemented on a well-known public DHT service (OpenDHT) and validation through experimentation and supporting simulation is performed. The results demonstrate the effectiveness of DAST over exiting methods.
On-line detection of anomalies in time series is a key technique used in various event-sensitive scenarios such as robotic system monitoring, smart sensor networks and data center security. However, the increasing diversity of data sources and the variety of demands make this task more challenging than ever. First, the rapid increase in unlabeled data means supervised learning is becoming less suitable in many cases. Second, a large portion of time series data have complex seasonality features. Third, on-line anomaly detection needs to be fast and reliable. In light of this, we have developed a prediction-driven, unsupervised anomaly detection scheme, which adopts a backbone model combining the decomposition and the inference of time series data. Further, we propose a novel metric, Local Trend Inconsistency (LTI), and an efficient detection algorithm that computes LTI in a real-time manner and scores each data point robustly in terms of its probability of being anomalous. We have conducted extensive experimentation to evaluate our algorithm with several datasets from both public repositories and production environments. The experimental results show that our scheme outperforms existing representative anomaly detection algorithms in terms of the commonly used metric, Area Under Curve (AUC), while achieving the desired efficiency.
Benchmarking has been used by performance engineers for over three decades to gain better insight into system performance. Numerous benchmarks are used in industry to characterize the performance of standalone systems on the one hand, to measure the computational efficiency of massively parallel systems such as clusters, grids and supercomputers, on the other hand. One of the common features of performance evaluation of small or large standalone systems is that performance engineers have direct and full control over these systems. The peer-to-peer (P2P) paradigm, which involves volunteer participation for execution of lengthy and complex scientific computations, operates in an uncontrolled environment. The performance evaluation of participating nodes operating in an uncontrolled P2P environment is a challenging task. Currently, two traditional synthetic benchmarks (Dhrystone and Whetstone) are used as part of Berkeley Open Infrastructure for Network Computing (BOINC)'s Accounting System for granting credits to the participating nodes. The analysis of the performance data obtained from running Dhrystone and Whetstone on general purpose computers has highlighted the limitations of these benchmarks. This study proposes a new synthetic light-weight benchmark - MalikStone, which is representative of large P2P projects and has been specifically designed in view the dynamic nature and challenges of the P2P paradigm. The benchmark integrates the strengths of existing synthetic benchmarks and provides more detailed insight into system performance by capturing the context in which the benchmarking was performed. The results of this newly designed benchmark have been found to be encouraging. Compared with existing synthetic benchmarks, Dhrystone and Whetstone, and the SPEC CPU2006 benchmarking suite, we highlight MalikStone's superiority in characterizing system performance in P2P settings.
The increasing availability of multi-core and multi- processor architectures provides new opportunities for improving the performance of many computer simulations. Markov Chain Monte Carlo (MCMC) simulations are widely used for approximate counting problems, Bayesian inference and as a means for estimating very high-dimensional integrals. As such MCMC has had a wide variety of applications in fields including computational biology and physics, financial econometrics, machine learning and image processing. One method for improving the performance of Markov Chain Monte Carlo simulations is to use SMP machines to perform ‘speculative moves’, reducing the runtime whilst producing statistically identical results to conventional sequential implementations. In this paper we examine the circumstances under which the original speculative moves method performs poorly, and consider how some of the situations can be addressed by refining the implementation. We extend the technique to perform Markov Chains speculatively, expanding the range of algorithms that maybe be accelerated by speculative execution to those with non-uniform move processing times. By simulating program runs we can predict the theoretical reduction in runtime that may be achieved by this technique. We compare how efficiently different architectures perform in using this method, and present experiments that demonstrate a runtime reduction of up to 35-42% where using conventional speculative moves would result in execution as slow, if not slower, than sequential processing.
The Resilient Distributed Dataset (RDD) is the core memory abstraction behind the popular data-analytic framework Apache Spark. We present an extension to the Resilient Distributed Dataset for map transformations, that we call MapRDD, which takes advantage of the underlying relations between records in the parent and child datasets, in order to achieve random-access of individual records in a partition. The design is complemented by a new MemoryStore, which manages data sampling and data transfers asynchronously. We use the ImageNet dataset to demonstrate that: (I) The initial data loading phase is redundant and can be completely avoided; (II) Sampling on the CPU can be entirely overlapped with training on the GPU to achieve near full occupancy; (III) CPU processing cycles and memory usage can be reduced by more than 90%, allowing other applications to be run simultaneously; (IV) Constant training step time can be achieved, regardless of the size of the partition, for up to 1.3 million records in our experiments. We expect to obtain the same improvements in other RDD transformations via further research on finer-grained implicit & explicit dataset relations.
We present the first reported OpenCL implementation of EPOCH3D, an extensible particle-in-cell plasma physics code developed at the University of Warwick. We document the challenges and successes of this porting effort, and compare the performance of our implementation executing on a wide variety of hardware from multiple vendors. The focus of our work is on understanding the suitability of existing algorithms for future accelerator-based architectures, and identifying the changes necessary to achieve performance portability for particle-in-cell plasma physics codes. We achieve good levels of performance with limited changes to the algorithmic behaviour of the code. However, our results suggest that a fundamental change to EPOCH3D’s current accumulation step (and its dependency on atomic operations) is necessary in order to fully utilise the massive levels of parallelism supported by emerging parallel architectures.
A performance prediction framework is described in which predictive data generated by the PACE toolkit is stored and published through a Globus MDS-based performance information service. Distributing this data allows additional performance-based middleware tools to be built; this paper describes two such tools, a local-level scheduler and a system for wide-area task management. Experimental evidence shows that by integrating these performance tools for local- and wide-area management, considerable improvements can be made to task scheduling, resource utilisation and load balancing on heterogeneous distributed computing systems.
The development of supportive middleware to manage resources and distributed workload across multiple administrative boundaries is of central importance to Grid computing. Active middleware services that perform look-up, match-making, scheduling and staging are being developed that allow users to identify and utilise appropriate resources that provide sustainable system- and user-level qualities of service. This paper documents two performance-responsive middleware services that address the implications of executing a particular workload on a given set of resources. These services are based on an estab- lished performance prediction system that is employed at both the local (intra-domain) and global (multi- domain) levels to provide dynamic workload steering. These additional facilities bring about significant performance improvements, the details of which are presented with regard to the user-perceived quality of service and to resource utilisation.
The prevalence of cloud computing environments and the ever increasing reliance of large organisations on computational resources has meant that service providers must operate at unprecedented scales and levels of efficiency. Dynamic resource allocation (DRA) policies have been shown to allow service providers to improve resource utilisation and operational efficiency in presence of unpredictable demands, hence maximising profitability. However, practical considerations, such as power and space, have led service providers to adopt rack based approaches to application servicing. This co-location of computation resources, and the associated common provision of utilities it encourages, has immediate implications for system dependability. Specifically, in the presence of rack crash failures which can lead to all the servers within a rack becoming unavailable, resource allocation policies need to be cognisant of failures. In this paper, we address this issue and make the following specific contributions: (i) we present a modular architecture for failure-aware resource allocation, where a performance- oriented DRA policy is composed with a failure-aware resource allocator, (ii) we propose a metric, called Capacity Loss, to capture the exposure of an application to a rack failure, (iii) we develop an algorithm for reducing the proposed metric across all applications in a system operating under a DRA policy, and (iv) we evaluate the effectiveness of the proposed architecture on a large-scale DRA policy in context of rack failures, ultimately concluding that our approach reduces the number of failed requests as compared to a single random allocation. The main benefit of our approach is that we have developed a failure-aware resource allocation framework that can work in tandem with any DRA policy.
The increasing availability of multi-core and multiprocessor architectures provides new opportunities for improving the performance of many computer simulations. Markov Chain Monte Carlo (MCMC) simulations are widely used for approximate counting problems, Bayesian inference and as a means for estimating very high-dimensional integrals. As such MCMC has found a wide variety of applications in fields including computational biology and physics, financial econometrics, machine learning and image processing. Whilst divide and conquer is an obvious means to simplify image processing tasks, "naively" dividing an image into smaller images to be processed separately results in anomalies and breaks the statistical validity of the MCMC algorithm. We present a method of grouping the spatially local moves and temporarily partitioning the image to allow those moves to be processed in parallel, reducing the runtime whilst conserving the properties of the MCMC method. We calculate the theoretical reduction in runtime achievable by this method, and test its effectiveness on a number of different architectures. Experiments are presented that show reductions in runtime of 38% using a dual-core dual-processor machine. In circumstances where absolute statistical validity are not required, an algorithm based upon, but not strictly adhering to, MCMC may suffice. For such situation two additional algorithms are presented for partitioning of images to which MCMC will be applied. Assuming preconditions are met, these methods may be applied with minimal risk of anomalous results. Although the extent of the runtime reduction will be data dependent, the experiments performed showed the runtime reduced to 27% of its original value.
Efficient utilization of resources is one of the key challenges that project administrators face in the dynamic peer-to-peer (P2P) computing environment. The volatile nature of participation by the general public for running lengthy and complex scientific applications is not bounded by any legal obligations, thus, requires following a greedy approach for resource utilization. Generally, the availability statistics of a computational resource and selected set of scheduling algorithms or heuristics are combined together for allocating tasks to the available resources. Berkeley Open Infrastructure for Network Computing (BOINC) is a widely used middleware platform in P2P systems, which uses four interrelated scheduling policies for utilization of available pool of computational resources. The policies used by BOINC mainly depend on the job execution estimates based on the composition of the job and performance results of the considered computing platform obtained using benchmarks. BOINC deploys two traditional synthetic benchmarks: Dhrystone and Whetstone; for measuring the integer and floating-point performance of a considered platform. However, the performance results obtained using these benchmarks show significant variations in results for similar microprocessor, operating system and hardware configuration. These inconsistent results, when used for scheduling, significantly affect the resource-scheduling estimates particularly for time-constrained jobs. This study proposes a novel scheduling policy based on a more consistent and P2P representative benchmark - MalikStone. The policy considers the total availability time of a computational resource, estimated execution time of a work-unit and the available unused time of a computational resource for dynamically slicing large work-unit into smaller work-unit depending on the available unused time of a computational resource. The results have revealed that the policy improved the utilization of available computational resources by around 10% under the considered experimental settings.
The paper designs an automated valuation model to predict the price of residential property in Coventry, United Kingdom, and achieves this by means of geostatistical Kriging, a popularly employed distance-based learning method. Unlike traditional applications of distance-based learning, this papers implements non-Euclidean distance metrics by approximating road distance, travel time and a linear combination of both, which this paper hypothesizes to be more related to house prices than straight-line (Euclidean) distance. Given that - to undertake Kriging - a valid variogram must be produced, this paper exploits the conforming properties of the Minkowski distance function to approximate a road distance and travel time metric. A least squares approach is put forth for variogram parameter selection and an ordinary Kriging predictor is implemented for interpolation. The predictor is then validated with 10-fold cross-validation and a spatially aware checkerboard hold out method against the almost exclusively employed, Euclidean metric. Given a comparison of results for each distance metric, this paper witnesses a goodness of fit ( ) result of 0.6901 ± 0.18 SD for real estate price prediction compared to the traditional (Euclidean) approach obtaining a suboptimal value of 0.66 ± 0.21 SD.
This paper seeks to identify some of the technology challenges and research opportunities in assessing, optimising and assuring performance and dependability on the grid. Grid computing offers many scientific and commercial benefits, but also many technical and organisational challenges. Amongst the challenges for grid computing is the need to provide reliable quality of service to end users, while enabling increasingly dynamic and federated usage scenarios. Understanding and quantifying the performance capabilities of a grid system and its applications is a crucial issue, without which performance improvements will be at best ad hoc. Nevertheless, this is particularly difficult for complex and dynamic systems like the grid, where both the computing environment (in terms of the contributing platforms and networks) and the application base are in continual change. There is an internationally recognised need for research in this area and success will lay the foundation for next generation e-Science.
The advent of concurrent systems undoubtedly promises greatly increased processing power. These systems, however, are inherently complex and present many new problems at both the design and implementation phases. Appropriate specification, design and implementation methods are required, which will facilitate the development of such systems. Techniques based on formal methods, such as the language of CSP (Communicating Sequential Processes) have proven to be the most successful means of conquering complexity in the specification of concurrent, embedded, real-time and distributed systems. Concurrent Systems: Formal Development in CSP offers an up-to-date and practical treatment of this important subject. The basis of the text is the combination of a tutorial approach with convenient reference material. Beginning with a clear introduction to CSP, the book goes on to provide a comprehensive listing of the laws of CSP and describes the more important variants of the language. An illustrative case study, based on the verification of a network protocol, forms a central part of the book and practical implementation in both Occam and Ada 9X is discussed in depth. This book represents a significant, student-friendly contribution to the teaching of CSP. Its example-based, but rigorous approach makes it ideal for both undergraduate and postgraduate courses in concurrency, formal methods, distributed and real-time systems. Special features include: A tutorial introduction combined with comprehensive reference material, coverage of the most important variants of CSP, a fully annotated select bibliography, and implementation details in Occam and Ada 9X.
Crafting scalable analytics in order to extract actionable business intelligence is a challenging endeavour, requiring multiple layers of expertise and experience. Often, this expertise is irreconcilably split between an organisation's engineers and subject matter or domain experts. Previous approaches to this problem have relied on technically adept users with tool-specific training. These approaches have generally not targeted the levels of performance and scalability required to harness the sheer volume and velocity of large-scale data analytics. In this paper, we present a novel approach to the automated planning of scalable analytics using a semantically rich type system, the use of which requires little programming expertise from the user. This approach is the first of its kind to permit domain experts with little or no technical expertise to assemble complex and scalable analytics, for execution both on-and off-line, with no lower-level engineering support. We describe in detail (i) an abstract model of analytic assembly and execution, (ii) goal-based planning and (iii) code generation using this model for both on-and off-line analytics. Our implementation of this model, Mendeleev, is used to (iv) demonstrate the applicability of our approach through a series of case studies, in which a single interface is used to create analytics that can be run in real-time (on-line) and batch (off-line) environments. We (v) analyse the performance of the planner, and (vi) show that the performance of Mendeleev's generated code is comparable with that of hand-written analytics.
Peer-to-peer computing, involving the participation of thousands of general purpose, public computers, has established itself as a viable paradigm for executing loosely- coupled, complex scientific applications requiring significant computational resources. ClimatePrediction.net is an excellent demonstrator of this technology, having attracted hundreds of thousands of users from more than 200 countries for the efficient and effective execution of complex climate prediction models. This paper is concerned with understanding the underlying compute resources on which ClimatePrediction.net commonly executes. Such a study is advantageous to three different stakeholders, namely the application developers, the project participants and the project administrators. This is the first study of this kind, and while offering general conclusions with specific reference to ClimatePrediction.net, it also illustrates the benefits of such work to scalability analysis of other peer-to-peer projects.
The emergence of modern many-core architectures that offer an extreme level of parallelism makes methods that were previously infeasible due to computational expense now achievable. Particle-in-Cell (PIC) codes often fail to fully leverage this increased performance potential due to their high use of memory bandwidth. The use of higher order PIC methods may offer a solution to this by improving simulation accuracy significantly for an increase in computational intensity when compared to their first order counterparts. This greater expense is accompanied with only a minor increase in the amount of memory throughput required during the simulation. In this presentation we will show the performance of a second order PIC algorithm. Our implementation uses second order finite elements and particles that are represented with a collection of surrounding ghost particles. These ghost particles each have associated weights and offsets around the true particle position and therefore represent a charge distribution. We test our PIC implementation against a first order algorithm on various modern compute architectures including Intel’s Knights Landing (KNL) and NVIDIA’s Tesla P100. Our preliminary results show the viability of second order methods for PIC applications on these architectures when compared to previous generations of many-core hardware. Specifically, we see an order of magnitude improvement in performance for second order methods between the Pascal and Kepler GPU architectures, despite only a 4× improvement in theoretical peak performance between the architectures. Although these initial results show a large increase in runtime over first order methods, we hope to be able to show improved scaling behaviour and increased simulation accuracy in the future.
We are currently investigating the viability of many-core architectures for the acceleration of wavefront applications and this report focuses on graphics processing units (GPUs) in particular. To this end, we have implemented NASA’s LU benchmark – a real world production-grade application – on GPUs employing NVIDIA’s Compute Unified Device Architecture (CUDA). This GPU implementation of the benchmark has been used to investigate the performance of a selection of GPUs, ranging from workstation-grade commodity GPUs to the HPC "Tesla” and "Fermi” GPUs. We have also compared the performance of the GPU solution at scale to that of traditional high perfor- mance computing (HPC) clusters based on a range of multi- core CPUs from a number of major vendors, including Intel (Nehalem), AMD (Opteron) and IBM (PowerPC). In previous work we have developed a predictive “plug-and-play” performance model of this class of application running on such clusters, in which CPUs communicate via the Message Passing Interface (MPI). By extending this model to also capture the performance behaviour of GPUs, we are able to: (1) comment on the effects that architectural changes will have on the performance of single-GPU solutions, and (2) make projections regarding the performance of multi-GPU solutions at larger scale.
We present the performance analysis of a port of the LU benchmark from the NAS Parallel Benchmark (NPB) suite to NVIDIA's Compute Unified Device Architecture (CUDA), and report on the optimisation efforts employed to take advantage of this platform. Execution times are reported for several different GPUs, ranging from low-end consumer-grade products to high-end HPC-grade devices, including the Tesla C2050 built on NVIDIA's Fermi processor. We also utilise recently developed performance models of LU to facilitate a comparison between future large-scale distributed clusters of GPU devices and existing clusters built on traditional CPU architectures, including a quad-socket, quad-core AMD Opteron cluster and an IBM BlueGene/P.
It has been a subject of a significant amount of research to automate the execution of workflows (or business processes) on computer resources. However, many workflow scenarios still require human involvement, which introduces additional security and authorization concerns. This paper presents a novel mechanism for modeling the execution of workflows with human involvement under Role-based Authorization Control. Our modeling approach applies Colored Timed Petri-Nets to allow various authorization constraints to be modeled, including role, temporal, cardinality, BoD (Binding of Duty), SoD (Separation of Duty), role hierarchy constraints etc. We also model the execution of tasks with different levels of human involvement and as such allow the interactions between workflow authorization and workflow execution to be captured. The modeling mechanism is developed in such a way that the construction of the authorization model for a workflow can be automated. This feature is very helpful for modeling large collections of authorization policies and/or complex workflows. A Petri-net toolkit, the CPN Tools, is utilized in the development of the modeling mechanism and to simulate the constructed models. This paper also presents the methods to analyze and calculate the authorization overhead as well as the performance data in terms of various metrics through the model simulations. Based on the simulation results, this paper further proposes the approaches to improving performance given the deployed authorization policies. This work can be used for investigating the impact of authorization, for capacity planning, for the design of workload management strategies, and also to estimate execution performance, when human resources and authorization policies are employed in tandem. ► We model the executions of workflows with human involvement under RBAC authorization. ► We analyze the impact of authorization on workflow executions. ► We calculate the performance of workflow executions under RBAC authorization. ► We propose the approaches to improving performance given the authorization constraints.
The increasing availability of multi-core and multi-processor architectures provides new opportunities for improving the performance of many computer simulations. Markov Chain Monte Carlo (MCMC) simulations are widely used for approximate counting problems, Bayesian inference and as a means for estimating very high-dimensional integrals. As such MCMC has found a wide variety of applications in biology and biomedical analysis. This paper presents a new method for reducing the runtime of Markov Chain Monte Carlo simulations by using SMP machines to speculatively perform iterations in parallel, reducing the runtime of MCMC programs whilst producing statistically identical results to conventional sequential implementations. We calculate the theoretical reduction in runtime that may be achieved using our technique under perfect conditions, and test and compare the method on a selection of multi-core and multi-processor architectures. Experiments are presented that show reductions in runtime of 35% using two cores and 55% using four cores.
A call-graph profiling tool has been designed and implemented to analyse the efficiency of programs written in bsplib. This tool highlights computation and communication imbalance in parallel programs, exposing portions of program code which are amenable to improvement. A unique feature of this profiler is that it uses the BSP cost model, thus providing a mechanism for portable and architecture-independent parallel performance tuning. In order to test the capabilities of the model on a real-world example, the performance characteristics of an SQL query processing application are investigated on a number of different parallel architectures.
High Performance Computing (HPC) critically underpins the design of aero-engines. With global emissions targets, engine designs require a fundamental change including designs utilizing sustainable aviation fuels and electric/hybrid flight. Virtual certification of designs with HPC is recognized as a key technology to meet these challenges, but require analysis on models with higher fidelity, using ultra-large scale executions. In this explanatory SC-SciVis showcase, we present results from time-accurate simulations of a 4.6B-element full 360-degree model of a production-representative gas turbine engine compressor, the Rig250 at DLR. This represents a grand challenge problem, at the fidelity for virtual certification standards. The results are achieved through Rolls-Royce’s Hydra CFD suite on ARCHER2. The compressor is visualized under off-design conditions, demonstrating flow contours of velocity, Mach number and iso-surfaces of vorticity. The level of detail and the HPC simulations leading to the visualizations demonstrate a step-change towards achieving virtual certification objectives under production settings.
This paper investigates the underlying impact of predictive inaccuracies on execution scheduling, with particular reference to execution time predictions. This study is conducted from two perspectives: from that of job selection and from that of resource allocation, both of which are fundamental components in execution scheduling. A new performance metric, termed the degree of misperception, is introduced to express the probability that the predicted execution times of jobs display different ordering characteristics from their real execution times due to inaccurate prediction. Specific formulae are developed to calculate the degree of misperception in both job selection and resource allocation scenarios. The parameters which influence the degree of misperception are also extensively investigated. The results presented in this paper are of significant benefit to scheduling approaches that take into account predictive data; the results are also of importance to the application of these scheduling techniques to real-world high-performance systems.
With the age of Exascale computing causing a diversification away from traditional CPU-based homogeneous clusters, it is becoming increasingly difficult to ensure that computationally complex codes are able to run on these emerging architectures. This is especially important for large physics simulations that are themselves becoming increasingly complex and computationally expensive. One proposed solution to the problem of ensuring these applications can run on the desired architectures is to develop representative mini-applications that are simpler and so can be ported to new frameworks more easily, but which are also representative of the algorithmic and performance characteristics of the original applications. In this paper we present BookLeaf, an unstructured Arbitrary Lagrangian-Eulerian mini-application to add to the suite of representative applications developed and maintained by the UK Mini-App Consortium (UK-MAC). First, we outline the reference implementation of our application in Fortran. We then discuss a number of alternative implementations using a variety of parallel programming models and discuss the issues that arise when porting such an application to new architectures. To demonstrate our implementation, we present a study of the performance of BookLeaf on number of platforms using alternative designs, and we document a scaling study showing the behaviour of the application at scale.
In this work we directly evaluate two PGAS programming models, CAF and OpenSHMEM, as candidate technologies for improving the performance and scalability of scientific applications on future exascale HPC platforms. PGAS approaches are considered by many to represent a promising research direction with the potential to solve some of the existing problems preventing codebases from scaling to exascale levels of performance. The aim of this work is to better inform the exacsale planning at large HPC centres such as AWE. Such organisations invest significant resources maintaining and updating existing scientific codebases, many of which were not designed to run at the scales required to reach exascale levels of computational performance on future system architectures. We document our approach for implementing a recently developed Lagrangian-Eulerian explicit hydrodynamics mini-application in each of these PGAS languages. Furthermore, we also present our results and experiences from scaling these different approaches to high node counts on two state-of-the-art, large scale system architectures from Cray (XC30) and SGI (ICE-X), and compare their utility against an equivalent existing MPI implementation.
Federated learning (FL) has attracted increasing attention as a promising approach to driving a vast number of end devices with artificial intelligence. However, it is very challenging to guarantee the efficiency of FL considering the unreliable nature of end devices while the cost of device-server communication cannot be neglected. In this article, we propose SAFA, a semi-asynchronous FL protocol, to address the problems in federated learning such as low round efficiency and poor convergence rate in extreme conditions (e.g., clients dropping offline frequently). We introduce novel designs in the steps of model distribution, client selection and global aggregation to mitigate the impacts of stragglers, crashes and model staleness in order to boost efficiency and improve the quality of the global model. We have conducted extensive experiments with typical machine learning tasks. The results demonstrate that the proposed protocol is effective in terms of shortening federated round duration, reducing local resource wastage, and improving the accuracy of the global model at an acceptable communication cost.
In this work we directly evaluate five candidate programming models for future exascale applications (MPI, MPI+OpenMP, MPI+OpenACC, MPI+CUDA and CAF) using a recently developed Lagrangian-Eulerian explicit hydrodynamics mini-application. The aim of this work is to better inform the exacsale planning at large HPC centres such as AWE. Such organisations invest significant resources maintaining and updating existing scientific codebases, many of which were not designed to run at the scale required to reach exascale levels of computation on future system architectures. We present our results and experiences of scaling these different approaches to high node counts on existing large-scale Cray systems (Titan and HECToR). We also examine the effect that improving the mapping between process layout and the underlying machine interconnect topology can have on performance and scalability, as well as highlighting several communication-focused optimisations.
The ever increasing scale of modern high performance computing platforms poses challenges for system architects and code developers alike. The increase in core count densities and associated cost of components is having a dramatic effect on the viability of high memory-per-core ratios. Whilst the available memory per core is decreasing, the increased scale of parallel jobs is testing the efficiency of MPI implementations with respect to memory overhead. Scalability issues have always plagued both hardware manufacturers and software developers, and the combined effects can be disabling. In this paper we address the issue of MPI memory consumption with regard to InfiniBand network communications. We reaffirm some widely held beliefs regarding the existence of scalability problems under certain conditions. Additionally, we present results testing memory-optimised runtime configurations and vendor provided optimisation libraries. Using Orthrus, a linear solver benchmark developed by AWE, we demonstrate these memory-centric optimisations and their performance implications. We show the growth of OpenMPI memory consumption (demonstrating poor scalability) on both Mellanox and QLogic InfiniBand platforms. We demonstrate a 616× increase in MPI memory consumption for a 64× increase in core count, with a default OpenMPI configuration on Mellanox. Through the use of the Mellanox MXM and QLogic PSM optimisation libraries we are able to observe a 117× and 115× reduction in MPI memory at application memory high water mark. This significantly improves the potential scalability of the code.
Production machine performance has large variability. On the UK National Supercomputing Service, the time a job takes to complete can vary by as much as 53%. Load imbalance and shared resource contention are largely responsible, but we find that previous efforts to model application/architecture performance do not typically take these into account. In this research we model and simulate network contention, which allows us to explore the impact of multiple interacting jobs and approaches to alleviate these effects, including network re-design and communication-staging within applications. We show the utility of this work on a variety of systems and interacting applications.
Input/Output (I/O) operations can represent a significant proportion of run-time when large scientific applications are run in parallel and at scale. In order to address the growing divergence between processing speeds and I/O performance, the Parallel Log-structured File System (PLFS) has been developed by EMC Corporation and the Los Alamos National Laboratory (LANL) to improve the performance of parallel file activities. Currently, PLFS requires the use of either (i) the FUSE Linux Kernel module, (ii) a modified MPI library with a customised ROMIO MPI-IO library, or (iii) an application rewrite to utilise the PLFS API directly. In this paper we present an alternative method of utilising PLFS in applications. This method employs a dynamic library to intercept the low-level POSIX operations and retarget them to use the equivalents offered by PLFS. We demonstrate our implementation of this approach, named LDPLFS, on a set of standard UNIX tools, as well on as a set of standard parallel I/O intensive mini-applications. The results demonstrate almost equivalent performance to a modified build of ROMIO and improvements over the FUSE-based approach. Furthermore, through our experiments we demonstrate decreased performance in PLFS when ran at scale on the Lustre file system.
Input/output (I/O) operations can represent a significant proportion of the run-time when large scientific applications are run in parallel. Although there have been advances in the form of file-format libraries, file-system design and I/O hardware, a growing divergence exists between the performance of parallel file-systems and compute processing rates. The effect is often a bottleneck when any form of file-system interaction is required. In this paper we present RIOT – an input/output tracing toolkit being developed at the University of Warwick for dynamic attachment to parallel applications. The two-stage tracing process includes a lightweight library to record I/O events and an in-depth post-execution analysis tool to extract performance metrics such as MPI-IO bandwidth, effective POSIX/file-system bandwidth, duration of individual or aggregated time spent in obtaining or releasing file locks and temporal information relating to parallel file activity. We present a case study on the use of RIOT for three standard industry I/O benchmarks: the BT-IO micro-application from NASA’s Parallel Benchmark suite, FLASH-IO, a benchmark which replicates the check-pointing operations of the FLASH thermonuclear star modelling code and IOR, an industry standard I/O benchmark using HDF-5 and MPI-IO. Furthermore, we utilise RIOT to assess these codes when running with the Parallel Log-structured File System (PLFS) middleware developed by the Los Alamos National Laboratory.
In September 2013, the large laser-based inertial confinement fusion device housed in the National Ignition Facility at Lawrence Livermore National Laboratory, was widely acclaimed to have achieved a milestone in controlled fusion -- successfully initiating a reaction that resulted in the release of more energy than the fuel absorbed. Despite this success, we remain some distance from being able to create controlled, self-sustaining fusion reactions. Inertial Confinement Fusion (ICF) represents one leading design for the generation of energy by nuclear fusion. Since the 1950s, ICF has been supported by computing simulations, providing the mathematical foundations for pulse shaping, lasers, and material shells needed to ensure effective and efficient implosion. The research presented here focuses on one such simulation code, EPOCH, a fully relativistic particle-in-cell plasma physics code, developed by a leading network of over 30 UK researchers. A significant challenge in developing large codes like EPOCH is maintaining effective scientific delivery on successive generations of high-performance computing architecture. To support this process, we adopt the use of mini-applications -- small code proxies that encapsulate important computational properties of their larger parent counterparts. Through the development of a mini-app for EPOCH (called miniEPOCH), we investigate known time-step scaling issues within EPOCH and explore possible optimisations: (i) Employing loop fission to increase levels of vectorisation, (ii) Enforcing particle ordering to allow the exploitation of domain specific knowledge and, (iii) Changing underlying data storage to improve memory locality. When applied to EPOCH, these improvements represent a 2.02× speed-up in the core algorithm and a 1.55× speed-up to the overall application runtime, when executed on EPCC's Cray XC30 ARCHER platform.
With the diversification of HPC architectures beyond traditional CPU-based clusters, a number of new frameworks for performance portability across architectures have arisen. One way of implementing such frameworks is to use C++ templates and lambda expressions to design loop-like functions. However, lower level programming APIs that these implementations must use are often designed with C in mind and do not specify how they interact with C++ features such as lambda expressions. This paper proposes a change to the behavior of the OpenMP specification with respect to lambda expressions such that when functions generated by lambda expressions are called inside GPU regions, any pointers used in the lambda expression correctly refer to device pointers. This change has been implemented in a branch of the Clang C++ compiler and demonstrated with two representative codes. Our results show that the implicit mapping of lambda expressions always exhibits identical performance to an explicit mapping but without breaking the abstraction provided by the high level frameworks, and therefore also reduces the burden on the application developer.
Scientific applications typically have considerable memory and processor requirements. Nevertheless, even with today's fastest supercomputers, the power of available resources falls short of the demands of large, complex simulations and analysis. The solution fortunately lies in the emergence of innovative resource environments such as Grid computing. This paper addresses issues concerning the discovery and monitoring of resources across multiple administrative domains, which can be harnessed by scientific applications. Grid Information Services are an important infrastructure in computational Grid environments as they provide important information about the state and availability of resources. We will present how the Globus Monitoring and Discovery Service (MDS) architecture can be extended to offer a transparent, unified view of Grid resources. Local resource managers ultimately possess the potential to influence where scientific jobs are processed, depending on the availability and load of their scheduler. Consequently, information about the structure and the state of schedulers should be provided to Grid brokers, which will then decide where to schedule the applications. This paper will present the way in which low-level scheduling information is collated from multiple sources and is incorporated into the unified Grid information view. The existence and availability of Grid resource information also allows reliable application performance prediction to be carried out using the Warwick PACE (Performance Analysis and Characterisation Environment) system. Thus, Grid Information Services are a key component in the efficient execution of scientific applications on Grid computing architectures.
Distributed e-business application platforms such as the web services framework will require sophisticated workload management infrastructures for the matching of client requests to the most appropriate resources. This paper introduces a novel dynamic predictive framework that uses both historical data and analytical performance modelling to extrapolate a requestís predicted performance under varying workloads and conditions on different systems. Such a framework can facilitate a workload manager in maximising the extent to which service level agreements can be met. Initial results are presented analysing how the response times of an e-business benchmark can be predicted at different system loads and on different application servers.
This paper is concerned with predicting the response times an enterprise information system would provide on new server architectures. These predictions can allow a workload to be transferred onto new servers whilst maintaining quality of service levels. Two common techniques are solving queu- ing models and extrapolating from previously gathered performance data. The dynamic recalibration of a layered queuing model and a historical model are investigated experimentally using an established distributed enterprise benchmark. The conclusions provide guidelines as to how to select an appropriate technique, including how to dynamically calibrate each model at a low overhead. Using these guidelines it is shown that both techniques can make low overhead predictions for new server architectures at a good level of predictive accuracy (typically over 80%).
BitTorrent is a typical peer-to-peer (P2P) file distribution application that has gained tremendous popularity in recent years. A considerable amount of research exists regarding BitTorrent’s choking algorithm, which has proved to be effective in preventing freeriders. However, the effect of the seeding strategy on the resistance to freeriders in BitTorrent has been largely overlooked. In addition to this, a category of selfish leechers (termed exploiters), who leave the overlay immediately after completion, has never been taken into account in the previous research. In this paper two popular seeding strategies, the Original Seeding Strategy (OSS) and the Time- based Seeding Strategy (TSS), are chosen and we study via mathematical models and simulation their effects on freeriders and exploiters in BitTorrent networks. The mathematical model is verified and we discover that both freeriders and exploiters impact on system performance, despite the seeding strategy that is employed. However, a selfish-leechers threshold is identified; once the threshold is exceeded, we find that TSS outperforms OSS – that is, TSS reduces the negative impact of selfish lechers more effectively than OSS. Based on these results we discuss the choice of seeding strategy and speculate as to how more effective BitTorrent-based file distribu- tion applications can be built.
Dynamic resource allocation has the potential to provide significant increases in total revenue in enterprise systems through the reallocation of available resources as the demands on hosted applications change over time. This paper investigates the combination of workload prediction algorithms and switching policies: the former aim to forecast the workload associated with Internet services, the latter switch resources between applications according to certain system criteria. An evaluation of two well known switching policies – the proportional switching policy (PSP) and the bottleneck aware switching policy (BSP) – is conducted in the context of seven workload prediction algorithms. This study uses real-world workload traces consisting of approximately 3.5M requests, and models a multi-tiered, cluster-based, multi-server solution. The results show that a combination of the bottleneck aware switching policy and workload predictions based on an autoregressive, integrated, moving-average model can improve system revenue by as much as 43%.
Application server clusters are often used to service high-throughput web applications. In order to host more than a single application, an organization will usually procure a separate cluster for each application. Over time the utilization of the two clusters will vary, leading to variation in the response times experienced by users of the applications. Techniques that statically assign servers to each application prevent the system from adapting to changes in the workload, and are thus suscep- tible to providing unacceptable levels of service. This paper investigates a system for allocating server resources to applications dynamically, thus allowing applications to automatically adapt to variable workloads. Such a scheme requires meticulous system monitoring, a method for switching application servers between server pools and a means of calculating when a server switch should be made (balancing switching cost against perceived benefits). Experimentation is performed using such a switching system on a Web application testbed hosting two applications across eight application servers. The test bed is used to compare several theoretically derived switching policies. The Average Flow switching policy is shown to provide the best policy, when considering the mean response times for this application.
The importance of memory performance and capacity is a growing concern for high performance computing laboratories around the world. It has long been recognized that improvements in processor speed exceed the rate of improvement in dynamic random access memory speed and, as a result, memory access times can be the limiting factor in high performance scientific codes. The use of multi-core processors exacerbates this problem with the rapid growth in the number of cores not being matched by similar improvements in memory capacity, increasing the likelihood of memory contention. In this paper, we present , a lightweight memory tracing tool and analysis framework for parallel codes, which is able to identify peak memory usage and also analyse per-function memory use over time. An evaluation of , in terms of its effectiveness and also its overheads, is performed using nine established scientific applications/benchmark codes representing a variety of programming languages and scientific domains. We also show how can be used to automatically generate a parameterized memory model for one of these applications, a two-dimensional non-linear magnetohydrodynamics application, . Through the memory model we are able to identify an unexpected growth term which becomes dominant at scale. With a refined model we are able to predict memory consumption with under 7% error.
Input/Output (I/O) operations can represent a significant proportion of the run-time of parallel scientific computing applications. Although there have been several advances in file format libraries, file system design and I/O hardware, a growing divergence exists between the performance of parallel file systems and the compute clusters that they support. In this paper, we document the design and application of the RIOT I/O toolkit (RIOT) being developed at the University of Warwick with our industrial partners at the Atomic Weapons Establishment and Sandia National Laboratories. We use the toolkit to assess the performance of three industry-standard I/O benchmarks on three contrasting supercomputers, ranging from a mid-sized commodity cluster to a large-scale proprietary IBM BlueGene/P system. RIOT provides a powerful framework in which to analyse I/O and parallel file system behaviour-we demonstrate, for example, the large file locking overhead of IBM's General Parallel File System, which can consume nearly 30% of the total write time in the FLASH-IO benchmark. Through I/O trace analysis, we also assess the performance of HDF-5 in its default configuration, identifying a bottleneck created by the use of suboptimal Message Passing Interface hints. Furthermore, we investigate the performance gains attributed to the Parallel Log-structured File System (PLFS) being developed by EMC Corporation and the Los Alamos National Laboratory. Our evaluation of PLFS involves two high-performance computing systems with contrasting I/O backplanes and illustrates the varied improvements to I/O that result from the deployment of PLFS (ranging from up to 25x speed-up in I/O performance on a large I/O installation to 2x speed-up on the much smaller installation at the University of Warwick).
As the computing hardware landscape gets more diverse, and the complexity of hardware grows, the need for a general purpose parallel programming model capable of developing (performance) portable codes have become highly attractive. Intel's OneAPI suite, which is based on the SYCL standard aims to fill this gap using a modern C++ API. In this paper, we use SYCL to parallelize MGCFD, an unstructured-mesh computational fluid dynamics (CFD) code, to explore current performance of SYCL. The code is benchmarked on several modern processor systems from Intel (including CPUs and the latest Xe LP GPU), AMD, ARM and Nvidia, making use of a variety of current SYCL compilers, with a particular focus on OneAPI and how it maps to Intel's CPU and GPU architectures. We compare performance with other parallelizations available in OP2, including SIMD, OpenMP, MPI and CUDA. The results are mixed; the performance of this class of applications, when parallelized with SYCL, highly depends on the target architecture and the compiler, but in many cases comes close to the performance of currently prevalent parallel programming models. However, it still requires different parallelization strategies or code-paths be written for different hardware to obtain the best performance.
Exposure to nature is thought to benefit mental health and wellbeing. However, many studies consider greenspace as a single entity, which overlooks the potential significance of the various forms of greenspace, and natural greenspaces in particular. This study was designed to investigate the association between different types of greenspace and mental wellbeing. Drawing wellbeing and socioeconomic data from the Annual Population Survey (2012-2015), and shapefiles from the Greenspace Information for Greater London group, the amount of greenspace accessible within a 300 m walk of individual's postcodes was calculated, and categorised according to type. Spatial Error Models were used to account for spatial patterns in the data. Natural greenspace was significantly associated with improved life satisfaction (B = 0.028, p < 0.001) and happiness (B = 0.023, p = 0.019) scores. The spatial autoregressive parameter (lambda) was small but significant (p < 0.001), implying slight second-order spatial variation in the model. These results imply that natural areas may be more important for hedonic mental wellbeing than other greenspaces. Future research is needed on exploring causal relationships between exposure to greenspace and mental wellbeing outcomes.
We propose a novel payment-based incentive scheme for peer-to-peer (P2P) live media streaming. Using this approach, peers earn points by forwarding data to others. The data streaming is divided into fixed-length periods; during each of these periods, peers compete with each other for good parents (data suppliers) for the next period in a first-price-auction-like procedure using their points. We design a distributed algorithm to regulate peer competitions and consider various individual strategies for parent selection from a game-theoretic perspective. We then discuss possible strategies that can be used to maximize a peer's expected media quality by planning different bids for its substreams. Finally, in order to encourage off-session users to remain online and continue contributing to the network, we develop an optimal data forwarding strategy that allows peers to accumulate points that can be used in future services. Simulation results show that the proposed methods effectively differentiate the media qualities received by peers making different contributions (which originate from, for example, different forwarding bandwidths or servicing times) and at the same time maintain high overall system performance.
Semantic drift is a well-known concept in distributional semantics, which is used to demonstrate gradual, long-term changes in meanings and sentiments of words and is largely detectable by studying the composition of large corpora. In our previous work, which used ontological relationships between words and phrases, we established that certain kinds of semantic micro-changes can be found in social media emerging around natural hazard events, such as floods. Our previous results confirmed that semantic drift in social media can be used to for early detection of floods and to increase the volume of 'useful' geo-referenced data for event monitoring. In this work we use deep learning in order to determine whether images associated with 'semantically drifted' social media tags reflect changes in crowd navigation strategies during floods. Our results show that alternative tags can be used to differentiate naive and experienced crowds witnessing flooding of various degrees of severity.
TCP over bandwidth asymmetric networks such as Cable TV and Asymmetric Digital Subscriber Loop (ADSL) exhibits different characteristics from TCP on symmetric links. A number of techniques have been proposed to address this problem. However, previous research has been largely focused on bulk transfers. This paper investigates the effects of bandwidth asymmetry on Web-like short-lived transfers. Two prediction models, one given in a closed form and the other in a recursive form, are presented for TCP transfers without and with ACK Filtering (AF), a representative optimizing technique for TCP transfers over bandwidth asymmetric links. Ns-2 based experiments show that these models can predict TCP transfer latency with a high degree of accuracy. © 2006 Elsevier Inc. All rights reserved.
In this paper we present an alternative approach to the representation of simulation particles for unstructured electrostatic and electromagnetic PIC simulations. In our modified PIC algorithm we represent particles as having a smooth shape function limited by some specified finite radius, r(0). A unique feature of our approach is the representation of this shape by surrounding simulation particles with a set of virtual particles with delta shape, with fixed offsets and weights derived from Gaussian quadrature rules and the value of r(0). As the virtual particles are purely computational, they provide the additional benefit of increasing the arithmetic intensity of traditionally memory bound particle kernels. The modified algorithm is implemented within Sandia National Laboratories' unstructured EMPIRE-PIC code, for electrostatic and electromagnetic simulations, using periodic boundary conditions. We show results for a representative set of benchmark problems, including electron orbit, a transverse electromagnetic wave propagating through a plasma, numerical heating, and a plasma slab expansion. Good error reduction across all of the chosen problems is achieved as the particles are made progressively smoother, with the optimal particle radius appearing to be problem-dependent. Crown Copyright (c) 2021 Published by Elsevier Inc. All rights reserved.
The architectures which support modem supercomputing machinery are as diverse today, as at any point during the last twenty years. The variety of processor core arrangements, threading strategies and the arrival of heterogeneous computation nodes are driving modern-day solutions to petaflop speeds. The increasing complexity of such systems, as well as codes written to take advantage of the new computational abilities, pose significant frustrations for existing techniques which aim to model and analyse the performance of such hardware and software. In this paper we demonstrate the use of post-execution analysis on trace-based profiles to support the construction of simulation-based models. This involves combining the runtime capture of call-graph information with computational timings, which in turn allows representative models code behaviour to be extracted. The main advantage of this technique is that it largely automates performance model development, a burden associated with existing techniques. We demonstrate the capabilities of our approach using both the NAS Parallel Benchmark suite and a real-world supercomputing benchmark developed by the United Kingdom Atomic Weapons Establishment. The resulting models, developed in less than two hours per code, have a good degree of predictive accuracy. We also show how one of these models can be used to explore the performance of the code on over 16,000 cores, demonstrating the scalability of our solution.
The increasing availability of multi-core and multiprocessor architectures provides new opportunities for improving the performance of many computer simulations. Markov Chain Monte Carlo (MCMC) simulations are widely used for approximate counting problems, Bayesian inference and as a means for estimating very high-dimensional integrals. As such MCMC has found a wide variety of applications in fields including computational biology and physics, financial econometrics, machine learning and image processing. This paper presents a new method for reducing the run-time of Markov Chain Monte Carlo simulations by using SMP machines to speculatively perform iterations in parallel, reducing the runtime of MCMC programs whilst producing statistically identical results to conventional sequential implementations. We calculate the theoretical reduction in runtime that may be achieved using our technique under perfect conditions, and test and compare the method on a selection of multi-core and multi-processor architectures. Experiments are presented that show reductions in runtime of 35% using two cores and 55% using four cores.
Application server clusters are often used to service high-throughput web applications. In order to host more than a single application, an organisation will usually procure a separate cluster for each application. Over time the utilisation of the clusters will vary, leading to variation in the response times experienced by users of the applications. Techniques that statically assign servers to each application prevent the system from adapting to changes in the workload, and are thus susceptible to providing unacceptable levels of service. This paper investigates a system for allocating server resources to applications dynamically, thus allowing applications to automatically adapt to variable workloads. Such a scheme requires meticulous system monitoring, a method for switching application servers between server pools and a means of calculating when a server switch should be made (balancing switching cost against perceived benefits). Experimentation is performed using such a switching system on a Web application testbed hosting two applications across eight application servers. The testbed is used to compare several theoretically derived switching policies under a variety of workloads. Recommendations are made as to the suitability of different policies under different workload conditions.
Block-structured adaptive mesh refinement (AMR) is a technique that can be used when solving partial differential equations to reduce the number of cells necessary to achieve the required accuracy in areas of interest. These areas (shock fronts, material interfaces, etc.) are recursively covered with finer mesh patches that are grouped into a hierarchy of refinement levels. Despite the potential for large savings in computational requirements and memory usage without a corresponding reduction in accuracy, AMR adds overhead in managing the mesh hierarchy, adding complex communication and data movement requirements to a simulation. In this paper, we describe the design and implementation of a resident GPU-based AMR library, including: the classes used to manage data on a mesh patch, the routines used for transferring data between GPUs on different nodes, and the data-parallel operators developed to coarsen and refine mesh data. We validate the performance and accuracy of our implementation using three test problems and two architectures: an 8 node cluster, and 4,196 nodes of Oak Ridge National Laboratory's Titan supercomputer. Our GPU-based AMR hydrodynamics code performs up to 4.87x faster than the CPU-based implementation, and is scalable on 4,196 K20x GPUs using a combination of MPI and CUDA.
This paper investigates the reliability of application-level multicast based on a distributed hash table (DHT) in a highly dynamic network. Using a node residual lifetime model, we derive the stationary end-to-end delivery ratio of data streaming between a pair of nodes in the worst case, and show through numerical examples that in a practical DHT network, this ratio can be very low (e.g., less than 50%). Leveraging the property of heavy-tailed lifetime distribution, we then consider three optimizing techniques, namely Senior Member Overlay (SMO), Longer-Lived Neighbor Selection (LNS), and Reliable Route Selection (RRS), and present quantitative analysis of data delivery reliability under these schemes. In particular, we discuss the tradeoff between delivery ratio and the load imbalance among nodes. Simulation experiments are also used to evaluate the multicast performance under practical settings. Our model and analytic results provide useful tools for reliability analysis for other overlay-based applications (e.g., those involving persistent data transfers).
We present optimisations applied to a bespoke biophysical molecular dynamics simulation designed to investigate chromosome condensation. Our primary focus is on domainspecific algorithmic improvements to determining short-range interaction forces between particles, as certain qualities of the simulation render traditional methods less effective. We implement tuned versions of the code for both traditional CPU architectures and the modern many-core architecture found in the Intel Xeon Phi coprocessor and compare their effectiveness. We achieve speed-ups starting at a factor of 10 over the original code, facilitating more detailed and larger-scale experiments.
Urban environments are restricted by various physical, regulatory and customary barriers such as buildings, one-way systems and pedestrian crossings. These features create challenges for predictive modelling in urban space, as most proximity-based models rely on Euclidean (straight line) distance metrics which, given restrictions within the urban landscape, do not fully capture spatial urban processes. Here, we argue that road distance and travel time provide effective alternatives, and we develop a new low-dimensional Euclidean distance metric based on these distances using an isomap approach. The purpose of this is to produce a valid covariance matrix for Kriging. Our primary methodological contribution is the derivation of two symmetric dissimilarity matrices ( and ), with which it is possible to compute low-dimensional Euclidean metrics for the production of a positive definite covariance matrix with commonly utilised kernels. This new method is implemented into a Kriging predictor to estimate house prices on 3,669 properties in Coventry, UK. We find that a metric estimating a combination of road distance and travel time, in both and , produces a superior house price predictor compared with alternative state-of-the-art methods, that is, a standard Euclidean metric in and a non-restricted road distance metric in and . F
We proposes a novel payment-based incentive mechanism for peer-to-peer (M) live media streaming. Using this approach, peers earn points by forwarding data to others; the data streaming is divided into fixed length periods, during each of which peers compete with each other for good parents (data suppliers) for the next period in a first-price auction like procedure using their points. We design a distributed algorithm to regulate peer competitions, and consider various individual strategies for parent selection from a game theoretic perspective. We then discuss possible strategies that can be used to maximize a peer's expected media quality by planning different bids for its substreams. Finally, in order to encourage off-session users to keep staying online and continue contributing to the network, we develop an optimal data forwarding strategy that allows peers to accumulate points that can be used in future services. Simulations results show that proposed methods effectively differentiate the media qualities received by peers making different contributions (which originate from, for example, different forwarding bandwidths or servicing times), and at the same time maintaining a high system-wide performance.
Wireless sensor networks (WSNs) are potentially a boon to the implementation and enforcement of noise codes which now exist for many major cities all over the world. Unfortunately, the high energy required by the noise measurement process and the reliance of sensor motes on batteries make the management of noise sensing WSNs cumbersome. Giving motes energy harvesting (EH) capabilities could alleviate such a problem, and several EH WSNs have already been demonstrated. Nevertheless, the high frequency nature of the data required to measure noise places significant additional challenges to the design of EH WSNs. In this paper, we characterize the capabilities and limitations of a WSN mote designed to measure noise. We identify, through specification analysis and experimentation, the performance-limiting factors in each step of the noise measurement process. These steps include sound gathering, data processing, and result transmission. The prospect of the entire process being powered by energy-=harvesting means is also evaluated. For each step, we also discuss and recommend measures that would help improve the overall system performance.
It is common that Internet service hosting centres use several logical pools to assign server resources to different applications, and that they try to achieve the highest total revenue by making efficient use of these resources. In this paper, multi-tiered enterprise systems are modelled as multi-class closed queueing networks, with each network station corresponding to each application tier. In such queueing networks, bottlenecks can limit overall system performance, and thus should be avoided. We propose a bottleneck-aware server switching policy, which responds to system bottlenecks and switches servers to alleviate these problems as necessary. The switching engine compares the benefits and penalties of a potential switch, and makes a decision as to whether it is likely to be worthwhile switching. We also propose a simple admission control scheme, in addition to the switching policy, to deal with system overloading and optimise the total revenue of multiple applications in the hosting centre. Performance evaluation has been done via simulation and results are compared with those from a proportional switching policy and also a system that implements no switching policy. The experimental results show that the combination of the bottleneck-aware switching policy and the admission control scheme consistently outperforms the other two policies in terms of revenue contribution.
This paper addresses the problem of fault resilience of overlay-based live media streaming from two aspects: (1) how to construct a stable multicast tree that minimizes the negative impact of frequent member departures on existing overlay, and (2) how to efficiently recover from packet errors caused by end-system or network failures. In particular, this paper makes two contributions: (1) A distributed Reliability-Oriented Switching Tree (ROST) algorithm that minimizes the failure correlation among tree nodes. By exploiting both bandwidth and time properties, the algorithm constructs a more reliable multicast tree than existing algorithms that solely minimize tree depth, while not compromising the quality of the tree in terms of service delay and incurring only a small protocol overhead; (2) A simple Cooperative Error Recovery (CER) protocol that helps recover from packet errors efficiently. Recognizing that a single recovery source is usually incapable of providing timely delivery of the lost data, the protocol recovers from data outages using the residual bandwidths from multiple sources, which are identified using a minimum-loss-correlation algorithm. Extensive simulations are conducted to demonstrate the effectiveness of the proposed schemes.
With the age of Exascale computing causing a diversification away from traditional CPU-based homogeneous clusters, it is becoming increasingly difficult to ensure that computationally complex codes are able to run on these emerging architectures. This is especially important for large physics simulations that are themselves becoming increasingly complex and computationally expensive. One proposed solution to the problem of ensuring these applications can run on the desired architectures is to develop representative mini-applications that are simpler and so can be ported to new frameworks more easily, but which are also representative of the algorithmic and performance characteristics of the original applications. In this paper we present BookLeaf, an unstructured Arbitrary Lagrangian-Eulerian mini-application to add to the suite of representative applications developed and maintained by the UK Mini-App Consortium (UK-MAC). First, we outline the reference implementation of our application in Fortran. We then discuss a number of alternative implementations using a variety of parallel programming models and discuss the issues that arise when porting such an application to new architectures. To demonstrate our implementation, we present a study of the performance of BookLeaf on number of platforms using alternative designs, and we document a scaling study showing the behaviour of the application at scale.
Wireless sensor networks (WSNs) could potentially help in the measurement and monitoring of noise levels, an important step in mitigating and fighting noise pollution. Unfortunately, the high energy required by the noise measurement process and the reliance of sensor motes on batteries make the management of noise-sensing WSNs cumbersome. Giving motes energy harvesting (EH) capabilities could alleviate such a problem, and several EH-WSNs have already been demonstrated. Nevertheless, the high-frequency nature of the data required to measure noise places significant additional challenges to the design of EH-WSNs. In this paper, we present a design and prototype for a mote extension which enables the mote to detect noise levels while being powered by energy harvesting. The noise level detection carried out by the system relies primarily on the concept of peak detection. Results of performance testing are presented. Aside from the hardware design and prototype, we also discuss methods of assigning charge times for application scenarios where there are multiple pulse loads. We also propose a new opportunistic method for charge time determination. Experiments demonstrate that the new method could improve analytically-derived duty cycles by at least 350%.
Cities of the future are envisioned to be fully optimised, owing to technological advancements, distributed sensor networks and automation. With the proliferation of new data sources, opportunities also exist for better understanding of how people act and make decisions, as well as discerning the conditions in which they wish to live and what they expect from their surrounding environment. Following the recently proposed normative strand in urban planning, this study uses distributed personal underground development as a case study for extracting the values behind this emerging self-build movement, alongside observers' opinions obtained from associated web-based data.
This paper investigates the underlying impact of predictive inaccuracies on execution scheduling, with particular reference to execution time predictions. This study is conducted from two perspectives: from that of job selection and from that of resource allocation, both of which are fundamental components in execution scheduling. A new performance metric, termed the degree of misperception, is introduced to express the probability that the predicted execution times of jobs display different ordering characteristics from their real execution times due to inaccurate prediction. Specific formulae are developed to calculate the degree of misperception in both job selection and resource allocation scenarios. The parameters which influence the degree of misperception are also extensively investigated. The results presented in this paper are of significant benefit to scheduling approaches that take into account predictive data; the results are also of importance to the application of these scheduling techniques to real-world high-performance systems. © 2004 Elsevier B.V. All rights reserved.
Aggregation is the process of gathering and combining information from a number of sources. In peer-to-peer systems, aggregation is a basic component of a range of applications, including monitoring and complex-query resolution. Peer-to-peer aggregation services themselves are dependent on a number of other fundamental peer-to-peer services - directories, multicasting and system-size approximation. The overall performance characteristics of an aggregation service are affected by the chosen implementation method for these underlying services. To illustrate this relationship, aggregation techniques for internet-based peer-to-peer systems are surveyed and dissected into their component parts. We further consider the problem of running one-off aggregation queries in a peer-to-peer network. A new aggregation service, Bliksum, which uses a novel combination of underlying services, is introduced. Bliksum employs unstructured peer-to-peer techniques for node sampling, multicasting and system-size approximation, in combination with a method of building a temporary tree structure for aggregation itself. Unstructured peer-to-peer techniques have been shown to be highly resilient to node churn, avoiding the problem inherent in structured systems of maintaining the desired structure when the set of nodes changes rapidly. We present experiments showing that Bliksum retains these advantages while reducing communications cost and reducing information loss compared to pure gossip-based aggregation.
In this paper we present the outline of a novel electrostatic, second order Particle-in-Cell (PIC) algorithm, that makes use of 'ghost particle& located around true particle positions in order to represent a charge distribution. We implement our algorithm within EMPIRE-PIC, a PIC code developed at Sandia National Laboratories. We test the performance of our algorithm on a variety of many-core architectures including NVIDIA GPUs, conventional CPUs, and Intel's Knights Landing. Our preliminary results show the viability of second order methods for PIC applications on these architectures when compared to previous generations of many-core hardware. Specifically, we see an order of magnitude improvement in performance for second order methods between the Tesla K20 and Tesla P100 GPU devices, despite only a 4x improvement in the theoretical peak performance between the devices. Although these initial results show a large increase in runtime over first order methods, we hope to be able to show improved scaling behaviour and increased simulation accuracy in the future.
Background: With urbanisation increasing, it is important to understand how to design changing environments to promote mental wellbeing. Evidence suggests that local-area proportions of green space may be associated with happiness and life satisfaction; however, the available evidence on such associations with more broadly defined mental wellbeing in still very scarce. This study aimed to establish whether the amount of neighbourhood green space was associated with mental wellbeing. Methods: Data were drawn from Understanding Society, a national survey of 30,900 individuals across 11,096 Census Lower-Layer Super Output Areas (LSOAs) in England, over the period 2009-2010. Measures included the multidimensional Warwick-Edinburgh Mental Well-Being Scale (SWEMWBS) and LSOA proportion of green space, which was derived from the General Land Use Database (GLUD), and were analysed using linear regression, while controlling for individual, household and area-level factors. Results: Those living in areas with greater proportions of green space had significantly higher mental wellbeing scores in unadjusted analyses (an expected increase of 0.17 points (95% CI 0.11, 0.23) in the SWEMWBS score for a standard deviation increase of green space). However, after adjustment for confounding by respondent sociodemographic characteristics and urban/rural location, the association was attenuated to the null (regression coefficient B = -0.01, 95% CI -0.08, 0.05, p = 0.712). Conclusions: While the green space in an individual's local area has been shown through other research to be related to aspects of mental health such as happiness and life satisfaction, the association with multidimensional mental wellbeing is much less clear from our results. While we did not find a statistically significant association between the amount of green space in residents' local areas and mental wellbeing, further research is needed to understand whether other features of green space, such as accessibility, aesthetics or use, are important for mental wellbeing.
Workload demands in e-commerce applications are very dynamic in nature, therefore it is essential for internet service providers to manage server resources effectively to maximize total revenue in server overloading situations. In this paper, a data mining technique is applied to a typical e-commerce application model for identification of composite association rules that capture user navigation patterns. Two algorithms are then developed based on the derived rules for admission control, service differentiation, and priority scheduling. Our approach takes the following aspects into consideration: (a) only final purchase requests result in company revenue; (b) any other request can potentially lead to final purchase, depending upon the likelihood of the navigation sequence that starts from current request and leads to final purchase; (c) service differentiation and priority assignment are based on aggregated confidence and average support of the composite association rules. As identification of composite association rules and computation of confidence and support of the rules can be pre-computed offline, the proposed approach incurs minimum performance overheads. The evaluation results suggest that the proposed approach is effective in terms of request management for revenue maximization. This article is categorized under: Application Areas > Science and Technology Algorithmic Development > Association Rules Algorithmic Development > Web Mining
This book constitutes the thoroughly refereed proceedings of the 5th International Workshop, PMBS 2014 in New Orleans, LA, USA in November 2014. The 12 full and 2 short papers presented in this volume were carefully reviewed and selected from 53 submissions. The papers cover topics on performance benchmarking and optimization; performance analysis and prediction; and power, energy and checkpointing.
COVID-19 was declared a pandemic by the World Health Organisation (WHO) on March 11th, 2020. With half of the world's countries in lockdown as of April due to this pandemic, monitoring and understanding the spread of the virus and infection rates and how these factors relate to behavioural and societal parameters is crucial for developing control strategies. This paper aims to investigate the effectiveness of masks, social distancing, lockdown and self-isolation for reducing the spread of SARS-CoV-2 infections. Our findings from an agent-based simulation modelling showed that whilst requiring a lockdown is widely believed to be the most efficient method to quickly reduce infection numbers, the practice of social distancing and the usage of surgical masks can potentially be more effective than requiring a lockdown. Our multivariate analysis of simulation results using the Morris Elementary Effects Method suggests that if a sufficient proportion of the population uses surgical masks and follows social distancing regulations, then SARS-CoV-2 infections can be controlled without requiring a lockdown.
Energy consumption is rapidly becoming a limiting factor in scientific computing. As a result, hardware manufacturers increasingly prioritise energy efficiency in their processor designs. Performance engineers are also beginning to explore software optimisation and hardware/software co-design as a means to reduce energy consumption. Energy efficiency metrics developed by the hardware community are often re-purposed to guide these software optimisation efforts. In this paper we argue that established metrics, and in particular those in the Energy Delay Product (Et-n) family, are unsuitable for energyaware software optimisation. A good metric should provide meaningful values for a single experiment, allow fair comparison between experiments, and drive optimisation in a sensible direction. We show that Et-n metrics are unable to fulfil these basic requirements and present suitable alternatives for guiding energy-aware software optimisation. We finish with a practical demonstration of the utility of our proposed metrics.
Automating the execution of workflows (or business processes) on computer resources has been the subject of much research. However, many workflow scenarios still require human involvement, which introduces additional authorization concerns. Role-Based Authorization Control (RBAC), under which the users are assigned to certain roles while the roles are associated with prescribed permissions, is a popular authorisation control scheme. When we allocate resources for workloads and plan system capacities, it is often assumed that when a task is allocated to a resource, the resource will accept the task and start the execution once the processor becomes available. However, the authorization policies impose further constraints on task executions, and therefore may incur performance penalty and affect both application-and system-oriented performance. This paper investigates the issue of allocating resources for running workflows under the role-based authorization control. The resource allocation strategies are developed in this paper for both human resources and computing resources. The allocation strategy for human resources takes into account the authorization constraints and establishes the optimization equation subject to the constraint of the budget available to hire human resources. Then the optimization equation is solved to obtain the number of human resources allocated to each authorization role. The allocation strategy for computing resources also takes into account authorization constraints, calculating not only the number of computing resources, but also the proportion of processing capacity in each resource allocated to serve the tasks assuming each role. The simulation experiments have been conducted to verify the effectiveness of the developed allocation strategies. The experimental results show that the allocation strategy developed in this paper outperforms the traditional allocation strategies, which do not consider authorization constraints, in terms of both average response time and resource utilization.
In this paper we investigate the use of distributed graphics processing unit (GPU)-based architectures to accelerate pipelined wavefront applications-a ubiquitous class of parallel algorithms used for the solution of a number of scientific and engineering applications. Specifically, we employ a recently developed port of the LU solver (from the NAS Parallel Benchmark suite) to investigate the performance of these algorithms on high-performance computing solutions from NVIDIA (Tesla C1060 and C2050) as well as on traditional clusters (AMD/InfiniBand and IBM BlueGene/P). Benchmark results are presented for problem classes A to C and a recently developed performance model is used to provide projections for problem classes D and E, the latter of which represents a billion-cell problem. Our results demonstrate that while the theoretical performance of GPU solutions will far exceed those of many traditional technologies, the sustained application performance is currently comparable for scientific wavefront applications. Finally, a breakdown of the GPU solution is conducted, exposing PCIe overheads and decomposition constraints. A new k-blocking strategy is proposed to improve the future performance of this class of algorithm on GPU-based architectures.
Trophic coherence, a measure of a graph's hierarchical organisation, has been shown to be linked to a graph's structural and dynamical aspects such as cyclicity, stability and normality. Trophic levels of vertices can reveal their functional properties, partition and rank the vertices accordingly. Trophic levels and hence trophic coherence can only be defined on graphs with basal vertices, i.e. vertices with zero in-degree. Consequently, trophic analysis of graphs had been restricted until now. In this paper we introduce a hierarchical framework which can be defined on any simple graph. Within this general framework, we develop several metrics: hierarchical levels, a generalisation of the notion of trophic levels, influence centrality, a measure of a vertex's ability to influence dynamics, and democracy coefficient, a measure of overall feedback in the system. We discuss how our generalisation relates to previous attempts and what new insights are illuminated on the topological and dynamical aspects of graphs. Finally, we show how the hierarchical structure of a network relates to the incidence rate in a SIS epidemic model and the economic insights we can gain through it.
Performance modelling unstructured mesh codes is a challenging process, due to the difficulty of capturing their memory access patterns, and their communication patterns at varying scale. In this paper we first develop extensions to an existing runtime performance model, aimed at overcoming the former, which we validate on up to 1,024 cores of a Haswell-based cluster, using both a geometric partitioning algorithm and ParMETIS to partition the input deck, with a maximum absolute runtime error of 12.63% and 11.55% respectively. To overcome the latter, we develop an application representative of the mesh partitioning process internal to an unstructured mesh code. This application is able to generate partitioning data that is usable with the performance model to produce predicted application runtimes within 7.31% of those produced using empirically collected data. We then demonstrate the use of the performance model by undertaking a predictive comparison among several partitioning algorithms on up to 30,000 cores. Additionally, we correctly predict the ineffectiveness of the geometric partitioning algorithm at 512 and 1024 cores.
Mobile sensors can move and self-deploy into a network. While focusing on the problems of coverage, existing deployment schemes mostly over-simplify the conditions for network connectivity: they either assume that the communication range is large enough for sensors in geometric neigh borhoods to obtain each other's location by local communications, or assume a dense network that remains connected. At the same time, an obstacle-free field or full knowledge of the field layout is often assumed. We present new schemes that are not restricted by these assumptions, and thus adapt to a much wider range of application scenarios. While maximizing sensing coverage, our schemes can achieve connectivity for a network with arbitrary sensor communication/sensing ranges or node densities, at the cost of a small moving distance; the schemes do not need any knowledge of the field layout, which can be irregular and have obstacles/holes of arbitrary shape. Simulations results show that the proposed schemes achieve the targeted properties.
We analyse gather-scatter performance bottle-necks in molecular dynamics codes and the challenges that they pose for obtaining benefits from SIMD execution. This analysis informs a number of novel code-level and algorithmic improvements to Sandia's miniMD benchmark, which we demonstrate using three SIMD widths (128-, 256- and 512-bit). The applicability of these optimisations to wider SIMD is discussed, and we show that the conventional approach of exposing more parallelism through redundant computation is not necessarily best. In single precision, our optimised implementation is up to 5x faster than the original scalar code running on Intel (R) Xeon (R) processors with 256-bit SIMD, and adding a single Intel (R) Xeon Phi (TM) coprocessor provides up to an additional 2x performance increase. These results demonstrate: (i) the importance of effective SIMD utilisation for molecular dynamics codes on current and future hardware; and (ii) the considerable performance increase afforded by the use of Intel (R) Xeon Phi (TM) coprocessors for highly parallel workloads.
The demographics and landscape of cities are changing rapidly, and there is an emphasis to better understand the factors which influence citizen happiness in order to design smarter urban systems. Few studies have attempted to understand how large-scale sentiment maps to urban human geography. Inferring sentiment from social media data is one such scalable solution. In this paper, we apply natural language processing (NLP) techniques to 0.4 million geo-tagged Tweets in the Greater London area to understand the influence of socioeconomic and urban geography parameters on happiness. Our results not only verify established thinking: that job opportunities correlate with positive sentiments; but also reveal two insights: (1) happiness is negatively correlated with number of children, and (2) happiness has a U-shaped (parabolic) relationship with access to public transportation. The latter implies that the happiest people are those who have good access to public transport, or such poor access that they use private transportation. The number of jobs, children, and transportation availability are every day facets of urban living and individually account for up to 47% of the variations in people's happiness. Our results show that they influence happiness more significantly than long term socioeconomic parameters such as degradation, education, income, housing, and crime. This study will enable urban planners and system designers to move beyond the traditional cost-benefit methodology and to incorporate citizens' happiness.
With the diversification of HPC architectures beyond traditional CPU-based clusters, a number of new frameworks for performance portability across architectures have arisen. One way of implementing such frameworks is to use C++ templates and lambda expressions to design loop-like functions. However, lower level programming APIs that these implementations must use are often designed with C in mind and do not specify how they interact with C++ features such as lambda expressions. This paper discusses a change to the behavior of the OpenMP specification with respect to lambda expressions such that when functions generated by lambda expressions are called inside GPU regions, any pointers used in the lambda expression correctly refer to device pointers. This change has been implemented in a branch of the Clang C++ compiler and demonstrated with two representative codes. This change has also been accepted into the draft OpenMP (R) specification for inclusion in OpenMP 5. Our results show that the implicit mapping of lambda expressions always exhibits identical performance to an explicit mapping but without breaking the abstraction provided by the high level frameworks.
The divergence between processor and memory performance has been a well discussed aspect of computer architecture literature for some years. The recent use of multi-core processor designs has, however, brought new problems to the design of memory architectures - as more cores are added to each successive generation of processor, equivalent improvement in memory capacity and memory sub-systems must be made if the compute components of the processor are to remain sufficiently supplied with data. These issues combined with the traditional problem of designing cache-efficient code help to ensure that memory remains an on-going challenge for application and machine designers. In this paper we present a comprehensive discussion of WMTools - a trace-based toolkit designed to support the analysis of memory allocation for parallel applications. This paper features an extended discussion of the WMTrace tracing tool presented in previous work including a revised discussion on trace-compression and several refinements to the tracing methodology to reduce overheads and improve tool scalability. The second half of this paper features a case study in which we apply WMTools to five parallel scientific applications and benchmarks, demonstrating its effectiveness at recording high-water mark memory consumption as well as memory use per-function over time. An in-depth analysis is provided for an unstructured mesh benchmark which reveals significant memory allocation imbalance across its participating processes. This study demonstrates the use of WMTools in elucidating memory allocation issues in high-performance scientific codes.
The cost of state-of-the-art supercomputing resources makes each individual purchase a length and expensive process. Often each candidate architecture will need to be benchmarked using a variety of tools to assess likely performance. However, benchmarking alone only provides a limited insight into the suitability of each architecture for key codes and will give potentially misleading results when assessing their scalability. In this study the authors present a case study of the application of recently developed performance models of the Chimaera benchmarking code written by the United Kingdom Atomic Weapons Establishment (AWE), with a view to analysing how the code will perform and scale on a medium sized, commodity-based InfiniBand cluster. The models are validated and demonstrate a greater than 90% accuracy for an existing InfiniBand machine; the models are then used as the basis for predicting code performance on a variety of alternative hardware configurations which include changes in the underlying network, the use of faster processors and the use of a higher core density per processor. The results demonstrate the compute-bound nature of Chimaera and its sensitivity to network latency at increased processor counts. By using these insights the authors are able to discuss potential strategies which may be employed during the procurement of future mid-range clusters for wavefront-rich workloads.
Under-reporting and delayed reporting of rape crime are severe issues that can complicate the prosecution of perpetrators and prevent rape survivors from receiving needed support. Building on a massive database of publicly available criminal reports from two US cities, we develop a machine learning framework to predict delayed reporting of rape to help tackle this issue. Motivated by large and unexplained spatial variation in reporting delays, we build predictive models to analyse spatial, temporal and socio-economic factors that might explain this variation. Our findings suggest that we can explain a substantial proportion of the variation in rape reporting delays using only openly available data. The insights from this study can be used to motivate targeted, data-driven policies to assist vulnerable communities. For example, we find that younger rape survivors and crimes committed during holiday seasons exhibit longer delays. Our insights can thus help organizations focused on supporting survivors of sexual violence to provide their services at the right place and time. Due to the non-confidential nature of the data used in our models, even community organizations lacking access to sensitive police data can use these findings to optimize their operations.
Cloud computing and mobile computing are two of the most influential technologies that look set to change the face of computing in the coming years. Combination of the two provides us with an unprecedented opportunity to provide highly portable and yet content-rich and computation-intensive services to the end user. In this paper we investigate the possibility of using code/task offload techniques between mobile and cloud in order to reduce the energy cost of workflows deployed on mobile devices. We first present a vision in which mobile devices are coordinated over a network, which is equipped with a layer of cloud-like infrastructures which we term cloudlets, whose computational resources can be leveraged by the mobile devices to host the execution of mission-critical mobile workflows in an energy-aware manner. We then build a model that encompasses various characteristics of the workflow's software and the network's hardware devices. With this model, we construct the objective functions that guide the offload decisions. We then present a heuristic algorithm that produces statistical and dynamic offload plans according to these objective functions and their variations both statically and dynamically. We conclude the paper with a series of simulation studies, the results of which give insight into the offload-ability of workflows of different characteristics. The results also illustrate how different hardware specifications can affect offload efficiency. These studies indicate that our offload algorithm can significantly improve the energy efficiency and execution speed of mobile workflows.
Spatio-temporal data generated by sensors in the environment, such as traffic data, is widely used in the transportation domain. However, learning from and analysing such data is increasingly problematic as the volume of data grows. Therefore, methods are required to reduce the quantity of data needed for multiple types of subsequent analysis without losing significant information. In this paper, we present the 2-Dimensional Spatio-Temporal Reduction method (2D-STR), which partitions the spatio-temporal matrix of a dataset into regions of similar instances, and reduces each region to a model of its instances. The method is shown to be effective at reducing the volume of a traffic dataset to
Data scientists have applied various analytic models and techniques to address the oft-cited problems of large volume, high velocity data rates and diversity in semantics. Such approaches have traditionally employed analytic techniques in a streaming or batch processing paradigm. This paper presents CRUCIBLE, a first-in-class framework for the analysis of large-scale datasets that exploits both streaming and batch paradigms in a unified manner. The CRUCIBLE framework includes a domain specific language for describing analyses as a set of communicating sequential processes, a common runtime model for analytic execution in multiple streamed and batch environments, and an approach to automating the management of cell-level security labelling that is applied uniformly across runtimes. This paper shows the applicability of CRUCIBLE to a variety of state-of-the-art analytic environments, and compares a range of runtime models for their scalability and performance against a series of native implementations. The work demonstrates the significant impact of runtime model selection, including improvements of between 2.3 x and 480x between runtime models, with an average performance gap of just 14x between CRUCIBLE and a suite of equivalent native implementations. (C) 2014 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/3.0/).
In a Cloud system, a number of services are often deployed with each service being hosted by a collection of Virtual Machines (VM). The services may interact with each other and the interaction patterns may be dynamic, varying according to the system information at runtime. These impose a challenge in determining the amount of resources required to deliver a desired level of QoS for each service. In this paper, we present a method to determine the sufficient number of VMs for the interacting Cloud services. The proposed method borrows the ideas from the Leontief Open Production Model in economy. Further, this paper develops a communication-aware strategy to place the VMs to Physical Machines (PM), aiming to minimize the communication costs incurred by the service interactions. The developed communication-aware placement strategy is formalized in a way that it does not need to the specific communication pattern between individual VMs. A genetic algorithm is developed to find a VM-to-PM placement with low communication costs. Simulation experiments have been conducted to evaluate the performance of the developed communication-aware placement framework. The results show that compared with the placement framework aiming to use the minimal number of PMs to host VMs, the proposed communication-aware framework is able to reduce the communication cost significantly with only a very little increase in the PM usage.
Energy harvesting wireless sensor network nodes would not be able to operate without duty cycling. In TinyOS, duty cycling is supported through Low Power Listening or LPL. LPL is sender-centric: the longer the wakeup interval, the more power a receiver saves, at the cost of more energy per transmission for the sender. Due to the limitations of energy storage technologies, there is a limit to the sender wakeup interval which energy harvesting senders could support. Currently, the limit could be derived computationally or experimentally. Computational derivation is overly conservative, while manual experimentation is labour intensive. In this paper, we present a protocol which enables sensor nodes to determine the wakeup interval limit experimentally without human intervention or the aid of other nodes. Not only does the protocol allow for easier determination of the said limit, it also allows network nodes to adjust to environmental changes that nodes encounter while in deployment, such as capacitor ageing.
The cooperation of end users can be exploited to boost the performance of high-bandwidth multicast. While intraoverlay cooperation, the mechanism for cooperation within a single overlay (multicast group), has been extensively studied, little attention has been paid to inter-overlay cooperation. In this paper we explore the possibility and effects of cooperation among co-existing heterogeneous overlays in the context of live media streaming, where bandwidth is the bottleneck resource. To motivate such a kind of cooperation, we design a reputation-based incentive mechanism that differentiates user' streaming qualities based on the amount of data actually forwarded by individual users. This not only stimulates users to contribute as much forwarding bandwidth as possible, but also motivates those with spare bandwidths in resource-rich overlays to find downstream users in external, often resource-poor overlays so as to accumulate more reputation scores. Under this mechanism, an adaptive bandwidth exporting/reclaiming algorithm is developed which allows users to dynamically allocate bandwidth according to the resource availability of multiple overlays. Simulation results are reported with enhanced system performance in terms of users' average media quality.
Purpose There is significant national interest in tackling issues surrounding the needs of vulnerable children and adults. This paper aims to argue that much value can be gained from the application of new data-analytic approaches to assist with the care provided to vulnerable children. This paper highlights the ethical and information governance issues raised in the development of a research project that sought to access and analyse children's social care data. Design/methodology/approach The paper documents the process involved in identifying, accessing and using data held in Birmingham City Council's social care system for collaborative research with a partner organisation. This includes identifying the data, its structure and format; understanding the Data Protection Act 1998 and 2018 (DPA) exemptions that are relevant to ensure that legal obligations are met; data security and access management; the ethical and governance approval process. Findings The findings will include approaches to understanding the data, its structure and accessibility tasks involved in addressing ethical and legal obligations and requirements of the ethical and governance processes. Originality/value The aim of this research is to highlight the potential use of use new data-analytic techniques to examine the flow of children's social care data from referral, through the assessment process, to the resulting service provision. Data held by Birmingham City Council are used throughout, and this paper highlights key ethical and information governance issues which were addressed in preparing and conducting the research. The findings provide insight for other data-led studies of a similar nature.
Distributed e-business application platforms such as the web services framework will require sophisticated workload management infrastructures for the routing of client requests to the most appropriate services. This paper introduces a novel and dynamic predictive framework that uses both analytical performance modelling and historical data to extrapolate a request's predicted performance under varying workloads and conditions on different systems. Such a framework can facilitate a routing algorithm in meeting quality of service-based metric requirements. An e-business benchmark is used as a demonstrator to provide an insight into the techniques used in predicting request response times.
When investigating the performance of running scientific/commercial workflows in parallel and distributed systems, we often take into account only the resources allocated to the tasks constituting the workflow, assuming that computational resources will accept the tasks and execute them to completion once the processors are available. In reality, and in particular in Grid or e-business environments, security policies may be implemented in the individual organisations in which the computational resources reside. It is therefore expedient to have methods to calculate the performance of executing workflows under security policies. Authorisation control, which specifies who is allowed to perform which tasks when, is one of the most fundamental security considerations in distributed systems such as Grids. Role-Based Access Control (RBAC), under which the users are assigned to certain roles while the roles are associated with prescribed permissions, remains one of the most popular authorisation control mechanisms. This paper presents a mechanism to theoretically compute the performance of running scientific workflows under RBAC authorisation control. Various performance metrics are calculated, including both system-oriented metrics, (such as system utilisation, throughput and mean response time) and user-oriented metrics (such as mean response time of the workflows submitted by a particular client). With this work, if a client informs an organisation of the workflows they are going to submit, the organisation is able to predict the performance of these workflows running in its local computational resources (e.g. a high-performance cluster) enforced with RBAC authorisation control, and can also report client-oriented performance to each individual user.
With the widespread use of the Internet, the number of web services that can provide similar functionality has increased rapidly in recent years. Web service selection has to be based on some non-functional attributes of the services, such as the quality of service (QoS). In this chapter, we use a server switching service that is commonly used in Internet hosting environments to explain how an agent can use a performance model to evaluate services and select the most suitable services among a number of functionally similar services returned by the service discovery. The various criteria that can be used to assess QoS are introduced in this chapter, including mean response time, throughput, system utilisation and others closely related to business such as revenue and operating costs. Service selection in the chosen case study depends on the quality and suitability of various switching policies, in other words, different switching policies can be selected depending on the QoS of the services and the run-time system state. Since the system performance can be evaluated using an analytic model, therefore, the QoS of services is assessed based on the output of the performance model.
Pipelined wavefront computations are a ubiquitous class of parallel algorithm used for the solution of a number of scientific and engineering applications. This paper investigates three optimisations to the generic pipelined wavefront algorithm, which are investigated through the use of predictive analytic models. The modelling of potential optimisations is supported by a recently developed reusable LogGP-based analytic performance model, which allows the speculative evaluation, of each optimisation within the context of an industry-strength pipelined wavefront benchmark, developed and maintained by the United Kingdom Atomic Weapons Establishment (AWE). The paper details the quantitative and qualitative benefits of: (1) parallelising computation blocks of the wavefront algorithm using OpenMP; (2) a novel restructuring/shifting of computation within the wavefront code and, (3) performing simultaneous multiple sweeps through the data grid.
Energy harvesting wireless sensor network nodes would not be able to operate without duty cycling. In TinyOS, duty cycling is supported through Low Power Listening or LPL. LPL is sender-centric: the longer the wakeup interval, the more power a receiver saves, at the cost of more energy per transmission for the sender. Due to the limitations of energy storage technologies, there is a limit to the sender wakeup interval which energy harvesting senders could support. Currently, the limit could be derived computationally or experimentally. Computational derivation is overly conservative, while manual experimentation is labor intensive. In this paper, we present a protocol which enables sensor nodes to autonomously determine the wakeup interval limit experimentally, using a client-server architecture. Not only does the protocol allow for easier determination of the said limit, it also allows network nodes to adjust to environmental changes that nodes encounter while in deployment, such as capacitor ageing.
Existing algorithms for orienting sensors in directional sensor networks have primarily concerned themselves with the problem of maximizing the number of covered targets, assuming that target identification as a non-issue. Such an assumption however, does not hold true in all situations. In this paper, three heuristic algorithms for choosing active sensors and orienting them with the goal of balancing coverage and identifiability are presented. The first algorithm is parameterizable, allowing users to specify the desired level of trade-off between coverage and identifiability. The other two algorithms are not parameterizable, and are more suited for situations where coverage is preferred over identifiability. The performance of the algorithms are verified via simulations, and shown through simulations to confer increased target identifiability compared to algorithms originally designed to simply maximize the number of targets covered.
Advances in processor design have delivered performance improvements for decades. As physical limits are reached, refinements to the same basic technologies are beginning to yield diminishing returns. Unsustainable increases in energy consumption are forcing hardware manufacturers to prioritise energy efficiency in their designs. Research suggests that software modifications may be needed to exploit the resulting improvements in current and future hardware. New tools are required to capitalise on this new class of optimisation. In this article, we present the Power Optimised Software Envelope (POSE) model, which allows developers to assess the potential benefits of power optimisation for their applications. The POSE model is metric agnostic and in this article, we provide derivations using the established Energy-Delay Product metric and the novel Energy-Delay Sum and Energy-Delay Distance metrics that we believe are more appropriate for energy-aware optimisation efforts. We demonstrate POSE on three platforms by studying the optimisation characteristics of applications from the Mantevo benchmark suite. Our results show that the Pathfinder application has very little scope for power optimisation while TeaLeaf has the most, with all other applications in the benchmark suite falling between the two. Finally, we extend our POSE model with a formulation known as System Summary POSE-a meta-heuristic that allows developers to assess the scope a system has for energy-aware software optimisation independent of the code being run.
This paper investigates the resource allocation problem for a type of workflow in pervasive computing. These workflows are abstracted from the enterprise-level applications in the business or commerce area. The activities in these workflows require not only computing resources, but also human resources. Human involvement introduces additional security concerns. When we plan/allocate resource capacities, we often assume that when a task is allocated to a resource, the resource will accept the task and start the execution once the processor becomes available. However, the security policies impose further constraints on task executions, and therefore may affect both application- and system-oriented performance. Authorization is an important aspect in security. This paper investigates the issue of allocating resources for running workflows under the role-based authorization control, which is one of the most popular authorization mechanisms. By taking into account the authorization constraints, the resource allocation strategies are developed in this paper for both human resources and computing resources. In the allocation strategy for human resources, the optimization equation is constructed subject to the constraint of the budget available to hire human resources. Then the optimization equation is solved to obtain the number of human resources allocated to each authorization role. The allocation strategy for computing resources calculates not only the number of computing resources, but also the proportion of processing capacity in each resource allocated to serve the tasks assuming each role. The simulation experiments have been conducted to verify the effectiveness of the developed allocation strategies. The experimental results show that the allocation strategy developed in this paper outperforms the traditional allocation strategies, which do not consider authorization constraints, in terms of both average response time and resource utilization. (C) 2013 Elsevier B.V. All rights reserved.
In this paper we present a predictive performance model for a key biomedical imaging application found as part of the U.K. e-Science Information eXtraction from Images (IXI) project. This code represents a significant challenge for our existing performance prediction tools as it has internal structures that exhibit highly variable runtimes depending on qualities in the input data provided. Since the runtime can vary by more than an order of magnitude, it has been difficult to apply meaningful quality of service criteria to workflows that use this code. The model developed here is used in the context of an interactive scheduling system which provides rapid feedback to the users, allowing them to tailor their workloads to available resources or to allocate extra resources to scheduled workloads. Copyright (c) 2007 John Wiley & Sons, Ltd.
This paper develops a plug-and-play reusable LogGP model that can be used to predict the runtime and scaling behavior of different MPI-based pipelined wavefront applications running on modern parallel platforms with multicore nodes. A key new feature of the model is that it requires only a few simple input parameters to project performance for wavefront codes with different structure to the sweeps in each iteration as well as different behavior during each wavefront computation and/or between iterations. We apply the model to three key benchmark applications that are used in high performance computing procurement, illustrating that the model parameters yield insight into the key differences among the codes. We also develop new, simple and highly accurate models of MPI send, receive, and group communication primitives on the dual-core Cray XT system. We validate the reusable model applied to each benchmark on up to 8192 processors on the XT3/XT4. Results show excellent accuracy for all high performance application and platform configurations that we were able to measure. Finally we use the model to assess application and hardware configurations, develop new metrics for procurement and configuration, identify bottlenecks, and assess new application design modifications that, to our knowledge, have not previously been explored.
We apply a novel clustering technique to London's bikesharing network, deriving distinctive behavioral patterns and assessing community interactions and spatio-temporal dynamics. The analyses reveal self-contained, interconnected and hybrid clusters that mimic London's physical structure. Exploring changes over time, we find geographically isolated and specialized communities to be relatively consistent, while the remaining system exhibits volatility. We increase understanding of the collective behavior of the bikesharing users.
Node sampling services provide peers in a peer-to-peer system with a source of randomly chosen addresses of other nodes. Ideally, samples should be independent and uniform. The restrictions of a distributed environment, however, introduce various dependancies between samples. We review gossip-based sampling protocols proposed in previous work, and identify sources of inaccuracy. These include replicating the items from which samples are drawn, and imprecise management of the process of refreshing items. Based on this analysis, we propose a new protocol, Eddy, which aims to minimize temporal and spatial dependancies between samples. We demonstrate, through extensive simulation experiments, that these changes lead to an improved sampling service. Eddy maintains a balanced distribution of items representing active system nodes, even in the face of realistic levels of message loss and node churn. As a result, it behaves more like a centralized random number generator than previous protocols. We demonstrate this by showing that using Eddy improves the accuracy of a simple algorithm that uses random samples to estimate the size of a peer-to-peer network.
It is common that multiple cores reside on the same chip and share the on-chip cache. As a result, resource sharing can cause performance degradation of co-running jobs. Job co-scheduling is a technique that can effectively alleviate this contention and many co-schedulers have been reported in related literature. Most solutions however do not aim to find the optimal co-scheduling solution. Being able to determine the optimal solution is critical for evaluating co-scheduling systems. Moreover, most co-schedulers only consider serial jobs, and there often exist both parallel and serial jobs in real-world systems. In this paper a graph-based method is developed to find the optimal co-scheduling solution for serial jobs; the method is then extended to incorporate parallel jobs, including multi-process, and multi-threaded parallel jobs. A number of optimization measures are also developed to accelerate the solving process. Moreover, a flexible approximation technique is proposed to strike a balance between the solving speed and the solution quality. Extensive experiments are conducted to evaluate the effectiveness of the proposed co-scheduling algorithms. The results show that the proposed algorithms can find the optimal co-scheduling solution for both serial and parallel jobs. The proposed approximation technique is also shown to be flexible in the sense that we can control the solving speed by setting the requirement for the solution quality.
We present the development of a predictive performance model for the high-performance computing code Hydra, a hydrodynamics benchmark developed and maintained by the United Kingdom Atomic Weapons Establishment (AWE). The developed model elucidates the parallel computation of Hydra, with which it is possible to predict its run-time and scaling performance on varying largescale chip multiprocessor (CMP) clusters. A key feature of the model is its granularity; with the model we are able to separate the contributing costs, including computation, point-to-point communications, collectives, message buffering and message synchronisation. The predictions are validated on two contrasting large-scale HPC systems, an AMD Opteron/InfiniBand cluster and an IBM BlueGene/P, both of which are located at the Lawrence Livermore National Laboratory (LLNL) in the US. We validate the model on up to 2,048 cores, where it achieves a >85% accuracy in weakscaling studies. We also demonstrate use of the model in exposing the increasing costs of collectives for this application, and also the influence of node density on network accesses, therefore highlighting the impact of machine choice when running this hydrodynamics application at scale.
Existing algorithms for orienting sensors in directional sensor networks have primarily concerned themselves with the problem of maximizing the number of covered targets, assuming that target identification as a non-issue. Such an assumption however, does not hold true in all situations. In this paper, a distributed heuristic algorithm for choosing active sensors and orienting them with the goal of balancing coverage and identifiability is presented. The performance of the algorithm is verified via simulations, and shown through simulations to confer increased target identifiability compared to algorithms originally designed to simply maximize the number of targets covered.
Maintaining the performance of large scientific codes is a difficult task. To aid in this task, a number of mini-applications have been developed that are more tractable to analyze than large-scale production codes while retaining the performance characteristics of them. These "mini-apps" also enable faster hardware evaluation and, for sensitive commercial codes, allow evaluation of code and system changes outside of access approval processes. In this paper, we develop MG-CFD, a mini-application that represents a geometric multigrid, unstructured computational fluid dynamics (CFD) code, designed to exhibit similar performance characteristics without sharing commercially sensitive code. We detail our experiences of developing this application using guidelines detailed in existing research and contributing further to these. Our application is validated against the inviscid flux routine of HYDRA, a CFD code developed by Rolls-Royce plc for turbomachinery design. This paper (1) documents the development of MG-CFD, (2) introduces an associated performance model with which it is possible to assess the performance of HYDRA on new HPC architectures, and (3) demonstrates that it is possible to use MG-CFD and the performance models to predict the performance of HYDRA with a mean error of 9.2% for strong-scaling studies.
Machine learning has established itself as a powerful tool for the construction of decision making models and algorithms through the use of statistical techniques on training data. However, a significant impediment to its progress is the time spent training and improving the accuracy of these models this is a data and compute intensive process, which can often take days, weeks or even months to complete. A common approach to accelerate this process is to employ the use of multiple machines simultaneously, a trait shared with the field of High Performance Computing (HPC) and its clusters. However, existing distributed frameworks for data analytics and machine learning are designed for commodity servers, which do not realize the full potential of a HPC cluster, and thus denies the effective use of a readily available and potentially useful resource. In this work we adapt the application of Apache Spark, a distributed data-flow framework, to support the use of machine learning in HPC environments for the purposes of machine learning. There are inherent challenges to using Spark in this context; memory management, communication costs and synchronization overheads all pose challenges to its efficiency. To this end we introduce: (i) the application of MapRDD, a fine grained distributed data representation; (ii) a task-based all-reduce implementation; and (iii) a new asynchronous Stochastic Gradient Descent (SGD) algorithm using non-blocking all-reduce. We demonstrate up to a 2.6x overall speedup (or a 11.2x theoretical speedup with a Nvidia K80 graphics card), a 82-91% compute ratio, and a 80% reduction in the memory usage, when training the GoogLeNet model to classify 10% of the ImageNet dataset on a 32-node cluster. We also demonstrate a comparable convergence rate using the new asynchronous SGD with respect to the synchronous method. With increasing use of accelerator cards, larger cluster computers and deeper neural network models, we predict a 2x further speedup (i.e. 22.4x accumulated speedup) is obtainable with the new asynchronous SGD algorithm on heterogeneous clusters.
This paper introduces a novel approach to predicting UK-wide daily traffic counts on all roads in England and Wales, irrespective of sensor data availability. A key finding of this research is that many roads in a network may have no local connection, but may still share some common law, and this fact can be exploited to improve simulation. In this paper we show that: (1) Traffic counts are a function of dependant spatial, temporal and neighbourhood variables; (2) Large open-source data, such as school location and public transport hubs can, with appropriate GIS and machine learning, assist the prediction of traffic counts; (3) Real-time simulation can be scaled-up to large networks with the aid of machine learning and, (4) Such techniques can be employed in real-world tools. Validation of the proposed approach demonstrates an 88.2% prediction accuracy on traffic counts across the UK.
This paper reports on the development of an MPI/OpenCL implementation of LU, an application-level benchmark from the NAS Parallel Benchmark Suite. An account of the design decisions addressed during the development of this code is presented, demonstrating the importance of memory arrangement and work-item/work-group distribution strategies when applications are deployed on different device types. The resulting platform-agnostic, single source application is benchmarked on a number of different architectures, and is shown to be 1.3-1.5x slower than native FORTRAN 77 or CUDA implementations on a single node and 1.3-3.1x slower on multiple nodes. We also explore the potential performance gains of OpenCL's device fissioning capability, demonstrating up to a 3 x speed-up over our original OpenCL implementation. (C) 2012 Elsevier Inc. All rights reserved.
This book constitutes the refereed proceedings of the 4th International Workshop, PMBS 2013 in Denver, CO, USA in November 2013. The 14 papers presented in this volume were carefully reviewed and selected from 37 submissions. The selected articles broadly cover topics on massively parallel and high-performance simulations, modeling and simulation, model development and analysis, performance optimization, power estimation and optimization, high performance computing, reliability, performance analysis, and network simulations
The peer-to-peer (P2P) paradigm operates in an uncontrolled voluntary environment and, as the shared resources contributed by the general public have no legal obligation with respect to resource provisioning, an incentive-based approach is needed to motivate and encourage participants to continue with the project and attract new volunteers. The Berkeley Open Infrastructure for Network Computing (BOINC), one of the most widely used middleware platforms in P2P systems, has devised an Accounting System to provide incentives to the project participants in order to attract more participants and retain existing ones. The primary incentive provided by the BOINC Accounting System is the award of 'credits', which themselves have no monetary value. However, one of the problems with the credit system is the use of significantly inconsistent performance results obtained from the Dhrystone and Whetstone benchmarks for calculating credits. This study analyses the existing BOINC Credits calculation system and proposes a new credit unit - MalikCredits - based on a more consistent synthetic lightweight benchmark - MalikStone, which is specifically designed for the dynamic challenges of the P2P paradigm. The results of this newly proposed credit calculation system have highlighted its superiority over the BOINC Credit calculation system in terms of consistency.
This paper considers the scenario where multiple clusters of Virtual Machines (i.e., termed Virtual Clusters) are hosted in a. Cloud system consisting of a cluster of physical nodes. Multiple Virtual Clusters (VCs) cohabit in the physical cluster, with each VC offering a particular type of service for the incoming requests. In this context, VM consolidation, which strives to use a minimal number of nodes to accommodate all VMs in the system, plays an important role in saving resource consumption. Most existing consolidation methods proposed in the literature regard VMs as "rigid" during consolidation, i.e., VMs' resource capacities remain unchanged. In VC environments, QoS is usually delivered by a VC as a single entity. Therefore, there is no reason why VMs' resource capacity cannot be adjusted as long as the whole VC is still able to maintain the desired QoS. Treating VMs as "moldable" during consolidation may be able to further consolidate VMs into an even fewer number of nodes. This paper investigates this issue and develops a Genetic Algorithm (GA) to consolidate moldable VMs. The GA is able to evolve an optimized system state, which represents the VM-to-node mapping and the resource capacity allocated to each VM. After the new system state is calculated by the GA, the Cloud will transit from the current system state to the new one. The transition time represents overhead and should be minimized. In this paper, a cost model is formalized to capture the transition overhead, and a reconfiguration algorithm is developed to transit the Cloud to the optimized system state with low transition overhead. Experiments have been conducted to evaluate the performance of the GA and the reconfiguration algorithm. (C) 2012 Elsevier B.V. All rights reserved.
We present the large-scale, computational fluid dynamics (CFD) simulation of a full gas-turbine engine compressor, demonstrating capability towards overcoming current limitations for virtual certification of aero-engine design. The simulation is carried out through a performance portable code-base on multi-core/many-core HPC clusters with a CFD-to-CFD coupled execution, combining an industrial CFD solver linked using custom coupler software. The application innovates in its design for performance portability through the OP2 domain specific library for the CFD components, allowing the automatic generation of highly optimized platform-specific parallelizations for both multi-core (CPU) and many-core (GPU) clusters from a single high-level source. The code is used for the simulation of a 4.58B node, full-annulus 10-row production-grade test compressor (DLR's Rig250), using a coupled sliding-plane setup on the ARCHER2 and Cirrus supercomputers at EPCC. The OP2 generated multiple parallelizations, together with optimized coupler configurations on heterogeneous/hybrid settings achieve, for the first time, execution of 1 revolution in less than 6 hours on 512 nodes of ARCHER2 (65k cores), with a parallel scaling efficiency of over 80% compared to a 107 node run. Results indicate a speed up of the CFD suite by an order of a magnitude (approximate to 30x) relative to current production capability. Benchmarking and performance modelling project a time-to-solution of less than 5 hours on a cluster of 488xNVIDIA V100 GPUs, about 3x-4x speedup over CPU clusters. The work demonstrates a step-change towards achieving virtual certification of aircraft engines with the requisite fidelity and tractable time-to-solution that was previously out of reach under production settings.