Friday, 30 July 2010

Hey, you sass that hoopy Fortran?

We have to work with lots of different Fortran compilers at NAG. So far at Mark 22 of our Fortran Library we have built using products from six distinct sources (GNU, IBM, Intel®, NAG, PGI and Sun). The list of what we build with naturally changes from Mark to Mark as different vendors fall by the wayside or as commercial considerations preclude us from creating a particular Library implementation.

With the six vendors listed above we are blessed that the base Fortran coverage is Fortran 95. At previous Marks we still supported compilers covering only Fortran 77 (g77 and pgf77 for example). This forced us to jump through all manner of hoops so that we could do some of the really useful stuff from Fortran 90—like, woah, dynamically allocating memory—in a Fortran 77 way.

Some hoops still exist though, even with Fortran 90 code.


For example, we recently ran into the following:
    PROGRAM str_read
!      .. Implicit None Statement ..
       IMPLICIT NONE
!      .. Parameters ..
       INTEGER, PARAMETER              :: wp = KIND(0.0D0)
!      .. Local Scalars ..
       REAL (kind=wp)                  :: rval
       INTEGER                         :: ioerr
       CHARACTER (200)                 :: my_str
!      .. Executable Statements ..
       my_str = '1.0D400'
       READ (my_str,'(E16.0)',iostat=ioerr) rval
       PRINT *, 'IOERR = ', ioerr
    END PROGRAM str_read
Ignoring the fact that the code fragment is a little contextless, just imagine that we're trying to trap overflowing real values being read from a CHARACTER. On five of the six compilers given above, two return a nonzero ioerr from the READ, two carry on happily and set rval to Infinity, and one core dumps! (At the time of writing, I didn't have access to all six compilers.)

Conferring with the Fortran standard (e.g., the draft Fortran 95) we see the dreaded phrase The set of input/output error conditions is processor dependent. (my italics), so we can't even complain to the compiler vendors about this behaviour! We have good relationships with the vendors, so when our building process reveals a compiler bug we'll report it. But as you can imagine, sometimes it can be a battle to achieve what we want across many Fortran platforms in as simple and maintainable a way as possible.

The next level of complication comes from ensuring that what we do in Fortran is callable from non-Fortran environments. We usually take Microsoft Excel as a sufficiently-far-removed example. I won't go into details of the usual issues raised by cross-language programming; however, now we're at a base level of Fortran 95 we've been looking at pushing our envelope more and trying out some Fortran 2003 features, specifically C Interoperability.

M'colleague Nicolas has been working on a suite of image processing routines. In pure Fortran a NAG_IMAGE TYPE exists to hold the pixel values and other metadata for the image. To make this passable from outside Fortran we're using Fortran 2003. It's understandable with new features in the language that there should be a settling-in period, but it's no exaggeration to say that every compiler we tried did, at some point, fall over on our new code. Thanks to the hard work of the compiler developers on the various teams we're now at a stable state, but there are still some vendors who are lagging behind with Fortran 2003 and for which we have to exclude the image-processing code from our testing.

Looking forward (!) to Fortran 2008, I'm interested to see how the new special-function INTRINSICs (BESSEL_J*, BESSEL_Y*, ERF* and *GAMMA) will behave across different compilers. In theory these new INTRINSICs mean we'd be able to withdraw a sizeable chunk of our S Chapter. But I wonder how easy that will be?


(The images in this post are from now voyager, but are probably copyright PolyGram Filmed Entertainment.)

Friday, 23 July 2010

A Life Well-Lived: Erwin Ruppenthal (1958 – 2010)

There’s probably some unwritten rule that blogs are off-limits to memorials. If so, I’m happy to break it for our colleague Erwin (or Erv as he was known at home). He died early this week after a long struggle with brain cancer, at home, surrounded by his wife Heidi and sons Alex and Kurt. If we take a closer look at his life we can learn a lot about the value of work and about having a purpose in life greater than earning a paycheck.

Erwin taught me a number of lessons about work and I’ll tell you about some of them so that things he valued might live on in each of us, but first let me tell you a little about him. He came to NAG in early August 1990 to work in IT at our US office. What I’m told was that he was an earnest, hardworking German citizen in his early 30’s who most would describe as shy and self-effacing. It was nine years later that I encountered him on my first day as his boss.

He was a man of two countries and two cultures who embraced both. Following sports was a passion: American baseball, American football and European football featured prominently. In his office you could find emblems of the Chicago Bears and Chicago Cubs along with a Bayern Munich FC flag. I read in his obituary that he almost never missed a Naperville Central high school football game over many years. If you knew him, none of this loyalty and dedication would surprise you because he brought the same qualities to work every day,

I’ve worked with few people who could equal his dedication to his work, our mission as a company and the customers we serve. His loyalty to serving our customers showed in the countless problems he solved late at night and early in the morning from home while monitoring our helpdesk inbox or remote web site. He had no hesitation in coming to me if he felt we weren’t being fair to a customer or living up to our promises. He confirmed in me that our first principle had to be to do the right thing by a customer and trust that they would return the same for us.

He taught me some powerful management lessons though I doubt he ever spent a day in business school. Early on, I would see something we were doing that needed change, blindingly obvious (at least to me). Erwin would sit patiently as I explained, asking questions and listening. At first I thought he was just humoring his boss so I’d go back a week later and repeat the process and come away with the same impression. At some point later I would notice that a change had been made with no fanfare. Erwin would take my idea, however half-baked, improve it and implement it. Therein I learned the value of patience, trust and that overused cliché, empowerment. With Erwin, I mostly needed to plant a seed and get out of the way.

Erwin didn’t spend money frivolously in his personal life, as far as I could tell, but when he did advocate spending it at work, he pursued quality, durability and things which made for better customer service. We have (still) a truly ancient UNIX server that serves as an archive of customer data. Through two office moves and occasional disk failures Erwin kept it going reliably. His care for such things could be summed up by a short story of the first of those moves. The move was from one building to another in an office park in suburban Chicago separated by perhaps two football fields of lawn and street. It was late on a cold, moonlit Friday night in early January 2002. The movers had finished earlier in the evening and all of the staff had gone home. All that was left to do was to power down the servers, move them across the street and re-start them. They were already ancient and Erwin wasn’t about to trust them to the movers so he and I methodically shut everything down and loaded it all on a 4-wheel cart. A few minutes later we were pushing the cart gently down the middle of the deserted street when I jokingly wondered what would happen if a police car came by. He had prepared for everything but that. Even in the small things, he planned carefully and followed through. That server is still running.

I could tell you more but these things tell us the enduring values Erwin brought to work: loyalty to people and a purpose bigger than himself, patience, persistence and reliability. He was more than a co-worker for everyone he touched. He was a true colleague and a trusted friend. Erwin may be gone but his memory and his values live on in each of us.

Monday, 19 July 2010

Time Machines and Supercomputers

I found a Linpack App for the iPhone last week. Nothing special, just a bit of five minute fun. It seems a 3G model achieves about 20 MFLOPS. [Note 1]

What's that got to do with time machines? Well it got me thinking "I wonder when 20 MFLOPS was the performance of a leading edge supercomputer?" Actually, it was before the start of the Top500 list (1993), so finding out was beyond the research I was prepared to do for this blog.

So I thought instead about the first supercomputer I used in anger. As soon as I name it, if anyone is still reading this waffle, you will immediately fall into two camps - those who think I'm too young to be nostalgic about old supercomputers yet - and those who think I'm too old to be talking about modern supercomoputers :-).

It was a Cray T3D.

You're still waiting for the time machine bit ... hang on in there.

My application on that T3D sustained about 25 GFLOPS. Which is about the same as a high end PC of recent years. What this means to me is that anyone who cares to apply the effort today with a high end PC, could get comparable results to that work of 15-20 years ago that needed the supercomputer.

Or, in other words, that supercomputer gave us a 15-20 years time advantage over everyone who didn't have supercomputers - or a few years over others with smaller supercomputers. [Note 2]

That is one of the key benefits of High Performance Computing - the ability to get a result before a competitor - you could say HPC is a time machine for simulation and modelling.

Now for the [Notes] - which actually contain the real story!

Note 1 : It's not really true to say the iPhone 3G can do 20 MFLOPs - all we can say is that particular App achieved 20 MFLOPs on that iPhone 3G. The result is a factor of both the software and the hardware. Better performance can come from optimising the application as much as from buying a more powerful phone.

Note 2 : If fact, even with the same supercomputer, it would be hard for most people to replicate the results - simply because there was as much value in the software (physics, algorithms, performance engineering, implementation, etc) and the associated validation and verification program as there was in the supercomputer.

The supercomputer offered us a time machine. But the attention to performance and scalability in the application enabled us to actually use that time machine to get results faster than others - even if those others used the same supercomputer. And the validation and verification effort meant that we could trust what our time machine was telling us.

Thursday, 8 July 2010

Fantasy Football – a classic Portfolio Optimisation problem

England out the World Cup, German colleagues, customers, collaborators (actually not just Germans, Americans, Scots,…) and so called friends all sending me mocking e-mails and texts.

“I hear OXO are making a new product. The packaging is white with a red cross and they're calling it the laughing stock.”

How could I channel my frustration? NAG Blog to the rescue.

In my last blog I promised to reflect on my early career at NAG which isn’t that long ago compared with some of colleagues.

My commute to the NAG office is longer than I’d like, but there is a positive side…

· Valuable thinking time
· Audio books - I recently enjoyed “No One Would Listen” by Harry Markopolos. The exclusive story of the Harry Markopolos–lead investigation into Bernie Madoff and his $65 billion Ponzi scheme.

Back to “thinking time,” on one commute into Oxford I’d been pondering the previous day’s internal training course. I was recalling my senior technical colleague’s wise words. That wise colleague was David (some of you will have met him). He had given an internal training course to the Sales and Marketing group. His presentation was a simple introduction to the topic of Optimisation and he had touched on modern portfolio theory even explaining the efficient frontier.

Some of you may be lucky enough to have heard this talk before….

Imagine an English ice cream manufacturer. He might run a very profitable business, but recognise that his business is very dependant on long sunny, dry summers. Wishing to diversify and minimise his risk he chooses to start another business thus protecting himself against a cold, rainy summer with an umbrella company.

Subsequently he went on to talk about stocks and shares and the benefits of diversification. This struck a chord with me. I learnt early on in my adult life the importance of diversification. Remember GEC that became Marconi. Where did Marconi go wrong? As a result of shares options I had a portfolio dominated by Marconi and consequently suffered a very large paper loss. Oh, why didn’t I meet David earlier? He went on to speak about how diversification can be achieved by holding a selection of stocks from a spread of geographic regions, industry sectors etc.

Of course this whole training course was aimed at helping us understand NAG’s Optimisation Routines . For those of you wishing to learn more about NAG’s optimisation routines you should refer to the NAG Library Manual and the appropriate chapter introduction.

Anyway, as I drove back into work that morning I had that light bulb moment. Fantasy football is like portfolio optimisation! Well, you can imagine how I sprinted into the office from the work car park eager to share my discovery with David.

Me: “Remember your lecture about optimisation and portfolio optimisation? Fantasy football is like this, isn’t it?”

David: “What is Fantasy Football?”

Me: “Well
·You have a set of rules.
· You have a limited pot of money that you can spend to pick a squad of 15 football players.
·You then have to pick X Goal Keepers, Y defenders, …
· You can only pick Z players from any one team
· Players win points by
o Keeping a clean sheet
o Scoring a goal
oAssisting a goal
· Players lose points by
oConceding goals
o Getting red or yellow cards.”

Pause for thought.

Me: “These are constraints aren’t they?”

Patronising nod from David

Me: “And this is an Integer problem, isn’t it?”

Laughter. David: “Well, at least you were listening last week. Now, go and get on with your work.”

Me: “No, you don’t get it. You’re going to help me program this up and we’ll see how the NAG Optimisers perform! I bet they won’t beat me. I’ve won the office fantasy football competition two years in a row. I would be interested to see how it performs though.”

Two or three days pass. I find a way of getting all the historic data from “The Official Fantasy Game Of The Premier League” into an Excel spreadsheet.

A week later I manage to persuade David to code up one of NAG’s optimisation routines to optimise my portfolio of football players. What does this mean? Well one is supplied with a list of players. Each player has a value and their previous season’s points total is listed. So one might choose a strategy which optimises a squad of 15 players based on a maximum spend of £100 million. A more risky approach might be to assume you will always have 11 fit players so you pick the 4 cheapest players and then look to maximise your return from an optimal 11.

Fortunately NAG supplies Visual Basic Declaration Statements and C Types with its DLLs. This makes these libraries easy to use for VB programmers who could easily code up one of NAG’s Optimisers to pick a Fantasy Football side. Yes, NAG’s Libraries are easy to link to Excel and we include a simple Portfolio Optimisation example on our website.



Let me explain this screen shot.

The two teams that were picked entirely by NAG’s Solvers were David’s “Too Hot” and Sven’s “Old NAG’s Best 15.”
· “Too Hot” was based on picking the four cheapest players (with the highest return) and then the “Optimised 11”
· “Old NAG’s Best 15” was based on optimising the entire 15 i.e. taking no account that only 11 players can be picked and used in each match.

In my next blog I’ll share with you how “Too Hot” and “Old NAG’s Best 15” finished the season.

Friday, 2 July 2010

The power within graphics cards

NAG is a company that likes to be present at scientific events all around the world. One of the main reasons for this is we get face-to-face time with NAG users and this gives us the chance to listen to their needs and reiterate that 40 years on, NAG is still a not-for-profit company for which collaboration is essential.


Last week I had the pleasure of representing NAG at two events in North America: GPGPU Developer Briefing in New York City and the 6th World Congress of Bachelier Finance Society in Toronto. While manning the NAG stand together with my colleagues we had many interesting conversations with those attending and made them aware of one of NAG’s most recent developments – the NAG Library for GPUs.


We expected lots of interest in GPUs at the briefing in NYC, and were all happy to describe our successes with Monte Carlo simulations on GPUs. Currently the NAG Library for GPUs consists of two random number generators: pseudorandom L’Ecuyer MRG and quasirandom Sobol, a generator of sample paths for a Brownian bridge, and statistical distributions. Other features, such as PDE methods or optimization will follow soon. It was really good to see the high level of interest in the Library and how to call it from Excel and other environments.


The Bachelier Society Congress in Toronto attracted an audience primarily from financial academia. We spoke to various NAG users who talked about their use of NAG routines for optimization, wavelet transforms, and RNG to name a few, but this audience also expressed keen interest in our GPU work, in particular the work with a major bank where the NAG Library for GPUs performed a Monte Carlo simulation over 200 faster than on a single-core CPU! Of course, this level of speed up cannot always be achieved, but it is an indicator of the power of GPUs combined with the robust, efficient and stable implementation of random number generators in the NAG Library for GPUs.

Wednesday, 23 June 2010

My first experiences calling the NAG Library for SMP and Multicore

A quick introduction: I work for NAG as a Technical Support Engineer in Germany. I look after clients in Germany, Switzerland and Austria and am interested in approximation and interpolation methods as well as .NET languages. Naturally, I’m big fan of the NAG library for .NET as well as the NAG Library for SMP and Multicore.

As a non FORTRAN programmer to be able to run some tests with the new NAG Library for SMP and Multicore I decided to choose a few simple tests that I could run using both CPUs of my laptop. I created some .NET console applications, wrappers around the FORTRAN routines that were able to pass all the arguments to the routines in a .NET fashion and implemented a time counting function. There are many parts of the Library parallelized, including quadrature, partial differential equations, interpolation, curve and surface fitting, linear algebra, correlation and regression analysis, multivariate methods, random number generators, time series analysis, sorting and special functions. I chose routines from the correlation, curve and surface fitting and random number generation chapters.

The results were calculated on a Intel® Core™2 Duo Prozessor P8700 (2.53GHz,1066MHz,3MB) machine with Windows Vista as operating system and Microsoft Visual Studio 2008 (32‐bit project) as compiler. Each function was used to solve a large enough problem to allow for parallelism. Each test case was run with one or two threads. The number of threads was set using the system variable ‘OMP_NUM_THREADS’ in the Visual Studio console window.

A ‘QueryPerformanceCounter’ function, was implemented in all the examples to calculate the time which the routines needed for calculating the result. Below is the class implemented for each of the examples which can be used by first creating the constructor HiPerfTimer and then using the methods Start, Stop and Duration:
HiPerfTimer pt = new HiPerfTimer();
pt.Start();
//routine
pt.Stop();
pt.Duration();

internal class HiPerfTimer
    {
        [DllImport("Kernel32.dll")]
        private static extern bool QueryPerformanceCounter(
            out long lpPerformanceCount);

        [DllImport("Kernel32.dll")]
        private static extern bool QueryPerformanceFrequency(
            out long lpFrequency);

        private long startTime, stopTime;
        private long freq;

        // Constructor

        public HiPerfTimer()
        {
            startTime = 0;
            stopTime = 0;

            if (QueryPerformanceFrequency(out freq) == false)
            {
                throw new Win32Exception();
            }
        }


        // Start the timer

        public void Start()
        {

            Thread.Sleep(0);

            QueryPerformanceCounter(out startTime);
        }


        // Stop the timer

        public void Stop()
        {
            QueryPerformanceCounter(out stopTime);
        }


        // Returns the duration of the timer (in seconds)

        public double Duration
        {
            get
            {
                return (double)(stopTime - startTime) / (double)freq;
            }
        }
    }
After compiling the examples with the C# compiler option ‘csc’, the number of threads is first set to the value 1 and then set to the value 2. The only result which is printed is the time which the routine needs for calculating the result.

Correlation

NAG routine C06PKF calculates the circular convolution or correlation of two complex vectors of period n.
In this example to complex vectors are build up with a period of n (n=5000000) and the correlation and the circular convolution are calculated.
C: \C06PKF\C06PKF>csc c06pkf.cs

C:\ C06PKF\C06PKF> set OMP_NUM_THREADS=1
C:\C06PKF\C06PKF>c06pkf
Duration: 5.74 sec

C: \C06PKF\C06PKF>set OMP_NUM_THREADS=2
C: \C06PKF\C06PKF>c06pkf
Duration: 2.98 sec

Random Number Generation

NAG routine G05YKF generates a quasi-random sequence from a log-normal distribution. It must be preceded by a call to one of the initialization routines G05YLF or G05YNF. The number N (N = 10000000) of quasi-random numbers of dimension IDIM (IDIM = 4) are passed and also the mean (IDIM) of the underlying Normal distribution for each dimension and the std (IDIM) standard deviation of the underlying Normal distribution.
C: \ G05YKF\G05YKF>csc c06pkf.cs

C: \G05YKF\G05YKF>set OMP_NUM_THREADS=1
C: \G05YKF\G05YKF>g05ykf
Duration: 3.52 sec

C: \G05YKF\G05YKF>set OMP_NUM_THREADS=2
C:\ G05YKF\G05YKF>g05ykf
Duration: 2.06 sec

Approximation

NAG Routine E02CAF forms an approximation to the weighted, least-squares Chebyshev series surface fit to data arbitrarily distributed on lines parallel to one independent coordinate axis. It determines a bivariate polynomial approximation of degree k in x and l in y to the set of data points , with weights , for and . That is, the data points are on lines , but the x values may be different on each line. The polynomial is represented in double Chebyshev series form.
N, the number of lines on which data points are given, was set to 1000
K, the required degree of x was set to 100 and
L, the required degree of y was set also to 100.
Also the fitting polynomial was evaluated at the data points using the routine E02CBF.
C:\E02CAF\E02CAF >csc e02caf.cs

C:\ E02CAF\E02CAF >set OMP_NUM_THREADS=1
C:\ E02CAF\E02CAF >e02caf
Duration: 4.77 sec

C:\ E02CAF\E02CAF >set OMP_NUM_THREADS=2
C:\ E02CAF\E02CAF >e02caf
Duration: 2.48 sec

Fig 1: Timings for the SMP enabled NAG routines with one or two threads
My experience with presenting these examples at different clients has been very positive, but also interesting! “Warum verwenden Sie diese Bibliothek in .NET? Das ergibt doch keinen Sinn, da wir nur in Fortran programmieren!” “Why are you using the Library in .NET and not directly in Fortran? This doesn’t make any sense, we are coding only in Fortran!”
Others were excited to learn and understand they might use their second processor!
My conclusions are as follows
  • Never show a C# program to HPC Computer Programmers who love Fortran and think languages such as C are radical!
  • NAG Library for SMP and Multicore really is easy to use and for those who are current NAG Fortran Library users the transition really is painless. Users just need to take care of the number of threads used.
I should admit a bias as a NAG employee, but perhaps more importantly I’d like to use this blog to talk to users / prospective users and offer help. As I said in my introduction I am interesting in this, but also hearing from those who have interests in approximation and interpolation. I plan to work with some of my developer colleagues to expand some of NAG’s interpolation solvers.

Tuesday, 22 June 2010

Technical computing futures part 2: GPU and manycore success

In my previous blog, I suggested that the HPC revolution towards GPUs (or similar many-core technologies) as the primary processor has a lot in common with the move from RISC to commodity x86 processors a few years ago. A new technology appears to offer cheaper (or better) performance than the incumbent, for some porting and tuning pain. Of course, I’m not the first HPC blogger to have made this observation, but I hope to follow it a little further.

In particular, my previous blog suggested the outcome might be: “at first the uptake is tentative ... but in a few years time, we might well look back with nostalgia to when GPU’s were not the dominant processor for HPC systems” – in other words, hard going initially, but GPU/many-core will “win” eventually. I even ended up with an ambitious promise for my next blog (i.e. this one): “an idea of what/who will emerge as the dominant solution ...

Continuing the basis of using the past to guess the future, my prediction is that the next steady state of HPC processors will be GPU-like/manycore technologies (for most of the FLOPS at least) and, just like the current steady state (x86), those few companies with the strongest financial muscle will eventually own the dominant market share. However, other companies will have pioneered many of the technologies that make that dominant market share possible, enjoying good market share surges in the process.

I can even have a go at predicting some of the path that might get us to the next steady state of HPC architecture. NVIDIA has already shown us that GPUs for HPC are sometimes a good solution – and importantly, that a good programming ecosystem (CUDA) really helps adoption. Over the last year or so, I’d say the HPC community has moved from “if GPUs can work in this case ...” to “how do I make GPUs work across my workload?

As Intel’s Knights processors bring us many-core but with a familiar x86 instruction set, we might learn that getting good performance across a broad range of applications is possible, but critically dependent on software tools and hard work by skilled parallel programmers. AMD’s Fusion with tighter links between CPU & GPU, could show that the nature of the integration between the many-core/GPU unit and the rest of the system (be it CPU, network, main memory etc) will affect not only maximum performance on specific applications, but maybe more importantly the ease of getting “good enough” performance across a range of applications.

I don't know of any GPU/many-core/accelerator announcements from IBM, but it’s always possible IBM will throw in another useful contribution before the dust settles. They were one of the first into many-core processors for HPC acceleration with Cell and they cannot be easily counted out of top end HPC solutions - e.g. the forthcoming Blue Waters (POWER7) and Sequoia (BG/Q) chart-toppers.

But back to my “winner” prediction. When the revolution settles into a new steady state of mostly GPU/many-core for HPC processors, there won’t be (can’t be) critical distinctions between the various products anymore for most applications. Whichever product we consider (whether GPU or x86-based or whatever), many-core is sufficiently different from few-core (e.g. 1-8 cores) to mean that the early winners have been those users who are easily able to move their key applications across to get step changes in cost and performance.

The big winners in the next stages of the GPU/manycore emergence will be those users who can move the bulk of their high-value-generating HPC usage to many-core processors with the most attractive transition (economy and speed) compared to their competitors.

So what about the dominant solution I promised? For the technology to be pervasive, first there must be greater commonality between offerings (I stop short of standardization) so that programmers have at least a hope of portability. Secondly, users need to be able to extract the available performance. Ideally these would mean a software method that makes many-core programming “good enough easily enough” is discovered – and if so, that software method will be the dominant solution, across all hardware.

Or, if the magic bullet is still not market ready, skilled parallel programmers will be the dominant solution for achieving competitive performance and cost benefits - just like it is for HPC using commodity x86 processors today.