Even without Safe Harbor, are US cloud providers the only safe harbour?

There’s been a lot written in the last couple of weeks about the demise of the Safe Harbor agreement between the EU and USA. If you believe the more biased posts on LinkedIn and twitter by various [EU-based] cloud hosting companies and consultancies, it’s now illegal to put your data in the USA. Period.

Of course, this is scaremongering. It’s not illegal to put your data anywhere, prima facie. Whether it resides in the EU or the USA doesn’t really matter – the onus is on a data controller to ensure that the protections afforded to that data are in keeping with the data subjects’ legal requirement to protection of their personal data. Put simply, if you hold data on EU citizens, you have to protect their data to at least a certain minimum level. Where you do that is largely irrelevant.

Where it gets a bit murky is around your hosting provider’s obligation to do the same. Safe Harbor was an undertaking by US-based companies that they’d adhere to a higher standard of care than US law required in order to meet EU requirements; in essence, a “scout’s honour” promise that although they’re not in the EU they’d behave as though they were. Crucially however, for the European Court of Justice, the same obligation did not exist on US public bodies. If the NSA, CIA, FBI, DMV or any other government body wanted to read your data they are able to adhere to the US’ (much weaker) laws on privacy – and if the data’s hosted with a US-owned hosting company, then the US government can compel them to hand over whatever they can.

And therein lies the devil in the detail. The government can force disclosure of whatever the hosting company can provide.

If you’re storing your data, unencrypted and open for anyone to read, in the USA then it’s a safe bet that anyone in the US government can read it should they so desire. This, quite clearly, breaches the EU’s requirement to privacy for that data – and so the ECJ struck down Safe Harbor on the grounds that merely putting the data on a server owned by someone who promises they won’t read it is largely meaningless. But, as we’ve seen in the news this week,  if you’re storing your data unencrypted and open for anyone to read in the EU, it’s still a pretty safe bet that anyone who wants to can read it. Not just the US Government – they’ve got better things to do than go after the details of #TalkTalk customers and their bank accounts. Anyone.

If, on the other hand, you’re storing your data in a (safely) encrypted form, you’re not storing plain-text or weak-cipher versions next to it (#AshleyMadison subscribers can tell you all about this one) in the same table, and the keys to your encryption are closely guarded then it’s about as safe as it’s going to get. It really, really doesn’t matter where you put it – whether it’s unreadable in a datacentre in Ireland or Idaho, it’s still unreadable to those who shouldn’t have access to it. And voila – suddenly, you don’t need Safe Harbor after all because you’re safeguarding your customers’ data yourself rather than relying on someone else to do so for you.

But… isn’t it worth choosing an EU provider just to be on the safe side?

Well… probably not.

In the same month as all the Safe Harbor fuss has been dominating discussions about cloud, HP have (fairly quietly) announced that their public cloud offering (Helion) will no longer accept new customers onto the platform, effective this month. More terrifyingly if you’re an exec who’s already signed off on a move into the cloud and plumped for HP, they’ll turn off what you already have with them by the end of January 2016. That gives you just four short months (of which one is December, hardly the western hemisphere’s most productive month of the year) to migrate your data from HP’s cloud to one of it’s “selected partner clouds” – basically, #Azure or #AWS.

HP has heavily invested in OpenStack – an effort to reduce the proprietary nature of cloud computing and enable easier vendor swapping – and the logical choice for moving their content was RackSpace, who are pretty much the only other significant OpenStack provider. That RackSpace isn’t HP’s partner of choice is a telling indictment of OpenStack’s biggest weakness – the stack on which your cloud operates is far, far less significant than the scale of the provider running it. Conservative estimates put HP’s cloud at 300Pb of data. That’s really quite a lot.

Microsoft and Amazon don’t publish figures about their total storage capacity, but it’s pretty safe to assume it’s well into the multiple-Exabyte range. Another few hundred petabytes is a nice jump in consumption, but it’s not an order of magnitude increase on the current infrastructure. Compare that to some of the smaller cloud providers; Backblaze is growing at around 3Pb/month in total storage capacity – taking on HP’s data load would mean 100 months’ worth growth in a single quarter. It’s pretty safe to assume RackSpace and HP (who already have a working relationship via their OpenStack people) mutually decided that 300Pb is just too much to bite off in one go.

The big four cloud providers (AWS, Azure, IBM & Google) now account for more than half the world’s total cloud spend between them. A year ago, it was 46%, and a year before that, 41%. In 2013, the big four earned about $1bn from cloud operations while everyone else collectively raked in $1.5bn. By 2014, the gap had closed a little, and this year Amazon and Microsoft alone have seen revenues almost equal to the combined total of the “not big four” rest of the market. Add in IBM or Google (who together generate about the same revenues as Azure does alone) and you’ve got a majority of the market share from just 3 of the big 4.

It’s this massive lead over the rest of the market that gives the big their huge advantage in terms of buying power, reach and scalability – and it’s this massive lead which means that choosing a cloud provider other than the big four is a risk which needs to be weighed up against the other factors in a cloud-hosting decision. Tie-in is a problem if the vendor you’re tied in to decides that actually, being a small player (and let’s not forget that HP generated $111bn in revenue and $7bn in OI in 2014 – they certainly could absorb the cost of being a large player) in the cloud market just isn’t worth the hassle.

Safe Harbor made the news this month because it’s consumer-facing – it involved Facebook, and a David-and-Goliath style “little guy against the big corporation” fight. Storing data outside the EU (or inside it, but on US-owned providers) does add an element of extra risk and responsibility, but it’s a pretty small one. Picking an EU-owned provider won’t make your life that much easier from a privacy perspective, and it might just make your life a great deal more difficult if your cloud provider goes the way of HP.

Advertisements

SME’s in your supply chain – backdoors to your IP?

On Wednesday morning, I had the privilege of attending a briefing session arranged by Symantec for CIOs, CTOs and CISOs of a range of public sector and private sector organisations with a shared interest: protection of data of which they’re custodians from attackers who’d like access to it.

I say privilege for a number of reasons – not least the fairly decent bacon sandwiches, but also the collaborative and open forum for exchange of information about the challenges we all face. It’s not often any of us get to discuss the difficulties of securing our networks, assets and people against increasingly determined would-be intruders, much less to hear from those in three more sensitive areas of the secret arena. With attendees from the secret intelligence community and a keynote speaker who worked for GCHQ and at the highest levels of government for decades, it was a rare chance to get a look at the other side of the industry.

The challenges are broadly similar, however

The thing that struck me most of all wasn’t so much the differences between the challenges of the guys and girls in the donut and ours (as a private sector provider of services to retailers and brand owners rather than an intelligence agency), but the similarities in the attacks they’re facing and the ones we’re dealing with in our own sector.

We don’t hold state secrets. Nothing we do relates to nuclear submarines or lists of secret identities for 00-agents. Our data is much more mundane – the most you’ll get from our servers is an early look at what your favourite retailer is planning to launch for Christmas, or what cold and flu remedy is next year’s big release… But it still has value. If you’re a rival supermarket or a counterfeiter, knowing what’ll be on shelf in 9 months is worth enough to make it worthwhile trying to get a sneak peak. 

The supply chain is the weak link

The most eye-opening thing for me on Wednesday wasn’t the stats about USB sticks found in company car parks being plugged in by the finder (40%, if you’re interested, and 60% if you put the company logo on them) or the success rate of targeted social engineering attacks against employees (100%), but the revelation that tier 1 targets (banks, retailers, pharmaceuticals etc) are becoming too hard to attack – and so the SME-heavy supply chain has taken their place as attack vectors.

So many SME suppliers to those tier 1 targets have lax network security, poor monitoring, weak authentication and outdated policies that it’s much easier to get into them than the real targets – and many of those same SMEs have access directly into the tier 1’s networks or hold their data, rendering attacks against the upstream company unnecessary. The recent emergence of MalumPOS – malware specifically targeting Micros systems – is a notable shift away from retailer financial system targeting to attacking the suppliers of systems to those retailers, for example.

Consider your supply chain part of your security domain

Your supply chain is as much a part of your organisation’s security concern as your own systems if you allow them access to, or custodianship of, your data. 

SBS’ clients benefit from Microsoft’s own white hat team constantly seeking new Azure and Windows vulnerabilities and patching them to make sure we’re as secure as can be – which is great, but only part of the picture. Our software is penetration tested by CESG CHECK testers to see if there are cracks, and how to fill them. We employ Symantec’s expert MSS team to monitor our cloud and on-premises environment for breaches and suspect activity, and we investigate it with them. Our ISMS policies are rigorous and span physical, IT and human security, and our management team have a top-down commitment to data security which reflects the fact that we’ve been trusted with our customers’ data for 122 years without incident. We don’t assume we’re secure – we make sure. It’s also quite nice to know the people at Symantec helping protect us and our customers are working with GCHQ and their colleagues by the river to predict UK Plc too!

Next time you’re considering adding to or changing your supply chain, ask them how they’ll protect your data in a world where they might just be the weakest link in your chain.

The not-so-silver lining; when cloud goes bad

There’s been quite a lot written about the azure outage on Wednesday of last week, much of it by Microsoft themselves. I wanted to share some thoughts on our experiences of it, some of the ways it’ll change our plans, and perhaps more significantly on how it won’t.

The background

I’m currently leading a transition from colocated hosting to azure for an FMCG software provider; we have physical tin owned by us in rack space we rent from SunGard, and for a number of reasons (not least the Sungard account management strategy of being unwilling to visit your clients!) we’ve decided that a traditional CapEx’d replacement is the wrong way to go.

The company sells a SaaS application which is hosted on VMs, and we have around 50 or so running IIS or SQL server, of which we’ve migrated the majority to Azure’s IaaS offering.

The story until Wednesday

So far, so good. There have been gotchas, sure. The app we’re running (and replacing, for many of these reasons!) doesn’t lend itself well (or at all) to load balancing, so most of Azure’s features around availability sets don’t do us any good at all – and without them, we’re a little more at the mercy of the scheduled (and often mid-working day) reboots azure seems fond of. Similarly, it gets fairly unamusing fairly fast setting up firewall rules only to discover that stopping the only VM in a cloud service has the annoying habit of releasing its IP address, buggering up your carefully crafted firewall rules (there’s now a reserved IP feature, but it doesn’t allow you to reserve the existing IP for a service). There are IO limits per storage account, which have vaguely bitten us in terms of seeing performance not scale linearly with instances, and the 1gb blob limit makes for an annoying reliance on stripe sets and a fingers-crossed approach to DR where replication of stripe sets is concerned (at least until the first time it died, when we discovered the happy surprise that locally redundant storage replicates striped volumes quite happily).

Despite those gotchas, however, it’s been mostly awesome.

We’ve seen IO performance consistent with auto-tiering over fast SAS and SSD disk, significantly faster than we got from a dedicated SAN in colo, and without the cost of investing in terabytes and terabytes of SSD. We’ve gained the ability to flex VMs up and down without having to worry about upgrading hosts (we’ve provisioned machines with 56gb of RAM as test instances, used them for a week and then deprovisioned them without having to worry about where the extra few hundred gb of ram is coming from) to deal with test and production loads. This year’s spend will be around $100k, which is comparable to our yearly spend on colocated facilities (but Microsoft, unlike Sungard, actually come visit for their hundred grand a year…), but we’ve gained 24/7 ops and support, expert maintenance of the physical kit without having to increase spend on IT headcount, and have the ability to vary our spend in response to changing business needs.

…and then Wednesday happened

In case you missed it, at just before 1am on Wednesday morning, a faulty storage patch rolled out – much faster than it should’ve been allowed to – across the global azure infrastructure, and chaos ensued. All services that depend on storage (and, frankly, that’s pretty much all services) suffered delays, timeouts and outages to varying degrees. VMs were particularly badly affected, because the storage device is responsible for serving up the blobs holding vhds as disks. Operating systems don’t really love having their disks appear and disappear like Mr Ben on black Friday, it turns out.

This, when you have single-instance VMs running your production apps, is A Bad Thing™.

For the most part, they pretty much sorted themselves out. We had reboots, which were inconvenient, but they came back up pretty quickly – in most cases, literally just a few minutes from losing the disk to restoring service to users.

In two cases, however, the VMs stubbornly remained down (stuck in a ‘starting’ state) and inaccessible. More annoyingly the VHDs themselves also remained inaccessible so we couldn’t recreate the VMs as new instances, and basically left us somewhat stuck.

So… what’ll we change?

Ultimately, I guess this is the most important bit… we’ve learned a few lessons from the azure downtime that’ll hopefully help isolate us from such a total dependency in future. Specifically:

  • On-premises backups of backups: We’ve been using Azure’s backup service to take system image backups, including of the SQL backup files. That is already in the process of changing: we’ll use Azure Backup to do our backups, but we’re putting in place automated copying of the backups (of the SQL data – the IIS content doesn’t change much) to an on-premises solution. Restoring from them will be a pain in the backside and there will be some data loss (up to 15 minutes of transactions) but if we’re left facing another outage with no clear end in sight at least it’ll give us the ability to spin up an on-premises replica as a workaround.
  • Very short TTLs on DNS records: Restoring the service of that one problematic, live, service meant restarting the cloud service – which changes the IP address assigned to it. In most instances, we’re using CNAMEs to point to the <whatever>.cloudapp.net address and things Just Worked™, but in others (and in particular, customer-owned domain names) we had A records with 24-hour TTLs. This meant that changes to the IP address on the service took a full day to propagate to some users, which was bad. We’ve replaced A’s with CNAMEs where possible, and reduced the TTL down to 1 hour (less for some particularly crucial sites) so any moves are effected as quickly as possible.
  • Reserved IPs: Azure now offers a reserved IP service, though (annoyingly) it’s not possible to reserve an IP which is already in use – in other words, to fix our IPs to avoid the disruption of them changing, we have to undergo the disruption of them changing. This is annoying, but we’re scheduling it in for future upgrades and builds.

And… well, that’s it.

The combination of those three changes – none of which are exactly groundbreaking – should give us the ability to spin up an on-premises VM pretty quickly in the event of another major outage.

We’ve put a lot of faith in Azure’s infrastructure – trusting that, for example, hosting machines in Azure and backing them up to Azure is safe and that we won’t suffer loss of data. That faith isn’t really dented by this incident – in fact, it’s actually reassuring that even after a fairly catastrophic incident all our data reappeared when the issue was fixed. However, it has highlighted (and it’s not really the way we’d have preferred to have this pointed out) that we’re dependent upon Azure being available to a much greater extent than we realised. Our DR & BCP thinking had all presupposed that the HA measures Azure would implement would work, and that the replication and distribution of data measures would guard against data loss.

We stand by that, and we’re happy with the choice of Azure as a hosting platform (and, for the record, the support – we got regular updates during the outage, a callback by phone within an hour of logging the initial ticket, and a support team who admittedly couldn’t fix much but kept us informed, updated and sympathised with throughout).

But it doesn’t hurt to have a backup option either!

Job titles and what they’re worth

Firstly, hello Sarah (who, for those of you who are thinking “eh?!”, tweeted about a previous post). It’s good to know I’m not talking entirely to myself…

Secondly, the post itself. I read an article this evening (which I’d link to in some clever manner if only I knew how) about a chap in the US who had his career “re-positioned” (my words, not theirs) by a recruiter as part of an exercise in determining the relative worth of a software engineer in the career world. It’s a good article, and worth reading here: http://michaelochurch.wordpress.com/2014/07/13/how-the-other-half-works-an-adventure-in-the-low-status-of-software-engineers/

Several things about it got me thinking. When I hire managerial grade staff, I assume (probably unjustly) that they’re due a slightly easier ride – after all, they’ve proved themselves already, right? This makes no sense at all now I think about it. E colpe mia.

But the bit I’d like to focus on here is the title of a given role, and how much I’ve just realised it means nothing at all. The article refers to a guy with the title “Staff Software Engineer” and then goes on to say it’s a director-level post, and drops into conversation the fact that VP is often a mid-tier, non-managerial role in the US. Which is all a bit strange, when you think about it.

My current title is “Development and Support Manager”, although I have responsibility for infrastructure too. I have 26 direct reports (including a couple of managers), and budgetary responsibility for anything that has a plug and anyone that works with the things that do. I make strategy decisions (like moving out of a physical datacentre to Windows Azure), I sign contracts, I hire people and I fire them, I set the technical direction for the group and am responsible for seeing it delivered, and I sit on the board (along with some, but not all, of our directors). At most other companies, my role would be called CTO or the like, but we don’t do C-level titles.

Reporting to me I have some truly gifted people, many of whom have responsibilities which probably aren’t that far off the VP-level role listed above. Keith, my UK technical architect (who’d be Chief Technical Architect if we did C-level titles) is responsible for the rewrite of an application which brings in around £3m/year and which is sold to some of the world’s biggest names – if you eat food in Europe or North America, Keith is leading the rewrite of one of the things that manages the production of something that’s currently sitting in your bowel. Yum. He doesn’t have “managerial” responsibility, in the sense that he doesn’t have to deal with absentee management, disciplinary matters and so on – but he does have to make sure the team delivers on time, that work is apportioned fairly and some to spec and on time, that people pull their weight, and that the end product is something I’d trust to manage the creation of the stuff I eat (put it this way: if Keith screws up, your vegan nut-and-dairy-free lasagne might well then out to be beef and walnut pasta bake with cheese sauce).

My support right-hand-man is, in fact, a woman. Sukhy runs the support team with formidable efficiency, line managing 5 people in two continents. Without her I would have floundered a long time ago. We operate longer shifts, across more countries, than I’m able to keep on top of, and customers with varying reporting, compliance and deployment mean multiple processes, running concurrently – and even if I wanted to get involved in the minutiae (and I don’t), I wouldn’t have time. Sukhy’s title? “SupportTeam Lead” – which doesn’t sound half as grand as VP, but means a whole lot more.

Amit is a project manager, officially – another lynchpin in the team that I just couldn’t do without. He not only manages the operational workload of the dev team on a daily basis, but hires and fires in our Indian team, covers for me when I’m on holiday, manages deployments and generally fills the gap that my strategy responsibilities leave in my operational work week. If anything, he’s the dev manager.

All of which means it rankles a little when I read articles like that one; other jobs don’t seem to have the same disparity between titles and responsibilities that we do in tech, and it unfairly plays down the skills of some hugely talented people.

And (as Sarah will no doubt agree :)) it makes it utterly impossible to explain to the current Mrs Evans why, when she’s out for a drink with my test leads, senior devs, architects, BA’s, PM’s and support leads, their job titles are almost entirely useless in working out what they do.

(Names have been left unchanged to promote the deserving!)

How the Other Half Works: an Adventure in the Low Status of Software Engineers

This is a long, but salutary, post. I’ve interviewed a lot, and I’ve been interviews a lot too. This is probably the first time I’ve realised that I almost certainly give the managers I’m recruiting an easier ride than the senior devs, and it’s quite eye-opening as to the relative weight of a VP title; I’d have considered it quite significantly senior, perhaps just before C-level, when this this post paints it more as a senior dev post.

Food for thought!

YAGNI, DRY and DDTTMOTBYITAL

You probably recognise You Ain’t Gonna Need It and Don’t Repeat Yourself, but I’m guessing you might not be so familiar with Don’t Decouple Things Too Much Or They Bite You In The Ass Later.

The back story

We’ve been kicking around some design ideas at my current employer this week for a Brave New Vision™ of our newly rewritten product. It’s a big departure from what we’ve done before – historically our architecture has been brownfield to the extent of being virtually mud, and has more layers(which might be better termed eras) than Kim & Kanye’s wedding cake. It’s been multi-instance, with each single-tenant, and it’s run on a traditional tin and then moved to VMs with the same ‘dedicated server’ mentality behind it. The new one is a step change; it’s a Platform-as-a-Service (PaaS) application running on Windows Azure. It auto-scales to cope with demand and load, and it’s as new as it gets technology-wise; we’re building MVC5 on top of EF6 and backing off to SQL Azure, continuously integrating as we go (with TeamCity) and (in another departure) building in load and functional testing from the outset.

So, understandably there’s some big architecture conversations going on – this is new ground for us as a company, and we want to get the fundamentals right. We’re bouncing ideas off each other – some conservative, some totally left-field, and most somewhere in the middle. But there’s a theme in a number of them which we’re having to consciously address:

We’re trying to avoid problems we’ve hit in The Current App by identifying symptoms and coming up with clever cures.

We have a number of problems in The Current App. This won’t be news to anyone who works in software – there isn’t an app on the planet that doesn’t have bugs, war stories and skeletons in the development closet – so it shouldn’t be taken as an admission that our current app is any worse than most of the others out there. But it has problems; problems we know about, which cost us money to work around (how many of you get the sinking feeling that your support teams are actually not that far removed from the punched-card operators of years gone by, actually operating the systems for your customers rather than just fixing the bits that go wrong from time to time) and which we don’t want to repeat.

One of these is a fairly unpleasantly close coupling between the various layers of our app. The Current App started out life as a VB6 application with some Classic ASP pages on top (many of which are still there), and then had some ASP.Net Web Forms pages added along the way (complete with ADO.Net bits sprinkled liberally over them). Some years later MVC was in vogue and so that’s now the technology of choice for the newer bits – but the ADO.Net and VB6 COM components are all still there, and it’s led to ‘business logic’ being implemented vertically throughout those layers rather than as a layer of its own and the data structures in the app being tied quite (read: very) closely to the database tables themselves.

Abstraction in all layers?

It’s not surprising, then, that people are kicking around ideas for how to avoid this – while EF certainly helps abstract away the pain you do still end up (at least by default) with classes per table and a set of DAL objects which look like the tables underlying them. So, this week we had a (brief, but telling) discussion about whether or not we should modify our DAL to work against a set of views and synonyms rather than the tables themselves; a sort of RDBMS version of coding against interfaces rather than implementations, if you will.

And here’s my problem with that: relational database management systems are just that – they manage (very well indeed) relational data which is stored in tables. Adding views on top of that does provide for a much greater degree of isolation from the schema in the DAL and does mean that changes to the physical storage don’t necessarily have to be reflected at the same time (or at all) in the application itself. But, it has a number of downsides too..

The first that springs to mind (because it’s the most obvious) is the performance of views versus tables – even with some of the very funky stuff Microsoft and the other RDBMS vendors have been doing over the last few years with indexed/materialised views, they generally don’t match tables for raw performance.

The second, and the one that’s far harder to measure in terms of impact, is the “ooh, I wasn’t expecting that to happen” effect some years down the line when everyone’s forgotten that select * from Customers is in fact selecting from a view of some data held in a table somewhere else altogether and updates a table called Customers instead before spending several hours wondering why their updated data can’t be seen for love nor money.

Third, and the deciding factor for me (when considered along with #2), is the fact that it’s solving a problem that shouldn’t exist; your DAL should provide abstraction between the business objects and the physical storage. You shouldn’t need to abstract the DAL from the physical storage itself by adding a logical layer in the RDBMS – it is the abstraction layer that prevents close coupling between the business objects and the storage.

Avoid cleverness

Back in my days at Insight we came up with a set of coding guidelines which were deliberately high level – and “avoid cleverness” came top of the list. There were exceptions (and in fact I think I’d have serious qualms about working for a company where that rule was inviolable) but for the most part, anything that leaves you with a feeling that you’re a genius is probably either an inelegant solution or unmaintainable by anyone else (and often both).

Decoupling things is good. Dependency Injection is one of a number of tools and techniques the software industry has come up with over the years in order to professionalise our work, and loose dependencies are good for more reasons than just that. But you can take decoupling of dependencies too far – when you’ve got an abstract logical layer that interfaces (on one side) only with another abstract logical layer, you’ve probably got an extra layer adding nothing but complexity and future problems.

DDTTMOTBYITAL

Constantly recycling app pools – “configuration changed”

Recently, we had a sudden and serious problem with our app pools recycling. At busy times of the working day we would get between 4 and 60 seconds between recycles – which is pretty fatal to the overall user experience when you’re making extensive use of in-proc session.

We spent quite a lot of time looking at this internally with Microsoft support (thanks Hari!) and learned a couple of things along the way which I thought is share in case it saves someone else some pain:

1) ASP.Net 2.0 introduced a file change watcher which recycles the app pool when a directory anywhere within or below the application’s root is deleted. This is for good reason – it stops the runtime serving up deleted-but-cached pages which might have lived in that folder – and has obviously been around a long time.

2) Those deletions trigger a clean recycle and recompile, and log “configuration changed” as the reason – even when there was no change to config files. I guess this is because the “physical configuration” of the site has changed, but it is still confusing. Adding extra logging into global.asax.cs to log a stack trace to the windows event log when it happens doesn’t add any more detail – you’re still left with “configuration changed” as your only clue

3) Our apps exhibited no problems with that FCN feature in .Net 2.0, or when 3.5 came along. When we upgraded to .Net 4.0, however, all hell broke loose. We’re still waiting to hear from Microsoft if something changed between 2.0/3.5 and 4.0 in the FCN logic, but don’t discount this as the probable cause just because it worked fine in 2.0/3.5. Maybe we just got lucky. Maybe less users were using the pages that did the create & delete of sub-folders in the app before. Maybe it’s to do with the order the .net frameworks were installed and some odd handler or other was still pointing back at a .Net 1.0 runtime. Who knows. Either way: the fact that the FCN feature was introduced back in .Net 2.0 didn’t stop it suddenly and inexplicably biting us royally in the ass 2 versions later

4) Separate out your IIS applications into standalone pools as much as possible. Our site had – for legacy reasons nobody can quite remember anymore – the site root application and one sub-application (which just happened to be the most accessed bit of the site by a country mile) in the same pool. When a subfolder of the site root got deleted, it’d nuke the session for the other app in that pool too… Neatly doubling the impact of the recycle and associated session loss.

We’ve now moved the directory that gets written to (and deleted from) outside the site root and split out the two applications into different pools, and our recycle problem appears to have gone away – from one every few seconds we now see around two a day (one scheduled, one a crash caused by yet another bit of “technical debt”!).

If you’re suddenly seeing a huge number of recycles, maybe some of the above will save you the tremendous amount of pain it caused us!