Horseshoe bend #5: Lessons learned so far

2024. február 11. - Floorshrink

The following post is an attempt to summarize the learnings from our cloud journey in the first 18 months. You bet, this is biased, but it might help others who come behind us. Those ahead of us you may put your all-knowing smile on.

How to go faster - the first steps in the chaos

Public cloud adoption is an intertwine of grassroot experimentation, the mandate from the senior management to establish an enterprise grade cloud presence and finally a crash landing of the first cloud workloads without a proper foundation. The sooner you have a program established around it, the less chaotic the first months will be.

You need a cloud strategy

that answers questions like:

why you want the whole thing in the first place, how and when do you declare that you reached this goal and what metrics are used to prove it. (eg. cost saving may not be a strategic goal, while speed is.)
what your core design choices are: cloud architectural design (eg. hub & spoke vs. VWAN), accepted building blocks (cloud services), CI/CD tool set (source and artifact repo, build and deploy tools), IT Sec key decisions (eg. rejecting the use of public IP, checking ingress code from the internet, policy layers, IaC framework and the toolset like Terraform vs. the cloud provider’s native tooling like Bicep) and most importantly a decision-making process how to reach these choices.
the question of ownership: Cloud is much more than a 3^rd datacenter (in fact more than any other IT infrastructure), therefore its governance should be established in the context of Business IT, DevOps, IT security and IT Operations. This is not an ITOps internal affair.
The willingness to change everything: I could not find the source of this quote but I think this is true: “When digital transformation is done right, it's like a caterpillar turning into a butterfly, but when done wrong, all you have is a really fast caterpillar.” You have to change the processes and the org structure if you want to harvest the advantages of the cloud. Without these changes the result will be as slow as the original on prem counterpart is.

The right level of ITSec control – if too loose, you will be hacked, if too tight, nobody will use your stuff and shadow IT orgs will sprout out everywhere. You need to decide on a few core items:
- single CSP, or multi cloud, distributed cloud yes/no, cloud native tools vs 3^rd party for monitoring, managing, protecting it.
- how far you are able (willing) to go with automation, mostly with Infrastructure as a Code (IaC). The dilemma is where to stop. The Pareto principle should give us guidance but it misses one key point: any manual intervention will defeat the purpose of the entire automation. This quote is from 1935, but it is as relevant as ever: “It is difficult to get a man to understand something, when his salary depends on his not understanding it.” /Upton Sinclair/
- what your cloud operating model is: the conservative approach is when the dev teams file a SNOW ticket for everything in the cloud just like on prem, the avant-garde approach is when you give them freedom to implement their preferred PaaS component with their own IaC code and to go YBIYRI (you build it, you run it) for components that are not yet supported by central IT Ops.

Establishing the Cloud CoE

A program or an org unit: Management needs to find out if you are a project or an org unit. All peer connects (interviews with other enterprises who embarked on this journey earlier) show that introducing the public cloud at enterprise scale is a 5+ year program with likely evergreen residuals. Treating it as a project has implications, eg. 90+ % of the team will leave at the end of the program, taking all learnings with them.
Staffing:
- #1: quick learners with a solid technology background are on high demand. Giving scraps of time of mediocre performers will defeat the purpose of the whole thing.
- #2: the imbalance between supply and demand will crank up the prices to the point that can jeopardize the financial viability of the program.
- #3: be prepared to lose your best cloud engineers to abroad. Our regretted attrition way over the internal FTE attrition. The replacement takes cca 3+ months. The ramp up will require another 3 months, ie. you are down with a top engineer for 6+ months.
- #4: We underestimated, therefore understaffed the process, governance and compliance tasks. Cloud is not only an engineering task, but a heavy lifting on process and compliance, let alone a major change management undertaking as well. The non- engineering activities are 30+% of the job. (the process folks claim this is 50+%...)

Key decisions to make

what the public cloud actually is – a 3rd data center or something completely different? The CCoE was convinced that this is different while ITOps insisted that this was just another DC, therefore should behave like one: same technologies, same processes, and nothing else.
how far you want to go with self-service? One approach is to allow You Build It- You Run It where ITOps is not ready to operate the new technology. The advantage is that it will allow the dev teams to go faster but will require to build operations skills and capacity on their side. Another approach is to channel every cloud request into the existing processes and handle them as if they were an on prem request.
Some dev teams will want to tinker with PaaS components while others will want to concentrate on business logic and application-level tasks. In the latter case, centrally provided cloud services will be required for those who do not want to deal with the PaaS component operations. You need to define the boundaries between YBIYRI and these central cloud services (roles and responsibilities) AND need to establish this managed service layer. (this is mostly not a technical undertaking.)
Drinking from the firehose - the balance between an R&D workshop and a factory - the number of PaaS services vs. the available offerings (let alone the Marketplace) Do not go beyond 10-15% of the total service offerings, otherwise you will be quashed by their quantity.

The forces that will slow you down

There are two forces at play here: ITSec and ITOps. (Compliance waiting for you around the corner.)

On prem ITOps mindset will dictate that anything in the cloud should function just like as if it was on prem. They will demand the same technologies and processes, the same IaaS approach to anything. Their – legitimate – reasoning is that 95+% of the workloads are on prem today, therefore anything you create should look like the current stuff since it is easier to operate. The untold driver is fear that you need to address upfront: Nobody will lose their jobs but likely to have a different job (with a different skillset) within 4-5 years. All of us need to learn and unlearn.
ITSec requirements dictate technical solutions that take much longer in a bank than in a small (non-financial) account. It is like running the Marathon in a heavy diver suit while all others run in shorts… An example: in a public cloud cross regional DR capabilities come out of the box, unless you implement private endpoints when you lose most of this functionality.

The nose of the ship cannot travel faster than the back of the ship, ie. it does not really help to produce designs and technical solutions that other parts of the IT org cannot implement let alone comprehend. This is a lesson we learned the hard way: You need to move the entire ship. Trainings, constant communication, demos and regular small updates help the transition.

Dependencies

You will find (at least) the following dependencies:

Identity and Access management – the identity management process and technology. eg. your IAM system does not work with cloud native identities and/or it is being replaced therefore does not accept any changes.
Ticketing system – your team gravitates toward JIRA (as most SW dev. projects do) while ITOps will demand ServiceNow. Shoveling data manually from SNOW to JIRA is a pain in the neck but you want to track the hours in a single system.
Click-Ops - your IaC code will bump into manual steps in the process, eg. a FW port opening might take a week while your code runs for 45 minutes.

Technical issues

If you implement IaC you need to pay attention for the smooth coexistence between the IaC code and the policies on top of them. This is a daunting task to debug a code where both layers are in constant move.
on prem proxy servers and multiple firewalls plus an on prem DNS vs. your cloud internal routing design will give you a bunch of networking and name resolution issues where you do not have access to the monitoring logs of any of the on prem components. it will require a smooth collaboration with the network people to resolve simple issues like a wrong conditional access setting.

The exit strategy

There are 3 caveats with a cloud exit:

when you mix up a disaster recovery and an exit scenario. the difference is the RTO allowed. the first is measured in hours, the later in years. It takes the same effort to walk away from a cloud than to walk into it.
when you allow only technologies that have an on prem equivalent. This way you do preserve your exit but throw away any innovation produced by the cloud provider. The deeper you go into the PaaS/SaaS forest, the less likely it is that you will ever come out.
when the seller’s state, eg. the USA says NO. In this case a cloud-to-cloud exit becomes unattainable (MSFT, Amazon or Google will leave the local market on the same day)

A reasonable exit strategy should be formulated, that will be acceptable by the local regulator. Regulatory, compliance and engineering task forces should collaborate, with an experienced leader (the best is someone who worked as an auditor before). Think twice before you execute this exit. This will ruin the ROI of the whole thing.

The square peg in a round hole – the lack of public IP

If we had to had to name one item that caused us the most headache, it is easily the fact that the public cloud is designed with the internet in mind, that is that all services can be accessed directly from the internet. In case of an enterprise environment this is not the case, you have to go private.

The nonfunctional requirements

All of these requirements are known for decades, but work differently in the cloud, especially for PaaS and SaaS. Think about monitoring, logging, alerting and backup early and make reasonable compromises with their on prem counterparts.
Cloud monitoring, alerting and logging should be incorporated into the company level monitoring, alerting and logging. It is inevitable because the cloud-based systems will not operate standalone but integrated with on-prem (and later maybe other cloud) systems. In case of a problem an end-to-end view is needed, and it is possible only with an integration between the various monitoring systems.
Backup: you need to have a clear view on what you need to “bring home”, ie. back to on prem and what is okay to store in the cloud. At the end of the day, it boils down to the level of trust in your cloud provider and the demands by the regulator. Be aware that some of the backups provided by the provider are not compatible with anything else, ie. cannot migrate them to any on prem equivalent. (eg. KeyVault)
The big shift is when the Application Operations teams will claim a bigger slice of the traditional monitoring and alerting pie, using their own – mostly cloud native – tooling that will overlap in functionality with the tools used by IT Ops.

The non-technical side of the house

We shuffled all non-technical topics into a single team: Process – Governance – Compliance – Cost. In retrospect we underestimated the amount of work and the difficulties related to these topics. (engineering myopia) In fact there is a significant difference between “it works from an engineering aspect” and “it is a service one can provide with a predefined SLA”.

ITSM processes: IT Service management processes assume that everything is done by ITOps, the client just files a service request. ITOps is right claiming that an incident is a pain regardless where it happens, therefore you need to have a proper incident (and change) management process. If you are an ITIL shop, you will find out that a big chunk of the areas covered by ITIL3 are simply not applicable for the cloud. (hence the introduction of ITIL4 several years ago.)

The cost thingy: This is very easy to leave the lights on (on prem “flat fee - we already paid for it” reflexes kick in) but will cost you dearly. IT is one thing to spin up resources automatically, and seems like just a small change in the code (create vs. destroy) to tear them down. But somehow it just does not happen without forcing it. This is not by accident that FinOps became a discipline on its own right in the last couple of years.
The service catalog: In case of a cloud request the client may ask for a subscription, then for the predefined set of PaaS components in it, or just for the subscription and then would do the rest him/herself. Ie. you need to clarify what the service catalog should contain.

What comes next

I wanted to thank the entire team who walked along in the last 18+ months. We are not finished by any measure and with the quickening speed of change we may not even know what “done” really looks like. What is beyond doubt that the big players turned their attention to artificial intelligence. It is a safe bet to forecast that AI will infiltrate all aspects of the cloud within a few years and will become the new battleground.

To finish with some fun: I used Midjourney to illustrate this post. The last prompt I used was this: “the magician pulling the rabbit out of the hat but the audience is not happy, cartoon by David Horsey, --ar 3:2”. Is it possible that AI already went rouge?

As always, I appreciate any comment of feedback.

Szólj hozzá!

Egy újszülöttnek minden vicc új, így én a régi viccekre szakosodtam, azokat mondom el újra és újra.

Floorshrink diaries

Floorshrink diaries

Horseshoe bend #5: Lessons learned so far

How to go faster - the first steps in the chaos

You need a cloud strategy

Establishing the Cloud CoE

Key decisions to make

The forces that will slow you down

Dependencies

Technical issues

The exit strategy

The square peg in a round hole – the lack of public IP

The nonfunctional requirements

The non-technical side of the house

What comes next

A bejegyzés trackback címe:

Kommentek:

Egy újszülöttnek minden vicc új, így én a régi viccekre szakosodtam, azokat mondom el újra és újra.

Floorshrink diaries

Horseshoe bend #5: Lessons learned so far

How to go faster - the first steps in the chaos

You need a cloud strategy

Establishing the Cloud CoE

Key decisions to make

The forces that will slow you down

Dependencies

Technical issues

The exit strategy

The square peg in a round hole – the lack of public IP

The nonfunctional requirements

The non-technical side of the house

What comes next

Ajánlott bejegyzések:

A bejegyzés trackback címe:

Kommentek: