Redmond, we have a problem

0

2025. február 23. - Floorshrink

It seems that the law of supply and demand doesn’t work the usual way in the cloud related job market: It creates behavioural distortions rather than the gradual move to a healthy equilibrium. While key players seemingly declared victory and shifted their sight to the next battlefield of AI, this anomaly, combined with the heightened scrutiny from the regulators might hurt adoption in the long run. This post aims to identify the root cause and to suggest possible ways out of this problem.

Symptoms - What’s going on here?

I have been tinkering with cloud implementations for several years. It baffled me that – despite of the cloud-dev(sec)ops engineer compensation being 30+ % above the average - the inflow of talent into this area is far lower than anticipated.

The salary structure for an individual contributor DevSecOps Engineer varies based on geolocation, level of experience, and company size. Below is a table outlining the approximate salary ranges for different levels in various regions:

For the above reason my team lost 10+ top notch cloud/devops engineers in two years. Most of these folks went abroad, one of them as far as Vancouver. Some others stayed in their homes but switched to foreign employers.
Some contractors sold 80% of their time twice, to two different customers, one of them was so unashamed that he put his other job in his Linkedin profile. (There are telltale signs of this behaviour: insisting on full home office and missing regular meetings, later deadlines.) Some elevated this practice to the company level…
Some others played a fair game and told upfront that they work for multiple clients, carried 3 (!) notebooks (one for each client plus one for their own company, God bless virtual desktops…) and declared that they would not even pick up the phone on days assigned to their other clients.
The cloud IT market bears resemblance to the construction industry, two, sometimes three layers of subcontractors adding little value besides their margin to the price tag.

The root cause

The wheel reinvented: this is the imbalance between supply and demand. The thing that bugged me was that despite of knowing the impressive earning potential, less than one percent of the internal IT Operations workforce (in a HUN commercial bank) made a substantial effort to learn the new discipline. (Those who did soon left ITOps.) On the other side of the house few developers made a career shift to become DevOps/IaC experts.

Gartner found a good demonstration of the problem in 2021. They dubbed it IT Talent quadrant. This matrix uses the stack ranked demand (the number of job postings asking for a given skill) vs. the number of these job openings per candidate as dimensions and provides evidence that Kubernetes, Infrastructure as a Code and Automation are critical ingredients for any cloud implementation. For some reason Gartner did not update this chart since 2021.

I created my own explanation, the Commitment matrix, that uses the seller’s commitment to his/her employer vs. the buyer’s commitment to its employee as dimensions. (In some cases, the seller and the employee being the same.)

In most cases there is a gradual shift of any new skill from being a Spice to a Cornerstone and later to drift into the Majority. For some reason in case of the most wanted cloud expertise this shift is just not happening.

The reasons

the cloud is an expanding universe, more and more large companies make their inroads, thus generating new demand for experienced people.
Buyers would love to have Spice people on their staff, but are not willing to pay the requested premium, claiming that it would generate internal salary tensions. (or simply drawing the comp. ceiling too low.) On the other hand, the very same corporations are willing to pay twice as much for the same people as contractors.
Top engineers do not want to work for a large firm as rank and file, they pledge allegiance to their boutique consulting firms instead. Smaller size means a more direct connection of the person’ contribution to the performance of the firm, thus results in perks up to partial ownership, let alone being among great technical peers is a nirvana for an engineer.
Achieving Spice level requires extensive learning and practice. Let alone the industry dictates a breakneck speed: your knowledge will become obsolete within 4-5 years unless you keep updating it. Cloud DevOps and Security are good examples for the Pi shaped skillset. One needs to understand the traditional development principles like branching or a pull request (and must write decent code) while being familiar with the nitty-gritty of name resolution in a hybrid environment with private endpoints or how a policy set will interact with the underlying Terraform codebase.
The last item could come from etymology: Dev + Ops is like mixing oil with water. Development is akin to creating something new, thus experimenting with the unknown: little predictability with high level of autonomy. Operations on the other side hinge on high level of predictability and minimal autonomy. I already used the modified Wardley map to depict this divide, but it is worth repeating it.

To make things worse top-notch developers disregard script languages and look down on the non-functional side of the house like a private DNS resolver or a cross-regional site recovery. My hunch: they do not care, let alone know much about these things and want it as a service. From time to time, I present on universities as guest lecturer. On one occasion I asked the participants (50+ BSc students in their graduation year) about the power consumption of an Intel server. No idea. How about a notebook: no clue. A hair dryer? One girl new it. Infrastructure is not sexy, not even when it becomes code.

Ways to handle this problem

There are multiple stakeholders in this game with multiple paths to follow.

Vendors - Reduce complexity

I picked Kubernetes as the veterinarian’s horse to illustrate the problem. People who dealt with Kubernetes and its automated deployment and configuration can attest that it is complex to implement and to run, even without its ingress headaches with private endpoints or a service mesh on top of it. For this reason, Microsoft has offerings like Azure Container Apps (ACA) or lightweight alternatives like Azure Container Instance (ACI) while allowing plain vanilla implementations on VM scale sets for masochists.

The downside is that this simplification comes with losing some of the configuration, security and monitoring capabilities. As a dreamer I wish we had a universal serverless compute resource on our hands like Azure Functions or Amazon Lambda. „Liberté, Égalité, Fraternité” for cloud computing: „Autoscaling, Resiliency and Security”.

Another approach is to hide the internal complexity altogether by moving to PaaS and in many cases to SaaS. This is exactly what Microsoft is doing eg. with items bundled into Fabric. The issue: the deeper you walk into the cloud forest (wandering to SaaS territory) the less likely you will ever come out. This reduction of complexity is not evil by definition, one could argue that it helps IT to create business value faster. But there is a catch: Once a senior executive of a large bank asked me what the biggest danger in cloud computing was. My answer was: if politicians on either side of the Atlantic go crazy. Two years ago, I meant it as a joke…

Service providers - Hide complexity

Complexity and skills shortage provide a business opportunity. In practice it means creating a layer between the offerings provided by the hyper scalers and their enterprise customers. This toolbox is a combo of blueprints for landing zones, IaC code base for cloud services, integration solutions for connecting the cloud instance with its on prem counterpart covering networking, identity management, service management and monitoring and automatically deployed policy sets to streamline compliance audits.

While it has its short-term financial advantages to start each implementation from scratch (if you are selling this service), only the thin upper layer of customers can afford it.

Warning: your Spice people are your golden goose, and not just for the profit you make on their billed hours. Ignoring the need for or screwing up with the foundations will lead to flawed implementations that will haunt you either as a security breach or a hard to run environment that your client will hate.

Engineers - Thrive on complexity

The revolution in infrastructure platform arena (Software Defined Storage-Network-Compute, Infrastructure as a Code) is the marriage of two – earlier distinct disciplines. This is reflected by the compensation data for cloud architects and DevSecOps engineers on Glassdoor: this is in the 130k to 230k USD gross annual range in the US. The rule of thumb is that whatever a top-notch IT skill costs in NY or London, you will get the same for one third of this price in Budapest, voila, flourishing Shared Service Centre business. So, we are talking about 50-60k USD annual gross for a good cloud devops engineer or a cloud security expert. The emphasis is on good. There is never-ending debate about the relevance of certificates. I recall a top-notch colleague at Microsoft from last century, when I nudged him about his certs (the lack of them) and offered that I would cover the cost of any MCP exam. He literally threw his MCSD certificate on my desk in two weeks. (he left the country 10+ years ago…) If you are good, the certs are doable and a good advertisement, but true, certs themselves are not enough. So Folks, learn and experiment! Cost is not an obstacle, a Coursera (ex. acloudguru) subscription is 30 USD a month, time is the problem.

For the record: it is not all roses, as shown by the chart below. (I found similar data for Hungary.) The IT job market is not that pretty as it used to be, but it this fact reinforces my previous mantra: learn and experiment to stay ahead of your competition.

Another caveat is the industry and the location. Your compensation depends on the impact of your work on the outcome and the profitability of the sector you are operating in. The effect is a bit sad: healthcare and education could make a good use of top-notch IT if they could afford it.

Legend has it that when the famous bank robber John Dillinger was asked by a reporter why he always robbed banks, he replied matter-of-factly, “Because that's where the money is!” In the next chapter we will have a look at the second core problem with the cloud: hyper scale providers being greedy and siphoning out profit from the value chain.

As always, I appreciate your feedback.

Sources

Szólj hozzá!

Horseshoe bend #6: Galileo Galilei

Facebook Tumblr Tweet Pinterest Tetszik

0

2024. szeptember 23. - Floorshrink

This post is an attempt to identify the root cause of the apparent divide between the two major branches of IT and to offer a remedy to this problem. (ambitious, isn't it?) As always this coin has two sides, so I would like to learn the view of the Dev folks and IT Operations folks as well.

Contradictions

During the 2+ years of running the cloud transformation at a commercial bank I faced contradicting views on the following aspects of how IT could function.

One extreme argued that the cloud is just another data centre, therefore it should be treated the same way as our own: same (ticket based) processes, same technologies (ie. nothing else beyond what we already have on prem) and most importantly same speed letting new things in.
The other extreme exclaimed that the cloud shall change most aspects of IT as we know it, we should replace the stop signs (approvals) with guardrails (policies), automate every aspect of our daily life and most importantly treat IT infrastructure as a product that we want to sell to our (internal) clients.

I recall when 28 years ago – being in charge of introducing Exchange 4.0 in a local commercial bank - I attempted to explain to a deputy general manager that printing and faxing each and every e-mail he sent (in order to make sure the other party received it) was suboptimal and a read receipt was enough. (not kidding). I cared more for the consulting revenue than the rain forests, so I dropped the case. In the same bank I had arguments with the network folks that tracking IP addresses for every Windows desktop in a paper-based grid notebook is suboptimal compared to DHCP. I did not drop this one, gaining friends until they ran into issues due to duplicated IP addresses and the joy of troubleshooting them. (then they relented…) It has been bugging me ever since why it took them so long to realize these things. Why is it so darn hard to embrace change?

The psychology of IT Operations

One day a manager at IT Operations asked me how many times I recall when IT Ops was praised by the senior leadership for things going normal (rarely) vs. how many cases I remember when they were reprimanded after a major service outage. To be honest: for all of them. I had to agree with his point: IT Operations is strongly incited NOT to change what works since the bulk of issues are connected to changing some aspect of the service. Hence the need for a CAB (Change Advisory Board) in ITIL. The root cause for pushback against change is the deep belief that speed and stability are the opposites in the same dimension.

At this point I have to borrow a page from the book of Matthias Patzak, who in turn borrowed a page from Simon Wardley and tweaked his map by changing the vertical axis (visibility) to autonomy. Here is a modified Wardley map explaining why change agents are at odds with IT Operations. (a proposed remediation is on the chart)

The question is unavoidable: How can the infrastructure stay unchanged when everything that uses it changes at an unprecedented speed? My hunch: it cannot. The rest of this post is an attempt to prove this point.

The stakeholders’ view

The voice of the customer – In our case the app dev teams:

Putting the cognitive load on the customer of the service is a guaranteed customer satisfaction killer – when a developer needs to figure out the internal processes of the service provider. (eg. filing separate ServiceNow tickets for the VM, the OS, the RDBMS, the DNS entry, the domain join and the admin access) It is like Vogon poetry. (the 3^rd worst in the Universe) Dissatisfaction is the hotbed of shadow IT. For the record, not just in IT: Ferruccio Lamborghini probably would have stayed with his Ferrari (and his tractor business) if Enzo Ferrari would have been a bit nicer to him or would have made better clutches.
Lack of speed and autonomy leads to disengagement. I recall a developer who wanted to test a new feature of MS SQL Server. It took him 3+ months to get a test server. By this time, he gave up on the whole idea he wanted to test in the first place. (He knew that the test bed he was asking for would have taken about an hour to implement if he was given a chance. But he wasn’t.) So, after 3 months he dropped the whole thing.

The voice of the business

The top management of companies are concerned about unforeseen changes that may have a devastating impact on the livelihood of their enterprise. Their worries are backed by data. The Corporate Longevity Forecast, eg. the time a company spends on the Standard & Poor 500 list is shrinking. In plain English even large established companies can disappear from the list or even become “also run” within a few years. (Nokia, Credit Swiss, GE, Qualcomm bidding for Intel, WTF?) The age of creative destruction is upon us: What worked in the past for decades may not be good enough in the next ten years.
Enterprises are trying to be prepared for and respond quickly to attacks from any new force in the market. The cloud is one of their bets. All parties but one agree on the following:
a cloud transformation will deliver its value proposition only if the organization and the underlying processes are changed along with the technology.

When money talks - R&D budgets

If we assume that most Technology companies spend the same portion of their revenue on R&D and this R&D has the same impact on the bottom line (sometimes not true) than we may predict that more R&D (when it leads to a breakthrough), results in a quantum leap in profitability.
If a firm catches one of these quantum leaps in a life time, it is lucky. If it catches two, this has long lasting consequences for the entire industry. (Data points from 2023: IBM made 8.18 billion USD net income, in the same period HPE made 2 billion, Microsoft 86 billion.) The cloud race is over, the AI race has begun and the hyperscalers have more money to spend on it than their traditional competitors.

Source: STATISTA.com (data for HPE is from 2015 only, when they separated from HPQ)

I feel it in my fingers, I feel it in my toes (change is all around you…)

The following chart is a visualization for obtaining infrastructure for an app. Say the dev team working on App 1 wants an infrastructure with an application server with some compute power, an SQL DB, and OS and a VM underneath, plus this thing should be accessible via the web to clients. In a traditional org this would mean 5 separate ServiceNow tickets with manual handover between them. Eg. The virtualization folks would set their ticket status to done, a human being would intercept this change, and would file another ticket to the OS team to install the OS. These teams are measured on meeting their SLA-s, so they would close the ticket even if the client is not able to log on to this server. (After all identity management is a separate step, right?) Imagine a car dealer who tries to sell you an engine, a transmission, a few wheels and a body work as separate items, when you wanted a car…

In a cloud infrastructure it is a set of IaC scripts that ran at once. And here comes the problem:

This automation could be built by a dedicated cloud group requiring an org change that is against the will of the existing org units. Injecting the SNOW tickets into the belly of the automation – with the same 5 day SLA-s - would require the same time as the traditional setup. If you are against the cloud all you need to do is to insist on sticking to the old process.
You can grant the right to execute this automation to the developer teams themselves, but it would mean relinquishing control and shifting to creating and maintaining the automation scripts and establishing guardrails (policies) instead of the stop signs.
Creating and maintaining IaC code, CI/CD pipelines and policies (some people might call it DevSecOps and Site Reliability Eng.) require new skills and could be seen as a threat for those not interested in the above changes.

All in all, an innocent technology change proposed by the cloud would require organizational, procedural and skillset changes in an org who does not like change.

There is an interesting observation in the State of DevOps report for 2023. The more frequently you make changes, the more likely you will succeed. The root cause is simple: more frequent small changes (with a working rollback) touch fewer things that can go wrong. If we turn it around: the more worried you are about changing the platform, the more time will pass between changes, gathering more moving parts, that in turn will increase the likelihood that something will indeed go wrong.

A side effect is that this will make your environment less secure (I will not apply that security hot fix because it might break the application – to be honest, sometimes it will) and will accumulate more technical debt.

There is an expression that is the tell-tale sign of a siloed organisation: “he is criss-crossing in my backyard”, read trespassing into a territory that the speaker considers his home turf. “Any time you start something new like [an innovation – eg. the cloud initiative], that cuts across many areas, there’s a potential for people feeling like you’re in their backyard.” (Michael Britt) The problem is that most value creation process involves multiple departments, therefore one cannot innovate without “trespassing”.

I got into a conversation with the cloud transformation lead of a large commercial bank a few days ago. He made an observation that struck a chord: only a miniscule portion of the IT Operations workforce (in this bank) embraced the cloud, they honestly believed that everything was okay and this cloud thingy was unnecessary, so responded accordingly. I think Amara's law is at work here: "We tend to overestimate the effect of a technology in the short run and underestimate the effect in the long run." I am biased in this case, but I believe they underestimate the impact of the cloud and miss the opportunity to increase their market value.

Squaring the circle - the way forward

The known knowns:
- the business hates when cost grows faster than revenue. READ: The days of extensive growth in IT staff are over. (if you care for the whys, check out "Red Plenty") There is one way forward: automation.
- what is likely that those willing to merry stability with speed will gain the upper hand vs. those who will stick to their guns and obstruct change.
The known unknowns:
- technology will create as many jobs as it will eliminate. (a recent study by Guardian suggests that it creates more than it destroys.) What is unclear which jobs will stay and which will transform to something new. My bet is that the mundane ones (repetitive ticket crunching) will fade, while those requiring more thinking (eg. designing those guardrails mentioned above) will grow their relevance.
-large IT shops carry an enormous amount of legacy, applications that generate the vast majority of business value for the enterprise today. This is difficult to forecast when the above shift will happen and how long this shift will take.
The unknown unknowns:
- IT Operations can hold the business at gunpoint claiming that any org/process change will pose a threat to the current stability of the business, therefore any cloud adoption should happen on their terms and at a speed deemed suitable by them. The real unknown is how long IT Ops can resist the push from their own internal clients and the hyperscalers. (make no mistake: the stick will follow the carrot soon.)
- for the record: while industry disruptors are already doing it, my prognosis that technology allows for speed while maintaining stability is not yet proven in large enterprises carrying a legacy.

Famous last words: In 1633 Galileo Galilei had an unpleasant encounter with the Sacred Inquisition that forced him to recant his claims that the Earth moves around the Sun, rather than the other way around. After leaving the courtroom he murmured "Eppur si muove" ("and yet it moves") and spent the rest of his life in a house arrest.

As always, I will be glad to learn about your feedback.

Sources:

AWS re:Invent 2023 - How to not sabotage your transformation (SEG201) – Matthias Patzak
The future of Ops is platform engineering | PlatformCon 2023 – Charity majors
Beyond Engineering: The Future of Platforms - Craft Conf, 2023 - Manuel Pais
Skunk Works: A Personal Memoir of My Years at Lockheed - Ben Rich
How platform teams get stuff done - Pete Hodgson
Programmable 2023: Strong and Weak Forces - Evan Bottcher
What I Talk About When I Talk About Platforms - Evan Bottcher
Programmable 2024: Engineering Platforms Evolved - Scott Shaw
com - Tech Radar - Platforms-as-products
Corporate Longevity Forecast - Creative destruction
Flying Through Giga Berlin and Xiaomi Automobile Super Factory
https://medium.com/@gpeuc/debunking-bad-design-memes-part-2-candles-and-electric-light-quote-3f9990784cfe
https://services.google.com/fh/files/misc/2023_final_report_sodr.pdf
https://hbr.org/2018/07/the-biggest-obstacles-to-innovation-in-large-companies.
E pur si muove - Galileo Galilei
https://www.theguardian.com/business/2015/aug/17/technology-created-more-jobs-than-destroyed-140-years-data-census
https://en.wikipedia.org/wiki/There_are_unknown_unknowns
https://www.amazon.com/Red-Plenty-Francis-Spufford/dp/1555976042

Szólj hozzá!

The memoirs of Kilgore Trout nr. 6: the elephant and the snake

Facebook Tumblr Tweet Pinterest Tetszik

0

2024. augusztus 15. - Floorshrink

Twelve years ago, I held a presentation at the Budapest University of Economics. I used a drawing from the Little Prince (when the boa constrictor swallowed the elephant) with slight modifications to illustrate the income over time curve. Warning: your government wants to keep you as a net contributor in the pension system, while you may want to have a few more good years. In 2012 it seemed like a funny thing, today it looks like a problem. Considering the likelihood that your pension will cause a significant drop in your standard of living, your goal is to push the blue milestone to the right while retaining (some of) your market value.

The acceptance of the above curve depends on your age and financial status, but the first reaction usually is that this is wrong, “torque overcomes RPM”, experience rules etc, bottom line: the market is wrong. If you keep in mind “the customer is always right”, then you might become interested in the root causes of this devaluation and what we can do about them. If you are under 40, stop reading, if you are over 50, you might want to read on.

The components of your (job) market value

Your experience – which doctor will you pick for a heart surgery for you kid? A newbie (who is eager to do it) or the 40+ year old guy with 15+ years of proven track record? The untold part of the story is that you do not want a 70 years old dude with a trembling hand to do this operation either. Ok, we are talking about IT, but keep in mind, Oppenheimer was 39 when he joined the Manhattan project… The problem double fold: your experience is amortized AND you are unwilling to let it go to make room for new skills and new experiences. You need to learn new things and learning gets harder as you get older. There is a potential escape route here: move to areas where the half life of your skills is longer, that is away from hard core IT towards something softer like process or project management or farming watermelons. The issue is that this area already got overpopulated with the refuges bringing the prices down. Another way is to move up in the hierarchy but it comes with the unavoidable and undesirable jostling for positions. (then hustling the pretenders….)
Your network – to be precise a few key people in that network who act like your sponsor are vital to your career. These are the people who trust you, who put a bet on you and who will speak up for you in that vital moment when a decision is made about you (or not you). Side note: this is one of those things when size does not matter, quality does. And now the bad news: Like it or not, your network ages with you, that means that those who know what you are capable of might no longer be in the position to stand up for you.
Your college degrees – I was a diploma collector once (3 university degrees). Then one day I asked myself when the last time was when I used a Fourier transformation or whether I could still use my coding skills in Z80 assembly. Diplomas in technology get amortized fast. The real value from those years is your capability to learn and the seeds of your network.
Your language skills – whenever I meet an IT person who claims that not speaking English is okay, I lose my marbles. 90+ % of literature in information technology is in English… Bad news: the upper 25% of the new generation speak two languages before entering college. (The only area where I put a heavy demand on my kids was a high-level language cert in ENG and GER by the end of their high school. Okay, I also put some emphasis on math…)
Your appetite for 60+ hours work weeks – being a workaholic is not a shame (been there, done that), albeit it will have consequences on your relationships with your loved ones. As the adage goes the only people who will remember that you worked that much will be your kids, not your boss. For sure this appetite will calm down a bit around 60.
Your ability to learn and to forget – most folks accept the fact that the half-life of any technology related skill is around 10 years this means you will have to reinvent yourself at least 3 times during your active years. What many folks do not think about is that one has to “unlearn” the old ways of doing things in order to be able to absorb new things.
The logical multipliers:
- Your appetite for power – you cannot be a leader without starving for the right to make decisions. You will not be a great leader if all you care for is power and not your people.
- Your health – although I accept the gene lottery idea, I think there are a few basic rules you need to play by: very little alcohol, no smoking, no drugs, enough sleep, lots of physical exercise and a wonderful woman (man) by your side.

Bottom line: to a large extent the market is right about reducing the market value of people over 50-55. On the other hand, they are wrong about rejecting old folks upfront without any consideration. I recall a disaster at Liptovský Mikuláš in Slovakia when a storm literally erased an entire forest in 2004 due to one thing: all trees in that forest were the same type, planted at the same time. Old trees are a must in any forest. (pic below is my own)

The cost side of the house

Homo Economicus beware I dropped minor things like inflation and mortgages, but I considered items like moving to a smaller home once you became an empty nester and inserted luxury items like a costly divorce into the mix.

Houston, we have a problem: This curve does not look like a snake who swallowed an elephant.

What to do about this problem?

There is a gap between the income and the cost curve. If we accept the definition of happiness as minimizing the gap between one’s desires and one’s reality we have three choices:

lower the bar of your desires and expectations
stay on the job market longer and reduce the degradation of your market value
increase the portion of your income from your savings

Option A is not that bad as it sounds. I have first hand experience about moving from a 6-cylinder BMW to a 3-cylinder Mini Cooper without any mental or manhood degradation. Fancy objects (cars, watches, gadgets etc.) are not essential to your happiness, collecting excessive amount of them even suggests that you are compensating for something.

Option C is by far the best. The only caveat is that only a minority of the working population reaches “escape velocity” who do charity work only to save baby seals and rainforests. (besides being angel investors since they want even more money) OK, what about the rest?

So here we are: the market is mostly right and becoming a follower of Siddhartha solves only a part of the problem. Here are the ingredients for preserving your livelihood over 55:

Drop anything superfluous from your life and use what you already have. This whole life thingy looks like a lease with an expiry date, ie. you will have to hand in all your belongings before leaving the stage.
Stop being concerned with everything. As Mark Manson put it: "Maturity is what happens when one learns to only give a f**k about what's truly f**kworthy." A subtler explanation is from Milan Kundera who described it as a choice about the number of mirrors you want to see yourself in. Accept yourself as is, minimize your social media activities and pick only a few people whose opinion you care about. The rest can go and fly a kite.
The final thing from my all-time favorite, the mother of COBOL, Grace Hopper: The most damaging phrase in the language is: “it’s always been done that way.”
DO NOT continue doing things because this is how you did it in the past. Change in IT is inevitable let alone exponential. You need to adopt. It is like a winding road with curves where you need to change speed and direction to stay on it.

As always, I will be happy to hear your feedback and remarks. Happy riding, Folks! Laszlo

Szólj hozzá!

Horseshoe bend #5: Lessons learned so far

Facebook Tumblr Tweet Pinterest Tetszik

0

2024. február 11. - Floorshrink

The following post is an attempt to summarize the learnings from our cloud journey in the first 18 months. You bet, this is biased, but it might help others who come behind us. Those ahead of us you may put your all-knowing smile on.

How to go faster - the first steps in the chaos

Public cloud adoption is an intertwine of grassroot experimentation, the mandate from the senior management to establish an enterprise grade cloud presence and finally a crash landing of the first cloud workloads without a proper foundation. The sooner you have a program established around it, the less chaotic the first months will be.

You need a cloud strategy

that answers questions like:

why you want the whole thing in the first place, how and when do you declare that you reached this goal and what metrics are used to prove it. (eg. cost saving may not be a strategic goal, while speed is.)
what your core design choices are: cloud architectural design (eg. hub & spoke vs. VWAN), accepted building blocks (cloud services), CI/CD tool set (source and artifact repo, build and deploy tools), IT Sec key decisions (eg. rejecting the use of public IP, checking ingress code from the internet, policy layers, IaC framework and the toolset like Terraform vs. the cloud provider’s native tooling like Bicep) and most importantly a decision-making process how to reach these choices.
the question of ownership: Cloud is much more than a 3^rd datacenter (in fact more than any other IT infrastructure), therefore its governance should be established in the context of Business IT, DevOps, IT security and IT Operations. This is not an ITOps internal affair.
The willingness to change everything: I could not find the source of this quote but I think this is true: “When digital transformation is done right, it's like a caterpillar turning into a butterfly, but when done wrong, all you have is a really fast caterpillar.” You have to change the processes and the org structure if you want to harvest the advantages of the cloud. Without these changes the result will be as slow as the original on prem counterpart is.

The right level of ITSec control – if too loose, you will be hacked, if too tight, nobody will use your stuff and shadow IT orgs will sprout out everywhere. You need to decide on a few core items:
- single CSP, or multi cloud, distributed cloud yes/no, cloud native tools vs 3^rd party for monitoring, managing, protecting it.
- how far you are able (willing) to go with automation, mostly with Infrastructure as a Code (IaC). The dilemma is where to stop. The Pareto principle should give us guidance but it misses one key point: any manual intervention will defeat the purpose of the entire automation. This quote is from 1935, but it is as relevant as ever: “It is difficult to get a man to understand something, when his salary depends on his not understanding it.” /Upton Sinclair/
- what your cloud operating model is: the conservative approach is when the dev teams file a SNOW ticket for everything in the cloud just like on prem, the avant-garde approach is when you give them freedom to implement their preferred PaaS component with their own IaC code and to go YBIYRI (you build it, you run it) for components that are not yet supported by central IT Ops.

Establishing the Cloud CoE

A program or an org unit: Management needs to find out if you are a project or an org unit. All peer connects (interviews with other enterprises who embarked on this journey earlier) show that introducing the public cloud at enterprise scale is a 5+ year program with likely evergreen residuals. Treating it as a project has implications, eg. 90+ % of the team will leave at the end of the program, taking all learnings with them.
Staffing:
- #1: quick learners with a solid technology background are on high demand. Giving scraps of time of mediocre performers will defeat the purpose of the whole thing.
- #2: the imbalance between supply and demand will crank up the prices to the point that can jeopardize the financial viability of the program.
- #3: be prepared to lose your best cloud engineers to abroad. Our regretted attrition way over the internal FTE attrition. The replacement takes cca 3+ months. The ramp up will require another 3 months, ie. you are down with a top engineer for 6+ months.
- #4: We underestimated, therefore understaffed the process, governance and compliance tasks. Cloud is not only an engineering task, but a heavy lifting on process and compliance, let alone a major change management undertaking as well. The non- engineering activities are 30+% of the job. (the process folks claim this is 50+%...)

Key decisions to make

what the public cloud actually is – a 3rd data center or something completely different? The CCoE was convinced that this is different while ITOps insisted that this was just another DC, therefore should behave like one: same technologies, same processes, and nothing else.
how far you want to go with self-service? One approach is to allow You Build It- You Run It where ITOps is not ready to operate the new technology. The advantage is that it will allow the dev teams to go faster but will require to build operations skills and capacity on their side. Another approach is to channel every cloud request into the existing processes and handle them as if they were an on prem request.
Some dev teams will want to tinker with PaaS components while others will want to concentrate on business logic and application-level tasks. In the latter case, centrally provided cloud services will be required for those who do not want to deal with the PaaS component operations. You need to define the boundaries between YBIYRI and these central cloud services (roles and responsibilities) AND need to establish this managed service layer. (this is mostly not a technical undertaking.)
Drinking from the firehose - the balance between an R&D workshop and a factory - the number of PaaS services vs. the available offerings (let alone the Marketplace) Do not go beyond 10-15% of the total service offerings, otherwise you will be quashed by their quantity.

The forces that will slow you down

There are two forces at play here: ITSec and ITOps. (Compliance waiting for you around the corner.)

On prem ITOps mindset will dictate that anything in the cloud should function just like as if it was on prem. They will demand the same technologies and processes, the same IaaS approach to anything. Their – legitimate – reasoning is that 95+% of the workloads are on prem today, therefore anything you create should look like the current stuff since it is easier to operate. The untold driver is fear that you need to address upfront: Nobody will lose their jobs but likely to have a different job (with a different skillset) within 4-5 years. All of us need to learn and unlearn.
ITSec requirements dictate technical solutions that take much longer in a bank than in a small (non-financial) account. It is like running the Marathon in a heavy diver suit while all others run in shorts… An example: in a public cloud cross regional DR capabilities come out of the box, unless you implement private endpoints when you lose most of this functionality.

The nose of the ship cannot travel faster than the back of the ship, ie. it does not really help to produce designs and technical solutions that other parts of the IT org cannot implement let alone comprehend. This is a lesson we learned the hard way: You need to move the entire ship. Trainings, constant communication, demos and regular small updates help the transition.

Dependencies

You will find (at least) the following dependencies:

Identity and Access management – the identity management process and technology. eg. your IAM system does not work with cloud native identities and/or it is being replaced therefore does not accept any changes.
Ticketing system – your team gravitates toward JIRA (as most SW dev. projects do) while ITOps will demand ServiceNow. Shoveling data manually from SNOW to JIRA is a pain in the neck but you want to track the hours in a single system.
Click-Ops - your IaC code will bump into manual steps in the process, eg. a FW port opening might take a week while your code runs for 45 minutes.

Technical issues

If you implement IaC you need to pay attention for the smooth coexistence between the IaC code and the policies on top of them. This is a daunting task to debug a code where both layers are in constant move.
on prem proxy servers and multiple firewalls plus an on prem DNS vs. your cloud internal routing design will give you a bunch of networking and name resolution issues where you do not have access to the monitoring logs of any of the on prem components. it will require a smooth collaboration with the network people to resolve simple issues like a wrong conditional access setting.

The exit strategy

There are 3 caveats with a cloud exit:

when you mix up a disaster recovery and an exit scenario. the difference is the RTO allowed. the first is measured in hours, the later in years. It takes the same effort to walk away from a cloud than to walk into it.
when you allow only technologies that have an on prem equivalent. This way you do preserve your exit but throw away any innovation produced by the cloud provider. The deeper you go into the PaaS/SaaS forest, the less likely it is that you will ever come out.
when the seller’s state, eg. the USA says NO. In this case a cloud-to-cloud exit becomes unattainable (MSFT, Amazon or Google will leave the local market on the same day)

A reasonable exit strategy should be formulated, that will be acceptable by the local regulator. Regulatory, compliance and engineering task forces should collaborate, with an experienced leader (the best is someone who worked as an auditor before). Think twice before you execute this exit. This will ruin the ROI of the whole thing.

The square peg in a round hole – the lack of public IP

If we had to had to name one item that caused us the most headache, it is easily the fact that the public cloud is designed with the internet in mind, that is that all services can be accessed directly from the internet. In case of an enterprise environment this is not the case, you have to go private.

The nonfunctional requirements

All of these requirements are known for decades, but work differently in the cloud, especially for PaaS and SaaS. Think about monitoring, logging, alerting and backup early and make reasonable compromises with their on prem counterparts.
Cloud monitoring, alerting and logging should be incorporated into the company level monitoring, alerting and logging. It is inevitable because the cloud-based systems will not operate standalone but integrated with on-prem (and later maybe other cloud) systems. In case of a problem an end-to-end view is needed, and it is possible only with an integration between the various monitoring systems.
Backup: you need to have a clear view on what you need to “bring home”, ie. back to on prem and what is okay to store in the cloud. At the end of the day, it boils down to the level of trust in your cloud provider and the demands by the regulator. Be aware that some of the backups provided by the provider are not compatible with anything else, ie. cannot migrate them to any on prem equivalent. (eg. KeyVault)
The big shift is when the Application Operations teams will claim a bigger slice of the traditional monitoring and alerting pie, using their own – mostly cloud native – tooling that will overlap in functionality with the tools used by IT Ops.

The non-technical side of the house

We shuffled all non-technical topics into a single team: Process – Governance – Compliance – Cost. In retrospect we underestimated the amount of work and the difficulties related to these topics. (engineering myopia) In fact there is a significant difference between “it works from an engineering aspect” and “it is a service one can provide with a predefined SLA”.

ITSM processes: IT Service management processes assume that everything is done by ITOps, the client just files a service request. ITOps is right claiming that an incident is a pain regardless where it happens, therefore you need to have a proper incident (and change) management process. If you are an ITIL shop, you will find out that a big chunk of the areas covered by ITIL3 are simply not applicable for the cloud. (hence the introduction of ITIL4 several years ago.)

The cost thingy: This is very easy to leave the lights on (on prem “flat fee - we already paid for it” reflexes kick in) but will cost you dearly. IT is one thing to spin up resources automatically, and seems like just a small change in the code (create vs. destroy) to tear them down. But somehow it just does not happen without forcing it. This is not by accident that FinOps became a discipline on its own right in the last couple of years.
The service catalog: In case of a cloud request the client may ask for a subscription, then for the predefined set of PaaS components in it, or just for the subscription and then would do the rest him/herself. Ie. you need to clarify what the service catalog should contain.

What comes next

I wanted to thank the entire team who walked along in the last 18+ months. We are not finished by any measure and with the quickening speed of change we may not even know what “done” really looks like. What is beyond doubt that the big players turned their attention to artificial intelligence. It is a safe bet to forecast that AI will infiltrate all aspects of the cloud within a few years and will become the new battleground.

To finish with some fun: I used Midjourney to illustrate this post. The last prompt I used was this: “the magician pulling the rabbit out of the hat but the audience is not happy, cartoon by David Horsey, --ar 3:2”. Is it possible that AI already went rouge?

As always, I appreciate any comment of feedback.

Szólj hozzá!

Horseshoe bend #4 – Mount Rushmore (from the Canadian side)

Facebook Tumblr Tweet Pinterest Tetszik

0

2023. június 11. - Floorshrink

In regulated industries you are required to produce an exit plan before you are supposed to make your entrée in the public cloud. On prem stalwarts cite this requirement on a regular basis demanding a plan as detailed as the inroad itself. For a while I figured this was just an excuse from the luddites to slow down progress, so it puzzled me when I heard this from people whose opinion I do care about. The bug buzzed in my ear for months: what if they are right and this road indeed leads to trouble? What if Mount Rushmore is not so pretty when viewed from the other side? To settle this I typed in vendor lock-in cloud computing in Google and Bing to learn. Most answers were either sponsored by cloud vendors or by firms like Cloudflare of Red Hat (Cast AI, VMware, Wasabi etc.) whose real objective was to convince you that you can avoid this trouble with their assistance (that is jumping in their trap instead of Amazon’s or Microsoft’s.) Some were thoughtless like the one from a HDD manufacturer arguing that cloud lock-in would lead to the lack of scalability (really?), some were lazy enough to copy entire sections (even the drawings) from each other. Okay, this is useless, so let’s dig deeper. The rest of this article is the result of this digging and the outcome of consulting with Lydia Leong from Gartner, peppered with my longing to computer history. Spoiler alert: when was the last time you listened to music on a CD player, or to phrase it differently: do you have an exit strategy for your Spotify (Netflix etc.) subscription, that is you purchase an on prem copy of each song or movie you like? If you don’t, then read on!

A few definitions:

Disaster Recovery Plan ≠ Exit Strategy ≠ Exit plan ≠ Testing the Exit plan

A Disaster Recovery (DR) plan is part of the Business Continuity Plan (BCP). It has nothing to do with an exit. When somebody asks you to execute a cloud exit in days, that is a DR situation, not an exit. For this reason, I omitted situations when the Cloud Service provider (CSP) becomes insolvent overnight and is forced to shut down its entire service. I also left out cases like a nuclear bomb wiping out all DC-s in multiple regions (not just availability zones) of a cloud provider. In this case we have an existential problem way beyond a service disruption. (and yes, Putin is moving these deadly toys into Belarus as we speak…)
An Exit Strategy defines the triggers when your Firm will want to or will have to get out of a Cloud agreement. Players in this decision are the Business owners, the IT leadership, Procurement, Legal and the IT architects.
An Exit plan is the series of steps -and the players with their specific roles and responsibilities -that are triggered by events defined in the Exit Strategy. It covers technology and business process related changes; thus, not an IT only problem at all.
Two types of cloud exit: moving an application elsewhere or leaving the platform altogether are two different games. Depending on the players involved in the conflict triggering the exit you might face any of these.
Testing the Exit plan: walking the talk and moving a workload from the original cloud location to A: another cloud provider or B: back to on prem.

Concentration risk is the risk associated with dependence on a single supplier for multiple business capabilities. This is applicable to on prem IT environments as well. Imagine that you have to move away overnight from the RDBMS provider having a few thousand DB-s and a few hundred thousand lines of PL/SQL code holding the bulk of the business logic of your core applications. The same goes for the runtimes and the language itself from the same provider. You bet; you are on the hook. Some smart consultant coined a derivative called Cloud concentration risk. This is the risk associated with dependence on a particular cloud provider for multiple business capabilities, such that a single failure can result in a disruption to multiple aspects of the business. It’s on prem sibling is a major outage in your primary data center.

The triggers: Who can say no?

There are five possible actors in any cloud exit: the service provider and the consumer, the buyer’s regulator and two nation states (the vendor’s and the consumer’s).

Buyer-seller conflicts: this is in scope for this post.
Buyer in conflict with the seller’s state –this is a weird idea for any firm (at least in my home country) to get into a fight with the US government, so I risk to skip this.
Seller in conflict with the buyer’s state – not impossible, (eg. East India Company vs. China, but this one too ended up as type D.)
The conflict between two states – the USA banned the sale of key IT technologies (on prem as well) to Russia after their attack on Ukraine. FTR: it was not allowed to transfer any personal data outside of the Russian Federation anyway, therefore US cloud providers were a no go before the war.
Whoever claims that the (HUN) regulator said no to the public cloud, pls. show me the actual paragraph in their guidance to prove it.

The types of conflicts between the seller and the buyer (type A):

When the seller says no

When the buyer says no

A serious violation of the contract terms by the buyer (eg. you posted adultery content on your website. In case of an enterprise client this is unusual and probably would trigger a “remove it immediately or…” reminder rather than a hasty service suspension.
When you do not pay the bill. This is where the old adage applies: if you owe the bank 50 thousand dollars, this is your problem, if you owe them 5 million dollars, this is the bank’s problem. The bigger your consumption is, the more likely the vendor will negotiate, although this is not a life insurance.
When the seller is told by its state to say no – ie. this is Type D. If you plan to substitute AWS with Azure (or the other way around) keep in mind that they are from the same country, ie. subject to any type D issue simultaneously

When the service quality is unacceptable - regular service outages, degradation of service
When the price goes up at renewal without any benefits compensating for it. The usual way of carrying this out is removing an existing discount. This is playing hardball. Not cloud specific - see when the tax collectors of an RDBMS provider show up on December 21^st for a little audit.
If the cloud provider enters your market as a competitor. (Apple Pay BNPL, anyone?)
When you decide to rationalize your cloud footprint since realized that 3 providers are probably too many.
When the innovation dries up (for folks in photography, this is when the Hasselblad 501CM became available in ruby red) I think this is by far the most dangerous thing that can happen in a cloud relationship since it breaks the balance between the price and what you get for it.

A word on innovation and its relation to vendor lock-in

Repeat after me: Innovation comes from differentiation. Maximizing the value of cloud adoption requires exploiting the provider’s capabilities, thus increasing lock-in. The flip side: The greater your need for portability, the more you are likely to sacrifice some of the benefits of cloud services —and the greater the complexity and cost. The deeper you walk into the cloud forest, the more likely you will stay there for a long time.

I met an IT executive who thought that the cloud was nothing more than a 3^rd data center owned by someone else. For this reason, he demanded complete symmetry, that is using components in the cloud only if they had an on prem counterpart. (read IaaS) To be fair, he was right from an exit viewpoint, but ignored the efforts of all major cloud providers in the last 5+ years, that is PaaS. This is where most of their R&D spend went, probably beside IT Security. Bottom line: the more value you take out from the cloud the more difficult it becomes to exit from it. In case of SaaS this is simply a redo exercise, same cost, same time.

To illustrate the innovation story let me use an old example, the 360 series mainframes from IBM. This was the first modular, general-purpose, upgradeable series of mainframes with the same OS for all models – that is running the same application without modifications, introduced the micro-coded CPUs, the 8 bit bytes (today it sounds funny, but there was financial pressure to use 6 bit bytes, since memory was expensive), the EBCDIC character set, a new floating point architecture, a nine track magnetic tape drive, backward SW compatibility with older IBM products, all in all a tremendous amount of innovation. It cost half of the development of the atomic bomb, the development time was way over the original plans, but within 15 years it drove the seven dwarfs out of the computer business (7 dwarfs = Burroughs, Sperry Rand, Control Data, Honeywell, General Electric, RCA and NCR) Was it a true vendor lock-in? You bet it was: It was compatible only with itself, but it was the best of its time so much that this was the origin of the saying “Nobody ever gets fired for buying IBM”. And guess what, this was the seed of the antitrust law suit that almost chopped IBM into pieces. If you are into computer history, check out the book written by the Fred Brooks (the PM of the development, working in tandem with Gene Amdahl, the lead architect) titled the Mythical Man-month.

A word on R&D budgets: If you check out the annual reports of the hyperscale providers and their traditional on prem counterparts you will find telling numbers. In a nutshell: there is an ongoing shift of profits from the incumbents to the largest cloud players. (eg. Amazon is now the largest database vendor, surpassing Oracle.) Their net earnings are manyfold compared to the traditional HW and on prem SW providers like HP or even IBM. If we assume that each R&D dollar has similar financial impact at all major players, this is fair to say that the hyperscale providers are on a growth trajectory (because their cloud R&D is larger and is funded by their cloud business, not by a separate cash cow) while their on prem counterparts will face tough times within 5-6 years. This is why IBM paid 34 billion USD for Red Hat. This move was triggered by the realization that they lost the cloud war. The real thing is that the war is no longer in the cloud area, this is over, the battle moved to the AI territory with even bigger stakes.

Busting myths

There are no solutions that eliminate lock-in. Vendors just want you to become locked into their solution instead of someone else’s. Think about it: if Vendor A’s service is 100% compatible with Vendor B’s service, then the ONLY differentiating factor will be the price. This would lead to a cost war to the bottom, that would force both vendors to cut back their R&D budgets. At the end they (and you) would end up with commodities where the only differentiator is the price, read: ZERO innovation. There are competing forces at work here: the appetite for innovation in the buyer’s side intertwined with the need to differentiation on the vendor’s side plus the demand for freedom to escape those providers whose innovation stream has dried up. Since I used a mainframe example for ground breaking innovation I have to mention other mainframe providers whose only excuse to exist is that one’s primary application runs on their iron and this is very-very expensive to move away, and they know it. On the other hand, you have a choice which vendor’s lock-in you want to avoid and which we prefer in order to avoid the other one.

A cloud exit plan does not provide any reduction in your availability risk. The period when the cloud service is unavailable is way shorter than your ability to execute any exit plan. You need to address this in your DR plans WITHIN the given cloud itself. (nope, cloud to cloud exit is not a panacea for resiliency, see below.)

Multi-cloud is not a solution for cloud resiliency since it is difficult and expensive to implement. I had a chat with a senior IT executive a few weeks ago. When we got to this issue, he figured he would ask his teams to build a software application targeted to public cloud to be either portable, OR to develop two versions of the same SW in the same time for the two hyperscale providers. I think both of these ideas are unpractical: If you build a software that uses the common subset of the functionalities you will throw away the bulk of innovation coming from any of these providers. If you build for both in the same time you will ruin the business case and the time expectations of the business, ie. I would rather not even start this endeavor.

One more word on multi-cloud: this will eventually happen to most large enterprises, either by choice or by accident when a software vendor is picked by the business who happens to use the other CSP. This will put an additional training burden on the internal IT departments of large enterprises, let alone cranking up the price tags for those folks literate in both technologies. (I always talk about two hyperscale providers instead of three, no intention to disregard GCP, this is just simpler to express myself this way.)

If your exit is triggered by a change either from the seller or the buyer’s regulator, this will rule out any cloud-to-cloud exit, because a regulatory change (for the record a state decree) will render all of your target exit providers unviable. (eg. Russia, unless you consider Alibaba…)

Your ability to execute an exit from your cloud provider does not improve your negotiation position, since cloud exits are complicated and costly and the CSP knows that the cost of a cloud switch will exceed any price advantages gained through the switch. To be fair, this is no longer a money printing machine like it used to be in the on prem - perpetual license days. This is a service with actual cost of building and running astonishingly large data centers all over the world, let alone their electricity and communication costs. Do not dream about 50% discounts. If you check out the annual reports of key cloud providers, their profitability is in the range of 30-35%. If you consider their buying power and operational efficiency, chances are 1 kilogram CPU from them cost less than 1 kilogram CPU in your DC. (Leaving on the lights when not needed is a different problem, but this is finops, a subject for another post.)

Containers do not eliminate cloud lock-in: Theory (and Kubernetes providers) say that putting applications in containers will solve the cloud lock-in problem with no drawbacks. Tag line: “Once an application is in a container, it is easy and cheap to move it between cloud providers, or between cloud and on-premises environments.” On the one hand containers and microservices became the hallmarks of cloud native development, and they do ease some aspects of portability. On the other hand, they do not address most of the underlying causes of lock-in. Container management platforms are one out of the hundreds of PaaS services available from any of the top cloud service providers. Replacing this with a 3^rd party component will have no effect on the dozens of PaaS components also required to run a modern application.

Regulators DO NOT want the whole exit plan executed before you go to the cloud with your app. They will be satisfied with plans that can be executed over a reasonable period of time (such as two years), without requiring that you demonstrate your ability to actually do an exit. The effort required to test an exit scenario is comparable to the effort of moving to the cloud itself. Unless the regulator wants to ruin the whole business case to move to the cloud, they will not demand it. The good news, they heard of FinTech and BigTech and know that if they overdo their “no cloud please” thingy, they hurt the entire industry rather than protecting it.

Your options

Minimize lock-in as much as possible: Cloud IaaS providers are treated like infrastructure resource commodities, and higher-level functionality is avoided wherever possible. This requires a very high level of skills in the IT team and significant engineering effort, time and risk since you assemble your car from thousands of tiny parts coming from several manufacturers. Not recommended, since you lose the innovation and the developer efficiency gains brought by the PaaS components. You throw the baby out with the bath water.
Use overlays to minimize cloud IaaS provider lock-in: You can try to minimize lock-in to the cloud IaaS provider, by overlaying the provider’s resources with third-party solutions that are portable across multiple environments. This results in a high degree of lock-in to the overlay solutions and vendors, as well as the ecosystem around those solutions. The cloud IaaS providers may be treated like infrastructure resource commodities, thus losing the innovation brought by the cloud provider.
Be loyal to a single ecosystem: you choose one vendor’s ecosystem to base your strategy on it, accepting the notion that you will have long-term dependency upon that vendor. Innovation, ease of integration and speed of delivery are the highest priorities. You accept that you will become highly dependent on this cloud provider over the long run, and must invest in building a strong, trusted relationship with that vendor. Resiliency is handled within the provider’s ecosystem, using cloud native tools.
Be loyal to more ecosystems: You build capabilities on two or more providers, but not for resilience purposes, but to maintain the balance when negotiating with mega players. You manage cloud concentration risk primarily through a multi-cloud workload placement strategy, rather than through a cloud exit strategy. The two cloud you bet on are likely to be two out of the three hyperscale players.

The final word: You do need to be prepared to exit your cloud provider but not for the reasons usually quoted by most articles on the web. The real dilemma is to pick the right provider and to maintain the relationship as long as it provides competitive advantage to your firm. A cloud exit is a complicated and very long journey. Planning an exit in advance will help you shorten the time to a successful execution, thus jumping from a limping horse to better one in time. To paraphrase Oliver Cromwell "Trust in your cloud provider but keep your powder dry!"

As always I will appreciate any feedback on this post.

Sources used for this paper:

Szólj hozzá!

Horseshoe bend #3 – Midway

Facebook Tumblr Tweet Pinterest Tetszik

0

2022. november 08. - Floorshrink

The battle at Midway is symbolic for many reasons, it showed the importance of information security (the key to success of the US Navy was that they decrypted the Japanese communication and knew the plans of Yamamoto), and marked the end of the era when battleships reigned and the beginning of the supremacy of aircraft carriers. (Let alone it was the equivalent to Japan as Trafalgar was to France) I realize that the analogy is a bit far-fetched nevertheless I build this post around it: while IT security is more relevant than ever for any enterprise, the old way of thinking about it will no longer reach the goal. No, I am not talking about quantum computing and its threat of breaking current cryptography in minutes, I am talking about the cloud. ITSec has to change.

Let me nail it down: I do realize how important information security is, history provides ample proof points. As of today, cyber warfare is on equal terms with any other military branch. (Think of Stuxnet). On the other hand, a recent study by McKinsey found that the average life-span of companies listed in Standard & Poor’s 500 was 61 years in 1958. Today, it is less than 18 years. If you recall the faith of Blockbusters, Borders Books, Nokia or Kodak you see the Innovator’s dilemma in action. If you stop innovating, you will wither (sometimes very fast), if you are careless, you will suffer significant material losses. (pretty soon)

What we know for a long time

“Navigarenecesse est, vivere non est necesse.” Going online (that means mobile) is a must, tweaking your business process to delivery speed is nonnegotiable. Gen Z measures a response in seconds, a whole transaction in minutes and want it all anytime, anywhere.

The ITSec playing field is not levelled, a threat actor can make way more damage with 1M USD than the good guys can fend off with the same amount of money.
The imbalance between demand and supply for skilled ITSec professionals is cranking up prices to the upper 5 digits range (in EUR) in countries where this used to be the package of mid management. Despite of the sky rocketing compensation, there is unmet demand.
Hacking is a lucrative profession and a weapon in the arsenal of nation states. The number of data breaches grew in sync with the number of users and the amount of data generated and exposed to the online world. Ugly: yes, surprising: No.
The biggest concern in any ITSec protection scheme is the human factor combined with organizational inertia, from careless users and unnoticed human config errors to orgs working in silos not giving a damn about each other’s motifs and agenda. (Read the case of the London underground fire at King’s Cross and you will know what I mean.)

In summary: as a consequence of the above more and more firms move a significant part of their business online, while not being prepared, exposing their cyber sec weaknesses to the outer world.

Something happened - what we learned lately

Let me enumerate the changes that have happened in the last 5-8 years in the ITSec arena.

The business demands collaboration with entities outside of the main org, thus a significant portion of the value creation process happens OUTSIDE of the castle that you are trying to protect. The “castle and moat” paradigm even when executed with the outmost rigor is not enough. If we add the growing segment of SaaS based functional delivery this statement becomes more relevant.
The public cloud grew indispensable, sucking the bulk of investment dollars from the on prem world, thus becoming a self-fulfilling prophecy. Three groups formed: the hyperscalers, the multi-cloud vendors (riding on these hyperscalers) and the incumbent traditional players.
Since hardware is becoming a commodity, there is a power shift towards developers. Yes, they are sometimes closer to a primadonna than a soldier, demanding weird perks. Live with it. For the record: the price difference between a Macbook Pro and a good Wintel notebook is around two days compensation of these folks, so be it.
A DDoS attack with a botnet made from smart fridges is a novelty, though a pretty sad one. (see my comment of the lack of ITSec expertise, this time at the fridge makers)
The shared responsibility model introduced by the cloud blurs the boundaries and sometimes makes you feel as if it was somebody else’s (ie. the could provider’s) problem.
The vast majority of recent and future successful cyber security incidents were and will be enabled by a human configuration errors. Throwing more human effort at the problem will only generate more errors. Just because you do it slowly, it will not make it more secure either.

The need for speed

The ability to respond to events in the business environment quickly became the nr. 1 priority to business leadership, regardless the industry. (COVID, the Russian invasion of Ukraine or the double-digit inflation came overnight)
There is a widening gap in agility between the cloud and devops enabled development units and their IT Sec (and IT Ops) counterparts. IT is getting good at producing new code fast, but is not yet prepared to protect this new code well.
You measure the life span of a physical machine in years, a VM in months and a container in minutes. With Kubernetes coming to age with the support of major cloud players, the traditional ways of creating, managing, monitoring and protecting these compute instances become more and more inadequate.
Former U.S. Deputy Secretary of Defense William Lynn argues that “cyber-warfare is like maneuver warfare, in that speed and agility matter most” This guy probably knows a thing or two about cyber security, since he wrote Pentagon’s cyber strategy in 2010.

What is next

The last part of this post is a list of proposed actions. For the record: being a cloud CoE lead I am biased and this is part of my job to be biased. A “conservative revolutionary” is an oxymoron, right?

Accept the paradigm shift

A paradigm shift needs to be answered by another paradigm shift: insisting on total manual pre-control and ignoring the importance of speed will put ITSec at odds with the developer communities and eventually with the business. Explain, teach, go beyond saying NO and show how it can be done securely. Sit and breath with the coders, literally.
“Widening the moat”, ie. making it more cumbersome to access data from within the castle (in the cloud) will not protect the firm. As leased lines between company locations became obsolete (my 5G phone runs circles around a 4 Mbit leased line), soon the moat will become obsolete for most volatile apps or it will move where the assets to be protected are, ie. to the cloud. This is not by accident that MSFT became a significant contender in the unified endpoint management and SIEM (Security Information and Event Management) arena. They had to in order to make Azure (their new cash cow) prevail.
Protecting the identity of users, machines and applications will be (is) the core of the new era. I risk to forecast that biometrics as the primary means of (human) authentication will prevail despite of the current legislative hesitation.
Turn your teams to developers themselves who author and run the configuration monitoring scripts (Ansible, Terraforms, shell, does not matter) the hardening and patching states of all assets. Realize that these scripts will behave as a real code, you will store them in a source repo and you will create new releases of them instead of just replacing a parameter in a shell script on your c:\ drive.
Be prepared for the increasing pressure from cloud vendors: They will combine the increasing functionality gap between their cloud based and on-prem offerings, will produce licensing arrangements making their cloud-based services more compelling (eg. the Hybrid advantage from MS where you double your existing on prem license amount for Windows servers IF you use their cloud based KMS service) and eventually they will discontinue their on prem product ranges altogether just like Atlassian announced already.
Convert your mindset: thinking in static, dedicated source and destination IPv4 addresses is the past. A cloud provider will not guarantee you that the IP address range for a VM scale set or an Kubernetes cluster will be the same two weeks later as it is today. Think in FQDNs instead of static IP addresses and use the DNS service of the cloud provider.
Insist on discipline where it matters: protecting the endpoints, primarily the mobile devices. Discipline applies for senior management as well.

Focus on your people

Many companies have the cash to buy the best of breed ITSec offerings on the market, but lack the skills and capacity to bring the most out of them. Reverse this trend. Hire the best possible people and explain to HR that compensation tensions are less painful than losing the trust of your clients.
Financial realities will force traditional ISVs to port their core offerings to the cloud and their limited resources will dictate to place their bets on these cloud-based versions, thus slowly but surely will abandon their on prem versions. The tendency will reinforce itself with every product iteration. The gap will widen. Beef up your cloud related skills and capacity.

Learn to code and automate everything

If you measure the latency in response in months due to capacity shortage and then you manually execute a process based upon outdated config information, you will miss the target. The more manual steps you put into a process, the more error prone it becomes, introducing “flavors” into the execution. When you add favors to the process, your quality assurance becomes a lottery. Automate every step in your process including auditing your own work.
Defense in depth: while the “castle and moat” approach is outdated, but maintaining various layers of defense is very much alive. The goal is to protect any asset in the org with vigor an investment that is proportional to the asset being protected. Eg. do not protect information that is already on Linkedin, but create a dedicated subnet for your really important stuff with well monitored control points to these subnets.
Patching a vulnerability a year after it was discovered is autopsy. Real time monitoring and detecting and reacting to anomalies in a near real time manner will be crucial. Voluntary “confession” of ITSec considerations in an Excel sheet is as useful as resuscitating a corpse. (except for audit purposes) You need to automate the discovery and eventually the whole response.
Go beyond the static (one-time) snapshot mentality where the name of the game is making any change difficult, accept the new rules and become able to detect these changes and respond to them very quickly.
Focus on AI: The role of AI will become prevalent in ITSec on both the attack and the protection side. Bluntly put algorithms will fight algorithms within ten years. (I risk an estimate that this is already the case on the attacker side.)

Bottom line: all vectors point to one direction: ITSec need to change and have to learn to automate, that is have to learn to code. As always, I appreciate your feedback.

PS: The first image is the IJN Mikuma, a Mogami class heavy cruiser sinking during the battle of Midway. Others were generated by https://openai.com/dall-e-2

Szólj hozzá!

Horseshoe bend #2 – Are we there yet?

Facebook Tumblr Tweet Pinterest Tetszik

0

2022. július 10. - Floorshrink

Sponsors have the tendency to want to know how any project in their realm is getting along and above all what they get for the money that they threw at us. They ask the same question over and over again: Are we there yet? To be honest, when you requested a few million bucks for a cloud implementation, it makes sense to know what “there” is and to be able to tell when you reach this point. This is #2 in the cloud related articles dubbed the Horseshoe bend, focusing on the measurement of the outcomes of a cloud implementation.

In case of a cloud adoption program there are four sets of folks in your organization whose interests you need to cater for. These people are the business (the guys who fund the whole thing), ITSec – the knights who say Ni (or rather No), IT Ops who see this whole thing as unnecessary and last but not least the compliance folks representing regulatory scrutiny. The rest of this article attempts to set reasonable targets for each stakeholder group, define metrics for each of these targets and at the end to prove why you should not stress the whole thing beyond reason.

The business metrics

The ability to respond quickly to a surge in demand (or a sharp decline for that matter) – this is a no-brainer, as long as you apply the ground rules of Infrastructure as a Code. (AND as long as your cloud provider does not run out of steam.) (Metric: being able to spin up additional compute/storage resources within a few hours from the demand.) WARNING: it only makes sense to dynamically scale the infrastructure if the application layer is able to take advantage of this capability.
The speed of infrastructure design and implementation from the request until it actually goes live. This is the one that has a great effect on developer productivity. The way to do it is by using technology building blocks and the underpinning blueprints combined with automation. I mean full automation, with no manual intervention at all. This will require that ITSec and ITOps GIVE UP pre-control and to move to post-control with near real time policy violation detection. Approve the design, not the actual instance and check if we strayed away from this design.

The caveat is when you need to link your shiny new cloud environment with its on prem buddy carrying a bunch of legacy technologies and more importantly legacy processes. It is like Lightning McQueen pulling Bessie. Yep, it may not be that fast… (Metric: the time between the first and the last related ticket designing and implementing an IT infrastructure should be 25+% faster than its on prem counterpart.)

Cost transparency – this is easy, just implement proper tagging and a data analysis/visualization tool (a pedestrian Excel with a SQL backend will do) on top of the analytics report. Warning: it can be a double-edged sword in environments with poor cost transparency since – while it indeed can tell to the penny who spends how much on what – this can be pitched as a weakness compared to an on prem alternative where the costs are unknown or where the actual user of a service does not feel the pain of their extravagant requests. (Metric: report AND forecast the cloud spending by cost center. Produce cost reduction suggestions as a bonus.)
Technology adoption speed – The marketplace of any major cloud provider contains thousands of applications, development/management/monitoring tools, two magnitudes more choices than your on prem IT can handle. Balance is the key word here, too much freedom would throw the monkey wrench into IT Operations, while banning the inflow of new technologies would defeat the purpose of the whole thing. Clogging the path of innovation is a very bad idea, therefore when ITOps no longer can handle a new technology, apply the “you build it, you run it” principle.

The Technology metrics

As long as you opt for IaaS, you will have to deal with the same duties as if these VM-s were in your data center. And in some cases, you cannot avoid deploying VMs in your cloud subscription. Unless you plan to operate what you have built you need to realize that demanding the same processes as used on prem is a legitimate ask from Ops. The problem arises if those processes are siloed and littered with manual steps. IMPORTANT: The strength of a cloud infrastructure is given by the level of integration between the components. As soon as you start to operate the various components in separate silos, you are going to kill the essence of the whole thing. This begs for a dedicated Cloud Operations, but it would question the status quo.. Anyway, here are the technology metrics:

Know what you have: as long as you deal with a computing resource deployed for longer than a few hours you want it to be in your CMDB. This is obvious but easily forgotten that this CMDB is on prem. (Metric: all CI-s are known by the CMDB)
Config management: Automation can be a key differentiator here. Rather than trying to find an error in a configuration by eyeballing config files one could write a code that makes sure that reality equals the design. (Metric: the number of differences between the designed and the actual parameters.)
Monitoring: Cloud providers use the same components, architectures, hypervisors etc. (but not the same processes) that you do, therefore are susceptible to the same errors like their on prem counterparts. Things will go wrong sometimes, so you have to implement monitoring. For a smooth coexistence feed the metrics streams into the traditional on prem monitoring tool and its cloud native alternative as well. (Metric: key metrics are fed to a monitoring system with alert thresholds defined.) WARNING: no matter how good your infra DR capabilities are if the application layer is not prepared to use these capabilities.
Incident management: The real thing is how fast and meaningful your reaction to an alert is. This topic is dealt with in ITIL, so I rest this case with the assumption that this is mostly the same as on prem with one key difference: DO NOT to allow anybody to temper with the production environment manually since it will create a collision between the parameters set by the automation script and those set by an Operations person. The question is if you will have the discipline to make changes to the IAC code, then run this code or you cannot resist the temptation to make manual changes. My hunch is that you will violate this rule sometimes…

The ITSec metrics

None of us want to fall victim to a hacker attack. I learned the following maxim from ITSec people who were clearly beyond me: “You can inflict way more damage with 1 million USD than you can avoid with it.” The playing field is not even. This is that should make you ITSec cautious. The problem is when you achieve relative strong security posture at the expense of the business flexibility. The following list is just scratching the surface.

Using Multi Factor Authentication (MFA) for any activity – in case of public cloud you are exposed by definition, your first line of defense is the identity of the users. You need decent Identity and Access Management (IAM) tools and processes. The very minimum is to use MFA in all cases, not just for the admins. (Metric: yep, MFA for all.)
The granularity of admin rights aka. reducing the attack surface: I recall my early days in IT in 1990 when I felt Mr. Important when I got the admin access of the Netware 2.15 server at my first workplace. Of course, it was permanent, revoking would have meant a demotion, right? Wrong: You do not need admin access to anything unless you have a job to do with that system. Using Privileged Identity Management (PIM) is an essential way to reduce the attack surface, namely time. Of course, its efficient use is based upon the assumption that the PIM approval process is fast. In fact, the best thing is if you do not use admin accounts to do anything in a production environment, but use service principals instead. (Metric: admin rights are granted for a few hours to the least number of people when needed. Dig the global admin account in a safe place and use it only as a last resort.)
Cloud native security metrics and best practices: cloud providers will create assessments of your cloud implementation, suggesting improvements. 3^rd parties will also produce reports on the known vulnerabilities (eg. Sysdig, F5, Read Hat) Read these and act upon their findings. It is wise to procure a penetration test against your own implementation on a regular basis. (Metric: a predefined security score – likely from your provider and the speed of reacting to these findings.)

The compliance metrics:

d'Artagnan did not worry about the duel waiting for him at 2PM with Aramis since he knew he probably would be dead by this time due to his duel with Porthos scheduled at 1PM. I am more worried about hackers than auditors, so I do not have metrics for this area yet. (okay: being in compliance with a the regulatory guidelines whatever their real meaning is.)

Summary – how to prove to your sponsor that you reached the goal?

The next paragraphs might look weird after pages spent on defining them: these metrics are less relevant compared to what they miss to capture since they cannot measure it: the impact of the knock-on effects of a good cloud implementation. As Roy Amara put it: “We tend to overestimate the effect of a technology in the short run and underestimate the effect in the long run.” I am convinced that cloud computing is going to have a profound effect on how we do computing in the future. It is not an end into itself but an enabler, and we surely do not comprehend all of its implications since it’s hard to notice things in a system that we are part of and it’s hard to notice the incremental change because it lacks stark contrast YET. As always, I will be happy to learn your feedback.

Szólj hozzá!

Horseshoe bend #1 – the Why

Facebook Tumblr Tweet Pinterest Tetszik

0

2022. június 19. - Floorshrink

The World is full of natural wonders that are photographed in each and every minute. The bend of the Colorado river near Page is one of them. As an amateur photographer I took my own version. (don’t go in January, that is snow in the upper left corner…) Somebody also writes an article about the public cloud in every minute, so investing effort in writing the N+1^st version carries as much novelty as the picture above. The sinister #1 forecasts the fact that my content does not fit in a single article, therefore it will arrive in small chunks just like the coffeehouse novels of the 1930s. Despite of this I hope that whoever invests a few minutes reading this piece, will profit from it.

Why this cloudy thing is relevant?

The strongest argument is bridging the ch asm between the demand by the developers (and the business behind them) to follow a zigzag path (a.k.a. going agile) and the current capability of the on prem IT infrastructure to satisfy this demand. The business wants to experiment, ever faster and of course at the lowest possible cost, while on prem IT still thinks in annual budget cycles, where rolling out a new piece of hardware takes 3-4 months from the approval. (if we consider the current chip shortage, it is easily over 5 months)
There are usually two complaints about IT infrastructures beyond stability: it takes ages to grow and it cannot scale down, ie. this is rigid. Originally, I drew the chart below as a fun fact to illustrate my point in a discussion with an executive: the development of IT infrastructure - beyond brute force power increase - is the capability to follow an arbitrary demand curve with an increasingly precise answer. (in the good old integral term limes delta t = 0) One of the advantages of the public cloud is the capability to support both the dynamically scaleable technology (microservices in containers) AND the tooling that can automatically provision and manage it. (aka. Infrastructure as a Code)

A colleague of mine once argued against the public cloud that since everything is changing so fast (COVID, the war in Ukraine, the looming recession, the growing inflation), one cannot plan even for 3 years, therefore we are at leisure to think about it for a few more years. While the reasoning is correct, the statement it is trying to support is dead wrong: this very unpredictability demands the ability to change direction fast. And the public cloud does exactly this: to be prepared for the unknown. It happens that a republic (“United forever in friendship and labour, Our mighty republics will ever endure.” Aha..) batters down its little brother because it dared to venture too close to countries not loved by the big brother. The Ukrainian National Bank screw around the public cloud topic for over ten years, then suddenly approved its use to banks within a week in this March. There are times when one needs to be fast.
My last argument: the bulk of technology innovation shows up lately in the marketplaces of the major cloud providers, aka. “cloud first”. I think we are not too far from the era when it will switch to “cloud only”. The functionality gap between the cloud and on prem version of the same product is widening every year, first there is the honey on the rope, then comes the stick (sorry dudes, this is deprecated, you have no choice.)

The rational of the deniers – a bit tousled

For the full picture we need to discuss the arguments of the naysayers.

“We can do the same thing on prem!” This is true that any technological advancement and process innovation can be copied and implemented in an on prem environment. I am almost as good looking as Thor, I all need to do is just a bit more exercise and I will be there in no time.
An average hyperscale cloud provider can enumerate more SW engineers to this task that most Hungarian enterprises combined. If we accept the theory that eg. Microsoft allocates its resources to a product line based on its revenue potential AND we take into consideration that MS has roughly 160 thousand employees AND Azure was behind 22 billion USD out of the 168 billion total revenue last year, then it is fair to estimate that cca. 21 thousand people at MS are working on Azure day after day. At least one third of this army are developers with technical leaders like Mark Russinovich. This is darn hard to win this race against MS (or AWS for that matter), We should compete somewhere else.
“You did not have to wait for the infrastructure, this was the business messing around wasting time.”
- In enterprise IT it takes several months from request to fulfillment to serve an infrastructure demand WHEN the hardware was already in the data center at the time of the request. If procurement starts its “Speedy Gonsales” process AFTER the demand arrived, we are talking about at least 6 months.
- If we replace the term “fidgeting” with experimenting, then we have to accept that the business sometimes does follow a zig-zag path. Although the expression is a bit overused this is still true: the business wants to be agile, it will place its bets on multiple things, it will change its mind and sometimes will make mistakes. The best supporter to “fail fast” and “fail cheap” is the public cloud.
„The cloud is expensive!” – There is a large amount of truth in this statement. When used in earnest cloud services can be pretty expensive. On the other hand this statement is misleading for the following reasons::
- The cost of on prem infrastructures is a flat fee in nature regardless the utilization of it. For the record: an average on prem enterprise IT infra runs with the efficiency of a steam engine, a Diesel engine at best. This means that it uses 20% of its capacity while you had to pay for 100%. On the other hand cloud services pricing is consumption based, ie. it will cost you dearly if you leave the lights on when not needed. Generations of IT folks grew up on the mantra that leaving the light on was okay or even a good thing since it gave a timeslot to those maintenance scripts or patches that ran throughout the night. We will need to override decade old reflexes.
- Most enterprises carry a huge amount of technical debt and controlling departments do not even try to estimate the hidden cost of this fact. (If we draw a analogy between tech. debt and financial debt, then this becomes clear that the “interest” on this technical debt is the firm’s slower reaction to change.) If you can reduce your technical debt by using the public cloud, it will make your company go faster that is worth some money. This is the benefit that you never take into account when examining the cost of the cloud.
- Most large enterprises cannot tell how much a given IT service exactly cost them. (true respect to those few who can.) (This is the equation that is hard to solve: ∑ i= 1 to N for all items in IT service portfolio (unit price x number of units consumed) = total IT cost) We know the prices of the cloud services, but the on prem service prices are smudged and distorted in the big common hat. It may even happen that cost transparency backfires, when the on prem folks will claim “more expensive” discretely hiding the fact they do not even know how much their stuff costs.
The cloud is not secure – Please put the „Common vulnerabilities in Java” string in your (Google) search window. Then, if you are not nervous enough yet, replace the Java part with dotNet, then with your favorite (mobile) OS etc.etc. How long did it take you to fix all vulnerabilities related to log4j or Heartbleed? The question is NOT that you are vulnerable or not but how long it will take to realize that you are hacked and to do something against it. I do not want to understate this topic; the last really trustworthy firewall was the two-inch airgap. I want to point out that the cloud is as vulnerable as your on prem infrastructure, and there is a chance that there are more and better trained ITSec engineers attempting to reduce the risk than in your on prem environment. Of course this is an entirely different coup of tee when the service provider (or the state) itself wants to look into your data.
You cannot use the cloud because of compliance requirements – to meet the requirements of PCI-DSS (Payment Card Industry Data Security Standard), SOC 2 (System and Organization Controls 2), HIPAA ( Health Insurance Portability and Accountability Act), ISO 27001 etc. is a daunting task indeed. This is so funny to hear this excuse from IT people of firms that can satisfy none of the standards above, furthermore this is not even in their plans to pull this trick. Large cloud players have done it years ago and can withstand the endoscopy of auditors on an annual basis.

Summary – what comes after the curve in the road?

IT infrastructure is becoming a commodity. This commodity is indispensable to our survival (in case of banks to the very existence), but does not bring a sustained competitive advantage compared to others who also use this technology.

The cloud, like many other technological advancements before brings something new that previous technologies could not do and this will change the rules of the game. The question is whether an enterprise can still benefit from cultivating an on prem IT infrastructure and if an on prem IT can compete with the capabilities of the hyperscale cloud providers. The answer to the first question is a possible yes, to the second one a definite no. As we know from Niels Bohr prediction is difficult especially when this is about the future, but I give it a try.

The balance (between on prem and cloud) will be influenced by the business goals of the firm (a mom-and-pop shop vs. a multinational trying to conquer the globe), the playground defined by the regulators, the sensitivity of data handled and the optimum between cost and speed. It will vary between industries and company segments. The less legacy you carry (eg. a startup) and the further you are from the heavily regulated industries (ie. not a government) the more likely that the only on prem HW equipment you will end up with are a coffee machine and a photocopier within a few years. If you have hundreds of legacy (on prem) applications, and you are heavily regulated, chances are the balance will be around a 65-75% on prem vs. 25-35% cloud.

Szólj hozzá!

Horseshoe bend Nr.1 – a Miért

Facebook Tumblr Tweet Pinterest Tetszik

0

2022. június 12. - Floorshrink

A világ tele van természeti csodákkal, amelyekről percenként készül egy-egy fénykép. A Colorado folyó Page melletti kanyarulata ezek közé tartozik. Amatőr fotósként én is elkészítettem a magam változatát. (ne januárban menjetek…) A nyilvános felhővel kapcsolatban szintén percenként ír valaki egy új cikket, így nekiállni az N+1-iknek kb. annyi újdonságot ígér, mint a fenti fotó. A vészjósló #1 arra utal, hogy a mondókám nem fér el egyetlen cikkben, így várhatóan részletekben érkezik majd, mint egy kávéházi regény. Ennek ellenére abban bízom, hogy aki szán az alábbiak elolvasására pár percet, profitálni fog belőle.

Miért is érdekes ez az egész felhősdi?

A legerősebb érv a fejlesztők (és a mögöttük lévő üzlet) által megkövetelt „cikk-cakkban futás” joga (aka. agilis működés) és az IT infrastruktúra válaszadási képessége közötti szakadék áthidalása. Az üzlet kísérletezni akar, minél gyorsabban és persze minél kisebb költség mellett, miközben az IT éves költségvetési ciklusokban gondolkodik, ahol egy-egy hardver kigördítése az igény befogadásától számítva 3-4 hónap. (ha ehhez még hozzávesszük a chip hiányt, akkor bőven fél év felett)
Az IT infrastruktúrával szemben két panasz szokott felmerülni a stabilitáson túl: túl lassan tud nőni, és nem tud lefelé skálázódni, azaz rugalmatlan. Az alábbi ábrát eredetileg poénnak szántam egy felsővezetővel folytatott beszélgetés kapcsán: az informatikai infrastruktúra fejlődése a nyers erő exponenciális növekedésén túl a tetszőleges görbével leírható terhelési igényekre adott egyre pontosabb válasz képességével írható le. (az integrál számításban delta T tart a nullához) A nyilvános felhő egyik előnye az, hogy mind a dinamikusan skálázódni képes technológiát (pl. konténerekben futó mikroszolgáltatások), mind pedig ezen technológia létrehozását hatalmas mértékben felgyorsította. (aka. Infrastructure as a Code)

Egy munkatársam egy alkalommal azzal érvelt a nyilvános felhővel szemben, hogy mostanában minden változik (COVID, háború, gazdasági válság, infláció), nem lehet akár csak három évre sem előre tervezni, azaz még ráérünk. Az érvelés igaz, de az állítás téves: éppen ez a kiszámíthatatlanság igényli a gyors irányváltás képességét, és a nyilvános felhő segít gyorsan reagálni a váratlanra. Előfordul, hogy a “szövetségbe forrt szabad köztársaságok” egyike éppen rommá lövi a szomszédját, mert az túlságosan közel merészkedett a nagy testvér számára nem szimpi országcsoportokhoz. Az ukrán nemzeti bank kb. tíz évig piszmogott a kérdésen, aztán idén márciusban egy hét alatt engedélyezte a nyilvános felhő használatát bankok számára. Van az úgy, hogy sietni kell.
Az utolsó érvem az alábbi: az informatikai innováció java az utóbbi években valamely felhő szolgáltató piacterén jelenik meg (cloud first) és nem vagyunk messze attól a pillanattól, amikor ez átcsap „cloud only” állapotba. A nagy gyártók kínálatában ugyanazon termék on prem és cloud verzióinak funkcionalitása között évről évre nyílik az olló, azaz előbb van a mézesmadzag, utána jön a „nincs más választásod” állapot.

A tagadók érvei – kicsit megcincálva

A teljes képhez hozzá tartozik a „miért nem” kifejtése is.

„Ezt mi is meg tudjuk csinálni on-prem!” Való igaz, elvileg a felhő szolgáltatók minden technológiai megoldása és processze megvalósítható egy on prem környezetben is. Én is majdnem úgy nézek ki, mint Thor, már csak egy kicsit kell gyúrnom és kész.
Egy átlagos hyperscale felhő szolgáltató jóval több szakképzett szoftver mérnököt tud ráállítani pl. a saját felhő infrastruktúrájának automatizálására, mint a legtöbb magyar nagyvállalat (együttesen). Ha elfogadjuk, hogy a Microsoft árbevétel arányosan allokálja az erőforrásait egy-egy területre és ehhez hozzá vesszük, hogy kb. 160 ezer ember dolgozik náluk, az Azure pedig 22 milliárd USD árbevételt hozott tavaly a 168 milliárd USD-s teljes átbevételből, akkor kb. 21 ezer MS munkatárs reszel nap mint nap az Azure-on. Ennek minimum a harmada fejlesztő mérnök és a műszaki vezetőik között olyan kaliberek vannak, mint Mark Russinovich. Ezt a meccset piszok nehéz lesz megnyerni, ne itt versenyezzünk.
„Eddig sem az infrastruktúrára kellett várni, az üzlet molyolt.”
- Egy nagyvállalati informatikai infrastruktúra igény kielégítése annak megjelenésétől a tényleges működésig hónapokban mérhető, HA a vas már ott volt az adatközpontban az igény megjelenésekor. Ha nagyobb HW igényről van szó és a beszerzés az igény megjelenése UTÁN kezdi versenyeztetni a szállítókat, akkor min. fél évről beszélünk.
- Ha a „molyolás” szót kísérletezésre cseréljük, akkor el kell ismernünk, hogy az üzlet időnként IT infra szempontból cikk-cakk-ban megy. Bár a kifejezés némileg lejáratódott, de igaz: az üzlet agilis, több helyre is tesz tétet, változtat és eseteként téved. A gyorsan és ebből adódóan kis költséggel „hibázás” legjobb támogatója a nyilvános felhő.
„A felhő drága!” – Ebben az állításban van egy jókora adag igazság, a felhő szolgáltatások, ha komolyan kezded használni őket, drágák tudnak lenni. A kijelentés ezek ellenére félrevezető az alábbi okokból:
- Az on prem infrastruktúrák költsége „általánydíj” módon jelentkezik, azaz független attól, hogy milyen kihasználtsággal hajták őket. (for the record: egy nagyvállalati IT valahol egy gőzmozdony és egy Diesel hatékonysága között mozog, azaz kb. 20%-os utilizációval megy, de 100%-ot fizettet ki veled.) A felhő szolgáltatásokat viszont általában fogyasztással arányosan számlázzák, azaz, ha „égve hagyod a villanyt”, akkor valóban drágább lesz, mint az on prem. IT-sek több generációja szocializálódott azon, hogy a villanyt égve hagyni oké, mi több szerencsés, mert éjszakánként lefuthatnak a karbantartó script-ek, feltelepülhetnek a patch-ek. Évtizedes beidegződéseket kell majd felülírni.
- A nagyvállalatok zöme jókora technikai adósság állományt cipel magával, aminek a járulékos költségét a kontrolling még csak meg sem próbálja megbecsülni. (a technikai adósságot párhuzamba állíthatjuk egy pénzügyi adóssággal, ahol a felvett hitel kamata a szervezet lassabb reakciókészsége a változásra.) Ha a nyilvános felhő segítségével csökkentheted a technikai adósságállományt, azzal gyorsítod a vállalatodat, ami pénzt ér. Ezt soha nem írod jóvá a felhő szolgáltatások árának vizsgálatakor.
- A nagyvállalatok zöme nem képes megmondani, hogy az IT szolgáltatás portfolió egyes elemei valójában mennyibe is kerültek. (ezt az egyenletet nehéz felírni: ∑ i= 1 to N az IT szolgáltatás portfolió minden elemére(egységár x fogyasztott darabszám) = total IT költség) Ezek után előfordulhat, hogy a felhő szolgáltató árait ismerjük, az on prem IT infra árak pedig elmaszatolódnak a nagy közösben.
A felhő nem biztonságos – Írd be a (Google) keresődbe pl. a „Common vulnerabilities in Java” kulcsszót. Aztán, ha még nem vagy elég ideges, cseréld ki a Java részt dotNet-re, majd kedvenc (mobil) OS-edre stb. Mennyi idő alatt is javítottál ki minden log4j, Heartbleed stb sérülékenységet? Nem az a kérdés, hogy sebezhető vagy-e, hanem hogy mennyi idő után veszed észre a bajt és teszel ellene valamit. Nem akarom elbagatellizálni a kérdést, az utolsó valóban megbízható tűzfal a négy centis légrés volt. Arra akarok rámutatni, hogy a felhő pont annyira sérülékeny, mint az on prem infrád és ott több, esetenként jobban képzett ITSec-es munkatárs igyekszik csökkenteni ennek kockázatát. Az egy másik kérdés, ha maga a szolgáltató vagy az állam akar beleolvasni az adataidba.
A compliance követelmények miatt nem lehet – A PCI-DSS (Payment Card Industry Data Security Standard), SOC 2 (System and Organization Controls 2), HIPAA ( Health Insurance Portability and Accountability Act), ISO 27001 stb. követelményeinek megfelelni valóban nem kis feladat. Ezért is izgalmas e kifogást hallani olyan cégek munkatársaitól, akik informatikája a fentiek egyikét sem teljesíti, mi több, nincs is tervben ez a mutatvány. A nagy felhő szolgáltatók általában a fentiek mindegyikét megugrották évekkel ezelőtt, és évente testüreg motozza őket egy sor auditor, hogy még mindig rendben vannak-e.

Összefoglalás Mi vár ránk a kanyaron túl?

Az informatikai technológiai infrastruktúra mára tömegcikké válik. Ez a tömegtermék nélkülözhetetlen a versenyben maradáshoz (valójában a létezéshez), de nem jelent stratégiai versenyelőnyt a többiekhez képest, akik szintén rendelkeznek ezzel a technológiával.

A felhő, mint oly sok technológiai eredetű újítás, valami mást is tud, amit az eddigi megoldások nem és ez a valami megváltoztatja a játék szabályait. A kérdés, hogy tud-e egy vállalat versenyelőnyt kovácsolni az on prem IT infrastruktúra további kultiválásából ill. hogy tud-e a vállalati IT versenyezni a legnagyobb felhő szolgáltatók képességeivel. Az első kérdésre a válasz egy valószínű igen, a másodikra egy biztosan nem. Ahogy Niels Bohr-tól tudjuk, a jóslás egy nagyon nehéz dolog, különösen, ha az a jövőre vonatkozik, de én mégis megpróbálom.

Az egyensúly a piaci célok (Alsó-Röcsöge vagy a világpiac meghódítása a cél), szabályozás által adott keretek, az adatok szenzitivitása és a költség optimum dimenziói mentén áll majd be, iparáganként és cégméretenként eltérő munkapontban. Minél kevesebb legacy-val bírsz (pl. egy startup) és minél messzebb vagy a specialitásoktól (pl. kormányzati informatika), annál valószínűbb, hogy pár éven belül az utolsó két on prem hardver eszközöd a kávéfőződ és a fénymásolód lesz. A több száz legacy (on prem) alkalmazással bíró, erősen szabályozott cégek esetén az egyensúly kb. a 65-75% on prem, 25-35% cloud környékén fog kialakulni.

Szólj hozzá!

The 4th wise man

Facebook Tumblr Tweet Pinterest Tetszik

0

2021. november 01. - Floorshrink

A few months ago, I got in a conversation with a senior executive about standards. I elaborated on the merits of standardization, mentioning companies where one could find 3-4 different technologies for the same function, being incompatible with each other let alone the cost of operating and integrating all these technologies. The executive listened carefully then pointed out that he had a worry about standards: they curtailed innovation. So, he would rather live with the higher cost to preserve the ability of the organization to innovate faster. The debacle made me wonder if I was wrong all along with my thesis, so I started digging. This blog post is my take on this subject.

I based debunking the “standards hurt innovation” myth by asking WHEN a standard should be inaugurated. (credit and thanks to another executive for his help.) The illustration below is the combination of the technology adoption lifecycle curve from Geoffrey Moore and the Hype cycle model from Gartner.

https://www.loadbalanceworks.com/newsDetail.asp?PostID=54624&n=5g-the-slope-of-enlightenment

Wait with setting your company standards until the market starts consolidating and the future winners are in sight, ie. when you can make a good bet on them. The Hype Cycle gives guard rails when this solidifying moment arrives. As much as a well-chosen standard will help you reduce complexity, hence cost, an overripe standard will indeed become a blocker. There are two more questions that we do not cover today: when to move away from an old standard and which technology to choose from the existing, stable options.

At this point I still had a nagging question in mind: if this is not standards, then what are the real inhibitors to innovation? We know for sure that if a company loses sight of the “3^rd horizon” (3+ years into the future) then it risks its demise as soon as another player finds an area where its innovation becomes transformational. (eg. some fintechs did a decent job on online customer onboarding way before COVID, they just did not find the right business model yet.) The rest of the post is a homegrown set of rules on innovation for established financial institutions.

Rule #1: know where you want and where you can innovate. I recall an investment bank that decided to create its own container management technology, that is to beat Docker/Kubernetes. Their argument was that the internals of the bank were so unique that tweaking an off the shelf solution would have taken as much effort as writing their own. I also recall a bank who created its own private cloud with a handful of people over several years let alone started by purchasing a truckload of hardware rather than having a look at their existing provisioning processes. Neither of these cases were founded on a sound financial calculation, missed the fact that some very large market players threw hundred times the staff and money at the problem EARLIER than they did. IMPORTANT: Innovation <> doing something differently from the mainstream. The 13 bit microprocessor did not become mainstream for a good reason…

Rule #2: make a clear choice if you are an innovator or a fast follower.

I recall a meeting from another century when I tried to sell something cool to the CEO of a local GSM firm and quoted (even hummed) their tag line “XY GSM, the cutting edge” to reinforce my message. The CEO interrupted me and made an interesting statement: “We no longer want to be the cutting edge, that is expensive and risky, we want to be fast followers.” In this case a fast ingress process of bringing in external innovation is crucial. Reaction speed trumps anything else in these firms.

Rule #3: Do not try to fix a process problem with (yet another) piece of technology.

There is a overarching theme in wealthy companies: they want an easy solution for a difficult problem, that is they want to fix process problems with technology. When they inevitably fail to achieve their goal, they blame the given vendor, ditch the chosen technology and pick another one, hoping for a better outcome. (This always makes me frown, since they are darn close to the definition of insanity.) Purchasing a new technology is not innovation by itself, it just shows that you can afford it. (It is like a hobbyist photographer purchasing a medium format camera hoping that it would turn him into the next Ansel Adams overnight.) Technology might speed up a process, but what if it made a faulty process faster?

Rule #4: There is no such a thing as risk free innovation, when the result is guaranteed.

Enough to mention Edison, whose team - in search for a filament for the lightbulb that would be durable but inexpensive - tested more than 6,000 possible materials before finding one that fit the bill. For this reason, do not try to use the same set of KPIs that you run your daily business with BUT do have a separate set of KPIs and a clear definition of success.

Rule #5: regularly scrape the hull of the ship! I think this is not funding, that makes innovation. If you do not believe me, check out the movie called “The boy who harnessed the wind.” (of course, a second bicycle would have made the case easier…) The real impediments to innovation are broken, anachronistic processes usually sustained by the silo nature of an organization. I recall a company where a business request would go through four, disconnected JIRA queues before ending up as a set of separate Servicenow tickets. (It takes only a few months to get through this maze.) Now imagine Edison requesting the 6000 (!) different materials he tested before landing on carbonized bamboo through this process. NOPE! Check out the value creation process to see where customer or employee engagement suffers and fix the process BEFORE you do anything else.

https://www.drjdavidson.com/blog/2013/07/take-time-to-scrape-off-the-barnacles

An alternative solution is to leave the mothership behind and create a semi-independent “Skunkworks” unit where the rules of the mother company do not apply. After all, it worked for Lockheed when they created the SR-71. Note: If you separate the team who is entitled to innovate, make sure there is a natural way to ingest their findings back into the regular business.

Rule #6: Get people who can actually make a difference. To be politically incorrect – and I can joke with this one - I use the "One-Legged Tarzan" sketch to describe the problem of innovating with people who are 10-15 years behind the cutting edge. Tarzan is "a role which traditionally involves the use of a two-legged actor" and that it would be unusual for the part to be taken by a "unidexter". Of course training can help, but still there is a lingering doubt about waking up one morning to realize that you are the old dog who may no longer learn new tricks…

https://www.youtube.com/watch?v=njK6zQp2Fdk

Rule #7: Nurture cross unit collaboration by breaking down the silos. The following quote is from a HBR article “The Biggest Obstacles to Innovation in Large Companies” by Michael Britt.

“Any time you start something new like [an innovation initiative], that cuts across many areas, there’s a potential for people feeling like you’re in their backyard,” In these organizations any change will provoke a strong reaction, feeling attacked because you trespassed into their territory. The problem is that most value creation process involves multiple departments, therefore one cannot really fix them without “trespassing”.

The last word: an innovation push without a nurturing cultural background is like a new coat of paint on a rusty surface, it will not last. Remove the most important inhibitor: the fear of being seen as vulnerable: that even big shots – just like other human beings – may not know the answer to everything or sometimes even make mistakes.

As always, I appreciate any feedback on this post.

Resources used in this article:

Szólj hozzá!

Egy újszülöttnek minden vicc új, így én a régi viccekre szakosodtam, azokat mondom el újra és újra.

Symptoms - What’s going on here?

The root cause

The reasons

Ways to handle this problem

Vendors - Reduce complexity

Service providers - Hide complexity

Engineers - Thrive on complexity

Sources

Contradictions

The psychology of IT Operations

The stakeholders’ view

I feel it in my fingers, I feel it in my toes (change is all around you…)

Squaring the circle - the way forward

The components of your (job) market value

The cost side of the house

What to do about this problem?

How to go faster - the first steps in the chaos

You need a cloud strategy

Establishing the Cloud CoE

Key decisions to make

The forces that will slow you down

Dependencies

Technical issues

The exit strategy

The square peg in a round hole – the lack of public IP

The nonfunctional requirements

The non-technical side of the house

What comes next

A few definitions:

The triggers: Who can say no?

A word on innovation and its relation to vendor lock-in

Busting myths

Your options

What we know for a long time

Something happened - what we learned lately

The need for speed

What is next

Accept the paradigm shift

Focus on your people

Learn to code and automate everything

The business metrics

The Technology metrics

The ITSec metrics

The compliance metrics:

Summary – how to prove to your sponsor that you reached the goal?

Why this cloudy thing is relevant?

The rational of the deniers – a bit tousled

Summary – what comes after the curve in the road?

Miért is érdekes ez az egész felhősdi?

A tagadók érvei – kicsit megcincálva

Összefoglalás Mi vár ránk a kanyaron túl?