This post is an attempt to identify the root cause of the apparent divide between the two major branches of IT and to offer a remedy to this problem. (ambitious, isn't it?) As always this coin has two sides, so I would like to learn the view of the Dev folks and IT Operations folks as well.
Contradictions
During the 2+ years of running the cloud transformation at a commercial bank I faced contradicting views on the following aspects of how IT could function.
- One extreme argued that the cloud is just another data centre, therefore it should be treated the same way as our own: same (ticket based) processes, same technologies (ie. nothing else beyond what we already have on prem) and most importantly same speed letting new things in.
- The other extreme exclaimed that the cloud shall change most aspects of IT as we know it, we should replace the stop signs (approvals) with guardrails (policies), automate every aspect of our daily life and most importantly treat IT infrastructure as a product that we want to sell to our (internal) clients.
I recall when 28 years ago – being in charge of introducing Exchange 4.0 in a local commercial bank - I attempted to explain to a deputy general manager that printing and faxing each and every e-mail he sent (in order to make sure the other party received it) was suboptimal and a read receipt was enough. (not kidding). I cared more for the consulting revenue than the rain forests, so I dropped the case. In the same bank I had arguments with the network folks that tracking IP addresses for every Windows desktop in a paper-based grid notebook is suboptimal compared to DHCP. I did not drop this one, gaining friends until they ran into issues due to duplicated IP addresses and the joy of troubleshooting them. (then they relented…) It has been bugging me ever since why it took them so long to realize these things. Why is it so darn hard to embrace change?
The psychology of IT Operations
One day a manager at IT Operations asked me how many times I recall when IT Ops was praised by the senior leadership for things going normal (rarely) vs. how many cases I remember when they were reprimanded after a major service outage. To be honest: for all of them. I had to agree with his point: IT Operations is strongly incited NOT to change what works since the bulk of issues are connected to changing some aspect of the service. Hence the need for a CAB (Change Advisory Board) in ITIL. The root cause for pushback against change is the deep belief that speed and stability are the opposites in the same dimension.
At this point I have to borrow a page from the book of Matthias Patzak, who in turn borrowed a page from Simon Wardley and tweaked his map by changing the vertical axis (visibility) to autonomy. Here is a modified Wardley map explaining why change agents are at odds with IT Operations. (a proposed remediation is on the chart)
The question is unavoidable: How can the infrastructure stay unchanged when everything that uses it changes at an unprecedented speed? My hunch: it cannot. The rest of this post is an attempt to prove this point.
The stakeholders’ view
The voice of the customer – In our case the app dev teams:
- Putting the cognitive load on the customer of the service is a guaranteed customer satisfaction killer – when a developer needs to figure out the internal processes of the service provider. (eg. filing separate ServiceNow tickets for the VM, the OS, the RDBMS, the DNS entry, the domain join and the admin access) It is like Vogon poetry. (the 3rd worst in the Universe) Dissatisfaction is the hotbed of shadow IT. For the record, not just in IT: Ferruccio Lamborghini probably would have stayed with his Ferrari (and his tractor business) if Enzo Ferrari would have been a bit nicer to him or would have made better clutches.
- Lack of speed and autonomy leads to disengagement. I recall a developer who wanted to test a new feature of MS SQL Server. It took him 3+ months to get a test server. By this time, he gave up on the whole idea he wanted to test in the first place. (He knew that the test bed he was asking for would have taken about an hour to implement if he was given a chance. But he wasn’t.) So, after 3 months he dropped the whole thing.
The voice of the business
- The top management of companies are concerned about unforeseen changes that may have a devastating impact on the livelihood of their enterprise. Their worries are backed by data. The Corporate Longevity Forecast, eg. the time a company spends on the Standard & Poor 500 list is shrinking. In plain English even large established companies can disappear from the list or even become “also run” within a few years. (Nokia, Credit Swiss, GE, Qualcomm bidding for Intel, WTF?) The age of creative destruction is upon us: What worked in the past for decades may not be good enough in the next ten years.
- Enterprises are trying to be prepared for and respond quickly to attacks from any new force in the market. The cloud is one of their bets. All parties but one agree on the following:
a cloud transformation will deliver its value proposition only if the organization and the underlying processes are changed along with the technology.
When money talks - R&D budgets
- If we assume that most Technology companies spend the same portion of their revenue on R&D and this R&D has the same impact on the bottom line (sometimes not true) than we may predict that more R&D (when it leads to a breakthrough), results in a quantum leap in profitability.
- If a firm catches one of these quantum leaps in a life time, it is lucky. If it catches two, this has long lasting consequences for the entire industry. (Data points from 2023: IBM made 8.18 billion USD net income, in the same period HPE made 2 billion, Microsoft 86 billion.) The cloud race is over, the AI race has begun and the hyperscalers have more money to spend on it than their traditional competitors.
Source: STATISTA.com (data for HPE is from 2015 only, when they separated from HPQ)
I feel it in my fingers, I feel it in my toes (change is all around you…)
The following chart is a visualization for obtaining infrastructure for an app. Say the dev team working on App 1 wants an infrastructure with an application server with some compute power, an SQL DB, and OS and a VM underneath, plus this thing should be accessible via the web to clients. In a traditional org this would mean 5 separate ServiceNow tickets with manual handover between them. Eg. The virtualization folks would set their ticket status to done, a human being would intercept this change, and would file another ticket to the OS team to install the OS. These teams are measured on meeting their SLA-s, so they would close the ticket even if the client is not able to log on to this server. (After all identity management is a separate step, right?) Imagine a car dealer who tries to sell you an engine, a transmission, a few wheels and a body work as separate items, when you wanted a car…
In a cloud infrastructure it is a set of IaC scripts that ran at once. And here comes the problem:
- This automation could be built by a dedicated cloud group requiring an org change that is against the will of the existing org units. Injecting the SNOW tickets into the belly of the automation – with the same 5 day SLA-s - would require the same time as the traditional setup. If you are against the cloud all you need to do is to insist on sticking to the old process.
- You can grant the right to execute this automation to the developer teams themselves, but it would mean relinquishing control and shifting to creating and maintaining the automation scripts and establishing guardrails (policies) instead of the stop signs.
- Creating and maintaining IaC code, CI/CD pipelines and policies (some people might call it DevSecOps and Site Reliability Eng.) require new skills and could be seen as a threat for those not interested in the above changes.
All in all, an innocent technology change proposed by the cloud would require organizational, procedural and skillset changes in an org who does not like change.
There is an interesting observation in the State of DevOps report for 2023. The more frequently you make changes, the more likely you will succeed. The root cause is simple: more frequent small changes (with a working rollback) touch fewer things that can go wrong. If we turn it around: the more worried you are about changing the platform, the more time will pass between changes, gathering more moving parts, that in turn will increase the likelihood that something will indeed go wrong.
A side effect is that this will make your environment less secure (I will not apply that security hot fix because it might break the application – to be honest, sometimes it will) and will accumulate more technical debt.
There is an expression that is the tell-tale sign of a siloed organisation: “he is criss-crossing in my backyard”, read trespassing into a territory that the speaker considers his home turf. “Any time you start something new like [an innovation – eg. the cloud initiative], that cuts across many areas, there’s a potential for people feeling like you’re in their backyard.” (Michael Britt) The problem is that most value creation process involves multiple departments, therefore one cannot innovate without “trespassing”.
I got into a conversation with the cloud transformation lead of a large commercial bank a few days ago. He made an observation that struck a chord: only a miniscule portion of the IT Operations workforce (in this bank) embraced the cloud, they honestly believed that everything was okay and this cloud thingy was unnecessary, so responded accordingly. I think Amara's law is at work here: "We tend to overestimate the effect of a technology in the short run and underestimate the effect in the long run." I am biased in this case, but I believe they underestimate the impact of the cloud and miss the opportunity to increase their market value.
Squaring the circle - the way forward
- The known knowns:
- the business hates when cost grows faster than revenue. READ: The days of extensive growth in IT staff are over. (if you care for the whys, check out "Red Plenty") There is one way forward: automation.
- what is likely that those willing to merry stability with speed will gain the upper hand vs. those who will stick to their guns and obstruct change. - The known unknowns:
- technology will create as many jobs as it will eliminate. (a recent study by Guardian suggests that it creates more than it destroys.) What is unclear which jobs will stay and which will transform to something new. My bet is that the mundane ones (repetitive ticket crunching) will fade, while those requiring more thinking (eg. designing those guardrails mentioned above) will grow their relevance.
-large IT shops carry an enormous amount of legacy, applications that generate the vast majority of business value for the enterprise today. This is difficult to forecast when the above shift will happen and how long this shift will take. - The unknown unknowns:
- IT Operations can hold the business at gunpoint claiming that any org/process change will pose a threat to the current stability of the business, therefore any cloud adoption should happen on their terms and at a speed deemed suitable by them. The real unknown is how long IT Ops can resist the push from their own internal clients and the hyperscalers. (make no mistake: the stick will follow the carrot soon.)
- for the record: while industry disruptors are already doing it, my prognosis that technology allows for speed while maintaining stability is not yet proven in large enterprises carrying a legacy.
Famous last words: In 1633 Galileo Galilei had an unpleasant encounter with the Sacred Inquisition that forced him to recant his claims that the Earth moves around the Sun, rather than the other way around. After leaving the courtroom he murmured "Eppur si muove" ("and yet it moves") and spent the rest of his life in a house arrest.
As always, I will be glad to learn about your feedback.
Sources:
- AWS re:Invent 2023 - How to not sabotage your transformation (SEG201) – Matthias Patzak
- The future of Ops is platform engineering | PlatformCon 2023 – Charity majors
- Beyond Engineering: The Future of Platforms - Craft Conf, 2023 - Manuel Pais
- Skunk Works: A Personal Memoir of My Years at Lockheed - Ben Rich
- How platform teams get stuff done - Pete Hodgson
- Programmable 2023: Strong and Weak Forces - Evan Bottcher
- What I Talk About When I Talk About Platforms - Evan Bottcher
- Programmable 2024: Engineering Platforms Evolved - Scott Shaw
- com - Tech Radar - Platforms-as-products
- Corporate Longevity Forecast - Creative destruction
- Flying Through Giga Berlin and Xiaomi Automobile Super Factory
- https://medium.com/@gpeuc/debunking-bad-design-memes-part-2-candles-and-electric-light-quote-3f9990784cfe
- https://services.google.com/fh/files/misc/2023_final_report_sodr.pdf
- https://hbr.org/2018/07/the-biggest-obstacles-to-innovation-in-large-companies.
- E pur si muove - Galileo Galilei
- https://www.theguardian.com/business/2015/aug/17/technology-created-more-jobs-than-destroyed-140-years-data-census
- https://en.wikipedia.org/wiki/There_are_unknown_unknowns
- https://www.amazon.com/Red-Plenty-Francis-Spufford/dp/1555976042