Egy újszülöttnek minden vicc új, így én a régi viccekre szakosodtam, azokat mondom el újra és újra.

Floorshrink diaries

Floorshrink diaries

Horseshoe bend #6: Galileo Galilei

2024. szeptember 23. - Floorshrink

This post is about how traditional IT Operations relate to the public cloud and why this approach is not aligned with the interest of the organization.

Contradictions

During the 2+ years of running the cloud transformation at a commercial bank I faced contradicting views on the following aspects of how IT could function.

contradictions.jpg

  • One extreme argued that the cloud is just another data centre, therefore it should be treated the same way as our own: same (ticket based) processes, same technologies (ie. nothing else beyond what we already have on prem) and most importantly same speed letting new things in.
  • The other extreme exclaimed that the cloud shall change most aspects of IT as we know it, we should replace the stop signs (approvals) with guardrails (policies), automate every aspect of our daily life and most importantly treat IT infrastructure as a product that we want to sell to our (internal) clients.

I recall when 28 years ago – being in charge of introducing Exchange 4.0 in a local commercial bank -  I attempted to explain to a deputy general manager that printing and faxing each and every e-mail he sent (in order to make sure the other party received it) was suboptimal and a read receipt was enough. (not kidding). I cared more for the consulting revenue than the rain forests, so I dropped the case. In the same bank I had arguments with the network folks that tracking IP addresses for every Windows desktop in a paper-based grid notebook is suboptimal compared to DHCP. I did not drop this one, gaining friends until they ran into issues due to duplicated IP addresses and the joy of troubleshooting them.  (then they relented…) It has been bugging me ever since why it took them so long to realize these things. Why is it so darn hard to embrace change?

The psychology of IT Operations

 

psychology_of_it_ops.jpg

One day a manager at IT Operations asked me how many times I recall when IT Ops was praised by the senior leadership for things going normal (rarely) vs. how many cases I remember when they were reprimanded after a major service outage. To be honest: for all of them. I had to agree with his point: IT Operations is strongly incited NOT to change what works since the bulk of issues are connected to changing some aspect of the service. Hence the need for a CAB (Change Advisory Board) in ITIL. The root cause for pushback against change is the deep belief that speed and stability are the opposites in the same dimension.

At this point I have to borrow a page from the book of Matthias Patzak, who in turn borrowed a page from Simon Wardley and tweaked his map by changing the vertical axis (visibility) to autonomy. Here is a modified Wardley map explaining why change agents are at odds with IT Operations. (a proposed remediation is on the chart)

wardley_map.jpg

The question is unavoidable: How can the infrastructure stay unchanged when everything that uses it changes at an unprecedented speed? My hunch: it cannot. The rest of this post is an attempt to prove this point.

The stakeholders’ view

The voice of the customer – In our case the app dev teams:

  • Putting the cognitive load on the customer of the service is a guaranteed customer satisfaction killer – when a developer needs to figure out the internal processes of the service provider. (eg. filing separate ServiceNow tickets for the VM, the OS, the RDBMS, the DNS entry, the domain join and the admin access) It is like Vogon poetry. (the 3rd worst in the Universe) Dissatisfaction is the hotbed of shadow IT. For the record, not just in IT: Ferruccio Lamborghini probably would have stayed with his Ferrari (and his tractor business) if Enzo Ferrari would have been a bit nicer to him or would have made better clutches.
  • Lack of speed and autonomy leads to disengagement. I recall a developer who wanted to test a new feature of MS SQL Server. It took him 3+ months to get a test server. By this time, he gave up on the whole idea he wanted to test in the first place. (He knew that the test bed he was asking for would have taken about an hour to implement if he was given a chance. But he wasn’t.) So, after 3 months he dropped the whole thing.

The voice of the business

  • The top management of companies are concerned about unforeseen changes that may have a devastating impact on the livelihood of their enterprise. Their worries are backed by data. The Corporate Longevity Forecast, eg. the time a company spends on the Standard & Poor 500 list is shrinking. In plain English even large established companies can disappear from the list or even become “also run” within a few years. (Nokia, Credit Swiss, GE, Qualcomm bidding for Intel, WTF?) The age of creative destruction is upon us: What worked in the past for decades may not be good enough in the next ten years.
  • Enterprises are trying to be prepared for and respond quickly to attacks from any new force in the market. The cloud is one of their bets. All parties but one agree on the following:
    a cloud transformation will deliver its value proposition only if the organization and the underlying processes are changed along with the technology.

When money talks - R&D budgets

  • If we assume that most Technology companies spend the same portion of their revenue on R&D and this R&D has the same impact on the bottom line (sometimes not true) than we may predict that more R&D (when it leads to a breakthrough), results in a quantum leap in profitability.
  • If a firm catches one of these quantum leaps in a life time, it is lucky. If it catches two, this has long lasting consequences for the entire industry. (Data points from 2023: IBM made 8.18 billion USD net income, in the same period HPE made 2 billion, Microsoft 86 billion.) The cloud race is over, the AI race has begun and the hyperscalers have more money to spend on it than their traditional competitors.

statista_stats.jpg

Source: STATISTA.com (data for HPE is from 2015 only, when they separated from HPQ)

I feel it in my fingers, I feel it in my toes (change is all around you…)

The following chart is a visualization for obtaining infrastructure for an app. Say the dev team working on App 1 wants an infrastructure with an application server with some compute power, an SQL DB, and OS and a VM underneath, plus this thing should be accessible via the web to clients. In a traditional org this would mean 5 separate ServiceNow tickets with manual handover between them. Eg. The virtualization folks would set their ticket status to done, a human being would intercept this change, and would file another ticket to the OS team to install the OS. These teams are measured on meeting their SLA-s, so they would close the ticket even if the client is not able to log on to this server. (After all identity management is a separate step, right?) Imagine a car dealer who tries to sell you an engine, a transmission, a few wheels and a body work as separate items, when you wanted a car…

the_tale_of_5_snow_tickets.jpg

In a cloud infrastructure it is a set of IaC scripts that ran at once.  And here comes the problem:

  • This automation could be built by a dedicated cloud group requiring an org change that is against the will of the existing org units. Injecting the SNOW tickets into the belly of the automation – with the same 5 day SLA-s - would require the same time as the traditional setup. If you are against the cloud all you need to do is to insist on sticking to the old process.
  • You can grant the right to execute this automation to the developer teams themselves, but it would mean relinquishing control and shifting to creating and maintaining the automation scripts and establishing guardrails (policies) instead of the stop signs.
  • Creating and maintaining IaC code, CI/CD pipelines and policies (some people might call it DevSecOps and Site Reliability Eng.) require new skills and could be seen as a threat for those not interested in the above changes.  

All in all, an innocent technology change proposed by the cloud would require organizational, procedural and skillset changes in an org who does not like change.

There is an interesting observation in the State of DevOps report for 2023. The more frequently you make changes, the more likely you will succeed. The root cause is simple: more frequent small changes (with a working rollback) touch fewer things that can go wrong. If we turn it around: the more worried you are about changing the platform, the more time will pass between changes, gathering more moving parts, that in turn will increase the likelihood that something will indeed go wrong.

number_of_changes.jpg

A side effect is that this will make your environment less secure (I will not apply that security hot fix because it might break the application – to be honest, sometimes it will) and will accumulate more technical debt.

There is an expression that is the tell-tale sign of a siloed organisation: “he is criss-crossing in my backyard”, read trespassing into a territory that the speaker considers his home turf.  “Any time you start something new like [an innovation – eg. the cloud initiative], that cuts across many areas, there’s a potential for people feeling like you’re in their backyard.” (Michael Britt) The problem is that most value creation process involves multiple departments, therefore one cannot innovate without “trespassing”.

I got into a conversation with the cloud transformation lead of a large commercial bank a few days ago. He made an observation that struck a chord: only a miniscule portion of the IT Operations workforce (in this bank) embraced the cloud, they honestly believed that everything was okay and this cloud thingy was unnecessary, so responded accordingly. I think Amara's law is at work here: "We tend to overestimate the effect of a technology in the short run and underestimate the effect in the long run." I am biased in this case, but I believe they underestimate the impact of the cloud and miss the opportunity to increase their market value.

Squaring the circle - the way forward

  • The known knowns:
    - the business hates when cost grows faster than revenue. READ: The days of extensive growth in IT staff are over. (if you care for the whys, check out "Red Plenty") There is one way forward: automation
    - what is likely that those willing to merry stability with speed will gain the upper hand vs. those who will stick to their guns and obstruct change.
  • The known unknowns: 
    - technology will create as many jobs as it will eliminate. (a recent study by Guardian suggests that it creates more than it destroys.) What is unclear which jobs will stay and which will transform to something new. My bet is that the mundane ones (repetitive ticket crunching) will fade, while those requiring more thinking (eg. designing those guardrails mentioned above) will grow their relevance.
    -large IT shops carry an enormous amount of legacy, applications that generate the vast majority of business value for the enterprise today. This is difficult to forecast when the above shift will happen and how long this shift will take.
  • The unknown unknowns: 
    - IT Operations can hold the business at gunpoint claiming that any org/process change will pose a threat to the current stability of the business, therefore any cloud adoption should happen on their terms and at a speed deemed suitable by them. The real unknown is how long IT Ops can resist the push from their own internal clients and the hyperscalers. (make no mistake: the stick will follow the carrot soon.)
    - for the record: while industry disruptors are already doing it, my prognosis that technology allows for speed while maintaining stability is not yet proven in large enterprises carrying a legacy. 

Famous last words: In 1633 Galileo Galilei had an unpleasant encounter with the Sacred Inquisition that forced him to recant his claims that the Earth moves around the Sun, rather than the other way around. After leaving the courtroom he murmured "Eppur si muove" ("and yet it moves") and spent the rest of his life in a house arrest.

As always, I will be glad to learn about your feedback.

Sources:

 

The memoirs of Kilgore Trout nr. 6: the elephant and the snake

Twelve years ago, I held a presentation at the Budapest University of Economics. I used a drawing from the Little Prince (when the boa constrictor swallowed the elephant) with slight modifications to illustrate the income over time curve. Warning: your government wants to keep you as a net contributor in the pension system, while you may want to have a few more good years. In 2012 it seemed like a funny thing, today it looks like a problem. Considering the likelihood that your pension will cause a significant drop in your standard of living, your goal is to push the blue milestone to the right while retaining (some of) your market value.

the_elephant_and_the_snake.jpg

The acceptance of the above curve depends on your age and financial status, but the first reaction usually is that this is wrong, “torque overcomes RPM”, experience rules etc, bottom line: the market is wrong. If you keep in mind “the customer is always right”, then you might become interested in the root causes of this devaluation and what we can do about them. If you are under 40, stop reading, if you are over 50, you might want to read on.

your_market_value.jpg

The components of your (job) market value

  1. Your experience – which doctor will you pick for a heart surgery for you kid? A newbie (who is eager to do it) or the 40+ year old guy with 15+ years of proven track record? The untold part of the story is that you do not want a 70 years old dude with a trembling hand to do this operation either. Ok, we are talking about IT, but keep in mind, Oppenheimer was 39 when he joined the Manhattan project… The problem double fold: your experience is amortized AND you are unwilling to let it go to make room for new skills and new experiences. You need to learn new things and learning gets harder as you get older. There is a potential escape route here: move to areas where the half life of your skills is longer, that is away from hard core IT towards something softer like process or project management or farming watermelons. The issue is that this area already got overpopulated with the refuges bringing the prices down. Another way is to move up in the hierarchy but it comes with the unavoidable and undesirable jostling for positions. (then hustling the pretenders….)
  2. Your network – to be precise a few key people in that network who act like your sponsor are vital to your career. These are the people who trust you, who put a bet on you and who will speak up for you in that vital moment when a decision is made about you (or not you). Side note: this is one of those things when size does not matter, quality does. And now the bad news: Like it or not, your network ages with you, that means that those who know what you are capable of might no longer be in the position to stand up for you.
  3. Your college degrees – I was a diploma collector once (3 university degrees). Then one day I asked myself when the last time was when I used a Fourier transformation or whether I could still use my coding skills in Z80 assembly. Diplomas in technology get amortized fast. The real value from those years is your capability to learn and the seeds of your network.
  4. Your language skills – whenever I meet an IT person who claims that not speaking English is okay, I lose my marbles. 90+ % of literature in information technology is in English… Bad news: the upper 25% of the new generation speak two languages before entering college. (The only area where I put a heavy demand on my kids was a high-level language cert in ENG and GER by the end of their high school. Okay, I also put some emphasis on math…)
  5. Your appetite for 60+ hours work weeks – being a workaholic is not a shame (been there, done that), albeit it will have consequences on your relationships with your loved ones. As the adage goes the only people who will remember that you worked that much will be your kids, not your boss. For sure this appetite will calm down a bit around 60.
  6. Your ability to learn and to forget – most folks accept the fact that the half-life of any technology related skill is around 10 years this means you will have to reinvent yourself at least 3 times during your active years. What many folks do not think about is that one has to “unlearn” the old ways of doing things in order to be able to absorb new things.
  7. The logical multipliers:
    • Your appetite for power – you cannot be a leader without starving for the right to make decisions. You will not be a great leader if all you care for is power and not your people.
    • Your health – although I accept the gene lottery idea, I think there are a few basic rules you need to play by: very little alcohol, no smoking, no drugs, enough sleep, lots of physical exercise and a wonderful woman (man) by your side.

Bottom line: to a large extent the market is right about reducing the market value of people over 50-55. On the other hand, they are wrong about rejecting old folks upfront without any consideration. I recall a disaster at Liptovský Mikuláš in Slovakia when a storm literally erased an entire forest in 2004 due to one thing: all trees in that forest were the same type, planted at the same time. Old trees are a must in any forest.  (pic below is my own)

liptovsky_mikulas.jpg

The cost side of the house

Homo Economicus beware I dropped minor things like inflation and mortgages, but I considered items like moving to a smaller home once you became an empty nester and inserted luxury items like a costly divorce into the mix.

the_cost_side_of_the_house.jpg

Houston, we have a problem: This curve does not look like a snake who swallowed an elephant.

What to do about this problem?

There is a gap between the income and the cost curve. If we accept the definition of happiness as minimizing the gap between one’s desires and one’s reality we have three choices:

  1. lower the bar of your desires and expectations
  2. stay on the job market longer and reduce the degradation of your market value
  3. increase the portion of your income from your savings

Option A is not that bad as it sounds. I have first hand experience about moving from a 6-cylinder BMW to a 3-cylinder Mini Cooper without any mental or manhood degradation. Fancy objects (cars, watches, gadgets etc.)  are not essential to your happiness, collecting excessive amount of them even suggests that you are compensating for something.

Option C is by far the best. The only caveat is that only a minority of the working population reaches “escape velocity” who do charity work only to save baby seals and rainforests. (besides being angel investors since they want even more money) OK, what about the rest?

salary_vs_return_on_investment.jpgSo here we are: the market is mostly right and becoming a follower of Siddhartha solves only a part of the problem. Here are the ingredients for preserving your livelihood over 55:

  • Drop anything superfluous from your life and use what you already have. This whole life thingy looks like a lease with an expiry date, ie. you will have to hand in all your belongings before leaving the stage.
  • Stop being concerned with everything. As Mark Manson put it: "Maturity is what happens when one learns to only give a f**k about what's truly f**kworthy." A subtler explanation is from Milan Kundera who described it as a choice about the number of mirrors you want to see yourself in. Accept yourself as is, minimize your social media activities and pick only a few people whose opinion you care about. The rest can go and fly a kite.
  • The final thing from my all-time favorite, the mother of COBOL, Grace Hopper: The most damaging phrase in the language is: “it’s always been done that way.”
    DO NOT continue doing things because this is how you did it in the past. Change in IT is inevitable let alone exponential. You need to adopt. It is like a winding road with curves where you need to change speed and direction to stay on it. 

long_and_winding_road.jpg

As always, I will be happy to hear your feedback and remarks. Happy riding, Folks! Laszlo

Horseshoe bend #5: Lessons learned so far

The following post is an attempt to summarize the learnings from our cloud journey in the first 18 months. You bet, this is biased, but it might help others who come behind us. Those ahead of us you may put your all-knowing smile on.

the_rocky_road_to_dublin_v2.JPG

How to go faster - the first steps in the chaos

Public cloud adoption is an intertwine of grassroot experimentation, the mandate from the senior management to establish an enterprise grade cloud presence and finally a crash landing of the first cloud workloads without a proper foundation. The sooner you have a program established around it, the less chaotic the first months will be.

You need a cloud strategy

that answers questions like:

  • why you want the whole thing in the first place, how and when do you declare that you reached this goal and what metrics are used to prove it. (eg. cost saving may not be a strategic goal, while speed is.)
  • what your core design choices are: cloud architectural design (eg. hub & spoke vs. VWAN), accepted building blocks (cloud services), CI/CD tool set (source and artifact repo, build and deploy tools), IT Sec key decisions (eg. rejecting the use of public IP, checking ingress code from the internet, policy layers, IaC framework and the toolset like Terraform vs. the cloud provider’s native tooling like Bicep) and most importantly a decision-making process how to reach these choices.
  • the question of ownership: Cloud is much more than a 3rd datacenter (in fact more than any other IT infrastructure), therefore its governance should be established in the context of Business IT, DevOps, IT security and IT Operations. This is not an ITOps internal affair.
  • The willingness to change everything: I could not find the source of this quote but I think this is true: “When digital transformation is done right, it's like a caterpillar turning into a butterfly, but when done wrong, all you have is a really fast caterpillar.” You have to change the processes and the org structure if you want to harvest the advantages of the cloud. Without these changes the result will be as slow as the original on prem counterpart is.
  • The right level of ITSec control – if too loose, you will be hacked, if too tight, nobody will use your stuff and shadow IT orgs will sprout out everywhere. You need to decide on a few core items:
    • single CSP, or multi cloud, distributed cloud yes/no, cloud native tools vs 3rd party for monitoring, managing, protecting it.
    • how far you are able (willing) to go with automation, mostly with Infrastructure as a Code (IaC). The dilemma is where to stop. The Pareto principle should give us guidance but it misses one key point: any manual intervention will defeat the purpose of the entire automation. This quote is from 1935, but it is as relevant as ever: “It is difficult to get a man to understand something, when his salary depends on his not understanding it.” /Upton Sinclair/
    • what your cloud operating model is: the conservative approach is when the dev teams file a SNOW ticket for everything in the cloud just like on prem, the avant-garde approach is when you give them freedom to implement their preferred PaaS component with their own IaC code and to go YBIYRI (you build it, you run it) for components that are not yet supported by central IT Ops.

Establishing the Cloud CoE

  • A program or an org unit: Management needs to find out if you are a project or an org unit. All peer connects (interviews with other enterprises who embarked on this journey earlier) show that introducing the public cloud at enterprise scale is a 5+ year program with likely evergreen residuals. Treating it as a project has implications, eg. 90+ % of the team will leave at the end of the program, taking all learnings with them.
  • Staffing:
    • #1: quick learners with a solid technology background are on high demand. Giving scraps of time of mediocre performers will defeat the purpose of the whole thing.
    • #2: the imbalance between supply and demand will crank up the prices to the point that can jeopardize the financial viability of the program.
    • #3: be prepared to lose your best cloud engineers to abroad. Our regretted attrition way over the internal FTE attrition. The replacement takes cca 3+ months. The ramp up will require another 3 months, ie. you are down with a top engineer for 6+ months.
    • #4: We underestimated, therefore understaffed the process, governance and compliance tasks. Cloud is not only an engineering task, but a heavy lifting on process and compliance, let alone a major change management undertaking as well. The non- engineering activities are 30+% of the job. (the process folks claim this is 50+%...)

Key decisions to make

  • what the public cloud actually is – a 3rd data center or something completely different? The CCoE was convinced that this is different while ITOps insisted that this was just another DC, therefore should behave like one: same technologies, same processes, and nothing else.
  • how far you want to go with self-service? One approach is to allow You Build It- You Run It where ITOps is not ready to operate the new technology. The advantage is that it will allow the dev teams to go faster but will require to build operations skills and capacity on their side. Another approach is to channel every cloud request into the existing processes and handle them as if they were an on prem request.
  • Some dev teams will want to tinker with PaaS components while others will want to concentrate on business logic and application-level tasks. In the latter case, centrally provided cloud services will be required for those who do not want to deal with the PaaS component operations. You need to define the boundaries between YBIYRI and these central cloud services (roles and responsibilities) AND need to establish this managed service layer. (this is mostly not a technical undertaking.) 
  • Drinking from the firehose - the balance between an R&D workshop and a factory - the number of PaaS services vs. the available offerings (let alone the Marketplace) Do not go beyond 10-15% of the total service offerings, otherwise you will be quashed by their quantity.

The forces that will slow you down

There are two forces at play here: ITSec and ITOps. (Compliance waiting for you around the corner.)

itsec_and_itops_1.JPG

  • On prem ITOps mindset will dictate that anything in the cloud should function just like as if it was on prem. They will demand the same technologies and processes, the same IaaS approach to anything. Their – legitimate – reasoning is that 95+% of the workloads are on prem today, therefore anything you create should look like the current stuff since it is easier to operate. The untold driver is fear that you need to address upfront: Nobody will lose their jobs but likely to have a different job (with a different skillset) within 4-5 years. All of us need to learn and unlearn.
  • ITSec requirements dictate technical solutions that take much longer in a bank than in a small (non-financial) account. It is like running the Marathon in a heavy diver suit while all others run in shorts… An example: in a public cloud cross regional DR capabilities come out of the box, unless you implement private endpoints when you lose most of this functionality.

running_the_marathon_in_a_heavy_diver_suit.JPG

  • The nose of the ship cannot travel faster than the back of the ship, ie. it does not really help to produce designs and technical solutions that other parts of the IT org cannot implement let alone comprehend. This is a lesson we learned the hard way: You need to move the entire ship. Trainings, constant communication, demos and regular small updates help the transition.

Dependencies

architect_in_the_spider_web.JPG

You will find (at least) the following dependencies:

  • Identity and Access management – the identity management process and technology. eg. your IAM system does not work with cloud native identities and/or it is being replaced therefore does not accept any changes.
  • Ticketing system – your team gravitates toward JIRA (as most SW dev. projects do) while ITOps will demand ServiceNow. Shoveling data manually from SNOW to JIRA is a pain in the neck but you want to track the hours in a single system.
  • Click-Ops - your IaC code will bump into manual steps in the process, eg. a FW port opening might take a week while your code runs for 45 minutes.

Technical issues

  • If you implement IaC you need to pay attention for the smooth coexistence between the IaC code and the policies on top of them. This is a daunting task to debug a code where both layers are in constant move.
  • on prem proxy servers and multiple firewalls plus an on prem DNS vs. your cloud internal routing design will give you a bunch of networking and name resolution issues where you do not have access to the monitoring logs of any of the on prem components. it will require a smooth collaboration with the network people to resolve simple issues like a wrong conditional access setting.

The exit strategy

There are 3 caveats with a cloud exit:

  • when you mix up a disaster recovery and an exit scenario. the difference is the RTO allowed. the first is measured in hours, the later in years. It takes the same effort to walk away from a cloud than to walk into it.
  • when you allow only technologies that have an on prem equivalent. This way you do preserve your exit but throw away any innovation produced by the cloud provider. The deeper you go into the PaaS/SaaS forest, the less likely it is that you will ever come out.
  • when the seller’s state, eg. the USA says NO. In this case a cloud-to-cloud exit becomes unattainable (MSFT, Amazon or Google will leave the local market on the same day)

A reasonable exit strategy should be formulated, that will be acceptable by the local regulator. Regulatory, compliance and engineering task forces should collaborate, with an experienced leader (the best is someone who worked as an auditor before). Think twice before you execute this exit. This will ruin the ROI of the whole thing.

The square peg in a round hole – the lack of public IP 

If we had to had to name one item that caused us the most headache, it is easily the fact that the public cloud is designed with the internet in mind, that is that all services can be accessed directly from the internet. In case of an enterprise environment this is not the case, you have to go private.

The nonfunctional requirements

  • All of these requirements are known for decades, but work differently in the cloud, especially for PaaS and SaaS. Think about monitoring, logging, alerting and backup early and make reasonable compromises with their on prem counterparts.
  • Cloud monitoring, alerting and logging should be incorporated into the company level monitoring, alerting and logging. It is inevitable because the cloud-based systems will not operate standalone but integrated with on-prem (and later maybe other cloud) systems. In case of a problem an end-to-end view is needed, and it is possible only with an integration between the various monitoring systems.
  • Backup: you need to have a clear view on what you need to “bring home”, ie. back to on prem and what is okay to store in the cloud. At the end of the day, it boils down to the level of trust in your cloud provider and the demands by the regulator. Be aware that some of the backups provided by the provider are not compatible with anything else, ie. cannot migrate them to any on prem equivalent. (eg. KeyVault)
  • The big shift is when the Application Operations teams will claim a bigger slice of the traditional monitoring and alerting pie, using their own – mostly cloud native – tooling that will overlap in functionality with the tools used by IT Ops.

The non-technical side of the house

We shuffled all non-technical topics into a single team: Process – Governance – Compliance – Cost. In retrospect we underestimated the amount of work and the difficulties related to these topics. (engineering myopia) In fact there is a significant difference between “it works from an engineering aspect” and “it is a service one can provide with a predefined SLA”.

the_real_x_wing_fighter_and_how_it_looks.JPG

  • ITSM processes: IT Service management processes assume that everything is done by ITOps, the client just files a service request. ITOps is right claiming that an incident is a pain regardless where it happens, therefore you need to have a proper incident (and change) management process. If you are an ITIL shop, you will find out that a big chunk of the areas covered by ITIL3 are simply not applicable for the cloud. (hence the introduction of ITIL4 several years ago.)
  • The cost thingy: This is very easy to leave the lights on (on prem “flat fee - we already paid for it” reflexes kick in) but will cost you dearly. IT is one thing to spin up resources automatically, and seems like just a small change in the code (create vs. destroy) to tear them down. But somehow it just does not happen without forcing it. This is not by accident that FinOps became a discipline on its own right in the last couple of years.
  • The service catalog: In case of a cloud request the client may ask for a subscription, then for the predefined set of PaaS components in it, or just for the subscription and then would do the rest him/herself. Ie. you need to clarify what the service catalog should contain.

What comes next

at_the_beginning_of_the_journey.JPG

I wanted to thank the entire team who walked along in the last 18+ months. We are not finished by any measure and with the quickening speed of change we may not even know what “done” really looks like. What is beyond doubt that the big players turned their attention to artificial intelligence. It is a safe bet to forecast that AI will infiltrate all aspects of the cloud within a few years and will become the new battleground.

To finish with some fun: I used Midjourney to illustrate this post. The last prompt I used was this: “the magician pulling the rabbit out of the hat but the audience is not happy, cartoon by David Horsey, --ar 3:2”. Is it possible that AI already went rouge?

ai_went_rouge.JPG

As always, I appreciate any comment of feedback.

 

 

 

 

Horseshoe bend #4 – Mount Rushmore (from the Canadian side)

mount_rushmore_the_backside.JPG

In regulated industries you are required to produce an exit plan before you are supposed to make your entrée in the public cloud. On prem stalwarts cite this requirement on a regular basis demanding a plan as detailed as the inroad itself. For a while I figured this was just an excuse from the luddites to slow down progress, so it puzzled me when I heard this from people whose opinion I do care about. The bug buzzed in my ear for months: what if they are right and this road indeed leads to trouble? What if Mount Rushmore is not so pretty when viewed from the other side? To settle this I typed in vendor lock-in cloud computing in Google and Bing to learn. Most answers were either sponsored by cloud vendors or by firms like Cloudflare of Red Hat (Cast AI, VMware, Wasabi etc.) whose real objective was to convince you that you can avoid this trouble with their assistance (that is jumping in their trap instead of Amazon’s or Microsoft’s.) Some were thoughtless like the one from a HDD manufacturer arguing that cloud lock-in would lead to the lack of scalability (really?), some were lazy enough to copy entire sections (even the drawings) from each other. Okay, this is useless, so let’s dig deeper. The rest of this article is the result of this digging and the outcome of consulting with Lydia Leong from Gartner, peppered with my longing to computer history. Spoiler alert: when was the last time you listened to music on a CD player, or to phrase it differently: do you have an exit strategy for your Spotify (Netflix etc.) subscription, that is you purchase an on prem copy of each song or movie you like? If you don’t, then read on!

A few definitions:

Disaster Recovery Plan ≠ Exit Strategy ≠ Exit plan ≠ Testing the Exit plan

  • A Disaster Recovery (DR) plan is part of the Business Continuity Plan (BCP). It has nothing to do with an exit. When somebody asks you to execute a cloud exit in days, that is a DR situation, not an exit. For this reason, I omitted situations when the Cloud Service provider (CSP) becomes insolvent overnight and is forced to shut down its entire service. I also left out cases like a nuclear bomb wiping out all DC-s in multiple regions (not just availability zones) of a cloud provider. In this case we have an existential problem way beyond a service disruption. (and yes, Putin is moving these deadly toys into Belarus as we speak…)
  • An Exit Strategy defines the triggers when your Firm will want to or will have to get out of a Cloud agreement. Players in this decision are the Business owners, the IT leadership, Procurement, Legal and the IT architects.
  • An Exit plan is the series of steps -and the players with their specific roles and responsibilities -that are triggered by events defined in the Exit Strategy. It covers technology and business process related changes; thus, not an IT only problem at all.
  • Two types of cloud exit: moving an application elsewhere or leaving the platform altogether are two different games. Depending on the players involved in the conflict triggering the exit you might face any of these.
  • Testing the Exit plan: walking the talk and moving a workload from the original cloud location to A: another cloud provider or B: back to on prem.

Concentration risk is the risk associated with dependence on a single supplier for multiple business capabilities. This is applicable to on prem IT environments as well. Imagine that you have to move away overnight from the RDBMS provider having a few thousand DB-s and a few hundred thousand lines of PL/SQL code holding the bulk of the business logic of your core applications. The same goes for the runtimes and the language itself from the same provider. You bet; you are on the hook. Some smart consultant coined a derivative called Cloud concentration risk. This is the risk associated with dependence on a particular cloud provider for multiple business capabilities, such that a single failure can result in a disruption to multiple aspects of the business. It’s on prem sibling is a major outage in your primary data center.

The triggers: Who can say no?

 There are five possible actors in any cloud exit: the service provider and the consumer, the buyer’s regulator and two nation states (the vendor’s and the consumer’s).

the_payers_in_an_exit.JPG

  1. Buyer-seller conflicts: this is in scope for this post.
  2. Buyer in conflict with the seller’s state –this is a weird idea for any firm (at least in my home country) to get into a fight with the US government, so I risk to skip this.
  3. Seller in conflict with the buyer’s state – not impossible, (eg. East India Company vs. China, but this one too ended up as type D.)
  4. The conflict between two states – the USA banned the sale of key IT technologies (on prem as well) to Russia after their attack on Ukraine. FTR: it was not allowed to transfer any personal data outside of the Russian Federation anyway, therefore US cloud providers were a no go before the war.
  5. Whoever claims that the (HUN) regulator said no to the public cloud, pls. show me the actual paragraph in their guidance to prove it.

The types of conflicts between the seller and the buyer (type A):

When the seller says no When the buyer says no
  • A serious violation of the contract terms by the buyer (eg. you posted adultery content on your website. In case of an enterprise client this is unusual and probably would trigger a “remove it immediately or…” reminder rather than a hasty service suspension.
  • When you do not pay the bill. This is where the old adage applies: if you owe the bank 50 thousand dollars, this is your problem, if you owe them 5 million dollars, this is the bank’s problem. The bigger your consumption is, the more likely the vendor will negotiate, although this is not a life insurance.
  • When the seller is told by its state to say no – ie. this is Type D. If you plan to substitute AWS with Azure (or the other way around) keep in mind that they are from the same country, ie. subject to any type D issue simultaneously
  • When the service quality is unacceptable - regular service outages, degradation of service
  • When the price goes up at renewal without any benefits compensating for it. The usual way of carrying this out is removing an existing discount. This is playing hardball. Not cloud specific - see when the tax collectors of an RDBMS provider show up on December 21st for a little audit.
  • If the cloud provider enters your market as a competitor. (Apple Pay BNPL, anyone?)
  • When you decide to rationalize your cloud footprint since realized that 3 providers are probably too many.
  • When the innovation dries up (for folks in photography, this is when the Hasselblad 501CM became available in ruby red) I think this is by far the most dangerous thing that can happen in a cloud relationship since it breaks the balance between the price and what you get for it.

 

A word on innovation and its relation to vendor lock-in

 

Repeat after me: Innovation comes from differentiation. Maximizing the value of cloud adoption requires exploiting the provider’s capabilities, thus increasing lock-in. The flip side: The greater your need for portability, the more you are likely to sacrifice some of the benefits of cloud services —and the greater the complexity and cost. The deeper you walk into the cloud forest, the more likely you will stay there for a long time.

the_price_of_moving_away_from_the_cloud.JPG

I met an IT executive who thought that the cloud was nothing more than a 3rd data center owned by someone else. For this reason, he demanded complete symmetry, that is using components in the cloud only if they had an on prem counterpart. (read IaaS) To be fair, he was right from an exit viewpoint, but ignored the efforts of all major cloud providers in the last 5+ years, that is PaaS. This is where most of their R&D spend went, probably beside IT Security. Bottom line: the more value you take out from the cloud the more difficult it becomes to exit from it. In case of SaaS this is simply a redo exercise, same cost, same time.

To illustrate the innovation story let me use an old example, the 360 series mainframes from IBM. This was the first modular, general-purpose, upgradeable series of mainframes with the same OS for all models – that is running the same application without modifications, introduced the micro-coded CPUs, the 8 bit bytes (today it sounds funny, but there was financial pressure to use 6 bit bytes, since memory was expensive), the EBCDIC character set, a new floating point architecture, a nine track magnetic tape drive, backward SW compatibility with older IBM products, all in all a tremendous amount of innovation. It cost half of the development of the atomic bomb, the development time was way over the original plans, but within 15 years it drove the seven dwarfs out of the computer business (7 dwarfs = Burroughs, Sperry Rand, Control Data, Honeywell, General Electric, RCA and NCR) Was it a true vendor lock-in? You bet it was: It was compatible only with itself, but it was the best of its time so much that this was the origin of the saying “Nobody ever gets fired for buying IBM”. And guess what, this was the seed of the antitrust law suit that almost chopped IBM into pieces. If you are into computer history, check out the book written by the Fred Brooks (the PM of the development, working in tandem with Gene Amdahl, the lead architect) titled the Mythical Man-month.

A word on R&D budgets: If you check out the annual reports of the hyperscale providers and their traditional on prem counterparts you will find telling numbers. In a nutshell: there is an ongoing shift of profits from the incumbents to the largest cloud players. (eg. Amazon is now the largest database vendor, surpassing Oracle.) Their net earnings are manyfold compared to the traditional HW and on prem SW providers like HP or even IBM. If we assume that each R&D dollar has similar financial impact at all major players, this is fair to say that the hyperscale providers are on a growth trajectory (because their cloud R&D is larger and is funded by their cloud business, not by a separate cash cow) while their on prem counterparts will face tough times within 5-6 years. This is why IBM paid 34 billion USD for Red Hat. This move was triggered by the realization that they lost the cloud war. The real thing is that the war is no longer in the cloud area, this is over, the battle moved to the AI territory with even bigger stakes.

Busting myths

There are no solutions that eliminate lock-in. Vendors just want you to become locked into their solution instead of someone else’s. Think about it: if Vendor A’s service is 100% compatible with Vendor B’s service, then the ONLY differentiating factor will be the price. This would lead to a cost war to the bottom, that would force both vendors to cut back their R&D budgets. At the end they (and you) would end up with commodities where the only differentiator is the price, read: ZERO innovation. There are competing forces at work here: the appetite for innovation in the buyer’s side intertwined with the need to differentiation on the vendor’s side plus the demand for freedom to escape those providers whose innovation stream has dried up. Since I used a mainframe example for ground breaking innovation I have to mention other mainframe providers whose only excuse to exist is that one’s primary application runs on their iron and this is very-very expensive to move away, and they know it. On the other hand, you have a choice which vendor’s lock-in you want to avoid and which we prefer in order to avoid the other one.

A cloud exit plan does not provide any reduction in your availability risk. The period when the cloud service is unavailable is way shorter than your ability to execute any exit plan. You need to address this in your DR plans WITHIN the given cloud itself. (nope, cloud to cloud exit is not a panacea for resiliency, see below.)

Multi-cloud is not a solution for cloud resiliency since it is difficult and expensive to implement. I had a chat with a senior IT executive a few weeks ago. When we got to this issue, he figured he would ask his teams to build a software application targeted to public cloud to be either portable, OR to develop two versions of the same SW in the same time for the two hyperscale providers. I think both of these ideas are unpractical: If you build a software that uses the common subset of the functionalities you will throw away the bulk of innovation coming from any of these providers. If you build for both in the same time you will ruin the business case and the time expectations of the business, ie. I would rather not even start this endeavor.

One more word on multi-cloud: this will eventually happen to most large enterprises, either by choice or by accident when a software vendor is picked by the business who happens to use the other CSP. This will put an additional training burden on the internal IT departments of large enterprises, let alone cranking up the price tags for those folks literate in both technologies. (I always talk about two hyperscale providers instead of three, no intention to disregard GCP, this is just simpler to express myself this way.)

If your exit is triggered by a change either from the seller or the buyer’s regulator, this will rule out any cloud-to-cloud exit, because a regulatory change (for the record a state decree) will render all of your target exit providers unviable. (eg. Russia, unless you consider Alibaba…)

Your ability to execute an exit from your cloud provider does not improve your negotiation position, since cloud exits are complicated and costly and the CSP knows that the cost of a cloud switch will exceed any price advantages gained through the switch. To be fair, this is no longer a money printing machine like it used to be in the on prem - perpetual license days. This is a service with actual cost of building and running astonishingly large data centers all over the world, let alone their electricity and communication costs. Do not dream about 50% discounts. If you check out the annual reports of key cloud providers, their profitability is in the range of 30-35%. If you consider their buying power and operational efficiency, chances are 1 kilogram CPU from them cost less than 1 kilogram CPU in your DC. (Leaving on the lights when not needed is a different problem, but this is finops, a subject for another post.)

Containers do not eliminate cloud lock-in: Theory (and Kubernetes providers) say that putting applications in containers will solve the cloud lock-in problem with no drawbacks. Tag line: “Once an application is in a container, it is easy and cheap to move it between cloud providers, or between cloud and on-premises environments.” On the one hand containers and microservices became the hallmarks of cloud native development, and they do ease some aspects of portability. On the other hand, they do not address most of the underlying causes of lock-in. Container management platforms are one out of the hundreds of PaaS services available from any of the top cloud service providers. Replacing this with a 3rd party component will have no effect on the dozens of PaaS components also required to run a modern application.

Regulators DO NOT want the whole exit plan executed before you go to the cloud with your app. They will be satisfied with plans that can be executed over a reasonable period of time (such as two years), without requiring that you demonstrate your ability to actually do an exit. The effort required to test an exit scenario is comparable to the effort of moving to the cloud itself. Unless the regulator wants to ruin the whole business case to move to the cloud, they will not demand it. The good news, they heard of FinTech and BigTech and know that if they overdo their “no cloud please” thingy, they hurt the entire industry rather than protecting it.

Your options

  • Minimize lock-in as much as possible: Cloud IaaS providers are treated like infrastructure resource commodities, and higher-level functionality is avoided wherever possible. This requires a very high level of skills in the IT team and significant engineering effort, time and risk since you assemble your car from thousands of tiny parts coming from several manufacturers. Not recommended, since you lose the innovation and the developer efficiency gains brought by the PaaS components. You throw the baby out with the bath water.
  • Use overlays to minimize cloud IaaS provider lock-in: You can try to minimize lock-in to the cloud IaaS provider, by overlaying the provider’s resources with third-party solutions that are portable across multiple environments. This results in a high degree of lock-in to the overlay solutions and vendors, as well as the ecosystem around those solutions. The cloud IaaS providers may be treated like infrastructure resource commodities, thus losing the innovation brought by the cloud provider.
  • Be loyal to a single ecosystem: you choose one vendor’s ecosystem to base your strategy on it, accepting the notion that you will have long-term dependency upon that vendor. Innovation, ease of integration and speed of delivery are the highest priorities. You accept that you will become highly dependent on this cloud provider over the long run, and must invest in building a strong, trusted relationship with that vendor. Resiliency is handled within the provider’s ecosystem, using cloud native tools.
  • Be loyal to more ecosystems: You build capabilities on two or more providers, but not for resilience purposes, but to maintain the balance when negotiating with mega players. You manage cloud concentration risk primarily through a multi-cloud workload placement strategy, rather than through a cloud exit strategy. The two cloud you bet on are likely to be two out of the three hyperscale players.

The final word: You do need to be prepared to exit your cloud provider but not for the reasons usually quoted by most articles on the web. The real dilemma is to pick the right provider and to maintain the relationship as long as it provides competitive advantage to your firm. A cloud exit is a complicated and very long journey. Planning an exit in advance will help you shorten the time to a successful execution, thus jumping from a limping horse to better one in time. To paraphrase Oliver Cromwell "Trust in your cloud provider but keep your powder dry!"

 As always I will appreciate any feedback on this post.

 

Sources used for this paper:

  

Horseshoe bend #3 – Midway

midway.JPG

The battle at Midway is symbolic for many reasons, it showed the importance of information security (the key to success of the US Navy was that they decrypted the Japanese communication and knew the plans of Yamamoto), and marked the end of the era when battleships reigned and the beginning of the supremacy of aircraft carriers. (Let alone it was the equivalent to Japan as Trafalgar was to France) I realize that the analogy is a bit far-fetched nevertheless I build this post around it: while IT security is more relevant than ever for any enterprise, the old way of thinking about it will no longer reach the goal. No, I am not talking about quantum computing and its threat of breaking current cryptography in minutes, I am talking about the cloud. ITSec has to change.

Let me nail it down: I do realize how important information security is, history provides ample proof points. As of today, cyber warfare is on equal terms with any other military branch. (Think of Stuxnet). On the other hand, a recent study by McKinsey found that the average life-span of companies listed in Standard & Poor’s 500 was 61 years in 1958. Today, it is less than 18 years. If you recall the faith of Blockbusters, Borders Books, Nokia or Kodak you see the Innovator’s dilemma in action. If you stop innovating, you will wither (sometimes very fast), if you are careless, you will suffer significant material losses. (pretty soon)

What we know for a long time

  • Navigarenecesse est, vivere non est necesse.” Going online (that means mobile) is a must, tweaking your business process to delivery speed is nonnegotiable. Gen Z measures a response in seconds, a whole transaction in minutes and want it all anytime, anywhere.

gen_z.jpg

  • The ITSec playing field is not levelled, a threat actor can make way more damage with 1M USD than the good guys can fend off with the same amount of money.
  • The imbalance between demand and supply for skilled ITSec professionals is cranking up prices to the upper 5 digits range (in EUR) in countries where this used to be the package of mid management. Despite of the sky rocketing compensation, there is unmet demand.
  • Hacking is a lucrative profession and a weapon in the arsenal of nation states. The number of data breaches grew in sync with the number of users and the amount of data generated and exposed to the online world. Ugly: yes, surprising: No.
  • The biggest concern in any ITSec protection scheme is the human factor combined with organizational inertia, from careless users and unnoticed human config errors to orgs working in silos not giving a damn about each other’s motifs and agenda. (Read the case of the London underground fire at King’s Cross and you will know what I mean.)

In summary: as a consequence of the above more and more firms move a significant part of their business online, while not being prepared, exposing their cyber sec weaknesses to the outer world.

Something happened - what we learned lately

Let me enumerate the changes that have happened in the last 5-8 years in the ITSec arena.

the_moat.JPG

  • The business demands collaboration with entities outside of the main org, thus a significant portion of the value creation process happens OUTSIDE of the castle that you are trying to protect. The “castle and moat” paradigm even when executed with the outmost rigor is not enough. If we add the growing segment of SaaS based functional delivery this statement becomes more relevant.
  • The public cloud grew indispensable, sucking the bulk of investment dollars from the on prem world, thus becoming a self-fulfilling prophecy. Three groups formed: the hyperscalers, the multi-cloud vendors (riding on these hyperscalers) and the incumbent traditional players.
  • Since hardware is becoming a commodity, there is a power shift towards developers. Yes, they are sometimes closer to a primadonna than a soldier, demanding weird perks. Live with it. For the record: the price difference between a Macbook Pro and a good Wintel notebook is around two days compensation of these folks, so be it.
  • A DDoS attack with a botnet made from smart fridges is a novelty, though a pretty sad one. (see my comment of the lack of ITSec expertise, this time at the fridge makers)
  • The shared responsibility model introduced by the cloud blurs the boundaries and sometimes makes you feel as if it was somebody else’s (ie. the could provider’s) problem.
  • The vast majority of recent and future successful cyber security incidents were and will be enabled by a human configuration errors. Throwing more human effort at the problem will only generate more errors. Just because you do it slowly, it will not make it more secure either.

The need for speed

  • The ability to respond to events in the business environment quickly became the nr. 1 priority to business leadership, regardless the industry. (COVID, the Russian invasion of Ukraine or the double-digit inflation came overnight)
  • There is a widening gap in agility between the cloud and devops enabled development units and their IT Sec (and IT Ops) counterparts. IT is getting good at producing new code fast, but is not yet prepared to protect this new code well.
  • You measure the life span of a physical machine in years, a VM in months and a container in minutes. With Kubernetes coming to age with the support of major cloud players, the traditional ways of creating, managing, monitoring and protecting these compute instances become more and more inadequate.
  • Former U.S. Deputy Secretary of Defense William Lynn argues that “cyber-warfare is like maneuver warfare, in that speed and agility matter most” This guy probably knows a thing or two about cyber security, since he wrote Pentagon’s cyber strategy in 2010.

What is next

The last part of this post is a list of proposed actions. For the record: being a cloud CoE lead I am biased and this is part of my job to be biased. A “conservative revolutionary” is an oxymoron, right?

Accept the paradigm shift

  • A paradigm shift needs to be answered by another paradigm shift: insisting on total manual pre-control and ignoring the importance of speed will put ITSec at odds with the developer communities and eventually with the business. Explain, teach, go beyond saying NO and show how it can be done securely. Sit and breath with the coders, literally.
  • “Widening the moat”, ie. making it more cumbersome to access data from within the castle (in the cloud) will not protect the firm. As leased lines between company locations became obsolete (my 5G phone runs circles around a 4 Mbit leased line), soon the moat will become obsolete for most volatile apps or it will move where the assets to be protected are, ie. to the cloud. This is not by accident that MSFT became a significant contender in the unified endpoint management and SIEM (Security Information and Event Management) arena. They had to in order to make Azure (their new cash cow) prevail.
  • Protecting the identity of users, machines and applications will be (is) the core of the new era. I risk to forecast that biometrics as the primary means of (human) authentication will prevail despite of the current legislative hesitation.
  • Turn your teams to developers themselves who author and run the configuration monitoring scripts (Ansible, Terraforms, shell, does not matter) the hardening and patching states of all assets. Realize that these scripts will behave as a real code, you will store them in a source repo and you will create new releases of them instead of just replacing a parameter in a shell script on your c:\ drive.
  • Be prepared for the increasing pressure from cloud vendors: They will combine the increasing functionality gap between their cloud based and on-prem offerings, will produce licensing arrangements making their cloud-based services more compelling (eg. the Hybrid advantage from MS where you double your existing on prem license amount for Windows servers IF you use their cloud based KMS service) and eventually they will discontinue their on prem product ranges altogether just like Atlassian announced already.
  • Convert your mindset: thinking in static, dedicated source and destination IPv4 addresses is the past. A cloud provider will not guarantee you that the IP address range for a VM scale set or an Kubernetes cluster will be the same two weeks later as it is today. Think in FQDNs instead of static IP addresses and use the DNS service of the cloud provider.
  • Insist on discipline where it matters: protecting the endpoints, primarily the mobile devices. Discipline applies for senior management as well.

Focus on your people

  • Many companies have the cash to buy the best of breed ITSec offerings on the market, but lack the skills and capacity to bring the most out of them. Reverse this trend. Hire the best possible people and explain to HR that compensation tensions are less painful than losing the trust of your clients.
  • Financial realities will force traditional ISVs to port their core offerings to the cloud and their limited resources will dictate to place their bets on these cloud-based versions, thus slowly but surely will abandon their on prem versions. The tendency will reinforce itself with every product iteration. The gap will widen. Beef up your cloud related skills and capacity.

Learn to code and automate everything

  • If you measure the latency in response in months due to capacity shortage and then you manually execute a process based upon outdated config information, you will miss the target. The more manual steps you put into a process, the more error prone it becomes, introducing “flavors” into the execution. When you add favors to the process, your quality assurance becomes a lottery. Automate every step in your process including auditing your own work.
  • Defense in depth: while the “castle and moat” approach is outdated, but maintaining various layers of defense is very much alive. The goal is to protect any asset in the org with vigor an investment that is proportional to the asset being protected. Eg. do not protect information that is already on Linkedin, but create a dedicated subnet for your really important stuff with well monitored control points to these subnets.
  • Patching a vulnerability a year after it was discovered is autopsy. Real time monitoring and detecting and reacting to anomalies in a near real time manner will be crucial. Voluntary “confession” of ITSec considerations in an Excel sheet is as useful as resuscitating a corpse. (except for audit purposes) You need to automate the discovery and eventually the whole response.
  • Go beyond the static (one-time) snapshot mentality where the name of the game is making any change difficult, accept the new rules and become able to detect these changes and respond to them very quickly.
  • Focus on AI: The role of AI will become prevalent in ITSec on both the attack and the protection side. Bluntly put algorithms will fight algorithms within ten years. (I risk an estimate that this is already the case on the attacker side.)

Bottom line: all vectors point to one direction: ITSec need to change and have to learn to automate, that is have to learn to code. As always, I appreciate your feedback.

 

PS: The first image is the IJN Mikuma, a Mogami class heavy cruiser sinking during the battle of Midway. Others were generated by https://openai.com/dall-e-2

Horseshoe bend #2 – Are we there yet?

are_we_there_yet_jpg.png

Sponsors have the tendency to want to know how any project in their realm is getting along and above all what they get for the money that they threw at us. They ask the same question over and over again: Are we there yet? To be honest, when you requested a few million bucks for a cloud implementation, it makes sense to know what “there” is and to be able to tell when you reach this point. This is #2 in the cloud related articles dubbed the Horseshoe bend, focusing on the measurement of the outcomes of a cloud implementation.

In case of a cloud adoption program there are four sets of folks in your organization whose interests you need to cater for. These people are the business (the guys who fund the whole thing), ITSec – the knights who say Ni (or rather No), IT Ops who see this whole thing as unnecessary and last but not least the compliance folks representing regulatory scrutiny. The rest of this article attempts to set reasonable targets for each stakeholder group, define metrics for each of these targets and at the end to prove why you should not stress the whole thing beyond reason.

The business metrics

  • The ability to respond quickly to a surge in demand (or a sharp decline for that matter) – this is a no-brainer, as long as you apply the ground rules of Infrastructure as a Code. (AND as long as your cloud provider does not run out of steam.) (Metric: being able to spin up additional compute/storage resources within a few hours from the demand.) WARNING: it only makes sense to dynamically scale the infrastructure if the application layer is able to take advantage of this capability.
  • The speed of infrastructure design and implementation from the request until it actually goes live. This is the one that has a great effect on developer productivity. The way to do it is by using technology building blocks and the underpinning blueprints combined with automation. I mean full automation, with no manual intervention at all. This will require that ITSec and ITOps GIVE UP pre-control and to move to post-control with near real time policy violation detection. Approve the design, not the actual instance and check if we strayed away from this design.

one_ticket.jpg

The caveat is when you need to link your shiny new cloud environment with its on prem buddy carrying a bunch of legacy technologies and more importantly legacy processes. It is like Lightning McQueen pulling Bessie. Yep, it may not be that fast… (Metric: the time between the first and the last related ticket designing and implementing an IT infrastructure should be 25+% faster than its on prem counterpart.)

lightning_mcqueen.jpg

  • Cost transparency – this is easy, just implement proper tagging and a data analysis/visualization tool (a pedestrian Excel with a SQL backend will do) on top of the analytics report. Warning: it can be a double-edged sword in environments with poor cost transparency since – while it indeed can tell to the penny who spends how much on what – this can be pitched as a weakness compared to an on prem alternative where the costs are unknown or where the actual user of a service does not feel the pain of their extravagant requests. (Metric: report AND forecast the cloud spending by cost center. Produce cost reduction suggestions as a bonus.)
  • Technology adoption speed – The marketplace of any major cloud provider contains thousands of applications, development/management/monitoring tools, two magnitudes more choices than your on prem IT can handle. Balance is the key word here, too much freedom would throw the monkey wrench into IT Operations, while banning the inflow of new technologies would defeat the purpose of the whole thing. Clogging the path of innovation is a very bad idea, therefore when ITOps no longer can handle a new technology, apply the “you build it, you run it” principle.

innovation_vs_complexity.jpg

The Technology metrics

As long as you opt for IaaS, you will have to deal with the same duties as if these VM-s were in your data center. And in some cases, you cannot avoid deploying VMs in your cloud subscription. Unless you plan to operate what you have built you need to realize that demanding the same processes as used on prem is a legitimate ask from Ops. The problem arises if those processes are siloed and littered with manual steps. IMPORTANT: The strength of a cloud infrastructure is given by the level of integration between the components. As soon as you start to operate the various components in separate silos, you are going to kill the essence of the whole thing. This begs for a dedicated Cloud Operations, but it would question the status quo.. Anyway, here are the technology metrics:

  • Know what you have: as long as you deal with a computing resource deployed for longer than a few hours you want it to be in your CMDB. This is obvious but easily forgotten that this CMDB is on prem. (Metric: all CI-s are known by the CMDB)
  • Config management: Automation can be a key differentiator here. Rather than trying to find an error in a configuration by eyeballing config files one could write a code that makes sure that reality equals the design. (Metric: the number of differences between the designed and the actual parameters.)
  • Monitoring: Cloud providers use the same components, architectures, hypervisors etc. (but not the same processes) that you do, therefore are susceptible to the same errors like their on prem counterparts. Things will go wrong sometimes, so you have to implement monitoring. For a smooth coexistence feed the metrics streams into the traditional on prem monitoring tool and its cloud native alternative as well. (Metric: key metrics are fed to a monitoring system with alert thresholds defined.) WARNING: no matter how good your infra DR capabilities are if the application layer is not prepared to use these capabilities.
  • Incident management: The real thing is how fast and meaningful your reaction to an alert is. This topic is dealt with in ITIL, so I rest this case with the assumption that this is mostly the same as on prem with one key difference: DO NOT to allow anybody to temper with the production environment manually since it will create a collision between the parameters set by the automation script and those set by an Operations person. The question is if you will have the discipline to make changes to the IAC code, then run this code or you cannot resist the temptation to make manual changes. My hunch is that you will violate this rule sometimes…

The ITSec metrics

None of us want to fall victim to a hacker attack. I learned the following maxim from ITSec people who were clearly beyond me: “You can inflict way more damage with 1 million USD than you can avoid with it.” The playing field is not even. This is that should make you ITSec cautious. The problem is when you achieve relative strong security posture at the expense of the business flexibility. The following list is just scratching the surface.

  • Using Multi Factor Authentication (MFA) for any activity – in case of public cloud you are exposed by definition, your first line of defense is the identity of the users. You need decent Identity and Access Management (IAM) tools and processes. The very minimum is to use MFA in all cases, not just for the admins. (Metric: yep, MFA for all.)
  • The granularity of admin rights aka. reducing the attack surface: I recall my early days in IT in 1990 when I felt Mr. Important when I got the admin access of the Netware 2.15 server at my first workplace. Of course, it was permanent, revoking would have meant a demotion, right? Wrong: You do not need admin access to anything unless you have a job to do with that system. Using Privileged Identity Management (PIM) is an essential way to reduce the attack surface, namely time. Of course, its efficient use is based upon the assumption that the PIM approval process is fast. In fact, the best thing is if you do not use admin accounts to do anything in a production environment, but use service principals instead. (Metric: admin rights are granted for a few hours to the least number of people when needed. Dig the global admin account in a safe place and use it only as a last resort.)
  • Cloud native security metrics and best practices: cloud providers will create assessments of your cloud implementation, suggesting improvements. 3rd parties will also produce reports on the known vulnerabilities (eg. Sysdig, F5, Read Hat) Read these and act upon their findings. It is wise to procure a penetration test against your own implementation on a regular basis. (Metric: a predefined security score – likely from your provider and the speed of reacting to these findings.)

The compliance metrics:

d'Artagnan did not worry about the duel waiting for him at 2PM with Aramis since he knew he probably would be dead by this time due to his duel with Porthos scheduled at 1PM. I am more worried about hackers than auditors, so I do not have metrics for this area yet. (okay: being in compliance with a the regulatory guidelines whatever their real meaning is.)

Summary – how to prove to your sponsor that you reached the goal?

The next paragraphs might look weird after pages spent on defining them: these metrics are less relevant compared to what they miss to capture since they cannot measure it: the impact of the knock-on effects of a good cloud implementation. As Roy Amara put it: “We tend to overestimate the effect of a technology in the short run and underestimate the effect in the long run.” I am convinced that cloud computing is going to have a profound effect on how we do computing in the future.  It is not an end into itself but an enabler, and we surely do not comprehend all of its implications since it’s hard to notice things in a system that we are part of and it’s hard to notice the incremental change because it lacks stark contrast YET. As always, I will be happy to learn your feedback.

Horseshoe bend #1 – the Why

horseshoe_bend.jpg

The World is full of natural wonders that are photographed in each and every minute. The bend of the Colorado river near Page is one of them. As an amateur photographer I took my own version. (don’t go in January, that is snow in the upper left corner…) Somebody also writes an article about the public cloud in every minute, so investing effort in writing the N+1st version carries as much novelty as the picture above. The sinister #1 forecasts the fact that my content does not fit in a single article, therefore it will arrive in small chunks just like the coffeehouse novels of the 1930s. Despite of this I hope that whoever invests a few minutes reading this piece, will profit from it.

Why this cloudy thing is relevant?

  • The strongest argument is bridging the ch asm between the demand by the developers (and the business behind them) to follow a zigzag path (a.k.a. going agile) and the current capability of the on prem IT infrastructure to satisfy this demand. The business wants to experiment, ever faster and of course at the lowest possible cost, while on prem IT still thinks in annual budget cycles, where rolling out a new piece of hardware takes 3-4 months from the approval. (if we consider the current chip shortage, it is easily over 5 months)
  • There are usually two complaints about IT infrastructures beyond stability: it takes ages to grow and it cannot scale down, ie. this is rigid. Originally, I drew the chart below as a fun fact to illustrate my point in a discussion with an executive: the development of IT infrastructure - beyond brute force power increase - is the capability to follow an arbitrary demand curve with an increasingly precise answer. (in the good old integral term limes delta t = 0) One of the advantages of the public cloud is the capability to support both the dynamically scaleable technology (microservices in containers) AND the tooling that can automatically provision and manage it. (aka. Infrastructure as a Code)

delta_t_tart_nullahoz.jpg

  • A colleague of mine once argued against the public cloud that since everything is changing so fast (COVID, the war in Ukraine, the looming recession, the growing inflation), one cannot plan even for 3 years, therefore we are at leisure to think about it for a few more years. While the reasoning is correct, the statement it is trying to support is dead wrong: this very unpredictability demands the ability to change direction fast. And the public cloud does exactly this: to be prepared for the unknown. It happens that a republic (“United forever in friendship and labour, Our mighty republics will ever endure.” Aha..) batters down its little brother because it dared to venture too close to countries not loved by the big brother. The Ukrainian National Bank screw around the public cloud topic for over ten years, then suddenly approved its use to banks within a week in this March. There are times when one needs to be fast.
  • My last argument: the bulk of technology innovation shows up lately in the marketplaces of the major cloud providers, aka. “cloud first”. I think we are not too far from the era when it will switch to “cloud only”. The functionality gap between the cloud and on prem version of the same product is widening every year, first there is the honey on the rope, then comes the stick (sorry dudes, this is deprecated, you have no choice.)

The rational of the deniers – a bit tousled

For the full picture we need to discuss the arguments of the naysayers.

  • “We can do the same thing on prem!” This is true that any technological advancement and process innovation can be copied and implemented in an on prem environment. I am almost as good looking as Thor, I all need to do is just a bit more exercise and I will be there in no time.
    An average hyperscale cloud provider can enumerate more SW engineers to this task that most Hungarian enterprises combined. If we accept the theory that eg. Microsoft allocates its resources to a product line based on its revenue potential AND we take into consideration that MS has roughly 160 thousand employees AND Azure was behind 22 billion USD out of the 168 billion total revenue last year, then it is fair to estimate that cca. 21 thousand people at MS are working on Azure day after day. At least one third of this army are developers with technical leaders like Mark Russinovich. This is darn hard to win this race against MS (or AWS for that matter), We should compete somewhere else.
  • “You did not have to wait for the infrastructure, this was the business messing around wasting time.”
    • In enterprise IT it takes several months from request to fulfillment to serve an infrastructure demand WHEN the hardware was already in the data center at the time of the request. If procurement starts its “Speedy Gonsales” process AFTER the demand arrived, we are talking about at least 6 months.
    • If we replace the term “fidgeting” with experimenting, then we have to accept that the business sometimes does follow a zig-zag path. Although the expression is a bit overused this is still true: the business wants to be agile, it will place its bets on multiple things, it will change its mind and sometimes will make mistakes. The best supporter to “fail fast” and “fail cheap” is the public cloud.
  • „The cloud is expensive!” – There is a large amount of truth in this statement. When used in earnest cloud services can be pretty expensive. On the other hand this statement is misleading for the following reasons::
    • The cost of on prem infrastructures is a flat fee in nature regardless the utilization of it. For the record: an average on prem enterprise IT infra runs with the efficiency of a steam engine, a Diesel engine at best. This means that it uses 20% of its capacity while you had to pay for 100%. On the other hand cloud services pricing is consumption based, ie. it will cost you dearly if you leave the lights on when not needed. Generations of IT folks grew up on the mantra that leaving the light on was okay or even a good thing since it gave a timeslot to those maintenance scripts or patches that ran throughout the night. We will need to override decade old reflexes.
    • Most enterprises carry a huge amount of technical debt and controlling departments do not even try to estimate the hidden cost of this fact. (If we draw a analogy between tech. debt and financial debt, then this becomes clear that the “interest” on this technical debt is the firm’s slower reaction to change.) If you can reduce your technical debt by using the public cloud, it will make your company go faster that is worth some money. This is the benefit that you never take into account when examining the cost of the cloud.
    • Most large enterprises cannot tell how much a given IT service exactly cost them. (true respect to those few who can.) (This is the equation that is hard to solve: ∑ i= 1 to N for all items in IT service portfolio (unit price x number of units consumed) = total IT cost) We know the prices of the cloud services, but the on prem service prices are smudged and distorted in the big common hat. It may even happen that cost transparency backfires, when the on prem folks will claim “more expensive” discretely hiding the fact they do not even know how much their stuff costs.
  • The cloud is not secure – Please put the „Common vulnerabilities in Java” string in your (Google) search window. Then, if you are not nervous enough yet, replace the Java part with dotNet, then with your favorite (mobile) OS etc.etc. How long did it take you to fix all vulnerabilities related to log4j or Heartbleed? The question is NOT that you are vulnerable or not but how long it will take to realize that you are hacked and to do something against it. I do not want to understate this topic; the last really trustworthy firewall was the two-inch airgap. I want to point out that the cloud is as vulnerable as your on prem infrastructure, and there is a chance that there are more and better trained ITSec engineers attempting to reduce the risk than in your on prem environment. Of course this is an entirely different coup of tee when the service provider (or the state) itself wants to look into your data.
  • You cannot use the cloud because of compliance requirements – to meet the requirements of PCI-DSS (Payment Card Industry Data Security Standard), SOC 2 (System and Organization Controls 2), HIPAA ( Health Insurance Portability and Accountability Act), ISO 27001 etc. is a daunting task indeed. This is so funny to hear this excuse from IT people of firms that can satisfy none of the standards above, furthermore this is not even in their plans to pull this trick. Large cloud players have done it years ago and can withstand the endoscopy of auditors on an annual basis.

Summary – what comes after the curve in the road?

IT infrastructure is becoming a commodity. This commodity is indispensable to our survival (in case of banks to the very existence), but does not bring a sustained competitive advantage compared to others who also use this technology.

The cloud, like many other technological advancements before brings something new that previous technologies could not do and this will change the rules of the game. The question is whether an enterprise can still benefit from cultivating an on prem IT infrastructure and if an on prem IT can compete with the capabilities of the hyperscale cloud providers. The answer to the first question is a possible yes, to the second one a definite no. As we know from Niels Bohr prediction is difficult especially when this is about the future, but I give it a try.

The balance (between on prem and cloud) will be influenced by the business goals of the firm (a mom-and-pop shop vs. a multinational trying to conquer the globe), the playground defined by the regulators, the sensitivity of data handled and the optimum between cost and speed.  It will vary between industries and company segments. The less legacy you carry (eg. a startup) and the further you are from the heavily regulated industries (ie. not a government) the more likely that the only on prem HW equipment you will end up with are a coffee machine and a photocopier within a few years. If you have hundreds of legacy (on prem) applications, and you are heavily regulated, chances are the balance will be around a 65-75% on prem vs.  25-35% cloud.

Horseshoe bend Nr.1 – a Miért

horseshoe_bend.jpg

A világ tele van természeti csodákkal, amelyekről percenként készül egy-egy fénykép. A Colorado folyó Page melletti kanyarulata ezek közé tartozik. Amatőr fotósként én is elkészítettem a magam változatát. (ne januárban menjetek…) A nyilvános felhővel kapcsolatban szintén percenként ír valaki egy új cikket, így nekiállni az N+1-iknek kb. annyi újdonságot ígér, mint a fenti fotó. A vészjósló #1 arra utal, hogy a mondókám nem fér el egyetlen cikkben, így várhatóan részletekben érkezik majd, mint egy kávéházi regény. Ennek ellenére abban bízom, hogy aki szán az alábbiak elolvasására pár percet, profitálni fog belőle.

Miért is érdekes ez az egész felhősdi?

  • A legerősebb érv a fejlesztők (és a mögöttük lévő üzlet) által megkövetelt „cikk-cakkban futás” joga (aka. agilis működés) és az IT infrastruktúra válaszadási képessége közötti szakadék áthidalása. Az üzlet kísérletezni akar, minél gyorsabban és persze minél kisebb költség mellett, miközben az IT éves költségvetési ciklusokban gondolkodik, ahol egy-egy hardver kigördítése az igény befogadásától számítva 3-4 hónap. (ha ehhez még hozzávesszük a chip hiányt, akkor bőven fél év felett)
  • Az IT infrastruktúrával szemben két panasz szokott felmerülni a stabilitáson túl: túl lassan tud nőni, és nem tud lefelé skálázódni, azaz rugalmatlan. Az alábbi ábrát eredetileg poénnak szántam egy felsővezetővel folytatott beszélgetés kapcsán: az informatikai infrastruktúra fejlődése a nyers erő exponenciális növekedésén túl a tetszőleges görbével leírható terhelési igényekre adott egyre pontosabb válasz képességével írható le. (az integrál számításban delta T tart a nullához) A nyilvános felhő egyik előnye az, hogy mind a dinamikusan skálázódni képes technológiát (pl. konténerekben futó mikroszolgáltatások), mind pedig ezen technológia létrehozását hatalmas mértékben felgyorsította. (aka. Infrastructure as a Code)

delta_t_tart_nullahoz.jpg

  • Egy munkatársam egy alkalommal azzal érvelt a nyilvános felhővel szemben, hogy mostanában minden változik (COVID, háború, gazdasági válság, infláció), nem lehet akár csak három évre sem előre tervezni, azaz még ráérünk. Az érvelés igaz, de az állítás téves: éppen ez a kiszámíthatatlanság igényli a gyors irányváltás képességét, és a nyilvános felhő segít gyorsan reagálni a váratlanra. Előfordul, hogy a “szövetségbe forrt szabad köztársaságok” egyike éppen rommá lövi a szomszédját, mert az túlságosan közel merészkedett a nagy testvér számára nem szimpi országcsoportokhoz. Az ukrán nemzeti bank kb. tíz évig piszmogott a kérdésen, aztán idén márciusban egy hét alatt engedélyezte a nyilvános felhő használatát bankok számára. Van az úgy, hogy sietni kell.
  • Az utolsó érvem az alábbi: az informatikai innováció java az utóbbi években valamely felhő szolgáltató piacterén jelenik meg (cloud first) és nem vagyunk messze attól a pillanattól, amikor ez átcsap „cloud only” állapotba. A nagy gyártók kínálatában ugyanazon termék on prem és cloud verzióinak funkcionalitása között évről évre nyílik az olló, azaz előbb van a mézesmadzag, utána jön a „nincs más választásod” állapot.

A tagadók érvei – kicsit megcincálva

A teljes képhez hozzá tartozik a „miért nem” kifejtése is.

  • „Ezt mi is meg tudjuk csinálni on-prem!” Való igaz, elvileg a felhő szolgáltatók minden technológiai megoldása és processze megvalósítható egy on prem környezetben is. Én is majdnem úgy nézek ki, mint Thor, már csak egy kicsit kell gyúrnom és kész.
    Egy átlagos hyperscale felhő szolgáltató jóval több szakképzett szoftver mérnököt tud ráállítani pl. a saját felhő infrastruktúrájának automatizálására, mint a legtöbb magyar nagyvállalat (együttesen). Ha elfogadjuk, hogy a Microsoft árbevétel arányosan allokálja az erőforrásait egy-egy területre és ehhez hozzá vesszük, hogy kb. 160 ezer ember dolgozik náluk, az Azure pedig 22 milliárd USD árbevételt hozott tavaly a 168 milliárd USD-s teljes átbevételből, akkor kb. 21 ezer MS munkatárs reszel nap mint nap az Azure-on. Ennek minimum a harmada fejlesztő mérnök és a műszaki vezetőik között olyan kaliberek vannak, mint Mark Russinovich. Ezt a meccset piszok nehéz lesz megnyerni, ne itt versenyezzünk.
  • „Eddig sem az infrastruktúrára kellett várni, az üzlet molyolt.”
    • Egy nagyvállalati informatikai infrastruktúra igény kielégítése annak megjelenésétől a tényleges működésig hónapokban mérhető, HA a vas már ott volt az adatközpontban az igény megjelenésekor. Ha nagyobb HW igényről van szó és a beszerzés az igény megjelenése UTÁN kezdi versenyeztetni a szállítókat, akkor min. fél évről beszélünk.
    • Ha a „molyolás” szót kísérletezésre cseréljük, akkor el kell ismernünk, hogy az üzlet időnként IT infra szempontból cikk-cakk-ban megy. Bár a kifejezés némileg lejáratódott, de igaz: az üzlet agilis, több helyre is tesz tétet, változtat és eseteként téved. A gyorsan és ebből adódóan kis költséggel „hibázás” legjobb támogatója a nyilvános felhő.
  • „A felhő drága!” – Ebben az állításban van egy jókora adag igazság, a felhő szolgáltatások, ha komolyan kezded használni őket, drágák tudnak lenni. A kijelentés ezek ellenére félrevezető az alábbi okokból:
    • Az on prem infrastruktúrák költsége „általánydíj” módon jelentkezik, azaz független attól, hogy milyen kihasználtsággal hajták őket. (for the record: egy nagyvállalati IT valahol egy gőzmozdony és egy Diesel hatékonysága között mozog, azaz kb. 20%-os utilizációval megy, de 100%-ot fizettet ki veled.) A felhő szolgáltatásokat viszont általában fogyasztással arányosan számlázzák, azaz, ha „égve hagyod a villanyt”, akkor valóban drágább lesz, mint az on prem. IT-sek több generációja szocializálódott azon, hogy a villanyt égve hagyni oké, mi több szerencsés, mert éjszakánként lefuthatnak a karbantartó script-ek, feltelepülhetnek a patch-ek. Évtizedes beidegződéseket kell majd felülírni.
    • A nagyvállalatok zöme jókora technikai adósság állományt cipel magával, aminek a járulékos költségét a kontrolling még csak meg sem próbálja megbecsülni. (a technikai adósságot párhuzamba állíthatjuk egy pénzügyi adóssággal, ahol a felvett hitel kamata a szervezet lassabb reakciókészsége a változásra.) Ha a nyilvános felhő segítségével csökkentheted a technikai adósságállományt, azzal gyorsítod a vállalatodat, ami pénzt ér. Ezt soha nem írod jóvá a felhő szolgáltatások árának vizsgálatakor.
    • A nagyvállalatok zöme nem képes megmondani, hogy az IT szolgáltatás portfolió egyes elemei valójában mennyibe is kerültek. (ezt az egyenletet nehéz felírni: ∑ i= 1 to N az IT szolgáltatás portfolió minden elemére(egységár x fogyasztott darabszám) = total IT költség) Ezek után előfordulhat, hogy a felhő szolgáltató árait ismerjük, az on prem IT infra árak pedig elmaszatolódnak a nagy közösben.
  • A felhő nem biztonságos – Írd be a (Google) keresődbe pl. a „Common vulnerabilities in Java” kulcsszót. Aztán, ha még nem vagy elég ideges, cseréld ki a Java részt dotNet-re, majd kedvenc (mobil) OS-edre stb. Mennyi idő alatt is javítottál ki minden log4j, Heartbleed stb sérülékenységet? Nem az a kérdés, hogy sebezhető vagy-e, hanem hogy mennyi idő után veszed észre a bajt és teszel ellene valamit. Nem akarom elbagatellizálni a kérdést, az utolsó valóban megbízható tűzfal a négy centis légrés volt. Arra akarok rámutatni, hogy a felhő pont annyira sérülékeny, mint az on prem infrád és ott több, esetenként jobban képzett ITSec-es munkatárs igyekszik csökkenteni ennek kockázatát. Az egy másik kérdés, ha maga a szolgáltató vagy az állam akar beleolvasni az adataidba.
  • A compliance követelmények miatt nem lehet – A PCI-DSS (Payment Card Industry Data Security Standard), SOC 2 (System and Organization Controls 2), HIPAA ( Health Insurance Portability and Accountability Act), ISO 27001 stb. követelményeinek megfelelni valóban nem kis feladat. Ezért is izgalmas e kifogást hallani olyan cégek munkatársaitól, akik informatikája a fentiek egyikét sem teljesíti, mi több, nincs is tervben ez a mutatvány. A nagy felhő szolgáltatók általában a fentiek mindegyikét megugrották évekkel ezelőtt, és évente testüreg motozza őket egy sor auditor, hogy még mindig rendben vannak-e.

Összefoglalás Mi vár ránk a kanyaron túl?

Az informatikai technológiai infrastruktúra mára tömegcikké válik. Ez a tömegtermék nélkülözhetetlen a versenyben maradáshoz (valójában a létezéshez), de nem jelent stratégiai versenyelőnyt a többiekhez képest, akik szintén rendelkeznek ezzel a technológiával.

A felhő, mint oly sok technológiai eredetű újítás, valami mást is tud, amit az eddigi megoldások nem és ez a valami megváltoztatja a játék szabályait. A kérdés, hogy tud-e egy vállalat versenyelőnyt kovácsolni az on prem IT infrastruktúra további kultiválásából ill. hogy tud-e a vállalati IT versenyezni a legnagyobb felhő szolgáltatók képességeivel.  Az első kérdésre a válasz egy valószínű igen, a másodikra egy biztosan nem. Ahogy Niels Bohr-tól tudjuk, a jóslás egy nagyon nehéz dolog, különösen, ha az a jövőre vonatkozik, de én mégis megpróbálom.

Az egyensúly a piaci célok (Alsó-Röcsöge vagy a világpiac meghódítása a cél), szabályozás által adott keretek, az adatok szenzitivitása és a költség optimum dimenziói mentén áll majd be, iparáganként és cégméretenként eltérő munkapontban. Minél kevesebb legacy-val bírsz (pl. egy startup) és minél messzebb vagy a specialitásoktól (pl. kormányzati informatika), annál valószínűbb, hogy pár éven belül az utolsó két on prem hardver eszközöd a kávéfőződ és a fénymásolód lesz. A több száz legacy (on prem) alkalmazással bíró, erősen szabályozott cégek esetén az egyensúly kb. a 65-75% on prem, 25-35% cloud környékén fog kialakulni.

The 4th wise man

img_e9365.JPG

A few months ago, I got in a conversation with a senior executive about standards. I elaborated on the merits of standardization, mentioning companies where one could find 3-4 different technologies for the same function, being incompatible with each other let alone the cost of operating and integrating all these technologies. The executive listened carefully then pointed out that he had a worry about standards: they curtailed innovation. So, he would rather live with the higher cost to preserve the ability of the organization to innovate faster.  The debacle made me wonder if I was wrong all along with my thesis, so I started digging. This blog post is my take on this subject.

I based debunking the “standards hurt innovation” myth by asking WHEN a standard should be inaugurated. (credit and thanks to another executive for his help.) The illustration below is the combination of the technology adoption lifecycle curve from Geoffrey Moore and the Hype cycle model from Gartner.

 hype_cycle_and_adoption_curve_together.JPG

https://www.loadbalanceworks.com/newsDetail.asp?PostID=54624&n=5g-the-slope-of-enlightenment

Wait with setting your company standards until the market starts consolidating and the future winners are in sight, ie. when you can make a good bet on them. The Hype Cycle gives guard rails when this solidifying moment arrives. As much as a well-chosen standard will help you reduce complexity, hence cost, an overripe standard will indeed become a blocker. There are two more questions that we do not cover today: when to move away from an old standard and which technology to choose from the existing, stable options.

At this point I still had a nagging question in mind: if this is not standards, then what are the real inhibitors to innovation? We know for sure that if a company loses sight of the “3rd horizon” (3+ years into the future) then it risks its demise as soon as another player finds an area where its innovation becomes transformational. (eg. some fintechs did a decent job on online customer onboarding way before COVID, they just did not find the right business model yet.) The rest of the post is a homegrown set of rules on innovation for established financial institutions.

Rule #1: know where you want and where you can innovate. I recall an investment bank that decided to create its own container management technology, that is to beat Docker/Kubernetes. Their argument was that the internals of the bank were so unique that tweaking an off the shelf solution would have taken as much effort as writing their own. I also recall a bank who created its own private cloud with a handful of people over several years let alone started by purchasing a truckload of hardware rather than having a look at their existing provisioning processes. Neither of these cases were founded on a sound financial calculation, missed the fact that some very large market players threw hundred times the staff and money at the problem EARLIER than they did. IMPORTANT: Innovation <> doing something differently from the mainstream. The 13 bit microprocessor did not become mainstream for a good reason…

rocket_scienetists.JPG

Rule #2: make a clear choice if you are an innovator or a fast follower.

I recall a meeting from another century when I tried to sell something cool to the CEO of a local GSM firm and quoted (even hummed) their tag line “XY GSM, the cutting edge” to reinforce my message. The CEO interrupted me and made an interesting statement: “We no longer want to be the cutting edge, that is expensive and risky, we want to be fast followers.” In this case a fast ingress process of bringing in external innovation is crucial. Reaction speed trumps anything else in these firms.

silence_please.jpg

Rule #3: Do not try to fix a process problem with (yet another) piece of technology.

There is a overarching theme in wealthy companies: they want an easy solution for a difficult problem, that is they want to fix process problems with technology. When they inevitably fail to achieve their goal, they blame the given vendor, ditch the chosen technology and pick another one, hoping for a better outcome. (This always makes me frown, since they are darn close to the definition of insanity.) Purchasing a new technology is not innovation by itself, it just shows that you can afford it. (It is like a hobbyist photographer purchasing a medium format camera hoping that it would turn him into the next Ansel Adams overnight.) Technology might speed up a process, but what if it made a faulty process faster?

Rule #4: There is no such a thing as risk free innovation, when the result is guaranteed.

Enough to mention Edison, whose team - in search for a filament for the lightbulb that would be durable but inexpensive - tested more than 6,000 possible materials before finding one that fit the bill. For this reason, do not try to use the same set of KPIs that you run your daily business with BUT do have a separate set of KPIs and a clear definition of success.

Rule #5: regularly scrape the hull of the ship! I think this is not funding, that makes innovation. If you do not believe me, check out the movie called “The boy who harnessed the wind.” (of course, a second bicycle would have made the case easier…) The real impediments to innovation are broken, anachronistic processes usually sustained by the silo nature of an organization. I recall a company where a business request would go through four, disconnected JIRA queues before ending up as a set of separate Servicenow tickets. (It takes only a few months to get through this maze.) Now imagine Edison requesting the 6000 (!) different materials he tested before landing on carbonized bamboo through this process. NOPE! Check out the value creation process to see where customer or employee engagement suffers and fix the process BEFORE you do anything else.

scraping_the_hull.JPG

https://www.drjdavidson.com/blog/2013/07/take-time-to-scrape-off-the-barnacles

An alternative solution is to leave the mothership behind and create a semi-independent “Skunkworks” unit where the rules of the mother company do not apply. After all, it worked for Lockheed when they created the SR-71. Note: If you separate the team who is entitled to innovate, make sure there is a natural way to ingest their findings back into the regular business.

Rule #6: Get people who can actually make a difference. To be politically incorrect – and I can joke with this one - I use the "One-Legged Tarzan" sketch to describe the problem of innovating with people who are 10-15 years behind the cutting edge. Tarzan is "a role which traditionally involves the use of a two-legged actor" and that it would be unusual for the part to be taken by a "unidexter". Of course training can help, but still there is a lingering doubt about waking up one morning to realize that you are the old dog who may no longer learn new tricks… 

the_one_legged_tarzan.JPG

https://www.youtube.com/watch?v=njK6zQp2Fdk

Rule #7: Nurture cross unit collaboration by breaking down the silos. The following quote is from a HBR article “The Biggest Obstacles to Innovation in Large Companies” by Michael Britt.

“Any time you start something new like [an innovation initiative], that cuts across many areas, there’s a potential for people feeling like you’re in their backyard,” In these organizations any change will provoke a strong reaction, feeling attacked because you trespassed into their territory. The problem is that most value creation process involves multiple departments, therefore one cannot really fix them without “trespassing”.

The last word: an innovation push without a nurturing cultural background is like a new coat of paint on a rusty surface, it will not last. Remove the most important inhibitor: the fear of being seen as vulnerable: that even big shots – just like other human beings – may not know the answer to everything or sometimes even make mistakes.

As always, I appreciate any feedback on this post.

 

Resources used in this article:

 

Dinosaur for breakfast

dinosaur_for_breakfast.png

I got a question from a colleague about how I would approach the replacement of an aging core banking system (CBS). Although I had an encounter with a project of this kind earlier, I wanted to give a more elaborate answer so I got in touch with a few silverback CIOs of the local IT community who had a first-hand experience with these beasts. (I express my appreciation for your help, Guys!)

This post is a summary of the interviews I had with IT executives in the CEE region mixed with my own observations. I dare to say that most of the suggestions below stay true for telco billing systems, or any major endeavor that touches the core functions of the firm and has interfaces with large number of other systems. So here we go:

  • Start with the why – if these reasons are not shared by the top management – do not start the whole thing. The best analogy to a CBS replacement is changing the engine of an aircraft during flight: this is the last thing you want to do since it will bog you down for 5+ years, will cost your proverbial shirt and nobody can guarantee that it will succeed. You must have a well understood, easy to communicate reason why you do this and you have to have the commitment from the board to support your venture sworn by blood. (They might forget about their oath in two years...)
  • This is NOT an IT project rather a process / product level overhaul of the ship with a significant technology support. (to prove my point check out the Standish CHAOS reports or the various lists of the biggest flops in IT – technology itself IS NOT in the primary reason for a failure in most cases. Unrealistic changes driven by politics is a bigger danger.)
  • Chose the right time – When the moon is in the Seventh House and Jupiter aligns with Mars”, ie. when the ownership and management structure of the firm is stable, the economic environment is fine and the regulators are a relaxed and not introducing major legislations that demand immediate action.
  • The sponsor and his/her relationship to the PM – you are not tinkering with a pimple on your chin, this is heart surgery! As historical records show those projects achieved their business goal where the PM had the unwavering trust of the CEO ie. the head of the program is not a CIO direct but a CEO direct. (the CIO would qualify but is usually busy running IT as we know it.)
  • Keep the ten commandments – DO NOT allow any customization in the core or live with the consequences: There is a theory by Gartner called the Pace layered IT architecture. In a single sentence: Tinker with the application layer where your competitive advantage actually lays (system of innovation) and DO NOT mess up the lower layers, especially the system of records. I guess the 11th commandment that the business keeps scarifying for short term gains is “Thou shalt not build frequently changing business functions into the foundation of your institution”. (The 12th is “Thou shalt not create point to point interfaces”. Well, they are as problematic to keep as “Thou shalt not covet thy neighbour's wife.”
  • KYD – Know Your Dinosaur – some of these systems linger around for 20+ years, carrying a thick guano of poorly documented changes, with the original developers gone for years by the time your great adventure starts. This is reverse engineering time when the project team tries to understand the process from the code. It will take time.
  • Make your Dino more simple - Remove any functionality that does not belong to the core feature set. This is undoing the sins of the past (when eg. you built OLTP functions in your data warehouse or baked the deposit management of a loan into the a/c handling system itself.) Of course when you create a centralized client master data management solution, you will have to create an interface to the master data management in the new system. (do not allow multi-master solutions.)
  • Staffing – The teams of these projects can grow substantial (up to 150 people). These folks are the ones with the deepest knowledge about your existing processes and systems whom you take away from their day jobs and who will be greatly missed by their line managers. It makes sense to set up a formal process to regulate this exodus of talent from the daily business and to reach out to system integrators or local partners of the COTS vendor to fill the gap.
    A guaranteed source of conflict is when the annual resource planning exercise ASSUMES that these jolly jokers are still in their regular positions and assigns tasks to them. The business will attest that the Earth will stop spinning without these folks and will reclaim them. This is when deadlines start vanishing. A potential way to avoid this conflict to ask for dedicated people and even to create a dedicated org unit for the project.
  • Coexistence – Life will not stop for the years while you are building the “Great New Thing” The owners of current systems will keep churning out new releases, will modify interfaces or even change the underlying data structures. For this reason, this is vital to capture these changes and to make sure that you have the latest version of all corresponding systems in your test environment. You need to automate – on top of automating the testing of the new system itself - the buildup of an integrated test environment, including test creation of the test data.
  • Evolution vs. Revolution - i got the feedback from a ex colleague of mine that I i missed an aspect, namely that you have to produce something on a regular basis that the business actually can use. This keeps the hope alive that you are moving to the right direction and gives the chance to the client to provide feedback.
  • Test data – Gary Larson mentions a separate chamber in Hell for those who drive slow in the fast lane. I think there is another bucket for those who invented GDPR. Imagine building a test scenario where all systems depersonalize their master data on their own right. You may want to be a bit more forgiving when enforcing  those GDPR guidelines… (25+ years ago lawmakers in Hungary abolished the use of the personal ID since it could allow those nasty IT people to link disjunct databases. So the industry went back and used strings instead (name, mother’s name, address etc. UNTIL the government really wanted to identify you and asked to enter your social security number and your tax ID for mostly anything (the two together are as good as the personal ID was). I think that the fact that the whole society is self-profiling itself in social media for poo emojis will cause an even bigger trouble and lawmakers are lukewarm at best to stop it.

gdpr_resize.png

  • The interfaces –the Achilles heel of any complex IT project. A CBS can have 50+ interfaces, using technologies invented by your ancestors. You set a goal to replace these not so secure interfaces with something modern (for the record: ITSec hold you at gunpoint to do it.) The issue: it requires changes in the other systems by those people whom you just brought over to your project. Oops.. Ok, you decide to EMULATE the old interfaces to the outside world while going super doper inside. Things start to go ugly so you obtain the first permission to fall back on the old solutions, “just temporarily”. For the record: temporary solutions will stay for 10+ years. The business will never allow you to spend money replacing them!
  • Close coupling – this is the fancy name for not using API-s and eg. suck out data directly from the database of another system. This is cool until they change the DB layout… If there is a place in life for enterprise architects then this is guarding the adherence to design best practices regarding interfaces.

close_coupling.jpg

  • The vendor – vendors love when you are on the hook. It is like a thick needle in your vein, pumping your money into their pockets, and you just cannot escape. As long as they are good at what they do this might be acceptable. They will be nice during the courting phase but the gloves may come off when the first non-acceptance occurs. It makes sense to have escape clauses for both parties with well-defined milestones.
    It is also important to note that all the promises made by the vendor about the new functions rolled out on a regular basis – paid in the annual maintenance and support fee – WILL NOT be available to you once you started to customize the base offering. (This applies only when you go for a commercial off-the-shelf solution.)
  • The system integrator – as mentioned earlier in the staffing section you are likely to run short on skilled people, so you are likely to turn to a system integrator for help. The issue is when you get an army of rookies instead of the highly skilled folks you met during the presales phase OR when you realize nuances like a kickback from a certain HW vendor. Make sure you really understand the business model of your provider and accept the fact that good people are expensive. A theoretical alternative is to solely rely upon your own internal project team, but the recruitment and ramp up hurdle makes it challenging.
  • 24 x 7 – in the era of instant gratification people just cannot live without an always on banking service. This requires your new system to be able to work without the old daily closure when you had to maintain a shadow balance since the CBS was busy with its closure batch processing.
  • A word on HW: Do not start with buying a bunch of iron and licenses, use a public cloud offering and a bare minimum of licenses until you are done with the process related issues. Consider not buying HW at all, but staying on a public cloud throughout the whole project and moving back only at the end if necessary.
  • A word on SW: Make sure that your new CBS (auto) scales out rather than scaling up. Most managers are familiar with the concept of containers and micro services but some of them do not realize that this capability is only enabled by the underlying platform and the application layer itself will have to take advantage of these capabilities.

 

The not so PC stuff

  • Beware of Conway’s law – some financial institutions have more than one core banking system operated by separate silos. To avoid never ending turf wars it makes sense to consider moving the various a/c handling systems into the same org unit. As Melvin Conway put it: “Any organization that designs a system will produce a design whose structure is a copy of the organization's communication structure.” (Read: a copy of its org chart) If the management decides to do a major reorg, they should do it well before the project starts since it takes 6+ months after such events until functions, competencies and accountabilities will be aligned again.
  • The location of the vendor team “It will be done by next Monday” means a different thing in Europe vs. in other parts of the Globe. Chances are that your dev team will be located in India so you need to get accustomed to the cultural differences.
  • Occupational hazard of the PM - if something goes wrong with a large scale undertaking the upper management will seek for a scapegoat, that is most likely the PM. This is business as usual as long as you negotiated a decent severance package in advance.

framed.JPG

 

My interviewees highlighted that they were only scratching the surface during our discussions. I hope that you enjoyed reading it. As always I will be happy to hear from you on this topic.

süti beállítások módosítása