Egy újszülöttnek minden vicc új, így én a régi viccekre szakosodtam, azokat mondom el újra és újra.

Floorshrink diaries

Floorshrink diaries

Dinosaur for breakfast

2021. június 27. - Floorshrink

I got a question from a colleague about how I would approach the replacement of an aging core banking system (CBS). Although I had an encounter with a project of this kind earlier, I wanted to give a more elaborate answer so I got in touch with a few silverback CIOs of the local IT community who had a first-hand experience with these beasts. (I express my appreciation for your help, Guys!)

This post is a summary of the interviews I had with IT executives in the CEE region mixed with my own observations. I dare to say that most of the suggestions below stay true for telco billing systems, or any major endeavor that touches the core functions of the firm and has interfaces with large number of other systems. So here we go:

  • Start with the why – if these reasons are not shared by the top management – do not start the whole thing. The best analogy to a CBS replacement is changing the engine of an aircraft during flight: this is the last thing you want to do since it will bog you down for 5+ years, will cost your proverbial shirt and nobody can guarantee that it will succeed. You must have a well understood, easy to communicate reason why you do this and you have to have the commitment from the board to support your venture sworn by blood. (They might forget about their oath in two years...)
  • This is NOT an IT project rather a process / product level overhaul of the ship with a significant technology support. (to prove my point check out the Standish CHAOS reports or the various lists of the biggest flops in IT – technology itself IS NOT in the primary reason for a failure in most cases. Unrealistic changes driven by politics is a bigger danger.)
  • Chose the right time – When the moon is in the Seventh House and Jupiter aligns with Mars”, ie. when the ownership and management structure of the firm is stable, the economic environment is fine and the regulators are a relaxed and not introducing major legislations that demand immediate action.
  • The sponsor and his/her relationship to the PM – you are not tinkering with a pimple on your chin, this is heart surgery! As historical records show those projects achieved their business goal where the PM had the unwavering trust of the CEO ie. the head of the program is not a CIO direct but a CEO direct. (the CIO would qualify but is usually busy running IT as we know it.)
  • Keep the ten commandments – DO NOT allow any customization in the core or live with the consequences: There is a theory by Gartner called the Pace layered IT architecture. In a single sentence: Tinker with the application layer where your competitive advantage actually lays (system of innovation) and DO NOT mess up the lower layers, especially the system of records. I guess the 11th commandment that the business keeps scarifying for short term gains is “Thou shalt not build frequently changing business functions into the foundation of your institution”. (The 12th is “Thou shalt not create point to point interfaces”. Well, they are as problematic to keep as “Thou shalt not covet thy neighbour's wife.”
  • KYD – Know Your Dinosaur – some of these systems linger around for 20+ years, carrying a thick guano of poorly documented changes, with the original developers gone for years by the time your great adventure starts. This is reverse engineering time when the project team tries to understand the process from the code. It will take time.
  • Make your Dino more simple - Remove any functionality that does not belong to the core feature set. This is undoing the sins of the past (when eg. you built OLTP functions in your data warehouse or baked the deposit management of a loan into the a/c handling system itself.) Of course when you create a centralized client master data management solution, you will have to create an interface to the master data management in the new system. (do not allow multi-master solutions.)
  • Staffing – The teams of these projects can grow substantial (up to 150 people). These folks are the ones with the deepest knowledge about your existing processes and systems whom you take away from their day jobs and who will be greatly missed by their line managers. It makes sense to set up a formal process to regulate this exodus of talent from the daily business and to reach out to system integrators or local partners of the COTS vendor to fill the gap.
    A guaranteed source of conflict is when the annual resource planning exercise ASSUMES that these jolly jokers are still in their regular positions and assigns tasks to them. The business will attest that the Earth will stop spinning without these folks and will reclaim them. This is when deadlines start vanishing. A potential way to avoid this conflict to ask for dedicated people and even to create a dedicated org unit for the project.
  • Coexistence – Life will not stop for the years while you are building the “Great New Thing” The owners of current systems will keep churning out new releases, will modify interfaces or even change the underlying data structures. For this reason, this is vital to capture these changes and to make sure that you have the latest version of all corresponding systems in your test environment. You need to automate – on top of automating the testing of the new system itself - the buildup of an integrated test environment, including test creation of the test data.
  • Evolution vs. Revolution - i got the feedback from a ex colleague of mine that I i missed an aspect, namely that you have to produce something on a regular basis that the business actually can use. This keeps the hope alive that you are moving to the right direction and gives the chance to the client to provide feedback.
  • Test data – Gary Larson mentions a separate chamber in Hell for those who drive slow in the fast lane. I think there is another bucket for those who invented GDPR. Imagine building a test scenario where all systems depersonalize their master data on their own right. You may want to be a bit more forgiving when enforcing  those GDPR guidelines… (25+ years ago lawmakers in Hungary abolished the use of the personal ID since it could allow those nasty IT people to link disjunct databases. So the industry went back and used strings instead (name, mother’s name, address etc. UNTIL the government really wanted to identify you and asked to enter your social security number and your tax ID for mostly anything (the two together are as good as the personal ID was). I think that the fact that the whole society is self-profiling itself in social media for poo emojis will cause an even bigger trouble and lawmakers are lukewarm at best to stop it.

  • The interfaces –the Achilles heel of any complex IT project. A CBS can have 50+ interfaces, using technologies invented by your ancestors. You set a goal to replace these not so secure interfaces with something modern (for the record: ITSec hold you at gunpoint to do it.) The issue: it requires changes in the other systems by those people whom you just brought over to your project. Oops.. Ok, you decide to EMULATE the old interfaces to the outside world while going super doper inside. Things start to go ugly so you obtain the first permission to fall back on the old solutions, “just temporarily”. For the record: temporary solutions will stay for 10+ years. The business will never allow you to spend money replacing them!
  • Close coupling – this is the fancy name for not using API-s and eg. suck out data directly from the database of another system. This is cool until they change the DB layout… If there is a place in life for enterprise architects then this is guarding the adherence to design best practices regarding interfaces.

  • The vendor – vendors love when you are on the hook. It is like a thick needle in your vein, pumping your money into their pockets, and you just cannot escape. As long as they are good at what they do this might be acceptable. They will be nice during the courting phase but the gloves may come off when the first non-acceptance occurs. It makes sense to have escape clauses for both parties with well-defined milestones.
    It is also important to note that all the promises made by the vendor about the new functions rolled out on a regular basis – paid in the annual maintenance and support fee – WILL NOT be available to you once you started to customize the base offering. (This applies only when you go for a commercial off-the-shelf solution.)
  • The system integrator – as mentioned earlier in the staffing section you are likely to run short on skilled people, so you are likely to turn to a system integrator for help. The issue is when you get an army of rookies instead of the highly skilled folks you met during the presales phase OR when you realize nuances like a kickback from a certain HW vendor. Make sure you really understand the business model of your provider and accept the fact that good people are expensive. A theoretical alternative is to solely rely upon your own internal project team, but the recruitment and ramp up hurdle makes it challenging.
  • 24 x 7 – in the era of instant gratification people just cannot live without an always on banking service. This requires your new system to be able to work without the old daily closure when you had to maintain a shadow balance since the CBS was busy with its closure batch processing.
  • A word on HW: Do not start with buying a bunch of iron and licenses, use a public cloud offering and a bare minimum of licenses until you are done with the process related issues. Consider not buying HW at all, but staying on a public cloud throughout the whole project and moving back only at the end if necessary.
  • A word on SW: Make sure that your new CBS (auto) scales out rather than scaling up. Most managers are familiar with the concept of containers and micro services but some of them do not realize that this capability is only enabled by the underlying platform and the application layer itself will have to take advantage of these capabilities.

 

The not so PC stuff

  • Beware of Conway’s law – some financial institutions have more than one core banking system operated by separate silos. To avoid never ending turf wars it makes sense to consider moving the various a/c handling systems into the same org unit. As Melvin Conway put it: “Any organization that designs a system will produce a design whose structure is a copy of the organization's communication structure.” (Read: a copy of its org chart) If the management decides to do a major reorg, they should do it well before the project starts since it takes 6+ months after such events until functions, competencies and accountabilities will be aligned again.
  • The location of the vendor team “It will be done by next Monday” means a different thing in Europe vs. in other parts of the Globe. Chances are that your dev team will be located in India so you need to get accustomed to the cultural differences.
  • Occupational hazard of the PM - if something goes wrong with a large scale undertaking the upper management will seek for a scapegoat, that is most likely the PM. This is business as usual as long as you negotiated a decent severance package in advance.

framed.JPG

 

My interviewees highlighted that they were only scratching the surface during our discussions. I hope that you enjoyed reading it. As always I will be happy to hear from you on this topic.

Stranger things

A couple of days ago I had the chance to talk with the CIO of a pharmaceutical company. He cited that the business does not track the time usage of the internal IT workforce, ie. acting as it was free and limitless. In another discussion with a banking executive, he mentioned that during a regular business review they marked 50 items as priority one. Earlier I had the chance to peer into the JIRA queue of a SW development team. There were 17 thousand (not a typo) active items in it.

Something strange is going on in the minds of these otherwise absolutely smart people that we need to set straight in order to go beyond the eternal blame on IT. In the following post I attempt to sort out a few basics. IT folks dealing with economists, you may want to read on.

If everything is a priority, then actually nothing is a priority

If you type in “meaning of priority” into Google, the first result is this: “the fact or condition of being regarded or treated as more important than others.” Webster will tell you this: “a preferential rating, especially: one that allocates rights to goods and services usually in limited supply”. In our case the service in limited supply is the capacity of your development team(s). This has an implication: Prioritization means ranking where only one item can be assigned as PRIORITY 1, all others will get a lower priority compared to this one. What made the aforementioned business executive tag 50 development requests as priority 1 was that all of these came from regulatory changes that he could not refuse to do. This increased demand - if not accompanied with an increase in the throughput of the dev team - will have consequences:

  • Fulfilling a regulatory demand usually does not make you more competitive, it just keeps you in business, therefore when a significant portion of the dev capacity is burned on satisfying these, the business will get frustrated because of the functions they DID NOT get.
  • It will make the dev team frustrated since their (internal) client is unhappy and will unleash their wrath on them for not meeting the business expectations.
  • The dev team will learn to pick up the work load of those who escalated most efficiently (screamed most loudly or most recently into the right ears). A LIFO in operation, who yelled at you last will get served first. Warning: stress will increase performance only for a short period of time, after that will give way to apathy. (see János (Hans) Selye for the details)
  • These priorities will change quite often, introducing one more factor, context switching. The bad news is that the human brain is wired in a way that handles context switching with a penalty. I would guesstimate that this phenomenon itself could reduce the team’s throughput by 2-3 %. (to use the juggling analogy, they will drop the ball sometimes…)
  • One of the usual casualties of this conflict is technical debt. Neglected technical debt will increase the resistance of the system against any change, therefore the dev team will have to push harder. Another typical reaction is the reduction of training time. As a result, the actual throughput will decrease. (see the details in the Appendix.)

What you can do about it? Treat priorities as a ranking order, estimate the development effort properly (the most difficult part, requires regular checking with actuals) and draw the line at capacity. Do not try to force anything into the system beyond capacity, simply park them.

Putting more demand on the dev team than they can handle

Let us see what happens if you do overload your system. The short version can be described with the following diagram:

Borrowed from a presentation from Mary Poppendieck

Of course, the Business feels they were treated badly, a large portion of their asks were not fulfilled. They have two choices (after replacing the head of the team):

  1. reduce the demand by reducing complexity in the processes (and slashing the number of offerings) and then moving to COTS (Commercial Off-The-Shelf) solutions rather than demanding tailor-made solutions for everything. This move might have an impact on the competitiveness of the firm, so most business folks will not like it.
  2. Increase the capacity to serve the demand. (including tech. debt.)

In the last part of this post, we will have a look at the ways to increase the capacity.

How to increase the capacity of your dev team without increasing the cost?

Imagine the software development team as an engine with a finite throughput (capacity). The Business cares about three things: this throughput, the quality and the associated cost. An obvious solution is to hire more developers, but this approach has its limits. Extensive growth worked in the USSR for a while (more raw materials + more labour = more output, yippee) but it soon reached its limits. We need to go beyond this and try something else.

Here is a 10 sec crash course on the Theory of Constraints: if you buy 10,000 respirator machines while you have only 4000 trained medical experts to operate them (1 operator for each machine, training one takes several years), what will be the upper limit of people who can benefit from these machines at a given time? Yep: Strengthening any link of a chain except the weakest is a waste of energy. (in our case money)

bottleneck.png

Imagine that this Software Development Engine is a pipe shown above. (Agile folks, this is a simplification, lower your guns.) You have to have the same throughput at each stage within this Engine, otherwise the one with the smallest throughput will determine the overall throughput. For this reason, it does not make sense to add one more Demand Manager to the team when the bottlenecks are the developers and the testers. You need to reallocate people to the weakest section of your value creation process in order to increase its overall throughput. A possible way to increase throughput is by using the Theory of Constraints (TOC). The TOC improvement process is named the Five Focusing Steps:

https://www.tocinstitute.org/five-focusing-steps.html

There are various methods to find these bottlenecks in a software development process. One of them is value stream mapping. Once eliminated the bottleneck be prepared to find another one somewhere else in the process. This is a whack-a-mole, but one with great efficiency gains.

If want to read on, there is a great article on this topic: Theory of Constraints Best Practices by McKinsey Alum (stratechi.com). If you want to dig deeper, there two amazing books I recommend:

Henry Ford knew it – we just need to copy him

While we are not at the point when can automate the creation of code, we are certainly at (beyond) the point when we can automate the testing of the code designed and written by humans. In theory one can automate most key steps in the SDLC process, including the automated creation and initialization of the test environment, the build process and the testing itself. This idea has one caveat: those who can create this magic are developers (okay, test automation folks with strong scripting skills) themselves. Test automation became a separate discipline that one needs to learn.

The takeaway: if you are not happy with the throughput of your development engine, do a thorough analysis on the whole value creation process (eg. with value stream mapping) and eliminate the constraints by repurposing existing capacity from another part of the engine AND automate any repetitive tasks, especially testing (+ creating the test environment) and the build process.

As always, I will be happy to get your feedback on this post.

 

PS: For those with a bit more appetite for analytics, I put together a model to explain what happens over a longer period of time. Let’s assume that the throughput of your dev team over a period is 100.

You have 3 types of user stories: Small ones that cost 1 unit of effort, Medium ones that cost 3 units, and Large ones that cost 5 units. (a simplification from the original 5 T-shirt sizes) A certain request will either fit or will not fit into a given period. (again a simplification, not allowing for partial jobs.)

excel_1.png

 

There are 3 priorities, from Prio 1 to Prio 3. It is important to note that we do not put into consideration the actual value gained from a user story, we assume that the value gained is a linear function of the corresponding cost. (If you cannot measure and cross charge the cost of any product or service then your ROI calculations can only justify a decision that you already made, right?)

In the first period (year) you will cover 91% of the known requirements. The business does not know about technical debt OR wants to allocate too much time to sharpen the axe (aka. training). Technical debt is like financial debt, you take a loan today and will have to pay it back with interests later. This interest is the system’s increased resistance to change. This comes as a penalty on the total throughput of the dev team, so this is down to 97 units vs. the 100 you had last year.

In the second year we assume that the new demand is the same as last year, BUT we have to take care of the load that spilled over from last year. Your performance in the eyes of the business is down to 81% and you still did not touch the tech debt that will haunt you soon.

At the end of year 3 your performance is down at 71%, with an ever-increasing pile of tech dept waiting for you. This model also explains how those gigantic amounts of open JIRA tickets are generated.

That's a lot bigger than the one you are using now!

A couple of weeks ago I got in a discussion about training budgets. Not surprisingly the gentleman from finance thought it was too high, while I maintained a view that it was too low. His final argument was that we could not spend those dollars anyway since the business would not give us time for it. His reasoning stuck in my mind. Since the method suggested by Gary Larson (see above) may not work for everybody I figured I would prove my point by using an accounting analogy. So here we go.

I always had a strange feeling when someone called me (or anyone else for that matter) a “resource” since I would not even call my dog this way, but for my purpose this was the right term. For a minute let us treat humans working at a firm like robots, whose value is recorded in the books as fixed assets with a purchase cost and depreciated during a predefined number of years.

Most people acknowledge that the half-life of IT skills is finite: if one sits on his-her laurels over 8-10 years (ie. does not update his-her skills related to this profession) his-her knowledge would become outdated and the guy would slip down to a lower league with little hope to ever return. We need to sharpen the axe. The question is how much time and money should be spent on this sharpening and who should pay for it.

My next assumption was that sliding down to a lower league roughly halves one's cost to the employer. All guesstimates below are from Hungary, applying the local conditions. I inserted a link to the original model in the PS section if you want to play with it.

The linchpin of the argument is that the job market assesses your value correctly, ie. the gap between the two categories equals the money (including opportunity cost) someone needs to invest to stay in the current league.  (Of course, the whole reasoning is valid only within the same country, since a fresh grad in an investment bank in NY earns twice as much as a senior guy in Eastern Europe working for the same bank.)

Let’s assume that the half-life of an IT person’s knowledge is 8 years. This comes to 3.5k USD “amortization” that we need to backfill annually. (This amortization is not linear and the boundaries between “leagues” are blurred, but I did not want to overcomplicate the model.)

Having talked with several folks on this topic we came to the conclusion that one needs roughly 10-12 days a year to stay afloat. So all in all we have 3.5k cost and opportunity cost and 80 hours to split between the two parties, the employee and the employer. (a friend of mine from Berlin mentioned that the state can have a word in this thing, but this is the distant future for CEE so we ignore it for now.)

the_cost_to_offset_amortization.JPG

With 2080 workable hours and 88% “billable” utilization an hour of a skilled IT person will cost to the employer around 30 USD, so this is the money “lost” if this person is not working, but learning.

We arrived at the crucial point of this post, the split of time and money spent on training. I assumed that an average IT person would be willing to spend 1% of her-his annual net income on learning (roughly the annual subscription to O’Reilly or Coursera Plus) and also figured that people are more sensitive to spending money than scarifying their free time. (if IT is your passion, you will happily spend way more on it than 24 hours a year.)

So here is the model that I ended up with:

Having played with it for a while I arrived at 1.4k USD as a reasonable amount of training budget per head per year. In a COVID era this might be lower, but certainly not below 1k USD.

As usual answering a question (regardless how good this answer is) raises new questions:

  • Should this money be spent evenly on the entire staff or should there be a bias towards high performers or potential high flyers? (my assumption is yes.)
  • How shall we measure the effectiveness of training? People tend to freak out from certification. (and indeed, some certs are overpriced and lack real life applicability.)
  • How shall we measure the actual impact of training on the bottom line of the firm? If you cannot measure it, the whole argument becomes weaker.
  • Should we apply different split ratios for different subjects? (eg. English vs. hard core technical skills) My bet is that language skills should be treated as a prerequisite to actual learning, therefore the employee’s problem.

 

As always, I will be glad to receive feedback on this post.

PS: A la recherche du temps perdu – the link to the model.

TLDR #7: Too old to rock & roll but too young to die

too_old_to_rock_and_roll.png

I have been involved with training at large organizations for a number of years. I was surprised to hear the contradicting opinions on this subject, so i decided to from my own. The questions this post is trying to answer are as follows: How much time, effort and money should be invested to maintain the cutting edge and most importantly who should invest it? How can this investor measure the return on his/her investment? So here we go, my take on this topic. BTW: if you know where this image comes from you probably should read on� 

The stakeholders in the game: beside the two obvious ones (the employee and the employer), there is a third stakeholder, the state. People in IT tend to generate 2+ times the average GDP per capita, in many cases a substantial contributor to the country’s export, hence their interest. I skip the governments this time although they can tilt the playing field by throwing in (taxpayer’s) money or creating regulations, but their most effective measures are in assisting the secondary and higher education.

 The underlying problems – the employee’s headache:

  • Our profession is not like being a baker, where you learn it at the age of 18 then you keep doing it for the next forty years. IT skills has an amortization period of 7-8 years. You have two choices; you upgrade your skills or downgrade the league you play in. (FYI: this is a one-way street.)
  • The players ability to sharpen their axe declines with age. You can claim that the higher torque (your experience) will make up the lower RPM (no longer willing to work 60+ weekly hours), but this will not hold water over 50. You are not getting stupid; you are getting slow. It takes more effort to learn something new at the age of 50 than at the age of 20. Let put it mildly you are more likely to master “sustaining innovation”, a newer version of your favourite technology than to embrace a disruptive innovation that is likely oust the one you already mastered.

 Employers also have a few headaches:

  • The long-term supply-demand imbalance causes wage inflation that surpasses the increase in personal efficiency gains, that means the net margin produced by (made on) a single IT knowledge worker is shrinking. This caused the flood to outsource work to cheaper places like Eastern Europe and India.
  • Their assets (us) amortize pretty fast while demanding an increased compensation. Without permanent upgrades your staff loses their cutting edge within a decade. In theory the employer could buy these new skills on the market (an euphemistic way of saying that they could sack the old folks with expiring knowledge and purchase new ones from the job market with up to date skills), but this would backfire as their reputation would go down the drain, ie. they would need to pay an extra for playing hardball with their staff, not mentioning that the ramp up period is 8+ months in a complex IT environment.  

 How the math looks like for an employer? The efficiency gain that materializes in the production (in combination with the gains brought in by the new technology the student just learned) should be greater than the cost of the training plus the opportunity cost of not working during the training. The issue is measuring these gains. More on this later.

A few false myths employees should forget:

  • Staying at a firm for a long time = loyalty = increased value. Nope: unless you truly reinvent yourself (eg. by moving up in the food chain to an area where the half-life of your skills is longer) your value increase stops around 7-8 years at the firm. In this case it is your know-how to get things done at the firm + your informal network that will keep you afloat, so get out of your comfort zone and try something new!
  • Degrees and certificates don’t matter. Certs help to reduce subjectivity from the judgement; therefore, they play a significant role in selecting new hires. Unless to plan to stay at the same firm until you retire it is worth considering a few certs. On the other hand, it is true that not all certs are created equal and many vendor specific certs live in a “my stuff alone” world that exist only in the minds of the marketing people. And remember: more certs will not make you a better leader.
  • You can maintain your market value without speaking English. Nope, you just won’t. 85% of the new stuff in IT is documented or even mentioned in English speaking publications.

 False myths that should be forgotten by the employers:

  • It is enough to throw in cash into the equation and you are done. Nope: if you push your people beyond a certain utilization, by the time they get home their brain will have no capacity to absorb anything new, let alone learning during the night does not help the longevity of your family life. You need to invest money AND time.
  • This is okay to dictate what your people should learn. Nope: motivation is a key factor here, let them pick at least half of the curriculum. (and turn a blind eye on the report when you find out that the most popular course on the training site you bought a site license for is DSLR photography�)
  • This is okay to keep a team on a dying technology if you pay them well. Nope: When you have an enormous tech debt that solidified itself into a dinosaur the people maintaining it will hold you hostage and will fight until the last bullet to keep the old system alive, ie. to keep their jobs. Let them off the hook by offering them a retraining package.
  • You always have to have a study contract. Nope, its power to retain people is negligible as long as the external market thinks that your employee is worth more. I recall a case when a firm threw in 60k USD per student in a graduate program and saw half of their grad population leaving within 18 months despite of the study contract. Sure, you want to keep if you finance an executive MBA program.

 How to measure the effect of training, ie. what it is worth. There are two conversions involved here:

  1. how much knowledge actually remains the heads of the students.
  2. how much of the newly gained knowledge is transformed into business value.

The first is negative exponential, ie. 6 months after the training– unless you do something about it – is approaching zero. People simply forget. You have to allow them to practice.

The second is the difficult part: the closest to the real thing is to calculate how much it is worth to be on the market with a new product first (that is built on IT, in case of financial services roughly anything) vs. to reach the market several years after the pioneers harvested it. Another proxy is how much it is worth to avoid a major system failure, say when your web banking site is inaccessible for a day or two. (okay this is the value of the system, where training is just a part of the story.)

How much should you invest to preserve the value and who should invest it? if you treat your people as a fixed asset with 12 years depreciation period, and your annual fully loaded cost is say 60k USD per employee, then you should spend 4-5k AND 7-8 days per year (covering actual training cost AND opportunity cost which is cca. 230 USD per day in our case) on training to maintain their edge. The key is to keep both parties’ skin in the game: training will help the employer sustain the value of their people, but the same is true to the employee, therefore she should throw in effort and money as well. Any easy to implement rule is when the firm pays the first attempt to pass an exam, and the employee pays all consecutive attempts should she-he fail the first time.

How to get the training delivered? At last, a few rules of thumb:

  • To make sure the theory gets mixed with the firm specific environment internal expert teachers are a great choice. But again, teaching should be treated as real work not just a hobby of the expert. Otherwise their enthusiasm will vanish within 6 months if the system punishes him/her for doing the right thing.
  • E-learning sites: forget about site licenses, buy a limited number of slots and make it clear from the beginning that people who do not use it will lose it. Maintain the perceived value. (Anything that comes for free has no value.) Track the usage, who studied what and when. Collect feedback on the trainings, maintain learning paths. 
  • Demand a cert right after the training. Should you let the trainees procrastinate, their chance to pass the exam will vanish exponentially.
  • Do not overspecialise! T-shaped skills easily beat the single minded “Dirac Delta” type knowledge. (when you know EVERYTHING about a very narrow field approaching nothing vs. polititians who know nothing about everything)
  • The Chinese got it right: “Tell me and I’ll forget, show me and I may remember, involve me and I will understand.” Treat training as a project (even with its own PPM code if you like) and let it happen preferably right before the project that will actually use the new skills.
  • Besides technology, learning process related skills is a must. Yes, processes may taste like eating sawdust (until you tried accounting) but they can save the day. Remember: more IT systems go down due to a screwed-up change, than for a HW errors.
  • Since we are humans, soft skills are key even if you are a foot soldier. The need for these skills increases exponentially if you start managing other human beings.
  • All the animals are equal: if you demand a skill from your new hires, ask the existing staff to pick up the same skills but after a grace period.

 As always, I appreciate your comments.

 

If you take the red pill - Dilemmas around software defined storage

red_pill.JPG

Summary: Offerings from public cloud providers exposed the inherent complexity and poor agility of comparable infrastructure services built by Enterprise IT. For this reason internal clients are less likely to tolerate provisioning times measured in months, let alone the hiccups when something goes wrong while fiddling with the multi-dimensional puzzle of compute, storage, network, hypervisor and its monitoring. This forced mid-market buyers (and enterprises) to consider hyper converged infrastructures. This article pulls together a few observations around HCI, focusing on its SW defined storage (SDS) subset and makes a prediction on its future.

Problem statement

Storage demand follows an exponential pattern. Without a working data governance and IT cost cross-charge in place getting any data deleted is near mission impossible. The business does not like to make irreversible decisions, so they keep every bit of information since the Magna Carta.

The old “for want of a nail” proverb is still alive in Enterprise IT, an issue with a single interconnect cable can inhibit a test DB that is a must to meet a business target. Enterprise IT must come up with something better. An abstraction is needed from the HW.

The promise of SDS

Vendors make the following promises related to SDS:

  • Enormous storage capacity with very big IOPS
  • Low and predictable (load independent) latency
  • The ability to go very big (10+ PB) while doing it in small installments if necessary
  • The ability to go small, ie. serving micro DCs at the edge of the enterprise
  • Unified monitoring for all components (compute, virtualization, network, and disks)
  • Automation of repetitive tasks like provisioning and maintenance by using APIs instead of UI
  • Features like thin provisioning, deduplication, snapshots, async replication etc.
  • Doing all the above with high resiliency, more agility at a lower price than their dedicated SAN counterparts

 

How is it achieved? Software-defined storage fits in the infrastructure commoditization theme building on ever faster CPUs and the mass adoption of SSD. It uncouples storage resources from the underlying hardware platform. SDS is designed to run on industry-standard x86 servers, removing the software’s dependence on proprietary hardware. (eg. FC and iSCSI)

Decisions to be made

Ok, you decided to put some of your payloads on SDS. Let’s have a look at the decisions you will have to make before your new storage is ready to serve you.

The workload you want to serve

SDS will provide you with block, file system based or object storage, ready to serve practically any workload. In practice there are impediments from unexpected angles. While it is possible to place a large database on a virtual machine, thus optimizing the utilization of your compute resources, the licensing of a major RDBMS provider is an inhibitor. They count not only the virtual cores that your DB is using but all cores in the given cluster that it could possibly use.  This subtle change makes it prohibitively expensive to do it.

The natural next step from SDS is hyper-converged infrastructure (HCI). It promises optimal CPU and RAM utilization by allowing the combination of compute and storage workloads on the same x86 box. The problem is that is blurs the boundaries between the storage ops and the compute/network ops departments. You either change the org structure or swallow the bitter pill and stay away from HCI. Again, some RDBMS providers make the choice easier by charging for all cores in all CPUs in the given HCI cluster even if there are a dozen other workloads, they have nothing to do with.

Appliance vs. reference architecture

As the name implies it is SW, that is supposed to run on any x86 server. Your preferred HW vendor will want a part of this business and will make claims how good they are with the SW part (or will bring a boutique service provider as subcontractor). Buying the SDS solution and the underlying HW (and network) from different vendors may defeat the purpose of off-loading the burden of compatibility testing of the various layers of your infrastructure. This pressure paves the road to appliances that comes hand in hand with a vendor lock in. As a counter force your procurement will issue an RFI for the next petabyte of the same SDS and may find a cheaper HW. Some SDS vendors decided to settle this case by offering the management SW with their appliance only (while selling the SW component alone if you like.). Bottom line: as a minimum buy the HW and network component from the same vendor and look for proven alliances between the SW vendors and their HW counterparts.

Storage tiering

It makes sense to use the fastest (and most expensive) storage for the most frequently used data. On the other hand, a mixed storage (combining NVMe with SSD and with HDD) will take away some of the functionality that you can save money with. (eg. snapshots) At the end of the day it boils down to the desired mix of hot, warm, and cold data, the price tags and your appetite for complexity when it comes to chasing performance bottlenecks.

Redundancy and backup

The question is if you want two copies of your data, three copies, or even four copies (two in the primary site and another two in the DR site) or you opt for erasure coding. When talking about petabytes, there is a palpable difference in the cost. I think two copies and a backup on disk is okay.

If you ever tried to restore a 150 TB DB from tape you will have an urge to find something faster. As HDD prices keep falling it may make to sense to go away from tape and to archive to HDD instead.

Where to implement which functionality

There are a bunch of storage related functions that multiple layers of your infrastructure can do for you. The rule of thumb is simple: do not implement the same functionality in two layers. The following list names some of the choices you must make while designing your new storage:

Function

Layer

Mini pro-con evaluation

Compression

RDBMS, storage

Compression takes away CPU cycles that your RDBMS provider charges for while they offer the broadest choices in the way you compress your data. If you choose storage you may want to test the performance penalties on your SDS before betting on it.

Encryption (in place)

Application, RDBMS, storage

There are performance penalties regardless where you implement this, the key concern is the certificate management and the fact that you still need to look after data encryption in flight.

Snapshots

RDBMS, storage

High end SANs did it a decade ago, saving lots of disk space. As SDS matures it seems the right place for this functionality. The big deal is how fast the end to end process is. For the record: snapshots are not an alternative for backups.

Inter site replication

Distributed file systems, RDBMS, storage

If your goal is to care for DR scenarios, storage might be the right place for this while some RDBMS solutions are also worth considering.

If you want a definitive media library across multiple sites a DFS is the best choice.

 

A word on networking and DC infrastructure

It is possible to run each node with a single NIC but in practice you will end up with 6 NICs per server. The reason is that you will want speed and redundancy and to isolate the payload related traffic from the traffic needed for this redundancy and management/monitoring.

DC infrastructure – power, cooling, physical space for the new racks, or bandwidth – is often treated as a given, while they are finite resources.  Running out from free 100 Gbit ports can be as painful as not having the servers you want to hook them on. (40 nodes with 6 ports each, do the math..)

The maturity of your IT staff

While the principles are similar, the implementation is new, therefore your IT staff need to learn SDS. Give them time to get trained, a sandbox they can experiment with and make sure you have a vendor with skin in the game to support you when shit hits the fan.

Potential pitfalls

There are situations where SDS will not fulfill your - potentially unrealistic - expectations.

  • This is not clear how long a snapshot is supposed to be kept alive. I did not find any example when they went beyond a few weeks. This raises a question how often the developers are willing to move from as old test DB to a new one. The answer depends on how seamless your DB provisioning is.
  • There are functions that are available only after a data evacuation that is you need twice the capacity while you do the reconfiguration. This can be a showstopper.
  • The small print again – there are functional combinations you cannot do. (eg. mixing fine and medium granularity block size within the same storage pool.) or the need for a tiny bit of extra HW that will demand to shut down your servers during installation. (eg. a battery pack for the NVDIMM modules.)
  • As usual trouble comes when you do a version upgrade of your SDS SW. If you want to make one thing bulletproof, this should be the upgrade process.

 

Do the homework

You need to understand the drivers of your vendor, where its core competencies are, where its profit comes from. Make sure this is their core thing rather than a “I also have it” exercise they may drop if the hype is gone. Check out the public references and talk with others who already implemented the proposed technology. Never believe a vendor just because they claim something.

Best practices

  • Leave some disk bays empty in the storage nodes. This will allow you to respond to an immediate storage demand with ease at the expense of buying more nodes than necessary.
  • Do not spare money on the small items like NVDIMM that become essential for some of the sought-after functionality like efficient snapshots. Purchasing and installing this thing can be so time consuming that you drop the whole idea.
  • Check the compatibility with 3rd party add-ons, particularly with backups solutions.
  • Look for choices major public cloud providers embraced. Their testing might be more rigorous than yours and later if you want to bridge your DC to their cloud it may come very handy to have the same underlying technology for storage.

The conclusion

Businesses and SW developers serving them have been trying to escape the drag of infrastructure realities since the dawn of computing. It’s like the soul trying to leave the body behind, it did not really work in the past.  Public cloud providers pour enormous investments into their IT infrastructures to make them more capable and more flexible in the same time. Enterprise IT cannot escape the inevitable consequences of these developments by hiding behind the shield of legal protection. Defending an outdated org structure or sticking to an old technology-process combo will not cut it. They have to apply similar concepts and similar technologies if they want to stay competitive in the long run. They may not move every workload to a cloud provider but shall become one.

As always, I appreciate any feedback on this post.

TLDR #6.1: The hurdles of creating a private cloud

The following post is about the pitfalls of creating a private cloud. My aim is to list some the potential pitfalls down the road implementing a private cloud, and not to take away your ambition. This is the opposite: You are running against time. Once the regulatory environment (in CEE) lifts the restrictions on the usage of offerings from public cloud providers, you are going to face a formidable competition with offerings being honed for ten plus years. You have to find niches where you can beat them before this change happens. So here we go:

Problem #0: I added this paragraph after receiving an interesting comment on the original post. You do not actually know why you want to have a private cloud. The elephant in the room: you have to have applications that can utilize the capabilities of the new infrastructure - most notably auto scaling to serve the actual requirements. And this is a way bigger issue than the infrastructure itself. 

 Problem #1: the foundation is insufficient. Your virtualized environment is probably the product of several years of patchwork technology choices and makeshift automation fragments. You are tempted to listen to the siren song of the sales folks of vendors and downplay the importance of dreary things like processes, let alone automation, your mind being spellbound with the big, shiny iron.

 Ask yourself these questions:

  •  Do you run a service based operating model? To be less academic, do you have a service catalog that is actually used by your customers?
  • Do you have technology standards that you can actually enforce or you bend over to any exotic request? (imagine your test matrix with 3 hypervisors, 4 server OS-s and 2 application servers)
  • How long does it take to serve a new request for a VM? If the answer is 4+ months, then identify what you will (have to) change if you want to bring down this time to a few hours.
  • Do you know your existing compute-storage-network capacity and their utilization?
  • Do you know what the workload is that consumes the above capacity and can you pair the applications with their infrastructure layer? (do you have an up to date technical asset management DB?)
  • Do you have a well-documented load balancing, High Availability and Disaster Recovery service that you provide to your clients? You can be assured that they will not settle for anything less than they have today.
  • Do you monitor your current physical and virtual infrastructure and the application layer within the same process framework and with the same tools? You will need to provide these services in your private cloud as well, preferably via the same pane of glass and by the same people.
  • Do you have a ballpark idea about the cost of a VM you produce? Do your clients care about the cost, or it doesn’t matter at all?

 If the answer to any these questions is no, then you have a homework to do as part of your private cloud project.

Problem #2:  treating the effort as if it was just an automation add-on on top of your existing virtualized environment. There are two issues in this approach:

  • Provisioning and decommissioning: without changing the underlying provisioning processes your new private cloud offering will feel very similar to the existing (and loathed) physical HW provisioning. Equally important: as my favorite band once put it, “When the music’s over, turn out the light”. You have to make sure that unused capacity is returned to the pool, otherwise you will run out of it very soon. And here lays a paradox: you have to gain the trust of your clients that they will get another compute node when they need it and they will get it fast and you need this trust by the time you roll out your first private cloud VM. (otherwise they will stick to their assets regardless if they actually do anything with it or not.)
  • Procurement: Imagine that a public cloud provider tells you to wait a few months with your request for the next VM, claiming that they need to get the purchase approved, then need to run a procurement process to select the HW vendor, wait a few months, check if there is any free rack available in their DC and then, they will be happy to serve you. You actually do the same when you advertise you fancy new stuff with a lightning fast provisioning (say two days vs. the current 4 months), then you add 4 months in the fine print since you did not change the supporting procurement process. The key is to build capacity WITHOUT knowing who will use it. For this you will have to convince your financial department to run the shop as if it was a mini service provider, ie. not insisting on distributing all costs “somewhere.”

 

Problem #3: driving the whole effort with an engineering only mindset. Build it and they will come” is carved into many project tombstones. You should not forget about the payload and WHY and how this payload will be moved to the new environment. Ask the following questions:

  •  What will be the motivation of your clients to move to your new offering? The business could not care less if a given workload runs on top of a physical hardware, on a traditional VM or in your private cloud, especially when there is no established charge back model in your company. Unless there is a compelling reason any migration effort that takes away key people from creating new business functionality will be considered as an impediment to their progress, ie. may be very slow.
  • How fragmented is the application portfolio from an infrastructure requirement standpoint? The stronger you stick to the new standards to reduce build and maintenance complexity, the bigger the migration effort becomes, especially when there is a large amount of technical debt piled up over the years under these applications. It creates a gap between the current and the target infrastructure. You need to find a sweet spot of requirements which is large enough to matter and has a reason to move. (eg. when the vendor is no longer interested in providing a fig leaf to cover your ass to the regulators called extended support.)

 

Problem #4: the human factor. Your colleagues are not „resources”, they are human beings with their own skills, fears and agendas. Check out these questions while you put yourself in their shoes: 

  • Does creating a private cloud - let alone a containerized compute platform - require the same skillsets ie. the same people as the old school physical environment? Spoiler alert: it doesn’t.
  • What is the chance that your current staff will pick up the news skills fast enough? Chances are, if they had this skillset, they would be somewhere else already.
  • Are you prepared to create/hire (let alone retain) a dedicated team of automation engineers (in fact developers) and process people to make it happen? (reallocating 20% of the bandwidth of your existing people won’t cut it.)
  • Are you prepared to handle the compensation gap between the above mentioned two groups?

 The paradox is that you definitely need your existing staff to keep the business running. While some part of this crew will be prepared for the new technology and processes, some other part will fall behind and may even try to make the project fail. If you are still in the mood of creating your own private cloud after the questions above, here are a few considerations for you.

The technical considerations

  • Do the homework and define a minimum viable product that answers the needs of a double-digit subset of the existing application portfolio.
  • Walk before you run, start with a pure play IaaS, and support containers only in phase II.
  • Be prepared to offer a very low number of offerings. As Henry Ford once put it: “A customer can have a car painted any color he wants as long as it’s black”. One hypervisor, two guest OS-s (RHEL and Windows are safe bets), SAN based storage with basic HA and DR support, 3 T-Shirt sizes, IaaS only.
  • On the other hand, be generous with RAM and be prepared to offer fast and reliable provisioning with functioning monitoring and management tools. Make sure you have enough bandwidth to serve these VMs.
  • Focus on seamless provisioning with minimal number of manual steps. It makes little sense to have beautiful scripts spinning up the core VM image if it takes another day to apply all the missing patches or if your DNS propagation needs a day. Automation means using API-s, rather than clicking in GUI-s. Make sure your automation tools are in sync with those used by the application layer folks for their build process.
  • When you think about automation, treat all layers equally, ie. include storage and networking in your automation efforts. Avoid the doom of Conway’s law like when you break the process along the borders between the various units in your org responsible for creating a solution. (imagine when the VM automation process has to create a SNOW ticket to get the disk and an IP address.)
  • Be prepared to answer security considerations: can all apps coexist on the same physical hardware and subnet or you need physical isolation between application tiers?
  • Integrate with the existing core services like the corporate directory, firewall, monitoring tools, CMDB, while keep the load balancers in scope.
  • Provide HA and DR support from the beginning. If the developers are accustomed to a storage-based DB replication, keep it. A naked VM might be good for PoC and testing purposes, but if it falls short compared to current offerings, it will be relegated to the above functions.
  • Go for overprovisioning and avoid reserved instances as much as the political environment allows.
  • Storage quotas are goodness, especially if they trigger a data lifecycle management effort and not just an outcry for more disks.
  • Create some rudimentary billing from the start. Make sure your stuff looks cheaper than the competing physical offerings. If the word cross financing comes to you mind, team up with your Finance colleagues and make it happen! (of course, this helps only if there is a cross charge model in place already.)
  • If this project is a priority, then staff it accordingly. You cannot make it happen by reallocating 20% the existing time of your existing people, that’s just tire kicking.

 

 And finally, a few DO-s and DO NOT-s

  • Make the business understand the key value proposition of the whole thing: this is agility, not the cost! Get their long-term commitment, going back to the budgeting table every time is time consuming. (minus cases like the current virus triggered economy melt down…)
  • Understand the pain points of your user community and create a unique selling point by easing this pain. You need friends to make it happen.
  • Understand what the current application portfolio runs on (down to the Java framework versions, RDBMS versions, application server and OS versions, storage requirements) and be prepared to serve a core subset upfront while resist serving every other demand at start. Agree on an MVP and make sure you have something tangible to offer soon, while not committing to unrealistic deadlines.
  • Track and coordinate with other key projects effecting the infrastructure. (eg. a firm directory revamp or replacing the firewalls or a simple DC move)
  • Do not start your project with buying a truckload of hardware. Most of your difficulties are not HW related anyway and you might write them down by the time the project is finished.

 I would like to thank Gabor Illyes and Zeno Horvath for his insight on this topic. As always, I appreciate any feedback or comment.

 

 “Ha az élet citromot ad, készíts limonádét!”  

 

Pár napja az egyik IVSZ-es kollégám feltett pár kérdést a COVID-19 és a felhő kapcsolatáról. Az alábbi válaszok a privát meglátásaim, érdekelne, hogy a LinkedIn közösség mit gondol erről a kérdésről.

  • Hogyan fog a koronavírus a (felhő)piacra hatni rövid távon?

 

A változás három dimenziót is érint:

  • Olyan területeket, amelyekről első hallásra senki sem gondolta volna, hogy lehetséges: pl. lehet tornaórát tartani online, saját tapasztalat, hogy igen, lehet.  
  • A másik dimenzió a döntéshozatal, jelen esetben a felhő alapú megoldásokra váltás sebessége: a távmunka és távoktatás körüli „Pató Pál féle- ej, ráérünk arramég” mentalitás átadta a helyét a „csináljunk valamit de azonnal” hozzáállásnak.
  • A harmadik dimenzió pedig a felhasznált eszközök jellege és tulajdona: a home office sok esetben feltételezi, hogy a munkavállaló a saját eszközét használja a munkavégzésre (hiszen a notebook-ok már elfogytak a boltok polcairól és az új szállítmányok akár hónapokat is késhetnek.) A BYOD megközelítés eleve kizár bármi nemű eszköz, operációs rendszer, vagy szoftver verzió uniformitást és egy sor, korábban kőbe vésett IT alapvetést (pl. lehet-e lokál admin a felhasználó). Ezen új feltételek együttese szinte kiált az SaaS megoldások iránt.

 

  • Milyen hatása lesz a mostani eseményeknek a (felhő)piacra hosszabb távon?

 

A válság során kialakult új szokások és megoldások várhatóan velünk maradnak és hosszú távon a felhő technológiákhoz való adaptáció sebességét növelik majd. Például a video konferencia, mint kis és középvállalati kommunikációs eszköz elfogadottá válik a válságot követően.  Mind a munkáltatók, mind a munkavállalók látják, hogy a dolog működik.

A másik terület, ahol áttörést feltételezek, a távoktatás, ugyanezen okokból. Mind a két terület szinte kizárólag valamely felhőszolgáltatáson keresztül elérhetőek, így közvetett módon a felhőszolgáltatások elfogadottsága is annyit léphet előre pár hónap alatt, mint korábban pár év alatt. Hadd illusztráljam egy példával, hogy mi zajlik a fejekben: egy hazai kereskedelmi banknál az egyik nagyon felsővezető, aki korábban egyáltalán nem érdeklődött a Skype iránt (off-line volt kb. fél éve), napok alatt rendszeres felhasználóvá vált és vélhetően az is marad majd a válság elmúltával.

 

  • Van esélye azoknak a vállalatoknak, akik most pótolják be lemaradásukat felhő terén?

 

A jelenlegi vállalati környezet legfontosabb jellemzője a külső paraméterek korában elképzelhetetlen sebességű változása . A túlélés egyik tényezője az ezen változásokra adott válaszok sebessége, amelyhez elengedhetetlen a felhő alapú informatikai megoldások alkalmazása. A jó hír, hogy a belépési korlátok alacsony volta (nincs beruházási költség) és az informatikai tudásigény „házon kívül” kerülése, amelyek mind azt segítik elő, hogy ha valaki végre elhatározta magát, akkor gyorsan léphessen. Ugyanez távolról sem mondható el a céges folyamatok ill. a vállalati kultúra átalakulásának sebességére. Itt a „felhőben gondolkodni tudó” felső vezetés elengedhetetlen. A felhő csak egy eszköz, hogy mihez kezdesz vele, az zömében rajtad (vállalatvezetők) múlik.  

 

  • Megvan a vállalatoknál az a tudás, amely révén biztonságosan navigálják a távolról dolgozók mennyiségének növekedését?

 

Ez a tudás (képesség) csak részben a vállalatok házon belüli tudása, sok szempontból épp az a kulcs, hogy  a távolról történő munkavégzést elrendelő vállalati vezetőknek nemigen kellett elgondolkodnia azon, hogy vajon bírni fogja-e a távközlési és a felhőszolgáltató (Zoom, Teams, Slack, Webex stb.) kiszolgáló infrastruktúra a hirtelen megduplázódott terhelést. Feltételezte, és egyelőre joggal, hogy ezek mind csak szimplán teszik a dolgukat. (A saját UCC platformot, VPN-t és távoli elérést üzemeltető nagyvállalatok IT vezetőire ez távolról sem igaz, ők és munkatársaik óriási küzdelmet folytatnak azért, hogy a felhasználók ne vegyék észre, hogy az infrastruktúra milyen rendkívüli mértékben túl lett terhelve az utóbbi időben.)

 

  • Milyen szerepe lehet a felhő technológiának majd a gazdaság újraindításában?

 

A COVID-19 drámaian felgyorsítja a felhő alapú informatika elfogadásának folyamatát, de a gazdaság újraindításában játszandó szerepe vélhetően limitált . A jelenlegi válság konkrét termelési kapacitás csökkenést nem okoz (hál istennek ez nem egy háború), tehát elvileg a recovery hasonlóan gyors lehet, mint a leépülés volt. Ellenben két kulcs területen hatalmas rombolást vihet (visz) végbe a válság: az egyik a globális értékteremtési láncok szétszakítása és ennek következtében a bizalom meggyengülése (jön-e az alapanyag vagy alkatrész Kínából), a másik terület pedig maguknak a gazdasági szereplőknek likviditása, végző soron az életben maradása, amely állami beavatkozást, pénzügyi stimulust igényel majd.

A Vaják - avagy a privát felhő létrehozásának buktatói

A privát felhő létrehozásának buktatói (írta: Laár András verse)

 Az alábbi szösszenet egy Netflix sorozat és egy informatikai probléma alkohol hatására létrejött béna ötvözete, majd egyszer megírom rendesen is. 

A technológiaválasztás

Az első kérdés az, hogy tulajdonképpen mit is szeretnénk szolgáltatni: IaaS-t, (lánykori nevén VM-eket), PaaS-t (viva la devops), netán konténereket, mivel a világ a microservice-k felé halad. A problémát persze a múlt jelenti, ti. egy nagyvállalat több évtized alatt felhalmozott alkalmazás portfolióval bír, ami alatt több tucat adatbázis kezelő és operációs rendszer kombináció található, nem beszélve a már bejáratott mentési rendszerekről, magas rendelkezésre állást és katasztrófa utáni üzemmenetet biztosító megoldásokról. A hangsúly itt az alkalmazás réteg által diktált platform megoldások sokszínűségén van.

Ha lecövekelsz azon tézis mellett, hogy (induláskor) csak 3 fajta VM-et szolgáltatsz, egyfajta hypervisor-ral, direct attached storage-el és API-t sem adsz hozzá, csak GUI-t, akkor kezelhető nagyságú scope-ot definiáltál, de fennáll a veszélye, hogy a vadászkutyát sem érdekli majd a terméked, ti. nem kompatibilis a már meglévő alkalmazás réteggel. Arra persze senki sem szán pénzt, hogy migrálja a meglévő alkalmazását az új privát felhőre, hiszen állandóan nyaggatja az üzlet a legújabb funkcionális igényekkel. (a tech debt kialakulásának klasszikus példája.)

Adott tehát a késztetés a több hypervisor, OS, HW méret, SAN storage, API-n keresztül elérhető funkcionalitás támogatására. A dolog csapdája az, hogy a scope lassan kezd hasonlítani az Amazon vagy a Microsoft által kínált publikus szolgáltatások funkcionalitásához. És itt van a bibi, te nem vagy az Amazon, sem a Microsoft. Sem szaktudásban, sem a fejlesztőid létszámát tekintve nem vagy versenyképes ezekhez a fiúkhoz képest, arról nem is beszélve, hogy ők 12-14 évvel korábban kezdtek a kérdéssel foglalkozni.

A platform választás

Tegyük föl, hogy sikerült lebeszélned a huhogókat kedvenc hypervisor-ukról, maradt a jó öreg VMWare, (hiszen ehhez értenek az embereid). Te D’Artagnan-ként már készülhetsz is a délután 3 órás párbajra, ti. minimum 3 virtualizált OS-t kell majd támogatnod (RHEL 7.x nélkül a Linux hívők vetnek máglyára, Windows Server 2016 is kell, mert a pénzügy erre esküszik, a DWH-sok meg Solaris-t vagy legalább Oracle Linux-ot akarnak.) Közben szól a főnököd, hogy mégiscsak kell a Hyper-V és sajnos a Kubernetes támogatás is.

Itt hirtelen egy másik vallásháború kellős közepén találod magad: plain vanilla Kubernetes a nyerő (csak tiszta forrásból, ugye és ráadásul kimondottan versenyképes az ára) vagy az Openshift-re szavazol, mondván, az Kubernetes extrákkal – ami nagyban megkönnyíti az üzemeltetők életét. Igen ám, de az egy drága IBM fizetős cucc… Ahogy a Vaják mondaná, amikor megharapták a zombik: a picsába… (ti. ezt hallgatom, miközben ezt a cikket írogatom:- https://www.youtube.com/watch?v=hqbS7O9qIXE )

 

Az automatizálás

No igen, eljött az igazság pillanata, le kell fejleszteni egy csinos provisioning megoldást, ráadásul a VM-eket le is kéne tudnod pörgetni, nem csak fel. Hibakezelés, jogosultság kezelés, kvóták, mindezt 3 guest OS négy verziójára, némi web- és alkalmazásszerver támogatással, Java, dotNet (you name it) keretrendszerekkel, nem beszélve arról az iciri-piciri gondról, hogy a terméket bele kéne regisztrálni a DNS-be, (ja, hogy a TOR switch portok nem virtuálisak, azok elfogytak biza.. lásd a Vaják korábbi észrevételét…) 

Ekkor, ahogy a régi Gazdálkodj okosan játékban, sajnos egyest dobtál, azaz a négy Python fejlesztődből kettő lelépett egy most induló startup-hoz, ahol a kerítés is kolbászból van (vagy lesz izibe). A maradék két fejlesztőd – némileg joggal – jelezte, hogy az eredetileg megbeszélt határidő alma és hogy őket is megkereste egy fejvadász.

A projekt szponzora szólt, hogy a leendő felhasználók egyike, hogy nem szeretne osztozkodni a hardveren másokkal, neki dedikált vas kell. (tulajdonképpen feltalálta a reserved instance fogalmát, a bibi csak az, hogy ha még páran követik a példáját, akkor elfogy a vas.) A PM végtelen bölcsességétől vezérelve javasolja, hogy az egyszerűség kedvéért válasszátok le a rendelkezésre álló vasról a fenti erő(s)ember igényét. Hurrá, már két felhőnk is van.

Mivel késlekedtél és az egyik (az egyetlen) kommitált vevődnek határidőre kellett a gép, ezért a nekik szánt kapacitást nemes egyszerűséggel átallokálták az ő projektjükbe, mint nyersvasat. Sietünk, a virtualizáció sem kell… A gond az, hogy a storage-ből elvitték a DR site kapacitásának a felét, mert csak. Innentől aszimmetrikus a felállás, no de majd kezeljük SW-ből.

A tooling

A tooling lényegében mindent jelent, ami a provisioning, decomissioning, config management, monitoring feladatok ellátása során kell ahhoz, hogy működni tudj. A gond ott kezdődik, amikor a tuti új VM-jeidet át akarod adni üzemeltetésre. Az infrások jelzik, hogy ez rendben is lesz, feltéve, hogy integráltad a cumódat a meglévő monitoring és management rendszerekbe, értsd VSphere. A SNOW-s csávó is megjelent és mondta, hogy az új VM-eknek benne kell lennie az éppen készülő CMDB-ben (már csak hónapok kérdése és tuti kész lesz). Egy szó, mint száz, scope-ot kell vágni, hogy le tudj szállítani valami működőnek látszó holmit.

 

Az árazás

Szenzáció, van egy műdökő VM-ünk, már órák óta stabilan megy, bár tilos hozzá nyúlni, olyan szép… És ekkor megjelenik egy pénzügyes és megkérdi, hogy mennyibe is kerül ez per VM. Azt tudod, hogy olcsóbbnak kell lenned, mint egy mezei HW, különben senki nem veszi meg a motyódat, ellenben ha a teljes HW és storage költséget (no persze levonva belőle, amit korábban leraboltak) szétterheled arra a pár VM-re, amit jelenleg tesztelésre használsz, az igen drága lesz. Kéne egy ár. A jó hír, hogy nem tudod előre, hogy mekkora VM-eket fognak kérni a felhasználók, azt meg főleg nem, hogy mennyi diszket kérnek hozzá, így marad a guessing. Kici-ócó VM-et tesszék..

Az eszközpark kapacitásbővítése

Sajnos a beszerzésnek senki sem szólt, hogy te egy belső felhő szolgáltató lettél, pont úgy négy hónapos átfutási idővel kapod a vasat, mint bárki más. A bibi csak az, hogy ennyi idő alatt bárki megkapja, és a 4 hónapra vetítve az az egy nap, amíg létrehozod a kért VM-et, már nem sokkal jobb, mintha maradnátok a megszokott módszernél. Aztán kiderül, hogy a „no approval, just billing” javaslatodat lekukázták, tehát pont úgy meg kell igényelni a privát felhőben lévő VM-et, mint a hagyományos vasat és kb. ugyanannyi időbe is telik, mire megkapod a jóváhagyást. Ezen a napon kezdesz el érdeklődni a seppuku-hoz szükséges rövid japán kard, a tanto iránt. Look on the bright side: a számlázással amúgy sem lettünk készen..

Az övön aluli ütés

Az igazi probléma akkor áll elő, amikor valakinek eszébe jut a portékádat a nyilvános felhőszolgáltatók megoldásaihoz hasonlítani akár funkcionális, akár ár tekintetben. Az Úr áldja az MNB ajánlását, ami egyelőre tiltja a nyilvános felhő szolgáltatók használatát a pénzügyi szektorban…

TLDR #5: The World according to Garp

I have been tinkering with the taxonomy of the Hungarian IT job market for a while. My goal was to find dimensions that could assist job seekers to pair their strengths and aspirations with the available options. Last week a friend of mine asked me to meet his buddy, a talented young telco executive who was considering a position at a company I knew fairly well. The conversation evolved around the various aspects of the firm and the job, so I gave a try and used this draft decision support sheet to provide a framework for the discussion.

The four entities we evaluated were as follows:

  1. The firm – to be precise a given site of the firm (in case of companies with a WW or regional footprint). One of the first decisions is if you plan to stay in Hungary or not.
  2. The product or service the job is related to – in case of large companies there can be several product lines independent from each other, so you need to go one layer deeper.
  3. The job – most hiring managers hate to write proper job descriptions, while nailing the core aspects of the job is key to find the right match.
  4. The employee – how your skills and preferences match the features of the other 3 entities.

The assessment of some of these buckets are dependent on the preferences of the person, eg. some just cannot live without being in the drivers’ seat, some others do not care, while appreciate a 3-4 days a week Working From Home (WFH).

For this reason, most features in the following tables are not good or bad: eg. a small company can be as good as a giant or integrating dozens of off the shelf components can be as exciting as building one of them at a SW vendor. On the other hand, I did some color coding where I had an opinion. (green = good, yellow = warning, red = stay away)

Most categories are straightforward, but there is one from a former big boss of mine that should be considered by every job seeker. He framed the activities he cared about in an IT organization like this: increase revenue, reduce cost or reduce risk. As a rule of thumb, stay away from any function that does not fall into these three buckets. The good news: more and more verticals realize that they might be the next target of a disruption, therefore they need to embrace digitalization faster than they have thought before. This means more appetite for ICT services.

I highlighted the business model, ie. the way the firm makes money. While this is perfectly valid for a startup or a brand-new unit within an established firm, no firm can burn investor money endlessly. If there is no clearly articulated business model or the numbers look bad for several years, you can be assured a merger or acquisition is lurking around.

Pay attention to the owner structure and any major changes in it. (eg. the departure of the founders is a sign of a coming storm.) In any case they are the harbinger of turmoil in the culture of the firm, in most cases with a complete overhaul of the management ranks.

There are categories that tend to move in sync, eg. most disruptor firms deliver their service as SaaS, while most established Enterprise SW players need to maintain their on prem installed base while moving their core offerings to the cloud. Nevertheless, I kept them as separate items.  

After all these notes, here we go.

The characteristics of the firm

 A few comments:

  • Glassdoor reviews: check out if there is a large number of positive reviews that came in within a short period of time. Firms with a sluggish track record tend to beautify their score by gently pushing their staff to write nice reviews in a burst.
  • If the firm claims that they have a public cloud-based offering, check it out. the proof of the pudding is in the eating.
  • Size matters: an oil tanker is more likely to weather a storm than a speedboat, while the greater the firm becomes the more likely it will be more and more process driven. Processes and innovation are not close friends. This is not by accident that large firms create “Skunkworks” where they liberate their best engineers from the oppression of processes.
  • There are verticals where regulation is the norm. (eg. law enforcement, pharma or investment banking). This is a “take it or leave it” situation, if you can live with it, they can be a great place to work for, if heavy weight regulation drives you nuts, chose something else.
  • The issue with any IT firm collecting revenues solely from Hungary is twofold: EMEA is roughly one quarter of the WW ICT spending, Eastern Europe is around 14% of the EMEA market, Hungary is cca. 4% of CEE. Bottom line: The Hungarian market is tiny (0.1-0.2% of the WW cake). There is a glass ceiling to growth. The other problem the state is overrepresented in this mix.
  • Licensing (in case of a SW company): while everybody loves open source as a buyer, few people think about how these firms make money. As long as you are a giant who create and give away your IP (intellectual property) in order to innovate in your core business AND to boost the acceptance of your stuff and to find new team mates, while you make your revenue from something else, this is a great thing. As soon as you live from support and professional services only, you will realize its limitations. These scale linearly with the number of staff, therefore are capped at 15-20% margin, while in case of SW license there is no linearity (until you competition arrives).
  • No hiring manager or recruiter will tell you about their toxic culture. You have the find it out for yourself. One method is the “Exodus meter”, that is checking the number of people in LinkedIn with the target firm as their previous employment. If you find a large number who left within a short period of time, then there is something wrong. If detected, ask around.
  • Multiple SSC-s (shared service centers) mean a strong focus on cost control and risk mitigation, read internal competition and a potential fragmentation of functions where it is difficult to step up regarding product ownership.

The characteristics of the product or service

A few comments:

  • Bodyshop: the same profitability cap exists in this case like the one with OS SW firms, on the other hand this can be a good entry point to the firm you are sold to.
  • An on prem only product in 2019 is yellow flag. This firm is likely to face the ”Innovator’s dilemma”. Many of these companies are dead by the time they figure out what went wrong.
  • Watch out for end of life technologies and end of life products. This is extremely difficult to build a career on top of these.
  • Decision making power is a zero-sum game: if you are not in the HQ, you need to “earn your bones” first before you will be listened to at the big guys’ table.

 The characteristics of the job

A few comments:

  • Whenever you are green from envy when you hear about the money sales folks take home, keep in mind: you are worth as much as your last quarter and sales is not for everybody, it is a tough job.
  • If you want to be in the loop, you have to be on those meetings where decisions are made. If the HQ is in California, that means long nights away from the family. Do not fool yourself, you cannot get away with a single evening per week.

The characteristics of YOU

A few comments:

  • The key question is if you want to continue exploring, that is learning new stuff, new technologies and new people or you want to exploit what you already learned. Warning: you cannot fall into full exploit mode until the age of 60 or you will be stranded soon.
  • There is this freaking age thing: For a while you may compensate the decreasing RPM with an increase torque, but around 55 you suddenly face a perfectly able competition 15-20 years younger than you. The good news: there is a growing appetite for reasonable ICT people, so not all hope is lost, just trim your expectations.
  • A word on the big money waiting for you in London, New York etc. Yes, you can earn up to 3 times more, but there is a caveat: your living cost will increase the same way, housing in particular. When you move out of city, you save on the rent but spend your life in commute. At the end of the day, your overall standard of living may stay the same.
  • There is a major difference about how the job market works in NY and in Budapest: In NY contractors earn significantly more than FTEs, since the market knows that they run with a lower utilization (ie. they earn nothing between two assignments). In Hungary contractors earn the same or less than FTEs, and lack all the perks coming with it.

 Checking all these points above took us a good 2 times 2 hours discussion, but I think it was worth it. The fun begins when you create small matrices cross referencing the items in various dimensions.

As always, I appreciate any feedback of comment on this blog post.

 

PS: “The world according to Garp” is one of my favorite books from John Irving. This blog post is the ICT job market according to my knowledge, hence the title.

TLDR nr. 4: The tale of the area integral

Summary: This blog post examines the rationality behind the recurring behavior of firms not willing to accept the wage inflation in case of their existing employees (ie. “urging” them to leave by keeping them at a flat comp for years) while paying the (much higher) market price for the backfills.  Is this the right thing to do or it hurts the company’s bottom line in the long run?

It’s been bugging me for years when I lose a colleague in the middle of a project for compensation reasons, spend significant time and effort to recruit and bring up to speed the backfill while I know that the compensation of the new hire is higher than the money that the departed person would have stayed for. Counteroffers after the resignation are mostly in vain, by the time the employee resigns she is emotionally disconnected. But the most damaging is when the employee leaves as an individual contributor (IC) and then returns as an officer within a year because she does not fit in the IC pay range anymore (a smack in the face of the merit-based promotion process). At some firms the problem even has a telling name: “Loyalty tax”.

The output and compensation math of these backfills looks like this (pls. find the assumptions I used in Appendix A):

For the first look it seems obvious that attrition hurts delivery and costs a lot to backfill, so if we know that the root cause is compensation related, it makes sense to increase the salaries. If this is correct, then why companies do it so grudgingly? To prove this theory, I built a model to see what happens to one’s profitability if she increases the compensation of the existing staff to make them stay longer.  (Pls. find the parameters in Appendix B.)

I created scenarios along the lines:

  • a firm can maintain its profitability while the payroll increases. (y/n)
  • Is willing to share this extra revenue with its employees to make them stay longer. (y/n)
  • Some condition will change within max 4 years (function, manager, location, private status), therefore we will no longer compare apples to apples. Hence the cut-off at 4 years)

Here is what I ended up with:

Pls. find two illustrations below to Scenario A1 when the employee leaves in the middle of year 1, and the honky-dory scenario when the firm maintains profitability, shares it with the employee, so she stays for 4 years.

  • “A” scenarios: most firms will not increase comp when the market is not willing to pay more for their stuff. In this case the firm either automates as many functions as it can, moves the bulk of the development into nearshore or offshore development centers (this is the driving force behind bringing IT jobs to Eastern Europe), imports cheaper workforce from other countries OR accepts less qualified backfills. 
  • “B” scenarios: the firm cannot increase its prices but must adjust the payroll. Firms hate this thing and do it only as a last resort, eg. with a fixed fee support contract where the client fences off cheaper alternatives. Other examples are when the employee has monopoly on skills or expertise, ie. has a strong negotiating position. (cannot be replaced easily.)  
  • “C” scenarios: this is the best option to the firms – increase the prices while keep the payroll flat as long as your employees are willing to accept it. This may work during economy downturns.
  • “D” scenarios: this is when the firm is willing to split the extra revenue with its employees. This looks like the optimum solution; find a split that both parties accept, then focus on the motivation factors. The dilemma is how to make sure that the extra money has an impact. 

 The following numbers are based on fictive cost and revenue components, the point is their relative size and how they react to changes in the inputs. You can find the model on this link: Wage inflation v2

I have doubts about applying the bell curve to small populations,  then moaning about the consequences of  forced ranking, but i maintain a view that in an ideal case a firm has a clear understanding on the achievements and the future potential of its people. In this case i would pick the midpoint between the consumer price index and the wage inflation and would disperse this increase to the upper 50% of the staff, tilted towards the top 25%. 

There are cases when your employees do not leave but lack intrinsic motivation to do their best (eg. watch Youtube for hours or come in at 11AM). Several top brains - Herzberg (two factor theory), Deci & Ryan (self-determination theory), Dan Pink, Seth Godin to name a few – have proven that money itself does not cut it.

Most departures are caused by multiple factors, not just money. BUT: a good developer is approached by recruiters on a monthly basis, temptation to listen to their siren song increases with the time not receiving any uptick in base comp.

Bottom line: Covering at least the inflation for good performers makes sense. (the objectivity of performance evaluation could be another post) As always, I will be glad to receive comments on this post.

 

Appendix A - Assumptions

I made assumptions here, that are reasonable for Hungarian developer jobs today.  Here we go:

  • Core assumption #1: the imbalance between supply and demand for skilled software developers keeps cranking up their price tag. This local demand is a reaction to the same wage inflation in developed economies.
  • Core assumption #2: The leaving employee was a solid performer, ie. someone the firm wanted to keep.
  • She left mostly to comp reasons. Her dissatisfaction with the subpar compensation was the primary trigger, while the motivation side was okay. (ie. no graveyard shifts, no regular weekend work, fine boss, cool environment and mostly a cool job with ownership)
  • The function the leaving employee filled in is still necessary and the firm is NOT in crises mode when any voluntary departure is cheaper than a layoff with severance payments. Caveat: your best people will be the voluntary leavers while the not so good might stay.
  • It takes 4-6 months to ramp up a new employee to do the same thing his/her predecessor did.
  • Even a devoted employee takes off her foot from the gas during her resignation period, hence I used a 50% multiplier for her last month.
  • The cost of a backfill is cca. 2 months gross base compensation. You pay it as a mix of agency fee, referral bonus, cost of your own recruiters, but you pay it anyway.
  • The backfill will cost you 20+% more than the guy who left. 15% is the psychological threshold, read a risk premium to jump ships. (I have seen several cases beyond 35%.)
  • I calculated with a 4 years period; under developer terms this is a long tenure.

Items that I did not include in the model, but I think it would strengthen my argument:

  • The model does not include the time investment (that is also money) needed from the existing employees to ramp up the newcomer.
  • I left out the most important hidden cost, the lost revenue from the delayed delivery. The reason: I could not produce a model that would be accepted by any finance people. 
  • I skipped the value of the institutional knowledge that left with the leaving employee.
  • I ignored the collateral damage: the cost spike will create internal tensions if it leaks out. In Hungary it surely will. A bad thing.
  • I ignored the gains and losses from the FX rate fluctuations (ie. when the budget is in USD while comp is in HUF.)

Appendix B – input parameters in the model

input_parameters.png

 

Appendix C - references

Hungary Gross Average Wages Growth https://tradingeconomics.com/hungary/wage-growth

http://goaleurope.com/2016/10/26/software-developer-salary-europe-survey-results-2016/

https://www.daxx.com/blog/development-trends/it-salaries-software-developer-trends-2019

https://qubit-labs.com/average-software-developer-salaries-salary-comparison-country/

https://stackoverflow.blog/2018/09/05/developer-salaries-in-2018-updating-the-stack-overflow-salary-calculator/

https://www.computerworld.com/article/3182268/it-salary-survey-2017-tech-pay-holds-tight-for-now.html

https://www.hwsw.hu/hirek/57000/hays-salary-guide-fizetes-berek-bertargyalas-hr-fejlesztok-2017.html

https://www.hwsw.hu/hirek/58596/hays-salary-guide-informatikus-fizetes-berek-2018.html

https://www.hwsw.hu/hirek/60222/informatikus-fejleszto-rendszermernok-fizetesek-berek-2019.html

https://cloud.email.hays.com/hu_salary_guide

https://en.wikipedia.org/wiki/Self-determination_theory

https://en.wikipedia.org/wiki/Two-factor_theory

süti beállítások módosítása