
Chris Shumba: The crucial role of data infrastructure
Written by
Chris Shumba
February 13, 2025
How many of you know that a tree grows in two different directions at the same time?
The words are gravitropic, growing down, and phototropic, growing up. There are two systems acting on a tree – the root system and the fruit system.
For a lot of people, we only get to see the fruit and enjoy the fruit. For the purposes of this presentation, the fruit is the models, insights, forecasts and recommendations.
But for me, the job is the things that people don’t really see, the things that people don’t even know exist sometimes – the pipelines, the cloud infrastructure and making sure the automated reports get to the right place at the right time.
Most of you probably didn’t even know I existed until I came up on stage.
Starting working in football
Our set-up at Manchester United, in terms of the data science team, is we have four people. We have two data scientists, who work closely with the coaches, they are almost front-facing, then we have a Machine Learning Scientist, a technical lead, called Andrew Davies.
I manage a partner team of Data Engineers who are not employed by us, but we partner with them to create our data infrastructure. I collaborate with the data scientists, I do data engineering, platform engineering, data strategy, data architecture, data modelling, IT security, cost control.
I do a lot of things and there are not really enough hours to do my role but you just have to make it work. I enjoy it, I absolutely love it. The most important thing is you have to be teachable and that’s what shocked me about football.
I had never worked in football before and was really excited to start my role. I had all these IT, architecture things before I started, I interviewed for this role for about four months and probably got to meet everyone in the organisation.
But they were a bit like, ‘You’ve never worked in football.’ When I started, one of the reality checks I got was that technology doesn’t really matter, it’s really about knowing the organisation.
So I’d try and get meetings with anybody I could around the training ground – physical performance team, medical, any director I could get, finance, video analysts, IT, security -and the question I would ask is, ‘What is a normal day in your life?’
Establishing a data strategy
That leads on to how we established a data strategy as a team. For me, coming in, the data strategy is really focused on five key areas (I didn’t invent this, I have just used this for the last six years): infrastructure, people, tools, organisation, processes.
The whole thing is to go from being a data-reactive organisation to being a data-informed organisation.
Whenever I meet a Director, they always ask me, ‘What help do you need from me?’ I say, ‘If you can think of the information you need before making a decision, and we are providing you with that information, then we are doing our job.’
That is why I always like the term data-informed and these are the sentences I use to complete what a data-informed organisation is:
When it comes to infrastructure, it’s that data is collected, discoverable, reliable, understood, compliant and actionable.
When it comes to people, you understand that you have the right people with a good attitude to data. Data doesn’t make the decisions for you, so the people are so important.
Then you have the tools. You have to make sure you have the modern tools, because they make things easier.
You’ve got to make sure you understand the organisation, that you’re complimenting the experts, the practitioners, the coaches, the video analysts, the physical performance team.
Then you’re also creating processes that are scalable. If someone leaves the team, the processes shouldn’t stop. You should always try and keep that continuity and depend on those processes, it’s so important.
Infrastructure
When I say infrastructure, I mean the underlying pieces. For example, when you receive a report, what is computing that report? Where is that data coming from? Where is that data being stored?
There are about six key design principles I use:
Managed services: I try not to build anything in-house if there is a tool out there that can do it. The only important factor that I key in is whether this tool can be included in our private network.
What usually happens with managed services is they ask, ‘Send your data to us’, and I’m not really sure how it is secured.
So one of my key requirements is whatever tool we use, that we can spin it up in our network instead, because then I know everything is secure. Everything has code. I write a lot of code, I still do a lot of code reviews for my team, I do all the data modelling and things like that as well.
It’s important to have all the infrastructure defined as code too. Some of you guys will have heard of Terraform, which is a tool that will spin up your whole cloud infrastructure, all from code, so you are defining your infrastructure as code.
This means that even if I leave my job, someone else can take over who understands that language. And there are a lot of people who will understand that language. It also means we can spin up a new environment in less than a day.
Value censored: always make sure you are talking to your stakeholders. Our data scientists are in constant communication every day.
What is the most important thing we have to work on? What is the most important thing the Technical Director or Director of Football is thinking about? And we are always working backwards to see what use-cases we can build for them.
Consumer access: If you work in football, most people want to be able to access that app, that plot, really quickly. You won’t even believe how quick they need this thing!
Sometimes they want it through email as a PDF. I always think that’s better, because then the information can be on their phones. Directors, their time is really limited.
On the cloud: I love on the cloud, because you pay for what you use and you can scale down, scale up, depending on what you’re doing. And also secure.
Identity access management: When it comes to security, there are two things – authorisation (are you allowed to see the things you are looking at, which is really important in football) and authentication (are you who you say you are).
Building on the cloud, there are really three things we are looking for – compute, storage and networking. These are the three reasons we are on the cloud.
These are the three biggest ones (AWS, Google and Azure). We just go for these ones because it makes our life easier, but we could go for a smaller one. Go for the one that gives you the cheapest bill.
Key components
Generation: When it comes to data providers, have a meeting with them to find out how they generate that data and how you can access it. Usually it will be through an API and you can ingest it, transform it, serve it.
If you are ingesting video data, it is a little different and it’s expensive to store. You have to make sure you have a really good use case to spend that amount of money.
Then there are also undercurrents. You are not just ingesting data and keeping it there. There is security, data management, data operations, the architecture, the orchestration, but also just really good software engineering practices.
If you are starting from scratch to build a data team, if you are recruiting, I would say look more for software engineers or software engineering practices, they will do really well in this space.
End-product: You’ve done all your data strategies, you’ve got all your use cases and will probably end up with something like this if you’ve got a really good engineer or architect, for each cloud provider, where you say we want to use these services.
If you are ingesting real-time streaming you can use event hubs. You store your raw data in your data lake.
One tip, if you are building a data platform these days you probably want a data lake house, not a warehouse. The concept of a lake house is that the interface you use to access your data is that you are accessing the raw files in your storage instead.
Usually your data will flow from a bronze layer to a gold layer. Bronze, the more raw form, silver transformed and gold is the more aggregated dat you see in your dashboard.
In my team, the bronze is the provider data, silver is the data we have enhanced, which we have added our own custom metrics to, and anything in the gold layer is anything that is being visualised either on Tableau or Power BI or any other tool.
Data providers: If their quality is bad, anything you create is going to be bad. That quality needs to be good. Connecting information from multiple providers, you need to be able to glue information from multiple providers. That’s how you’re going to get your competitive edge.
Stakeholder involvement: Keep asking them what’s the most important thing.
Data modelling: Take some time to do some really good data modelling.
Feature engineering: Make sure your data is truthful. The good thing about football is you can always go back and watch the video. Did that thing happen at that place?
Delivery model: If you have a small team, it’s ok to partner up with a consultancy to get them to deliver, but make sure they are building something that actually matters.
This article is for TGG Members
To view this full article you will need to become a TGG member and you will also get full access to:
- 20+ Masterclasses & Webinars
- Full Academy Productivity Rankings & analysis
- Club Directory with 96 clubs & 1,000+ staff profiles
- Personal Profile builder to showcase yourself to the industry
