Data Engineer
Data Driven Decisions: Powering Innovation with RAiD's Data Engineer
Hey Leon, nice to meet you! Could you introduce yourself?
My major in university was Material Science. Through my first job I investigated simulations of magnetic fields and did so through programming. I got interested in the IT field and on further research, came about Kaggle and decided to enter the IT/Data field through data engineering.
What do you consider the biggest challenges in building and scaling data pipelines today?
Cost and requirement trade-offs. Most companies have cloud-based databases and the industry has many available solutions to run pipelines and process data. The key challenge then becomes how to reduce the required processing time and resources to an appropriate cost that the company is comfortable with. Not properly communicating and addressing this results in either an over-engineered or under-optimised solution, both of which would cost the company more in the long run.
How do you approach data governance and security in your projects?
This is a big question. For security, the key is ensuring the right permissions are given to the right people. No one should have too many permissions. IAM policies, firewall rules and monitoring traffic are all integral in maintaining vigilance and security. For data governance, mapping out the business process, data schema and data flow from raw to usable data helps me best understand data governance requirements. After that, I believe it's most important to communicate risks and probabilities to stakeholders.
What critical skills do you think are essential for a data engineer today?
Algorithms, Data structure, Pipelining, Automation, Structured Query Language (SQL) and Databases are the most important. After covering these, Streaming and Distributed Computing are important too.
If you could have access to any type of data from any point in history, what data would you choose to analyse and why?
Since you mentioned any data, I'll choose human DNA sequencing data for each individual throughout history. I would like to see how the genome pool has changed in each geographic location, the rise and fall of each allele with major historical events, how each gene is tied to life expectancy. Potentially, the data will also be useful in detecting cancers and other health issues.
Job description
Specialise in designing, implementing and maintaining systems, architecture and infrastructure to allow for storage, cleaning, verification, and manipulation of RSAF's complex data from multiple sources.
Mould and prepare data for advanced work by data scientists.