5 Ways Data Scientists Can Advance Their Careers
Data and machine learning scientists are joining enterprises with the promise of cutting-edge ML models and technologies. But often they spend 80% of their time cleaning data or dealing with data riddled with missing values and outliers, a frequently changing schema, and massive load times. The gap between expectations and reality can be huge.
While data scientists may initially be excited to tackle advanced insights and models, that enthusiasm quickly deflates amid daily schema changes, tables that stop updating, and other breaking surprises. models and dashboards silently.
While “data science” applies to a range of roles, from analyzing products to putting statistical models into production, one thing is generally true: data scientists and ML engineers often find themselves at the end of the data pipeline. They are consumers of data, pulling it from data warehouses or from S3 or other centralized sources. They analyze data to help make business decisions or use it as training inputs for machine learning models.
In other words, they are impacted by data quality issues but are often not empowered to move up the pipeline earlier to fix them. So they write a ton of defensive data preprocessing into their work or move on to a new project.
If this scenario sounds familiar, you don’t have to give up or complain that upstream data engineering is off for good. Do like a scientist and be experimental. You are the last step in the pipe and release of the models, which means you are responsible for the outcome. While it may seem terrifying or unfair, it’s also a great opportunity to shine and make a big difference in your team’s business impact.
Here are five things that data scientists and ML analysts get out of defense mode and ensure that even if they don’t create data quality issues, they will prevent them from impacting data-dependent teams. .
1. Increase confidence through better monitoring of data quality
Business leaders are hesitant to make decisions based solely on data. A KPMG report showed that 60% of companies do not feel very confident in their data and that 49% of management teams do not fully support internal data and analytics strategy.
Good data scientists and ML engineers can help by increasing the accuracy of data and then integrating it into dashboards that help key decision makers. By doing so, they will have a direct positive impact. But manually checking data for quality issues is error-prone and a huge drag on your speed. It slows you down and makes you less productive.
Using data quality tests (e.g. with dbt testing) and data observability helps you ensure you know about quality issues before your stakeholders, earning their trust in you (and the data) over time.
2. Establish SLAs to avoid confusion and blame
Data quality issues can easily lead to a boring blame game between data science, data engineering, and software engineering. Who broke the data? And who knew? And who will fix it?
But when bad data comes into the world, it’s everyone’s fault. Your stakeholders want the data to work so the business can move forward with an accurate picture.
Good data scientists and ML engineers reinforce accountability for all stages of the data pipeline with service level agreements. SLAs define data quality in quantifiable terms, assigning stakeholders who must intervene to resolve issues. SLAs help avoid the blame game altogether.
3. Faster analysis through experiments
Trust is so fragile and it quickly erodes when your stakeholders catch mistakes and start assigning blame. But what about when they fail to detect quality issues? Then the model is bad, or bad decisions are made. Either way, the business suffers.
For example, what if you have a single entity registered as “Dallas-Fort Worth” and “DFW” in a database? When testing a new feature, everyone in “Dallas Fort-Worth” is shown as Variant A and everyone in “DFW” is shown as Variant B. No one enters the gap. You cannot conclude users in the Dallas Fort-Worth area – your test was rejected and the groups were not properly randomized.
Pave the way for better experimentation and analysis with a higher quality database. By using your expertise to improve quality, your data will become more reliable and your sales teams will be able to perform meaningful tests. The team can focus on What to be tested afterwards instead of doubting the results of the tests.
4. Become the point of contact for data quality
Trust in data starts with you; If you don’t master reliable, high-quality data, you will carry that burden in your interactions with the product and your colleagues.
So claim your position as the point of contact for data quality and ownership. You can help define quality and delegate responsibility for solving different problems. Remove friction between data science and engineering.
If you can lead the charge to define and improve data quality, you’ll impact nearly every other team in your organization. Your teammates will appreciate the work you do to reduce organization-wide headaches.
5. Minimize data waste
Incomplete or unreliable data can result in terabytes of wasted data. This data resides in your warehouse and is included in queries that incur compute costs. Poor quality data can be a major drag on your infrastructure bill as it is repeatedly included in the filtering process.
Identifying complex data is a way to immediately create value for your organization, especially for pipelines that see heavy traffic for product analytics and machine learning. Recollect, reprocess or impute and clean up existing values to reduce storage and computational costs.
Keep track of the tables and data you cleanse, as well as the number of queries run against those tables. It’s essential to let your team know how many questions are no longer working on junk data and how many GB of storage have been freed up for better things.
All data professionals, seasoned veterans and newcomers should be indispensable parts of the organization. You add value by appropriating more reliable data. Although the tools, algorithms, and techniques for analysis are becoming more sophisticated, the input data often isn’t — it’s still unique and company-specific. Even the most sophisticated tools and models don’t work well with bad data. The impact of data science can be a boon to your entire organization through the five steps above. Everyone wins when you improve the data your teams depend on.