Charter

The Data Team is a part of the Finance organization within GitLab, but we serve the entire company. We do this by maintaining a data warehouse where information from all business systems are stored and managed for analysis.
Our charter and goals are as follows:

  • Build and maintain a centralized data warehouse (Snowflake) that can support data integration, aggregation, and analysis requirements from all functional groups within the company
  • Create a common data framework and governance practice
  • Create and maintain scalable ELT pipelines that support the data sources needed for analysis
  • Work with functional groups through Finance Business Partners to establish the single source of truth (SSOT) for company Key Performance Indicators (KPIs)
  • Establish a change management process for source systems, data transformations, and reporting
  • Develop a Data Architecture plan in conjunction with functional groups
  • Develop a roadmap for systems evolution in alignment with the Company’s growth plans
  • Collaborate with infrastructure to maintain our self-hosted Snowplow pipeline for web event analytics
  • Create and evangelize analyses produced in our business intelligence tool (Periscope); support others learning to and creating their own analyses

    Data Team Principles

    The Data Team at GitLab is working to establish a world-class data analytics and engineering function by utilizing the tools of DevOps in combination with the core values of GitLab. We believe that data teams have much to learn from DevOps. We will work to model good software development best practices and integrate them into our data management and analytics.
    A typical data team has members who fall along a spectrum of skills and focus. For now, the data function at GitLab has Data Engineers and Data Analysts; eventually, the team will include Data Scientists. Review the team organization sectionsection to see the make up of the team.
    Data Engineers are essentially software engineers who have a particular focus on data movement and orchestration. The transition to DevOps is typically easier for them because much of their work is done using the command line and scripting languages such as bash and python. One challenge in particular are data pipelines. Most pipelines are not well tested, data movement is not typically idempotent, and auditability of history is challenging.
    Data Analysts are further from DevOps practices than Data Engineers. Most analysts use SQL for their analytics and queries, with Python or R. In the past, data queries and transformations may have been done by custom tooling or software written by other companies. These tools and approaches share similar traits in that they’re likely not version controlled, there are probably few tests around them, and they are difficult to maintain at scale.
    Data Scientists are probably furthest from integrating DevOps practices into their work. Much of their work is done in tools like Jupyter Notebooks or R Studio. Those who do machine learning create models that are not typically version controlled. Data management and accessibility is also a concern.
    We will work closely with the data and analytics communities to find solutions to these challenges. Some of the solutions may be cultural in nature, and we aim to be a model for other organizations of how a world-class Data and Analytics team can utilize the best of DevOps for all Data Operations.
    Some of our beliefs are:

  • Everything can and should be defined in code

  • Everything can and should be version controlled
  • Data Engineers, Data Analysts, and Data Scientists can and should integrate best practices from DevOps into their workflow
  • It is possible to serve the business while having a high-quality, maintainable code base
  • Analytics, and the code that supports it, can and should be open source
  • There can be a single source of truth for every analytic question within a company
  • Data team managers serve their team and not themselves
  • Glue work is important for the health of the team and is recognized individually for the value it provides. We call this out specifically as women tend to over-index on glue work and it can negatively affects their careers.
  • We focus our limited resources where data will have the greatest impact
  • Lead indicators are just as important, if not moreso, than lag indicators
  • All business users should be able to learn how to interpret and calculate simple statistics

    How we Work

    Documentation

    The data team, like the rest of GitLab, works hard to document as much as possible. We believe this framework for types of documentation from Divio is quite valuable. For the most part, what’s captured in the handbook are tutorials, how-to guides, and explanations, while reference documentation lives within in the primary analytics project. We have aspirations to tag our documentation with the appropriate function as well as clearly articulate the assumed audiences for each piece of documentation.

    OKR Planning

    Data Team OKRs are derived from the higher level BizOps/Finance OKRs as well as the needs of the team. At the beginning of a FQ, the team will outline all actions that are required to succeed with our KRs and in helping other teams measure the success of their KRs. The best way to do that is via a team brain dump session in which everyone lays out all the steps they anticipate for each of the relevant actions. This is a great time for the team to raise any blockers or concerns they foresee. These should be recorded for future reference.
    These OKRs drive ~60% of the work that the central data team does in a given quarter. The remaining time is divided between urgent issues that come up and ad hoc/exploratory analyses. Specialty data analysts (who have the title “Data Analyst, Specialty”) should have a similar break down of planned work to responsive work, but their priorities are set by their specialty manager.

    Milestone Planning

    The data team currently works in two-week intervals, called milestones. Milestones start on Tuesdays and end on Mondays. This discourages last-minute merging on Fridays and allows the team to have milestone planning meetings at the top of the milestone.
    Milestones may be three weeks long if they cover a major holiday or if the majority of the team is on vacation or at Contribute. As work is assigned to a person and a milestone, it gets a weight assigned to it.
    Milestone planning should take into consideration:

  • vacation timelines

  • conference schedules
  • team member availability
  • team member work preferences (specialties are different from preferences)

The milestone planning is owned by the Manager, Data.
The timeline for milestone planning is as follows:

  • Meeting Preparation - Responsible Party: Milestone Planner
    • Investigate and flesh out open issues.
    • Assign issues to the milestone based on alignment with the Team Roadmap.
    • Note: Issues are not assigned to an individual at this stage, except where required. | Day | Current Milestone | Next Milestone | | :—- | :—- | :—- | | 0 - 1st Tuesday | Milestone Start

      Roll Milestone | - | | 6 - 1st Monday | - | Groom new issues for planning | | 7 - 2nd Tuesday | Midpoint

      Any issues that are at risk of slipping from the milestone must be raised by the assignee | Milestone Review and Planning Meeting

      Discuss: What we learned from the last milestone. Priorities for new milestone. What’s coming down the pike.

      Issues are pointed by relevant team (Engineering or Analytics)

      Note: Pointing is done without knowledge of who may pick up the task. | | 10 - 2nd Friday | The last day to submit MRs for review

      MRs must include documentation and testing to be ready to merge

      No MRs are to be merged on Fridays | Milestone is roughly final

      Milestone Planner distributes issues to team members, with the appropriate considerations and preferences | | 13 - 2nd Monday | Last day of Milestone

      Ready MRs can be merged | |

The short-term goal of this process is to improve our ability to plan and estimate work through better understanding of our velocity. In order to successfully evaluate how we’re performing against the plan, any issues not raised at the T+7 mark should not be moved until the next milestone begins.
The work of the data team generally falls into the following categories:

  • Infrastructure
  • Analytics
    • Central Team
    • Specialist Team
  • Housekeeping

During the milestone planning process, we point issues. Then we pull into the milestone the issues expected to be completed in the timeframe. Points are a good measure of consistency, as milestone over milestone should share an average. Then issues are prioritized according to these categories.
Issues are not assigned to individual members of the team, except where necessary, until someone is ready to work on it. Work is not assigned and then managed into a milestone. Every person works on the top priority issue for their job type. As that issue is completed, they can pick up the next highest priority issue. People will likely be working on no more than 2 issues at a time.
Given the power of the Ivy Lee method, this allows the team to collectively work on priorities as opposed to creating a backlog for any given person. As a tradeoff, this also means that every time a central analyst is introduced to a new data source their velocity may temporarily decrease as they come up to speed; the overall benefit to the organization that any analyst can pick up any issue will compensate for this, though. Learn how the product managers groom issues.
Data Engineers will work on Infrastructure issues. Data Analysts, Central and sometimes Data Engineers work on general Analytics issues. Data Analysts, work on analyses, e.g Growth, Finance, etc.
There is a demo of what this proposal would look like in a board.
This approach has many benefits, including:

  1. It helps ensure the highest priority projects are being completed
  2. It can help leadership identify issues that are blocked
  3. It provides leadership view into the work of the data team, including specialty analysts whose priorities are set from outside the data function
  4. It encourages consistent throughput from team members
  5. It makes clear to stakeholders where their ask is in priority
  6. It helps alleviate the pressure of planning the next milestone, as issues are already ranked

    Issue Types

    There are three general types of issues:
  • Discovery
  • Introducing a new data source
  • Work

Not all issues will fall into one of these buckets but 85% should.

Discovery issues

Some issues may need a discovery period to understand requirements, gather feedback, or explore the work that needs to be done. Discovery issues are usually 2 points.

Introducing a new data source

Introducing a new data source requires a heavy lift of understanding that new data source, mapping field names to logic, documenting those, and understanding what issues are being delivered. Usually introducing a new data source is coupled with replicating an existing dashboard from the other data source. This helps verify that numbers are accurate and the original data source and the data team’s analysis are using the same definitions.

Work

This umbrella term helps capture:

  • inbound requests from GitLab team-members that usually materialize into a dashboard
  • housekeeping improvements/technical debt from the data team
  • goals of the data team
  • documentation notes

It is the responsibility of the assignee to be clear on what the scope of their issue is. A well-defined issue has a clearly outlined problem statement. Complex or new issues may also include an outline (not all encompassing list) of what steps need to be taken. If an issue is not well-scoped as its assigned, it is the responsibility of the assignee to understand how to scope that issue properly and approach the appropriate team members for guidance early in the milestone.

Issue Pointing

Issue pointing captures the complexity of an issue, not the time it takes to complete an issue. That is why pointing is independent of who the issue assignee is.

  • Refer to the table below for point values and what they represent.
  • We size and point issues as a group.
  • Effective pointing requires more fleshed out issues, but that requirement shouldn’t keep people from creating issues.
  • When pointing work that happens outside of the Data Team projects, add points to the issue in the relevant Data Team project and ensure issues are cross-linked. | Weight | Description | | :—- | :—- | | Null | Meta and Discussions that don’t result in an MR | | 0 | Should not be used. | | 1 | The simplest possible change including documentation changes. We are confident there will be no side effects. | | 2 | A simple change (minimal code changes), where we understand all of the requirements. | | 3 | A simple change, but the code footprint is bigger (e.g. lots of different files, or tests effected). The requirements are clear. | | 5 | A more complex change that will impact multiple areas of the codebase, there may also be some refactoring involved. Requirements are understood but you feel there are likely to be some gaps along the way. | | 8 | A complex change, that will involve much of the codebase or will require lots of input from others to determine the requirements. | | 13 | A significant change that may have dependencies (other teams or third-parties) and we likely still don’t understand all of the requirements. It’s unlikely we would commit to this in a milestone, and the preference would be to further clarify requirements and/or break into smaller Issues. |

Issue Labeling

Think of each of these groups of labels as ways of bucketing the work done. All issues should get the following classes of labels assigned to them:

  • Who (Purple): Team for which work is primarily for (Data, Finance, Sales, etc.)
  • What - Data or Tool
    • Data (Light Green): Data being touched (Salesforce, Zuora, Zendesk, Gitlab.com, etc.)
    • Tool (Light Blue) (Periscope, dbt, Stitch, Airflow, etc.)
  • Where (Brown): Which part of the team performs the work (Analytics, Infrastructure, Housekeeping)
  • How (Orange): Type of work (Documentation, Break-fix, Enhancement, Refactor, Testing, Review)

Optional labels that are useful to communicate state or other priority

  • State (Red) (Won’t Do, Blocked, Needs Consensus, etc.)
  • Inbound: For issues created by folks who are not on the data team; not for asks created by data team members on behalf of others

    Daily Standup

    Members of the data team use Geekbot for our daily standups. These are posted in #data-daily. When Geekbot asks, “What are you planning on working on today? Any blockers?” try answering with specific details, so that teammates can proactively unblock you. Instead of “working on Salesforce stuff”, consider “Adding Opportunity Owners for the sfdc_opportunity_xf model`.” There is no pressure to respond to Geekbot as soon as it messages you. Give responses to Geekbot that truly communicate to your team what you’re working on that day, so that your team can help you understand if some priority has shifted or there is additional context you may need.

    Merge Request Workflow

    Ideally, your workflow should be as follows:
  1. Confirm you have access to the analytics project. If not, request Developer access so you can create branches, merge requests, and issues.
  2. Create an issue, open an existing issue, or assign yourself to an existing issue. The issue is assigned to the person(s) who will be doing the work.
  3. Add appropriate labels to the issue (see above)
  4. Open an MR from the issue using the “Create merge request” button. This automatically creates a unique branch based on the issue name. This marks the issue for closure once the MR is merged.
  5. Push your work to the branch
  6. Run any relevant jobs to the work being proposed
    • e.g. if you’re working on dbt changes, run the job most appropriate for your changes. See the dbt changes MR template checklist for a list of jobs and their uses.
  7. Document in the MR description what the purpose of the MR is, any additional changes that need to happen for the MR to be valid, and if it’s a complicated MR, how you verified that the change works. See this MR for an example of good documentation. The goal is to make it easier for reviewers to understand what the MR is doing so it’s as easy as possible to review.
  8. Assign the MR to a peer to have it reviewed. If assigning to someone who can merge, either leave a comment asking for a review without merge, or you can simply leave theWIP:label.
    • Note that assigning someone an MR means action is required from them.
    • The peer reviewer should use the native approve button in the MR after they have completed their review and approve of the changes in the MR.
    • Adding someone as an approver is a way to tag them for an FYI. This is similar to doing cc @user in a comment.
    • After approval, the peer reviewer should send the MR back to the author to decide what needs to happen next. The reviewer should not be responsible for the final tasks. The author is responsible for finalizing the checklist, closing threads, removing WIP, and getting it in a merge-ready state.
  9. Once it’s ready for further review and merging, remove the WIP: label, mark the branch for deletion, mark squash commits, and assign to the project’s maintainer. Ensure that the attached issue is appropriately labeled and pointed.

Other tips:

  1. The Merge Request Workflow provides clear expectations; however, there is some wiggle room and freedom around certain steps as follows.
    • For simple changes, it is the MR author who should be responsible for closing the threads. If there is a complex change and the concern has been addressed, either the author or reviewer could resolve the threads if the reviewer approves.
  2. Reviewers should have 48 hours to complete a review, so plan ahead with the end of the milestone.
  3. When possible, questions/problems should be discussed with your reviewer before submitting the MR for review. Particularly for large changes, review time is the least efficient time to have to make meaningful changes to code, because you’ve already done most of the work!

    Local Docker Workflow

    To faciliate an easier workflow for analysts and to abstract away some of the complexity around handling dbt and its dependencies locally, the main analytics repo now supports using dbt from within a Docker container. There are commands within the Makefile to facilitate this, and if at any time you have questions about the various makecommands and what they do, just use make help to get a handy list of the commands and what each of them does.
    Before your initial run (and whenever the containers get updated) make sure to run the following commands:

  4. make update-containers

  5. make cleanup

These commands will ensure you get the newest versions of the containers and generally clean up your local Dockerenvironment.

Using dbt:
  • To start a dbt container and run commands from a shell inside of it, use make dbt-image.
  • This will automatically import everything dbt needs to run, including your local profiles.yml and repo files.
  • To see the docs for your current branch, run make dbt-docs and then visit localhost:8081 in a web-browser.There is a quoting bug in dbt that prevents this from working as intended, expected fix will be in dbt version 0.15
  • Once inside of the dbt container, run any dbt commands as you normally would.
  • Changes that are made to any files in the repo will automatically be updated within the container. There is no need to restart the container when you change a file through your editor!