宪章

数据团队是GitLab财务部门的一部分,但我们为整个公司提供服务。为此,我们要维护一个数据仓库,所有业务系统的信息都将被存储和管理以进行分析。
我们的章程和目标如下:

  • 建立和维护集中式数据仓库(Snowflake),可以支持公司内所有职能部门的数据集成,聚合和分析需求
  • 创建通用数据框架和治理实践
  • 创建和维护可扩展的ELT管道,以支持分析所需的数据源
  • 通过财务业务合作伙伴与职能部门合作,为公司关键绩效指标(KPI)建立单一真相(SSOT)
  • 建立源系统,数据转换和报告的变更管理流程
  • 与职能部门一起制定数据架构计划
  • 根据公司的增长计划制定系统发展路线图
  • 与基础架构协作,维护我们用于网络事件分析的自托管Snowplow管道
  • 创建和宣传通过我们的商业智能工具(Periscope)产生的分析;支持其他人学习并创建自己的分析

    数据团队原则

    manbetx客户端打不开的数据团队正在努力通过使用DevOps的工具与manbetx客户端打不开的核心价值相结合,建立世界一流的数据分析和工程功能。我们认为,数据团队可以从DevOps中学习很多东西。我们将努力为良好的软件开发最佳实践建模,并将其集成到我们的数据管理和分析中。
    一个典型的数据团队的成员具有各种技能和重点。目前,GitLab的数据功能拥有数据工程师和数据分析师;最终,该团队将包括数据科学家。查看“ 团队组织”部分,以了解团队的组成。
    数据工程师本质上是软件工程师,他们特别关注数据移动和编排。对于他们来说,向DevOps的过渡通常更容易,因为他们的大部分工作都是使用命令行和脚本语言(例如bash和python)完成的。数据管道尤其是一项挑战。大多数管道没有经过良好的测试,数据移动通常不是幂等的,历史的可审核性也具有挑战性。
    数据分析师比数据工程师更远离DevOps实践。大多数分析人员使用SQL或Python或R进行SQL分析和查询。过去,数据查询和转换可能是由其他公司编写的自定义工具或软件完成的。这些工具和方法具有相似的特征,因为它们可能不受版本控制,围绕它们的测试可能很少,并且难以大规模维护。
    数据科学家可能最不愿意将DevOps实践整合到他们的工作中。他们的大部分工作都是在Jupyter Notebooks或R Studio等工具中完成的。那些进行机器学习的人创建的模型通常不受版本控制。数据管理和可访问性也是一个问题。
    我们将与数据和分析社区紧密合作,以找到应对这些挑战的解决方案。其中一些解决方案本质上可能是文化性的,我们旨在成为其他组织的典范,以了解世界一流的数据和分析团队如何将最佳DevOps用于所有数据操作。
    我们的一些信念是:

  • 一切都可以并且应该在代码中定义

  • 一切都可以并且应该由版本控制
  • 数据工程师,数据分析师和数据科学家可以并且应该将DevOps的最佳实践集成到他们的工作流程中
  • 拥有高质量,可维护的代码库,可以为企业服务
  • Analytics(分析)及其支持的代码可以并且应该是开源的
  • 公司中的每个分析问题都可能有一个真实的来源
  • 数据团队经理为团队而非自己服务
  • Glue work is important for the health of the team and is recognized individually for the value it provides. We call this out specifically as women tend to over-index on glue work and it can negatively affects their careers.
  • We focus our limited resources where data will have the greatest impact
  • Lead indicators are just as important, if not moreso, than lag indicators
  • All business users should be able to learn how to interpret and calculate simple statistics

    How we Work

    Documentation

    The data team, like the rest of GitLab, works hard to document as much as possible. We believe this framework for types of documentation from Divio is quite valuable. For the most part, what’s captured in the handbook are tutorials, how-to guides, and explanations, while reference documentation lives within in the primary analytics project. We have aspirations to tag our documentation with the appropriate function as well as clearly articulate the assumed audiences for each piece of documentation.

    OKR Planning

    Data Team OKRs are derived from the higher level BizOps/Finance OKRs as well as the needs of the team. At the beginning of a FQ, the team will outline all actions that are required to succeed with our KRs and in helping other teams measure the success of their KRs. The best way to do that is via a team brain dump session in which everyone lays out all the steps they anticipate for each of the relevant actions. This is a great time for the team to raise any blockers or concerns they foresee. These should be recorded for future reference.
    These OKRs drive ~60% of the work that the central data team does in a given quarter. The remaining time is divided between urgent issues that come up and ad hoc/exploratory analyses. Specialty data analysts (who have the title “Data Analyst, Specialty”) should have a similar break down of planned work to responsive work, but their priorities are set by their specialty manager.

    Milestone Planning

    The data team currently works in two-week intervals, called milestones. Milestones start on Tuesdays and end on Mondays. This discourages last-minute merging on Fridays and allows the team to have milestone planning meetings at the top of the milestone.
    Milestones may be three weeks long if they cover a major holiday or if the majority of the team is on vacation or at Contribute. As work is assigned to a person and a milestone, it gets a weight assigned to it.
    Milestone planning should take into consideration:

  • vacation timelines

  • conference schedules
  • team member availability
  • team member work preferences (specialties are different from preferences)

The milestone planning is owned by the Manager, Data.
The timeline for milestone planning is as follows:

  • Meeting Preparation - Responsible Party: Milestone Planner
    • Investigate and flesh out open issues.
    • Assign issues to the milestone based on alignment with the Team Roadmap.
    • Note: Issues are not assigned to an individual at this stage, except where required. | Day | Current Milestone | Next Milestone | | :—- | :—- | :—- | | 0 - 1st Tuesday | Milestone Start

      Roll Milestone | - | | 6 - 1st Monday | - | Groom new issues for planning | | 7 - 2nd Tuesday | Midpoint

      Any issues that are at risk of slipping from the milestone must be raised by the assignee | Milestone Review and Planning Meeting

      Discuss: What we learned from the last milestone. Priorities for new milestone. What’s coming down the pike.

      Issues are pointed by relevant team (Engineering or Analytics)

      Note: Pointing is done without knowledge of who may pick up the task. | | 10 - 2nd Friday | The last day to submit MRs for review

      MRs must include documentation and testing to be ready to merge

      No MRs are to be merged on Fridays | Milestone is roughly final

      Milestone Planner distributes issues to team members, with the appropriate considerations and preferences | | 13 - 2nd Monday | Last day of Milestone

      Ready MRs can be merged | |

The short-term goal of this process is to improve our ability to plan and estimate work through better understanding of our velocity. In order to successfully evaluate how we’re performing against the plan, any issues not raised at the T+7 mark should not be moved until the next milestone begins.
The work of the data team generally falls into the following categories:

  • Infrastructure
  • Analytics
    • Central Team
    • Specialist Team
  • Housekeeping

During the milestone planning process, we point issues. Then we pull into the milestone the issues expected to be completed in the timeframe. Points are a good measure of consistency, as milestone over milestone should share an average. Then issues are prioritized according to these categories.
Issues are not assigned to individual members of the team, except where necessary, until someone is ready to work on it. Work is not assigned and then managed into a milestone. Every person works on the top priority issue for their job type. As that issue is completed, they can pick up the next highest priority issue. People will likely be working on no more than 2 issues at a time.
Given the power of the Ivy Lee method, this allows the team to collectively work on priorities as opposed to creating a backlog for any given person. As a tradeoff, this also means that every time a central analyst is introduced to a new data source their velocity may temporarily decrease as they come up to speed; the overall benefit to the organization that any analyst can pick up any issue will compensate for this, though. Learn how the product managers groom issues.
Data Engineers will work on Infrastructure issues. Data Analysts, Central and sometimes Data Engineers work on general Analytics issues. Data Analysts, work on analyses, e.g Growth, Finance, etc.
There is a demo of what this proposal would look like in a board.
This approach has many benefits, including:

  1. It helps ensure the highest priority projects are being completed
  2. It can help leadership identify issues that are blocked
  3. It provides leadership view into the work of the data team, including specialty analysts whose priorities are set from outside the data function
  4. It encourages consistent throughput from team members
  5. It makes clear to stakeholders where their ask is in priority
  6. It helps alleviate the pressure of planning the next milestone, as issues are already ranked

    Issue Types

    There are three general types of issues:
  • Discovery
  • Introducing a new data source
  • Work

Not all issues will fall into one of these buckets but 85% should.

Discovery issues

Some issues may need a discovery period to understand requirements, gather feedback, or explore the work that needs to be done. Discovery issues are usually 2 points.

Introducing a new data source

Introducing a new data source requires a heavy lift of understanding that new data source, mapping field names to logic, documenting those, and understanding what issues are being delivered. Usually introducing a new data source is coupled with replicating an existing dashboard from the other data source. This helps verify that numbers are accurate and the original data source and the data team’s analysis are using the same definitions.

Work

This umbrella term helps capture:

  • inbound requests from GitLab team-members that usually materialize into a dashboard
  • housekeeping improvements/technical debt from the data team
  • goals of the data team
  • documentation notes

It is the responsibility of the assignee to be clear on what the scope of their issue is. A well-defined issue has a clearly outlined problem statement. Complex or new issues may also include an outline (not all encompassing list) of what steps need to be taken. If an issue is not well-scoped as its assigned, it is the responsibility of the assignee to understand how to scope that issue properly and approach the appropriate team members for guidance early in the milestone.

Issue Pointing

Issue pointing captures the complexity of an issue, not the time it takes to complete an issue. That is why pointing is independent of who the issue assignee is.

  • Refer to the table below for point values and what they represent.
  • We size and point issues as a group.
  • Effective pointing requires more fleshed out issues, but that requirement shouldn’t keep people from creating issues.
  • When pointing work that happens outside of the Data Team projects, add points to the issue in the relevant Data Team project and ensure issues are cross-linked. | Weight | Description | | :—- | :—- | | Null | Meta and Discussions that don’t result in an MR | | 0 | Should not be used. | | 1 | The simplest possible change including documentation changes. We are confident there will be no side effects. | | 2 | A simple change (minimal code changes), where we understand all of the requirements. | | 3 | A simple change, but the code footprint is bigger (e.g. lots of different files, or tests effected). The requirements are clear. | | 5 | A more complex change that will impact multiple areas of the codebase, there may also be some refactoring involved. Requirements are understood but you feel there are likely to be some gaps along the way. | | 8 | A complex change, that will involve much of the codebase or will require lots of input from others to determine the requirements. | | 13 | A significant change that may have dependencies (other teams or third-parties) and we likely still don’t understand all of the requirements. It’s unlikely we would commit to this in a milestone, and the preference would be to further clarify requirements and/or break into smaller Issues. |

Issue Labeling

Think of each of these groups of labels as ways of bucketing the work done. All issues should get the following classes of labels assigned to them:

  • Who (Purple): Team for which work is primarily for (Data, Finance, Sales, etc.)
  • What - Data or Tool
    • Data (Light Green): Data being touched (Salesforce, Zuora, Zendesk, Gitlab.com, etc.)
    • Tool (Light Blue) (Periscope, dbt, Stitch, Airflow, etc.)
  • Where (Brown): Which part of the team performs the work (Analytics, Infrastructure, Housekeeping)
  • How (Orange): Type of work (Documentation, Break-fix, Enhancement, Refactor, Testing, Review)

Optional labels that are useful to communicate state or other priority

  • State (Red) (Won’t Do, Blocked, Needs Consensus, etc.)
  • Inbound: For issues created by folks who are not on the data team; not for asks created by data team members on behalf of others

    Daily Standup

    Members of the data team use Geekbot for our daily standups. These are posted in #data-daily. When Geekbot asks, “What are you planning on working on today? Any blockers?” try answering with specific details, so that teammates can proactively unblock you. Instead of “working on Salesforce stuff”, consider “Adding Opportunity Owners for the sfdc_opportunity_xf model`.” There is no pressure to respond to Geekbot as soon as it messages you. Give responses to Geekbot that truly communicate to your team what you’re working on that day, so that your team can help you understand if some priority has shifted or there is additional context you may need.

    Merge Request Workflow

    Ideally, your workflow should be as follows:
  1. Confirm you have access to the analytics project. If not, request Developer access so you can create branches, merge requests, and issues.
  2. Create an issue, open an existing issue, or assign yourself to an existing issue. The issue is assigned to the person(s) who will be doing the work.
  3. Add appropriate labels to the issue (see above)
  4. Open an MR from the issue using the “Create merge request” button. This automatically creates a unique branch based on the issue name. This marks the issue for closure once the MR is merged.
  5. Push your work to the branch
  6. Run any relevant jobs to the work being proposed
    • e.g. if you’re working on dbt changes, run the job most appropriate for your changes. See the dbt changes MR template checklist for a list of jobs and their uses.
  7. Document in the MR description what the purpose of the MR is, any additional changes that need to happen for the MR to be valid, and if it’s a complicated MR, how you verified that the change works. See this MR for an example of good documentation. The goal is to make it easier for reviewers to understand what the MR is doing so it’s as easy as possible to review.
  8. Assign the MR to a peer to have it reviewed. If assigning to someone who can merge, either leave a comment asking for a review without merge, or you can simply leave theWIP:label.
    • Note that assigning someone an MR means action is required from them.
    • The peer reviewer should use the native approve button in the MR after they have completed their review and approve of the changes in the MR.
    • Adding someone as an approver is a way to tag them for an FYI. This is similar to doing cc @user in a comment.
    • After approval, the peer reviewer should send the MR back to the author to decide what needs to happen next. The reviewer should not be responsible for the final tasks. The author is responsible for finalizing the checklist, closing threads, removing WIP, and getting it in a merge-ready state.
  9. Once it’s ready for further review and merging, remove the WIP: label, mark the branch for deletion, mark squash commits, and assign to the project’s maintainer. Ensure that the attached issue is appropriately labeled and pointed.

Other tips:

  1. The Merge Request Workflow provides clear expectations; however, there is some wiggle room and freedom around certain steps as follows.
    • For simple changes, it is the MR author who should be responsible for closing the threads. If there is a complex change and the concern has been addressed, either the author or reviewer could resolve the threads if the reviewer approves.
  2. Reviewers should have 48 hours to complete a review, so plan ahead with the end of the milestone.
  3. When possible, questions/problems should be discussed with your reviewer before submitting the MR for review. Particularly for large changes, review time is the least efficient time to have to make meaningful changes to code, because you’ve already done most of the work!

    Local Docker Workflow

    To faciliate an easier workflow for analysts and to abstract away some of the complexity around handling dbt and its dependencies locally, the main analytics repo now supports using dbt from within a Docker container. There are commands within the Makefile to facilitate this, and if at any time you have questions about the various makecommands and what they do, just use make help to get a handy list of the commands and what each of them does.
    Before your initial run (and whenever the containers get updated) make sure to run the following commands:

  4. make update-containers

  5. make cleanup

These commands will ensure you get the newest versions of the containers and generally clean up your local Dockerenvironment.

Using dbt:
  • 要启动dbt容器并从其中的外壳运行命令,请使用make dbt-image
  • 这将自动导入dbt运行所需的所有内容,包括本地profiles.yml文件和repo文件。
  • 要查看您当前分支的文档,请运行make dbt-docs,然后localhost:8081在网络浏览器中访问。有一个引用错误,dbt它阻止此功能按预期工作,预期的修复程序将在dbt版本中0.15
  • 一旦进入dbt容器,请dbt像往常一样运行任何命令。
  • 对存储库中的任何文件所做的更改将在容器内自动更新。通过编辑器更改文件时,无需重新启动容器!