12. The Future of Data Systems

12. The Future of Data Systems - 图1

If a thing be ordained to another as to its end, its last end cannot consist in the preservation of its being. Hence a captain does not intend as a last end, the preservation of the ship entrusted to him, since a ship is ordained to something else as its end, viz. to navigation.

(Often quoted as: If the highest aim of a captain was the preserve his ship, he would keep it in port forever.)

​ — St. Thomas Aquinas, Summa Theologica (1265–1274)


So far, this book has been mostly about describing things as they are at present. In this final chapter, we will shift our perspective toward the future and discuss how things should be: I will propose some ideas and approaches that, I believe, may funda‐ mentally improve the ways we design and build applications.

Opinions and speculation about the future are of course subjective, and so I will use the first person in this chapter when writing about my personal opinions. You are welcome to disagree with them and form your own opinions, but I hope that the ideas in this chapter will at least be a starting point for a productive discussion and bring some clarity to concepts that are often confused.

The goal of this book was outlined in Chapter 1: to explore how to create applications and systems that are reliable, scalable, and maintainable. These themes have run through all of the chapters: for example, we discussed many fault-tolerance algo‐ rithms that help improve reliability, partitioning to improve scalability, and mecha‐ nisms for evolution and abstraction that improve maintainability. In this chapter we will bring all of these ideas together, and build on them to envisage the future. Our goal is to discover how to design applications that are better than the ones of today— robust, correct, evolvable, and ultimately beneficial to humanity.

……

Summary

In this chapter we discussed new approaches to designing data systems, and I included my personal opinions and speculations about the future. We started with the observation that there is no one single tool that can efficiently serve all possible use cases, and so applications necessarily need to compose several different pieces of software to accomplish their goals. We discussed how to solve this data integration problem by using batch processing and event streams to let data changes flow between different systems.

In this approach, certain systems are designated as systems of record, and other data is derived from them through transformations. In this way we can maintain indexes, materialized views, machine learning models, statistical summaries, and more. By making these derivations and transformations asynchronous and loosely coupled, a problem in one area is prevented from spreading to unrelated parts of the system, increasing the robustness and fault-tolerance of the system as a whole.

Expressing dataflows as transformations from one dataset to another also helps evolve applications: if you want to change one of the processing steps, for example to change the structure of an index or cache, you can just rerun the new transformation code on the whole input dataset in order to rederive the output. Similarly, if some‐ thing goes wrong, you can fix the code and reprocess the data in order to recover.

These processes are quite similar to what databases already do internally, so we recast the idea of dataflow applications as unbundling the components of a database, and building an application by composing these loosely coupled components.

Derived state can be updated by observing changes in the underlying data. Moreover, the derived state itself can further be observed by downstream consumers. We can even take this dataflow all the way through to the end-user device that is displaying the data, and thus build user interfaces that dynamically update to reflect data changes and continue to work offline.

Next, we discussed how to ensure that all of this processing remains correct in the presence of faults. We saw that strong integrity guarantees can be implemented scala‐ bly with asynchronous event processing, by using end-to-end operation identifiers to make operations idempotent and by checking constraints asynchronously. Clients can either wait until the check has passed, or go ahead without waiting but risk hav‐ ing to apologize about a constraint violation. This approach is much more scalable and robust than the traditional approach of using distributed transactions, and fits with how many business processes work in practice.

By structuring applications around dataflow and checking constraints asynchro‐ nously, we can avoid most coordination and create systems that maintain integrity but still perform well, even in geographically distributed scenarios and in the pres‐ ence of faults. We then talked a little about using audits to verify the integrity of data and detect corruption.

Finally, we took a step back and examined some ethical aspects of building data- intensive applications. We saw that although data can be used to do good, it can also do significant harm: making justifying decisions that seriously affect people’s lives and are difficult to appeal against, leading to discrimination and exploitation, nor‐ malizing surveillance, and exposing intimate information. We also run the risk of data breaches, and we may find that a well-intentioned use of data has unintended consequences.

As software and data are having such a large impact on the world, we engineers must remember that we carry a responsibility to work toward the kind of world that we want to live in: a world that treats people with humanity and respect. I hope that we can work together toward that goal.

References


  1. Rachid Belaid: “Postgres Full-Text Search is Good Enough!,” rachbelaid.com, July 13, 2015.

  2. Philippe Ajoux, Nathan Bronson, Sanjeev Kumar, et al.: “Challenges to Adopting Stronger Consistency at Scale,” at 15th USENIX Workshop on Hot Topics in Operating Systems (HotOS), May 2015.

  3. Pat Helland and Dave Campbell: “Building on Quicksand,” at 4th Biennial Conference on Innovative Data Systems Research (CIDR), January 2009.

  4. Jessica Kerr: “Provenance and Causality in Distributed Systems,” blog.jessitron.com, September 25, 2016.

  5. Kostas Tzoumas: “Batch Is a Special Case of Streaming,” data-artisans.com, September 15, 2015.

  6. Shinji Kim and Robert Blafford: “Stream Windowing Performance Analysis: Concord and Spark Streaming,” concord.io, July 6, 2016.

  7. Jay Kreps: “The Log: What Every Software Engineer Should Know About Real-Time Data’s Unifying Abstraction,” engineering.linkedin.com, December 16, 2013.

  8. Pat Helland: “Life Beyond Distributed Transactions: An Apostate’s Opinion,” at 3rd Biennial Conference on Innovative Data Systems Research (CIDR), January 2007.

  9. Great Western Railway (1835–1948),” Network Rail Virtual Archive, networkrail.co.uk.

  10. Jacqueline Xu: “Online Migrations at Scale,” stripe.com, February 2, 2017.

  11. Molly Bartlett Dishman and Martin Fowler: “Agile Architecture,” at O’Reilly Software Architecture Conference, March 2015.

  12. Nathan Marz and James Warren: Big Data: Principles and Best Practices of Scalable Real-Time Data Systems. Manning, 2015. ISBN: 978-1-617-29034-3

  13. Oscar Boykin, Sam Ritchie, Ian O’Connell, and Jimmy Lin: “Summingbird: A Framework for Integrating Batch and Online MapReduce Computations,” at 40th International Conference on Very Large Data Bases (VLDB), September 2014.

  14. Jay Kreps: “Questioning the Lambda Architecture,” oreilly.com, July 2, 2014.

  15. Raul Castro Fernandez, Peter Pietzuch, Jay Kreps, et al.: “Liquid: Unifying Nearline and Offline Big Data Integration,” at 7th Biennial Conference on Innovative Data Systems Research (CIDR), January 2015.

  16. Dennis M. Ritchie and Ken Thompson: “The UNIX Time-Sharing System,” Communications of the ACM, volume 17, number 7, pages 365–375, July 1974. doi:10.1145/361011.361061

  17. Eric A. Brewer and Joseph M. Hellerstein: “CS262a: Advanced Topics in Computer Systems,” lecture notes, University of California, Berkeley, cs.berkeley.edu, August 2011.

  18. Michael Stonebraker: “The Case for Polystores,” wp.sigmod.org, July 13, 2015.

  19. Jennie Duggan, Aaron J. Elmore, Michael Stonebraker, et al.: “The BigDAWG Polystore System,” ACM SIGMOD Record, volume 44, number 2, pages 11–16, June 2015. doi:10.1145/2814710.2814713

  20. Patrycja Dybka: “Foreign Data Wrappers for PostgreSQL,” vertabelo.com, March 24, 2015.

  21. David B. Lomet, Alan Fekete, Gerhard Weikum, and Mike Zwilling: “Unbundling Transaction Services in the Cloud,” at 4th Biennial Conference on Innovative Data Systems Research (CIDR), January 2009.

  22. Martin Kleppmann and Jay Kreps: “Kafka, Samza and the Unix Philosophy of Distributed Data,” IEEE Data Engineering Bulletin, volume 38, number 4, pages 4–14, December 2015.

  23. John Hugg: “Winning Now and in the Future: Where VoltDB Shines,” voltdb.com, March 23, 2016.

  24. Frank McSherry, Derek G. Murray, Rebecca Isaacs, and Michael Isard: “Differential Dataflow,” at 6th Biennial Conference on Innovative Data Systems Research (CIDR), January 2013.

  25. Derek G Murray, Frank McSherry, Rebecca Isaacs, et al.: “Naiad: A Timely Dataflow System,” at 24th ACM Symposium on Operating Systems Principles (SOSP), pages 439–455, November 2013. doi:10.1145/2517349.2522738

  26. Gwen Shapira: “We have a bunch of customers who are implementing ‘database inside-out’ concept and they all ask ‘is anyone else doing it? are we crazy?’twitter.com, July 28, 2016.

  27. Martin Kleppmann: “Turning the Database Inside-out with Apache Samza,” at Strange Loop, September 2014.

  28. Peter Van Roy and Seif Haridi: Concepts, Techniques, and Models of Computer Programming. MIT Press, 2004. ISBN: 978-0-262-22069-9

  29. Juttle Documentation,” juttle.github.io, 2016.

  30. Evan Czaplicki and Stephen Chong: “Asynchronous Functional Reactive Programming for GUIs,” at 34th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), June 2013. doi:10.1145/2491956.2462161

  31. Engineer Bainomugisha, Andoni Lombide Carreton, Tom van Cutsem, Stijn Mostinckx, and Wolfgang de Meuter: “A Survey on Reactive Programming,” ACM Computing Surveys, volume 45, number 4, pages 1–34, August 2013. doi:10.1145/2501654.2501666

  32. Peter Alvaro, Neil Conway, Joseph M. Hellerstein, and William R. Marczak: “Consistency Analysis in Bloom: A CALM and Collected Approach,” at 5th Biennial Conference on Innovative Data Systems Research (CIDR), January 2011.

  33. Felienne Hermans: “Spreadsheets Are Code,” at Code Mesh, November 2015.

  34. Dan Bricklin and Bob Frankston: “VisiCalc: Information from Its Creators,” danbricklin.com.

  35. D. Sculley, Gary Holt, Daniel Golovin, et al.: “Machine Learning: The High-Interest Credit Card of Technical Debt,” at NIPS Workshop on Software Engineering for Machine Learning (SE4ML), December 2014.

  36. Peter Bailis, Alan Fekete, Michael J Franklin, et al.: “Feral Concurrency Control: An Empirical Investigation of Modern Application Integrity,” at ACM International Conference on Management of Data (SIGMOD), June 2015. doi:10.1145/2723372.2737784

  37. Guy Steele: “Re: Need for Macros (Was Re: Icon),” email to ll1-discuss mailing list, people.csail.mit.edu, December 24, 2001.

  38. David Gelernter: “Generative Communication in Linda,” ACM Transactions on Programming Languages and Systems (TOPLAS), volume 7, number 1, pages 80–112, January 1985. doi:10.1145/2363.2433

  39. Patrick Th. Eugster, Pascal A. Felber, Rachid Guerraoui, and Anne-Marie Kermarrec: “The Many Faces of Publish/Subscribe,” ACM Computing Surveys, volume 35, number 2, pages 114–131, June 2003. doi:10.1145/857076.857078

  40. Ben Stopford: “Microservices in a Streaming World,” at QCon London, March 2016.

  41. Christian Posta: “Why Microservices Should Be Event Driven: Autonomy vs Authority,” blog.christianposta.com, May 27, 2016.

  42. Alex Feyerke: “Say Hello to Offline First,” hood.ie, November 5, 2013.

  43. Sebastian Burckhardt, Daan Leijen, Jonathan Protzenko, and Manuel Fähndrich: “Global Sequence Protocol: A Robust Abstraction for Replicated Shared State,” at 29th European Conference on Object-Oriented Programming (ECOOP), July 2015. doi:10.4230/LIPIcs.ECOOP.2015.568

  44. Mark Soper: “Clearing Up React Data Management Confusion with Flux, Redux, and Relay,” medium.com, December 3, 2015.

  45. Eno Thereska, Damian Guy, Michael Noll, and Neha Narkhede: “Unifying Stream Processing and Interactive Queries in Apache Kafka,” confluent.io, October 26, 2016.

  46. Frank McSherry: “Dataflow as Database,” github.com, July 17, 2016.

  47. Peter Alvaro: “I See What You Mean,” at Strange Loop, September 2015.

  48. Nathan Marz: “Trident: A High-Level Abstraction for Realtime Computation,” blog.twitter.com, August 2, 2012.

  49. Edi Bice: “Low Latency Web Scale Fraud Prevention with Apache Samza, Kafka and Friends,” at Merchant Risk Council MRC Vegas Conference, March 2016.

  50. Charity Majors: “The Accidental DBA,” charity.wtf, October 2, 2016.

  51. Arthur J. Bernstein, Philip M. Lewis, and Shiyong Lu: “Semantic Conditions for Correctness at Different Isolation Levels,” at 16th International Conference on Data Engineering (ICDE), February 2000. doi:10.1109/ICDE.2000.839387

  52. Sudhir Jorwekar, Alan Fekete, Krithi Ramamritham, and S. Sudarshan: “Automating the Detection of Snapshot Isolation Anomalies,” at 33rd International Conference on Very Large Data Bases (VLDB), September 2007.

  53. Kyle Kingsbury: Jepsen blog post series, aphyr.com, 2013–2016.

  54. Michael Jouravlev: “Redirect After Post,” theserverside.com, August 1, 2004.

  55. Jerome H. Saltzer, David P. Reed, and David D. Clark: “End-to-End Arguments in System Design,” ACM Transactions on Computer Systems, volume 2, number 4, pages 277–288, November 1984. doi:10.1145/357401.357402

  56. Peter Bailis, Alan Fekete, Michael J. Franklin, et al.: “Coordination-Avoiding Database Systems,” Proceedings of the VLDB Endowment, volume 8, number 3, pages 185–196, November 2014.

  57. Alex Yarmula: “Strong Consistency in Manhattan,” blog.twitter.com, March 17, 2016.

  58. Douglas B Terry, Marvin M Theimer, Karin Petersen, et al.: “Managing Update Conflicts in Bayou, a Weakly Connected Replicated Storage System,” at 15th ACM Symposium on Operating Systems Principles (SOSP), pages 172–182, December 1995. doi:10.1145/224056.224070

  59. Jim Gray: “The Transaction Concept: Virtues and Limitations,” at 7th International Conference on Very Large Data Bases (VLDB), September 1981.

  60. Hector Garcia-Molina and Kenneth Salem: “Sagas,” at ACM International Conference on Management of Data (SIGMOD), May 1987. doi:10.1145/38713.38742

  61. Pat Helland: “Memories, Guesses, and Apologies,” blogs.msdn.com, May 15, 2007.

  62. Yoongu Kim, Ross Daly, Jeremie Kim, et al.: “Flipping Bits in Memory Without Accessing Them: An Experimental Study of DRAM Disturbance Errors,” at 41st Annual International Symposium on Computer Architecture (ISCA), June 2014. doi:10.1145/2678373.2665726

  63. Mark Seaborn and Thomas Dullien: “Exploiting the DRAM Rowhammer Bug to Gain Kernel Privileges,” googleprojectzero.blogspot.co.uk, March 9, 2015.

  64. Jim N. Gray and Catharine van Ingen: “Empirical Measurements of Disk Failure Rates and Error Rates,” Microsoft Research, MSR-TR-2005-166, December 2005.

  65. Annamalai Gurusami and Daniel Price: “Bug #73170: Duplicates in Unique Secondary Index Because of Fix of Bug#68021,” bugs.mysql.com, July 2014.

  66. Gary Fredericks: “Postgres Serializability Bug,” github.com, September 2015.

  67. Xiao Chen: “HDFS DataNode Scanners and Disk Checker Explained,” blog.cloudera.com, December 20, 2016.

  68. Jay Kreps: “Getting Real About Distributed System Reliability,” blog.empathybox.com, March 19, 2012.

  69. Martin Fowler: “The LMAX Architecture,” martinfowler.com, July 12, 2011.

  70. Sam Stokes: “Move Fast with Confidence,” blog.samstokes.co.uk, July 11, 2016.

  71. Sawtooth Lake Documentation,” Intel Corporation, intelledger.github.io, 2016.

  72. Richard Gendal Brown: “Introducing R3 Corda™: A Distributed Ledger Designed for Financial Services,” gendal.me, April 5, 2016.

  73. Trent McConaghy, Rodolphe Marques, Andreas Müller, et al.: “BigchainDB: A Scalable Blockchain Database,” bigchaindb.com, June 8, 2016.

  74. Ralph C. Merkle: “A Digital Signature Based on a Conventional Encryption Function,” at CRYPTO ‘87, August 1987. doi:10.1007/3-540-48184-2_32

  75. Ben Laurie: “Certificate Transparency,” ACM Queue, volume 12, number 8, pages 10-19, August 2014. doi:10.1145/2668152.2668154

  76. Mark D. Ryan: “Enhanced Certificate Transparency and End-to-End Encrypted Mail,” at Network and Distributed System Security Symposium (NDSS), February 2014. doi:10.14722/ndss.2014.23379

  77. Software Engineering Code of Ethics and Professional Practice,” Association for Computing Machinery, acm.org, 1999.

  78. François Chollet: “Software development is starting to involve important ethical choices,” twitter.com, October 30, 2016.

  79. Igor Perisic: “Making Hard Choices: The Quest for Ethics in Machine Learning,” engineering.linkedin.com, November 2016.

  80. John Naughton: “Algorithm Writers Need a Code of Conduct,” theguardian.com, December 6, 2015.

  81. Logan Kugler: “What Happens When Big Data Blunders?,” Communications of the ACM, volume 59, number 6, pages 15–16, June 2016. doi:10.1145/2911975

  82. Bill Davidow: “Welcome to Algorithmic Prison,” theatlantic.com, February 20, 2014.

  83. Don Peck: “They’re Watching You at Work,” theatlantic.com, December 2013.

  84. Leigh Alexander: “Is an Algorithm Any Less Racist Than a Human?theguardian.com, August 3, 2016.

  85. Jesse Emspak: “How a Machine Learns Prejudice,” scientificamerican.com, December 29, 2016.

  86. Maciej Cegłowski: “The Moral Economy of Tech,” idlewords.com, June 2016.

  87. Cathy O’Neil: Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy. Crown Publishing, 2016. ISBN: 978-0-553-41881-1

  88. Julia Angwin: “Make Algorithms Accountable,” nytimes.com, August 1, 2016.

  89. Bryce Goodman and Seth Flaxman: “European Union Regulations on Algorithmic Decision-Making and a ‘Right to Explanation’,” arXiv:1606.08813, August 31, 2016.

  90. A Review of the Data Broker Industry: Collection, Use, and Sale of Consumer Data for Marketing Purposes,” Staff Report, United States Senate Committee on Commerce, Science, and Transportation, commerce.senate.gov, December 2013.

  91. Olivia Solon: “Facebook’s Failure: Did Fake News and Polarized Politics Get Trump Elected?theguardian.com, November 10, 2016.

  92. Donella H. Meadows and Diana Wright: Thinking in Systems: A Primer. Chelsea Green Publishing, 2008. ISBN: 978-1-603-58055-7

  93. Daniel J. Bernstein: “Listening to a ‘big data’/‘data science’ talk,” twitter.com, May 12, 2015.

  94. Marc Andreessen: “Why Software Is Eating the World,” The Wall Street Journal, 20 August 2011.

  95. J. M. Porup: “‘Internet of Things’ Security Is Hilariously Broken and Getting Worse,” arstechnica.com, January 23, 2016.

  96. Bruce Schneier: Data and Goliath: The Hidden Battles to Collect Your Data and Control Your World. W. W. Norton, 2015. ISBN: 978-0-393-35217-7

  97. The Grugq: “Nothing to Hide,” grugq.tumblr.com, April 15, 2016.

  98. Tony Beltramelli: “Deep-Spying: Spying Using Smartwatch and Deep Learning,” Masters Thesis, IT University of Copenhagen, December 2015. Available at arxiv.org/abs/1512.05616

  99. Shoshana Zuboff: “Big Other: Surveillance Capitalism and the Prospects of an Information Civilization,” Journal of Information Technology, volume 30, number 1, pages 75–89, April 2015.doi:10.1057/jit.2015.5

  100. Carina C. Zona: “Consequences of an Insightful Algorithm,” at GOTO Berlin, November 2016.

  101. Bruce Schneier: “Data Is a Toxic Asset, So Why Not Throw It Out?,” schneier.com, March 1, 2016.

  102. John E. Dunn: “The UK’s 15 Most Infamous Data Breaches,” techworld.com, November 18, 2016.

  103. Cory Scott: “Data is not toxic - which implies no benefit - but rather hazardous material, where we must balance need vs. want,” twitter.com, March 6, 2016.

  104. Bruce Schneier: “Mission Creep: When Everything Is Terrorism,” schneier.com, July 16, 2013.

  105. Lena Ulbricht and Maximilian von Grafenstein: “Big Data: Big Power Shifts?,” Internet Policy Review, volume 5, number 1, March 2016. doi:10.14763/2016.1.406

  106. Ellen P. Goodman and Julia Powles: “Facebook and Google: Most Powerful and Secretive Empires We’ve Ever Known,” theguardian.com, September 28, 2016.

  107. Directive 95/46/EC on the protection of individuals with regard to the processing of personal data and on the free movement of such data, Official Journal of the European Communities No. L 281/31, eur-lex.europa.eu, November 1995.

  108. Brendan Van Alsenoy: “Regulating Data Protection: The Allocation of Responsibility and Risk Among Actors Involved in Personal Data Processing,” Thesis, KU Leuven Centre for IT and IP Law, August 2016.

  109. Michiel Rhoen: “Beyond Consent: Improving Data Protection Through Consumer Protection Law,” Internet Policy Review, volume 5, number 1, March 2016. doi:10.14763/2016.1.404

  110. Jessica Leber: “Your Data Footprint Is Affecting Your Life in Ways You Can’t Even Imagine,” fastcoexist.com, March 15, 2016.

  111. Maciej Cegłowski: “Haunted by Data,” idlewords.com, October 2015.

  112. Sam Thielman: “You Are Not What You Read: Librarians Purge User Data to Protect Privacy,” theguardian.com, January 13, 2016.

  113. Conor Friedersdorf: “Edward Snowden’s Other Motive for Leaking,” theatlantic.com, May 13, 2014.

  114. Phillip Rogaway: “The Moral Character of Cryptographic Work,” Cryptology ePrint 2015/1162, December 2015.