最近ByteBase的公众号推送了这篇文章,就顺带看了下。文章发于2018年的ICDE

文章地址:https://dl.acm.org/doi/10.1145/3180155.3180194

研究ORM读取数据库的性能影响,有意思

Introduction

These ORM frameworks allow developers to program such database-backed web applications in a DBMS oblivious way, as the frameworks expose APIs for developers to operate persistent data stored in the DBMS as if they are regular heap objects, with regularlooking method calls transparently translated to SQL queries by frameworks when executed.

简要的介绍了下ORM

Given the above, we target three key research questions about real-world ORM applications in this paper:

• RQ 1: How well do real-world database-backed web applications perform as the amount of application data increases?

• RQ 2: What are the common root causes of performance and scalability issues in such applications?

• RQ 3: What are the potential solutions to such issues and can they be applied automatically?

问了几个问题,可以理解为ORM的使用情况怎么样?性能怎么样?有什么问题需要解决?

Overall, our comprehensive study provides motivations and guidelines for future research to help avoid, detect, and fix crossstack performance issues in ORM applications. We have prepared a detailed replication package for all the performance-issue study, profiling, and program analysis conducted in this paper. This package is available on the webpage of our Hyperloop project [16], a project that aims to solve database-related performance problems in ORM applications.

Mark下,有空再瞅瞅:http://hyperloop.cs.uchicago.edu.

Background

Our study focuses on applications written in Ruby on Rails (Rails). Ruby is among the top 3 popular languages on GitHub [38], and Rails is among the top 3 popular web application frameworks

啊,Ruby上Github前三?(想想是2018年,那确实有可能)——难怪江湖有不少关于Ruby的传说doge

这章主要介绍Ruby生态,图看看就好

image-20250516174256439

Profiling & Study Methodology

As mentioned in Section 2, we focus on Rails applications. Since it is impractical to study all open-source Rails applications (about 200 thousand of them on GitHub [12]), we focus on 6 popular application categories1 as shown in Table 1. These 6 categories cover 90% of all Rails applications with more than 100 stars on GitHub.

image-20250516175546549

以Github上的Ruby的6类项目为例(大家比较熟悉的应该有Gitlab和Discourse,OpenSteetmap是Ruby这个我真没想到)

To profile an ORM application, we need to populate its database. Without access to the database contents in the deployed applications, we collect real-world statistics of each application based on its public website

卧槽,真这么做?“填充数据库”,这个工作量应该非常大,下面是Gitlab的数据

image-20250516213602814

we examine each application’s bug-tracking system. For 6 applications that contain fewer than 1000 bug reports, as shown in Table 4, we manually check every bug report. For applications with 1000 to 5000 bug reports, we randomly sample 100 bug reports that have been fixed and contain the keywords performance, slow, or optimization. For Redmine and Gitlab, which have more than 10,000 bug reports, we sample 200 from them in the same way.

甚至还仔细查看了BugReport

Our study of each application’s bug-tracking system does not consider bug reports that are not fixed or not clearly explained. Despite these aspects, we have made our best effort in conducting a comprehensive and unbiased study

“我们对每个应用程序的错误跟踪系统的研究并未考虑未固定或未清楚解释的错误报告”emmm,好吧,也就是实验具备时效性,现在2025年未必是有效的

Profiling Results

We identify the 10 pages with the most loading time for every application under the 20,000-record database configuration and plot their average end-to-end page loading time in Figure 2.

image-20250516214944462

As shown in Figure 3, server time contributes to at least 40% of the end-to-end-latency for more than half of the top 10 pages in all but 1 application

image-20250516220100732

服务端在端到端延迟中占据了40%,这并不意外

但下面还有一段注释,我觉得有些难评价

Part of the server time could overlap with the client time or the network time. However, our measurement shows that the overlap is negligible.

实验环境下网络环境不是问题,但实际网络延迟不会导致ORM出现其他问题么?

Causes of inefficiences

After studying the 64 performance issues in the 40 problematic actions and the 140 issues reported in the applications’ bug-tracking systems, we categorize the inefficiency causes into three categories: ORM API misuses, database design, and application design.

这三个结论:

  1. ORM API misuses
  2. Database design
  3. Application design

说了跟没说一样?😂那得看看后面怎么分析了

ORM API misuses

这一块总结为几类

  • Inefficient Computation (IC)
  • Unnecessary Computation (UC)
  • Inefficient Data Accessing (ID)

Another common problem is developers using API calls that generate queries with unnecessary ordering of the results. For example, Ror, Diaspora, and Spree developers use Object.where(c).first to get an object satisfying predicate c instead of Object.find_by(c), not realizing that the former API orders Objects by primary key after evaluating predicate c

这种情况似乎并不意外?😂ORM总是能提供多种路径实现数据读取,可是使用的SQL就未必高效

image-20250516222004684

Sometimes, queries are repeatedly issued to load the same database contents and hence are unnecessary. For instance, Figure 5 shows the patch from redmine-23334.

image-20250516222402232

Summary. While similar issues in general purpose programs can be eliminated using classic compiler optimization techniques (e.g., loop invariant motion, dead-store elimination), doing so for ORM applications is difficult as it involves understanding database queries. We are unaware of any compilers that perform such transformations.

我是不是能这么理解,上述问题都可以通过优化ORM解决,但真想要优化ORM不太可能😅比如说臭名昭著的N+1问题

这一块基本就是列举ORM的各类错误使用样例,不过里面的”Inefficient rendering”是头一回听说,文章说是“IR reflects a trade-off between readability and performance when a view file renders a set of objects.”

image-20250516224353051

Although slow rendering is complained, such transformation has not yet been proposed by issue reports. Our profiling finds such optimization speeds up 5 problematic actions by 2.5× on average.

也就是说这个问题是可以被优化的?那还行

Database Design Problems

主要分为以下几点

  • Missing Fields (MF)

  • Missing Database Indexes (MI)

这些问题在数据库设计中不可避免存在,基本可以认为是数据库设计人员的锅

Application Design Trade-offs

  • Application Design Trade-offs

Unfortunately, the lack of pagination still widely exists in latest versions of ORM applications in our study. This indicates that ORM developers need database-aware performance-estimation support to remind them of the need to use pagination in webpage design.

“remind them of the need to use pagination in webpage design”——这其实就有些难蚌

  • Application Functionality Trade-offs (FT)

这段其实我没看懂,是想说为了ORM性能不得不放弃些功能?(不会吧😶)

Fixing The Inefficiences

讲修复效果的,情况以下面这张图为准

Figure 10(a) shows the amount of servertime speedup and the sources of the speedup broken down into different anti-patterns as discussed in Section 5.

image-20250516233927725

We have reported these 64 fixes to corresponding developers. So far, we have received developers’ feedback for 14 of them, all of which have been confirmed to be true performance problems and 7 have already been fixed based on our report.

怎么说,对社区贡献还是不错的

Figure 10(b) shows the lines of code changes required to implement the fixes. The biggest change takes 56 lines of code to fix (for an inefficient rendering (IR) anti-pattern), while the smallest change requires only 1 line of code in 27 fixes. More than 78% of fixes require fewer than 5 lines. In addition, among the fixes that improve performance by 3× or more, more than 90% of them take fewer than 10 lines of code. Around 60% of fixes are intra-procedural, involving only one function.

总之就是这些问题都可以通过找有经验的专家解决——那似乎还是Mybatis那种写SQL的ORM会更好些?

Finding more API Misuses

image-20250516234606308

Anyway,API Misuse还是广泛存在的

Discussion

  • Improving ORM APIs

    这没的说,硬要说的话铁定背锅

  • Support for design and development of ORM applications.

    当然还要教育开发者😂

  • Compiler and runtime optimizations

    这个难咯,想法很好,代价很高(你们为什么不选Java.jpg)

  • Generalizing to other ORM frameworks

    基本使用ORM都会有这个问题,用ORM的一个都别跑(为什么没有Hibernate和Mybatis.jpg)

image-20250516235447164

结语

一片很棒的实验文章😂如果文章是在写不出来可以来这里找找写作思路

Ruby On Rails在国内基本没市场,这些问题我反倒是在Django上有些感觉——说白了,还是ORM降低了后端人员的门槛导致的,要是选型Java这种范式成熟的很难想象会有这种问题

结尾贴一些ByteBase发的微信文章:ORM 越方便,数据库越慢?顶会论文列举九大反模式