The Act of Coding

Query objects within an area with GeoDjango (part 2)

May 02, 2021 - 4 min read

GeoDjango The first article mentioned concerns about the speed of the selected approach as for every geometry an additional buffer should be prepared and that is rather time-consuming. It is about time to measure these performance penalties and see what we can do with them.

Prepare sample dataset for the testing

def generate_sample_data(num: int) -> None:
    search_criteria = []
    for index in range(num):
        search_criteria.append(
            SearchCriteria(
                geo_location=Point(
                    random.randint(-180, 180),
                    random.randint(-90, 90),
                ),
                distance=random.randint(1000, 30000)
            )
        )
    SearchCriteria.objects.bulk_create(search_criteria)

From the product estimation we should be dealing with ~ 100k objects, so let's create about that amount of data randomly distributed on our map. This step is required to check the performance of using dynamic buffer generation for our objects.

def generate():
    for i in range(100):
        print(i)
        generate_sample_data(1000)

Measure execution time of building a dynamic buffer

In [1]: poi = POI.objects.first() 
In [2]: %time  SearchCriteria.objects.with_buffer().filter(buff__intersects=poi.geo_location).count()                                                                                           
CPU times: user 1.56 ms, sys: 901 µs, total: 2.46 ms
Wall time: 7.89 s
Out[2]: 4

As we tell, this is absolutely inappropriate execution time of ~ 8sec 🚫. To profile it we could check the SQL query that shows us ST_Buffer function being used twice across the entire table.

SELECT 
    "core_searchcriteria"."id", 
    "core_searchcriteria"."geo_location"::bytea, 
    "core_searchcriteria"."distance", 
    ST_Buffer(
        CAST("core_searchcriteria"."geo_location" AS geography(POINT,4326)), "core_searchcriteria"."distance"
        )::bytea AS "buff" FROM "core_searchcriteria" 
    WHERE ST_Intersects(
        ST_Buffer(
            CAST("core_searchcriteria"."geo_location" AS geography(POINT,4326)), "core_searchcriteria"."distance"), 
        ST_GeomFromEWKB('\001\001\000\000 \346\020\000\000\000\000\000\000\000@U@\000\000\000\000\000\000@@'::bytea)
    )

Using geography columns

For much better performance on distance queries, we could use geography columns that can use their spatial index in distance queries. You can tell GeoDjango to use a geography column by setting geography=True in your field definition.

class SearchCriteria(models.Model):
    geo_location = PointField(
        verbose_name='Location coordinates',
        geography=True,
    )
    distance = models.PositiveIntegerField(
        verbose_name='Search radius (in meters)',
        default=20000,
        validators=[
            MinValueValidator(100),
            MaxValueValidator(30000),
        ],
    )

    objects = SearchCriteriaManager()

Django automatically adds an index to all SpatialFields by settings default spatial_index=True. Let's have a look at how the SQL definition of our table looks like after migration.

searchapp=# \d+ core_searchcriteria;
Table "public.core_searchcriteria"
    Column    |         Type          | Collation | Nullable |                     Default                     | Storage | Stats target | Description 
--------------+-----------------------+-----------+----------+-------------------------------------------------+---------+--------------+-------------
 id           | bigint                |           | not null | nextval('core_searchcriteria_id_seq'::regclass) | plain   |              | 
 geo_location | geography(Point,4326) |           | not null |                                                 | main    |              | 
 distance     | integer               |           | not null |                                                 | plain   |              | 
Indexes:
    "core_searchcriteria_pkey" PRIMARY KEY, btree (id)
    "core_searchcriteria_geo_location_id" gist (geo_location)
Check constraints:
    "core_searchcriteria_distance_check" CHECK (distance >= 0)

With the geography field type we will be able to use regular ST_DWithin lookups in Django.

In [5]: %time SearchCriteria.objects.filter(
    geo_location__dwithin=(poi.geo_location, F('distance'))
 ).count()                                                                                                                                                     
CPU times: user 1.25 ms, sys: 1.04 ms, total: 2.29 ms
Wall time: 85.9 ms

Execution speed was ~90x 🤯 faster comparing to generating buffers and calculating intersections. While it might sound like a silver bullet such an approach comes with certain caveats and shouldn't be considered as a universal solution especially when you perform other different operations on your geometries. Because geography calculations involve more mathematics, only a subset of the PostGIS spatial lookups are available for the geography type. Except for the various distance lookups it supports also bboverlaps, coveredby, coversand intersects. For the other types of operations you could cast your geography field back to regular geometry and you can find more about it here.

An alternative approach with the Distance query

One more possibility that we have without changing the type of our spatial field is based on the pre-calculating distance to the given point of interest and then checking if it lies in the range of the search radius. For that, we will be using another spatial function called distance Distance and extend our manager with another method.

# managers.py

from django.contrib.gis.db.models.functions import GeoFunc, Distance

class SearchCriteriaManager(Manager):
    ...
    
    def distance_to(self, point: Point) -> QuerySet:
        return self.annotate(
            ds=Distance('geo_location', point)
        )

This way we got an increase in performance in ~65x times which could be a great compromise if you don't want converting your fields to the geometry types.

In [7]: poi = POI.objects.first()                                                                                                                                                                                                                                

In [8]: %timeit SearchCriteria.objects.distance_to(poi.geo_location).filter(ds__lte=F('distance')).count()                                                                                                                                                       
124 ms ± 2.09 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Skipping the count() part the example above gives us the following SQL query where ST_DistanceSphere function will very efficiently be using spatial index in PostGIS.

SELECT 
    "core_searchcriteria"."id", 
    "core_searchcriteria"."geo_location"::bytea,
     "core_searchcriteria"."distance",
     ST_DistanceSphere(
        "core_searchcriteria"."geo_location", 
        ST_GeomFromEWKB('\001\001\000\000 \346\020\000\000\000\000\000\000\000@U@\000\000\000\000\000\000@@'::bytea)) AS "ds" 
     FROM "core_searchcriteria" 
        WHERE ST_DistanceSphere(
            "core_searchcriteria"."geo_location", 
            ST_GeomFromEWKB('\001\001\000\000 \346\020\000\000\000\000\000\000\000@U@\000\000\000\000\000\000@@'::bytea)
        ) <= "core_searchcriteria"."distance";

Footnotes

It is fascinating how one problem could have a variety of different solutions in Django world. The topic of spatial queries is vast, and I genially recommend checking the documentation of the PostGIS when facing similar performance issues with your queries. Bye :)